---
title: "Gifi Theory"
author: "Patrick Mair, Jan De Leeuw"
output: rmarkdown::html_vignette
bibliography: gifi.bib
vignette: >
%\VignetteIndexEntry{Gifi Theory}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, include = FALSE}
knitr::opts_chunk$set(collapse = TRUE, comment = "#>")
```
The `Gifi` package represents an easier-to-use version of the `homals` [@deLeeuw+Mair:2009b] package for multivariate analysis with optimal scaling [@Gifi:1990]. There are two main differences between `Gifi` and `homals`:
- Theory: `homals` uses the concept of rank and set restrictions to fit methods like princals, morals, overals, etc. `Gifi` is based on the concept of *copies*.
- Implementation: `homals` has a single function called `homals()` which, depending on the `rank` and `sets` argument settings, fits various Gifi methods. `Gifi` offers wrapper functions like `princals()`, `homals()`, and `morals()` that allow users to fit corresponding solutions in a more user-friendly way. It also presents the results in a more straightforward and accessible manner.
This vignette focuses on the Gifi theory using the idea of *copies*. The other vignettes are applied and demonstrate how to use the main functions of the `Gifi` package.
## Theory in a Nutshell
The data are collected in an $n \times m$ data frame. Whereas the `homals` package uses the homogeneity loss function [see @deLeeuw+Mair:2009b], the `Gifi` package uses and solves the following loss function (*meet loss*, see @Gifi:1990, Sec. 4.4.), with SSQ as the sum-of-squares of a matrix:
$$\sigma(X,Z,A)=\frac{1}{mp}\sum_{j=1}^m\text{SSQ}\ (X-\sum_{i\in I_j}H_iZ_iA_i)$$
The index set $\mathcal{N}=\{1,2,\cdots,N\}$, where $N$ is the total number of *active* variables (see below) in the analysis, is partitioned into the $m$ index sets $I_j$, with $I_j\cap I_l=\emptyset$ and $\bigcup_{j=1}^m I_j=\mathcal{N}$.
$X$ is $n\times p$ matrix of *object scores* ($p$ being the number of dimensions). $X$ is centered $e'X=0$ and orthonormal $X'X=I$. For each variable $i$, the meet loss involves:
- $H_i$ as the *basis*, $n \times k_i$, observations by basis, known and fixed;
- $Z_i$ as the *coefficients*, $k_i\times l_i$, basis by copies;
- $A_i$ as the *loadings*, $l_i\times p$, copies by dimensions.
Instead of using rank restrictions like `homals`, `Gifi` uses the idea of *copies*, first introduced by @deLeeuw:1984, which are literally copies (or duplicates) of variables that enter the loss. Overall, this concept called *multiple quantifications* in the original Gifi terminology, makes the system more flexible.
In addition, in `homals` all data were categorical and the basis was always an indicator matrix. In `Gifi` the basis is either categorical, or a *B-spline basis* [@vanRijckevorsel:1988] for which the user needs to specify the knots implying that the data must be numerical.
For each variable $i$, the following matrices are returned by various `Gifi` functions like `homals()` and `princals()`:
- *transformations* $T_i=H_iZ_i$, orthogonalized, observations by copies, $n\times l_i$;
- *loadings* $A_i$, copies by dimensions, $l_i\times p$;
- *scores* $S_i=T_iA_i=H_iZ_iA_i$, observations by dimensions, $n\times p$;
- *quantifications* $Q_i=Z_iA_i$, degrees by dimensions, $k_i\times p$;
- *coefficients* $Z_i$, basis by copies, $k_i\times l_i$.
The `Gifi` loss is solved using alternating least squares (ALS), combined with majorization. The `gifiEngine()` function alternates over $X$, and $Z_i$ and $A_i$.
`Gifi` also allows for declaring variables as *active* vs. *passive*. Active variables are all variables of main interest, contributing to the loss and to the ALS step that updates $X$. Passive (or supplementary) variables don't contribute to these components; each of them is scaled in a separate step via $\text{SSQ}\ (X-H_iZ_iA_i)$ using the optimal $X$.
`Gifi` provides several options for handling missing values:
- *single*: add a single 0/1 column to the basis with 1 for missing;
- *multiple*: add a 0/1 column to the basis for each missing observation, i.e. append an identity matrix.
- *average*: use a basis row with all elements equal to the mean of the non-missing rows.
## Implementation
So far, the following wrapper functions are implemented. Internally they all use the same `gifiEngine()` function to solve the loss from above. The main difference between these functions are the default settings in terms of the number of copies and the number of sets:
- `princals()`: one variable per set, all variables one copy.
- `homals()`: one variable per set, all variables $p$ copies.
- `morals()`: two sets, one set has a single variable with one copy ($p = 1$).
`princals()` is designed to fit ordinal or mixed PCA in a user-friendly way, whereas `homals()` is designed for multiple correspondence analysis. However, if the default settings for `princals()` and `homals()` are changed accordingly, they both give the same result. `morals()` performs multiple (monotone) regression analysis within the Gifi system. For these 3 functions applied vignettes are provided.