Gifi Theory

Patrick Mair, Jan De Leeuw

The Gifi package represents an easier-to-use version of the homals (De Leeuw and Mair 2009) package for multivariate analysis with optimal scaling (Gifi 1990). There are two main differences between Gifi and homals:

This vignette focuses on the Gifi theory using the idea of copies. The other vignettes are applied and demonstrate how to use the main functions of the Gifi package.

Theory in a Nutshell

The data are collected in an \(n \times m\) data frame. Whereas the homals package uses the homogeneity loss function (see De Leeuw and Mair 2009), the Gifi package uses and solves the following loss function (meet loss, see Gifi (1990), Sec. 4.4.), with SSQ as the sum-of-squares of a matrix:

\[\sigma(X,Z,A)=\frac{1}{mp}\sum_{j=1}^m\text{SSQ}\ (X-\sum_{i\in I_j}H_iZ_iA_i)\]

The index set \(\mathcal{N}=\{1,2,\cdots,N\}\), where \(N\) is the total number of active variables (see below) in the analysis, is partitioned into the \(m\) index sets \(I_j\), with \(I_j\cap I_l=\emptyset\) and \(\bigcup_{j=1}^m I_j=\mathcal{N}\).

\(X\) is \(n\times p\) matrix of object scores (\(p\) being the number of dimensions). \(X\) is centered \(e'X=0\) and orthonormal \(X'X=I\). For each variable \(i\), the meet loss involves:

Instead of using rank restrictions like homals, Gifi uses the idea of copies, first introduced by De Leeuw (1984), which are literally copies (or duplicates) of variables that enter the loss. Overall, this concept called multiple quantifications in the original Gifi terminology, makes the system more flexible.

In addition, in homals all data were categorical and the basis was always an indicator matrix. In Gifi the basis is either categorical, or a B-spline basis (van Rijckevorsel 1988) for which the user needs to specify the knots implying that the data must be numerical.

For each variable \(i\), the following matrices are returned by various Gifi functions like homals() and princals():

The Gifi loss is solved using alternating least squares (ALS), combined with majorization. The gifiEngine() function alternates over \(X\), and \(Z_i\) and \(A_i\).

Gifi also allows for declaring variables as active vs. passive. Active variables are all variables of main interest, contributing to the loss and to the ALS step that updates \(X\). Passive (or supplementary) variables don’t contribute to these components; each of them is scaled in a separate step via \(\text{SSQ}\ (X-H_iZ_iA_i)\) using the optimal \(X\).

Gifi provides several options for handling missing values:

Implementation

So far, the following wrapper functions are implemented. Internally they all use the same gifiEngine() function to solve the loss from above. The main difference between these functions are the default settings in terms of the number of copies and the number of sets:

princals() is designed to fit ordinal or mixed PCA in a user-friendly way, whereas homals() is designed for multiple correspondence analysis. However, if the default settings for princals() and homals() are changed accordingly, they both give the same result. morals() performs multiple (monotone) regression analysis within the Gifi system. For these 3 functions applied vignettes are provided.



De Leeuw, J. 1984. “The Gifi System of Nonlinear Multivariate Analysis.” In Data Analysis and Informatics, Vol Iii, edited by E. Diday, M. Jambu, L. Lebart, J. Pages, and R. Tomassone, 735–52. Amsterdam, The Netherlands: North Holland Publishing Company.

De Leeuw, J., and P. Mair. 2009. “Gifi Methods for Optimal Scaling in R: The Package homals.” Journal of Statistical Software 31 (4): 1–20. https://doi.org/10.18637/jss.v031.i04.

Gifi, A. 1990. Nonlinear Multivariate Analysis. Chichester, UK: John Wiley & Sons.

van Rijckevorsel, J. L. A. 1988. “Fuzz Coding and B-Splines.” In Data Analysis and Informatics, Vol Iii, edited by J. L. A. van Rijckevorsel and J. De Leeuw, 33–54. Chichester, UK: John Wilry & Sons.