One of the most basic analyses in social sciences is the ability to compare groups. However, this seemingly straightforward task is often complicated by two significant challenges: ensuring group equivalence and assessing measurement invariance. If not adequately addressed, these challenges can lead to misleading conclusions and undermine the validity of research findings. This application aims to make the process accessible by integrating several advanced statistical techniques including:
You can launch the CALMs in one of two ways:
Simply visit: evaluent.shinyapps.io/CALMs
No setup or installation is needed. The application runs directly in your browser.
Make sure R or R studio is installed on your system. Open the application to begin.
Run the following in the R Console:
install.packages("calms",dependencies=TRUE)
This will install the CALMs package from a local tarball file rather than a CRAN repository to maintain author anonymity while under peer-review.
After installing the CALMs package, run the application locally via R
with run_calms()
:
::run_calms() calms
Users can run CALMs analyses using the dataset built into the application. This built-in dataset is a subset of data from Work Orientations IV – ISSP 2015 (ISSP Research Group, 2017) and is included with permission from the ISSP Research Group. The subset and modifications applied to the original dataset were generated using the following code:
###Load necessary packages
library(foreign)
library(haven)
### Read in data set without labels
<-
dso read.spss("ZA6770_v2-1-0.sav",
use.value.labels=FALSE, max.value.labels=Inf, to.data.frame=TRUE)
nrow(dso)
names(dso)
### Read in data set with labels
<-
dsoa read.spss("ZA6770_v2-1-0.sav",
use.value.labels=TRUE, max.value.labels=Inf, to.data.frame=TRUE)
nrow(dsoa)
names(dsoa)
### Select only needed columns
#quality of job content (JC: v22-v24) and quality of work environment (WE: v25-v27)
#demographics:SEX,EMPREL,TYPORG2,DEGREE
<-subset(dso,select=c(country,v22:v27,SEX,DEGREE,EMPREL,TYPORG2))
dsnames(ds)
c("country","SEX","DEGREE","EMPREL","TYPORG2")]<-dsoa[,c("country","SEX","DEGREE","EMPREL","TYPORG2")]
ds[,
###Get data for the groups (i.e., countries)
#country numerical codes in SPSS: UK = 826, US = 840
table(ds$country)
<-subset(ds,(country=="GB-Great Britain and/or United Kingdom" | country=="US-United States"))
ds$country<-factor(ds$country)
dstable(ds$country)
nrow(ds)
###getting rid of missing values
nrow(ds)
<-na.omit(ds)
dsnrow(ds)
###check values
table(ds$SEX)
table(ds$DEGREE)
table(ds$EMPREL)
table(ds$TYPORG2)
table(ds$country)
levels(ds$EMPREL)<-c("Employee","Self-employed","Self-employed",NA)
levels (ds$DEGREE)<-c(rep("no univ",5),rep("univ",2))
###getting rid of missing values
nrow(ds)
<-na.omit(ds)
dsnrow(ds)
$SEX
dslevels(ds$SEX)
levels(ds$SEX)<-c(1,0) #Set "Male" to 1
levels(ds$EMPREL)
levels(ds$EMPREL)<-c(0,1) #Set "Employee" to 1
levels(ds$TYPORG2)
levels(ds$TYPORG2)<-c(0,1) #Set "Private employer" to 1
levels(ds$DEGREE)
levels(ds$DEGREE)<-c(0,1) #Set "univ" to 1
levels(ds$country)
levels(ds$country)<-c(1,0) #Set "US-United States" to 1
$SEX<-as.numeric(ds$SEX)-1
ds$EMPREL<-as.numeric(ds$EMPREL)-1
ds$TYPORG2<-as.numeric(ds$TYPORG2)-1
ds$DEGREE<-as.numeric(ds$DEGREE)-1
ds$country<-as.numeric(ds$country)-1
ds
nrow(ds)
names(ds)
write_sav(ds,"WosDemo.sav")
Users can run CALMs analyses on their own datasets. To do so, they must upload two files simultaneously from the same directory:
The CALMs application supports data files in .csv, .dat, and .sav formats.
The meta file must be a .csv and be named such that the last four characters of the file name are “Meta” (e.g., My_Meta.csv, Meta.csv). The meta file must have the column names of itemo, item, type, scale, ds, and missing.
A sample meta file is provided below that corresponds to a subset of the 2015 Work Orientations dataset (ISSP Research Group, 2017) that is built into the CALMs application for demonstration purposes.
itemo item type scale ds missing
1 country USA group WosDemo.sav NA
2 v22 JC1 item JC NA
3 v23 JC2 item JC NA
4 v24 JC3 item JC NA
5 v25 WE1 item WE NA
6 v26 WE2 item WE NA
7 v27 WE3 item WE NA
8 SEX Male cov NA
9 DEGREE UnivDegree cov NA
10 EMPREL SelfEmp cov NA
11 TYPORG2 PrivateOrg cov NA
The CALMs Shiny application is organized with multiple tabs that each serve a specific purpose:
This section walks through the CALMs application interface using screenshots for illustrative purposes. Specifically, we analyze data from the 2015 Work Orientations Survey that includes responses from the United States (USA) and the United Kingdom (UK; ISSP Research Group, 2017). The 2015 Work Orientations dataset is from an international project that began in 1984 and was collected across 37 countries (ISSP Research Group, 2017).
The portion of the 2015 Work Orientations dataset used for the demonstration includes 1,477 responses from the USA and 1,793 responses from the UK. We specifically chose the stated two countries because full scalar invariance was not supported in previous measurement invariance studies using the constructs quality of job context (JX), quality of job content (JC), and quality of work environment (WE) in the measurement model using the 1989 Work Orientations dataset (Cheung & Lau, 2012; Cheung & Rensvold, 1999).
The 2015 Work Orientations dataset provided data for two of these previously utilized constructs, JC and WE (ISSP Research Group, 2017). Each construct is measured by three items, scored on a five-point Likert-type scale ranging from 1 (strongly agree) to 5 (strongly disagree). Figure 1 depicts the 2-factor measurement model used in the illustrative example. What follows next is a recommend set of steps to comprehensively analyze the latent means of JC and WE by country, where country is either USA or UK. Note that researchers may choose to use the application in a different way that the example workflow and skip tests if that fits their research scenario.
Figure 1. Measurement Model
Users can either use the built-in dataset by leaving Use 2015 Work Orientations Survey Data selected, or upload their own data by deselecting this option.
To upload your own dataset and accompanying *Meta.csv file, follow the steps shown in the GIF below.
The labeling of the items in the original dataset (ISSP Research Group, 2017) was not intuitive for our illustrative example; hence, the original items were renamed as previously described and as depicted in Figure 2.
Figure 2. View Data Tab
CALMs uses the MatchIt package in R (Ho et al., 2011) for propensity score analyis including checking for group equivalency. The comparison groups for the demonstration with the 2015 Work Orientations Survey data are the USA and the UK. Hence, USA was selected as the Grouping Variable. All possible covariates were selected as Covariates to Check.
Figure 3. Check Group Equivalency Tab
The output in Figure 3 shows that there are statistically (p < .05) and practically significant (Cramer’s V > .10) differences in employment type and degree by country. Specifically, employment type (SelfEmp) was found to be statistically significant while organization type (PrivateOrg) was found to be both statistically and practically significant.
CALMs offers users the flexibility to use either a default call to MatchIt or to define a custom call. A link to the matchIt documentation is included within the application.
To customize the call, deselect Use Default call to matchit and edit the arugments following data=dpsm in the provided code box.
Two propensity score matching (PSM) methods, nearest neighbor and genetic matching, are the most common.
Nearest neighbor matching requires the input of all demographic variables and has been recommended as the most straightforward PSM method (Caliendo & Kopeinig, 2008; Keiffer & Lane, 2016). Although nearest neighbor is computationally effective with large datasets, the method pairs each treated unit with its nearest matching control without considering fully optimizing matches. This may lead to covariates not being optimally balanced with less equivalent groups, as compared to more stringent or robust matching methods.
Genetic matching is recommended when PSM output is required to have highly equivalent groups as it effectively achieves good matching balance even with highly complex data (Randolph et al., 2014). Genetic matching requires the input of all demographic variables (e.g., gender, age group, race/ethnicity, and educational level) into an algorithm that results in statistically (e.g., p ≤ .05) and practically (e.g., Cramer’s V ≥ .10) significant differences.
By default, CALMs uses the nearest neighbor method.
Figure 4. Propensity Score Analysis Setup Tab
Figure 5 presents the result of the propensity score analysis using the default call previously described.
Figure 5. Propensity Score Analysis Results Tab
The nearest neighbor method yielded two equivalent groups with 769 responses in each country. We observed that there were statistically significant differences in gender (Male) by country and elected to use the results of the nearest neighbor method since there was not a practically significant difference by country (all Cramer’s V < .10).
When conducting measurement invariance tests, the application defaults to using the matched dataset.
To change this, users can deselect Use matched data for invariance tests.
Users can also select the Grouping Variable and
Items to Analyze. By default, the application include
all items identified in the *
Meta.csv file as type
item.
Measurement invariance tests include configural, metric, and scalar. Omnibus and scale-level tests are provided for both metric and scale invariance tests. Commonly recommended fit indices criteria include: (a) comparative fit index (CFI) ≥ .95; (b) standardized root-mean-square residuals (SRMR) ≤ .05; and (c) root-mean-squared-error of approximation (RMSEA) .05 to .08 (Kline, 2016; Schumacker & Lomax, 2016).
Statistically significant model noninvariance is determined based on the p-value of the χ² difference test at p ≤ .05 (Cheung & Rensvold, 1999; van de Schoot et al., 2012). Guidelines have been provided to evaluate the ΔCFI for practical model (non)invariance, namely: (a) practical model invariance for ΔCFI ≥ -.01; (b) potential practical model noninvariance for ΔCFI between -.01 and -.02; and (c) practical model noninvariance for ΔCFI ≤ -.02 (Cheung & Rensvold, 2002).
Figure 6. Measurement Invariance Tab
The results of measurement invariance tests (see Figure 6) indicated that the configural model showed good fit. This model indicated good fit with a SRMR = .031 and CFI = .956. The metric model was compared to the configural model and met criteria for both statistical and practical invariance (Δχ²[4] = 8.484, p = .075; ΔCFI = -.004). However, the data did not reach the thresholds for scalar invariance (Δχ²[4] = 55.772, p < .001; ΔCFI = -.042). Both the JC (Δχ²[2] = 24.996, p < .001; = -.019) and WE (Δχ²[2] = 30.796, p < .001; ΔCFI = -.023) scales demonstrated evidence of scalar non-invariance.
In our illustrative example, it was not necessary to conduct follow-up tests for metric invariance as neither the omnibus test nor the scale-level tests for JC and WE indicated evidence of metric non-invariance.
However, for demonstration purposes, we conducted metric invariance tests specifically on the JC scale.
Note that the application uses the p-value of the χ² difference test when determining invariant subsets of items (Cheung & Rensvold, 1999). The default significance level (alpha) is set to .05, but users may adjust this threshold as needed. In this example, we set the alpha to .01 (see Figure 7).
Figure 7. Metric Invariance Tab
The factor ratio test (see Figure 7) confirmed that all JC items were metric invariant. Similarly, all WE items were metric invariant (tests not shown).
Because full scalar invariance was not demonstrated, we conducted partial MI testing on each scale. Had we determined that the factor loadings were non-invariant at the metric invariance assessment, we could have allowed a set of loadings to be freely estimated to allow for a partial scalar invariance assessment.
Figure 8. Scalar Invariance Tab
The factor ratio test (see Figure 8) identified JC2 and JC3 as a invariant subset of JC items (p > .01). Similarly, WE1 and WE2 were identified (tests not shown) as a invariant subset of WE items (p > .01).
Based on the results of the scalar invariance assessment, the intercepts for WE3 and JC1 should be freely estimated to account for the partial scalar invariance.
Building on the results of the scalar invariance testing, we allowed the intercepts for WE3 and JC1 to be freely estimated. Structural invariance is given when the comparison between an unconstrained and a constrained structural model yields a non-significant χ² difference (p > .05) and a non-significant CFI difference (Cheung & Rensvold, 1999; Cheung & Rensvold, 2002; Kline, 2016; Schumacker & Lomax, 2016).
Figure 9. Structural Invariance Tab
The results indicate that the set of scales met the criteria for structural invariance. Although the structural invariance model was statistically significantly different from the scalar model (Δχ²[2] = 6.374, p = .041), the difference was not practically significant (ΔCFI = -.004; see Figure 9). However, considering only JC, a statistically significant latent mean difference was observed (-.077, p = .013). Given that the latent mean for the USA was constrained to zero, the negative estimate indicates that the latent mean for the UK is lower in JC. There was no significant latent mean difference for WE across the two countries (-.047, p = .326).
Caliendo, M., & Kopeinig, S. (2008). Some practical guidance for the implementation of propensity score matching. Journal of Economic Surveys, 22(1), 31–72. https://doi.org/10.1111/j.1467-6419.2007.00527.x
Cheung, G. W., & Lau, R. S. (2012). A direct comparison approach for testing measurement invariance. Organizational Research Methods, 15(2), 167–198. https://doi.org/10.1177/1094428111421987
Cheung, G. W., & Rensvold, R. B. (1999). Testing factorial invariance across groups: A reconceptualization and proposed new method. Journal of Management, 25(1), 1–27. https://doi.org/10.1177/014920639902500101
Cheung, G. W., & Rensvold, R. B. (2002). Evaluating goodness-of-fit indexes for testing measurement invariance. Structural Equation Modeling, 9(2), 233–255. https://doi.org/10.1207/S15328007SEM0902_5
Ho, D., Imai, K., King, G., & Stuart, E. (2011). MatchIt: Nonparametric preprocessing for parametric causal inference. Journal of Statistical Software, 42(8), 1–28. https://doi.org/10.18637/jss.v042.i08
ISSP Research Group (2017). International social survey programme: Work orientations IV – ISSP 2015. GESIS data archive, Cologne. ZA6770 data file version 2.1.0, https://doi.org/10.4232/1.12848
Keiffer, G. L., & Lane, F. C. (2016). Propensity score analysis: An alternative statistical approach for HRD researchers. European Journal of Training and Development, 40(8/9), 660–675. https://doi.org/10.1108/EJTD-06-2015-0046
Kline, R. B. (2016). Principles and practice of structural equation modeling (4th ed.). New York: The Guilford Press.
Randolph, J. J., Falbe, K., Manuel, A., & Balloun, J. (2014). A step-by-step guide to propensity score matching in R. Practical Assessment, Research & Evaluation, 19, 1–6. https://doi.org/10.7275/n3pv-tx27
Schumacker, R. E., & Lomax, R. G. (2016). A beginner’s guide to structural equation modeling (4th ed.). New York: Routledge.
van de Schoot, R., Lugtig, P., & Hox, J. (2012). A checklist for testing measurement invariance. European Journal of Developmental Psychology, 9(4), 486–492. https://doi.org/10.1080/17405629.2012.686740