HW #11: SEM I

Daniel Lewis

9/30/2020

Load Packages

I will use the R package lavaan to estimate the models Rosseel (2012) . This is the most popular SEM package for R, and from what I can tell, is considered to work similarly well to LISREL and Mplus or the built-in modules in SAS, Stata, and SPSS. I prefer it because it allows me to continue to practice learning R instead of having to start over from scratch with a new statistical package. But, where I can, I will check my results from lavaan against the results provided from LISREL.

# Projects
library(here)

# R Markdown
library(knitr)
library(rbbt)

# Statistics
library(psych)
library(lavaan)

# Tidyverse
library(readxl)
library(tidyverse)
library(holodeck)

Import Data

tbl <- read_excel(here('data', 'AMBSTD.xlsx'))

2. Structural Model with Disattenuated Correlations

Disattenuated Correlations

Looks like the first step is to calculate disattenuated correlations. I’ll do the first one by hand(ish) and then use the built in correct.cor() function from the popular psych package to do the remaining ones.

The general equation, pulled from the syllabus, is: \(\phi_{12}= \frac{r_{12}}{\sqrt{\alpha_1}\sqrt{\alpha_2}}\). So I need the (attenuated) correlation and the reliabilities for each variable.

cor(tbl$amb, tbl$cd)
## [1] 0.5513968

The reliabilities for AMB and CD are given and are approximately .663 and .793, respectively.

Then, \(\phi_{AMB, CD} = \frac{r_{AMB, CD}}{\sqrt{\alpha_{AMB}}\sqrt{\alpha_{CD}}} \approx \frac{.551}{\sqrt{.663} \times \sqrt{.793}}\).

.551 / (sqrt(.663) * sqrt(.793)) %>%
  round(3)
## [1] 0.76

Now let’s see if that lines up with the results of the correct.cor() function.

mtx.2.a <- cor(tbl)

num.2.a <- c(.663,.793,.631,.629,.745)

num.2.b <- correct.cor(mtx.2.a,num.2.a) %>%
  set_diag(1) %>%
  t %>%
  .[lower.tri(., diag = TRUE)]

chr.2.a <- c("amb","cd","tp","hi","anx")

mtx.2.b <- getCov(num.2.b, lower=FALSE,
                     names = chr.2.a)

options(knitr.kable.NA = "")
mtx.2.c <- mtx.2.b
mtx.2.c[upper.tri(mtx.2.c)] <- as.numeric("")
mtx.2.c %>%
  kable(digits = 3)
amb cd tp hi anx
amb 1.000
cd 0.760 1.000
tp 0.492 0.429 1.000
hi 0.436 0.428 0.700 1.000
anx 0.160 0.107 0.509 0.594 1

Great, that matches perfectly.

Theoretical Model A

If I understand correctly, the next step is to estimate a structural model, except instead of using the raw data from the path analysis assignment to compute regressions one-by-one, I now use the disattenuated correlation matrix to estimate a system of equations at once using SEM.

Before I can use the sem() function from the lavaan package, I need to specify the model with a series of formulas corresponding to the system of regression equations from the path analysis procedure.

options(knitr.kable.NA = "NA")

chr.2.b <- '
  # regressions
    cd ~ amb
    tp ~ amb + cd
    hi ~ tp + cd
    anx ~ tp + hi
'

Before proceeding, I will verify that sem() produces the same results as path analysis when using the raw data. If it does, I can feel confident using it with the disattenuated correlation matrix.

mtx.2.d <- mtx.2.a %>%
  set_diag(1) %>%
  t %>%
  .[lower.tri(., diag = TRUE)] %>%
  getCov(lower = F, names = chr.2.a)
  
fit.2.a <- sem(chr.2.b, sample.cov = mtx.2.d, sample.nobs = 211)
fit.2.a %>%
  parameterEstimates %>%
  kable
lhs op rhs est se z pvalue ci.lower ci.upper
cd ~ amb 0.5513968 0.0574317 9.600920 0.0000000 0.4388328 0.6639608
tp ~ amb 0.2171171 0.0772044 2.812239 0.0049198 0.0657993 0.3684349
tp ~ cd 0.1834381 0.0772044 2.376007 0.0175011 0.0321203 0.3347559
hi ~ tp 0.3846249 0.0635617 6.051203 0.0000000 0.2600462 0.5092035
hi ~ cd 0.1859613 0.0635617 2.925681 0.0034370 0.0613826 0.3105400
anx ~ tp 0.2104844 0.0685709 3.069589 0.0021435 0.0760879 0.3448808
anx ~ hi 0.3135577 0.0685709 4.572754 0.0000048 0.1791613 0.4479541
cd ~~ cd 0.6926631 0.0674366 10.271319 0.0000000 0.5604898 0.8248365
tp ~~ tp 0.8711407 0.0848129 10.271319 0.0000000 0.7049104 1.0373710
hi ~~ hi 0.7704466 0.0750095 10.271319 0.0000000 0.6234307 0.9174626
anx ~~ anx 0.7953794 0.0774369 10.271319 0.0000000 0.6436058 0.9471529
amb ~~ amb 0.9952607 0.0000000 NA NA 0.9952607 0.9952607

These results are approximately identical to my results from section 2a in HW #10. I feel safe to continue.

Using the same model specification—but now with the disattenuated correlations—I will re-estimate the model.

fit.2.b <- sem(chr.2.b, sample.cov = mtx.2.b, sample.nobs = 211)

fit.2.b %>%
parameterEstimates %>%
  kable
lhs op rhs est se z pvalue ci.lower ci.upper
cd ~ amb 0.7604504 0.0447063 17.009929 0.0000000 0.6728277 0.8480731
tp ~ amb 0.3940041 0.0918609 4.289138 0.0000179 0.2139600 0.5740481
tp ~ cd 0.1289423 0.0918609 1.403669 0.1604174 -0.0511018 0.3089864
hi ~ tp 0.6325881 0.0533254 11.862792 0.0000000 0.5280722 0.7371040
hi ~ cd 0.1573001 0.0533254 2.949815 0.0031796 0.0527842 0.2618159
anx ~ tp 0.1825826 0.0765498 2.385149 0.0170722 0.0325478 0.3326174
anx ~ hi 0.4658416 0.0765498 6.085473 0.0000000 0.3158068 0.6158764
cd ~~ cd 0.4197165 0.0408630 10.271319 0.0000000 0.3396266 0.4998065
tp ~~ tp 0.7473085 0.0727568 10.271319 0.0000000 0.6047078 0.8899093
hi ~~ hi 0.4874784 0.0474602 10.271319 0.0000000 0.3944582 0.5804985
anx ~~ anx 0.6275901 0.0611012 10.271319 0.0000000 0.5078339 0.7473462
amb ~~ amb 0.9952607 0.0000000 NA NA 0.9952607 0.9952607

At a glance, it looks like all the parameter estimates are higher in the model with disattenuated correlations, but let’s just check quickly.

parameterEstimates(fit.2.b)$est[1:7] - parameterEstimates(fit.2.a)$est[1:7]
## [1]  0.20905358  0.17688694 -0.05449579  0.24796324 -0.02866124 -0.02790171
## [7]  0.15228393

Interesting, good thing I checked! Four of the seven path coefficients increased by using the disattenuated correlations, but three of the seven decreased. That said, the three that decreased only did so by small fractions, while the the four that increased did so substantially. So my original analysis was approximately correct.

Clearly using the disattenuated correlations affected the results. And it makes sense to me that disattenuating the correlations would increase the effect sizes. What’s not as clear to me is why disattenuating correlations is epistemically valid. If the variables have measurement error, shouldn’t that change only our confidence in the results (e.g., standard errors)? I don’t see why we should be able to increase effect sizes.

Putting philosophical considerations aside, I will continue on to the next model.

Saturated Model

In a saturated model, every variable is supposed to have every ‘upstream’ variable as a direct predictor Pedhazur (1982) . For example, in Theoretical Model A, AMB is hypothesized to predict HI via CD and TP, but in a saturated version, AMB would predict HI directly.

chr.2.c <- '
  # regressions
    cd ~ amb
    tp ~ amb + cd
    hi ~ amb + cd + tp
    anx ~ amb + cd + tp + hi
'
fit.2.c <- sem(chr.2.c, sample.cov = mtx.2.b, sample.nobs = 211)

fit.2.c %>%
  parameterEstimates %>%
  kable
lhs op rhs est se z pvalue ci.lower ci.upper
cd ~ amb 0.7604504 0.0447063 17.0099294 0.0000000 0.6728277 0.8480731
tp ~ amb 0.3940041 0.0918609 4.2891381 0.0000179 0.2139600 0.5740481
tp ~ cd 0.1289423 0.0918609 1.4036686 0.1604176 -0.0511018 0.3089863
hi ~ amb 0.0121744 0.0773544 0.1573842 0.8749421 -0.1394376 0.1637863
hi ~ cd 0.1491040 0.0745334 2.0004977 0.0454466 0.0030211 0.2951869
hi ~ tp 0.6301102 0.0555982 11.3332771 0.0000000 0.5211396 0.7390807
anx ~ amb -0.0472125 0.0850853 -0.5548850 0.5789733 -0.2139766 0.1195515
anx ~ cd -0.1862894 0.0827513 -2.2511964 0.0243731 -0.3484790 -0.0240999
anx ~ tp 0.2469094 0.0775616 3.1833962 0.0014556 0.0948914 0.3989274
anx ~ hi 0.5211863 0.0757186 6.8832020 0.0000000 0.3727806 0.6695919
cd ~~ cd 0.4197165 0.0408630 10.2713193 0.0000000 0.3396266 0.4998064
tp ~~ tp 0.7473085 0.0727568 10.2713193 0.0000000 0.6047078 0.8899093
hi ~~ hi 0.4874211 0.0474546 10.2713193 0.0000000 0.3944119 0.5804304
anx ~~ anx 0.5896464 0.0574071 10.2713193 0.0000000 0.4771306 0.7021622
amb ~~ amb 0.9952607 0.0000000 NA NA 0.9952607 0.9952607

Saturating the model generated parameter estimates for the effects of AMB on HI and ANX and for the effect of CD on ANX. This seems analogous to the decomposition of correlations in HW #10 into indirect effects, but because I am using the disattenuated correlations here, it doesn’t really make sense to compare the results.

Anyway, the results suggest that perhaps the relationship between CD and ANX could be included (if founded in theory), as the effect is significant.

3. Structural Equation Modeling with Observed Variables

I’m a little confused by the instructions for this section, because it seems exactly the same as what I did in the previous step. But I think the point is to compare the output of SEM with the the output of a series of regression equations. So in this step, I will output the full summary of the fitted model for Theoretical Model A instead of just the parameter estimates.

fit.2.a %>%
  summary
## lavaan 0.6-7 ended normally after 13 iterations
## 
##   Estimator                                         ML
##   Optimization method                           NLMINB
##   Number of free parameters                         11
##                                                       
##   Number of observations                           211
##                                                       
## Model Test User Model:
##                                                       
##   Test statistic                                 3.086
##   Degrees of freedom                                 3
##   P-value (Chi-square)                           0.379
## 
## Parameter Estimates:
## 
##   Standard errors                             Standard
##   Information                                 Expected
##   Information saturated (h1) model          Structured
## 
## Regressions:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   cd ~                                                
##     amb               0.551    0.057    9.601    0.000
##   tp ~                                                
##     amb               0.217    0.077    2.812    0.005
##     cd                0.183    0.077    2.376    0.018
##   hi ~                                                
##     tp                0.385    0.064    6.051    0.000
##     cd                0.186    0.064    2.926    0.003
##   anx ~                                               
##     tp                0.210    0.069    3.070    0.002
##     hi                0.314    0.069    4.573    0.000
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .cd                0.693    0.067   10.271    0.000
##    .tp                0.871    0.085   10.271    0.000
##    .hi                0.770    0.075   10.271    0.000
##    .anx               0.795    0.077   10.271    0.000

OK, let’s take a look at the output. First, we see that the software was able to execute properly. It used an iterative estimating method called maximum likelihood (ML) and found an optimal solution in 13 tries. It found 11 parameters to estimate, corresponding to seven path coefficients and four variances.

In the next section, the output tells us that the overall fit of the model was not significant (\(p>.05\)). lavaan uses the basic chi-square test statistic. We probably do not want to rely wholly on this test statistic, so I will come back to test statistics in a moment.

In the regressions section, we see the exact same output as from the series of regressions in path analysis. These are the \(\beta\)’s in regression notation, \(p\)’s in path analysis notation, and the \(\gamma\)’s and \(\beta\)’s in SEM notation Pedhazur & Schmelkin (1991) .

The variances are simply the variances of the endogenous variables. In SEM, these are referred to as disturbances, are represented as \(\zeta\)’s, and would be found in the PSI matrix in LISREL.

While the basic lavaan summary output does not provide additional test statistics, such as RMSEA or CFI, they can easily be called with summary(fit, fit.measures = TRUE) or, more completely, with fitMeasures(fit).

summary(fit.2.a, fit.measures = T)
## lavaan 0.6-7 ended normally after 13 iterations
## 
##   Estimator                                         ML
##   Optimization method                           NLMINB
##   Number of free parameters                         11
##                                                       
##   Number of observations                           211
##                                                       
## Model Test User Model:
##                                                       
##   Test statistic                                 3.086
##   Degrees of freedom                                 3
##   P-value (Chi-square)                           0.379
## 
## Model Test Baseline Model:
## 
##   Test statistic                               208.997
##   Degrees of freedom                                10
##   P-value                                        0.000
## 
## User Model versus Baseline Model:
## 
##   Comparative Fit Index (CFI)                    1.000
##   Tucker-Lewis Index (TLI)                       0.999
## 
## Loglikelihood and Information Criteria:
## 
##   Loglikelihood user model (H0)              -1092.624
##   Loglikelihood unrestricted model (H1)      -1091.081
##                                                       
##   Akaike (AIC)                                2207.248
##   Bayesian (BIC)                              2244.118
##   Sample-size adjusted Bayesian (BIC)         2209.263
## 
## Root Mean Square Error of Approximation:
## 
##   RMSEA                                          0.012
##   90 Percent confidence interval - lower         0.000
##   90 Percent confidence interval - upper         0.117
##   P-value RMSEA <= 0.05                          0.587
## 
## Standardized Root Mean Square Residual:
## 
##   SRMR                                           0.025
## 
## Parameter Estimates:
## 
##   Standard errors                             Standard
##   Information                                 Expected
##   Information saturated (h1) model          Structured
## 
## Regressions:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   cd ~                                                
##     amb               0.551    0.057    9.601    0.000
##   tp ~                                                
##     amb               0.217    0.077    2.812    0.005
##     cd                0.183    0.077    2.376    0.018
##   hi ~                                                
##     tp                0.385    0.064    6.051    0.000
##     cd                0.186    0.064    2.926    0.003
##   anx ~                                               
##     tp                0.210    0.069    3.070    0.002
##     hi                0.314    0.069    4.573    0.000
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .cd                0.693    0.067   10.271    0.000
##    .tp                0.871    0.085   10.271    0.000
##    .hi                0.770    0.075   10.271    0.000
##    .anx               0.795    0.077   10.271    0.000

Overall, I think we have to say this is a pretty weak model.

Like LISREL, lavaan can also produce estimates for the covariances of structural variables (vcov()), \(R^2\) for structural equations (summary(fit, rsquare = TRUE)), fitted values and residuals (fitted.values(), resid(fit, type = standardized)), a standardized solution (standardizedSolution()), and modification indices (modIndices()). I won’t go through all of them with this model, so suffice it to say for now that lavaan provides the information.

4. Structural Equation Modeling with Corrections for Measurement Error

SEM with Disattenuated Correlations

I understand this model to be equivalent to the structural model computed in section 2, except we will now look at the additional information provided by the SEM procedure.

I will go in order of the assignment.

Model fit

I’m going to keep it simple and only look at \(\chi^2\), RMSEA, and the Tucker-Lewis Index (TLI):

fitMeasures(fit.2.b, fit.measures = c("chisq", "df", "pvalue", "rmsea", "tli")) %>%
  kable
x
chisq 13.183572086
df 3.000000000
pvalue 0.004255979
rmsea 0.126837658
tli 0.931246288

The \(\chi^2\) statistic looks good with a \(p\)-value less than .001. But an RMSEA of 0.13 is unacceptably high and a TLI of .93 is considered marginal Kenny (2020) . Overall, I’m getting mixed signals on the goodness of fit.

This might be a good time to compare the results from lavaan with those from LISREL for this model. LISREL does not report the TLI, but the \(\chi^2\) and RMSEA values are within rounding distance of each other. 13.183 vs. 13.166 and 0.127 vs. 0.124. That gives me confidence that lavaan is working as expected.

Parameter Estimates

This really should be identical to the analysis of the structural model in section 2, so I won’t reproduce the table. But looking back at it, all the estimates are significant and positive, as hypothesized.

I checked the LISREL output again for correspondence, and this time the results are identical.

SEM with Latent Variables

This model does not incorporate any additional information beyond the previous model, so I don’t expect it to diverge dramatically, but the different statistical procedure may produce some artifacts.

I’m going to make an attempt to reproduce the analysis in R using lavaan, but if I can’t figure it out, I’ll fall back on the LISREL output provided.

# Loadings
sqrt(num.2.a)

# Error Variances
1 - num.2.a

# Model specification
chr.2.d <- '
  # measurement model
    amb_xi =~ .814*amb
    cd_eta =~ .890*cd
    tp_eta =~ .794*tp
    hi_eta =~ .793*hi
    anx_eta =~ .863*anx
  # regressions
    cd_eta ~ amb_xi
    tp_eta ~ amb_xi + cd_eta
    hi_eta ~ tp_eta + cd_eta
    anx_eta ~ tp_eta + hi_eta
  # fix error variances
    amb ~~ .337*amb
    cd ~~ .207*cd
    tp ~~ .369*tp
    hi ~~ .371*hi
    anx ~~ .255*anx
'

I think I may have actually done that right! I honestly wasn’t sure I’d be able to pull it off. Let’s test it.

fit.4 <- sem(chr.2.d, sample.cov = mtx.2.d, sample.nobs = 211)

summary(fit.4)
## lavaan 0.6-7 ended normally after 33 iterations
## 
##   Estimator                                         ML
##   Optimization method                           NLMINB
##   Number of free parameters                         12
##                                                       
##   Number of observations                           211
##                                                       
## Model Test User Model:
##                                                       
##   Test statistic                                 5.170
##   Degrees of freedom                                 3
##   P-value (Chi-square)                           0.160
## 
## Parameter Estimates:
## 
##   Standard errors                             Standard
##   Information                                 Expected
##   Information saturated (h1) model          Structured
## 
## Latent Variables:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   amb_xi =~                                           
##     amb               0.814                           
##   cd_eta =~                                           
##     cd                0.890                           
##   tp_eta =~                                           
##     tp                0.794                           
##   hi_eta =~                                           
##     hi                0.793                           
##   anx_eta =~                                          
##     anx               0.863                           
## 
## Regressions:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   cd_eta ~                                            
##     amb_xi            0.763    0.088    8.662    0.000
##   tp_eta ~                                            
##     amb_xi            0.397    0.198    2.007    0.045
##     cd_eta            0.115    0.189    0.609    0.542
##   hi_eta ~                                            
##     tp_eta            0.659    0.120    5.498    0.000
##     cd_eta            0.115    0.104    1.111    0.266
##   anx_eta ~                                           
##     tp_eta            0.149    0.172    0.866    0.386
##     hi_eta            0.471    0.174    2.700    0.007
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .amb               0.337                           
##    .cd                0.207                           
##    .tp                0.369                           
##    .hi                0.371                           
##    .anx               0.255                           
##     amb_xi            0.993    0.146    6.793    0.000
##    .cd_eta            0.418    0.095    4.403    0.000
##    .tp_eta            0.759    0.137    5.535    0.000
##    .hi_eta            0.491    0.129    3.794    0.000
##    .anx_eta           0.652    0.110    5.914    0.000

I’m seeing a \(\chi^2\) of about 5.17 in the R version and a \(\chi^2\) of about 5.12 in the LISREL version. That’s pretty close, no? I’m going to proceed with the R version.

Model Fit

We’ve already seen the \(\chi^2\) is non-significant (\(p>0.1\)), but let’s check the RMSEA and TLI, as before.

fitMeasures(fit.4, fit.measures = c("rmsea", "tli")) %>%
  kable
x
rmsea 0.05854444
tli 0.96365816

OK, now we’re cooking with gas! An RMSEA of .06 is not bad, and a TLI of .96 is considered good Kenny (2020) . Overall, I think we have to conclude that this model fits the data better than the model that simply used disattenuated correlations.

Parameter Estimates

parameterEstimates(fit.4, remove.nonfree = T) %>%
  kable
lhs op rhs est se z pvalue ci.lower ci.upper
6 cd_eta ~ amb_xi 0.7625065 0.0880293 8.6619658 0.0000000 0.5899723 0.9350407
7 tp_eta ~ amb_xi 0.3968277 0.1976774 2.0074511 0.0447017 0.0093871 0.7842682
8 tp_eta ~ cd_eta 0.1148200 0.1885138 0.6090802 0.5424713 -0.2546603 0.4843003
9 hi_eta ~ tp_eta 0.6592937 0.1199245 5.4975737 0.0000000 0.4242460 0.8943414
10 hi_eta ~ cd_eta 0.1152745 0.1037270 1.1113253 0.2664284 -0.0880268 0.3185757
11 anx_eta ~ tp_eta 0.1486640 0.1716193 0.8662430 0.3863569 -0.1877037 0.4850317
12 anx_eta ~ hi_eta 0.4705380 0.1742488 2.7003805 0.0069260 0.1290167 0.8120593
18 amb_xi ~~ amb_xi 0.9934571 0.1462385 6.7934017 0.0000000 0.7068348 1.2800793
19 cd_eta ~~ cd_eta 0.4175367 0.0948342 4.4028055 0.0000107 0.2316650 0.6034083
20 tp_eta ~~ tp_eta 0.7593272 0.1371784 5.5353273 0.0000000 0.4904625 1.0281918
21 hi_eta ~~ hi_eta 0.4910500 0.1294226 3.7941604 0.0001481 0.2373864 0.7447136
22 anx_eta ~~ anx_eta 0.6515061 0.1101622 5.9140646 0.0000000 0.4355923 0.8674200

The effect estimates are all positive but definitely not all significant. CD is not significantly predictive of either TP or HI, and TP is not significantly predictive of ANX.

Conclusions

The latent-variable model performed the best in terms of overall model fit, but not all its paths were significant. That hardly gives me confidence. The disattenuated-correlations model was mediocre for overall model fit, but the parameter estimates matched the theory. That’s a mixed bag. And the raw-data structural model had strong TLI and RMSEA values, but we know that one is flawed because the reliabilities are low.

The bigger picture is that different statistical procedures that all incorporate the same information are producing different results, and I find that worrisome. I don’t really understand what’s going on under the hood well enough to say why it’s happening, but overall, I do not have confidence in Theoretical Model A.

Export Objects

# Objects to use again in HW #12
fit.observed <- fit.2.a
fit.disattenuated <- fit.2.b
fit.reliabilities <- fit.4
save(fit.observed, fit.disattenuated, fit.reliabilities, file = here('data', "hw11data.RData"))

References

bbt_write_bib("biblio.json", bbt_detect_citations("hw11.Rmd"), overwrite = TRUE)

Kenny, D. A. (2020, June 5). Measuring Model Fit. http://davidakenny.net/cm/fit.htm

Pedhazur, E. J. (1982). Path Analysis. In Multiple Regression in Behavioral Research. Holt.

Pedhazur, E. J., & Schmelkin, L. P. (1991). Measurement, design, and analysis: an integrated approach. Lawrence Erlbaum Associates.

Rosseel, Y. (2012). lavaan: An R Package for Structural Equation Modeling. Journal of Statistical Software, 48(1), 1–36. https://doi.org/10.18637/jss.v048.i02