2. Path Analysis of Theoretical Model A
Before diving into the analysis, I will load my packages, import the data, and take a look.
Packages
Data
amb | cd | tp | hi | anx |
---|---|---|---|---|
-1.8254544 | -2.3190302 | -2.2932604 | -1.5294401 | -1.9294029 |
1.8031440 | 1.6117188 | 0.6198821 | 0.2343117 | 1.0286644 |
-0.2481765 | 0.7392864 | 1.0568971 | 0.3556038 | 1.3984228 |
-0.2481765 | -0.1335827 | 0.3285387 | 0.9948955 | -0.8201277 |
1.1830218 | 1.1755026 | 0.7658451 | 0.3565158 | -0.0806108 |
0.5628995 | -0.5711088 | 1.0568971 | 1.4198739 | 0.6589060 |
0.0845604 | -0.5711088 | -1.2741412 | -2.0155206 | -0.4503692 |
0.2287305 | -0.1348926 | -0.3995284 | -1.0424476 | -0.4503692 |
1.5167134 | 0.7388497 | 0.7658451 | 0.2042167 | 3.9867317 |
1.6599287 | 0.3030702 | -0.1084764 | -0.8290464 | -0.8201277 |
n | mean | sd | min | max | |
---|---|---|---|---|---|
amb | 211 | 0 | 1 | -3.497254 | 1.803144 |
cd | 211 | 0 | 1 | -3.628115 | 1.611719 |
tp | 211 | 0 | 1 | -2.293843 | 1.493912 |
hi | 211 | 0 | 1 | -2.259017 | 2.149451 |
anx | 211 | 0 | 1 | -1.929403 | 3.986732 |
No data are missing, and the data are fully standardized. Excellent.
a. Path Coefficients
First, I’m going to write down Model A as a system of equations, using equations 15.1a-15.1d in Pedhazur (1982) .
\[ \begin{aligned} AMB &= e_1 \\ CD &= p_{21}AMB + e_2 \\ TP &= p_{31}AMB + p_{32}CD + e_3 \\ HI &= p_{42}CD + p_{43}TP + e_4 \\ ANX &= p_{53}TP + p_{54}HI + e_5 \end{aligned} \] These equations will inform the regression models I estimate.
I believe, because AMB is an exogenous variable, the estimate of \(e_1\) is simply the expected value of AMB.
## [1] 1.266381e-15
Now for the interesting path coefficients.
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 0.0000000 | 0.0575689 | 0.00000 | 1 |
amb | 0.5513968 | 0.0577058 | 9.55531 | 0 |
\(p_{21}\) is the estimate of the coefficient of amb, so \(p_{21} \approx 0.55\).
According to Pedhazur (1982), \(e_j = \sqrt{1 - R^2_{j,12...i}}\), “where \(R^2_{j,12...i}\) is the squared multiple of correlation of endogenous variable \(j\) with variables \(1, 2, \dots, i\)” (Pedhazur, 1982, p. 585). Thus, in the single-predictor, standardized case, \[ \begin{aligned} e_2 &= \sqrt{1-R^2_{2,1}} \\ &= \sqrt{1-r^2_{21}} \\ &= \sqrt{1-p^2_{21}} \end{aligned} \]
## amb
## 0.8342431
The result is that \(e_2 \approx 0.83\).
Next, let’s take a look at \(p_{31}\), \(p_{32}\) and \(e_3\).
fit.2.3 <- lm(tp ~ amb + cd, tbl.2)
p31 <- coef(fit.2.3)["amb"]
p32 <- coef(fit.2.3)["cd"]
fit.2.3 %>%
tidy %>%
kable
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 0.0000000 | 0.0647161 | 0.000000 | 1.0000000 |
amb | 0.2171172 | 0.0777591 | 2.792176 | 0.0057235 |
cd | 0.1834380 | 0.0777591 | 2.359054 | 0.0192481 |
\(p_{31}\) is the effect of AMB on TP, and is approximately .22. \(p_{32}\) is the effect of CD on TP, and is approximately .18.
\(e_3\) is slightly more complicated to calculate than \(e_2\) because we now have more than one predictor. Thankfully, R can automatically compute \(R^2\) for the fitted model.
## [1] 0.9355688
Thus, \(e_3 \approx .94\), which is a little larger than \(e_2\).
The process for \(p_{42}\), \(p_{43}\), \(e_4\), \(p_{53}\), \(p_{54}\), and \(e_5\) is essentially the same as that presented above, so I will present the syntax and results with less commentary.
fit.2.4 <- lm(hi ~ cd + tp, tbl.2)
p42 <- coef(fit.2.4)["cd"]
p43 <- coef(fit.2.4)["tp"]
e4 <- sqrt(1 - summary(fit.2.4)$r.squared)
fit.2.5 <- lm(anx ~ tp + hi, tbl.2)
p53 <- coef(fit.2.5)["tp"]
p54 <- coef(fit.2.5)["hi"]
e5 <- sqrt(1 - summary(fit.2.5)$r.squared)
tibble(p42, p43, e4, p53, p54, e5) %>%
kable
p42 | p43 | e4 | p53 | p54 | e5 |
---|---|---|---|---|---|
0.1859613 | 0.3846249 | 0.8798383 | 0.2104845 | 0.3135576 | 0.8939613 |
Now with all the coefficients estimated, I can point out that \(p_{21}\) (the effect of AMB on CD) is the strongest, followed by \(p_{43}\).
b. Decomposed Correlations
Definitions of direct effect, indirect effect, total effect, spurious component, and unanalyzed component Pedhazur (1982):
- Direct Effect: The estimated effect of one variable on another.
- Ex: The effect of AMB on CD.
- Indirect Effect: The mediated effect of one variable on another.
- Ex: The effect of AMB on HI via CD.
- Total Effect (also known as the effect coefficient): The sum of the direct and indirect effects of one variable on anther.
- Ex: The effect of AMB on TP and the effect of AMB on TP via CD.
- Spurious Component: The correlation between two variables not explained by the total effect of one variable on another&em;in other words, the correlation between two variables that is explained by a common cause.
- Ex: The correlation between CD and TP explained by AMB, rather than the effect of CD on TP.
- Unanalyzed Component: The correlation between two variables not explained by a causal path in the model.
- Ex: The effect of AMB on TP explained by AMB’s correlation with CD and CD’s effect on TP in Theoretical Model B. No correlations in Theoretical Model A will have unanalyzed components.
The sum of the spurious and unanalyzed components is the noncausal part of the correlation coefficient.
Let’s move on to the calculations, beginning with \(r_{12}\), the correlation between AMB and CD.
\(r_{12}\) is fully explained by \(p_{21}\), the direct effect of AMB on CD, which we already know is approximately .55.
\(r_{13}\), the correlation between AMB and TP, is a combination of the direct effect of AMB on TP and the indirect effect mediated by CD. Thus, \(r_{13} = p_{31} + p_{21}p_{32}\).
\(r_{23}\), the correlation between CD and TP has a direct effect and a spurious component: \(r_{23} = p_{32} + p_{21}p_{31}\).
Moving along, \(r_{14}\), the correlation between AMB and HI, is the sum of three indirect effects, mediated by CD, TP, and the combination of both CD and TP: \(r_{14} = p_{21}p_{42} + p_{31}p_{43} + p_{21}p_{32}p_{43}\).
\(r_{24}\) is composed of a direct effect, an indirect effect, and a spurious component.
\(r_{34}\) is composed of a direct effect and a spurious component. Because the spurious component is solely a result of TP’s correlation with CD, I will reuse \(r_{23}\).
\(r_{15}\) comprises only an indirect effect. Well, really five indirect effects.
IE.15 <- p31 * p53 + p21 * p42 * p53 + p31 * p43 * p53 + p21 * p32 * p53 + p21 * p32 * p43 * p54
r15 <- IE.15
\(r_{25}\) comprises both an indirect effect and a spurious component.
IE.25 <- p32 * p53 + p42 * p54 + p32 * p43 * p54
S.25 <- p21 * p31 * p53 + p21 * p31 * p43 * p54
r25 <- IE.25 + S.25
\(r_{35}\) comprises a direct effect, an indirect effect, and a spurious component.
\(r_{45}\) is the final correlation to decompose. It comprises a direct effect and a spurious component.
All the components are broken down in the following table.
tbl.2.b <- tibble(Vars = c("AMB, CD",
"AMB, TP", "CD, TP",
"AMB, HI", "CD, HI", "TP, HI",
"AMB, ANX", "CD, ANX", "TP, ANX", "HI, ANX"),
r = c(r12, r13, r23, r14, r24, r34, r15, r25, r35, r45),
DE = c(DE.12, DE.13, DE.23, NA, DE.24, DE.34, NA, NA, DE.35, DE.45),
IE = c(NA, IE.13, NA, IE.14, IE.24, NA, IE.15, IE.25, IE.35, NA),
S = c(NA, NA, S.23, NA, S.24, S.34, NA, S.25, S.35, S.45),
U = c(rep(NA,10)))
tbl.2.b %>% kable
Vars | r | DE | IE | S | U |
---|---|---|---|---|---|
AMB, CD | 0.5513968 | 0.5513968 | NA | NA | NA |
AMB, TP | 0.3182644 | 0.2171172 | 0.1011471 | NA | NA |
CD, TP | 0.3031557 | 0.1834380 | NA | 0.1197177 | NA |
AMB, HI | 0.2249509 | NA | 0.2249509 | NA | NA |
CD, HI | 0.3025626 | 0.1859613 | 0.0705548 | 0.0460464 | NA |
TP, HI | 0.4410001 | 0.3846249 | NA | 0.0563752 | NA |
AMB, ANX | 0.1183483 | NA | 0.1183483 | NA | NA |
CD, ANX | 0.1586804 | NA | 0.1190434 | 0.0396369 | NA |
TP, ANX | 0.3487634 | 0.2104845 | 0.1206021 | 0.0176769 | NA |
HI, ANX | 0.4063813 | 0.3135576 | NA | 0.0928237 | NA |
c. Fit Assessment
Overall Fit
In Theoretical Model A, we have have an “overidentified model” (Pedhazur, 1982, p. 617) because there are fewer paths in the model than are possible. This is equivalent to saying that the model implicitly hypothesizes no effect of some variables on others.
Specifically, the table above clearly shows that AMB has no direct effect on HI or ANX, and CD has no direct effect on ANX. Thus, in this case, we have three “overidentifying restrictions” (Pedhazur, 1982, p. 618), which will equal the degrees of freedom in the significance test.
\[d = df = 3\] In order to calculate the test statistic, \(W\), in addition to \(d\), we need the sample size, \(N\), the generalized squared multiple correlation, \(R^2_m\), and what I call ‘Specht’s M’. Specht’s M is essentially a rebranded \(R^2_m\) for overidentified models Specht (1975).
The exact formula is Equation 15.21 in (1982, p. 619). I present it here with modernized notation:
\[W = - (N - d)\ln{\frac{1 - R^2_m}{1-M}}\] \(W\) is approximately \(\chi^2\)-distributed with \(df = d\) degrees of freedom.
I will now calculate each variable in turn.
If I understand this correctly, \(R^2_m\) is simply the product of (one minus) all the squared observed correlations, less than one.
# Note: Using for loops in R is discouraged, but it's not a big deal for a short vector,
# and using nested map() functions is annoying.
num.2.c <- c()
for (col in select(correlate(tbl.2), -c(1, 2))) {
for (r in col) {
if (!is.na(r)) {
num.2.c <- c(num.2.c, r)
} else {
break
}
}
}
##
## Correlation method: 'pearson'
## Missing treated using: 'pairwise.complete.obs'
Finally, I believe Specht’s \(M\) is calculated in the same way, except by using the reproduced correlations instead of the observed correlations.
Now, I can calculate \(W\) and test it against a \(\chi^2\) distribution.
W | p |
---|---|
2.152695 | 0.199503 |
This result suggests that the model is not a good fit.
Correspondence Between Observed and Reproduced Correlations
To compare the observed correlations and the reproduced correlations, we can simply start by taking the differences.
tbl.2.c <- tbl.2.b %>%
select(c(1,2)) %>%
rename(Reproduced = r) %>%
mutate(Observed = num.2.c, .after = Vars) %>%
mutate(Difference = Observed - Reproduced)
tbl.2.c %>%
kable
Vars | Observed | Reproduced | Difference |
---|---|---|---|
AMB, CD | 0.5513968 | 0.5513968 | 0.0000000 |
AMB, TP | 0.3182644 | 0.3182644 | 0.0000000 |
CD, TP | 0.3031557 | 0.3031557 | 0.0000000 |
AMB, HI | 0.2813076 | 0.2249509 | 0.0563567 |
CD, HI | 0.3025626 | 0.3025626 | 0.0000000 |
TP, HI | 0.4410001 | 0.4410001 | 0.0000000 |
AMB, ANX | 0.1122043 | 0.1183483 | -0.0061440 |
CD, ANX | 0.0821677 | 0.1586804 | -0.0765127 |
TP, ANX | 0.3487634 | 0.3487634 | 0.0000000 |
HI, ANX | 0.4063813 | 0.4063813 | 0.0000000 |
It appears that for all the pairs of variables with direct effects, the difference between the observed and reproduced correlations is infinitesimally small. I’m not sure what to make of that. It may simply be a result of those relationships being identified. The differences for the variables without direct effects are larger, but without a significance test, it’s difficult to say how much credence to lend the differences.
Although we are not talking about correlations from two samples, it might be worth doing a Fisher \(r\)-to-\(z\) transformation and checking whether those three correlations are significantly different.
tbl.2.c %>%
select(- "Difference") %>%
mutate(z = paired.r(Observed, Reproduced, n = nrow(tbl.2))$z,
p = paired.r(Observed, Reproduced, n = nrow(tbl.2))$p) %>%
kable
Vars | Observed | Reproduced | z | p |
---|---|---|---|---|
AMB, CD | 0.5513968 | 0.5513968 | 0.0000000 | 1.0000000 |
AMB, TP | 0.3182644 | 0.3182644 | 0.0000000 | 1.0000000 |
CD, TP | 0.3031557 | 0.3031557 | 0.0000000 | 1.0000000 |
AMB, HI | 0.2813076 | 0.2249509 | 0.6142955 | 0.5390201 |
CD, HI | 0.3025626 | 0.3025626 | 0.0000000 | 1.0000000 |
TP, HI | 0.4410001 | 0.4410001 | 0.0000000 | 1.0000000 |
AMB, ANX | 0.1122043 | 0.1183483 | 0.0635007 | 0.9493678 |
CD, ANX | 0.0821677 | 0.1586804 | 0.7921770 | 0.4282575 |
TP, ANX | 0.3487634 | 0.3487634 | 0.0000000 | 1.0000000 |
HI, ANX | 0.4063813 | 0.4063813 | 0.0000000 | 1.0000000 |
If I did this right, then none of the three are even remotely significantly different.
d. Hypotheses
Tested as a whole, Theoretical Model A did not perform well. It did not explain enough additional variance vs. an identified model to justify the additional constraints.
The tests of individual correlations tell the same story. Constraining the three pairs of variable to indirect paths did not lead to improved reproduced correlations vs. the observed correlations.
Thus, we cannot conclude that the omitted paths should be rejected from the model.
e. Reliabilities
I’m sure we will find out exactly what implications reliabilities have when we move onto SEM and include a measurement model. For now, suffice it to say that these relatively low scores imply that an assumption of path analysis may have been violated (Pedhazur, 1982, p. 633).
My guess is that in this case, the low reliabilities are more likely to bias the effect sizes downward, but without more information about the measures, it’s difficult to say.
f. Conclusions
I’m going to go out on a limb and say “ambition” is related to “competitive drive” and “time pressure” is related to “hurried/impatient”, although I hardly think we needed this study to tell us that. Each is, if not synonymous, then at least closely semantically related to the other. But the data do bear those relationships out.
Overall, I don’t think this analysis lends Theoretical Model A much support.
3. Path Analysis of Theoretical Model B
Theoretical Model B resolves one of my complaints, namely that ambition did not seem so much like a cause of competitive drive as a conceptually-related construct. Let’s see if that makes a difference.
a. Path Coefficients
With the exception of \(p_{21}\)’s theoretical replacement with \(r_{12}\) (which has the identical value), all the path coefficients remain the same to those in Part 2. Therefore, please refer to Part 2.a for the path coefficients.
However, as we shall see in a moment, the decomposed correlations will be different because some previously analyzed components will no longer be so.
b. Decomposed Correlations
I will go in the same order as before.
\(r_{12}\) is simply the unanalyzed “path” between AMB and CD, represented by a correlation.
# this seems almost absurdly unnecessary, but I'm including it for the sake
# of completeness
U.12 <- r12
r12 <- U.12
\(r_{13}\) includes the direct effect of AMB on TP and an unanalyzed component for the effect of CD on TP.
Likewise, \(r_{23}\) comprises a direct effect of CD on TP and an unanalyzed component for the effect of AMB on TP.
Moving along, \(r_{14}\) comprises an indirect effect and an unanalyzed component.
So many unanalyzed components! \(r_{24}\) comprises a direct effect, an indirect effect, and an unanalyzed component.
Next, \(r_{34}\) has a direct effect, a spurious component (thanks to the common cause of CD), and an unanalyzed component.
Getting close… \(r_{15}\) is composed of an indirect effect and an unanalyzed component.
IE.15 <- p31 * p53 + p31 * p43 * p54
U.15 <- (p32 * p53 + p42 * p53 + p32 * p43 * p54) * r12
r15 <- IE.15 + U.15
\(r_{25}\) is similarly composed of an indirect effect and an unanalyzed component. The \(r_{12}\) is simply switched.
IE.25 <- p32 * p53 + p42 * p54 + p32 * p43 * p54
U.25 <- (p31 * p53 + p31 * p43 * p54) * r12
r25 <- IE.25 + U.25
Almost there… \(r_{35}\) comprises a direct effect, an indirect effect, a spurious component, and an unanalyzed component. Wow!!!!
DE.35 <- p53
IE.35 <- p43 * p54
S.35 <- p32 * p42 * p54
U.35 <- p31 * p42 * p54 * r12
r35 <- DE.35 + IE.35 + S.35 + U.35
Last and arguably least, \(r_{45}\) comprises a direct effect, a spurious component, and an unanlyzed component.
DE.45 <- p54
S.45 <- p42 * p32 * p53 + p43 * p53
U.45 <- (p42 + p32 * p43) * (p31 * p53) * r12
r45 <- DE.45 + S.45 + U.45
As before, I’ve broken down all the components in a single table.
tbl.3.b <- tibble(Vars = c("AMB, CD", "AMB, TP", "CD, TP",
"AMB, HI", "CD, HI", "TP, HI",
"AMB, ANX", "CD, ANX", "TP, ANX", "HI, ANX"),
r = c(r12, r13, r23, r14, r24, r34, r15, r25, r35, r45),
DE = c(NA, DE.13, DE.23, NA, DE.24, DE.34, NA, NA, DE.35, DE.45),
IE = c(NA, IE.13, NA, IE.14, IE.24, NA, IE.15, IE.25, IE.35, NA),
S = c(NA, NA, NA, NA, NA, S.34, NA, NA, S.35, S.45),
U = c(U.12, U.13, U.23, U.14, U.24, U.34, U.15, U.25, U.35, U.45))
tbl.3.b %>% kable
Vars | r | DE | IE | S | U |
---|---|---|---|---|---|
AMB, CD | 0.5513968 | NA | NA | NA | 0.5513968 |
AMB, TP | 0.3182644 | 0.2171172 | 0.1011471 | NA | 0.1011471 |
CD, TP | 0.3031557 | 0.1834380 | NA | NA | 0.1197177 |
AMB, HI | 0.2249509 | NA | 0.0835087 | NA | 0.1414422 |
CD, HI | 0.3025626 | 0.1859613 | 0.0705548 | NA | 0.0460464 |
TP, HI | 0.4410001 | 0.3846249 | NA | 0.0341124 | 0.0222629 |
AMB, ANX | 0.1269558 | NA | 0.0718846 | NA | 0.0550712 |
CD, ANX | 0.1586804 | NA | 0.1190434 | NA | 0.0396369 |
TP, ANX | 0.3487634 | 0.2104845 | 0.1206021 | 0.0106962 | 0.0069807 |
HI, ANX | 0.4081592 | 0.3135576 | NA | 0.0881377 | 0.0064639 |
c. Fit Assessment
Overall Fit
I will attempt to use the same procedures as before. But first, it is necessary to determine whether the proposed correlation (and not causation) between AMB and CD constitutes an overidentifying restriction. (Pedhazur, 1982), to my reading, is ambiguous on this point. Instead, I relied on (Kaplan, 2009)’s explication of Kenneth Bollen’s counting rule. This clarifies that the correlation between exogenous variables is not parameterized. Thus, the correlation is an overidentifying restriction, and \(d\) increases from 3 to 4.
\(N\) remains the same at 211. However, I believe the values of \(R^2_m\) and Specht’s \(M\) will have changed.
Repurposing my code from earlier (another R “no-no”):
num.3.c <- c()
for (col in select(correlate(tbl.2), -c(1, 2))) {
for (r in col) {
if (!is.na(r)) {
num.3.c <- c(num.3.c, r)
} else {
break
}
}
}
##
## Correlation method: 'pearson'
## Missing treated using: 'pairwise.complete.obs'
R2m <- 1 - prod(map_dbl(num.3.c, ~ 1 - . ^ 2))
M <- 1 - prod(map_dbl(tbl.3.b$r, ~ 1 - . ^ 2))
W <- -(N - d) * log((1 - R2m) / (1 - M))
p <- dchisq(W, d)
tibble(W = W, p = p) %>%
kable
W | p |
---|---|
1.33921 | 0.171389 |
The result is that \(W\) is lower than before, but it is still not statistically significant.
Correspondence Between Observed and Reproduced Correlations
At this point, I will check whether the reproduced correlations for the three restricted paths (not including the correlation between AMB and CD) are significantly different from the observed correlations.
tbl.2.b %>%
select(c(1,2)) %>%
rename(Reproduced = r) %>%
mutate(Observed = num.2.c, .after = Vars) %>%
mutate(z = paired.r(Observed, Reproduced, n = nrow(tbl.2))$z,
p = paired.r(Observed, Reproduced, n = nrow(tbl.2))$p) %>%
kable
Vars | Observed | Reproduced | z | p |
---|---|---|---|---|
AMB, CD | 0.5513968 | 0.5513968 | 0.0000000 | 1.0000000 |
AMB, TP | 0.3182644 | 0.3182644 | 0.0000000 | 1.0000000 |
CD, TP | 0.3031557 | 0.3031557 | 0.0000000 | 1.0000000 |
AMB, HI | 0.2813076 | 0.2249509 | 0.6142955 | 0.5390201 |
CD, HI | 0.3025626 | 0.3025626 | 0.0000000 | 1.0000000 |
TP, HI | 0.4410001 | 0.4410001 | 0.0000000 | 1.0000000 |
AMB, ANX | 0.1122043 | 0.1183483 | 0.0635007 | 0.9493678 |
CD, ANX | 0.0821677 | 0.1586804 | 0.7921770 | 0.4282575 |
TP, ANX | 0.3487634 | 0.3487634 | 0.0000000 | 1.0000000 |
HI, ANX | 0.4063813 | 0.4063813 | 0.0000000 | 1.0000000 |
Once again, the reproduced correlations are not significantly different from the observed correlations.
d. Hypotheses
Theoretical Model B may have performed better than Theoretical Model A, but not sufficiently to be able to say that it performed well. The overall fit assessment found it be a non-significant improvement on the identified model. And the reproduced correlations in Theoretical Model B were identical to those in Theoretical Model A (this is worrying to me, but I can’t see how I might have made a mistake in calculating the reproduced correlations).
e. Conclusions
Because Theoretical Model B did perform (if nonsignficantly, at least directionally) better than Theoretical Model A, I will take it at face value that the respecification of the relationship between AMB and CD is justified. In other words, it’s better to say that ambition and competitive drive are correlated than that ambition causes competitive drive.
However, I can’t help but wonder if a Theoretical Model C in which the relationship between time pressure and hurried/impatient were respecified might perform even better. The two seem closely related, and although we generally conceive of cognitive states as leading to behaviors, I imagine that the measures do not precisely distinguish between the two.
References
Kaplan, D. (2009). Structural Equation Modeling (2nd ed.): Foundations and Extensions. SAGE Publications, Inc. https://doi.org/10.4135/9781452226576
Pedhazur, E. J. (1982). Path Analysis. In Multiple Regression in Behavioral Research. Holt.
Specht, D. A. (1975). On the evaluation of causal models. Social Science Research, 4(2), 113–133. https://doi.org/10.1016/0049-089X(75)90007-1