The data has started to roll in, and I’ve noticed a number of issues arising. I will try to address them here.
I will wrestle with some missing data, detect “fake” data (where participants seemed to have clicked through without reading), and identify outliers. Having done so, I will consider whether to use thresholds on the indicators of data quality, such as duration
.
Import packages.
library(lubridate)
library(tidyverse)
library(flextable)
library(ggdark)
Import my R objects from previous documents.
<- readRDS("format.rds")
formatAsTable <- readRDS("structure-table.rds")
structureTable <- readRDS(file.path("..", "data", "clean-data.rds")) clean
OK, now I’ll take a look at what I’m dealing with.
# rectangular data excerpt
%>%
clean select(1:7) %>%
%>%
head formatAsTable
id | interdep | disclose | validate | intcheck | discheck | aff1 |
1 | FALSE | FALSE | FALSE | Neither agree nor disagree | Strongly disagree | Neither agree nor disagree |
2 | TRUE | TRUE | FALSE | Somewhat agree | Strongly agree | Somewhat agree |
3 | TRUE | TRUE | FALSE | Somewhat disagree | Neither agree nor disagree | Somewhat agree |
4 | FALSE | FALSE | FALSE | Strongly disagree | Strongly disagree | Strongly disagree |
5 | FALSE | TRUE | FALSE | Strongly disagree | Strongly agree | Neither agree nor disagree |
6 | TRUE | FALSE | FALSE | Somewhat agree | Strongly disagree | Strongly disagree |
It’s a little hard to see what’s going on with the full dataset because I can’t fit all the variables here.
I’m going to look at it a different way, by mimicking R’s str()
function, which shows the structure of the data.
# data structure table
%>%
clean %>%
structureTable formatAsTable
variable | type | head |
id | integer | 1, 2, 3 |
interdep | logical | FALSE, TRUE, TRUE |
disclose | logical | FALSE, TRUE, TRUE |
validate | logical | FALSE, FALSE, FALSE |
intcheck | factor | Neither agree nor disagree, Somewhat agree, Somewhat disagree |
discheck | factor | Strongly disagree, Strongly agree, Neither agree nor disagree |
aff1 | factor | Neither agree nor disagree, Somewhat agree, Somewhat agree |
aff1ord | integer | 2, 1, 2 |
aff2 | factor | Neither agree nor disagree, Somewhat agree, Neither agree nor disagree |
aff2ord | integer | 1, 1, 1 |
aff3 | factor | Somewhat agree, Somewhat agree, Neither agree nor disagree |
aff3ord | integer | 1, 2, 2 |
aff4 | factor | Neither agree nor disagree, Strongly agree, Neither agree nor disagree |
aff4ord | integer | 2, 2, 1 |
cog1 | factor | Somewhat agree, Somewhat agree, Somewhat disagree |
cog1ord | integer | 2, 1, 2 |
cog2 | factor | Somewhat agree, Somewhat agree, Somewhat disagree |
cog2ord | integer | 1, 2, 2 |
cog3 | factor | Neither agree nor disagree, Neither agree nor disagree, Neither agree nor disagree |
cog3ord | integer | 1, 2, 1 |
cog4 | factor | Somewhat agree, Neither agree nor disagree, Somewhat disagree |
cog4ord | integer | 2, 1, 1 |
lik1 | factor | Somewhat agree, Somewhat agree, Neither agree nor disagree |
lik1ord | integer | 2, 1, 1 |
lik2 | factor | Somewhat agree, Somewhat agree, Somewhat agree |
lik2ord | integer | 1, 2, 1 |
lik3 | factor | Somewhat disagree, Somewhat disagree, Somewhat disagree |
lik3ord | integer | 2, 2, 2 |
lik4 | factor | Somewhat disagree, Somewhat disagree, Neither agree nor disagree |
lik4ord | integer | 1, 1, 2 |
gender | factor | Male, Male, Male |
age | integer | 27, 30, 33 |
employ | factor | Full-time, Full-time, Full-time |
educat | factor | Bachelor's degree, Bachelor's degree, Bachelor's degree |
educat_7_text | character | NA, NA, NA |
party | factor | Moderate, Conservative, Conservative |
adhd | character | Family member,Acquaintance, I do not know anyone with ADHD, Prefer not to answer |
adhd_family | logical | TRUE, FALSE, FALSE |
adhd_acquaintance | logical | TRUE, FALSE, FALSE |
adhd_nobody | logical | FALSE, TRUE, FALSE |
adhd_prefer | logical | FALSE, FALSE, TRUE |
adhd_myself | logical | FALSE, FALSE, FALSE |
adhd_friend | logical | FALSE, FALSE, FALSE |
adhd_coworker | logical | FALSE, FALSE, FALSE |
adhd_classmate | logical | FALSE, FALSE, FALSE |
work | character | service, professional, professional |
work_service | logical | TRUE, FALSE, FALSE |
work_professional | logical | FALSE, TRUE, TRUE |
work_prefer | logical | FALSE, FALSE, FALSE |
work_manual | logical | FALSE, FALSE, FALSE |
work_other | logical | FALSE, FALSE, FALSE |
work_4_text | character | NA, NA, NA |
ethnic | character | White or Caucasian, White or Caucasian, White or Caucasian |
ethnic_white | logical | TRUE, TRUE, TRUE |
ethnic_hispanic | logical | FALSE, FALSE, FALSE |
ethnic_other | logical | FALSE, FALSE, FALSE |
ethnic_black | logical | FALSE, FALSE, FALSE |
ethnic_asian | logical | FALSE, FALSE, FALSE |
ethnic_prefer | logical | FALSE, FALSE, FALSE |
ethnic_5_text | character | NA, NA, NA |
startdate | datetime | 2021-05-24 09:26:24, 2021-05-24 09:31:12, 2021-05-24 09:32:08 |
duration | integer | 154, 95, 128 |
gender_3_text | character | NA, NA, NA |
party_4_text | character | NA, NA, NA |
employ_6_text | character | NA, NA, NA |
Here I can see the main variables of interest.
IVs: interdep
and disclose
are the independent variables, corresponding to interdependence and disclosure. They were manipulated, so there’s nothing to do with those.
Checks: intcheck
and discheck
are the manipulation checks. We’ll want to take a close look at those, because if participants responded to both differently from how we wanted, it’s unclear how to use their data for further analyses.
Mediators: The four variables beginning with aff
are the scale for affective trust, and the four variables beginning with cog
are the scale for cognitive trust. There’s no right or wrong answer here, but I’ll want to look for coherence. For example, cog
had a reverse-coded item, so I can see if participants responded differently to that.
DV: The four variables beginning with lik
are the scale for liking. As with cog
, I can look for coherence here.
The other variables are demographic questions, metadata, and variables that I may use for exploratory analyses or as controls.
The questionnaire was designed such that participants were reminded to respond to all items before moving on, but it did not force them, too. Thus, there is the possibility that some participants failed to respond to some items. Let’s see which variables have missing data (coded as NA
).
%>%
clean select(where(~ any(is.na(.)))) %>%
summarise(across(everything(),
~ sum(is.na(.)))) %>%
%>%
t as_tibble(rownames = "vars") %>%
rename(NA_count = V1) %>%
formatAsTable
vars | NA_count |
age | 2 |
educat_7_text | 440 |
work_4_text | 425 |
ethnic_5_text | 440 |
gender_3_text | 442 |
party_4_text | 439 |
employ_6_text | 426 |
As of this writing, there were just two missing data in the age
variable, and otherwise the rest were in the optional text fields, which we expected to be mostly NA
anyway. That is a great relief.
I won’t bother with multiple imputation for the age
variable, since it’s so ancillary to the study.
I won’t try to define “fake” or “empty” data. Instead, consider these hypothetical examples:
The random clicker: A participant fails one or both manipulation checks, gives inconsistent responses to scales, and selects contradictory options on a demographic question. It could be that they misunderstood what they were reading or slightly different wording measured different constructs; but, more likely, the participant clicked buttons at random instead of actually responding.
The non-participant: A participant responds with “Neither agree nor disagree” to all Likert-style scales, and respons with “Prefer not to answer” to all the remaining questions. It could be the case that the participant lacks attitudes or opinions and is deeply private about demographic information; more likely, though, the participant wasn’t motivated to engage with the content of the survey. Thus, it is probably appropriate to drop these observations entirely.
The FTL thinker: A participant “completes” the survey in less than a minute, meaning they probably spent fewer than two seconds on each question. It could be that they have lightning fast processing speed; but it’s more likely that they raced through the questionnaire as fast as possible without reading the vignettes or thinking about the questions.
I will try to detect the presence of each of these types of participants.
OK, I will try to quantify this type of participant with criteria. One point for each time the participant did the following:
<- clean %>%
clean mutate(across(where(~is.factor(.)),
~ as.integer(.),
.names = "{.col}.int"))
<- c("gender", "employ", "educat", "party", "adhd", "work")
demo <- paste0(demo, "wrong")
demo.wrong
<- clean %>%
clean mutate(across(c(cog4.int, lik3.int, lik4.int),
~ 6 - .),
diswrong = ifelse(disclose,
< 3,
discheck.int > 3),
discheck.int intwrong = ifelse(interdep,
< 3,
intcheck.int > 3),
intcheck.int affwrong = pmin(rowSums(across(matches("aff\\d.int"), ~ . < 3)),
rowSums(across(matches("aff\\d.int"), ~ . > 3))),
cogwrong = pmin(rowSums(across(matches("cog\\d.int"), ~ . < 3)),
rowSums(across(matches("cog\\d.int"), ~ . > 3))),
likwrong = pmin(rowSums(across(matches("lik\\d.int"), ~ . < 3)),
rowSums(across(matches("lik\\d.int"), ~ . > 3))),
across(demo,
~ str_detect(., ",") & str_detect(., "Prefer"),
.names = "{.col}wrong"),
demowrong = rowSums(across(demo.wrong)),
wrong = rowSums(across(contains("wrong")))) %>%
select(-ends_with(".int") & -demo.wrong)
Clearly there’s a lot of nonsense here. But the median number of self-contradictory clicks was exactly 1. Let’s take a look at the distribution. My guess is a small number of people accounted for the bulk of the wrong clicks.
%>%
clean count(wrong) %>%
formatAsTable
wrong | n |
0.00 | 177 |
1.00 | 153 |
2.00 | 74 |
3.00 | 28 |
4.00 | 5 |
5.00 | 4 |
6.00 | 3 |
It looks like almost everyone made three or fewer mistakes. Just 12 participants made three or more mistakes. I won’t take them out just yet, but I’ll add a variable to mark them so it’ll be easy to do analyses with and without them later on.
<- clean %>%
clean mutate(random_clicker = wrong >= 4)
It appears that some participants responded to the survey but refrained from revealing any opinions, attitudes, or demographic information. I’ll try to identify these folks, too.
<- clean %>%
clean mutate(decline = rowSums(across(matches("\\w{3}\\d") & where(is.factor),
~ . == "Neither agree nor disagree")) +
rowSums(across(demo,
~ . == "Prefer not to answer")))
%>%
clean count(decline) %>%
group_by(decline = cut(decline,
breaks = seq(0, 18, by = 3),
include.lowest = T)) %>%
summarise(n = sum(n)) %>%
formatAsTable
decline | n |
[0,3] | 258 |
(3,6] | 112 |
(6,9] | 56 |
(9,12] | 15 |
(12,15] | 1 |
(15,18] | 2 |
It looks like 18 declined to respond to at least 10 questions. I’ll leave them in but mark them.
<- clean %>%
clean mutate(nonpart = decline >= 10)
How long did it take participants to complete the questionnaire?
%>%
clean group_by(duration = cut(duration,
c(seq(0, 480, 60), 1260))) %>%
count(duration) %>%
formatAsTable
duration | n |
(0,60] | 2 |
(60,120] | 68 |
(120,180] | 156 |
(180,240] | 116 |
(240,300] | 50 |
(300,360] | 17 |
(360,420] | 10 |
(420,480] | 10 |
(480,1.26e+03] | 15 |
All but 15 participants completed the questionnaire in under eight minutes. The median duration was 2.97 minutes, which is very similar to the original estimated time of 3 minutes.
The worrying observations are the 18 participants who took less than 90 seconds to complete the questionnaire. I’ll mark them and move on.
<- clean %>%
clean mutate(ftl = duration <= 90)
Did anyone manage to raise all three red flags?
<- clean %>%
triple transmute(random_clicker & nonpart & ftl) %>%
%>%
pull sum
Thankfully, exactly 0 participants raised all three red flags.
Given the structure of this data, outlier analysis does not seem to be particularly important. Likert-scale items and demographic questions cannot produce outliers, by design.
That leaves the continuous variables, age
and duration
as the only possible outlier. For duration
, at least, it doesn’t matter how high the number is, and I’ve already dealt with the extreme low values. And age
is ancillary to the purpose of this study.
Let’s take a quick look at it anyway.
%>%
clean ggplot(aes(age)) +
geom_histogram() +
dark_theme_minimal() +
theme(plot.background = element_rect(fill = '#3b434f'))
There’s obviously a floor at 18, which was our age restriction for participants, and a long tail at the upper end. The oldest participant was 83, which is statistically an outlier, but not actually concerning.
Export data:
%>%
clean saveRDS(file.path("..", "data", "quality-data.rds"))
%>%
clean write_csv(file.path("..", "data", "quality-data.csv"))
Output document:
options(knitr.duplicate.label = "allow")
::render("data-qa.Rmd",
rmarkdownoutput_dir = file.path("..", "github", "thesis"))