The data has started to roll in, and I’ve noticed a number of issues arising. I will try to address them here.
I will wrestle with some missing data, detect “fake” data (where participants seemed to have clicked through without reading), and identify outliers. Having done so, I will consider whether to use thresholds on the indicators of data quality, such as duration.
Import packages.
library(lubridate)
library(tidyverse)
library(flextable)
library(ggdark)Import my R objects from previous documents.
formatAsTable <- readRDS("format.rds")
structureTable <- readRDS("structure-table.rds")
clean <- readRDS(file.path("..", "data", "clean-data.rds"))OK, now I’ll take a look at what I’m dealing with.
# rectangular data excerpt
clean %>%
select(1:7) %>%
head %>%
formatAsTableid | interdep | disclose | validate | intcheck | discheck | aff1 |
1 | FALSE | FALSE | FALSE | Neither agree nor disagree | Strongly disagree | Neither agree nor disagree |
2 | TRUE | TRUE | FALSE | Somewhat agree | Strongly agree | Somewhat agree |
3 | TRUE | TRUE | FALSE | Somewhat disagree | Neither agree nor disagree | Somewhat agree |
4 | FALSE | FALSE | FALSE | Strongly disagree | Strongly disagree | Strongly disagree |
5 | FALSE | TRUE | FALSE | Strongly disagree | Strongly agree | Neither agree nor disagree |
6 | TRUE | FALSE | FALSE | Somewhat agree | Strongly disagree | Strongly disagree |
It’s a little hard to see what’s going on with the full dataset because I can’t fit all the variables here.
I’m going to look at it a different way, by mimicking R’s str() function, which shows the structure of the data.
# data structure table
clean %>%
structureTable %>%
formatAsTablevariable | type | head |
id | integer | 1, 2, 3 |
interdep | logical | FALSE, TRUE, TRUE |
disclose | logical | FALSE, TRUE, TRUE |
validate | logical | FALSE, FALSE, FALSE |
intcheck | factor | Neither agree nor disagree, Somewhat agree, Somewhat disagree |
discheck | factor | Strongly disagree, Strongly agree, Neither agree nor disagree |
aff1 | factor | Neither agree nor disagree, Somewhat agree, Somewhat agree |
aff1ord | integer | 2, 1, 2 |
aff2 | factor | Neither agree nor disagree, Somewhat agree, Neither agree nor disagree |
aff2ord | integer | 1, 1, 1 |
aff3 | factor | Somewhat agree, Somewhat agree, Neither agree nor disagree |
aff3ord | integer | 1, 2, 2 |
aff4 | factor | Neither agree nor disagree, Strongly agree, Neither agree nor disagree |
aff4ord | integer | 2, 2, 1 |
cog1 | factor | Somewhat agree, Somewhat agree, Somewhat disagree |
cog1ord | integer | 2, 1, 2 |
cog2 | factor | Somewhat agree, Somewhat agree, Somewhat disagree |
cog2ord | integer | 1, 2, 2 |
cog3 | factor | Neither agree nor disagree, Neither agree nor disagree, Neither agree nor disagree |
cog3ord | integer | 1, 2, 1 |
cog4 | factor | Somewhat agree, Neither agree nor disagree, Somewhat disagree |
cog4ord | integer | 2, 1, 1 |
lik1 | factor | Somewhat agree, Somewhat agree, Neither agree nor disagree |
lik1ord | integer | 2, 1, 1 |
lik2 | factor | Somewhat agree, Somewhat agree, Somewhat agree |
lik2ord | integer | 1, 2, 1 |
lik3 | factor | Somewhat disagree, Somewhat disagree, Somewhat disagree |
lik3ord | integer | 2, 2, 2 |
lik4 | factor | Somewhat disagree, Somewhat disagree, Neither agree nor disagree |
lik4ord | integer | 1, 1, 2 |
gender | factor | Male, Male, Male |
age | integer | 27, 30, 33 |
employ | factor | Full-time, Full-time, Full-time |
educat | factor | Bachelor's degree, Bachelor's degree, Bachelor's degree |
educat_7_text | character | NA, NA, NA |
party | factor | Moderate, Conservative, Conservative |
adhd | character | Family member,Acquaintance, I do not know anyone with ADHD, Prefer not to answer |
adhd_family | logical | TRUE, FALSE, FALSE |
adhd_acquaintance | logical | TRUE, FALSE, FALSE |
adhd_nobody | logical | FALSE, TRUE, FALSE |
adhd_prefer | logical | FALSE, FALSE, TRUE |
adhd_myself | logical | FALSE, FALSE, FALSE |
adhd_friend | logical | FALSE, FALSE, FALSE |
adhd_coworker | logical | FALSE, FALSE, FALSE |
adhd_classmate | logical | FALSE, FALSE, FALSE |
work | character | service, professional, professional |
work_service | logical | TRUE, FALSE, FALSE |
work_professional | logical | FALSE, TRUE, TRUE |
work_prefer | logical | FALSE, FALSE, FALSE |
work_manual | logical | FALSE, FALSE, FALSE |
work_other | logical | FALSE, FALSE, FALSE |
work_4_text | character | NA, NA, NA |
ethnic | character | White or Caucasian, White or Caucasian, White or Caucasian |
ethnic_white | logical | TRUE, TRUE, TRUE |
ethnic_hispanic | logical | FALSE, FALSE, FALSE |
ethnic_other | logical | FALSE, FALSE, FALSE |
ethnic_black | logical | FALSE, FALSE, FALSE |
ethnic_asian | logical | FALSE, FALSE, FALSE |
ethnic_prefer | logical | FALSE, FALSE, FALSE |
ethnic_5_text | character | NA, NA, NA |
startdate | datetime | 2021-05-24 09:26:24, 2021-05-24 09:31:12, 2021-05-24 09:32:08 |
duration | integer | 154, 95, 128 |
gender_3_text | character | NA, NA, NA |
party_4_text | character | NA, NA, NA |
employ_6_text | character | NA, NA, NA |
Here I can see the main variables of interest.
IVs: interdep and disclose are the independent variables, corresponding to interdependence and disclosure. They were manipulated, so there’s nothing to do with those.
Checks: intcheck and discheck are the manipulation checks. We’ll want to take a close look at those, because if participants responded to both differently from how we wanted, it’s unclear how to use their data for further analyses.
Mediators: The four variables beginning with aff are the scale for affective trust, and the four variables beginning with cog are the scale for cognitive trust. There’s no right or wrong answer here, but I’ll want to look for coherence. For example, cog had a reverse-coded item, so I can see if participants responded differently to that.
DV: The four variables beginning with lik are the scale for liking. As with cog, I can look for coherence here.
The other variables are demographic questions, metadata, and variables that I may use for exploratory analyses or as controls.
The questionnaire was designed such that participants were reminded to respond to all items before moving on, but it did not force them, too. Thus, there is the possibility that some participants failed to respond to some items. Let’s see which variables have missing data (coded as NA).
clean %>%
select(where(~ any(is.na(.)))) %>%
summarise(across(everything(),
~ sum(is.na(.)))) %>%
t %>%
as_tibble(rownames = "vars") %>%
rename(NA_count = V1) %>%
formatAsTablevars | NA_count |
age | 2 |
educat_7_text | 440 |
work_4_text | 425 |
ethnic_5_text | 440 |
gender_3_text | 442 |
party_4_text | 439 |
employ_6_text | 426 |
As of this writing, there were just two missing data in the age variable, and otherwise the rest were in the optional text fields, which we expected to be mostly NA anyway. That is a great relief.
I won’t bother with multiple imputation for the age variable, since it’s so ancillary to the study.
I won’t try to define “fake” or “empty” data. Instead, consider these hypothetical examples:
The random clicker: A participant fails one or both manipulation checks, gives inconsistent responses to scales, and selects contradictory options on a demographic question. It could be that they misunderstood what they were reading or slightly different wording measured different constructs; but, more likely, the participant clicked buttons at random instead of actually responding.
The non-participant: A participant responds with “Neither agree nor disagree” to all Likert-style scales, and respons with “Prefer not to answer” to all the remaining questions. It could be the case that the participant lacks attitudes or opinions and is deeply private about demographic information; more likely, though, the participant wasn’t motivated to engage with the content of the survey. Thus, it is probably appropriate to drop these observations entirely.
The FTL thinker: A participant “completes” the survey in less than a minute, meaning they probably spent fewer than two seconds on each question. It could be that they have lightning fast processing speed; but it’s more likely that they raced through the questionnaire as fast as possible without reading the vignettes or thinking about the questions.
I will try to detect the presence of each of these types of participants.
OK, I will try to quantify this type of participant with criteria. One point for each time the participant did the following:
clean <- clean %>%
mutate(across(where(~is.factor(.)),
~ as.integer(.),
.names = "{.col}.int"))
demo <- c("gender", "employ", "educat", "party", "adhd", "work")
demo.wrong <- paste0(demo, "wrong")
clean <- clean %>%
mutate(across(c(cog4.int, lik3.int, lik4.int),
~ 6 - .),
diswrong = ifelse(disclose,
discheck.int < 3,
discheck.int > 3),
intwrong = ifelse(interdep,
intcheck.int < 3,
intcheck.int > 3),
affwrong = pmin(rowSums(across(matches("aff\\d.int"), ~ . < 3)),
rowSums(across(matches("aff\\d.int"), ~ . > 3))),
cogwrong = pmin(rowSums(across(matches("cog\\d.int"), ~ . < 3)),
rowSums(across(matches("cog\\d.int"), ~ . > 3))),
likwrong = pmin(rowSums(across(matches("lik\\d.int"), ~ . < 3)),
rowSums(across(matches("lik\\d.int"), ~ . > 3))),
across(demo,
~ str_detect(., ",") & str_detect(., "Prefer"),
.names = "{.col}wrong"),
demowrong = rowSums(across(demo.wrong)),
wrong = rowSums(across(contains("wrong")))) %>%
select(-ends_with(".int") & -demo.wrong)Clearly there’s a lot of nonsense here. But the median number of self-contradictory clicks was exactly 1. Let’s take a look at the distribution. My guess is a small number of people accounted for the bulk of the wrong clicks.
clean %>%
count(wrong) %>%
formatAsTablewrong | n |
0.00 | 177 |
1.00 | 153 |
2.00 | 74 |
3.00 | 28 |
4.00 | 5 |
5.00 | 4 |
6.00 | 3 |
It looks like almost everyone made three or fewer mistakes. Just 12 participants made three or more mistakes. I won’t take them out just yet, but I’ll add a variable to mark them so it’ll be easy to do analyses with and without them later on.
clean <- clean %>%
mutate(random_clicker = wrong >= 4)It appears that some participants responded to the survey but refrained from revealing any opinions, attitudes, or demographic information. I’ll try to identify these folks, too.
clean <- clean %>%
mutate(decline = rowSums(across(matches("\\w{3}\\d") & where(is.factor),
~ . == "Neither agree nor disagree")) +
rowSums(across(demo,
~ . == "Prefer not to answer")))
clean %>%
count(decline) %>%
group_by(decline = cut(decline,
breaks = seq(0, 18, by = 3),
include.lowest = T)) %>%
summarise(n = sum(n)) %>%
formatAsTabledecline | n |
[0,3] | 258 |
(3,6] | 112 |
(6,9] | 56 |
(9,12] | 15 |
(12,15] | 1 |
(15,18] | 2 |
It looks like 18 declined to respond to at least 10 questions. I’ll leave them in but mark them.
clean <- clean %>%
mutate(nonpart = decline >= 10)How long did it take participants to complete the questionnaire?
clean %>%
group_by(duration = cut(duration,
c(seq(0, 480, 60), 1260))) %>%
count(duration) %>%
formatAsTableduration | n |
(0,60] | 2 |
(60,120] | 68 |
(120,180] | 156 |
(180,240] | 116 |
(240,300] | 50 |
(300,360] | 17 |
(360,420] | 10 |
(420,480] | 10 |
(480,1.26e+03] | 15 |
All but 15 participants completed the questionnaire in under eight minutes. The median duration was 2.97 minutes, which is very similar to the original estimated time of 3 minutes.
The worrying observations are the 18 participants who took less than 90 seconds to complete the questionnaire. I’ll mark them and move on.
clean <- clean %>%
mutate(ftl = duration <= 90)Did anyone manage to raise all three red flags?
triple <- clean %>%
transmute(random_clicker & nonpart & ftl) %>%
pull %>%
sumThankfully, exactly 0 participants raised all three red flags.
Given the structure of this data, outlier analysis does not seem to be particularly important. Likert-scale items and demographic questions cannot produce outliers, by design.
That leaves the continuous variables, age and duration as the only possible outlier. For duration, at least, it doesn’t matter how high the number is, and I’ve already dealt with the extreme low values. And age is ancillary to the purpose of this study.
Let’s take a quick look at it anyway.
clean %>%
ggplot(aes(age)) +
geom_histogram() +
dark_theme_minimal() +
theme(plot.background = element_rect(fill = '#3b434f'))There’s obviously a floor at 18, which was our age restriction for participants, and a long tail at the upper end. The oldest participant was 83, which is statistically an outlier, but not actually concerning.
Export data:
clean %>%
saveRDS(file.path("..", "data", "quality-data.rds"))
clean %>%
write_csv(file.path("..", "data", "quality-data.csv"))Output document:
options(knitr.duplicate.label = "allow")
rmarkdown::render("data-qa.Rmd",
output_dir = file.path("..", "github", "thesis"))