Data Cleaning and QA

The data has started to roll in, and I’ve noticed a number of issues arising. I will try to address them here.

I will wrestle with some missing data, detect “fake” data (where participants seemed to have clicked through without reading), and identify outliers. Having done so, I will consider whether to use thresholds on the indicators of data quality, such as duration.

Setup

Import packages.

library(lubridate)
library(tidyverse)

library(flextable)
library(ggdark)

Import my R objects from previous documents.

formatAsTable <- readRDS("format.rds")
structureTable <- readRDS("structure-table.rds")
clean <- readRDS(file.path("..", "data", "clean-data.rds"))

OK, now I’ll take a look at what I’m dealing with.

# rectangular data excerpt
clean %>%
  select(1:7) %>%
  head %>%
  formatAsTable

id	interdep	disclose	validate	intcheck	discheck	aff1
1	FALSE	FALSE	FALSE	Neither agree nor disagree	Strongly disagree	Neither agree nor disagree
2	TRUE	TRUE	FALSE	Somewhat agree	Strongly agree	Somewhat agree
3	TRUE	TRUE	FALSE	Somewhat disagree	Neither agree nor disagree	Somewhat agree
4	FALSE	FALSE	FALSE	Strongly disagree	Strongly disagree	Strongly disagree
5	FALSE	TRUE	FALSE	Strongly disagree	Strongly agree	Neither agree nor disagree
6	TRUE	FALSE	FALSE	Somewhat agree	Strongly disagree	Strongly disagree

It’s a little hard to see what’s going on with the full dataset because I can’t fit all the variables here.

I’m going to look at it a different way, by mimicking R’s str() function, which shows the structure of the data.

# data structure table
clean %>%
  structureTable %>%
  formatAsTable

variable	type	head
id	integer	1, 2, 3
interdep	logical	FALSE, TRUE, TRUE
disclose	logical	FALSE, TRUE, TRUE
validate	logical	FALSE, FALSE, FALSE
intcheck	factor	Neither agree nor disagree, Somewhat agree, Somewhat disagree
discheck	factor	Strongly disagree, Strongly agree, Neither agree nor disagree
aff1	factor	Neither agree nor disagree, Somewhat agree, Somewhat agree
aff1ord	integer	2, 1, 2
aff2	factor	Neither agree nor disagree, Somewhat agree, Neither agree nor disagree
aff2ord	integer	1, 1, 1
aff3	factor	Somewhat agree, Somewhat agree, Neither agree nor disagree
aff3ord	integer	1, 2, 2
aff4	factor	Neither agree nor disagree, Strongly agree, Neither agree nor disagree
aff4ord	integer	2, 2, 1
cog1	factor	Somewhat agree, Somewhat agree, Somewhat disagree
cog1ord	integer	2, 1, 2
cog2	factor	Somewhat agree, Somewhat agree, Somewhat disagree
cog2ord	integer	1, 2, 2
cog3	factor	Neither agree nor disagree, Neither agree nor disagree, Neither agree nor disagree
cog3ord	integer	1, 2, 1
cog4	factor	Somewhat agree, Neither agree nor disagree, Somewhat disagree
cog4ord	integer	2, 1, 1
lik1	factor	Somewhat agree, Somewhat agree, Neither agree nor disagree
lik1ord	integer	2, 1, 1
lik2	factor	Somewhat agree, Somewhat agree, Somewhat agree
lik2ord	integer	1, 2, 1
lik3	factor	Somewhat disagree, Somewhat disagree, Somewhat disagree
lik3ord	integer	2, 2, 2
lik4	factor	Somewhat disagree, Somewhat disagree, Neither agree nor disagree
lik4ord	integer	1, 1, 2
gender	factor	Male, Male, Male
age	integer	27, 30, 33
employ	factor	Full-time, Full-time, Full-time
educat	factor	Bachelor's degree, Bachelor's degree, Bachelor's degree
educat_7_text	character	NA, NA, NA
party	factor	Moderate, Conservative, Conservative
adhd	character	Family member,Acquaintance, I do not know anyone with ADHD, Prefer not to answer
adhd_family	logical	TRUE, FALSE, FALSE
adhd_acquaintance	logical	TRUE, FALSE, FALSE
adhd_nobody	logical	FALSE, TRUE, FALSE
adhd_prefer	logical	FALSE, FALSE, TRUE
adhd_myself	logical	FALSE, FALSE, FALSE
adhd_friend	logical	FALSE, FALSE, FALSE
adhd_coworker	logical	FALSE, FALSE, FALSE
adhd_classmate	logical	FALSE, FALSE, FALSE
work	character	service, professional, professional
work_service	logical	TRUE, FALSE, FALSE
work_professional	logical	FALSE, TRUE, TRUE
work_prefer	logical	FALSE, FALSE, FALSE
work_manual	logical	FALSE, FALSE, FALSE
work_other	logical	FALSE, FALSE, FALSE
work_4_text	character	NA, NA, NA
ethnic	character	White or Caucasian, White or Caucasian, White or Caucasian
ethnic_white	logical	TRUE, TRUE, TRUE
ethnic_hispanic	logical	FALSE, FALSE, FALSE
ethnic_other	logical	FALSE, FALSE, FALSE
ethnic_black	logical	FALSE, FALSE, FALSE
ethnic_asian	logical	FALSE, FALSE, FALSE
ethnic_prefer	logical	FALSE, FALSE, FALSE
ethnic_5_text	character	NA, NA, NA
startdate	datetime	2021-05-24 09:26:24, 2021-05-24 09:31:12, 2021-05-24 09:32:08
duration	integer	154, 95, 128
gender_3_text	character	NA, NA, NA
party_4_text	character	NA, NA, NA
employ_6_text	character	NA, NA, NA

Here I can see the main variables of interest.

IVs: interdep and disclose are the independent variables, corresponding to interdependence and disclosure. They were manipulated, so there’s nothing to do with those.

Checks: intcheck and discheck are the manipulation checks. We’ll want to take a close look at those, because if participants responded to both differently from how we wanted, it’s unclear how to use their data for further analyses.

Mediators: The four variables beginning with aff are the scale for affective trust, and the four variables beginning with cog are the scale for cognitive trust. There’s no right or wrong answer here, but I’ll want to look for coherence. For example, cog had a reverse-coded item, so I can see if participants responded differently to that.

DV: The four variables beginning with lik are the scale for liking. As with cog, I can look for coherence here.

The other variables are demographic questions, metadata, and variables that I may use for exploratory analyses or as controls.

Missing data

The questionnaire was designed such that participants were reminded to respond to all items before moving on, but it did not force them, too. Thus, there is the possibility that some participants failed to respond to some items. Let’s see which variables have missing data (coded as NA).

clean %>%
  select(where(~ any(is.na(.)))) %>%
  summarise(across(everything(),
                   ~ sum(is.na(.)))) %>%
  t %>%
  as_tibble(rownames = "vars") %>%
  rename(NA_count = V1) %>%
  formatAsTable

vars	NA_count
age	2
educat_7_text	440
work_4_text	425
ethnic_5_text	440
gender_3_text	442
party_4_text	439
employ_6_text	426

As of this writing, there were just two missing data in the age variable, and otherwise the rest were in the optional text fields, which we expected to be mostly NA anyway. That is a great relief.

I won’t bother with multiple imputation for the age variable, since it’s so ancillary to the study.

Fake or empty data

I won’t try to define “fake” or “empty” data. Instead, consider these hypothetical examples:

The random clicker: A participant fails one or both manipulation checks, gives inconsistent responses to scales, and selects contradictory options on a demographic question. It could be that they misunderstood what they were reading or slightly different wording measured different constructs; but, more likely, the participant clicked buttons at random instead of actually responding.
The non-participant: A participant responds with “Neither agree nor disagree” to all Likert-style scales, and respons with “Prefer not to answer” to all the remaining questions. It could be the case that the participant lacks attitudes or opinions and is deeply private about demographic information; more likely, though, the participant wasn’t motivated to engage with the content of the survey. Thus, it is probably appropriate to drop these observations entirely.
The FTL thinker: A participant “completes” the survey in less than a minute, meaning they probably spent fewer than two seconds on each question. It could be that they have lightning fast processing speed; but it’s more likely that they raced through the questionnaire as fast as possible without reading the vignettes or thinking about the questions.

I will try to detect the presence of each of these types of participants.

The Random Clicker

OK, I will try to quantify this type of participant with criteria. One point for each time the participant did the following:

Responded to a manipulation check incorrectly.
Selected a self-contradicting option on a Likert scale.
Selected at least two, mutually contradicting options on a multiple-choice question.

clean <- clean %>%
  mutate(across(where(~is.factor(.)),
                ~ as.integer(.),
                .names = "{.col}.int"))

demo <- c("gender", "employ", "educat", "party", "adhd", "work")
demo.wrong <- paste0(demo, "wrong")

clean <- clean %>%
  mutate(across(c(cog4.int, lik3.int, lik4.int),
                ~ 6 - .),
         diswrong = ifelse(disclose,
                           discheck.int < 3,
                           discheck.int > 3),
         intwrong = ifelse(interdep,
                           intcheck.int < 3,
                           intcheck.int > 3),
         affwrong = pmin(rowSums(across(matches("aff\\d.int"), ~ . < 3)),
                         rowSums(across(matches("aff\\d.int"), ~ . > 3))),
         cogwrong = pmin(rowSums(across(matches("cog\\d.int"), ~ . < 3)),
                         rowSums(across(matches("cog\\d.int"), ~ . > 3))),
         likwrong = pmin(rowSums(across(matches("lik\\d.int"), ~ . < 3)),
                         rowSums(across(matches("lik\\d.int"), ~ . > 3))),
         across(demo,
                ~ str_detect(., ",") & str_detect(., "Prefer"),
                .names = "{.col}wrong"),
         demowrong = rowSums(across(demo.wrong)),
         wrong = rowSums(across(contains("wrong")))) %>%
  select(-ends_with(".int") & -demo.wrong)

Clearly there’s a lot of nonsense here. But the median number of self-contradictory clicks was exactly 1. Let’s take a look at the distribution. My guess is a small number of people accounted for the bulk of the wrong clicks.

clean %>%
  count(wrong) %>%
  formatAsTable

wrong	n
0.00	177
1.00	153
2.00	74
3.00	28
4.00	5
5.00	4
6.00	3

It looks like almost everyone made three or fewer mistakes. Just 12 participants made three or more mistakes. I won’t take them out just yet, but I’ll add a variable to mark them so it’ll be easy to do analyses with and without them later on.

clean <- clean %>%
  mutate(random_clicker = wrong >= 4)

The Non-Participant

It appears that some participants responded to the survey but refrained from revealing any opinions, attitudes, or demographic information. I’ll try to identify these folks, too.

clean <- clean %>%
  mutate(decline = rowSums(across(matches("\\w{3}\\d") & where(is.factor),
                                  ~ . == "Neither agree nor disagree")) +
           rowSums(across(demo,
                          ~ . == "Prefer not to answer")))

clean %>%
  count(decline) %>%
  group_by(decline = cut(decline,
                         breaks = seq(0, 18, by = 3),
                         include.lowest = T)) %>%
  summarise(n = sum(n)) %>%
  formatAsTable

decline	n
[0,3]	258
(3,6]	112
(6,9]	56
(9,12]	15
(12,15]	1
(15,18]	2

It looks like 18 declined to respond to at least 10 questions. I’ll leave them in but mark them.

clean <- clean %>%
  mutate(nonpart = decline >= 10)

The FTL Thinker

How long did it take participants to complete the questionnaire?

clean %>%
  group_by(duration = cut(duration,
                          c(seq(0, 480, 60), 1260))) %>%
  count(duration) %>%
  formatAsTable

duration	n
(0,60]	2
(60,120]	68
(120,180]	156
(180,240]	116
(240,300]	50
(300,360]	17
(360,420]	10
(420,480]	10
(480,1.26e+03]	15

All but 15 participants completed the questionnaire in under eight minutes. The median duration was 2.97 minutes, which is very similar to the original estimated time of 3 minutes.

The worrying observations are the 18 participants who took less than 90 seconds to complete the questionnaire. I’ll mark them and move on.

clean <- clean %>%
  mutate(ftl = duration <= 90)

The Triple Threat

Did anyone manage to raise all three red flags?

triple <- clean %>%
  transmute(random_clicker & nonpart & ftl) %>%
  pull %>%
  sum

Thankfully, exactly 0 participants raised all three red flags.

Outliers

Given the structure of this data, outlier analysis does not seem to be particularly important. Likert-scale items and demographic questions cannot produce outliers, by design.

That leaves the continuous variables, age and duration as the only possible outlier. For duration, at least, it doesn’t matter how high the number is, and I’ve already dealt with the extreme low values. And age is ancillary to the purpose of this study.

Let’s take a quick look at it anyway.

clean %>%
  ggplot(aes(age)) +
  geom_histogram() +
  dark_theme_minimal() +
  theme(plot.background = element_rect(fill = '#3b434f'))

There’s obviously a floor at 18, which was our age restriction for participants, and a long tail at the upper end. The oldest participant was 83, which is statistically an outlier, but not actually concerning.

Export data:

clean %>%
  saveRDS(file.path("..", "data", "quality-data.rds"))

clean %>%
  write_csv(file.path("..", "data", "quality-data.csv"))

Output document:

options(knitr.duplicate.label = "allow")
rmarkdown::render("data-qa.Rmd",
                  output_dir = file.path("..", "github", "thesis"))