The data has started to roll in, and I’ve noticed a number of issues arising. I will try to address them here.

I will wrestle with some missing data, detect “fake” data (where participants seemed to have clicked through without reading), and identify outliers. Having done so, I will consider whether to use thresholds on the indicators of data quality, such as duration.

Setup

Import packages.

library(lubridate)
library(tidyverse)

library(flextable)
library(ggdark)

Import my R objects from previous documents.

formatAsTable <- readRDS("format.rds")
structureTable <- readRDS("structure-table.rds")
clean <- readRDS(file.path("..", "data", "clean-data.rds"))

OK, now I’ll take a look at what I’m dealing with.

# rectangular data excerpt
clean %>%
  select(1:7) %>%
  head %>%
  formatAsTable

It’s a little hard to see what’s going on with the full dataset because I can’t fit all the variables here.

I’m going to look at it a different way, by mimicking R’s str() function, which shows the structure of the data.

# data structure table
clean %>%
  structureTable %>%
  formatAsTable

Here I can see the main variables of interest.

IVs: interdep and disclose are the independent variables, corresponding to interdependence and disclosure. They were manipulated, so there’s nothing to do with those.

Checks: intcheck and discheck are the manipulation checks. We’ll want to take a close look at those, because if participants responded to both differently from how we wanted, it’s unclear how to use their data for further analyses.

Mediators: The four variables beginning with aff are the scale for affective trust, and the four variables beginning with cog are the scale for cognitive trust. There’s no right or wrong answer here, but I’ll want to look for coherence. For example, cog had a reverse-coded item, so I can see if participants responded differently to that.

DV: The four variables beginning with lik are the scale for liking. As with cog, I can look for coherence here.

The other variables are demographic questions, metadata, and variables that I may use for exploratory analyses or as controls.

Missing data

The questionnaire was designed such that participants were reminded to respond to all items before moving on, but it did not force them, too. Thus, there is the possibility that some participants failed to respond to some items. Let’s see which variables have missing data (coded as NA).

clean %>%
  select(where(~ any(is.na(.)))) %>%
  summarise(across(everything(),
                   ~ sum(is.na(.)))) %>%
  t %>%
  as_tibble(rownames = "vars") %>%
  rename(NA_count = V1) %>%
  formatAsTable

As of this writing, there were just two missing data in the age variable, and otherwise the rest were in the optional text fields, which we expected to be mostly NA anyway. That is a great relief.

I won’t bother with multiple imputation for the age variable, since it’s so ancillary to the study.

Fake or empty data

I won’t try to define “fake” or “empty” data. Instead, consider these hypothetical examples:

  • The random clicker: A participant fails one or both manipulation checks, gives inconsistent responses to scales, and selects contradictory options on a demographic question. It could be that they misunderstood what they were reading or slightly different wording measured different constructs; but, more likely, the participant clicked buttons at random instead of actually responding.

  • The non-participant: A participant responds with “Neither agree nor disagree” to all Likert-style scales, and respons with “Prefer not to answer” to all the remaining questions. It could be the case that the participant lacks attitudes or opinions and is deeply private about demographic information; more likely, though, the participant wasn’t motivated to engage with the content of the survey. Thus, it is probably appropriate to drop these observations entirely.

  • The FTL thinker: A participant “completes” the survey in less than a minute, meaning they probably spent fewer than two seconds on each question. It could be that they have lightning fast processing speed; but it’s more likely that they raced through the questionnaire as fast as possible without reading the vignettes or thinking about the questions.

I will try to detect the presence of each of these types of participants.

The Random Clicker

OK, I will try to quantify this type of participant with criteria. One point for each time the participant did the following:

  1. Responded to a manipulation check incorrectly.
  2. Selected a self-contradicting option on a Likert scale.
  3. Selected at least two, mutually contradicting options on a multiple-choice question.
clean <- clean %>%
  mutate(across(where(~is.factor(.)),
                ~ as.integer(.),
                .names = "{.col}.int"))

demo <- c("gender", "employ", "educat", "party", "adhd", "work")
demo.wrong <- paste0(demo, "wrong")

clean <- clean %>%
  mutate(across(c(cog4.int, lik3.int, lik4.int),
                ~ 6 - .),
         diswrong = ifelse(disclose,
                           discheck.int < 3,
                           discheck.int > 3),
         intwrong = ifelse(interdep,
                           intcheck.int < 3,
                           intcheck.int > 3),
         affwrong = pmin(rowSums(across(matches("aff\\d.int"), ~ . < 3)),
                         rowSums(across(matches("aff\\d.int"), ~ . > 3))),
         cogwrong = pmin(rowSums(across(matches("cog\\d.int"), ~ . < 3)),
                         rowSums(across(matches("cog\\d.int"), ~ . > 3))),
         likwrong = pmin(rowSums(across(matches("lik\\d.int"), ~ . < 3)),
                         rowSums(across(matches("lik\\d.int"), ~ . > 3))),
         across(demo,
                ~ str_detect(., ",") & str_detect(., "Prefer"),
                .names = "{.col}wrong"),
         demowrong = rowSums(across(demo.wrong)),
         wrong = rowSums(across(contains("wrong")))) %>%
  select(-ends_with(".int") & -demo.wrong)

Clearly there’s a lot of nonsense here. But the median number of self-contradictory clicks was exactly 1. Let’s take a look at the distribution. My guess is a small number of people accounted for the bulk of the wrong clicks.

clean %>%
  count(wrong) %>%
  formatAsTable

It looks like almost everyone made three or fewer mistakes. Just 12 participants made three or more mistakes. I won’t take them out just yet, but I’ll add a variable to mark them so it’ll be easy to do analyses with and without them later on.

clean <- clean %>%
  mutate(random_clicker = wrong >= 4)

The Non-Participant

It appears that some participants responded to the survey but refrained from revealing any opinions, attitudes, or demographic information. I’ll try to identify these folks, too.

clean <- clean %>%
  mutate(decline = rowSums(across(matches("\\w{3}\\d") & where(is.factor),
                                  ~ . == "Neither agree nor disagree")) +
           rowSums(across(demo,
                          ~ . == "Prefer not to answer")))

clean %>%
  count(decline) %>%
  group_by(decline = cut(decline,
                         breaks = seq(0, 18, by = 3),
                         include.lowest = T)) %>%
  summarise(n = sum(n)) %>%
  formatAsTable

It looks like 18 declined to respond to at least 10 questions. I’ll leave them in but mark them.

clean <- clean %>%
  mutate(nonpart = decline >= 10)

The FTL Thinker

How long did it take participants to complete the questionnaire?

clean %>%
  group_by(duration = cut(duration,
                          c(seq(0, 480, 60), 1260))) %>%
  count(duration) %>%
  formatAsTable

All but 15 participants completed the questionnaire in under eight minutes. The median duration was 2.97 minutes, which is very similar to the original estimated time of 3 minutes.

The worrying observations are the 18 participants who took less than 90 seconds to complete the questionnaire. I’ll mark them and move on.

clean <- clean %>%
  mutate(ftl = duration <= 90)

The Triple Threat

Did anyone manage to raise all three red flags?

triple <- clean %>%
  transmute(random_clicker & nonpart & ftl) %>%
  pull %>%
  sum

Thankfully, exactly 0 participants raised all three red flags.

Outliers

Given the structure of this data, outlier analysis does not seem to be particularly important. Likert-scale items and demographic questions cannot produce outliers, by design.

That leaves the continuous variables, age and duration as the only possible outlier. For duration, at least, it doesn’t matter how high the number is, and I’ve already dealt with the extreme low values. And age is ancillary to the purpose of this study.

Let’s take a quick look at it anyway.

clean %>%
  ggplot(aes(age)) +
  geom_histogram() +
  dark_theme_minimal() +
  theme(plot.background = element_rect(fill = '#3b434f'))

There’s obviously a floor at 18, which was our age restriction for participants, and a long tail at the upper end. The oldest participant was 83, which is statistically an outlier, but not actually concerning.


Export data:

clean %>%
  saveRDS(file.path("..", "data", "quality-data.rds"))

clean %>%
  write_csv(file.path("..", "data", "quality-data.csv"))

Output document:

options(knitr.duplicate.label = "allow")
rmarkdown::render("data-qa.Rmd",
                  output_dir = file.path("..", "github", "thesis"))