Data Tidying

While I wait for all the data to come in, I’m going to prepare the syntax for tidying the data. I want to get it as close to ready to work with as possible.

I’m not going to deal with analytical judgment calls here. So I’m going to ignore missing data, outliers, and other data-quality concerns. I’m not going to center variables, calculate predicted values of latent variables, or produce new variables like “answered all checks correctly” or product terms.

What I am trying to do here is rationalize variable names, coalesce variables that were spread across multiple columns, dummy-code options in “select all that apply” questions, coerce variables to their appropriate data types, and arrange columns in a meaningful order.

Setup

Import packages.

library(magrittr)
library(lubridate)
library(tidyverse)

library(flextable)

Copy over table formatting.

# turn dataframe into html table
formatAsTable <- function(data) {
  data %>%
    mutate(across(where(is.double), ~ round(., 3))) %>%
    flextable %>%
    color(color = "white", part = "all")
}

Raw data

Import data.

raw.data <- read_csv(
  file.path("..", "data", "adhd-disclosure-raw-data.csv")
)

Let’s look at the top-left just to get an idea of what we’re dealing with.

quickview <- function(data, c = 6) {
  data %>%
  head(4) %>%
  select(1:c) %>%
  formatAsTable %>%
  autofit
}

raw.data %>% quickview(3)

StartDate	EndDate	Status
Start Date	End Date	Response Type
{"ImportId":"startDate","timeZone":"America/Denver"}	{"ImportId":"endDate","timeZone":"America/Denver"}	{"ImportId":"status"}
2021-05-24 09:26:24	2021-05-24 09:28:59	IP Address
2021-05-24 09:31:12	2021-05-24 09:32:47	IP Address

Clearly this is going to take some work.

Header rows

First, it looks like there are three header rows. We only need one.

raw.data <- raw.data[-2:-1,]

raw.data %>% quickview

StartDate	EndDate	Status	IPAddress	Progress	Duration (in seconds)
2021-05-24 09:26:24	2021-05-24 09:28:59	IP Address	73.121.40.9	100	154
2021-05-24 09:31:12	2021-05-24 09:32:47	IP Address	71.115.145.137	100	95
2021-05-24 09:32:08	2021-05-24 09:34:16	IP Address	174.48.238.47	100	128
2021-05-24 09:37:42	2021-05-24 09:38:54	IP Address	75.129.124.127	100	72

That’s already much better.

This is a style choice, but I prefer my variables to be all lowercase with no spaces.

raw.data <- raw.data %>%
  rename_with(tolower) %>%
  rename(duration = `duration (in seconds)`)

The naming scheme for the variables got distorted somewhere along the way. I’m going to rationalize it. I’m going to make each variable the scale name, followed by survey block number, followed by item number.

varnames <- map(c("aff", "cog", "lik"),
                function(x) map(1:2,
                      function(y) map(1:4,
                            function(z) paste(x, y, z, sep = "")))) %>%
  unlist

oldnames <- raw.data %>%
  select(starts_with(c("aff", "cog", "trust", "liking"))) %>%
  names

raw.data <- raw.data %>%
  rename_with(~ varnames[which(oldnames == .)],
              .cols = oldnames)

Experimental conditions

Disclosure and interdependence came in as dummy-coded character vectors. Honestly, that’s probably fine, but I feel like they should be logical.

raw.data <- raw.data %>%
  mutate(across(c(disclose, interdep), ~ as.logical(as.numeric(.))))

And the manipulation checks need to be renamed.

raw.data <- raw.data %>%
  rename_with(~ c("intcheck", "discheck"),
              .cols = contains("checks"))

Variables

Qualtrics dumps a whole bunch of extra metadata variables in that we don’t need. Let’s take them out.

raw.data <- raw.data %>%
  relocate(duration, .after = startdate) %>%
  select(!enddate:userlanguage)

raw.data %>% quickview

There might also be some variables that have no data at all. Let’s take those out, too. While I’m at it, I’ll remove any rows that have no data at all (or, more precisely, only metadata and embedded data).

raw.data <- raw.data %>%
  select(-where(~all(is.na(.)))) %>%
  filter(if_any(-c(interdep, disclose,
                   startdate, duration), ~ !is.na(.)))

Exactly 7 “observations” were dropped.

I’m going to create a basic ID column that can be used as key values in joins.

raw.data <- raw.data %>%
  mutate(id = row_number())

There are four columns with the name read that resulted from the validation checks that followed the vignettes. None of these have any real data, except for the fact that I only implemented this technique after 40 or so observations had already been collected. So what I want to do is create a new column called validate that will be TRUE for any observations taken after the validation check was added. Then we can dump the read columns.

raw.data <- raw.data %>%
  mutate(validate = !is.na(read) | !is.na(read_1)) %>%
  select(-starts_with("read"))

Because of how I set up the survey, some of these variables need to be consolidated. For example, aff11 is actually the same as aff21, but was shown to the participant in the first, rather than the second, set. I’m going to create a new variable that will tell us when the item was displayed so we don’t lose that information when I coalesce the columns.

raw.data <- raw.data %>%
  mutate(across(matches("\\w{3}1\\d"),
                ~ ifelse(!is.na(.), 1L, 2L),
                .names = "{.col}ord")) %>%
  rename_with(~ str_remove(., "1"),
              contains("ord"))

Now I will consolidate the data by coalescing the columns.

first_sets <- varnames[unlist(map(seq(1, 24, 8), ~ seq(., . + 3)))]

raw.data <- first_sets %>%
  map_dfc(~ coalesce(raw.data[[.]],
                     raw.data[[str_replace(., "1", "2")]])) %>%
  set_colnames(first_sets) %>%
  mutate(id = row_number()) %>%
  inner_join(select(raw.data, !matches("\\w{3}\\d{2}")), "id") %>%
  rename_with(~ str_remove(., "1"),
              matches("\\w{3}1\\d"))

raw.data %>% quickview

aff1	aff2	aff3	aff4	cog1	cog2
Neither agree nor disagree	Neither agree nor disagree	Somewhat agree	Neither agree nor disagree	Somewhat agree	Somewhat agree
Somewhat agree	Somewhat agree	Somewhat agree	Strongly agree	Somewhat agree	Somewhat agree
Somewhat agree	Neither agree nor disagree	Neither agree nor disagree	Neither agree nor disagree	Somewhat disagree	Somewhat disagree
Strongly disagree	Strongly disagree	Somewhat disagree	Somewhat disagree	Neither agree nor disagree	Somewhat agree

Factors

R has a feature called ordered factors that combines the functionality of numeric vectors and character vectors for use in ordinal data (like Likert-scale data). Factors will look like character vectors, but ordering functions like min() and arrange() will work.

This kind of data is specific to R and will be lost if exported to a CSV file. However, we can save it in a R data file (.rds).

The Likert-scale data is easy to factor-ize, so I’ll take care of that first.

lvls <- c("Strongly disagree",
          "Somewhat disagree",
          "Neither agree nor disagree",
          "Somewhat agree",
          "Strongly agree")

raw.data <- raw.data %>%
  mutate(across(c(aff1:lik4, intcheck:discheck),
                ~ ordered(., levels = lvls)))

Depending on how participants answer, education and politics could be ordinal variables. Someone has more or less education and can be more or less liberal. But if we get answers like “Other” or “Prefer not to answer”, they don’t really work as ordinal variables.

So here’s what I’m going to do. I will set these as ordinal variables, and if I decide to do any actual analyses based on that, I will drop the exceptions.

educat.lvls <- c("High school diploma",
                 "Associate's degree",
                 "Bachelor's degree",
                 "Master's degree",
                 "Doctoral degree",
                 "Other", "Prefer not to answer")

party.lvls <- c("Liberal", "Moderate", "Conservative",
                "Other", "Prefer not to answer")

raw.data <- raw.data %>%
  mutate(educat = ordered(educat, educat.lvls),
         party = ordered(party, party.lvls))

R also has unordered factors for categorical (i.e., nominal) data (like age). I’ll take care of that now for the gender and employment status.

raw.data <- raw.data %>%
  mutate(across(c(gender, employ), factor))

Ethnicity, working history, and ADHD relationships remain an issue because we allowed participants to “select all that apply.” To deal with these properly, we need a dummy variable for each option that was selected at least once.

# compile list of options selected at least once
opts <- function(data, var) {
  o <- data %>%
    select(var) %>%
    map(~ str_split(., ",")) %>%
    unlist %>%
    unique
  o %>%
    map_dfc(~ str_detect(data[[var]], word(.))) %>%
    set_colnames(o) %>%
    rename_with(~ paste(var, tolower(word(.)), sep="_")) %>%
    mutate(id = row_number()) %>%
    inner_join(data, by = "id")
}

raw.data <- raw.data %>%
  opts("ethnic") %>%
  opts("work") %>%
  opts("adhd") %>%
  rename_with(tolower) %>%
  rename("adhd_nobody" = "adhd_i")

Continuous data

Although most of the data is categorical or ordinal, a few variables are properly continuous. For some reason they came in as character vectors rather than numeric vectors. This isn’t a big deal, because R modelling functions like lm() will automatically convert character vectors to numeric when doing the analysis if it makes sense. But for good form, I’ll convert the data manually upfront.

raw.data <- raw.data %>%
  mutate(across(c(duration, age),
                as.integer))

I saved the startdate variable, so I might as well prep that too. As with other programming languages, dates and times are treated a little differently from other numeric data. I will use the package lubridate to convert the dates.

raw.data <- raw.data %>%
  mutate(startdate = ymd_hms(startdate, tz = "America/Denver"))

Organizing variables

At this point, we have all the variables, but they are in an order that doesn’t make a whole lot of sense. I’m going to sort them by:

ID
IVs
Mediators
DVs
Demographics
Metadata

clean.data <- raw.data %>%
  relocate(id) %>%
  relocate(c(interdep, disclose, validate) | intcheck:discheck,
           .after = id) %>%
  relocate(as.vector(rbind(names(select(raw.data, aff1:lik4)),
                           names(select(raw.data, aff1ord:lik4ord)))),
           .after = discheck) %>%
  relocate(c(gender, age, employ, educat, educat_7_text, party),
           .after = lik4ord) %>%
  relocate(c(adhd | contains("adhd_"),
             work | contains("work_"),
             ethnic | contains("ethnic_")), .after = party)

Export data

I will export the data both as a CSV and as an R data file.

clean.data %>%
  write_csv(file.path("..", "data", "clean-data.csv"))

clean.data %>%
  saveRDS(file.path("..", "data", "clean-data.rds"))

And let’s take one more look at the data before we go:

structure_table <- function(data) {
  tibble(variable = names(data),
         type = map_chr(data,
                        ~ ifelse(is.factor(.),
                                 "factor",
                                 ifelse(is.POSIXct(.),
                                        "datetime",
                                        typeof(.)))),
         head = map_chr(data,
                        ~ paste(head(., 3),
                                collapse = ", ")))
}

structure_table %>%
  saveRDS("structure-table.rds")

clean.data %>%
  structure_table %>%
  formatAsTable %>%
  autofit

variable	type	head
id	integer	1, 2, 3
interdep	logical	FALSE, TRUE, TRUE
disclose	logical	FALSE, TRUE, TRUE
validate	logical	FALSE, FALSE, FALSE
intcheck	factor	Neither agree nor disagree, Somewhat agree, Somewhat disagree
discheck	factor	Strongly disagree, Strongly agree, Neither agree nor disagree
aff1	factor	Neither agree nor disagree, Somewhat agree, Somewhat agree
aff1ord	integer	2, 1, 2
aff2	factor	Neither agree nor disagree, Somewhat agree, Neither agree nor disagree
aff2ord	integer	1, 1, 1
aff3	factor	Somewhat agree, Somewhat agree, Neither agree nor disagree
aff3ord	integer	1, 2, 2
aff4	factor	Neither agree nor disagree, Strongly agree, Neither agree nor disagree
aff4ord	integer	2, 2, 1
cog1	factor	Somewhat agree, Somewhat agree, Somewhat disagree
cog1ord	integer	2, 1, 2
cog2	factor	Somewhat agree, Somewhat agree, Somewhat disagree
cog2ord	integer	1, 2, 2
cog3	factor	Neither agree nor disagree, Neither agree nor disagree, Neither agree nor disagree
cog3ord	integer	1, 2, 1
cog4	factor	Somewhat agree, Neither agree nor disagree, Somewhat disagree
cog4ord	integer	2, 1, 1
lik1	factor	Somewhat agree, Somewhat agree, Neither agree nor disagree
lik1ord	integer	2, 1, 1
lik2	factor	Somewhat agree, Somewhat agree, Somewhat agree
lik2ord	integer	1, 2, 1
lik3	factor	Somewhat disagree, Somewhat disagree, Somewhat disagree
lik3ord	integer	2, 2, 2
lik4	factor	Somewhat disagree, Somewhat disagree, Neither agree nor disagree
lik4ord	integer	1, 1, 2
gender	factor	Male, Male, Male
age	integer	27, 30, 33
employ	factor	Full-time, Full-time, Full-time
educat	factor	Bachelor's degree, Bachelor's degree, Bachelor's degree
educat_7_text	character	NA, NA, NA
party	factor	Moderate, Conservative, Conservative
adhd	character	Family member,Acquaintance, I do not know anyone with ADHD, Prefer not to answer
adhd_family	logical	TRUE, FALSE, FALSE
adhd_acquaintance	logical	TRUE, FALSE, FALSE
adhd_nobody	logical	FALSE, TRUE, FALSE
adhd_prefer	logical	FALSE, FALSE, TRUE
adhd_myself	logical	FALSE, FALSE, FALSE
adhd_friend	logical	FALSE, FALSE, FALSE
adhd_coworker	logical	FALSE, FALSE, FALSE
adhd_classmate	logical	FALSE, FALSE, FALSE
work	character	service, professional, professional
work_service	logical	TRUE, FALSE, FALSE
work_professional	logical	FALSE, TRUE, TRUE
work_prefer	logical	FALSE, FALSE, FALSE
work_manual	logical	FALSE, FALSE, FALSE
work_other	logical	FALSE, FALSE, FALSE
work_4_text	character	NA, NA, NA
ethnic	character	White or Caucasian, White or Caucasian, White or Caucasian
ethnic_white	logical	TRUE, TRUE, TRUE
ethnic_hispanic	logical	FALSE, FALSE, FALSE
ethnic_other	logical	FALSE, FALSE, FALSE
ethnic_black	logical	FALSE, FALSE, FALSE
ethnic_asian	logical	FALSE, FALSE, FALSE
ethnic_prefer	logical	FALSE, FALSE, FALSE
ethnic_5_text	character	NA, NA, NA
startdate	datetime	2021-05-24 09:26:24, 2021-05-24 09:31:12, 2021-05-24 09:32:08
duration	integer	154, 95, 128
gender_3_text	character	NA, NA, NA
party_4_text	character	NA, NA, NA
employ_6_text	character	NA, NA, NA

Output document:

options(knitr.duplicate.label = "allow")
rmarkdown::render("data-tidying.Rmd", output_dir = file.path("..", "github", "thesis"))

startdate	duration	read	read_1	read_2	read_3
2021-05-24 09:26:24	154
2021-05-24 09:31:12	95
2021-05-24 09:32:08	128
2021-05-24 09:37:42	72