While I wait for all the data to come in, I’m going to prepare the syntax for tidying the data. I want to get it as close to ready to work with as possible.

I’m not going to deal with analytical judgment calls here. So I’m going to ignore missing data, outliers, and other data-quality concerns. I’m not going to center variables, calculate predicted values of latent variables, or produce new variables like “answered all checks correctly” or product terms.

What I am trying to do here is rationalize variable names, coalesce variables that were spread across multiple columns, dummy-code options in “select all that apply” questions, coerce variables to their appropriate data types, and arrange columns in a meaningful order.

Setup

Import packages.

library(magrittr)
library(lubridate)
library(tidyverse)

library(flextable)

Copy over table formatting.

# turn dataframe into html table
formatAsTable <- function(data) {
  data %>%
    mutate(across(where(is.double), ~ round(., 3))) %>%
    flextable %>%
    color(color = "white", part = "all")
}

Raw data

Import data.

raw.data <- read_csv(
  file.path("..", "data", "adhd-disclosure-raw-data.csv")
)

Let’s look at the top-left just to get an idea of what we’re dealing with.

quickview <- function(data, c = 6) {
  data %>%
  head(4) %>%
  select(1:c) %>%
  formatAsTable %>%
  autofit
}

raw.data %>% quickview(3)

Clearly this is going to take some work.

Header rows

First, it looks like there are three header rows. We only need one.

raw.data <- raw.data[-2:-1,]

raw.data %>% quickview

That’s already much better.

This is a style choice, but I prefer my variables to be all lowercase with no spaces.

raw.data <- raw.data %>%
  rename_with(tolower) %>%
  rename(duration = `duration (in seconds)`)

The naming scheme for the variables got distorted somewhere along the way. I’m going to rationalize it. I’m going to make each variable the scale name, followed by survey block number, followed by item number.

varnames <- map(c("aff", "cog", "lik"),
                function(x) map(1:2,
                      function(y) map(1:4,
                            function(z) paste(x, y, z, sep = "")))) %>%
  unlist

oldnames <- raw.data %>%
  select(starts_with(c("aff", "cog", "trust", "liking"))) %>%
  names

raw.data <- raw.data %>%
  rename_with(~ varnames[which(oldnames == .)],
              .cols = oldnames)

Experimental conditions

Disclosure and interdependence came in as dummy-coded character vectors. Honestly, that’s probably fine, but I feel like they should be logical.

raw.data <- raw.data %>%
  mutate(across(c(disclose, interdep), ~ as.logical(as.numeric(.))))

And the manipulation checks need to be renamed.

raw.data <- raw.data %>%
  rename_with(~ c("intcheck", "discheck"),
              .cols = contains("checks"))

Variables

Qualtrics dumps a whole bunch of extra metadata variables in that we don’t need. Let’s take them out.

raw.data <- raw.data %>%
  relocate(duration, .after = startdate) %>%
  select(!enddate:userlanguage)

raw.data %>% quickview

There might also be some variables that have no data at all. Let’s take those out, too. While I’m at it, I’ll remove any rows that have no data at all (or, more precisely, only metadata and embedded data).

raw.data <- raw.data %>%
  select(-where(~all(is.na(.)))) %>%
  filter(if_any(-c(interdep, disclose,
                   startdate, duration), ~ !is.na(.)))

Exactly 7 “observations” were dropped.

I’m going to create a basic ID column that can be used as key values in joins.

raw.data <- raw.data %>%
  mutate(id = row_number())

There are four columns with the name read that resulted from the validation checks that followed the vignettes. None of these have any real data, except for the fact that I only implemented this technique after 40 or so observations had already been collected. So what I want to do is create a new column called validate that will be TRUE for any observations taken after the validation check was added. Then we can dump the read columns.

raw.data <- raw.data %>%
  mutate(validate = !is.na(read) | !is.na(read_1)) %>%
  select(-starts_with("read"))

Because of how I set up the survey, some of these variables need to be consolidated. For example, aff11 is actually the same as aff21, but was shown to the participant in the first, rather than the second, set. I’m going to create a new variable that will tell us when the item was displayed so we don’t lose that information when I coalesce the columns.

raw.data <- raw.data %>%
  mutate(across(matches("\\w{3}1\\d"),
                ~ ifelse(!is.na(.), 1L, 2L),
                .names = "{.col}ord")) %>%
  rename_with(~ str_remove(., "1"),
              contains("ord"))

Now I will consolidate the data by coalescing the columns.

first_sets <- varnames[unlist(map(seq(1, 24, 8), ~ seq(., . + 3)))]

raw.data <- first_sets %>%
  map_dfc(~ coalesce(raw.data[[.]],
                     raw.data[[str_replace(., "1", "2")]])) %>%
  set_colnames(first_sets) %>%
  mutate(id = row_number()) %>%
  inner_join(select(raw.data, !matches("\\w{3}\\d{2}")), "id") %>%
  rename_with(~ str_remove(., "1"),
              matches("\\w{3}1\\d"))

raw.data %>% quickview

Factors

R has a feature called ordered factors that combines the functionality of numeric vectors and character vectors for use in ordinal data (like Likert-scale data). Factors will look like character vectors, but ordering functions like min() and arrange() will work.

This kind of data is specific to R and will be lost if exported to a CSV file. However, we can save it in a R data file (.rds).

The Likert-scale data is easy to factor-ize, so I’ll take care of that first.

lvls <- c("Strongly disagree",
          "Somewhat disagree",
          "Neither agree nor disagree",
          "Somewhat agree",
          "Strongly agree")

raw.data <- raw.data %>%
  mutate(across(c(aff1:lik4, intcheck:discheck),
                ~ ordered(., levels = lvls)))

Depending on how participants answer, education and politics could be ordinal variables. Someone has more or less education and can be more or less liberal. But if we get answers like “Other” or “Prefer not to answer”, they don’t really work as ordinal variables.

So here’s what I’m going to do. I will set these as ordinal variables, and if I decide to do any actual analyses based on that, I will drop the exceptions.

educat.lvls <- c("High school diploma",
                 "Associate's degree",
                 "Bachelor's degree",
                 "Master's degree",
                 "Doctoral degree",
                 "Other", "Prefer not to answer")

party.lvls <- c("Liberal", "Moderate", "Conservative",
                "Other", "Prefer not to answer")

raw.data <- raw.data %>%
  mutate(educat = ordered(educat, educat.lvls),
         party = ordered(party, party.lvls))

R also has unordered factors for categorical (i.e., nominal) data (like age). I’ll take care of that now for the gender and employment status.

raw.data <- raw.data %>%
  mutate(across(c(gender, employ), factor))

Ethnicity, working history, and ADHD relationships remain an issue because we allowed participants to “select all that apply.” To deal with these properly, we need a dummy variable for each option that was selected at least once.

# compile list of options selected at least once
opts <- function(data, var) {
  o <- data %>%
    select(var) %>%
    map(~ str_split(., ",")) %>%
    unlist %>%
    unique
  o %>%
    map_dfc(~ str_detect(data[[var]], word(.))) %>%
    set_colnames(o) %>%
    rename_with(~ paste(var, tolower(word(.)), sep="_")) %>%
    mutate(id = row_number()) %>%
    inner_join(data, by = "id")
}

raw.data <- raw.data %>%
  opts("ethnic") %>%
  opts("work") %>%
  opts("adhd") %>%
  rename_with(tolower) %>%
  rename("adhd_nobody" = "adhd_i")

Continuous data

Although most of the data is categorical or ordinal, a few variables are properly continuous. For some reason they came in as character vectors rather than numeric vectors. This isn’t a big deal, because R modelling functions like lm() will automatically convert character vectors to numeric when doing the analysis if it makes sense. But for good form, I’ll convert the data manually upfront.

raw.data <- raw.data %>%
  mutate(across(c(duration, age),
                as.integer))

I saved the startdate variable, so I might as well prep that too. As with other programming languages, dates and times are treated a little differently from other numeric data. I will use the package lubridate to convert the dates.

raw.data <- raw.data %>%
  mutate(startdate = ymd_hms(startdate, tz = "America/Denver"))

Organizing variables

At this point, we have all the variables, but they are in an order that doesn’t make a whole lot of sense. I’m going to sort them by:

  1. ID
  2. IVs
  3. Mediators
  4. DVs
  5. Demographics
  6. Metadata
clean.data <- raw.data %>%
  relocate(id) %>%
  relocate(c(interdep, disclose, validate) | intcheck:discheck,
           .after = id) %>%
  relocate(as.vector(rbind(names(select(raw.data, aff1:lik4)),
                           names(select(raw.data, aff1ord:lik4ord)))),
           .after = discheck) %>%
  relocate(c(gender, age, employ, educat, educat_7_text, party),
           .after = lik4ord) %>%
  relocate(c(adhd | contains("adhd_"),
             work | contains("work_"),
             ethnic | contains("ethnic_")), .after = party)

Export data

I will export the data both as a CSV and as an R data file.

clean.data %>%
  write_csv(file.path("..", "data", "clean-data.csv"))

clean.data %>%
  saveRDS(file.path("..", "data", "clean-data.rds"))

And let’s take one more look at the data before we go:

structure_table <- function(data) {
  tibble(variable = names(data),
         type = map_chr(data,
                        ~ ifelse(is.factor(.),
                                 "factor",
                                 ifelse(is.POSIXct(.),
                                        "datetime",
                                        typeof(.)))),
         head = map_chr(data,
                        ~ paste(head(., 3),
                                collapse = ", ")))
}

structure_table %>%
  saveRDS("structure-table.rds")

clean.data %>%
  structure_table %>%
  formatAsTable %>%
  autofit

Output document:

options(knitr.duplicate.label = "allow")
rmarkdown::render("data-tidying.Rmd", output_dir = file.path("..", "github", "thesis"))