While I wait for all the data to come in, I’m going to prepare the syntax for tidying the data. I want to get it as close to ready to work with as possible.
I’m not going to deal with analytical judgment calls here. So I’m going to ignore missing data, outliers, and other data-quality concerns. I’m not going to center variables, calculate predicted values of latent variables, or produce new variables like “answered all checks correctly” or product terms.
What I am trying to do here is rationalize variable names, coalesce variables that were spread across multiple columns, dummy-code options in “select all that apply” questions, coerce variables to their appropriate data types, and arrange columns in a meaningful order.
Import packages.
library(magrittr)
library(lubridate)
library(tidyverse)
library(flextable)
Copy over table formatting.
# turn dataframe into html table
<- function(data) {
formatAsTable %>%
data mutate(across(where(is.double), ~ round(., 3))) %>%
%>%
flextable color(color = "white", part = "all")
}
Import data.
<- read_csv(
raw.data file.path("..", "data", "adhd-disclosure-raw-data.csv")
)
Let’s look at the top-left just to get an idea of what we’re dealing with.
<- function(data, c = 6) {
quickview %>%
data head(4) %>%
select(1:c) %>%
%>%
formatAsTable
autofit
}
%>% quickview(3) raw.data
StartDate | EndDate | Status |
Start Date | End Date | Response Type |
{"ImportId":"startDate","timeZone":"America/Denver"} | {"ImportId":"endDate","timeZone":"America/Denver"} | {"ImportId":"status"} |
2021-05-24 09:26:24 | 2021-05-24 09:28:59 | IP Address |
2021-05-24 09:31:12 | 2021-05-24 09:32:47 | IP Address |
Clearly this is going to take some work.
First, it looks like there are three header rows. We only need one.
<- raw.data[-2:-1,]
raw.data
%>% quickview raw.data
StartDate | EndDate | Status | IPAddress | Progress | Duration (in seconds) |
2021-05-24 09:26:24 | 2021-05-24 09:28:59 | IP Address | 73.121.40.9 | 100 | 154 |
2021-05-24 09:31:12 | 2021-05-24 09:32:47 | IP Address | 71.115.145.137 | 100 | 95 |
2021-05-24 09:32:08 | 2021-05-24 09:34:16 | IP Address | 174.48.238.47 | 100 | 128 |
2021-05-24 09:37:42 | 2021-05-24 09:38:54 | IP Address | 75.129.124.127 | 100 | 72 |
That’s already much better.
This is a style choice, but I prefer my variables to be all lowercase with no spaces.
<- raw.data %>%
raw.data rename_with(tolower) %>%
rename(duration = `duration (in seconds)`)
The naming scheme for the variables got distorted somewhere along the way. I’m going to rationalize it. I’m going to make each variable the scale name, followed by survey block number, followed by item number.
<- map(c("aff", "cog", "lik"),
varnames function(x) map(1:2,
function(y) map(1:4,
function(z) paste(x, y, z, sep = "")))) %>%
unlist
<- raw.data %>%
oldnames select(starts_with(c("aff", "cog", "trust", "liking"))) %>%
names
<- raw.data %>%
raw.data rename_with(~ varnames[which(oldnames == .)],
.cols = oldnames)
Disclosure and interdependence came in as dummy-coded character vectors. Honestly, that’s probably fine, but I feel like they should be logical.
<- raw.data %>%
raw.data mutate(across(c(disclose, interdep), ~ as.logical(as.numeric(.))))
And the manipulation checks need to be renamed.
<- raw.data %>%
raw.data rename_with(~ c("intcheck", "discheck"),
.cols = contains("checks"))
Qualtrics dumps a whole bunch of extra metadata variables in that we don’t need. Let’s take them out.
<- raw.data %>%
raw.data relocate(duration, .after = startdate) %>%
select(!enddate:userlanguage)
%>% quickview raw.data
startdate | duration | read | read_1 | read_2 | read_3 |
2021-05-24 09:26:24 | 154 | ||||
2021-05-24 09:31:12 | 95 | ||||
2021-05-24 09:32:08 | 128 | ||||
2021-05-24 09:37:42 | 72 |
There might also be some variables that have no data at all. Let’s take those out, too. While I’m at it, I’ll remove any rows that have no data at all (or, more precisely, only metadata and embedded data).
<- raw.data %>%
raw.data select(-where(~all(is.na(.)))) %>%
filter(if_any(-c(interdep, disclose,
~ !is.na(.))) startdate, duration),
Exactly 7 “observations” were dropped.
I’m going to create a basic ID column that can be used as key values in joins.
<- raw.data %>%
raw.data mutate(id = row_number())
There are four columns with the name read
that resulted from the validation checks that followed the vignettes. None of these have any real data, except for the fact that I only implemented this technique after 40 or so observations had already been collected. So what I want to do is create a new column called validate
that will be TRUE
for any observations taken after the validation check was added. Then we can dump the read
columns.
<- raw.data %>%
raw.data mutate(validate = !is.na(read) | !is.na(read_1)) %>%
select(-starts_with("read"))
Because of how I set up the survey, some of these variables need to be consolidated. For example, aff11
is actually the same as aff21
, but was shown to the participant in the first, rather than the second, set. I’m going to create a new variable that will tell us when the item was displayed so we don’t lose that information when I coalesce the columns.
<- raw.data %>%
raw.data mutate(across(matches("\\w{3}1\\d"),
~ ifelse(!is.na(.), 1L, 2L),
.names = "{.col}ord")) %>%
rename_with(~ str_remove(., "1"),
contains("ord"))
Now I will consolidate the data by coalescing the columns.
<- varnames[unlist(map(seq(1, 24, 8), ~ seq(., . + 3)))]
first_sets
<- first_sets %>%
raw.data map_dfc(~ coalesce(raw.data[[.]],
str_replace(., "1", "2")]])) %>%
raw.data[[set_colnames(first_sets) %>%
mutate(id = row_number()) %>%
inner_join(select(raw.data, !matches("\\w{3}\\d{2}")), "id") %>%
rename_with(~ str_remove(., "1"),
matches("\\w{3}1\\d"))
%>% quickview raw.data
aff1 | aff2 | aff3 | aff4 | cog1 | cog2 |
Neither agree nor disagree | Neither agree nor disagree | Somewhat agree | Neither agree nor disagree | Somewhat agree | Somewhat agree |
Somewhat agree | Somewhat agree | Somewhat agree | Strongly agree | Somewhat agree | Somewhat agree |
Somewhat agree | Neither agree nor disagree | Neither agree nor disagree | Neither agree nor disagree | Somewhat disagree | Somewhat disagree |
Strongly disagree | Strongly disagree | Somewhat disagree | Somewhat disagree | Neither agree nor disagree | Somewhat agree |
R has a feature called ordered factors that combines the functionality of numeric vectors and character vectors for use in ordinal data (like Likert-scale data). Factors will look like character vectors, but ordering functions like min()
and arrange()
will work.
This kind of data is specific to R and will be lost if exported to a CSV file. However, we can save it in a R data file (.rds
).
The Likert-scale data is easy to factor-ize, so I’ll take care of that first.
<- c("Strongly disagree",
lvls "Somewhat disagree",
"Neither agree nor disagree",
"Somewhat agree",
"Strongly agree")
<- raw.data %>%
raw.data mutate(across(c(aff1:lik4, intcheck:discheck),
~ ordered(., levels = lvls)))
Depending on how participants answer, education and politics could be ordinal variables. Someone has more or less education and can be more or less liberal. But if we get answers like “Other” or “Prefer not to answer”, they don’t really work as ordinal variables.
So here’s what I’m going to do. I will set these as ordinal variables, and if I decide to do any actual analyses based on that, I will drop the exceptions.
<- c("High school diploma",
educat.lvls "Associate's degree",
"Bachelor's degree",
"Master's degree",
"Doctoral degree",
"Other", "Prefer not to answer")
<- c("Liberal", "Moderate", "Conservative",
party.lvls "Other", "Prefer not to answer")
<- raw.data %>%
raw.data mutate(educat = ordered(educat, educat.lvls),
party = ordered(party, party.lvls))
R also has unordered factors for categorical (i.e., nominal) data (like age). I’ll take care of that now for the gender and employment status.
<- raw.data %>%
raw.data mutate(across(c(gender, employ), factor))
Ethnicity, working history, and ADHD relationships remain an issue because we allowed participants to “select all that apply.” To deal with these properly, we need a dummy variable for each option that was selected at least once.
# compile list of options selected at least once
<- function(data, var) {
opts <- data %>%
o select(var) %>%
map(~ str_split(., ",")) %>%
%>%
unlist
unique%>%
o map_dfc(~ str_detect(data[[var]], word(.))) %>%
set_colnames(o) %>%
rename_with(~ paste(var, tolower(word(.)), sep="_")) %>%
mutate(id = row_number()) %>%
inner_join(data, by = "id")
}
<- raw.data %>%
raw.data opts("ethnic") %>%
opts("work") %>%
opts("adhd") %>%
rename_with(tolower) %>%
rename("adhd_nobody" = "adhd_i")
Although most of the data is categorical or ordinal, a few variables are properly continuous. For some reason they came in as character vectors rather than numeric vectors. This isn’t a big deal, because R modelling functions like lm()
will automatically convert character vectors to numeric when doing the analysis if it makes sense. But for good form, I’ll convert the data manually upfront.
<- raw.data %>%
raw.data mutate(across(c(duration, age),
as.integer))
I saved the startdate
variable, so I might as well prep that too. As with other programming languages, dates and times are treated a little differently from other numeric data. I will use the package lubridate
to convert the dates.
<- raw.data %>%
raw.data mutate(startdate = ymd_hms(startdate, tz = "America/Denver"))
At this point, we have all the variables, but they are in an order that doesn’t make a whole lot of sense. I’m going to sort them by:
<- raw.data %>%
clean.data relocate(id) %>%
relocate(c(interdep, disclose, validate) | intcheck:discheck,
.after = id) %>%
relocate(as.vector(rbind(names(select(raw.data, aff1:lik4)),
names(select(raw.data, aff1ord:lik4ord)))),
.after = discheck) %>%
relocate(c(gender, age, employ, educat, educat_7_text, party),
.after = lik4ord) %>%
relocate(c(adhd | contains("adhd_"),
| contains("work_"),
work | contains("ethnic_")), .after = party) ethnic
I will export the data both as a CSV and as an R data file.
%>%
clean.data write_csv(file.path("..", "data", "clean-data.csv"))
%>%
clean.data saveRDS(file.path("..", "data", "clean-data.rds"))
And let’s take one more look at the data before we go:
<- function(data) {
structure_table tibble(variable = names(data),
type = map_chr(data,
~ ifelse(is.factor(.),
"factor",
ifelse(is.POSIXct(.),
"datetime",
typeof(.)))),
head = map_chr(data,
~ paste(head(., 3),
collapse = ", ")))
}
%>%
structure_table saveRDS("structure-table.rds")
%>%
clean.data %>%
structure_table %>%
formatAsTable autofit
variable | type | head |
id | integer | 1, 2, 3 |
interdep | logical | FALSE, TRUE, TRUE |
disclose | logical | FALSE, TRUE, TRUE |
validate | logical | FALSE, FALSE, FALSE |
intcheck | factor | Neither agree nor disagree, Somewhat agree, Somewhat disagree |
discheck | factor | Strongly disagree, Strongly agree, Neither agree nor disagree |
aff1 | factor | Neither agree nor disagree, Somewhat agree, Somewhat agree |
aff1ord | integer | 2, 1, 2 |
aff2 | factor | Neither agree nor disagree, Somewhat agree, Neither agree nor disagree |
aff2ord | integer | 1, 1, 1 |
aff3 | factor | Somewhat agree, Somewhat agree, Neither agree nor disagree |
aff3ord | integer | 1, 2, 2 |
aff4 | factor | Neither agree nor disagree, Strongly agree, Neither agree nor disagree |
aff4ord | integer | 2, 2, 1 |
cog1 | factor | Somewhat agree, Somewhat agree, Somewhat disagree |
cog1ord | integer | 2, 1, 2 |
cog2 | factor | Somewhat agree, Somewhat agree, Somewhat disagree |
cog2ord | integer | 1, 2, 2 |
cog3 | factor | Neither agree nor disagree, Neither agree nor disagree, Neither agree nor disagree |
cog3ord | integer | 1, 2, 1 |
cog4 | factor | Somewhat agree, Neither agree nor disagree, Somewhat disagree |
cog4ord | integer | 2, 1, 1 |
lik1 | factor | Somewhat agree, Somewhat agree, Neither agree nor disagree |
lik1ord | integer | 2, 1, 1 |
lik2 | factor | Somewhat agree, Somewhat agree, Somewhat agree |
lik2ord | integer | 1, 2, 1 |
lik3 | factor | Somewhat disagree, Somewhat disagree, Somewhat disagree |
lik3ord | integer | 2, 2, 2 |
lik4 | factor | Somewhat disagree, Somewhat disagree, Neither agree nor disagree |
lik4ord | integer | 1, 1, 2 |
gender | factor | Male, Male, Male |
age | integer | 27, 30, 33 |
employ | factor | Full-time, Full-time, Full-time |
educat | factor | Bachelor's degree, Bachelor's degree, Bachelor's degree |
educat_7_text | character | NA, NA, NA |
party | factor | Moderate, Conservative, Conservative |
adhd | character | Family member,Acquaintance, I do not know anyone with ADHD, Prefer not to answer |
adhd_family | logical | TRUE, FALSE, FALSE |
adhd_acquaintance | logical | TRUE, FALSE, FALSE |
adhd_nobody | logical | FALSE, TRUE, FALSE |
adhd_prefer | logical | FALSE, FALSE, TRUE |
adhd_myself | logical | FALSE, FALSE, FALSE |
adhd_friend | logical | FALSE, FALSE, FALSE |
adhd_coworker | logical | FALSE, FALSE, FALSE |
adhd_classmate | logical | FALSE, FALSE, FALSE |
work | character | service, professional, professional |
work_service | logical | TRUE, FALSE, FALSE |
work_professional | logical | FALSE, TRUE, TRUE |
work_prefer | logical | FALSE, FALSE, FALSE |
work_manual | logical | FALSE, FALSE, FALSE |
work_other | logical | FALSE, FALSE, FALSE |
work_4_text | character | NA, NA, NA |
ethnic | character | White or Caucasian, White or Caucasian, White or Caucasian |
ethnic_white | logical | TRUE, TRUE, TRUE |
ethnic_hispanic | logical | FALSE, FALSE, FALSE |
ethnic_other | logical | FALSE, FALSE, FALSE |
ethnic_black | logical | FALSE, FALSE, FALSE |
ethnic_asian | logical | FALSE, FALSE, FALSE |
ethnic_prefer | logical | FALSE, FALSE, FALSE |
ethnic_5_text | character | NA, NA, NA |
startdate | datetime | 2021-05-24 09:26:24, 2021-05-24 09:31:12, 2021-05-24 09:32:08 |
duration | integer | 154, 95, 128 |
gender_3_text | character | NA, NA, NA |
party_4_text | character | NA, NA, NA |
employ_6_text | character | NA, NA, NA |
Output document:
options(knitr.duplicate.label = "allow")
::render("data-tidying.Rmd", output_dir = file.path("..", "github", "thesis")) rmarkdown