Funding Source: This work was supported by the National Science Foundation (SES1851690, https://www.nsf.gov/awardsearch/showAward?AWD_ID=1851690).

Introduction

This website provides a tutorial for fitting dynamic tree-based item response (IRTree) models as presented in Cho, Brown-Schmidt, De Boeck, and Shen (2020, https://link.springer.com/article/10.1007/s11336-020-09694-6?) using R. In this tutorial we will show:

Data Management

This section will show how to recode your data to include variables for node values (\(node\)) and recoded responses (\(yy\)).

First, read the file for your data (as a data frame) into R:

setwd("C:\\Users\\MyName\\Documents\\Code")
data <- read.table("data.txt",sep="\t",header=TRUE)

Here is what your data should (roughly) resemble:

##   trial item person time contrast privileged y
## 1   273    5      1  180      0.5        0.5 3
## 2   273    5      1  190      0.5        0.5 3
## 3   273    5      1  200      0.5        0.5 3
## 4   273    5      1  210      0.5        0.5 3
## 5   273    5      1  220      0.5        0.5 3
## 6   273    5      1  230      0.5        0.5 3
## 7   273    5      1  240      0.5        0.5 1
## 8   273    5      1  250      0.5        0.5 1

The only variable we need for data management is the response data (\(y\)).

Second, we’ll duplicate the data based on the number of nodes. Each duplicate will have that node’s value for the \(node\) variable. In this example, we will have 2 nodes, therefore, the final data frame \(data.node\) will be a data frame composed of 2 copies of the original data, with the first copy having \(node=1\), and the second copy having \(node=2\).

number.of.nodes <- 2
for (node in 1:number.of.nodes){
data.copy <- data
data.copy$node <- rep(node, nrow(data))
if (node==1){data.node <- data.copy} else {data.node <- rbind(data.node,data.copy)}
}

Line 1 in the above code sets the number of nodes (in this example 2), and the loop in lines 2-6 creates the new data frame with the inclusion of the \(node\) variable, first by copying the data for each node and then combining the copies into a single data frame. Note that the number of rows for \(data.node\) [\(nrow(data.node)\)] should be equal to \(number.of.nodes \times nrow(data)\).

Third, the following table illustrates how the recoded responses \(yy\) can be created given the values of \(node\) and \(y\) at each observation:

The above table is translated into code as \(yy.list\) below:

yy.list <- list(list(1,1,0),list(1,0,NA))

Notice that the value for \(yy\) given a value for \(node\) and a value for \(y\) is indexed as \(yy.list[[node]][[y]]\). For example, if \(node=1\) and \(y=2\), then \(yy=yy.list[[1]][[2]]=1\), and if \(node=2\) and \(y=3\), then \(yy=yy.list[[2]][[3]]=NA\).

Fourth, the following adds \(yy\) as a variable to our data \(data.node\):

for (observation in 1:nrow(data.node)){
data.node$yy[observation] <- yy.list[[data.node$node[observation]]][[data.node$y[observation]]]
}

Your \(data.node\) should now include values for \(node\) (ranging from 1 to the number of nodes) and \(yy\):

##   trial item person time contrast privileged y node yy
## 1   273    5      1  180      0.5        0.5 3    1  0
## 2   273    5      1  190      0.5        0.5 3    1  0
## 3   273    5      1  200      0.5        0.5 3    1  0
## 4   273    5      1  210      0.5        0.5 3    1  0
## 5   273    5      1  220      0.5        0.5 3    1  0
## 6   273    5      1  230      0.5        0.5 3    1  0
## 7   273    5      1  240      0.5        0.5 1    1  1
## 8   273    5      1  250      0.5        0.5 1    1  1

Data Description

This section will show how to calculate empirical logits (\(logit = log(\frac{proportion}{1 - proportion})\)) for your data, and then use those empirical logits to explore change processes (autocorrelation, partial autocorrelation, and trend). We’ll then use those calculations to create plots for the autocorrelations, partial autocorrelations, and trend using the ggplot2 package (Wickham, 2016).

Calculation: Empirical logit

We’ll be using the data set we created in the data description step (\(data.node\)):

##   trial item person time contrast privileged y node yy
## 1   273    5      1  180      0.5        0.5 3    1  0
## 2   273    5      1  190      0.5        0.5 3    1  0
## 3   273    5      1  200      0.5        0.5 3    1  0
## 4   273    5      1  210      0.5        0.5 3    1  0
## 5   273    5      1  220      0.5        0.5 3    1  0
## 6   273    5      1  230      0.5        0.5 3    1  0
## 7   273    5      1  240      0.5        0.5 1    1  1
## 8   273    5      1  250      0.5        0.5 1    1  1

First, for calculating empirical logit we’ll only need the following variables: \(trial\), \(person\), \(time\), \(node\), and \(yy\), so we’ll create a new data frame caled \(data.el\) with these variables in this new order:

data.el <- data.node[,c(1,3,4,8,9)]
names(data.el) <- variable.names <- names(data.node[,c(1,3,4,8,9)])
head(data.el)
##   trial person time node yy
## 1   273      1  180    1  0
## 2   273      1  190    1  0
## 3   273      1  200    1  0
## 4   273      1  210    1  0
## 5   273      1  220    1  0
## 6   273      1  230    1  0

Line 1 above extracts the variables we need from \(data.node\), which is all of them except for \(item\), \(contrast\), and \(privileged\). Line 2 then sets the names of the variables in \(data.el\) to the names of the original variables in \(data.node\).

Second, we’ll create a matrix called \(Empirical.Logit\), with each row having a unique combination of \(trial\), \(time\), and \(node\) (note that we don’t include \(person\) here, because the empirical logit will be calculated across persons):

unique.trial <- sort(unique(data.el$trial))
unique.time <- sort(unique(data.el$time))
unique.node <- sort(unique(data.el$node))

number.variables <- 3

Empirical.Logit <- matrix(nrow=prod(length(unique.trial),length(unique.time),
length(unique.node)), ncol = (number.variables + 1))

Lines 1-3 above assign a variable to the unique values for \(trial\), \(time\), and \(node\) (respectively) in \(data.el\). Line 5 specifies that the number of variables that we’ll be calculating the empirical logit with is 3 (\(trial\), \(time\), and \(node\)). Line 7 creates the matrix \(Empirical.Logit\), in which we’ll insert every unique combination of \(trial\), \(time\), and \(node\), as well as the empirical logit calculated for that combination. Notice that \(Empirical.Logit\) has a number of rows equal to the product of the number of unique values each variable possesses, and a number of columns equal to the number of variables plus one (the additional column will be used to input the calculated empirical logits).

Third, we’ll calculate the empirical logits and input them into the \(Empirical.Logit\) matrix:

row.index <- 1

for (trial in 1:length(unique.trial)){
for (time in 1:length(unique.time)){
for (node in 1:length(unique.node)){

data.el.trial.time.node <- data.el[(data.el$trial==unique.trial[trial] 
& data.el$time==unique.time[time] & data.el$node==unique.node[node]),]
yy.trial.time.node <- data.el.trial.time.node$yy
proportion <- sum(yy.trial.time.node[!is.na(yy.trial.time.node)])/length(yy.trial.time.node)
empirical.logit <- log(proportion/(1 - proportion))
Empirical.Logit[row.index,] <- c(unique.trial[trial], unique.time[time], 
unique.node[node], empirical.logit)
row.index <- row.index + 1
}}}

Empirical.Logit <- data.frame(Empirical.Logit)
names(Empirical.Logit) <- c("trial", "time", "node", "empirical.logit")
Empirical.Logit$empirical.logit[Empirical.Logit$empirical.logit < -10^6] <- -10^6
Empirical.Logit$empirical.logit[Empirical.Logit$empirical.logit > 10^6] <- 10^6

Line 1 in the code above simply sets the initial value for the \(row.index\) we’ll use to specify what row of \(Empirical.Logit\) we’re currently using. Lines 3-5 indicate that we’re repeating the following process for each unique combination of \(trial\), \(time\), and \(node\). Line 7 extracts the rows of \(data.el\) that match the unique combination of \(trial\), \(node\), and \(time\) we’re calculating the empirical logit for, calling this variable \(data.el.trial.time.node\). Line 9 extracts the values for \(yy\) from \(data.el.trial.time.node\), which will be used to calculate the empirical logit. Lines 10-11 calculate the empirical logit, using the formula \(empirical.logit = log(\frac{proportion}{1 - proportion})\), where the \(proportion\) is the proportion of \(yy.trial.time.node\) equal to one (\(\frac{\sum yy}{length(yy)}\)). Lines 17-18 turn \(Empirical.Logit\) into a dataframe, with the appropriate names for its variables. Lines 19-20 deal with any cases where \(empirical.logit = \pm \infty\), which can occur when \(proportion=0\) or \(proportion=1\). Because the autocorrelations and partial autocorrelations cannot be calculated with infinite empirical logits, we set these values to extremely large values in magnitude (in this case, \(\pm 10^6\)).

Looking at \(Empirical.Logit\), we should now have the calculated values for empirical logit:

##   trial time node empirical.logit
## 1     1  180    1       -1.791759
## 2     1  180    2       -2.564949
## 3     1  190    1       -1.791759
## 4     1  190    2       -2.564949
## 5     1  200    1       -1.791759
## 6     1  200    2       -2.564949

Calculation: Autocorrelation & Partial Autocorrelations

In this section we’ll be calculating the autocorrelations and partial autocorrelations, using the empirical logits calculated above. Note that an autocorrelation will be calculated for each combination of \(trial\), \(node\), and \(time.lag\). First, choose what you want your maximum time lag to be for the autocorrelations and partial autocorrelations. For this example, we’ll use a maximum time lag of 20:

time.lag.max <- 20

Second, we’ll set up the data frame \(AC.PAC\), that will contain the autocorrelations (\(AC\)) and partial autocorrelations (\(PAC\)):

AC.PAC <- matrix(nrow=prod(length(unique.trial),length(unique.node),time.lag.max), 
ncol=(number.variables + 2))

Notice that \(AC.PAC\) has a number of rows equal to the number of unique combinations of \(trial\), \(node\), and \(time.lag\) (ranging from 1 to \(time.lag.max\)), and a number of columns equal to the number of variables from before, plus two. The additional two columns are for inputting autocorrelations and partial autocorrelations.

Third, we’ll calculate the autocorrelations and partial autocorrelations and input them into the \(AC.PAC\) matrix:

row.index <- 1

for (trial in 1:length(unique.trial)){
for (node in 1:length(unique.node)){

Empirical.Logit.trial.node <- Empirical.Logit[(Empirical.Logit$trial==unique.trial[trial] 
& Empirical.Logit$node==unique.node[node]),]
autocorrelations <- acf(Empirical.Logit.trial.node$empirical.logit, 
lag.max = time.lag.max, plot=FALSE)
autocorrelations <- autocorrelations$acf[2:(time.lag.max + 1)]
partial.autocorrelations <- pacf(Empirical.Logit.trial.node$empirical.logit, 
lag.max = time.lag.max, na.action=na.pass, plot=FALSE)
partial.autocorrelations <- c(partial.autocorrelations$acf)
for (time.lag in 1:time.lag.max){
AC.PAC[(row.index + time.lag - 1),] <- c(unique.trial[trial], unique.node[node], time.lag, 
autocorrelations[time.lag], partial.autocorrelations[time.lag])
}
row.index <- row.index + time.lag.max
}}

AC.PAC <- data.frame(AC.PAC)
names(AC.PAC) <- c("trial", "node", "time.lag", "autocorrelation", "partial.autocorrelation")

Line 1 above sets the initial value for \(row.index\), used to reference which rows of \(AC.PAC\) we’re calculating values for. Lines 3-4 indicate that we’re repeating the following process for each unique combination of \(trial\) and \(node\). Lines 6-7 extracts the rows of \(Empirical.Logit\) that match the unique combination of \(trial\) and \(node\) we’re calculating the autocorrelation and partial autocorrelation for, calling this variable \(Empirical.Logit.trial.node\). Lines 8-14 calculate the autocorrelations and partial autocorrelations for each time lag, ranging from 1 to \(time.lag.max\). Notice in line 10 that the autocorrelations are extracted for indices 2 through (\(time.lag.max\) + 1), this is because the \(acf()\) function includes results for \(time.lag = 0\), which we aren’t interested in for these purposes. Lines 14-17 include the autocorrelation and partial autocorrelations for each combination of \(trial\), \(node\), and \(time.lag\) into the \(AC.PAC\) matrix. Lines 21-22 turn \(AC.PAC\) into a dataframe, with the appropriate names for its variables.

Looking at \(AC.PAC\), we should now have the calculated values for both autocorrelations and partial autocorrelations:

##   trial node time.lag autocorrelation partial.autocorrelation
## 1     1    1        1       0.9417591              0.94175909
## 2     1    1        2       0.8903962              0.03082523
## 3     1    1        3       0.8390333             -0.02378418
## 4     1    1        4       0.7876704             -0.02773465
## 5     1    1        5       0.7489045              0.08291772
## 6     1    1        6       0.7041814             -0.06349548

Plotting Figures: Autocorrelation & Partial Autocorrelations

In this step, we’ll use ggplot2 to plot figures for the autocorrelations and partial autocorrelations calculated previously.

First, set the \(time.lag\) variable in \(AC.PAC\) as a factor:

AC.PAC$time.lag <- as.factor(AC.PAC$time.lag)

Second, for these plots, we will be plotting the results for an individual node. For this example, we’ll plot the autocorrelations and partial autocorrelations for \(node=1\):

node.to.plot <- 1
AC.PAC.node <- AC.PAC[AC.PAC$node==node.to.plot,]

Change the value for \(node.to.plot\) above to the node you wish to plot. \(AC.PAC.node\) only gives the rows of \(AC.PAC\) with \(node=1\).

Third, plot the figures for autocorrelation and partial autocorrelation using ggplot2:

library(ggplot2)
AC.plot <- ggplot(AC.PAC.node, aes(x=time.lag, y=autocorrelation)) +
geom_boxplot() +
labs(title = "Figure 1: Autocorrelation", x = "Time Lag", y = "Autocorrelation for Trials")

Calling up \(AC.plot\) gives the following: