Loading

Description

ShinyItemAnalysis provides analysis of educational tests (such as admission tests) and their items including:

  • Exploration of total and standard scores on Summary page.
  • Correlation structure and predictive validity analysis on Validity page.
  • Item and distractor analysis on Item analysis page.
  • Item analysis by logistic models on Regression page.
  • Item analysis by item response theory models on IRT models page.
  • Differential item functioning (DIF) and differential distractor functioning (DDF) methods on DIF/Fairness page.

This application is based on the free statistical software R and its shiny package.

For all graphical outputs a download button is provided. Moreover, on Reports page HTML or PDF report can be created. Additionaly, all application outputs are complemented by selected R code hence the similar analysis can be run and modified in R.

Data

For demonstration purposes, by default, 20-item dataset GMAT from R difNLR package is used. Other four datasets are available: GMAT2 and MSAT-B from difNLR package and Medical 100 and HCI from ShinyItemAnalysis package. You can change the dataset (and try your own one) on page Data.

Availability

Application can be downloaded as R package from CRAN. It is also available online at Czech Academy of Sciences and shinyapps.io .

Version

Current version of ShinyItemAnalysis available on CRAN is 1.2.7. Version available online is 1.2.7. The newest development version available on GitHub is 1.2.7.
See also older versions: 0.1.0, 0.2.0, 1.0.0, 1.1.0. 1.2.3. 1.2.6.

Authors and contributors

Jakub
Houdek
Lubomir
Stepanek

List of packages used

library(corrplot)
library(CTT)
library(data.table)
library(deltaPlotR)
library(DT)
library(difNLR)
library(difR)
library(ggplot2)
library(grid)
library(gridExtra)
library(knitr)
library(latticeExtra)
library(ltm)
library(mirt)
library(moments)
library(msm)
library(nnet)
library(plotly)
library(psych)
library(psychometric)
library(reshape2)
library(rmarkdown)
library(shiny)
library(shinyBS)
library(shinyjs)
library(stringr)
library(WrightMap)
library(xtable)

References

To cite package ShinyItemAnalysis in publications please use:

Martinkova P., Drabinova A., Leder O., & Houdek J. (2018). ShinyItemAnalysis: Test and item analysis via shiny. R package version 1.2.6. https://CRAN.R-project.org/package=ShinyItemAnalysis

Martinkova, P., Drabinova, A., & Houdek, J. (2017). ShinyItemAnalysis: Analyza prijimacich a jinych znalostnich ci psychologickych testu [ShinyItemAnalysis: Analyzing admission and other educational and psychological tests]. TESTFORUM, 6(9), 16-35. doi:10.5817/TF2017-9-129

Bug reports

If you discover a problem with this application please contact the project maintainer at martinkova(at)cs.cas.cz or use GitHub.

Acknowledgments

Project was supported by grant funded by Czech Science Foundation under number GJ15-15856Y.

License

This program is free software and you can redistribute it and or modify it under the terms of the GNU GPL 3 as published by the Free Software Foundation. This program is distributed in the hope that it will be useful, but without any warranty; without even the implied warranty of merchantability of fitness for a particular purpose.



Data

For demonstration purposes, 20-item dataset GMAT from difNLR R package is used. On this page, you may select one of five datasets offered by difNLR and ShinyItemAnalysis packages or you may upload your own dataset (see below). To return to demonstration dataset, refresh this page in your browser (F5) .

Training datasets

Used dataset GMAT (Martinkova, et al., 2017) is generated dataset based on parameters of real Graduate Management Admission Test (GMAT) (Kingston et al., 1985). However, first two items were simulated to function differently in uniform and non-uniform way respectively. The dataset represents responses of 2,000 subjects (1,000 males, 1,000 females) to multiple-choice test of 20 items. The distribution of total scores is the same for both groups. See Martinkova, et al. (2017) for further discussion. GMAT containts simulated continuous criterion variable.

GMAT2 (Drabinova & Martinkova, 2017) is also simulated dataset based on parameters of GMAT (Kingston et al., 1985) from difNLR R package . Again, first two items were generated to function differently in uniform and non-uniform way respectively. The dataset represents responses of 1,000 subjects (500 males, 500 females) to multiple-choice test of 20 items.

MSAT-B (Drabinova & Martinkova, 2017) is a subset of real Medical School Admission Test in Biology in Czech Republic. The dataset represents responses of 1,407 subjects (484 males, 923 females) to multiple-choice test of 20 items. First item was previously detected as functioning differently. For more details of item selection see Drabinova and Martinkova (2017). Dataset can be found in difNLR R package.

Medical 100 is a real dataset of admission test to medical school from ShinyItemAnalysis R package. The data set represents responses of 2,392 subjects (750 males, 1,633 females and 9 subjects without gender specification) to multiple-choice test of 100 items. Medical 100 contains criterion variable - indicator whether student studies standardly or not.

HCI (McFarland et al., 2017) is a real dataset of Homeostasis Concept Inventory from ShinyItemAnalysis R package. The dataset represents responses of 651 subjects (405 males, 246 females) to multiple-choice test of 20 items. HCI contains criterion variable - indicator whether student plans to major in the life sciences.



Upload your own datasets

Main data file should contain responses of individual respondents (rows) to given items (columns). Header may contain item names, no row names should be included. If responses are in unscored ABCD format, the key provides correct response for each item. If responses are scored 0-1, key is vector of 1s.

Group is 0-1 vector, where 0 represents reference group and 1 represents focal group. Its length need to be the same as number of individual respondents in main dataset. If the group is not provided then it wont be possible to run DIF and DDF detection procedures on DIF/Fairness page.

Criterion variable is either discrete or continuous vector (e.g. future study success or future GPA in case of admission tests) which should be predicted by the measurement. Again, its length needs to be the same as number of individual respondents in the main dataset. If the criterion variable is not provided then it wont be possible to run validity analysis in Predictive validity section on Validity page.

In all data sets header should be either included or excluded. Columns of dataset are by default renamed to Item and number of particular column. If you want to keep your own names, check box Keep items names below. Missing values in scored dataset are by default evaluated as 0. If you want to keep them as missing, check box Keep missing values below.


Data specification



Data exploration

Here you can explore uploaded dataset. Rendering of tables can take some time.

Main dataset

Key (correct answers)

Scored test

Group vector

Criterion variable vector



Analysis of total scores

Summary table

Table below summarizes basic characteristics of total scores including minimum and maximum, mean, median, standard deviation, skewness and kurtosis. The kurtosis here is estimated by sample kurtosis \(\frac{m_4}{m_2^2}\), where \(m_4\) is the fourth central moment and \(m_2\) is the second central moment. The skewness is estimated by sample skewness \(\frac{m_3}{m_2^{3/2}}\), where \(m_3\) is the third central moment. The kurtosis for normally distributed scores is near the value of 3 and the skewness is near the value of 0.

Histogram of total score

For selected cut-score, blue part of histogram shows respondents with total score above the cut-score, grey column shows respondents with total score equal to the cut-score and red part of histogram shows respondents below the cut-score.

Download figure

Selected R code

library(difNLR)
library(ggplot2)
library(moments)

# loading data
data(GMAT)
data <- GMAT[, 1:20]

# total score calculation
score <- apply(data, 1, sum)

# summary of total score 
c(min(score), max(score), mean(score), median(score), sd(score), skewness(score), kurtosis(score))

# colors by cut-score
cut <- median(score) # cut-score 
color <- c(rep("red", cut - min(score)), "gray", rep("blue", max(score) - cut))
df <- data.frame(score)

# histogram
ggplot(df, aes(score)) + 
  geom_histogram(binwidth = 1, fill = color, col = "black") + 
  xlab("Total score") + 
  ylab("Number of respondents") + 
  theme_bw() + 
  theme(legend.title = element_blank(), 
        axis.line  = element_line(colour = "black"), 
        panel.grid.major = element_blank(), 
        panel.grid.minor = element_blank(), 
        text = element_text(size = 14))

Standard scores

Total score also known as raw score is a total number of correct answers. It can be used to compare individual score to a norm group, e.g. if the mean is 12, then individual score can be compared to see if it is below or above this average.
Percentile indicates the value below which a percentage of observations falls, e.g. a individual score at the 80th percentile means that the individual score is the same or higher than the scores of 80% of all respondents.
Success rate is the percentage of success, e.g. if the maximum points of test is equal to 20 and individual score is 12 then success rate is 12/20 = 0.6, i.e. 60%.
Z-score or also standardized score is a linear transformation of total score with a mean of 0 and with variance of 1. If X is total score, M its mean and SD its standard deviation then Z-score = (X - M) / SD.
T-score is transformed Z-score with a mean of 50 and standard deviation of 10. If Z is Z-score then T-score = (Z * 10) + 50.

Table by score


Selected R code

library(difNLR) 

# loading data
data(GMAT) 
data <- GMAT[, 1:20] 

# scores calculations
score <- apply(data, 1, sum) # Total score 
tosc <- sort(unique(score)) # Levels of total score 
perc <- cumsum(prop.table(table(score))) # Percentiles 
sura <- 100 * (tosc / max(score)) # Success rate 
zsco <- sort(unique(scale(score))) # Z-score 
tsco <- 50 + 10 * zsco # T-score

Correlation structure

Polychoric correlation heat map

Polychoric correlation heat map is a correlation plot which displays a polychoric correlations of items. The size and shade of circles indicate how much the items are correlated (larger and darker circle means larger correlation). The color of circles indicates in which way the items are correlated - blue color shows possitive correlation and red color shows negative correlation.

Polychoric correlation heat map can be reordered using hierarchical clustering method below. Ward's method aims at finding compact clusters based on minimizing the within-cluster sum of squares. Ward's n. 2 method used squared disimilarities. Single method connects clusters with the nearest neighbours, i.e. the distance between two clusters is calculated as the minimum of distances of observations in one cluster and observations in the other clusters. Complete linkage with farthest neighbours, i.e. maximum of distances. Average linkage method used the distance based on weighted average of the individual distances. With McQuitty method used unweighted average. Median linkage calculates the distance as the median of distances between an observation in one cluster and observation in the other cluster. Centroid method used distance between centroids of clusters.

With number of clusters larger than 1, the rectangles representing clusters are drawn.

Download figure

Scree plot

A scree plot displays the eigenvalues associated with an component or a factor in descending order versus the number of the component or factor.

Download figure

Selected R code

library(corrplot) 
library(difNLR) 
library(psych)

# loading data
data(GMAT) 
data <- GMAT[, 1:20] 

# correlation heat map 
corP <- polychoric(data) # polychoric correlation calculation
corP$rho # correlation matrix 
corrplot(corP$rho) # correlation plot 
corrplot(corP$rho, order = "hclust", hclust.method = "ward.D", addrect = 3) # correlation plot with 3 clusters using Ward method

# scree plot 
ev <- eigen(corP$rho)$values # eigen values
df <- data.frame(comp = 1:length(ev), ev)

ggplot(df, aes(x = comp, y = ev)) + 
  geom_point() + 
  geom_line() + 
  ylab("Eigen value") + 
  xlab("Component number") +
  theme_bw() + 
  theme(legend.title = element_blank(), 
        axis.line  = element_line(colour = "black"), 
        panel.grid.major = element_blank(), 
        panel.grid.minor = element_blank(), 
        text = element_text(size = 14))

Criterion validity

This section requires criterion variable (e.g. future study success or future GPA in case of admission tests) which should correlate with the measurement. Criterion variable can be uploaded in Data section.

Descriptive plots of criterion variable on total score

Total scores are plotted according to criterion variable. Boxplot or scatterplot is displayed depending on the type of criterion variable - whether it is discrete or continuous. Scatterplot is provided with red linear regression line.

Download figure

Correlation of criterion variable and total score

Test for association between total score and criterion variable is based on Spearman`s \(\rho\). This rank-based measure has been recommended if bivariate normal distribution is not guaranteed. The null hypothesis is that correlation is 0.

Selected R code

library(ShinyItemAnalysis) 
library(difNLR) 

# loading data
data(GMAT) 
data01 <- GMAT[, 1:20] 
# total score calculation
score <- apply(data01, 1, sum) 
# criterion variable
criterion <- GMAT[, "criterion"] 
# number of respondents in each criterion level
size <- as.factor(criterion)
levels(size) <- table(as.factor(criterion))
size <- as.numeric(paste(size))
df <- data.frame(score, criterion, size)

# descriptive plots 
### boxplot, for discrete criterion
ggplot(df, aes(y = score, x = as.factor(criterion), fill = as.factor(criterion))) +
  geom_boxplot() +
  geom_jitter(shape = 16, position = position_jitter(0.2)) +
  scale_fill_brewer(palette = "Blues") +
  xlab("Criterion group") +
  ylab("Total score") +
  coord_flip() +
  theme_bw() + 
  theme(legend.title = element_blank(), 
        axis.line  = element_line(colour = "black"), 
        panel.grid.major = element_blank(), 
        panel.grid.minor = element_blank(), 
        text = element_text(size = 14))

### scatterplot, for continuous criterion
ggplot(df, aes(x = score, y = criterion)) + 
  geom_point() + 
  ylab("Criterion variable") + 
  xlab("Total score") + 
  geom_smooth(method = lm,
              se = FALSE,
              color = "red") + 
  theme_bw() + 
  theme(legend.title = element_blank(), 
        axis.line  = element_line(colour = "black"), 
        panel.grid.major = element_blank(), 
        panel.grid.minor = element_blank(), 
        text = element_text(size = 14))

# correlation 
cor.test(criterion, score, method = "spearman", exact = F)

Criterion validity

This section requires criterion variable (e.g. future study success or future GPA in case of admission tests) which should correlate with the measurement. Criterion variable can be uploaded in Data section. Here you can explore how the criterion correlates with individual items.

In distractor analysis based on criterion variable, we are interested in how test takers select the correct answer and how the distractors (wrong answers) with respect to group based on criterion variable.

Distractor plot

With option Combinations all item selection patterns are plotted (e.g. AB, ACD, BC). With option Distractors answers are splitted into distractors (e.g. A, B, C, D).

Download figure

Correlation of criterion variable and scored item

Test for association between total score and criterion variable is based on Spearman`s \(\rho\). This rank-based measure has been recommended if bivariate normal distribution is not guaranteed. The null hypothesis is that correlation is 0.

Selected R code

library(ShinyItemAnalysis) 
library(difNLR) 

# loading data
data("GMAT", "GMATtest", "GMATkey") 
data <- GMATtest[, 1:20] 
data01 <- GMAT[, 1:20] 
key <- GMATkey 
criterion <- GMAT[, "criterion"] 

# distractor plot for item 1 and 3 groups 
plotDistractorAnalysis(data, key, num.groups = 3, item = 1, matching = criterion) 

# correlation for item 1 
cor.test(criterion, data01[, 1], method = "spearman", exact = F)

Traditional item analysis

Traditional item analysis uses proportions of correct answers or correlations to estimate item properties.

Item difficulty/discrimination plot

Displayed is difficulty (red) and discrimination (blue) for all items. Items are ordered by difficulty.
Difficulty of items is estimated as percent of respondents who answered correctly to that item.
Discrimination is by default described by difference of percent correct in upper and lower third of respondents (Upper-Lower Index, ULI). By rule of thumb it should not be lower than 0.2 (borderline in the plot), except for very easy or very difficult items. Discrimination can be customized (see also Martinkova, Stepanek, et al. (2017)) by changing number of groups and by changing which groups should be compared:


Download figure

Cronbach's alpha

Chronbach's alpha is an estimate of the reliability of a psychometric test. It is a function of the number of items in a test, the average covariance between item-pairs, and the variance of the total score (Cronbach, 1951).

Traditional item analysis table


Selected R code

library(difNLR) 
library(psych) 
library(psychometric) 
library(ShinyItemAnalysis) 

# loading data
data(GMAT) 
data <- GMAT[, 1:20] 

# difficulty and discrimination plot 
DDplot(data, k = 3, l = 1, u = 3) 

# Cronbach alpha 
psych::alpha(data) 

# traditional item analysis table 
tab <- round(data.frame(item.exam(data, discr = TRUE)[, c(4, 1, 5, 2, 3)], 
                        psych::alpha(data)$alpha.drop[, 1], 
                        gDiscrim(data, k = 3, l = 1, u = 3)), 2) 
colnames(tab) <- c("Difficulty", "SD", "Dsicrimination ULI", "Discrimination RIT", "Discrimination RIR", "Alpha Drop", "Customized Discrimination") 
tab

Distractor analysis

In distractor analysis, we are interested in how test takers select the correct answer and how the distractors (wrong answers) were able to function effectively by drawing the test takers away from the correct answer.

Distractors plot

With option Combinations all item selection patterns are plotted (e.g. AB, ACD, BC). With option Distractors answers are splitted into distractors (e.g. A, B, C, D).

Download figure

Table with counts

Table with proportions


Barplot of item response patterns

Download figure

Histogram of total scores

Download figure

Table of total scores by groups



Selected R code

library(difNLR) 
library(ShinyItemAnalysis) 

# loading data
data(GMATtest) 
data <- GMATtest[, 1:20] 
data(GMATkey) 
key <- GMATkey 

# combinations - plot for item 1 and 3 groups 
plotDistractorAnalysis(data, key, num.group = 3, item = 1, multiple.answers = T) 

# distractors - plot for item 1 and 3 groups 
plotDistractorAnalysis(data, key, num.group = 3, item = 1, multiple.answers = F) 

# table with counts and margins - item 1 and 3 groups 
DA <- DistractorAnalysis(data, key, num.groups = 3)[[1]] 
dcast(as.data.frame(DA), response ~ score.level, sum, margins = T, value.var = "Freq") 

# table with proportions - item 1 and 3 groups 
DistractorAnalysis(data, key, num.groups = 3, p.table = T)[[1]]

Logistic regression on total scores

Various regression models may be fitted to describe item properties in more detail. Logistic regression can model dependency of probability of correct answer on total score by S-shaped logistic curve. Parameter b0 describes horizontal position of the fitted curve, parameter b1 describes its slope.


Plot with estimated logistic curve

Points represent proportion of correct answer with respect to total score. Their size is determined by count of respondents who achieved given level of total score.

Download figure

Equation

$$\mathrm{P}(Y = 1|X, b_0, b_1) = \mathrm{E}(Y|X, b_0, b_1) = \frac{e^{\left( b_{0} + b_1 X\right)}}{1+e^{\left( b_{0} + b_1 X\right) }} $$

Table of parameters


Selected R code

library(difNLR) 
library(ggplot2)

# loading data
data(GMAT) 
data <- GMAT[, 1:20] 
score <- apply(data, 1, sum) # total score

# logistic model for item 1 
fit <- glm(data[, 1] ~ score, family = binomial) 

# coefficients 
coef(fit) 

# function for plot 
fun <- function(x, b0, b1){exp(b0 + b1 * x) / (1 + exp(b0 + b1 * x))} 

# empirical probabilities calculation
df <- data.frame(x = sort(unique(score)),
                 y = tapply(data[, 1], score, mean),
                 size = as.numeric(table(score)))

# plot of estimated curve
ggplot(df, aes(x = x, y = y)) +
  geom_point(aes(size = size),
             color = "darkblue",
             fill = "darkblue",
             shape = 21, alpha = 0.5) +
  stat_function(fun = fun, geom = "line",
                args = list(b0 = coef(fit)[1],
                            b1 = coef(fit)[2]),
                size = 1,
                color = "darkblue") +
  xlab("Total score") +
  ylab("Probability of correct answer") +
  ylim(0, 1) +
  ggtitle("Item 1")

Logistic regression on standardized total scores

Various regression models may be fitted to describe item properties in more detail. Logistic regression can model dependency of probability of correct answer on standardized total score (Z-score) by S-shaped logistic curve. Parameter b0 describes horizontal position of the fitted curve (difficulty), parameter b1 describes its slope at inflection point (discrimination).


Plot with estimated logistic curve

Points represent proportion of correct answer with respect to standardized total score. Their size is determined by count of respondents who achieved given level of standardized total score.

Download figure

Equation

$$\mathrm{P}(Y = 1|Z, b_0, b_1) = \mathrm{E}(Y|Z, b_0, b_1) = \frac{e^{\left( b_{0} + b_1 Z\right) }}{1+e^{\left( b_{0} + b_1 Z\right) }} $$

Table of parameters


Selected R code

library(difNLR) 
library(ggplot2)

# loading data
data(GMAT) 
data <- GMAT[, 1:20] 
zscore <- scale(apply(data, 1, sum)) # standardized total score

# logistic model for item 1 
fit <- glm(data[, 1] ~ zscore, family = binomial) 

# coefficients 
coef(fit) 

# function for plot 
fun <- function(x, b0, b1){exp(b0 + b1 * x) / (1 + exp(b0 + b1 * x))} 

# empirical probabilities calculation
df <- data.frame(x = sort(unique(zscore)),
                 y = tapply(data[, 1], zscore, mean),
                 size = as.numeric(table(zscore)))

# plot of estimated curve
ggplot(df, aes(x = x, y = y)) +
  geom_point(aes(size = size),
             color = "darkblue",
             fill = "darkblue",
             shape = 21, alpha = 0.5) +
  stat_function(fun = fun, geom = "line",
                args = list(b0 = coef(fit)[1],
                            b1 = coef(fit)[2]),
                size = 1,
                color = "darkblue") +
  xlab("Standardized total score") +
  ylab("Probability of correct answer") +
  ylim(0, 1) +
  ggtitle("Item 1")

Logistic regression on standardized total scores with IRT parameterization

Various regression models may be fitted to describe item properties in more detail. Logistic regression can model dependency of probability of correct answer on standardized total score (Z-score) by s-shaped logistic curve. Note change in parametrization - the IRT parametrization used here corresponds to the parametrization used in IRT models. Parameter b describes horizontal position of the fitted curve (difficulty), parameter a describes its slope at inflection point (discrimination).


Plot with estimated logistic curve

Points represent proportion of correct answer with respect to standardized total score. Their size is determined by count of respondents who achieved given level of standardized total score.

Download figure

Equation

$$\mathrm{P}(Y = 1|Z, a, b) = \mathrm{E}(Y|Z, a, b) = \frac{e^{ a\left(Z - b\right) }}{1+e^{a\left(Z - b\right)}} $$

Table of parameters


Selected R code

library(difNLR) 
library(ggplot2)

# loading data
data(GMAT) 
data <- GMAT[, 1:20] 
zscore <- scale(apply(data, 1, sum)) # standardized total score

# logistic model for item 1 
fit <- glm(data[, 1] ~ zscore, family = binomial) 

# coefficients
coef <- c(a = coef(fit)[2], b = - coef(fit)[1] / coef(fit)[2]) 
coef  

# function for plot 
fun <- function(x, a, b){exp(a * (x - b)) / (1 + exp(a * (x - b)))} 

# empirical probabilities calculation
df <- data.frame(x = sort(unique(zscore)),
                 y = tapply(data[, 1], zscore, mean),
                 size = as.numeric(table(zscore)))

# plot of estimated curve
ggplot(df, aes(x = x, y = y)) +
  geom_point(aes(size = size),
             color = "darkblue",
             fill = "darkblue",
             shape = 21, alpha = 0.5) +
  stat_function(fun = fun, geom = "line",
                args = list(a = coef[1],
                            b = coef[2]),
                size = 1,
                color = "darkblue") +
  xlab("Standardized total score") +
  ylab("Probability of correct answer") +
  ylim(0, 1) +
  ggtitle("Item 1")

Nonlinear three parameter regression on standardized total scores with IRT parameterization

Various regression models may be fitted to describe item properties in more detail. Nonlinear regression can model dependency of probability of correct answer on standardized total score (Z-score) by s-shaped logistic curve. The IRT parametrization used here corresponds to the parametrization used in IRT models. Parameter b describes horizontal position of the fitted curve (difficulty), parameter a describes its slope at inflection point (discrimination). This model allows for nonzero lower left asymptote c (pseudo-guessing parameter).


Plot with estimated nonlinear curve

Points represent proportion of correct answer with respect to standardized total score. Their size is determined by count of respondents who achieved given level of standardized total score.

Download figure

Equation

$$\mathrm{P}(Y = 1|Z, b_0, b_1, c) = \mathrm{E}(Y|Z, b_0, b_1, c) = c + \left( 1-c \right) \cdot \frac{e^{a\left(Z-b\right) }}{1+e^{a\left(Z-b\right) }} $$

Table of parameters


Selected R code

library(difNLR) 
library(ggplot2)

# loading data
data(GMAT) 
data <- GMAT[, 1:20] 
zscore <- scale(apply(data, 1, sum)) # standardized total score

# NLR 3P model for item 1 
fun <- function(x, a, b, c){c + (1 - c) * exp(a * (x - b)) / (1 + exp(a * (x - b)))} 

fit <- nls(data[, 1] ~ fun(zscore, a, b, c), 
           algorithm = "port", 
           start = startNLR(data, GMAT[, "group"], model = "3PLcg", parameterization = "classic")[[1]][1:3],
           lower = c(-Inf, -Inf, 0,),
           upper = c(Inf, Inf, 1)) 
# coefficients 
coef(fit) 

# empirical probabilities calculation
df <- data.frame(x = sort(unique(zscore)),
                 y = tapply(data[, 1], zscore, mean),
                 size = as.numeric(table(zscore)))

# plot of estimated curve
ggplot(df, aes(x = x, y = y)) +
  geom_point(aes(size = size),
             color = "darkblue",
             fill = "darkblue",
             shape = 21, alpha = 0.5) +
  stat_function(fun = fun, geom = "line",
                args = list(a = coef(fit)[1],
                            b = coef(fit)[2],
                            c = coef(fit)[3]),
                size = 1,
                color = "darkblue") +
  xlab("Standardized total score") +
  ylab("Probability of correct answer") +
  ylim(0, 1) +
  ggtitle("Item 1")

Nonlinear four parameter regression on standardized total scores with IRT parameterization

Various regression models may be fitted to describe item properties in more detail. Nonlinear four parameter regression can model dependency of probability of correct answer on standardized total score (Z-score) by s-shaped logistic curve. The IRT parametrization used here corresponds to the parametrization used in IRT models. Parameter b describes horizontal position of the fitted curve (difficulty), parameter a describes its slope at inflection point (discrimination), pseudo-guessing parameter c is describes lower asymptote and inattention parameter d describes upper asymptote.


Plot with estimated nonlinear curve

Points represent proportion of correct answer with respect to standardized total score. Their size is determined by count of respondents who achieved given level of standardized total score.

Download figure

Equation

$$\mathrm{P}(Y = 1|Z, b_0, b_1, c) = \mathrm{E}(Y|Z, b_0, b_1, c) = c + \left( d-c \right) \cdot \frac{e^{a\left(Z-b\right) }}{1+e^{a\left(Z-b\right) }} $$

Table of parameters


Selected R code

library(difNLR) 
library(ggplot2)

# loading data
data(GMAT) 
data <- GMAT[, 1:20] 
zscore <- scale(apply(data, 1, sum)) # standardized total score

# NLR 4P model for item 1 
fun <- function(x, a, b, c, d){c + (d - c) * exp(a * (x - b)) / (1 + exp(a * (x - b)))} 

fit <- nls(data[, 1] ~ fun(zscore, a, b, c, d), 
           algorithm = "port", 
           start = startNLR(data, GMAT[, "group"], model = "4PLcgdg", parameterization = "classic")[[1]][1:4],
           lower = c(-Inf, -Inf, 0, 0),
           upper = c(Inf, Inf, 1, 1)) 
# coefficients 
coef(fit) 

# empirical probabilities calculation
df <- data.frame(x = sort(unique(zscore)),
                 y = tapply(data[, 1], zscore, mean),
                 size = as.numeric(table(zscore)))

# plot of estimated curve
ggplot(df, aes(x = x, y = y)) +
  geom_point(aes(size = size),
             color = "darkblue",
             fill = "darkblue",
             shape = 21, alpha = 0.5) +
  stat_function(fun = fun, geom = "line",
                args = list(a = coef(fit)[1],
                            b = coef(fit)[2],
                            c = coef(fit)[3],
                            d = coef(fit)[4]),
                size = 1,
                color = "darkblue") +
  xlab("Standardized total score") +
  ylab("Probability of correct answer") +
  ylim(0, 1) +
  ggtitle("Item 1")

Logistic regression model selection

Here you can compare classic 2PL logistic regression model to non-linear model item by item using some information criteria:

  • AIC is the Akaike information criterion (Akaike, 1974),
  • BIC is the Bayesian information criterion (Schwarz, 1978)

Another approach to nested models can be likelihood ratio chi-squared test. Significance level is set to 0.05. As tests are performed item by item, it is possible to use multiple comparison correction method.

Table of comparison statistics

Rows BEST indicate which model has the lowest value of criterion, or is the largest significant model by likelihood ratio test.


Selected R code

library(difNLR) 

# loading data
data(GMAT) 
Data <- GMAT[, 1:20] 
zscore <- scale(apply(Data, 1, sum)) # standardized total score

# function for fitting models
fun <- function(x, a, b, c, d){c + (d - c) * exp(a * (x - b)) / (1 + exp(a * (x - b)))} 

# starting values for item 1
start <- startNLR(Data, GMAT[, "group"], model = "4PLcgdg", parameterization = "classic")[[1]][, 1:4]

# 2PL model for item 1 
fit2PL <- nls(Data[, 1] ~ fun(zscore, a, b, c = 0, d = 1), 
              algorithm = "port", 
              start = start[1:2]) 
# NLR 3P model for item 1 
fit3PL <- nls(Data[, 1] ~ fun(zscore, a, b, c, d = 1), 
              algorithm = "port", 
              start = start[1:3],
              lower = c(-Inf, -Inf, 0), 
              upper = c(Inf, Inf, 1)) 
# NLR 4P model for item 1 
fit3PL <- nls(Data[, 1] ~ fun(zscore, a, b, c, d), 
              algorithm = "port", 
              start = start,
              lower = c(-Inf, -Inf, 0, 0), 
              upper = c(Inf, Inf, 1, 1)) 

# comparison 
### AIC
AIC(fit2PL); AIC(fit3PL); AIC(fit4PL) 
### BIC
BIC(fit2PL); BIC(fit3PL); BIC(fit4PL) 
### LR test, using Benjamini-Hochberg correction
###### 2PL vs NLR 3P
LRstat <- -2 * (sapply(fit2PL, logLik) - sapply(fit3PL, logLik)) 
LRdf <- 1 
LRpval <- 1 - pchisq(LRstat, LRdf) 
LRpval <- p.adjust(LRpval, method = "BH") 
###### NLR 3P vs NLR 4P
LRstat <- -2 * (sapply(fit3PL, logLik) - sapply(fit4PL, logLik)) 
LRdf <- 1 
LRpval <- 1 - pchisq(LRstat, LRdf) 
LRpval <- p.adjust(LRpval, method = "BH")

Multinomial regression on standardized total scores

Various regression models may be fitted to describe item properties in more detail. Multinomial regression allows for simultaneous modelling of probability of choosing given distractors on standardized total score (Z-score).


Plot with estimated curves of multinomial regression

Points represent proportion of selected option with respect to standardized total score. Their size is determined by count of respondents who achieved given level of standardized total score and who selected given option.

Download figure

Equation

Table of parameters

Interpretation:

Selected R code

library(difNLR) 
library(nnet) 

# loading data
data(GMAT, GMATtest, GMATkey) 
zscore <- scale(apply(GMAT[, 1:20] , 1, sum)) # standardized total score
data <- GMATtest[, 1:20] 
key <-GMATkey

# multinomial model for item 1 
fit <- multinom(relevel(data[, 1], ref = paste(key[1])) ~ zscore) 

# coefficients 
coef(fit)

Rasch model

Item Response Theory (IRT) models are mixed-effect regression models in which respondent ability (theta) is assumed to be a random effect and is estimated together with item paramters. Ability (theta) is often assumed to follow normal distibution.

In Rasch model (Rasch, 1960), all items are assumed to have the same slope in inflection point, i.e., the same discrimination parameter a which is fixed to value of 1. Items may differ in location of their inflection point, i.e. they may differ in difficulty parameter b.

Equation

$$\mathrm{P}\left(Y_{ij} = 1\vert \theta_{i}, b_{j} \right) = \frac{e^{\left(\theta_{i}-b_{j}\right) }}{1+e^{\left(\theta_{i}-b_{j}\right) }} $$

Item characteristic curves

Download figure

Item information curves

Download figure

Test information function

Download figure

Table of parameters with item fit statistics

Estimates of parameters are completed by SX2 item fit statistics (Ames & Penfield, 2015). SX2 is computed only when no missing data are present. In such a case consider using imputed dataset!

Scatter plot of factor scores and standardized total scores

Download figure

Wright map

Wright map (Wilson, 2005; Wright & Stone, 1979), also called item-person map, is a graphical tool to display person ability estimates and item parameters. The person side (left) represents histogram of estimated abilities of respondents. The item side (right) displays estimates of difficulty parameters of individual items.

Download figure

Selected R code

library(difNLR)
library(mirt)
library(WrightMap)
data(GMAT)
data <- GMAT[, 1:20]

# Model
fit <- mirt(data, model = 1, itemtype = "Rasch", SE = T)
# Item Characteristic Curves
plot(fit, type = "trace", facet_items = F)
# Item Information Curves
plot(fit, type = "infotrace", facet_items = F)
# Test Information Function
plot(fit, type = "infoSE")
# Coefficients
coef(fit, simplify = TRUE)
coef(fit, IRTpars = TRUE, simplify = TRUE)
# Item fit statistics
itemfit(fit)
# Factor scores vs Standardized total scores
fs <- as.vector(fscores(fit))
sts <- as.vector(scale(apply(data, 1, sum)))
plot(fs ~ sts)

# Wright Map
b <- sapply(1:ncol(data), function(i) coef(fit)[[i]][, "d"])
wrightMap(fs, b, item.side = itemClassic)

One parameter Item Response Theory model

Item Response Theory (IRT) models are mixed-effect regression models in which respondent ability (theta) is assumed to be a random effect and is estimated together with item paramters. Ability (theta) is often assumed to follow normal distibution.

In 1PL IRT model, all items are assumed to have the same slope in inflection point, i.e., the same discrimination a. Items can differ in location of their inflection point, i.e., in item difficulty parameters b.

Equation

$$\mathrm{P}\left(Y_{ij} = 1\vert \theta_{i}, a, b_{j} \right) = \frac{e^{a\left(\theta_{i}-b_{j}\right) }}{1+e^{a\left(\theta_{i}-b_{j}\right) }} $$

Item characteristic curves

Download figure

Item information curves

Download figure

Test information function

Download figure

Table of parameters with item fit statistics

Estimates of parameters are completed by SX2 item fit statistics (Ames & Penfield, 2015). SX2 is computed only when no missing data are present. In such a case consider using imputed dataset!

Scatter plot of factor scores and standardized total scores

Download figure

Wright map

Wright map (Wilson, 2005; Wright & Stone, 1979), also called item-person map, is a graphical tool to display person ability estimates and item parameters. The person side (left) represents histogram of estimated abilities of respondents. The item side (right) displays estimates of difficulty parameters of individual items.

Download figure

Selected R code

library(difNLR)
library(mirt)
library(WrightMap)
data(GMAT)
data <- GMAT[, 1:20]

# Model
fit <- mirt(data, model = 1, itemtype = "2PL", constrain = list((1:ncol(data)) + seq(0, (ncol(data) - 1)*3, 3)), SE = T)
# Item Characteristic Curves
plot(fit, type = "trace", facet_items = F)
# Item Information Curves
plot(fit, type = "infotrace", facet_items = F)
# Test Information Function
plot(fit, type = "infoSE")
# Coefficients
coef(fit, simplify = TRUE)
coef(fit, IRTpars = TRUE, simplify = TRUE)
# Item fit statistics
itemfit(fit)
# Factor scores vs Standardized total scores
fs <- as.vector(fscores(fit))
sts <- as.vector(scale(apply(data, 1, sum)))
plot(fs ~ sts)

# Wright Map
b <- sapply(1:ncol(data), function(i) coef(fit)[[i]][, "d"])
wrightMap(fs, b, item.side = itemClassic)


# You can also use ltm library for IRT models
library(difNLR)
library(ltm)
data(GMAT)
data <- GMAT[, 1:20]

# Model
fit <- rasch(data)
# for Rasch model use
# fit <- rasch(data, constraint = cbind(ncol(data) + 1, 1))
# Item Characteristic Curves
plot(fit)
# Item Information Curves
plot(fit, type = "IIC")
# Test Information Function
plot(fit, items = 0, type = "IIC")
# Coefficients
coef(fit)
# Factor scores vs Standardized total scores
df1 <- ltm::factor.scores(fit, return.MIvalues = T)$score.dat
FS <- as.vector(df1[, "z1"])
df2 <- df1
df2$Obs <- df2$Exp <- df2$z1 <- df2$se.z1 <- NULL
STS <- as.vector(scale(apply(df2, 1, sum)))
df <- data.frame(FS, STS)
plot(FS ~ STS, data = df, xlab = "Standardized total score", ylab = "Factor score")

Two parameter Item Response Theory model

Item Response Theory (IRT) models are mixed-effect regression models in which respondent ability (theta) is assumed to be a random effect and is estimated together with item paramters. Ability (theta) is often assumed to follow normal distibution.

2PL IRT model allows for different slopes in inflection point, i.e., different discrimination parameters a. Items can also differ in location of their inflection point, i.e., in item difficulty parameters b.

Equation

$$\mathrm{P}\left(Y_{ij} = 1\vert \theta_{i}, a_{j}, b_{j}\right) = \frac{e^{a_{j}\left(\theta_{i}-b_{j}\right) }}{1+e^{a_{j}\left(\theta_{i}-b_{j}\right) }} $$

Item characteristic curves

Download figure

Item information curves

Download figure

Test information function

Download figure

Table of parameters with item fit statistics

Estimates of parameters are completed by SX2 item fit statistics (Ames & Penfield, 2015). SX2 is computed only when no missing data are present. In such a case consider using imputed dataset!

Scatter plot of factor scores and standardized total scores

Download figure

Selected R code

library(difNLR)
library(mirt)
data(GMAT)
data <- GMAT[, 1:20]

# Model
fit <- mirt(data, model = 1, itemtype = "2PL", SE = T)
# Item Characteristic Curves
plot(fit, type = "trace", facet_items = F)
# Item Information Curves
plot(fit, type = "infotrace", facet_items = F)
# Test Information Function
plot(fit, type = "infoSE")
# Coefficients
coef(fit, simplify = TRUE)
coef(fit, IRTpars = TRUE, simplify = TRUE)
# Item fit statistics
itemfit(fit)
# Factor scores vs Standardized total scores
fs <- as.vector(fscores(fit))
sts <- as.vector(scale(apply(data, 1, sum)))
plot(fs ~ sts)


# You can also use ltm library for IRT models
library(difNLR)
library(ltm)
data(GMAT)
data <- GMAT[, 1:20]

# Model
fit <- ltm(data ~ z1, IRT.param = TRUE)
# Item Characteristic Curves
plot(fit)
# Item Information Curves
plot(fit, type = "IIC")
# Test Information Function
plot(fit, items = 0, type = "IIC")
# Coefficients
coef(fit)
# Factor scores vs Standardized total scores
df1 <- ltm::factor.scores(fit, return.MIvalues = T)$score.dat
FS <- as.vector(df1[, "z1"])
df2 <- df1
df2$Obs <- df2$Exp <- df2$z1 <- df2$se.z1 <- NULL
STS <- as.vector(scale(apply(df2, 1, sum)))
df <- data.frame(FS, STS)
plot(FS ~ STS, data = df, xlab = "Standardized total score", ylab = "Factor score")

Three parameter Item Response Theory model

Item Response Theory (IRT) models are mixed-effect regression models in which respondent ability (theta) is assumed to be a random effect and is estimated together with item paramters. Ability (theta) is often assumed to follow normal distibution.

3PL IRT model allows for different discriminations of items a, different item difficulties b, and allows also for nonzero left asymptote, pseudo-guessing c.

Equation

$$\mathrm{P}\left(Y_{ij} = 1\vert \theta_{i}, a_{j}, b_{j}, c_{j} \right) = c_{j} + \left(1 - c_{j}\right) \cdot \frac{e^{a_{j}\left(\theta_{i}-b_{j}\right) }}{1+e^{a_{j}\left(\theta_{i}-b_{j}\right) }} $$

Item characteristic curves

Download figure

Item information curves

Download figure

Test information function

Download figure

Table of parameters with item fit statistics

Estimates of parameters are completed by SX2 item fit statistics (Ames & Penfield, 2015). SX2 is computed only when no missing data are present. In such a case consider using imputed dataset!

Scatter plot of factor scores and standardized total scores

Download figure

Selected R code

library(difNLR)
library(mirt)
data(GMAT)
data <- GMAT[, 1:20]

# Model
fit <- mirt(data, model = 1, itemtype = "3PL", SE = T)
# Item Characteristic Curves
plot(fit, type = "trace", facet_items = F)
# Item Information Curves
plot(fit, type = "infotrace", facet_items = F)
# Test Information Function
plot(fit, type = "infoSE")
# Coefficients
coef(fit, simplify = TRUE)
coef(fit, IRTpars = TRUE, simplify = TRUE)
# Item fit statistics
itemfit(fit)
# Factor scores vs Standardized total scores
fs <- as.vector(fscores(fit))
sts <- as.vector(scale(apply(data, 1, sum)))
plot(fs ~ sts)
# You can also use ltm library for IRT models


library(difNLR)
library(ltm)
data(GMAT)
data <- GMAT[, 1:20]

# Model
fit <- tpm(data, IRT.param = TRUE)
# Item Characteristic Curves
plot(fit)
# Item Information Curves
plot(fit, type = "IIC")
# Test Information Function
plot(fit, items = 0, type = "IIC")
# Coefficients
coef(fit)
# Factor scores vs Standardized total scores
df1 <- ltm::factor.scores(fit, return.MIvalues = T)$score.dat
FS <- as.vector(df1[, "z1"])
df2 <- df1
df2$Obs <- df2$Exp <- df2$z1 <- df2$se.z1 <- NULL
STS <- as.vector(scale(apply(df2, 1, sum)))
df <- data.frame(FS, STS)
plot(FS ~ STS, data = df, xlab = "Standardized total score", ylab = "Factor score")

Four parameter Item Response Theory model

Item Response Theory (IRT) models are mixed-effect regression models in which respondent ability (theta) is assumed to be a random effect and is estimated together with item paramters. Ability (theta) is often assumed to follow normal distibution.

4PL IRT model allows for different discriminations of items a, different item difficulties b, nonzero left asymptote, i.e. pseudo-guessing parameter c, and also for upper asymptote lower than one, i.e, inattention parameter d.

Equation

$$\mathrm{P}\left(Y_{ij} = 1\vert \theta_{i}, a_{j}, b_{j}, c_{j}, d_{j} \right) = c_{j} + \left(d_{j} - c_{j}\right) \cdot \frac{e^{a_{j}\left(\theta_{i}-b_{j}\right) }}{1+e^{a_{j}\left(\theta_{i}-b_{j}\right) }} $$

Item characteristic curves

Download figure

Item information curves

Download figure

Test information function

Download figure

Table of parameters with item fit statistics

Estimates of parameters are completed by SX2 item fit statistics (Ames & Penfield, 2015). SX2 is computed only when no missing data are present. In such a case consider using imputed dataset!

Scatter plot of factor scores and standardized total scores

Download figure

Selected R code

library(difNLR)
library(mirt)
data(GMAT)
data <- GMAT[, 1:20]

# Model
fit <- mirt(data, model = 1, itemtype = "4PL", SE = T)
# Item Characteristic Curves
plot(fit, type = "trace", facet_items = F)
# Item Information Curves
plot(fit, type = "infotrace", facet_items = F)
# Test Information Function
plot(fit, type = "infoSE")
# Coefficients
coef(fit, simplify = TRUE)
coef(fit, IRTpars = TRUE, simplify = TRUE)
# Item fit statistics
itemfit(fit)
# Factor scores vs Standardized total scores
fs <- as.vector(fscores(fit))
sts <- as.vector(scale(apply(data, 1, sum)))
plot(fs ~ sts)

Item Response Theory model selection

Item Response Theory (IRT) models are mixed-effect regression models in which respondent ability (theta) is assumed to be a random effect and is estimated together with item paramters. Ability (theta) is often assumed to follow normal distibution.

IRT models can be compared by several information criteria:

  • AIC is the Akaike information criterion (Akaike, 1974),
  • AICc is AIC with a correction for finite sample size,
  • BIC is the Bayesian information criterion (Schwarz, 1978).
  • SABIC is the Sample-sized adjusted BIC criterion,

Another approach to compare IRT models can be likelihood ratio chi-squared test. Significance level is set to 0.05.

Table of comparison statistics

Row BEST indicates which model has the lowest value of criterion, or is the largest significant model by likelihood ratio test.


Selected R code

library(difNLR)
library(mirt)
data(GMAT)
data <- GMAT[, 1:20]

# 1PL IRT model
s <- paste("F = 1-", ncol(data), " ", "CONSTRAIN = (1-", ncol(data), ", a1)")
model <- mirt.model(s)
fit1PL <- mirt(data, model = model, itemtype = "2PL")
# 2PL IRT model
fit2PL <- mirt(data, model = 1, itemtype = "2PL")
# 3PL IRT model
fit3PL <- mirt(data, model = 1, itemtype = "3PL")
# 4PL IRT model
fit4PL <- mirt(data, model = 1, itemtype = "4PL")

# Comparison
anova(fit1PL, fit2PL)
anova(fit2PL, fit3PL)
anova(fit3PL, fit4PL)

Bock's nominal Item Response Theory model

The nominal response model (NRM) was introduced by Bock (1972) as a way to model responses to items with two or more nominal categories. This model is suitable for multiple-choice items with no particular ordering of distractors. It is also generalization of some models for ordinal data, e.g. generalized partial credit model (GPCM) or its restricted versions partial credit model (PCM) and rating scale model (RSM).

Equation

For K possible test choices is the probability of the choice k for person i with latent trait \(\theta\) in item j given by the following equation: $$\mathrm{P}(Y_{ij} = k|\theta_i, a_{j1}, al_{j(l-1)}, d_{j(l-1)}, l = 1, \dots, K) = \frac{e^{(ak_{j(k-1)} * a_{j1} * \theta_i + d_{j(k-1)})}}{\sum_l e^{(al_{j(l-1)} * a_{j1} * \theta_i + d_{j(l-1)})}}$$

Item characteristic curves

Download figure

Item information curves

Download figure

Test information function

Download figure

Table of parameters

Scatter plot of factor scores and standardized total scores

Download figure

Selected R code

library(difNLR)
library(mirt)
data(GMAT)
data <- GMAT[, 1:20]

# Model
fit <- mirt(data, model = 1, itemtype = "nominal")
# Item Characteristic Curves
plot(fit, type = "trace", facet_items = F)
# Item Information Curves
plot(fit, type = "infotrace", facet_items = F)
# Test Information Function
plot(fit, type = "infoSE")
# Coefficients
coef(fit, simplify = TRUE)
coef(fit, IRTpars = TRUE, simplify = TRUE)
# Factor scores vs Standardized total scores
fs <- as.vector(fscores(fit))
sts <- as.vector(scale(apply(data, 1, sum)))
plot(fs ~ sts)

Dichotomous models

Dichotomous models are used for modelling items producing a simple binary response (i.e., true/false). Most complex unidimensional dichotomous IRT model described here is 4PL IRT model. Rasch model (Rasch, 1960) assumes discrimination fixed to \(a = 1\) guessing fixed to \(c = 0\) and innatention to \(d = 1\). Similarly, other restricted models (1PL, 2PL and 3PL models) can be obtained by fixing appropriate parameters in 4PL model.

In this section, you can explore behavior of two item characteristic curves \(\mathrm{P}\left(\theta\right)\) and their item information functions \(\mathrm{I}\left(\theta\right)\) in 4PL IRT model.

Parameters

Select parameters \(a\) (discrimination), \(b\) (difficulty), \(c\) (guessing) and \(d\) (inattention). By constraining \(a = 1\), \(c = 0\), \(d = 1\) you get Rasch model. With option \(c = 0\) and \(d = 1\) you get 2PL model and with option \(d = 1\) 3PL model.

When different curve parameters describe properties of the same item but for different groups of respondents, this phenomenon is called Differential Item Functioning (DIF). See further section for more information.

Select also the value of latent ability \(\theta\) to see the intepretation of the item characteristic curves.

Equations

$$\mathrm{P}\left(\theta \vert a, b, c, d \right) = c + \left(d - c\right) \cdot \frac{e^{a\left(\theta-b\right) }}{1+e^{a\left(\theta-b\right) }} $$ $$\mathrm{I}\left(\theta \vert a, b, c, d \right) = a \cdot \left(d - c\right) \cdot \frac{e^{a\left(\theta-b\right) }}{\left[1+e^{a\left(\theta-b\right)}\right]^2} $$


Exercise 1

Consider the following 2PL items with parameters
Item 1: \(a = 2.5, b = -0.5\)
Item 2: \(a = 1.5, b = 0\)
For these items fill the following exercises with an accuracy of up to 0.05. Then click on Submit answers button. If you need a hint, click on blue button with question mark.

  • Sketch item characteristic and information curves.
  • Calculate probability of correct answer for latent abilities \(\theta = -2, -1, 0, 1, 2\).
    Item 1:
    Item 2:
  • For what level of ability \(\theta\) are the probabilities equal?
  • Which item provides more information for weak (\(\theta = -2\)), average (\(\theta = 0\)) and strong (\(\theta = 2\)) students?
    \(\theta = -2\)
    \(\theta = 0\)
    \(\theta = 2\)

Exercise 2

Consider now 2 items with following parameters
Item 1: \(a = 1.5, b = 0, c = 0, d = 0\)
Item 2: \(a = 1.5, b = 0, c = 0.2, d = 0\)
For these items fill the following exercises with an accuracy of up to 0.05. Then click on Submit answers button.

  • What is the lower asymptote for items?
    Item 1:
    Item 2:
  • What is the probability of correct answer for latent ability \(\theta = b\)?
    Item 1:
    Item 2:
  • Which item is more informative?

Exercise 3

Consider now 2 items with following parameters
Item 1: \(a = 1.5, b = 0, c = 0, d = 0.9\)
Item 2: \(a = 1.5, b = 0, c = 0, d = 1\)
For these items fill the following exercises with an accuracy of up to 0.05. Then click on Submit answers button.

  • What is the upper asymptote for items?
    Item 1:
    Item 2:
  • What is the probability of correct answer for latent ability \(\theta = b\)?
    Item 1:
    Item 2:
  • Which item is more informative?


Selected R code

library(ggplot2)
library(data.table)

# parameters 
a1 <- 1; b1 <- 0; c1 <- 0; d1 <- 1 
a2 <- 2; b2 <- 0.5; c2 <- 0; d2 <- 1 

# latent ability 
theta <- seq(-4, 4, 0.01)
# latent ability level
theta0 <- 0

# function for IRT characteristic curve 
icc_irt <- function(theta, a, b, c, d){ return(c + (d - c)/(1 + exp(-a*(theta - b)))) } 

# calculation of characteristic curves
df <- data.frame(theta, 
                 "icc1" = icc_irt(theta, a1, b1, c1, d1),
                 "icc2" = icc_irt(theta, a2, b2, c2, d2))
df <- melt(df, id.vars = "theta")

# plot for characteristic curves 
ggplot(df, aes(x = theta, y = value, color = variable)) + 
  geom_line() + 
  geom_segment(aes(y = icc_irt(theta0, a = a1, b = b1, c = c1, d = d1), 
                   yend = icc_irt(theta0, a = a1, b = b1, c = c1, d = d1), 
                   x = -4, xend = theta0), 
               color = "gray", linetype = "dashed") + 
  geom_segment(aes(y = icc_irt(theta0, a = a2, b = b2, c = c2, d = d2), 
                   yend = icc_irt(theta0, a = a2, b = b2, c = c2, d = d2), 
                   x = -4, xend = theta0), 
               color = "gray", linetype = "dashed") + 
  geom_segment(aes(y = 0, 
                   yend = max(icc_irt(theta0, a = a1, b = b1, c = c1, d = d1), 
                              icc_irt(theta0, a = a2, b = b2, c = c2, d = d2)), 
                   x = theta0, xend = theta0),
               color = "gray", linetype = "dashed") + 
  xlim(-4, 4) + 
  xlab("Ability") + 
  ylab("Probability of correct answer") + 
  theme_bw() + 
  ylim(0, 1) + 
  theme(axis.line = element_line(colour = "black"), 
        panel.grid.major = element_blank(), 
        panel.grid.minor = element_blank()) + 
  ggtitle("Item characteristic curve") 

# function for IRT information function 
iic_irt <- function(theta, a, b, c, d){ return(a^2*(d-c)*exp(a*(theta-b))/(1 + exp(a*(theta-b)))^2) } 

# calculation of information curves
df <- data.frame(theta, 
                 "iic1" = iic_irt(theta, a1, b1, c1, d1),
                 "iic2" = iic_irt(theta, a2, b2, c2, d2))
df <- melt(df, id.vars = "theta")

# plot for information curves 
ggplot(df, aes(x = theta, y = value, color = variable)) + 
  geom_line() + 
  xlim(-4, 4) + 
  xlab("Ability") + 
  ylab("Information") + 
  theme_bw() + 
  ylim(0, 4) + 
  theme(axis.line = element_line(colour = "black"), 
        panel.grid.major = element_blank(), 
        panel.grid.minor = element_blank()) + 
  ggtitle("Item information curve") 

Polytomous models

Polytomous models are used when partial score is possible, or when items are graded on Likert scale (e.g. from Totally disagree to Totally agree); some polytomous models can also be used when analyzing multiple-choice items. In this section you can explore item response functions of some polytomous models.


Two main classes of polytomous IRT models are considered:

Difference models are defined by setting mathematical form to cumulative probabilities, while category probabilities are calculated as their difference. These models are also sometimes called cumulative logit models as they set linear form to cumulative logits.

As an example, Graded Response Model (GRM; Samejima, 1970) uses 2PL IRT model to describe cumulative probabilities (probabilities to obtain score higher than 1, 2, 3, etc.). Category probabilities are then described as differences of two subsequent cumulative probabilities.


For divide-by-total models response category probabilities are defined as the ratio between category-related functions and their sum.

In Generalized Partial Credit Model (GPCM; Muraki, 1992), probability of the successful transition from one category score to the next category score is modelled by 2PL IRT model, while Partial Credit Model (PCM; Masters, 1982) uses 1PL IRT model to describe this probability. Even more restricted version, the Rating Scale Model (RSM; Andrich, 1978) assumes exactly the same K response categories for each item and threshold parameters which can be split into a response-threshold parameter and an item-specific location parameter. These models are also sometimes called adjacent-category logit models as they set linear form to adjacent logits.

To model distractor properties in multiple-choice items, Nominal Response Model (NRM; Bock, 1972) can be used. NRM is an IRT analogy of multinomial regression model. This model is also generalization of GPCM/PCM/RSM ordinal models. NRM is also sometimes called baseline-category logit model as it sets linear form to log of odds of selecting given category to selecting a baseline category. Baseline can be chosen arbitrary, although usually the correct answer or the first answer is chosen.

Graded response model

Graded response model (GRM; Samejima, 1970) uses 2PL IRT model to describe cumulative probabilities (probabilities to obtain score higher than 1, 2, 3, etc.). Category probabilities are then described as differences of two subsequent cumulative probabilities.

It belongs to class of difference models, which are defined by setting mathematical form to cumulative probabilities, while category probabilities are calculated as their difference. These models are also sometimes called cumulative logit models, as they set linear form to cumulative logits.

Parameters

Select number of responses and difficulty for cummulative probabilities b and common discrimination parameter a . Cummulative probability \(P(Y \geq 0)\) is always equal to 1 and it is not displayed, corresponding category probability \(P(Y = 0)\) is displayed with black color.




Equations

$$\pi_k* = \mathrm{P}\left(Y \geq k \vert \theta, a, b_k\right) = \frac{e^{a\left(\theta-b\right) }}{1+e^{a\left(\theta-b\right) }} $$ $$\pi_k =\mathrm{P}\left(Y = k \vert \theta, a, b_k, b_{k+1}\right) = \pi_k* - \pi_{k+1}* $$ $$\mathrm{E}\left(Y \vert \theta, a, b_1, \dots, b_K\right) = \sum_{k = 0}^K k\pi_k$$

Plots

Selected R code

library(ggplot2) 
library(data.table) 

# setting parameters 
a <- 1 
b <- c(-1.5, -1, -0.5, 0) 
theta <- seq(-4, 4, 0.01) 

# calculating cummulative probabilities 
ccirt <- function(theta, a, b){ return(1/(1 + exp(-a*(theta - b)))) } 
df1 <- data.frame(sapply(1:length(b), function(i) ccirt(theta, a, b[i])) , theta)
df1 <- melt(df1, id.vars = "theta") 

# plotting cummulative probabilities 
ggplot(data = df1, aes(x = theta, y = value, col = variable)) + 
  geom_line() + 
  xlab("Ability") + 
  ylab("Cummulative probability") + 
  xlim(-4, 4) + 
  ylim(0, 1) + 
  theme_bw() + 
  theme(text = element_text(size = 14), 
        panel.grid.major = element_blank(), 
        panel.grid.minor = element_blank()) + 
  ggtitle("Cummulative probabilities") + 
  scale_color_manual("", values = c("red", "yellow", "green", "blue"), labels = paste0("P(Y >= ", 1:4, ")")) 

# calculating category probabilities 
df2 <- data.frame(1, sapply(1:length(b), function(i) ccirt(theta, a, b[i]))) 
df2 <- data.frame(sapply(1:length(b), function(i) df2[, i] - df2[, i+1]), df2[, ncol(df2)], theta) 
df2 <- melt(df2, id.vars = "theta") 

# plotting category probabilities 
ggplot(data = df2, aes(x = theta, y = value, col = variable)) + 
  geom_line() + 
  xlab("Ability") + 
  ylab("Category probability") + 
  xlim(-4, 4) + 
  ylim(0, 1) + 
  theme_bw() + 
  theme(text = element_text(size = 14), 
        panel.grid.major = element_blank(), 
        panel.grid.minor = element_blank()) + 
  ggtitle("Category probabilities") + 
  scale_color_manual("", values = c("black", "red", "yellow", "green", "blue"), labels = paste0("P(Y >= ", 0:4, ")"))

# calculating expected item score
df3 <- data.frame(1, sapply(1:length(b), function(i) ccirt(theta, a, b[i]))) 
df3 <- data.frame(sapply(1:length(b), function(i) df3[, i] - df3[, i+1]), df3[, ncol(df3)])
df3 <- data.frame(exp = as.matrix(df3) %*% 0:4, theta)

# plotting category probabilities 
ggplot(data = df3, aes(x = theta, y = exp)) + 
  geom_line() + 
  xlab("Ability") + 
  ylab("Expected item score") + 
  xlim(-4, 4) + 
  ylim(0, 4) + 
  theme_bw() + 
  theme(text = element_text(size = 14), 
        panel.grid.major = element_blank(), 
        panel.grid.minor = element_blank()) + 
  ggtitle("Expected item score")


Generalized partial credit model

In Generalized Partial Credit Model (GPCM; Muraki, 1992), probability of the successful transition from one category score to the next category score is modelled by 2PL IRT model. The response category probabilities are then ratios between category-related functions (cumulative sums of exponentials) and their sum.

Two simpler models can be derived from GPCM by restricting some parameters: Partial Credit Model (PCM; Masters, 1982) uses 1PL IRT model to describe this probability, thus parameters a = 1. Even more restricted version, the Rating Scale Model (RSM; Andrich, 1978) assumes exactly the same K response categories for each item and threshold parameters which can be split into a response-threshold parameter \(\lambda_t\) and an item-specific location parameter \(\delta_i\). These models are also sometimes called adjacent logit models, as they set linear form to adjacent logits.

Parameters

Select number of responses and their threshold parameters d and common discrimination parameter a . With a = 1 you get PCM. Numerator of \(\pi_0 = P(Y = 0)\) is set to 1 and \(pi_0\) is displayed with black color.




Equations

$$\pi_k =\mathrm{P}\left(Y = k \vert \theta, \alpha, \delta_0, \dots, \delta_K\right) = \frac{\exp\sum_{t = 0}^k \alpha(\theta - \delta_t)}{\sum_{r = 0}^K\exp\sum_{t = 0}^r \alpha(\theta - \delta_t)} $$ $$\mathrm{E}\left(Y \vert \theta, \alpha, \delta_0, \dots, \delta_K\right) = \sum_{k = 0}^K k\pi_k$$

Plots

Selected R code

library(ggplot2) 
library(data.table) 

# setting parameters 
a <- 1 
d <- c(-1.5, -1, -0.5, 0) 
theta <- seq(-4, 4, 0.01) 

# calculating category probabilities 
ccgpcm <- function(theta, a, d){ a*(theta - d) } 
df <- sapply(1:length(d), function(i) ccgpcm(theta, a, d[i])) 
pk <- sapply(1:ncol(df), function(k) apply(as.data.frame(df[, 1:k]), 1, sum)) 
pk <- cbind(0, pk) 
pk <- exp(pk) 
denom <- apply(pk, 1, sum) 
df <-  apply(pk, 2, function(x) x/denom)
df1 <- melt(data.frame(df, theta), id.vars = "theta") 

# plotting category probabilities 
ggplot(data = df1, aes(x = theta, y = value, col = variable)) + 
  geom_line() + 
  xlab("Ability") + 
  ylab("Category probability") + 
  xlim(-4, 4) + 
  ylim(0, 1) + 
  theme_bw() + 
  theme(text = element_text(size = 14), 
        panel.grid.major = element_blank(), 
        panel.grid.minor = element_blank()) + 
  ggtitle("Category probabilities") + 
  scale_color_manual("", values = c("black", "red", "yellow", "green", "blue"), labels = paste0("P(Y = ", 0:4, ")"))

# calculating expected item score
df2 <- data.frame(exp = as.matrix(df) %*% 0:4, theta)
# plotting category probabilities 
ggplot(data = df2, aes(x = theta, y = exp)) + 
  geom_line() + 
  xlab("Ability") + 
  ylab("Expected item score") + 
  xlim(-4, 4) + 
  ylim(0, 4) + 
  theme_bw() + 
  theme(text = element_text(size = 14), 
        panel.grid.major = element_blank(), 
        panel.grid.minor = element_blank()) + 
  ggtitle("Expected item score")


Nominal response model

In Nominal Response Model (NRM; Bock, 1972), probability of selecting given category over baseline category is modelled by 2PL IRT model. This model is also sometimes called baseline-category logit model, as it sets linear form to log of odds of selecting given category to selecting a baseline category. Baseline can be chosen arbitrary, although usually the correct answer or the first answer is chosen. NRM model is generalization of GPCM model by setting item-specific and category-specific intercept and slope parameters.

Parameters

Select number of distractors and their threshold parameters d and discrimination parameters a . Parameters of \(\pi_0 = P(Y = 0)\) are set to zeros and \(\pi_0\) is displayed with black color.



Equations

$$\pi_k =\mathrm{P}\left(Y = k \vert \theta, \alpha_0, \dots, \alpha_K, \delta_0, \dots, \delta_K\right) = \frac{\exp(\alpha_k\theta + \delta_k)}{\sum_{r = 0}^K\exp(\alpha_r\theta + \delta_r)} $$

Plots

Download figure

Selected R code

library(ggplot2) 
library(data.table) 

# setting parameters 
a <- c(2.5, 2, 1, 1.5) 
d <- c(-1.5, -1, -0.5, 0) 
theta <- seq(-4, 4, 0.01) 

# calculating category probabilities 
ccnrm <- function(theta, a, d){ exp(d + a*theta) } 
df <- sapply(1:length(d), function(i) ccnrm(theta, a[i], d[i])) 
df <- data.frame(1, df) 
denom <- apply(df, 1, sum) 
df <- apply(df, 2, function(x) x/denom) 
df1 <- melt(data.frame(df, theta), id.vars = "theta") 

# plotting category probabilities 
ggplot(data = df1, aes(x = theta, y = value, col = variable)) + 
  geom_line() + 
  xlab("Ability") + 
  ylab("Category probability") + 
  xlim(-4, 4) + 
  ylim(0, 1) + 
  theme_bw() + 
  theme(text = element_text(size = 14), 
        panel.grid.major = element_blank(), 
        panel.grid.minor = element_blank()) + 
  ggtitle("Category probabilities") + 
  scale_color_manual("", values = c("black", "red", "yellow", "green", "blue"), labels = paste0("P(Y = ", 0:4, ")"))

# calculating expected item score
df2 <- data.frame(exp = as.matrix(df) %*% 0:4, theta)

# plotting expected item score
ggplot(data = df2, aes(x = theta, y = exp)) + 
  geom_line() + 
  xlab("Ability") + 
  ylab("Expected item score") + 
  xlim(-4, 4) + 
  ylim(0, 4) + 
  theme_bw() + 
  theme(text = element_text(size = 14), 
        panel.grid.major = element_blank(), 
        panel.grid.minor = element_blank()) + 
  ggtitle("Expected item score")


Differential Item Functioning / Item Fairness

Differential item functioning (DIF) occurs when people from different social groups (commonly gender or ethnicity) with the same underlying true ability have a different probability of answering the item correctly. If item functions differently for two groups, it is potentially unfair. In general, two type of DIF can be recognized: if the item has different difficulty for given two groups with the same discrimination, uniform DIF is present (left figure). If the item has different discrimination and possibly also different difficulty for given two groups, non-uniform DIF is present (right figure)




Total scores

DIF is not about total scores! Two groups may have the same distribution of total scores, yet, some item may function differently for two groups. Also, one of the groups may have signifficantly lower total score, yet, it may happen that there is no DIF item! (Martinkova et al., 2017).

Summary of total scores for groups

Histograms of total scores for groups

For selected cut-score, blue part of histogram shows respondents with total score above the cut-score, grey column shows respondents with Total Score equal to cut-score and red part of histogram shows respondents below the cut-score.


Selected R code

library(difNLR)
data <- GMAT[, 1:20]
group <- GMAT[, "group"]

# Summary table
sc_zero <- apply(data[group == 0, ], 1, sum); summary(sc_zero) # total scores of reference group
sc_one <- apply(data[group == 1, ], 1, sum); summary(sc_one) # total scores of focal group
# Histograms
hist(sc_zero, breaks = 0:20)
hist(sc_one, breaks = 0:20)

Delta plot

Delta plot (Angoff & Ford, 1973) compares the proportions of correct answers per item in the two groups. It displays non-linear transformation of these proportions using quantiles of standard normal distributions (so called delta scores) for each item for the two genders in a scatterplot called diagonal plot or delta plot (see Figure). Item is under suspicion of DIF if the delta point considerably departs from the diagonal. The detection threshold is either fixed to value 1.5 or based on bivariate normal approximation (Magis & Facon, 2012).

Download figure


        

Selected R code

library(deltaPlotR)
library(difNLR)
data(GMAT)
data <- GMAT[, 1:20]
group <- GMAT[, "group"]

# Delta scores with fixed threshold
deltascores <- deltaPlot(data.frame(data, group), group = "group", focal.name = 1, thr = 1.5)
deltascores
# Delta plot
diagPlot(deltascores, thr.draw = T)

# Delta scores with normal threshold
deltascores <- deltaPlot(data.frame(data, group), group = "group", focal.name = 1, thr = "norm", purify = F)
deltascores
# Delta plot
diagPlot(deltascores, thr.draw = T)

Mantel-Haenszel test

Mantel-Haenszel test is DIF detection method based on contingency tables that are calculated for each level of total score (Mantel & Haenszel, 1959).

Summary table

Here you can select correction method for multiple comparison or item purification.


Selected R code

library(difNLR)
library(difR)
data(GMAT)
data <- GMAT[, 1:20]
group <- GMAT[, "group"]

# Mantel-Haenszel test
fit <- difMH(Data = data, group = group, focal.name = 1, p.adjust.method = "none", purify = F)
fit

Mantel-Haenszel test

Mantel-Haenszel test is DIF detection method based on contingency tables that are calculated for each level of total score (Mantel & Haenszel, 1959).

Contingency tables and odds ratio calculation


Selected R code

library(difNLR)
library(difR)
data(GMAT)
data <- GMAT[, 1:20]
group <- GMAT[, "group"]

# Contingency table for item 1 and score 12
df <- data.frame(data[, 1], group)
colnames(df) <- c("Answer", "Group")
df$Answer <- relevel(factor(df$Answer, labels = c("Incorrect", "Correct")), "Correct")
df$Group <- factor(df$Group, labels = c("Reference Group", "Focal Group"))
score <- apply(data, 1, sum)
df <- df[score == 12, ]
tab <- dcast(data.frame(xtabs(~ Group + Answer, data = df)), Group ~ Answer, value.var = "Freq", margins = T, fun = sum)
tab
# Mantel-Haenszel estimate of OR
fit <- difMH(Data = data, group = group, focal.name = 1, p.adjust.method = "none", purify = F)
fit$alphaMH

Logistic regression on total scores

Logistic regression allows for detection of uniform and non-uniform DIF (Swaminathan & Rogers, 1990) by adding a group specific intercept b2 (uniform DIF) and group specific interaction b3 (non-uniform DIF) into model and by testing for their significance.

Equation

$$\mathrm{P}\left(Y_{ij} = 1 | X_i, G_i, b_0, b_1, b_2, b_3\right) = \frac{e^{b_0 + b_1 X_i + b_2 G_i + b_3 X_i G_i}}{1+e^{b_0 + b_1 X_i + b_2 G_i + b_3 X_i G_i}} $$

Summary table

Here you can choose what type of DIF to test. You can also select correction method for multiple comparison or item purification.


Selected R code

library(difNLR)
library(difR)
data(GMAT)
data <- GMAT[, 1:20]
group <- GMAT[, "group"]

# Logistic regression DIF detection method
fit <- difLogistic(Data = data, group = group, focal.name = 1, type = "both", p.adjust.method = "none", purify = F)
fit

Logistic regression on total scores

Logistic regression allows for detection of uniform and non-uniform DIF (Swaminathan & Rogers, 1990) by adding a group specific intercept b2 (uniform DIF) and group specific interaction b3 (non-uniform DIF) into model and by testing for their significance.

Plot with estimated DIF logistic curve

Here you can choose what type of DIF to test. You can also select correction method for multiple comparison or item purification.

Points represent proportion of correct answer with respect to total score. Their size is determined by count of respondents who achieved given level of total score with respect to the group membership.

NOTE: Plots and tables are based on DIF logistic procedure without any correction method.

Download figure

Equation

$$\mathrm{P}\left(Y_{ij} = 1 | X_i, G_i, b_0, b_1, b_2, b_3\right) = \frac{e^{b_0 + b_1 X_i + b_2 G_i + b_3 X_i G_i}}{1+e^{b_0 + b_1 X_i + b_2 G_i + b_3 X_i G_i}} $$

Table of parameters


Selected R code

library(difNLR)
library(difR)
data(GMAT)
data <- GMAT[, 1:20]
group <- GMAT[, "group"]

# Logistic regression DIF detection method
fit <- difLogistic(Data = data, group = group, focal.name = 1, type = "both", p.adjust.method = "none", purify = F)
fit
# Plot of characteristic curve for item 1
plotDIFLogistic(data, group, type = "both", item = 1, IRT = F, p.adjust.method = "none", purify = F)
# Coefficients
fit$logitPar

Logistic regression on standardized total scores with IRT parameterization

Logistic regression allows for detection of uniform and non-uniform DIF (Swaminathan & Rogers, 1990) by adding a group specific intercept bDIF (uniform DIF) and group specific interaction aDIF (non-uniform DIF) into model and by testing for their significance.

Equation

$$\mathrm{P}\left(Y_{ij} = 1 | Z_i, G_i, a_j, b_j, a_{\text{DIF}j}, b_{\text{DIF}j}\right) = \frac{e^{\left(a_j + a_{\text{DIF}j} G_i\right) \left(Z_i -\left(b_j + b_{\text{DIF}j} G_i\right)\right)}}{1+e^{\left(a_j + a_{\text{DIF}j} G_i\right) \left(Z_i -\left(b_j + b_{\text{DIF}j} G_i\right)\right)}} $$

Summary table

Here you can choose what type of DIF to test. You can also select correction method for multiple comparison.


Selected R code

library(difNLR)
library(difR)
data(GMAT)
data <- GMAT[, 1:20]
group <- GMAT[, "group"]
scaled.score <- scale(score)

# Logistic regression DIF detection method
fit <- difLogistic(Data = data, group = group, focal.name = 1, type = "both", match = scaled.score, p.adjust.method = "none", purify = F)
fit

Logistic regression on standardized total scores with IRT parameterization

Logistic regression allows for detection of uniform and non-uniform DIF by adding a group specific intercept bDIF (uniform DIF) and group specific interaction aDIF (non-uniform DIF) into model and by testing for their significance.

Plot with estimated DIF logistic curve

Here you can choose what type of DIF to test. You can also select correction method for multiple comparison.

Points represent proportion of correct answer with respect to standardized total score. Their size is determined by count of respondents who achieved given level of standardized total score with respect to the group membership.

NOTE: Plots and tables are based on DIF logistic procedure without any correction method.

Download figure

Equation

$$\mathrm{P}\left(Y_{ij} = 1 | Z_i, G_i, a_j, b_j, a_{\text{DIF}j}, b_{\text{DIF}j}\right) = \frac{e^{\left(a_j + a_{\text{DIF}j} G_i\right)\left(Z_i -\left(b_j + b_{\text{DIF}j} G_i\right)\right)}} {1+e^{\left(a_j + a_{\text{DIF}j} G_i\right)\left(Z_i -\left(b_j + b_{\text{DIF}j} G_i\right)\right)}} $$

Table of parameters


Selected R code

library(difNLR)
library(difR)
data(GMAT)
data <- GMAT[, 1:20]
group <- GMAT[, "group"]
scaled.score <- scale(score)

# Logistic regression DIF detection method
fit <- difLogistic(Data = data, group = group, focal.name = 1, type = "both", match = scaled.score, p.adjust.method = "none", purify = F)
fit
# Plot of characteristic curve for item 1
plotDIFLogistic(data, group, type = "both", item = 1, IRT = T, p.adjust.method = "BH")
# Coefficients for item 1 - recalculation
coef_old <- fit$logitPar[1, ]
coef <- c()
# a = b1, b = -b0/b1, adif = b3, bdif = -(b1b2-b0b3)/(b1(b1+b3))
coef[1] <- coef_old[2]
coef[2] <- -(coef_old[1] / coef_old[2])
coef[3] <- coef_old[4]
coef[4] <- -(coef_old[2] * coef_old[3] + coef_old[1] * coef_old[4] ) / (coef_old[2] * (coef_old[2] + coef_old[4]))

Nonlinear regression on standardized total scores with IRT parameterization

Nonlinear regression model allows for nonzero lower asymptote - pseudoguessing c (Drabinova & Martinkova, 2017). Similarly to logistic regression, also nonlinear regression allows for detection of uniform and non-uniform DIF by adding a group specific intercept bDIF (uniform DIF) and group specific interaction aDIF (non-uniform DIF) into the model and by testing for their significance.

Equation

$$\mathrm{P}\left(Y_{ij} = 1 | Z_i, G_i, a_j, b_j, c_j, a_{\text{DIF}j}, b_{\text{DIF}j}\right) = c_j + \left(1 - c_j\right) \cdot \frac{e^{\left(a_j + a_{\text{DIF}j} G_i\right)\left(Z_i -\left(b_j + b_{\text{DIF}j} G_i\right)\right)}} {1+e^{\left(a_j + a_{\text{DIF}j} G_i\right)\left(Z_i -\left(b_j + b_{\text{DIF}j} G_i\right)\right)}} $$

Summary table

Here you can choose what type of DIF to test. You can also select correction method for multiple comparison or item purification.


Selected R code

library(difNLR)
data(GMAT)
Data <- GMAT[, 1:20]
group <- GMAT[, "group"]

# Nonlinear regression DIF method
fit <- difNLR(Data = Data, group = group, focal.name = 1, model = "3PLcg", type = "both", p.adjust.method = "none")
fit

Nonlinear regression on standardized total scores with IRT parameterization

Nonlinear regression model allows for nonzero lower asymptote - pseudoguessing c (Drabinova & Martinkova, 2017). Similarly to logistic regression, also nonlinear regression allows for detection of uniform and non-uniform DIF by adding a group specific intercept bDIF (uniform DIF) and group specific interaction aDIF (non-uniform DIF) into the model and by testing for their significance.

Plot with estimated DIF nonlinear curve

Here you can choose what type of DIF to test. You can also select correction method for multiple comparison or item purification.

Points represent proportion of correct answer with respect to standardized total score. Their size is determined by count of respondents who achieved given level of standardized total score with respect to the group membership.

Download figure

Equation

$$\mathrm{P}\left(Y_{ij} = 1 | Z_i, G_i, a_j, b_j, c_j, a_{\text{DIF}j}, b_{\text{DIF}j}\right) = c_j + \left(1 - c_j\right) \cdot \frac{e^{\left(a_j + a_{\text{DIF}j} G_i\right)\left(Z_i -\left(b_j + b_{\text{DIF}j} G_i\right)\right)}} {1+e^{\left(a_j + a_{\text{DIF}j} G_i\right)\left(Z_i -\left(b_j + b_{\text{DIF}j} G_i\right)\right)}} $$

Table of parameters


Selected R code

library(difNLR)
data(GMAT)
Data <- GMAT[, 1:20]
group <- GMAT[, "group"]

# Nonlinear regression DIF method
fit <- difNLR(Data = Data, group = group, focal.name = 1, model = "3PLcg", type = "both", p.adjust.method = "none")
# Plot of characteristic curve of item 1
plot(fit, item = 1)
# Coefficients
fit$nlrPAR

Lord test for IRT models

Lord test (Lord, 1980) is based on IRT model (1PL, 2PL, or 3PL with the same guessing). It uses the difference between item parameters for the two groups to detect DIF. In statistical terms, Lord statistic is equal to Wald statistic.



Summary table

Here you can choose model to test DIF. You can also select correction method for multiple comparison or item purification.


Selected R code

library(difNLR)
library(difR)
data(GMAT)
data <- GMAT[, 1:20]
group <- GMAT[, "group"]

# 1PL IRT MODEL
fit1PL <- difLord(Data = data, group = group, focal.name = 1, model = "1PL", p.adjust.method = "none", purify = F)
fit1PL

# 2PL IRT MODEL
fit2PL <- difLord(Data = data, group = group, focal.name = 1, model = "2PL", p.adjust.method = "none", purify = F)
fit2PL

# 3PL IRT MODEL with the same guessing for groups
guess <- itemParEst(data, model = "3PL")[, 3]
fit3PL <- difLord(Data = data, group = group, focal.name = 1, model = "3PL", c = guess, p.adjust.method = "none", purify = F)
fit3PL

Lord test for IRT models

Lord test (Lord, 1980) is based on IRT model (1PL, 2PL, or 3PL with the same guessing). It uses the difference between item parameters for the two groups to detect DIF. In statistical terms, Lord statistic is equal to Wald statistic.


Plot with estimated DIF characteristic curve

Here you can choose model to test DIF. You can also select correction method for multiple comparison or item purification.

NOTE: Plots and tables are based on larger DIF IRT model.

Download figure

Equation

Table of parameters


Selected R code

library(difNLR)
library(difR)
data(GMAT)
data <- GMAT[, 1:20]
group <- GMAT[, "group"]

# 1PL IRT MODEL
fit1PL <- difLord(Data = data, group = group, focal.name = 1, model = "1PL", p.adjust.method = "none", purify = F)
fit1PL
# Coefficients for all items
tab_coef1PL <- fit1PL$itemParInit
# Plot of characteristic curve of item 1
plotDIFirt(parameters = tab_coef1PL, item = 1, test = "Lord")

# 2PL IRT MODEL
fit2PL <- difLord(Data = data, group = group, focal.name = 1, model = "2PL", p.adjust.method = "none", purify = F)
fit2PL
# Coefficients for all items
tab_coef2PL <- fit2PL$itemParInit
# Plot of characteristic curve of item 1
plotDIFirt(parameters = tab_coef2PL, item = 1, test = "Lord")

# 3PL IRT MODEL with the same guessing for groups
guess <- itemParEst(data, model = "3PL")[, 3]
fit3PL <- difLord(Data = data, group = group, focal.name = 1, model = "3PL", c = guess, p.adjust.method = "none", purify = F)
fit3PL
# Coefficients for all items
tab_coef3PL <- fit3PL$itemParInit
# Plot of characteristic curve of item 1
plotDIFirt(parameters = tab_coef3PL, item = 1, test = "Lord")

Raju test for IRT models

Raju test (Raju, 1988, 1990) is based on IRT model (1PL, 2PL, or 3PL with the same guessing). It uses the area between the item charateristic curves for the two groups to detect DIF.



Summary table

Here you can choose model to test DIF. You can also select correction method for multiple comparison or item purification.


Selected R code

library(difNLR)
library(difR)
data(GMAT)
data <- GMAT[, 1:20]
group <- GMAT[, "group"]

# 1PL IRT MODEL
fit1PL <- difRaju(Data = data, group = group, focal.name = 1, model = "1PL", p.adjust.method = "none", purify = F)
fit1PL

# 2PL IRT MODEL
fit2PL <- difRaju(Data = data, group = group, focal.name = 1, model = "2PL", p.adjust.method = "none", purify = F)
fit2PL

# 3PL IRT MODEL with the same guessing for groups
guess <- itemParEst(data, model = "3PL")[, 3]
fit3PL <- difRaju(Data = data, group = group, focal.name = 1, model = "3PL", c = guess, p.adjust.method = "none", purify = F)
fit3PL

Raju test for IRT models

Raju test (Raju, 1988, 1990) is based on IRT model (1PL, 2PL, or 3PL with the same guessing). It uses the area between the item charateristic curves for the two groups to detect DIF.


Plot with estimated DIF characteristic curve

Here you can choose model to test DIF. You can also select correction method for multiple comparison or item purification.

NOTE: Plots and tables are based on larger DIF IRT model.

Download figure

Equation

Table of parameters


Selected R code

library(difNLR)
library(difR)
data(GMAT)
data <- GMAT[, 1:20]
group <- GMAT[, "group"]

# 1PL IRT MODEL
fit1PL <- difRaju(Data = data, group = group, focal.name = 1, model = "1PL", p.adjust.method = "none", purify = F)
fit1PL
# Coefficients for all items
tab_coef1PL <- fit1PL$itemParInit
# Plot of characteristic curve of item 1
plotDIFirt(parameters = tab_coef1PL, item = 1, test = "Raju")

# 2PL IRT MODEL
fit2PL <- difRaju(Data = data, group = group, focal.name = 1, model = "2PL", p.adjust.method = "none", purify = F)
fit2PL
# Coefficients for all items
tab_coef2PL <- fit2PL$itemParInit
# Plot of characteristic curve of item 1
plotDIFirt(parameters = tab_coef2PL, item = 1, test = "Raju")

# 3PL IRT MODEL with the same guessing for groups
guess <- itemParEst(data, model = "3PL")[, 3]
fit3PL <- difRaju(Data = data, group = group, focal.name = 1, model = "3PL", c = guess, p.adjust.method = "none", purify = F)
fit3PL
# Coefficients for all items
tab_coef3PL <- fit3PL$itemParInit
# Plot of characteristic curve of item 1
plotDIFirt(parameters = tab_coef3PL, item = 1, test = "Raju")

Differential Distractor Functioning with multinomial log-linear regression model

Differential Distractor Functioning (DDF) occurs when people from different groups but with the same knowledge have different probability of selecting at least one distractor choice. DDF is here examined by Multinomial Log-linear Regression model with Z-score and group membership as covariates.

Equation

For K possible test choices is the probability of the correct answer for person i with standardized total score Z and group membership G in item j given by the following equation:

$$\mathrm{P}(Y_{ij} = K|Z_i, G_i, b_{jl0}, b_{jl1}, b_{jl2}, b_{jl3}, l = 1, \dots, K-1) = \frac{1}{1 + \sum_l e^{\left( b_{il0} + b_{il1} Z + b_{il2} G + b_{il3} Z:G\right)}}$$

The probability of choosing distractor k is then given by:

$$\mathrm{P}(Y_{ij} = k|Z_i, G_i, b_{jl0}, b_{jl1}, b_{jl2}, b_{jl3}, l = 1, \dots, K-1) = \frac{e^{\left( b_{jk0} + b_{jk1} Z_i + b_{jk2} G_i + b_{jk3} Z_i:G_i\right)}} {1 + \sum_l e^{\left( b_{jl0} + b_{jl1} Z_i + b_{jl2} G_i + b_{jl3} Z_i:G_i\right)}}$$

Summary table

Here you can choose what type of DIF to test. You can also select correction method for multiple comparison or item purification.


Selected R code

library(difNLR)
data(GMATtest, GMATkey)
Data <- GMATtest[, 1:20]
group <- GMATtest[, "group"]
key <- GMATkey

# DDF with difNLR package
fit <- ddfMLR(Data, group, focal.name = 1, key, type = "both", p.adjust.method = "none")
fit

Differential Distractor Functioning with multinomial log-linear regression model

Differential Distractor Functioning (DDF) occurs when people from different groups but with the same knowledge have different probability of selecting at least one distractor choice. DDF is here examined by Multinomial Log-linear Regression model with Z-score and group membership as covariates.

Plot with estimated DDF curves

Here you can choose what type of DIF to test. You can also select correction method for multiple comparison or item purification.

Points represent proportion of selected answer with respect to standardized total score. Their size is determined by count of respondents who achieved given level of standardized total score and who selected given option with respect to the group membership.

Download figure

Equation

Table of parameters


Selected R code

library(difNLR)
data(GMATtest, GMATkey)
Data <- GMATtest[, 1:20]
group <- GMATtest[, "group"]
key <- GMATkey

# DDF with difNLR package
fit <- ddfMLR(Data, group, focal.name = 1, key, type = "both", p.adjust.method = "none")
# Estimated coefficients of item 1
fit$mlrPAR[[1]]

Download report

Settings of report

ShinyItemAnalysis offers an option to download a report in HTML or PDF format. PDF report creation requires latest version of MiKTeX (or other TeX distribution). If you don't have the latest installation, please, use the HTML report.

There is an option whether to use customize settings. By checking the Customize settings local settings will be offered and use for each selected section of report. Otherwise the settings will be taken from pages of application. You can also include your name into report as well as the name of dataset which was used.

Content of report

Reports by default contain summary of total scores, table of standard scores, item analysis, distractors plots for each item and multinomial regression plots for each item. Other analyses can be selected below.

Validity

Difficulty/discrimination plot

Distractors plots

DIF method selection

Delta plot settings

Logistic regression settings

Multinomial regression settings

Recommendation: Report generation can be faster and more reliable when you first check sections of intended contents. For example, if you wish to include a 3PL IRT model, you can first visit IRT models section and 3PL subsection.




References

Akaike, H. (1974). A New Look at the Statistical Model Identification. IEEE Transactions on Automatic Control, 19(6), 716-723. See online.

Ames, A. J., & Penfield, R. D. (2015). An NCME Instructional Module on Item-Fit Statistics for Item Response Theory Models. Educational Measurement: Issues and Practice, 34(3), 39-48. See online.

Andrich, D. (1978). A rating formulation for ordered response categories. Psychometrika, 43(4), 561-573. See online.

Angoff, W. H., & Ford, S. F. (1973). Item-Race Interaction on a Test of Scholastic Aptitude. Journal of Educational Measurement, 10(2), 95-105. See online.

Bock, R. D. (1972). Estimating Item Parameters and Latent Ability when Responses Are Scored in Two or More Nominal Categories. Psychometrika, 37(1), 29-51. See online.

Cronbach, L. J. (1951). Coefficient Alpha and the Internal Structure of Tests. Psychometrika, 16(3), 297-334. See online.

Drabinova, A., & Martinkova, P. (2017). Detection of Differential Item Functioning with Non-Linear Regression: Non-IRT Approach Accounting for Guessing. Journal of Educational Measurement, 54(4), 498-517. See online.

Lord, F. M. (1980). Applications of Item Response Theory to Practical Testing Problems. Routledge.

Magis, D., & Facon, B. (2012). Angoff's Delta Method Revisited: Improving DIF Detection under Small Samples. British Journal of Mathematical and Statistical Psychology, 65(2), 302-321. See online.

Mantel, N., & Haenszel, W. (1959). Statistical Aspects of the Analysis of Data from Retrospective Studies. Journal of the National Cancer Institute, 22 (4), 719-748. See online.

Martinkova, P., Drabinova, A., & Houdek, J. (2017). ShinyItemAnalysis: Analyza prijimacich a jinych znalostnich ci psychologických testu. TESTFORUM, 6(9), 16–35. See online. (ShinyItemAnalysis: Analyzing admission and other educational and psychological tests)

Martinkova, P., Drabinova, A., Liaw, Y. L., Sanders, E. A., McFarland, J. L., & Price, R. M. (2017). Checking Equity: Why Differential Item Functioning Analysis Should Be a Routine Part of Developing Conceptual Assessments. CBE-Life Sciences Education, 16(2). See online.

Martinkova, P., Stepanek, L., Drabinova, A., Houdek, J., Vejrazka, M., & Stuka, C. (2017). Semi-real-time analyses of item characteristics for medical school admission tests. In: Proceedings of the 2017 Federated Conference on Computer Science and Information Systems. In print.

Masters, G. N. (1982). A Rasch model for partial credit scoring. Psychometrika, 47(2), 149-174. See online.

McFarland, J. L., Price, R. M., Wenderoth, M. P., Martinkova, P., Cliff, W., Michael, J., ... & Wright, A. (2017). Development and validation of the homeostasis concept inventory. CBE-Life Sciences Education, 16(2), ar35. See online.

Muraki, E. (1992). A generalized partial credit model: Application of an EM algorithm. ETS Research Report Series, 1992(1). See online.

Swaminathan, H., & Rogers, H. J. (1990). Detecting Differential Item Functioning Using Logistic Regression Procedures. Journal of Educational Measurement, 27(4), 361-370. See online.

Raju, N. S. (1988). The Area between Two Item Characteristic Curves. Psychometrika, 53 (4), 495-502. See online.

Raju, N. S. (1990). Determining the Significance of Estimated Signed and Unsigned Areas between Two Item Response Functions. Applied Psychological Measurement, 14 (2), 197-207. See online.

Rasch, G. (1960) Probabilistic Models for Some Intelligence and Attainment Tests. Copenhagen: Paedagogiske Institute.

Samejima, F. (1969). Estimation of latent ability using a response pattern of graded scores. Psychometrika, 34(1), 1-97. See online.

Schwarz, G. (1978). Estimating the Dimension of a Model. The Annals of Statistics, 6(2), 461-464. See online.

Wilson, M. (2005). Constructing Measures: An Item Response Modeling Approach.

Wright, B. D., & Stone, M. H. (1979). Best Test Design. Chicago: Mesa Press.