### Data

For demonstration purposes, 20-item dataset GMAT from difNLR R package is used. On this page, you may select one of several toy datasets, mostly offered by ShinyItemAnalysis and difNLR packages or you may upload your own dataset (see below). To return to demonstration dataset, click on Unload data button.

#### Training datasets

Main data file should contain responses of individual respondents (rows) to given items (columns). Data need to be either binary, nominal (e.g. in ABCD format), or ordinal (e.g. in Likert scale). Header may contain item names, no row names should be included. In all data sets header should be either included or excluded. Columns of dataset are by default renamed to Item and number of particular column. If you want to keep your own names, check box Keep item names below. Missing values in scored dataset are by default evaluated as 0. If you want to keep them as missing, check box Keep missing values below.

Data specification
Missing values

For nominal data, it is necessary to upload key of correct answers.

For ordinal data, you are advised to include vector containing cut-score which is used for binarization of uploaded data, i.e., values greater or equal to provided cut-score are set to 1, otherwise to 0. You can either upload dataset of item-specific values, or you can provide one value for whole dataset.

Note: In case that cut-score is not provided, vector of maximal values is used.

For nominal data, it is necessary to upload key of correct answers.

For ordinal data, you are advised to include vector containing cut-score which is used for binarization of uploaded data, i.e., values greater or equal to provided cut-score are set to 1, otherwise to 0. You can either upload dataset of item-specific values, or you can provide one value for whole dataset.

Note: In case that cut-score is not provided, vector of maximal values is used.

For ordinal data, it is optional to upload minimal and maximal values of answers. You can either upload datasets of item-specific values, or you can provide one value for whole dataset.

Note: If no minimal or maximal values are provided, these values are set automatically based on observed values.

Group is binary vector, where 0 represents reference group and 1 represents focal group. Its length needs to be the same as number of individual respondents in the main dataset. If the group is not provided then it won't be possible to run DIF and DDF detection procedures in DIF/Fairness section. Missing values are not supported for group membership vector and such cases/rows of the data should be removed.

Criterion variable is either discrete or continuous vector (e.g. future study success or future GPA in case of admission tests) which should be predicted by the measurement. Its length needs to be the same as number of individual respondents in the main dataset. If the criterion variable is not provided then it wont be possible to run validity analysis in Predictive validity section on Validity page.

DIF matching variable is a vector of the same length as number of observations in your data. If not supplied, total score is automatically computed and utilized by default.

### Data exploration

Here you can explore uploaded dataset. Rendering of tables can take some time.

### Analysis of total scores

Total score, also known as raw score or sum score, is a total number of correct answers.

#### Summary table

Table below summarizes basic characteristics of total scores including minimum and maximum, mean, median, standard deviation, skewness and kurtosis. The kurtosis here is estimated by sample kurtosis $$\frac{m_4}{s_4}$$, where $$m_4$$ is the fourth central moment and $$s^2$$ is sample variance. The skewness is estimated by sample skewness $$\frac{m_3}{s^3}$$, where $$m_3$$ is the third central moment. The kurtosis for normally distributed scores is near the value of 3 and the skewness is near the value of 0.

#### Histogram of total score

For selected cut-score, blue part of histogram shows respondents with total score above the cut-score, grey column shows respondents with total score equal to the cut-score and red part of histogram shows respondents below the cut-score.

#### Selected R code

library(difNLR)library(ggplot2)library(moments)# loading datadata(GMAT)data <- GMAT[, 1:20]# total score calculationscore <- apply(data, 1, sum)# summary of total score c(min(score), max(score), mean(score), median(score), sd(score), skewness(score), kurtosis(score))# colors by cut-scorecut <- median(score) # cut-score color <- c(rep("red", cut - min(score)), "gray", rep("blue", max(score) - cut))df <- data.frame(score)# histogramggplot(df, aes(score)) +   geom_histogram(binwidth = 1, fill = color, col = "black") +   xlab("Total score") +   ylab("Number of respondents") +   theme_app()

### Standard scores

Total score also known as raw score is a total number of correct answers. It can be used to compare individual score to a norm group, e.g. if the mean is 12, then individual score can be compared to see if it is below or above this average.
Percentile indicates the value below which a percentage of observations falls, e.g. a individual score at the 80th percentile means that the individual score is the same or higher than the scores of 80% of all respondents.
Success rate is the percentage of success, e.g. if the maximum points of test is equal to 20 and individual score is 12 then success rate is 12/20 = 0.6, i.e. 60%.
Z-score or also standardized score is a linear transformation of total score with a mean of 0 and with variance of 1. If X is total score, M its mean and SD its standard deviation then Z-score = (X - M) / SD.
T-score is transformed Z-score with a mean of 50 and standard deviation of 10. If Z is Z-score then T-score = (Z * 10) + 50.

#### Selected R code

library(difNLR) # loading datadata(GMAT) data <- GMAT[, 1:20] # scores calculationsscore <- apply(data, 1, sum)             # Total score tosc <- sort(unique(score))              # Levels of total score perc <- cumsum(prop.table(table(score))) # Percentiles sura <- 100 * (tosc / max(score))        # Success rate zsco <- sort(unique(scale(score)))       # Z-score tsco <- 50 + 10 * zsco                   # T-score

### Reliability

We are typically interested in unobserved true score $$T$$, but have available only the observed score $$X$$ which is contaminated by some measurement error $$e$$, such that $$X = T + e$$ and error term is uncorrelated with the true score.

#### Equation

Reliability is defined as squared correlation of the true and observed score

$$\text{rel}(X) = \text{cor}(T, X)^2$$

Equivalently, reliability can be re-expressed as the ratio of the true score variance to total observed variance

$$\text{rel}(X) = \frac{\sigma^2_T}{\sigma^2_X}$$

### Spearman-Brown formula

#### Equation

For test with $$I$$ items total score is calculated as $$X = X_1 + ... + X_I$$. Let $$\text{rel}(X)$$ be the reliability of the test. For a test consisting of $$I^*$$ items (equally precise, measuring the same construct), that is for test which is $$m = \frac{I^*}{I}$$ times longer/shorter, the reliability would be

$$\text{rel}(X^*) = \frac{m\cdot \text{rel}(X)}{1 + (m - 1)\cdot\text{rel}(X)}.$$

Spearman-Brown formula can be used to determine reliability of a test with similar items but of different number of items. It can also be used to determine necessary number of items to achieve desired reliability.

In calculations below reliability of original data is by default set to value of Cronbach's $$\alpha$$ of the dataset currentli in use. Number of items in original data is by default set to number of items of dataset currently in use.

#### Estimate of reliability with different number of items

Here you can calculate estimate of reliability of a test consisting of different number of items (equally precise, measuring the same construct).

#### Necessary number of items for required level of reliability

Here you can calculate necessary number of items (equally precise, measuring the same construct) to gain required level of reliability.

#### Selected R code

library(psychometrics)library(ShinyItemAnalysis)# loading datadata(HCI)data <- HCI[, 1:20]# reliability of original datarel.original <- psychometric::alpha(data)# number of items in original dataitems.original <- ncol(data)# number of items in new dataitems.new <- 30# ratio of tests lengthsm <- items.new/items.original# determining reliabilitypsychometric::SBrel(Nlength = m, rxx = rel.original)# desired reliabilityrel.new <- 0.8# determining test length(m.new <- psychometric::SBlength(rxxp = rel.new, rxx = rel.original))# number of required itemsm.new*items.original

### Split-half method

Split-half method uses correlation between two subscores for estimation of reliability. The underlying assumption is that the two halves of the test (or even all items on the test) are equally precise and measure the same underlying construct. Spearman-Brown formula is then used to correct the estimate for the number of items.

#### Equation

For test with $$I$$ items total score is calculated as $$X = X_1 + ... + X_I$$. Let $$X^*_1$$ and $$X^*_2$$ be total scores calculated from items only in the first and second subsets. Then estimate of reliability is given by Spearman-Brown formula (Spearman, 1910; Brown, 1910) with $$m = 2$$.

$$\text{rel}(X) = \frac{m\cdot \text{cor}(X^*_1, X^*_2)}{1 + (m - 1)\cdot\text{cor}(X^*_1, X^*_2)} = \frac{2\cdot \text{cor}(X^*_1, X^*_2)}{1 + \text{cor}(X^*_1, X^*_2)}$$

Below you can choose from different split-half approaches. First-last method uses correlation between the first half of items and the second half of items. Even-odd includes even items into the first subset and odd items into the second one. Random method performs random split of items, thus the resulting estimate may be different for each call. Revelle's $$\beta$$ is actually the worst split-half (Revelle, 1979). Estimate is here calculated as the lowest split-half reliability of by default 10,000 random splits. Finally, Average considers by default 10,000 split halves and averages the resulting estimates. Number of split halves can be changed below. In case of odd number of items, first subset contains one more item than second one.

#### Reliability estimate with confidence interval

Estimate of reliability for First-last , Even-odd , Random and Revelle's $$\beta$$ is calculated using Spearman-Brown formula. Confidence interval is based on confidence interval of correlation using delta method. Estimate of reliability for Average method is mean value of sampled reliabilities and confidence interval is confidence interval of this mean.

#### Histogram of reliability estimates

Histogram is based on selected number of split halves estimates (10,000 by default). The current estimate is highlighted by red colour.

### 1PL model

Item Response Theory (IRT) models are mixed-effect regression models in which respondent ability $$\theta$$ is assumed to be latent and is estimated together with item paramters.

In 1PL IRT model, all items are assumed to have the same slope in inflection point, i.e., the same discrimination $$a$$. Its value corresponds to standard deviation of ability estimates in Rasch model. Items can differ in location of their inflection point, i.e., in item difficulty parameters $$b$$. Model parameters are estimated using marginal maximum likelihood (MML) method. Ability $$\theta$$ is assumed to follow standard normal distribution.

#### Equation

$$\mathrm{P}\left(Y_{ij} = 1\vert \theta_{i}, a, b_{j} \right) = \frac{e^{a\left(\theta_{i}-b_{j}\right) }}{1+e^{a\left(\theta_{i}-b_{j}\right) }}$$

#### Table of estimated parameters

Estimates of parameters are completed by SX2 item fit statistics (Orlando and Thissen, 2000). SX2 statistics are computed only when no missing data are present.

### Two parameter Item Response Theory model

Item Response Theory (IRT) models are mixed-effect regression models in which respondent ability $$\theta$$ is assumed to be latent and is estimated together with item paramters.

2PL IRT model allows for different slopes in inflection point, i.e., different discrimination parameters $$a$$. Items can also differ in location of their inflection point, i.e., in item difficulty parameters $$b$$. Model parameters are estimated using marginal maximum likelihood (MML) method. Ability $$\theta$$ is assumed to follow standard normal distribution.

#### Equation

$$\mathrm{P}\left(Y_{ij} = 1\vert \theta_{i}, a_{j}, b_{j}\right) = \frac{e^{a_{j}\left(\theta_{i}-b_{j}\right) }}{1+e^{a_{j}\left(\theta_{i}-b_{j}\right) }}$$

#### Table of estimated parameters

Estimates of parameters are completed by SX2 item fit statistics (Orlando and Thissen, 2000). SX2 statistics are computed only when no missing data are present.

#### Ability estimates

This table shows the response score of only six respondents. If you want to see scores for all respondents, click on Download abilities button.

#### Selected R code

library(difNLR)
library(mirt)
data(GMAT)
data <- GMAT[, 1:20]

# Model
fit <- mirt(data, model = 1, itemtype = "2PL", SE = T)
# Item Characteristic Curves
plot(fit, type = "trace", facet_items = F)
# Item Information Curves
plot(fit, type = "infotrace", facet_items = F)
# Test Information Function
plot(fit, type = "infoSE")
# Coefficients
coef(fit, simplify = TRUE)
coef(fit, IRTpars = TRUE, simplify = TRUE)
# Item fit statistics
itemfit(fit)
# Factor scores vs Standardized total scores
fs <- as.vector(fscores(fit))
sts <- as.vector(scale(apply(data, 1, sum)))
plot(fs ~ sts)

# You can also use ltm library for IRT models
library(difNLR)
library(ltm)
data(GMAT)
data <- GMAT[, 1:20]

# Model
fit <- ltm(data ~ z1, IRT.param = TRUE)
# Item Characteristic Curves
plot(fit)
# Item Information Curves
plot(fit, type = "IIC")
# Test Information Function
plot(fit, items = 0, type = "IIC")
# Coefficients
coef(fit)
# Factor scores vs Standardized total scores
df1 <- ltm::factor.scores(fit, return.MIvalues = T)$score.dat FS <- as.vector(df1[, "z1"]) df2 <- df1 df2$Obs <- df2$Exp <- df2$z1 <- df2$se.z1 <- NULL STS <- as.vector(scale(apply(df2, 1, sum))) df <- data.frame(FS, STS) plot(FS ~ STS, data = df, xlab = "Standardized total score", ylab = "Factor score") ### Two parameter Item Response Theory model Item Response Theory (IRT) models are mixed-effect regression models in which respondent ability $$\theta$$ is assumed to be latent and is estimated together with item paramters. 2PL IRT model allows for different slopes in inflection point, i.e., different discrimination parameters $$a$$. Items can also differ in location of their inflection point, i.e., in item difficulty parameters $$b$$. Model parameters are estimated using marginal maximum likelihood (MML) method. Ability $$\theta$$ is assumed to follow standard normal distribution. #### Equation $$\mathrm{P}\left(Y_{ij} = 1\vert \theta_{i}, a_{j}, b_{j}\right) = \frac{e^{a_{j}\left(\theta_{i}-b_{j}\right) }}{1+e^{a_{j}\left(\theta_{i}-b_{j}\right) }}$$ #### Item characteristic curves #### Item information curves #### Table of estimated parameters Estimates of parameters are completed by SX2 item fit statistics (Orlando and Thissen, 2000). SX2 statistics are computed only when no missing data are present. ### Three parameter Item Response Theory model Item Response Theory (IRT) models are mixed-effect regression models in which respondent ability $$\theta$$ is assumed to be latent and is estimated together with item paramters. 3PL IRT model allows for different discriminations of items $$a$$, different item difficulties $$b$$ and allows also for nonzero left asymptote, pseudo-guessing $$c$$. Model parameters are estimated using marginal maximum likelihood (MML) method. Ability $$\theta$$ is assumed to follow standard normal distribution. #### Equation $$\mathrm{P}\left(Y_{ij} = 1\vert \theta_{i}, a_{j}, b_{j}, c_{j} \right) = c_{j} + \left(1 - c_{j}\right) \cdot \frac{e^{a_{j}\left(\theta_{i}-b_{j}\right) }}{1+e^{a_{j}\left(\theta_{i}-b_{j}\right) }}$$ #### Item characteristic curves #### Item information curves #### Test information function #### Table of estimated parameters Estimates of parameters are completed by SX2 item fit statistics (Orlando and Thissen, 2000). SX2 statistics are computed only when no missing data are present. #### Ability estimates This table shows the response score of only six respondents. If you want to see scores for all respondents, click on Download abilities button. #### Scatter plot of factor scores and standardized total scores #### Selected R code library(difNLR) library(mirt) data(GMAT) data <- GMAT[, 1:20] # Model fit <- mirt(data, model = 1, itemtype = "3PL", SE = T) # Item Characteristic Curves plot(fit, type = "trace", facet_items = F) # Item Information Curves plot(fit, type = "infotrace", facet_items = F) # Test Information Function plot(fit, type = "infoSE") # Coefficients coef(fit, simplify = TRUE) coef(fit, IRTpars = TRUE, simplify = TRUE) # Item fit statistics itemfit(fit) # Factor scores vs Standardized total scores fs <- as.vector(fscores(fit)) sts <- as.vector(scale(apply(data, 1, sum))) plot(fs ~ sts) # You can also use ltm library for IRT models library(difNLR) library(ltm) data(GMAT) data <- GMAT[, 1:20] # Model fit <- tpm(data, IRT.param = TRUE) # Item Characteristic Curves plot(fit) # Item Information Curves plot(fit, type = "IIC") # Test Information Function plot(fit, items = 0, type = "IIC") # Coefficients coef(fit) # Factor scores vs Standardized total scores df1 <- ltm::factor.scores(fit, return.MIvalues = T)$score.dat
FS <- as.vector(df1[, "z1"])
df2 <- df1
df2$Obs <- df2$Exp <- df2$z1 <- df2$se.z1 <- NULL
STS <- as.vector(scale(apply(df2, 1, sum)))
df <- data.frame(FS, STS)
plot(FS ~ STS, data = df, xlab = "Standardized total score", ylab = "Factor score")

### Three parameter Item Response Theory model

Item Response Theory (IRT) models are mixed-effect regression models in which respondent ability $$\theta$$ is assumed to be latent and is estimated together with item paramters.

3PL IRT model allows for different discriminations of items $$a$$, different item difficulties $$b$$ and allows also for nonzero left asymptote, pseudo-guessing $$c$$. Model parameters are estimated using marginal maximum likelihood (MML) method. Ability $$\theta$$ is assumed to follow standard normal distribution.

#### Equation

$$\mathrm{P}\left(Y_{ij} = 1\vert \theta_{i}, a_{j}, b_{j}, c_{j} \right) = c_{j} + \left(1 - c_{j}\right) \cdot \frac{e^{a_{j}\left(\theta_{i}-b_{j}\right) }}{1+e^{a_{j}\left(\theta_{i}-b_{j}\right) }}$$

#### Table of estimated parameters

Estimates of parameters are completed by SX2 item fit statistics (Orlando and Thissen, 2000). SX2 statistics are computed only when no missing data are present.

### Four parameter Item Response Theory model

Item Response Theory (IRT) models are mixed-effect regression models in which respondent ability $$\theta$$ is assumed to be latent and is estimated together with item paramters.

4PL IRT model allows for different discriminations of items $$a$$, different item difficulties $$b$$, nonzero left asymptote, pseudo-guessing $$c$$ and also for upper asymptote lower than one, i.e, inattention parameter $$d$$. Model parameters are estimated using marginal maximum likelihood (MML) method. Ability $$\theta$$ is assumed to follow standard normal distribution.

#### Equation

$$\mathrm{P}\left(Y_{ij} = 1\vert \theta_{i}, a_{j}, b_{j}, c_{j}, d_{j} \right) = c_{j} + \left(d_{j} - c_{j}\right) \cdot \frac{e^{a_{j}\left(\theta_{i}-b_{j}\right) }}{1+e^{a_{j}\left(\theta_{i}-b_{j}\right) }}$$

#### Table of estimated parameters

Estimates of parameters are completed by SX2 item fit statistics (Orlando and Thissen, 2000). SX2 statistics are computed only when no missing data are present.

#### Ability estimates

This table shows the response score of only six respondents. If you want to see scores for all respondents, click on Download abilities button.

#### Selected R code

library(difNLR)
library(mirt)
data(GMAT)
data <- GMAT[, 1:20]

# Model
fit <- mirt(data, model = 1, itemtype = "4PL", SE = T)
# Item Characteristic Curves
plot(fit, type = "trace", facet_items = F)
# Item Information Curves
plot(fit, type = "infotrace", facet_items = F)
# Test Information Function
plot(fit, type = "infoSE")
# Coefficients
coef(fit, simplify = TRUE)
coef(fit, IRTpars = TRUE, simplify = TRUE)
# Item fit statistics
itemfit(fit)
# Factor scores vs Standardized total scores
fs <- as.vector(fscores(fit))
sts <- as.vector(scale(apply(data, 1, sum)))
plot(fs ~ sts)

### Four parameter Item Response Theory model

Item Response Theory (IRT) models are mixed-effect regression models in which respondent ability $$\theta$$ is assumed to be latent and is estimated together with item paramters.

4PL IRT model allows for different discriminations of items $$a$$, different item difficulties $$b$$, nonzero left asymptote, pseudo-guessing $$c$$ and also for upper asymptote lower than one, i.e, inattention parameter $$d$$. Model parameters are estimated using marginal maximum likelihood (MML) method. Ability $$\theta$$ is assumed to follow standard normal distribution.

#### Equation

$$\mathrm{P}\left(Y_{ij} = 1\vert \theta_{i}, a_{j}, b_{j}, c_{j}, d_{j} \right) = c_{j} + \left(d_{j} - c_{j}\right) \cdot \frac{e^{a_{j}\left(\theta_{i}-b_{j}\right) }}{1+e^{a_{j}\left(\theta_{i}-b_{j}\right) }}$$

#### Table of estimated parameters

Estimates of parameters are completed by SX2 item fit statistics (Orlando and Thissen, 2000). SX2 statistics are computed only when no missing data are present.

### Item Response Theory model selection

Item Response Theory (IRT) models are mixed-effect regression models in which respondent ability $$\theta$$ is assumed to be latent and is estimated together with item paramters. Model parameters are estimated using marginal maximum likelihood (MML) method, in 1PL, 2PL, 3PL and 4PL IRT models, ability $$\theta$$ is assumed to follow standard normal distribution.

IRT models can be compared by several information criteria:

• AIC is the Akaike information criterion (Akaike, 1974),
• AICc is AIC with a correction for finite sample size,
• BIC is the Bayesian information criterion (Schwarz, 1978).
• SABIC is the Sample-sized adjusted BIC criterion,

Another approach to compare IRT models can be likelihood ratio chi-squared test. Significance level is set to 0.05.

#### Table of comparison statistics

Row BEST indicates which model has the lowest value of criterion, or is the largest significant model by likelihood ratio test.

#### Selected R code

library(difNLR) library(mirt)# loading datadata(GMAT) data <- GMAT[, 1:20] # 1PL IRT model s <- paste("F = 1-", ncol(data), "\n",           "CONSTRAIN = (1-", ncol(data), ", a1)")model <- mirt.model(s)fit1PL <- mirt(data, model = model, itemtype = "2PL")# 2PL IRT model fit2PL <- mirt(data, model = 1, itemtype = "2PL") # 3PL IRT model fit3PL <- mirt(data, model = 1, itemtype = "3PL") # 4PL IRT model fit4PL <- mirt(data, model = 1, itemtype = "4PL") # comparison anova(fit1PL, fit2PL) anova(fit2PL, fit3PL) anova(fit3PL, fit4PL)

### Bock's nominal Item Response Theory model

The nominal response model (NRM) was introduced by Bock (1972) as a way to model responses to items with two or more nominal categories. This model is suitable for multiple-choice items with no particular ordering of distractors. It is also generalization of some models for ordinal data, e.g. generalized partial credit model (GPCM) or its restricted versions partial credit model (PCM) and rating scale model (RSM).

#### Equation

For $$K$$ possible test choices is the probability of the choice $$k$$ for person $$i$$ with latent trait $$\theta$$ in item $$j$$ given by the following equation: $$\mathrm{P}(Y_{ij} = k|\theta_i, a_{j1}, al_{j(l-1)}, d_{j(l-1)}, l = 1, \dots, K) = \frac{e^{(ak_{j(k-1)} * a_{j1} * \theta_i + d_{j(k-1)})}}{\sum_l e^{(al_{j(l-1)} * a_{j1} * \theta_i + d_{j(l-1)})}}$$

#### Ability estimates

This table shows the response score of only six respondents. If you want to see scores for all respondents, click on Download abilities button.

#### Selected R code

library(difNLR) library(mirt)library(ShinyItemAnalysis)# loading datadata("dataMedicalgraded") data <- dataMedicalgraded[, 1:100] # model fit <- mirt(data, model = 1, itemtype = "nominal") # item characteristic curves plot(fit, type = "trace", facet_items = F) # item information curves plot(fit, type = "infotrace", facet_items = F) # test information function plot(fit, type = "infoSE") # coefficients coef(fit, simplify = TRUE) coef(fit, IRTpars = TRUE, simplify = TRUE) # factor scores vs standardized total scores fs <- as.vector(fscores(fit)) sts <- as.vector(scale(apply(data, 1, sum))) plot(fs ~ sts)

### Bock's nominal Item Response Theory model

The nominal response model (NRM) was introduced by Bock (1972) as a way to model responses to items with two or more nominal categories. This model is suitable for multiple-choice items with no particular ordering of distractors. It is also generalization of some models for ordinal data, e.g. generalized partial credit model (GPCM) or its restricted versions partial credit model (PCM) and rating scale model (RSM).

#### Equation

For $$K$$ possible test choices is the probability of the choice $$k$$ for person $$i$$ with latent trait $$\theta$$ in item $$j$$ given by the following equation: $$\mathrm{P}(Y_{ij} = k|\theta_i, a_{j1}, al_{j(l-1)}, d_{j(l-1)}, l = 1, \dots, K) = \frac{e^{(ak_{j(k-1)} * a_{j1} * \theta_i + d_{j(k-1)})}}{\sum_l e^{(al_{j(l-1)} * a_{j1} * \theta_i + d_{j(l-1)})}}$$

### Dichotomous models

Dichotomous models are used for modelling items producing a simple binary response (i.e., true/false). Most complex unidimensional dichotomous IRT model described here is 4PL IRT model. Rasch model (Rasch, 1960) assumes discrimination fixed to $$a = 1$$ guessing fixed to $$c = 0$$ and innatention to $$d = 1$$. Similarly, other restricted models (1PL, 2PL and 3PL models) can be obtained by fixing appropriate parameters in 4PL model.

In this section, you can explore behavior of two item characteristic curves $$\mathrm{P}\left(\theta\right)$$ and their item information functions $$\mathrm{I}\left(\theta\right)$$ in 4PL IRT model.

#### Parameters

Select parameters $$a$$ (discrimination), $$b$$ (difficulty), $$c$$ (guessing) and $$d$$ (inattention). By constraining $$a = 1$$, $$c = 0$$, $$d = 1$$ you get Rasch model. With option $$c = 0$$ and $$d = 1$$ you get 2PL model and with option $$d = 1$$ 3PL model.

When different curve parameters describe properties of the same item but for different groups of respondents, this phenomenon is called Differential Item Functioning (DIF). See further section for more information.

Select also the value of latent ability $$\theta$$ to see the interpretation of the item characteristic curves.

#### Equations

$$\mathrm{P}\left(\theta \vert a, b, c, d \right) = c + \left(d - c\right) \cdot \frac{e^{a\left(\theta-b\right) }}{1+e^{a\left(\theta-b\right) }}$$ $$\mathrm{I}\left(\theta \vert a, b, c, d \right) = a \cdot \left(d - c\right) \cdot \frac{e^{a\left(\theta-b\right) }}{\left[1+e^{a\left(\theta-b\right)}\right]^2}$$

#### Exercise 1

Consider the following 2PL items with parameters
Item 1: $$a = 2.5, b = -0.5$$
Item 2: $$a = 1.5, b = 0$$
For these items fill the following exercises with an accuracy of up to 0.05. Then click on Submit answers button. If you need a hint, click on blue button with question mark.

• Sketch item characteristic and information curves.
• Calculate probability of correct answer for latent abilities $$\theta = -2, -1, 0, 1, 2$$.
Item 1:
Item 2:
• For what level of ability $$\theta$$ are the probabilities equal?
• Which item provides more information for weak ($$\theta = -2$$), average ($$\theta = 0$$) and strong ($$\theta = 2$$) students?
$$\theta = -2$$
$$\theta = 0$$
$$\theta = 2$$

#### Exercise 2

Consider now 2 items with following parameters
Item 1: $$a = 1.5, b = 0, c = 0, d = 1$$
Item 2: $$a = 1.5, b = 0, c = 0.2, d = 1$$
For these items fill the following exercises with an accuracy of up to 0.05. Then click on Submit answers button.

• What is the lower asymptote for items?
Item 1:
Item 2:
• What is the probability of correct answer for latent ability $$\theta = b$$?
Item 1:
Item 2:

#### Exercise 3

Consider now 2 items with following parameters
Item 1: $$a = 1.5, b = 0, c = 0, d = 0.9$$
Item 2: $$a = 1.5, b = 0, c = 0, d = 1$$
For these items fill the following exercises with an accuracy of up to 0.05. Then click on Submit answers button.

• What is the upper asymptote for items?
Item 1:
Item 2:
• What is the probability of correct answer for latent ability $$\theta = b$$?
Item 1:
Item 2:

#### Selected R code

library(ggplot2)library(data.table)# parameters a1 <- 1; b1 <- 0; c1 <- 0; d1 <- 1 a2 <- 2; b2 <- 0.5; c2 <- 0; d2 <- 1 # latent ability theta <- seq(-4, 4, 0.01)# latent ability leveltheta0 <- 0# function for IRT characteristic curve icc_irt <- function(theta, a, b, c, d){ return(c + (d - c)/(1 + exp(-a*(theta - b)))) } # calculation of characteristic curvesdf <- data.frame(theta,                  "icc1" = icc_irt(theta, a1, b1, c1, d1),                 "icc2" = icc_irt(theta, a2, b2, c2, d2))df <- melt(df, id.vars = "theta")# plot for characteristic curves ggplot(df, aes(x = theta, y = value, color = variable)) +   geom_line() +   geom_segment(aes(y = icc_irt(theta0, a = a1, b = b1, c = c1, d = d1),                    yend = icc_irt(theta0, a = a1, b = b1, c = c1, d = d1),                    x = -4, xend = theta0),                color = "gray", linetype = "dashed") +   geom_segment(aes(y = icc_irt(theta0, a = a2, b = b2, c = c2, d = d2),                    yend = icc_irt(theta0, a = a2, b = b2, c = c2, d = d2),                    x = -4, xend = theta0),                color = "gray", linetype = "dashed") +   geom_segment(aes(y = 0,                    yend = max(icc_irt(theta0, a = a1, b = b1, c = c1, d = d1),                               icc_irt(theta0, a = a2, b = b2, c = c2, d = d2)),                    x = theta0, xend = theta0),               color = "gray", linetype = "dashed") +   xlim(-4, 4) +   xlab("Ability") +   ylab("Probability of correct answer") +   theme_bw() +   ylim(0, 1) +   theme(axis.line = element_line(colour = "black"),         panel.grid.major = element_blank(),         panel.grid.minor = element_blank()) +   ggtitle("Item characteristic curve") # function for IRT information function iic_irt <- function(theta, a, b, c, d){ return(a^2*(d-c)*exp(a*(theta-b))/(1 + exp(a*(theta-b)))^2) } # calculation of information curvesdf <- data.frame(theta,                  "iic1" = iic_irt(theta, a1, b1, c1, d1),                 "iic2" = iic_irt(theta, a2, b2, c2, d2))df <- melt(df, id.vars = "theta")# plot for information curves ggplot(df, aes(x = theta, y = value, color = variable)) +   geom_line() +   xlim(-4, 4) +   xlab("Ability") +   ylab("Information") +   theme_bw() +   ylim(0, 4) +   theme(axis.line = element_line(colour = "black"),         panel.grid.major = element_blank(),         panel.grid.minor = element_blank()) +   ggtitle("Item information curve")

### Polytomous models

Polytomous models are used when partial score is possible, or when items are graded on Likert scale (e.g. from Totally disagree to Totally agree); some polytomous models can also be used when analyzing multiple-choice items. In this section you can explore item response functions of some polytomous models.

Two main classes of polytomous IRT models are considered:

Difference models are defined by setting mathematical form to cumulative probabilities, while category probabilities are calculated as their difference. These models are also sometimes called cumulative logit models as they set linear form to cumulative logits.

As an example, Graded Response Model (GRM; Samejima, 1970) uses 2PL IRT model to describe cumulative probabilities (probabilities to obtain score higher than 1, 2, 3, etc.). Category probabilities are then described as differences of two subsequent cumulative probabilities.

For divide-by-total models response category probabilities are defined as the ratio between category-related functions and their sum.

In Generalized Partial Credit Model (GPCM; Muraki, 1992), probability of the successful transition from one category score to the next category score is modelled by 2PL IRT model, while Partial Credit Model (PCM; Masters, 1982) uses 1PL IRT model to describe this probability. Even more restricted version, the Rating Scale Model (RSM; Andrich, 1978) assumes exactly the same K response categories for each item and threshold parameters which can be split into a response-threshold parameter and an item-specific location parameter. These models are also sometimes called adjacent-category logit models as they set linear form to adjacent logits.

To model distractor properties in multiple-choice items, Nominal Response Model (NRM; Bock, 1972) can be used. NRM is an IRT analogy of multinomial regression model. This model is also generalization of GPCM/PCM/RSM ordinal models. NRM is also sometimes called baseline-category logit model as it sets linear form to log of odds of selecting given category to selecting a baseline category. Baseline can be chosen arbitrary, although usually the correct answer or the first answer is chosen.

Graded response model (GRM; Samejima, 1970) uses 2PL IRT model to describe cumulative probabilities (probabilities to obtain score higher than 1, 2, 3, etc.). Category probabilities are then described as differences of two subsequent cumulative probabilities.

It belongs to class of difference models, which are defined by setting mathematical form to cumulative probabilities, while category probabilities are calculated as their difference. These models are also sometimes called cumulative logit models, as they set linear form to cumulative logits.

#### Parameters

Select number of responses and difficulty for cummulative probabilities $$b$$ and common discrimination parameter $$a$$. Cummulative probability $$P(Y \geq 0)$$ is always equal to 1 and it is not displayed, corresponding category probability $$P(Y = 0)$$ is displayed with black color.

#### Equations

$$\pi_k* = \mathrm{P}\left(Y \geq k \vert \theta, a, b_k\right) = \frac{e^{a\left(\theta-b\right) }}{1+e^{a\left(\theta-b\right) }}$$ $$\pi_k =\mathrm{P}\left(Y = k \vert \theta, a, b_k, b_{k+1}\right) = \pi_k* - \pi_{k+1}*$$ $$\mathrm{E}\left(Y \vert \theta, a, b_1, \dots, b_K\right) = \sum_{k = 0}^K k\pi_k$$

#### Exercise

Consider item following graded response model rated $$0-1-2-3$$, with discrimination $$a = 1$$ and difficulties $$b_{1} = − 0.5$$, $$b_{2} = 1$$ and $$b_{3} = 1.5$$.

• Calculate probabilities of obtaining $$k$$ and more points for specific level of ability $$\theta$$
$$k \geq 0$$
$$k \geq 1$$
$$k \geq 2$$
$$k \geq 3$$
• Calculate probabilities of obtaining exactly $$k$$ points for specific level of ability $$\theta$$
$$k = 0$$
$$k = 1$$
$$k = 2$$
$$k = 3$$
• What is the expected item score for specific level of ability $$\theta$$?
$$\theta = -2$$
$$\theta = -1$$
$$\theta = 0$$
$$\theta = 1$$
$$\theta = 2$$

#### Selected R code

library(ggplot2) library(data.table) # setting parameters a <- 1 b <- c(-1.5, -1, -0.5, 0) theta <- seq(-4, 4, 0.01) # calculating cummulative probabilities ccirt <- function(theta, a, b){ return(1/(1 + exp(-a*(theta - b)))) } df1 <- data.frame(sapply(1:length(b), function(i) ccirt(theta, a, b[i])) , theta)df1 <- melt(df1, id.vars = "theta") # plotting cummulative probabilities ggplot(data = df1, aes(x = theta, y = value, col = variable)) +   geom_line() +   xlab("Ability") +   ylab("Cummulative probability") +   xlim(-4, 4) +   ylim(0, 1) +   theme_bw() +   theme(text = element_text(size = 14),         panel.grid.major = element_blank(),         panel.grid.minor = element_blank()) +   ggtitle("Cummulative probabilities") +   scale_color_manual("", values = c("red", "yellow", "green", "blue"), labels = paste0("P(Y >= ", 1:4, ")")) # calculating category probabilities df2 <- data.frame(1, sapply(1:length(b), function(i) ccirt(theta, a, b[i]))) df2 <- data.frame(sapply(1:length(b), function(i) df2[, i] - df2[, i+1]), df2[, ncol(df2)], theta) df2 <- melt(df2, id.vars = "theta") # plotting category probabilities ggplot(data = df2, aes(x = theta, y = value, col = variable)) +   geom_line() +   xlab("Ability") +   ylab("Category probability") +   xlim(-4, 4) +   ylim(0, 1) +   theme_bw() +   theme(text = element_text(size = 14),         panel.grid.major = element_blank(),         panel.grid.minor = element_blank()) +   ggtitle("Category probabilities") +   scale_color_manual("", values = c("black", "red", "yellow", "green", "blue"), labels = paste0("P(Y >= ", 0:4, ")"))# calculating expected item scoredf3 <- data.frame(1, sapply(1:length(b), function(i) ccirt(theta, a, b[i]))) df3 <- data.frame(sapply(1:length(b), function(i) df3[, i] - df3[, i+1]), df3[, ncol(df3)])df3 <- data.frame(exp = as.matrix(df3) %*% 0:4, theta)# plotting category probabilities ggplot(data = df3, aes(x = theta, y = exp)) +   geom_line() +   xlab("Ability") +   ylab("Expected item score") +   xlim(-4, 4) +   ylim(0, 4) +   theme_bw() +   theme(text = element_text(size = 14),         panel.grid.major = element_blank(),         panel.grid.minor = element_blank()) +   ggtitle("Expected item score")

### Generalized partial credit model

In Generalized Partial Credit Model (GPCM; Muraki, 1992), probability of the successful transition from one category score to the next category score is modelled by 2PL IRT model. The response category probabilities are then ratios between category-related functions (cumulative sums of exponentials) and their sum.

Two simpler models can be derived from GPCM by restricting some parameters: Partial Credit Model (PCM; Masters, 1982) uses 1PL IRT model to describe this probability, thus parameters $$\alpha = 1$$. Even more restricted version, the Rating Scale Model (RSM; Andrich, 1978) assumes exactly the same K response categories for each item and threshold parameters which can be split into a response-threshold parameter $$\lambda_t$$ and an item-specific location parameter $$\delta_i$$. These models are also sometimes called adjacent logit models, as they set linear form to adjacent logits.

#### Parameters

Select number of responses and their threshold parameters $$\delta$$ and common discrimination parameter $$\alpha$$. With $$\alpha = 1$$ you get PCM. Numerator of $$\pi_0 = P(Y = 0)$$ is set to 1 and $$\pi_0$$ is displayed with black color.

#### Equations

$$\pi_k =\mathrm{P}\left(Y = k \vert \theta, \alpha, \delta_0, \dots, \delta_K\right) = \frac{\exp\sum_{t = 0}^k \alpha(\theta - \delta_t)}{\sum_{r = 0}^K\exp\sum_{t = 0}^r \alpha(\theta - \delta_t)}$$ $$\mathrm{E}\left(Y \vert \theta, \alpha, \delta_0, \dots, \delta_K\right) = \sum_{k = 0}^K k\pi_k$$

#### Exercise

Consider item following generalized partial credit model rated $$0-1-2$$, with discrimination $$a = 1$$ andthreshold parameters $$d_{1} = − 1$$ and $$d_{2} = 1$$.

• For what ability levels do the category probability curves cross?
• What is the expected item score for these ability levels?
$$\theta = -1.5$$
$$\theta = 0$$
$$\theta = 1.5$$
• Change discrimination to $$a = 2$$. Do the category probability curves cross for the same ability levels?
• How did the expected item score change for these ability levels?
$$\theta = -1.5$$
$$\theta = 0$$
$$\theta = 1.5$$

#### Selected R code

library(ggplot2) library(data.table) # setting parameters a <- 1 d <- c(-1.5, -1, -0.5, 0) theta <- seq(-4, 4, 0.01) # calculating category probabilities ccgpcm <- function(theta, a, d){ a*(theta - d) } df <- sapply(1:length(d), function(i) ccgpcm(theta, a, d[i])) pk <- sapply(1:ncol(df), function(k) apply(as.data.frame(df[, 1:k]), 1, sum)) pk <- cbind(0, pk) pk <- exp(pk) denom <- apply(pk, 1, sum) df <-  apply(pk, 2, function(x) x/denom)df1 <- melt(data.frame(df, theta), id.vars = "theta") # plotting category probabilities ggplot(data = df1, aes(x = theta, y = value, col = variable)) +   geom_line() +   xlab("Ability") +   ylab("Category probability") +   xlim(-4, 4) +   ylim(0, 1) +   theme_bw() +   theme(text = element_text(size = 14),         panel.grid.major = element_blank(),         panel.grid.minor = element_blank()) +   ggtitle("Category probabilities") +   scale_color_manual("", values = c("black", "red", "yellow", "green", "blue"), labels = paste0("P(Y = ", 0:4, ")"))# calculating expected item scoredf2 <- data.frame(exp = as.matrix(df) %*% 0:4, theta)# plotting expected item score ggplot(data = df2, aes(x = theta, y = exp)) +   geom_line() +   xlab("Ability") +   ylab("Expected item score") +   xlim(-4, 4) +   ylim(0, 4) +   theme_bw() +   theme(text = element_text(size = 14),         panel.grid.major = element_blank(),         panel.grid.minor = element_blank()) +   ggtitle("Expected item score")

### Nominal response model

In Nominal Response Model (NRM; Bock, 1972), probability of selecting given category over baseline category is modelled by 2PL IRT model. This model is also sometimes called baseline-category logit model, as it sets linear form to log of odds of selecting given category to selecting a baseline category. Baseline can be chosen arbitrary, although usually the correct answer or the first answer is chosen. NRM model is generalization of GPCM model by setting item-specific and category-specific intercept and slope parameters.

#### Parameters

Select number of distractors and their threshold parameters $$\delta$$ and discrimination parameters $$\alpha$$. Parameters of $$\pi_0 = P(Y = 0)$$ are set to zeros and $$\pi_0$$ is displayed with black color.

#### Equations

$$\pi_k =\mathrm{P}\left(Y = k \vert \theta, \alpha_0, \dots, \alpha_K, \delta_0, \dots, \delta_K\right) = \frac{\exp(\alpha_k\theta + \delta_k)}{\sum_{r = 0}^K\exp(\alpha_r\theta + \delta_r)}$$

#### Selected R code

library(ggplot2) library(data.table) # setting parameters a <- c(2.5, 2, 1, 1.5) d <- c(-1.5, -1, -0.5, 0) theta <- seq(-4, 4, 0.01) # calculating category probabilities ccnrm <- function(theta, a, d){ exp(d + a*theta) } df <- sapply(1:length(d), function(i) ccnrm(theta, a[i], d[i])) df <- data.frame(1, df) denom <- apply(df, 1, sum) df <- apply(df, 2, function(x) x/denom) df1 <- melt(data.frame(df, theta), id.vars = "theta") # plotting category probabilities ggplot(data = df1, aes(x = theta, y = value, col = variable)) +   geom_line() +   xlab("Ability") +   ylab("Category probability") +   xlim(-4, 4) +   ylim(0, 1) +   theme_bw() +   theme(text = element_text(size = 14),         panel.grid.major = element_blank(),         panel.grid.minor = element_blank()) +   ggtitle("Category probabilities") +   scale_color_manual("", values = c("black", "red", "yellow", "green", "blue"), labels = paste0("P(Y = ", 0:4, ")"))# calculating expected item scoredf2 <- data.frame(exp = as.matrix(df) %*% 0:4, theta)# plotting expected item scoreggplot(data = df2, aes(x = theta, y = exp)) +   geom_line() +   xlab("Ability") +   ylab("Expected item score") +   xlim(-4, 4) +   ylim(0, 4) +   theme_bw() +   theme(text = element_text(size = 14),         panel.grid.major = element_blank(),         panel.grid.minor = element_blank()) +   ggtitle("Expected item score")

### Differential Item/Distractor Functioning

Differential item functioning (DIF) occurs when respondents from different social groups (such as defined by gender or ethnicity) with the same underlying ability have a different probability of answering the item correctly or endorsing the item. If some item functions differently for two groups, it is potentially unfair and should be checked for wording. In general, two types of DIF can be distinguished: The uniform DIF describes a situation when the item advantages one of the groups at all levels of the latent ability (left figure). In such a case, the item has different difficulty (location parameters) for given two groups, while the item discrimination is the same. Contrary, the non-uniform DIF (right figure) means that the item advantages one of the groups at lower ability levels, and the other group at higher ability levels. In this case, the item has different discrimination (slope) parameters and possibly also different difficulty parameters for the given two groups.

Differential distractor functioning (DDF) occurs when respondents from different groups but with the same latent ability have different probability of selecting at least one distractor choice. Again, two types of DDF can be distinguished - uniform (left figure below) and non-uniform DDF (right figure below).

### Total scores and other matching variables

DIF analysis may come to a different conclusion than test of group differences in total scores. Two groups may have the same distribution of total scores, yet, some items may function differently for the two groups. Also, one of the groups may have significantly lower total score, yet, it may happen that there is no DIF item (Martinkova et al., 2017). This section examines the differences in total scores only. Explore further DIF sections to analyze differential item functioning.

DIF can also be explored with respect to matching criteria other than the total score of analyzed items. For example, to analyze instructional sensitivity, Martinkova et al. (2020) analyzed differential item functioning in change (DIF-C) by analyzing DIF on Grade 9 item answers while matching on Grade 6 total scores of the same respondents in a longitudinal setting (see toy data Learning to Learn 9 in Data section).

#### Comparison of

Notes: Test for difference in between the reference and the focal group is based on Welch two sample t-test.
Diff. (CI) - difference in means of with 95% confidence interval, $$t$$-value - test statistic, df - degrees of freedom, $$p$$-value - value lower than 0.05 means significant difference in between the reference and the focal group.

#### Selected R code

library(ggplot2)library(moments)library(ShinyItemAnalysis)# Loading datadata(GMAT, package = "difNLR")Data <- GMAT[, 1:20]group <- GMAT[, "group"]# Total score calculation wrt groupscore <- rowSums(Data)score0 <- score[group == 0] # reference groupscore1 <- score[group == 1] # focal group# Summary of total scorerbind(  c(length(score0), min(score0), max(score0), mean(score0), median(score0), sd(score0), skewness(score0), kurtosis(score0)),  c(length(score1), min(score1), max(score1), mean(score1), median(score1), sd(score1), skewness(score1), kurtosis(score1)))df <- data.frame(score, group = as.factor(group))# Histogram of total scores wrt groupggplot(data = df, aes(x = score, fill = group, col = group)) +  geom_histogram(binwidth = 1, position = "dodge2", alpha = 0.75) +  xlab("Total score") +  ylab("Number of respondents") +  scale_fill_manual(values = c("dodgerblue2", "goldenrod2"), labels = c("Reference", "Focal")) +  scale_colour_manual(values = c("dodgerblue2", "goldenrod2"), labels = c("Reference", "Focal")) +  theme_app() +  theme(legend.position = "left"))# t-test to compare total scorest.test(score0, score1)

### Delta plot

Delta plot (Angoff & Ford, 1973) compares the proportions of correct answers per item in the two groups. It displays non-linear transformation of these proportions using quantiles of standard normal distributions (so-called delta scores) for each item for the two genders in a scatterplot called diagonal plot or delta plot (see Figure below). Item is under suspicion of DIF if the delta point considerably departs from the main axis of the ellipsoid formed by delta scores.

#### Method specification

The detection threshold is either fixed to the value of 1.5 or it is based on bivariate normal approximation (Magis & Facon, 2012). The item purification algorithms offered when using the threshold based on normal approximationare are as follows: IPP1 uses the threshold obtained after the first run in all following runs, IPP2 updates only the slope parameter of the threshold formula and thus lessens the impact of DIF items, IPP3 adjusts every single parameter and completely discards the effect of items flagged as DIF from the computation of the threshold (for further details see Magis & Facon, 2013). When using the fixed threshold and item purification, this threshold (1.5) stays the same henceforward during the purification algorithm.

#### Summary table

Summary table contains information about proportions of correct answers in the reference and the focal group together with their transformations into delta scores. It also includes distances of delta scores from the main axis of the ellipsoid formed by delta scores.

#### Selected R code

library(deltaPlotR)# Loading datadata(GMAT, package = "difNLR")Data <- GMAT[, 1:20]group <- GMAT[, "group"]# Delta scores with fixed threshold(DS_fixed <- deltaPlot(data = data.frame(Data, group), group = "group", focal.name = 1, thr = 1.5, purify = FALSE))# Delta plotdiagPlot(DS_fixed, thr.draw = TRUE)# Delta scores with normal threshold(DS_normal <- deltaPlot(data = data.frame(Data, group), group = "group", focal.name = 1, thr = "norm", purify = FALSE))# Delta plotdiagPlot(DS_normal, thr.draw = TRUE)

### Mantel-Haenszel test

Mantel-Haenszel test is a DIF detection method based on contingency tables which are calculated for each level of the total score (Mantel & Haenszel, 1959).

#### Method specification

Here you can select correction method for multiple comparison, and/or item purification.

#### Summary table

Summary table contains information about Mantel-Haenszel $$\chi^2$$ statistics, corresponding $$p$$-values considering selected adjustement, and significance codes. Moreover, table offers values of Mantel-Haenszel estimates of odds ratio $$\alpha_{\mathrm{MH}}$$, which incorporate all levels of total score, and their transformations into D-DIF indices $$\Delta_{\mathrm{MH}} = -2.35 \log(\alpha_{\mathrm{MH}})$$ to evaluate DIF effect size.

#### Selected R code

library(difR)# Loading datadata(GMAT, package = "difNLR")Data <- GMAT[, 1:20]group <- GMAT[, "group"]# Mantel-Haenszel test(fit <- difMH(Data = Data, group = group, focal.name = 1, match = "score", p.adjust.method = "none", purify = FALSE))

### Mantel-Haenszel test

Mantel-Haenszel test is a DIF detection method based on contingency tables which are calculated for each level of total score (Mantel & Haenszel, 1959).

#### Contingency tables and odds ratio calculation

For selected item and for selected level of total score you can display contingency table and calculates odds ratio of answering item correctly. This can be compared to Mantel-Haenszel estimate of odds ratio $$\alpha_{\mathrm{MH}}$$, which incorporates all levels of total score. Further, $$\alpha_{\mathrm{MH}}$$ can be transformed into Mantel-Haenszel D-DIF index $$\Delta_{\mathrm{MH}}$$ to evaluate DIF effect size.

#### Selected R code

library(difR)library(reshape2)# Loading datadata(GMAT, package = "difNLR")Data <- GMAT[, 1:20]group <- GMAT[, "group"]# contingency table for item 1 and score 12item <- 1cut <- 12df <- data.frame(Data[, item], group)colnames(df) <- c("Answer", "Group")df$Answer <- relevel(factor(df$Answer, labels = c("Incorrect", "Correct")), "Correct")df$Group <- factor(df$Group, labels = c("Reference Group", "Focal Group"))score <- rowSums(Data) # total score calculationdf <- df[score == 12, ] # responses of those with total score of 12dcast(data.frame(xtabs(~ Group + Answer, data = df)),  Group ~ Answer,  value.var = "Freq", margins = TRUE, fun = sum)# Mantel-Haenszel estimate of OR(fit <- difMH(Data = Data, group = group, focal.name = 1, match = "score", p.adjust.method = "none", purify = FALSE))fit$alphaMH# D-DIF index calculation-2.35 * log(fit$alphaMH)

### Logistic regression

Logistic regression method allows for detection of uniform and non-uniform DIF (Swaminathan & Rogers, 1990) by including a group specific intercept $$b_{2}$$ (uniform DIF) and group specific interaction $$b_{3}$$ (non-uniform DIF) into model and by testing for their significance.

#### Method specification

Here you can choose what type of DIF to be tested. You can also select correction method for multiple comparison or item purification. Finally, you may change the DIF matching variable. While matching on the standardized total score is typical, upload of other DIF matching variable is possible in Section Data. Using a pre-test (standardized) total score as DIF matching variable allows for testing differential item functioning in change (DIF-C) to provide proofs of instructional sensitivity (Martinkova et al., 2020), also see Learning To Learn 9 toy dataset.

#### Equation

$$\mathrm{P}\left(Y_{ij} = 1 | X_i, G_i, b_0, b_1, b_2, b_3\right) = \frac{e^{b_0 + b_1 X_i + b_2 G_i + b_3 X_i G_i}}{1+e^{b_0 + b_1 X_i + b_2 G_i + b_3 X_i G_i}}$$

#### Summary table

Summary table contains information about DIF test statistics $$LR(\chi^2)$$, corresponding $$p$$-values considering selected adjustement, and significance codes. Moreover, it offers values of Nagelkerke's $$R^2$$ with DIF effect size classifications. Table also provides estimated parameters for the best fitted model for each item.

#### Selected R code

library(difR)# Loading datadata(GMAT, package = "difNLR")Data <- GMAT[, 1:20]group <- GMAT[, "group"]# Logistic regression DIF detection method(fit <- difLogistic(Data = Data, group = group, focal.name = 1, match = "score", type = "both", p.adjust.method = "none", purify = FALSE))# Loading datadata(LearningToLearn, package = "ShinyItemAnalysis")Data <- LearningToLearn[, 87:94]        # item responses from Grade 9 from subscale 6group <- LearningToLearn$track # school track - group membership variablematch <- scale(LearningToLearn$score_6) # standardized test score from Grade 6# Detecting differential item functioning in change (DIF-C) using# logistic regression DIF detection method# and standardized total score from Grade 6 as matching criterion(fit <- difLogistic(Data = Data, group = group, focal.name = "AS", match = match, type = "both", p.adjust.method = "none", purify = FALSE))

### Logistic regression

Logistic regression method allows for detection of uniform and non-uniform DIF (Swaminathan & Rogers, 1990) by including a group specific intercept $$b_{2}$$ (uniform DIF) and group specific interaction $$b_{3}$$ (non-uniform DIF) into model and by testing for their significance.

#### Method specification

Here you can choose what type of DIF to be tested. You can also select correction method for multiple comparison or item purification. Finally, you may change the DIF matching variable. While matching on the standardized total score is typical, upload of other DIF matching variable is possible in Section Data. Using a pre-test (standardized) total score as DIF matching criterion allows for testing differential item functioning in change (DIF-C) to provide proofs of instructional sensitivity (Martinkova et al., 2020), also see Learning To Learn 9 toy dataset. For selected item you can display plot of its characteristic curves and table of its estimated parameters with standard errors.

#### Plot with estimated DIF logistic curve

Points represent proportion of correct answer (empirical probabilities) with respect to the DIF matching variable. Their size is determined by count of respondents who achieved given level of DIF matching variable with respect to the group membership.

#### Equation

$$\mathrm{P}\left(Y_{ij} = 1 | X_i, G_i, b_0, b_1, b_2, b_3\right) = \frac{e^{b_0 + b_1 X_i + b_2 G_i + b_3 X_i G_i}}{1+e^{b_0 + b_1 X_i + b_2 G_i + b_3 X_i G_i}}$$

#### Table of parameters

Table summarizes estimated item parameters together with standard errors.

### Raju test for IRT models

To detect DIF, Raju test (Raju, 1988, 1990) uses area between the item charateristic curves of selected IRT model, fitted separately on data of the two groups. Model is either 1PL, 2PL, or 3PL with guessing which is the same for the two groups. In case of 3PL model, the guessing parameter is estimated based on the whole dataset and is subsequently considered fixed.

#### Method specification

Here you can choose underlying IRT model used to test DIF. You can also select correction method for multiple comparison, and/or item purification.

#### Summary table

Summary table contains information about Raju's $$Z$$-statistics, corresponding $$p$$-values considering selected adjustement, and significance codes. Table also provides estimated parameters for both groups. Note that item parameters might slightly differ even for non-DIF items as two seperate models are fitted, however this difference is non-significant. Also note that under the 3PL model, the guessing parameter $$c$$ is estimated from the whole dataset, and is considered fixed in the final models, thus no standard error is displayed.

#### Selected R code

library(difR)library(ltm)# Loading datadata(GMAT, package = "difNLR")Data <- GMAT[, 1:20]group <- GMAT[, "group"]# 1PL IRT MODEL(fit1PL <- difRaju(Data = Data, group = group, focal.name = 1, model = "1PL", p.adjust.method = "none", purify = FALSE))# 2PL IRT MODEL(fit2PL <- difRaju(Data = Data, group = group, focal.name = 1, model = "2PL", p.adjust.method = "none", purify = FALSE))# 3PL IRT MODEL with the same guessing for groupsguess <- itemParEst(Data, model = "3PL")[, 3](fit3PL <- difRaju(Data = Data, group = group, focal.name = 1, model = "3PL", c = guess, p.adjust.method = "none", purify = FALSE))

### Raju test for IRT models

To detect DIF, Raju test (Raju, 1988, 1990) uses area between the item charateristic curves of selected IRT model, fitted separately on data of the two groups. Model is either 1PL, 2PL, or 3PL with guessing which is the same for the two groups. In case of 3PL model, the guessing parameter is estimated based on the whole dataset and is subsequently considered fixed.

#### Method specification

Here you can choose underlying IRT model used to test DIF. You can also select correction method for multiple comparison, and/or item purification. For selected item you can display plot of its characteristic curves and table of its estimated parameters with standard errors.

#### Plot with estimated DIF characteristic curve

Note that plots might slightly differ even for non-DIF items as two seperate models are fitted, however this difference is non-significant.

#### Table of parameters

Table summarizes estimated item parameters together with standard errors. Note that item parameters might slightly differ even for non-DIF items as two seperate models are fitted, however this difference is non-significant. Also note that under the 3PL model, the guessing parameter $$c$$ is estimated from the whole dataset, and is considered fixed in the final models, thus no standard error is available.

#### Selected R code

library(difR)library(ltm)library(ShinyItemAnalysis)# Loading datadata(GMAT, package = "difNLR")Data <- GMAT[, 1:20]group <- GMAT[, "group"]# 1PL IRT MODEL(fit1PL <- difRaju(Data = Data, group = group, focal.name = 1, model = "1PL", p.adjust.method = "none", purify = FALSE))# Estimated coefficients for all items(coef1PL <- fit1PL$itemParInit)# Plot of characteristic curve of item 1plotDIFirt(parameters = coef1PL, item = 1, test = "Raju")# 2PL IRT MODEL(fit2PL <- difRaju(Data = Data, group = group, focal.name = 1, model = "2PL", p.adjust.method = "none", purify = FALSE))# Estimated coefficients for all items(coef2PL <- fit2PL$itemParInit)# Plot of characteristic curve of item 1plotDIFirt(parameters = coef2PL, item = 1, test = "Raju")# 3PL IRT MODEL with the same guessing for groupsguess <- itemParEst(Data, model = "3PL")[, 3](fit3PL <- difRaju(Data = Data, group = group, focal.name = 1, model = "3PL", c = guess, p.adjust.method = "none", purify = FALSE))# Estimated coefficients for all items(coef3PL <- fit3PL\$itemParInit)# Plot of characteristic curve of item 1plotDIFirt(parameters = coef3PL, item = 1, test = "Raju")

### SIBTEST

The SIBTEST method (Shealy & Stout, 1993) allows for detection of uniform DIF without requiring an item response model. Its modified version, the Crossing-SIBTEST (Chalmers, 2018; Li & Stout, 1996), focuses on detection of non-uniform DIF.

#### Method specification

Here you can choose type of DIF to test. With uniform DIF, SIBTEST is applied, while with non-uniform DIF, the Crossing-SIBTEST method is used instead. You can also select correction method for multiple comparison or item purification.

#### Summary table

Summary table contains estimates of $$\beta$$ together with standard errors (only available when testing uniform DIF), corresponding $$\chi^2$$-statistics with $$p$$-values considering selected adjustement, and significance codes.

#### Selected code

library(difR)# Loading datadata(GMAT, package = "difNLR")Data <- GMAT[, 1:20]group <- GMAT[, "group"]# SIBTEST (uniform DIF)(fit_udif <- difSIBTEST(Data = Data, group = group, focal.name = 1, type = "udif", p.adjust.method = "none", purify = FALSE))# Crossing-SIBTEST (non-uniform DIF)(fit_nudif <- difSIBTEST(Data = Data, group = group, focal.name = 1, type = "nudif", p.adjust.method = "none", purify = FALSE))

### Method comparison

Here you can compare all offered DIF detection methods. In the table below, columns represent DIF detection methods, and rows represent item number. If the method detects item as DIF, value 1 is assigned to that item, otherwise 0 is assigned. In case that any method fail to converge or cannot be fitted, NA is displayed instead of 0/1 values. Available methods:

• Delta is delta plot method (Angoff & Ford, 1973; Magis & Facon, 2012),
• MH is Mantel-Haenszel test (Mantel & Haenszel, 1959),
• LR is logistic regression (Swaminathan & Rogers, 1990),
• NLR is generalized (non-linear) logistic regression (Drabinova & Martinkova, 2017),
• LORD is Lord chi-square test (Lord, 1980),
• RAJU is Raju area method (Raju, 1990),
• SIBTEST is SIBTEST (Shealy & Stout, 1993) and crossing-SIBTEST method (Chalmers, 2018; Li & Stout, 1996).

### Table with method comparison

Settings for individual methods (DIF matching criterion, type of DIF to be tested, correction method, item purification) are taken from subsection pages of given methods. In case your settings are not unified, you can set some of them below. Note that changing the options globaly can be computationaly demanding. This especially applies for purification request. To see the complete setting of all analyses, please refer to the note below the table. The last column shows how many methods detect certain item as DIF. The last row shows how many items are detected as DIF by a certain method.

### Cumulative logit regression model for DIF detection

Cumulative logit regression allows for detection of uniform and non-uniform DIF among ordinal data by adding a group specific intercept $$b_2$$ (uniform DIF) and interaction $$b_3$$ between group and DIF matching variable (non-uniform DIF) into model and by testing for their significance.

#### Method specification

Here you can change DIF matching variable and choose type of DIF to be tested. You can also select correction method for multiple comparison or item purification.

#### Equation

The probability that person $$p$$ with DIF matching variable (e.g., standardized total score) $$Z_p$$ and group membership $$G_p$$ obtained at least $$k$$ points in item $$i$$ is given by the following equation:

The probability that person $$p$$ with DIF matching variable (e.g., standardized total score) $$Z_p$$ and group membership $$G_p$$ obtained exactly $$k$$ points in item $$i$$ is then given as differnce between probabilities of obtaining at least $$k$$ and $$k + 1$$ points:

#### Summary table

Summary table contains information about $$\chi^2$$-statistics, corresponding $$p$$-values considering selected adjustement, and significance codes. Table also provides estimated parameters for the best fitted model for each item.

#### Selected R code

library(difNLR)# Loading datadata(dataMedicalgraded, package = "ShinyItemAnalysis")Data <- dataMedicalgraded[, 1:100]group <- dataMedicalgraded[, 101]# DIF with cumulative logit regression model(fit <- difORD(Data = Data, group = group, focal.name = 1, model = "cumulative", type = "both", match = "zscore", p.adjust.method = "none", purify = FALSE))

### Cumulative logit regression model for DIF detection

Cumulative logit regression allows for detection of uniform and non-uniform DIF among ordinal data by adding a group specific intercept $$b_2$$ (uniform DIF) and interaction $$b_3$$ between group and DIF matching variable (non-uniform DIF) into model and by testing for their significance.

#### Method specification

Here you can change DIF matching variable and choose type of DIF to be tested. You can also select correction method for multiple comparison or item purification. For selected item you can display plot of its characteristic curves and table of its estimated parameters with standard errors.

#### Plot with estimated DIF curves

Points represent proportion of obtained score with respect to DIF matching variable. Their size is determined by count of respondents who achieved given level of DIF matching variable and who selected given option with respect to the group membership.

#### Table of parameters

Table summarizes estimated item parameters together with standard errors.

#### Selected R code

library(difNLR)# Loading datadata(dataMedicalgraded, package = "ShinyItemAnalysis")Data <- dataMedicalgraded[, 1:100]group <- dataMedicalgraded[, 101]# DIF with cumulative logit regression model(fit <- difORD(Data = Data, group = group, focal.name = 1, model = "cumulative", type = "both", match = "zscore", p.adjust.method = "none", purify = FALSE))# Plot of characteristic curves for item X2003, cumulative probabilitiesplot(fit, item = "X2003", plot.type = "cumulative")# Plot of characteristic curves for item X2003, category probabilitiesplot(fit, item = "X2003", plot.type = "category")# Estimated coefficients for all items with standard errorscoef(fit, SE = TRUE)

### Adjacent category logit regression model for DIF detection

Adjacent category logit regression model allows for detection of uniform and non-uniform DIF among ordinal data by adding a group specific intercept $$b_2$$ (uniform DIF) and interaction $$b_3$$ between group and DIF matching variable (non-uniform DIF) into model and by testing for their significance.

#### Method specification

Here you can change DIF matching variable and choose type of DIF to be tested. You can also select correction method for multiple comparison or item purification.

#### Equation

The probability that person $$p$$ with DIF matching variable (e.g., standardized total score) $$Z_p$$ and group membership $$G_p$$ obtained $$k$$ points in item $$i$$ is given by the following equation:

#### Summary table

Summary table contains information about $$\chi^2$$-statistics, corresponding $$p$$-values considering selected adjustement, and significance codes. Table also provides estimated parameters for the best fitted model for each item.

#### Selected R code

library(difNLR)# Loading datadata(dataMedicalgraded, package = "ShinyItemAnalysis")Data <- dataMedicalgraded[, 1:100]group <- dataMedicalgraded[, 101]# DIF with cumulative logit regression model(fit <- difORD(Data = Data, group = group, focal.name = 1, model = "adjacent", type = "both", match = "zscore", p.adjust.method = "none", purify = FALSE))

### Adjacent category logit regression model for DIF detection

Adjacent category logit regression model allows for detection of uniform and non-uniform DIF among ordinal data by adding a group specific intercept $$b_2$$ (uniform DIF) and interaction $$b_3$$ between group and DIF matching variable (non-uniform DIF) into model and by testing for their significance.

#### Method specification

Here you can change DIF matching variable and choose type of DIF to be tested. You can also select correction method for multiple comparison or item purification. For selected item you can display plot of its characteristic curves and table of its estimated parameters with standard errors.

#### Plot with estimated DIF curves

Points represent proportion of obtained score with respect to DIF matching variable. Their size is determined by count of respondents who achieved given level of DIF matching variable and who selected given option with respect to the group membership.

#### Table of parameters

Table summarizes estimated item parameters together with standard errors.

#### Selected R code

library(difNLR)# Loading datadata(dataMedicalgraded, package = "ShinyItemAnalysis")Data <- dataMedicalgraded[, 1:100]group <- dataMedicalgraded[, 101]# DIF with cumulative logit regression model(fit <- difORD(Data = Data, group = group, focal.name = 1, model = "cumulative", type = "both", match = "zscore", p.adjust.method = "none", purify = FALSE))# Plot of characteristic curves for item X2003plot(fit, item = "X2003")# Estimated coefficients for all items with standard errorscoef(fit, SE = TRUE)

### Multinomial regression model for DDF detection

Differential Distractor Functioning (DDF) occurs when people from different groups but with the same knowledge have different probability of selecting at least one distractor choice. DDF is here examined by multinomial log-linear regression model with Z-score and group membership as covariates.

#### Method specification

Here you can change DIF matching variable and choose type of DDF to be tested. You can also select correction method for multiple comparison or item purification.

#### Equation

For $$K$$ possible test choices is the probability of the correct answer for person $$p$$ with DIF matching variable (e.g., standardized total score) $$Z_p$$ and group membership $$G_p$$ in item $$i$$ given by the following equation:

$$\mathrm{P}(Y_{ip} = K|Z_p, G_p, b_{il0}, b_{il1}, b_{il2}, b_{il3}, l = 1, \dots, K-1) = \frac{1}{1 + \sum_l e^{\left( b_{il0} + b_{il1} Z_p + b_{il2} G_p + b_{il3} Z_p:G_p\right)}}$$

The probability of choosing distractor $$k$$ is then given by:

$$\mathrm{P}(Y_{ip} = k|Z_p, G_p, b_{il0}, b_{il1}, b_{il2}, b_{il3}, l = 1, \dots, K-1) = \frac{e^{\left( b_{ik0} + b_{ik1} Z_p + b_{ik2} G_p + b_{ik3} Z_p:G_p\right)}} {1 + \sum_l e^{\left( b_{il0} + b_{il1} Z_p + b_{il2} G_p + b_{il3} Z_p:G_p\right)}}$$

#### Summary table

Summary table contains information about $$\chi^2$$-statistics, corresponding $$p$$-values considering selected adjustement, and significance codes. Table also provides estimated parameters for the best fitted model for each item.

#### Selected R code

library(difNLR)# Loading datadata(GMATtest, GMATkey, package = "difNLR")Data <- GMATtest[, 1:20]group <- GMATtest[, "group"]key <- GMATkey# DDF with multinomial  regression model(fit <- ddfMLR(Data, group, focal.name = 1, key, type = "both", match = "zscore", p.adjust.method = "none", purify = FALSE))

### Multinomial regression model for DDF detection

Differential Distractor Functioning (DDF) occurs when people from different groups but with the same knowledge have different probability of selecting at least one distractor choice. DDF is here examined by Multinomial Log-linear Regression model with Z-score and group membership as covariates.

#### Method specification

Here you can change DIF matching variable and choose type of DDF to be tested. You can also select correction method for multiple comparison or item purification. For selected item you can display plot of its characteristic curves and table of its estimated parameters with standard errors.

#### Plot with estimated DDF curves

Points represent proportion of selected answer with respect to DIF matching variable. Their size is determined by count of respondents who achieved given level of DIF matching variable and who selected given option with respect to the group membership.

#### Table of parameters

Table summarizes estimated item parameters together with standard errors.

#### Selected R code

library(difNLR)# Loading datadata(GMATtest, GMATkey, package = "difNLR")Data <- GMATtest[, 1:20]group <- GMATtest[, "group"]key <- GMATkey# DDF with multinomial  regression model(fit <- ddfMLR(Data, group, focal.name = 1, key, type = "both", match = "zscore", p.adjust.method = "none", purify = FALSE))# Plot of characteristic curves for item 1plot(fit, item = 1)# Estimated coefficients for all items with standard errorscoef(fit, SE = TRUE)

#### Settings of report

ShinyItemAnalysis offers an option to download a report in HTML or PDF format. PDF report creation requires latest version of MiKTeX (or other TeX distribution). If you don't have the latest installation, please, use the HTML report.

There is an option to use customized settings. When checking the Customize settings local settings will be offered and used for each selected section of the report. Otherwise, the settings will be taken from sections of the application. You may also include your name into the report, as well as the name of analyzed dataset.

#### Content of report

Reports by default contain summary of total scores, table of standard scores, item analysis, distractor plots for each item and multinomial regression plots for each item. Other analyses can be selected below.

Validity

Difficulty/discrimination plot

Distractors plots

DIF method selection

Delta plot settings

Mantel-Haenszel test settings

Logistic regression settings

Multinomial regression settings

Recommendation: Report generation can be faster and more reliable when you first check sections of intended contents. For example, if you wish to include a 3PL IRT model, you can first visit IRT models section and 3PL subsection.

### Welcome

Welcome to ShinyItemAnalysis!

ShinyItemAnalysis is an interactive online application for psychometric analysis of educational and other psychological tests and their items, built on R and shiny. You can simply start using the application by choosing toy dataset (or upload your own one) in section Data and run analysis including:

• Exploration of total and standard scores in Summary section
• Analysis of measurement error in Reliability section
• Correlation structure and criterion validity analysis in Validity section
• Item and distractor analysis in Item analysis section
• Item analysis with regression models in Regression section
• Item analysis by item response theory models in IRT models section
• Differential item functioning (DIF) and differential distractor functioning (DDF) methods in DIF/Fairness section

All graphical outputs and selected tables can be downloaded via download button. Moreover, you can automatically generate HTML or PDF report in Reports section. All offered analyses are complemented by selected R code which is ready to be copy-pasted into your R console, hence a similar analysis can be run and modified in R.

#### Availability

It is also available online at Czech Academy of Sciences and shinyapps.io .

#### Versions

Current CRAN version is 1.3.3.
Version available online is 1.3.3.
The newest development version available on GitHub is 1.3.3.

#### Feedback

If you discover a problem with this application please contact the project maintainer at martinkova(at)cs.cas.cz or use GitHub. We also encourage you to provide your feedback using Google form.

This program is free software and you can redistribute it and or modify it under the terms of the GNU GPL 3 as published by the Free Software Foundation. This program is distributed in the hope that it will be useful, but without any warranty; without even the implied warranty of merchantability of fitness for a particular purpose.

To cite ShinyItemAnalysis in publications, please use:

Martinkova P., & Drabinova A. (2018).
ShinyItemAnalysis for teaching psychometrics and to enforce routine analysis of educational tests.
The R Journal, 10(2), 503-515. doi: 10.32614/RJ-2018-074

#### Acknowledgments

Project was supported by Czech Science Foundation grant GJ15-15856Y 'Estimation of psychometric measures as part of admission test development' and by Charles University under project PRIMUS/17/HUM/11 'Center for Educational Measurement and Psychometrics (CEMP)'.

### R packages

• corrplot Wei, T. & Simko, V. (2017). R package "corrplot": Visualization of a Correlation Matrix. R package version 0.84. See online.
• cowplot Claus O. Wilke (2018). cowplot: Streamlined Plot Theme and Plot Annotations for "ggplot2". R package version 0.9.3. See online.
• CTT Willse, J. & Willse, T. (2018). CTT: Classical Test Theory Functions. R package version 2.3.3. See online.
• data.table Dowle, M. & Srinivasan, A. (2019). data.table: Extension of "data.frame". R package version 1.12.8. See online.
• deltaPlotR Magis, D. & Facon, B. (2014). deltaPlotR: An R Package for Differential Item Functioning Analysis with Angoffs Delta Plot. Journal of Statistical Software, Code Snippets, 59(1), 1-19. See online.
• difNLR Drabinova, A., Martinkova, P. (2020). difNLR: DIF and DDF Detection by Non-Linear Regression Models. R package version 1.3.2. See online.
• difR Magis, D., Beland, S., Tuerlinckx, F. & De Boeck, P. (2010). A general framework and an R package for the detection of dichotomous differential item functioning. Behavior Research Methods, 42847-862.
• DT Xie, Y., Cheng, J. & Tan, X. (2019). DT: A Wrapper of the JavaScript Library "DataTables". R package version 0.10. See online.
• ggdendro de Vries, A. & Ripley, B.D. (2016). ggdendro: Create Dendrograms and Tree Diagrams Using "ggplot2". R package version 0.1-20. See online.
• ggplot2 Wickham, H. (2016). ggplot2: Elegant Graphics for Data Analysis. See online.
• gridExtra Auguie, B. (2017). gridExtra: Miscellaneous Functions for "Grid" Graphics. R package version 2.3. See online.
• knitr Xie, Y. (2019). knitr: A General-Purpose Package for Dynamic Report Generation in R. R package version 1.26. See online.
• latticeExtra Sarkar, D. & Andrews, F. (2016). latticeExtra: Extra Graphical Utilities Based on Lattice. R package version 0.6-28. See online.
• ltm Rizopoulos, D. (2006). ltm: An R package for Latent Variable Modelling and Item Response Theory Analyses. Journal of Statistical Software, 17(5), 1-25. See online.
• mirt Chalmers, R. & Chalmers, P. (2012). mirt: A Multidimensional Item Response Theory Package for the R Environment. Journal of Statistical Software, 48(6), 1-29.
• moments Komsta, L. & Novomestky, F. (2015). moments: Moments, cumulants, skewness, kurtosis and related tests. R package version 0.14. See online.
• msm Jackson, C. & Jackson, H. (2011). Multi-State Models for Panel Data: The msm Package for R. Journal of Statistical Software, 38(8), 1-29. See online.
• nnet Venables, C. & Ripley, C. (2002). Modern Applied Statistics with S. See online.
• plotly Sievert, C., Parmer, C., Hocking, T., Chamberlain, S., Ram, K., Corvellec, M. & Despouy, P. (2017). plotly: Create Interactive Web Graphics via "plotly.js". R package version 4.9.1. See online.
• psych Revelle, W. (2018). psych: Procedures for Psychological, Psychometric, and Personality Research. R package version 1.8.12. See online.
• psychometric Fletcher, T. & Fletcher, D. (2010). psychometric: Applied Psychometric Theory. R package version 2.2. See online.
• reshape2 Wickham, H. (2007). Reshaping Data with the reshape Package. Journal of Statistical Software, 21(12), 1-20. See online.
• rmarkdown Xie, Y., Allaire, J.J. & Grolemund G. (2018). R Markdown: The Definitive Guide. Chapman and Hall/CRC. ISBN 9781138359338. See online.
• shiny Chang, W., Cheng, J., Allaire, J., Xie, Y. & McPherson, J. (2019). shiny: Web Application Framework for R. R package version 1.4.0. See online.
• shinyBS Bailey, E. (2015). shinyBS: Twitter Bootstrap Components for Shiny. R package version 0.61. See online.
• shinydashboard Chang, W. & Borges Ribeiro, B. (2018). shinydashboard: Create Dashboards with "Shiny". R package version 0.7.1 See online.
• ShinyItemAnalysis Martinkova, P., & Drabinova, A. (2018). ShinyItemAnalysis for teaching psychometrics and to enforce routine analysis of educational tests. The R Journal, 10(2), 503-515. See online.
• shinyjs Attali, D. (2018). shinyjs: Easily Improve the User Experience of Your Shiny Apps in Seconds. R package version 1.0. See online.
• stringr Wickham, H. (2019). stringr: Simple, Consistent Wrappers for Common String Operations. R package version 1.4.0. See online.
• xtable Dahl, D., Scott, D., Roosen, C., Magnusson, A.& Swinton, J. (2019). xtable: Export Tables to LaTeX or HTML. R package version 1.8-4. See online.
• VGAM` Yee, T. W. (2019). VGAM: Vector Generalized Linear and Additive Models. R package version 1.1-2. See online.

### References

• Akaike, H. (1974). A New Look at the Statistical Model Identification. IEEE Transactions on Automatic Control, 19(6), 716-723. See online.
• Ames, A. J., & Penfield, R. D. (2015). An NCME Instructional Module on Item-Fit Statistics for Item Response Theory Models. Educational Measurement: Issues and Practice, 34(3), 39-48. See online.
• Andrich, D. (1978). A Rating Formulation for Ordered Response Categories. Psychometrika, 43(4), 561-573. See online.
• Angoff, W. H., & Ford, S. F. (1973). Item-Race Interaction on a Test of Scholastic Aptitude. Journal of Educational Measurement, 10(2), 95-105. See online.
• Bartholomew, D., Steel, F., Moustaki, I. and Galbraith, J. (2002). The Analysis and Interpretation of Multivariate Data for Social Scientists. London: Chapman and Hall.
• Bock, R. D. (1972). Estimating Item Parameters and Latent Ability when Responses Are Scored in Two or More Nominal Categories. Psychometrika, 37(1), 29-51. See online.
• Brown, W. (1910). Some experimental results in the correlation of mental abilities. British Journal of Psychology, 1904-1920, 3(3), 296-322. See online.
• Chalmers, R. P. (2018). Improving the Crossing-SIBTEST Statistic for Detecting Non-uniform DIF. Psychometrika, 83(2), 376-386. See online.
• Cronbach, L. J. (1951). Coefficient Alpha and the Internal Structure of Tests. Psychometrika, 16(3), 297-334. See online.
• Drabinova, A., & Martinkova, P. (2017). Detection of Differential Item Functioning with Non-Linear Regression: Non-IRT Approach Accounting for Guessing. Journal of Educational Measurement, 54(4), 498-517 See online.
• Feldt, L. S., Woodruff, D. J., & Salih, F. A. (1987). Statistical inference for coefficient alpha. Applied Psychological Measurement 11(1), 93-103. See online.
• Li, H.-H., and Stout, W. (1996). A New Procedure for Detection of Crossing DIF. Psychometrika, 61(4), 647-677. See online.
• Lord, F. M. (1980). Applications of Item Response Theory to Practical Testing Problems. Routledge.
• Magis, D., & Facon, B. (2012). Angoffs Delta Method Revisited: Improving DIF Detection under Small Samples. British Journal of Mathematical and Statistical Psychology, 65(2), 302-321. See online.
• Magis, D., & Facon, B. (2013). Item purification does not always improve DIF detection: a counter-example with Angoffs Delta plot. Educational and Psychological Measurement, 73(2), 293-311. See online.
• Mantel, N., & Haenszel, W. (1959). Statistical Aspects of the Analysis of Data from Retrospective Studies. Journal of the National Cancer Institute, 22(4), 719-748. See online.
• Martinkova, P., Drabinova, A., & Houdek, J. (2017). ShinyItemAnalysis: Analyza Prijimacich a Jinych Znalostnich ci Psychologickych Testu. [ShinyItemAnalysis: Analyzing Admission and Other Educational and Psychological Tests] TESTFORUM, 6(9), 16-35. See online.
• Martinkova, P., Drabinova, A., Liaw, Y. L., Sanders, E. A., McFarland, J. L., & Price, R. M. (2017). Checking Equity: Why Differential Item Functioning Analysis Should Be a Routine Part of Developing Conceptual Assessments. CBE-Life Sciences Education, 16(2), rm2. See online
• Martinkova, P., Stepanek, L., Drabinova, A., Houdek, J., Vejrazka, M., & Stuka, C. (2017). Semi-real-time Analyses of Item Characteristics for Medical School Admission Tests. In Proceedings of the 2017 Federated Conference on Computer Science and Information Systems, 189-194. See online.
• Martinkova, P., Drabinova, A., & Potuznikova, E. (2020). Is academic tracking related to gains in learning competence? Using propensity score matching and differential item change functioning analysis for better understanding of tracking implications. Learning and Instruction 66(April). See online
• Masters, G. N. (1982). A Rasch model for partial credit scoring. Psychometrika, 47(2), 149-174. See online.
• McFarland, J. L., Price, R. M., Wenderoth, M. P., Martinkova, P., Cliff, W., Michael, J., ... & Wright, A. (2017). Development and Validation of the Homeostasis Concept Inventory. CBE-Life Sciences Education, 16(2), ar35. See online.
• Muraki, E. (1992). A Generalized Partial Credit Model: Application of an EM Algorithm. ETS Research Report Series, 1992(1) See online.
• Orlando, M., & Thissen, D. (2000). Likelihood-based item-fit indices for dichotomous item response theory models. Applied Psychological Measurement, 24(1), 50-64. See online.
• Swaminathan, H., & Rogers, H. J. (1990). Detecting Differential Item Functioning Using Logistic Regression Procedures. Journal of Educational Measurement, 27(4), 361-370. See online.
• Raju, N. S. (1988). The Area between Two Item Characteristic Curves. Psychometrika, 53(4), 495-502. See online.
• Raju, N. S. (1990). Determining the Significance of Estimated Signed and Unsigned Areas between Two Item Response Functions. Applied Psychological Measurement, 14(2), 197-207. See online.
• Rasch, G. (1960) Probabilistic Models for Some Intelligence and Attainment Tests. Copenhagen: Paedagogiske Institute.
• Revelle, W. (1979). Hierarchical cluster analysis and the internal structure of tests. Multivariate Behavioral Research, 14(1), 57-74. See online.
• Samejima, F. (1969). Estimation of Latent Ability Using a Response Pattern of Graded Scores. Psychometrika, 34(1), 1-97 See online.
• Schwarz, G. (1978). Estimating the Dimension of a Model. The Annals of Statistics, 6(2), 461-464. See online.
• Shealy, R. and Stout, W. (1993). A Model-Based Standardization Approach that Separates True Bias/DIF from Group Ability Differences and Detect Test Bias/DTF as well as Item Bias/DIF. Psychometrika, 58(2), 159-194. See online.
• Spearman, C. (1910). Correlation calculated from faulty data. British Journal of Psychology, 1904-1920, 3(3), 271-295. See online.
• Wilson, M. (2005). Constructing Measures: An Item Response Modeling Approach.
• Wright, B. D., & Stone, M. H. (1979). Best Test Design. Chicago: Mesa Press.

### Settings

#### IRT models setting

Set the number of cycles for IRT 1PL, 2PL, 3PL and 4PL models.