Loading
Welcome to ShinyItemAnalysis!
ShinyItemAnalysis is an interactive online application for the psychometric analysis of educational tests, psychological assessments, health-related and other types of multi-item measurements, or ratings from multiple raters, built on R and shiny. You can easily start using the application with the default toy dataset. You may also select from a number of other toy datasets or upload your own in the Data section. Offered methods include:
All graphical outputs and selected tables can be downloaded via the download button. Moreover, you can automatically generate a HTML or PDF report in the Reports section. All offered analyses are complemented by selected R codes which are ready to be copied and pasted into your R console, therefore a similar analysis can be run and modified in R.
Visit the www.ShinyItemAnalysis.org webpage to learn more about ShinyItemAnalysis!
A new paper on range-restricted inter-rater reliability has been published in JRSS-A (Erosheva, Martinkova, & Lee, 2021). To try examples interactively, set the
AIBS
toy dataset in the
Data
section by clicking on
the menu in the upper left corner and go to the
Reliability/Restricted range
section.
New papers on differential item functioning have been published in Learning and Instruction (Martinkova, Hladka, & Potuznikova, 2020) and in The R Journal (Hladka & Martinkova, 2020). To try these examples interactively, set the
Learning to Learn 9
toy dataset in the
Data
section
by clicking on the menu in the upper left corner and go to the
DIF/Fairness/Generalized logistic
section.
An application can be downloaded as an R package from
CRAN.
It is also available online at the
Czech Academy of Sciences
and
shinyapps.io
.
If you discover a problem with this application please contact the project maintainer at martinkova(at)cs.cas.cz or use GitHub. We also encourage you to provide your feedback using Google form.
This program is free software and you can redistribute it and or modify it under the terms of the GNU GPL 3 as published by the Free Software Foundation. This program is distributed in the hope that it will be useful, but without any warranty; without even the implied warranty of merchantability of fitness for a particular purpose.
To cite ShinyItemAnalysis in publications, please use:
Czech Science Foundation (GJ15-15856Y, 21-03658S), Charles University (PRIMUS/17/HUM/11).
For demonstration purposes, the 20-item dataset
GMAT
is used. While on this page, you may select one of several other toy datasets or you may upload your own
dataset (see below). To return to the demonstration dataset, click on the
Unload data
button.
Here you can upload your own dataset. Select all necessary files and use the Upload data button on bottom of this page.
The main data file should contain the responses of individual respondents (rows) to given items (columns). Data need to be either binary, nominal (e.g. in ABCD format), or ordinal (e.g. in Likert scale). The header may contain item names, however, no row names should be included. In all data sets, the header should be either included or excluded. Columns of dataset are by default renamed to the Item and number of a particular column. If you want to keep your own names, check the box Keep item names below. Missing values in scored dataset are by default evaluated as 0. If you want to keep them as missing, check the box Keep missing values below.
For ordinal data, you are advised to include vector containing cut-score which is used for binarization of uploaded data, i.e., values greater or equal to provided cut-score are set to 1, otherwise to 0. You can either upload dataset of item-specific values, or you can provide one value for whole dataset.
Note: In case that cut-score is not provided, vector of maximal values is used.
For nominal data, it is necessary to upload key of correct answers.
For ordinal data, it is optional to upload minimal and maximal values of answers. You can either upload datasets of item-specific values, or you can provide one value for whole dataset.
Note: If no minimal or maximal values are provided, these values are set automatically based on observed values.
Group is a variable for DIF and DDF analyses. It should be a binary vector, where 0 represents the reference group and 1 represents the focal group. Its length needs to be the same as the number of individual respondents in the main dataset. Missing values are not supported for the group variable and such cases/rows of the data should be removed.
Note: If no group variable is provided, the DIF and DDF analyses in the DIF/Fairness section are not available.
Criterion is either a discrete or continuous variable (e.g., future study success or future GPA in the case of admission tests) which should be predicted by the measurement. Its length needs to be the same as the number of individual respondents in the main dataset.
Note: If no criterion variable is provided, it won't be possible to run a validity analysis in the Predictive validity section on Validity page.
Observed score is a variable describing observed ability or trait of respondents. If supplied, it is offered in the Regression and in the DIF/Fairness sections for analyses with respect to this external variable. Its length needs to be the same as the number of individual respondents in the main dataset.
Note: If no observed score is provided, the total scores or standardized total scores are used instead.
Here you can explore uploaded dataset. The rendering of tables can take some time.
Total score, also known as raw score or sum score, is the easiest measure of latent traits being measured. The total score is calculated as the sum of the item scores. In binary correct/false items, the total score corresponds to the total number of correct answers.
The table below summarizes basic descriptive statistics for the total scores including the number of respondents \(n\), minimum and maximum, median, \(\textrm{SD}\), and The skewness for normally distributed scores is near the value of 0 and the kurtosis is near the value of 3.
For a selected cut-score, the blue part of the histogram shows respondents with a total score above the cut-score, the grey column shows respondents with a total score equal to the cut-score and the red part of the histogram shows respondents below the cut-score.
Download figurelibrary(ggplot2)
library(psych)
library(ShinyItemAnalysis)
# loading data
data(GMAT, package = "difNLR")
data <- GMAT[, 1:20]
# total score calculation
score <- rowSums(data)
# summary of total score
tab <- describe(score)[, c("n", "min", "max", "mean", "median", "sd", "skew", "kurtosis")]
tab$kurtosis <- tab$kurtosis + 3
tab
# histogram
ggplot(df, aes(score)) +
geom_histogram(binwidth = 1, col = "black") +
xlab("Total score") +
ylab("Number of respondents") +
theme_app()
# colors by cut-score
cut <- median(score) # cut-score
color <- c(rep("red", cut - min(score)), "gray", rep("blue", max(score) - cut))
df <- data.frame(score)
# histogram
ggplot(df, aes(score)) +
geom_histogram(binwidth = 1, fill = color, col = "black") +
xlab("Total score") +
ylab("Number of respondents") +
theme_app()
Total score
is calculated as the
Percentile
indicates the value below which a percentage of observations falls, e.g., an individual
score at the 80th percentile means that the individual score is the same or higher than the
scores of 80% of all respondents.
Success rate
is the percentage of scores obtained, e.g., if the maximum points of test is equal to
20, minimum is 0, and individual score is 12 then success rate is \(12 / 20 = 0.6\), i.e., 60%.
The
Z-score
, also known as the standardized score is
with a mean of 0 and and a standard deviation of 1.
The
T-score
is
with a mean of 50 and standard
deviation of 10.
# loading data
data(GMAT, package = "difNLR")
data <- GMAT[, 1:20]
# scores calculations (unique values)
score <- rowSums(data) # Total score
tosc <- sort(unique(score)) # Levels of total score
perc <- ecdf(score)(tosc) # Percentiles
sura <- 100 * (tosc / max(score)) # Success rate
zsco <- sort(unique(scale(score))) # Z-score
tsco <- 50 + 10 * zsco # T-score
cbind(tosc, perc, sura, zsco, tsco)
A correlation heat map displays selected type of correlations between items. The size and shade of circles indicate how much the items are correlated (larger and darker circle mean greater correlations). The color of circles indicates in which way the items are correlated - a blue color means possitive correlation and a red color means negative correlation. A correlation heat map can be reordered using a hierarchical clustering method selected below. With a number of clusters larger than 1, the rectangles representing clusters are drawn. The values of a correlation heatmap may be displayed and also downloaded.
Pearson correlation coefficient describes the strength and direction of a linear relationship between two random variables \(X\) and \(Y\). It is given by formula
$$\rho = \frac{cov(X,Y)}{\sqrt{var(X)}\sqrt{var(Y)}}.$$Sample Pearson corelation coefficient may be calculated as
$$ r = \frac{\sum_{i = 1}^{n}(x_{i} - \bar{x})(y_{i} - \bar{y})}{\sqrt{\sum_{i = 1}^{n}(x_{i} - \bar{x})^2}\sqrt{\sum_{i = 1}^{n}(y_{i} - \bar{y})^2}}$$Pearson correlation coefficient has a value between -1 and +1. Sample correlation of -1 and +1 correspond to all data points lying exactly on a line (decreasing in case of negative linear correlation -1 and increasing for +1). If the coefficient is equal to 0, it means there is no linear relationship between the two variables.
A polychoric/tetrachoric correlation between two ordinal/binary variables is calculated from their contingency table, under the assumption that the ordinal variables dissect continuous latent variables that are bivariate normal.
The Spearman's rank correlation coefficient describes the strength and the direction of a monotonic relationship between random variables \(X\) and \(Y\), i.e. the dependence between the rankings of two variables. It is given by formula
$$\rho = \frac{cov(rg_{X},rg_{Y})}{\sqrt{var(rg_{X})}\sqrt{var(rg_{Y})}},$$where \(rg_{X}\) and \(rg_{Y}\) are the transformed random variables \(X\) and \(Y\) into ranks, i.e, the Spearman correlation coefficient is the Pearson correlation coefficient between the ranked variables.
The sample Spearman correlation is calculated by converting \(X\) and \(Y\) to ranks (average ranks are used in case of ties) and by applying the sample Pearson correlation formula. If both the \(X\) and \(Y\) have \(n\) unique ranks, i.e. there are no ties, then the sample correlation coefficient is given by formula
$$ r = 1 - \frac{6\sum_{i = 1}^{n}d_i^{2}}{n(n-1)}$$where \(d = rg_{X} - rg_{Y}\) is the difference between two ranks and \(n\) is size of \(X\) and \(Y\). Spearman rank correlation coefficient has value between -1 and 1, where 1 means identity of ranks of the variables and -1 means reverse ranks of the two variables. In case of no repeated values, Spearman correlation of +1 or -1 means that all data points are lying exactly on some monotone line. If the Spearman coefficient is equal to 0, it means there is no tendency for \(Y\) to either increase or decrease with \(X\) increasing.
Clustering methods. Ward's method aims at finding compact clusters based on minimizing the within-cluster sum of squares. Ward's n. 2 method uses squared disimilarities. The Single method connects clusters with their nearest neighbours, i.e. the distance between two clusters is calculated as the minimum of the distance of observations in one cluster and observations in the other clusters. Complete linkage with the farthest neighbours, on the other hand, uses the maximum of distance. The Average linkage method uses the distance based on a weighted average of the individual distances. The McQuitty method uses an unweighted average. The Median linkage calculates the distance as the median of distance between an observation in one cluster and observation in another cluster. The Centroid method uses the distance between centroids of clusters.
A scree plot below displays two sets of the eigenvalues associated with the factors in descending order. Location of a bend (an elbow) of the "real" part can be considered indicative to the suitable number of factors (Catell, 1966). Another rule, as proposed by Kaiser (1960), discards all factors below the eigenvalue of 1 (the information of a single average item).
A much better, modern approach called a parallel analysis (Horn, 1965) compares the eigenvalues of the real data correlation matrix with the eigenvalues (or more precisely, 95th percentiles of their sampling distributions) obtained from simulated zero-factor random matrices. The number of factors with the eigenvalue bigger than the eigenvalue at the first (leftmost) curves crossing is then the optimal number to extract in factor analysis.
Once the optimal number of factors is found, the exploratory factor analysis (EFA) itself may be conducted. The number of factor found by the parallel analysis is offered as the default value. You can select the preffered factor rotation of the solution or hide the loadings outside interest. There is also an option to sort items by their importance on each factor. Below the loadings table, there is factor summary with proportion of variance each of the factor explains, as well as the list of common model fit indices.
library(psych)
library(ggplot2)
# loading data
data(HCI, package = "ShinyItemAnalysis")
data <- HCI[, 1:20]
# scree plot, parallel analysis
(fa_paral <- fa_parallel(data))
plot(fa_paral)
as.data.frame(fa_paral)
# EFA for 1, 2, and 3 factors
(FA1 <- psych::fa(data, nfactors = 1))
(FA2 <- psych::fa(data, nfactors = 2))
(FA3 <- psych::fa(data, nfactors = 3))
# Model fit for different number of factors
VSS(data)
# Path diagrams
fa.diagram(FA1)
fa.diagram(FA2)
fa.diagram(FA3)
# Higher order factor solution
(om.h <- omega(data, sl = FALSE))
Depending on the criterion variable, different types of criterion validity may be examined. As an example, a correlation between the test score and the future study success or future GPA may be used as a proof of predictive validity in the case of admission tests. A criterion variable may be uploaded in the Data section.
Total scores are plotted according to a criterion variable. Boxplot or scatterplot is displayed depending on the type of criterion variable - whether it is discrete or continuous. Scatterplot is provided with a red linear regression line.
Download figureAn association between the total score and the criterion variable can be estimated using Pearson product-moment correlation coefficient r . The null hypothesis being tested states that correlation is exactly 0.
library(ggplot2)
library(ShinyItemAnalysis)
# loading data
data(GMAT, package = "difNLR")
data <- GMAT[, 1:20]
score <- rowSums(data) # total score calculation
criterion <- GMAT[, "criterion"] # criterion variable
hist(criterion)
criterionD <- round(criterion) # discrete criterion variable
hist(criterionD)
# number of respondents in each criterion level
size <- as.factor(criterionD)
levels(size) <- table(as.factor(criterionD))
size <- as.numeric(paste(sizeD))
df <- data.frame(score, criterionD, size)
# descriptive plots
### boxplot, for discrete criterion
ggplot(df, aes(y = score, x = as.factor(criterionD), fill = as.factor(criterionD))) +
geom_boxplot() +
geom_jitter(shape = 16, position = position_jitter(0.2)) +
scale_fill_brewer(palette = "Blues") +
xlab("Criterion group") +
ylab("Total score") +
coord_flip() +
theme_app()
### scatterplot, for continuous criterion
ggplot(df, aes(x = score, y = criterion)) +
geom_point() +
ylab("Criterion variable") +
xlab("Total score") +
geom_smooth(
method = lm,
se = FALSE,
color = "red"
) +
theme_app()
# test for association between total score and criterion variable
cor.test(criterion, score, method = "pearson", exact = FALSE)
Let \(\text{rel}(X)\) be the reliability of the test composed of \(I\) equally precise items measuring the same construct, \(X = X_1 + ... + X_I\). Then for a test consisting of \(I^*\) such items, that is for a test which is \(m = \frac{I^*}{I}\) times longer/shorter, the reliability would be
$$\text{rel}(X^*) = \frac{m\cdot \text{rel}(X)}{1 + (m - 1)\cdot\text{rel}(X)}.$$The Spearman-Brown formula can be used to determine reliability of a test with with a different number of equally precise items measuring the same construct. It can also be used to determine the necessary number of items to achieve desired reliability.
In the calculations below, reliability of original data is by default set to the value of Cronbach's \(\alpha\) for the dataset currently in use. The number of items in the original data is by default set to the number of items in the dataset currently in use.
Here you can calculate an estimate of reliability for a test consisting of a different number of items.
Here you can calculate the necessary number of items to gain the required level of reliability.
library(psychometric)
# loading data
data(HCI, package = "ShinyItemAnalysis")
data <- HCI[, 1:20]
# reliability of original data
rel.original <- psychometric::alpha(data)
# number of items in original data
items.original <- ncol(data)
# number of items in new data
items.new <- 30
# ratio of tests lengths
m <- items.new / items.original
# determining reliability
SBrel(Nlength = m, rxx = rel.original)
# desired reliability
rel.new <- 0.8
# determining test length
(m.new <- SBlength(rxxp = rel.new, rxx = rel.original))
# number of required items
m.new * items.original
The split-half method uses the correlation between two subscores for an estimation of reliability. The underlying assumption is that the two halves of the test (or even all items on the test) are equally precise and measure the same underlying construct. The Spearman-Brown formula is then used to correct the estimate for the number of items.
For a test with \(I\) items total score is calculated as \(X = X_1 + ... + X_I\). Let \(X^*_1\) and \(X^*_2\) be total scores calculated from items found only in the first and second subsets. The estimate of reliability is then given by the Spearman-Brown formula (Spearman, 1910; Brown, 1910) with \(m = 2\).
$$\text{rel}(X) = \frac{m\cdot \text{cor}(X^*_1, X^*_2)}{1 + (m - 1)\cdot\text{cor}(X^*_1, X^*_2)} = \frac{2\cdot \text{cor}(X^*_1, X^*_2)}{1 + \text{cor}(X^*_1, X^*_2)}$$You can choose below from different split-half approaches. The First-last method uses a correlation between the first half of items and the second half of items. The Even-odd method places even numbered items into the first subset and odd numbered items into the second one. The Random method performs a random split of items, thus the resulting estimate may be different for each call. Out of a specified number of random splits (10,000 by default), the Worst method selects the lowest estimate and the Average method calculates the average. In the case of an odd number of items, the first subset contains one more item than the second one.
The estimate of reliability for First-last , Even-odd , Random and Worst is calculated using the Spearman-Brown formula. The confidence interval is based on a confidence interval of correlation using the delta method. The estimate of reliability for the Average method is a mean value of sampled reliabilities and the confidence interval is the confidence interval of this mean.
A histogram is based on a selected number of split halves estimates (10,000 by default). The current estimate is highlighted by a red colour.
Downloadlibrary(psych)
# loading data
data(HCI, package = "ShinyItemAnalysis")
# first-second half split
df1 <- HCI[, 1:10]
df2 <- HCI[, 11:20]
# total score calculation
ts1 <- rowSums(df1)
ts2 <- rowSums(df2)
# correlation
cor.x <- cor(ts1, ts2)
# apply Spearman-Brown formula to estimate reliability
(rel.x <- 2 * cor.x / (1 + cor.x))
# even-odd half split
df1 <- HCI[, seq(1, 20, 2)]
df2 <- HCI[, seq(2, 20, 2)]
# total score calculation
ts1 <- rowSums(df1)
ts2 <- rowSums(df2)
# correlation
cor.x <- cor(ts1, ts2)
# apply Spearman-Brown formula to estimate reliability
(rel.x <- 2 * cor.x / (1 + cor.x))
# random halves split
samp <- sample(1:20, 10)
df1 <- HCI[, samp]
df2 <- HCI[, setdiff(1:20, samp)]
# total score calculation
ts1 <- rowSums(df1)
ts2 <- rowSums(df2)
# correlation
cor.x <- cor(ts1, ts2)
# apply Spearman-Brown formula to estimate reliability
(rel.x <- 2 * cor.x / (1 + cor.x))
# minimum of 10,000 split-halves
split <- psych::splitHalf(HCI[, 1:20], raw = TRUE)
items1 <- which(split$minAB[, "A"] == 1)
items2 <- which(split$minAB[, "B"] == 1)
df1 <- HCI[, items1]
df2 <- HCI[, items2]
# total score calculation
ts1 <- rowSums(df1)
ts2 <- rowSums(df2)
# correlation
cor.x <- cor(ts1, ts2)
# apply Spearman-Brown formula to estimate reliability
(rel.x <- 2 * cor.x / (1 + cor.x))
# calculation of CI
z.r <- 0.5 * log((1 + cor.x) / (1 - cor.x))
n <- length(ts1)
z.low <- z.r - 1.96 * sqrt(1 / (n - 3))
z.upp <- z.r + 1.96 * sqrt(1 / (n - 3))
cor.low <- (exp(2 * z.low) - 1) / (exp(2 * z.low) + 1)
cor.upp <- (exp(2 * z.upp) - 1) / (exp(2 * z.upp) + 1)
rel.x <- 2 * cor.x / (1 + cor.x)
rel.low <- 2 * cor.low / (1 + cor.low)
rel.upp <- 2 * cor.upp / (1 + cor.upp)
# average 10,000 split-halves
split <- psych::splitHalf(HCI[, 1:20], raw = TRUE)
(rel.x <- mean(split$raw))
# average all split-halves
split <- psych::splitHalf(HCI[, 1:20], raw = TRUE, brute = TRUE)
(rel.x <- mean(split$raw))
# calculation of CI
n <- length(split$raw)
rel.low <- rel.x - 1.96 * sd(split$raw) / sqrt(n)
rel.upp <- rel.x + 1.96 * sd(split$raw) / sqrt(n)
Cronbach's \(\alpha\) is an estimate of the internal consistency of a psychometric test. It is a function of the number of items in a test, the average covariance between item-pairs, and the variance of the total score (Cronbach, 1951).
For a test with \(I\) items where \(X = X_1 + ... + X_I\) is a total score, \(\sigma^2_X\) its variance and \(\sigma^2_{X_i}\) variances of items, Cronbach's \(\alpha\) is given by following equation
$$\alpha = \frac{I}{I-1}\left(1 - \frac{\sum_{i = 1}^I \sigma^2_{X_i}}{\sigma^2_X}\right)$$A confidence interval is based on F distribution as proposed by Feldt et al. (1987).
library(psychometric)
# loading data
data(HCI, package = "ShinyItemAnalysis")
data <- HCI[, 1:20]
# Cronbach's alpha with confidence interval
a <- psychometric::alpha(data)
psychometric::alpha.CI(a, N = nrow(data), k = ncol(data), level = 0.95)
This section illustrates the issue of range-restricted reliability and the difficulties with maximum
likelihood estimation, described in more detail in the context of inter-rater reliability in grant proposal review in
(Erosheva, Martinkova, & Lee, 2021). To replicate their examples, select the
AIBS
toy dataset in the
Data
section.
Below, you may select the ratio and type of range restriction given by the
proportion of rated subjects/objects.
It could be a grant proposal application in grant review (as is the case in the
AIBS
dataset),
a student in educational assessment, a job application in hiring, a patient in a medical study, etc.
Further, you may select the
direction
of restriction (top or bottom).
The left plot illustrates the variability in ratings for the whole dataset outshading the data which would be lost
due to range-restriction. The right plot provides the estimates of the calculated inter-rater reliability estimates,
defined by intraclass corelation in the simplest model including the ratee effect only.
The estimates are accompanied by a bootstrapped 95% confidence interval; see the settings section for further details
on used number of bootstrapped samples.
library(ggplot2)
library(purrr)
library(ShinyItemAnalysis)
# loading data
data(AIBS, package = "ShinyItemAnalysis")
# estimate reliability with ICC for complete AIBS dataset
ICCrestricted(Data = AIBS, case = "ID", var = "Score", rank = "ScoreRankAdj")
# estimate range-restricted ICC
ICCrestricted(Data = AIBS, case = "ID", var = "Score", rank = "ScoreRankAdj",
sel = 0.90, dir = "top")
# caterpillar plot
AIBS %>%
ggplot(aes(x = ScoreRankAdj, y = Score, group = ID)) +
geom_line(col = "gray") +
geom_point(shape = 1, size = 1.5) +
stat_summary(fun = mean, fun.args = list(na.rm = TRUE), geom = "point",
col = "red", shape = 5, size = 2.5, stroke = .35) +
labs(x = "Ratee rank", y = "Rating (score)") +
coord_cartesian(ylim = c(1, 5)) +
theme_app()
# estimate all possible top-restricted subsets
all_top_restricted <- map_dfr(2:72,
~ ICCrestricted(Data = AIBS, case = "ID", var = "Score",
rank = "ScoreRankAdj", sel = .x, nsim = 10))
all_top_restricted
# or alternatively, in base R:
base_way <- lapply(2:72, function(x) {
ICCrestricted(Data = AIBS, case = "ID", var = "Score", rank = "ScoreRankAdj",
sel = x, nsim = 10)})
do.call(rbind.data.frame, base_way)
# plot
all_top_restricted %>%
ggplot(aes(prop_sel, ICC1, ymin = ICC1_LCI, ymax = ICC1_UCI)) +
geom_pointrange() + scale_x_continuous(labels = scales::percent) +
labs(x = ("Proportion of top ratees"), y = "Reliability") +
coord_cartesian(ylim = c(0, 1), xlim = c(0, 1)) +
theme_app()
Traditional item analysis uses proportions and correlations to estimate item properties.
Displayed is difficulty (red) and discrimination (blue) for all items. Items are ordered by difficulty.
Difficulty
of the item is by default estimated as its average scaled score, i.e. average item score
divided by its range. Below you can change the estimate of difficulty to the average score of the item. For binary
items both estimates are equivalent and can be interpreted as the percentage of respondents who answered the item correctly.
Discrimination
is by default estimated as the coRrelation between Item and Total score (RIT index). Other
options for the discrimination index include coRrelation between Item and total score based on Rest of the items (RIR index).
Discrimination can also be estimated as the difference in (scaled) item score
in the upper and lower third of the respondents (Upper-Lower Index, ULI). ULI can be further customized by changing the number of
groups and by changing which groups should be compared (see also Martinkova, Stepanek et al., 2017). By a rule of thumb, all items with a discrimination
lower than 0.2 (threshold in the plot), should be checked for content. Lower discrimination is excpectable
in the case of very easy or very difficult items, or in ULI based on more homogeneous groups
(such as 4th and last fifth). A threshold may be adjusted for these cases or may be set to 0.
library(psych)
library(ShinyItemAnalysis)
# loading data
data(GMAT, package = "difNLR")
Data <- GMAT[, 1:20]
# difficulty and discrimination plot
DDplot(Data, discrim = 'ULI', k = 3, l = 1, u = 3)
# Cronbach alpha
psych::alpha(Data)
# traditional item analysis table
ItemAnalysis(Data)
Empirical item response curves describe how test takers from different ability groups select available responses. In case of multiple-choice items these curves can show how the distractors (wrong answers) were able to function effectively by drawing the test takers away from the correct answer.
With the option Combinations all item selection patterns are plotted (e.g., AB, ACD, BC). With the option Distractors answers are split among the remaining incorect answers (e.g., A, B, C, D).
library(data.table)
library(ShinyItemAnalysis)
# loading data
data(GMATtest, GMATkey, package = "difNLR")
Data <- GMATtest[, 1:20]
key <- unlist(GMATkey)
# combinations - plot for item 1 and 3 groups
plotDistractorAnalysis(Data, key, num.group = 3, item = 1, multiple.answers = TRUE)
# distractors - plot for item 1 and 3 groups
plotDistractorAnalysis(Data, key, num.group = 3, item = 1, multiple.answers = FALSE)
# table with counts and margins - item 1 and 3 groups
DA <- DistractorAnalysis(Data, key, num.groups = 3)[[1]]
dcast(as.data.frame(DA), response ~ score.level, sum, margins = TRUE, value.var = "Freq")
# table with proportions - item 1 and 3 groups
DistractorAnalysis(Data, key, num.groups = 3, p.table = TRUE)[[1]]
This section requires a criterion variable (e.g. future study success or future GPA in case of admission tests) which should correlate with the measurement. A criterion variable can be uploaded in the Data section. Here you can explore how the the criterion correlates with individual items.
The following plot intelligibly depicts the criterion validity of every individual item (blue) together with its difficulty (red). Items are ordered by difficulty. You can choose from two indices of criterion validity - item-criterion correlation and the so-called "item validity index". The former refers to a simple Pearson product-moment correlation (or, in the case of a binary dataset, point-biserial correlation), the later also takes into account the item varinace (see Allen & Yen, 1979, for details). Further item analysis can be performed in an Item Analysis tab.
In a distractor analysis based on a criterion variable, we are interested in how test takers select the correct answer and the distractors (wrong answers) with respect to a group based on criterion variable.
With option Combinations all item selection patterns are plotted (e.g. AB, ACD, BC). With option Distractors answers are split into the various distractors (e.g. A, B, C, D).
A test for association between the total score and criterion variable is based on Pearson product-moment correlation coefficient r. The null hypothesis is that correlation is 0.
library(ShinyItemAnalysis)
# loading data
data(GMAT, GMATtest, GMATkey, package = "difNLR")
data <- GMATtest[, 1:20]
data_binary <- GMAT[, 1:20]
key <- GMATkey
criterion <- GMAT[, "criterion"]
# item difficulty / criterion validity plot
DDplot(data_binary, criterion = criterion, val_type = "simple")
# distractor plot for item 1 and 3 groups
plotDistractorAnalysis(data, key, num.groups = 3, item = 1, criterion = criterion)
# test for association between total score and criterion variable for item 1
cor.test(criterion, data_binary[, 1], method = "pearson", exact = FALSE)
Various regression models may be fitted to describe item properties in more detail. Logistic regression can model dependency of pthe robability of correctly answering item \(i\) by respondent \(p\) on their total score \(X_p\) by an S-shaped logistic curve. Parameter \(\beta_{i0}\) describes horizontal position of the fitted curve and parameter \(\beta_{i1}\) describes its slope.
Points represent proportion of correct answers with respect to the total score. Their size is determined by the count of respondents who achieved a given level of the total score.
library(ggplot2)
# loading data
data(GMAT, package = "difNLR")
data <- GMAT[, 1:20]
score <- rowSums(data) # total score
# logistic model for item 1
fit <- glm(data[, 1] ~ score, family = binomial)
# coefficients
coef(fit) # estimates
sqrt(diag(vcov(fit))) # SE
summary(fit)$coefficients[, 1:2] # estimates and SE
# function for plot
fun <- function(x, b0, b1) {
exp(b0 + b1 * x) / (1 + exp(b0 + b1 * x))
}
# empirical probabilities calculation
df <- data.frame(
x = sort(unique(score)),
y = tapply(data[, 1], score, mean),
size = as.numeric(table(score))
)
# plot of estimated curve
ggplot(df, aes(x = x, y = y)) +
geom_point(aes(size = size),
color = "darkblue",
fill = "darkblue",
shape = 21, alpha = 0.5
) +
stat_function(
fun = fun, geom = "line",
args = list(
b0 = coef(fit)[1],
b1 = coef(fit)[2]
),
size = 1,
color = "darkblue"
) +
xlab("Total score") +
ylab("Probability of correct answer") +
ylim(0, 1) +
ggtitle("Item 1") +
theme_app()
Various regression models may be fitted to describe item properties in more detail. Logistic regression can model dependency of the probability of correctly answering item \(i\) by respondent \(p\) on their standardized total score \(Z_p\) (Z-score) by an S-shaped logistic curve. Parameter \(\beta_{i0}\) describes horizontal position of the fitted curve and parameter \(\beta_{i1}\) describes its slope.
Points represent proportion of correct answers with respect to the standardized total score. Their size is determined by the count of respondents who achieved a given level of the standardized total score.
library(ggplot2)
# loading data
data(GMAT, package = "difNLR")
data <- GMAT[, 1:20]
zscore <- scale(rowSums(data)) # standardized total score
# logistic model for item 1
fit <- glm(data[, 1] ~ zscore, family = binomial)
# coefficients
coef(fit) # estimates
sqrt(diag(vcov(fit))) # SE
summary(fit)$coefficients[, 1:2] # estimates and SE
# function for plot
fun <- function(x, b0, b1) {
exp(b0 + b1 * x) / (1 + exp(b0 + b1 * x))
}
# empirical probabilities calculation
df <- data.frame(
x = sort(unique(zscore)),
y = tapply(data[, 1], zscore, mean),
size = as.numeric(table(zscore))
)
# plot of estimated curve
ggplot(df, aes(x = x, y = y)) +
geom_point(aes(size = size),
color = "darkblue",
fill = "darkblue",
shape = 21, alpha = 0.5
) +
stat_function(
fun = fun, geom = "line",
args = list(
b0 = coef(fit)[1],
b1 = coef(fit)[2]
),
size = 1,
color = "darkblue"
) +
xlab("Standardized total score") +
ylab("Probability of correct answer") +
ylim(0, 1) +
ggtitle("Item 1") +
theme_app()
Various regression models may be fitted to describe item properties in more detail. Logistic regression can model dependency of the probability of correctly answering item \(i\) by respondent \(p\) on their standardized total score \(Z_p\) (Z-score) by an S-shaped logistic curve. Note change in parametrization - the IRT parametrization used here corresponds to the parametrization used in IRT models. Parameter \(b_{i}\) describes horizontal position of the fitted curve (difficulty) and parameter \(a_{i}\) describes its slope at the inflection point (discrimination).
Points represent proportion of correct answers with respect to the standardized total score. Their size is determined by the count of respondents who achieved a given level of the standardized total score.
library(ggplot2)
library(msm)
# loading data
data(GMAT, package = "difNLR")
data <- GMAT[, 1:20]
zscore <- scale(rowSums(data)) # standardized total score
# logistic model for item 1
fit <- glm(data[, 1] ~ zscore, family = binomial)
# coefficients
(coef <- c(a = coef(fit)[2], b = -coef(fit)[1] / coef(fit)[2])) # estimates
# SE using delta method
(se <- deltamethod(
list(~x2, ~ -x1 / x2),
mean = coef(fit),
cov = vcov(fit),
ses = TRUE
))
cbind(coef, se) # estimates and SE
# function for plot
fun <- function(x, a, b) {
exp(a * (x - b)) / (1 + exp(a * (x - b)))
}
# empirical probabilities calculation
df <- data.frame(
x = sort(unique(zscore)),
y = tapply(data[, 1], zscore, mean),
size = as.numeric(table(zscore))
)
# plot of estimated curve
ggplot(df, aes(x = x, y = y)) +
geom_point(aes(size = size),
color = "darkblue",
fill = "darkblue",
shape = 21, alpha = 0.5
) +
stat_function(
fun = fun, geom = "line",
args = list(
a = coef[1],
b = coef[2]
),
size = 1,
color = "darkblue"
) +
xlab("Standardized total score") +
ylab("Probability of correct answer") +
ylim(0, 1) +
ggtitle("Item 1") +
theme_app()
Various regression models may be fitted to describe item properties in more detail. Nonlinear regression can model dependency of the probability of correctly answering item \(i\) by respondent \(p\) on their standardized total score \(Z_p\) (Z-score) by an S-shaped logistic curve. The IRT parametrization used here corresponds to the parametrization used in IRT models. Parameter \(b_{i}\) describes horizontal position of the fitted curve (difficulty) and parameter \(a_{i}\) describes its slope at the inflection point (discrimination). This model allows for nonzero lower left asymptote \(c_i\) (pseudo-guessing parameter).
Points represent proportion of correct answers with respect to the standardized total score. Their size is determined by the count of respondents who achieved a given level of the standardized total score.
library(difNLR)
library(ggplot2)
# loading data
data(GMAT, package = "difNLR")
data <- GMAT[, 1:20]
zscore <- scale(rowSums(data)) # standardized total score
# NLR 3P model for item 1
fun <- function(x, a, b, c) {
c + (1 - c) * exp(a * (x - b)) / (1 + exp(a * (x - b)))
}
fit <- nls(data[, 1] ~ fun(zscore, a, b, c),
algorithm = "port",
start = startNLR(
data, GMAT[, "group"],
model = "3PLcg",
parameterization = "classic"
)[[1]][1:3],
lower = c(-Inf, -Inf, 0),
upper = c(Inf, Inf, 1)
)
# coefficients
coef(fit) # estimates
sqrt(diag(vcov(fit))) # SE
summary(fit)$coefficients[, 1:2] # estimates and SE
# empirical probabilities calculation
df <- data.frame(
x = sort(unique(zscore)),
y = tapply(data[, 1], zscore, mean),
size = as.numeric(table(zscore))
)
# plot of estimated curve
ggplot(df, aes(x = x, y = y)) +
geom_point(aes(size = size),
color = "darkblue",
fill = "darkblue",
shape = 21, alpha = 0.5
) +
stat_function(
fun = fun, geom = "line",
args = list(
a = coef(fit)[1],
b = coef(fit)[2],
c = coef(fit)[3]
),
size = 1,
color = "darkblue"
) +
xlab("Standardized total score") +
ylab("Probability of correct answer") +
ylim(0, 1) +
ggtitle("Item 1") +
theme_app()
Various regression models may be fitted to describe item properties in more detail. Nonlinear regression can model dependency of the probability of correctly answering item \(i\) by respondent \(p\) on their standardized total score \(Z_p\) (Z-score) by an S-shaped logistic curve. The IRT parametrization used here corresponds to the parametrization used in IRT models. Parameter \(b_{i}\) describes horizontal position of the fitted curve (difficulty), parameter \(a_{i}\) describes its slope at the inflection point (discrimination), pseudo-guessing parameter \(c_i\) describes its lower asymptote and inattention parameter \(d_i\) describes its upper asymptote.
Points represent proportion of correct answers with respect to the standardized total score. Their size is determined by the count of respondents who achieved a given level of the standardized total score.
library(difNLR)
library(ggplot2)
# loading data
data(GMAT, package = "difNLR")
data <- GMAT[, 1:20]
zscore <- scale(rowSums(data)) # standardized total score
# NLR 4P model for item 1
fun <- function(x, a, b, c, d) {
c + (d - c) * exp(a * (x - b)) / (1 + exp(a * (x - b)))
}
fit <- nls(data[, 1] ~ fun(zscore, a, b, c, d),
algorithm = "port",
start = startNLR(
data, GMAT[, "group"],
model = "4PLcgdg",
parameterization = "classic"
)[[1]][1:4],
lower = c(-Inf, -Inf, 0, 0),
upper = c(Inf, Inf, 1, 1)
)
# coefficients
coef(fit) # estimates
sqrt(diag(vcov(fit))) # SE
summary(fit)$coefficients[, 1:2] # estimates and SE
# empirical probabilities calculation
df <- data.frame(
x = sort(unique(zscore)),
y = tapply(data[, 1], zscore, mean),
size = as.numeric(table(zscore))
)
# plot of estimated curve
ggplot(df, aes(x = x, y = y)) +
geom_point(aes(size = size),
color = "darkblue",
fill = "darkblue",
shape = 21, alpha = 0.5
) +
stat_function(
fun = fun, geom = "line",
args = list(
a = coef(fit)[1],
b = coef(fit)[2],
c = coef(fit)[3],
d = coef(fit)[4]
),
size = 1,
color = "darkblue"
) +
xlab("Standardized total score") +
ylab("Probability of correct answer") +
ylim(0, 1) +
ggtitle("Item 1") +
theme_app()
Here you can compare a classic 2PL logistic regression model to non-linear 3PL and 4PL models item by item using some information criteria:
Rows BEST indicate which model has the lowest value of given information criterion.
library(difNLR)
# loading data
data(GMAT, package = "difNLR")
Data <- GMAT[, 1:20]
zscore <- scale(rowSums(Data)) # standardized total score
# function for fitting models
fun <- function(x, a, b, c, d) {
c + (d - c) * exp(a * (x - b)) / (1 + exp(a * (x - b)))
}
# starting values for item 1
start <- startNLR(
Data, GMAT[, "group"], model = "4PLcgdg",
parameterization = "classic"
)[[1]][, 1:4]
# 2PL model for item 1
fit2PL <- nls(Data[, 1] ~ fun(zscore, a, b, c = 0, d = 1),
algorithm = "port",
start = start[1:2]
)
# NLR 3P model for item 1
fit3PL <- nls(Data[, 1] ~ fun(zscore, a, b, c, d = 1),
algorithm = "port",
start = start[1:3],
lower = c(-Inf, -Inf, 0),
upper = c(Inf, Inf, 1)
)
# NLR 4P model for item 1
fit4PL <- nls(Data[, 1] ~ fun(zscore, a, b, c, d),
algorithm = "port",
start = start,
lower = c(-Inf, -Inf, 0, 0),
upper = c(Inf, Inf, 1, 1)
)
# comparison
### AIC
AIC(fit2PL)
AIC(fit3PL)
AIC(fit4PL)
### BIC
BIC(fit2PL)
BIC(fit3PL)
BIC(fit4PL)
Various regression models may be fitted to describe item properties in more detail. Cumulative logit regression can model cumulative probabilities, i.e., probabilities to obtain an item score higher than or equal to 1, 2, 3, etc.
A cumulative logit model can be fitted on selected Observed score - standardized total scores or total scores, using IRT or classical (intercept/slope) parametrization.
Lines determine the cumulative probabilities \(\mathrm{P}(Y_{pi} \geq k)\). Circles represent a proportion of answers having at least \(k\) points with respect to the matching criterion, i.e., the empirical cumulative probabilities. The size of the points is determined by the count of respondents who achieved a given level of the matching criterion.
Download figureLines determine the category probabilities \(\mathrm{P}(Y_{pi} = k)\). Circles represent a proportion of answers having \(k\) points with respect to the matching criterion, i.e., the empirical category probabilities. The size of the points is determined by the count of respondents who achieved a given level of the matching criterion.
Download figurelibrary(msm)
library(ShinyItemAnalysis)
library(VGAM)
# loading data
data(Science, package = "mirt")
# standardized total score calculation
zscore <- scale(rowSums(Science))
Science[, 1] <- factor(
Science[, 1], levels = sort(unique(Science[, 1])), ordered = TRUE
)
# cumulative logit model for item 1
fit <- vglm(Science[, 1] ~ zscore,
family = cumulative(reverse = TRUE, parallel = TRUE))
# coefficients under intercept/slope parametrization
coef(fit) # estimates
sqrt(diag(vcov(fit))) # SE
# IRT parametrization
# delta method
num_par <- length(coef(fit))
formula <- append(
paste0("~ x", num_par),
as.list(paste0("~ -x", 1:(num_par - 1), "/", "x", num_par))
)
formula <- lapply(formula, as.formula)
se <- deltamethod(
formula,
mean = coef(fit),
cov = vcov(fit),
ses = TRUE
)
# estimates and SE in IRT parametrization
cbind(c(coef(fit)[num_par], -coef(fit)[-num_par] / coef(fit)[num_par]), se)
# plot of estimated cumulative probabilities
plotCumulative(fit, type = "cumulative", matching.name = "Standardized total score")
# plot of estimated category probabilities
plotCumulative(fit, type = "category", matching.name = "Standardized total score")
Models for ordinal responses need not use cumulative probabilities. An adjacent categories model assumes linear form of logarithm of the ratio of probabilities of two successive scores (e.g., 1 vs. 2, 2 vs. 3, etc.), i.e., of the adjacent category logits.
An adjacent category logit model can be fitted on selected Observed score - standardized total scores or total scores, using IRT or classical (intercept/slope) parametrization.
Lines determine the category probabilities \(\mathrm{P}(Y_{pi} = k)\). Circles represent the proportion of answers with \(k\) points with respect to the total score, i.e., the empirical category probabilities. The size of the circles is determined by the count of respondents who achieved a given level of the total score.
Download figurelibrary(msm)
library(ShinyItemAnalysis)
library(VGAM)
# loading data
data(Science, package = "mirt")
# standardized total score calculation
zscore <- scale(rowSums(Science))
Science[, 1] <- factor(
Science[, 1], levels = sort(unique(Science[, 1])), ordered = TRUE
)
# adjacent category logit model for item 1
fit <- vglm(Science[, 1] ~ zscore,
family = acat(reverse = FALSE, parallel = TRUE))
# coefficients under intercept/slope parametrization
coef(fit) # estimates
sqrt(diag(vcov(fit))) # SE
# IRT parametrization
# delta method
num_par <- length(coef(fit))
formula <- append(
paste0("~ x", num_par),
as.list(paste0("~ -x", 1:(num_par - 1), "/", "x", num_par))
)
formula <- lapply(formula, as.formula)
se <- deltamethod(
formula,
mean = coef(fit),
cov = vcov(fit),
ses = TRUE
)
# estimates and SE in IRT parametrization
cbind(c(coef(fit)[num_par], -coef(fit)[-num_par] / coef(fit)[num_par]), se)
# plot of estimated category probabilities
plotAdjacent(fit, matching.name = "Standardized total score")
Various regression models may be fitted to describe item properties in more detail. Multinomial regression allows for simultaneous modelling of the probability of choosing given distractors on selected Observed score - standardized total scores or total scores, using IRT or classical (intercept/slope) parametrization.
Points represent the proportion of a selected option with respect to the matching criterion. Their size is determined by the count of respondents who achieved a given level of the matching criterion and who selected a given option.
library(msm)
library(nnet)
library(ShinyItemAnalysis)
# loading data
data(GMAT, GMATtest, GMATkey, package = "difNLR")
# standardized total score calculation
zscore <- scale(rowSums(GMAT[, 1:20]))
# multinomial model for item 1
fit <- multinom(relevel(GMATtest[, 1], ref = paste(GMATkey[1])) ~ zscore)
# coefficients under intercept/slope parametrization
coef(fit) # estimates
sqrt(diag(vcov(fit))) # SE
# IRT parametrization
# delta method
subst_vcov <- function(vcov, cat) {
ind <- grep(cat, colnames(vcov))
vcov[ind, ind]
}
se <- t(sapply(
rownames(coef(fit)),
function(.x) {
vcov_subset <- subst_vcov(vcov(fit), .x)
msm::deltamethod(
list(~ -x1 / x2, ~x2),
mean = coef(fit)[.x, ],
cov = vcov_subset,
ses = TRUE
)
}
))
# estimates and SE in IRT parametrization
cbind(-coef(fit)[, 1] / coef(fit)[, 2], se[, 1], coef(fit)[, 2], se[, 2])
# plot of estimated category probabilities
plotMultinomial(fit, zscore, matching.name = "Standardized total score")
Item Response Theory (IRT) models are mixed-effect regression models in which respondent ability \(\theta_p\) is assumed to be latent and is estimated together with item paramters.
Item characteristic function \(\pi_{pi} = \mathrm{P}\left(Y_{pi} = 1\vert \theta_{p}\right)\) describes the probability of a correct answer for given item \(i\). Item information function \(\mathrm{I}_i(\theta_p)\) describes how well the item discriminates from two nearby ability levels, i.e., how much information it provides for the given ability. The test information function \(\mathrm{T}(\theta_p)\) sums up all item informations and thus describes the information of the whole test. The inverse of the test information is the standard error (SE) of measurement.
The equation and estimated item parameters can be displayed using the IRT or intercept/slope parametrization.
Estimates of item parameters can be displayed using the IRT or intercept/slope parametrization, which can be selected at the top of this tab. Parameter estimates are completed by SX2 item fit statistics (Orlando & Thissen, 2000). SX2 statistics are computed only when no missing data are present.
Download tableThis table shows the response and factor scores for only six respondents. If you want to see the scores for all respondents, click on Download abilities button.
Download abilitiesThe Wright map (Wilson, 2005; Wright & Stone, 1979), also called an item-person map, is a graphical tool used to display person ability estimates and item parameters on one scale. The person side (left) represents a histogram of estimated abilities of the respondents. The item side (right) displays estimates of the difficulty parameters of individual items.
Download figurelibrary(mirt)
library(ShinyItemAnalysis)
# loading data
data(GMAT, package = "difNLR")
# fitting Rasch model
fit <- mirt(GMAT[, 1:20], model = 1, itemtype = "Rasch", SE = TRUE)
# item characteristic curves
plot(fit, type = "trace", facet_items = FALSE)
# test score curve
plot(fit)
# item information curves
plot(fit, type = "infotrace", facet_items = FALSE)
# test information curve
plot(fit, type = "infoSE")
# estimated parameters
coef(fit, simplify = TRUE) # classical intercept-slope parametrization
coef(fit) # including confidence intervals
coef(fit, IRTpars = TRUE, simplify = TRUE) # IRT parametrization
coef(fit, IRTpars = TRUE) # including confidence intervals
# item fit statistics
itemfit(fit)
# factor scores vs standardized total scores
fs <- as.vector(fscores(fit))
head(fs)
fs.se <- fscores(fit, full.scores.SE = TRUE) # with SE
head(fs.se)
sts <- as.vector(scale(rowSums(GMAT[, 1:20])))
plot(fs ~ sts, xlab = "Standardized total score", ylab = "Factor score")
cor(fs, sts)
# Wright map
b <- coef(fit, IRTpars = TRUE, simplify = TRUE)$items[, "b"]
ggWrightMap(fs, b)
# you can also use the ltm package
library(ltm)
# fitting Rasch model
fit <- rasch(GMAT[, 1:20], constraint = cbind(ncol(GMAT[, 1:20]) + 1, 1))
# item characteristic curves
plot(fit)
# item information curves
plot(fit, type = "IIC")
# test information curve
plot(fit, items = 0, type = "IIC")
# estimated parameters
coef(fit)
# factor scores vs standardized total scores
df1 <- ltm::factor.scores(fit, return.MIvalues = TRUE)$score.dat
FS <- as.vector(df1[, "z1"])
df2 <- df1
df2$Obs <- df2$Exp <- df2$z1 <- df2$se.z1 <- NULL
STS <- as.vector(scale(rowSums(df2[, 1:20])))
df <- data.frame(FS, STS)
plot(FS ~ STS, data = df, xlab = "Standardized total score", ylab = "Factor score")
cor(FS, STS)
library(ltm)
library(mirt)
library(ShinyItemAnalysis)
# loading data
data(GMAT, package = "difNLR")
# fitting 1PL model
fit <- mirt(GMAT[, 1:20],
model = 1, itemtype = "2PL",
constrain = list((1:20) + seq(0, (20 - 1) * 3, 3)), SE = TRUE
)
# item characteristic curves
plot(fit, type = "trace", facet_items = FALSE)
# test score curve
plot(fit)
# item information curves
plot(fit, type = "infotrace", facet_items = FALSE)
# test information curve
plot(fit, type = "infoSE")
# estimated parameters
coef(fit, simplify = TRUE) # classical intercept-slope parametrization
coef(fit) # including confidence intervals
coef(fit, IRTpars = TRUE, simplify = TRUE) # IRT parametrization
coef(fit, IRTpars = TRUE) # including confidence intervals
# item fit statistics
itemfit(fit)
# factor scores vs standardized total scores
fs <- as.vector(fscores(fit))
sts <- as.vector(scale(rowSums(GMAT[, 1:20])))
plot(fs ~ sts, xlab = "Standardized total score", ylab = "Factor score")
cor(fs, sts)
# Wright map
b <- coef(fit, IRTpars = TRUE, simplify = TRUE)$items[, "b"]
ggWrightMap(fs, b)
# you can also use the ltm package
library(ltm)
# fitting 1PL model
fit <- rasch(GMAT[, 1:20])
# item characteristic curves
plot(fit)
# item information curves
plot(fit, type = "IIC")
# test information curve
plot(fit, items = 0, type = "IIC")
# estimated parameters
coef(fit)
# factor scores vs standardized total scores
df1 <- ltm::factor.scores(fit, return.MIvalues = TRUE)$score.dat
FS <- as.vector(df1[, "z1"])
df2 <- df1
df2$Obs <- df2$Exp <- df2$z1 <- df2$se.z1 <- NULL
STS <- as.vector(scale(rowSums(df2[, 1:20])))
df <- data.frame(FS, STS)
plot(FS ~ STS, data = df, xlab = "Standardized total score", ylab = "Factor score")
cor(FS, STS)
library(ltm)
library(mirt)
library(ShinyItemAnalysis)
# loading data
data(GMAT, package = "difNLR")
# fitting 2PL model
fit <- mirt(GMAT[, 1:20], model = 1, itemtype = "2PL", SE = TRUE)
# item characteristic curves
plot(fit, type = "trace", facet_items = FALSE)
# test score curve
plot(fit)
# item information curves
plot(fit, type = "infotrace", facet_items = FALSE)
# test information curve
plot(fit, type = "infoSE")
# estimated parameters
coef(fit, simplify = TRUE) # classical intercept-slope parametrization
coef(fit) # including confidence intervals
coef(fit, IRTpars = TRUE, simplify = TRUE) # IRT parametrization
coef(fit, IRTpars = TRUE) # including confidence intervals
# item fit statistics
itemfit(fit)
# factor scores vs standardized total scores
fs <- as.vector(fscores(fit))
sts <- as.vector(scale(rowSums(GMAT[, 1:20])))
plot(fs ~ sts, xlab = "Standardized total score", ylab = "Factor score")
cor(fs, sts)
# you can also use the ltm package
library(ltm)
# fitting 2PL model
fit <- ltm(GMAT[, 1:20] ~ z1, IRT.param = TRUE)
# item characteristic curves
plot(fit)
# item information curves
plot(fit, type = "IIC")
# test information curve
plot(fit, items = 0, type = "IIC")
# estimated parameters
coef(fit)
# factor scores vs standardized total scores
df1 <- ltm::factor.scores(fit, return.MIvalues = TRUE)$score.dat
FS <- as.vector(df1[, "z1"])
df2 <- df1
df2$Obs <- df2$Exp <- df2$z1 <- df2$se.z1 <- NULL
STS <- as.vector(scale(rowSums(df2[, 1:20])))
df <- data.frame(FS, STS)
plot(FS ~ STS, data = df, xlab = "Standardized total score", ylab = "Factor score")
cor(FS, STS)
library(ltm)
library(mirt)
library(ShinyItemAnalysis)
# loading data
data(GMAT, package = "difNLR")
# fitting 3PL model
fit <- mirt(GMAT[, 1:20], model = 1, itemtype = "3PL", SE = TRUE)
# item characteristic curves
plot(fit, type = "trace", facet_items = FALSE)
# test score curve
plot(fit)
# item information curves
plot(fit, type = "infotrace", facet_items = FALSE)
# test information curve
plot(fit, type = "infoSE")
# estimated parameters
coef(fit, simplify = TRUE) # classical intercept-slope parametrization
coef(fit) # including confidence intervals
coef(fit, IRTpars = TRUE, simplify = TRUE) # IRT parametrization
coef(fit, IRTpars = TRUE) # including confidence intervals
# item fit statistics
itemfit(fit)
# factor scores vs standardized total scores
fs <- as.vector(fscores(fit))
sts <- as.vector(scale(rowSums(GMAT[, 1:20])))
plot(fs ~ sts, xlab = "Standardized total score", ylab = "Factor score")
cor(fs, sts)
# you can also use the ltm package
library(ltm)
# fitting 3PL model
fit <- tpm(GMAT[, 1:20], IRT.param = TRUE)
# item characteristic curves
plot(fit)
# item information curves
plot(fit, type = "IIC")
# test information curve
plot(fit, items = 0, type = "IIC")
# estimated parameters
coef(fit)
# factor scores vs standardized total scores
df1 <- ltm::factor.scores(fit, return.MIvalues = TRUE)$score.dat
FS <- as.vector(df1[, "z1"])
df2 <- df1
df2$Obs <- df2$Exp <- df2$z1 <- df2$se.z1 <- NULL
STS <- as.vector(scale(rowSums(df2[, 1:20])))
df <- data.frame(FS, STS)
plot(FS ~ STS, data = df, xlab = "Standardized total score", ylab = "Factor score")
cor(FS, STS)
library(mirt)
library(ShinyItemAnalysis)
# loading data
data(GMAT, package = "difNLR")
# fitting 4PL model
fit <- mirt(GMAT[, 1:20], model = 1, itemtype = "4PL", SE = TRUE)
# item characteristic curves
plot(fit, type = "trace", facet_items = FALSE)
# test score curve
plot(fit)
# item information curves
plot(fit, type = "infotrace", facet_items = FALSE)
# test information curve
plot(fit, type = "infoSE")
# estimated parameters
coef(fit, simplify = TRUE) # classical intercept-slope parametrization
coef(fit) # including confidence intervals
coef(fit, IRTpars = TRUE, simplify = TRUE) # IRT parametrization
coef(fit, IRTpars = TRUE) # including confidence intervals
# item fit statistics
itemfit(fit)
# factor scores vs standardized total scores
fs <- as.vector(fscores(fit))
sts <- as.vector(scale(rowSums(GMAT[, 1:20])))
plot(fs ~ sts, xlab = "Standardized total score", ylab = "Factor score")
cor(fs, sts)
Item Response Theory (IRT) models are mixed-effect regression models in which respondent ability \(\theta_p\) is assumed to be latent and is estimated together with item paramters.
Item characteristic function \(\pi_{pi} = \mathrm{P}\left(Y_{pi} = 1\vert \theta_{p}\right)\) describes the probability of a correct answer for given item \(i\). Item information function \(\mathrm{I}_i(\theta_p)\) describes how well the item discriminates from two nearby ability levels, i.e., how much information it provides for the given ability.
The equation and estimated item parameters can be displayed using the IRT or intercept/slope parametrization.
Estimates of item parameters can be displayed using the IRT or intercept/slope parametrization, which can be selected at the top of this tab. Parameter estimates are completed by SX2 item fit statistics (Orlando & Thissen, 2000). SX2 statistics are computed only when no missing data are present.
Item Response Theory (IRT) models are mixed-effect regression models in which respondent ability \(\theta_p\) is assumed to be latent and is estimated together with item paramters. Model parameters are estimated using a marginal maximum likelihood method, in 1PL, 2PL, 3PL, and 4PL IRT models, ability \(\theta_p\) is assumed to follow standard normal distribution.
IRT models can be compared by several information criteria:
Row BEST indicates which model has the lowest value of given information criterion.
library(mirt)
# loading data
data(GMAT, package = "difNLR")
# 1PL IRT model
fit1PL <- mirt(GMAT[, 1:20],
model = 1,
constrain = list((1:20) + seq(0, (20 - 1) * 3, 3)), itemtype = "2PL"
)
# 2PL IRT model
fit2PL <- mirt(GMAT[, 1:20], model = 1, itemtype = "2PL")
# 3PL IRT model
fit3PL <- mirt(GMAT[, 1:20], model = 1, itemtype = "3PL")
# 4PL IRT model
fit4PL <- mirt(GMAT[, 1:20], model = 1, itemtype = "4PL")
# comparison
anova(fit1PL, fit2PL)
anova(fit2PL, fit3PL)
anova(fit3PL, fit4PL)
The Nominal Response Model (NRM) was introduced by Bock (1972) as a way to model responses to items with two or more nominal categories. This model is suitable for multiple-choice items with no particular ordering of distractors. It is also a generalization of some models for ordinal data, e.g., Generalized Partial Credit Model (GPCM) or its restricted versions Partial Credit Model (PCM) and Rating Scale Model (RSM).
Item characteristic curves may be displayed for each item in the Items subtab.
This table shows the response score of only six respondents. If you want to see scores for all respondents, click on the Download abilities button.
Download abilitieslibrary(mirt)
# loading data
data(HCItest, HCI, package = "ShinyItemAnalysis")
HCInumeric <- HCItest[, 1:20]
HCInumeric[] <- sapply(HCInumeric, as.numeric)
# model
fit <- mirt(HCInumeric, model = 1, itemtype = "nominal")
# item response curves
plot(fit, type = "trace")
# item information curves
plot(fit, type = "infotrace", facet_items = FALSE)
# test information curve
plot(fit, type = "infoSE")
# estimated parameters
coef(fit, simplify = TRUE) # classical intercept-slope parametrization
coef(fit, IRTpars = TRUE, simplify = TRUE) # IRT parametrization
# factor scores vs standardized total scores
fs <- as.vector(fscores(fit))
sts <- as.vector(scale(rowSums(HCI[, 1:20])))
plot(fs ~ sts, xlab = "Standardized total score", ylab = "Factor score")
cor(fs, sts)
The Nominal Response Model (NRM) was introduced by Bock (1972) as a way to model responses to items with two or more nominal categories. This model is suitable for multiple-choice items with no particular ordering of distractors. It is also generalization of some models for ordinal data, e.g., Generalized Partial Credit Model (GPCM) or its restricted versions Partial Credit Model (PCM) and Rating Scale Model (RSM).
Dichotomous models are used for modelling items producing a simple binary response (i.e., true/false). The most complex unidimensional dichotomous IRT model described here is the 4PL IRT model. The Rasch model (Rasch, 1960) assumes discrimination fixed to \(a = 1\), guessing fixed to \(c = 0\), and innatention to \(d = 1\). Additionally, other restricted models (1PL, 2PL, and 3PL models) can be obtained by fixing appropriate parameters in the 4PL model.
In this section, you can explore the behavior of two item characteristic curves \(\mathrm{P}\left(Y = 1|\theta\right)\) and their item information functions \(\mathrm{I}\left(\theta\right)\) in the 4PL IRT model.
Select parameters \(a\) (discrimination), \(b\) (difficulty), \(c\) (guessing), and \(d\) (inattention). By constraining \(a = 1\), \(c = 0\), \(d = 1\) you get the Rasch model. With option \(c = 0\) and \(d = 1\) you get the 2PL model, and with option \(d = 1\) the 3PL model.
You may also select the value of latent ability \(\theta\) to obtain the interpretation of the item characteristic curves for this ability.
Note that for 1PL and 2PL models, the item information is the highest at \(\theta = b\). This is not necessarily the case for 3PL and 4PL models.
Consider the following 2PL items with parameters
Item 1:
\(a = 2.5, b = -0.5\)
Item 2:
\(a = 1.5, b = 0\)
For these items fill in the following exercises with an accuracy of up to 0.05,
then click on the
Submit answers
button.
If you need a hint, click on the blue button with a question mark.
Now consider 2 items with the following parameters
Item 1:
\(a = 1.5, b = 0, c = 0, d = 1\)
Item 2:
\(a = 1.5, b = 0, c = 0.2, d = 1\)
For these items fill in the following exercises with an accuracy of up to 0.05,
then click on the
Submit answers
button.
Now consider 2 items with the following parameters
Item 1:
\(a = 1.5, b = 0, c = 0, d = 0.9\)
Item 2:
\(a = 1.5, b = 0, c = 0, d = 1\)
For these items fill in the following exercises with an accuracy of up to 0.05, then click on the
Submit answers
button.
library(ggplot2)
library(data.table)
# parameters
a1 <- 1
b1 <- 0
c1 <- 0
d1 <- 1
a2 <- 2
b2 <- 0.5
c2 <- 0
d2 <- 1
# latent ability
theta <- seq(-4, 4, 0.01)
# latent ability level
theta0 <- 0
# function for IRT characteristic curve
icc_irt <- function(theta, a, b, c, d) {
return(c + (d - c) / (1 + exp(-a * (theta - b))))
}
# calculation of characteristic curves
df <- data.frame(theta,
"icc1" = icc_irt(theta, a1, b1, c1, d1),
"icc2" = icc_irt(theta, a2, b2, c2, d2)
)
df <- melt(df, id.vars = "theta")
# plot for characteristic curves
ggplot(df, aes(x = theta, y = value, color = variable)) +
geom_line() +
geom_segment(aes(
y = icc_irt(theta0, a = a1, b = b1, c = c1, d = d1),
yend = icc_irt(theta0, a = a1, b = b1, c = c1, d = d1),
x = -4, xend = theta0
),
color = "gray", linetype = "dashed"
) +
geom_segment(aes(
y = icc_irt(theta0, a = a2, b = b2, c = c2, d = d2),
yend = icc_irt(theta0, a = a2, b = b2, c = c2, d = d2),
x = -4, xend = theta0
),
color = "gray", linetype = "dashed"
) +
geom_segment(aes(
y = 0,
yend = max(
icc_irt(theta0, a = a1, b = b1, c = c1, d = d1),
icc_irt(theta0, a = a2, b = b2, c = c2, d = d2)
),
x = theta0, xend = theta0
),
color = "gray", linetype = "dashed"
) +
xlim(-4, 4) +
xlab("Ability") +
ylab("Probability of correct answer") +
theme_bw() +
ylim(0, 1) +
theme(
axis.line = element_line(colour = "black"),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank()
) +
ggtitle("Item characteristic curve")
# function for IRT information function
iic_irt <- function(theta, a, b, c, d) {
pi <- c + (d - c) * exp(a * (theta - b)) / (1 + exp(a * (theta - b)))
return(a^2 * (pi - c)^2 * (d - pi)^2 / (pi * (1 - pi) * (d - c)^2))
}
# calculation of information curves
df <- data.frame(theta,
"iic1" = iic_irt(theta, a1, b1, c1, d1),
"iic2" = iic_irt(theta, a2, b2, c2, d2)
)
df <- melt(df, id.vars = "theta")
# plot for information curves
ggplot(df, aes(x = theta, y = value, color = variable)) +
geom_line() +
xlim(-4, 4) +
xlab("Ability") +
ylab("Information") +
theme_bw() +
ylim(0, 4) +
theme(
axis.line = element_line(colour = "black"),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank()
) +
ggtitle("Item information curve")
Polytomous models are used when a partial score is possible, or when items are graded on the Likert scale (e.g. from Totally disagree to Totally agree); some polytomous models can also be used when analyzing multiple-choice items. In this section you can explore item response functions for some polytomous models.
Two main classes of polytomous IRT models are considered:
Difference models are defined by setting the mathematical form to cumulative probabilities, while category probabilities are calculated by their difference. These models are sometimes called cumulative logit models as they set a linear form to cumulative logits.
As an example, the Graded Response Model (GRM; Samejima, 1970) uses a 2PL IRT model to describe cumulative probabilities (probabilities to obtain a score higher than 1, 2, 3, etc.). Category probabilities are then described as the differences between two subsequent cumulative probabilities.
For the divide-by-total models, response category probabilities are defined as the ratio between category-related functions and their sum.
In the Generalized Partial Credit Model (GPCM; Muraki, 1992), probability of the successful transition from one category score to the next category score is modelled by the 2PL IRT model, while the Partial Credit Model (PCM; Masters, 1982) uses the 1PL IRT model to describe this probability. In an even more restricted version, the Rating Scale Model (RSM; Andrich, 1978) assumes exactly the same K response categories for each item and threshold parameters which can be split into a response-threshold parameter and an item-specific location parameter. These models are sometimes called adjacent-category logit models because they set linear form to adjacent logits.
To model distractor properties in multiple-choice items, the Nominal Response Model (NRM; Bock, 1972) can be used. NRM is an IRT analogy of a multinomial regression model. This model is also a generalization of GPCM/PCM/RSM ordinal models. NRM is sometimes called a baseline-category logit model because it sets linear form to log of the odds of selecting a given category to the baseline category. The baseline can be chosen arbitrarily, although normally the correct answer is the first answer chosen.
Graded response model (GRM; Samejima, 1970) uses the 2PL IRT model to describe cumulative probabilities (probabilities to obtain a score higher than 1, 2, 3, etc.). Category probabilities are then described as the differences between two subsequent cumulative probabilities.
It belongs to a class of difference models, which are defined by setting mathematical form to cumulative probabilities, while category probabilities are calculated as their difference. These models are sometimes called cumulative logit models, because they set linear form to cumulative logits.
Select the number of responses, inflection points of cumulative probabilities \(b_k\), and the common discrimination parameter \(a\). Cumulative probability \(P(Y \geq 0 \vert \theta)\) is always equal to 1 and it is not displayed, the corresponding category probability \(P(Y = 0 \vert \theta)\) is displayed with a black color.
Consider an item following a graded response model rated \(0-1-2-3\), with discrimination \(a = 1\) and difficulties \(b_{1} = − 0.5\), \(b_{2} = 1\) and \(b_{3} = 1.5\).
library(ggplot2)
library(data.table)
# setting parameters
a <- 1
b <- c(-1.5, -1, -0.5, 0)
theta <- seq(-4, 4, 0.01)
# calculating cumulative probabilities
ccirt <- function(theta, a, b) {
return(1 / (1 + exp(-a * (theta - b))))
}
df1 <- data.frame(sapply(1:length(b), function(i) ccirt(theta, a, b[i])), theta)
df1 <- melt(df1, id.vars = "theta")
# plotting cumulative probabilities
ggplot(data = df1, aes(x = theta, y = value, col = variable)) +
geom_line() +
xlab("Ability") +
ylab("Cumulative probability") +
xlim(-4, 4) +
ylim(0, 1) +
theme_bw() +
theme(
text = element_text(size = 14),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank()
) +
ggtitle("Cumulative probabilities") +
scale_color_manual("",
values = c("red", "yellow", "green", "blue"),
labels = paste0("P(Y >= ", 1:4, ")")
)
# calculating category probabilities
df2 <- data.frame(1, sapply(
1:length(b),
function(i) ccirt(theta, a, b[i])
))
df2 <- data.frame(sapply(
1:length(b),
function(i) df2[, i] - df2[, i + 1]
), df2[, ncol(df2)], theta)
df2 <- melt(df2, id.vars = "theta")
# plotting category probabilities
ggplot(data = df2, aes(x = theta, y = value, col = variable)) +
geom_line() +
xlab("Ability") +
ylab("Category probability") +
xlim(-4, 4) +
ylim(0, 1) +
theme_bw() +
theme(
text = element_text(size = 14),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank()
) +
ggtitle("Category probabilities") +
scale_color_manual("",
values = c("black", "red", "yellow", "green", "blue"),
labels = paste0("P(Y >= ", 0:4, ")")
)
# calculating expected item score
df3 <- data.frame(1, sapply(
1:length(b),
function(i) ccirt(theta, a, b[i])
))
df3 <- data.frame(sapply(
1:length(b),
function(i) df3[, i] - df3[, i + 1]
), df3[, ncol(df3)])
df3 <- data.frame(exp = as.matrix(df3) %*% 0:4, theta)
# plotting category probabilities
ggplot(data = df3, aes(x = theta, y = exp)) +
geom_line() +
xlab("Ability") +
ylab("Expected item score") +
xlim(-4, 4) +
ylim(0, 4) +
theme_bw() +
theme(
text = element_text(size = 14),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank()
) +
ggtitle("Expected item score")
In the Generalized Partial Credit Model (GPCM; Muraki, 1992), the probability of successful transition from one category score to the next category score is modelled by the 2PL IRT model. The response category probabilities are then ratios between category-related functions (cumulative sums of exponentials) and their sum.
Two simpler models can be derived from GPCM by restricting some parameters: The Partial Credit Model (PCM; Masters, 1982) uses the 1PL IRT model to describe this probability, thus parameters \(a = 1\). An even more restricted version, the Rating Scale Model (RSM; Andrich, 1978) assumes exactly the same \(K\) response categories for each item and threshold parameters which can be split into a response-threshold parameter \(\lambda_k\) and an item-specific location parameter \(b_i\). These models are sometimes called adjacent category logit models, as they set linear form to adjacent category logits.
Select the number of responses and their threshold parameters \(b_k\) and common discrimination parameter \(a\). With \(a = 1\) you get the PCM. Numerator of \(\pi_0 = P(Y = 0 \vert \theta)\) is set to 1 and \(\pi_0\) is displayed with a black color.
Consider an item following the generalized partial credit model rated \(0-1-2\), with a discrimination \(a = 1\) and threshold parameters \(b_{1} = − 1\) and \(b_{2} = 1\).
library(ggplot2)
library(data.table)
# setting parameters
a <- 1
d <- c(-1.5, -1, -0.5, 0)
theta <- seq(-4, 4, 0.01)
# calculating category probabilities
ccgpcm <- function(theta, a, d) {
a * (theta - d)
}
df <- sapply(1:length(d), function(i) ccgpcm(theta, a, d[i]))
pk <- sapply(1:ncol(df), function(k) apply(as.data.frame(df[, 1:k]), 1, sum))
pk <- cbind(0, pk)
pk <- exp(pk)
denom <- apply(pk, 1, sum)
df <- apply(pk, 2, function(x) x / denom)
df1 <- melt(data.frame(df, theta), id.vars = "theta")
# plotting category probabilities
ggplot(data = df1, aes(x = theta, y = value, col = variable)) +
geom_line() +
xlab("Ability") +
ylab("Category probability") +
xlim(-4, 4) +
ylim(0, 1) +
theme_bw() +
theme(
text = element_text(size = 14),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank()
) +
ggtitle("Category probabilities") +
scale_color_manual("",
values = c("black", "red", "yellow", "green", "blue"),
labels = paste0("P(Y = ", 0:4, ")")
)
# calculating expected item score
df2 <- data.frame(exp = as.matrix(df) %*% 0:4, theta)
# plotting expected item score
ggplot(data = df2, aes(x = theta, y = exp)) +
geom_line() +
xlab("Ability") +
ylab("Expected item score") +
xlim(-4, 4) +
ylim(0, 4) +
theme_bw() +
theme(
text = element_text(size = 14),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank()
) +
ggtitle("Expected item score")
In the Nominal Response Model (NRM; Bock, 1972), the probability of selecting a given category over the baseline category is modelled by the 2PL IRT model. This model is sometimes called the baseline-category logit model, because it sets linear form to the log odds of selecting a given category to the baseline category. The baseline can be chosen arbitrarily, although normally the correct answer is the first answer chosen. The NRM model is a generalization of the GPCM model by setting item-specific and category-specific intercept and slope parameters.
Select the number of distractors, their threshold parameters \(b_k\), and discrimination parameters \(a_k\). Parameters of \(\pi_0 = P(Y = 0 \vert \theta)\) are set to zeros and \(\pi_0\) is displayed with a black color.
library(ggplot2)
library(data.table)
# setting parameters
a <- c(2.5, 2, 1, 1.5)
d <- c(-1.5, -1, -0.5, 0)
theta <- seq(-4, 4, 0.01)
# calculating category probabilities
ccnrm <- function(theta, a, d) {
exp(d + a * theta)
}
df <- sapply(1:length(d), function(i) ccnrm(theta, a[i], d[i]))
df <- data.frame(1, df)
denom <- apply(df, 1, sum)
df <- apply(df, 2, function(x) x / denom)
df1 <- melt(data.frame(df, theta), id.vars = "theta")
# plotting category probabilities
ggplot(data = df1, aes(x = theta, y = value, col = variable)) +
geom_line() +
xlab("Ability") +
ylab("Category probability") +
xlim(-4, 4) +
ylim(0, 1) +
theme_bw() +
theme(
text = element_text(size = 14),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank()
) +
ggtitle("Category probabilities") +
scale_color_manual("",
values = c("black", "red", "yellow", "green", "blue"),
labels = paste0("P(Y = ", 0:4, ")")
)
# calculating expected item score
df2 <- data.frame(exp = as.matrix(df) %*% 0:4, theta)
# plotting expected item score
ggplot(data = df2, aes(x = theta, y = exp)) +
geom_line() +
xlab("Ability") +
ylab("Expected item score") +
xlim(-4, 4) +
ylim(0, 4) +
theme_bw() +
theme(
text = element_text(size = 14),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank()
) +
ggtitle("Expected item score")
Differential item functioning (DIF) occurs when respondents from different social groups (such as those defined by gender or ethnicity) with the same underlying ability have a different probability of answering the item correctly or endorsing the item. If some item functions differently for two groups, it is potentially unfair and should be checked for wording. In general, two types of DIF can be distinguished: The uniform DIF describes a situation when the item advantages one of the groups at all levels of the latent ability (left figure). In such a case, the item has different difficulty (location parameters) for two given groups, while the item discrimination is the same. Contrary, the non-uniform DIF (right figure) means that the item advantages one of the groups at lower ability levels, and the other group at higher ability levels. In this case, the item has different discrimination (slope) parameters and possibly also different difficulty parameters for the two given groups.
Differential distractor functioning (DDF) occurs when respondents from different groups but with the same latent ability have a different probability of selecting at least one distractor choice. Again, two types of DDF can be distinguished - uniform (left figure below) and non-uniform DDF (right figure below).
DIF analysis may come to a different conclusion than a test of group differences in total scores. Two groups may have the same distribution of total scores, yet, some items may function differently for the two groups. Also, one of the groups may have a significantly lower total score, yet, it may happen that there is no DIF item (Martinkova et al., 2017). This section examines the differences in observed scores only. Explore further DIF sections to analyze differential item functioning.
In DIF analysis, the groups are compared in functioning of items with respect to respondent ability.
In many methods, observed ability such as the standardized total score is used as the matching criterion.
DIF can also be explored with respect to other observed score or criterion.
For example, to analyze instructional sensitivity,
Martinkova et al. (2020)
analyzed differential item functioning in change (DIF-C) by analyzing DIF on Grade 9 item answers
while matching on Grade 6 total scores of the same respondents in a longitudinal setting
(see toy data
Learning to Learn 9
in the Data section).
library(ggplot2)
library(moments)
# loading data
data(GMAT, package = "difNLR")
data <- GMAT[, 1:20]
group <- GMAT[, "group"]
# total score calculation wrt group
score <- rowSums(data)
score0 <- score[group == 0] # reference group
score1 <- score[group == 1] # focal group
# Summary of total score
rbind(
c(
length(score0), min(score0), max(score0), mean(score0), median(score0),
sd(score0), skewness(score0), kurtosis(score0)
),
c(
length(score1), min(score1), max(score1), mean(score1), median(score1),
sd(score1), skewness(score1), kurtosis(score1)
)
)
df <- data.frame(score, group = as.factor(group))
# histogram of total scores wrt group
ggplot(data = df, aes(x = score, fill = group, col = group)) +
geom_histogram(binwidth = 1, position = "dodge2", alpha = 0.75) +
xlab("Total score") +
ylab("Number of respondents") +
scale_fill_manual(
values = c("dodgerblue2", "goldenrod2"), labels = c("Reference", "Focal")
) +
scale_colour_manual(
values = c("dodgerblue2", "goldenrod2"), labels = c("Reference", "Focal")
) +
theme_app() +
theme(legend.position = "left")
# t-test to compare total scores
t.test(score0, score1)
A delta plot (Angoff & Ford, 1973) compares the proportions of correct answers per item in the two groups. It displays non-linear transformation of these proportions using quantiles of standard normal distributions (so-called delta scores) for each item for the two groups in a scatterplot called diagonal plot or delta plot (see Figure below). An item is under suspicion of DIF if the delta point departs considerably from the main axis of the ellipsoid formed by the delta scores.
The detection threshold is either fixed to the value of 1.5 or it is based on bivariate normal approximation (Magis & Facon, 2012). The item purification algorithms offered when using the threshold based on normal approximation are as follows: IPP1 uses the threshold obtained after the first run in all following runs, IPP2 updates only the slope parameter of the threshold formula and thus lessens the impact of DIF items, IPP3 adjusts every single parameter and completely discards the effect of items flagged as DIF from the computation of the threshold (for further details see Magis & Facon, 2013). When using the fixed threshold and item purification, this threshold (1.5) stays the same henceforward during the purification algorithm.
A summary table contains information about the proportions of correct answers in the reference and the focal group together with their transformations into delta scores. It also includes the distances of delta scores from the main axis of the ellipsoid formed by delta scores.
library(deltaPlotR)
# loading data
data(GMAT, package = "difNLR")
data <- GMAT[, -22]
# delta scores with fixed threshold
(DS_fixed <- deltaPlot(
data = data, group = "group", focal.name = 1,
thr = 1.5, purify = FALSE
))
# delta plot
diagPlot(DS_fixed, thr.draw = TRUE)
# delta scores with normal threshold
(DS_normal <- deltaPlot(
data = data, group = "group", focal.name = 1,
thr = "norm", purify = FALSE
))
# delta plot
diagPlot(DS_normal, thr.draw = TRUE)
The Mantel-Haenszel test is a DIF detection method based on contingency tables which are calculated for each level of the total score (Mantel & Haenszel, 1959).
Here you can select a correction method for multiple comparison, and/or item purification.
The summary table contains information about Mantel-Haenszel \(\chi^2\) statistics, corresponding \(p\)-values considering selected adjustement, and significance codes. Moreover, this table offers values of Mantel-Haenszel estimates of the odds ratio \(\alpha_{\mathrm{MH}}\), which incorporate all levels of the total score, and their transformations into D-DIF indices \(\Delta_{\mathrm{MH}} = -2.35 \log(\alpha_{\mathrm{MH}})\) to evaluate DIF effect size.
library(difR)
# loading data
data(GMAT, package = "difNLR")
data <- GMAT[, 1:20]
group <- GMAT[, "group"]
# Mantel-Haenszel test
(fit <- difMH(
Data = data, group = group, focal.name = 1, match = "score",
p.adjust.method = "none", purify = FALSE
))
The Mantel-Haenszel test is a DIF detection method based on contingency tables which are calculated for each level of the total score (Mantel & Haenszel, 1959).
For the selected item and for the selected level of the total score you can display a contingency table and calculate the odds ratio of answering an item correctly. This can be compared to the Mantel-Haenszel estimate of odds ratio \(\alpha_{\mathrm{MH}}\), which incorporates all levels of the total score. Further, \(\alpha_{\mathrm{MH}}\) can be transformed into the Mantel-Haenszel D-DIF index \(\Delta_{\mathrm{MH}}\) to evaluate the DIF effect size.
library(difR)
# loading data
data(GMAT, package = "difNLR")
data <- GMAT[, 1:20]
group <- GMAT[, "group"]
# contingency table for item 1 and score 12
item <- 1
cut <- 12
df <- data.frame(data[, item], group)
colnames(df) <- c("Answer", "Group")
df$Answer <- relevel(factor(df$Answer, labels = c("Incorrect", "Correct")), "Correct")
df$Group <- factor(df$Group, labels = c("Reference Group", "Focal Group"))
score <- rowSums(data) # total score calculation
df <- df[score == 12, ] # responses of those with total score of 12
xtabs(~ Group + Answer, data = df)
# Mantel-Haenszel estimate of OR
(fit <- difMH(
Data = data, group = group, focal.name = 1, match = "score",
p.adjust.method = "none", purify = FALSE
))
fit$alphaMH
# D-DIF index calculation
-2.35 * log(fit$alphaMH)
The SIBTEST method (Shealy & Stout, 1993) allows for detection of uniform DIF without requiring an item response model. Its modified version, the Crossing-SIBTEST (Chalmers, 2018; Li & Stout, 1996), focuses on detection of non-uniform DIF.
Here you can choose the type of DIF to test. With uniform DIF, SIBTEST is applied, while with non-uniform DIF, the Crossing-SIBTEST method is used instead. You can also select the correction method for multiple comparisons or item purification.
This summary table contains estimates of \(\beta\) together with standard errors (only available when testing uniform DIF), corresponding \(\chi^2\)-statistics with \(p\)-values considering selected adjustement, and significance codes.
library(difR)
# loading data
data(GMAT, package = "difNLR")
data <- GMAT[, 1:20]
group <- GMAT[, "group"]
# SIBTEST (uniform DIF)
(fit_udif <- difSIBTEST(
Data = data, group = group, focal.name = 1, type = "udif",
p.adjust.method = "none", purify = FALSE
))
# Crossing-SIBTEST (non-uniform DIF)
(fit_nudif <- difSIBTEST(
Data = data, group = group, focal.name = 1, type = "nudif",
p.adjust.method = "none", purify = FALSE
))
The logistic regression method allows for detection of uniform and non-uniform DIF (Swaminathan & Rogers, 1990) by including a group-membership variable (uniform DIF) and its interaction with a matching criterion (non-uniform DIF) into a model for item \(i\) and by testing for significance of their effect.
Here you can change
type
of DIF to be tested and
parametrization
- either based on IRT
models or classical intercept/slope. You can also select a
correction method
for multiple comparison and/or
item purification.
Finally, you may also change the
Observed score.
While matching
on the standardized total score is typical, the upload of other observed scores is possible in the
Data.
section. Using a pre-test (standardized) total score as the observed score allows for testing a differential item functioning
in change (DIF-C) to provide proofs of instructional sensitivity
(Martinkova et al., 2020),
also see
Learning To Learn 9
toy dataset.
The probability that respondent \(p\) with the observed score and the group membership variable \(G_p\) answers correctly item \(i\) is given by the following equation:
The summary table contains information about DIF test statistics \(LR(\chi^2)\) based on a likelihood ratio test, the corresponding \(p\)-values considering selected adjustement, and the significance codes. Moreover, it offers the values of Nagelkerke's \(R^2\) with DIF effect size classifications. This table also provides estimated parameters for the best fitted model for each item, and their standard errors.
library(difR)
# loading data
data(GMAT, package = "difNLR")
data <- GMAT[, 1:20]
group <- GMAT[, "group"]
# logistic regression DIF detection method
(fit <- difLogistic(
Data = data, group = group, focal.name = 1, match = "score",
type = "both", p.adjust.method = "none", purify = FALSE
))
# loading data
data(LearningToLearn, package = "ShinyItemAnalysis")
data <- LearningToLearn[, 87:94] # item responses from Grade 9 from subscale 6
group <- LearningToLearn$track # school track - group membership variable
match <- scale(LearningToLearn$score_6) # standardized test score from Grade 6
# detecting differential item functioning in change (DIF-C) using
# the logistic regression DIF detection method
# and standardized total score from Grade 6 as the matching criterion
(fit <- difLogistic(
Data = data, group = group, focal.name = "AS", match = match,
type = "both", p.adjust.method = "none", purify = FALSE
))
The logistic regression method allows for detection of uniform and non-uniform DIF (Swaminathan & Rogers, 1990) by including a group-membership variable (uniform DIF) and its interaction with a matching criterion (non-uniform DIF) into a model for item \(i\) and by testing for significance of their effect.
Here you can change
type
of DIF to be tested and
parametrization
- either based on IRT
models or classical intercept/slope. You can also select a
correction method
for multiple comparison and/or
item purification.
Finally, you may also change the
Observed score.
While matching
on the standardized total score is typical, the upload of other observed scores is possible in the
Data
section. Using a pre-test (standardized) total score as the observed score allows for testing a differential item functioning
in change (DIF-C) to provide proofs of instructional sensitivity
(Martinkova et al., 2020),
also see
Learning To Learn 9
toy dataset. For a selected
item
you can display a plot of its
characteristic curves and a table of its estimated parameters with standard errors.
Points represent a proportion of the correct answer (empirical probabilities) with respect to the observed score. Their size is determined by the count of respondents who achieved a given level of the observed score and who selected given option with respect to the group membership.
Download figureThe probability that respondent \(p\) with the observed score and the group membership variable \(G_p\) answers correctly item \(i\) is given by the following equation:
This table summarizes estimated item parameters and their standard errors.
library(difR)
library(ShinyItemAnalysis)
# loading data
data(GMAT, package = "difNLR")
data <- GMAT[, 1:20]
group <- GMAT[, "group"]
# logistic regression DIF detection method
(fit <- difLogistic(
Data = data, group = group, focal.name = 1, match = "score",
type = "both", p.adjust.method = "none", purify = FALSE
))
# plot of characteristic curve for item 1
plotDIFLogistic(fit, item = 1, Data = data, group = group)
# estimated coefficients for item 1
fit$logitPar[1, ]
Generalized logistic regression models are extensions of a logistic regression method which account for the possibility of guessing by allowing for nonzero lower asymptote - pseudo-guessing \(c_i\) (Drabinova & Martinkova, 2017) or an upper asymptote lower than one - inattention \(d_i\). Similarly to logistic regression, its extensions also provide detection of uniform and non-uniform DIF by letting the difficulty parameter \(b_i\) (uniform) and the discrimination parameter \(a_i\) (non-uniform) differ for groups and by testing for the difference in their values. Moreover, these extensions allow for testing differences in pseudo-guessing and inattention parameters and they can be seen as proxies of 3PL and 4PL IRT models for DIF detection.
Here you can specify the assumed model. In 3PL and 4PL models, the abbreviations \(c_{g}\) or \(d_{g}\) mean that parameters \(c_i\) or \(d_i\) are assumed to be the same for both groups, otherwise they are allowed to differ. With type you can specify the type of DIF to be tested by choosing the parameters in which a difference between groups should be tested. You can also select correction method for multiple comparison or item purification.
Finally, you may change the
Observed score.
While matching on the standardized total score is typical, the upload
of other Observed scores is possible in the
Data
section. Using a pre-test (standardized) total score allows
for testing differential item functioning in change (DIF-C) to provide proofs of instructional sensitivity
(Martinkova et al., 2020),
also see
Learning To Learn 9
toy dataset.
The displayed equation is based on the model selected below
This summary table contains information about DIF test statistic \(LR(\chi^2)\), corresponding \(p\)-values considering selected adjustement, and significance codes. This table also provides estimated parameters for the best fitted model for each item. Note that \(a_{iG_p}\) (and also other parameters) from the equation above consists of a parameter for the reference group and a parameter for the difference between focal and reference groups, i.e., \(a_{iG_p} = a_{i} + a_{iDif}G_{p}\), where \(G_{p} = 0\) for the reference group and \(G_{p} = 1\) for the focal group, as stated in the table below.
library(difNLR)
# loading data
data(GMAT, package = "difNLR")
data <- GMAT[, 1:20]
group <- GMAT[, "group"]
# generalized logistic regression DIF method
# using 3PL model with the same guessing parameter for both groups
(fit <- difNLR(
Data = data, group = group, focal.name = 1, model = "3PLcg",
match = "zscore", type = "all", p.adjust.method = "none", purify = FALSE
))
# loading data
data(LearningToLearn, package = "ShinyItemAnalysis")
data <- LearningToLearn[, 87:94] # item responses from Grade 9 from subscale 6
group <- LearningToLearn$track # school track - group membership variable
match <- scale(LearningToLearn$score_6) # standardized test score from Grade 6
# detecting differential item functioning in change (DIF-C) using
# the generalized logistic regression DIF method with 3PL model
# with the same guessing parameter for both groups
# and standardized total score from Grade 6 as the matching criterion
(fit <- difNLR(
Data = data, group = group, focal.name = "AS", model = "3PLcg",
match = match, type = "all", p.adjust.method = "none", purify = FALSE
))
Generalized logistic regression models are extensions of a logistic regression method which account for the possibility of guessing by allowing for nonzero lower asymptote - pseudo-guessing \(c_i\) (Drabinova & Martinkova, 2017) or an upper asymptote lower than one - inattention \(d_i\). Similarly to logistic regression, its extensions also provide detection of uniform and non-uniform DIF by letting the difficulty parameter \(b_i\) (uniform) and the discrimination parameter \(a_i\) (non-uniform) differ for groups and by testing for the difference in their values. Moreover, these extensions allow for testing differences in pseudo-guessing and inattention parameters and they can be seen as proxies of 3PL and 4PL IRT models for DIF detection.
Here you can specify the assumed model. In 3PL and 4PL models, the abbreviations \(c_{g}\) or \(d_{g}\) mean that parameters \(c\) or \(d\) are assumed to be the same for both groups, otherwise they are allowed to differ. With type you can specify the type of DIF to be tested by choosing the parameters in which a difference between groups should be tested. You can also select correction method for multiple comparison or item purification.
Finally, you may change the
Observed score.
While matching on the standardized total score is typical, the upload
of other observed scores is possible in the
Data
section. Using a pre-test (standardized) total score allows
for testing differential item functioning in change (DIF-C) to provide proofs of instructional sensitivity
(Martinkova et al., 2020),
also see
Learning To Learn 9
toy dataset. For selected
item
you can display plot of its characteristic curves and table of its estimated parameters with standard errors.
Points represent a proportion of the correct answer (empirical probabilities) with respect to the observed score. Their size is determined by the count of respondents who achieved a given level of observed score with respect to the group membership.
Download figureThis table summarizes estimated item parameters together with their standard errors. Note that \(a_{iG_p}\) (and also other parameters) from the equation above consists of a parameter for the reference group and a parameter for the difference between focal and reference groups, i.e., \(a_{iG_p} = a_{i} + a_{iDif}G_{p}\), where \(G_{p} = 0\) for the reference group and \(G_{p} = 1\) for the focal group, as stated in the table below.
library(difNLR)
# loading data
data(GMAT, package = "difNLR")
data <- GMAT[, 1:20]
group <- GMAT[, "group"]
# generalized logistic regression DIF method
# using 3PL model with the same guessing parameter for both groups
(fit <- difNLR(
Data = data, group = group, focal.name = 1, model = "3PLcg",
match = "zscore", type = "all", p.adjust.method = "none", purify = FALSE
))
# plot of characteristic curve of item 1
plot(fit, item = 1)
# estimated coefficients for item 1 with SE
coef(fit, SE = TRUE)[[1]]
To detect DIF, the Lord test (Lord, 1980) compares item parameters of a selected IRT model, fitted separately on data of the two groups. The model is either 1PL, 2PL, or 3PL with guessing, which is the same for the two groups. In the case of the 3PL model, the guessing parameter is estimated based on the whole dataset and is subsequently considered fixed. In statistical terms, the Lord statistic is equal to the Wald statistic.
Here you can choose the underlying IRT model used to test DIF. You can also select the correction method for multiple comparisons, and/or item purification.
This summary table contains information about Lord's \(\chi^2\)-statistics, corresponding \(p\)-values considering selected adjustment, and significance codes. The table also provides estimated parameters for both groups. Note that item parameters might slightly differ even for non-DIF items as two seperate models are fitted, however this difference is non-significant. Also note that under the 3PL model, the guessing parameter \(c\) is estimated from the whole dataset, and is considered fixed in the final models, thus no standard error is displayed.
library(difR)
library(ltm)
# loading data
data(GMAT, package = "difNLR")
data <- GMAT[, 1:20]
group <- GMAT[, "group"]
# 1PL IRT model
(fit1PL <- difLord(
Data = data, group = group, focal.name = 1, model = "1PL",
p.adjust.method = "none", purify = FALSE
))
# 2PL IRT model
(fit2PL <- difLord(
Data = data, group = group, focal.name = 1, model = "2PL",
p.adjust.method = "none", purify = FALSE
))
# 3PL IRT model with the same guessing for groups
guess <- itemParEst(data, model = "3PL")[, 3]
(fit3PL <- difLord(
Data = data, group = group, focal.name = 1, model = "3PL",
c = guess, p.adjust.method = "none", purify = FALSE
))
To detect DIF, the Lord test (Lord, 1980) compares item parameters of a selected IRT model, fitted separately on data of the two groups. The model is either 1PL, 2PL, or 3PL with guessing which is the same for the two groups. In the case of the 3PL model, the guessing parameter is estimated based on the whole dataset and is subsequently considered fixed. In statistical terms, the Lord statistic is equal to the Wald statistic.
Here you can choose an underlying IRT model used to test DIF. You can also select a correction method for multiple comparison, and/or item purification. For a selected item you can display the plot of its characteristic curves and the table of its estimated parameters with standard errors.
Note that plots might differ slightly even for non-DIF items as two seperate models are fitted, however this difference is non-significant.
Download figureThe table summarizes estimated item parameters together with standard errors. Note that item parameters might differ slightly even for non-DIF items as two seperate models are fitted, however this difference is non-significant. Also note that under the 3PL model, the guessing parameter \(c\) is estimated from the whole dataset, and is considered fixed in the final models, thus no standard error is displayed.
library(difR)
library(ltm)
library(ShinyItemAnalysis)
# loading data
data(GMAT, package = "difNLR")
data <- GMAT[, 1:20]
group <- GMAT[, "group"]
# 1PL IRT model
(fit1PL <- difLord(
Data = data, group = group, focal.name = 1, model = "1PL",
p.adjust.method = "none", purify = FALSE
))
# estimated coefficients for all items
(coef1PL <- fit1PL$itemParInit)
# plot of characteristic curve of item 1
plotDIFirt(parameters = coef1PL, item = 1, test = "Lord")
# 2PL IRT model
(fit2PL <- difLord(
Data = data, group = group, focal.name = 1, model = "2PL",
p.adjust.method = "none", purify = FALSE
))
# estimated coefficients for all items
(coef2PL <- fit2PL$itemParInit)
# plot of characteristic curve of item 1
plotDIFirt(parameters = coef2PL, item = 1, test = "Lord")
# 3PL IRT model with the same guessing for groups
guess <- itemParEst(data, model = "3PL")[, 3]
(fit3PL <- difLord(
Data = data, group = group, focal.name = 1, model = "3PL",
c = guess, p.adjust.method = "none", purify = FALSE
))
# estimated coefficients for all items
(coef3PL <- fit3PL$itemParInit)
# plot of characteristic curve of item 1
plotDIFirt(parameters = coef3PL, item = 1, test = "Lord")
To detect DIF, the Raju test (Raju, 1988, 1990) uses the area between the item charateristic curves of the selected IRT model, fitted separately with data of the two groups. The model is either 1PL, 2PL, or 3PL with guessing which is the same for the two groups. In the case of the 3PL model, the guessing parameter is estimated based on the whole dataset and is subsequently considered fixed.
Here you can choose an underlying IRT model used to test DIF. You can also select the correction method for multiple comparison, and/or item purification.
This summary table contains information about Raju's \(Z\)-statistics, corresponding \(p\)-values considering selected adjustement, and significance codes. The table also provides estimated parameters for both groups. Note that item parameters might differ slightly even for non-DIF items as the two seperate models are fitted, however this difference is non-significant. Also note that under the 3PL model, the guessing parameter \(c\) is estimated from the whole dataset, and is considered fixed in the final models, thus no standard error is displayed.
library(difR)
library(ltm)
# loading data
data(GMAT, package = "difNLR")
data <- GMAT[, 1:20]
group <- GMAT[, "group"]
# 1PL IRT model
(fit1PL <- difRaju(
Data = data, group = group, focal.name = 1, model = "1PL",
p.adjust.method = "none", purify = FALSE
))
# 2PL IRT model
(fit2PL <- difRaju(
Data = data, group = group, focal.name = 1, model = "2PL",
p.adjust.method = "none", purify = FALSE
))
# 3PL IRT model with the same guessing for groups
guess <- itemParEst(data, model = "3PL")[, 3]
(fit3PL <- difRaju(
Data = data, group = group, focal.name = 1, model = "3PL",
c = guess, p.adjust.method = "none", purify = FALSE
))
To detect DIF, the Raju test (Raju, 1988, 1990) uses the area between the item charateristic curves of the selected IRT model, fitted separately with data of the two groups. The model is either 1PL, 2PL, or 3PL with guessing which is the same for the two groups. In the case of the 3PL model, the guessing parameter is estimated based on the whole dataset and is subsequently considered fixed.
Here you can choose an underlying IRT model used to test DIF. You can also select the correction method for multiple comparison, and/or item purification. For a selected item you can display the plot of its characteristic curves and the table of its estimated parameters with standard errors.
Note that plots might differ slightly even for non-DIF items as two seperate models are fitted, however this difference is non-significant.
Download figureThis table summarizes the estimated item parameters together with the standard errors. Note that item parameters might differ slightly even for non-DIF items as two seperate models are fitted, however this difference is non-significant. Also note that under the 3PL model, the guessing parameter \(c\) is estimated from the whole dataset, and is considered fixed in the final models, thus no standard error is available.
library(difR)
library(ltm)
library(ShinyItemAnalysis)
# loading data
data(GMAT, package = "difNLR")
data <- GMAT[, 1:20]
group <- GMAT[, "group"]
# 1PL IRT model
(fit1PL <- difRaju(
Data = data, group = group, focal.name = 1, model = "1PL",
p.adjust.method = "none", purify = FALSE
))
# estimated coefficients for all items
(coef1PL <- fit1PL$itemParInit)
# plot of characteristic curve of item 1
plotDIFirt(parameters = coef1PL, item = 1, test = "Raju")
# 2PL IRT model
(fit2PL <- difRaju(
Data = data, group = group, focal.name = 1, model = "2PL",
p.adjust.method = "none", purify = FALSE
))
# estimated coefficients for all items
(coef2PL <- fit2PL$itemParInit)
# plot of characteristic curve of item 1
plotDIFirt(parameters = coef2PL, item = 1, test = "Raju")
# 3PL IRT model with the same guessing for groups
guess <- itemParEst(data, model = "3PL")[, 3]
(fit3PL <- difRaju(
Data = data, group = group, focal.name = 1, model = "3PL",
c = guess, p.adjust.method = "none", purify = FALSE
))
# estimated coefficients for all items
(coef3PL <- fit3PL$itemParInit)
# plot of characteristic curve of item 1
plotDIFirt(parameters = coef3PL, item = 1, test = "Raju")
Here you can compare all offered DIF detection methods. In the table below, columns represent DIF detection methods, and rows represent item numbers. If the method detects an item as DIF, value 1 is assigned to that item, otherwise 0 is assigned. In the case that any method fails to converge or cannot be fitted, NA is displayed instead of 0/1 values. Available methods:
Settings for individual methods (Observed score, type of DIF to be tested, correction method, item purification) are taken from the subsection pages of given methods. In case your settings are not unified, you can set some of them below. Note that changing the options globaly can be computationaly demanding. This especially applies for a purification request. To see the complete setting of all analyses, please refer to the note below the table. The last column shows how many methods detect a certain item as DIF. The last row shows how many items are detected as DIF by a certain method.
Cumulative logit regression allows for detection of uniform and non-uniform DIF among ordinal data by adding a group-membership variable (uniform DIF) and its interaction with observed score (non-uniform DIF) into a model for item \(i\) and by testing for their significance.
Here you can change the type of DIF to be tested, the Observed score, and the parametrization - either the IRT or the classical intercept/slope. You can also select a correction method for a multiple comparison and/or item purification.
The probability that respondent \(p\) with the observed score (e.g., standardized total score) \(Z_p\) and the group membership variable \(G_p\) obtained at least \(k\) points in item \(i\) is given by the following equation:
The probability that respondent \(p\) with the observed score (e.g., standardized total score) \(Z_p\) and group membership \(G_p\) obtained exactly \(k\) points in item \(i\) is then given as the difference between the probabilities of obtaining at least \(k\) and \(k + 1\) points:
This summary table contains information about \(\chi^2\)-statistics of the likelihood ratio test, corresponding \(p\)-values considering selected correction method, and significance codes. The table also provides estimated parameters for the best fitted model for each item.
library(difNLR)
# loading data
data(dataMedicalgraded, package = "ShinyItemAnalysis")
data <- dataMedicalgraded[, 1:100]
group <- dataMedicalgraded[, 101]
# DIF with cumulative logit regression model
(fit <- difORD(
Data = data, group = group, focal.name = 1, model = "cumulative",
type = "both", match = "zscore", p.adjust.method = "none", purify = FALSE,
parametrization = "classic"
))
Cumulative logit regression allows for detection of uniform and non-uniform DIF among ordinal data by adding a group-membership variable (uniform DIF) and its interaction with observed score (non-uniform DIF) into a model for item \(i\) and by testing for their significance.
Here you can change the type of DIF to be tested, the Observed score, and the parametrization - either the IRT or classical intercept/slope. You can also select a correction method for a multiple comparison and/or item purification.
Points represent a proportion of the obtained score with respect to the observed score. Their size is determined by the count of respondents who achieved a given level of the observed score and who selected given option with respect to the group membership.
This table summarizes estimated item parameters together with the standard errors.
library(difNLR)
# loading data
data(dataMedicalgraded, package = "ShinyItemAnalysis")
data <- dataMedicalgraded[, 1:100]
group <- dataMedicalgraded[, 101]
# DIF with cumulative logit regression model
(fit <- difORD(
Data = data, group = group, focal.name = 1, model = "cumulative",
type = "both", match = "zscore", p.adjust.method = "none", purify = FALSE,
parametrization = "classic"
))
# plot of cumulative probabilities for item X2003
plot(fit, item = "X2003", plot.type = "cumulative")
# plot of category probabilities for item X2003
plot(fit, item = "X2003", plot.type = "category")
# estimated coefficients for all items with SE
coef(fit, SE = TRUE)
An adjacent category logit regression allows for detection of uniform and non-uniform DIF among ordinal data by adding a group-membership variable (uniform DIF) and its interaction with observed score (non-uniform DIF) into a model for item \(i\) and by testing for their significance.
Here you can change the type of DIF to be tested, the Observed score, and parametrization - either based on IRT models or classical intercept/slope. You can also select the correction method for multiple comparison and/or item purification.
The probability that respondent \(p\) with the observed score (e.g., standardized total score) \(Z_p\) and the group membership variable \(G_p\) obtained \(k\) points in item \(i\) is given by the following equation:
Summary table contains information about \(\chi^2\)-statistics of the likelihood ratio test, corresponding \(p\)-values considering selected correction method, and significance codes. Table also provides estimated parameters for the best fitted model for each item.
library(difNLR)
# loading data
data(dataMedicalgraded, package = "ShinyItemAnalysis")
data <- dataMedicalgraded[, 1:100]
group <- dataMedicalgraded[, 101]
# DIF with cumulative logit regression model
(fit <- difORD(
Data = data, group = group, focal.name = 1, model = "adjacent",
type = "both", match = "zscore", p.adjust.method = "none", purify = FALSE,
parametrization = "classic"
))
An adjacent category logit regression allows for detection of uniform and non-uniform DIF among ordinal data by adding a group-membership variable (uniform DIF) and its interaction with observed score (non-uniform DIF) into a model for item \(i\) and by testing for their significance.
Here you can change type of DIF to be tested, Observed score, and parametrization - either based on IRT models or classical intercept/slope. You can also select correction method for multiple comparison and/or item purification.
Points represent proportion of obtained score with respect to the observed score. Their size is determined by count of respondents who achieved given level of the observed score and who selected given option with respect to the group membership.
Download figureTable summarizes estimated item parameters together with standard errors.
library(difNLR)
# loading data
data(dataMedicalgraded, package = "ShinyItemAnalysis")
data <- dataMedicalgraded[, 1:100]
group <- dataMedicalgraded[, 101]
# DIF with cumulative logit regression model
(fit <- difORD(
Data = data, group = group, focal.name = 1, model = "cumulative",
type = "both", match = "zscore", p.adjust.method = "none", purify = FALSE,
parametrization = "classic"
))
# plot of characteristic curves for item X2003
plot(fit, item = "X2003")
# estimated coefficients for all items with SE
coef(fit, SE = TRUE)
Differential distractor functioning (DDF) occurs when respondents from different groups but with the same ability have a different probability of selecting item responses in a multiple-choice item. DDF is examined here by multinomial log-linear regression model.
Here you can change the type of DDF to be tested, the Observed score, and the parametrization - either IRT or intercept/slope. You can also select the correction method for a multiple comparison and/or item purification.
For \(K_i\) possible item responses, the probability of the correct answer \(K_i\) for respondent \(p\) with a DIF matching variable (e.g., standardized total score) \(Z_p\) and a group membership \(G_p\) in item \(i\) is given by the following equation:
The probability of choosing distractor \(k\) is then given by:
This summary table contains information about \(\chi^2\)-statistics of the likelihood ratio test, corresponding \(p\)-values considering selected correction method, and significance codes.
Table provides estimated parameters for the fitted model for each item and distractor (incorrect option).
library(difNLR)
# loading data
data(GMATtest, GMATkey, package = "difNLR")
data <- GMATtest[, 1:20]
group <- GMATtest[, "group"]
key <- GMATkey
# DDF with multinomial regression model
(fit <- ddfMLR(
Data = data, group = group, focal.name = 1, key,
type = "both", match = "zscore",
p.adjust.method = "none", purify = FALSE,
parametrization = "classic"
))
Differential distractor functioning (DDF) occurs when respondents from different groups but with the same ability have a different probability of selecting item responses in a multiple-choice item. DDF is examined here by multinomial log-linear regression model.
Here you can change the type of DDF to be tested, the Observed score, and the parametrization - either IRT or intercept/slope. You can also select the correction method for a multiple comparison and/or item purification.
Points represent a proportion of the response selection with respect to the observed score. Their size is determined by the count of respondents from a given group who achieved a given level of the observed score and who selected a given response option.
Download figure Download all figuresTable summarizes estimated item parameters together with standard errors.
library(difNLR)
# loading data
data(GMATtest, GMATkey, package = "difNLR")
data <- GMATtest[, 1:20]
group <- GMATtest[, "group"]
key <- GMATkey
# DDF with multinomial regression model
(fit <- ddfMLR(
Data = data, group = group, focal.name = 1, key,
type = "both", match = "zscore",
p.adjust.method = "none", purify = FALSE,
parametrization = "classic"
))
# plot of characteristic curves for item 1
plot(fit, item = 1)
# estimated coefficients for all items with SE
coef(fit, SE = TRUE)
In this section, you can explore the group-specific model for testing differential item functioning among two groups - reference and focal.
Select parameters \(a\) (discrimination) and \(b\) (difficulty) for an item given by 2PL IRT model for reference and focal group. When the item parameters for the reference and the focal group differ, this phenomenon is termed differential item functioning.
You may also select the value of latent ability \(\theta\) to obtain the interpretation of the item characteristic curves for this ability.
Consider item following 2PL model with the following parameters
Reference group: \(a_R = 1, b_R = 0\)
Focal group: \(a_F = 1, b_F = 1\)
For this item, fill in the following exercises with an accuracy of up to 0.05. Then click on Submit answers button. If you need a hint, click on blue button with question mark.
Consider item following 2PL model with the following parameters
Reference group: \(a_R = 0.8, b_R = -0.5\)
Focal group: \(a_F = 1.5, b_F = 1\)
For this item fill in the following exercises with an accuracy of up to 0.05. Then click on Submit answers button. If you need a hint, click on blue button with question mark.
ShinyItemAnalysis
offers an option to download a report in HTML or PDF format. PDF report
creation requires the latest version of
MiKTeX
(or other TeX distribution). If you don't have the latest installation, please, use the HTML report.
There is also an option to use customized settings. When checking the Customize settings, local settings will be offered and used for each selected section of the report. Otherwise, the settings will be taken from sections made in the individual sections of the application. You may also include your name into the report, and change the name of the analyzed dataset.
Reports by default contain a summary of total scores, table of standard scores, item analysis, distractor plots for each item and multinomial regression plots for each item. Other analyses can be selected below.
Validity
Difficulty/discrimination plot
Distractors plots
DIF method selection
Delta plot settings
Mantel-Haenszel test settings
Logistic regression settings
Multinomial regression settings
Recommendation: Report generation can be faster and more reliable when you first check sections of intended contents. For example, if you wish to include a 3PL IRT model, you can first visit the Dichotomous models subsection of the IRT models section and fit the 3PL IRT model.
Set the number of cycles for IRT models in the IRT models section.
Set the number of bootstrap samples for the confidence interval calculation in the Reliability / Restricted range section.
Here you can change setting for download of figures.
cowplot
Wilke, C.O. (2020).
cowplot: Streamlined plot theme and plot annotations for "ggplot2".
R package version 1.1.1.
See online.
data.table
Dowle, M., & Srinivasan, A. (2020).
data.table: Extension of "data.frame".
R package version 1.13.6.
See online.
deltaPlotR
Magis, D., & Facon, B. (2014).
deltaPlotR: An R package for differential item functioning analysis with Angoff`s delta plot.
Journal of Statistical Software, Code Snippets, 59(1), 1-19.
See online.
difNLR
Hladka, A., Martinkova, P. (2020).
difNLR: Generalized logistic regression models for DIF and DDF detection.
The R Journal, 12(1), 300-323.
See online.
difR
Magis, D., Beland, S., Tuerlinckx, F., & De Boeck, P. (2010).
A general framework and an R package for the detection of dichotomous differential item functioning.
Behavior Research Methods, 42847-862.
DT
Xie, Y., Cheng, J., & Tan, X. (2021).
DT: A wrapper of the JavaScript library "DataTables".
R package version 0.17.
See online.
ggdendro
de Vries, A., & Ripley, B.D. (2020).
ggdendro: Create dendrograms and tree diagrams using "ggplot2".
R package version 0.1-22.
See online.
ggplot2
Wickham, H. (2016).
ggplot2: Elegant graphics for data analysis.
See online.
gridExtra
Auguie, B. (2017).
gridExtra: Miscellaneous functions for "grid" graphics.
R package version 2.3.
See online.
knitr
Xie, Y. (2020).
knitr: A general-purpose package for dynamic report generation in R.
R package version 1.30.
See online.
latticeExtra
Sarkar, D., & Andrews, F. (2019).
latticeExtra: Extra graphical utilities based on lattice.
R package version 0.6-29.
See online.
lme4
Bates, D., Maechler, M., Bolker, B., & Walker, S. (2015).
Mixed-effects models using lme4.
Journal of Statistical Software, 67(1), 1-48.
See online.
ltm
Rizopoulos, D. (2006).
ltm: An R package for latent variable modelling and item response theory analyses.
Journal of Statistical Software, 17(5), 1-25.
See online.
magrittr
Bache, S. M., & Wickham, H. (2020).
magrittr: A forward-pipe operator for R.
R package version 2.0.1.
See online.
mirt
Chalmers, R., & Chalmers, P. (2012).
mirt: A multidimensional item response theory package for the R environment.
Journal of Statistical Software, 48(6), 1-29.
msm
Jackson, C., & Jackson, H. (2011).
Multi-state models for panel data: The msm package for R.
Journal of Statistical Software, 38(8), 1-29.
See online.
nnet
Venables, C., & Ripley, C. (2002).
Modern applied statistics with S.
See online.
plotly
Sievert, C. (2020).
Interactive web-based data visualization with R, plotly, and shiny.
Chapman and Hall/CRC Florida, 2020.
See online.
psych
Revelle, W. (2020).
psych: Procedures for psychological, psychometric, and personality research.
R package version 2.0.12.
See online.
psychometric
Fletcher, T., & Fletcher, D. (2010).
psychometric: Applied psychometric theory.
R package version 2.2.
See online.
purrr
Henry, L., & Wickham, H. (2020).
purrr: Functional programming tools.
R package version 0.3.4.
See online.
rlang
Henry, L., & Wickham, H. (2020).
rlang: Functions for base types and core R and "tidyverse" features.
R package version 0.4.10.
See online.
rmarkdown
Xie, Y., Allaire, J.J., & Grolemund G. (2018).
R Markdown: The definitive guide.
Chapman and Hall/CRC. ISBN 9781138359338.
See online.
rstudioapi
Ushey, K., Allaire J.J., Wickham, H., & Ritchie G. (2018).
rstudioapi: Safely access the RStudio API.
R package version 0.13.
See online.
scales
Wickham, H., & Seidel D. (2020).
scales: Scale functions for visualization.
R package version 1.1.1.
See online.
shiny
Chang, W., Cheng, J., Allaire, J., Xie, Y., & McPherson, J. (2020).
shiny: Web application framework for R.
R package version 1.5.0.
See online.
shinyBS
Bailey, E. (2015).
shinyBS: Twitter bootstrap components for shiny.
R package version 0.61.
See online.
shinydashboard
Chang, W., & Borges Ribeiro, B. (2018).
shinydashboard: Create dashboards with "shiny".
R package version 0.7.1
See online.
ShinyItemAnalysis
Martinkova, P., & Drabinova, A. (2018).
ShinyItemAnalysis for teaching psychometrics and to enforce routine analysis of educational tests.
The R Journal, 10(2), 503-515.
See online.
shinyjs
Attali, D. (2020).
shinyjs: Easily improve the user experience of your shiny apps in seconds.
R package version 2.0.0.
See online.
stringr
Wickham, H. (2019).
stringr: Simple, consistent wrappers for common string operations.
R package version 1.4.0.
See online.
tibble
Müller, K., & Wickham, H. (2020).
tibble: Simple data frames.
R package version 3.0.4.
See online.
tidyr
Wickham, H. (2020).
tidyr: Tidy messy data.
R package version 1.1.2.
See online.
VGAM
Yee, T. W. (2015).
Vector Generalized linear and additive models: With an implementation in R.
New York, USA: Springer.
See online.
xtable
Dahl, D., Scott, D., Roosen, C., Magnusson, A., & Swinton, J. (2019).
xtable: Export tables to LaTeX or HTML.
R package version 1.8-4.
See online.