### Welcome

Welcome to ShinyItemAnalysis!

ShinyItemAnalysis is an interactive online application for the psychometric analysis of educational tests, psychological assessments, health-related and other types of multi-item measurements, or ratings from multiple raters, built on R and shiny. You can easily start using the application with the default toy dataset. You may also select from a number of other toy datasets or upload your own in the Data section. Offered methods include:

• Exploration of total and standard scores in the Summary section
• Analysis of measurement error in the Reliability section
• Correlation structure and criterion validity analysis in the Validity section
• Item and distractor analysis in the Item analysis section
• Item analysis with regression models in the Regression section
• Item analysis by item response theory models in the IRT models section
• Detection of differential item functioning in the DIF/Fairness section

All graphical outputs and selected tables can be downloaded via the download button. Moreover, you can automatically generate a HTML or PDF report in the Reports section. All offered analyses are complemented by selected R codes which are ready to be copied and pasted into your R console, therefore a similar analysis can be run and modified in R.

#### News

A new paper on range-restricted inter-rater reliability has been published in JRSS-A (Erosheva, Martinkova, & Lee, 2021). To try examples interactively, set the AIBS toy dataset in the Data section by clicking on the menu in the upper left corner and go to the Reliability/Restricted range section.
New papers on differential item functioning have been published in Learning and Instruction (Martinkova, Hladka, & Potuznikova, 2020) and in The R Journal (Hladka & Martinkova, 2020). To try these examples interactively, set the Learning to Learn 9 toy dataset in the Data section by clicking on the menu in the upper left corner and go to the DIF/Fairness/Generalized logistic section.

#### Availability

It is also available online at the Czech Academy of Sciences and shinyapps.io .

#### Versions

The current CRAN version is 1.3.6.
The version available online is 1.3.6.
The newest development version available on GitHub is 1.3.6.

#### Feedback

If you discover a problem with this application please contact the project maintainer at martinkova(at)cs.cas.cz or use GitHub. We also encourage you to provide your feedback using Google form.

This program is free software and you can redistribute it and or modify it under the terms of the GNU GPL 3 as published by the Free Software Foundation. This program is distributed in the hope that it will be useful, but without any warranty; without even the implied warranty of merchantability of fitness for a particular purpose.

To cite ShinyItemAnalysis in publications, please use:

Martinkova, P., & Drabinova, A. (2018).
ShinyItemAnalysis for teaching psychometrics and to enforce routine analysis of educational tests.
The R Journal, 10(2), 503-515, doi: 10.32614/RJ-2018-074

#### Funding

Czech Science Foundation (GJ15-15856Y, 21-03658S), Charles University (PRIMUS/17/HUM/11).

### Data

For demonstration purposes, the 20-item dataset GMAT is used. While on this page, you may select one of several other toy datasets or you may upload your own dataset (see below). To return to the demonstration dataset, click on the Unload data button.

#### Training datasets

The main data file should contain the responses of individual respondents (rows) to given items (columns). Data need to be either binary, nominal (e.g. in ABCD format), or ordinal (e.g. in Likert scale). The header may contain item names, however, no row names should be included. In all data sets, the header should be either included or excluded. Columns of dataset are by default renamed to the Item and number of a particular column. If you want to keep your own names, check the box Keep item names below. Missing values in scored dataset are by default evaluated as 0. If you want to keep them as missing, check the box Keep missing values below.

Data specification
Missing values

For ordinal data, you are advised to include vector containing cut-score which is used for binarization of uploaded data, i.e., values greater or equal to provided cut-score are set to 1, otherwise to 0. You can either upload dataset of item-specific values, or you can provide one value for whole dataset.

Note: In case that cut-score is not provided, vector of maximal values is used.

For nominal data, it is necessary to upload key of correct answers.

For ordinal data, it is optional to upload minimal and maximal values of answers. You can either upload datasets of item-specific values, or you can provide one value for whole dataset.

Note: If no minimal or maximal values are provided, these values are set automatically based on observed values.

Group is a variable for DIF and DDF analyses. It should be a binary vector, where 0 represents the reference group and 1 represents the focal group. Its length needs to be the same as the number of individual respondents in the main dataset. Missing values are not supported for the group variable and such cases/rows of the data should be removed.

Note: If no group variable is provided, the DIF and DDF analyses in the DIF/Fairness section are not available.

Criterion is either a discrete or continuous variable (e.g., future study success or future GPA in the case of admission tests) which should be predicted by the measurement. Its length needs to be the same as the number of individual respondents in the main dataset.

Note: If no criterion variable is provided, it won't be possible to run a validity analysis in the Predictive validity section on Validity page.

Observed score is a variable describing observed ability or trait of respondents. If supplied, it is offered in the Regression and in the DIF/Fairness sections for analyses with respect to this external variable. Its length needs to be the same as the number of individual respondents in the main dataset.

Note: If no observed score is provided, the total scores or standardized total scores are used instead.

### Data exploration

Here you can explore uploaded dataset. The rendering of tables can take some time.

### Total scores

Total score, also known as raw score or sum score, is the easiest measure of latent traits being measured. The total score is calculated as the sum of the item scores. In binary correct/false items, the total score corresponds to the total number of correct answers.

#### Summary table

The table below summarizes basic descriptive statistics for the total scores including the number of respondents $$n$$, minimum and maximum, median, $$\textrm{SD}$$, and The skewness for normally distributed scores is near the value of 0 and the kurtosis is near the value of 3.

#### Histogram of total score

For a selected cut-score, the blue part of the histogram shows respondents with a total score above the cut-score, the grey column shows respondents with a total score equal to the cut-score and the red part of the histogram shows respondents below the cut-score.

#### Selected R code

library(ggplot2) library(psych) library(ShinyItemAnalysis) # loading data data(GMAT, package = "difNLR") data <- GMAT[, 1:20] # total score calculation score <- rowSums(data) # summary of total score tab <- describe(score)[, c("n", "min", "max", "mean", "median", "sd", "skew", "kurtosis")] tab$kurtosis <- tab$kurtosis + 3 tab # histogram ggplot(df, aes(score)) + geom_histogram(binwidth = 1, col = "black") + xlab("Total score") + ylab("Number of respondents") + theme_app() # colors by cut-score cut <- median(score) # cut-score color <- c(rep("red", cut - min(score)), "gray", rep("blue", max(score) - cut)) df <- data.frame(score) # histogram ggplot(df, aes(score)) + geom_histogram(binwidth = 1, fill = color, col = "black") + xlab("Total score") + ylab("Number of respondents") + theme_app() 

### Standard scores

Total score is calculated as the
Percentile indicates the value below which a percentage of observations falls, e.g., an individual score at the 80th percentile means that the individual score is the same or higher than the scores of 80% of all respondents.
Success rate is the percentage of scores obtained, e.g., if the maximum points of test is equal to 20, minimum is 0, and individual score is 12 then success rate is $$12 / 20 = 0.6$$, i.e., 60%.
The Z-score , also known as the standardized score is with a mean of 0 and and a standard deviation of 1.
The T-score is with a mean of 50 and standard deviation of 10.

#### Selected R code

# loading data data(GMAT, package = "difNLR") data <- GMAT[, 1:20] # scores calculations (unique values) score <- rowSums(data) # Total score tosc <- sort(unique(score)) # Levels of total score perc <- ecdf(score)(tosc) # Percentiles sura <- 100 * (tosc / max(score)) # Success rate zsco <- sort(unique(scale(score))) # Z-score tsco <- 50 + 10 * zsco # T-score cbind(tosc, perc, sura, zsco, tsco)

### Correlation structure

#### Correlation heat map

A correlation heat map displays selected type of correlations between items. The size and shade of circles indicate how much the items are correlated (larger and darker circle mean greater correlations). The color of circles indicates in which way the items are correlated - a blue color means possitive correlation and a red color means negative correlation. A correlation heat map can be reordered using a hierarchical clustering method selected below. With a number of clusters larger than 1, the rectangles representing clusters are drawn. The values of a correlation heatmap may be displayed and also downloaded.

Pearson correlation coefficient describes the strength and direction of a linear relationship between two random variables $$X$$ and $$Y$$. It is given by formula

$$\rho = \frac{cov(X,Y)}{\sqrt{var(X)}\sqrt{var(Y)}}.$$

Sample Pearson corelation coefficient may be calculated as

$$r = \frac{\sum_{i = 1}^{n}(x_{i} - \bar{x})(y_{i} - \bar{y})}{\sqrt{\sum_{i = 1}^{n}(x_{i} - \bar{x})^2}\sqrt{\sum_{i = 1}^{n}(y_{i} - \bar{y})^2}}$$

Pearson correlation coefficient has a value between -1 and +1. Sample correlation of -1 and +1 correspond to all data points lying exactly on a line (decreasing in case of negative linear correlation -1 and increasing for +1). If the coefficient is equal to 0, it means there is no linear relationship between the two variables.

A polychoric/tetrachoric correlation between two ordinal/binary variables is calculated from their contingency table, under the assumption that the ordinal variables dissect continuous latent variables that are bivariate normal.

The Spearman's rank correlation coefficient describes the strength and the direction of a monotonic relationship between random variables $$X$$ and $$Y$$, i.e. the dependence between the rankings of two variables. It is given by formula

$$\rho = \frac{cov(rg_{X},rg_{Y})}{\sqrt{var(rg_{X})}\sqrt{var(rg_{Y})}},$$

where $$rg_{X}$$ and $$rg_{Y}$$ are the transformed random variables $$X$$ and $$Y$$ into ranks, i.e, the Spearman correlation coefficient is the Pearson correlation coefficient between the ranked variables.

The sample Spearman correlation is calculated by converting $$X$$ and $$Y$$ to ranks (average ranks are used in case of ties) and by applying the sample Pearson correlation formula. If both the $$X$$ and $$Y$$ have $$n$$ unique ranks, i.e. there are no ties, then the sample correlation coefficient is given by formula

$$r = 1 - \frac{6\sum_{i = 1}^{n}d_i^{2}}{n(n-1)}$$

where $$d = rg_{X} - rg_{Y}$$ is the difference between two ranks and $$n$$ is size of $$X$$ and $$Y$$. Spearman rank correlation coefficient has value between -1 and 1, where 1 means identity of ranks of the variables and -1 means reverse ranks of the two variables. In case of no repeated values, Spearman correlation of +1 or -1 means that all data points are lying exactly on some monotone line. If the Spearman coefficient is equal to 0, it means there is no tendency for $$Y$$ to either increase or decrease with $$X$$ increasing.

Clustering methods. Ward's method aims at finding compact clusters based on minimizing the within-cluster sum of squares. Ward's n. 2 method uses squared disimilarities. The Single method connects clusters with their nearest neighbours, i.e. the distance between two clusters is calculated as the minimum of the distance of observations in one cluster and observations in the other clusters. Complete linkage with the farthest neighbours, on the other hand, uses the maximum of distance. The Average linkage method uses the distance based on a weighted average of the individual distances. The McQuitty method uses an unweighted average. The Median linkage calculates the distance as the median of distance between an observation in one cluster and observation in another cluster. The Centroid method uses the distance between centroids of clusters.

### Factor analysis

#### Finding the optimal number of factors

A scree plot below displays two sets of the eigenvalues associated with the factors in descending order. Location of a bend (an elbow) of the "real" part can be considered indicative to the suitable number of factors (Catell, 1966). Another rule, as proposed by Kaiser (1960), discards all factors below the eigenvalue of 1 (the information of a single average item).

A much better, modern approach called a parallel analysis (Horn, 1965) compares the eigenvalues of the real data correlation matrix with the eigenvalues (or more precisely, 95th percentiles of their sampling distributions) obtained from simulated zero-factor random matrices. The number of factors with the eigenvalue bigger than the eigenvalue at the first (leftmost) curves crossing is then the optimal number to extract in factor analysis.

Method used to compute the correlation matrix. For ordinal datasets with only a few categories, polychoric option is recommended. The choice is automatically forwarded to the EFA below.

#### Exploratory factor analysis

Once the optimal number of factors is found, the exploratory factor analysis (EFA) itself may be conducted. The number of factor found by the parallel analysis is offered as the default value. You can select the preffered factor rotation of the solution or hide the loadings outside interest. There is also an option to sort items by their importance on each factor. Below the loadings table, there is factor summary with proportion of variance each of the factor explains, as well as the list of common model fit indices.

#### Selected R code

library(psych) library(ggplot2) # loading data data(HCI, package = "ShinyItemAnalysis") data <- HCI[, 1:20] # scree plot, parallel analysis (fa_paral <- fa_parallel(data)) plot(fa_paral) as.data.frame(fa_paral) # EFA for 1, 2, and 3 factors (FA1 <- psych::fa(data, nfactors = 1)) (FA2 <- psych::fa(data, nfactors = 2)) (FA3 <- psych::fa(data, nfactors = 3)) # Model fit for different number of factors VSS(data) # Path diagrams fa.diagram(FA1) fa.diagram(FA2) fa.diagram(FA3) # Higher order factor solution (om.h <- omega(data, sl = FALSE)) 

### Criterion validity

Depending on the criterion variable, different types of criterion validity may be examined. As an example, a correlation between the test score and the future study success or future GPA may be used as a proof of predictive validity in the case of admission tests. A criterion variable may be uploaded in the Data section.

#### Descriptive plots of criterion variable on total score

Total scores are plotted according to a criterion variable. Boxplot or scatterplot is displayed depending on the type of criterion variable - whether it is discrete or continuous. Scatterplot is provided with a red linear regression line.

#### Correlation of criterion variable and total score

An association between the total score and the criterion variable can be estimated using Pearson product-moment correlation coefficient r . The null hypothesis being tested states that correlation is exactly 0.

#### Selected R code

library(ggplot2) library(ShinyItemAnalysis) # loading data data(GMAT, package = "difNLR") data <- GMAT[, 1:20] score <- rowSums(data) # total score calculation criterion <- GMAT[, "criterion"] # criterion variable hist(criterion) criterionD <- round(criterion) # discrete criterion variable hist(criterionD) # number of respondents in each criterion level size <- as.factor(criterionD) levels(size) <- table(as.factor(criterionD)) size <- as.numeric(paste(sizeD)) df <- data.frame(score, criterionD, size) # descriptive plots ### boxplot, for discrete criterion ggplot(df, aes(y = score, x = as.factor(criterionD), fill = as.factor(criterionD))) + geom_boxplot() + geom_jitter(shape = 16, position = position_jitter(0.2)) + scale_fill_brewer(palette = "Blues") + xlab("Criterion group") + ylab("Total score") + coord_flip() + theme_app() ### scatterplot, for continuous criterion ggplot(df, aes(x = score, y = criterion)) + geom_point() + ylab("Criterion variable") + xlab("Total score") + geom_smooth( method = lm, se = FALSE, color = "red" ) + theme_app() # test for association between total score and criterion variable cor.test(criterion, score, method = "pearson", exact = FALSE)

### Spearman-Brown formula

#### Equation

Let $$\text{rel}(X)$$ be the reliability of the test composed of $$I$$ equally precise items measuring the same construct, $$X = X_1 + ... + X_I$$. Then for a test consisting of $$I^*$$ such items, that is for a test which is $$m = \frac{I^*}{I}$$ times longer/shorter, the reliability would be

$$\text{rel}(X^*) = \frac{m\cdot \text{rel}(X)}{1 + (m - 1)\cdot\text{rel}(X)}.$$

The Spearman-Brown formula can be used to determine reliability of a test with with a different number of equally precise items measuring the same construct. It can also be used to determine the necessary number of items to achieve desired reliability.

In the calculations below, reliability of original data is by default set to the value of Cronbach's $$\alpha$$ for the dataset currently in use. The number of items in the original data is by default set to the number of items in the dataset currently in use.

#### Estimate of reliability with different number of items

Here you can calculate an estimate of reliability for a test consisting of a different number of items.

#### Necessary number of items for required level of reliability

Here you can calculate the necessary number of items to gain the required level of reliability.

#### Selected R code

library(psychometric) # loading data data(HCI, package = "ShinyItemAnalysis") data <- HCI[, 1:20] # reliability of original data rel.original <- psychometric::alpha(data) # number of items in original data items.original <- ncol(data) # number of items in new data items.new <- 30 # ratio of tests lengths m <- items.new / items.original # determining reliability SBrel(Nlength = m, rxx = rel.original) # desired reliability rel.new <- 0.8 # determining test length (m.new <- SBlength(rxxp = rel.new, rxx = rel.original)) # number of required items m.new * items.original

### Split-half method

The split-half method uses the correlation between two subscores for an estimation of reliability. The underlying assumption is that the two halves of the test (or even all items on the test) are equally precise and measure the same underlying construct. The Spearman-Brown formula is then used to correct the estimate for the number of items.

#### Equation

For a test with $$I$$ items total score is calculated as $$X = X_1 + ... + X_I$$. Let $$X^*_1$$ and $$X^*_2$$ be total scores calculated from items found only in the first and second subsets. The estimate of reliability is then given by the Spearman-Brown formula (Spearman, 1910; Brown, 1910) with $$m = 2$$.

$$\text{rel}(X) = \frac{m\cdot \text{cor}(X^*_1, X^*_2)}{1 + (m - 1)\cdot\text{cor}(X^*_1, X^*_2)} = \frac{2\cdot \text{cor}(X^*_1, X^*_2)}{1 + \text{cor}(X^*_1, X^*_2)}$$

You can choose below from different split-half approaches. The First-last method uses a correlation between the first half of items and the second half of items. The Even-odd method places even numbered items into the first subset and odd numbered items into the second one. The Random method performs a random split of items, thus the resulting estimate may be different for each call. Out of a specified number of random splits (10,000 by default), the Worst method selects the lowest estimate and the Average method calculates the average. In the case of an odd number of items, the first subset contains one more item than the second one.

#### Reliability estimate with confidence interval

The estimate of reliability for First-last , Even-odd , Random and Worst is calculated using the Spearman-Brown formula. The confidence interval is based on a confidence interval of correlation using the delta method. The estimate of reliability for the Average method is a mean value of sampled reliabilities and the confidence interval is the confidence interval of this mean.

#### Histogram of reliability estimates

A histogram is based on a selected number of split halves estimates (10,000 by default). The current estimate is highlighted by a red colour.

### Logistic regression on standardized total scores

Various regression models may be fitted to describe item properties in more detail. Logistic regression can model dependency of the probability of correctly answering item $$i$$ by respondent $$p$$ on their standardized total score $$Z_p$$ (Z-score) by an S-shaped logistic curve. Parameter $$\beta_{i0}$$ describes horizontal position of the fitted curve and parameter $$\beta_{i1}$$ describes its slope.

#### Plot with estimated logistic curve

Points represent proportion of correct answers with respect to the standardized total score. Their size is determined by the count of respondents who achieved a given level of the standardized total score.

#### Equation

$$\mathrm{P}(Y_{pi} = 1|Z_p) = \mathrm{E}(Y_{pi}|Z_p) = \frac{e^{\left(\beta_{i0} + \beta_{i1} Z_p\right)}}{1 + e^{\left(\beta_{i0} + \beta_{i1} Z_p\right)}}$$

### Nonlinear four parameter regression on standardized total scores with IRT parameterization

Various regression models may be fitted to describe item properties in more detail. Nonlinear regression can model dependency of the probability of correctly answering item $$i$$ by respondent $$p$$ on their standardized total score $$Z_p$$ (Z-score) by an S-shaped logistic curve. The IRT parametrization used here corresponds to the parametrization used in IRT models. Parameter $$b_{i}$$ describes horizontal position of the fitted curve (difficulty), parameter $$a_{i}$$ describes its slope at the inflection point (discrimination), pseudo-guessing parameter $$c_i$$ describes its lower asymptote and inattention parameter $$d_i$$ describes its upper asymptote.

#### Plot with estimated nonlinear curve

Points represent proportion of correct answers with respect to the standardized total score. Their size is determined by the count of respondents who achieved a given level of the standardized total score.

#### Equation

$$\mathrm{P}(Y_{pi} = 1|Z_p) = \mathrm{E}(Y_{pi}|Z_p) = c_i + \left(d_i - c_i\right) \cdot \frac{e^{a_i\left(Z_p - b_i\right)}}{1 + e^{a_i\left(Z_p - b_i\right)}}$$

### Generalized logistic regression

Generalized logistic regression models are extensions of a logistic regression method which account for the possibility of guessing by allowing for nonzero lower asymptote - pseudo-guessing $$c_i$$ (Drabinova & Martinkova, 2017) or an upper asymptote lower than one - inattention $$d_i$$. Similarly to logistic regression, its extensions also provide detection of uniform and non-uniform DIF by letting the difficulty parameter $$b_i$$ (uniform) and the discrimination parameter $$a_i$$ (non-uniform) differ for groups and by testing for the difference in their values. Moreover, these extensions allow for testing differences in pseudo-guessing and inattention parameters and they can be seen as proxies of 3PL and 4PL IRT models for DIF detection.

#### Method specification

Here you can specify the assumed model. In 3PL and 4PL models, the abbreviations $$c_{g}$$ or $$d_{g}$$ mean that parameters $$c_i$$ or $$d_i$$ are assumed to be the same for both groups, otherwise they are allowed to differ. With type you can specify the type of DIF to be tested by choosing the parameters in which a difference between groups should be tested. You can also select correction method for multiple comparison or item purification.

Finally, you may change the Observed score. While matching on the standardized total score is typical, the upload of other Observed scores is possible in the Data section. Using a pre-test (standardized) total score allows for testing differential item functioning in change (DIF-C) to provide proofs of instructional sensitivity (Martinkova et al., 2020), also see Learning To Learn 9 toy dataset.

#### Equation

The displayed equation is based on the model selected below

#### Summary table

This summary table contains information about DIF test statistic $$LR(\chi^2)$$, corresponding $$p$$-values considering selected adjustement, and significance codes. This table also provides estimated parameters for the best fitted model for each item. Note that $$a_{iG_p}$$ (and also other parameters) from the equation above consists of a parameter for the reference group and a parameter for the difference between focal and reference groups, i.e., $$a_{iG_p} = a_{i} + a_{iDif}G_{p}$$, where $$G_{p} = 0$$ for the reference group and $$G_{p} = 1$$ for the focal group, as stated in the table below.

#### Selected R code

library(difNLR) # loading data data(GMAT, package = "difNLR") data <- GMAT[, 1:20] group <- GMAT[, "group"] # generalized logistic regression DIF method # using 3PL model with the same guessing parameter for both groups (fit <- difNLR( Data = data, group = group, focal.name = 1, model = "3PLcg", match = "zscore", type = "all", p.adjust.method = "none", purify = FALSE )) # loading data data(LearningToLearn, package = "ShinyItemAnalysis") data <- LearningToLearn[, 87:94] # item responses from Grade 9 from subscale 6 group <- LearningToLearn$track # school track - group membership variable match <- scale(LearningToLearn$score_6) # standardized test score from Grade 6 # detecting differential item functioning in change (DIF-C) using # the generalized logistic regression DIF method with 3PL model # with the same guessing parameter for both groups # and standardized total score from Grade 6 as the matching criterion (fit <- difNLR( Data = data, group = group, focal.name = "AS", model = "3PLcg", match = match, type = "all", p.adjust.method = "none", purify = FALSE ))

### Generalized logistic regression

Generalized logistic regression models are extensions of a logistic regression method which account for the possibility of guessing by allowing for nonzero lower asymptote - pseudo-guessing $$c_i$$ (Drabinova & Martinkova, 2017) or an upper asymptote lower than one - inattention $$d_i$$. Similarly to logistic regression, its extensions also provide detection of uniform and non-uniform DIF by letting the difficulty parameter $$b_i$$ (uniform) and the discrimination parameter $$a_i$$ (non-uniform) differ for groups and by testing for the difference in their values. Moreover, these extensions allow for testing differences in pseudo-guessing and inattention parameters and they can be seen as proxies of 3PL and 4PL IRT models for DIF detection.

#### Method specification

Here you can specify the assumed model. In 3PL and 4PL models, the abbreviations $$c_{g}$$ or $$d_{g}$$ mean that parameters $$c$$ or $$d$$ are assumed to be the same for both groups, otherwise they are allowed to differ. With type you can specify the type of DIF to be tested by choosing the parameters in which a difference between groups should be tested. You can also select correction method for multiple comparison or item purification.

Finally, you may change the Observed score. While matching on the standardized total score is typical, the upload of other observed scores is possible in the Data section. Using a pre-test (standardized) total score allows for testing differential item functioning in change (DIF-C) to provide proofs of instructional sensitivity (Martinkova et al., 2020), also see Learning To Learn 9 toy dataset. For selected item you can display plot of its characteristic curves and table of its estimated parameters with standard errors.

#### Plot with estimated DIF generalized logistic curve

Points represent a proportion of the correct answer (empirical probabilities) with respect to the observed score. Their size is determined by the count of respondents who achieved a given level of observed score with respect to the group membership.

#### Table of parameters

This table summarizes estimated item parameters together with their standard errors. Note that $$a_{iG_p}$$ (and also other parameters) from the equation above consists of a parameter for the reference group and a parameter for the difference between focal and reference groups, i.e., $$a_{iG_p} = a_{i} + a_{iDif}G_{p}$$, where $$G_{p} = 0$$ for the reference group and $$G_{p} = 1$$ for the focal group, as stated in the table below.

#### Selected R code

library(difNLR) # loading data data(GMAT, package = "difNLR") data <- GMAT[, 1:20] group <- GMAT[, "group"] # generalized logistic regression DIF method # using 3PL model with the same guessing parameter for both groups (fit <- difNLR( Data = data, group = group, focal.name = 1, model = "3PLcg", match = "zscore", type = "all", p.adjust.method = "none", purify = FALSE )) # plot of characteristic curve of item 1 plot(fit, item = 1) # estimated coefficients for item 1 with SE coef(fit, SE = TRUE)[[1]]

### Lord test for IRT models

To detect DIF, the Lord test (Lord, 1980) compares item parameters of a selected IRT model, fitted separately on data of the two groups. The model is either 1PL, 2PL, or 3PL with guessing, which is the same for the two groups. In the case of the 3PL model, the guessing parameter is estimated based on the whole dataset and is subsequently considered fixed. In statistical terms, the Lord statistic is equal to the Wald statistic.

#### Method specification

Here you can choose the underlying IRT model used to test DIF. You can also select the correction method for multiple comparisons, and/or item purification.

#### Summary table

This summary table contains information about Lord's $$\chi^2$$-statistics, corresponding $$p$$-values considering selected adjustment, and significance codes. The table also provides estimated parameters for both groups. Note that item parameters might slightly differ even for non-DIF items as two seperate models are fitted, however this difference is non-significant. Also note that under the 3PL model, the guessing parameter $$c$$ is estimated from the whole dataset, and is considered fixed in the final models, thus no standard error is displayed.

#### Selected R code

library(difR) library(ltm) # loading data data(GMAT, package = "difNLR") data <- GMAT[, 1:20] group <- GMAT[, "group"] # 1PL IRT model (fit1PL <- difLord( Data = data, group = group, focal.name = 1, model = "1PL", p.adjust.method = "none", purify = FALSE )) # 2PL IRT model (fit2PL <- difLord( Data = data, group = group, focal.name = 1, model = "2PL", p.adjust.method = "none", purify = FALSE )) # 3PL IRT model with the same guessing for groups guess <- itemParEst(data, model = "3PL")[, 3] (fit3PL <- difLord( Data = data, group = group, focal.name = 1, model = "3PL", c = guess, p.adjust.method = "none", purify = FALSE ))

### Lord test for IRT models

To detect DIF, the Lord test (Lord, 1980) compares item parameters of a selected IRT model, fitted separately on data of the two groups. The model is either 1PL, 2PL, or 3PL with guessing which is the same for the two groups. In the case of the 3PL model, the guessing parameter is estimated based on the whole dataset and is subsequently considered fixed. In statistical terms, the Lord statistic is equal to the Wald statistic.

#### Method specification

Here you can choose an underlying IRT model used to test DIF. You can also select a correction method for multiple comparison, and/or item purification. For a selected item you can display the plot of its characteristic curves and the table of its estimated parameters with standard errors.

#### Plot with estimated DIF characteristic curve

Note that plots might differ slightly even for non-DIF items as two seperate models are fitted, however this difference is non-significant.

#### Table of parameters

The table summarizes estimated item parameters together with standard errors. Note that item parameters might differ slightly even for non-DIF items as two seperate models are fitted, however this difference is non-significant. Also note that under the 3PL model, the guessing parameter $$c$$ is estimated from the whole dataset, and is considered fixed in the final models, thus no standard error is displayed.

### Method comparison

Here you can compare all offered DIF detection methods. In the table below, columns represent DIF detection methods, and rows represent item numbers. If the method detects an item as DIF, value 1 is assigned to that item, otherwise 0 is assigned. In the case that any method fails to converge or cannot be fitted, NA is displayed instead of 0/1 values. Available methods:

• Delta is delta plot method (Angoff & Ford, 1973; Magis & Facon, 2012),
• MH is Mantel-Haenszel test (Mantel & Haenszel, 1959),
• LR is logistic regression (Swaminathan & Rogers, 1990),
• NLR is generalized (non-linear) logistic regression (Drabinova & Martinkova, 2017),
• LORD is Lord chi-square test (Lord, 1980),
• RAJU is Raju area method (Raju, 1990),
• SIBTEST is SIBTEST (Shealy & Stout, 1993) and crossing-SIBTEST method (Chalmers, 2018; Li & Stout, 1996).

### Table with method comparison

Settings for individual methods (Observed score, type of DIF to be tested, correction method, item purification) are taken from the subsection pages of given methods. In case your settings are not unified, you can set some of them below. Note that changing the options globaly can be computationaly demanding. This especially applies for a purification request. To see the complete setting of all analyses, please refer to the note below the table. The last column shows how many methods detect a certain item as DIF. The last row shows how many items are detected as DIF by a certain method.

### Cumulative logit model for DIF detection

Cumulative logit regression allows for detection of uniform and non-uniform DIF among ordinal data by adding a group-membership variable (uniform DIF) and its interaction with observed score (non-uniform DIF) into a model for item $$i$$ and by testing for their significance.

#### Method specification

Here you can change the type of DIF to be tested, the Observed score, and the parametrization - either the IRT or the classical intercept/slope. You can also select a correction method for a multiple comparison and/or item purification.

#### Equation

The probability that respondent $$p$$ with the observed score (e.g., standardized total score) $$Z_p$$ and the group membership variable $$G_p$$ obtained at least $$k$$ points in item $$i$$ is given by the following equation:

The probability that respondent $$p$$ with the observed score (e.g., standardized total score) $$Z_p$$ and group membership $$G_p$$ obtained exactly $$k$$ points in item $$i$$ is then given as the difference between the probabilities of obtaining at least $$k$$ and $$k + 1$$ points:

#### Summary table

This summary table contains information about $$\chi^2$$-statistics of the likelihood ratio test, corresponding $$p$$-values considering selected correction method, and significance codes. The table also provides estimated parameters for the best fitted model for each item.

#### Selected R code

library(difNLR) # loading data data(dataMedicalgraded, package = "ShinyItemAnalysis") data <- dataMedicalgraded[, 1:100] group <- dataMedicalgraded[, 101] # DIF with cumulative logit regression model (fit <- difORD( Data = data, group = group, focal.name = 1, model = "cumulative", type = "both", match = "zscore", p.adjust.method = "none", purify = FALSE, parametrization = "classic" ))

### Cumulative logit model for DIF detection

Cumulative logit regression allows for detection of uniform and non-uniform DIF among ordinal data by adding a group-membership variable (uniform DIF) and its interaction with observed score (non-uniform DIF) into a model for item $$i$$ and by testing for their significance.

#### Method specification

Here you can change the type of DIF to be tested, the Observed score, and the parametrization - either the IRT or classical intercept/slope. You can also select a correction method for a multiple comparison and/or item purification.

#### Plot with estimated DIF curves

Points represent a proportion of the obtained score with respect to the observed score. Their size is determined by the count of respondents who achieved a given level of the observed score and who selected given option with respect to the group membership.

#### Table of parameters

This table summarizes estimated item parameters together with the standard errors.

#### Selected R code

library(difNLR) # loading data data(dataMedicalgraded, package = "ShinyItemAnalysis") data <- dataMedicalgraded[, 1:100] group <- dataMedicalgraded[, 101] # DIF with cumulative logit regression model (fit <- difORD( Data = data, group = group, focal.name = 1, model = "cumulative", type = "both", match = "zscore", p.adjust.method = "none", purify = FALSE, parametrization = "classic" )) # plot of cumulative probabilities for item X2003 plot(fit, item = "X2003", plot.type = "cumulative") # plot of category probabilities for item X2003 plot(fit, item = "X2003", plot.type = "category") # estimated coefficients for all items with SE coef(fit, SE = TRUE)

### Adjacent category logit model for DIF detection

An adjacent category logit regression allows for detection of uniform and non-uniform DIF among ordinal data by adding a group-membership variable (uniform DIF) and its interaction with observed score (non-uniform DIF) into a model for item $$i$$ and by testing for their significance.

#### Method specification

Here you can change the type of DIF to be tested, the Observed score, and parametrization - either based on IRT models or classical intercept/slope. You can also select the correction method for multiple comparison and/or item purification.

#### Equation

The probability that respondent $$p$$ with the observed score (e.g., standardized total score) $$Z_p$$ and the group membership variable $$G_p$$ obtained $$k$$ points in item $$i$$ is given by the following equation:

#### Summary table

Summary table contains information about $$\chi^2$$-statistics of the likelihood ratio test, corresponding $$p$$-values considering selected correction method, and significance codes. Table also provides estimated parameters for the best fitted model for each item.

#### Selected R code

library(difNLR) # loading data data(dataMedicalgraded, package = "ShinyItemAnalysis") data <- dataMedicalgraded[, 1:100] group <- dataMedicalgraded[, 101] # DIF with cumulative logit regression model (fit <- difORD( Data = data, group = group, focal.name = 1, model = "adjacent", type = "both", match = "zscore", p.adjust.method = "none", purify = FALSE, parametrization = "classic" ))

### Adjacent category logit model for DIF detection

An adjacent category logit regression allows for detection of uniform and non-uniform DIF among ordinal data by adding a group-membership variable (uniform DIF) and its interaction with observed score (non-uniform DIF) into a model for item $$i$$ and by testing for their significance.

#### Method specification

Here you can change type of DIF to be tested, Observed score, and parametrization - either based on IRT models or classical intercept/slope. You can also select correction method for multiple comparison and/or item purification.

#### Plot with estimated DIF curves

Points represent proportion of obtained score with respect to the observed score. Their size is determined by count of respondents who achieved given level of the observed score and who selected given option with respect to the group membership.

#### Table of parameters

Table summarizes estimated item parameters together with standard errors.

#### Selected R code

library(difNLR) # loading data data(dataMedicalgraded, package = "ShinyItemAnalysis") data <- dataMedicalgraded[, 1:100] group <- dataMedicalgraded[, 101] # DIF with cumulative logit regression model (fit <- difORD( Data = data, group = group, focal.name = 1, model = "cumulative", type = "both", match = "zscore", p.adjust.method = "none", purify = FALSE, parametrization = "classic" )) # plot of characteristic curves for item X2003 plot(fit, item = "X2003") # estimated coefficients for all items with SE coef(fit, SE = TRUE)

### Multinomial model for DDF detection

Differential distractor functioning (DDF) occurs when respondents from different groups but with the same ability have a different probability of selecting item responses in a multiple-choice item. DDF is examined here by multinomial log-linear regression model.

#### Method specification

Here you can change the type of DDF to be tested, the Observed score, and the parametrization - either IRT or intercept/slope. You can also select the correction method for a multiple comparison and/or item purification.

#### Equation

For $$K_i$$ possible item responses, the probability of the correct answer $$K_i$$ for respondent $$p$$ with a DIF matching variable (e.g., standardized total score) $$Z_p$$ and a group membership $$G_p$$ in item $$i$$ is given by the following equation:

The probability of choosing distractor $$k$$ is then given by:

#### Summary table

This summary table contains information about $$\chi^2$$-statistics of the likelihood ratio test, corresponding $$p$$-values considering selected correction method, and significance codes.

#### Estimates of item parameters

Table provides estimated parameters for the fitted model for each item and distractor (incorrect option).

#### Selected R code

library(difNLR) # loading data data(GMATtest, GMATkey, package = "difNLR") data <- GMATtest[, 1:20] group <- GMATtest[, "group"] key <- GMATkey # DDF with multinomial regression model (fit <- ddfMLR( Data = data, group = group, focal.name = 1, key, type = "both", match = "zscore", p.adjust.method = "none", purify = FALSE, parametrization = "classic" ))

### Multinomial model for DDF detection

Differential distractor functioning (DDF) occurs when respondents from different groups but with the same ability have a different probability of selecting item responses in a multiple-choice item. DDF is examined here by multinomial log-linear regression model.

#### Method specification

Here you can change the type of DDF to be tested, the Observed score, and the parametrization - either IRT or intercept/slope. You can also select the correction method for a multiple comparison and/or item purification.

#### Plot with estimated DDF curves

Points represent a proportion of the response selection with respect to the observed score. Their size is determined by the count of respondents from a given group who achieved a given level of the observed score and who selected a given response option.

#### Table of parameters

Table summarizes estimated item parameters together with standard errors.

#### Selected R code

library(difNLR) # loading data data(GMATtest, GMATkey, package = "difNLR") data <- GMATtest[, 1:20] group <- GMATtest[, "group"] key <- GMATkey # DDF with multinomial regression model (fit <- ddfMLR( Data = data, group = group, focal.name = 1, key, type = "both", match = "zscore", p.adjust.method = "none", purify = FALSE, parametrization = "classic" )) # plot of characteristic curves for item 1 plot(fit, item = 1) # estimated coefficients for all items with SE coef(fit, SE = TRUE)

### DIF training

In this section, you can explore the group-specific model for testing differential item functioning among two groups - reference and focal.

#### Parameters

Select parameters $$a$$ (discrimination) and $$b$$ (difficulty) for an item given by 2PL IRT model for reference and focal group. When the item parameters for the reference and the focal group differ, this phenomenon is termed differential item functioning.

You may also select the value of latent ability $$\theta$$ to obtain the interpretation of the item characteristic curves for this ability.

#### Exercise 1

Consider item following 2PL model with the following parameters

Reference group: $$a_R = 1, b_R = 0$$

Focal group: $$a_F = 1, b_F = 1$$

For this item, fill in the following exercises with an accuracy of up to 0.05. Then click on Submit answers button. If you need a hint, click on blue button with question mark.

• Sketch item characteristic curves for both groups.
• What type of DIF is displayed?
• What are the probabilities of correct answer for latent abilities $$\theta = -2, 0, 2$$ for reference and focal group?
Reference:
Focal:
• Which group is favored?

#### Exercise 2

Consider item following 2PL model with the following parameters

Reference group: $$a_R = 0.8, b_R = -0.5$$

Focal group: $$a_F = 1.5, b_F = 1$$

For this item fill in the following exercises with an accuracy of up to 0.05. Then click on Submit answers button. If you need a hint, click on blue button with question mark.

• Sketch item characteristic curves for both groups.
• What type of DIF is displayed?
• What are the probabilities of correct answer for latent abilities $$\theta = -1, 0, 1$$ for reference and focal group?
Reference:
Focal:
• Which group is favored?

#### Settings of report

ShinyItemAnalysis offers an option to download a report in HTML or PDF format. PDF report creation requires the latest version of MiKTeX (or other TeX distribution). If you don't have the latest installation, please, use the HTML report.

There is also an option to use customized settings. When checking the Customize settings, local settings will be offered and used for each selected section of the report. Otherwise, the settings will be taken from sections made in the individual sections of the application. You may also include your name into the report, and change the name of the analyzed dataset.

#### Content of report

Reports by default contain a summary of total scores, table of standard scores, item analysis, distractor plots for each item and multinomial regression plots for each item. Other analyses can be selected below.

Validity

Difficulty/discrimination plot

Distractors plots

DIF method selection

Delta plot settings

Mantel-Haenszel test settings

Logistic regression settings

Multinomial regression settings

Recommendation: Report generation can be faster and more reliable when you first check sections of intended contents. For example, if you wish to include a 3PL IRT model, you can first visit the Dichotomous models subsection of the IRT models section and fit the 3PL IRT model.

### Settings

#### IRT models setting

Set the number of cycles for IRT models in the IRT models section.

#### Range-restricted reliability settings

Set the number of bootstrap samples for the confidence interval calculation in the Reliability / Restricted range section.

### R packages

• cowplot Wilke, C.O. (2020). cowplot: Streamlined plot theme and plot annotations for "ggplot2". R package version 1.1.1. See online.
• data.table Dowle, M., & Srinivasan, A. (2020). data.table: Extension of "data.frame". R package version 1.13.6. See online.
• deltaPlotR Magis, D., & Facon, B. (2014). deltaPlotR: An R package for differential item functioning analysis with Angoffs delta plot. Journal of Statistical Software, Code Snippets, 59(1), 1-19. See online.
• difNLR Hladka, A., Martinkova, P. (2020). difNLR: Generalized logistic regression models for DIF and DDF detection. The R Journal, 12(1), 300-323. See online.
• difR Magis, D., Beland, S., Tuerlinckx, F., & De Boeck, P. (2010). A general framework and an R package for the detection of dichotomous differential item functioning. Behavior Research Methods, 42847-862.
• DT Xie, Y., Cheng, J., & Tan, X. (2021). DT: A wrapper of the JavaScript library "DataTables". R package version 0.17. See online.
• ggdendro de Vries, A., & Ripley, B.D. (2020). ggdendro: Create dendrograms and tree diagrams using "ggplot2". R package version 0.1-22. See online.
• ggplot2 Wickham, H. (2016). ggplot2: Elegant graphics for data analysis. See online.
• gridExtra Auguie, B. (2017). gridExtra: Miscellaneous functions for "grid" graphics. R package version 2.3. See online.
• knitr Xie, Y. (2020). knitr: A general-purpose package for dynamic report generation in R. R package version 1.30. See online.
• latticeExtra Sarkar, D., & Andrews, F. (2019). latticeExtra: Extra graphical utilities based on lattice. R package version 0.6-29. See online.
• lme4 Bates, D., Maechler, M., Bolker, B., & Walker, S. (2015). Mixed-effects models using lme4. Journal of Statistical Software, 67(1), 1-48. See online.
• ltm Rizopoulos, D. (2006). ltm: An R package for latent variable modelling and item response theory analyses. Journal of Statistical Software, 17(5), 1-25. See online.
• magrittr Bache, S. M., & Wickham, H. (2020). magrittr: A forward-pipe operator for R. R package version 2.0.1. See online.
• mirt Chalmers, R., & Chalmers, P. (2012). mirt: A multidimensional item response theory package for the R environment. Journal of Statistical Software, 48(6), 1-29.
• msm Jackson, C., & Jackson, H. (2011). Multi-state models for panel data: The msm package for R. Journal of Statistical Software, 38(8), 1-29. See online.
• nnet Venables, C., & Ripley, C. (2002). Modern applied statistics with S. See online.
• plotly Sievert, C. (2020). Interactive web-based data visualization with R, plotly, and shiny. Chapman and Hall/CRC Florida, 2020. See online.
• psych Revelle, W. (2020). psych: Procedures for psychological, psychometric, and personality research. R package version 2.0.12. See online.
• psychometric Fletcher, T., & Fletcher, D. (2010). psychometric: Applied psychometric theory. R package version 2.2. See online.
• purrr Henry, L., & Wickham, H. (2020). purrr: Functional programming tools. R package version 0.3.4. See online.
• rlang Henry, L., & Wickham, H. (2020). rlang: Functions for base types and core R and "tidyverse" features. R package version 0.4.10. See online.
• rmarkdown Xie, Y., Allaire, J.J., & Grolemund G. (2018). R Markdown: The definitive guide. Chapman and Hall/CRC. ISBN 9781138359338. See online.
• rstudioapi Ushey, K., Allaire J.J., Wickham, H., & Ritchie G. (2018). rstudioapi: Safely access the RStudio API. R package version 0.13. See online.
• scales Wickham, H., & Seidel D. (2020). scales: Scale functions for visualization. R package version 1.1.1. See online.
• shiny Chang, W., Cheng, J., Allaire, J., Xie, Y., & McPherson, J. (2020). shiny: Web application framework for R. R package version 1.5.0. See online.
• shinyBS Bailey, E. (2015). shinyBS: Twitter bootstrap components for shiny. R package version 0.61. See online.
• shinydashboard Chang, W., & Borges Ribeiro, B. (2018). shinydashboard: Create dashboards with "shiny". R package version 0.7.1 See online.
• ShinyItemAnalysis Martinkova, P., & Drabinova, A. (2018). ShinyItemAnalysis for teaching psychometrics and to enforce routine analysis of educational tests. The R Journal, 10(2), 503-515. See online.
• shinyjs Attali, D. (2020). shinyjs: Easily improve the user experience of your shiny apps in seconds. R package version 2.0.0. See online.
• stringr Wickham, H. (2019). stringr: Simple, consistent wrappers for common string operations. R package version 1.4.0. See online.
• tibble Müller, K., & Wickham, H. (2020). tibble: Simple data frames. R package version 3.0.4. See online.
• tidyr Wickham, H. (2020). tidyr: Tidy messy data. R package version 1.1.2. See online.
• VGAM Yee, T. W. (2015). Vector Generalized linear and additive models: With an implementation in R. New York, USA: Springer. See online.
• xtable` Dahl, D., Scott, D., Roosen, C., Magnusson, A., & Swinton, J. (2019). xtable: Export tables to LaTeX or HTML. R package version 1.8-4. See online.

### References

• Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19(6), 716-723. See online.
• Ames, A. J., & Penfield, R. D. (2015). An NCME instructional module on item-fit statistics for item response theory models. Educational Measurement: Issues and Practice, 34(3), 39-48. See online.
• Andrich, D. (1978). A rating formulation for ordered response categories. Psychometrika, 43(4), 561-573. See online.
• Angoff, W. H., & Ford, S. F. (1973). Item-race interaction on a test of scholastic aptitude. Journal of Educational Measurement, 10(2), 95-105. See online.
• Bartholomew, D., Steel, F., Moustaki, I., & Galbraith, J. (2002). The analysis and interpretation of multivariate data for social scientists. London: Chapman and Hall.
• Bock, R. D. (1972). Estimating item parameters and latent ability when responses are scored in two or more nominal categories. Psychometrika, 37(1), 29-51. See online.
• Brown, W. (1910). Some experimental results in the correlation of mental abilities. British Journal of Psychology, 1904-1920, 3(3), 296-322. See online.
• Cattell, R. B. (1966). The scree test for the number of factors. Multivariate Behavioral Research, 1, 245–276. See online.
• Chalmers, R. P. (2018). Improving the crossing-SIBTEST statistic for detecting non-uniform DIF. Psychometrika, 83(2), 376-386. See online.
• Cronbach, L. J. (1951). Coefficient alpha and the internal structure of tests. Psychometrika, 16(3), 297-334. See online.
• Drabinova, A., & Martinkova, P. (2017). Detection of differential item functioning with non-linear regression: Non-IRT approach accounting for guessing. Journal of Educational Measurement, 54(4), 498-517. See online.
• Erosheva, E. A, Martinkova, P., & Lee, C. J. (2021). When zero may not be zero: A cautionary note on the useof inter-rater reliability in evaluating grant peer review. Journal of the Royal Statistical Society: Series A. See online.
• Feldt, L. S., Woodruff, D. J., & Salih, F. A. (1987). Statistical inference for coefficient alpha. Applied Psychological Measurement 11(1), 93-103. See online.
• Horn, J. L. (1965). A rationale and test for the number of factors in factor analysis. Psychometrika, 30, 179–185. See online.
• Kaiser, H. F. (1960). The application of electronic computers to factor analysis. Educational and Psychological Measurement, 20, 141–151. See online.
• Li, H.-H., & Stout, W. (1996). A new procedure for detection of crossing DIF. Psychometrika, 61(4), 647-677. See online.
• Lord, F. M. (1980). Applications of item response theory to practical testing problems. Routledge.
• Magis, D., & Facon, B. (2012). Angoffs delta method revisited: Improving DIF detection under small samples. British Journal of Mathematical and Statistical Psychology, 65(2), 302-321. See online.
• Magis, D., & Facon, B. (2013). Item purification does not always improve DIF detection: A counter-example with Angoffs Delta plot. Educational and Psychological Measurement, 73(2), 293-311. See online.
• Mantel, N., & Haenszel, W. (1959). Statistical aspects of the analysis of data from retrospective studies. Journal of the National Cancer Institute, 22(4), 719-748. See online.
• Martinkova, P., Drabinova, A., & Houdek, J. (2017). ShinyItemAnalysis: Analyza prijimacich a jinych znalostnich ci psychologickych testu. [ShinyItemAnalysis: Analyzing admission and other educational and psychological tests] TESTFORUM, 6(9), 16-35. See online.
• Martinkova, P., Drabinova, A., Liaw, Y. L., Sanders, E. A., McFarland, J. L., & Price, R. M. (2017). Checking equity: Why differential item functioning analysis should be a routine part of developing conceptual Assessments. CBE-Life Sciences Education, 16(2), rm2. See online
• Martinkova, P., Stepanek, L., Drabinova, A., Houdek, J., Vejrazka, M., & Stuka, C. (2017). Semi-real-time analyses of item characteristics for medical school admission tests. In Proceedings of the 2017 Federated Conference on Computer Science and Information Systems, 189-194. See online.
• Martinkova, P., Drabinova, A., & Potuznikova, E. (2020). Is academic tracking related to gains in learning competence? Using propensity score matching and differential item change functioning analysis for better understanding of tracking implications. Learning and Instruction 66(April). See online
• Masters, G. N. (1982). A Rasch model for partial credit scoring. Psychometrika, 47(2), 149-174. See online.
• McFarland, J. L., Price, R. M., Wenderoth, M. P., Martinkova, P., Cliff, W., Michael, J., ..., & Wright, A. (2017). Development and validation of the homeostasis concept inventory. CBE-Life Sciences Education, 16(2), ar35. See online.
• Muraki, E. (1992). A generalized partial credit model: Application of an EM algorithm. ETS Research Report Series, 1992(1) See online.
• Orlando, M., & Thissen, D. (2000). Likelihood-based item-fit indices for dichotomous item response theory models. Applied Psychological Measurement, 24(1), 50-64. See online.
• Swaminathan, H., & Rogers, H. J. (1990). Detecting differential item functioning using logistic regression procedures. Journal of Educational Measurement, 27(4), 361-370. See online.
• Raju, N. S. (1988). The area between two item characteristic curves. Psychometrika, 53(4), 495-502. See online.
• Raju, N. S. (1990). Determining the significance of estimated signed and unsigned areas between two item response functions. Applied Psychological Measurement, 14(2), 197-207. See online.
• Rasch, G. (1960) Probabilistic models for some intelligence and attainment tests. Copenhagen: Paedagogiske Institute.
• Revelle, W. (1979). Hierarchical cluster analysis and the internal structure of tests. Multivariate Behavioral Research, 14(1), 57-74. See online.
• Samejima, F. (1969). Estimation of latent ability using a response pattern of graded scores. Psychometrika, 34(1), 1-97 See online.
• Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics, 6(2), 461-464. See online.
• Shealy, R., & Stout, W. (1993). A model-based standardization approach that separates true bias/DIF from group ability differences and detect test bias/DTF as well as Item Bias/DIF. Psychometrika, 58(2), 159-194. See online.
• Spearman, C. (1910). Correlation calculated from faulty data. British Journal of Psychology, 1904-1920, 3(3), 271-295. See online.
• Wilson, M. (2005). Constructing measures: An item response modeling approach.
• Wright, B. D., & Stone, M. H. (1979). Best test design. Chicago: Mesa Press.