Analysis of the Characteristics of a Large-Scale Reading Assessment

Marty McCall

ABSTRACT
Many tests of achievement are designed according to the principles of latent trait theory, which encompasses a family of models varying in complexity, in assumptions about the latent trait, and in the type of responses examinees can make. A set of response data from a statewide public school reading test was analyzed using several latent trait models to evaluate model fit, reliability and validity.

First, three nested models of increasing complexity were compared. All of these models assume a single latent trait (unidimensionality) and examinee responses coded as either “correct” or “incorrect” (dichotomous coding). To see if more complex models caused overfitting, parameters for all models were estimated using a sample of the overall data set. Fit statistics were then computed using these parameters with a new population. There was no evidence of overfitting. Parameters were also estimated for all models at varying sample sizes, using an assortment of test designs. For all conditions, the most complex model (3PL) showed the best fit with no loss of generalizability when used with new populations. Differences in fit were not large enough to cause a practical difference, but all were statistically significant.

Next, factor analysis was used to test the assumption of unidimensionality. Factors did not correspond to content categories. Instead, items clustered on reading passages indicating multidimensionality at the item level. In this case, dichotomous coding overestimates reliability, because passage effects account for some of the variance rather than variation in latent reading ability.

Finally data was recoded so that each passage became an item with a range of correctness (polytomous coding). This model was slightly less reliable than the corresponding dichotomous model. As a test of validity, scores from both models were correlated to scores from an external test. There was no significant difference in the correlations, suggesting that the scores are virtually equivalent.

The more complex models do not result in perceptible changes in scores, but are statistically justified. They are promising when assessments are strongly dependent on model appropriateness.

Monday, June 10, 2002
DISSERTATION COMMITTEE
James A. Paulson, Chairman
Dalton Miller-Jones
Lynne Steinberg
Martin Zwick
Ron Narode, Graduate Studies Rep.