NCES research program on efficient multilevel modeling of NAEP data



This program has been running formally since 2003, though related work on it began earlier in two projects, one during my appointment 2000-2002 as Chief Statistician at ESSI, the other in 2002-2003.

Background to NAEP analyses

The current NAEP analyses for NAEP publications are based on a set of tools for the psychometric models, the survey designs used and imputation of student ability using very large regression models. These can be briefly described as follows (for the national math surveys):

Survey designs

The sampling of students taking the test in the early surveys was three-stage cluster sampling, of national PSUs, then schools within PSUs, then students within schools. In later surveys the national sample comprised a set of state samples, in which the design was two-stage sampling, of schools within state and students within schools. The school sampling was stratified with oversampling of minority schools, and the student sampling oversampled minority students.

Adjustment of estimates for the survey design is by jackknifing of PSUs and reweighting for the oversampling of minorities. No adjustment is made for the school sampling in the national surveys, other than including school fixed effects in the very large regression model described below.

Psychometric models

The 2PL, 3PL and ordered categorical response models are used. The item parameters for the models are obtained by fitting the item responses by a null (no explanatory variables) item model, ignoring the survey design and the variables recorded in the survey. The estimated item parameters are then held fixed, and the item responses are regressed on the latent student abilities using a very large "conditioning" regression model with ~200 principal variables of about 1000 survey explanatory variables, including the school fixed effects from the survey design, and main effects and some two-level interactions of a large number of important survey variables. The latent student abilities are then multiply imputed by generating five values from the posterior distribution of the student abilities, given their item responses and explanatory variable values, and the item and regression model parameter estimates.

Group differences in ability, and other tabulations by important variables, are then made five times for the five plausible values, and these are combined by the standard rules for multiple imputation to give a single analysis for each varaible, or pairs of variables for cross-tabulations.

Summary

The current NAEP analysis methods are based on the psychometric models developed in the 1980s (originally by Bock and Aitkin 1981) and take no account of the school sampling in the survey design. The imputation of abilities reflects the limitations of 1980s computer power, in not being able to handle simultaneously the item parameters and explanatory variables in a single model. The adjustment to standard errors for the PSU design effect (in the early surveys) does not account for the much larger design effect resulting from the school sampling, and neither does the inclusion of school fixed effects in the conditioning model.

Philosophy of our NAEP research program

The aim of our program is to develop a unified high-level efficient statistical modeling analysis system for NAEP data, which will enable a detailed and rich analysis of NAEP surveys, by

Program history

Past projects

The two projects preceding our NAEP program are:

1) Imputation and Data Quality (June 2002, M. Aitkin and Y.-Y. Shieh)

This project at ESSI examined the computation of parameter ML estimates and their standard errors for simple linear regression models with missing covariate data by computing the estimates and the observed data information matrix using the EM algorithm for maximum likelihood with incomplete data. The ML estimates were nearly unbiased and had smaller (sometimes much smaller) mean square errors than the complete case estimates.

The importance of this project was that the additional information in the incomplete cases could be obtained relatively easily (in this simple model), and the standard errors resulting were uniformly smaller than those for the complete case estimates.

2) Standard Errors from the Information Matrix with Missing Covariate Data (September 2003, M. Aitkin and T. Chadwick)

This project extended the approach above to two-variable regression models using the EM algorithm, and compared parameter estimates and standard errors with those from multiple imputation (MI). The MI and ML estimates required assuming a joint normal distribution for the covariates; biases and standard errors for the MI estimates were slightly larger than those for the ML estimates. If the covariate distribution was binary rather than normal, the parameter estimates were almost unaffected, but the information matrix gave serious biases in the standard errors.

The importance of this project was that it suggested a general method for standard errors for parameter estimates in models with missing covariate data, by computing the information matrix using the actual (empirical) covariate distribution, rather than a multivariate normal distribution, for the contributions of the incomplete observations to this matrix.

Projects completed under the NAEP program (and their main contributions)

Identification of Ability Distributions in IRT Models for NAEP Items (August 2004, Aitkin and Aitkin)

This project began the NAEP series. It set out the generalized linear model framework for IRT models, and its extension to multilevel models for clustered survey designs.

The importance of this project was that it showed that the estimates of upper-(individual) level parameters by Gaussian quadrature, used currently in the NAEP analysis for 2PL and other models, were very robust to various degrees of non-normality of the ability distribution, and that more complex semi-nonparametric and fully nonparametric forms of estimation did not improve the upper-level parameter estimates, and were much more computer-intensive.

You can download the report here

Comparison of Direct Estimation with the Conditioning Model and Plausible Value Imputation (September 2004, Aitkin and Aitkin).

This project established that in simulations, direct maximum likelihood estimation of both item parameters and reporting group parameters was uniformly superior to the current method, of estimating item parameters first, generating five plausible values of ability from the item model, fitting a regression of each plausible value on the reporting group variables, and finally combining the five sets of reporting group parameter estimates.

This importance of this project was that it established the superiority of direct maximum likelihood estimation of regression model parameters in IRT models over the current indirect methods of analysis.

You can download the report here

Multi-level Model Analysis of the Knowledge and Skills Scale of the NAEP 1986 Math Data (September 2005, Aitkin and Aitkin)

This project established that the four-level maximum likelihood analysis of a (30-item) scale of the 1986 NAEP math data using the 2PL model for all items was computationally feasible, and that this analysis allowed properly for the survey design (requiring two extra model levels), giving correct and efficient standard errors. The Gllamm program in Stata was used; it was effective but very slow, and alternative efficient methods were clearly needed for routine maximum likelihood analysis of such surveys.

The importance of this project was that it established, for the first time, that full multi-level model-based maximum likelihood analysis was feasable for NAEP-scale data.

You can download the report here

Percentile Estimation for the Ability Distribution in Item Response Models (October 2005, Aitkin and Aitkin)

This project examined a range of probability models for the ability distribution. It established that for reliable inference about percentiles of this distribution, explicit and detailed parametric modeling of it was essential: reliance on the normal distribution was unsound, and nonparametric estimation of the ability distribution was ineffective.

The importance of this project was that NAEP reports percentiles of the ability distribution by major reporting group variables assuming a normal ability distribution. The reported percentiles depend strongly on the distributional assumption for ability, so this needs to be checked.

You can download the report here

Comparison of Joint and Separate Estimation Analyses of the Knowledge and Skills Scale of the NAEP 1986 Math Data (October 2005, Aitkin and Aitkin)

This project compared the full multi-level approach to the analysis of NAEP data, used in the September 2005 project, with the "separate estimation" approach, of first fitting the null item model, then holding the item parameters fixed and estimating the reporting group regression model. The reporting group estimates from the separate estimation approach were good approximations to the full ML estimates, provided that the full four-level survey design was used in the null item model estimation, but their standard errors were seriously underestimated. Separate estimation gave a negligible saving in computation time.

The importance of this project was that it showed that it is essential to allow for the survey design in any analysis of NAEP data, and that the separate estimation approach seriously underestimated the standard errors of the regression parameter estimates.

You can download the report here

Investigation of the Ability Distribution in the NAEP 1986 Math Survey (June 2006, Aitkin and Aitkin)

This project examined the effect of varying the distributional model for student ability on the reporting group parameter estimates and standard errors for the 1986 NAEP math scale. Several parametric distributions (including extreme cases) and the nonparametric maximum likelihood estimate (for the two-level model) were used instead of the normal distribution; these resulted in only small changes in parameter estimates, and very small changes in standard errors. We concluded that the normal distribution and Gaussian quadrature provide a robust analysis for the estimation of reporting group parameters which is almost unaffected by the actual form of the ability distribution. This stability did not however apply to percentiles of the ability distribution, as we found in October 2005.

The importance of this project was that it established the robustness of the ML estimation approach to regression coefficient estimation in real ANEP data against variation in the form of the ability distribution. (This robustness did not however extend to percentiles of the ability distribution, as reported above.)

You can download the report here

Investigation of the Identifiability of the 3PL Model in the NAEP 1986 Math Survey (November 2006, Aitkin and Aitkin)

This project found that the 3PL model could be successfully fitted and identified with dense item data - 5 2PL items and 5 3PL items on 1000 subjects. Fitting the 2PL model to all 10 items in the simulations gave biased estimates of reporting group parameters; smaller but still serious biases resulted from fitting 3PL models with incorrectly specified guessing parameters. However, the 3PL model could not be identified (using Gllamm in Stata) for the small numbers of items taken by each of the 10,000 students on the 30-item 1986 NAEP math scale.

The importance of this project was that it raised a serious problem - that not fitting the 3PL model when it was known to be necessary could lead to serious bias in regression model parameter estimates, but the 3PL model might itself be unidentifiable.

You can download the report here

Efficient Maximum Likelihood Estimation in Large-scale Multilevel Models (December 2006, Aitkin and Aitkin)

This project examined the facilties of current packages and algorithms for NAEP-scale maximum likelihood multi-level modeling to see if they could be extended or adapted for fitting psychometric models with a multi-level structure. The report gave a list of features which are needed to achieve this in a suitable package, and a sequence list of developments which needed to be carried out to achieve a suitable analysis system.

You can download the report here

Investigation of Alternative Models for Guessing (September 2007, Aitkin and Aitkin).

This project examined polynomial models in the latent ability and latent class models for guessing, or "engagement", as alternatives to the 3PL model. The 3PL model was found to be identifiable on the NAEP 1986 math data, but it could be only item-specific, and not person-specific, unlike the latent class guessing model. The different models were compared for goodness of fit.

The importance of this project was that i) it showed that the 3PL model could be identified with existing packages without strong priors; ii) that the latent class model for engagement was a serious competitor to the 3PL model for guessing.

You can download the report here

Multidimensional Abilities (April 2008, Aitkin and Aitkin).

This project examined differences in reporting group estimates resulting from full multidimensional ability estimation, separate estimation of each scale and the combination of estimates, and the estimation of a single overall ability scale, for both simulated data from a small model and 79 items in three scales in the 2005 NAEP math data.

The importance of this project was that it showed that estimation of inter-factor correlations in a joint estimation of all factors was not necessary to obtain good parameter estimates and standard errors, but it was necessary to fit the multiple-factor model: fitting a single factor model could give biased estimates of the reporting group parameters.

You can download the report here

Model Identification of Student, School and Interaction Factors Affecting Item Responses on the 2005 NAEP Math Test (February 2009, Aitkin and Aitkin).

This project used the Latent Gold package to develop multilevel latent class models for math achievement on the 2005 NAEP math data. Analyses were carried out on the California and Texas state samples with a selection of school and teacher variables available in this survey, as well as the available reporting group variables.

The importance of this project was that it established that the latent class model for guessing was superior to the 3PL model for both state samples. The latent class model simultaneously identified the important variables for membership in the "guessing" class, and adjusted the group difference estimates for guessing.


Back to my home page
Department of Mathematics and Statistics home page Last modified: Tue Oct 25 2011