So Happy Together? Combining Rasch and Item Response Theory Model Estimates with Support Vector Machines to Detect Test Fraud

Thomas, Sarah, Psychology - Graduate School of Arts and Sciences, University of Virginia
Schmidt, Karen, Arts & Sciences Graduate-tasg, University of Virginia

Each year, thousands of people must pass standardized exams to be certified in medical, legal, clinical, and technological fields. Unfortunately, the increasing number of examinees taking such tests seems to have been accompanied by an increase in the instances of reported cheating and the invention of more sophisticated cheating techniques. The stealing and sharing of proprietary test content, such as when items are recorded and then compromised via sharing, is associated with legal consequences for the culprit but can also have negative consequences for the integrity, reputation, and budget of a testing program. Compromised items represent a significant threat to the integrity of testing. The purpose of the present study was to detect items that were suspected to be compromised using a combination of Rasch and Item Response Theory model estimates, along with other item properties such as average response times and local dependence estimates, with Support Vector Machines (SVMs). In this study, we used this combination of methods to detect items of an international healthcare certification exam (N = 13,584) that were suspected to be compromised in screenshots or notes. The results showed that this method appeared to be somewhat accurate at classifying suspected compromised and suspected uncompromised items, but that the main factor driving these results appeared to be the relative size of the classes. The SVMs showed apparent bias towards predicting the items into the category of whichever item class was larger. Thus, the accuracy of the SVMs, balanced for class size, was much lower than desired. We hypothesized that the most important item features would be Rasch item infit and outfit, but the item feature results showed that the two most important features were the Rasch model standard error and the Rasch item difficulty. We also hypothesized that Rasch estimates would outperform 3PL estimates, but the evidence for this hypothesis was mixed, with some Rasch estimates outperforming the 3PL estimates and others performing worse than the 3PL estimates. Our final hypothesis was that the 3PL discrimination would outperform the 3PL lower limit. The evidence for this final hypothesis was somewhat mixed, but discrimination outperformed the lower limit in the majority of the models. However, the results of the feature weights in the current study should be interpreted cautiously for several reasons. The assessment of feature weights provides information on the importance of features in the data for deriving the classifications made by the models. In the current study, the classifications made by the models seemed to be biased by class size and were often inaccurate. Thus, the feature weights may contribute very little to the understanding of what item properties actually distinguish between suspected uncompromised and suspected compromised items. The current study combined Rasch and Item Response Theory model estimates with Support Vector Machines, bringing the disparate fields of psychometrics and machine learning together, and showing a promising future for the resulting hybrid method.

PHD (Doctor of Philosophy)
psychometrics, cheating, machine learning
All rights reserved (no additional license for public reuse)
Issued Date: