Abstract
Binary decision-making occurs in many areas of science and policy, including medicine (tumor present or absent), forensics (identification or exclusion), finance (good or bad credit risk), and agriculture (healthy or diseased plant). Lab or field studies may be conducted to assess the error rates in such binary decision-making processes (e.g., proficiency tests for radiologists or latent print examiners). In such tests, a true outcome, or the "ground truth," is known (e.g., latent print and file print did or did not come from the same source), but in practice, three outcomes are allowed --- "same," "different," "inconclusive." Reporting error rates from such studies prove to be inconsistent, as some report error rates incorporating inconclusive decisions as "correct" or "incorrect," while others ignore inconclusive decisions entirely. These three options for reporting error rates provide unreliable and inconsistent estimates.
In this dissertation, we discuss a method to consistently compare error rates across studies despite the varying proportions of inconclusive decisions and quality of prints included in studies, assuming inconclusive decisions are treated the same when calculating error rates, via standardization.
We then develop a group-level Item Response Theory (IRT) model to describe examiners' decision-making process. Previous models assume that inconclusive decisions occur on a separate step than conclusive decisions, and uses separate parameters for each examiners' proficiency, which proves to be complex. Our model assumes that examiners fall into three experience-level classes -- Novice, Intermediate, Expert -- and that the participants' proficiency levels within any one group arise from a distribution with a mean and variance, greatly generalizing its applicability and reducing the number of parameters to be estimated. This model is also consistent with the examiner’s workflow, in that our model allows examiners to make conclusive and inconclusive decisions simultaneously as they do in practice (via Nominal Item Response Theory modeling).
Finally, we discuss a method to incorporate inconclusive decisions into error rates consistently, without determining if such decisions are correct or incorrect. Specifically, we will use a weighted average to report the error rates, in which the weights for inconclusive decisions are defined by the probability an examiner will make an inconclusive decision on a comparison given the examiner's proficiency and the comparison's difficulty (from group-level IRT modeling technique). These methods will yield comparable, reliable, and reproducible error rates that enable fair comparisons across different studies, where the comparisons can account for variations in test difficulty and the frequency of inconclusive decisions. The methods can be applied both within specific procedures (e.g., comparing the accuracy of different latent print analyses or bone imaging techniques) and across procedures within a discipline (e.g., comparing the accuracy of latent print analysis to firearms analysis in forensic science, or comparing bone scan accuracy to brain scan accuracy in diagnostic imaging).