Assessing and Improving Critical Properties of Test Oracles for Effective Software Bug Detection

Author: ORCID icon orcid.org/0000-0002-7282-061X
Hossain, Soneya Binta, Computer Science - School of Engineering and Applied Science, University of Virginia
Advisor:
Dwyer, Matthew, EN-Comp Science Dept, University of Virginia
Abstract:

Software is everywhere, shaping nearly every aspect of modern life—from the systems we rely on for communication and transportation to those that power healthcare and finance. However, this heavy reliance on software introduces significant risks. Software bugs are more than just technical inconveniences; they can be extremely costly and lead to serious consequences. Beyond system outages, security breaches, and financial losses, software bugs can jeopardize lives.

Such failures often occur when testing—the first and often most critical line of defense against software bugs—is performed inadequately or ineffectively. Effective testing and bug-detection mechanisms are essential for ensuring software reliability, security, and overall quality. At the core of this process are test oracles—mechanisms that determine whether a software system produces the correct output for a given input. As the ground truth for expected behavior, test oracles are not merely supportive components; they are the ultimate arbiters of pass or fail outcomes. Consequently, they play a pivotal role in the overall effectiveness of software testing.

While effective testing also relies on diverse and high-quality test inputs, this area has seen substantial progress. Many automated techniques have been developed to generate inputs that exercise complex program behaviors and achieve high code coverage. In contrast, the automatic generation of effective test oracles remains both more complex and relatively underexplored. Given an input to a system, determining whether the resulting output is correct remains a longstanding challenge in automated software testing, referred to as the test oracle problem.

The primary goal of this dissertation is to address the test oracle problem. To tackle this challenge, the dissertation draws initial inspiration from the PIE fault model and argues that solving this problem requires a test suite to include adequate oracles that are also effective at detecting bugs. Oracle adequacy refers to the extent to which oracles observe and check program behavior, while oracle effectiveness depends on their correctness—alignment with the specification—and their strength in detecting behavioral deviations. To address the test oracle problem and ensure effective bug detection, this dissertation focuses on assessing and improving three critical properties of test oracles: the ability to check relevant behavior, correctness with respect to the specification, and strength in detecting deviation from expected behavior. Collectively, these are referred to as the CCS properties.

To achieve this goal, this dissertation presents OracleGuru—a framework of methods and datasets that automatically assess and improve the CCS properties of test oracles. OracleGuru integrates program analysis techniques with generative artificial intelligence to evaluate oracle adequacy through the checked dimension, identify under-checked behaviors, and enhance adequacy by synthesizing additional test oracles that are both correct with respect to the specification and strong enough to effectively expose bugs.

OracleGuru is extensively evaluated on large-scale, real-world software systems and through two curated benchmarks—OracleEval25 and SF110. The results show that OracleGuru effectively both measures and mitigates gaps in the checked dimension, substantially improving oracle adequacy across test suites. Its generation module, TOGLL, synthesizes oracles that are both correct and strong, achieving state-of-the-art bug-detection performance. In addition, Doc2OracLL analyzes software artifacts—specifically documentation—to (i) pinpoint components that most aid oracle generation, (ii) identify helpful and harmful aspects of documentation, and (iii) provide actionable guidelines for developers to write effective documentation for better test oracle generation.

Collectively, OracleGuru provides foundational insights, methods, and artifacts for automatically assessing and improving test oracle quality at scale. By rigorously analyzing the CCS properties and uniting program analysis with generative AI, this work advances the solution to the test oracle problem—the central goal of this dissertation—making automated testing more trustworthy, scalable and effective.

Degree:
PHD (Doctor of Philosophy)
Keywords:
Software Testing, Test Oracle Generation, Test Oracle Problem, CCS Properties (Checked, Correct, Strong), Bug Detection, Program Analysis, Generative AI
Sponsoring Agency:
DARPAThe Air Force Office of Scientific ResearchLockheed Martin Advanced Technology LaboratoriesNational Science Foundation
Language:
English
Issued Date:
2025/07/05