Automatically Describing Program Structure and Behavior

Buse, Raymond P. L., Computer Science - School of Engineering and Applied Science, University of Virginia
Weimer, Westley, Department of Computer Science, University of Virginia

Powerful development environments, data-rich project management systems, and ubiquitous software frameworks have fundamentally altered the way software is constructed and maintained. Today, professional developers spend less than 10% of their time actually writing new code and instead primarily try to understand existing software. Increasingly, programmers work by searching for examples or other documentation and then assembling pre-constructed components. Yet, code comprehension is poorly understood and documentation is often incomplete, incorrect, or unavailable. Moreover, few tools exist to support structured code search making the process of finding useful examples ad hoc, slow and error prone.

The overarching goal of this research is to help humans better understand software at many levels and in doing so improve development productivity and software quality. The contributions of this work are in three areas: readability, runtime behavior, and documentation.

We introduce the first readability metric for program source code. Our model for readability, which is internally based on logistic regression, was learned from data gathered from 120 study participants. Our model agrees with human annotators as much as they agree with each other. The importance of measuring code readability is highlighted by the observation that reading code is now the most time consuming part of the most expensive activity in the software development process.

Beyond surface-level readability, understanding software demands understanding the behavior of the running program. To that end, we contribute a model describing runtime behavior in terms of program paths. Our approach is based on two key insights. First, we observe a relationship between the relative frequency of a path and its effect on program state. Second, we carefully choose the right level of abstraction for considering program paths: inter-procedural for precision, but limited to one class for scalability. Over several benchmarks, the top 5% of paths as ranked by our algorithm account for over half of program runtime.

Finally, we describe automated approaches for improving the understandability of software through documentation synthesis. A 2005 NASA survey found that the most significant barrier to code reuse is that software is too difficult to understand or is poorly documented. As software systems grow ever larger and more complex, we believe that automated documentation tools will become indispensable. The key to our approach is adapting programming language techniques (e.g., symbolic execution) to create output that is directly comparable to existing human-written artifacts. This has two key advantages: (1) it simplifies evaluation by permitting objective comparisons to existing documentation, and (2) it enables tools to be used immediately --- no significant change to the development process is required. We describe and evaluate documentation synthesis algorithms for exceptions, code changes, and APIs. In each case, we employ real humans to evaluate the output of our algorithms and compare against artifacts created by real developers --- finding that our synthesized documentation is of similar quality to human-written artifacts

Increasingly, software comprehension is a critical factor in modern development. However, software remains difficult to understand. This research seeks to improve that state by automatically describing program structure and behavior.

PHD (Doctor of Philosophy)
readability, documentation, software engineering, programming languages, program understanding
All rights reserved (no additional license for public reuse)
Issued Date: