Online Archive of University of Virginia Scholarship
Towards Interactive Analysis of Compressed Provenance Graphs Without Decompression13 views
Author
Mensah-Homiah, Jason, Computer Science - School of Engineering and Applied Science, University of Virginia0009-0003-6624-1034
Advisors
Hassan, Wajih, Computer Science - School of Engineering and Applied Science, University of Virginia
Abstract
Enterprise security systems generate massive volumes of log data essential for forensic analysis and cyber threat investigation. The challenge is scale. Storing these logs requires significant resources, and searching through them is painfully slow. Raw logs are discrete events that lack the contextual relationships needed for effective analysis. Provenance graphs solve this by capturing causal dependencies between system events, making it possible to trace attack paths and identify incident origins. However, provenance graphs consume even more storage than the original logs, often by orders of magnitude. This creates a critical need: a compression system for provenance graphs that preserves all forensic information while enabling direct causal analysis without decompression.
ZPGraph addresses this challenge through a novel architecture that separates graph topology from log content, enabling efficient encoding of causal relationships while maintaining provenance semantics. The system achieves lossless compression by introducing a layered structure that decouples how events connect from what those events contain. This design allows ZPGraph to apply domain-specific compression techniques that eliminate redundant schema and field content in logs without compromising the ability to query causality.
This thesis introduces visualization, analysis, and querying layers to ZPGraph that allow analysts to interact with compressed provenance graphs as if they were uncompressed. Analysts can execute forensic queries directly on compressed data, with the interaction model supporting exploration: tracing causality backward and forward through time, detecting anomalies, and reconstructing incidents.
Evaluation demonstrates that the system delivers strong compression ratios without sacrificing the ability to execute complex forensic queries. Query performance remains consistent regardless of the original log volume, which proves essential when analyzing months or years of data from thousands of machines. The result is clear: organizations no longer face a trade-off between compression and analysis. For security teams managing large-scale log data, this approach makes comprehensive forensic investigation practically feasible.
Mensah-Homiah, Jason. Towards Interactive Analysis of Compressed Provenance Graphs Without Decompression. University of Virginia, Computer Science - School of Engineering and Applied Science, MS (Master of Science), 2025-12-04, https://doi.org/10.18130/bjhy-4t92.