LDaRM: A Technique for Improving Knowledge Discovery in Large Text Corpora

Author: ORCID icon orcid.org/0000-0001-9026-260X
Loose, Davis, Systems Engineering - School of Engineering and Applied Science, University of Virginia
Advisor:
Fleming, Cody, EN-Eng Sys and Environment, University of Virginia
Abstract:

Latent Dirichlet allocation -- association rule mining (LDaRM) is a methodology for uncovering latent semantic information from a set of documents. Latent Dirichlet allocation (LDA) is a form of topic modeling, a statistical learning technique that clusters and organizes key words in a set of documents into a set of topics, which represent the underlying themes in a corpus. While powerful, LDA often produces topics that are difficult for a human to understand, and larger topic models become cumbersome for an individual to analyze.

Association rule mining (ARM) is a data mining technique for uncovering interesting patterns in data that enable a user to explore new knowledge domains or adjust behavior in response to new information. However, if one were to use ARM on a text corpus, the run-time to find interesting patterns would be cost prohibitive.

LDaRM combines these two techniques to address the weaknesses in each. LDA reduces the dimensionality of the underlying data for ARM, while ARM improves user comprehension of topic models developed using LDA. LDaRM is able to uncover interesting rules that an analyst could use as a springboard to more in-depth analysis of a corpus. In particular, LDaRM is useful for contextualizing named entities - proper nouns that may have ambiguous definitions or are highly context dependent. Finally, LDaRM is an effective tool for identifying terms that confound topics and make it difficult for individuals to interpret topic model results.

Degree:
MS (Master of Science)
Keywords:
latent Dirichlet allocation, topic model, association rule mining, exploratory text analytics, text mining, data mining, stop words
Language:
English
Rights:
All rights reserved (no additional license for public reuse)
Issued Date:
2020/07/28