Contextual Representation Learning for Text Data

Author: ORCID icon orcid.org/0000-0002-7657-4305
Xun, Guangxu, Computer Science - School of Engineering and Applied Science, University of Virginia
Advisor:
Zhang, Aidong, EN-Comp Science Dept, University of Virginia
Abstract:

Nowadays, text data is being generated at an increasing rate of speed. Text data is prevalent in various domains, such as social media, newspapers, clinical notes and online reviews. Text data contains rich information and understanding text data is important for Artificial Intelligence (AI) tasks, especially for Natural Language Processing (NLP) tasks. The key to understanding text data lies in the representation of the data, as the success of NLP algorithms heavily depends on the quality of the text representations. For that reason, many conventional NLP systems attempt to design preprocessing pipelines and data transformations that can provide good representations of text data. Such feature engineering is useful but requires careful design and prior knowledge. Therefore, it is desirable to learn representations of text data automatically and lessen the degree of feature engineering in NLP systems. In this way, the downstream NLP applications can be constructed faster and achieve better performances.

Context information of text data, including spatial context, temporal context and domain context, is naturally a good source to learn representations of text data. Because the context information not only contains syntactic and semantic information which are a core requirement for representations of text data, but also is easy and convenient to collect. I will first introduce how to extract semantic and syntactic features from text data based on its spatial context information, such as word-word co-occurrences and document-word co-occurrences, and also how to coordinate both kinds of spatial context information. Next, I will demonstrate how to learn time-aware representations based on the temporal context information of text data, for example, temporal representations that can capture the semantic evolutions of words. Then I will show how to learn domain-specific representations of text data based on its domain context information, for example, extracting domain-related features from documents given the task domain. Extensive evaluations are also conducted and presented to demonstrate the effectiveness of the proposed contextual representation learning algorithms.

Degree:
PHD (Doctor of Philosophy)
Keywords:
Representation learning, Text data, Context information, Natural language processing, Deep learning, Topic modeling
Language:
English
Rights:
All rights reserved (no additional license for public reuse)
Issued Date:
2021/01/18