Brill Tagging on the Micron Automata Processor

Zhou, Qi, Systems Engineering - School of Engineering and Applied Science, University of Virginia
Brown, Donald, Department of Systems and Information Engineering, University of Virginia

There is a growing importance of Natural Language Processing (NLP) as it allows human-machine interaction, drawing insights from text documents and unstructured data, machine translation, etc. Many tasks are involved in the NLP pipeline. Part-of-speech (POS) Tagging is a task within NLP that makes assignments of a tag to input tokens, such as, nouns, verbs, adjectives, adverbs, etc. Various tagging techniques have been developed to accomplish this task. Brill tagging is a classic rule-based algorithm for POS tagging. However, traditional CPU implementation of the tagger is inherently slow. In this work, we take the advantage of different existing computer hardware as well as the Micron Automata Processor, a new computing architecture that can perform massive pattern matching in parallel, and implement the second stage of Brill tagging in a fashion of template matching. The direct implementation is tested with a subset of Brown Corpus using 218 contextual rules. The result shows a significant speed-up for the second stage tagger. To illustrate the general utility of hardware acceleration for other NLP tasks, the 218 contextual rules are then converted into Regular Expressions (Regex), which is more widely in use in various situations for NLP, and compared as single-threaded, multi-threaded versions on CPU, Xeon Phi and the AP. The result shows a promising performance improvement of using the AP as a Regex accelerator. This work serves as a guide of using different accelerators for various computational linguistic tasks, particularly those that involve rule-based or pattern-matching approaches, as well as Regex matching.

MS (Master of Science)
Brill tagging, POS tagging, the Automata Processor, hardware accelerators, regular expressions, natural language processing
All rights reserved (no additional license for public reuse)
Issued Date: