Combining Human-defined Semantic Knowledge with Data-driven Learning: A Topic Modeling Approach

Mark Steyvers
University of California at Irvine, USA

        Statistical topic models provide a general data-driven framework for automated discovery of high-level knowledge from large collections of text documents. Although topic models can potentially discover a broad range of themes in a data set, the interpretability of the learned topics is not always ideal. Human-defined concepts, in the form of content labels or word clusters, tend to be more useful but they might not span the themes in a data set exhaustively. Therefore, an important goal in modeling is to combine human-defined semantic knowledge with data-driven learning approaches. In this talk, I will review the basic unsupervised topic model and a variety of extensions such as the Labeled Dirichlet Allocation Model (labeled LDA) and Concept-Topic models that combine human background knowledge with learned topics. I will demonstrate the utility of these models on the task of annotating documents with hierarchically organized human concepts as well as learned topics.