Topic modeling is usually used in text mining to discover the main topics of documents based on the statistical analysis of the vocabularies and their related topics.
(Graphic concept illustration: Charles Lang, HUDK4051 Learning Analytics lecture notes @Teachers College, Columbia University)
A topic is the probability distribution over words. We could characterize a document with the list of topics based on the vocabularies it used and the probability distribution of individual topics. This method converts the text information into word frequency vector, building the numeric foundation for data modeling (Lang, 2017). However, it oversimplifies the complexity of text mining because it does not take the word sequence into consideration, causing confusing to determine the major topic while there are multiple topics in the same document.
Latent Dirichlet Allocation (LDA) is the text mining method developed by David Blei (Computer Science Professor at Columbia University), Andrew Ng (co-founder of Coursera), and Michael Jordan (advisor of David Blei and Andrew Ng). As an unsupervised learning (no prior label for data structure) method, LDA is a Bayes Hierarchy Model and make “documents-topics-words” as the Bayes Chain (Blei et al., 2003).
The generative process of LDA:
- take a topic from a document；
- take a word from the chosen topic from 1；
- repeat 1 and 2 until every single word was matched with a topic in the document.
The major topic of the document was inferred from the distributions of “document-topic” and “topic-word”. LDA could be further extended with Variational Bayesian, expectation–maximization algorithm, and Gibbs sampling.
An example for LDA: Pressible
I applied LDA Topic modeling to analysis the data on the school blogging system Pressible. The following procedures were processed:
- Construct database for Pressible project;
- Clean data with removal of repeated messages;
- Pre-process data with stemming, removal of stop words;
- Generate hyperparameters and Dirichlet Distribution to construct data frames with LDA.
“Music” “creativity” “education” and “think” are the most popular topics in this sample dataset of Pressible. The numbers are the IDs of users in Pressible. The size of the circles represents the engagement rate of the topics and users (for example, the more active users such as ID:1489 occupied a larger circle). The color represents the co-occurrence of the topics and the users. The IDs in blue are more likely to talk about the topics in “education” and “think”.
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of machine Learning research, 3(Jan), 993-1022.