I grabbed all the text from Research Digest, build Latent Dirichlet Allocation (LDA) topic models, by RTextTools library in R.
It basically uses machine learning to classify the text. With these two packages in R, the topics are easily classified in a few lines of R code, but to clean the text, it takes a little tweak (since we store html in database), essentially some text wrangling to process the text. I wrote the script to automate the text cleaning step. (java apps are available upon request)
matrix <- create_matrix(cbind(as.vector(rd_topic$title),as.vector(rd_topic$text)), language="english", removeNumbers=TRUE, stemWords=TRUE, weighting=weightTf)
lda <- LDA(matrix, 30)
Our top 30 topics are displayed in word cloud.
I remember not long ago I tried cloud comparison, and merely impossible? Check out this previous blog post.