Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

Latest commit

 

History

History
History
 
 

README.rst

Outline

Chapter 4

Support code for Chapter 4: Topic Modeling

Word cloud creation

Word cloud creation requires that pytagcloud be installed (in turn, this requires pygame). Since this is not an essential part of the chapter, the code will work even if you have not installed it (naturally, the cloud image will not be generated and a warning will be printed).

Wikipedia processing

You will need a lot of disk space. The download of the Wikipedia text is 11GB and preprocessing it takes another 24GB to save it in the intermediate format that gensim uses for a total of 34GB!

Run the following two commands inside the data/ directory:

./download_wp.sh
./preprocess-wikidata.sh

As the filenames indicate, the first step will download the data and the second one will preprocess it. Preprocessing can take several hours, but it is feasible to run it on a modern laptop. Once the second step is finished, you may remove the input file if you want to save disk space (data/enwiki-latest-pages-articles.xml.bz2).

To generate the model, you can run the wikitopics_create.py script, while the wikitopics_plot.py script will plot the most heavily discussed topic as well as the least heavily discussed one. The code is split into steps as the first one can take a very long time. Then it saves the results so that you can later explore them at leisure.

You should not expect that your results will exactly match the results in the book, for two reasons:

  1. The LDA algorithm is a probabilistic algorithm and can give different results every time it is run.
  2. Wikipedia keeps changing. Thus, even your input data will be different.

Scripts

blei_lda.py
Computes LDA using the AP Corpus.
Morty Proxy This is a proxified and sanitized view of the page, visit original site.