TopicMapping finds topics in a set of documents using network clustering (Infomap) as a guess for LDA likelihood optimization. This guide will tell you how to compile, what are the input and output format, how to tune the algorithm’s parameters, and some more.
TopicMapping as well as the programs for validating the algorithms can be downloaded from this Bitbucket repository. Alternatively you can open a terminal and type:
hg clone https://bitbucket.org/andrealanci/topicmapping
Numerical experiments validation
Most of the codes used in our numerical experiments are available from this other repository. This is made available mostly to researchers who aim to reproduce our results, and therefore the code is not maintained.
The documents from Web of Science can be downloaded here.
Open a terminal and type
If you are using Windows, you can run the program by installing MinGW (Minimalist GNU for Windows).
Input and Output
Let us start with an example:
./bin/topicmap -f quantum-and-granular-large-stemmed -t 10 -o test results
-f is followed by the name of the file where the corpus is recorded. This file is supposed to contain a number of strings separated by newlines. Every string will be considered a different document.
In the example, “quantum comput predict util . . . ” is the first document, “develop theori interlay tunnel . . . ” is the second document and so on.
The reason why the documents look a little strange is that we used a stemming algorithm and we removed stop-words. If you want to do the same thing (we recommend it), you can use the following:
python Sources/NatLangProc/stem.py [original file] [output file]
which requires the python library called stemming.
The list of stop words is in
IMPORTANT: Please make sure that your corpus file does not contain empty lines (or lines with just white spaces).
The other options in the example are explained in the next two sections.
-o specifies the directory where all the output is redirected to,
test_results in the example above.
Each document is associated with a probability distribution of topics and each topic is characterized by a distribution of words. These two distributions are written in two separate files:
lda_gammas_final.txtprovides the probability of topics, for each document: p(topic|doc). Every line refers to a document, in the same order as they appear in the corpus file. Each number is the average usage of the corresponding topics.
For instance, “0.025 0.01 58.8” means that topic 0 is used 0.025 times, topic 1 is used 0.01 times, and topic 2 is used 58.8 times, on average. To get the probabilities, just normalize this vector.
lda_betas_sparse final.txtprovides p(word|topic). Every line is a topic. The first number is the topic number. After that, pairs (word-id, probability) are sorted starting from the most probable. The word-ids can be mapped to the actual words from the file
Similar files such as
lda_gammas_6.txt etc., are printed every few iterations. Files
plsa_betas_sparse.txt also have the same content, obtained before running LDA optimization.
Other supporting files are:
lda_summary_final.txt gives overall information about the topics, such as their probability p(t), the total number of words which have positive probability given this topic, and their top 100 words.
word_wn_count.txtcontains strings in the format “word word-id occurrences”.
infomap.partcontains the (hard) partition of words found by Infomap, where words are represented with the word-id which can be found in
infomap-words.partis the same file as before, written in words.
The basic way to run TopicMapping is to specify just the corpus file and the out-directory.
However, TopicMapping has also a number of options to tune the size of the topics, the execution time and more. The following is an overview of most available options. Other options are available calling the program without arguments.
The algorithm runs without supervision, in particular it does not require that you input a prefixed number of topics. However, two options are available to tune the granularity of the topics to some extent:
-t [threshold (integer)]Minimum number of documents per topic. A few topics will likely be very small because of some isolated words. This option allows to get rid of very small topics, such as those used less than the threshold. The threshold is measured in number of documents: for instance
-t 10means that each topic must be covered by at least 10 documents. 10 (default) or 100 is recommended for fairly large corpuses. Documents which are entirely isolated from the others, cannot be assigned to any other topic and will still belong to their own topic.
-p [p-value (float)]Higher values of the p-value will deliver fewer and more coarse-grained topics because the network of words is more connected. Default is 5%.
./bin/topicmap -f quantum-and-granular-large-stemmed -p 0.1 -o test results -t 10
This does not change the topics very much. But the next example will filter out small topics, so that only two will be left.
./bin/topicmap -f quantum-and-granular-large-stemmed -o test results -t 100
Speed vs accuracy
There are two options to set the accuracy of the algorithm. Tuning them, you can get faster or more accurate results:
-r [number of runs (integer)]How many times you want the network clustering algorithm to run. Default is 10.
-step [interval in PLSA local optimization (float)]Default is 0.01. For example, 0.05 can be used to get faster results. Similarly, selecting
-maxf, you can narrow the filter range and make the algorithm faster.
./bin/topicmap -f quantum-and-granular-large-stemmed -r 1 -step 0.05 -o test results
Recycling previous runs
If you like to run the algorithm again with a different threshold, you can read the word partition saved in a previous run in the file called
infomap.part. This will skip the first part of the algorithm: building the network and running Infomap for the topics. The option is:
-part [infomap.part (string)]
After running the algorithm without the option, try:
./bin/topicmap -f quantum-and-granular-large-stemmed -o test results2 -t 100 -part test results/infomap.part
This allows you to explore how the topics change filtering out more topics (
-t 100), without running everything from scratch.
Random number generator
If you do not specify any seed, it will be read from the file
time_seed.dat, which is updated at each run. If you like to input the seed, the option is
Example: ./bin/topicmap -f quantum-and-granular-large-stemmed -o test results -seed 101010
Sometimes, it is interesting to zoom-in in a topic to find its sub-topics. In order to do that, first we need to decide which topic we want to break further (for that, it is helpful to look at file
lda_summary_final.txt). Let us say that we would like to zoom-in in topic 0. Running:
python Sources/py utils/write sub corpus.py test results/lda word assignments final.txt 0
we get a file called
sub_corpus.txt, which only contains words which were more likely drawn from topic 0. We can now simply run TopicMapping on the sub-corpus:
./bin/topicmap -f sub corpus.txt -o sub results. The file
doc_list.txt reports the original ids of the documents which appear in the sub-corpus.
We also provide a simple python script to parallelize the part of the program which builds the network. Try
python Sources/py utils/run parallel signet.py quantum-and-granular-large-stemmed, and follows the instructions to run TopicMapping after this.
There is also a script to parallelize the LDA optimization. To use it, you first need to run TopicMapping with option
-skip_lda, which just provides the file
plsa_betas_sparse.txt. After that, try
python Sources/py utils/run parallel lda.py where
[intial model] should be the path to the file
TopicMapping re-uses some of the code which implements the original Infomap. The LDA likelihood optimization code closely follows the original code by David Blei. Xiaohan Zeng curated the stemming algorithm. David Mertens compiled the corpus “quantum-and-granular-large-stemmed ” pulling abstracts about “quantum computing” and “transitions in granular systems”. All the rest was developed by Andrea Lancichinetti.