Luís A. Nunes Amaral

co-Director, Northwestern Institute on Complex Systems
Professor of Chemical & Biological Engineering
Professor of Physics & Astronomy (by courtesy)
Professor of Medicine (by courtesy)

Chemical & Biological Engineering
2145 Sheridan Road (Room E136)
EvanstonIL 60208US
Phone: (847) 491-7850

Abstract

Much of human knowledge sits in large databases of unstructured text. Leveraging this knowledge requires algorithms that extract and record metadata on unstructured text documents. Assigning topics to documents will enable intelligent searching, statistical characterization, and meaningful classification. Latent Dirichlet allocation (LDA) is the state of the art in topic modeling. Here, we perform a systematic theoretical and numerical analysis that demonstrates that current optimization techniques for LDA often yield results that are not accurate in inferring the most suitable model parameters. Adapting approaches from community detection in networks, we propose a new algorithm that displays high reproducibility and high accuracy and also has high computational efficiency. We apply it to a large set of documents in the English Wikipedia and reveal its hierarchical structure.