Topic model
Encyclopedia
In machine learning
and natural language processing
, a topic model is a type of statistical model
for discovering the abstract "topics" that occur in a collection of documents. An early topic model was probabilistic latent semantic indexing (PLSI), created by Thomas Hofmann in 1999. Latent Dirichlet allocation
(LDA), perhaps the most common topic model currently in use, is a generalization of PLSI developed by David Blei, Andrew Ng
, and Michael Jordan
in 2002, allowing documents to have a mixture of topics. Other topic models are generally extensions on LDA, such as Pachinko allocation
, which improves on LDA by modeling correlations between topics in addition to the word correlations which constitute topics. Although topic models were first described and implemented in the context of natural language processing, they have applications in other fields such as bioinformatics
.
Diachronic approaches include Block and Newman's determination the temporal dynamics of topics in the Pennsylvania Gazette
during 1728–1800. Griffiths & Steyvers use topic modeling on abstract from the journal PNAS to identify topics that rose or fell in popularity from 1991 to 2001. Nelson has been analyzing change in topics over time in the Richmond Times-Dispatch
to understand social and political changes and continuities in Richmond during the American Revolutionary War
. Yang, Torget and Mihalcea applied topic modeling methods to newspapers from 1829-2008. Blevins has been topic modeling Martha Ballard's
diary to identify thematic trends across the 27-year diary. Mimno used topic modelling with 24 journals on classical philology and archaeology spanning 150 years to look at how topics in the journals change over time and how the journals become more different or similar over time.
Machine learning
Machine learning, a branch of artificial intelligence, is a scientific discipline concerned with the design and development of algorithms that allow computers to evolve behaviors based on empirical data, such as from sensor data or databases...
and natural language processing
Natural language processing
Natural language processing is a field of computer science and linguistics concerned with the interactions between computers and human languages; it began as a branch of artificial intelligence....
, a topic model is a type of statistical model
Statistical model
A statistical model is a formalization of relationships between variables in the form of mathematical equations. A statistical model describes how one or more random variables are related to one or more random variables. The model is statistical as the variables are not deterministically but...
for discovering the abstract "topics" that occur in a collection of documents. An early topic model was probabilistic latent semantic indexing (PLSI), created by Thomas Hofmann in 1999. Latent Dirichlet allocation
Latent Dirichlet allocation
In statistics, latent Dirichlet allocation is a generative model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar...
(LDA), perhaps the most common topic model currently in use, is a generalization of PLSI developed by David Blei, Andrew Ng
Andrew Ng
Andrew Ng is an Associate Professor in the Department of Computer Science at Stanford University. His work is primarily in machine learning and robotics. He received his PhD from Carnegie Mellon University and finished his postdoctoral research in the University of California, Berkeley, where he...
, and Michael Jordan
Michael I. Jordan
Michael I. Jordan is a leading researcher in machine learning and artificial intelligence. Jordan was a prime mover behind popularising Bayesian networks in the machine learning community and is known for pointing out links between machine learning and statistics...
in 2002, allowing documents to have a mixture of topics. Other topic models are generally extensions on LDA, such as Pachinko allocation
Pachinko allocation
In machine learning and natural language processing, the pachinko allocation model is a topic model, i.e. a generative statistical model for discovering the abstract "topics" that occur in a collection of documents...
, which improves on LDA by modeling correlations between topics in addition to the word correlations which constitute topics. Although topic models were first described and implemented in the context of natural language processing, they have applications in other fields such as bioinformatics
Bioinformatics
Bioinformatics is the application of computer science and information technology to the field of biology and medicine. Bioinformatics deals with algorithms, databases and information systems, web technologies, artificial intelligence and soft computing, information and computation theory, software...
.
Case studies
Templeton's survey of work of topic modeling in the humanities grouped previous work into synchronic and diachronic approaches. The synchronic approaches identify topics at a certain time, for example, Jockers used topic modelling to classify 177 bloggers writing on the 2010 'Day of Digital Humanities' and identify the topics they wrote about for that day. Meeks modeled 50 texts in the Humanities Computing/Digital Humanities genre to identify self-definitions of scholars working on digital humanities and visualize networks of researchers and topics. Drouin examined Proust to identify topics and show them as a graphical networkDiachronic approaches include Block and Newman's determination the temporal dynamics of topics in the Pennsylvania Gazette
Pennsylvania Gazette
The Pennsylvania Gazette was one of the United States' most prominent newspapers from 1728, before the time period of the American Revolution, until 1815...
during 1728–1800. Griffiths & Steyvers use topic modeling on abstract from the journal PNAS to identify topics that rose or fell in popularity from 1991 to 2001. Nelson has been analyzing change in topics over time in the Richmond Times-Dispatch
Richmond Times-Dispatch
The Richmond Times-Dispatch is the primary daily newspaper in Richmond the capital of Virginia, United States, and is commonly considered the "newspaper of record" for events occurring in much of the state...
to understand social and political changes and continuities in Richmond during the American Revolutionary War
American Revolutionary War
The American Revolutionary War , the American War of Independence, or simply the Revolutionary War, began as a war between the Kingdom of Great Britain and thirteen British colonies in North America, and ended in a global war between several European great powers.The war was the result of the...
. Yang, Torget and Mihalcea applied topic modeling methods to newspapers from 1829-2008. Blevins has been topic modeling Martha Ballard's
Martha Ballard
Martha Moore Ballard was an American midwife, healer, and diarist.Martha Ballard is known today from her diary, which gives us a rare insight to the life of the average midwife and woman in 18th century Maine. Born on February 20, 1735, Ballard grew up in a moderately prosperous family in Oxford,...
diary to identify thematic trends across the 27-year diary. Mimno used topic modelling with 24 journals on classical philology and archaeology spanning 150 years to look at how topics in the journals change over time and how the journals become more different or similar over time.
External links
- Topic modeling bibliography maintained by David Mimno
- Topic Modeling in the Humanities: An Overview by Clay Templeton at the Maryland Institute for Technology in the Humanities
- Topic Models Applied to Online News and Reviews Video of a Google Tech Talk presentation by Alice Oh on topic modeling with LDALatent Dirichlet allocationIn statistics, latent Dirichlet allocation is a generative model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar...
- Modeling Science: Dynamic Topic Models of Scholarly Research Video of a Google Tech Talk presentation by David M. Blei
- Automated Topic Models in Political Science Video of a presentation by Brandon Stewart at the Tools for Text Workshop, 14 June 2010
Further reading
- Mark Steyvers; Tom Griffiths (2007) "Probabilistic Topic Models" In: T. Landauer, D McNamara, S. Dennis, and W. Kintsch (eds), Handbook of Latent Semantic Analysis, Psychology Press. ISBN 978-0-8058-5418-3
- Blei, D.M.; Lafferty, J.D. (2009) Topic Models Manuscript
- Blei, D. and Lafferty, J. (2007). "A correlated topic model of Science". Annals of Applied Statistics, 1(1), 17–35.
- Mimno, D. to appear. Computational Historiography: Data Mining in a Century of Classics Journals. ACM Transactions on Computational Logic, Vol., No., 20, Pages 1–0??. pre-print
- Jockers, M. 2011 Who's your DH Blog Mate: Match-Making the Day of DH Bloggers with Topic Modeling Matthew L. Jockers, posted 19 March 2010
- Meeks, E. 2011 Comprehending the Digital Humanities Digital Humanities Specialist, posted 19 February 2011
- Drouin, J. 2011 Foray Into Topic Modeling Ecclesiastical Proust Archive. posted 17 March 2011
- Templeton, C. 2011 Topic Modeling in the Humanities: An Overview Maryland Institute for Technology in the Humanities Blog. posted 1 August 2011
- Griffiths, T., & Steyvers, M. (2004). Finding scientific topics. Proceedings of the National Academy of Sciences, 101(suppl. 1, pp. 5228–5235).
- Yang, T., A Torget and R. Mihalcea (2011) Topic Modeling on Historical Newspapers. Proceedings of the 5th ACL-HLT Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities. The Association for Computational Linguistics, Madison, WI. pages 96–104.
- Block, S. 2006 Doing More with Digitization: An introduction to topic modeling of early American sources Common-place The Interactive Journal of Early American Life. Vol 6. No. 2 January 2006
- Newman, D. and S. Block (2006) "Probabilistic Topic Decomposition of an Eighteenth-Century Newspaper," Journal of the American Society for Information Science and Technology. 57:5 (March 2006) post-print
- Blevin, C. 2010. Topic Modeling Martha Ballard’s Diary historying. posted 1 April 2010.