Pachinko allocation
Encyclopedia
In machine learning
and natural language processing
, the pachinko allocation model (PAM) is a topic model
, i.e. a generative
statistical model
for discovering the abstract "topics" that occur in a collection of documents. The algorithm improves upon earlier topic models such as latent Dirichlet allocation
by modeling correlations between topics in addition to the word correlations which constitute topics. While first described and implemented in the context of natural language processing, the algorithm may have applications in other fields such as bioinformatics
. The name comes from pachinko
, a type of Japanese gaming machine related to pinball
.
in 2006.
The idea was extended with hierarchical Pachinko allocation by Li, McCallum, and David Mimno in 2007. The algorithm has been implemented in the MALLET
software package published by McCallum's group at the University of Massachusetts, Amherst.
Machine learning
Machine learning, a branch of artificial intelligence, is a scientific discipline concerned with the design and development of algorithms that allow computers to evolve behaviors based on empirical data, such as from sensor data or databases...
and natural language processing
Natural language processing
Natural language processing is a field of computer science and linguistics concerned with the interactions between computers and human languages; it began as a branch of artificial intelligence....
, the pachinko allocation model (PAM) is a topic model
Topic model
In machine learning and natural language processing, a topic model is a type of statistical model for discovering the abstract "topics" that occur in a collection of documents. An early topic model was probabilistic latent semantic indexing , created by Thomas Hofmann in 1999...
, i.e. a generative
Generative model
In probability and statistics, a generative model is a model for randomly generating observable data, typically given some hidden parameters. It specifies a joint probability distribution over observation and label sequences...
statistical model
Statistical model
A statistical model is a formalization of relationships between variables in the form of mathematical equations. A statistical model describes how one or more random variables are related to one or more random variables. The model is statistical as the variables are not deterministically but...
for discovering the abstract "topics" that occur in a collection of documents. The algorithm improves upon earlier topic models such as latent Dirichlet allocation
Latent Dirichlet allocation
In statistics, latent Dirichlet allocation is a generative model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar...
by modeling correlations between topics in addition to the word correlations which constitute topics. While first described and implemented in the context of natural language processing, the algorithm may have applications in other fields such as bioinformatics
Bioinformatics
Bioinformatics is the application of computer science and information technology to the field of biology and medicine. Bioinformatics deals with algorithms, databases and information systems, web technologies, artificial intelligence and soft computing, information and computation theory, software...
. The name comes from pachinko
Pachinko
is a type of game originating in Japan, and used as both a form of recreational arcade game and much more frequently as a gambling device, filling a niche in gambling in Japan comparable to that of the slot machine in Western gambling. A pachinko machine resembles a vertical pinball machine, but...
, a type of Japanese gaming machine related to pinball
Pinball
Pinball is a type of arcade game, usually coin-operated, where a player attempts to score points by manipulating one or more metal balls on a playfield inside a glass-covered case called a pinball machine. The primary objective of the game is to score as many points as possible...
.
History
Pachinko allocation was first described by Wei Li and Andrew McCallumAndrew McCallum
Andrew McCallum is a professor and researcher in the computer science department at University of Massachusetts Amherst. His primary specialties are in machine learning, natural language processing, information extraction, information integration, and social network analysis.McCallum graduated...
in 2006.
The idea was extended with hierarchical Pachinko allocation by Li, McCallum, and David Mimno in 2007. The algorithm has been implemented in the MALLET
Mallet (software project)
MALLET is a Java "MAchine Learning for Language Toolkit".-Description:MALLET is an integrated collection of Java code useful for statistical natural language processing, document classification, cluster analysis, information extraction, topic modeling and other machine learning applications to...
software package published by McCallum's group at the University of Massachusetts, Amherst.
See also
- Probabilistic latent semantic indexing (PLSI), an early topic model from Thomas Hofmann in 1999.
- Latent Dirichlet allocationLatent Dirichlet allocationIn statistics, latent Dirichlet allocation is a generative model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar...
, a generalization of PLSI developed by David BleiDavid BleiDavid Blei is an Associate Professor in the Department of Computer Science at Princeton University. His work is primarily in machine learning.His research interests include topic models and he was one of the original developers of latent Dirichlet allocation....
, Andrew NgAndrew NgAndrew Ng is an Associate Professor in the Department of Computer Science at Stanford University. His work is primarily in machine learning and robotics. He received his PhD from Carnegie Mellon University and finished his postdoctoral research in the University of California, Berkeley, where he...
, and Michael JordanMichael I. JordanMichael I. Jordan is a leading researcher in machine learning and artificial intelligence. Jordan was a prime mover behind popularising Bayesian networks in the machine learning community and is known for pointing out links between machine learning and statistics...
in 2002, allowing documents to have a mixture of topics. - MALLETMallet (software project)MALLET is a Java "MAchine Learning for Language Toolkit".-Description:MALLET is an integrated collection of Java code useful for statistical natural language processing, document classification, cluster analysis, information extraction, topic modeling and other machine learning applications to...
, an open-source Java library that implements Pachinko allocation.
External links
- Mixtures of Hierarchical Topics with Pachinko Allocation, a video recording of David Mimno presenting HPAM in 2007.