Sequence mining
Encyclopedia
Sequence mining is concerned with finding statistically relevant patterns between data examples where the values are delivered in a sequence. It is usually presumed that the values are discrete, and thus Time series
mining is closely related, but usually considered a different activity. Sequence mining is a special case of structured data mining
.
There are two different kinds of sequence mining: string mining and itemset mining. String mining is widely used in biology, to examine gene
and protein
sequences, and is primarily concerned with sequences with a single member at each position. There exist a variety of prominent algorithms to perform alignment of a query sequence with those existing in databases. The kind of alignment could either involve matching a query with one subject e.g. BLAST
or matching multiple query sets with each other e.g. ClustalW. Itemset mining is used more often in marketing and CRM applications, and is concerned with multiple-symbols at each position. Itemset mining is also a popular approach to text mining
.
There are several key problems within this field. These include building efficient databases and indexes for sequence information, extracting the frequently occurring patterns, comparing sequences for similarity, and recovering missing sequence members.
Two common techniques that are applied to sequence databases for frequent itemset
mining are the influential apriori algorithm
and the more-recent FP-Growth technique. However, there is nothing in these techniques that restricts them to sequences, per se.
Time series
In statistics, signal processing, econometrics and mathematical finance, a time series is a sequence of data points, measured typically at successive times spaced at uniform time intervals. Examples of time series are the daily closing value of the Dow Jones index or the annual flow volume of the...
mining is closely related, but usually considered a different activity. Sequence mining is a special case of structured data mining
Structured data mining
Structure mining or structured data mining is the process of finding and extracting useful information from semi structured data sets. Graph mining is a special case of structured data mining.-Description:...
.
There are two different kinds of sequence mining: string mining and itemset mining. String mining is widely used in biology, to examine gene
Gene
A gene is a molecular unit of heredity of a living organism. It is a name given to some stretches of DNA and RNA that code for a type of protein or for an RNA chain that has a function in the organism. Living beings depend on genes, as they specify all proteins and functional RNA chains...
and protein
Protein
Proteins are biochemical compounds consisting of one or more polypeptides typically folded into a globular or fibrous form, facilitating a biological function. A polypeptide is a single linear polymer chain of amino acids bonded together by peptide bonds between the carboxyl and amino groups of...
sequences, and is primarily concerned with sequences with a single member at each position. There exist a variety of prominent algorithms to perform alignment of a query sequence with those existing in databases. The kind of alignment could either involve matching a query with one subject e.g. BLAST
BLAST
In bioinformatics, Basic Local Alignment Search Tool, or BLAST, is an algorithm for comparing primary biological sequence information, such as the amino-acid sequences of different proteins or the nucleotides of DNA sequences...
or matching multiple query sets with each other e.g. ClustalW. Itemset mining is used more often in marketing and CRM applications, and is concerned with multiple-symbols at each position. Itemset mining is also a popular approach to text mining
Text mining
Text mining, sometimes alternately referred to as text data mining, roughly equivalent to text analytics, refers to the process of deriving high-quality information from text. High-quality information is typically derived through the devising of patterns and trends through means such as...
.
There are several key problems within this field. These include building efficient databases and indexes for sequence information, extracting the frequently occurring patterns, comparing sequences for similarity, and recovering missing sequence members.
Two common techniques that are applied to sequence databases for frequent itemset
Association rule learning
In data mining, association rule learning is a popular andwell researched method for discovering interesting relations between variablesin large databases. Piatetsky-Shapirodescribes analyzing and presenting...
mining are the influential apriori algorithm
Apriori algorithm
In computer science and data mining, Apriori is a classic algorithm for learning association rules. Apriori is designed to operate on databases containing transactions...
and the more-recent FP-Growth technique. However, there is nothing in these techniques that restricts them to sequences, per se.