C4.5 algorithm
Encyclopedia
C4.5 is an algorithm used to generate a decision tree
developed by Ross Quinlan
. C4.5 is an extension of Quinlan's earlier ID3 algorithm
. The decision trees generated by C4.5 can be used for classification, and for this reason, C4.5 is often referred to as a statistical classifier.
, using the concept of information entropy. The training data is a set of already classified samples. Each sample is a vector where represent attributes or features of the sample. The training data is augmented with a vector where represent the class to which each sample belongs.
At each node of the tree, C4.5 chooses one attribute of the data that most effectively splits its set of samples into subsets enriched in one class or the other. Its criterion is the normalized information gain (difference in entropy) that results from choosing an attribute for splitting the data. The attribute with the highest normalized information gain is chosen to make the decision. The C4.5 algorithm then recurses on the smaller sublists.
This algorithm has a few base cases.
, the general algorithm for building decision trees is:
Java
implementation of the C4.5 algorithm in the weka
data mining
tool.
Source for a single-threaded Linux version of C5.0 is available under the GPL.
Decision tree learning
Decision tree learning, used in statistics, data mining and machine learning, uses a decision tree as a predictive model which maps observations about an item to conclusions about the item's target value. More descriptive names for such tree models are classification trees or regression trees...
developed by Ross Quinlan
Ross Quinlan
John Ross Quinlan is a computer science researcher in data mining and decision theory. He has contributed extensively to the development of decision tree algorithms, including inventing the canonical C4.5 and ID3 algorithms...
. C4.5 is an extension of Quinlan's earlier ID3 algorithm
ID3 algorithm
In decision tree learning, ID3 is an algorithm used to generate a decision tree invented by Ross Quinlan. ID3 is the precursor to the C4.5 algorithm.-Algorithm:The ID3 algorithm can be summarized as follows:...
. The decision trees generated by C4.5 can be used for classification, and for this reason, C4.5 is often referred to as a statistical classifier.
Algorithm
C4.5 builds decision trees from a set of training data in the same way as ID3ID3 algorithm
In decision tree learning, ID3 is an algorithm used to generate a decision tree invented by Ross Quinlan. ID3 is the precursor to the C4.5 algorithm.-Algorithm:The ID3 algorithm can be summarized as follows:...
, using the concept of information entropy. The training data is a set of already classified samples. Each sample is a vector where represent attributes or features of the sample. The training data is augmented with a vector where represent the class to which each sample belongs.
At each node of the tree, C4.5 chooses one attribute of the data that most effectively splits its set of samples into subsets enriched in one class or the other. Its criterion is the normalized information gain (difference in entropy) that results from choosing an attribute for splitting the data. The attribute with the highest normalized information gain is chosen to make the decision. The C4.5 algorithm then recurses on the smaller sublists.
This algorithm has a few base cases.
- All the samples in the list belong to the same class. When this happens, it simply creates a leaf node for the decision tree saying to choose that class.
- None of the features provide any information gain. In this case, C4.5 creates a decision node higher up the tree using the expected value of the class.
- Instance of previously-unseen class encountered. Again, C4.5 creates a decision node higher up the tree using the expected value.
Pseudocode
In pseudocodePseudocode
In computer science and numerical computation, pseudocode is a compact and informal high-level description of the operating principle of a computer program or other algorithm. It uses the structural conventions of a programming language, but is intended for human reading rather than machine reading...
, the general algorithm for building decision trees is:
- Check for base cases
- For each attribute a
- Find the normalized information gain from splitting on a
- Let a_best be the attribute with the highest normalized information gain
- Create a decision node that splits on a_best
- Recurse on the sublists obtained by splitting on a_best, and add those nodes as children of node
Implementations
J48 is an open sourceOpen source
The term open source describes practices in production and development that promote access to the end product's source materials. Some consider open source a philosophy, others consider it a pragmatic methodology...
Java
Java (programming language)
Java is a programming language originally developed by James Gosling at Sun Microsystems and released in 1995 as a core component of Sun Microsystems' Java platform. The language derives much of its syntax from C and C++ but has a simpler object model and fewer low-level facilities...
implementation of the C4.5 algorithm in the weka
Weka (machine learning)
Weka is a popular suite of machine learning software written in Java, developed at the University of Waikato, New Zealand...
data mining
Data mining
Data mining , a relatively young and interdisciplinary field of computer science is the process of discovering new patterns from large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics and database systems...
tool.
Improvements from ID3 algorithm
C4.5 made a number of improvements to ID3. Some of these are:- Handling both continuous and discrete attributes - In order to handle continuous attributes, C4.5 creates a threshold and then splits the list into those whose attribute value is above the threshold and those that are less than or equal to it.
- Handling training data with missing attribute values - C4.5 allows attribute values to be marked as ? for missing. Missing attribute values are simply not used in gain and entropy calculations.
- Handling attributes with differing costs.
- Pruning trees after creation - C4.5 goes back through the tree once it's been created and attempts to remove branches that do not help by replacing them with leaf nodes.
Improvements in C5.0/See5 algorithm
Quinlan went on to create C5.0 and See5 (C5.0 for Unix/Linux, See5 for Windows) which he markets commercially. C5.0 offers a number of improvements on C4.5. Some of these are:- Speed - C5.0 is significantly faster than C4.5 (several orders of magnitude)
- Memory usage - C5.0 is more memory efficient than C4.5
- Smaller decision trees - C5.0 gets similar results to C4.5 with considerably smaller decision trees.
- Support for boostingBoostingBoosting is a machine learning meta-algorithm for performing supervised learning. Boosting is based on the question posed by Kearns: can a set of weak learners create a single strong learner? A weak learner is defined to be a classifier which is only slightly correlated with the true classification...
- Boosting improves the trees and gives them more accuracy. - Weighting - C5.0 allows you to weight different cases and misclassification types.
- Winnowing - a C5.0 option automatically winnowWinnow (algorithm)The winnow algorithm is a technique from machine learning for learning a linear classifier from labeled examples. It is very similar to the perceptron algorithm. However, the perceptron algorithm uses an additive weight-update scheme, while Winnow uses a multiplicative scheme that allows it to...
s the attributes to remove those that may be unhelpful.
Source for a single-threaded Linux version of C5.0 is available under the GPL.
External links
- Original implementation on Ross Quinlan's homepage: http://www.rulequest.com/Personal/
- See5 and C5.0