ID3 algorithm - AbsoluteAstronomy.com

Decision tree learning

Decision tree learning, used in statistics, data mining and machine learning, uses a decision tree as a predictive model which maps observations about an item to conclusions about the item's target value. More descriptive names for such tree models are classification trees or regression trees...

, ID3 (Iterative Dichotomiser 3) is an algorithm

Algorithm

In mathematics and computer science, an algorithm is an effective method expressed as a finite list of well-defined instructions for calculating a function. Algorithms are used for calculation, data processing, and automated reasoning...

used to generate a decision tree

Decision tree learning

invented by Ross Quinlan

Ross Quinlan

John Ross Quinlan is a computer science researcher in data mining and decision theory. He has contributed extensively to the development of decision tree algorithms, including inventing the canonical C4.5 and ID3 algorithms...

. ID3 is the precursor to the C4.5 algorithm

C4.5 algorithm

C4.5 is an algorithm used to generate a decision tree developed by Ross Quinlan. C4.5 is an extension of Quinlan's earlier ID3 algorithm. The decision trees generated by C4.5 can be used for classification, and for this reason, C4.5 is often referred to as a statistical classifier.-Algorithm:C4.5...

Algorithm

The ID3 algorithm can be summarized as follows:

Take all unused attributes and count their entropy concerning test samples
Choose attribute for which entropy is minimum (or, equivalently, information gain is maximum)
Make node containing that attribute

The algorithm is as follows:

ID3 (Examples, Target_Attribute, Attributes)

Create a root node for the tree
If all examples are positive, Return the single-node tree Root, with label = +.
If all examples are negative, Return the single-node tree Root, with label = -.
If number of predicting attributes is empty, then Return the single node tree Root, with label = most common value of the target attribute in the examples.
Otherwise Begin
- A = The Attribute that best classifies examples.
- Decision Tree attribute for Root = A.
- For each possible value, , of A,
  - Add a new tree branch below Root, corresponding to the test A = .
  - Let Examples() be the subset of examples that have the value for A
  - If Examples() is empty
    - Then below this new branch add a leaf node with label = most common target value in the examples
  - Else below this new branch add the subtree ID3 (Examples(), Target_Attribute, Attributes – {A})
End
Return Root

The ID3 metrics

The algorithm is based on Occam's razor

Occam's razor

Occam's razor, also known as Ockham's razor, and sometimes expressed in Latin as lex parsimoniae , is a principle that generally recommends from among competing hypotheses selecting the one that makes the fewest new assumptions.-Overview:The principle is often summarized as "simpler explanations...

: it prefers smaller decision trees (simpler theories) over larger ones. However, it does not always produce the smallest tree, and is therefore a heuristic

Heuristic

Heuristic refers to experience-based techniques for problem solving, learning, and discovery. Heuristic methods are used to speed up the process of finding a satisfactory solution, where an exhaustive search is impractical...

. Occam's razor is formalized using the concept of information entropy

Information entropy

In information theory, entropy is a measure of the uncertainty associated with a random variable. In this context, the term usually refers to the Shannon entropy, which quantifies the expected value of the information contained in a message, usually in units such as bits...

Entropy

Where :

is the information entropy
Information entropy
In information theory, entropy is a measure of the uncertainty associated with a random variable. In this context, the term usually refers to the Shannon entropy, which quantifies the expected value of the information contained in a message, usually in units such as bits...

of the set ;
is the number of different values of the attribute in (entropy is computed for one chosen attribute)
is the frequency (proportion) of the value in the set
is the binary logarithm
Binary logarithm
In mathematics, the binary logarithm is the logarithm to the base 2. It is the inverse function of n ↦ 2n. The binary logarithm of n is the power to which the number 2 must be raised to obtain the value n. This makes the binary logarithm useful for anything involving powers of 2,...

An entropy of 0 identifies a perfectly classified set.

Entropy is used to determine which node to split next in the algorithm. The higher the entropy, the higher the potential to improve the classification here.

Gain

Gain is computed to estimate the gain produced by a split over an attribute :

Where :

is the gain of the set after a split over the attribute
is the information entropy
Information entropy
In information theory, entropy is a measure of the uncertainty associated with a random variable. In this context, the term usually refers to the Shannon entropy, which quantifies the expected value of the information contained in a message, usually in units such as bits...

of the set
is the number of different values of the attribute in
is the frequency (proportion) of the items possessing as value for in
is possible value of
is a subset of containing all items where the value of is