Temporal difference learning
Encyclopedia
Temporal difference learning is a prediction method. It has been mostly used for solving the reinforcement learning
problem. "TD learning is a combination of Monte Carlo
ideas and dynamic programming
(DP) ideas." TD resembles a Monte Carlo method
because it learns by sampling
the environment according to some policy. TD is related to dynamic programming
techniques because it approximates its current estimate based on previously learned estimates (a process known as bootstrapping). The TD learning algorithm is related to the temporal difference model of animal learning.
As a prediction method, TD learning takes into account the fact that subsequent predictions are often correlated in some sense. In standard supervised predictive learning, one learns only from actually observed values: A prediction is made, and when the observation is available, the prediction is adjusted to better match the observation. The core idea, as elucidated in, of TD learning is that we adjust predictions to match other, more accurate, predictions about the future. This procedure is a form of bootstrapping, as illustrated with the following example (taken from):
Mathematically speaking, both in a standard and a TD approach, we would try to optimize some cost function, related to the error in our predictions of the expectation of some random variable, E[z]. However, while in the standard approach we in some sense assume E[z] = z (the actual observed value), in the TD approach we use a model. For the particular case of reinforcement learning, which is the major application of TD methods, z is the total return and E[z] is given by the Bellman equation
of the return.
has also received attention in the field of neuroscience
. Researchers discovered that the firing rate of dopamine
neurons in the ventral tegmental area (VTA) and substantia nigra
(SNc) appear to mimic the error function in the algorithm. The error function reports back the difference between the estimated reward at any given state or time step and the actual reward received. The larger the error function, the larger the difference between the expected and actual reward. When this is paired with a stimulus that accurately reflects a future reward, the error can be used to associate the stimulus with the future reward
.
Dopamine
cells appear to behave in a similar manner. In one experiment measurements of dopamine cells were made while training a monkey to associate a stimulus with the reward of juice. Initially the dopamine cells increased firing rates when exposed to the juice, indicating a difference in expected and actual rewards. Over time this increase in firing back propagated to the earliest reliable stimulus for the reward. Once the monkey was fully trained, there was no increase in firing rate upon presentation of the predicted reward. Continually, the firing rate for the dopamine cells decreased below normal activation when the expected reward was not produced. This mimics closely how the error function in TD is used for reinforcement learning
.
The relationship between the model and potential neurological function has produced research attempting to use TD to explain many aspects of behavioral research. It has also been used to study conditions such as schizophrenia
or the consequences of pharmacological manipulations of dopamine on learning.
where .
This formula can be expanded
by changing the index of i to start from 0.
Thus, the reinforcement is the difference between the ideal prediction and the current prediction.
TD-Lambda is a learning algorithm invented by Richard S. Sutton
based on earlier work on temporal difference learning by Arthur Samuel
. This algorithm was famously applied by Gerald Tesauro to create TD-Gammon
, a program that learned to play the game of backgammon
at the level of expert human players. The lambda () parameter refers to the trace decay parameter, with . Higher settings lead to longer lasting traces; that is, a larger proportion of credit from a reward can be given to more distal states and actions when is higher, with producing parallel learning to Monte Carlo RL algorithms.
Reinforcement learning
Inspired by behaviorist psychology, reinforcement learning is an area of machine learning in computer science, concerned with how an agent ought to take actions in an environment so as to maximize some notion of cumulative reward...
problem. "TD learning is a combination of Monte Carlo
Monte Carlo method
Monte Carlo methods are a class of computational algorithms that rely on repeated random sampling to compute their results. Monte Carlo methods are often used in computer simulations of physical and mathematical systems...
ideas and dynamic programming
Dynamic programming
In mathematics and computer science, dynamic programming is a method for solving complex problems by breaking them down into simpler subproblems. It is applicable to problems exhibiting the properties of overlapping subproblems which are only slightly smaller and optimal substructure...
(DP) ideas." TD resembles a Monte Carlo method
Monte Carlo method
Monte Carlo methods are a class of computational algorithms that rely on repeated random sampling to compute their results. Monte Carlo methods are often used in computer simulations of physical and mathematical systems...
because it learns by sampling
Sampling (statistics)
In statistics and survey methodology, sampling is concerned with the selection of a subset of individuals from within a population to estimate characteristics of the whole population....
the environment according to some policy. TD is related to dynamic programming
Dynamic programming
In mathematics and computer science, dynamic programming is a method for solving complex problems by breaking them down into simpler subproblems. It is applicable to problems exhibiting the properties of overlapping subproblems which are only slightly smaller and optimal substructure...
techniques because it approximates its current estimate based on previously learned estimates (a process known as bootstrapping). The TD learning algorithm is related to the temporal difference model of animal learning.
As a prediction method, TD learning takes into account the fact that subsequent predictions are often correlated in some sense. In standard supervised predictive learning, one learns only from actually observed values: A prediction is made, and when the observation is available, the prediction is adjusted to better match the observation. The core idea, as elucidated in, of TD learning is that we adjust predictions to match other, more accurate, predictions about the future. This procedure is a form of bootstrapping, as illustrated with the following example (taken from):
- Suppose you wish to predict the weather for Saturday, and you have some model that predicts Saturday's weather, given the weather of each day in the week. In the standard case, you would wait until Saturday and then adjust all your models. However, when it is, for example, Friday, you should have a pretty good idea of what the weather would be on Saturday - and thus be able to change, say, Monday's model before Saturday arrives.
Mathematically speaking, both in a standard and a TD approach, we would try to optimize some cost function, related to the error in our predictions of the expectation of some random variable, E[z]. However, while in the standard approach we in some sense assume E[z] = z (the actual observed value), in the TD approach we use a model. For the particular case of reinforcement learning, which is the major application of TD methods, z is the total return and E[z] is given by the Bellman equation
Bellman equation
A Bellman equation , named after its discoverer, Richard Bellman, is a necessary condition for optimality associated with the mathematical optimization method known as dynamic programming...
of the return.
TD algorithm in neuroscience
The TD algorithmAlgorithm
In mathematics and computer science, an algorithm is an effective method expressed as a finite list of well-defined instructions for calculating a function. Algorithms are used for calculation, data processing, and automated reasoning...
has also received attention in the field of neuroscience
Neuroscience
Neuroscience is the scientific study of the nervous system. Traditionally, neuroscience has been seen as a branch of biology. However, it is currently an interdisciplinary science that collaborates with other fields such as chemistry, computer science, engineering, linguistics, mathematics,...
. Researchers discovered that the firing rate of dopamine
Dopamine
Dopamine is a catecholamine neurotransmitter present in a wide variety of animals, including both vertebrates and invertebrates. In the brain, this substituted phenethylamine functions as a neurotransmitter, activating the five known types of dopamine receptors—D1, D2, D3, D4, and D5—and their...
neurons in the ventral tegmental area (VTA) and substantia nigra
Substantia nigra
The substantia nigra is a brain structure located in the mesencephalon that plays an important role in reward, addiction, and movement. Substantia nigra is Latin for "black substance", as parts of the substantia nigra appear darker than neighboring areas due to high levels of melanin in...
(SNc) appear to mimic the error function in the algorithm. The error function reports back the difference between the estimated reward at any given state or time step and the actual reward received. The larger the error function, the larger the difference between the expected and actual reward. When this is paired with a stimulus that accurately reflects a future reward, the error can be used to associate the stimulus with the future reward
Reward system
In neuroscience, the reward system is a collection of brain structures which attempts to regulate and control behavior by inducing pleasurable effects...
.
Dopamine
Dopamine
Dopamine is a catecholamine neurotransmitter present in a wide variety of animals, including both vertebrates and invertebrates. In the brain, this substituted phenethylamine functions as a neurotransmitter, activating the five known types of dopamine receptors—D1, D2, D3, D4, and D5—and their...
cells appear to behave in a similar manner. In one experiment measurements of dopamine cells were made while training a monkey to associate a stimulus with the reward of juice. Initially the dopamine cells increased firing rates when exposed to the juice, indicating a difference in expected and actual rewards. Over time this increase in firing back propagated to the earliest reliable stimulus for the reward. Once the monkey was fully trained, there was no increase in firing rate upon presentation of the predicted reward. Continually, the firing rate for the dopamine cells decreased below normal activation when the expected reward was not produced. This mimics closely how the error function in TD is used for reinforcement learning
Reinforcement learning
Inspired by behaviorist psychology, reinforcement learning is an area of machine learning in computer science, concerned with how an agent ought to take actions in an environment so as to maximize some notion of cumulative reward...
.
The relationship between the model and potential neurological function has produced research attempting to use TD to explain many aspects of behavioral research. It has also been used to study conditions such as schizophrenia
Schizophrenia
Schizophrenia is a mental disorder characterized by a disintegration of thought processes and of emotional responsiveness. It most commonly manifests itself as auditory hallucinations, paranoid or bizarre delusions, or disorganized speech and thinking, and it is accompanied by significant social...
or the consequences of pharmacological manipulations of dopamine on learning.
Mathematical formulation
Let be the reinforcement on time step t. Let be the correct prediction that is equal to the discounted sum of all future reinforcement. The discounting is done by powers of factor of such that reinforcement at distant time step is less important.where .
This formula can be expanded
by changing the index of i to start from 0.
Thus, the reinforcement is the difference between the ideal prediction and the current prediction.
TD-Lambda is a learning algorithm invented by Richard S. Sutton
Richard S. Sutton
Richard S. Sutton is professor of computer science and chair at the University of Alberta. Professor Sutton is known for his significant contributions in the field of reinforcement learning. He is the author of the original paper on temporal difference learning...
based on earlier work on temporal difference learning by Arthur Samuel
Arthur Samuel
Arthur Lee Samuel was an American pioneer in the field of computer gaming and artificial intelligence. The Samuel Checkers-playing Program appears to be the world's first self-learning program, and as such a very early demonstration of the fundamental concept of artificial intelligence...
. This algorithm was famously applied by Gerald Tesauro to create TD-Gammon
TD-Gammon
TD-Gammon was a computer backgammon program developed in 1992 by Gerald Tesauro at IBM's Thomas J. Watson Research Center. Its name comes from the fact that it is an artificial neural net trained by a form of temporal-difference learning, specifically TD-lambda....
, a program that learned to play the game of backgammon
Backgammon
Backgammon is one of the oldest board games for two players. The playing pieces are moved according to the roll of dice, and players win by removing all of their pieces from the board. There are many variants of backgammon, most of which share common traits...
at the level of expert human players. The lambda () parameter refers to the trace decay parameter, with . Higher settings lead to longer lasting traces; that is, a larger proportion of credit from a reward can be given to more distal states and actions when is higher, with producing parallel learning to Monte Carlo RL algorithms.
See also
- Reinforcement learningReinforcement learningInspired by behaviorist psychology, reinforcement learning is an area of machine learning in computer science, concerned with how an agent ought to take actions in an environment so as to maximize some notion of cumulative reward...
- Q-learningQ-learningQ-learning is a reinforcement learning technique that works by learning an action-value function that gives the expected utility of taking a given action in a given state and following a fixed policy thereafter. One of the strengths of Q-learning is that it is able to compare the expected utility...
- SARSASARSASARSA is an algorithm for learning a Markov decision process policy, used in the reinforcement learning area of machine learning...
- Rescorla-Wagner modelRescorla-Wagner modelThe Rescorla–Wagner model is a model of classical conditioning in which the animal is theorized to learn from the discrepancy between what is expected to happen and what actually happens. This is a trial-level model in which each stimulus is either present or not present at some point in the trial...
- Adaptive Heuristic Critic
- PVLVPVLVThe primary value learned value model is a possible explanation for the reward-predictive firing properties of dopamine neurons. It simulates behavioral and neural data on Pavlovian conditioning and the midbrain dopaminergic neurons that fire in proportion to unexpected rewards. It is an...
External links
- Scholarpedia Temporal difference Learning
- TD-Gammon
- TD-Networks Research Group
- Connect Four TDGravity Applet (+ mobile phone version) - self-learned using TD-Leaf method (combination of TD-Lambda with shallow tree search)