Spurious relationship
Encyclopedia
In statistics
, a spurious relationship (or, sometimes, spurious correlation or spurious regression) is a mathematical relationship in which two events or variables have no direct causal connection, yet it may be wrongly inferred that they do, due to either coincidence or the presence of a certain third, unseen factor (referred to as a "confounding factor" or "lurking variable
"). Suppose there is found to be a correlation between A and B. Aside from coincidence, there are three possible relationships:
In the last case there is a spurious correlation between A and B. In a regression model where A is regressed on B but C is actually the true causal factor for A, this misleading choice of independent variable
(B instead of C) is called specification error.
Because correlation can arise from the presence of a lurking variable rather than from direct causation, it is often said that "Correlation does not imply causation
".
sales. These sales are highest when the rate of drownings in city swimming pool
s is highest. To allege that ice cream sales cause drowning, or vice-versa, would be to imply a spurious relationship between the two. In reality, a heat wave
may have caused both. The heat wave is an example of a hidden or unseen variable, also known as a confounding variable.
Another popular example is a series of Dutch statistics showing a positive correlation between the number of storks nesting in a series of springs and the number of human babies born at that time. Of course there was no causal connection; they were correlated with each other only because they were correlated with the weather nine months before the observations.
and in particular in experimental research
techniques, both of which attempt to understand and predict direct causal relationships (X → Y). A non-causal correlation can be spuriously created by an antecedent which causes both (W → X and W → Y). Intervening variables (X → W → Y), if undetected, may make indirect causation look direct. Because of this, experimentally identified correlation
s do not represent causal relationships
unless spurious relationships can be ruled out.
usually employ pre-existing data rather than experimental data to establish causal relationships and to determine that they are not spurious. The body of statistical techniques that are used in economics is referred to as econometrics
, and involves substantial use of multivariate regression analysis
. Typically a linear relationship such as
is postulated, in which is the dependent variable (hypothesized to be the caused variable), for j=1,...,k is the jth independent variable (hypothesized to be a causative variable), and is the error term (containing the combined effects of all other causative variables, which must be uncorrelated with the included independent variables). If there is reason to believe that none of the s is caused by y, then estimates of the coefficients are obtained. If the null hypothesis that is rejected, then the alternative hypothesis that and equivalently that causes y cannot be rejected. On the other hand, if the null hypothesis that cannot be rejected, then equivalently the hypothesis of no causal effect of on y cannot be rejected. Here the notion of causality is one of contributory causality: If the true value , then a change in will result in a change in y unless some other causative variable(s), either included in the regression or implicit in the error term, change in such a way as to exactly offset its effect; thus a change in is not sufficient to change y. Likewise, a change in is not necessary to change y, because a change in y could be caused by something implicit in the error term (or by some other causative explanatory variable included in the model).
Regression analysis controls for other relevant variables by including them as regressors (explanatory variables). This helps to avoid false inferences of causality due to the presence of a third, underlying, variable that influences both the potentially causative variable and the potentially caused variable: its affect on the potentially caused variable is captured by directly including it in the regression, so that effect will not be picked up as a spurious effect of the potentially causative variable of interest. In addition, the use of multivariate regression helps to avoid wrongly inferring that an indirect effect of, say x1 (e.g., x1 → x2 → y) is a direct effect (x1 → y).
Just as an experimenter must be careful to control for every confounding factor, by holding such factors constant throughout the experiment, so also must the user of multiple regression be careful to control for every confounding factor by including them as xj variables in the regression. If a confounding factor is omitted from the regression, it exists by default in the error term, and if the latter is correlated with one (or more) of the included explanators then the regression results may be spurious.
Statistics
Statistics is the study of the collection, organization, analysis, and interpretation of data. It deals with all aspects of this, including the planning of data collection in terms of the design of surveys and experiments....
, a spurious relationship (or, sometimes, spurious correlation or spurious regression) is a mathematical relationship in which two events or variables have no direct causal connection, yet it may be wrongly inferred that they do, due to either coincidence or the presence of a certain third, unseen factor (referred to as a "confounding factor" or "lurking variable
Lurking variable
In statistics, a confounding variable is an extraneous variable in a statistical model that correlates with both the dependent variable and the independent variable...
"). Suppose there is found to be a correlation between A and B. Aside from coincidence, there are three possible relationships:
- A causes B,
- B causes A,
- OR
- C causes both A and B.
In the last case there is a spurious correlation between A and B. In a regression model where A is regressed on B but C is actually the true causal factor for A, this misleading choice of independent variable
Dependent and independent variables
The terms "dependent variable" and "independent variable" are used in similar but subtly different ways in mathematics and statistics as part of the standard terminology in those subjects...
(B instead of C) is called specification error.
Because correlation can arise from the presence of a lurking variable rather than from direct causation, it is often said that "Correlation does not imply causation
Correlation does not imply causation
"Correlation does not imply causation" is a phrase used in science and statistics to emphasize that correlation between two variables does not automatically imply that one causes the other "Correlation does not imply causation" (related to "ignoring a common cause" and questionable cause) is a...
".
General example
An example of a spurious relationship can be illuminated examining a city's ice creamIce cream
Ice cream is a frozen dessert usually made from dairy products, such as milk and cream, and often combined with fruits or other ingredients and flavours. Most varieties contain sugar, although some are made with other sweeteners...
sales. These sales are highest when the rate of drownings in city swimming pool
Swimming pool
A swimming pool, swimming bath, wading pool, or simply a pool, is a container filled with water intended for swimming or water-based recreation. There are many standard sizes; the largest is the Olympic-size swimming pool...
s is highest. To allege that ice cream sales cause drowning, or vice-versa, would be to imply a spurious relationship between the two. In reality, a heat wave
Heat wave
A heat wave is a prolonged period of excessively hot weather, which may be accompanied by high humidity. There is no universal definition of a heat wave; the term is relative to the usual weather in the area...
may have caused both. The heat wave is an example of a hidden or unseen variable, also known as a confounding variable.
Another popular example is a series of Dutch statistics showing a positive correlation between the number of storks nesting in a series of springs and the number of human babies born at that time. Of course there was no causal connection; they were correlated with each other only because they were correlated with the weather nine months before the observations.
Detecting spurious relationships
The term "spurious relationship" is commonly used in statisticsStatistics
Statistics is the study of the collection, organization, analysis, and interpretation of data. It deals with all aspects of this, including the planning of data collection in terms of the design of surveys and experiments....
and in particular in experimental research
Experimental techniques
Experimental research designs are used for the controlled testing of causal processes.The general procedure is one or more independent variables are manipulated to determine their effect on a dependent variable...
techniques, both of which attempt to understand and predict direct causal relationships (X → Y). A non-causal correlation can be spuriously created by an antecedent which causes both (W → X and W → Y). Intervening variables (X → W → Y), if undetected, may make indirect causation look direct. Because of this, experimentally identified correlation
Correlation
In statistics, dependence refers to any statistical relationship between two random variables or two sets of data. Correlation refers to any of a broad class of statistical relationships involving dependence....
s do not represent causal relationships
Causality
Causality is the relationship between an event and a second event , where the second event is understood as a consequence of the first....
unless spurious relationships can be ruled out.
Experiments
In experiments, spurious relationships can often be identified by controlling for other factors, including those that have been theoretically identified as possible confounding factors. For example, consider a researcher trying to determine whether a new drug kills bacteria; when the researcher applies the drug to a bacterial culture, the bacteria die. But to help in ruling out the presence of a confounding variable, another culture is subjected to conditions that are as nearly identical as possible to those facing the first-mentioned culture, but the second culture is not subjected to the drug. If there is an unseen confounding factor in those conditions, this control culture will die as well, so that no conclusion of efficacy of the drug can be drawn from the results of the first culture. On the other hand, if the control culture does not die, then the researcher cannot reject the hypothesis that the drug is efficacious.Non-experimental statistical analyses
Primarily non-experimental disciplines such as economicsEconomics
Economics is the social science that analyzes the production, distribution, and consumption of goods and services. The term economics comes from the Ancient Greek from + , hence "rules of the house"...
usually employ pre-existing data rather than experimental data to establish causal relationships and to determine that they are not spurious. The body of statistical techniques that are used in economics is referred to as econometrics
Econometrics
Econometrics has been defined as "the application of mathematics and statistical methods to economic data" and described as the branch of economics "that aims to give empirical content to economic relations." More precisely, it is "the quantitative analysis of actual economic phenomena based on...
, and involves substantial use of multivariate regression analysis
Regression analysis
In statistics, regression analysis includes many techniques for modeling and analyzing several variables, when the focus is on the relationship between a dependent variable and one or more independent variables...
. Typically a linear relationship such as
is postulated, in which is the dependent variable (hypothesized to be the caused variable), for j=1,...,k is the jth independent variable (hypothesized to be a causative variable), and is the error term (containing the combined effects of all other causative variables, which must be uncorrelated with the included independent variables). If there is reason to believe that none of the s is caused by y, then estimates of the coefficients are obtained. If the null hypothesis that is rejected, then the alternative hypothesis that and equivalently that causes y cannot be rejected. On the other hand, if the null hypothesis that cannot be rejected, then equivalently the hypothesis of no causal effect of on y cannot be rejected. Here the notion of causality is one of contributory causality: If the true value , then a change in will result in a change in y unless some other causative variable(s), either included in the regression or implicit in the error term, change in such a way as to exactly offset its effect; thus a change in is not sufficient to change y. Likewise, a change in is not necessary to change y, because a change in y could be caused by something implicit in the error term (or by some other causative explanatory variable included in the model).
Regression analysis controls for other relevant variables by including them as regressors (explanatory variables). This helps to avoid false inferences of causality due to the presence of a third, underlying, variable that influences both the potentially causative variable and the potentially caused variable: its affect on the potentially caused variable is captured by directly including it in the regression, so that effect will not be picked up as a spurious effect of the potentially causative variable of interest. In addition, the use of multivariate regression helps to avoid wrongly inferring that an indirect effect of, say x1 (e.g., x1 → x2 → y) is a direct effect (x1 → y).
Just as an experimenter must be careful to control for every confounding factor, by holding such factors constant throughout the experiment, so also must the user of multiple regression be careful to control for every confounding factor by including them as xj variables in the regression. If a confounding factor is omitted from the regression, it exists by default in the error term, and if the latter is correlated with one (or more) of the included explanators then the regression results may be spurious.
See also
- CausalityCausalityCausality is the relationship between an event and a second event , where the second event is understood as a consequence of the first....
- Correlation does not imply causationCorrelation does not imply causation"Correlation does not imply causation" is a phrase used in science and statistics to emphasize that correlation between two variables does not automatically imply that one causes the other "Correlation does not imply causation" (related to "ignoring a common cause" and questionable cause) is a...
- Specification (regression)Specification (regression)In regression analysis and related fields such as econometrics, specification is the process of converting a theory into a regression model. This process consists of selecting an appropriate functional form for the model and choosing which variables to include. Model specification is one of the...
- Omitted-variable biasOmitted-variable biasIn statistics, omitted-variable bias occurs when a model is created which incorrectly leaves out one or more important causal factors. The 'bias' is created when the model compensates for the missing factor by over- or under-estimating one of the other factors.More specifically, OVB is the bias...
External links
- Burns, William C., "Spurious Correlations", 1997.
- "The Art and Science of Cause and Effect": a slide show and tutorial lecture by Judea Pearl