Regression dilution
Encyclopedia
Regression dilution is a statistical phenomenon also known as "attenuation".
Consider fitting a straight line for the relationship of an outcome variable y to a predictor variable x, and estimating the gradient
(slope) of the line. Statistical variability, measurement error or random noise in the y variable cause imprecision in the estimated gradient, but not bias
: on average, the procedure calculates the right gradient. However, variability, measurement error or random noise in the x variable causes bias in the estimated gradient (as well as imprecision). The greater the variance in the x measurement, the closer the estimated slope must approach 0 instead of the true gradient. This 'dilution' of the gradient towards 0 is referred to as "regression dilution," "attenuation," or "attenuation bias."
It may seem counter-intuitive that noise in the predictor variable x induces a bias, but noise in the outcome variable y does not. Recall that linear regression
is not symmetric: the line of best fit for predicting y from x (the usual linear regression) is not the same as the line of best fit for predicting x from y (see, for example, Draper & Smith, "Applied Regression Analysis"; page 5 of the 1966 edition).
may be viewed as arising from a random sample
.
Under certain assumptions (typically, normal distribution assumptions) there is a known ratio
between the true gradient, and the expected
estimated gradient. Frost and Thompson (2000) review several methods for estimating this ratio and hence correcting the estimated gradient. The term regression dilution ratio (beware – not defined in quite the same way by all authors) is used of this general approach, in which the usual linear regression is fitted, and then a correction applied. The reply to Frost & Thompson by Longford (2001) refers the reader to other methods,
expanding the regression model to acknowledge the variability in the x variable, so that no bias arises. Fuller (1987) is one of the standard references for assessing and correcting for regression dilution.
Hughes (1993) shows that the regression dilution ratio methods apply approximately in survival models. Rosner (1992) shows that the ratio methods apply approximately to logistic regression models. Carroll et al. (1995) give more detail on regression dilution in nonlinear models, presenting the regression dilution ratio methods as the simplest case of regression calibration methods, in which additional covariates may also be incorporated.
In general, methods for the structural model require some estimate of the variability of the x variable. This will require repeated measurements of the x variable in the same individuals, either in a sub-study of the main data set, or in a separate data set. Without this information it will not be possible to make a correction.
Other non-linear models, such as proportional hazards models
for survival analysis
, have been considered only with a single predictor subject to variability.
Does this matter? In predictive modelling
, no. Standard methods can fit a regression of y on w without bias. There is bias only if we then use the regression of y on w as an approximation to the regression of y on x. In the example, assuming that blood pressure measurements are similarly variable in future patients, our regression line of y on w (observed blood pressure) gives unbiased predictions.
An example of a circumstance in which correction is desired is prediction of change. Suppose the change in x is known under some new circumstance: to estimate the likely change in an outcome variable y, the gradient of the regression of y on x is needed, not y on w. This arises in epidemiology
. To continue the example in which x denotes blood pressure, perhaps a large clinical trial
has provided an estimate of the change in blood pressure under a new treatment; then the possible effect on y, under the new treatment, should be estimated from the gradient in the regression of y on x.
Another circumstance is predictive modelling in which future observations are also variable, but not (in the phrase used above) "similarly variable". For example, if the current data set includes blood pressure measured with greater precision than is common in clinical practice. One specific example of this arose when developing a regression equation based on a clinical trial, in which blood pressure was the average of six measurements, for use in clinical practice, where blood pressure is usually a single measurement.
assuming normal distributions throughout (the framework of Frost & Thompson). However, it has been pointed out that a poorly executed correction for regression dilution may do more damage to an estimate than no correction.
.
Consider fitting a straight line for the relationship of an outcome variable y to a predictor variable x, and estimating the gradient
Gradient
In vector calculus, the gradient of a scalar field is a vector field that points in the direction of the greatest rate of increase of the scalar field, and whose magnitude is the greatest rate of change....
(slope) of the line. Statistical variability, measurement error or random noise in the y variable cause imprecision in the estimated gradient, but not bias
Bias
Bias is an inclination to present or hold a partial perspective at the expense of alternatives. Bias can come in many forms.-In judgement and decision making:...
: on average, the procedure calculates the right gradient. However, variability, measurement error or random noise in the x variable causes bias in the estimated gradient (as well as imprecision). The greater the variance in the x measurement, the closer the estimated slope must approach 0 instead of the true gradient. This 'dilution' of the gradient towards 0 is referred to as "regression dilution," "attenuation," or "attenuation bias."
It may seem counter-intuitive that noise in the predictor variable x induces a bias, but noise in the outcome variable y does not. Recall that linear regression
Linear regression
In statistics, linear regression is an approach to modeling the relationship between a scalar variable y and one or more explanatory variables denoted X. The case of one explanatory variable is called simple regression...
is not symmetric: the line of best fit for predicting y from x (the usual linear regression) is not the same as the line of best fit for predicting x from y (see, for example, Draper & Smith, "Applied Regression Analysis"; page 5 of the 1966 edition).
The case of a randomly distributed x variable
The case that the x variable arises randomly is known as the structural model or structural relationship. For example, in a medical study patients are recruited as a sample from a population, and their characteristics such as blood pressureBlood pressure
Blood pressure is the pressure exerted by circulating blood upon the walls of blood vessels, and is one of the principal vital signs. When used without further specification, "blood pressure" usually refers to the arterial pressure of the systemic circulation. During each heartbeat, BP varies...
may be viewed as arising from a random sample
Random sample
In statistics, a sample is a subject chosen from a population for investigation; a random sample is one chosen by a method involving an unpredictable component...
.
Under certain assumptions (typically, normal distribution assumptions) there is a known ratio
Ratio
In mathematics, a ratio is a relationship between two numbers of the same kind , usually expressed as "a to b" or a:b, sometimes expressed arithmetically as a dimensionless quotient of the two which explicitly indicates how many times the first number contains the second In mathematics, a ratio is...
between the true gradient, and the expected
Expected
Expected may refer to:*Expectation*Expected value*Expected shortfall*Expected utility hypothesis*Expected return*Expected gainSee also*Unexpected...
estimated gradient. Frost and Thompson (2000) review several methods for estimating this ratio and hence correcting the estimated gradient. The term regression dilution ratio (beware – not defined in quite the same way by all authors) is used of this general approach, in which the usual linear regression is fitted, and then a correction applied. The reply to Frost & Thompson by Longford (2001) refers the reader to other methods,
expanding the regression model to acknowledge the variability in the x variable, so that no bias arises. Fuller (1987) is one of the standard references for assessing and correcting for regression dilution.
Hughes (1993) shows that the regression dilution ratio methods apply approximately in survival models. Rosner (1992) shows that the ratio methods apply approximately to logistic regression models. Carroll et al. (1995) give more detail on regression dilution in nonlinear models, presenting the regression dilution ratio methods as the simplest case of regression calibration methods, in which additional covariates may also be incorporated.
In general, methods for the structural model require some estimate of the variability of the x variable. This will require repeated measurements of the x variable in the same individuals, either in a sub-study of the main data set, or in a separate data set. Without this information it will not be possible to make a correction.
The case of a fixed x variable
The case that x is fixed, but measured with noise, is known as the functional model or functional relationship. See, for example, Riggs et al. (1978).Multiple x variables
The case of multiple predictor variables (possibly correlated) subject to variability (possibly correlated) has been well-studied for linear regression, and for some non-linear regression models.Other non-linear models, such as proportional hazards models
Proportional hazards models
Proportional hazards models are a class of survival models in statistics. Survival models relate the time that passes before some event occurs to one or more covariates that may be associated with that quantity. In a proportional hazards model, the unique effect of a unit increase in a covariate...
for survival analysis
Survival analysis
Survival analysis is a branch of statistics which deals with death in biological organisms and failure in mechanical systems. This topic is called reliability theory or reliability analysis in engineering, and duration analysis or duration modeling in economics or sociology...
, have been considered only with a single predictor subject to variability.
Is correction necessary?
In many (perhaps most) applications, correction is neither necessary nor appropriate. To understand this, consider the measurement error as follows. Let y be the outcome variable, x be the true predictor variable, and w be an approximate observation of x. Frost and Thompson suggest, for example, that x may be the true, long-term blood pressure of a patient, and w may be the blood pressure observed on one particular clinic visit. Regression dilution arises if we are interested in the relationship between y and x, but estimate the relationship between y and w. Because w is measured with variability, the gradient of a regression line of y on w is less than the regression line of y on x.Does this matter? In predictive modelling
Predictive modelling
Predictive modelling is the process by which a model is created or chosen to try to best predict the probability of an outcome. In many cases the model is chosen on the basis of detection theory to try to guess the probability of an outcome given a set amount of input data, for example given an...
, no. Standard methods can fit a regression of y on w without bias. There is bias only if we then use the regression of y on w as an approximation to the regression of y on x. In the example, assuming that blood pressure measurements are similarly variable in future patients, our regression line of y on w (observed blood pressure) gives unbiased predictions.
An example of a circumstance in which correction is desired is prediction of change. Suppose the change in x is known under some new circumstance: to estimate the likely change in an outcome variable y, the gradient of the regression of y on x is needed, not y on w. This arises in epidemiology
Epidemiology
Epidemiology is the study of health-event, health-characteristic, or health-determinant patterns in a population. It is the cornerstone method of public health research, and helps inform policy decisions and evidence-based medicine by identifying risk factors for disease and targets for preventive...
. To continue the example in which x denotes blood pressure, perhaps a large clinical trial
Clinical trial
Clinical trials are a set of procedures in medical research and drug development that are conducted to allow safety and efficacy data to be collected for health interventions...
has provided an estimate of the change in blood pressure under a new treatment; then the possible effect on y, under the new treatment, should be estimated from the gradient in the regression of y on x.
Another circumstance is predictive modelling in which future observations are also variable, but not (in the phrase used above) "similarly variable". For example, if the current data set includes blood pressure measured with greater precision than is common in clinical practice. One specific example of this arose when developing a regression equation based on a clinical trial, in which blood pressure was the average of six measurements, for use in clinical practice, where blood pressure is usually a single measurement.
Caveats
All of these results can be shown mathematically, in the case of simple linear regressionSimple linear regression
In statistics, simple linear regression is the least squares estimator of a linear regression model with a single explanatory variable. In other words, simple linear regression fits a straight line through the set of n points in such a way that makes the sum of squared residuals of the model as...
assuming normal distributions throughout (the framework of Frost & Thompson). However, it has been pointed out that a poorly executed correction for regression dilution may do more damage to an estimate than no correction.
Further reading
Regression dilution was first mentioned, under the name attenuation, by Spearman (1904). Those seeking a readable mathematical treatment might like to start with Frost and Thompson (2000), or see correction for attenuationCorrection for attenuation
Correction for attenuation is a statistical procedure, due to Spearman , to "rid a correlation coefficient from the weakening effect of measurement error" , a phenomenon also known as regression dilution. In measurement and statistics, it is also called disattenuation...
.