Cook's distance
Encyclopedia
In statistics
, Cook's distance is a commonly used estimate of the influence of a data point when doing least squares regression analysis
. In a practical ordinary least squares
analysis, Cook's distance can be used in several ways: to indicate data points that are particularly worth checking for validity; to indicate regions of the design space where it would be good to be able obtain more data points.
s) and/or high leverage
may distort the outcome and accuracy of a regression. Points with a large Cook's distance are considered to merit closer examination in the analysis.
The following is an algebraically equivalent expression
In the above equations: is the prediction from the full regression model for observation j; is the prediction for observation j from a refitted regression model in which observation i has been omitted; is the i-th diagonal element of the hat matrix
; is the crude residual (i.e., the difference between the observed value and the value fitted by the proposed model);
Statistics
Statistics is the study of the collection, organization, analysis, and interpretation of data. It deals with all aspects of this, including the planning of data collection in terms of the design of surveys and experiments....
, Cook's distance is a commonly used estimate of the influence of a data point when doing least squares regression analysis
Regression analysis
In statistics, regression analysis includes many techniques for modeling and analyzing several variables, when the focus is on the relationship between a dependent variable and one or more independent variables...
. In a practical ordinary least squares
Ordinary least squares
In statistics, ordinary least squares or linear least squares is a method for estimating the unknown parameters in a linear regression model. This method minimizes the sum of squared vertical distances between the observed responses in the dataset and the responses predicted by the linear...
analysis, Cook's distance can be used in several ways: to indicate data points that are particularly worth checking for validity; to indicate regions of the design space where it would be good to be able obtain more data points.
Definition
Cook's distance measures the effect of deleting a given observation. Data points with large residuals (outlierOutlier
In statistics, an outlier is an observation that is numerically distant from the rest of the data. Grubbs defined an outlier as: An outlying observation, or outlier, is one that appears to deviate markedly from other members of the sample in which it occurs....
s) and/or high leverage
Leverage (statistics)
In statistics, leverage is a term used in connection with regression analysis and, in particular, in analyses aimed at identifying those observations that are far away from corresponding average predictor values...
may distort the outcome and accuracy of a regression. Points with a large Cook's distance are considered to merit closer examination in the analysis.
The following is an algebraically equivalent expression
In the above equations: is the prediction from the full regression model for observation j; is the prediction for observation j from a refitted regression model in which observation i has been omitted; is the i-th diagonal element of the hat matrix
Hat matrix
In statistics, the hat matrix, H, maps the vector of observed values to the vector of fitted values. It describes the influence each observed value has on each fitted value...
; is the crude residual (i.e., the difference between the observed value and the value fitted by the proposed model);
- MSE is the mean square error of the regression model; is the number of fitted parameters in the model
Detecting highly influential observations using Cook's distance
There are different opinions regarding what cut-off values to use for spotting highly influential points. A simple operational guideline of has been suggested. Others have indicated that , where is the number of observations, might be used.Interpreting Cook's distance
Specifically can be interpreted as the distance one's estimates move within the confidence ellipsoid that represents a region of plausible values for the parameters. This is shown by an alternative but equivalent representation of Cook's distance in terms of changes to the estimates of the regression parameters between the cases where the particular observation is either included or excluded from the regression analysis.See also
- OutlierOutlierIn statistics, an outlier is an observation that is numerically distant from the rest of the data. Grubbs defined an outlier as: An outlying observation, or outlier, is one that appears to deviate markedly from other members of the sample in which it occurs....
- Leverage (statistics)Leverage (statistics)In statistics, leverage is a term used in connection with regression analysis and, in particular, in analyses aimed at identifying those observations that are far away from corresponding average predictor values...
- Partial leverage
- DFFITS
- Studentized residualStudentized residualIn statistics, a studentized residual is the quotient resulting from the division of a residual by an estimate of its standard deviation. Typically the standard deviations of residuals in a sample vary greatly from one data point to another even when the errors all have the same standard...