Cook's distance - AbsoluteAstronomy.com

Statistics

Statistics is the study of the collection, organization, analysis, and interpretation of data. It deals with all aspects of this, including the planning of data collection in terms of the design of surveys and experiments....

, Cook's distance is a commonly used estimate of the influence of a data point when doing least squares regression analysis

Regression analysis

In statistics, regression analysis includes many techniques for modeling and analyzing several variables, when the focus is on the relationship between a dependent variable and one or more independent variables...

. In a practical ordinary least squares

Ordinary least squares

In statistics, ordinary least squares or linear least squares is a method for estimating the unknown parameters in a linear regression model. This method minimizes the sum of squared vertical distances between the observed responses in the dataset and the responses predicted by the linear...

analysis, Cook's distance can be used in several ways: to indicate data points that are particularly worth checking for validity; to indicate regions of the design space where it would be good to be able obtain more data points.

Definition

Cook's distance measures the effect of deleting a given observation. Data points with large residuals (outlier

Outlier

In statistics, an outlier is an observation that is numerically distant from the rest of the data. Grubbs defined an outlier as: An outlying observation, or outlier, is one that appears to deviate markedly from other members of the sample in which it occurs....

s) and/or high leverage

Leverage (statistics)

In statistics, leverage is a term used in connection with regression analysis and, in particular, in analyses aimed at identifying those observations that are far away from corresponding average predictor values...

may distort the outcome and accuracy of a regression. Points with a large Cook's distance are considered to merit closer examination in the analysis.

The following is an algebraically equivalent expression

In the above equations:

is the prediction from the full regression model for observation j;

is the prediction for observation j from a refitted regression model in which observation i has been omitted;

is the i-th diagonal element of the hat matrix

Hat matrix

In statistics, the hat matrix, H, maps the vector of observed values to the vector of fitted values. It describes the influence each observed value has on each fitted value...

;

is the crude residual (i.e., the difference between the observed value and the value fitted by the proposed model);

MSE is the mean square error of the regression model;

is the number of fitted parameters in the model

Detecting highly influential observations using Cook's distance

There are different opinions regarding what cut-off values to use for spotting highly influential points. A simple operational guideline of

has been suggested. Others have indicated that

, where

is the number of observations, might be used.

Interpreting Cook's distance

Specifically

can be interpreted as the distance one's estimates move within the confidence ellipsoid that represents a region of plausible values for the parameters. This is shown by an alternative but equivalent representation of Cook's distance in terms of changes to the estimates of the regression parameters between the cases where the particular observation is either included or excluded from the regression analysis.

External links

Procedure for calculating Cook's distance

The source of this article is wikipedia, the free encyclopedia. The text of this article is licensed under the GFDL.

Definition

Detecting highly influential observations using Cook's distance

Interpreting Cook's distance

See also

External links