Generalized estimating equations
Encyclopedia
In statistics
, a generalized estimating equation (GEE) is used to estimate the parameters of a generalized linear model
with a possible unknown correlation
between outcomes.
Parameter estimates from the GEE are consistent
even when the variance structure is misspecified, under mild regularity conditions. Intuitively, the focus of the GEE is on estimating the average response over the population ("population-averaged" effects) rather than the regression
parameters that would enable prediction of the effect of changing one or more covariates on a given individual. GEEs are usually used in conjunction with Huber-White standard error estimates, also known as robust standard error or sandwich variance estimates. In the case of a linear model with a working independence variance structure, these are known as heteroskedasticity consistent standard error estimators. Indeed, the GEE unified several independent formulations of these standard error estimators in a general framework.
GEEs belong to a class of semiparametric regression techniques as they rely on specification of only the first two moments. Under mild regularity conditions, parameter estimates from GEEs are consistent. They are a popular alternative to the likelihood–based generalized linear mixed model
which is more sensitive to variance structure specification. They are commonly used in large epidemiological studies, especially multi-site cohort studies
as they can handle many types of unmeasured dependence between outcomes.
The parameter estimates solve U(β)=0 and are typically obtained via the Newton-Raphson
algorithm. The variance structure is chosen to improve the efficiency of the parameter estimates. The Hessian
of the solution to the GEEs in the parameter space can be used to calculate robust standard error estimates. The term variance structure refers to the algebraic form of the covariance matrix between outcomes, Y, in the sample. Examples of variance structure specifications include independence, exchangeable, autoregressive, stationary m-dependent, and unstructured. The most popular form of inference on GEE regression parameters is the Wald test
using naive or robust standard errors, though the Score test
is also valid and preferable when it is difficult to obtain estimates of information
under the alternative hypothesis. The likelihood ratio test
is not valid in this setting because the estimating equations are not necessarily likelihood equations.
, SAS (proc genmod), SPSS
(the gee procedure), Stata
(the xtgee command) and R
(packages gee and geepack).
Statistics
Statistics is the study of the collection, organization, analysis, and interpretation of data. It deals with all aspects of this, including the planning of data collection in terms of the design of surveys and experiments....
, a generalized estimating equation (GEE) is used to estimate the parameters of a generalized linear model
Generalized linear model
In statistics, the generalized linear model is a flexible generalization of ordinary linear regression. The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a link function and by allowing the magnitude of the variance of each measurement to...
with a possible unknown correlation
Correlation
In statistics, dependence refers to any statistical relationship between two random variables or two sets of data. Correlation refers to any of a broad class of statistical relationships involving dependence....
between outcomes.
Parameter estimates from the GEE are consistent
Consistency (statistics)
In statistics, consistency of procedures such as confidence intervals or hypothesis tests involves their behaviour as the number of items in the data-set to which they are applied increases indefinitely...
even when the variance structure is misspecified, under mild regularity conditions. Intuitively, the focus of the GEE is on estimating the average response over the population ("population-averaged" effects) rather than the regression
Regression analysis
In statistics, regression analysis includes many techniques for modeling and analyzing several variables, when the focus is on the relationship between a dependent variable and one or more independent variables...
parameters that would enable prediction of the effect of changing one or more covariates on a given individual. GEEs are usually used in conjunction with Huber-White standard error estimates, also known as robust standard error or sandwich variance estimates. In the case of a linear model with a working independence variance structure, these are known as heteroskedasticity consistent standard error estimators. Indeed, the GEE unified several independent formulations of these standard error estimators in a general framework.
GEEs belong to a class of semiparametric regression techniques as they rely on specification of only the first two moments. Under mild regularity conditions, parameter estimates from GEEs are consistent. They are a popular alternative to the likelihood–based generalized linear mixed model
Generalized linear mixed model
In statistics, a generalized linear mixed model is a particular type of mixed model. It is an extension to the generalized linear model in which the linear predictor contains random effects in addition to the usual fixed effects...
which is more sensitive to variance structure specification. They are commonly used in large epidemiological studies, especially multi-site cohort studies
Cohort study
A cohort study or panel study is a form of longitudinal study used in medicine, social science, actuarial science, and ecology. It is an analysis of risk factors and follows a group of people who do not have the disease, and uses correlations to determine the absolute risk of subject contraction...
as they can handle many types of unmeasured dependence between outcomes.
Formulation
Given a mean model, , and variance structure, , the estimating equation is formed via:The parameter estimates solve U(β)=0 and are typically obtained via the Newton-Raphson
Newton's method
In numerical analysis, Newton's method , named after Isaac Newton and Joseph Raphson, is a method for finding successively better approximations to the roots of a real-valued function. The algorithm is first in the class of Householder's methods, succeeded by Halley's method...
algorithm. The variance structure is chosen to improve the efficiency of the parameter estimates. The Hessian
Hessian matrix
In mathematics, the Hessian matrix is the square matrix of second-order partial derivatives of a function; that is, it describes the local curvature of a function of many variables. The Hessian matrix was developed in the 19th century by the German mathematician Ludwig Otto Hesse and later named...
of the solution to the GEEs in the parameter space can be used to calculate robust standard error estimates. The term variance structure refers to the algebraic form of the covariance matrix between outcomes, Y, in the sample. Examples of variance structure specifications include independence, exchangeable, autoregressive, stationary m-dependent, and unstructured. The most popular form of inference on GEE regression parameters is the Wald test
Wald test
The Wald test is a parametric statistical test named after Abraham Wald with a great variety of uses. Whenever a relationship within or between data items can be expressed as a statistical model with parameters to be estimated from a sample, the Wald test can be used to test the true value of the...
using naive or robust standard errors, though the Score test
Score test
A score test is a statistical test of a simple null hypothesis that a parameter of interest \theta isequal to some particular value \theta_0. It is the most powerful test when the true value of \theta is close to \theta_0. The main advantage of the Score-test is that it does not require an...
is also valid and preferable when it is difficult to obtain estimates of information
Fisher information
In mathematical statistics and information theory, the Fisher information is the variance of the score. In Bayesian statistics, the asymptotic distribution of the posterior mode depends on the Fisher information and not on the prior...
under the alternative hypothesis. The likelihood ratio test
Likelihood-ratio test
In statistics, a likelihood ratio test is a statistical test used to compare the fit of two models, one of which is a special case of the other . The test is based on the likelihood ratio, which expresses how many times more likely the data are under one model than the other...
is not valid in this setting because the estimating equations are not necessarily likelihood equations.
Computation
Software for solving generalized estimating equations is available in MATLABMATLAB
MATLAB is a numerical computing environment and fourth-generation programming language. Developed by MathWorks, MATLAB allows matrix manipulations, plotting of functions and data, implementation of algorithms, creation of user interfaces, and interfacing with programs written in other languages,...
, SAS (proc genmod), SPSS
SPSS
SPSS is a computer program used for survey authoring and deployment , data mining , text analytics, statistical analysis, and collaboration and deployment ....
(the gee procedure), Stata
Stata
Stata is a general-purpose statistical software package created in 1985 by StataCorp. It is used by many businesses and academic institutions around the world...
(the xtgee command) and R
R (programming language)
R is a programming language and software environment for statistical computing and graphics. The R language is widely used among statisticians for developing statistical software, and R is widely used for statistical software development and data analysis....
(packages gee and geepack).