Missing values
Encyclopedia
In statistics
, missing data, or missing values, occur when no data
value
is stored for the variable
in the current observation
. Missing data are a common occurrence and can have a significant effect on the conclusions that can be drawn from the data.
Dropout is a type of missingness that occurs mostly when studying development over time. In this type of study the measurement is repeated after a certain period of time. Missingness occurs when participants drop out before the test ends and one or more measurements are missing.
Sometimes missing values are caused by the researchers themselves. If data collection was not done properly or if mistakes were made with the data entry (Ader, H.J., Mellenbergh, G.J. 2008).
And a great deal of missing data arise in cross-national research
in economics
, sociology
, and political science
because governments choose not to, or fail to, report critical statistics for one or more years (Messner 1992).
It is important to question why the data is missing, this can help with finding a solution to the problem. If the values are missing at random there is still information about each variable in each unit but if the values are missing systematically the problem is more severe because the sample cannot be representative of the population. For example: a research is done about the relation between IQ and income. If participants with an over average IQ do not answer the question ‘What is your salary?’ the results of the research may show that there is no association between IQ and salary, while in fact there is a relationship. Because of these problems, methodologists routinely advise researchers to design research so as to minimize the incidence of missing values (Ader, H.J., Mellenbergh, G.J. 2008).
In situations where missing data are likely to occur, the researcher is often advised to plan to use methods of data analysis methods that are robust to missingness. An analysis is robust when we are confident that mild to moderate violations of the technique's key assumptions will produce little or no bias
, or distortion in the conclusions drawn about the population.
technique which is to be used isn't content robust, it is good to consider imputing
the missing data. This can be done in several ways. Recommended is to use multiple imputations. Rubin argued that even with a small number, m, of repeated imputations (m being equal or smaller than 5) the quality of estimation improves enormously (in: Ader, H.J., Mellenbergh, G.J. 2008). For most practical purposes 2 or 3 imputations capture most of the relative efficiency that could be captured with a larger number of imputations. However, low values of m can lead to a substantial loss of statistical power
, and some scholars now recommend that m be set to values from 20 to 100 or more (Graham, Olchowski, and Gilreath 2007). Obviously, any multiply imputed data analysis has to be repeated for each of the m imputed data sets and, in some cases, the relevant statistics have to be combined in a relatively complicated way (Ader, H.J., Mellenbergh, G.J. 2008).
Examples of imputations are:
is an approach in which values of the statistics which would be computed if a complete dataset were available are estimated (imputed), taking into account the pattern of missing data. In this approach, values for individual missing data-items are not usually imputed.
Statistics
Statistics is the study of the collection, organization, analysis, and interpretation of data. It deals with all aspects of this, including the planning of data collection in terms of the design of surveys and experiments....
, missing data, or missing values, occur when no data
Data
The term data refers to qualitative or quantitative attributes of a variable or set of variables. Data are typically the results of measurements and can be the basis of graphs, images, or observations of a set of variables. Data are often viewed as the lowest level of abstraction from which...
value
Value (mathematics)
In mathematics, value commonly refers to the 'output' of a function. In the most basic case, that of unary, single-valued functions, there is one input and one output .The function f of the example is real-valued, since each and every possible function value is real...
is stored for the variable
Variable (mathematics)
In mathematics, a variable is a value that may change within the scope of a given problem or set of operations. In contrast, a constant is a value that remains unchanged, though often unknown or undetermined. The concepts of constants and variables are fundamental to many areas of mathematics and...
in the current observation
Observation
Observation is either an activity of a living being, such as a human, consisting of receiving knowledge of the outside world through the senses, or the recording of data using scientific instruments. The term may also refer to any data collected during this activity...
. Missing data are a common occurrence and can have a significant effect on the conclusions that can be drawn from the data.
Types of missing data
Missing data can occur because of nonresponse: no information is provided for several items or no information is provided for a whole unit. Some items are more sensitive for nonresponse than others, for example items about private subjects such as income.Dropout is a type of missingness that occurs mostly when studying development over time. In this type of study the measurement is repeated after a certain period of time. Missingness occurs when participants drop out before the test ends and one or more measurements are missing.
Sometimes missing values are caused by the researchers themselves. If data collection was not done properly or if mistakes were made with the data entry (Ader, H.J., Mellenbergh, G.J. 2008).
And a great deal of missing data arise in cross-national research
Cross-national research
In social science disciplines such as sociology, political science, geography, and economics, cross-national research is the technique of analyzing an event or process that takes place within a country, while comparing the way that event or process takes place across many different countries...
in economics
Economics
Economics is the social science that analyzes the production, distribution, and consumption of goods and services. The term economics comes from the Ancient Greek from + , hence "rules of the house"...
, sociology
Sociology
Sociology is the study of society. It is a social science—a term with which it is sometimes synonymous—which uses various methods of empirical investigation and critical analysis to develop a body of knowledge about human social activity...
, and political science
Political science
Political Science is a social science discipline concerned with the study of the state, government and politics. Aristotle defined it as the study of the state. It deals extensively with the theory and practice of politics, and the analysis of political systems and political behavior...
because governments choose not to, or fail to, report critical statistics for one or more years (Messner 1992).
It is important to question why the data is missing, this can help with finding a solution to the problem. If the values are missing at random there is still information about each variable in each unit but if the values are missing systematically the problem is more severe because the sample cannot be representative of the population. For example: a research is done about the relation between IQ and income. If participants with an over average IQ do not answer the question ‘What is your salary?’ the results of the research may show that there is no association between IQ and salary, while in fact there is a relationship. Because of these problems, methodologists routinely advise researchers to design research so as to minimize the incidence of missing values (Ader, H.J., Mellenbergh, G.J. 2008).
Techniques of dealing with missing data
Missing data reduce the representativeness of the sample and can therefore distort inferences about the population. If it is possible try to think about how to prevent data from missingness before the actual data gathering takes place. For example in computer questionnaires it is often not possible to skip a question. A question has to be answered, otherwise one cannot continue to the next. So missing values due to the participant are eliminated by this type of questionnaire. And in survey research, it is common to make multiple efforts to contact each individual in the sample, often sending letters to attempt to persuade those who have decided not to participate to change their minds (Stoop et al. 2010: 161-187). However, such techniques can either help or hurt in terms of reducing the negative inferential effects of missing data, because the kind of people who are willing to be persuaded to participate after initially refusing or not being home are likely to be significantly different from the kinds of people who will still refuse or remain unreachable after additional effort (Stoop et al. 2010: 188-198).In situations where missing data are likely to occur, the researcher is often advised to plan to use methods of data analysis methods that are robust to missingness. An analysis is robust when we are confident that mild to moderate violations of the technique's key assumptions will produce little or no bias
Bias
Bias is an inclination to present or hold a partial perspective at the expense of alternatives. Bias can come in many forms.-In judgement and decision making:...
, or distortion in the conclusions drawn about the population.
Imputation
If it is known that the data analysisData analysis
Analysis of data is a process of inspecting, cleaning, transforming, and modeling data with the goal of highlighting useful information, suggesting conclusions, and supporting decision making...
technique which is to be used isn't content robust, it is good to consider imputing
Imputation (statistics)
In statistics, imputation is the substitution of some value for a missing data point or a missing component of a data point. Once all missing values have been imputed, the dataset can then be analysed using standard techniques for complete data...
the missing data. This can be done in several ways. Recommended is to use multiple imputations. Rubin argued that even with a small number, m, of repeated imputations (m being equal or smaller than 5) the quality of estimation improves enormously (in: Ader, H.J., Mellenbergh, G.J. 2008). For most practical purposes 2 or 3 imputations capture most of the relative efficiency that could be captured with a larger number of imputations. However, low values of m can lead to a substantial loss of statistical power
Statistical power
The power of a statistical test is the probability that the test will reject the null hypothesis when the null hypothesis is actually false . The power is in general a function of the possible distributions, often determined by a parameter, under the alternative hypothesis...
, and some scholars now recommend that m be set to values from 20 to 100 or more (Graham, Olchowski, and Gilreath 2007). Obviously, any multiply imputed data analysis has to be repeated for each of the m imputed data sets and, in some cases, the relevant statistics have to be combined in a relatively complicated way (Ader, H.J., Mellenbergh, G.J. 2008).
Examples of imputations are:
Partial imputation
The expectation-maximization algorithmExpectation-maximization algorithm
In statistics, an expectation–maximization algorithm is an iterative method for finding maximum likelihood or maximum a posteriori estimates of parameters in statistical models, where the model depends on unobserved latent variables...
is an approach in which values of the statistics which would be computed if a complete dataset were available are estimated (imputed), taking into account the pattern of missing data. In this approach, values for individual missing data-items are not usually imputed.
Partial deletion
Methods which involve reducing the data available to a dataset having no missing values include:- Listwise deletion/casewise deletion (albeit a naive solution)
- Pairwise deletion(albeit a naive solution)
Full analysis
Methods which take full account of all information available, without the distortion resulting from using imputed values as if they were actually observed:- The expectation-maximization algorithmExpectation-maximization algorithmIn statistics, an expectation–maximization algorithm is an iterative method for finding maximum likelihood or maximum a posteriori estimates of parameters in statistical models, where the model depends on unobserved latent variables...
- full information maximum likelihood estimation
Background
- Missing values-envision
- psychwiki.com: Missing Values, Identifying Missing Values, and Dealing with Missing Values
- missingdata.org.uk, Medical Statistics Unit, London School of Hygiene & Tropical MedicineLondon School of Hygiene & Tropical MedicineThe London School of Hygiene & Tropical Medicine is a constituent college of the federal University of London, specialising in public health and tropical medicine...