Censoring (statistics)
Encyclopedia
In statistics
, engineering
, and medical research, censoring occurs when the value of a measurement or observation is only partially known.
For example, suppose a study is conducted to measure the impact of a drug on mortality. In such a study, it may be known that an individual's age at death is at least 75 years. Such a situation could occur if the individual withdrew from the study at age 75, or if the individual is currently alive at the age of 75.
Censoring also occurs when a value occurs outside the range of a measuring instrument
. For example, a bathroom scale might only measure up to 300 lbs. If a 350 lb individual is weighed using the scale, the observer would only know that the individual's weight is at least 300 lbs.
Censoring should not be confused with the related idea truncation
. With censoring, observations result either in knowing the exact value that applies, or in knowing that the value lies within an interval
. With truncation, observations never result in values outside a given range — values in the population outside the range are never seen or never recorded if they are seen. Note that in statistics, truncation is not the same as rounding
.
The problem of censored data, in which the observed value of some variable is partially known, is related to the problem of missing data, where the observed value of some variable is unknown.
Interval censoring can occur when observing a value requires follow-ups or inspections. Left and right censoring are special cases of interval censoring, with the beginning of the interval at zero or the end at infinity, respectively.
Left-censored data, is observed, for example, in environmental analytical data where trace concentrations of chemicals may indeed be present in an environmental sample (e.g., groundwater, soil) but are "non-detectable," i.e., below the detection limit
of the analytical instrument or laboratory method. Estimation methods
for using left-censored data vary, and not all methods of estimation may be applicable to, or the most reliable, for all data sets.
's 1766 analysis of smallpox
morbidity and mortality data to demonstrate the efficacy of vaccination
.
testing often consists of conducting a test on an item (under specified conditions) to determine the time it takes for a failure to occur.
An analysis of the data from replicate tests includes both the times-to-failure for the items that failed and the time-of-test-termination for those that did not fail.
oriented) can conduct a maximum likelihood
estimation for summary statistics, confidence intervals, etc.
Statistics
Statistics is the study of the collection, organization, analysis, and interpretation of data. It deals with all aspects of this, including the planning of data collection in terms of the design of surveys and experiments....
, engineering
Engineering
Engineering is the discipline, art, skill and profession of acquiring and applying scientific, mathematical, economic, social, and practical knowledge, in order to design and build structures, machines, devices, systems, materials and processes that safely realize improvements to the lives of...
, and medical research, censoring occurs when the value of a measurement or observation is only partially known.
For example, suppose a study is conducted to measure the impact of a drug on mortality. In such a study, it may be known that an individual's age at death is at least 75 years. Such a situation could occur if the individual withdrew from the study at age 75, or if the individual is currently alive at the age of 75.
Censoring also occurs when a value occurs outside the range of a measuring instrument
Measuring instrument
In the physical sciences, quality assurance, and engineering, measurement is the activity of obtaining and comparing physical quantities of real-world objects and events. Established standard objects and events are used as units, and the process of measurement gives a number relating the item...
. For example, a bathroom scale might only measure up to 300 lbs. If a 350 lb individual is weighed using the scale, the observer would only know that the individual's weight is at least 300 lbs.
Types
- Left censoring – a data point is below a certain value but it is unknown by how much
- Interval censoring – a data point is somewhere on an interval between two values
- Right censoring – a data point is above a certain value but it is unknown by how much
- Type I censoring occurs if an experiment has a set number of subjects or items and stops the experiment at a predetermined time, at which point any subjects remaining are right-censored.
- Type II censoring occurs if an experiment has a set number of subjects or items and stops the experiment when a predetermined number are observed to have failed; the remaining subjects are then right-censored.
- Random (or non-informative) censoring is when each subject has a censoring time that is statistically independent of their failure time. The observed value is the minimum of the censoring and failure times; subjects whose failure time is greater than their censoring time are right-censored.
Censoring should not be confused with the related idea truncation
Truncation (statistics)
In statistics, truncation results in values that are limited above or below, resulting in a truncated sample. Truncation is similar to but distinct from the concept of statistical censoring. A truncated sample can be thought of as being equivalent to an underlying sample with all values outside the...
. With censoring, observations result either in knowing the exact value that applies, or in knowing that the value lies within an interval
Interval (mathematics)
In mathematics, a interval is a set of real numbers with the property that any number that lies between two numbers in the set is also included in the set. For example, the set of all numbers satisfying is an interval which contains and , as well as all numbers between them...
. With truncation, observations never result in values outside a given range — values in the population outside the range are never seen or never recorded if they are seen. Note that in statistics, truncation is not the same as rounding
Rounding
Rounding a numerical value means replacing it by another value that is approximately equal but has a shorter, simpler, or more explicit representation; for example, replacing $23.4476 with $23.45, or the fraction 312/937 with 1/3, or the expression √2 with 1.414.Rounding is often done on purpose to...
.
The problem of censored data, in which the observed value of some variable is partially known, is related to the problem of missing data, where the observed value of some variable is unknown.
Interval censoring can occur when observing a value requires follow-ups or inspections. Left and right censoring are special cases of interval censoring, with the beginning of the interval at zero or the end at infinity, respectively.
Left-censored data, is observed, for example, in environmental analytical data where trace concentrations of chemicals may indeed be present in an environmental sample (e.g., groundwater, soil) but are "non-detectable," i.e., below the detection limit
Detection limit
In analytical chemistry, the detection limit, lower limit of detection, or LOD , is the lowest quantity of a substance that can be distinguished from the absence of that substance within a stated confidence limit...
of the analytical instrument or laboratory method. Estimation methods
Estimator
In statistics, an estimator is a rule for calculating an estimate of a given quantity based on observed data: thus the rule and its result are distinguished....
for using left-censored data vary, and not all methods of estimation may be applicable to, or the most reliable, for all data sets.
Epidemiology
One of the earliest attempts to analyse a statistical problem involving censored data was Daniel BernoulliDaniel Bernoulli
Daniel Bernoulli was a Dutch-Swiss mathematician and was one of the many prominent mathematicians in the Bernoulli family. He is particularly remembered for his applications of mathematics to mechanics, especially fluid mechanics, and for his pioneering work in probability and statistics...
's 1766 analysis of smallpox
Smallpox
Smallpox was an infectious disease unique to humans, caused by either of two virus variants, Variola major and Variola minor. The disease is also known by the Latin names Variola or Variola vera, which is a derivative of the Latin varius, meaning "spotted", or varus, meaning "pimple"...
morbidity and mortality data to demonstrate the efficacy of vaccination
Vaccination
Vaccination is the administration of antigenic material to stimulate the immune system of an individual to develop adaptive immunity to a disease. Vaccines can prevent or ameliorate the effects of infection by many pathogens...
.
Operating life testing
ReliabilityReliability engineering
Reliability engineering is an engineering field, that deals with the study, evaluation, and life-cycle management of reliability: the ability of a system or component to perform its required functions under stated conditions for a specified period of time. It is often measured as a probability of...
testing often consists of conducting a test on an item (under specified conditions) to determine the time it takes for a failure to occur.
- Sometimes a failure is planned and expected but does not occur: operator error, equipment malfunction, test anomaly, etc. The test result was not the desired time-to-failure but can be (and should be) used as a time-to-termination. The use of censored data is unintentional but necessary.
- Sometimes engineers plan a test program so that, after a certain time limit or number of failures, all other tests will be terminated. These suspended times are treated as right-censored data. The use of censored data is intentional.
An analysis of the data from replicate tests includes both the times-to-failure for the items that failed and the time-of-test-termination for those that did not fail.
Analysis
Special techniques may be used to handle censored data. Tests with specific failure times are coded as actual failures: Censored data are coded for the type of censoring and the known interval or limit. Special software programs (often reliabilityReliability engineering
Reliability engineering is an engineering field, that deals with the study, evaluation, and life-cycle management of reliability: the ability of a system or component to perform its required functions under stated conditions for a specified period of time. It is often measured as a probability of...
oriented) can conduct a maximum likelihood
Maximum likelihood
In statistics, maximum-likelihood estimation is a method of estimating the parameters of a statistical model. When applied to a data set and given a statistical model, maximum-likelihood estimation provides estimates for the model's parameters....
estimation for summary statistics, confidence intervals, etc.
External links
- "Engineering Statistics Handbook", NIST/SEMATEK, http://www.itl.nist.gov/div898/handbook/
See also
- Survival analysisSurvival analysisSurvival analysis is a branch of statistics which deals with death in biological organisms and failure in mechanical systems. This topic is called reliability theory or reliability analysis in engineering, and duration analysis or duration modeling in economics or sociology...
- Kaplan–Meier estimator
- Data analysisData analysisAnalysis of data is a process of inspecting, cleaning, transforming, and modeling data with the goal of highlighting useful information, suggesting conclusions, and supporting decision making...
- Reliability (statistics)Reliability (statistics)In statistics, reliability is the consistency of a set of measurements or of a measuring instrument, often used to describe a test. Reliability is inversely related to random error.-Types:There are several general classes of reliability estimates:...
- Imputation (statistics)Imputation (statistics)In statistics, imputation is the substitution of some value for a missing data point or a missing component of a data point. Once all missing values have been imputed, the dataset can then be analysed using standard techniques for complete data...
- Censored regression modelCensored regression modelCensored regression models commonly arise in econometrics in cases where the variable ofinterest is only observable under certain conditions. A common example is labor supply. Data are frequently available on the hours worked by employees, and a labor supply model estimates the relationship between...