Testing hypotheses suggested by the data
Encyclopedia
In statistics
, hypotheses
suggested by the data, if tested using the data set that suggested them, are likely to be accepted even when they are not true. This is because circular reasoning would be involved: something seems true in the limited data set, therefore we hypothesize that it is true in general, therefore we (wrongly) test it on the same limited data set, which seems to confirm that it is true. Generating hypotheses based on data already observed, in the absence of testing them on new data, is referred to as post hoc theorizing.
The correct procedure is to test any hypothesis on a data set that was not used to generate the hypothesis.
. The fiftieth study finds a big difference, but the difference is of a size that one would expect to see in about one of every fifty studies even if vitamin X has no effect at all, just due to chance (with patients who were going to get better anyway disproportionately ending up in the Vitamin X group instead of the control group, which can happen since the entire population of cancer patients cannot be included in the study). When all fifty studies are pooled, one would say no effect of Vitamin X was found, because the positive result was not more frequent than chance, i.e. it was not statistically significant
. However, it would be reasonable for the investigators running the fiftieth study to consider it likely that they have found an effect, at least until they learn of the other forty-nine studies. Now suppose that the one anomalous study was in Denmark. The data suggest a hypothesis that Vitamin X is more efficacious in Denmark than elsewhere. But Denmark was by chance the one-in-fifty in which an extreme value of the test statistic happened; one expects such extreme cases one time in fifty on average if no effect is present. It would therefore be fallacious
to cite the data as serious evidence for this particular hypothesis suggested by the data.
However, if another study is then done in Denmark and again finds a difference between the vitamin and the placebo, then the first study strengthens the case provided by the second study. Or, if a second series of studies is done on fifty countries, and Denmark stands out in the second study as well, the two series together constitute important evidence even though neither by itself is at all impressive.
that the hypothesis is correct. The negative test data that were thrown out are just as important, because they give one an idea of how common the positive results are compared to chance. Running an experiment, seeing a pattern in the data, proposing a hypothesis from that pattern, then using the same experimental data as evidence for the new hypothesis is extremely suspect, because data from all other experiments, completed or potential, has essentially been "thrown out" by choosing to look only at the experiments that suggested the new hypothesis in the first place.
A large set of tests as described above greatly inflates the probability
of type I error as all but the data most favorable to the hypothesis
is discarded. This is a risk, not only in hypothesis testing
but in all statistical inference
as it is often problematic to accurately describe the process that has been followed in searching and discarding data
. In other words, one wants to keep all data (regardless of whether they tend to support or refute the hypothesis) from "good tests", but it is sometimes difficult to figure out what a "good test" is. It is a particular problem in statistical model
ling, where many different models are rejected by trial and error
before publishing a result (see also overfitting
, Publication bias
).
The error is particularly prevalent in data mining
and machine learning
. It also commonly occurs in academic publishing
where only reports of positive, rather than negative, results tend to be accepted, resulting in the effect known as publication bias
.
Henry Scheffé's simultaneous test of all contrasts in multiple comparison
problems is the most well-known remedy in the case of analysis of variance
. It is a method designed for testing hypotheses suggested by the data while avoiding the fallacy described above.
Statistics
Statistics is the study of the collection, organization, analysis, and interpretation of data. It deals with all aspects of this, including the planning of data collection in terms of the design of surveys and experiments....
, hypotheses
Hypothesis
A hypothesis is a proposed explanation for a phenomenon. The term derives from the Greek, ὑποτιθέναι – hypotithenai meaning "to put under" or "to suppose". For a hypothesis to be put forward as a scientific hypothesis, the scientific method requires that one can test it...
suggested by the data, if tested using the data set that suggested them, are likely to be accepted even when they are not true. This is because circular reasoning would be involved: something seems true in the limited data set, therefore we hypothesize that it is true in general, therefore we (wrongly) test it on the same limited data set, which seems to confirm that it is true. Generating hypotheses based on data already observed, in the absence of testing them on new data, is referred to as post hoc theorizing.
The correct procedure is to test any hypothesis on a data set that was not used to generate the hypothesis.
Example of fallacious acceptance of a hypothesis
Suppose fifty different researchers, unaware of each other's work, run clinical trials to test whether Vitamin X is efficacious in treating cancer. Forty-nine of them find no significant differences between measurements done on patients who have taken Vitamin X and those who have taken a placeboPlacebo
A placebo is a simulated or otherwise medically ineffectual treatment for a disease or other medical condition intended to deceive the recipient...
. The fiftieth study finds a big difference, but the difference is of a size that one would expect to see in about one of every fifty studies even if vitamin X has no effect at all, just due to chance (with patients who were going to get better anyway disproportionately ending up in the Vitamin X group instead of the control group, which can happen since the entire population of cancer patients cannot be included in the study). When all fifty studies are pooled, one would say no effect of Vitamin X was found, because the positive result was not more frequent than chance, i.e. it was not statistically significant
Statistical significance
In statistics, a result is called statistically significant if it is unlikely to have occurred by chance. The phrase test of significance was coined by Ronald Fisher....
. However, it would be reasonable for the investigators running the fiftieth study to consider it likely that they have found an effect, at least until they learn of the other forty-nine studies. Now suppose that the one anomalous study was in Denmark. The data suggest a hypothesis that Vitamin X is more efficacious in Denmark than elsewhere. But Denmark was by chance the one-in-fifty in which an extreme value of the test statistic happened; one expects such extreme cases one time in fifty on average if no effect is present. It would therefore be fallacious
Fallacy
In logic and rhetoric, a fallacy is usually an incorrect argumentation in reasoning resulting in a misconception or presumption. By accident or design, fallacies may exploit emotional triggers in the listener or interlocutor , or take advantage of social relationships between people...
to cite the data as serious evidence for this particular hypothesis suggested by the data.
However, if another study is then done in Denmark and again finds a difference between the vitamin and the placebo, then the first study strengthens the case provided by the second study. Or, if a second series of studies is done on fifty countries, and Denmark stands out in the second study as well, the two series together constitute important evidence even though neither by itself is at all impressive.
The general problem
Testing a hypothesis suggested by the data can very easily result in false positives (type I errors). If one looks long enough and in enough different places, eventually data can be found to support any hypothesis. Unfortunately, these positive data do not by themselves constitute evidenceScientific evidence
Scientific evidence has no universally accepted definition but generally refers to evidence which serves to either support or counter a scientific theory or hypothesis. Such evidence is generally expected to be empirical and properly documented in accordance with scientific method such as is...
that the hypothesis is correct. The negative test data that were thrown out are just as important, because they give one an idea of how common the positive results are compared to chance. Running an experiment, seeing a pattern in the data, proposing a hypothesis from that pattern, then using the same experimental data as evidence for the new hypothesis is extremely suspect, because data from all other experiments, completed or potential, has essentially been "thrown out" by choosing to look only at the experiments that suggested the new hypothesis in the first place.
A large set of tests as described above greatly inflates the probability
Probability
Probability is ordinarily used to describe an attitude of mind towards some proposition of whose truth we arenot certain. The proposition of interest is usually of the form "Will a specific event occur?" The attitude of mind is of the form "How certain are we that the event will occur?" The...
of type I error as all but the data most favorable to the hypothesis
Hypothesis
A hypothesis is a proposed explanation for a phenomenon. The term derives from the Greek, ὑποτιθέναι – hypotithenai meaning "to put under" or "to suppose". For a hypothesis to be put forward as a scientific hypothesis, the scientific method requires that one can test it...
is discarded. This is a risk, not only in hypothesis testing
Statistical hypothesis testing
A statistical hypothesis test is a method of making decisions using data, whether from a controlled experiment or an observational study . In statistics, a result is called statistically significant if it is unlikely to have occurred by chance alone, according to a pre-determined threshold...
but in all statistical inference
Statistical inference
In statistics, statistical inference is the process of drawing conclusions from data that are subject to random variation, for example, observational errors or sampling variation...
as it is often problematic to accurately describe the process that has been followed in searching and discarding data
Data
The term data refers to qualitative or quantitative attributes of a variable or set of variables. Data are typically the results of measurements and can be the basis of graphs, images, or observations of a set of variables. Data are often viewed as the lowest level of abstraction from which...
. In other words, one wants to keep all data (regardless of whether they tend to support or refute the hypothesis) from "good tests", but it is sometimes difficult to figure out what a "good test" is. It is a particular problem in statistical model
Statistical model
A statistical model is a formalization of relationships between variables in the form of mathematical equations. A statistical model describes how one or more random variables are related to one or more random variables. The model is statistical as the variables are not deterministically but...
ling, where many different models are rejected by trial and error
Trial and error
Trial and error, or trial by error, is a general method of problem solving, fixing things, or for obtaining knowledge."Learning doesn't happen from failure itself but rather from analyzing the failure, making a change, and then trying again."...
before publishing a result (see also overfitting
Overfitting
In statistics, overfitting occurs when a statistical model describes random error or noise instead of the underlying relationship. Overfitting generally occurs when a model is excessively complex, such as having too many parameters relative to the number of observations...
, Publication bias
Publication bias
Publication bias is the tendency of researchers, editors, and pharmaceutical companies to handle the reporting of experimental results that are positive differently from results that are negative or inconclusive, leading to bias in the overall published literature...
).
The error is particularly prevalent in data mining
Data mining
Data mining , a relatively young and interdisciplinary field of computer science is the process of discovering new patterns from large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics and database systems...
and machine learning
Machine learning
Machine learning, a branch of artificial intelligence, is a scientific discipline concerned with the design and development of algorithms that allow computers to evolve behaviors based on empirical data, such as from sensor data or databases...
. It also commonly occurs in academic publishing
Academic publishing
Academic publishing describes the subfield of publishing which distributes academic research and scholarship. Most academic work is published in journal article, book or thesis form. The part of academic written output that is not formally published but merely printed up or posted is often called...
where only reports of positive, rather than negative, results tend to be accepted, resulting in the effect known as publication bias
Publication bias
Publication bias is the tendency of researchers, editors, and pharmaceutical companies to handle the reporting of experimental results that are positive differently from results that are negative or inconclusive, leading to bias in the overall published literature...
.
Correct procedures
All strategies for sound testing of hypotheses suggested by the data involve including a wider range of tests in an attempt to validate or refute the new hypothesis. These include:- Collecting confirmation samples
- Cross-validation
- Methods of compensation for multiple comparisonsMultiple comparisonsIn statistics, the multiple comparisons or multiple testing problem occurs when one considers a set of statistical inferences simultaneously. Errors in inference, including confidence intervals that fail to include their corresponding population parameters or hypothesis tests that incorrectly...
- Simulation studies including adequate representation of the multiple-testing actually involved
Henry Scheffé's simultaneous test of all contrasts in multiple comparison
Multiple comparisons
In statistics, the multiple comparisons or multiple testing problem occurs when one considers a set of statistical inferences simultaneously. Errors in inference, including confidence intervals that fail to include their corresponding population parameters or hypothesis tests that incorrectly...
problems is the most well-known remedy in the case of analysis of variance
Analysis of variance
In statistics, analysis of variance is a collection of statistical models, and their associated procedures, in which the observed variance in a particular variable is partitioned into components attributable to different sources of variation...
. It is a method designed for testing hypotheses suggested by the data while avoiding the fallacy described above.
See also
- Data analysisData analysisAnalysis of data is a process of inspecting, cleaning, transforming, and modeling data with the goal of highlighting useful information, suggesting conclusions, and supporting decision making...
- Data dredgingData dredgingData dredging is the inappropriate use of data mining to uncover misleading relationships in data. Data-snooping bias is a form of statistical bias that arises from this misuse of statistics...
- Data-snooping bias
- Exploratory data analysisExploratory data analysisIn statistics, exploratory data analysis is an approach to analysing data sets to summarize their main characteristics in easy-to-understand form, often with visual graphs, without using a statistical model or having formulated a hypothesis...
- Predictive analyticsPredictive analyticsPredictive analytics encompasses a variety of statistical techniques from modeling, machine learning, data mining and game theory that analyze current and historical facts to make predictions about future events....
- Texas sharpshooter fallacyTexas sharpshooter fallacyThe Texas sharpshooter fallacy is a logical fallacy in which pieces of information that have no relationship to one another are called out for their similarities, and that similarity is used for claiming the existence of a pattern. This fallacy is the philosophical/rhetorical application of the...
- Type I and type II errorsType I and type II errorsIn statistical test theory the notion of statistical error is an integral part of hypothesis testing. The test requires an unambiguous statement of a null hypothesis, which usually corresponds to a default "state of nature", for example "this person is healthy", "this accused is not guilty" or...
- Uncomfortable scienceUncomfortable scienceUncomfortable science is the term coined by statistician John Tukey for cases in which there is a need to draw an inference from a limited sample of data, where further samples influenced by the same cause system will not be available...