For each dataset, there is a limit to what we can use that dataset to test. Using the standard p-value based methods of science, the more hypotheses we check against the data, the more likely it will be that some of these checks give inaccurate conclusions. And this presents a big problem for the way science is practiced.
Let’s take an example to illustrate the principle. Suppose that you have information about 1000 people selected at random from the U.S. adult population. Your dataset includes these people’s heights, weights, ages, shoe sizes, and so forth. Now, if your goal is to know the mean height of all people in America, you can produce an estimate of this quantity by averaging the heights of the 1000 people you have information about. Despite the fact that your sample contains just 1000 people, rather than the full set of 230,000,000 or so American adults of interest, your estimate will, with high probability, be within a couple of inches of the total population mean height. This is due to the fact that the 1000 people were sampled at random (so we shouldn’t expect our sample to differ from the entire population in a systematic way) and because the standard deviation of heights is not very large (if there were tremendous outliers in the data, such as 500 foot tall giants, we would need more samples to get an accurate estimate). This idea is made precise by the central limit theorem. It tells us how likely the true entire population mean is to fall different distances from our sample estimate, and says that the error of our estimate decreases like one over the square root of the size of our sample.
The same technique could work to approximate the mean weight of adult Americans, or age, or shoe size, or number of children. And in each case, the estimate would, with high probability, be quite accurate. We could even, if we liked, estimate all these quantities simultaneously if we collected all of this information about each of our 1000 people. But the more quantities we estimate, the greater the chance that at least one estimate is quite inaccurate. Since each estimate has some chance of being bad, if we make a sufficiently large number of estimates we should expect to get unlucky at some point and end up with one or more bad ones. So, if we aren’t just estimating mean height, but rather the mean of 50 different traits, we cannot claim that all 50 of these estimate are likely to be good. We should expect that some of them will be inaccurate, though we don’t know which ones.
This is where problems arise. Suppose that you are a researcher who is trying to find interesting differences between, say, southerners and northerners in the United States. Your dataset of 1000 adults contains 500 people from each group. What do you do? Well, it might seem reasonable to go ahead and compute the mean value of many different traits, and look at how these means differ between the two groups, to see if you can find any large differences that seem interesting. For instance, you may compute the average salary of each group, and see if they deviate from each other by a large enough amount to be deemed statistically significant. If they don’t, you can try another trait like IQ, or number of children, and repeat the process. If you try enough different traits, hopefully you’ll eventually find an intriguingly large difference between the groups.
The trouble is, we know that if you estimate a large number of quantities, some of them will be inaccurate, and so some of the apparent differences between your two groups may just be due to these inaccuracies. If you test enough traits, you will eventually find differences between the populations that look significant, even though it is just the result of chance.
In fact, even if northerners and southerners had no systematic differences between them, there would still be apparent differences that arose just from the particular sample of 1000 people you happened to have data on. For example, in your dataset, it just might happen that the northerners have lower numbers of children than southerners, even if this isn’t true for the underlying populations of all northerners and southerners. If you were to publish this finding, without making mention of the number of hypotheses you tested before finding it, it may seem that you had produced a meaningful result. In fact, the assessment of this result should take into account the number of hypotheses (e.g. northerners have smaller shoe sizes than southerners, northerners have greater salaries than southerners, etc.) that you tested before you discovered this one (and the p-values can be modified to include this information). The most significant seeming deviation between the groups found after testing 100 different hypotheses is very likely greatly inflated by chance. Whereas if you had only tested a small number of hypotheses against your data, and found a strong result, this would likely be a meaningful finding.
As a general rule, the greater the number of data points you have, the larger the number of quantities you can accurately estimate from your dataset. On a set of just 10 points, you may not even be able to get an accurate estimate of the mean value of a single trait (unless the trait had very slow standard deviation). Whereas on a dataset of a billion points, you probably could estimate dozens of quantities accurately.
Unfortunately, when you’re reading a paper, there is no way to tell how many hypotheses the researcher tested on his dataset unless he chooses to publish it. And there is a strong incentive to obscure this information. If a researcher releases the fact that he tested 20 hypotheses before finding 1 which was statistically significant, readers may discredit the result, or reviewers may reject it for publication. And if the researcher spent a lot of time and money collecting his dataset, it would feel like a waste to give up on the data just because his first five hypotheses tested on it don’t pan out. It might take a lot of restraint to not just keep testing hypothesis after hypothesis until he finds something publishable.
But even if researchers were excessively careful, that wouldn’t fully resolve the problem. When a hypothesis is confirmed by a dataset, we must consider whether it is truly a confirmation of the hypothesis being tested, or a result of the fact that 20 researchers tested 20 false hypotheses, and this one of the 20 happened to seem true by chance. That is, if enough hypotheses are tested over all, we may find a large number of false hypotheses among them that just happen to seem true.
What makes this problem more pernicious is that when a hypothesis fails to pan out, the result is often not published. This is due to the fact that hypothesis disconfirmations (e.g. “no association was found between cabbage eating and longevity”) are generally less interesting and harder to publish than confirmations (e.g. “an association was found between cabbage eating and longevity”). But since most new hypotheses in science turn out to be false, we should expect the number of negative results to be very large (except in situations where previously well validated results are being confirmed). Hence, the number of published test results will be much less than the number of total tests conducted, with test failures substantially underreported. So there is no good way to tell how many times a hypotheses failed to be confirmed by tests before one researcher finally ran one that seemed to confirm it. And if a very large number of false hypotheses are tested, but mostly just the ones that turn out to look true are published, you could end up with a field’s journals being flooded with false but seemingly verified hypotheses. In exploratory fields where almost all hypotheses are false, and where disconfirmations of a hypothesis are almost never published, you might even get into a situation where most published research findings are false.
Because this idea is well-established, it should have become well-established to report the number of hypotheses. Anyone who doesn’t is obviously engaged in sophistry.
Come to think, this is probably why most statistical science is just wrong. The p=0.05 value they calculated is mathematically wrong because they calculated as if they only tested one hypothesis, and didn’t modify it to take into account the number of hypotheses checked. (Before just now, I knew the 0.05 was wrong, I just thought it was due to something else.)
I’m going to have to double-check that I properly adjust my theories, as well. Okay, it matches the data…but on what try? It’s not too important because I always test predictively as well, which means on new data. But I’d still like to use it to correctly assess how likely my prediction is a priori, even if I’m going to test it regardless.
There is a journal of negative results. But frankly it should be the most prestigious journal. Number one. Since we know no natural prestige accrues to negative results, to get them widely known, we’re gonna have to lavishly adorn them with artificial prestige. I’d rather null results be over-reported than the reverse. How about you?
Someone got me a book with lots of “facts” about left-handers. Many of them just didn’t make sense, things along the lines of left-handers are more likely to exel at tennis and do poorly at pool. I suspected that those statistics weren’t accurate, but I could imagine why. What you’re describing seems very likely. People want to find differences between lefties and righties, but they don’t care what those differences are.
First off — nice blog. I was skeptical at first, as it seemed that you were reproducing themes that clearly existed in LessWrong posts. I retract that skepticism. I really like what you did with ending “How Great We Are” with the practical worksheet, as well as your synthesis of both LW material with the work of Ariely, Wiseman, etc. I’m a fan!
This post surprised me greatly! I had no idea this was occurring or even potentially common. I’ve read some papers in which it is stated, “The data collected aligned well with H1, but not with H2.” I, perhaps wrongly, read this as the researchers having made their predictions well before the data was collected.
It seems that you are suggesting that:
1) Data is collected and then potential hypotheses are examined for best fit?
2) Data is collected, and if original, pre-data-collection hypotheses fail, they a) don’t state this and b) carry on with #1 like nothing happened?
3) The same as #2, except that new hypotheses are not sought; the results are simply never made public.
Is that a reasonable read of the land?
Hi Hendy, thanks for the comment. It’s definitely difficult to write about rationality without having some overlap with Less Wrong, for obvious reasons. Hopefully when that happens I’ll at least explain the concepts in a somewhat different way or put my own spin on them.
The extent to which the problems in this post occur vary a lot from field to field and researcher to researcher. I know of cases where results were purely data mined, with the researcher trying hypothesis after hypothesis on a fixed data set until something statistically significant was found. Unless this is done very carefully, it is very bad science.
What is probably more common is that the researcher does come up with hypotheses beforehand, but they come up with quite a number of them. Then, their paper emphasizes those few results that are statistically significant, but don’t adjust p-values for the many hypotheses tested.
What is perhaps more common still is when a researcher runs multiple experiments and only publishes the ones that lead to statistically significant outcomes. This is quite understandable, as journals are less likely to accept negative results for publication, and plus, it doesn’t always seem like a good investment of time to write up a boring, negative result. Of course, the more unpublished negative results they have for each positive result, the more likely it is that the positive result is just a consequence of testing so many hypotheses.
Then, there is the macro scale problem of many researchers testing many different hypotheses, with a selection bias in favor of publishing the ones that work out.
It is difficult to generalize about the prevalence of all of this activity though, because it depends so much on the field, and because it is not always obvious when you read a paper how those results are constructed. When you’re in a research area where you can easily replicate a promising result in order to get a p value of 0.00001 these issues are obviously much less of a problem.
Here is a disturbing statistic from the New York Times:
“In a survey of more than 2,000 American psychologists scheduled to be published this year, Leslie John of Harvard Business School and two colleagues found that 70 percent had acknowledged, anonymously, to cutting some corners in reporting data. About a third said they had reported an unexpected finding as predicted from the start, and about 1 percent admitted to falsifying data.”
Mind blown. I knew about the risks of sampling bias at the outset of a study and selective reporting of results at the end, but it never occurred to me that you could, in a sense, bake both of those flaws in to the heart of it!
It’s understandable that a researcher would not want the effort of collecting a data set to “go to waste”, though. Is there a possibility for rehabilitating the 20th hypothesis, once found? E.g., treat it as a potentially promising lead and gather a fresh data set to see if it repeats?
Great question. Treating the 20th hypothesis as a promising lead and gathering fresh data works well. Another approach that is rarely taken in most fields: after collecting data but before doing any data analysis hold some of that data off to the side that you don’t touch. At the very end, test that really promising 20th hypothesis (and only that 20th hypothesis) against this never-been-touched data, and see if it still holds up!