Image by davisuko on Unsplash.
If someone handed you a piece of fruit they’d painted over and told you that it was a lime, you could test that default (null) hypothesis by cutting it open. If you then saw something that looks nothing like a lime, you could probably reject the hypothesis that it was a lime. (If you were really cutting a lime open, there’s probably much less than a 5% chance that you’d see something that looks at least this different from a lime when you cut it open.) Image by davisuko on Unsplash.

Demystifying p-values

There is a tremendous amount of confusion around what a p-value actually is, despite their widespread use in science. Here is my attempt to explain the concept of p-values concisely and clearly (including why they are useful and what often goes wrong with them).

— What’s a p-value? —

If you run a study, then (all else equal, aside from rare edge cases) the lower the p-value, the lower the chance that your results are due to random chance or luck.

More precisely: a p-value is the probability you’d get a result at least as extreme as what you got IF there were actually no effect (or if some other pre-specified “null hypothesis” is true).

So it’s a probability calculated based on assuming that there is no effect (or assuming that a pre-specified “null hypothesis” is true). Here the phrase “no effect” would mean, in the case of a study on a new medicine, that the medicine doesn’t do anything.

To put it in terms of coin flips: suppose you’re trying to decide if a coin is fair (i.e., if it has an equal chance of landing on heads and tails – so that’s your “null hypothesis” in this context). You flip the coin 100 times and get 60 heads. You calculate the p-value (p=0.06).

This p-value tells you there’s a 6% chance you’d get 60 or more heads OR 60 or more tails out of 100 flips if the coin were actually fair.

What makes p-values useful is that when they are high, you usually can’t rule out your effect being due to random chance or luck. And, when they are very low, random chance is (in most cases) unlikely to be the explanation for your result.


— What’s the problem with p-values? —

In social science, p<0.05 is often used as the cutoff for a “successful” result (i.e., they treat the effect as real and potentially publishable). This is an arbitrary cutoff; there’s nothing special about 0.05. The phrase “statistically significant” is defined simply to mean that p<0.05.

There are many ways that p-values get commonly misused, creating lots of problems. For instance:

• p-values often get misinterpreted as the probability that an effect is not real (recall: p-values are actually the probability of getting a result this extreme if there is no effect, which is not the same thing)

• If you see one study where the main finding’s p-value is, say, 0.05, and another study where the main finding’s p-value is, say, 0.01, it’s tempting to conclude that the finding of the 2nd study is much less likely to be the result of chance (e.g., 1/5th as likely) than the 1st study’s finding. Unfortunately, we can’t draw this conclusion. The probability that a study’s finding is the result of chance is not the same as the p-value, and in fact, it can’t even be calculated just by knowing the p-value.

• Because a p-value threshold is often used for a result to be publishable (p<0.05 in social science), researchers sometimes engage in fishy methods to get their p-values below the threshold. This is known as “p-hacking.:

• A result’s p-value (or “statistical significance”) is sometimes focused on instead of focusing on other factors that are also important. For instance, a result may have a low p-value but be such a weak effect that it’s totally useless or uninteresting.

• While a low p-value helps you rule out the possibility that your effect is merely due to random chance, unfortunately, that’s all it helps you with. But researchers sometimes act as though it tells them more than that. Even an extremely low p-value doesn’t mean an effect is “real” or that the effect means what you think. Low p-values can result from a variety of causes, including mistakes in experimental design or confounds.

Here’s another way to think about what a p-value is and isn’t that some people find helpful: a p-value does not tell you the probability that your result is due to chance. It tells you how consistent your results are with being due to chance. (I’m paraphrasing from here.) So, the lower the p-value, the less consistent your results are with them being due to chance.

It’s interesting to note that, empirically, results with lower p-values are more likely to be genuine effects (i.e., not false positives). I looked at results for 325 psychology study replications, and when the original study p-value was at most 0.01, about 72% replicated. When p>0.01, only 48% did.

Ultimately, p-values are a useful (though often abused) statistical tool.


— BONUS APPENDIX: what’s the chance of a hypothesis being “true” if p<0.05? —

One annoying thing about p-values is that they don’t answer the question we are usually interested in. Usually, we want to know something like “What’s the probability that my hypothesis is true?” or “What’s the probability that the effect of this drug is bigger than X?” but p-values don’t tell us those things.

However, we can put a different spin on p-values to get them to answer questions that are closer to what we’re really interested in. Let’s think of p-values as giving us a decision procedure (in an overly simplified world where you either “believe” in an effect or you fail to believe in it). 

Suppose you test 100 totally separate, previously unexplored hypotheses about humans, and suppose that you commit to “believe” a hypothesis is true if and only if you get p<0.05 (and otherwise, you don’t believe it).

I think it’s realistic that in a social science context, most hypotheses studied will be false since discovering novel, publishable hypotheses about humans is hard. So let’s suppose that 80% of the hypotheses you test are *not* true. 

Finally, suppose that you use a large enough number of participants in your studies so that if you are testing for the presence of a real effect, there is an 80% chance you’ll be able to find it (this 80% figure is a common recommendation for “statistical power”). 

Under these assumptions, if you test 100 hypotheses, then you will end up believing in 20 hypotheses, and 80% of those you believe will be true (with the other 20% being false positives). That means that of the results you believe in, 80% will be correct! Of course, this assumes no mistakes are made in the process of designing the experiment, running the statistics, and so on.

Here’s how the math works out if you’re curious:

• Out of the 100 hypotheses, 20 will be true, and of those, you’ll believe 16 = 0.80 * 20 (these are the true positives) and fail to believe 4 (these are the false negatives).

• Out of the 100 hypotheses, 80 will be false, and of those, you’ll believe 4 = 0.05 * 80 (these are the false positives), and you’ll reject 76 (these are the true negatives).

Of course, if the numbers here had been different, the conclusions would be different as well. For instance, imagine if you started with 2000 hypotheses, and this time, imagine that only 1% of them were true. If the power was still 80%, then:

 • Out of the 2000 hypotheses, 20 of them would be true, and of those, you’d believe 16 (0.80 * 20) of them (these are true positives) and fail to believe 4 of them (these are false negatives).

• Out of the 2000 hypotheses, 1980 would be false, and of those, you’d believe 99 (0.05*1980) of them (these are false positives), and you’d reject the other 1881 of them (these are true negatives).

• So, altogether, you’d believe 115 (16 + 99) hypotheses, of which only 16 would’ve actually been true, so of the results you believe in, less than 14% would be correct! 

From analyses like these, we can see that the probability that a specific hypothesis is true, given that we’ve found p<0.05, depends on a variety of factors, including the sample size, the true effect size, the base rate probability that a new hypothesis tested by that researcher is true, the probability of errors being made in the experimental design or statistical analysis, and so on.


In real life:

(1) Studies often don’t use large enough numbers of participants (and so are underpowered).

(2) Researchers sometimes engage in p-hacking to artificially lower their p-values to help their papers get published.

(3) Researchers often don’t carefully track how many hypotheses they’ve really tested.

(4) The decision procedure described above is often not adhered to so strictly (e.g., a result of p=0.08 might be treated as suggestive evidence for the hypothesis, and hence the hypothesis is not rejected).

(5) Real hypotheses often have auxiliary assumptions beyond what the p-value accounts for (such as an assumption that there is a lack of confounders, a lack of serious errors in the experimental setup, and so on).

I personally don’t like thinking in terms of this decision procedure for p-values because I believe that modeling hypotheses as “true” or “false” is not a good approach to thinking clearly. This is because I believe it’s usually much better to think in terms of probabilities rather than a “true”/”false” dichotomy when trying to understand the answers to complex questions.

Some people have argued that we should switch to a Bayesian approach to hypothesis testing since such an approach avoids many of the issues of p-values (including avoiding the problematic “true”/”false” dichotomy). But it also introduces other challenges, such as how to come up with an appropriate “prior” (which represents one’s belief about the probability of the hypothesis having different strengths of effects prior to seeing the study results).

This piece was first written on December 31, 2022, and first appeared on this site on April 2, 2023.


If you read this line, please do us a favor and click here to answer one quick question.


  

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *


  1. Great post.

    I teach this stuff to teenagers and the finer points still elude me. Especially, I can’t quite get my head around the p-value NOT being the probablity that the difference seen is due to chance!