Whenever you are looking at an average of something, it’s useful to ask yourself, “plus or minus what?” Averages nearly always have uncertainty associated with them because they are calculated based on a sample of a larger population.
For instance, if you want to know how happy people in the U.S. are, on average, you could try to figure it out by asking them, “On a scale from 0 to 100, how happy are you?” But of course, you can’t pose that question to everyone in the U.S., so instead, you could ask a random sample of people. The larger the number of people you ask (i.e., the larger the sample size), the less uncertainty there will be in the result. But how much uncertainty is there?
Let’s say that you polled 100 random adults in the U.S., and you find that they report an average happiness level of 70 out of 100. If you polled the entire country, you’d probably get a result similar to that, but probably not exactly. So you can’t say the average happiness level of the whole population is “exactly 70.” But you *can* estimate a range of values that you expect the average would fall into if you repeated the sampling many, many times.
In science, people typically give a 95% confidence interval (CI) around mean estimates. This is the range that you’d expect the true mean to fall within 95% of the time if you did the same study repeatedly. Almost always, people in conversation don’t report their 95% CI .
People will often tell you that the only way you can calculate the 95% CI is if you know the standard deviation of the data. Unfortunately, this is almost always information that you will never know. But thankfully, as long as certain assumptions about the data and population are met, there is a trick you can use to get around this: you can actually estimate the confidence interval for yourself if you know the mean, the sample size, and the range of the measurement scale that the mean is based on.
To actually calculate the 95% confidence interval around a mean, you’d need to know about the variance of the values involved. But let’s say you don’t know that – how do you estimate what the 95% confidence interval is if the only information you have is the mean and the sample size? This also is a big problem for power calculations (i.e., statistical estimates of the sample size a study requires), as power calculators often ask you to estimate the standard deviation of a variable but you have no way to know this. Even running a pilot study often doesn’t help since, on a tiny sample, the standard deviation estimate will be excessively noisy.
So, with that in mind, here’s a statistical trick I came up with that I’ve never heard anyone discuss that I use all the time to easily estimate in my head the 95th percentile confidence interval. I call it “the Uncertainty Trick.” This trick can also be really useful for sample size estimation/calculation when you don’t have a large pilot study, or you don’t know the standard deviation of your variable, so you can’t use standard sample size calculators. The quick summary (though you’ll have to read on to know how to interpret it) is that:
If you calculated an average from 25 data points, you should have an uncertainty (in the sense of a 95th percentile confidence interval) about what the true average is of about plus or minus 20% of the allowed range of that value. So that means that if the original scale was 0 to 100, the 95th confidence interval estimate would be plus or minus 20 (hence, if the average you measured was 40, the 95th percentile confidence interval would be 20 to 60). On the other hand, if the original scale was 0 to 10, you’d have a 95th percentile confidence interval of about 0 to 2.
With 100 data points, the 95th percentile uncertainty is at most 10% of the allowed range. So, for instance, on a 0 to 10 scale, that would mean an uncertainty of +-1, whereas on a scale from 0 to 5, that would mean an uncertainty of +- 0.5.
With 1000 data points, the uncertainty is at most 3% of the allowed range.
The technique used to produce these results also has a handy second use, which is to easily estimate how many samples you have to look at in order to be confident that the average (that you’re using those samples to estimate) is within a certain degree of accuracy (e.g., how many random apples in a country would you have to examine before you could be confident the true average price of apples in that country was within 10% of my estimated average? A lot of people have the intuition that this will depend on the total number of apples in that country, but actually, the total number of apples essentially doesn’t matter, only the number of apples that you look at.)
When we are given the average of something, we usually accept it at face value and don’t think to ask, “How big was the sample used to estimate this average, and what does that tell us about how uncertain this average actually is?” For instance, if only 25 random people were used to estimate the average age of people in the U.S., that average wouldn’t be very reliable. The Uncertainty Trick is a quick way to estimate how much uncertainty is in an average. (e.g., the average age of the entire population estimated from a sample of only 25 random people could be off by 10% to 20%, whereas if 1000 people were used, the uncertainty in that average would be much lower).
The Setup
Suppose that an average value was calculated from some subset of a population in order to try to estimate the average for the entire population. This is almost always the case when you’re given an average and also when you calculate one yourself.
For instance, maybe you or someone else calculated the average height of 1000 random people in the U.S. in order to estimate the average height of all people in the U.S., or someone calculated the average number of puppies that a pregnant dog will have by examining the litters of 100 random dogs.
Often we’re tempted to take these sorts of averages as being precise, but if the number of samples used to calculate the average wasn’t that large (say, was less than 10000), there could be quite a lot of uncertainty in what the true average is for the entire population.
So, how much uncertainty does such an estimate have? As the size of the subset that was used to calculate the average (i.e., the “sample size”, which we’ll call N) increases, clearly, the average of that subset becomes an increasingly good estimate for the average of the entire population. But how much uncertainty is there in this estimate for different sample sizes? (Here we assume that each data point was drawn independently and at random with equal likelihood from the entire population.)
The Uncertainty Trick
When calculating an average of values that are constrained to be in a known range (e.g., each value that you’ll be taking the average of is constrained to be within 0 to 100 or within 0 to 5, etc.), then the 95th percentile (which captures your uncertainty in estimating the true value for the average) will be:
•At most about plus or minus 20% of the size of the range if the sample size was 25
•At most about plus or minus 10% of the size of the range if the sample size was 100
•At most about plus or minus 3% of the size of the range if the sample size was 1000
Or put another way: 25 samples means 95th percentile uncertainty of at most about +- 20% of the range size, 100 samples +- 10% of the range, and 1000 +- 3% of the range.
That’s the Uncertainty Trick!
More generally, if you’re trying to estimate the mean of an ENTIRE population, but all you have is a random sample of size N that you’ve measured the mean value for, the size that N needs to be so that the 95th percentile confidence interval is a fraction F of the total range is upper bounded by 1 / Range_Fraction2, so:
N < (1 / Range_Fraction^2)
That’s the Uncertainty Trick in its general form!
Plugging in Range_Fraction=.20, we get N=25 (i.e., a sample size of 25 will imply that the 95th percentile confidence is less than 20% of the range), plugging in Range_Fraction=.10, we get N=100, and plugging in Range_Fraction=.03, we get approximately N=1000 (it actually is closer to 1000 than it seems in the above formula due to some upper bound approximations – 1000 is reasonably close to the upper bound).
You can also reverse the Uncertainty Trick if you’re given a sample size, N, and want to get an upper bound on how big a fraction, Range_Fraction, of the range size the 95th percentile confidence interval will be. You’ll find that:
Range_Fraction < 1 / sqrt(N)
So if N=10000, then the 95th percentile confidence interval will be, at most, about 1% of the range size (i.e., Range_Fraction=0.01). That’s why when N=10000, we usually don’t have to worry about the uncertainty of the estimate at all (since, for most purposes, a +- 1% difference in value is not worth worrying about).
[Note: this estimate is designed to be *conservative*, in the sense that the actual 95th percentile confidence interval will almost always be at least somewhat narrower than this estimate. That is, think of this trick as a very easy-to-use upper bound, telling you AT MOST how much uncertainty there is. The way I apply it, generally, is as a quick sense of how much uncertainty there is, an average (it’s not meant as a precise estimate). Just how conservative is it? Well, I’ve seen cases where it can give 2x or 3x the size of the real confidence interval (e.g., if the values you have were sampled from a normal distribution with mean 0 and standard deviation 1/2 and you say that the allowed range of values is -1 to 1). So please only interpret this trick as giving an upper bound. This upper bound really CAN be reached, though, for instance, if your values were random coin flips with a 0 if you get tails and 1 if you get heads, and the range you use is 0 to 1. Technically, due to rounding to a convenient number, the N=1000 case can be exceeded by a few % in worst-case scenarios.]
Example Use Cases
Example use case: If 1000 random adults in the U.S. were asked how religious they are on a scale from 0 to 100, and the average score was 65, then your 95% confidence interval for that value (which captures your uncertainty in what the value would be if every U.S. adult had been asked, rather than just 1000 of them) is at largest about 65 plus or minus 3, since 3 is 3% of the 0 to 100 range.
If it had been 100 people whose religiosity was measured instead of 1000 then your 95% confidence interval would have been at largest 65 +- 10.
If only 25 people were asked the question, then your 95% confidence interval would be at the largest 65 +- 20 since 20 is 20% of the 0 to 100 range.
More complicated example use case: less than 0.001% of people are greater than 7 feet tall (84 inches), and babies are almost always 10 inches long or more, so we can (while making very little error) say that human height is in the range 10 to 84 inches. If someone measures 100 people in the U.S. selected totally at random, and finds an average height of 57 inches, our 95th percentile for the true average height of all people in the U.S. will be at most this number plus or minus 10% of the range (here 84-10 = 74 inches is the range, so 10% of this range is 7.4 inches) hence, our 95th percentile confidence interval is at most:
57 +- 7.4 inches
What If Your Data Is Unbounded?
The work above assumes that your data falls within a known range. But what if the variable you’re studying is unbounded, so it has no clear maximum (e.g., with incomes)? Or what if almost all the data is very likely to fall within a known range, but there’s always a small chance of a value falling outside of it (e.g., for most studies, a maximum age of 95 will suffice, but what if you then have a 110-year-old who participates)?
In such cases, you can still use the above methods as long as you pre-commit to a plan for how to handle any values that are collected that fall outside the pre-specified minimum and maximum range. Ideally, this plan should be decided in advance before collecting data and (if it’s for formal scientific work) consider pre-registering this plan. The two main strategies to deal with such cases are:
• Pre-commit to throw away all data from any participants who have a value outside the specified range. For instance, if you set the range for an age variable you are studying to be 18 to 95 years old, you will throw away the data from any study subjects would are younger than 18 or over 95.
• Pre-commit to clip (a.k.a. winsorize) the values out of anyone outside this range. For instance, if you are studying income, you may decide that any income over $1,000,000 per year (which you expect to be very rare in your sample) you will simply clip to be $1,000,000 per year, and any income below $0 per year to be $0 (e.g., a negative income might be a typo or a study participant messing with you or a study participant who actually lost money that year as their stock portfolio went down and who considered that negative income). Once you’ve decided that, then you can validly assume that all values will be from $0 to $1,000,000.
Other Common Cases
What if, instead of being interested in the confidence interval of a single mean, what you care about is the confidence interval of the difference between two means (e.g., the goal of your study is to study how much bigger the mean of one group is than another group, or to compare the difference between two variables such as comparing the mean of an outcome at the end end of your study to what value it had at the beginning of your study)? Or what if what you’re interested in is the confidence interval of a correlation? Here is the Uncertainty Trick for all three cases, so you can compare them:
When R is the range of the variable (from a rating on a 0 to 10 scale would mean that R=10, and a variable ranging from 0 to 100 would mean R=100), n is the number of study participants (or, in the case of (2), the number of study participants per group), and n is sufficiently large, then the 95th percentile confidence (CI) is bounded by:
(1) For a single mean:
CI ≤ R / sqrt(n)
If the standard deviation is relatively high this bound will be pretty accurate, but if the standard deviation is low (relative to the size of the range) then this confidence interval will be conservative (i.e., the real confidence interval will be smaller than this formula).
(2) For the comparison of two means (assuming each group or variable has a sample size of exactly n, and both of the two measurements has a range of R):
CI ≤ sqrt(2) R / sqrt(n)
If the standard deviation of the difference between these two variables is high, the bound will be pretty accurate, but if the standard deviation is low (which can happen, for instance, when each individual measurement has a low standard deviation relative to its range, or when the two measurements are very correlated and so taking their difference causes them to cancel each other out), this confidence interval will be conservative (i.e., overly wide).
(3) For a correlation (so long as the central limit theorem kicks in, which means n is large enough and there aren’t substantial outliers):
CI ≤ 2 / sqrt(n)
When the empirical correlation is close to 0 this confidence interval will be pretty accurate, and when the empirical correlation is closer to 1 or -1 the confidence interval will be conservative (i.e., overly wide). This bound is derived by observing that the worst-case scenario is when the empirical correlation is 0.
Estimated Sample Sizes In Various Cases
If we want to use the formulas above to estimate the sample size n for a study (i.e., number of study participants we need), we require n to be:
n ≤ (Q / CI)^2
where CI is the desired 95th percentile confidence interval, where for an individual mean Q=R, for the difference of two means Q=sqrt(2) R, and for a correlation Q=2.
The Math For One Variable But Only For Those Who Really Care How This Works
[If anyone notices any mistakes here, please let me know so I can correct them!]
If you’re interested in the math behind this, here’s how it works out. If your sample size is large enough, if your data points are drawn independently and uniformly at random from the population at large, and if there are no large outliers (which there can’t be in this case because our data is constrained by stipulation to be within a known range), then the central limit theorem tells us that our uncertainty in the true mean will be well described by a normal distribution with mean equal to the mean we calculated on our sample, and standard deviation sigma/sqrt(N), where sigma is the standard deviation of our sample.
In a normal distribution, the probability of a value falling within a certain range of the mean can be calculated using the error function (Erf), which is equal to 2/sqrt(pi) * integral_{t=0 to x} e^-(t^2). The error function relates to the normal distribution in the following way: the probability of a value being within k standard deviations of the mean is given by:
p = Erf[k/sqrt(2)]
where p is the probability, and k is the number of standard deviations.
If we invert this function to solve for k in terms of the probability, we get k = sqrt(2) InvErf[p], where InvErf is the inverse of the Erf function.
The central limit theorem tells us that the likelihood of the true mean taking on different values can be well-described by a normal distribution with standard deviation sigma/sqrt(N) (i.e., the standard error of the mean, or SEM). That means that the 95th percentile confidence interval for the true mean is plus or minus the “uncertainty,” which we will call U. To find the uncertainty U (i.e., the margin of error) for a 95% confidence interval, we multiply the SEM by the value of k corresponding to p = 0.95:
U = k * SEM, so:
U = k * sigma/sqrt(N) = sqrt(2) InvErf[p] * sigma / sqrt(N)
Okay, so what sample size do we need to have an uncertainty of U? Well, solving for N in this equation, we get:
N = 2 * sigma^2 InvErf[p]^2 / U^2
Now, we note that if (as we assumed at the beginning) our values that we’re taking the average of are all constrained to be within a fixed range, then that means that the standard deviation sigma can be, AT MOST (by definition of the standard deviation), half of that range. Call the size of that range R. That means that:
N <= 2 * Range^2 InvErf[p]^2 / U^2
So suppose that we want our uncertainty U to be a fraction of the size of our range, call that fraction F. Then U = Range_Fraction*Range, and so we have:
N <= 2 * Range^2 InvErf[p]^2 / (Range_Fraction*Range)^2
N <= 2 * InvErf[p]^2 / Range_Fraction^2
If we want this to be a 95th percentile confidence interval, we set p=0.95 (i.e., want to be in the central region of the normal distribution encompassing 95% of the probability):
N <= 0.960365 / Range_Fraction^2
So:
N <= 1 / Range_Fraction^2
Plugging in Range_Fraction=.20 gives an upper bound of about N=25, plugging in Range_Fraction=.10 gives an upper bound of about N=100, and plugging in Range_Fraction=.03 gives an upper bound of about 1000.
This piece was first written on July 28, 2017, and first appeared on my website on April 24, 2024.
Comments