Guessing the Probability Distribution

Tips for guessing what distribution a variable (or set of values) might have

People often assume that the frequency with which a variable will take on different values (i.e., probability distribution) is likely to follow a bell curve (i.e., a normal distribution); this is often a mistake. Instead, consider these rules of thumb for deciding which probability distribution to use as a model in different situations:

(1) Binomial – when flipping coins

Number of occurrences out of a fixed number of tries: if the variable represents the number of times something happened out of a certain fixed number of trials that are at least mostly independent from each other (e.g., the number of times a team wins out of a fixed number of games, or the number of people on your mailing list that opened the last email you sent) then try modeling the probability of outcomes with a binomial distribution. Note that when the number of trials is large, and the rate of success is not too close to 0% or 100%, then this can be approximated with a normal distribution, but when the number of trials is small, or the rate of success is close to 0% or 100% a normal distribution may fit it very poorly.

(2) Poisson – when watching the clock

Number of occurrences in a fixed amount of time: if the variable represents the number of things of a certain type that happen within a fixed amount of time (e.g., 1 hour) and that time shouldn’t be affected by the time that’s elapsed since the last instance of the thing happening (e.g., the number of burglaries that will happen in a given city in a day, or the number of emails that you get on a Monday between 9 am, and 10 am) then try a Poisson distribution. Instead of fixed amounts of time, it can also be applied to fixed amounts of space or stuff, such as the number of mutations in a fixed-sized region of DNA once a given dose of radiation is applied.

(3) Exponential – when luck fades

Positive variables where larger values are always less likely than smaller values: if the variable is constrained to be positive and it’s reasonable to assume that higher values are always less likely than lower values, then the Exponential distribution may be a reasonable choice. It can also be applied to cases where the variable is constrained to be above a certain value rather than strictly positive (if we’re willing to shift the mean of the Exponential distribution). Exponential distributions are an especially good choice when modeling the time until an event if the estimated amount of time remaining until the event occurs is not affected by the amount of time elapsed so far (e.g., the amount of time until the next telephone call at a call center around a fixed time of day, say 2 pm). It is also the right choice when modeling how likely the thing is to occur when, with each amount of passing the time, the chance of the thing happening falls by a fixed percentage (e.g., the chance that someone has NOT gotten an injury after their nth day of riding a motorcycle if we can assume roughly a fixed chance of injury each day). Another application of the Exponential distribution is when you know a variable must be positive, but its maximum value is unlimited, and you know its mean value must be set in stone, but otherwise, you have the maximum uncertainty possible about outcomes of the distribution (i.e., you want to use the maximum entropy distribution for a known mean).

(4) Gamma – when anything could happen (but it’s got to be positive)

Positive variables that don’t have a clear maximum value – when the restriction of larger values being always less likely than smaller values is not necessarily valid (but you still have an unbounded variable that must be positive), then the Gamma distribution can be a reasonable choice. It generalizes the exponential distribution to a wider range of cases and also includes as a special case the famous chi-squared distribution. The Gamma distribution is useful for modeling things such as the total amount of rainfall in a certain country in a certain year (it must be a positive number, and we can’t say for sure that the maximum rain that has ever occurred is truly the maximum that could occur). Another example application would be the amount of time between one random car passing a certain point and the next car passing that point, which again must be positive but doesn’t have a clear maximum value.

(5) Normal – when things just add up

Sums of variables: if the variable can be thought of as a sum or average (or weighted average, where no one variable gets most of the weight) of other variables that aren’t that correlated to each other (e.g., human height, which can be approximated as being affected by a weighted average of different effects from different genes, or IQ test scores, which can be thought of as a sum of many different components of intelligence such as working memory, verbal comprehension, spatial reasoning, processing speed, practice taking such tests, focus, etc.) then try a Normal distribution (a.k.a. bell curve). Regarding human height, a Normal distribution fits it quite well if you consider adults, but even better if you just consider males or just consider females (since the male/female chromosome difference causes a relatively large change, which can be thought of as a variable with excessively large weight in the weighted average). A Normal distribution is also a good choice when you know that the center (e.g., mean or median) and width (e.g., standard deviation) are set in stone, but other than that, you have the maximum amount of uncertainty about what outcomes can occur (i.e., you want the maximum entropy distribution for a known mean and standard deviation), or when you’re trying to pick a reasonable prior distribution for the mean value of something (e.g., you want to express uncertainty about the mean of another probability distribution).

(6) Log-normal – when luck is compounding

Products of variables: if the variable can be thought of as a product of (as opposed to some of) positive variables that are at least mostly independent, or thought of as an accumulation of small percentage changes (from a positive starting point) due to different unrelated causes, then consider using a Log-normal distribution (a.k.a. Galton distribution). Examples where these assumptions seem to work reasonably well include, e.g., the income distribution of a large population (if we cut off the top 3% or so richest as that top tail may be better modeled as a Pareto distribution), or the distribution of populations of cities, or perhaps people’s total “success” in life by some fixed metric (excluding the very, very highest of achievers). The Log-normal distribution can also be a reasonable choice when a variable represents percentage changes in a thing from a fixed starting point of 1 (where 0 is the lowest possible value since it would indicate the thing is totally gone) when there is not a clear maximum (e.g., the distribution of returns for a diversified asset portfolio – though this distribution will leave out the extreme outlier tail risk caused by black swan events such as the world being hit by a huge Asteroid or a world-altering new technology). The log-normal distribution is also a reasonable choice when trying to set a prior distribution for a scale parameter (e.g., you need to use a probability distribution to model the uncertainty in the standard deviation of some other distribution) or if you expect the number of orders of magnitude of a thing to be well modeled by a normal distribution (so that the thing itself is the exponential of a normally distributed variable). When the mean is much larger than the standard deviation, a Log-normal distribution approximates a Normal distribution.

(7) Beta – when possibilities are penned in

Continuous variables constrained to be in a certain range: if the variable you’re trying to model the probability distribution of is a continuous variable that is only allowed to be in a bounded range (e.g., a corruption score that can be any value from 0 to 1 or a quality score in the range of 1 to 1000) consider using a Beta distribution (rescaled and/or shifted to cover the range of allowed values of your variable since by default the Beta distribution has the range 0 to 1). The Beta distribution works especially though well when the variable you’re trying to model the probability distribution of is itself a probability (which by definition is always going to be in the range of 0 to 1) or if you need a prior distribution for a variable that represents a probability. The Beta distribution can also work well when trying to model how likely the median value (or, more generally, the path percentile value) of a set of values is to take on different values if the original values are approximately uniformly distributed in the range 0 to 1. The commonly used uniform distribution, which assigns an equal likelihood to each value within the range 0 to 1 (and no chance to anything outside of that range), is a special case of the Beta distribution.

(8) Pareto – when outcomes are driven by outliers

80/20 type rules for positive variables: if you expect that the top X% of the outcomes will contribute Y% of the total, but ALSO that X% of those top X% will contribute Y% of the total of just that top group of outcomes, and X% of the top X% of the top X% will contribute Y% of the total of that top group of the top group, and so on and so forth, then consider using a Pareto distribution (a.k.a. power law). Many people believe this is the right model for the total value of startups or the amount of total funds raised by startups. The Pareto distribution also is a good choice when you are modeling the total outcome value of a thing that occurs over time, and a reasonable estimate for the value remaining to be added is proportional to the amount of value accumulated so far. So for instance, if a reasonable estimate for the final equity valuation of startups is 15% higher than the valuation at their last equity sale, then a Pareto distribution might be a reasonable choice to model equity valuation for startups. Or, to give another example, if the average time remaining for a project to finish is proportional to the amount of time elapsed so far, then the completion times of projects might follow a Pareto distribution. Pareto distributions can be good for modeling things where a lot of the total of all the outcomes is driven by really extreme outliers. Relatedly, Pareto distributions have the funny property that they can have infinite variance or even infinite mean for certain parameter values. Things in the real world don’t really seem to have infinite means or variances as far as we can tell, but in some cases, it could still be a reasonable model of real-world phenomena. As statistician George Box said, “all models are wrong, but some are useful.”

(9) Gumbel – when you’re studying the worst-case scenario

The maximum of a large number of values: if you are trying to model something characterized by a maximum (or minimum) of many values, such as the max amount of rainfall in a city / the max daily stock market decline / the maximum worldwide earthquake strength in any day over a ten year period, consider using a Gumbel distribution, Weibull distribution, or Fréchet distribution. Which to use depends on details about the “tails” of the distribution from which each individual value is drawn. See here for more details about “extreme value theory,” which characterizes what sort of distributions the maximum can have (when the maximum is taken over a large number of random values drawn from some distribution): https://en.wikipedia.org/wiki/Extreme_value_theory…



Comments

Leave a Reply

Your email address will not be published. Required fields are marked *