Disputes Over How to Use Statistics in the Real World

There is a surprising lack of consensus on how to do statistics, especially as applies to science. As the tool that underpins the scientific enterprise, you’d think we would have figured it out by now. You’d be wrong.

The mathematical proofs are, of course, very rarely disputed. The use of mathematics is much more often disputed.

Why do these disputes arise? I’ve observed five different types.

Disputes in Applications of Statistics to Science

(1) Disputes over philosophy:

Example 1: is it valid to assign a probability to an event that is not part of any clear sequence of events or mathematical process (e.g., trying to estimate the probability that humanity will go extinct)?

We don’t agree on what we mean by “probability.”

Example 2: under what conditions should a strange data point be considered an “outlier” and hence removed from our data before calculating our statistics?

We don’t agree on what we mean by “outlier.”

Resolving these disputes might require more consensus on what we mean by different concepts in statistics and potentially more disambiguation of possible interpretations of these concepts.

(2) Disputes over what we’re trying to achieve:

Example 1: Is a p-value a measure something that we directly care about, and if not, is it still worth using?

We don’t agree on the extent to which one of the most commonly used statistics is something we care about.

Example 2: When we’re testing multiple hypotheses at once, how should we adjust our resulting statistics to account for this “multiple hypothesis testing”?

We don’t agree on how to handle the increased rate of false positives that occurs when we put multiple statistics in a single paper.

Resolving these disputes might require more agreement on exactly what we’re trying to achieve with statistics.

(3) Disputes over what is “good enough” or “accurate enough”:

Example 1: when you’re comparing the means of two groups, when is it okay to use the very popular t-test, which assumes equal variances in the two groups, rather than the welch test, which doesn’t assume equal variances (this article does a good job of exploring what a horror show this seemingly simple question is).

We don’t agree on how much we should worry about the basic assumptions of our commonly used statistical tests being violated.

Example 2: to control for effect (i.e., “factor out” the influence of one variable on the relationship between two other variables), is it sufficient to run a regression that includes that control variable in the model?

We don’t agree on what is sufficient to remove the causal effects of one variable (that we don’t care about) on another we do care about in order to avoid having those effects contaminating our main results.

Resolving these disputes might require more agreement on how robust methods are to violations of their assumptions and clearer best practices for what to do when we expect the assumptions might be violated.

(4) Disputes due to misconceptions:

Example 1: it is often assumed that the true value has a 95% chance of falling within the 95th percentile confidence interval for that value, but that’s actually a subtle misinterpretation of the definition of a 95th percentile confidence interval.

We can easily misunderstand the extremely subtle proper interpretations of statistics.

Example 2: the trim-and-fill method is often used in meta-analysis to try to correct for the fact results that find no effect are more likely to go unpublished than results that find an effect, but unfortunately, this method sometimes doesn’t do what it is supposed.

We sometimes continue to rely on techniques out of inertia even though they are known to occasionally produce wrong results.

Resolving these disputes might require better statistical education around common misconceptions and common confusions.

(5) Disputes over what to do when we lack information:

Example 1: when we don’t have any empirical data or previous experience to use to estimate a prior distribution for a variable, how should we set our prior?

We don’t have standardized steps that everyone can agree on for all procedures.

Example 2: if we don’t have reason to believe that our data is normally distributed, but the test we would usually run requires that the distribution of the data be normal, should we instead use a non-parametric version of the test (even though it has less statistical power), or should we base whether to use the non-parametric version of the test off of the result of the p-value of a test for normality, even though doing the latter will form a “compound test” and hence potentially change the interpretation of our resulting p-value?

When certain information is lacking, we don’t have a consensus on how to adapt.

Resolving these disputes might require a more standardized agreement on dos and don’ts for situations where we lack critical information.


  

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *