Photo by Sunder Muthukumaran on Unsplash
Photo by Sunder Muthukumaran on Unsplash

Three reasons to be cautious when reading data-driven “explanations”

Did you know that fairly often, there will be multiple extremely different stories you can tell about identical data, none of which are false? In other words, the mapping from statistical results to true stories about those results is not unique.

This leads to a lot of confusion, and it also implies that claims about “the reason” behind a complex social phenomenon should be interpreted with caution.

Here are 3 common situations of this happening, each illustrated with realistic political examples:



1) Correlated variables
You want to explain why politician X won. If being older and Christian are substantially correlated with each other AND ALSO with voting for X, then you could write either:

  • X wins because of older voters!
  • X wins because of Christian voters!
    …depending on who the writer prefers to blame it on.


2) Multi-causality
Many events have multiple causes, and so while it isn’t incorrect to say one “caused” that event, it is incomplete and can be misleading.

For instance, suppose you want to explain what caused politician Y to lose in a very close election. Any small deviation would have made the results different. So you could validly say:

  • Y loses because fewer Z’s voted than usual!
  • Y loses because more W’s voters than usual!

That way, the writer can choose what group they want to try to make into the guilty party. There may be dozens of such small discrepancies that could have “caused” the result.



3) Causal chains

If A caused B, which caused C, which caused D, you could blame D happening on any of A, B, or C, and each story is true (though incomplete).

This can provide convenient excuses to ignore solvable problems. For instance, people often say nuclear power is not adopted much because it’s so expensive. True – but why is it so expensive? If it is regulation that has made it so expensive (which I believe is true in this case), then blaming low usage on cost alone is misleading.



When reading an explanatory story based on data, it’s worth asking yourself: Is this the only story the dataset tells? Fairly often, it won’t be.


This piece was first written on August 7, 2023, and first appeared on this site on September 10, 2023.



Comments

Leave a Reply

Your email address will not be published. Required fields are marked *