Simpson's paradox: Understand the risks in data interpretation, avoid the trap

In 1934, Morris Cohen and Ernst Nagel conducted research on pulmonary tuberculosis death rates in New York City and Richmond. Their findings revealed higher death rates in Richmond (226/100,000) compared to New York (187/100,000). However, when they analyzed the data by ethnicity, the results were surprising. The death rates for both Caucasians and African Americans were higher in New York than in Richmond. This research sheds light on the impact of ethnicity on tuberculosis mortality rates.

(Look at the table below).

Death rates due to pulmonary tuberculosis

Ethnicity	New York	Richmond
Caucasian	179/100,000	162/100,000
African American	560/100,000	332/100,000
Total	187/100,000	226/100,000

The detailed data set is available through this source: https://plato.stanford.edu/entries/paradox-simpson/notes.html#note-1

This puzzle is known as “Simpson’s paradox”. This is a statistical illusion.

How did this happen?

When we disaggregate data into two different sub-groups, the real situation appears; in extreme instances, it reverses.

Those situations are identified as “Simpson’s paradox” because Edward Simpson explained the phenomenon using hypothetical data in 1951. However, before him, Yule demonstrated the bias again using a hypothetical data set in 1903.

According to statisticians, Simpson’s paradox, by definition, is not a true paradox; rather, it is a statistical illusion and could also be called aggregate bias. It is also a manifestation of confounding effects.

Its practical implications could be devastating, particularly when we make decisions based on aggregate data.

In the above example, if the decision-makers were not aware of this, they would have allocated resources wrongly to Richmond instead of New York City to reduce the death rate due to pulmonary tuberculosis.

This bias seems to have been occurring much more commonly than earlier thought.

Here are a few more examples;

Hospital admissions of men with psychiatric illnesses over the years; have gone up or down?

I created the following table using data that appeared in a short paper in the British Medical Journal.

According to the first table, the admission rates of men with psychiatric illnesses out of all admissions with such illnesses declined slightly from 1970 to 1975.

	1970	1975
Admission rate	46.4% (343/739)	46.2% (238/515)

Now, look at the following disaggregated data by age. The pattern reversed; the male admission rates have gone up.

	1970	1975
Those aged <=65	59.4% (255/429)	60.5% (156/258)
Those >65	28.4% (88/310)	31.9% (82/257)
Overall	46.4% (343/739)	46.2% (238/515)

Another example from a hospital setting

The data that appears below is from a paper published based on a study about the use of prophylactic antibiotics in eight hospitals in the Netherlands. According to the first table, it seems better to prophylactic use antibiotics because the urinary tract infection rate is lower when using them rather than when not using them.

	Prophylactic antibiotics	No prophylactic antibiotics
Urinary tract infection rate (UTI)	3.3% (42/1279)	4.6% (104/2240)

Since the researchers were sceptical about the finding, they disaggregated data by grouping hospitals based on UTI infection rates; low-incident and high-incident hospitals using 2.5% as the artificial cut-off rate. Now, the first observation was reversed; the rates were higher when prophylactic antibiotics were used.

UTI rates	Prophylactic antibiotics	No prophylactic antibiotics
Low incident (<=2.5%) hospitals	1.8% (20/1113)	0.7% (5/720)
High-incidence (>2.5%) hospitals	13.2% (22/166)	6.5% (99/1520)
Overall UTI rate	3.3% (42/1279)	4.6% (104/2240)

The above study appeared on the Royal Statistical Society website discussing Simpson’s paradox.

2 thoughts on “Simpson’s paradox: Understand the risks in data interpretation, avoid the trap”

Prasantha De Silva says:

December 18, 2020 at 5:46 am

Dana Mackenzie explains Simpson’s paradox with regard to COVID-19 using the US CDC data. This is a brilliant explanation. http://causality.cs.ucla.edu/blog/index.php/2020/07/06/race-covid-mortality-and-simpsons-paradox-by-dana-mackenzie/

Prasantha De Silva says:

December 18, 2020 at 5:51 am

I found another excellent article that explains this paradox using COVID-19 data. https://arxiv.org/pdf/2005.07180.pdf

The UpStreamBoat

Simpson’s paradox: Understand the risks in data interpretation, avoid the trap

Death rates due to pulmonary tuberculosis

Hospital admissions of men with psychiatric illnesses over the years; have gone up or down?

Another example from a hospital setting

Further readings;

2 thoughts on “Simpson’s paradox: Understand the risks in data interpretation, avoid the trap”

Leave a Reply Cancel reply

Death rates due to pulmonary tuberculosis

Hospital admissions of men with psychiatric illnesses over the years; have gone up or down?

Another example from a hospital setting

Further readings;

Related Articles

2 thoughts on “Simpson’s paradox: Understand the risks in data interpretation, avoid the trap”

Leave a Reply Cancel reply