Dark Data and the Pandemic

by David Hand, Emeritus Professor at Imperial College London

Statistics and data science have caught the public eye in recent years, with their promise to revolutionize our world through economic, social and health benefits. In the past year, nowhere has the public role of data become more apparent than in the COVID-19 pandemic. Policies, decisions and planning – all balancing health impact against economic, educational and social impact – have had to be based on counts of infections and fatalities, on understanding how the disease spread, and on estimating what would happen under different interventions. That is, on data describing the diseases and its consequences.

Not surprisingly, though, in the early days, data was limited and often of poor quality. John Ioannidis went so far as to describe it as a “once-in-a-century evidence fiasco,” but I think he was being unfair. It is unrealistic to expect data adequately describing new and previously unencountered situations to be ready-made, ripe for analysis. Rather, we have to develop data collection strategies and establish measurement procedures. Then we must collect, collate and interpret the data. And while this is being done, we have to do the best we can with the limited data at hand. This is especially true of politicians, who do not have the luxury of being able to wait while the science sorts itself out; they have to make a decision on the data available at the time. One consequence is that it may well be unfair to criticize governments for “flip-flopping” in their policies. Rather, perhaps they should be commended – as JM Keynes said, “When the facts change, I change my mind.”

It would be nice to think that another consequence was a raised public awareness of the contingent nature of science: that science is a process, not a fixed collection of facts, but rather something which is always susceptible to change as new information becomes available.

It may well be unfair to criticize governments for “flip-flopping” in their policies. Rather, perhaps they should be commended – as JM Keynes said, “When the facts change, I change my mind."

If a paucity of data at the start of a new challenge like the pandemic is understandable, we can draw sensible conclusions only if we recognize the limitations of the data. That uncertainty intervals are given, taking into account the possible values of what you don’t know. But that recognition highlights other risks. You may be able to handle the dangers arising from numbers about which you are uncertain; it is much tougher to handle, or even recognize, the dangers arising from numbers you don’t even know exist. Here the consequence is not simply one of drawing a highly uncertain conclusion. Instead it is one of drawing a “certain” conclusion that is wrong.

Take COVID-19 infection and death rates, for example.

It’s easy to determine the number of people with COVID-19, and the rate at which new people are becoming infected – you just count the number with symptoms. Except that, on the one hand, many people appear to contract (and be able to transmit) COVID-19 without having any symptoms, and on the other, symptoms of COVID-19 are also symptoms of other illnesses. Worse still, one cannot rely on those who present at clinics or hospitals, as they are likely to be non-representative of the population as a whole. Formal surveys, using carefully constructed sampling frames, are needed to avoid self-selection issues. But even then, while error arising from sampling variability in a survey is easy enough to handle using well-established tools, and error arising from non-response is somewhat more difficult, error arising from poor or misleading definitions is something else.

The answer, you might rightly say, is not to rely on symptoms and their intrinsic uncertainty, but to carry out formal medical tests that have precisely defined procedures. Which is fine, provided you know the false-positive and false-negative rates. Unfortunately, however, these are not simple properties of the tests themselves; they also depend on how carefully the tests are administered.

If infection rates represent challenges in determining the penetration of the disease into the population, what about death rates? There the definition is surely much simpler: It’s generally clear if someone is alive or dead, so numbers are easier to count. Except, it turns out that it’s not so clear after all. Do you count those who died of COVID-19 or with COVID-19 – if you can tell the difference. What about people who died of another cause that was aggravated by COVID-19? How long after a positive COVID-19 test do you regard the COVID-19 risk as having reduced to zero? And are you missing people who died of COVID-19 but did not have a formal test?

Different definitions presumably also go some way toward explaining the sometimes very substantial differences in death rates between countries. For example, by May 28, 2020, the UK was reporting 267,240 COVID-19 cases and 37,460 deaths, while Russia was reporting 379,051 cases but only 4,142 deaths. Were different ways of counting deaths responsible for this wide discrepancy?

Other problems with data – other types of dark data* – that have occurred during the pandemic include:

• Data that might have been – that is, the counterfactuals arising during interventions or clinical trials.
• Gaming. As Donald Trump rightly pointed out, one way to reduce the observed rate of infection is to reduce the number of tests carried out.
• Changes over time (due to behavioral fatigue, for example, where people abandon social distancing measures).
• Missing entire relevant variables. For example, only gradually did it become apparent that the severity of the illness was related to age, deprivation and other characteristics.
• Summary data. For example, a national count of 20 per 100,000 becoming infected could be very misleading if all the cases arose at a single sporting event.

There is no doubt that the pandemic has presented novel statistical – and above all, data – challenges. But one of the really encouraging things – encouraging beyond the pandemic – is the way people have rapidly collected relevant data, increased understanding and developed effective interventions. It almost gives one hope for the human race – hope achieved through statistics and data science.

*Dark Data: Why What You Don’t Know Matters, David J. Hand, Princeton University Press, 2020.

Are we missing things that matter? Get a complimentary chapter of David Hand’s book Dark Data: jmp.com/darkdata

JMP Foreword

Let's stay connected! Opt-in.

You may contact me by email regarding news, events and offers from JMP. I understand I can withdraw my consent at any time.


JMP Statistical Discovery LLC. Your information will be handled in accordance with our Privacy Statement.