Bad data can lead astray in the fight against COVID-19
No other disease has been broken down into numbers as much as COVID-19. Every day, Finland’s largest newspaper Helsingin Sanomat (HS), reports exactly how many people were reported to have become infected, died of COVID-19, have been hospitalised or are in intensive care due to the disease.
At the time of publishing this story, there were 7,144 confirmed cases of COVID-19 in Finland. The Finnish figures are reported by many other media apart from HS, and numerous media are doing the same across the world.
For Jaakko Nevalainen, professor of biostatistics at Tampere University, figures are a natural way to describe the world. He believes that for the general public, even a smaller flood of numbers would be enough to understand the overall situation.
“It would be best if we started using only those figures on which we have great certainty,” Nevalainen says.
Such data would include, for example, how many patients have been hospitalised or are in more demanding intensive care. We also know for certain how many of those in hospital have died.
However, the figure describing the number of infections is based on how many patients have been tested.
“The true number of persons who are infected with COVID-19 is not known, because many individuals may have mild symptoms and do not seek care or testing. And some may have no symptoms at all. Being infected and getting sick with COVID-19 are two different things. For example, if the testing criteria or testing activity change along the way, the number of laboratory-confirmed and reported cases of the disease will not be comparable with the previous data,” says Pekka Nuorti, professor of epidemiology at Tampere University.
The problem with the coronavirus numbers is that despite recognising the shortcomings of statistics, they are still used as a basis for general thinking, forming a picture of the disease, and social debate. When there is no better source, data literacy and data criticism emerge as new civic skills.
Helps researchers but confuses others
At the beginning of a pandemic, the total number of people infected in the population can only be estimated. As a proportion of the population have a mild or even asymptomatic infection, the true extent of the disease is an educated guess.
For example, the Finnish Institute for Health and Welfare estimated earlier that the number of people with COVID-19 infection is about 20 or 30 times more than the number of confirmed cases.
However, in antibody tests conducted in the Uusimaa region, which has had by far the most infections, the estimation of those infected varied from 2‰ to 2% in May 2020.
Neither figure is likely to be correct or final. The true figures will only be discovered much later when there is further experience from antibody testing. The interpretation of the figures is currently difficult, and antibody tests are not yet considered reliable indicators of the disease, immunity, or duration of the disease. These require further investigation.
The numbers of laboratory-confirmed reported cases have gotten an important role in describing the spread of the epidemic because they have been repeated and their development has been monitored by the media from the very beginning. However, the total number of infections is a very uncertain estimate and the first data on it are only just starting to be available.
Information that mainly confuses the general public is important to researchers.
“Reliable estimates of the prevalence and incidence of antibody-positive cases will eventually help to obtain information on the time of development of immunity and the persistence of antibodies,” Nevalainen says.
This information can be used, for example, to plan the order in which people will be vaccinated when a vaccine becomes available.
The numbers of laboratory-confirmed reported cases have gotten an important role in describing the spread of the epidemic because they have been repeated and their development has been monitored by the media from the very beginning.
When even death is uncertain
Of the indicators to measure the burden of the disease, death is unequivocal, in principle. Life was and then it ended. However, when looking at the coronavirus pandemic through statistics, even death is not unambiguous.
“A person can die of COVID-19 or he or she can die with COVID-19,” Nuorti points out.
“The discussion is further complicated by the fact that it is not always clear which concept of mortality is actually referred to at a given time,” Nevalainen notes.
We want to know about mortality because that may tell us how likely we are to die of the disease. On a personal level, knowing the cause of death of a loved one may seem unnecessary, but it is of great importance in assessing the overall impact of the progression of the epidemic.
It is necessary to understand what is meant by mortality in order to compile, compare or use statistics and figures for risk assessment.
“In the discussion, the case fatality rate (CFR) and infection fatality rate (IFR) are sometimes confused,” Nuorti says.
Currently, CFR is the better known of these epidemiological concepts of mortality. It means the proportion of people who have died among those who have had a confirmed COVID-19 infection.
IFR, in turn, seeks to assess the proportion of all coronavirus-infected individuals, including asymptomatic and untested ones, who eventually die of the disease.
“We will not know the infection fatality rate for sure until after the epidemic has ended. Its evaluation will be helped when the results of antibody tests are confirmed. It is also possible to assess the situation by examining excess death figures ie by comparing this year’s death rate with previous years, but at an early stage, one must be cautious in using such figures,” Nuorti says.
Care home deaths are a sensitive issue
Nevalainen points out that it is also important to understand the scale of the COVID-19 figures correctly. According to Statistics Finland, in 2018, 54,000 people died in Finland, or about 150 people every day. At the time of writing, the COVID-19 epidemic has lasted for three months in Finland, during which time 327 people have died from the disease.
Indeed, many have begun to interpret the figures quite optimistically using similar comparisons.
“In the list of causes of death, COVID-19 is currently far from the top. However, we have seen its potential to spread and rise to the top very quickly. The current situation is based on restrictions that have worked,” Nevalainen reminds.
In terms of numbers, the essential question is who is dying from the disease. As many as half of COVID-19 victims in Europe have been care home residents.
The question of the deaths of old people in care homes is sensitive. Among care home residents, the average life expectancy may be two to three years, but each of those days is as valuable as the days of others. At the same time, looking at the statistics, one must also understand that when spread to a care home, the virus will end many fragile lives in a short time. This appears as a spike in the causes of death list.
How should we think about this so that we can be both ethically and statistically correct?
According to Nevalainen, every avoidable death is a tragedy. Tragedy is accentuated if, due to the coronavirus, people must spend the last days of their lives in lockdown without the presence of their loved ones. On the other hand, population-level figures for the same thing are generated by counting, which may seem callous.
“When we assess the reasons for changes in overall mortality, we often talk about competing causes of death. For example, if COVID-19 would cause 300 deaths, it does not necessarily mean that the overall mortality rate from any cause will increase by 300 in 2020, because some fragile people would probably have died of some other cause in the absence of COVID-19,” Nevalainen says.
“In other words, it is sometimes difficult to ascertain whether an individual is dying with the disease (association) or from the disease (causation),” adds Nuorti.
More data does not mean better data
As the corona pandemic progresses, a huge amount of different data and modelling based on it is published daily.
“Growing numbers do not always increase our understanding. Especially if we do not know where the figures are coming from, which makes it hard to assess the quality of data,” Nevalainen summarises.
In figures, statistics, and models based on them, it is important to evaluate the reliability of data analyses and not just to interpret them. A problem arises if the risk assessment or forecast resulting from the analysis is published without disentangling the rationale, assumed mechanisms or the model structure. There is often no trace of where and how the figures used as a basis for analysis came from. In a hurry, they are not sufficiently questioned, either.
The World Health Organization’s (WHO) global COVID-19 statistics are a good example. WHO is a specialised agency of the United Nations that seems like a neutral umbrella organisation. That is what it is, because as an international actor, the WHO is the only body that can even somehow coordinate the response to the pandemic.
At the same time, however, it should be remembered that the open coronavirus data produced by the WHO, which is often used as the raw material for news, also requires caution when interpreted. When different countries collect data in different ways and there are large variations in the basic level of health care systems, countries should hardly be compared.
“The establishment of coordinated reporting practices would in principle be possible, but at present, national health authorities are doing their own thing and reporting on the disease in their own way. Common practices should be established in a quieter time than now when we have our hands full of other things,” Nevalainen says.
Anyone can create credible-looking reports if they know how to make a visually compelling presentation on their website or social media channel. And anyone can believe them.
Open data alone is not the key
Open data for everyone to use is usually a good thing. Open data and the justified questioning of prevailing knowledge are the building blocks of scientific research and democracy.
At the same time, the corona pandemic has shown that open data is also problematic. The criteria and scope of testing vary from country to country, and while the number of hospital beds is a reliable measure, universal access to health care affects the number of people in treatment as well as the death toll.
There are also differences and even clear shortcomings in the way the course of the disease is reported. For these reasons, an understanding-increasing analysis of the progression and duration of the coronavirus epidemic is not generated by just putting figures in an excel file in a technically correct manner. Anyone can create credible-looking reports if they know how to make a visually compelling presentation on their website or social media channel. And anyone can believe them.
“Some of the published coronavirus data journalism concerns me. There are carefully constructed, substantiated and critically interpreted analyses produced by skilled research groups, but also quick calculation exercises by those who lack the necessary expertise. It takes data literacy to note that they are not of equivalent value, and at worst, oversimplified or mis-specified calculations can be misleading,” Nevalainen explains.
The COVID-19 data paradox is that the more people have access to the data, the more presentations, interpretations and predictions by amateur epidemiologists are published. It is reasonable to wonder how many of the people, who make their own predictions, correct and publish their interpretations as more knowledge of the disease becomes available. That is at the heart of scientific research.
Hurry creates bad data
The major problem in producing numbers and statistics is the rush arising from the need for health care and the economy to get the disease under control. At the same time, the media intensifies the rush because the pace of publication and thirst for fast information have accelerated significantly in the age of the internet.
The self-corrective nature of scientific research and knowledge is well understood However, the scientific process of self-correction and consensus building is slow.
“An important issue that has not been talked about much is the quality of published studies, the strength of the evidence and the lack of peer review. Anyone can put up a draft or a version of an article that has not yet been peer-reviewed on a pre-print server, and these studies often find their way in public,” Nuorti says.
When, due to urgency, the criteria for scientific work may be relaxed, essential issues in assessing the validity of findings – such as chance, biases, confounding factors, or appropriateness of the control group – are usually left out. Not to speak of the technical errors made by researchers. A concerning issue is that low-quality rapid research may guide further research and decision-making. People are eagerly awaiting a medicine that works.
“This is shown, for example, in the studies on the use of the anti-malarial drug hydroxychloroquine in the treatment of COVID-19, which began with one low-quality study. This set in motion the snowball and several clinical trials were started with the drug,” Nuorti says.
Despite high hopes, it was not possible to confirm the effectiveness of the drug.
This can lead to substantial resources being put in the wrong place while perhaps something completely different should be explored. It is a pretty big problem when peer review is conducted on Twitter,” Nuorti says.
Author: Juho Paavola