Newcomb–Benford law

Coronavirus Covid-19: What can Benford’s Law tell us about the truthfulness of the reported numbers

Introduction

This article will discuss if there are any rational arguments to distrust any numbers given during the corona virus crisis by individual countries based on Benford’s law.

China is often accused that it forges or manipulates it’s data. That there are possible millions of undetected cases. There are even scientific papers about this (1). But no one seems to look at how other countries fare by this standard. Hence we will have a look at how China’s reporting fares if compared with 6 European countries and the USA via the Newcomb–Benford law.

One can use two data sets to analyse this data, the cumulative data set and the daily reports, not cumulative. The inherent problem with Benford’s law in an ongoing pandemic is WHEN do you take the analysis. The results of the Benford’s analysis can change daily. 

Benford’s law is also known as the Newcomb–Benford law,  Benford’s law is frequently used to detect possible fraud in accounting. (2). But it has also been frequently been used in Disease Reporting. (3)(4)(5)(6).

In (6) it is stated that:

In addition we find it also holds for other natural science observables such as the rotation frequencies of pulsars; green-house gas emissions, the masses of exoplanets as well as numbers of infectious diseases reported to the World Health Organization.

Please also see this video on why Benford’s Law should fit.

Data for this article has been taken from the respectable John Hopkins University Github  and Worldometer on the 15th of April.

Benford’s Law states that, in a naturally occurring set of numbers, the smaller digits appear disproportionately more often as the leading digits. .

The Newcomb–Benford distribution looks like this:

Newcomb–Benford law
Benford’s law, also called the Newcomb–Benford law, the law of anomalous numbers, or the first-digit law, is an observation about the frequency distribution of leading digits in many real-life sets of numerical data. The law states that in many naturally occurring collections of numbers, the leading significant digit is likely to be small.[7]
A real-valued random variable 𝑋 is said to be Benford
(base 10) if

ℙ(𝑆(𝑋) ≤ 𝑠) = log10 𝑠 for all 1 ≤ 𝑠 < 10.

 

But this is A an ongoing pandemic and B there are two versions of numbers we can look at cumulative cases and deaths and the increase from day to day.

If we run day to day cases and plot the Chi-Square probability cases and deaths on a graph against days then we come to this graph that also includes Iran, Belgium and the whole world as a comparison.

Benford's Law Chi - Square value plotted against days when not zero for day to day reported new cases and deaths for 10 countries and the World
Benford’s Law Chi – Square value plotted against days when not zero for day to day reported new cases and deaths for 10 countries and the World

As we can see there is a tendency of the Chi-Square probability to collapse to 0 as eventually with more and more cases no distribution of numbers will perfectly hold to Benford’s Law. Then we can do the same for cumulative cases and deaths.  China and the USA does not hold to this pattern in newly reported case and death numbers.  France not in Case numbers.  Belgium, Germany and Sweden  don’t collapse towards 0.

Benford's Law Chi - Square value plotted against days when not zero for day to day reported cumulative cases and deaths for 10 countries and the World
Benford’s Law Chi – Square value plotted against days when not zero for day to day reported cumulative cases and deaths for 10 countries and the World

As we can see in the cumulative cases, ALL Chi-squares probabilities collapse towards 0 eventually with Iran the longest to hold out but World Cases start to recover.  In cumulative death numbers it’s only China’s numbers that have collapsed. But Benford’s law will not hold for cumulative cases once the exponential peak has been reached as the numbers reach a steady state.

Conclusion

As this is an ongoing pandemic the results are not final and the final result can only come after Covid-19 has been defeated. Benford’s law is also not definitive proof of an untruthful or incompetent nation that has no control or willingness to report truthful data.

Yet as we stand today we know for example that the UK is only testing hospital admissions and neglects systematically care home and home deaths outside Scotland.

China  is an authoritarian state that thrives on constant surveillance and misinforming it’s citizens. It was reported that  “China admits Wuhan death toll 50% higher”. This can be seen as a step towards transparency.

We should also not forget the Spanish Flu was called the Spanish Flu because Spain was honest with their case reporting, while other countries like the UK tried to hide the true extent. Britain, France, Germany and the United States censored and restricted early reports while in Spain the Spanish Flu was actually called the French Flu.  What we can learn is that China’s reporting,might be as good or bad as the reporting procedure of any country.

And as we see from above analysis, it is important when to apply Benford’s law and to what.

Cumulative cases until the curve peaks might be the best way to use Benford’s law until the epidemic goes out of the exponential stage. But here we see over the long run ALL countries including the World data eventually fail the test. Is this just because with increasing cases eventually nothing will fit the BL perfectly, or do all these countries lie to an extent?

It is most likely that eventually all series fail the test as the number of samples increases.

The Chi-Square test is extremely sensitive for large sample sizes and tends to reject statistical significance even for small differences.

On daily new reported deaths and cases one is on much shakier ground as it is not entirely clear if Benford’s law should hold in any case.

We  should also not forget that Britain today is a mass surveillance society.  A country that has a “minority gets all voting system” (FPTP), no codified constitution in which the ruling party has essentially the powers of an un-elected monarch, and where even the courts have no power over the ruling party.

In the US the electoral college brought also minority rule and a shaky administration. Hungary, Brazil, Australia are also ruled by right wing propaganda parties add odds with facts or reason.

Corruption based on ancient laws operating a pseudo democracy would appear to be not just China’s problem. And accusations that China lies about its data are probably more founded in other reasons  than pure proof. All governments have a need to prove themselves, so one should probably trust no government to ultimately tell the truth. And Benford’s law could ultimately be a way to give a hint where the real perpetrators of false data are, but it seems a weak indicator for this pandemic.

But obviously governments can borrow a lot of money to employ statisticians that can simulate numbers that do not fail Benford’s law, so ultimate proof of where the liars are will always be hidden from the public.

Personally, I can not see that Benford’s law is that useful in proving fraud or non fraud in an ongoing epidemic.

It is also interesting that mainly Asian websites pick up on  an article that was released on SSRN by Christoffer Koch and Ken Okamura.

China’s COVID-19 data matches Benford’s Law like U.S. and Italy: Researchers
No evidence of manipulation of Chinese COVID-19 data: study
Study: No evidence of manipulation of Chinese Covid-19 data
Chinese figures can be trusted, study finds

The preprint is here.

“We focus on periods when the number of national cases grows by
at least 10% as after this the data series no longer follow an exponential path and the distribution of first digits is not likely to follow Benford’s Law”

“Deaths do not show these multiple changes in magnitude for Chinese provinces and only to a limited extent in the U.S. and Italy. Chinese provincial recoveries do show multiple changes in magnitude. More than two changes of magnitude are desirable for a series to follow Benford’s Law, which is why we focus on the number of confirmed cases.”

While it is mathematically correct what the authors do, it seems funny to have to choose your data so it fits the underlying maths and then conclude the data is OK.

We already know without Benford’s Law that each country seems to have it’s own variation of data collection. There are problems seemingly everywhere. What does it prove if I choose a period of exponential growth and magnitudes of order of change? The biological growth rate expects an exponential growth anyway.  So we can decide in the periods that a country puts out data that is exponentially growing that is not exactly putting out numbers fitting the Benford’s law. We already know without Benford’s law that the data is flawed.

Benford's Law Chi - Square value plotted against days when not zero for day to day reported and cumulative cases and deaths for the World data series
Benford’s Law Chi – Square value plotted against days when not zero for day to day reported and cumulative cases and deaths for the World data series

Did the World lie as at some point the Chi-square test would indicate? The fact is that we have data that represents sub-units, countries, that collect data from smaller units like provinces, these provinces would then collect data from hospitals or doctors.

Benford’s law seems to be best empirically in accounting from what I have read. Exceptions are known here. And maybe only one person made up the numbers when there is fraud. Here we have data collected by thousands if people.  Even countries collect from people around in different regions and within regions many hospital trusts and places (like GPs) have to report.

It is possible to create data series that fit Benford’s Law (Diekmann, 2007). But that person would need to manipulate the sub regional data series. Here is a Github repository  Benford Random Number Generator that can do that.

So it seems possible that Benford’s law is best suited to detect fraud in the smallest reporter, a hospital manager or so. But that person would unlikely fit their data to the big picture unless there is a coordinated mass conspiracy and if an individual reporting hospital would then fit to the preconditions, exponential and changes in order of magnitude would depend on the size of the hospital and the stage of the epidemic.

And maybe within all these hospitals and doctors mistakes are made, intentionally or unintentionally. I am doubtful if Benford’s law would pick this up as there is not yet enough empirical data to compare and contrast unlike accountancy.

I would liuke to end with a quote from the Royal Statistical Society:
“The key is that, in itself, Benford’s law is not a hypothesis test, and does not ground such tests without considerable qualification. Sometimes, it may flag things that could be found another way, such as “Why were so many expense approvals in the $9000 range?” (Perhaps because one’s signing authority stopped at $10 000.) But the hope is that no one is ever formally accused, let alone convicted, based on Benford’s law, without independent, and carefully thought‐out, evidence.”
The promises and pitfalls of Benford’s law

Also note the other Coronovirus, Covid-19 articles:

CORONAVIRUS COVID-19: MOBILITY AND CASES, WHY SHOPPING IS MORE DANGEROUS THAN GOING TO WORK.
CORONAVIRUS COVID-19: THE FALLACY OF SOME SIMPLE ARGUMENTS AS TO WHY COUNTRIES DIFFER
CORONAVIRUS COVID-19: WHAT WILL OUR AI DOMINATED SOCIETIES LOOK LIKE AFTER THE PANDEMIC?

Correcting under-reported COVID-19 case numbers: estimating the true scale of the pandemic
2  Benford’s Law 
Performance of public health surveillance systems during the influenza A(H1N1) pandemic in the Americas: testing a new method based on Benford’s Law
The economic impact of pandemic influenza
Monitoring the Paraguayan epidemiological dengue surveillance system (2009-2011) using Benford’s law
Benford’s law in the natural sciences
WHAT IS Benford’s Law?