Categories
Opinion

Working With Public Coronavirus Data

The UK Government provides public datasets that can be used by the media or the public. One such dataset contains information collected for the Coronavirus pandemic and its impact on people living and working in the UK. The problem with this dataset, though, is that we weren’t able to record the data until after the pandemic had got into full swing. Some examples of factors that defeat those trying to understand the data are below…

There was no actual testing system in place in March, which likely means a massive under-reporting of cases. The current testing system has reached capacity in some areas, which will mean under-reporting of cases now. In between, when there was a fully working test system, we would have been receiving reasonable data (the testing process itself is not 100% accurate, but we can expect the numbers to provide a good indication of the state of affairs when the system is in place and working).

Cases has been the go-to metric for reporting on Coronavirus, but I think this could be a mistake, given the massive problems with the collection of the data for cases. With this in mind, I have examined the other data that is available and made an effort to construct a model based on a more reliable measurement for how the situation has developed. A reliable metric needs to be one that is likely to have reported reasonable numbers throughout the March – September date range. One that is not affected by time slices with no testing, or limited testing. From this, we can examine the relationship between cases and our new metric to see if it provides a model for predicting cases.

Stable Metric – Hospital Admissions

The measurement I have selected is hospital admissions. This metric is not dependent on self-reporting or the availability of testing. We can theorise that there is a strong relationship between the number of cases and the number of patients admitted to hospital. Using the data from the public dataset, we can construct the following chart.

Original Coronavirus Data

The main suspect area in this cart appears on the left-hand side, where the number of hospital admissions seems high compared to the number of cases. There is a secondary suspect area on the right, where the media reported that the availability of tests was limited.

Building a Model

If we take the data “in the middle”, where we know there was a testing system in place (but before the tests started to run out), we can create the following chart based on a relationship between hospital admissions and cases.

Coronavirus Adjusted Model

The model suggests that the number of cases during the peak of the pandemic may have been in the region of 50,000 cases per day. This is significantly higher – in fact, so much higher we need to remain sceptical about the model. Let’s test the model on recent numbers.

Model Prediction August/September

And now lets look at the reported numbers for the same period.

Coronavirus Reported Number August/September

The model isn’t too far out from the reported numbers. The reality is likely to be in the same zone – in all probability, higher than being reported of the past week by some fifteen to twenty-five percent.

Early Case Reporting Likely to be Wrong

Based on these numbers, the current number of cases isn’t as worrying as they first appear in official charts. They are still concerning as they are going up, which we know is not a linear process. The under-reporting in March/April is likely to have resulted in a big hole in the cases dataset. That means our understanding of the spread of Coronavirus, as based on the early data, is likely to be wrong. What we can do is track a more reliable metric, although we also need to understand that they may suffer from more lag than cases (cases are likely to be reported closer to an individual first being ill, with hospitalisation happening days later).

The model that we have scratched together might not be perfect, what we can infer is that the cases were massively under-reported in March… how wide the “error bar” needs to be is unclear as much of the reporting of ratios is based on the same data we are questioning in this article. We are also working our way from quite a small number (250 a day in a population of sixty million), which means being one-or-two out at this level makes a big difference to the number of cases we can infer.

In the US, the case rates were 5x the hospitalisation numbers. The model in this article finds a somewhat larger gap (more like 16x) – but please remember this is just one way to examine the relationship between the numbers. The relationship we are basing this on seems a likely one, but that’s not proven by the above analysis.

And finally, please don’t walk away from this post thinking “oh well, that’s okay then!” The ability of the epidemic to spread will take us by surprise if we don’t keep a careful watch on the spread of the virus. In the UK the rates are doubling each week, which means the graph will “hockey stick” upwards if we can’t get things under control fast. If you can stay away from other people, limit that contact, and stop the spread – you are going to save real lives.

Updates

As of the 22nd October 2020 the Cases to Admissions ratio seems to be reliably around the 16x mark. The average since the start of August is 16.08 and the median for the same period is 16.73. That means that on average, around 6% of cases result in hospitalisation. The distribution of daily ratios is shown below. Not quite a normal distribution, but while it’s not a bell, it is jellyfish-like.

Distribution of Cases to Admissions

Based on the median of 16.73x cases to admissions, we can compare how this compares to the actual numbers since August. As you’d expect, hospital admissions are a lagging indicator, as people tend to become a case some days before they are admitted to hospital.

Predicted vs Actual Cases Since August

We can then look back to the previous periods to predict how many cases there were during the first peak, even though it wasn’t possibly to measure the number of cases at the time. Perhaps we should call this “predicted vs recorded” as the recorded numbers are less likely to be the actual numbers than the predicted ones.

Predicted vs Recorded Cases Since March

We can revisit this 16.73x multiplier with new data as it emerges to confirm the model, but it looks like the uppermost spike was almost 60,000 cases. That’s higher than the original model, which pitched it at around 50,000 – but both models are likely to be more indicative of the true extent of cases than the recorded numbers.