As COVID-19 continues to spread, healthcare groups and companies of all kinds are under pressure to provide care in the face of increasing demand. Healthy Data is your window into how data can help organizations address this crisis.

COVID-19 continues to be the major challenge for governments, health care agencies, and businesses all over the world. Many governments have instituted lockdowns, businesses have closed, and people everywhere have been practicing social distancing to slow the spread of the virus. In California, Governor Gavin Newsom ordered nearly 40 million Californians to stay at home following predictions that nearly 56% of the state’s residents could contract the virus in eight weeks without such an order. How are such predictions made? In this post, we’ll use SQL to predict the number of COVID-19 cases day by day. Let’s get started. 

The dataset 

Johns Hopkins University (JHU) maintains a GitHub repository of various datasets around COVID-19 and has made the data available for educational and academic research purposes. The daily COVID-19 report data is available in CSV form in that repository. This Kaggle repository maintains an aggregated version of the JHU dataset. Let’s use the CSV upload feature of Sisense for Cloud Data Teams to import this data into a table for analysis. 

Step 1: Preparing the data

We are going to be predicting the number of cases just in the U.S. So let’s filter the data to the “US” region and group the data by the number of confirmed cases on a given day. Let’s also add a row number to this data, which will be used for some SQL analysis later. Finally, let’s use common table expressions (CTEs) to set this up.

with
  us_cases as (
    select
      TO_DATE(OBSERVATIONDATE) as OBSERVATION_DATE
      , sum(confirmed) as CONFIRMED_CASES
    from
      [covid_19_2020_03_29]
    where
      COUNTRYREGION = 'US'
    group by
      OBSERVATION_DATE
  )
  , us_cases_with_row_nums as (
    select
      row_number() over(partition by 1 order by OBSERVATION_DATE) as row_number
      , *
    from
      us_cases
  )

Step 2: Calculating the pandemic growth factor

The pandemic growth factor (PGF) between two consecutive days can be calculated by dividing the number of confirmed cases for the latter day by the number of confirmed cases for the former day. 

PGF for Day(N) = Number of Cases on Day(N)/Number of Cases on Day(N-1)

For example, the number of confirmed cases for March 28 is 121,478, and the number of confirmed cases for March 27 is 101,657. This gives a PGF of 121,478/101,657 = 1.19 for March 28. By calculating the average PGF for the last few days, we are going to predict how many cases there will be in the future. 

Let’s use another CTE to set up the calculation of average PGF. This can be done by self joining the CTE we set up in step 1. To perform this self join, we added a row number to the data in step 1. Let’s take the average growth factor for the last five days to predict the number of cases for the next 10 days. The 10-day value is configurable and is passed through a date range filter

, average_growth_factor as (
    select
      avg(curr.CONFIRMED_CASES / prev.CONFIRMED_CASES) as average_growth_factor
    from
      us_cases_with_row_nums as curr
      join us_cases_with_row_nums as prev on
      curr.row_number = prev.row_number + 1
  where
    curr.OBSERVATION_DATE > DATEADD(day, -5, [daterange_start])
    and curr.OBSERVATION_DATE <= [daterange_start]
  )

Step 3: Predicting the number of future cases

Pandemics grow at an exponential scale. Now that we have the average pandemic growth factor, we can predict how many cases there will be N days from today. 

Number of Cases After N Days = (Number of Cases Today) * (Pandemic Growth Factor)^N

So we first need the number of cases on the day from which we are going to begin predicting. Let’s call it starting cases and set it up using a CTE. 

, starting_cases as (select
      CONFIRMED_CASES as STARTING_CASES
    from
      us_cases
    where
      OBSERVATION_DATE >[daterange_start]
    order by OBSERVATION_DATE
    limit 1
)

Now that we have the starting cases, we are ready to make some projections on the future number of cases. The SQL for it is below. 

select
  OBSERVATION_DATE
  , CONFIRMED_CASES
  , row_number() over(partition by 1 order by OBSERVATION_DATE) as DAY_NUMBER
  , ROUND(STARTING_CASES * power(average_growth_factor, DAY_NUMBER -1), 0) as PROJECTED_CASES
  , average_growth_factor
  , STARTING_CASES
from
  us_cases
  join average_growth_factor
join starting_cases
where
[OBSERVATION_DATE = daterange]

This outputs the data in the format below. 

As you can see, by using the average growth factor from March 13 to March 18, we have predicted the number of cases from March 19 to March 28. Our model predicted about 117,000 cases on March 28. In reality, there were about 121,000 cases. 

Step 4: Creating visualizations

Now that we have the data in a table format, it’s time to create some visualizations to see how the actual versus predicted number of cases performed throughout the time period. Let’s draw a line chart with the date on the X-axis and actual as well as projected cases on the Y-axis. 

This results in a chart like the one below.

Applying the projection to more recent dates, we have projected 793,000 cases on April 19. In reality, there were 759,000 confirmed cases. This time the number of confirmed cases is less than the number of projected cases. This is a good sign that indicates that the average pandemic growth factor has been decreasing. 

We will know that the pandemic is coming to an end when the growth factor becomes closer to one. Anything we can do to slow the spread of the pandemic—social distancing, frequent handwashing, disinfecting frequently used items, and following the guidelines of the state and local authorities—can help bring the growth factor down. 

Summary

Using a few lines of SQL, we have prepared COVID-19 data to be analyzed and from that data built a simple prediction of how many COVID-19 cases there are likely to be in the near future. This shows the power of tools in our hands that help us perform data analysis today. Sisense for Cloud Data Teams (previously Periscope Data) empowers data teams to quickly connect to cloud data sources, then explore and analyze data in a matter of minutes using SQL, R, and Python.

packages-CTA-banners_Cloud-Data-Teams

Govind Rajagopalan is a senior engineering manager at Sisense. He has over 15 years of experience across both engineering and management roles with diverse companies, domains, teams, and technologies. He is excited to teach, always helps his teammates build their skills, and loves improving his craft. 

Tags: | | | | |