As COVID-19 continues to spread, healthcare groups and companies of all kinds are under pressure to provide care in the face of increasing demand. Healthy Data is your window into how data can help organizations address this crisis.

As the rapid spread of COVID-19 continues, data managers around the world are pulling together a wide variety of global data sources to inform governments, the private sector, and the public with the latest on the spread of this disease.

In this article, we discuss how this data is accessed, an example environment and set-up to be used for data processing, sample lines of Python code to show the simplicity of data transformations using Pandas and how this simple architecture can enable you to unlock new insights from this data yourself.  Let’s get started.

packages-CTA-banners_Cloud-Data-Teams

The importance and impact of reliable data

Reliable data of this scale has helped to promote responsible decision-making in workplaces and communities around the world, including legislative action around international travel, the provision of emergency medical resources, the bolstering of financial markets, support for small business owners and proprietors, and medicine and treatment to those who are infected.  

Households and families are also using this data to prepare for the nuances of everyday life that will change as a result of this virus, whether it be securing a paycheck, buying groceries and essentials, playing with their kids, inviting the neighbors over for a barbecue or walking the dog. 

The importance of having properly-prepared data is paramount to the success of our attempts to mitigate and contain the spread of the virus, and the impact of information sharing at this level is truly transformative. But this type of globally aggregated data doesn’t just appear all on its own.  Reliable sources need to first be discovered, retrieved, parsed, and aggregated together before they can be distributed around the world. This is not something we’ve been ready to prepare in such a short period of time until now.

Specific challenges involved with data related to the Coronavirus

Access to a reliable source of contagion data for a global event happening in real-time is not easy to discover. The major repositories that include data on public health issues or disease outbreaks can be accessed through an API.

Often the data is needed to be communicated to the public so quickly after it’s generated that it’s provided via more accessible methods first, leaving the programmatic access via the API for future historical analysis. These methods (PDFs or HTML tables) can be time-consuming to parse and scrape correctly and consistently across a bulk set of examples, yet that wasn’t enough to stop the developer community from getting started.

Since then, a team of researchers affiliated with Johns Hopkins University has been retrieving PDF files posted on the websites of these organizations via programmatic methods and parsing their contents into CSV files stored on a public repository on GitHub. This repository has been starred over 15k times and forked over 7k times. It is being used as source data for engineers around the world to ingest into their data pipelines.

In providing a globally useful dataset they have connected with over 14 data sources from around the world and aggregated them into their data model.  Not all these data sources should be treated the same way, they each have specific needs. Some data sources include confirmed cases while others included presumptive cases.  Some data sources are Excel spreadsheets or HTML tables while others are location pins on Google Maps. And some are in completely different languages than English.  

The team at JHU has done a fantastic job in performing the bulk of the tricky normalization for us, yet they continue to face challenges every day in conforming to the new shapes, sizes, and formats of data that seem to be coming online exponentially as this virus continues to spread across our people.  In the next section we’ll provide the link to access this data and perform some data cleaning and normalization operations using Python and Pandas!

How to set up your own data environment for analyzing COVID-19 data

Data Access

Data access is provided by Johns Hopkins University.  They have built a pipeline that ingests global situation reports from the World Health Organization, translates data from South Korea, China, and Taiwan into English and has access to 10+ other global sources that appear to be quite labor-intensive to retrieve (a lot of web-scraping).  

While some cleaning and normalization has already gone into this dataset, such as converting all timestamps to a UTC time zone and addressing some inconsistencies related to update frequency, there are still plenty of opportunities for us to dive in and focus on cleaning and normalization activities that will unlock real insights.

#Data PublisherScopeSourceFormatUpdate CycleLocation
Johns Hopkins University*USAGitHubJSONDailyhttps://github.com/CSSEGISandData/COVID-19

* Johns Hopkins University is actively parsing daily situation reports from WHO and integrating into their data model for the open-source community to access

Data infrastructure

Here is an example of a simple, cloud-based architecture that is suitable for rapid deployment of a data pipeline.  In the environment created for this article, a Virtual Private Cloud (VPC) containing a Linux EC2 instance, a PostgreSQL database, and an internet gateway was spun up on Amazon Web Services and which was then connected to an external BI dashboarding tool.

The pipeline manager hosts a Python installation and Apache Airflow task scheduler (developed by Airbnb) which operates the data pipeline.  After simple configuration Airflow is up and running and is executing Python scripts itself, writing data to the database every time the source data is updated.  

Meanwhile, the internet gateway allows for external BI tools to connect to the data using a trusted connection so the data can be explored visually, and reports generated for communication of information to others.  This is done without needing to download a local copy of the data. This architecture allows for data in the database and downstream reports to stay up to date automatically.

Head_in_the_Clouds

Processing and cleaning the data using Python and Pandas, writing to SQL database

And finally, we will go through a few simple examples of data cleaning and normalization performed in Python that can be used on this dataset in order to insert into a SQL table and query for valuable insights. The goal here is to format data sources into a common structure for effective bulk processing. 

1. Combine data from multiple .csv files and dropping duplicates that may exist.

import pandas as pd
import os

df = pd.DataFrame()
for filename in files:
   filepath = f'csse_covid_19_daily_reports/{filename}'
   with open(filepath) as f:
 da = pd.read_csv(f)
   df = df.append(da, ignore_index=True, sort=False)df.drop_duplicates(inplace=True)

2. Fill null values with empty strings to prevent rows from being removed from subsequent table transformations.

df['Province/State'].fillna('', inplace=True)df['Country/Region'].fillna('', inplace=True)

3. Sort dates in descending order with most recent at the top.

df.sort_values(['Last Update'], ascending=False, inplace=True)

4. Convert datetimes to date and prepare to group and retrieve the most recent record from each date.

df['Last Update'].apply(lambda x: pd.to_datetime(x).strftime('%Y-%m-%d'))

5. Grouping and aggregating to retrieve the latest report from each day.

df.groupby(['Country/Region', 'Province/State', 'Last Update']) \
   .agg({
       'Province/State': 'first',
       'Country/Region': 'first',
       'Confirmed': 'first',
       'Deaths': 'first',
       'Recovered': 'first',
       'Latitude': 'first',
       'Longitude': 'first'})

6. Convert floating-point decimal fields to type “int.”

for col in ['Confirmed', 'Deaths', 'Recovered']:
    df[col] = df[col].astype(int)

7. Write the data to a SQL database.

df.reset_index(inplace=True)
df.to_sql('cssc_daily_reports', con=connection, index=False,    if_exists='replace')

8. Connect to your SQL database and visualize your newly written data.

In this graphic we see that China’s cases mostly leveled off after 3-4 weeks, while growth in other countries is still ongoing. 

As of 4/10/2020

In this graphic we see the differences in new cases each day between the United States, Germany, Italy, and Spain. 

As of 4/10/2020

Fighting fire with data

As we continue to encounter global challenges at this scale, whether it be the coronavirus or another civilization-halting crisis, the importance of data collaboration across countries and state lines should not be underestimated. Not only because data wins debates, but because good data converts talk into consensus and action.

packages-CTA-banners_Cloud-Data-Teams

Govind Rajagopalan is a Senior Engineering Manager at Sisense. He has over 15 years of experience across both engineering and management roles with diverse companies, domains, teams, and technologies. He is excited to teach, help his teammates thrive, and have fun improving his craft.

Tags: | | | | |