Using pandas to combine and transform data about social distancing in Brazil

5 min readOct 20, 2020

One of the main skills of a data scientist is to gather data from different parts of the internet. Even though, this data might not be in the shape you need. In this article, you will learn how to combine two datasets and extract useful information from them.

1. The datasets

The first dataset is part of the Google Mobility Report. It contains data about people’s mobility during quarantine. Since it is a global report we can find data about almost any region of the globe.

Actually, we are going to use a reshaped version of this dataset. To get the data in the shape we are going to use, you can follow the steps in this article: How to prepare your dataset and add styles to plots using matplotlib.

The second dataset is available here. It is one of the public datasets built by Brasil.io and it contains data about the daily covid reports of any city, state, or region of Brazil.

2. Look at the data

It’s a good practice to look at the data, so we can have some insights from it.

import pandas as pd# this is the first dataset (from google mobility report)
data_br = pd.read_csv('social_distancing.csv')data_br.head()

This is how the first dataset (the one about social distancing) looks like

The second dataset:

data_br_day = pd.read_csv('covid19-br-day.csv')data_br_day.head()

3. Combining features

An interesting thing about the first dataset is that the columns retail, grocery, parks, transit, and workplaces are related to the column residential. It means that if the residential column was not available we could create a strategy to recreated it based on the other columns.

One of the most basic operations we could do is to use the function apply. It works like a map, but instead, we can choose over which axis we want to iterate. Another difference is that we can pass custom args to our transformation function.

So let’s see the first ten values of the dataset

data_br.iloc[:, [3,4,5,6,7, 8]].head(10)

Let’s see if the mean of the columns retail, grocery, parks, transit, and workplaces is a good way to represent the column residential.

data_br.iloc[:, [3,4,5,6,7]].head(10).apply(lambda x: sum(x)/5, axis=1)

It’s important to notice that we are not trying to generate the same values. We are only trying to retrieve its tendency.

We can conclude that our new column has an inverse relation with the column residential. It makes sense because when people are visiting establishments they are not at home. And the inverse is also true.

But in this case, we have the residential columns (thanks google) and we don't need to use our custom column.

4. Prepare to merge

Since we want to merge these datasets we need to find some columns that have common values. As you can see, both datasets have columns that store information about the state and date. However, the states are in a different format.

So, let's fix it:

data_br.state.unique()

The states’ name contains the prefix ‘State of’

# remove the prefix 'State of'
data_br.state = data_br.state.str.replace('State of', '').str.strip()data_br.state.unique()

Now, we need to map the states’ names to their code. We can do it using a map function. It will iterate over all rows and apply a function to transform the value of each row.

# a dict to map the states' name to their code
state_dict = {
'Federal District': 'DF',
'Acre': 'AC',
'Alagoas': 'AL',
    .
    .
    .
}
data_br.state = data_br.state.map(lambda el: state_dict[el])data_br.state.unique()

Let’s select which columns we are going to use in the second dataset

columns = ['date', 'state', 'last_available_confirmed', 'new_confirmed']data_br_day_cases = data_br_day.loc[:, columns]data_br_day_cases.columns = ["date","state","total_cases","new_cases"]data_br_day_cases.head()

5. Merge

With the previous steps done, we have two matching columns to merge the dataset. Lets’ do it:

data_br_merged = data_br.merge(data_br_day_cases, on=[‘date’, ‘state’])data_br_merged.head()

6. Plot

Our final goal is to plot for each state a line that represents the residential data and the new cases per day. To draw both lines in the same graph we need to normalize them into the same interval [-1, 1].

This is our normalization function:

def normalize(row, df):  residential_max = df['residential'].abs().max()  confirmed_cases_max = df['total_cases'].abs().max()  new_cases_max = df['new_cases'].abs().max()  row['residential_normalized'] = row['residential']/residential_max  row['total_cases_normalized'] = row['total_cases']/confirmed_cases_max  row['new_cases_normalized'] = row['new_cases']/new_cases_max  return row

Let’s create the plot

7. Conclusion

Looking at the graph we can see that the six states we choose started the quarantine before the number of reported covid cases started to increase. However, in Rio de Janeiro (RJ), São Paulo (SP), Bahia (BA), and Amazonas (AM) people stayed less at home while the number of new cases per day was increasing. We can also notice that in the Rio Grande do Norte (RN) state, people started to stay less at home when the cases were decreasing. These graphs help us to understand one of the reasons why the covid crisis was more intense in states like SP, RJ, AM, and BA than in other states like RN.

This article is part of my journey of learning about data science, so if you have some consideration or improvement to my code just let me know in the comments.