How to prepare your dataset and add styles to plots using matplotlib

Matheus Ricardo dos Santos
6 min readOct 2, 2020

And an exploratory analyses about social distancing in Brazil

In most cases when we are just looking at the data we don’t need to create beautiful plots. But what if we need to present our results to the public?
In this post, I’ll show you how to model your dataset to easily make ‘queries’ and add styles to your plots.

1. Get the data

The dataset we are going to use is part of the Google Mobility Report. It contains data about people’s mobility during the quarantine. Since it is a global report we can find data about almost any region of the globe.

So, the first step is to download our dataset:

link = "https://www.gstatic.com/covid19/mobility/Global_Mobility_Report.csv"
data = pd.read_csv(link)
data.head()
These are the first 5 rows of the dataset

In this dataset, we have data about how some kinds of places are being frequented by people compared to the baseline. So, if the number is negative it means that people aren’t attending this place as they were before the quarantine. The opposite is also true: if the number is positive, people are attending the place more than the baseline.

The baseline is the frequency people visited some places before the quarantine start (probably the first three months of the year).

2. Select the data we want to work with

First, we need to select which columns are relevant to work with.

data.columns
The columns of the dataset

The relevant data is stored in the following columns:
- country_region_code
- country_region
- sub_region_1
- date
- residential_percent_change_from_baseline

You can also select the columns you are curious about e.g. workplaces_percent_change_from_baseline. I selected the residential data because it is closely related to the other columns.

As I said before, in this tutorial we are going to look at some data from Brazil. Currently, Brazil is the third country with the most confirmed COVID-19 cases in the world.

Let’s see how to retrieve the rows that contain data about Brazil:

# select the 'country_region' column and get the unique values
data.country_region.unique()
These are the countries that are present in the dataset. As you can see, Brazil is one of the countries included in the list.
# this way we are selecting only the rows that has 'country_region' equal to 'Brazil'
# and selecting only the columns that have relevant data to our analysis
data_br = data.loc[data.country_region == "Brazil",:].iloc[:,[1,2,3,7,13]].copy()

3. Modeling the dataset

Now that we selected the data, we need to make some small change to keep the dataset easy to visualize:

# rename the columns
data_br.columns = ["country","state","city","date","residential"]
# convert the index to datetime
data_br.date = pd.to_datetime(data_br.date)
data_br.index = data_br.date
# drop the columns that we changed to be index
data_br.drop(labels="date", axis=1, inplace=True)
data_br.head(2)
This is the new shape of our dataset

Since we want to analyze the data about each state separately, let’s see which states are present in the dataset:

data_br.state.unique()
These are the 27 Brazilian states
# clear the name of the states
data_br.state = data_br.state.str.replace("State of ","")

The next step is to select the data where the state name isn’t null and the city name is null. This means that we are looking for data that are related to the whole state.

data_br_state = data_br.loc[~data_br.state.isnull() & data_br.city.isnull()].copy()
data_br_state.head()
# drop the useless columndata_br.drop(labels="city", axis=1, inplace=True)

Now we need a better way to compare the numbers of each state. In the current dataset, we can’t easily make this comparison because we would need to find a lot of rows to compare a unique day of each state.

So, let’s reshape the dataset to get the comparison easily

# let's group the data by date and state name# we are using the mean in case we have more the one state date per daydata_br_grouped = data_br_state.groupby(by=[data_br_state.index,"state"]).mean()data_br_grouped.head()
The rows are grouped by day

After that, we have a better way to visualize the data. But we can turn it even better.

data_br_unstack = data_br_grouped.unstack()data_br_unstack.head(2)
Each state became a column

With this new shape, we can apply filters like that:

# select by place and by state namedata_br_unstack["residential"][["Rio Grande do Norte","São Paulo"]
Two states compared side by side

We can also group the data for every 7 days and compute the mean of each group. This is called a moving average:

# we call dropna() because the mean of the first six days are null since they don't have seven previous days to compute its meandata_br_unstack["residential"][["Rio Grande do Norte","São Paulo"]].rolling(window=7).mean().dropna()

4. Plotting the data

With this new dataset, we can easily plot our data

data_br_unstack["residential"][["Rio Grande do Norte","São Paulo"]].rolling(window=7).mean().dropna().plot()
The resulting plot

5. Adding styles to the plot

Until now, all steps are quite intuitive. But adding styles is not so easy as it seems to be. That is the tricky part. Since there’s not much content on the internet about how to customize your plots, and the docs of matplotlib sometimes aren’t as clear as it could be, you can face a lot of troubles when trying to plot the data with custom styles.

This is the easiest way to add styles to your plot:

# change the plot themeplt.style.use(“Solarize_Light2”)# you can find more themes here https://matplotlib.org/3.1.1/gallery/style_sheets/style_sheets_reference.html

We need first to select a color map to our plot. matplotlib provides a way to map an interval of values into a range of RGB colors.

You can see all the available colormaps here.

# first we need to import some modulesimport matplotlib.cm as cmsimport matplotlib.dates as mdatesfrom matplotlib.collections import LineCollectionimport numpy as npfrom matplotlib.colors import ListedColormap
# select a colormapGnBu = cms.get_cmap(‘GnBu’, 256)# since we don’t want the whole colormap we are selecting only values between 1 and 0.4newcolors = GnBu(np.linspace(0.6, 1, 256))newcmp = ListedColormap(newcolors)

Now we select which states we want to plot and create the figure containing the subplots

# select wich states we want to plotitem = "residential"states = ["São Paulo","Rio Grande do Norte", "Pernambuco", "Amazonas","Ceará"]# create a figure with one row and five columnsfig, ax = plt.subplots(nrows=1,ncols=5, figsize=(20,5))

The last step is the most complicated, we need to iterate over the selected states and add the elements we want to plot.

I left some comments in the code to make each step easier to understand.

Finally, this is our the resulting plot

As we can see from the image, the routine in Brazil is slowly coming back to the state it was before the start of the quarantine. This is the pattern followed by the majority of Brazilian states.

5. Conclusion

After all these topics we can have a clear view of the steps needed to transform data into information, and how to show the information we got in a friendly way to our public.

This article is part of my journey learning about data science, so if you have some consideration or improvement to my code just let me know in the comments.

--

--