Matplotlib is the most popular 2D plotting library in Python. Using matplotlib, you can create pretty much any type of plot.
Pandas has tight integration with matplotlib.
If you have Anaconda installed, then matplotlib was already installed together with it.
If you have a standalone Python3 and Jupyter Notebook installation, open a command prompt / terminal and type in:
pip3 install matplotlib
We will use the pyplot module inside the matlplotlib package for plotting. You can simply import this module as usual. It is usually aliased with the plt
abbreviation:
import matplotlib.pyplot as plt
Let's use the World Countries datatset. For each country the following information is given:
The dataset is given in the data/countries_world.csv
file. The used delimiter is the semicolon (;
) character.
import pandas as pd
import matplotlib.pyplot as plt
# Special Jupyter Notebook command, so the plots by matplotlib will be display inside the Jupyter Notebook
%matplotlib inline
countries = pd.read_csv('../data/countries_world.csv', delimiter = ';')
countries.columns = ['country', 'region', 'population', 'area', 'gdp', 'literacy']
display(countries)
Data source: US Government
Lets take just the top 50 countries by area, so visualization will be easier to overview in the following tasks:
countries50 = countries.sort_values(by = 'area', ascending = False).head(50)
display(countries50)
Plots can be generated with the plot()
function of a Pandas DataFrame (table) or Series (column). The most important parameter of the function is the kind
parameter, which defines the type of plot to be generated. Supported kinds are (non-exhaustive list):
line
bar
(vertical bar)barh
(horizontal bar)scatter
hist
(histogram)box
(boxplotpie
After a plot is generated, it can be displayed by the show()
function of the matplotlib.pyplot
module.
Display a bar plot on the area of the selected 50 largest countries.
countries50.plot(kind='bar', x='country', y='area', figsize = [15, 3])
plt.show() # matplotlib.pyplot was imported as plt
The size of the diagram can be configured with the figsize
parameter. The size is given in inches (1 inch = 2.54 centimeters).
The default size is [6.4, 4.8]
.
The bar diagram can be created directly on the selected Series (column of data). In this case the Series will be placed along axis Y, while the horizontal axis X will become the index of the DataFrame.
countries50['area'].plot(kind='bar', figsize = [15, 3])
plt.show()
The index column can be modified through the set_index
function (see Chapter 7 for more details) of the DataFrame and a new DataFrame is created so:
countries50_indexed = countries50.set_index('country')
display(countries50_indexed)
Creating the bar plot from the countries50_indexed
DataFrame will display the country names as labels correctly.
countries50_indexed['area'].plot(kind='bar', figsize = [15, 3])
plt.show()
The color of the bars can be defined with the color
parameter.
The width of the bars is set by the width
parameter, 1.0 meaning 100%.
countries50_indexed['area'].plot(kind='bar', figsize = [15, 3], color = 'red', width = 1.0)
plt.show()
Multiple colors can be passed in a list.
countries50_indexed['area'].plot(kind='bar', figsize = [15, 3], color = ['red', 'green', 'yellow'], width = 1.0)
plt.show()
Display a horizontal bar plot on the population of the selected 50 largest countries.
countries50.plot(kind='barh', x='country', y='population', figsize = [10, 10])
plt.show()
Note that for the horizontal bar plot, the axis X is the vertical axis, and axis Y is the horizontal axis. It is defined by this was, so only the kind
parameter of the plot()
function has to be changed when switching to a different type of diagram.
Before visualizing the data, sort it by the column population, instead of the default area.
countries50.sort_values(by = 'population').plot(kind='barh', x='country', y='population', figsize = [10, 10])
plt.show()
Display a scatter plot on the correlation of the area and the population columns of the selected 50 largest countries.
Question: What correlation can be expected between these 2 attributes of countries?
countries50.plot(kind='scatter', x='area', y='population', title='Area vs. Population of Top 50 Largest Countries')
plt.show()
A title can be given to be displayed above the generated diagram with the title
parameter.
Extend the scatter plot for all countries in the dataset.
countries.plot(kind='scatter', x='area', y='population', title='Area vs. Population correlation')
plt.show()
As we can observe there is a moderately strong correlation between area and population, which matches our expectation.
The limits of the X and Y axes can be configured with the xlim
and ylim
parameters, so the outliers can be excluded from the visualization. Both a minimum and a maximum boundary can be given, as a tuple.
countries.plot(kind='scatter', x='area', y='population', title='Area vs. Population correlation', xlim=(0, 1e6), ylim=(0, 1e8))
plt.show()
The correlation matrix between Series of a Pandas DataFrame can be generated with the corr()
function:
display(countries.corr())
Or just for 2 selected Series:
print(countries['area'].corr(countries['population']))
Every correlation has two qualities: strength and direction. The direction of a correlation is either positive or negative. When two variables have a positive correlation, it means the variables move in the same direction. This means that as one variable increases, so does the other one. In a negative correlation, the variables move in inverse, or opposite, directions. In other words, as one variable increases, the other variable decreases.
We determine the strength of a relationship between two correlated variables by looking at the numbers. A correlation of 0 means that no relationship exists between the two variables, whereas a correlation of 1 indicates a perfect positive relationship. It is uncommon to find a perfect positive relationship in the real world.
The further away from 1 that a positive correlation lies, the weaker the correlation. Similarly, the further a negative correlation lies from -1, the weaker the correlation. The following guidelines are useful when determining the strength of a positive correlation:
Question: which series of the dataframe show strong correlation?
A histogram is an accurate representation of the distribution of numerical data. It differs from a bar graph, in the sense that a bar graph relates two variables, but a histogram relates only one.
Display a histogram on the area of the selected 50 countries.
countries['area'].plot(kind='hist')
plt.show()
The number of columns (called bins or buckets) in the histrogram can be configured with the bins
parameter.
countries['area'].plot(kind='hist', bins=20)
plt.show()
Extend the histogram to cover all countries in the dataset. Apply a logarithmic scale with the logx
/ logy
parameter.
countries['area'].plot(kind='hist', bins=20, logy=True)
plt.show()
In descriptive statistics, a boxplot is a method for graphically depicting groups of numerical data through their quartiles.
Display a boxplot on the GDP of the selected 50 largest countries.
countries50['gdp'].plot(kind='box')
plt.show()
Explaining the graphical visualization of a boxplot:
Task: Display a boxplot on the literacy of all countries! What can we state based on the diagram?
countries['literacy'].plot(kind='box')
plt.show()
Display a pie chart on the area share of the selected 50 largest countries. Since we are creating this plot on the area
Series, we use the countries50_indexed
DataFrame, which was indexed with the country names, so the labels will contain them instead of numerical indices.
countries50_indexed['area'].plot(kind='pie', figsize=[10,10], label="", title="Area share of the top 50 largest countries")
plt.show()
Percentages for each slice can be displayed with the autopct
parameter:
countries50_indexed['area'].plot(kind='pie', figsize=[10,10], autopct='%.1f%%', label="", title="Area share of the top 50 largest countries")
plt.show()
Intead of using the show()
function of the matplotlib.pyplot
module, the savefig()
function can also be used to export and save a created plot to an external file.
countries50.plot(kind='bar', x='country', y='area', figsize = [15, 3])
plt.savefig('10_country_area.png')
Hint: look for the created file right next this Jupyter Notebook file on your computer.
Read the Population History dataset from the data/population_world.csv
file, which contains the population data for all countries between the years 1950 and 2019.
All together the dataset contains 239 countries (or territories), 70 years of data, so all together 16,730 rows of data.
Each row stores the following data:
The used delimiter is the semicolon (;
) character.
population_history = pd.read_csv('../data/population_history.csv', delimiter = ';')
display(population_history)
Data source: United Nations
Display the countries in the dataset:
print(population_history['Country'].unique())
Line diagrams works best with a series of data, assuming continuous change between the known discrete values.
Let's visualize the total and male population of Hungary between the years 1950 an 2019.
First filter the rows based on the country for Hungary and set the year as the index column.
hungary = population_history[population_history['Country'] == 'Hungary']
hungary.set_index('Year', drop=False, inplace=True)
display(hungary)
Now a line plot on the total population change of Hungary between 1950 and 2019 can be displayed.
hungary['PopTotal'].plot(kind='line')
plt.show()
# same:
#hungary.plot(kind='line', x='Year', y='PopTotal')
#plt.show()
Let's use multiple columns in the previous line plot, and add the male population to the diagram as a second line.
Multiple plot data can be generated with the plot()
method of Pandas Series. Calling the show()
function of the matplotlib.pyplot
module will visualize them on a single diagram.
hungary['PopTotal'].plot(kind='line')
hungary['PopMale'].plot(kind='line')
plt.show()
Add legend to the diagram:
hungary['PopTotal'].plot(kind='line', legend=True)
hungary['PopMale'].plot(kind='line', legend=True)
plt.show()
The same can be done by calling the plot()
method of a Pandas DataFrame. Be aware though, that in this case each plot will be displayed in an individual diagram:
hungary.plot(kind='line', x='Year', y='PopTotal', legend=True)
hungary.plot(kind='line', x='Year', y='PopMale', legend=True)
plt.show()
This can be fixed by explicitly configuring matplotlib to use the same axis object for visualization for both diagrams:
ca = plt.gca() # gca = get current axis configuration object
hungary.plot(kind='line', x='Year', y='PopTotal', ax=ca, legend=True) # use the ca axis configuration object
hungary.plot(kind='line', x='Year', y='PopMale', ax=ca, legend=True) # use the ca axis configuration object
plt.show()
Use a different, secondary scale for the male population.
hungary['PopTotal'].plot(kind='line', legend=True)
hungary['PopMale'].plot(kind='line', secondary_y=True, legend=True)
plt.show()
Pandas supports the grouping of data by the given column(s), which then can be used also for visualization.
Select 10 countries by your choice.
selected_countries = pd.Series(['Hungary', 'Germany', 'France', 'United Kingdom', 'Romania', 'Oman', 'Libya', 'Turkey', 'Chile', 'Viet Nam'])
display(selected_countries)
Select the rows of the original DataFrame for these selected countries.
selected_history = population_history[population_history['Country'].isin(selected_countries)]
display(selected_history)
The selected_history
DataFrame now contains all historical data for the selected 10 countries.
Visualize the population change of the selected 10 countries for the time period 1950-2019 in a line diagram.
To achieve this, we first group the selected_history
DataFrame by the Country
Series:
selected_history.groupby('Country')
We have got a DataFrameGroupBy
object, which can be converted to a list:
grouped_history = list(selected_history.groupby('Country'))
print("Length: {0}".format(len(grouped_history)))
Each item of the list contains all records for a given country (the column used for groupping):
print(grouped_history[0])
Question: what happens if we group by the year column?
Based on the grouped DataFrame, we select the PopTotal
Series and create a line plot.
First the Year
column is set as an index to be used for the X axis.
selected_history.set_index('Year', inplace=True, drop=False)
selected_history.groupby('Country')['PopTotal'].plot(
kind='line', logy=True,
figsize=[15, 6], legend=True,
title='Population history of 10 selected countries 1950-2019')
plt.show()
Aggregate functions (min
, max
, mean
, median
, sum
, etc.) transforms a group of values to a single value. By calling on aggregate function on a grouped DataFrame, the aggregated value of each group is calculated.
Let's calculate the largest population for each country they ever had between 1950 and 2019.
population_history.groupby('Country').max()
Sort the result by the PopTotal
and only display the PopTotal
:
largest_pop = population_history.groupby('Country').max().sort_values(by = 'PopTotal')['PopTotal']
display(largest_pop)
Task: Use the World Countries dataset defined in the countries
variable. That dataset contained the region for each country. Compute for each region how many countries belong to them. Visualize the results in a pie a chart.
Hint: use groupping.
countries_by_region = countries.groupby('region').count()['country']
display(countries_by_region)
countries_by_region.plot(kind='pie', figsize=[10,10], autopct='%.1f%%', label="",
title="Region distribution among countries")
plt.show()
Task: Calculate the accumulated population of the world for each year between 1950 and 2019 based on the Population History dataset stored in the population_history
variable.
Create a bar diagram visualizing how the aggregated population changed over the years.
aggregated_by_year = population_history.groupby('Year').sum()
display(aggregated_by_year)
aggregated_by_year.plot(kind='bar', y='PopTotal', figsize=[15, 4], width=0.8, color='orange')
plt.show()