The pandas package is a high-level data manipulation tool for Python.
The name pandas is derived from the term "panel data", an econometrics term for data sets that include observations over multiple time periods for the same individuals.
If you have Anaconda installed, then pandas was already installed together with it.
If you have a standalone Python3 and Jupyter Notebook installation, open a command prompt / terminal and type in:
pip3 install pandas xlrd openpyxl
The xlrd
and openpyxl
packages are required for managing Microsoft Excel files.
If you have Anaconda installed, these packages were also included in the default installation.
The xlrd
package is used for older format MS Excel files (.xls
), while openpyxl
is used for newer format MS Excel files (.xlsx
).
In case support for OpenDocument format is required (.ods
files), e.g. for compatibility with OpenOffice or LibreOffice, the odf
package can be used:
conda install odf
pip3 install odf
The pandas package is a module which you can simply import. It is usually aliased with the pd
abbreviation:
import pandas as pd
The primary two components of pandas are the Series and DataFrame.
A Series is essentially a column, and a DataFrame is a multi-dimensional table made up of a collection of Series.
We can work with DataFrames and Series in a similar way, since many operations support both of them. (E.g. calculating the mean value of a Series or a DataFrame.)
A Python dictionary can easily be converted to a pandas DataFrame. Each (key, value)
tuple in the dictionary corresponds to a column in the resulting DataFrame. The values in the rows for each column are given as lists.
import pandas as pd
data = {
'apples': [3, 2, 0, 1, 5, 0, 4],
'oranges': [0, 3, 7, 2, 1, 6, 2]
}
df = pd.DataFrame(data)
display(df)
By default the rows in the DataFrame are indexed numerically from 0, but we can set a custom Index:
df = pd.DataFrame(data, index = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'])
display(df)
Remark: the display()
function is a special Jupyter Notebook function to provide "prettier" output. Where display()
is used, print()
could also be used.
print(df)
print('Apples purchased over the week:')
print(df['apples']) # df is a DataFrame, df['apples'] is a Series
Values of single cells can also be accessed:
print('Apples purchased on Monday:')
print(df['apples']['Monday'])
print(df['apples'][0])
Rows can be accessed through the loc
property with their textual indexes:
print('Fruits purchased on Monday:')
print(df.loc['Monday'])
Rows can also be accessed through the iloc
property with ther numerical indexes:
print('Fruits purchased on Monday:')
print(df.iloc[0])
Accessing single cells:
print('Apples purchased on Monday:')
print(df.loc['Monday']['apples'])
print(df.loc['Monday'][0])
print(df.iloc[0]['apples'])
print(df.iloc[0][0])
Manually iterating through all the rows with the iterrows()
method:
for index, row in df.iterrows():
print("Index: {0}, Apples: {1}, Oranges: {2}".format(index, row["apples"], row["oranges"]))
Alternatively, if only the index values are required, the df.index
list can be iterated:
print(df.index)
The pandas library has a great support for reading (and writing) external data files, like CSV files, Excel files, JSON files, etc.
Let's use the European countries datatset.
The dataset contains the country name, capital city name, area (in km2), population (in millions) and the region data for 43 European countries respectively.
Data source: EuroStat
The dataset is given in the data/countries_europe.csv
file, which we can read with th read_csv()
method, passing the file path and the delimiter symbol (latter will be discussed soon).
import pandas as pd
countries = pd.read_csv('../data/countries_europe.csv', delimiter = ';')
display(countries)
A comma-separated values (CSV) file is a delimited text file that uses a comma to separate values. The separator (also called the delimiter) can be another character than a comma, often a semicolon is used. The default delimiter is the comma, but we can easily configure it int the read_csv()
method call.
A CSV file stores tabular data (numbers and text) in plain text. Each line of the file is a data record. Each record consists of one or more fields, separated by commas. The use of the comma as a field separator is the source of the name for this file format.
For example:
Country;Capital;Area (km2);Population (millions);Region
Albania;Tirana;28748;3.2;Southern
Andorra;Andorra la Vella;468;0.07;Western
Austria;Vienna;83857;7.6;Western
Belgium;Brussels;30519;10;Western
Bosnia and Herzegovina;Sarajevo;51130;4.5;Southern
Bulgaria;Sofia;110912;9;Southern
Czech Republic;Prague;78864;10.4;Central
Denmark;Copenhagen;43077;5.1;Northern
United Kingdom;London;244100;57.2;Western
...
Overwrite the used column names:
countries.columns = ['country', 'capital', 'area', 'population', 'region']
display(countries)
Use further reading functions, like read_excel
, read_json
to easily load other file formats into a pandas DataFrame.
The same European country dataset is given in the data/countries_europe.xls
and data/countries_europe.xlsx
MS Excel files:
countries = pd.read_excel('../data/countries_europe.xls')
countries.columns = ['country', 'capital', 'area', 'population', 'region']
display(countries)
countries = pd.read_excel('../data/countries_europe.xlsx')
countries.columns = ['country', 'capital', 'area', 'population', 'region']
display(countries)
Query the row and column count of a DataFrame:
print('Number of rows: {0}'.format(len(countries)))
print('Number of rows: {0}'.format(countries.shape[0]))
print('Number of columns: {0}'.format(countries.shape[1]))
print('Number of rows by columns:')
print(countries.count())
Remark: if a column contains empty cells, we would see different numbers here.
The first or last few rows of a DataFrame can be fetched with the head(n)
and tail(n)
methods. They are especially useful with large DataFrames.
display(countries.head(3))
display(countries.tail(3))
Calculate the population density of each country and add it as a new column to the countries
DataFrame.
First, create a list of the density values:
density = []
for i in range(len(countries)):
density.append(countries['population'][i] * 1e6 / countries['area'][i])
print(density)
Then add a new column to the DataFrame:
countries['density'] = density
display(countries)
Summation, mean and median values:
print("Sum population: {0:.2f} million".format(countries["population"].sum()))
print("Mean population: {0:.2f} million".format(countries["population"].mean()))
print("Median population: {0:.2f} million".format(countries["population"].median()))
Maximum and minimum values and their indexes:
print("Max population:{0:.2f} million".format(countries["population"].max()))
print("Max population index: {0}".format(countries["population"].idxmax()))
print("Min population:{0:.2f} million".format(countries["population"].min()))
print("Min population index: {0}".format(countries["population"].idxmin()))
Standard deviation:
print("Standard deviation of population: {0:.2f} million".format(countries["population"].std()))
Quantile calculation:
print("The 90% quantile for the European countries:")
print(countries.quantile(0.9))
Which means that e.g. the top 5 countries (top 10%) has a higher population than 49.2 million.
Calculate all basic statistical data for all columns at once:
display(countries.describe())
A DataFrame can be sorted into a new DataFrame through the sort_values
function. The original DataFrame remains intact.
bypopulation = countries.sort_values(by = 'population', ascending = False)
display(bypopulation)
Note that the row indices remained intact.
Task: which countries will display the following code cell?
print(bypopulation.loc[0])
print()
print(bypopulation.iloc[0])
Now we can e.g. verify that the top 5 countries (top 10%) has a higher population than 49.2 million:
print("Population of the 4th country: {0:.1f} million".format(bypopulation.iloc[4]["population"]))
print("Population of the 5th country: {0:.1f} million".format(bypopulation.iloc[5]["population"]))
Note: a DataFrame can be sorted by modifying it (and without creating a new one) by passing the inplace = True
argument to the sort_values()
method:
countries.sort_values(by = 'population', ascending = False, inplace = True)
A DataFrame can be sorted using multiple columns, by passing a list of columns to the by
parameter. (And a list of boolean values to the ascending
parameter.)
byregion = countries.sort_values(by = ['region', 'population'], ascending = [True, False])
display(byregion)
We can assign one of the columns as the index column. The indexer column must contain unique values.
countries_indexed = countries.set_index('country')
display(countries_indexed)
In the indexed DataFrame, rows can be accessed through the values of the index column with the already used loc
property:
print(countries_indexed.loc["Hungary"])
Remark: by default the set_index()
method call removes the indexer column. We can keep the indexer column also as a "normal" column through setting the drop
parameter to False
.
countries_indexed = countries.set_index('country', drop = False)
display(countries_indexed)
Configure a condition as a boolean expression:
condition = countries["region"] == "Central"
print(type(condition))
As we have seen, the result is a Series, so a new column! Evaluate it:
print(condition)
The series in the condition
variable stores the boolean True
/ False
value for each row, whether the profit was positive for that row or not.
Now filter the DataFrame by the condition. As we can observe, only the rows with the logical True
value in the condition
series remained:
centralEuropean = countries[condition]
display(centralEuropean)
The whole workflow can be achieved in a single statement:
display(countries[countries["region"] == "Central"])
Display the name of the countries which have a population of less than 1 million.
small_countries = countries[countries["population"] < 1]
print(small_countries["country"])
Write a program that calculates which Western-European country has the largest area?
max_index = countries[countries["region"] == "Western"]["area"].idxmax()
print(countries.iloc[max_index])