Shapely is a Python package for manipulation and analysis of planar geometric objects. While Shapely is not concerned with data formats or coordinate systems, it can be readily integrated with packages that are. Indeed, Shapely is a central and essential package to any geometry/geography related work, and many higher abstraction level packages like GeoPandas use Shapely under the hood.
We need to install the shapely
package.
If you have Anaconda installed, open the Anaconda Prompt and type in:
conda update --all
conda install -c conda-forge shapely
Note: updating the currently installed packages to their most recent version can be required to avoid dependency issues. Note: we install from the conda-forge channel, as it contains more recent versions of these packages compared to the default channel of Anaconda.
If you have standalone Python3 and Jupyter Notebook install on Linux, open a command prompt / terminal and type in:
pip3 install shapely
The installation of these packages is much more complicated with pip on Windows, because several library binaries must be installed separately or compiled from source. (E.g. the shapely package highly depends on the GEOS library.) An easier approach is to install these packages from Python binary wheel files.
Due to its complexity these options are only recommended for advanced Python users; and instead it is strongly advised to use Anaconda on Windows.
We can either import the complete shapely module or just some parts of it which will be used, e.g.:
from shapely import geometry
Now we can simply refer to e.g. the shapely.geometry.Point
type simply as geometry.Point
.
The fundamental types of geometric objects implemented by Shapely are points, line string, polygons and their collections.
Elementary planar geometries can be created from scratch.
Let's define a new point with the coordinate (5,6)
.
from shapely import geometry
point = geometry.Point(5,6)
print(point)
Line strings can be defined through the list of their vertices.
line = geometry.LineString([(6,6), (7,7), (8,9)])
print(line)
Polygons are closed line strings (optionally with holes). The line string is automatically closed. The coordinates can be given with either tuples or lists.
rectangle1 = geometry.Polygon([[0,0], [10,0], [10,10], [0,10]])
rectangle2 = geometry.Polygon([(-4,-4), (4,-4), (4,4), (-4,4)])
print(rectangle1)
print(rectangle2)
A holed polygon can be defined as an exterior shell with a list of inner shells as the holes.
rectangle3 = geometry.Polygon([(0,0), (10,0), (10,10), (0,10)],
[[(2,2), (2,3), (3,3), (3,2)],
[(5,5), (5,7), (7,7), (7,5)]])
print(rectangle3)
For points and line strings the sequence of coordinates can be accessed through the coords
property, while the separate list of X and Y coordinates can be accessed through the xy
property of the Shapely objects.
print("Coords: {0}".format(list(point.coords)))
x, y = point.xy
print("X coords: {0}".format(x))
print("Y coords: {0}".format(y))
print("Coords: {0}".format(list(line.coords)))
x, y = line.xy
print("X coords: {0}".format(x))
print("Y coords: {0}".format(y))
For polygons, the outer shell can be accessed as the exterior
.
Note: the exterior is also available for points and line strings.
print("Coords: {0}".format(list(rectangle1.exterior.coords)))
x, y = rectangle1.exterior.xy
print("X coords: {0}".format(x))
print("Y coords: {0}".format(y))
The holes of a polygon can be accessed through the interiors
list of the object.
print("Coords: {0}".format(list(rectangle3.exterior.coords)))
x, y = rectangle3.exterior.xy
print("X coords: {0}".format(x))
print("Y coords: {0}".format(y))
print("Holes:")
for hole in rectangle2.interiors:
print(hole)
Various geometric properties can be easily computed through Shapely:
print('Area of Rectangle1: {0:.2f}'.format(rectangle1.area))
print('Area of Rectangle2: {0:.2f}'.format(rectangle2.area))
print('Area of Rectangle2: {0:.2f}'.format(rectangle3.area))
print('Length of Line: {0:.2f}'.format(line.length))
print(point.distance(rectangle2))
print(line.distance(rectangle2))
print('Rectangle1 contains Point: {0}'.format(rectangle1.contains(point)))
print('Rectangle2 contains Point: {0}'.format(rectangle2.contains(point)))
print('Rectangle1 contains Rectangle2: {0}'.format(rectangle1.contains(rectangle2)))
print('Rectangle1 intersects Rectangle2: {0}'.format(rectangle1.intersects(rectangle2)))
Geometries can also be loaded using the well-known text (WKT) format.
Well-known text (WKT) is a text markup language for representing vector geometry objects. It is a human-readable, but verbose format. A binary equivalent, known as well-known binary (WKB) is used to transfer and store the same information in a more compact form convenient for computer processing but that is not human-readable.
WKT and WKB are understood by many applications and software libraries, including Shapely.
from shapely import wkt
rectangle2 = wkt.loads('POLYGON ((-4 -4, 4 -4, 4 4, -4 4, -4 -4))')
print(rectangle2)
Note: Shapely also displays geometries as a WKT as the default string representation.
GeoPandas is an open source project to make working with geospatial data in Python easier. GeoPandas extends the datatypes used by pandas to allow spatial operations on geometric types.
The following Python packages are required to be installed:
If you have Anaconda installed, open the Anaconda Prompt and type in:
conda update --all
conda install -c conda-forge geopandas descartes mapclassify rtree
Note: updating the currently installed packages to their most recent version can be required to avoid dependency issues.
Note: we install from the conda-forge channel, as it contains more recent versions of these packages compared to the default channel of Anaconda.
If you have standalone Python3 and Jupyter Notebook install on Linux, open a command prompt / terminal and type in:
pip3 install geopandas descartes mapclassify rtree
For the rtree Python package you must also install the libspatialindex-dev system package, which will require administrative priviliges:
sudo apt-get install libspatialindex-dev
The installation of these packages is much more complicated with pip on Windows, because several library binaries must be installed separately or compiled from source. (E.g. the geopandas package highly depends on the GDAL library.)
An easier approach is to install these packages from Python binary wheel files.
Due to its complexity these options are only recommended for advanced Python users and it is strongly advised to use Anaconda on Windows.
The geopandas package is also a module which you can simply import. It is usually aliased with the gpd
abbreviation.
import geopandas as gpd
Geopandas can read many vector-based spatial data format including Shapefiles, GeoJSON files and much more. Only the read_file()
function has to be called.
The result is a geopandas dataframe, a GeoDataFrame.
Read the data/ne_10m_admin_0_countries.shp
shapefile located in the data
folder. This dataset contains both scalar and spatial data of the countries all over the world.
Source: Natural Earth
import geopandas as gpd
import matplotlib.pyplot as plt
%matplotlib inline
countries_gdf = gpd.read_file('../data/ne_10m_admin_0_countries.shp')
display(countries_gdf)
GeoPandas uses Shapely to represent geometries. Observe the geometry
column (the last one), which contains the Shapely geometry objects of the row, displayed as a string (in WKT format) in the table.
Since this GeoDataFrame has quite a number of columns, some of them are hidden by the display. Let's list all the columns:
print(countries_gdf.columns)
With a lot of columns it can be useful to select only a few columns to make the displayed results more human-readable.
This can be done by in a similar way when selecting a single Series from a DataFrame, but now we shall define a list of Series to select.
Remark: this makes a copy of the dataframe.
countries_gdf = countries_gdf[['NAME', 'POP_EST', 'POP_YEAR', 'GDP_MD_EST', 'GDP_YEAR', 'REGION_UN', 'geometry']]
display(countries_gdf)
Geopandas extends the capabilties of the pandas library, which means we can use all what we have learned with pandas.
Let's sort the GeoDataFrame by the name of the countries:
display(countries_gdf.sort_values(by='NAME'))
Filter the dataframe to contain only the European countries:
condition = countries_gdf['REGION_UN'] == 'Europe'
europe_gdf = countries_gdf[condition]
display(europe_gdf)
Sort the European countries by their population in a descending order:
display(europe_gdf.sort_values(by = 'POP_EST', ascending = False))
We can fetch the CRS (coordinate reference system) of the geometry
column in the GeoDataFrame:
print(countries_gdf.crs)
display(countries_gdf.crs)
As we can observe the spatial data are in WGS 84 (EPSG:4326). Since that is a geographic CRS, it would be unsuitable to calculate the area of the countries.
The geometries can be transformed on-the-fly to a different CRS with GeoPandas. Let's select a projected CRS, Mercator (EPSG:3857).
countries_mercator = countries_gdf.to_crs('epsg:3857')
Now the area of each geometry can be calculated in $km^2$ units:
countries_mercator['AREA'] = countries_mercator.area / 10**6
display(countries_mercator)
Use the round()
function to limit the number decimal digits, hence we can get rid of the scientific notation:
countries_mercator['AREA'] = countries_mercator['AREA'].round(2)
display(countries_mercator)
Since the Mercator projection applies is azimuthal (meaning the angles are correct), but not equal-area, areas inflate with distance from the equator such that the polar regions are grossly exaggerated. Therefore there are great territorial distortion int calculated values, e.g. for Hungary the area is more than twice of the real value.
display(countries_mercator[countries_mercator['NAME'] == 'Hungary'])
Let's use the Mollweide (ESRI:54009) equal-area projection instead to calculate the proper area of the countries.
countries_mollweide = countries_gdf.to_crs('esri:54009')
countries_mollweide['AREA'] = countries_mollweide.area / 10**6
countries_mollweide['AREA'] = countries_mollweide['AREA'].round(2)
display(countries_mollweide[countries_mollweide['NAME'] == 'Hungary'])
Remark: EPSG (European Petroleum Survey Group) and ESRI (American company Environmental Systems Research Institute) are two authorities providing well-known identifiers (WKID) for CRS.
However these numbers don’t overlap for avoiding confusion.
Task: When working with local spatial data for Hungary often the Uniform National Projection named EOV (abbreviation of Egységes Országos Vetület) is utilizied. It is an azimuthal projected CRS, and while not equal-area, only applies a minimal distortion on the region of Hungary.
Calculate the area of Hungary in EOV!
countries_eov = countries_gdf.to_crs('EPSG:23700') # EOV is EPSG:23700
countries_eov.set_index('NAME', drop=False, inplace=True)
countries_eov['AREA'] = countries_eov.area / 10**6
display(countries_eov.loc['Hungary'])
Geopandas provides a high-level interface to the matplotlib library for making maps. Mapping shapes is as easy as using the plot()
method on a GeoDataFrame (or GeoSeries).
countries_gdf.plot(figsize=[20,10])
plt.show()
The plot()
function call on a GeoDataFrame (or a regular pandas DataFrame) will return an axis configuration object, which we can use to further customize our plot (map in this case). E.g. we can hide the axes with the set_axis_off()
function:
ax = countries_gdf.plot(figsize=[20,10])
ax.set_axis_off()
plt.show()
Geopandas makes it easy to create so called choropleth maps (maps where the color of each shape is based on the value of an associated variable). Simply use the plot()
method with the column
argument set to the column whose values you want used to assign colors.
countries_gdf.plot(column='POP_EST', figsize=[20,10])
plt.show()
Add a legend to the map.
countries_gdf.plot(column='POP_EST', legend=True, figsize=[20,10])
plt.show()
We can choose from various available color maps. A complete list can be found on the matplotlib website.
countries_gdf.plot(column='GDP_MD_EST', legend=True, cmap='YlOrRd', figsize=[20,10])
plt.show()
The way color maps are scaled can also be manipulated with the scheme
option (the mapclassify Python library must be installed).
A full list of schemes are available on the project's GitHub page and some examples of result on the package's website.
countries_gdf.plot(column='GDP_MD_EST', legend=True, cmap='YlOrRd', figsize=[20,10], scheme='quantiles')
plt.show()
With the user_defined
scheme, a custom classification can be defined.
countries_gdf.plot(column='GDP_MD_EST', legend=True, cmap='YlOrRd', figsize=[20,10],
scheme='user_defined', classification_kwds={'bins':[1500, 60000, 300000]})
plt.show()
We can easily combine the data of multiple GeoDataFrames and even visualize them as multiple layers with geopandas.
Open and read a second data source defined in the data/World_Cities.shp
shapefile, containing scalar and spatial data about major cities all around the world.
Source: ArcGIS
cities_gdf = gpd.read_file('../data/World_Cities.shp')
display(cities_gdf)
Reduce the number of columns, by selecting only the most important ones:
cities_gdf = cities_gdf[['CITY_NAME', 'CNTRY_NAME', 'STATUS', 'POP', 'geometry']]
display(cities_gdf)
Plot the cities:
cities_gdf.plot(color='red', markersize=3, figsize=[20,10])
plt.show()
Verify whether both datasets use the same coordinate reference system:
print(cities_gdf.crs)
print(countries_gdf.crs)
Would be they different, geopandas would also be capable to transform one of the dataframes to the other CRS:
cities_gdf = cities_gdf.to_crs(countries_gdf.crs)
Create a combined visualization of multiple layers, by simply calling the plot()
method on all GeoDataFrames, but drawing them on the same axis object.
base = countries_gdf.plot(color='white', edgecolor='black', figsize=[20, 10])
cities_gdf.plot(ax=base, color='red', markersize=3)
plt.show()
Contextily is a Python package to retrieve tile maps from the internet. It can add those tiles as basemap to matplotlib figures.
The basemap tiles are in the Web Mercator (EPSG:3857) projection. To use them, we must convert our dataset to this CRS first.
import contextily as ctx
# Convert dataset to Web Mercator (EPSG:3857)
cities_mercator = cities_gdf.to_crs('epsg:3857')
ax = cities_mercator.plot(figsize=[20, 10], color='red', markersize=3)
ctx.add_basemap(ax)
ax.set_axis_off()
plt.show()
The same journey can be travelled in the opposite direction by leaving your data untouched and warping the tiles coming from the web.
import contextily as ctx
ax = cities_gdf.plot(figsize=[20, 10], color='red', markersize=3)
ctx.add_basemap(ax, crs=cities_gdf.crs)
ax.set_axis_off()
plt.show()
Note: it is also possible to convert both dataset and the basemap tiles into a different, third CRS.
Geopandas offers a coordinate indexer (cx
), which can be used to select only the records which geometry overlaps with the selected region.
Let's select and plot the countries in the northern hemisphere.
northern_gdf = countries_gdf.cx[:, 0:]
northern_gdf.plot(figsize=[20, 10])
plt.show()
Note: with this approach countries overlapping both the northern and southern hemispheres are not clipped.
We can perform real clipping with the clip()
function of geopandas. As a showcase let's clip the countries and country parts inside the bounding box of Europe; defined with the following polygon (given in WKT format):
POLYGON ((-10 35, 40 35, 40 70, -10, 70, -10, 35))
.
Geopandas uses the Shapely library in the background to represent and manipulate vector data. Therefore, first we define a regular pandas DataFrame named europe_df
, where the Coordinates column will contain a polygon defined with Shapely.
import pandas as pd
from shapely.geometry import Polygon
europe_df = pd.DataFrame({
'Name': ['Europe'],
'Coordinates': [Polygon([(-10, 35), (40, 35), (40, 70), (-10, 70), (-10, 35)])]
# the polygon is defined as a closed line
})
display(europe_df)
Now our GeoDataFrame can be constructed from the DataFrame stored in europe_df
, by defining which Series (column) contains the geometries and the CRS. (Use the CRS of the countries dataset.)
europe_gdf = gpd.GeoDataFrame(europe_df, geometry='Coordinates', crs=countries_gdf.crs)
display(europe_gdf)
Finally, we can perform the clipping operation between the GeoDataFrames:
clipped_gdf = gpd.clip(countries_gdf, europe_gdf)
clipped_gdf.plot(figsize=[10, 10])
plt.show()
In an attribute join, a GeoDataFrame (or a GeoSeries) is combined with a regular pandas DataFrame or Series based on a common variable. (This is analogous to normal merging or joining in pandas.)
countries_europe = pd.read_csv('../data/countries_europe.csv', delimiter = ';')
display(countries_europe)
The attribute join can be performed with the merge()
method, defining the columns used for merging. (Or alternatively left_index
and right_index
.)
countries_merged = countries_gdf.merge(countries_europe, left_on='NAME', right_on='Country')
display(countries_merged)
The spatial join (sjoin()
) function of geopandas performs a spatial intersection check between the records of one or two GeoDataFrames. (The rtree package must be installed for spatial indexing support.)
Let's match the countries and cities based on their spatial location:
display(gpd.sjoin(countries_gdf, cities_gdf))
Limit the number of columns displayed to get an output easier to interpret:
display(gpd.sjoin(countries_gdf, cities_gdf)[['NAME', 'CITY_NAME']])
Select the cities inside Hungary for a quick verification of the results:
condition = countries_gdf['NAME'] == 'Hungary'
hungary_gdf = countries_gdf[condition]
display(gpd.sjoin(hungary_gdf, cities_gdf)[['NAME', 'CITY_NAME']])
Perform a spatial intersection check between the dataframe containing only Hungary (hungary_gdf
) and the dataframe containing all countries (countries_gdf
). The result shall be the neighbouring countries of Hungary.
display(gpd.sjoin(hungary_gdf, countries_gdf)[['NAME_left', 'NAME_right']])
Remark: the NAME
column was renamed to NAME_left
and NAME_right
automatically, since column names must be unique.
GeoDataFrames can be easily persisted with the to_file()
function. As when reading files, various file formats are supported again.
clipped_gdf.to_file('11_clipped.shp')
#clipped_gdf.to_file('11_clipped2.geojson', driver='GeoJSON')
Beside the countries_gdf
GeoDataFrame, read the data/ne_10m_rivers_lake_centerlines.shp
shapefile located in the data
folder. This dataset contains both scalar and spatial data of the larger rivers and lakes around the world.
Source: Natural Earth
rivers_gdf = gpd.read_file('../data/ne_10m_rivers_lake_centerlines.shp')
display(rivers_gdf)
Visualize the country boundaries and the river/lake layers on the same map. (Rivers and lakes shall be blue.)
base = countries_gdf.plot(color='white', edgecolor='black', figsize=[20, 10])
rivers_gdf.plot(ax=base, color='blue')
plt.show()
Visualize only Hungary (on any preferred country) and the rivers flowing through it.
hungary_gdf = countries_gdf[countries_gdf['NAME'] == 'Hungary']
hungary_rivers = gpd.sjoin(rivers_gdf, hungary_gdf)
base = hungary_gdf.plot(color='white', edgecolor='black', figsize=[20, 10])
hungary_rivers.plot(ax=base, color='blue')
plt.show()
With clipping to country boundaries:
hungary_rivers = gpd.clip(hungary_rivers, hungary_gdf)
base = hungary_gdf.plot(color='white', edgecolor='black', figsize=[20, 10])
hungary_rivers.plot(ax=base, color='blue')
plt.show()
Determine for the river Danube (or any major river) that which countries it flows through.
Hint: the river might consist of multiple line segments in the river dataset, but you can filter all of them by e.g. the name_en
field.
danube_gdf = rivers_gdf[rivers_gdf['name_en'] == 'Danube']
display(danube_gdf)
danube_countries = gpd.sjoin(danube_gdf, countries_gdf)
display(danube_countries[['NAME']])
The data/hungary_admin_8.shp
shapefile contains the city level administrative boundaries of Hungary.
Data source: OpenStreetMap
Load the dataset into a GeoDataFrame, set the NAME
column as index and convert it to the EOV coordinate reference system.
import geopandas as gpd
cities_admin = gpd.read_file('../data/hungary_admin_8.shp')
print("Initial CRS: {0}".format(cities_admin.crs))
cities_admin.set_index('NAME', inplace=True)
cities_admin.to_crs('epsg:23700', inplace=True) # EOV
print("Converted CRS: {0}".format(cities_admin.crs))
display(cities_admin)
Process all the rows in the GeoDataFrame and display only the counties with an area larger than 200 km2:
for name, row in cities_admin.iterrows():
geom = row['geometry']
if geom.area / 1e6 >= 200:
print('{0}, Area: {1:.1f} km2, Centroid: {2}'.format(name, geom.area / 1e6, geom.centroid))
The EOV coordinates (653812, 239106)
are inside the territory of Budapest.
Check whether really on this administrative unit contains this location.
pos_budapest = geometry.Point(653812, 239106)
for name, row in cities_admin.iterrows():
geom = row['geometry']
if geom.contains(pos_budapest):
print(name)