This chapter covers spatial indexing with KD-trees, Quadtrees and R-trees.
The package requirement for these spatial indexes are the scipy.spatial
, pyqtree
and rtree
modules respectively.
If you have Anaconda installed, the scipy
package was installed together with, you only need to install pyqtree
and rtree
.
Open the Anaconda Prompt and type in:
conda install -c conda-forge pyqtree rtree
If you have standalone Python3 and Jupyter Notebook install, open a command prompt / terminal and type in:
pip3 install scipy pyqtree rtree
You most likely have already installed rtree
, as it was an optional dependency for geopandas
in Chapter 11.
Read the hungary_cities.shp
shapefile located in the data
folder.
This dataset contains both scalar and spatial data of the Hungarian cities, and should be familiar from Chapter 15.
import geopandas as gpd
cities = gpd.read_file('../data/hungary_cities.shp')
display(cities)
Calculate the minimal bounding box for all the points! (We will use it later.)
def get_x(point):
return point.x
def get_y(point):
return point.y
# Calculating the minimal bounding box
min_x = min(cities['geometry'], key = get_x).x # or cities.geometry
max_x = max(cities['geometry'], key = get_x).x
min_y = min(cities['geometry'], key = get_y).y
max_y = max(cities['geometry'], key = get_y).y
print("Bounding box: ({0:.1f}, {1:.1f}) - ({2:.1f}, {3:.1f})".format(min_x, min_y, max_x, max_y))
Python lambdas are little, anonymous functions, subject to a more restrictive but more concise syntax than regular Python functions.
Lambda functions can have any number of arguments but only one expression. The evaluated expression is the return value of the function.
A lambda function in python has the following syntax:
lambda arguments: expression
Lambda functions can be used wherever function objects are required.
# Calculating the minimal bounding box
min_x = min(cities['geometry'], key = lambda p: p.x).x
max_x = max(cities['geometry'], key = lambda p: p.x).x
min_y = min(cities['geometry'], key = lambda p: p.y).y
max_y = max(cities['geometry'], key = lambda p: p.y).y
print("Bounding box: ({0:.1f}, {1:.1f}) - ({2:.1f}, {3:.1f})".format(min_x, min_y, max_x, max_y))
A kdTree (short for k-dimensional tree) is a space-partitioning data structure for organizing points in a k-dimensional space. KdTrees are especially useful for searches involving a multidimensional search key, e.g. nearest neighbor searches and range searches.
Example KdTree:
Representation:
Select a random city and create a point which we will query later.
import random
random.seed(42) # for reproducibility
idx = random.randint(0, len(cities) - 1)
city = cities.iloc[idx]
print(city)
Create the query point, by slightly distorting the location of the selected city.
from shapely.geometry import Point
city_point = city.geometry
query_point = Point(city_point.x + 1, city_point.y + 2)
print("City location: {0}".format(city_point))
print("Query location: {0}".format(query_point))
The scipy
module can construct KD-Tree from a list of points, where each point is represented by a 2 element list or tuple.
points = [(p.x, p.y) for p in cities['geometry']]
print(points[:10])
Now the KD-Tree can be constructed.
import scipy.spatial
kdtree = scipy.spatial.KDTree(points)
Query the closest neighbor to the query point.
print("City location: {0}".format(city_point))
print("Query location: {0}".format(query_point))
dist, idx = kdtree.query(query_point)
print("Closest neighbor: distance = {0:.4f}, index = {1}, point = {2}".format(dist, idx, points[idx]))
print("Closest neighbor city: {0}".format(cities.iloc[idx]['City']))
Query the 3 closest neighbors to the query point.
distances, indices = kdtree.query(query_point, k = 3)
print("Query location: {0}".format(query_point))
print("3 closest neighbors:")
for i in range(len(indices)):
idx = indices[i]
dist = distances[i]
print("{0}. neighbor: distance = {1:.4f}, index = {2}, point = {3}, city = {4}".format(i+1, dist, idx, points[idx], cities.iloc[idx]['City']))
Query the 50 closest neighbors to the query point within 10km.
distances, indices = kdtree.query(query_point, k = 50, distance_upper_bound = 10000)
print("Distance list: %s" % distances)
print("Index list: %s" % indices)
Most likely will only find less than 50 neighbors in a 10km range, but the index list has still 50 elements.
For the invalid elements the indices[i]
is not a valid index, but instead equals to len(cities)
.
So with a simple check we can detect the end of the valid results.
valid_indices = [idx for idx in indices if idx < len(cities)]
print(valid_indices)
print("50 closest neighbors within 10km:")
for i in range(len(valid_indices)):
idx = valid_indices[i]
dist = distances[i]
print("{0}. neighbor: distance = {1:.1f}, index = {2}, location = {3}, city = {4}".format(i+1, dist, idx, points[idx], cities.iloc[idx]['City']))
Task 1: Implement a linear search for the closest point instead of using a KD-Tree!
def find_closest(points, query):
min_dist = None
min_point = None
for point in points:
dist = point.distance(query)
if min_dist is None or dist < min_dist:
min_dist = dist
min_point = point
return min_point
print("City location: {0}".format(city_point))
print("Query location: {0}".format(query_point))
closest_point = find_closest(cities['geometry'], query_point)
print("Closest location: {0}".format(closest_point))
Task 2: Compare the execution time of the linear search and the spatial index query (logarithmic asymptotic complexity) approach!
Hint: import the time
module to record the timestamp before and after the execution of the desired algorithm:
start = time.time()
# ... measured code ...
end = time.time()
print("Execution time: {0:.6f}s".format(end-start))
import time
start = time.time()
find_closest(cities['geometry'], query_point)
end = time.time()
print("Linear search execution time: {0:.6f}s".format(end-start))
start = time.time()
kdtree.query(query_point)
end = time.time()
print("KD-tree search execution time: {0:.6f}s".format(end-start))
Create a 10x10km query area around a point.
query_area_size = 10000
query_area = (
query_point.x - query_area_size/2,
query_point.y - query_area_size/2,
query_point.x + query_area_size/2,
query_point.y + query_area_size/2
)
print("Query area: {0}, side length = {1:.1f} km".format(query_area, query_area_size / 1000))
import pyqtree
quadtree = pyqtree.Index(bbox=(min_x, min_y, max_x, max_y))
for i in range(len(points)):
obj = { "id": i, "point": points[i] }
quadtree.insert(obj, points[i]) # object, bbox
Note: for a polygon, the first argument should be the indexed object (e.g. the polygon itself), and the second argument should be the bounding box of the polygon.
matches = quadtree.intersect(query_area)
print("Matches: {0}".format(matches))
for obj in matches:
print("Index: {0}, Location: {1}, City: {2}".format(obj['id'], obj['point'], cities.iloc[obj['id']]['City']))
We will use the same query_area
for demonstration, as before with the Quadtree.
from rtree import index as rtree_index
rtree = rtree_index.Index()
for i in range(len(points)):
rtree.insert(i, points[i]) # index, bbox
matches = rtree.intersection(query_area)
print("Matches: {0}".format(list(matches)))
matches = rtree.intersection(query_area)
for idx in matches:
city = cities.iloc[idx]
print("Index: {0}, Location: {1}, City: {2}".format(idx, city['geometry'], city['City']))
If the rtree
module is installed, the geopandas
module utilizes an R-tree in the background to spatially index the spatial objects in a GeoDataFrame.
This spatial index can be accessed directly as the sindex
property of the GeoDataFrame:
print(cities.sindex)
matches = cities.sindex.intersection(query_area)
print("Matches: {0}".format(list(matches)))
The R-Tree spatial index is also used by the sjoin()
and clip()
function of geopandas.