The method of determining the properties of the thematic classes directly from the reference data is called supervised classification, because the analyst actually “supervises” how the discriminant functions of the classes are formed by providing the reference data.
Unsupervised classification methods on the other hand group data points (e.g. pixels) together based on their similarities, with no information from the user about which ones belong together. The user selects the independent or predictor variables of interest, and the chosen algorithm does the rest. This doesn’t mean that you don’t need to know what you’re classifying, however. Once a classification is produced, it’s up to the user to interpret it and decide which types of features correspond to which generated classes, or if they even do correspond nicely.
Unsupervised classification is also called clustering.
The K-Means method is one of the most common unsupervised classification approach.
The algorithm requires an arbitrarily specified initial cluster centres that are represented by the means of the data points assigned to them. As a naïve solution, the user only defines the number of clusters and random data points are selected as their initial centers.
This will generate a very crude set of clusters. The data points are then reassigned to the cluster with the closest center, and the centers are recomputed. The process is repeated as many times as necessary such that there is no further movement of the data points between clusters. In practice, with large data sets, the process is not run to completion and some other stopping rule is used.
Considering the squared distance between each data point and the respective cluster center as the squared error, the sum of squared errors (SSE) progressively reduces with each iteration. If the Euclidean distance is used as a metric, this simply means to accumulate the squared distances for all points and their respective cluster center.
Although no general proof of convergence exists for this algorithm, it can be expected to yield acceptable results when the data exhibit characteristic pockets which are relatively far from each other. In most practical cases the application of this algorithm will require experimenting with various values of inital clusters (the value of $k$), as well as different choices of starting configurations.
The K-Means method can be use with an arbitrary distance function. For raster imagery the distance is computed as if the pixel values were coordinates. For example, if the insensity values of two RGB pixels were $(25,42,37)$ and $(31,40,32)$, the squared distance would be $(25-31)^2 + (42-40)^2 + (37-32)^2 = 65$ in the 3 dimensional spectral space, no matter where the pixels were in relation to each other spatially.
One of the most challenging tasks in the K-Means clustering algorithm is to choose the right value of the clusters (the value of $k$). What should be the right value of $k$ and how to choose it?
The Elbow Method is one of the most popular methods to determine the optimal value of $k$. The idea is to run K-Means clustering on the dataset for a range of values of $k$ (e.g. from 1 to 10), and for each value of $k$ calculate the sum of squared errors (SSE).
Then, visualize a line chart of the SSE for each value of $k$. If the line chart looks like an arm, then the "elbow" on the arm is the value of $k$ that is the best. The idea is that we want a small SSE, but that the SSE tends to decrease toward 0 as we increase $k$. (The SSE is $0$ when $k$ is equal to the number of data points in the dataset, because then each data point has its own cluster, and there is no error between it and the center of its cluster.) Hence we select the value of $k$ at the “elbow”, i.e. the point after which the line chart starts decreasing in a linear fashion.
Scikit-learn (also known as sklearn) is a machine learning library for Python. It features various classification, regression and clustering algorithms including k-means.
If you have Anaconda installed, then scikit-learn
was already installed together with it.
If you have a standalone Python3 and Jupyter Notebook installation, open a command prompt / terminal and type in:
pip3 install scikit-learn
Read the hungary_cities.shp
shapefile located in the data
folder.
This dataset contains both scalar and spatial data of the Hungarian cities, and should be familiar from Chapter 15.
import geopandas as gpd
from sklearn.cluster import KMeans
cities_gdf = gpd.read_file('../data/hungary_cities.shp')
display(cities_gdf)
Fetch points for cities:
points = [(geom.x, geom.y) for geom in cities_gdf.geometry]
print("Number of points: {0}".format(len(points)))
Cluster the points using the K-Means algorithm:
pred = KMeans(n_clusters=19).fit_predict(points)
print(pred)
print(len(pred))
Plot figure:
import matplotlib.pyplot as plt
%matplotlib inline
plt.figure(figsize=(12, 8))
# Fetch list of X and Y coordinates
xs = [point[0] for point in points]
ys = [point[1] for point in points]
# Put the cluster points on the plot
plt.scatter(xs, ys, c=pred)
# Display plot
plt.title("Cluster map of the Hungarian cities")
plt.show()
The data/LC08_L1TP_188027_20200420_20200508_01_T1_Szekesfehervar.tif
file is a segment of a Landsat 8 satellite image of Székesfehérvár city, Lake Velence and their surroundings, acquired on 2020 April 20. It should be familiar from Chapter 12.
import rasterio
szfv_2020 = rasterio.open('../data/LC08_L1TP_188027_20200420_20200508_01_T1_Szekesfehervar.tif')
print(szfv_2020.count) # band count
print(szfv_2020.width) # dimensions
print(szfv_2020.height)
Read the red, green blue and NIR bands:
blue = szfv_2020.read(2)
green = szfv_2020.read(3)
red = szfv_2020.read(4)
nir = szfv_2020.read(5)
Cluster the satellite image based on the near-infrared band.
nir_1d = nir.reshape(nir.shape[0] * nir.shape[1], 1)
print(nir_1d.shape)
pred = KMeans(n_clusters=5).fit_predict(nir_1d)
img_clusters = pred.reshape(nir.shape)
import matplotlib.colors as mc
cmap = mc.LinearSegmentedColormap.from_list('', ['purple', 'red', 'green', 'beige', 'blue'])
plt.figure(figsize=[12,12])
plt.imshow(img_clusters, cmap=cmap)
plt.axis('off')
plt.show()
Cluster the satellite image based on the RGBN (red, blue, green NIR) bands.
red_1d = red.reshape(red.shape[0] * red.shape[1], 1)
green_1d = green.reshape(green.shape[0] * green.shape[1], 1)
blue_1d = blue.reshape(blue.shape[0] * blue.shape[1], 1)
rgbn_1d = [(0, 0, 0, 0)] * (red.shape[0] * red.shape[1])
for i in range(red.shape[0] * red.shape[1]):
rgbn_1d[i] = (red_1d[i, 0], green_1d[i, 0], blue_1d[i, 0], nir_1d[i, 0])
print(rgbn_1d[10000]) # print random item
pred = KMeans(n_clusters=6).fit_predict(rgbn_1d)
img_clusters = pred.reshape(red.shape)
cmap = mc.LinearSegmentedColormap.from_list('', ['blue', 'red', 'green', 'brown', 'beige', 'purple'])
plt.figure(figsize=[15,15])
plt.imshow(img_clusters, cmap=cmap)
plt.axis('off')
plt.show()
The LC08_L1TP_188027_20200420_20200508_01_T1
file is a complete Landsat 8 satellite image tile, containing Budapest and parts of Western-Hungary, acquired on 2020 April 20.
Download: https://gis.inf.elte.hu/files/public/landsat-budapest-2020 (1.4 GB)
import rasterio
bp_2020 = rasterio.open('LC08_L1TP_188027_20200420_20200508_01_T1.tif')
print(bp_2020.count) # band count
print(bp_2020.width) # dimensions
print(bp_2020.height)
To speed up processing larger raster files, we may downsample them for the price of reducing the accuracy of the result.
First, define the resampling function:
from rasterio.enums import Resampling
def read_resampled_band(dataset, band, resample_factor):
data = dataset.read(band,
out_shape=(
1,
int(dataset.height * resample_factor),
int(dataset.width * resample_factor)
),
resampling=Resampling.bilinear
)
return data
Read the blue, green, read and near-infrared bands into Numpy arrays. Resample them to a smaller size to make further processing (clustering especially) faster.
bp = {}
bp['blue'] = read_resampled_band(bp_2020, 2, 1/4)
bp['green'] = read_resampled_band(bp_2020, 3, 1/4)
bp['red'] = read_resampled_band(bp_2020, 4, 1/4)
bp['nir'] = read_resampled_band(bp_2020, 5, 1/4)
print(bp['red'].shape)
Display the near-infrared band for verification:
plt.figure(figsize=[10,10])
plt.imshow(bp['nir'], cmap='Reds')
plt.axis('off')
plt.colorbar()
plt.show()
Display the RGB image for verification:
from rasterio.plot import show
import numpy as np
bp['red_max'] = np.percentile(bp['red'], 99.99)
bp['blue_max'] = np.percentile(bp['blue'], 99.99)
bp['green_max'] = np.percentile(bp['green'], 99.99)
# astype('f4') is a numpy function to convert to float (4 byte)
bp['redf'] = bp['red'].astype('f4') / bp['red_max']
bp['bluef'] = bp['blue'].astype('f4') / bp['blue_max']
bp['greenf'] = bp['green'].astype('f4') / bp['green_max']
bp['rgb'] = [bp['redf'], bp['greenf'], bp['bluef']]
plt.figure(figsize=[10,10])
show(bp['rgb'])
plt.show()
Implement a functions which performs single band clustering on a rasterio band (NumPy array). Execute it on the NIR band of the complete satellite image.
Example on how it shall work:
single_band_clustering(bp['nir'], ['red', 'black', 'gray', 'green', 'white', 'blue']) # 6 clusters with these colors
def single_band_clustering(band, clusters=['red', 'black', 'gray', 'green', 'white', 'blue']):
band_1d = band.reshape(band.shape[0] * band.shape[1], 1)
pred = KMeans(n_clusters=len(clusters)).fit_predict(band_1d)
img_clusters = pred.reshape(band.shape)
cmap = mc.LinearSegmentedColormap.from_list('', clusters)
plt.figure(figsize=[12,12])
plt.imshow(img_clusters, cmap=cmap)
plt.axis('off')
plt.show()
single_band_clustering(bp['nir'])