How to create balanced k-means geospatial clusters? - python

I have 9000 USA-based points (that are accounts) with a variety of different string and numeric columns/attributes. I'm trying to evenly divide these points/accounts up into equitable groupings that are both spatially grouped as well as weighted (in a gravity sense) by number of employees, which is one of the columns/attributes. I used sklearn K-means clustering to do a grouping and it seemed to work fine but I noticed that the groupings are not equal. Some of the groups have ~600 and some of them have ~70. This is somewhat logical as there is more data in certain areas. The problem here is that I need these groups to be more equal. Here’s the code I used:
kmeans = KMeans(n_clusters = 30, max_iter=1000, init ='k-means++')
lat_long = dftobeclustered[dftobeclustered.columns[1:3]]
_employees = dftobeclustered[dftobeclustered.columns[3]]
weighted_kmeans_clusters =, sample_weight = _employees)
dftobeclustered['cluster_label'] = kmeans.predict(lat_long, sample_weight = _employees)
centers = kmeans.cluster_centers_
labels = dftobeclustered['cluster_label']
Is it possible to divide up the k-means clusters in a more equal way? I think the core problem is that it breaks low population areas like Montana or Hawaii off into their own groups when I actually need it to combine those areas into bigger groups. But I don't know.

K-means is not written to work that way. Observations are assigned to clusters based on their actual MEASURED distances from centroids.
If you try to coerce the the number of members in a cluster, it completely un-does the that distance measurement component, especially when you are talking geographically with Lat Lon.
You may need to look at another method of subsetting your observations or reconsider the equivalent sizes of clusters.
In all honesty, most of the time geographic distance-clustering is directly related to the similarity of observations in other ways (think of how house styles, or demographics or income in neighborhoods and how that might translate to a zip code or trees types in a localized region). These sorts of things do not respect our needs for them to be groups of the same size.
Clusters based on qualities OTHER than geography are more likely to level out if there is clear differentiation in even numbers of observations than straight up lat lon, as they will be distance way around it.
So areas with dense populations of observations WILL have more members than those with less. And the distance between MT and HI will always be greater than MT and NYC so they will NOT be geographically cluster by distance.
I understand that you want equal it necessary that they are geographically grouped? Given the fact that MT and HI would be together, the geographic label would not mean much. It might be better to use all of the NON geographic numerical values to cluster to create observations that are contextually alike.
Otherwise, you can use business rules to dissect the observations (by this I mean if var_x > 7 & var_y <227 & .... label=1 and make some groups yourself. You can use groupby() and describe() in pandas to create cross tables to see what might be good values to split on.

Try DBSCAN. See my sample code below.
# import necessary modules
import pandas as pd, numpy as np, matplotlib.pyplot as plt, time
from sklearn.cluster import DBSCAN
from sklearn import metrics
from geopy.distance import great_circle
from shapely.geometry import MultiPoint
# define the number of kilometers in one radian
kms_per_radian = 6371.0088
# load the data set
df = pd.read_csv('C:\\your_path\\summer-travel-gps-full.csv', encoding = "ISO-8859-1")
# how many rows are in this data set?
# scatterplot it to get a sense of what it looks like
df = df.sort_values(by=['lat', 'lon'])
ax = df.plot(kind='scatter', x='lon', y='lat', alpha=0.5, linewidth=0)
# represent points consistently as (lat, lon)
# coords = df.as_matrix(columns=['lat', 'lon'])
df_coords = df[['lat', 'lon']]
# coords = df.to_numpy(df_coords)
# define epsilon as 10 kilometers, converted to radians for use by haversine
epsilon = 10 / kms_per_radian
start_time = time.time()
db = DBSCAN(eps=epsilon, min_samples=10, algorithm='ball_tree', metric='haversine').fit(np.radians(df_coords))
cluster_labels = db.labels_
unique_labels = set(cluster_labels)
# get the number of clusters
num_clusters = len(set(cluster_labels))
# get colors and plot all the points, color-coded by cluster (or gray if not in any cluster, aka noise)
fig, ax = plt.subplots()
colors =, 1, len(unique_labels)))
# for each cluster label and color, plot the cluster's points
for cluster_label, color in zip(unique_labels, colors):
size = 150
if cluster_label == -1: #make the noise (which is labeled -1) appear as smaller gray points
color = 'gray'
size = 30
# plot the points that match the current cluster label
# X.iloc[:-1]
# df.iloc[:, 0]
x_coords = df_coords.iloc[:, 0]
y_coords = df_coords.iloc[:, 1]
ax.scatter(x=x_coords, y=y_coords, c=color, edgecolor='k', s=size, alpha=0.5)
ax.set_title('Number of clusters: {}'.format(num_clusters))
coefficient = metrics.silhouette_score(df_coords, cluster_labels)
print('Silhouette coefficient: {:0.03f}'.format(metrics.silhouette_score(df_coords, cluster_labels)))
# set eps low (1.5km) so clusters are only formed by very close points
epsilon = 1.5 / kms_per_radian
# set min_samples to 1 so we get no noise - every point will be in a cluster even if it's a cluster of 1
start_time = time.time()
db = DBSCAN(eps=epsilon, min_samples=1, algorithm='ball_tree', metric='haversine').fit(np.radians(df_coords))
cluster_labels = db.labels_
unique_labels = set(cluster_labels)
# get the number of clusters
num_clusters = len(set(cluster_labels))
# all done, print the outcome
message = 'Clustered {:,} points down to {:,} clusters, for {:.1f}% compression in {:,.2f} seconds'
print(message.format(len(df), num_clusters, 100*(1 - float(num_clusters) / len(df)), time.time()-start_time))
# Result:
Silhouette coefficient: 0.854
Clustered 1,759 points down to 138 clusters, for 92.2% compression in 0.17 seconds
coefficient = metrics.silhouette_score(df_coords, cluster_labels)
print('Silhouette coefficient: {:0.03f}'.format(metrics.silhouette_score(df_coords, cluster_labels)))
# number of clusters, ignoring noise if present
num_clusters = len(set(cluster_labels)) #- (1 if -1 in labels else 0)
print('Number of clusters: {}'.format(num_clusters))
Number of clusters: 138
# create a series to contain the clusters - each element in the series is the points that compose each cluster
clusters = pd.Series([df_coords[cluster_labels == n] for n in range(num_clusters)])
0 lat lon
1587 37.921659 22...
1 lat lon
1658 37.933609 23...
2 lat lon
1607 37.966766 23...
3 lat lon
1586 38.149019 22...
4 lat lon
1584 38.374766 21...
133 lat lon
662 50.37369 18.889205
134 lat lon
561 50.448704 19.0...
135 lat lon
661 50.462271 19.0...
136 lat lon
559 50.489304 19.0...
137 lat lon
1 51.474005 -0.450999
Data Source:
Relevant Resources:

During cluster assignment, one can also add to the distance a 'frequency penalty'. This is described in "Frequency Sensitive Competitive Learning for
Balanced Clustering on High-dimensional Hyperspheres - Arindam Banerjee and Joydeep Ghosh - IEEE Transactions on Neural Networks"
They also have an online/streaming version.


How to calculate distance of coordinates and categorical dataset with DBSCAN Algorithm?

I have a dataset containing coordinates and categorical data, such as below:
I have searched a lot of papers and journals trying to find explanations regarding which distance measurement method I should apply on my dataset with DBSCAN Algorithm. Here I have a mixed dataset with Latitude and Longitude (coordinates), and Jenis Kecelakaan (Accident Type) as categorical data. Here I found it hard, how do we cluster mixed dataset as above? is there any recommendations of which distance measurement method is good and can be applied in dbscan in my case?
I've been stuck with this problem for days. Please help me out of this problem by giving me some explanation, paper/journal link, or blog like medium/towardsdatascience.
Read this article, I prefer using OneHotEncoding
import pandas as pd
your_df = pd.read_csv('./your_data.csv')
# generate binary values using get_dummies
dum_df = pd.get_dummies(your_df, columns=["Jenis Kecelakaan"])
Try it this way.
# import necessary modules
import pandas as pd, numpy as np, matplotlib.pyplot as plt, time
from sklearn.cluster import DBSCAN
from sklearn import metrics
from geopy.distance import great_circle
from shapely.geometry import MultiPoint
# define the number of kilometers in one radian
kms_per_radian = 6371.0088
# load the data set
df = pd.read_csv('C:\\travel-gps-full.csv', encoding = "ISO-8859-1")
# how many rows are in this data set?
# scatterplot it to get a sense of what it looks like
df = df.sort_values(by=['lat', 'lon'])
ax = df.plot(kind='scatter', x='lon', y='lat', alpha=0.5, linewidth=0)
# represent points consistently as (lat, lon)
# coords = df.as_matrix(columns=['lat', 'lon'])
df_coords = df[['lat', 'lon']]
# coords = df.to_numpy(df_coords)
# define epsilon as 10 kilometers, converted to radians for use by haversine
epsilon = 10 / kms_per_radian
start_time = time.time()
db = DBSCAN(eps=epsilon, min_samples=10, algorithm='ball_tree', metric='haversine').fit(np.radians(df_coords))
cluster_labels = db.labels_
unique_labels = set(cluster_labels)
# get the number of clusters
num_clusters = len(set(cluster_labels))
# get colors and plot all the points, color-coded by cluster (or gray if not in any cluster, aka noise)
fig, ax = plt.subplots()
colors =, 1, len(unique_labels)))
# for each cluster label and color, plot the cluster's points
for cluster_label, color in zip(unique_labels, colors):
size = 150
if cluster_label == -1: #make the noise (which is labeled -1) appear as smaller gray points
color = 'gray'
size = 30
# plot the points that match the current cluster label
# X.iloc[:-1]
# df.iloc[:, 0]
x_coords = df_coords.iloc[:, 0]
y_coords = df_coords.iloc[:, 1]
ax.scatter(x=x_coords, y=y_coords, c=color, edgecolor='k', s=size, alpha=0.5)
ax.set_title('Number of clusters: {}'.format(num_clusters))
coefficient = metrics.silhouette_score(df_coords, cluster_labels)
print('Silhouette coefficient: {:0.03f}'.format(metrics.silhouette_score(df_coords, cluster_labels)))
# set eps low (1.5km) so clusters are only formed by very close points
epsilon = 1.5 / kms_per_radian
# set min_samples to 1 so we get no noise - every point will be in a cluster even if it's a cluster of 1
start_time = time.time()
db = DBSCAN(eps=epsilon, min_samples=1, algorithm='ball_tree', metric='haversine').fit(np.radians(df_coords))
cluster_labels = db.labels_
unique_labels = set(cluster_labels)
# get the number of clusters
num_clusters = len(set(cluster_labels))
# all done, print the outcome
message = 'Clustered {:,} points down to {:,} clusters, for {:.1f}% compression in {:,.2f} seconds'
print(message.format(len(df), num_clusters, 100*(1 - float(num_clusters) / len(df)), time.time()-start_time))
# Result:
Silhouette coefficient: 0.854
Clustered 1,759 points down to 138 clusters, for 92.2% compression in 0.17 seconds
coefficient = metrics.silhouette_score(df_coords, cluster_labels)
print('Silhouette coefficient: {:0.03f}'.format(metrics.silhouette_score(df_coords, cluster_labels)))
# number of clusters, ignoring noise if present
num_clusters = len(set(cluster_labels)) #- (1 if -1 in labels else 0)
print('Number of clusters: {}'.format(num_clusters))
# Result:
Number of clusters: 138
# create a series to contain the clusters - each element in the series is the points that compose each cluster
clusters = pd.Series([df_coords[cluster_labels == n] for n in range(num_clusters)])
sample code:
I had the same question and this is the best link I could find online. It's a bit complex but I think creating the distance matrix by yourself, as suggested in the link, is the best option I'm aware of.
Many ML algorithms create a distance matrix internally to find the neighbors. Here, you need to make your distance matrix based on lat/long using Harvesine, then create another distance matrix for the categorical feature, then concatenate the two matrices side by side and pass it as input to the model.

Plotting DWT Scaleogram in Python

I have a signal from a magnetic detector that I'm interested in analyzing, I've made signal decomposition using wavedec()
coeffs = pywt.wavedec(dane_K180_40['CH1[uV]'], 'coif5', level=5)
And I've received decomposition coefficients as follows:
cA1, cD5, cD4, cD3, cD2, cD1 = coeffs
These are ndarrays objects with various lengths.
cD1 is (1519,) cD2 is (774,) and so on. Different length of arrays is my main obstacle.
My question:
I have to make DWT Scaleogram and I can't stress it enough that I've tried my best and couldn't do it.
What is the best approach? Using matpllotlib's imshow() as follows:
plt.imshow(np.abs([cD5, cD4, cD3, cD2, cD1]), cmap='bone', interpolation='none', aspect='auto')
gives me an error
TypeError: Image data of dtype object cannot be converted to float
I've tried to google it since I'm not an expert in python and I've tried to change the ndarrays to float.
What is the best for plotting scaleogram, matshow, pcolormesh? ;D
Basically, each cDi array has half the amount of samples as the previous array (this is not the case for every mother wavelet!), so I create a 2D numpy array where the first element is the 'full' amount of samples, and for each subsequent level I repeat the samples 2^level times so that the end result is a rectangular block. You can pick whether you want the Y-axis plotted as a linear or as a logarithmic scale.
# Create signal
xc = np.linspace(0, t_n, num=N)
xd = np.linspace(0, t_n, num=32)
sig = np.sin(2*np.pi * 64 * xc[:32]) * (1 - xd)
composite_signal3 = np.concatenate([np.zeros(32), sig[:32], np.zeros(N-32-32)])
# Use the Daubechies wavelet
w = pywt.Wavelet('db1')
# Perform Wavelet transform up to log2(N) levels
lvls = ceil(log2(N))
coeffs = pywt.wavedec(composite_signal3, w, level=lvls)
# Each level of the WT will split the frequency band in two and apply a
# WT on the highest band. The lower band then gets split into two again,
# and a WT is applied on the higher band of that split. This repeats
# 'lvls' times.
# Since the amount of samples in each step decreases, we need to make
# sure that we repeat the samples 2^i times where i is the level so
# that at each level, we have the same amount of transformed samples
# as in the first level. This is only necessary because of plotting.
cc = np.abs(np.array([coeffs[-1]]))
for i in range(lvls - 1):
cc = np.concatenate(np.abs([cc, np.array([np.repeat(coeffs[lvls - 1 - i], pow(2, i + 1))])]))
plt.xlabel('Time (s)')
plt.ylabel('Frequency (Hz)')
plt.title('Discrete Wavelet Transform')
# X-axis has a linear scale (time)
x = np.linspace(start=0, stop=1, num=N//2)
# Y-axis has a logarithmic scale (frequency)
y = np.logspace(start=lvls-1, stop=0, num=lvls, base=2)
X, Y = np.meshgrid(x, y)
plt.pcolormesh(X, Y, cc)
use_log_scale = False
if use_log_scale:
yticks = [pow(2, i) for i in range(lvls)]

How can you create a KDE from histogram values only?

I have a set of values that I'd like to plot the gaussian kernel density estimation of, however there are two problems that I'm having:
I only have the values of bars not the values themselves
I am plotting onto a categorical axis
Here's the plot I've generated so far:
The order of the y axis is actually relevant since it is representative of the phylogeny of each bacterial species.
I'd like to add a gaussian kde overlay for each color, but so far I haven't been able to leverage seaborn or scipy to do this.
Here's the code for the above grouped bar plot using python and matplotlib:
enterN = len(color1_plotting_values)
fig, ax = plt.subplots(figsize=(20,30))
ind = np.arange(N) # the x locations for the groups
width = .5 # the width of the bars
p1 = ax.barh(Species_Ordering.Species.values, color1_plotting_values, width, label='Color1', log=True)
p2 = ax.barh(Species_Ordering.Species.values, color2_plotting_values, width, label='Color2', log=True)
for b in p2:
b.xy = (b.xy[0], b.xy[1]+width)
How to plot a "KDE" starting from a histogram
The protocol for kernel density estimation requires the underlying data. You could come up with a new method that uses the empirical pdf (ie the histogram) instead, but then it wouldn't be a KDE distribution.
Not all hope is lost, though. You can get a good approximation of a KDE distribution by first taking samples from the histogram, and then using KDE on those samples. Here's a complete working example:
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as sts
n = 100000
# generate some random multimodal histogram data
samples = np.concatenate([np.random.normal(np.random.randint(-8, 8), size=n)*np.random.uniform(.4, 2) for i in range(4)])
h,e = np.histogram(samples, bins=100, density=True)
x = np.linspace(e.min(), e.max())
# plot the histogram
plt.figure(figsize=(8,6))[:-1], h, width=np.diff(e), ec='k', align='edge', label='histogram')
# plot the real KDE
kde = sts.gaussian_kde(samples)
plt.plot(x, kde.pdf(x), c='C1', lw=8, label='KDE')
# resample the histogram and find the KDE.
resamples = np.random.choice((e[:-1] + e[1:])/2, size=n*5, p=h/h.sum())
rkde = sts.gaussian_kde(resamples)
# plot the KDE
plt.plot(x, rkde.pdf(x), '--', c='C3', lw=4, label='resampled KDE')
plt.title('n = %d' % n)
The red dashed line and the orange line nearly completely overlap in the plot, showing that the real KDE and the KDE calculated by resampling the histogram are in excellent agreement.
If your histograms are really noisy (like what you get if you set n = 10 in the above code), you should be a bit cautious when using the resampled KDE for anything other than plotting purposes:
Overall the agreement between the real and resampled KDEs is still good, but the deviations are noticeable.
Munge your categorial data into an appropriate form
Since you haven't posted your actual data I can't give you detailed advice. I think your best bet will be to just number your categories in order, then use that number as the "x" value of each bar in the histogram.
I have stated my reservations to applying a KDE to OP's categorical data in my comments above. Basically, as the phylogenetic distance between species does not obey the triangle inequality, there cannot be a valid kernel that could be used for kernel density estimation. However, there are other density estimation methods that do not require the construction of a kernel. One such method is k-nearest neighbour inverse distance weighting, which only requires non-negative distances which need not satisfy the triangle inequality (nor even need to be symmetric, I think). The following outlines this approach:
import numpy as np
# simulate data
total_classes = 10
sample_values = np.random.rand(total_classes)
distance_matrix = 100 * np.random.rand(total_classes, total_classes)
# Distances to the values itself are zero; hence remove diagonal.
distance_matrix -= np.diag(np.diag(distance_matrix))
# --------------------------------------------------------------------------------
# For each sample, compute an average based on the values of the k-nearest neighbors.
# Weigh each sample value by the inverse of the corresponding distance.
# Apply a regularizer to the distance matrix.
# This limits the influence of values with very small distances.
# In particular, this affects how the value of the sample itself (which has distance 0)
# is weighted w.r.t. other values.
regularizer = 1.
distance_matrix += regularizer
# Set number of neighbours to "interpolate" over.
k = 3
# Compute average based on sample value itself and k neighbouring values weighted by the inverse distance.
# The following assumes that the value of distance_matrix[ii, jj] corresponds to the distance from ii to jj.
for ii in range(total_classes):
# determine neighbours
indices = np.argsort(distance_matrix[ii, :])[:k+1] # +1 to include the value of the sample itself
# compute weights
distances = distance_matrix[ii, indices]
weights = 1. / distances
weights /= np.sum(weights) # weights need to sum to 1
# compute weighted average
values = sample_values[indices]
new_sample_values[ii] = np.sum(values * weights)
For now, I am skipping any philosophical argument about the validity of using Kernel density in such settings. Will come around that later.
An easy way to do this is using scikit-learn KernelDensity:
import numpy as np
import pandas as pd
from sklearn.neighbors import KernelDensity
from sklearn import preprocessing
Y=ds.loc[:,'State'].values # State is AL, AK, AZ, etc...
# With categorical data we need some label encoding here...
le = preprocessing.LabelEncoder() # le.classes_ would be ['AL', 'AK', 'AZ',...
y=le.transform(Y) # y would be [0, 2, 3, ..., 6, 7, 9]
y=y[:, np.newaxis] # preparing for kde
kde = KernelDensity(kernel='gaussian', bandwidth=0.75).fit(y)
# You can control the bandwidth so the KDE function performs better
# To find the optimum bandwidth for your data you can try Crossvalidation
x=np.linspace(0,5,100)[:, np.newaxis] # let's get some x values to plot on
dens=np.exp(log_dens) # these are the density function values
array([0.06625658, 0.06661817, 0.06676005, 0.06669403, 0.06643584,
0.06600488, 0.0654239 , 0.06471854, 0.06391682, 0.06304861,
0.06214499, 0.06123764, 0.06035818, 0.05953754, 0.05880534,
0.05818931, 0.05771472, 0.05740393, 0.057276 , 0.05734634,
0.05762648, 0.05812393, 0.05884214, 0.05978051, 0.06093455,
0.11885574, 0.11883695, 0.11881434, 0.11878766, 0.11875657,
0.11872066, 0.11867943, 0.11863229, 0.11857859, 0.1185176 ,
0.11844852, 0.11837051, 0.11828267, 0.11818407, 0.11807377])
And these values are all you need to plot your Kernel Density over your histogram. Capito?
Now, on the theoretical side, if X is a categorical(*), unordered variable with c possible values, then for 0 ≤ h < 1
is a valid kernel. For an ordered X,
where |x1-x2|should be understood as how many levels apart x1 and x2 are. As h tends to zero, both of these become indicators and return a relative frequency counting. h is oftentimes referred to as bandwidth.
(*) No distance needs to be defined on the variable space. Doesn't need to be a metric space.
Devroye, Luc and Gábor Lugosi (2001). Combinatorial Methods in Density Estimation. Berlin: Springer-Verlag.

Clustering observations based first on an attribute and on distance matrix

I have a dataset with locations (coordinates) and a scalar attribute of each location (for example, temperature). I need to cluster the locations based on the scalar attribute, but taking into consideration the distance between locations.
The problem is that, using temperature as an example, it is possible for locations that are far from each other to have the same temperature. If I cluster on temperature, these locations will be in the same cluster, when they shouldn't. The opposite is true if two locations that are near each other have different temperatures. In this case, clustering on temperature may result in these observations being in different clusters, while clustering based on a distance matrix would put them in the same one.
So, is there a way in which I could cluster observations giving more importance to one attribute (temperature) and then "refining" based on the distance matrix?
Here is a simple example showing how clustering differs depending on whether an attribute is used as the basis or the distance matrix. My goal is to be able to use both, the attribute and the distance matrix, giving more importance to the attribute.
import numpy as np
import matplotlib.pyplot as plt
import haversine
from scipy.cluster.hierarchy import linkage, fcluster
from scipy.spatial import distance as ssd
# Create location data
x = np.random.rand(100, 1)
y = np.random.rand(100, 1)
t = np.random.randint(0, 20, size=(100,1))
# Compute distance matrix
D = np.zeros((len(x),len(y)))
for k in range(len(x)):
for j in range(len(y)):
distance_pair= haversine.distance((x[k], y[k]), (x[j], y[j]))
D[k,j] = distance_pair
# Compare clustering alternatives
Zt = linkage(t, 'complete')
Zd = linkage(ssd.squareform(D), method="complete")
# Cluster based on t
clt = fcluster(Zt, 5, criterion='distance').reshape(100,1)
plt.figure(figsize=(10, 8))
plt.scatter(x, y, c=clt)
# Cluster based on distance matrix
cld = fcluster(Zd, 10, criterion='distance').reshape(100,1)
plt.figure(figsize=(10, 8))
plt.scatter(x, y, c=cld) is available here:

Finding the closest ground pixel on an irregular grid for given coordinates

I work with satellite data organised on an irregular two-dimensional grid whose dimensions are scanline (along track dimension) and ground pixel (across track dimension). Latitude and longitude information for each ground pixel are stored in auxiliary coordinate variables.
Given a (lat, lon) point, I would like to identify the closest ground pixel on my set of data.
Let's build a 10x10 toy data set:
import numpy as np
import xarray as xr
import as ccrs
import matplotlib.pyplot as plt
%matplotlib inline
lon, lat = np.meshgrid(np.linspace(-20, 20, 10),
np.linspace(30, 60, 10))
lon += lat/10
lat += lon/10
da = xr.DataArray(data = np.random.normal(0,1,100).reshape(10,10),
dims=['scanline', 'ground_pixel'],
coords = {'lat': (('scanline', 'ground_pixel'), lat),
'lon': (('scanline', 'ground_pixel'), lon)})
ax = plt.subplot(projection=ccrs.PlateCarree())
da.plot.pcolormesh('lon', 'lat', ax=ax,'Blues'),
ax.scatter(lon, lat, transform=ccrs.PlateCarree())
Note that the lat/lon coordinates identify the centre pixel and the pixel boundaries are automatically inferred by xarray.
Now, say I want to identify the closest ground pixel to Rome.
The best way I came up with so far is to use a scipy's kdtree on a stacked flattened lat/lon array:
from scipy import spatial
pixel_center_points = np.stack((, da.lon.values.flatten()), axis=-1)
tree = spatial.KDTree(pixel_center_points)
rome = (41.9028, 12.4964)
distance, index = tree.query(rome)
# 36
I have then to apply unravel_index to get my scanline/ground_pixel indexes:
pixel_coords = np.unravel_index(index, da.shape)
# (3, 6)
Which gives me the scanline/ground_pixel coordinates of the (supposedly) closest ground pixel to Rome:
ax = plt.subplot(projection=ccrs.PlateCarree())
da.plot.pcolormesh('lon', 'lat', ax=ax,'Blues'),
marker='x', color='r', transform=ccrs.PlateCarree())
I'm convinced there must me a much more elegant way to approach this problem. In particular, I would like to get rid of the flattening/unraveling steps (all my attempts to build a kdtree on a two-dimensional array failed miserably), and make use of my xarray dataset's variables as much as possible (adding a new centre_pixel dimension for example, and using it as input to KDTree).
I am going to answer my own question as I believe I came up with a decent solution, which is discussed at much greater length on my blog post on this subject.
Geographical distance
First of all, defining the distance between two points on the earth's surface as simply the euclidean distance between the two lat/lon pairs could lead to inaccurate results depending on the distance between two points. It is thus better to transform the coordinates to ECEF coordinates first and built a KD-Tree on the transformed coordinates. Assuming points on the surface of the planet (h=0) the coordinate transformation is done as such:
def transform_coordinates(coords):
""" Transform coordinates from geodetic to cartesian
Keyword arguments:
coords - a set of lan/lon coordinates (e.g. a tuple or
an array of tuples)
# WGS 84 reference coordinate system parameters
A = 6378.137 # major axis [km]
E2 = 6.69437999014e-3 # eccentricity squared
coords = np.asarray(coords).astype(np.float)
# is coords a tuple? Convert it to an one-element array of tuples
if coords.ndim == 1:
coords = np.array([coords])
# convert to radiants
lat_rad = np.radians(coords[:,0])
lon_rad = np.radians(coords[:,1])
# convert to cartesian coordinates
r_n = A / (np.sqrt(1 - E2 * (np.sin(lat_rad) ** 2)))
x = r_n * np.cos(lat_rad) * np.cos(lon_rad)
y = r_n * np.cos(lat_rad) * np.sin(lon_rad)
z = r_n * (1 - E2) * np.sin(lat_rad)
return np.column_stack((x, y, z))
Building the KD-Tree
We could then build the KD-Tree by transforming our dataset coordinates, taking care of flattening the 2D grid to a one-dimensional sequence of lat/lon tuples. This is because the KD-Tree input data needs to be (N,K), where N is the number of point and K is the dimensionality (K=2 in our case, as we assume no heigth component).
# reshape and stack coordinates
coords = np.column_stack((,
# construct KD-tree
ground_pixel_tree = spatial.cKDTree(transform_coordinates(coords))
Querying the tree and indexing the xarray dataset
Querying the tree is now as simple as transforming our point's lat/lon coordinates to ECEF and passing those to the tree's query method:
rome = (41.9028, 12.4964)
index = ground_pixel_tree.query(transform_coordinates(rome))
In doing so though, we need to unravel our index on the original dataset's shape, to get the scanline/ground_pixel indexes:
index = np.unravel_index(index, self.shape)
We could now use the two components to index our original xarray dataset, but we could also build two indexers to use with xarray pointwise indexing feature:
index = xr.DataArray(index[0], dims='pixel'), \
xr.DataArray(index[1], dims='pixel')
Getting the closest pixel is now easy and elegant at the same time:
Note that we could also query more than one point at once, and by building the indexers as above, we could still index the dataset with a single call:
Which would then return a subset of the dataset containing the closest ground pixels to our query points.
Further readings
Using the euclidean norm on the lat/lon tuples could be accurate enough for smaller distance (thing of it as approximating the earth as flat, it works on a small scale). More details on geographic distances here.
Using a KD-Tree to find the nearest neighbour is not the only way to address this problem, see this very comprehensive article.
An implementation of KD-Tree directly into xarray is in the pipeline.
My blog post on the subject.

