I'm analyzing GPS location data with weights indicating "importance". This can be easily plotted as heatmap, e.g. in google maps.
I would like to analyze this using the python data stack and, in particular, want to find the mode of a kernel density estimate.
How can I compute the mode of a KDE in python?
Very specifically, given the example at https://scikit-learn.org/stable/auto_examples/neighbors/plot_species_kde.html how would you find the location with highest probability of finding the "Bradypus variegatus" species?
Lets consider a simple example of getting kde-estimation:
import numpy as np
from scipy.stats import gaussian_kde
from pylab import plt
np.random.seed(10)
x = np.random.rand(100)
y = np.random.rand(100)
kde = gaussian_kde(np.vstack([x, y]))
X, Y = np.meshgrid(np.linspace(0, 1, 100), np.linspace(0, 1, 100))
Z = kde(np.vstack([X.ravel(), Y.ravel()])).reshape(X.shape)
plt.contourf(X, Y, Z)
plt.show()
Now, we can get the coordinates X and Y, where Z takes its maximal value:
X.ravel()[np.argmax(Z.ravel())]
0.3535353535353536
Y.ravel()[np.argmax(Z.ravel())]
0.5555555555555556
In practice, when estimating locations of highest probability of occurrence some
species, you need not the only one position, but some area around it.
In this case, you can choose, for example, all locations, where the probability
is greater than 90 percentile of all possible probability values, e.g.
Y.ravel()[Z.ravel() > np.percentile(Z, 90)]
X.ravel()[Z.ravel() > np.percentile(Z, 90)]
In case of cited example, you can try the same approach to get desired result. Probably, you will need to tweak threshold value, e.g. choose 75-percentile instead of 90-percentile value.
Related
I'm plotting x and y points. This results in a curved line, the line is first bending and then after a point its straight and after some time it bends again. I want to retrieve those two points. Though x is linear and y is plotted against x but y is not linearly dependent on x.
I tried matplotlib for plotting and numpy polynomial functions, and am currently looking into splines, but it seems that for these y needs to be directly dependent on x.
Your data is noisy, so you can't use a simple numerical derivative. Instead, as you may have found already, you should fit it with a spline and then check the curvature of the spline.
Keying off this answer, you can fit a spline and calculate the second derivative (curvature) like this:
import numpy as np
import matplotlib.pyplot as plt
from scipy.interpolate import UnivariateSpline
x = file['n']
y = file['Ds/2']
y_spline = UnivariateSpline(x, y)
x_range = np.linspace(x[0], x[-1], 1000) # or could use x_range = x
y_spline_deriv = y_spl.derivative(n=2)
curvature = y_spline_deriv(x_range)
Then you can find the start and end of the straight region like this:
straight_points = np.where(curvature.abs() <= 0.1)[0] # pick your threshold
start_idx = straight_points[0]
end_idx = straight_points[-1]
start_x = x_range[start_idx]
end_x = x_range[end_idx]
Alternatively, if you're mainly interested in finding the flattest part of the curve (as shown in your graphic), you could try calculating the first derivative and then finding regions where the slope is within some small amount of the minimum slope anywhere in the data. In that case, just substitute y_spline_deriv = y_spl.derivative(n=1) in the code above.
I have a set of values that I'd like to plot the gaussian kernel density estimation of, however there are two problems that I'm having:
I only have the values of bars not the values themselves
I am plotting onto a categorical axis
Here's the plot I've generated so far:
The order of the y axis is actually relevant since it is representative of the phylogeny of each bacterial species.
I'd like to add a gaussian kde overlay for each color, but so far I haven't been able to leverage seaborn or scipy to do this.
Here's the code for the above grouped bar plot using python and matplotlib:
enterN = len(color1_plotting_values)
fig, ax = plt.subplots(figsize=(20,30))
ind = np.arange(N) # the x locations for the groups
width = .5 # the width of the bars
p1 = ax.barh(Species_Ordering.Species.values, color1_plotting_values, width, label='Color1', log=True)
p2 = ax.barh(Species_Ordering.Species.values, color2_plotting_values, width, label='Color2', log=True)
for b in p2:
b.xy = (b.xy[0], b.xy[1]+width)
Thanks!
How to plot a "KDE" starting from a histogram
The protocol for kernel density estimation requires the underlying data. You could come up with a new method that uses the empirical pdf (ie the histogram) instead, but then it wouldn't be a KDE distribution.
Not all hope is lost, though. You can get a good approximation of a KDE distribution by first taking samples from the histogram, and then using KDE on those samples. Here's a complete working example:
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as sts
n = 100000
# generate some random multimodal histogram data
samples = np.concatenate([np.random.normal(np.random.randint(-8, 8), size=n)*np.random.uniform(.4, 2) for i in range(4)])
h,e = np.histogram(samples, bins=100, density=True)
x = np.linspace(e.min(), e.max())
# plot the histogram
plt.figure(figsize=(8,6))
plt.bar(e[:-1], h, width=np.diff(e), ec='k', align='edge', label='histogram')
# plot the real KDE
kde = sts.gaussian_kde(samples)
plt.plot(x, kde.pdf(x), c='C1', lw=8, label='KDE')
# resample the histogram and find the KDE.
resamples = np.random.choice((e[:-1] + e[1:])/2, size=n*5, p=h/h.sum())
rkde = sts.gaussian_kde(resamples)
# plot the KDE
plt.plot(x, rkde.pdf(x), '--', c='C3', lw=4, label='resampled KDE')
plt.title('n = %d' % n)
plt.legend()
plt.show()
Output:
The red dashed line and the orange line nearly completely overlap in the plot, showing that the real KDE and the KDE calculated by resampling the histogram are in excellent agreement.
If your histograms are really noisy (like what you get if you set n = 10 in the above code), you should be a bit cautious when using the resampled KDE for anything other than plotting purposes:
Overall the agreement between the real and resampled KDEs is still good, but the deviations are noticeable.
Munge your categorial data into an appropriate form
Since you haven't posted your actual data I can't give you detailed advice. I think your best bet will be to just number your categories in order, then use that number as the "x" value of each bar in the histogram.
I have stated my reservations to applying a KDE to OP's categorical data in my comments above. Basically, as the phylogenetic distance between species does not obey the triangle inequality, there cannot be a valid kernel that could be used for kernel density estimation. However, there are other density estimation methods that do not require the construction of a kernel. One such method is k-nearest neighbour inverse distance weighting, which only requires non-negative distances which need not satisfy the triangle inequality (nor even need to be symmetric, I think). The following outlines this approach:
import numpy as np
#--------------------------------------------------------------------------------
# simulate data
total_classes = 10
sample_values = np.random.rand(total_classes)
distance_matrix = 100 * np.random.rand(total_classes, total_classes)
# Distances to the values itself are zero; hence remove diagonal.
distance_matrix -= np.diag(np.diag(distance_matrix))
# --------------------------------------------------------------------------------
# For each sample, compute an average based on the values of the k-nearest neighbors.
# Weigh each sample value by the inverse of the corresponding distance.
# Apply a regularizer to the distance matrix.
# This limits the influence of values with very small distances.
# In particular, this affects how the value of the sample itself (which has distance 0)
# is weighted w.r.t. other values.
regularizer = 1.
distance_matrix += regularizer
# Set number of neighbours to "interpolate" over.
k = 3
# Compute average based on sample value itself and k neighbouring values weighted by the inverse distance.
# The following assumes that the value of distance_matrix[ii, jj] corresponds to the distance from ii to jj.
for ii in range(total_classes):
# determine neighbours
indices = np.argsort(distance_matrix[ii, :])[:k+1] # +1 to include the value of the sample itself
# compute weights
distances = distance_matrix[ii, indices]
weights = 1. / distances
weights /= np.sum(weights) # weights need to sum to 1
# compute weighted average
values = sample_values[indices]
new_sample_values[ii] = np.sum(values * weights)
print(new_sample_values)
THE EASY WAY
For now, I am skipping any philosophical argument about the validity of using Kernel density in such settings. Will come around that later.
An easy way to do this is using scikit-learn KernelDensity:
import numpy as np
import pandas as pd
from sklearn.neighbors import KernelDensity
from sklearn import preprocessing
ds=pd.read_csv('data-by-State.csv')
Y=ds.loc[:,'State'].values # State is AL, AK, AZ, etc...
# With categorical data we need some label encoding here...
le = preprocessing.LabelEncoder()
le.fit(Y) # le.classes_ would be ['AL', 'AK', 'AZ',...
y=le.transform(Y) # y would be [0, 2, 3, ..., 6, 7, 9]
y=y[:, np.newaxis] # preparing for kde
kde = KernelDensity(kernel='gaussian', bandwidth=0.75).fit(y)
# You can control the bandwidth so the KDE function performs better
# To find the optimum bandwidth for your data you can try Crossvalidation
x=np.linspace(0,5,100)[:, np.newaxis] # let's get some x values to plot on
log_dens=kde.score_samples(x)
dens=np.exp(log_dens) # these are the density function values
array([0.06625658, 0.06661817, 0.06676005, 0.06669403, 0.06643584,
0.06600488, 0.0654239 , 0.06471854, 0.06391682, 0.06304861,
0.06214499, 0.06123764, 0.06035818, 0.05953754, 0.05880534,
0.05818931, 0.05771472, 0.05740393, 0.057276 , 0.05734634,
0.05762648, 0.05812393, 0.05884214, 0.05978051, 0.06093455,
..............
0.11885574, 0.11883695, 0.11881434, 0.11878766, 0.11875657,
0.11872066, 0.11867943, 0.11863229, 0.11857859, 0.1185176 ,
0.11844852, 0.11837051, 0.11828267, 0.11818407, 0.11807377])
And these values are all you need to plot your Kernel Density over your histogram. Capito?
Now, on the theoretical side, if X is a categorical(*), unordered variable with c possible values, then for 0 ≤ h < 1
is a valid kernel. For an ordered X,
where |x1-x2|should be understood as how many levels apart x1 and x2 are. As h tends to zero, both of these become indicators and return a relative frequency counting. h is oftentimes referred to as bandwidth.
(*) No distance needs to be defined on the variable space. Doesn't need to be a metric space.
Devroye, Luc and Gábor Lugosi (2001). Combinatorial Methods in Density Estimation. Berlin: Springer-Verlag.
I have a dataset with locations (coordinates) and a scalar attribute of each location (for example, temperature). I need to cluster the locations based on the scalar attribute, but taking into consideration the distance between locations.
The problem is that, using temperature as an example, it is possible for locations that are far from each other to have the same temperature. If I cluster on temperature, these locations will be in the same cluster, when they shouldn't. The opposite is true if two locations that are near each other have different temperatures. In this case, clustering on temperature may result in these observations being in different clusters, while clustering based on a distance matrix would put them in the same one.
So, is there a way in which I could cluster observations giving more importance to one attribute (temperature) and then "refining" based on the distance matrix?
Here is a simple example showing how clustering differs depending on whether an attribute is used as the basis or the distance matrix. My goal is to be able to use both, the attribute and the distance matrix, giving more importance to the attribute.
import numpy as np
import matplotlib.pyplot as plt
import haversine
from scipy.cluster.hierarchy import linkage, fcluster
from scipy.spatial import distance as ssd
# Create location data
x = np.random.rand(100, 1)
y = np.random.rand(100, 1)
t = np.random.randint(0, 20, size=(100,1))
# Compute distance matrix
D = np.zeros((len(x),len(y)))
for k in range(len(x)):
for j in range(len(y)):
distance_pair= haversine.distance((x[k], y[k]), (x[j], y[j]))
D[k,j] = distance_pair
# Compare clustering alternatives
Zt = linkage(t, 'complete')
Zd = linkage(ssd.squareform(D), method="complete")
# Cluster based on t
clt = fcluster(Zt, 5, criterion='distance').reshape(100,1)
plt.figure(figsize=(10, 8))
plt.scatter(x, y, c=clt)
plt.show()
# Cluster based on distance matrix
cld = fcluster(Zd, 10, criterion='distance').reshape(100,1)
plt.figure(figsize=(10, 8))
plt.scatter(x, y, c=cld)
plt.show()
haversine.py is available here: https://gist.github.com/rochacbruno/2883505
Thanks.
I am trying to come up with a generalised way in Python to identify pitch rotations occurring during a set of planned spacecraft manoeuvres. You could think of it as a particular case of a shift detection problem.
Let's consider the solar_elevation_angle variable in my set of measurements, identifying the elevation angle of the sun measured from the spacecraft's instrument. For those who might want to play with the data, I saved the solar_elevation_angle.txt file here.
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import gridspec
from scipy.signal import argrelmax
from scipy.ndimage.filters import gaussian_filter1d
solar_elevation_angle = np.loadtxt("solar_elevation_angle.txt", dtype=np.float32)
fig, ax = plt.subplots()
ax.set_title('Solar elevation angle')
ax.set_xlabel('Scanline')
ax.set_ylabel('Solar elevation angle [deg]')
ax.plot(solar_elevation_angle)
plt.show()
The scanline is my time dimension. The four points where the slope changes identify the spacecraft pitch rotations.
As you can see, the solar elevation angle evolution outside the spacecraft manoeuvres regions is pretty much linear as a function of time, and that should always be the case for this particular spacecraft (except for major failures).
Note that during each spacecraft manoeuvre, the slope change is obviously continuous, although discretised in my set of angle values. That means: for each manoeuvre, it does not really make sense to try to locate a single scanline where a manoeuvre has taken place. My goal is rather to identify, for each manoeuvre, a "representative" scanline in the range of scanlines defining the interval of time where the manoeuvre occurred (e.g. middle value, or left boundary).
Once I get a set of "representative" scanline indexes where all manoeuvres have taken place, I could then use those indexes for rough estimations of manoeuvres durations, or to automatically place labels on the plot.
My solution so far has been to:
Compute the 2nd derivative of the solar elevation angle using
np.gradient.
Compute absolute value and clipping of resulting
curve. The clipping is necessary because of what I assume to be
discretisation noise in the linear segments, which would then severely affect the identification of the "real" local maxima in point 4.
Apply smoothing to the resulting curve, to get rid of multiple peaks. I'm using scipy's 1d gaussian filter with a trial-and-error sigma value for that.
Identify local maxima.
Here's my code:
fig = plt.figure(figsize=(8,12))
gs = gridspec.GridSpec(5, 1)
ax0 = plt.subplot(gs[0])
ax0.set_title('Solar elevation angle')
ax0.plot(solar_elevation_angle)
solar_elevation_angle_1stdev = np.gradient(solar_elevation_angle)
ax1 = plt.subplot(gs[1])
ax1.set_title('1st derivative')
ax1.plot(solar_elevation_angle_1stdev)
solar_elevation_angle_2nddev = np.gradient(solar_elevation_angle_1stdev)
ax2 = plt.subplot(gs[2])
ax2.set_title('2nd derivative')
ax2.plot(solar_elevation_angle_2nddev)
solar_elevation_angle_2nddev_clipped = np.clip(np.abs(np.gradient(solar_elevation_angle_2nddev)), 0.0001, 2)
ax3 = plt.subplot(gs[3])
ax3.set_title('absolute value + clipping')
ax3.plot(solar_elevation_angle_2nddev_clipped)
smoothed_signal = gaussian_filter1d(solar_elevation_angle_2nddev_clipped, 20)
ax4 = plt.subplot(gs[4])
ax4.set_title('Smoothing applied')
ax4.plot(smoothed_signal)
plt.tight_layout()
plt.show()
I can then easily identify the local maxima by using scipy's argrelmax function:
max_idx = argrelmax(smoothed_signal)[0]
print(max_idx)
# [ 689 1019 2356 2685]
Which correctly identifies the scanline indexes I was looking for:
fig, ax = plt.subplots()
ax.set_title('Solar elevation angle')
ax.set_xlabel('Scanline')
ax.set_ylabel('Solar elevation angle [deg]')
ax.plot(solar_elevation_angle)
ax.scatter(max_idx, solar_elevation_angle[max_idx], marker='x', color='red')
plt.show()
My question is: Is there a better way to approach this problem?
I find that having to manually specify the clipping threshold values to get rid of the noise and the sigma in the gaussian filter weakens this approach considerably, preventing it to be applied to other similar cases.
First improvement would be to use a Savitzky-Golay filter to find the derivative in a less noisy way. For example, it can fit a parabola (in the sense of least squares) to each data slice of certain size, and then take the second derivative of that parabola. The result is much nicer than just taking 2nd order difference with gradient. Here it is with window size 101:
savgol_filter(solar_elevation_angle, window_length=window, polyorder=2, deriv=2)
Second, instead of looking for points of maximum with argrelmax it is better to look for places where the second derivative is large; for example, at least half its maximal size. This will of course return many indexes, but we can then look at the gaps between those indexes to identify where each peak begins and ends. The midpoint of the peak is then easily found.
Here is the complete code. The only parameter is window size, which is set to 101. The approach is robust; the size 21 or 201 gives essentially the same outcome (it must be odd).
from scipy.signal import savgol_filter
window = 101
der2 = savgol_filter(solar_elevation_angle, window_length=window, polyorder=2, deriv=2)
max_der2 = np.max(np.abs(der2))
large = np.where(np.abs(der2) > max_der2/2)[0]
gaps = np.diff(large) > window
begins = np.insert(large[1:][gaps], 0, large[0])
ends = np.append(large[:-1][gaps], large[-1])
changes = ((begins+ends)/2).astype(np.int)
plt.plot(solar_elevation_angle)
plt.plot(changes, solar_elevation_angle[changes], 'ro')
plt.show()
The fuss with insert and append is because the first index with large derivative should qualify as "peak begins" and the last such index should qualify as "peak ends", even though they don't have a suitable gap next to them (the gap is infinite).
Piecewise linear fit
This is an alternative (not necessarily better) approach, which does not use derivatives: fit a smoothing spline of degree 1 (i.e., a piecewise linear curve), and notice where its knots are.
First, normalize the data (which I call y instead of solar_elevation_angle) to have standard deviation 1.
y /= np.std(y)
The first step is to build a piecewise linear curve that deviates from the data by at most the given threshold, arbitrarily set to 0.1 (no units here because y was normalized). This is done by calling UnivariateSpline repeatedly, starting with a large smoothing parameter and gradually reducing it until the curve fits. (Unfortunately, one can't simply pass in the desired uniform error bound).
from scipy.interpolate import UnivariateSpline
threshold = 0.1
m = y.size
x = np.arange(m)
s = m
max_error = 1
while max_error > threshold:
spl = UnivariateSpline(x, y, k=1, s=s)
interp_y = spl(x)
max_error = np.max(np.abs(interp_y - y))
s /= 2
knots = spl.get_knots()
values = spl(knots)
So far we found the knots, and noted the values of the spline at those knots. But not all of these knots are really important. To test the importance of each knot, I remove it and interpolate without it. If the new interpolant is substantially different from the old (doubling the error), the knot is considered important and is added to the list of found slope changes.
ts = knots.size
idx = np.arange(ts)
changes = []
for j in range(1, ts-1):
spl = UnivariateSpline(knots[idx != j], values[idx != j], k=1, s=0)
if np.max(np.abs(spl(x) - interp_y)) > 2*threshold:
changes.append(knots[j])
plt.plot(y)
plt.plot(changes, y[np.array(changes, dtype=int)], 'ro')
plt.show()
Ideally, one would fit piecewise linear functions to given data, increasing the number of knots until adding one more does not bring "substantial" improvement. The above is a crude approximation of that with SciPy tools, but far from best possible. I don't know of any off-the-shelf piecewise linear model selection tool in Python.
I have a numpy array of points in an XY plane like:
I want to select the n points (let's say 100) better distributed from all these points. This is, I want the density of points to be constant anywhere.
Something like this:
Is there any pythonic way or any numpy/scipy function to do this?
#EMS is very correct that you should give a lot of thought to exactly what you want.
There more sophisticated ways to do this (EMS's suggestions are very good!), but a brute-force-ish approach is to bin the points onto a regular, rectangular grid and draw a random point from each bin.
The major downside is that you won't get the number of points you ask for. Instead, you'll get some number smaller than that number.
A bit of creative indexing with pandas makes this "gridding" approach quite easy, though you can certainly do it with "pure" numpy, as well.
As an example of the simplest possible, brute force, grid approach: (There's a lot we could do better, here.)
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
total_num = 100000
x, y = np.random.normal(0, 1, (2, total_num))
# We'll always get fewer than this number for two reasons.
# 1) We're choosing a square grid, and "subset_num" may not be a perfect square
# 2) There won't be data in every cell of the grid
subset_num = 1000
# Bin points onto a rectangular grid with approximately "subset_num" cells
nbins = int(np.sqrt(subset_num))
xbins = np.linspace(x.min(), x.max(), nbins+1)
ybins = np.linspace(y.min(), y.max(), nbins+1)
# Make a dataframe indexed by the grid coordinates.
i, j = np.digitize(y, ybins), np.digitize(x, xbins)
df = pd.DataFrame(dict(x=x, y=y), index=[i, j])
# Group by which cell the points fall into and choose a random point from each
groups = df.groupby(df.index)
new = groups.agg(lambda x: np.random.permutation(x)[0])
# Plot the results
fig, axes = plt.subplots(ncols=2, sharex=True, sharey=True)
axes[0].plot(x, y, 'k.')
axes[0].set_title('Original $(n={})$'.format(total_num))
axes[1].plot(new.x, new.y, 'k.')
axes[1].set_title('Subset $(n={})$'.format(len(new)))
plt.setp(axes, aspect=1, adjustable='box-forced')
fig.tight_layout()
plt.show()
Loosely based on #EMS's suggestion in a comment, here's another approach.
We'll calculate the density of points using a kernel density estimate, and then use the inverse of that as the probability that a given point will be chosen.
scipy.stats.gaussian_kde is not optimized for this use case (or for large numbers of points in general). It's the bottleneck here. It's possible to write a more optimized version for this specific use case in several ways (approximations, special case here of pairwise distances, etc). However, that's beyond the scope of this particular question. Just be aware that for this specific example with 1e5 points, it will take a minute or two to run.
The advantage of this method is that you get the exact number of points that you asked for. The disadvantage is that you are likely to have local clusters of selected points.
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import gaussian_kde
total_num = 100000
subset_num = 1000
x, y = np.random.normal(0, 1, (2, total_num))
# Let's approximate the PDF of the point distribution with a kernel density
# estimate. scipy.stats.gaussian_kde is slow for large numbers of points, so
# you might want to use another implementation in some cases.
xy = np.vstack([x, y])
dens = gaussian_kde(xy)(xy)
# Try playing around with this weight. Compare 1/dens, 1-dens, and (1-dens)**2
weight = 1 / dens
weight /= weight.sum()
# Draw a sample using np.random.choice with the specified probabilities.
# We'll need to view things as an object array because np.random.choice
# expects a 1D array.
dat = xy.T.ravel().view([('x', float), ('y', float)])
subset = np.random.choice(dat, subset_num, p=weight)
# Plot the results
fig, axes = plt.subplots(ncols=2, sharex=True, sharey=True)
axes[0].scatter(x, y, c=dens, edgecolor='')
axes[0].set_title('Original $(n={})$'.format(total_num))
axes[1].plot(subset['x'], subset['y'], 'k.')
axes[1].set_title('Subset $(n={})$'.format(len(subset)))
plt.setp(axes, aspect=1, adjustable='box-forced')
fig.tight_layout()
plt.show()
Unless you give a specific criterion for defining "better distributed" we can't give a definite answer.
The phrase "constant density of points anywhere" is also misleading, because you have to specify the empirical method for calculating density. Are you approximating it on a grid? If so, the grid size will matter, and points near the boundary won't be correctly represented.
A different approach might be as follows:
Calculate the distance matrix between all pairs of points
Treating this distance matrix as a weighted network, calculate some measure of centrality for each point in the data, such as eigenvalue centrality, Betweenness centrality or Bonacich centrality.
Order the points in descending order according to the centrality measure, and keep the first 100.
Repeat steps 1-4 possibly using a different notion of "distance" between points and with different centrality measures.
Many of these functions are provided directly by SciPy, NetworkX, and scikits.learn and will work directly on a NumPy array.
If you are definitely committed to thinking of the problem in terms of regular spacing and grid density, you might take a look at quasi-Monte Carlo methods. In particular, you could try to compute the convex hull of the set of points and then apply a QMC technique to regularly sample from anywhere within that convex hull. But again, this privileges the exterior of the region, which should be sampled far less than the interior.
Yet another interesting approach would be to simply run the K-means algorithm on the scattered data, with a fixed number of clusters K=100. After the algorithm converges, you'll have 100 points from your space (the mean of each cluster). You could repeat this several times with different random starting points for the cluster means and then sample from that larger set of possible means. Since your data do not appear to actually cluster into 100 components naturally, the convergence of this approach won't be very good and may require running the algorithm for a large number of iterations. This also has the downside that the resulting set of 100 points are not necessarily points that come form the observed data, and instead will be local averages of many points.
This method to iteratively pick the point from the remaining points which has the lowest minimum distance to the already picked points has terrible time complexity, but produces pretty uniformly distributed results:
from numpy import array, argmax, ndarray
from numpy.ma import vstack
from numpy.random import normal, randint
from scipy.spatial.distance import cdist
def well_spaced_points(points: ndarray, num_points: int):
"""
Pick `num_points` well-spaced points from `points` array.
:param points: An m x n array of m n-dimensional points.
:param num_points: The number of points to pick.
:rtype: ndarray
:return: A num_points x n array of points from the original array.
"""
# pick a random point
current_point_index = randint(0, num_points)
picked_points = array([points[current_point_index]])
remaining_points = vstack((
points[: current_point_index],
points[current_point_index + 1:]
))
# while there are more points to pick
while picked_points.shape[0] < num_points:
# find the furthest point to the current point
distance_pk_rmn = cdist(picked_points, remaining_points)
min_distance_pk = distance_pk_rmn.min(axis=0)
i_furthest = argmax(min_distance_pk)
# add it to picked points and remove it from remaining
picked_points = vstack((
picked_points,
remaining_points[i_furthest]
))
remaining_points = vstack((
remaining_points[: i_furthest],
remaining_points[i_furthest + 1:]
))
return picked_points