Determine the center of the cluster with the most points - python

After performing clustering on a dataset with GPS locations using KMeans, is there a way to determine the cluster with the most points, i.e. the largest cluster and then associate one of the centers with this specific cluster?
Suppose my code is:
kmeans = KMeans(n_clusters=4)
kmeans.fit(points)
I know I can print the centers via:
print(kmeans.cluster_centers_) -> e.g [[lat1, long1], [lat2, long2], ...]
and the determine the amount of points of each cluster via:
print(Counter(kmeans.labels_)) -> e.g. Counter({0: 510, 1: 200, 2: 50, 3: 44})
How can I now link the largest cluster (the one with 510 points) to the correct center coordinates? Is this possible in Python?

You can get the largest cluser label using argmax on the counter values and link to centers just indexing.
import numpy as np
from sklearn.cluster import KMeans
from collections import Counter
points = np.random.normal(0, 3, size=(100, 2))
kmeans = KMeans(n_clusters=4)
kmeans.fit(points)
counter = Counter(kmeans.labels_)
largest_cluster_idx = np.argmax(counter.values())
largest_cluster_center = kmeans.cluster_centers_[largest_cluster_idx ]

The index 0 in labels corresponds to center 0, the index 1 to center 1.
Everything else would be madness, wouldn't it?
Even if you would automatically order them by size (which would break some things), you'd then also update labels, because users need to be able to find the right center for each point.
Also the theory that they are reordered by size is easy to refute: just run it a few more times on different days, and you'll find counterexamples. In particular, if you use reversed(cluster_centers_) as initialization, then it should finish within one iteration and give them in this reversed order.

Related

How can I find connected points on a graph in 3D?

I have a 1000+ set of x, y and z coordinates and I want to find how they cluster. I'd like to set a maximum distance that will specify that points belong in the same cluster i.e. if the point has a Euclidean distance of less than 1 from another point, the algorithm will cluster them together. I've tried to brute force this on python with little success, does anyone have any ideas or a pre-established algorithm that does something similar?
Thanks in advance
You can find quite a few clustering algorithms in module scikit-learn: https://scikit-learn.org/stable/modules/clustering.html
With your particular definition of clusters, it appears that sklearn.cluster.AgglomerativeClustering(n_clusters=None, distance_threshold=1) is exactly what you want.
import numpy as np
from sklearn.cluster import AgglomerativeClustering
N = 1500
box_size = 10
points = np.random.rand(N, 2) * box_size
# array([[5.93688935, 6.63209391], [2.6182196 , 8.33040083], [4.35061433, 7.21399521], ..., [4.09271753, 2.3614302 ], [5.69176382, 1.78457418], [9.76504841, 1.38935121]])
clustering = AgglomerativeClustering(n_clusters=None, distance_threshold=1).fit(points)
print('Number of clusters:', clustering.n_clusters_)
# Number of clusters: 224
An alternative approach would have been to build a graph, then get the connected components of the graph, using for instance module networkx: networkx.algorithms.components.connected_components

Clustering observations based first on an attribute and on distance matrix

I have a dataset with locations (coordinates) and a scalar attribute of each location (for example, temperature). I need to cluster the locations based on the scalar attribute, but taking into consideration the distance between locations.
The problem is that, using temperature as an example, it is possible for locations that are far from each other to have the same temperature. If I cluster on temperature, these locations will be in the same cluster, when they shouldn't. The opposite is true if two locations that are near each other have different temperatures. In this case, clustering on temperature may result in these observations being in different clusters, while clustering based on a distance matrix would put them in the same one.
So, is there a way in which I could cluster observations giving more importance to one attribute (temperature) and then "refining" based on the distance matrix?
Here is a simple example showing how clustering differs depending on whether an attribute is used as the basis or the distance matrix. My goal is to be able to use both, the attribute and the distance matrix, giving more importance to the attribute.
import numpy as np
import matplotlib.pyplot as plt
import haversine
from scipy.cluster.hierarchy import linkage, fcluster
from scipy.spatial import distance as ssd
# Create location data
x = np.random.rand(100, 1)
y = np.random.rand(100, 1)
t = np.random.randint(0, 20, size=(100,1))
# Compute distance matrix
D = np.zeros((len(x),len(y)))
for k in range(len(x)):
for j in range(len(y)):
distance_pair= haversine.distance((x[k], y[k]), (x[j], y[j]))
D[k,j] = distance_pair
# Compare clustering alternatives
Zt = linkage(t, 'complete')
Zd = linkage(ssd.squareform(D), method="complete")
# Cluster based on t
clt = fcluster(Zt, 5, criterion='distance').reshape(100,1)
plt.figure(figsize=(10, 8))
plt.scatter(x, y, c=clt)
plt.show()
# Cluster based on distance matrix
cld = fcluster(Zd, 10, criterion='distance').reshape(100,1)
plt.figure(figsize=(10, 8))
plt.scatter(x, y, c=cld)
plt.show()
haversine.py is available here: https://gist.github.com/rochacbruno/2883505
Thanks.

SKlearn: KDTree how to return nearest neighbour based on threshold (Python)

I have a database of 300 Images and I extracted for each of them a BOVW. Starting from a query image (with query_BOVW extracted from the same dictionary) I need to find similar images in my training dataset.
I used Sklearn KDTree on my training set kd_tree = KDTree(training) and then I calculate the distance from the query vector with kd_tree.query(query_vector). The last function takes as second parameter the number of nearest neighbours to return, but what I seek is to set a threshold for the euclidian distance and based on this threshold have different number of nearest neighbours.
I looked into the documentation but I did not find anything about that. Am I wrong seeking something that maybe does make no sense?
Thanks for the help.
You want to use query_radius here.
query_radius(self, X, r, count_only = False):
query the tree for neighbors within a radius r
...
Just the example from above link:
import numpy as np
np.random.seed(0)
X = np.random.random((10, 3)) # 10 points in 3 dimensions
tree = BinaryTree(X, leaf_size=2)
print(tree.query_radius(X[0], r=0.3, count_only=True))
ind = tree.query_radius(X[0], r=0.3)
print(ind) # indices of neighbors within distance 0.3
From the documentation, you can use the method query_radius:
Query for neighbors within a given radius:
import numpy as np
np.random.seed(0)
X = np.random.random((10, 3)) # 10 points in 3 dimensions
tree = KDTree(X, leaf_size=2)
print(tree.query_radius(X[0], r=0.3, count_only=True))
ind = tree.query_radius(X[0], r=0.3) # indices of neighbors within distance 0.3
This work with sklearn version 19.1

Parse list of x,y coordinates and detect continious areas

I have a list of x, y coordinates
What I need to do is separate those into groups of continuous areas
All the x, y coordinates in a list will end up belonging to a particular group.
I currently have an simple algorithm, that just goes through each point and finds all the adjacent points (so points with coordinates of +-1 on x and +-1 on y)
However, it is much too slow when it comes to using large x,y lists.
PS Keep in mind that there could be holes in the middle of groups.
One simple method that you could use is k-means clustering. k-means partitions a list of observations into k clusters, where each point belongs to the cluster with the nearest mean. If you know that there are k=2 groups of points, then this method should work very well, assuming your clusters of points are reasonably well separated (and even if they have holes). SciPy has an implementation of k-means that should be easy to apply.
Here's an example of the type of analysis you can perform.
# import required modules
import numpy as np
from scipy.cluster.vq import kmeans2
# generate clouds of 2D normally distributed points
N = 6000000 # number of points in each cluster
# cloud 1: mean (0, 0)
mean1 = [0, 0]
cov1 = [[1, 0], [0, 1]]
x1,y1 = np.random.multivariate_normal(mean1, cov1, N).T
# cloud 2: mean (5, 5)
mean2 = [5, 5]
cov2 = [[1, 0], [0, 1]]
x2,y2 = np.random.multivariate_normal(mean2, cov2, N).T
# merge the clouds and arrange into data points
xs, ys = np.concatenate( (x1, x2) ), np.concatenate( (y1, y2) )
points = np.array([xs, ys]).T
# cluster the points using k-means
centroids, clusters = kmeans2(points, k=2)
Running this on my 2012 MBA with 12 million data points is pretty fast:
>>> time python test.py
real 0m20.957s
user 0m18.128s
sys 0m2.732s
It is also 100% accurate (not surprising given that the point clouds don't overlap at all). Here's some quick code for computing the accuracy of the cluster assignments. The only tricky part is I first use Euclidean distance to identify which cluster's centroid matches up with the mean of the original data cloud.
# determine which centroid belongs to which cluster
# using Euclidean distance
dist1 = np.linalg.norm(centroids[0]-mean1)
dist2 = np.linalg.norm(centroids[1]-mean1)
if dist1 <= dist2:
FIRST, SECOND = 0, 1
else:
FIRST, SECOND = 1, 0
# compute accuracy by iterating through all 2N points
# note: first N points are from cloud1, second N points are from cloud2
correct = 0
for i in range(len(clusters)):
if clusters[i] == FIRST and i < N:
correct += 1
elif clusters[i] == SECOND and i >= N:
correct += 1
# output accuracy
print 'Accuracy: %.2f' % (correct*100./len(clusters))
What you want to do is called finding connected components in image processing. You have a binary image in which all the (x, y) pixels that are in your list are 1, and pixels that aren't are 0.
You can use numpy/scipy to turn your data into a 2D binary image, and then call ndimage.label to find the connected components.
Suppose all x and y are >= 0, you know max_x and max_y, and the resulting image fits into memory, then something like:
import numpy as np
from scipy import ndimage
image = np.zeros(max_x, max_y)
for x, y in huge_list_of_xy_points:
image[x, y] = 1
labelled = ndimage.label(image)
Should give you an array in which all pixels in group 1 have value 1, all pixels in groups 2 have value 2, et cetera. Not tested.
First of all, you can identify the problem with a corresponding graph G(V, E):
Points are vertices and there is an edge e between point A and point B if and only if A is "close" to B, where you can define "close" on your own.
Since each point belongs to exactly one group, groups form disjoint sets and you can use a simple DFS to assign points to groups. In graph theory the underlying problem is called Connected Components.
The complexity of DFS is linear i.e. O(V + E).

Peak detection in a noisy 2d array

I'm trying to get python to return, as close as possible, the center of the most obvious clustering in an image like the one below:
In my previous question I asked how to get the global maximum and the local maximums of a 2d array, and the answers given worked perfectly. The issue is that the center estimation I can get by averaging the global maximum obtained with different bin sizes is always slightly off than the one I would set by eye, because I'm only accounting for the biggest bin instead of a group of biggest bins (like one does by eye).
I tried adapting the answer to this question to my problem, but it turns out my image is too noisy for that algorithm to work. Here's my code implementing that answer:
import numpy as np
from scipy.ndimage.filters import maximum_filter
from scipy.ndimage.morphology import generate_binary_structure, binary_erosion
import matplotlib.pyplot as pp
from os import getcwd
from os.path import join, realpath, dirname
# Save path to dir where this code exists.
mypath = realpath(join(getcwd(), dirname(__file__)))
myfile = 'data_file.dat'
x, y = np.loadtxt(join(mypath,myfile), usecols=(1, 2), unpack=True)
xmin, xmax = min(x), max(x)
ymin, ymax = min(y), max(y)
rang = [[xmin, xmax], [ymin, ymax]]
paws = []
for d_b in range(25, 110, 25):
# Number of bins in x,y given the bin width 'd_b'
binsxy = [int((xmax - xmin) / d_b), int((ymax - ymin) / d_b)]
H, xedges, yedges = np.histogram2d(x, y, range=rang, bins=binsxy)
paws.append(H)
def detect_peaks(image):
"""
Takes an image and detect the peaks usingthe local maximum filter.
Returns a boolean mask of the peaks (i.e. 1 when
the pixel's value is the neighborhood maximum, 0 otherwise)
"""
# define an 8-connected neighborhood
neighborhood = generate_binary_structure(2,2)
#apply the local maximum filter; all pixel of maximal value
#in their neighborhood are set to 1
local_max = maximum_filter(image, footprint=neighborhood)==image
#local_max is a mask that contains the peaks we are
#looking for, but also the background.
#In order to isolate the peaks we must remove the background from the mask.
#we create the mask of the background
background = (image==0)
#a little technicality: we must erode the background in order to
#successfully subtract it form local_max, otherwise a line will
#appear along the background border (artifact of the local maximum filter)
eroded_background = binary_erosion(background, structure=neighborhood, border_value=1)
#we obtain the final mask, containing only peaks,
#by removing the background from the local_max mask
detected_peaks = local_max - eroded_background
return detected_peaks
#applying the detection and plotting results
for i, paw in enumerate(paws):
detected_peaks = detect_peaks(paw)
pp.subplot(4,2,(2*i+1))
pp.imshow(paw)
pp.subplot(4,2,(2*i+2) )
pp.imshow(detected_peaks)
pp.show()
and here's the result of that (varying the bin size):
Clearly my background is too noisy for that algorithm to work, so the question is: how can I make that algorithm less sensitive? If an alternative solution exists then please let me know.
EDIT
Following Bi Rico advise I attempted smoothing my 2d array before passing it on to the local maximum finder, like so:
H, xedges, yedges = np.histogram2d(x, y, range=rang, bins=binsxy)
H1 = gaussian_filter(H, 2, mode='nearest')
paws.append(H1)
These were the results with a sigma of 2, 4 and 8:
EDIT 2
A mode ='constant' seems to work much better than nearest. It converges to the right center with a sigma=2 for the largest bin size:
So, how do I get the coordinates of the maximum that shows in the last image?
Answering the last part of your question, always you have points in an image, you can find their coordinates by searching, in some order, the local maximums of the image. In case your data is not a point source, you can apply a mask to each peak in order to avoid the peak neighborhood from being a maximum while performing a future search. I propose the following code:
import matplotlib.image as mpimg
import matplotlib.pyplot as plt
import numpy as np
import copy
def get_std(image):
return np.std(image)
def get_max(image,sigma,alpha=20,size=10):
i_out = []
j_out = []
image_temp = copy.deepcopy(image)
while True:
k = np.argmax(image_temp)
j,i = np.unravel_index(k, image_temp.shape)
if(image_temp[j,i] >= alpha*sigma):
i_out.append(i)
j_out.append(j)
x = np.arange(i-size, i+size)
y = np.arange(j-size, j+size)
xv,yv = np.meshgrid(x,y)
image_temp[yv.clip(0,image_temp.shape[0]-1),
xv.clip(0,image_temp.shape[1]-1) ] = 0
print xv
else:
break
return i_out,j_out
#reading the image
image = mpimg.imread('ggd4.jpg')
#computing the standard deviation of the image
sigma = get_std(image)
#getting the peaks
i,j = get_max(image[:,:,0],sigma, alpha=10, size=10)
#let's see the results
plt.imshow(image, origin='lower')
plt.plot(i,j,'ro', markersize=10, alpha=0.5)
plt.show()
The image ggd4 for the test can be downloaded from:
http://www.ipac.caltech.edu/2mass/gallery/spr99/ggd4.jpg
The first part is to get some information about the noise in the image. I did it by computing the standard deviation of the full image (actually is better to select an small rectangle without signal). This is telling us how much noise is present in the image.
The idea to get the peaks is to ask for successive maximums, which are above of certain threshold (let's say, 3, 4, 5, 10, or 20 times the noise). This is what the function get_max is actually doing. It performs the search of maximums until one of them is below the threshold imposed by the noise. In order to avoid finding the same maximum many times it is necessary to remove the peaks from the image. In the general way, the shape of the mask to do so depends strongly on the problem that one want to solve. for the case of stars, it should be good to remove the star by using a Gaussian function, or something similar. I have chosen for simplicity a square function, and the size of the function (in pixels) is the variable "size".
I think that from this example, anybody can improve the code by adding more general things.
EDIT:
The original image looks like:
While the image after identifying the luminous points looks like this:
Too much of a n00b on Stack Overflow to comment on Alejandro's answer elsewhere here. I would refine his code a bit to use a preallocated numpy array for output:
def get_max(image,sigma,alpha=3,size=10):
from copy import deepcopy
import numpy as np
# preallocate a lot of peak storage
k_arr = np.zeros((10000,2))
image_temp = deepcopy(image)
peak_ct=0
while True:
k = np.argmax(image_temp)
j,i = np.unravel_index(k, image_temp.shape)
if(image_temp[j,i] >= alpha*sigma):
k_arr[peak_ct]=[j,i]
# this is the part that masks already-found peaks.
x = np.arange(i-size, i+size)
y = np.arange(j-size, j+size)
xv,yv = np.meshgrid(x,y)
# the clip here handles edge cases where the peak is near the
# image edge
image_temp[yv.clip(0,image_temp.shape[0]-1),
xv.clip(0,image_temp.shape[1]-1) ] = 0
peak_ct+=1
else:
break
# trim the output for only what we've actually found
return k_arr[:peak_ct]
In profiling this and Alejandro's code using his example image, this code about 33% faster (0.03 sec for Alejandro's code, 0.02 sec for mine.) I expect on images with larger numbers of peaks, it would be even faster - appending the output to a list will get slower and slower for more peaks.
I think the first step needed here is to express the values in H in terms of the standard deviation of the field:
import numpy as np
H = H / np.std(H)
Now you can put a threshold on the values of this H. If the noise is assumed to be Gaussian, picking a threshold of 3 you can be quite sure (99.7%) that this pixel can be associated with a real peak and not noise. See here.
Now the further selection can start. It is not exactly clear to me what exactly you want to find. Do you want the exact location of peak values? Or do you want one location for a cluster of peaks which is in the middle of this cluster?
Anyway, starting from this point with all pixel values expressed in standard deviations of the field, you should be able to get what you want. If you want to find clusters you could perform a nearest neighbour search on the >3-sigma gridpoints and put a threshold on the distance. I.e. only connect them when they are close enough to each other. If several gridpoints are connected you can define this as a group/cluster and calculate some (sigma-weighted?) center of the cluster.
Hope my first contribution on Stackoverflow is useful for you!
The way I would do it:
1) normalize H between 0 and 1.
2) pick a threshold value, as tcaswell suggests. It could be between .9 and .99 for example
3) use masked arrays to keep only the x,y coordinates with H above threshold:
import numpy.ma as ma
x_masked=ma.masked_array(x, mask= H < thresold)
y_masked=ma.masked_array(y, mask= H < thresold)
4) now you can weight-average on the masked coordinates, with weight something like (H-threshold)^2, or any other power greater or equal to one, depending on your taste/tests.
Comment:
1) This is not robust with respect to the type of peaks you have, since you may have to adapt the thresold. This is the minor problem;
2) This DOES NOT work with two peaks as it is, and will give wrong results if the 2nd peak is above threshold.
Nonetheless, it will always give you an answer without crashing (with pros and cons of the thing..)
I'm adding this answer because it's the solution I ended up using. It's a combination of Bi Rico's comment here (May 30 at 18:54) and the answer given in this question: Find peak of 2d histogram.
As it turns out using the peak detection algorithm from this question Peak detection in a 2D array only complicates matters. After applying the Gaussian filter to the image all that needs to be done is to ask for the maximum bin (as Bi Rico pointed out) and then obtain the maximum in coordinates.
So instead of using the detect-peaks function as I did above, I simply add the following code after the Gaussian 2D histogram is obtained:
# Get 2D histogram.
H, xedges, yedges = np.histogram2d(x, y, range=rang, bins=binsxy)
# Get Gaussian filtered 2D histogram.
H1 = gaussian_filter(H, 2, mode='nearest')
# Get center of maximum in bin coordinates.
x_cent_bin, y_cent_bin = np.unravel_index(H1.argmax(), H1.shape)
# Get center in x,y coordinates.
x_cent_coor , y_cent_coord = np.average(xedges[x_cent_bin:x_cent_bin + 2]), np.average(yedges[y_cent_g:y_cent_g + 2])

Categories

Resources