Is there a way to vectorize this code to eliminate the for loop:
import numpy as np
Z = np.concatenate((X, labels[:,None]), axis=1)
centroids = np.empty([len(unique(labels))-1,2])
for i in unique(labels[labels>-1]):
centroids[i,:]=Z[Z[:,-1]==i][:,:-1].mean(0)
centroids
This code produces pseudo centroids from the DBSCAN scikit-learn example, in case you want to play with it to find a vectorized form, i.e. X and labels are defined in the example.
Thanks for your help!
You can use bincount() three times:
count = np.bincount(labels)
x = np.bincount(labels, X[:, 0])
y = np.bincount(labels, X[:, 1])
centroids = np.c_[x, y] / count[:, None]
print centroids
But if you can use pandas, this is very simple:
Z = np.concatenate((X, labels[:,None]), axis=1)
df = pd.DataFrame(Z, columns=("x", "y", "label"))
df[df['label']>-1].groupby("label").mean()
Related
I have x and y arrays, x consists of three arrays and y consists of three arrays that consist of seven values
x= [np.array([6.03437288]), np.array([6.39850922]), np.array([6.07835145])]
y= [np.array([[-1.06565856, -0.16222044, 7.85850477, -2.62498475, -0.46315498,
-0.33087472, -0.1394244 ]]),
np.array([[-1.41487104e+00, 5.81421750e-03, 7.92917001e+00,
-3.37987517e+00, 1.14685839e-01, -2.91779263e-01,
2.51753851e-01]]),
np.array([[-1.56496814, 0.2612637 , 7.60577761, -3.55727614, 0.18844392,
-0.75112678, -0.48055978]])]
I concatenate x and y into one dataframe
df = pd.DataFrame({'x': x,'y': y})
then I tried to cluster this dataframe by k-medoids
kmedoids = KMedoids(n_clusters=3, random_state=0).fit(df)
cluster_labels = kmedoids.predict(df)
but I faced this error
ValueError: setting an array element with a sequence.
I tried to search for a solution to this problem, haven't found a concrete solution. any suggestions even with modified the code
Given arrays x and y as provided in question:
import pandas as pd
from sklearn_extra.cluster import KMedoids
df = pd.DataFrame({'x': x,'y': y})
First concatenate x and y of dataframe into one array per row:
df2 = df.apply(lambda r: np.append(r.x, r.y), axis = 1)
Then create one X array:
X = np.array(df2.values.tolist())
that can be passed to clustering method:
kmedoids = KMedoids(n_clusters=3, random_state=0).fit(X)
cluster_labels = kmedoids.predict(X)
result of clustering:
array([2, 0, 1], dtype=int64)
I figured that sklearn kmeans uses imaginary points as cluster centroids.
So far, I found no option to use real data points as centroids in sklearn.
I am currently calculating the data point that is closest to a centroid but thought there might be an easier way.
I am not necessarily restricted to kmeans by the way.
A google search around clustering with real data centroids wasn't fruitful either.
Did anyone have the same problem before?
import numpy as np
from sklearn.cluster import KMeans
import math
def distance(a, b):
dist = math.sqrt((a[0] - b[0])**2 + (a[1] - b[1])**2)
return dist
x = np.random.rand(10)
y = np.random.rand(10)
xy = np.array((x,y)).T
kmeans = KMeans(n_clusters=2)
kmeans.fit(xy)
centroids = kmeans.cluster_centers_
print(np.where(xy == centroids[0])[0])
for c in centroids:
nearest = min(xy, key=lambda x: distance(x, c))
print('centroid', c)
print('nearest data point to centroid', nearest)
Actually sklearn.cluster.KMeans allows now to use custom centroids.
see init section here https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html
or in source code for sklearn.kmneans here: https://github.com/scikit-learn/scikit-learn/blob/b194674c4/sklearn/cluster/_kmeans.py#L649
"If an ndarray is passed, it should be of shape (n_clusters, n_features) and gives the initial centers."
I hope that it works. Please try.
Centroids does not have to be points in your set. Since you are in a 2d space, you will find centroids with 2d coordinates. If you want to print distances between each centroid and each point you can:
import numpy as np
import pandas as pd
from sklearn.cluster import KMeans
x = np.random.rand(10)
y = np.random.rand(10)
xy = np.array((x,y)).T
kmeans = KMeans(n_clusters=2)
kmeans.fit(xy)
centroids = kmeans.cluster_centers_
for centroid in centroids:
print(f'List of distances between centroid {centroid} and each point:\n\
{np.linalg.norm(centroid-xy, axis=1)}\n')
List of distances between centroid [0.87236496 0.74034618] and each point:
[0.21056113 0.84946149 0.83381298 0.31347176 0.40811323 0.85442416
0.44043437 0.66736601 0.55282619 0.14813826]
List of distances between centroid [0.37243631 0.37851987] and each point:
[0.77005698 0.29192851 0.25249753 0.60881231 0.2219568 0.24264077
0.27374379 0.39968813 0.31728732 0.58604271]
As you can see we have that prediction corresponds to the centroid to which the distance is minimal:
kmeans.predict(xy)
array([0, 0, 0, 0, 1, 1, 0, 1, 1, 1])
distances = np.vstack([np.linalg.norm(centroids[0]-xy, axis=1),
np.linalg.norm(centroids[1]-xy, axis=1)])
distances.argmin(axis=0)
array([0, 0, 0, 0, 1, 1, 0, 1, 1, 1])
Let's plot the data: centroids are square shaped and points are circle shaped, which size is the inverse proportional to the distance from its centroid.
Now although the figure is plotting other random data points, I hope it helps.
I've been through the same question, how to find the sample within each cluster that minimizes inertia. I made this function :
import numpy as np
from sklearn.metrics import pairwise_distances_chunked
def index_representative_points(km, X):
ret = []
for k in range(km.n_clusters):
mask = (km.labels_ == k).nonzero()[0]
s = []
for _ in pairwise_distances_chunked(X=X[mask]):
s.append(np.square(_).sum(axis=1))
ret.append(mask[np.argmin(np.concatenate(s))])
return np.array(ret)
And it can be used like this :
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
X, y_true = make_blobs(n_samples=100, centers=3, cluster_std=0.60, random_state=0)
km = KMeans(n_clusters=3, random_state=0).fit(X)
index_representative_points(km, X)
>>> array([89, 25, 28], dtype=int64)
EDIT :
For very large datasets, the function is very slow. But it can be proven that the point within the cluster that minimizes the inertia is the closest one of the centroid. Hence, this second version :
def index_representative_points(km, X):
ret = []
for k in range(km.n_clusters):
mask = (km.labels_ == k).nonzero()[0]
centroid = np.mean(X[mask], axis=0)
i0 = mask[pairwise_distances_argmin(centroid[None, :], X[mask])[0]]
ret.append(i0)
return np.array(ret)
I'm following an excellent medium article: https://towardsdatascience.com/k-medoids-clustering-on-iris-data-set-1931bf781e05 to implement kmedoids from scratch. There is a place in the code where each pixel's distance to the medoid centers is calculated and it is VERY slow. It has numpy.linalg.norm inside a loop. Is there a way to optimize this with numpy.linalg.norm or with numpy broadcasting or scipy.spatial.distance.cdist and np.argmin to do the same thing?
###helper function here###
def compute_d_p(X, medoids, p):
m = len(X)
medoids_shape = medoids.shape
# If a 1-D array is provided,
# it will be reshaped to a single row 2-D array
if len(medoids_shape) == 1:
medoids = medoids.reshape((1,len(medoids)))
k = len(medoids)
S = np.empty((m, k))
for i in range(m):
d_i = np.linalg.norm(X[i, :] - medoids, ord=p, axis=1)
S[i, :] = d_i**p
return S
this is where the slowdown occurs
for datap in cluster_points:
new_medoid = datap
new_dissimilarity= np.sum(compute_d_p(X, datap, p))
if new_dissimilarity < avg_dissimilarity :
avg_dissimilarity = new_dissimilarity
out_medoids[i] = datap
Full code below. All credits to the article author.
# Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from sklearn import datasets
from sklearn.decomposition import PCA
# Dataset
iris = datasets.load_iris()
data = pd.DataFrame(iris.data,columns = iris.feature_names)
target = iris.target_names
labels = iris.target
#Scaling
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
data = pd.DataFrame(scaler.fit_transform(data), columns=data.columns)
#PCA Transformation
from sklearn.decomposition import PCA
pca = PCA(n_components=3)
principalComponents = pca.fit_transform(data)
PCAdf = pd.DataFrame(data = principalComponents , columns = ['principal component 1', 'principal component 2','principal component 3'])
datapoints = PCAdf.values
m, f = datapoints.shape
k = 3
def init_medoids(X, k):
from numpy.random import choice
from numpy.random import seed
seed(1)
samples = choice(len(X), size=k, replace=False)
return X[samples, :]
medoids_initial = init_medoids(datapoints, 3)
def compute_d_p(X, medoids, p):
m = len(X)
medoids_shape = medoids.shape
# If a 1-D array is provided,
# it will be reshaped to a single row 2-D array
if len(medoids_shape) == 1:
medoids = medoids.reshape((1,len(medoids)))
k = len(medoids)
S = np.empty((m, k))
for i in range(m):
d_i = np.linalg.norm(X[i, :] - medoids, ord=p, axis=1)
S[i, :] = d_i**p
return S
S = compute_d_p(datapoints, medoids_initial, 2)
def assign_labels(S):
return np.argmin(S, axis=1)
labels = assign_labels(S)
def update_medoids(X, medoids, p):
S = compute_d_p(points, medoids, p)
labels = assign_labels(S)
out_medoids = medoids
for i in set(labels):
avg_dissimilarity = np.sum(compute_d_p(points, medoids[i], p))
cluster_points = points[labels == i]
for datap in cluster_points:
new_medoid = datap
new_dissimilarity= np.sum(compute_d_p(points, datap, p))
if new_dissimilarity < avg_dissimilarity :
avg_dissimilarity = new_dissimilarity
out_medoids[i] = datap
return out_medoids
def has_converged(old_medoids, medoids):
return set([tuple(x) for x in old_medoids]) == set([tuple(x) for x in medoids])
#Full algorithm
def kmedoids(X, k, p, starting_medoids=None, max_steps=np.inf):
if starting_medoids is None:
medoids = init_medoids(X, k)
else:
medoids = starting_medoids
converged = False
labels = np.zeros(len(X))
i = 1
while (not converged) and (i <= max_steps):
old_medoids = medoids.copy()
S = compute_d_p(X, medoids, p)
labels = assign_labels(S)
medoids = update_medoids(X, medoids, p)
converged = has_converged(old_medoids, medoids)
i += 1
return (medoids,labels)
results = kmedoids(datapoints, 3, 2)
final_medoids = results[0]
data['clusters'] = results[1]
There's a good chance numpy's broadcasting capabilities will help. Getting broadcasting to work in 3+ dimensions is a bit tricky, and I usually have to resort to a bit of trial and error to get the details right.
The use of linalg.norm here compounds things further, because my version of the code won't give identical results to linalg.norm for all inputs. But I believe it will give identical results for all relevant inputs in this case.
I've added some comments to the code to explain the thinking behind certain details.
def compute_d_p_broadcasted(X, medoids, p):
# If a 1-D array is provided,
# it will be reshaped to a single row 2-D array
if len(medoids.shape) == 1:
medoids = medoids.reshape((1,len(medoids)))
# In general, broadcasting n-dim arrays requires that the last
# dim of the first array be a singleton dimension, and that the
# first dim of the second array be a singleton dimension. We can
# quickly accomplish that by slicing with `None` in the appropriate
# places. (`np.newaxis` is a slightly more self-documenting way
# of spelling `None`, but I rarely bother.)
# In this case, the shapes of the other two dimensions also
# have to align in the same way you'd expect for a dot product.
# So we pass `medoids.T`.
diff = np.abs(X[:, :, None] - medoids.T[None, :, :])
# The last tricky bit is to figure out which axis to sum. Right
# now, the array is a 3-dimensional array, with the first
# dimension corresponding to the rows of `X` and the last
# dimension corresponding to the columns of `medoids.T`.
# The middle dimension corresponds to the underlying dimensionality
# of the space; that's what we want to sum for a sum of squares.
# (Or sum of cubes for L3 norm, etc.)
return (diff ** p).sum(axis=1)
def compute_d_p(X, medoids, p):
m = len(X)
medoids_shape = medoids.shape
# If a 1-D array is provided,
# it will be reshaped to a single row 2-D array
if len(medoids_shape) == 1:
medoids = medoids.reshape((1,len(medoids)))
k = len(medoids)
S = np.empty((m, k))
for i in range(m):
d_i = np.linalg.norm(X[i, :] - medoids, ord=p, axis=1)
S[i, :] = d_i**p
return S
# A couple of simple tests:
X = np.array([[ 1.0, 2, 3],
[ 4, 5, 6],
[ 7, 8, 9],
[10, 11, 12]])
medoids = X[[0, 2], :]
np.allclose(compute_d_p(X, medoids, 2),
compute_d_p_broadcasted(X, medoids, 2))
# Returns True
np.allclose(compute_d_p(X, medoids, 3),
compute_d_p_broadcasted(X, medoids, 3))
# Returns True
Of course, these tests don't tell whether this actually gives a significant speedup. You'll have to check that yourself for the relevant use-case. But I suspect it will at least help.
For sorting a numpy via argsort, we can do:
import numpy as np
x = np.random.rand(3)
x_sorted = x[np.argsort(x)]
I am looking for a numpy solution for the generalization to two or higher dimensions.
The indexing as in the 1d case won't work for 2d matrices.
Y = np.random.rand(4, 3)
sort_indices = np.argsort(Y)
#Y_sorted = Y[sort_indices] (what would that line be?)
Related: I am looking for a pure numpy answer that addresses the same problem as solved in this answer: https://stackoverflow.com/a/53700995/2272172
Use np.take_along_axis:
import numpy as np
np.random.seed(42)
x = np.random.rand(3)
x_sorted = x[np.argsort(x)]
Y = np.random.rand(4, 3)
sort_indices = np.argsort(Y)
print(np.take_along_axis(Y, sort_indices, axis=1))
print(np.array(list(map(lambda x, y: y[x], np.argsort(Y), Y)))) # the solution provided
Output
[[0.15599452 0.15601864 0.59865848]
[0.05808361 0.60111501 0.86617615]
[0.02058449 0.70807258 0.96990985]
[0.18182497 0.21233911 0.83244264]]
[[0.15599452 0.15601864 0.59865848]
[0.05808361 0.60111501 0.86617615]
[0.02058449 0.70807258 0.96990985]
[0.18182497 0.21233911 0.83244264]]
I found this example for using kmeans2 algorithm in python. I can't get the following part
# make some z vlues
z = numpy.sin(xy[:,1]-0.2*xy[:,1])
# whiten them
z = whiten(z)
# let scipy do its magic (k==3 groups)
res, idx = kmeans2(numpy.array(zip(xy[:,0],xy[:,1],z)),3)
The points are zip(xy[:,0],xy[:,1]), so what is the third value z doing here?
Also what is whitening?
Any explanation is appreciated. Thanks.
First:
# make some z vlues
z = numpy.sin(xy[:,1]-0.2*xy[:,1])
The weirdest thing about this is that it's equivalent to:
z = numpy.sin(0.8*xy[:, 1])
So I don't know why it's written that way. maybe there's a typo?
Next,
# whiten them
z = whiten(z)
whitening is simply normalizing the variance of the population. See here for a demo:
>>> z = np.sin(.8*xy[:, 1]) # the original z
>>> zw = vq.whiten(z) # save it under a different name
>>> zn = z / z.std() # make another 'normalized' array
>>> map(np.std, [z, zw, zn]) # standard deviations of the three arrays
[0.42645, 1.0, 1.0]
>>> np.allclose(zw, zn) # whitened is the same as normalized
True
It's not obvious to me why it is whitened. Anyway, moving along:
# let scipy do its magic (k==3 groups)
res, idx = kmeans2(numpy.array(zip(xy[:,0],xy[:,1],z)),3)
Let's break that into two parts:
data = np.array(zip(xy[:, 0], xy[:, 1], z))
which is a weird (and slow) way of writing
data = np.column_stack([xy, z])
In any case, you started with two arrays and merge them into one:
>>> xy.shape
(30, 2)
>>> z.shape
(30,)
>>> data.shape
(30, 3)
Then it's data that is passed to the kmeans algorithm:
res, idx = vq.kmeans2(data, 3)
So now you can see that it's 30 points in 3d space that are passed to the algorithm, and the confusing part is how the set of points were created.