I figured that sklearn kmeans uses imaginary points as cluster centroids.
So far, I found no option to use real data points as centroids in sklearn.
I am currently calculating the data point that is closest to a centroid but thought there might be an easier way.
I am not necessarily restricted to kmeans by the way.
A google search around clustering with real data centroids wasn't fruitful either.
Did anyone have the same problem before?
import numpy as np
from sklearn.cluster import KMeans
import math
def distance(a, b):
dist = math.sqrt((a[0] - b[0])**2 + (a[1] - b[1])**2)
return dist
x = np.random.rand(10)
y = np.random.rand(10)
xy = np.array((x,y)).T
kmeans = KMeans(n_clusters=2)
kmeans.fit(xy)
centroids = kmeans.cluster_centers_
print(np.where(xy == centroids[0])[0])
for c in centroids:
nearest = min(xy, key=lambda x: distance(x, c))
print('centroid', c)
print('nearest data point to centroid', nearest)
Actually sklearn.cluster.KMeans allows now to use custom centroids.
see init section here https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html
or in source code for sklearn.kmneans here: https://github.com/scikit-learn/scikit-learn/blob/b194674c4/sklearn/cluster/_kmeans.py#L649
"If an ndarray is passed, it should be of shape (n_clusters, n_features) and gives the initial centers."
I hope that it works. Please try.
Centroids does not have to be points in your set. Since you are in a 2d space, you will find centroids with 2d coordinates. If you want to print distances between each centroid and each point you can:
import numpy as np
import pandas as pd
from sklearn.cluster import KMeans
x = np.random.rand(10)
y = np.random.rand(10)
xy = np.array((x,y)).T
kmeans = KMeans(n_clusters=2)
kmeans.fit(xy)
centroids = kmeans.cluster_centers_
for centroid in centroids:
print(f'List of distances between centroid {centroid} and each point:\n\
{np.linalg.norm(centroid-xy, axis=1)}\n')
List of distances between centroid [0.87236496 0.74034618] and each point:
[0.21056113 0.84946149 0.83381298 0.31347176 0.40811323 0.85442416
0.44043437 0.66736601 0.55282619 0.14813826]
List of distances between centroid [0.37243631 0.37851987] and each point:
[0.77005698 0.29192851 0.25249753 0.60881231 0.2219568 0.24264077
0.27374379 0.39968813 0.31728732 0.58604271]
As you can see we have that prediction corresponds to the centroid to which the distance is minimal:
kmeans.predict(xy)
array([0, 0, 0, 0, 1, 1, 0, 1, 1, 1])
distances = np.vstack([np.linalg.norm(centroids[0]-xy, axis=1),
np.linalg.norm(centroids[1]-xy, axis=1)])
distances.argmin(axis=0)
array([0, 0, 0, 0, 1, 1, 0, 1, 1, 1])
Let's plot the data: centroids are square shaped and points are circle shaped, which size is the inverse proportional to the distance from its centroid.
Now although the figure is plotting other random data points, I hope it helps.
I've been through the same question, how to find the sample within each cluster that minimizes inertia. I made this function :
import numpy as np
from sklearn.metrics import pairwise_distances_chunked
def index_representative_points(km, X):
ret = []
for k in range(km.n_clusters):
mask = (km.labels_ == k).nonzero()[0]
s = []
for _ in pairwise_distances_chunked(X=X[mask]):
s.append(np.square(_).sum(axis=1))
ret.append(mask[np.argmin(np.concatenate(s))])
return np.array(ret)
And it can be used like this :
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
X, y_true = make_blobs(n_samples=100, centers=3, cluster_std=0.60, random_state=0)
km = KMeans(n_clusters=3, random_state=0).fit(X)
index_representative_points(km, X)
>>> array([89, 25, 28], dtype=int64)
EDIT :
For very large datasets, the function is very slow. But it can be proven that the point within the cluster that minimizes the inertia is the closest one of the centroid. Hence, this second version :
def index_representative_points(km, X):
ret = []
for k in range(km.n_clusters):
mask = (km.labels_ == k).nonzero()[0]
centroid = np.mean(X[mask], axis=0)
i0 = mask[pairwise_distances_argmin(centroid[None, :], X[mask])[0]]
ret.append(i0)
return np.array(ret)
Related
My dataset can be found in kaggle https://www.kaggle.com/vjchoudhary7/customer-segmentation-tutorial-in-python. So i'm running k-means on my dataset that has 4 columns and 200 rows with k = 5. I wanted to find the cluster radius so I measured the average distance of each data point from their respective cluster center but whenever I re-run my program their values change. My cluster centers don't change with each iteration so what's going on exactly? How do I fix this?
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.metrics import euclidean_distances
from sklearn.preprocessing import StandardScaler
import numpy as np
import scipy.spatial.distance as sdist
df = pd.read_csv('D:\Mall_Customers.csv', usecols = ['Spending Score (1-100)', 'Annual Income (k$)'])
x = StandardScaler().fit_transform(df)
kmeans = KMeans(n_clusters=5, max_iter=100, random_state=0)
y_kmeans= kmeans.fit_predict(x)
centroids = kmeans.cluster_centers_
print(centroids)
df["cluster"] = kmeans.labels_
n_clusters = 5
clusters = [x[y_kmeans == i] for i in range(n_clusters)]
for i, c in enumerate(clusters):
print('Cluster {} has {} observations: {}...'.format(i, len(c), c[0]))
df["cluster"] = kmeans.labels_
with pd.option_context('display.max_rows', None, 'display.max_columns', None):
print(df)
#cluster radius
def k_mean_distance(data, cx, cy, i_centroid, cluster_labels):
distances = [np.sqrt((x - cx) ** 2 + (y - cy) ** 2) for (x, y) in data[cluster_labels == i_centroid]]
return np.mean(distances)
t_data = PCA(n_components=2).fit_transform(x)
k_means = KMeans()
clusters = k_means.fit_predict(t_data)
centroids = kmeans.cluster_centers_
c_mean_distances = []
for i, (cx, cy) in enumerate(centroids):
mean_distance = k_mean_distance(t_data, cx, cy, i, clusters)
c_mean_distances.append(mean_distance)
print("mean distances are", c_mean_distances)
Output 1 [1.5381892556224435, 1.796763983963032, 1.5144402423920744, 3.4372440532366753, 1.6533031213582314]
Iteration 2 ```[3.180393284279158, 2.809194267986748, 0.7823704675079582, 3.4929008204149365, 1.8109097594336663]
Iteration 3 [1.9461073260609538, 3.2032294269352155, 2.447917517713439, 3.4372440532366753, 2.197239028470577]
I'll add the answer to document the issue.
First, when you are doing a lower dimensional embedding make sure that it doesn't need a random seed to ensure repeatability. In this case (PCA) I think it is ok, but other lower dimensional embedding's may vary.
Second, KMeans does not always converge to a global optima and thus can have varying convergence clusters. To keep KMeans repeatable Scikit Learn has the random_state input parameter.
You set this the first time you ran KMeans. This kept the first portion of your code repeatable. To ensure repeatability on the clustering after PCA embedding, set the random state in the same way:
k_means = KMeans(n_clusters=5, max_iter=100, random_state=0)
im getting bad clusters i would like to rewrite it in a way where i can just plug in any algorithm that i would like (e.g hierarchical, knn, k-means) etc.
#takes in our text_extracts dictionary and returns clusters in an indexed list
def run_clustering(plan):
""" Transform texts to Tf-Idf coordinates and cluster texts using K-Means """
vectorizer = TfidfVectorizer(tokenizer=process_text,
max_df=0.5,
min_df=0.005,
ngram_range=(1,4),
lowercase=True)
#set the model with the vectorizer which will tokenize with our process_text function
extracts = {}
for page in plan.page_list:
if len(page.text_extract) > 50:
extracts[str(page.document_id) + '_' + str(page.page_number)] = page.text_extract
extract_lst = [extracts[text] for text in extracts]
tfidf_model = vectorizer.fit_transform(extract_lst)
#determine cluster number with silhouette coefficient
#start with 2 as a cluster size in case the set is very small
num_of_clusters_to_test = [2]
#going to test 25 more sizes in equal intervals based on the number of docs we are clustering
intervals_to_test = int(len(extracts) / 25)
#print(intervals_to_test)
num_of_clusters_to_test += [i for i in range(len(extracts)) if i % intervals_to_test == 0 and i != 0]
#these variables will help us determine the max silhouette
#iters_since_new_max is just being held so that if we aren't reaching optimal size for
#four iterations in a row, we dont have to keep testing huge cluster sizes
max_silhouette_coef = 0
iters_since_new_max = 0
good_size = 2
#cluster with a certain cluster size and record the silhouette coefficient
for size in num_of_clusters_to_test:
kmeans = KMeans(n_clusters=size).fit(tfidf_model)
label = kmeans.labels_
sil_coeff = silhouette_score(tfidf_model, label, metric='euclidean')
if sil_coeff > max_silhouette_coef:
max_silhouette_coef = sil_coeff
good_size = size
iters_since_new_max = 0
else:
iters_since_new_max += 1
if iters_since_new_max > 4:
break
# finally cluster for with the good size we want
km_model = KMeans(n_clusters=good_size)
km_model.fit(tfidf_model)
clustering = collections.defaultdict(list)
for idx, label in enumerate(km_model.labels_):
clustering[label].append(idx)
return clustering
left as much comment as i can to help you all follow what i am going for can anyone help me improve this
You know KMeans if for numeric data only, right. I mean, don't expect it to work on labeled data. With KMeans, you calculate the distance to the nearest centroid (cluster center) and add this point to this cluster. What is the 'distance' between apple, banana, and watermelon? It doesn't make sense! So, just make sure you are running your KMeans over numerics.
import numpy as np
import pandas as pd
from pylab import plot,show
from numpy import vstack,array
from scipy.cluster.vq import kmeans,vq
from sklearn.cluster import KMeans
from matplotlib import pyplot as plt
import seaborn as sns
df = pd.read_csv('foo.csv')
# get only numeric fields from your dataframe
df = df.sample(frac=0.1, replace=True, random_state=1)
numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
newdf = df.select_dtypes(include=numerics)
for col in newdf.columns:
print(col)
# your independent variables
X = newdf[['NumericField1','NumericField2','NumericField3','list_price']]
# your dependent variable
y = newdf['DependentVariable']
# take all numeric features from the corr exercise, and turn into an array
# so we can feed it into a cluetering algorythm
data = np.asarray(newdf)
X = data
# computing K-Means with K = 100 (100 clusters)
centroids,_ = kmeans(data,100)
# assign each sample to a cluster
idx,_ = vq(data,centroids)
# some plotting using numpy's logical indexing
plot(data[idx==0,0],data[idx==0,1],'ob',
data[idx==1,0],data[idx==1,1],'oy',
data[idx==2,0],data[idx==2,1],'or',
data[idx==3,0],data[idx==3,1],'og',
data[idx==4,0],data[idx==4,1],'om')
plot(centroids[:,0],centroids[:,1],'sg',markersize=8)
show()
details = [(name,cluster) for name, cluster in zip(df.brand,idx)]
for detail in details:
print(detail)
I've found Affinity Propogation to produce much tighter clusters than KMeans can achieve. Here is an example.
# Run Affinity Propogation Experiment
af = AffinityPropagation(preference=20).fit(X)
cluster_centers_indices = af.cluster_centers_indices_
labels = af.labels_
n_clusters_ = len(cluster_centers_indices)
print('Estimated number of clusters: %d' % n_clusters_)
# plt.scatter(X[:, 0], X[:, 1], s=50)
# Plot result
import matplotlib.pyplot as plt
from itertools import cycle
plt.close('all')
plt.figure(1)
plt.clf()
colors = cycle('bgrcmykbgrcmykbgrcmykbgrcmyk')
for k, col in zip(range(n_clusters_), colors):
class_members = labels == k
cluster_center = X[cluster_centers_indices[k]]
plt.plot(X[class_members, 0], X[class_members, 1], col + '.')
plt.plot(cluster_center[0], cluster_center[1], 'o', markerfacecolor=col,
markeredgecolor='k', markersize=14)
for x in X[class_members]:
plt.plot([cluster_center[0], x[0]], [cluster_center[1], x[1]], col)
plt.title('Estimated number of clusters: %d' % n_clusters_)
plt.show()
Try these concepts and see how you get along.
I am running K-Means on some statistical Data. My Matrix size is [192x31634].
K-Means performs well and creates the amount of 7 centroids, that I want it to. So my Result is [192x7]
As some self-check I store the index-Values I obtain in the K-Means run to a dictionary.
centroids,idx = runkMeans(X_train, initial_centroids, max_iters)
resultDict.update({'centroid' : centroids})
resultDict.update({'idx' : idx})
Then I test my K-Means on the same Data I used to find the centroids. Strangely my Result differs:
dict= pickle.load(open("MyDictionary.p", "rb"))
currentIdx = findClosestCentroids(X_train, dict['centroid'])
print("idx Differs: ",np.count_nonzero(currentIdx != dict['idx']))
Output:
idx Differs: 189
Can someone explain this Difference to me? I turned up the max-iterations of the Algorithm to 50 which seems to be way too much. #Joe Halliwell pointed out, that K-Means is non-deterministic. findClosestCentroids gets called by runkMeans. I do not see, why the Results of the two idx can differ. Thanks for any Ideas.
Here is my code:
def findClosestCentroids(X, centroids):
K = centroids.shape[0]
m = X.shape[0]
dist = np.zeros((K,1))
idx = np.zeros((m,1), dtype=int)
#number of columns defines my number of data points
for i in range(m):
#Every column is one data point
x = X[i,:]
#number of rows defines my number of centroids
for j in range(K):
#Every row is one centroid
c = centroids[j,:]
#distance of the two points c and x
dist[j] = np.linalg.norm(c-x)
#if last centroid is processed
if (j == K-1):
#the Result idx is set with the index of the centroid with minimal distance
idx[i] = np.argmin(dist)
return idx
def runkMeans(X, initial_centroids, max_iters):
#Initialize values
m,n = X.shape
K = initial_centroids.shape[0]
centroids = initial_centroids
previous_centroids = centroids
for i in range(max_iters):
print("K_Means iteration:",i)
#For each example in X, assign it to the closest centroid
idx = findClosestCentroids(X, centroids)
#Given the memberships, compute new centroids
centroids = computeCentroids(X, idx, K)
return centroids,idx
Edit: I turned my max_iters to 60 and get a
idx Differs: 0
Seems that was the problem.
K-means is a non-deterministic algorithm. One typically controls for this by setting the random seed. For example, SciKit Learn's implementation provides the random_state argument for this purpose:
from sklearn.cluster import KMeans
import numpy as np
X = np.array([[1, 2], [1, 4], [1, 0], [10, 2], [10, 4], [10, 0]])
kmeans = KMeans(n_clusters=2, random_state=0).fit(X)
See the documentation at https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html
I have clustered my data (12000, 3) using sklearn Gaussian mixture model algorithm (GMM). I have 3 clusters. Each point of my data represents a molecular structure. I would like to know how could I sampled each cluster. I have tried with the function:
gmm = GMM(n_components=3).fit(Data)
gmm.sample(n_samples=20)
but it does preform a sampling of the whole distribution, but I need a sample of each one of the components.
Well this is not that easy since you need to calculate the eigenvectors of all covariance matrices. Here is some example code for a problem I studied
import numpy as np
from scipy.stats import multivariate_normal
import random
from operator import truediv
import itertools
from scipy import linalg
import matplotlib.pyplot as plt
import matplotlib as mpl
from sklearn import mixture
#import some data which can be used for gmm
mix = np.loadtxt("mixture.txt", usecols=(0,1), unpack=True)
#print(mix.shape)
color_iter = itertools.cycle(['navy', 'c', 'cornflowerblue', 'gold',
'darkorange'])
def plot_results(X, Y_, means, covariances, index, title):
#function for plotting the gaussians
splot = plt.subplot(2, 1, 1 + index)
for i, (mean, covar, color) in enumerate(zip(
means, covariances, color_iter)):
v, w = linalg.eigh(covar)
v = 2. * np.sqrt(2.) * np.sqrt(v)
u = w[0] / linalg.norm(w[0])
# as the DP will not use every component it has access to
# unless it needs it, we shouldn't plot the redundant
# components.
if not np.any(Y_ == i):
continue
plt.scatter(X[Y_ == i, 0], X[Y_ == i, 1], .8, color=color)
# Plot an ellipse to show the Gaussian component
angle = np.arctan(u[1] / u[0])
angle = 180. * angle / np.pi # convert to degrees
ell = mpl.patches.Ellipse(mean, v[0], v[1], 180. + angle, color=color)
ell.set_clip_box(splot.bbox)
ell.set_alpha(0.5)
splot.add_artist(ell)
plt.xlim(-4., 3.)
plt.ylim(-4., 2.)
gmm = mixture.GaussianMixture(n_components=3, covariance_type='full').fit(mix.T)
print(gmm.predict(mix.T))
plot_results(mix.T, gmm.predict(mix.T), gmm.means_, gmm.covariances_, 0,
'Gaussian Mixture')
So for my problem the resulting plot looked like this:
Edit: here the answer to your comment. I would use pandas to do this. Assume X is your feature matrix and y are your labels, then
import pandas as pd
y_pred = gmm.predict(X)
df_all_info = pd.concat([X,y,y_pred], axis=1)
In the resulting dataframe you can check all the information you want, you can even just exclude the samples the algorithm misclassified with:
df_wrong = df_all_info[df_all_info['name of y-column'] != df_all_info['name of y_pred column']]
I am learning about Linear Discriminant Analysis and am using the scikit-learn module. I am confused by the "coef_" attribute from the LinearDiscriminantAnalysis class. As far as I understand, these are the discriminant function coefficients (sklearn calls them weight vectors). Since there should be (n_classes-1) discriminant functions, I would expect the coef_ attribute to be an array with shape (n_components, n_features), but instead it prints an (n_classes, n_features) array. Below is an example of this using the Iris dataset example from sklearn. Since there are 3 classes and 2 components, I would expect print(lda.coef_) to give me a 2x4 array instead of a 3x4 array...
Maybe I'm misinterpreting what the weight vectors are, perhaps they are the coefficients for the classification function?
And how do I get the coefficients for each variable in each discriminant/canonical function?
screenshot of jupyter notebook
Code here:
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.decomposition import PCA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
import numpy as np
iris = datasets.load_iris()
X = iris.data
y = iris.target
target_names = iris.target_names
lda = LinearDiscriminantAnalysis(n_components=2,store_covariance=True)
X_r = lda.fit(X, y).transform(X)
plt.figure()
for color, i, target_name in zip(colors, [0, 1, 2], target_names):
plt.scatter(X_r2[y == i, 0], X_r2[y == i, 1], alpha=.8, color=color,
label=target_name)
plt.legend(loc='best', shadow=False, scatterpoints=1)
plt.xlabel('Function 1 (%.2f%%)' %(lda.explained_variance_ratio_[0]*100))
plt.ylabel('Function 2 (%.2f%%)' %(lda.explained_variance_ratio_[1]*100))
plt.title('LDA of IRIS dataset')
print(lda.coef_)
#output -> [[ 6.24621637 12.24610757 -16.83743427 -21.13723331]
# [ -1.51666857 -4.36791652 4.64982565 3.18640594]
# [ -4.72954779 -7.87819105 12.18760862 17.95082737]]
You can calculate the coefficients with the following code:
def LDA_coefficients(X,lda):
nb_col = X.shape[1]
matrix= np.zeros((nb_col+1,nb_col), dtype=int)
Z=pd.DataFrame(data=matrix,columns=X.columns)
for j in range(0,nb_col):
Z.iloc[j,j] = 1
LD = lda.transform(Z)
nb_funct= LD.shape[1]
results = pd.DataFrame();
index = ['const']
for j in range(0,LD.shape[0]-1):
index = np.append(index,'C'+str(j+1))
for i in range(0,LD.shape[1]):
coef = [LD[-1][i]]
for j in range(0,LD.shape[0]-1):
coef = np.append(coef,LD[j][i]-LD[-1][i])
result = pd.Series(coef)
result.index = index
column_name = 'LD' + str(i+1)
results[column_name] = result
return results
Before calling this function you need to complete the linear discriminant analysis:
lda = LinearDiscriminantAnalysis()
lda.fit(X,y)