Weird clustering output (scikitlean kmeans) - python

I have an imbalanced dataset with four labels in total. Two of them have a much higher appearance frequency than the other two. I have nearly one million observations.
I'm trying to understand the data components a bit better by exploring with sklearn.cluster.kmeans clustering.
Here's my data:
print(X)
[[68. 0. 0. ... 0. 0. 0.]
[18. 1. 1. ... 1. 0. 0.]
[18. 1. 1. ... 0. 0. 0.]
...
[59. 0. 0. ... 0. 0. 0.]
[48. 1. 0. ... 0. 0. 1.]
[47. 1. 1. ... 0. 0. 0.]]
print(y)
[1 2 3 ... 3 2 3]
The observed labels have four levels (ordinal variables 0 - 3).
Here's my code:
scaler = StandardScaler()
scaler.fit(X)
X_scaled = scaler.transform(X)
algo1 = KMeans(n_clusters = 4)
y_pred = algo1.fit_predict(X_scaled)
mglearn.discrete_scatter(X_scaled[:, 0], X_scaled[:,1], y_pred)
plt.legend(["cluster 0", "cluster 1", "cluster 2", "cluster 3"], loc = 'best')
plt.xlabel("Feature 0")
plt.ylabel("Feature 1")
This looks weird, so I tried 3 clusters.
algo2 = KMeans(n_clusters = 3)
y_pred2 = algo2.fit_predict(X_scaled)
mglearn.discrete_scatter(X_scaled[:, 0], X_scaled[:,1], y_pred2)
plt.legend(["cluster 0", "cluster 1", "cluster 2"], loc = 'best')
plt.xlabel("Feature 0")
plt.ylabel("Feature 1")
And then 2 clusters
algo3 = KMeans(n_clusters = 2)
y_pred3 = algo3.fit_predict(X_scaled)
mglearn.discrete_scatter(X_scaled[:, 0], X_scaled[:,1], y_pred3)
plt.legend(["cluster 0", "cluster 1", "cluster 2"], loc = 'best')
plt.xlabel("Feature 0")
plt.ylabel("Feature 1")
I'm trying to figure out what exactly is happening to the clustering. Is there an alternative way that I could better understand the data structure?

You did not mention how many features you have, but by the looks of it, X has way more than just two variables. In your code, you are visualizing the clusters using only two features while algo1 used more variables than that.
In particular, you are visualizing the clusters using Feature 1, which appears to be binary (only takes on the values - 1 and 1 ), so it's not that the clustering is unsuccessful; you are simply visualizing the clusters under a very limited number of features.
By plotting in 2-D, you are limiting yourself to seeing the clusters as a function of only two variables, so there's a chance you'll be missing out on some relationships that are only visible in 3-D or even higher dimensions. If you wish to carry on this way, I recommend plotting Feature 1 against all other features, then Feature 2 against all other features, and so on. This way, you will visualize the clusters under all combinations of size two and perhaps this will help you understand the relationship between certain pairs of features and the cluster they belong to.
Remember also that KMeans is an unsupervised algorithm, so the clusters are not necessarily related to the labels in y. The results simply mean that the observations in each cluster are similar to each other in terms of distance to the centroid.

Related

Different results using affinity propagation "precomputed" distance matrix

I am working with two-dimensional data
X = array([[5.40310335, 0. ],
[6.86136114, 6.56225717],
[0. , 0. ],
...,
[5.88838732, 0. ],
[6.0003473 , 0. ],
[6.25971331, 0. ]])
looking for clusters, using euclidean distance, i run affinity propagation from scikit learn with this raw data as follows
af = AffinityPropagation(damping=.9, max_iter=300, random_state=0).fit(X)
obtaining as a result 9 clusters.
I understand that when you want to use another distance you have to enter the negative distance matrix, and use affintity = 'precomputed' as it follows
af_c = AffinityPropagation(damping=.9, max_iter=300,
affinity='precomputed', random_state=0).fit(distM)
if as distM I use the Euclidean distance matrix calculated as follows
distM_E = -distance_matrix(X,X)
np.fill_diagonal(distM, np.median(distM))
completing the diagonal with the median since it is a predefined preference value also in the method.
Using this I am getting 34 clusters as a result and I would expect to have 9 as if working with the default distance. I don't know if I'm interpreting the way of entering the distance matrix correctly or if the library does something different when one uses everything predefined.
I would appreciate any help.

Clustering with custom distance metric in sklearn

I am trying to implement a custom distance metric for clustering. The code snippet looks like:
import numpy as np
from sklearn.cluster import KMeans, DBSCAN, MeanShift
def distance(x, y):
# print(x, y) -> This x and y aren't one-hot vectors and is the source of this question
match_count = 0.
for xi, yi in zip(x, y):
if float(xi) == 1. and xi == yi:
match_count += 1
return match_count
def custom_metric(x, y):
# x, y are two vectors
# distance(.,.) calculates count of elements when both xi and yi are True
return distance(x, y)
vectorized_text = np.stack([[1, 0, 0, 1] * 100,
[1, 1, 1, 0] * 100,
[0, 1, 1, 0] * 100,
[0, 0, 0, 1] * 100] * 100)
dbscan = DBSCAN(min_samples=2, metric=custom_metric, eps=3, p=1).fit(vectorized_text)
The vectorized_text is a one-hot encoded feature matrix of size n_sample x n_features. But when custom_metric is being called, one of x or y turns to be a real valued vector and other one remains the one-hot vector. Expectedly, both x and y should have been one-hot vector. This is causing the custom_metric to return wrong results during run-time and hence clustering is not as correct.
Example of x and y in distance(x, y) method:
x = [0.5 0.5 0.5 ... 0.5 0.5]
y = [0. 0. 0. 1. 0. 0. ... 1. 0.]
Both should have been one-hot vectors.
Does anyone have an idea to go about this situation?
First of all, your distance is wrong.
Distances must return small values for similar vectors. You have defined a similarity, not a distance.
Secondly, using naive python code such as zip will perform extremely poor. Python just does not optimize such code well, it will do all the work in the slow interpreter. Python speed is only okay if you vectorize everything. And in fact, this code can be vectorised trivially, and then it likely won't even matter whether your inputs are binary or float data. What you are computing in a very complicated fashion is nothing but the dot product of two vectors, isn't it?
This, your distance should probably look like this:
def distance(x, y):
return x.shape[0] - np.dot(x,y)
Or whatever distance transformation you intend to use.
Now for your actual problem: my guess is that sklearn tries to accelerate your distance with a ball tree. That won't help much because of the poor performance of Python interpreter callbacks (in fact, you should probably precompute the entire distance matrix in one vectorised operation - something like dist = dim - X.transpose().dot(X)? Do the math yourself to figure out the equation). Other languages such as Java (e.g., the ELKI tool) are much better to extend this way, because of the way the hotspot JIT compiler can optimize and inline such calls everywhere.
To test the hypothesis that the sklearn ball-tree is the cause for the odd values you are observing, try setting method="brute" or so (see the documentation) to disable the ball tree. But in the end, you'll want to either precompute the entire distance matrix (if you can afford O(n²) cost), or switch to a different programming language (implementing your distance in Cython for example helps, but you'll still likely see the data being numpy float arrays suddenly).
I don't get your question, if I have:
x = [1, 0, 1]
y = [0, 0, 1]
and I use:
def distance(x, y):
# print(x, y) -> This x and y aren't one-hot vectors and is the source of this question
match_count = 0.
for xi, yi in zip(x, y):
if float(xi) == 1. and xi == yi:
match_count += 1
return match_count
print(distance(x, y))
1.0
and on top if you print x, y now:
x
[1, 0, 1]
y
[0, 0, 1]
so it is working?
I reproduced your code and I do get your error. I explain it better here:
He has a vectorized_text variable (np.stack) which simulates a One Hot Encoded feature set (only contains 0s and 1s). And in the DBSCAN model, he uses a custom_metric function to calculate the distance. It is expected that when the model is run, the custom metric function takes as parameters pairs of observations as they are: One Hot encoded values, but instead when printing those values inside the distance function, only one is taken as it is, and the other one appears to be a list of real values as he described in the question:
x = [0.5 0.5 0.5 ... 0.5 0.5]
y = [0. 0. 0. 1. 0. 0. ... 1. 0.]
Anyway, when I pass lists to the fit parameter, the function obtains the values as they are:
from sklearn.cluster import KMeans, DBSCAN, MeanShift
x = [1, 0, 1]
y = [0, 0, 1]
feature_set = [x*5]*5
def distance(x, y):
# Printing here the values. Should be 0s and 1s
print(x, y)
match_count = 0.
for xi, yi in zip(x, y):
if float(xi) == 1. and xi == yi:
match_count += 1
return match_count
def custom_metric(x, y):
# x, y are two vectors
# distance(.,.) calculates count of elements when both xi and yi are True
return distance(x, y)
dbscan = DBSCAN(min_samples=2, metric=custom_metric, eps=3, p=1).fit(feature_set)`
Result:
[1. 0. 1. 1. 0. 1. 1. 0. 1. 1. 0. 1. 1. 0. 1.] ... [1. 0. 1. 1. 0.1. 1. 0. 1. 1. 0. 1. 1. 0. 1.]
[1. 0. 1. 1. 0. 1. 1. 0. 1. 1. 0. 1. 1. 0. 1.] ... [1. 0. 1. 1. 0.1. 1. 0. 1. 1. 0. 1. 1. 0. 1.]
I suggest you to use a pandas DataFrame or some other type of value and see if it works.

How to plot a weighted graph of k-Neighbors in Python

I have created a weighted graph of k-Neighbors using scikit-learn, I'm wondering if there is any way to plot it as a graph.
Here is the result of computation in form of array which I want to plot:
array([[0. , 2.08243189, 0. , 3.42661108],
[2.08243189, 0. , 3.27141008, 0. ],
[0. , 3.27141008, 0. , 1.57294787],
[0. , 3.29779083, 1.57294787, 0. ]])
I just need to get some visualization of data, that's all I need.
More details about the array:
Each row represents a node and each column represents the weight of connectivity of that node with the other nodes.
For example: second column of first row (2.08243189) is the weight of connectivity from first node to second node.
Another example: second row, second column (0): the weight of connectivity from node 2 to itself.
The numbers represents euclidean distance.
Are you talking about something simple like this where the size of the point gives a visual indication of the relative weight compared to the other values? Assume the array is named ar:
for i in range(len(ar)):
for j in range(len(ar)):
v = ar[i,j]
plt.scatter(i+1,j+1,lw=0,s=10**v)
plt.grid(True)
plt.xlabel('Row')
plt.ylabel('Column')
ticks = list(range(1,1+len(ar)))
plt.xticks(ticks)
plt.yticks(ticks)

Most important original feature(s) of Principal Component Analysis

I'm am doing PCA and I am interested in which original features were most important. Let me illustrate this with an example:
import numpy as np
from sklearn.decomposition import PCA
X = np.array([[1,-1, -1,-1], [1,-2, -1,-1], [1,-3, -2,-1], [1,1, 1,-1], [1,2,1,-1], [1,3, 2,-0.5]])
print(X)
Which outputs:
[[ 1. -1. -1. -1. ]
[ 1. -2. -1. -1. ]
[ 1. -3. -2. -1. ]
[ 1. 1. 1. -1. ]
[ 1. 2. 1. -1. ]
[ 1. 3. 2. -0.5]]
Intuitively, one could already say that feature 1 and feature 4 are not very important due to their low variance. Let's apply pca on this set:
pca = PCA(n_components=2)
pca.fit_transform(X)
comps = pca.components_
Output:
array([[ 0. , 0.8376103 , 0.54436943, 0.04550712],
[-0. , 0.54564656, -0.8297757 , -0.11722679]])
This output represents the importance of each original feature for each of the two principal components (see this for reference). In other words, for the first principal component, feature 2 is most important, then feature 3. For the second principal component, feature 3 looks most important.
The question is, which feature is most important, which one second most etc? Can I use the component_ attribute for this? Or am I wrong and is PCA not the correct method for doing such analyses (and should I use a feature selection method instead)?
The component_ attribute is not the right spot to look for feature importance. The loadings in the two arrays (i.e. the two componments PC1 and PC2) tell you how your original matrix is transformed by each feature (taken together, they form a rotational matrix). But they don't tell you how much each component contributes to describing the transformed feature space, so you don't know yet how to compare the loadings across the two components.
However, the answer that you linked actually tells you what to use instead: the explained_variance_ratio_ attribute. This attribute tells you how much of the variance in your feature space is explained by each principal component:
In [5]: pca.explained_variance_ratio_
Out[5]: array([ 0.98934303, 0.00757996])
This means that the first prinicpal component explaines almost 99 percent of the variance. You know from components_ that PC1 has the highest loading for the second feature. It follows, therefore, that feature 2 is the most important feature in your data space. Feature 3 is the next most important feature, as it has the second highest loading in PC1.
In PC2, the absolute loadings are nearly swapped between feature 2 and feature 3. But as PC2 explains next to nothing of the overall variance, this can be neglected.

Dendrogram through scipy given a similarity matrix

I have computed a jaccard similarity matrix with Python. I want to cluster highest similarities to lowest, however, no matter what linkage function I use it produces the same dendrogram! I have a feeling that the function assumes that my matrix is of original data, but I have already computed the first similarity matrix. Is there any way to pass this similarity matrix through to the dendrogram so it plots correctly? Or am I going to have to output the matrix and simply do it with R. Passing through the original raw data is not possible, as I am computing similarities of words. Thanks for the help!
Here is some code:
SimMatrix = [[ 0.,0.09259259, 0.125 , 0. , 0.08571429],
[ 0.09259259, 0. , 0.05555556, 0. , 0.05128205],
[ 0.125 , 0.05555556, 0. , 0.03571429, 0.05882353],
[ 0. , 0. , 0.03571429, 0. , 0. ],
[ 0.08571429, 0.05128205, 0.05882353, 0. , 0. ]]
linkage = hcluster.complete(SimMatrix) #doesnt matter what linkage...
dendro = hcluster.dendrogram(linkage) #same plot for all types?
show()
If you run this code, you will see a dendrogram that is completely backwards. No matter what linkage type I use, it produces the same dendrogram. This intuitively can not be correct!
Here's the solution. Turns out the SimMatrix needs to be first converted into a condensed matrix (the diagonal, upper right or bottom left, of this matrix).
You can see this in the code below:
import scipy.spatial.distance as ssd
distVec = ssd.squareform(SimMatrix)
linkage = hcluster.linkage(1 - distVec)
dendro = hcluster.dendrogram(linkage)
show()

Categories

Resources