kmeans clustering with dataframe (scipy)

kmeans clustering with dataframe (scipy) - python

I would like to run kmeans clustering with more than 3 features. I've tried with two features and wondering how to provide more than 3 features to sklearn.cluster KMeans.
Here's my code and dataframe that I'd like to select features to run. I have multiple dataframes as an input and I have to provide them as features.
# currently two features are selected
# I'd like to combine more than 3 features and provide them to dataset
df_features = pd.merge(df_max[['id', 'max']],
df_var[['id', 'variance']], on='id', how='left')
cols = list(df_features.loc[:,'max':'variance'])
X = df_features.as_matrix(columns=cols)
kmeans = KMeans(n_clusters=3)
kmeans.fit(X)
centroid = kmeans.cluster_centers_
labels = kmeans.labels_
colors = ["g.","r.","c."]
for i in range(len(X)):
print ("coordinate:" , X[i], "label:", labels[i])
plt.plot(X[i][0],X[i][1],colors[labels[i]],markersize=10)
plt.scatter(centroid[:,0],centroid[:,1], marker = "x", s=150, linewidths = 5, zorder =10)
plt.show()

Generally you wouldn't want id to be a feature, because, unless you have good reason to believe otherwise, they do not correlate with anything.
As long as you feed in a valid matrix X at kmeans.fit(X), it will run KMean algorithm for you regardless of number of features in X. Though, if you have a huge amount of features, it may take longer to finish.
The problem is then how to construct X. As you have shown in your example, you can simply merge dataframes, select the wanted columns, and extract feature matrix with a .as_matrix() call. If you have more dataframes and columns, I guess you just merge more and select more.
Feature selection and dimensional reduction may come in handy once you have more than enough features in your dataset. Read more about them when you have time.
P.S. Why scipy in the title?

Related

T-SNE for better data visualization

My dataset shape is (248857, 11)
This is how it looks like before StandartScaler. I performed clustering analysis because of those clustering algorithms such as K-means do need feature scaling before they are fed to the algo.
After
I performed K-Means with three clusters and I am trying to find a way to show these clusters.
I found T-SNE as a solution but I am stuck.
This is how I implemented it:
# save the clusters into a variable l.
l = df_scale['clusters']
d = df_scale.drop("clusters", axis = 1)
standardized_data = StandardScaler().fit_transform(d)
# TSNE Picking the top 100000points as TSNE
data_points = standardized_data[0:100000, :]
labels_80 = l[0:100000]
model = TSNE(n_components = 2, random_state = 0)
tsne_data = model.fit_transform(data_points)
# creating a new data frame which help us in ploting the result data
tsne_data = np.vstack((tsne_data.T, labels_80)).T
tsne_df = pd.DataFrame(data = tsne_data,
columns =("Dimension1", "Dimension2", "Clusters"))
# Ploting the result of tsne
sns.FacetGrid(tsne_df, hue ="Clusters", size = 6).map(
plt.scatter, 'Dimension1', 'Dimension2').add_legend()
plt.show()
As you see, it is not that good. How to visualize this better?

It seems you need to tune the perplexity hyper-parameter which is:
a tunable parameter that says (loosely) how to balance attention between local and global aspects of your data. The parameter is, in a sense, a guess about the number of close neighbors each point has. The perplexity value has a complex effect on the resulting pictures.
Read more about it in this post and more specifically, here.

how to save separated parts (that come from k-means clustering) as different Dataframes

I used k-means clustering for a Dataframe that include salesNumber and Products. I used 3 means clustering. I can separate them and find centroids. But I also want to save my new 3 different divided datas as 3 different Dataframes. How can I do that?
df dataframe have 2 columns named SalesNumbers and ProductTypes.
kmeans = KMeans(n_clusters=3).fit(df)
centroidsProduct = kmeans.cluster_centers_
print(centroidsProduct)

Use labels_:
df['ClusterID'] = kmeans.labels_

Feature agglomeration: How to retrieve the features that make up the clusters?

I am using the scikit-learn's feature agglomeration to use a hierarchical clustering procedure on features rather than on the observations.
This is my code:
from sklearn import cluster
import pandas as pd
#load the data
df = pd.read_csv('C:/Documents/data.csv')
agglo = cluster.FeatureAgglomeration(n_clusters=5)
agglo.fit(df)
df_reduced = agglo.transform(df)
My original df had the shape (990, 15), after using feature agglomeration, df_reduced now has (990, 5).
How do now find out how the original 15 features have been clustered together? In other words, what original features from df make up each of the 5 new features in df_reduced?

The way how the features within each of the clusters are combined during transform is set by the way you perform the hierarchical clustering. The reduced feature set simply consists of the n_clusters cluster centers (which are n_samples - dimensional vectors). For certain applications you might think of computing centers manually using different definitions of cluster centers (i.e. median instead of mean to avoid the influence of outliers etc.).
n_features = 15
feature_identifier = range(n_features)
feature_groups = [np.array(feature_identifier )[agglo.labels_==i] for i in range(n_clusters)]
new_features = [df.loc[:,df.keys()[group]].mean(0) for group in feature_groups]
Don't forget to standardize the features beforehand (for example using sklearn's scaler). Otherwise you are rather grouping the scales of the quantities than clustering similar behavior.
Hope that helps!
Haven't tested the code. Let me know if there are problems.

After fitting the clusterer, agglo.labels_ contains a list that tells in which cluster in the reduced dataset each feature in the original dataset belongs.

Confusion matrix for Clustering in scikit-learn

I have a set of data with known labels. I want to try clustering and see if I can get the same clusters given by known labels. To measure the accuracy, I need to get something like a confusion matrix.
I know I can get a confusion matrix easily for a test set of a classification problem. I already tried that like this.
However, it can't be used for clustering as it expected both columns and rows to have the same set of labels, which makes sense for a classification problem. But for a clustering problem what I expect is something like this.
Rows - Actual labels
Columns - New cluster names (i.e. cluster-1, cluster-2 etc.)
Is there a way to do this?
Edit: Here are more details.
In sklearn.metrics.confusion_matrix, it expects y_test and y_pred to have the same values, and labels to be the labels of those values.
That's why it gives a matrix which has the same labels for both rows and columns like this.
But in my case (KMeans Clustering), the real values are Strings and estimated values are numbers (i.e. cluster number)
Therefore, if I call confusion_matrix(y_true, y_pred) it gives below error.
ValueError: Mix of label input types (string and number)
This is the real problem. For a classification problem, this makes sense. But for a clustering problem, this restriction shouldn't be there, because real label names and new cluster names don't need to be the same.
With this, I understand I'm trying to use a tool, which is supposed to be used for classification problems, for a clustering problem. So, my question is, is there a way I can get such a matrix for may clustered data.
Hope the question is now clearer. Please let me know if it isn't.

I wrote a code myself.
# Compute confusion matrix
def confusion_matrix(act_labels, pred_labels):
uniqueLabels = list(set(act_labels))
clusters = list(set(pred_labels))
cm = [[0 for i in range(len(clusters))] for i in range(len(uniqueLabels))]
for i, act_label in enumerate(uniqueLabels):
for j, pred_label in enumerate(pred_labels):
if act_labels[j] == act_label:
cm[i][pred_label] = cm[i][pred_label] + 1
return cm
# Example
labels=['a','b','c',
'a','b','c',
'a','b','c',
'a','b','c']
pred=[ 1,1,2,
0,1,2,
1,1,1,
0,1,2]
cnf_matrix = confusion_matrix(labels, pred)
print('\n'.join([''.join(['{:4}'.format(item) for item in row])
for row in cnf_matrix]))
Edit:
(Dayyyuumm) just found that I could do this easily with Pandas Crosstab :-/.
labels=['a','b','c',
'a','b','c',
'a','b','c',
'a','b','c']
pred=[ 1,1,2,
0,1,2,
1,1,1,
0,1,2]
# Create a DataFrame with labels and varieties as columns: df
df = pd.DataFrame({'Labels': labels, 'Clusters': pred})
# Create crosstab: ct
ct = pd.crosstab(df['Labels'], df['Clusters'])
# Display ct
print(ct)

You can easily compute a pairwise intersection matrix.
But it may be necessary to do this yourself, if the sklearn library has been optimized for the classification use case.

Does sklearn silhouette_samples depend on the order of the data?

I’ve been trying to determine the silhouette scores for each sample in a data set, which contains two different classes. However, the distribution and sample values change depending on how I’ve sorted my data ahead of time. For example, if I sort my dataframe by the class labels (0 & 1) in ascending vs descending order prior to calling silhouette_samples(), the silhouette scores change.
Can someone help me figure out what’s going on? I’d like to know
Is there a bug in my code that I’m not aware of?
Is this normal behavior of the sklearn silhouette_samples function that I’m
ignorant of?
Or is this a bug in the sklearn silhouette_samples?
The effect occurs with the following code:
import pandas as pd
from sklearn.metrics import silhouette_samples
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
df_myfeatures #data frame containing features and class labels
'''data frame sorted by output labels in ascending order'''
df1 = df_myfeatures.copy().sort_values(['output_label'], ascending = True)
'''data frame sorted by output labels in descending order'''
df2 = df_myfeatures.copy().sort_values(['output_label'], ascending = False)
'''
standardize the features ahead of time since they’re on different scales
the X matrix has 26k rows (observations) and 9 columns (features)
'''
standard_scaler = StandardScaler()
X1 = standard_scaler.fit_transform(df1[cols]) #cols is just a list of columns for fitting
X2 = standard_scaler.fit_transform(df2[cols])
y1 = df1['output_label']
y2 = df2['output_label']
'''find the silhouette scores'''
ss1 = silhouette_samples(X1,y1)
ss2 = silhouette_samples(X2,y2)
'''plot the distribution'''
plt.hist(ss1, bins = np.linspace(-1,1,21), alpha = 0.3, label = 'sorted ascending')
plt.hist(ss2, bins = np.linspace(-1,1,21), alpha = 0.3, label = 'sorted descending')
plt.legend()
plt.title('distribution of silhouette scores')
Which generates the following distributions of scores:
histogram of silhouette scores
As you can see, the distribution of scores changes depending on the order of data. I’ve verified that there’s not an issue with standard scaler producing different results for the data depending on the order, and that there’s not an issue with the pandas sorting the data and somehow messing up the alignment of different rows across columns. I'm completely stumped, as far as I was aware, there shouldn't be any order effects in the calculation of silhouette scores.
Please, help me understand this behavior! Thanks!
Note: I’m running on a Windows 10 machine, using Anaconda 4.0 (64 bit), Python v3.5.1, sklearn v0.17.1 and Pandas v0.18.0

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.