I have a customer data set with about 20-25 attributes about the customer such as:
age
gender_F
gender_M
num_purchases
loyalty_status_new
loyalty_status_intermediate
loyalty_status_advanced
...
I have cleaned my dataset to not have any null values and have one-hot encoded categorical variables as well into a pandas dataframe my_df. I have used scikit-learn's kmeans to create 2 clusters on this dataset, but I would like to understand how to tell which customers were clustered into which clusters.
scaler = StandardScaler()
my_df_scaler = scaler.fit_transform(my_df)
kmeans = KMeans(2)
model = kmeans.fit(my_df_scaler)
preds = model.predict(my_df_scaler)
Basically, I am looking for some help in getting insights like:
Cluster 1 represents people with larger values for age and loyalty_status_new
Thanks in advance!
If you have the clusters for each customer, you can compute the average by cluster for each parameters and you will have your answer. You can check more generally the distribution of each parameters in each clusters and compare them between clusters.
Yet, as I see your parameters, you should not take Gender_M and Gender_F as these features are correlated (Gender_M=1-Gender_F).
I see also loyalty status new, intermediate and advanced... If these parameters are computed from a continuous variable, you should keep the continuous variables and not go with three related variables like this.
Anyway here are some links that should help you for your clustering:
- rfm clustering principles: https://towardsdatascience.com/apply-rfm-principles-to-cluster-customers-with-k-means-fef9bcc9ab16
- go deeper in KMeans understanding: https://towardsdatascience.com/k-means-clustering-8e1e64c1561c
Related
I am working on a 'songs' dataset that has 2973 records and 2973 unique genres. In the end, I want to create a function that takes as input one genre and print other similar genres.
I thought about doing this by applying label or one-hot encoding and then cluster using K-Means. Then the ultimate idea is that the function called 'genre_recommender' searches the input genre within the clusters and print other values within this cluster. I have done the encoding and the clustering but I can't progress even 1% in the function. How can I do it?
kmeans = KMeans(n_clusters=4, init='k-means++', max_iter=300, n_init=10, random_state=0)
cluster_labels = kmeans.fit_predict(scaled_data)
# As it's difficult to visualise clusters when the data is high-dimensional - we'll use
# boxplots to help us see how the clusters are grouping the samples
df_bonus["cluster"] = cluster_labels
I clustered using kmeans upon 4 clusters after doing an elbow method. after this point, I am stuck.
I have a dataset with thousands of rows. Each row is a person, that I need to insert into 4 clusters. I know that have many possibles to do that and to find the best clusters, but in this case, I know the characteristics of each cluster. Generally, with ML, the clusters are find with IA.
For example, imagine that I have 4 columns to look: money_spending, salary, segment, days_to_buy. Also, I have:
Cluster 1 -> money_spending: 350-700
salary: 700-1000
segment: farmacy
days_to_buy: 12
Cluster 2 -> money_spending: 500-950
salary: 1000-1300
segment: construction material
days_to_buy: 18
Cluster 3 -> money_spending: 900-1400
salary: 1200-2000
segment: supermarket
days_to_buy: 20
Cluster 4 -> money_spending: 250-600
salary: 550-1000
segment: farmacy
days_to_buy: 30
What's the best way to apply this to my dataset? I would use k-nearest, but I don't know how to use my cluster information.
Can someone help me?
Plus: If I have more columns or more clusters the solution keeps works?
Edit: My original dataset only have the columns. The clusters are knowing, but are not in dataset. The job is exactly apply this cluster information to dataset. I don't have any idea how to do that.
You can try the following approach:
Run K means and find the best number of k using the Elbow method or Silhouette graph.
Use the cluster labels as a class.
e.g. if 4 is the optimal number of the cluster then class=0,1,2,3 ( which will be the cluster labels)
Merge the class with the original dataset and treat it as a supervised learning problem
Try running any classification model after the train test split.
See the classification report to check model performance.
PS
Try using data with normalization too as many clustering algorithms are sensitive to outliers.
Please see if the class is somewhat equally distributed like 1000,800,1150,890 and not 1500,80,150,..etc as it will create data imbalance for the classifiers.
I was recently introduced to clustering techniques because I was given the task to find "profiles" or "patterns" of professors of my university based on a survey they had to answer. I've been studying some of the avaible options to perform this and I came across the k-means clustering algorithm. Since most of my data is categorical I had to perform a one-hot-encoding (transforming the categorical variable in 0-1 single column vectors) and right after that I did a correlation analysis on Excel in order to exclude some redundant variables. After this I used python with pandas, numpy, matplotlib and sklearn libraries to perform a optimal cluster number check (elbow method) and then run k-means, finally.
This is the code I used to import the .csv with the data from the professors survey and to run the elbow method:
# loads the .csv dataframe (DF)
df = pd.read_csv('./dados_selecionados.csv', sep=",")
# prints the df
print(df)
#list for the sum of squared distances
SQD = []
#cluster number for testing in elbow method
num_clusters = 10
# runs k-means for each cluster number
for k in range(1,num_clusters+1):
kmeans = KMeans(n_clusters=k)
kmeans.fit(df)
SQD.append(kmeans.inertia_)
# sets up the plot and show it
plt.figure(figsize=(16, 8))
plt.plot(range(1, num_clusters+1), SQD, 'bx-')
plt.xlabel('Número de clusters')
plt.ylabel('Soma dos quadrados das distâncias de cada ponto ao centro de seu cluster')
plt.title('Método do cotovelo')
plt.show()
According to the figure I decided to go with 3 clusters. After that I run k-means for 3 clusters and sent cluster data to a .xlsx with the following code:
# runs k-means
kmeans = KMeans(n_clusters=3, max_iter=100,verbose=2)
kmeans.fit(df)
clusters = kmeans.fit_predict(df)
# dict to store clusters data
cluster_dict=[]
for c in clusters:
cluster_dict.append(c)
# prints the cluster dict
cluster_dict
# adds the cluster information as a column in the df
df['cluster'] = cluster_dict
# saves the df as a .xlsx
df.to_excel("3_clusters_k_means_selecionado.xlsx")
# shows the resulting df
print(df)
# shows each separate cluster
for c in clusters:
print(df[df['cluster'] == c].head(10))
My main doubt right know is how to perform a reasonable analysis on each cluster data to understand how they were created? I began using means on each variable and also conditional formatting on Excel to see if some patterns would show up and they kind of did actually, but I think this is not the best option.
And I'm also going to use this post to ask for any recommendations on the whole method. Maybe some of the steps I took were not the best.
If you're using scikit learns kmeans function, there is a parameter called n_init, which is the number of times the kmeans algorithm will run with different centroid seeds. By default it is set to 10 iteration, so essentially it does 10 different runs and outputs a single result with the lowest sum of squares. Another parameter you could mess around with is random_state which is a seed number to initialize the centroids randomly. This may give you better reproducibility because you choose the seed number, so if you see an optimal result you know which seed corresponds to that result.
You may want to consider testing several different clustering algos. Here is a list of some of the popular ones.
https://scikit-learn.org/stable/modules/clustering.html
I think there are over 100 different clustering algos out there now.
Also, some clustering algos will automatically select the optimal number of clusters for you, so you don't have to 'guess'. I say guess, because the silhouette and elbow techniques will help quantify the K number for you, but you, yourself, still need to do some kind of guess-work.
I am using the scikit-learn's feature agglomeration to use a hierarchical clustering procedure on features rather than on the observations.
This is my code:
from sklearn import cluster
import pandas as pd
#load the data
df = pd.read_csv('C:/Documents/data.csv')
agglo = cluster.FeatureAgglomeration(n_clusters=5)
agglo.fit(df)
df_reduced = agglo.transform(df)
My original df had the shape (990, 15), after using feature agglomeration, df_reduced now has (990, 5).
How do now find out how the original 15 features have been clustered together? In other words, what original features from df make up each of the 5 new features in df_reduced?
The way how the features within each of the clusters are combined during transform is set by the way you perform the hierarchical clustering. The reduced feature set simply consists of the n_clusters cluster centers (which are n_samples - dimensional vectors). For certain applications you might think of computing centers manually using different definitions of cluster centers (i.e. median instead of mean to avoid the influence of outliers etc.).
n_features = 15
feature_identifier = range(n_features)
feature_groups = [np.array(feature_identifier )[agglo.labels_==i] for i in range(n_clusters)]
new_features = [df.loc[:,df.keys()[group]].mean(0) for group in feature_groups]
Don't forget to standardize the features beforehand (for example using sklearn's scaler). Otherwise you are rather grouping the scales of the quantities than clustering similar behavior.
Hope that helps!
Haven't tested the code. Let me know if there are problems.
After fitting the clusterer, agglo.labels_ contains a list that tells in which cluster in the reduced dataset each feature in the original dataset belongs.
I have a set of data with known labels. I want to try clustering and see if I can get the same clusters given by known labels. To measure the accuracy, I need to get something like a confusion matrix.
I know I can get a confusion matrix easily for a test set of a classification problem. I already tried that like this.
However, it can't be used for clustering as it expected both columns and rows to have the same set of labels, which makes sense for a classification problem. But for a clustering problem what I expect is something like this.
Rows - Actual labels
Columns - New cluster names (i.e. cluster-1, cluster-2 etc.)
Is there a way to do this?
Edit: Here are more details.
In sklearn.metrics.confusion_matrix, it expects y_test and y_pred to have the same values, and labels to be the labels of those values.
That's why it gives a matrix which has the same labels for both rows and columns like this.
But in my case (KMeans Clustering), the real values are Strings and estimated values are numbers (i.e. cluster number)
Therefore, if I call confusion_matrix(y_true, y_pred) it gives below error.
ValueError: Mix of label input types (string and number)
This is the real problem. For a classification problem, this makes sense. But for a clustering problem, this restriction shouldn't be there, because real label names and new cluster names don't need to be the same.
With this, I understand I'm trying to use a tool, which is supposed to be used for classification problems, for a clustering problem. So, my question is, is there a way I can get such a matrix for may clustered data.
Hope the question is now clearer. Please let me know if it isn't.
I wrote a code myself.
# Compute confusion matrix
def confusion_matrix(act_labels, pred_labels):
uniqueLabels = list(set(act_labels))
clusters = list(set(pred_labels))
cm = [[0 for i in range(len(clusters))] for i in range(len(uniqueLabels))]
for i, act_label in enumerate(uniqueLabels):
for j, pred_label in enumerate(pred_labels):
if act_labels[j] == act_label:
cm[i][pred_label] = cm[i][pred_label] + 1
return cm
# Example
labels=['a','b','c',
'a','b','c',
'a','b','c',
'a','b','c']
pred=[ 1,1,2,
0,1,2,
1,1,1,
0,1,2]
cnf_matrix = confusion_matrix(labels, pred)
print('\n'.join([''.join(['{:4}'.format(item) for item in row])
for row in cnf_matrix]))
Edit:
(Dayyyuumm) just found that I could do this easily with Pandas Crosstab :-/.
labels=['a','b','c',
'a','b','c',
'a','b','c',
'a','b','c']
pred=[ 1,1,2,
0,1,2,
1,1,1,
0,1,2]
# Create a DataFrame with labels and varieties as columns: df
df = pd.DataFrame({'Labels': labels, 'Clusters': pred})
# Create crosstab: ct
ct = pd.crosstab(df['Labels'], df['Clusters'])
# Display ct
print(ct)
You can easily compute a pairwise intersection matrix.
But it may be necessary to do this yourself, if the sklearn library has been optimized for the classification use case.