I have a dataset with thousands of rows. Each row is a person, that I need to insert into 4 clusters. I know that have many possibles to do that and to find the best clusters, but in this case, I know the characteristics of each cluster. Generally, with ML, the clusters are find with IA.
For example, imagine that I have 4 columns to look: money_spending, salary, segment, days_to_buy. Also, I have:
Cluster 1 -> money_spending: 350-700
salary: 700-1000
segment: farmacy
days_to_buy: 12
Cluster 2 -> money_spending: 500-950
salary: 1000-1300
segment: construction material
days_to_buy: 18
Cluster 3 -> money_spending: 900-1400
salary: 1200-2000
segment: supermarket
days_to_buy: 20
Cluster 4 -> money_spending: 250-600
salary: 550-1000
segment: farmacy
days_to_buy: 30
What's the best way to apply this to my dataset? I would use k-nearest, but I don't know how to use my cluster information.
Can someone help me?
Plus: If I have more columns or more clusters the solution keeps works?
Edit: My original dataset only have the columns. The clusters are knowing, but are not in dataset. The job is exactly apply this cluster information to dataset. I don't have any idea how to do that.
You can try the following approach:
Run K means and find the best number of k using the Elbow method or Silhouette graph.
Use the cluster labels as a class.
e.g. if 4 is the optimal number of the cluster then class=0,1,2,3 ( which will be the cluster labels)
Merge the class with the original dataset and treat it as a supervised learning problem
Try running any classification model after the train test split.
See the classification report to check model performance.
PS
Try using data with normalization too as many clustering algorithms are sensitive to outliers.
Please see if the class is somewhat equally distributed like 1000,800,1150,890 and not 1500,80,150,..etc as it will create data imbalance for the classifiers.
Related
I am clustering a data set in python using kmeans. Before I clustered the data set, I determined the optimal number of clusters using an elbow curve.
The optimal number of clusters was 5. So after kmeans clustered the dataset, I had 5 different clusters.
So here’s my question. Now that I have 5 different clusters, I would like to cluster those 5 clusters again so that I can get smaller clusters. Once I have smaller clusters for each one of those 5 clusters, I would like to cluster those smaller clusters again. I would like to repeat this until I have only about 20 points in each cluster. The dataset has 1,000,000 + observations.
What is the best way to do this? Is there a way to build a clustering loop? Is there a completely different better way to do this? I know this isn’t a specific coding question, but I’d love to hear some thoughts.
I'm going to provide some pseudocode since you didn't provide any details about yout code (which you should, by the way):
def cluster_until_20(data):
if data.size() == 20:
return data
clusters = kmeans(data, 5)
if any size of cluster in clusters != 20:
return [cluster_until_20(cluster) for cluster in clusters]
return clusters
The key is using recursion with a list comprehension, that will go "deeper" in recursion as long as the size of data is != 20
I have a dataset with 28000 records. The data is of an e-commerce store menu items. The challenge is the following:
Multiple stores have similar products but with different names. For example, 'HP laptop 1102' is present in different stores as 'HP laptop 1102', 'Hewlett-Packard laptop 1102', 'HP notebook 1102' and many other different names.
I have opted to convert the product list as a tfidf vector and use KMeans clustering to group similar products together. I am also using some other features like product category, sub category etc. (I have one hot encoded all the categorical features)
Now my challenge is to estimate the optimal n_clusters in KMeans algorithm. As the clustering should occur at product level, I'm assuming I need a high n_clusters value. Is there any upper limit for the n_clusters?
Also any suggestions and advice on the solution approach would be really helpful.
Thanks in advance.
You are optimising for k, so you could try an approach similar to this one here: how do I cluster a list of geographic points by distance?
As for max k, you can only every have as many clusters as you do datapoints, so try using that as your upper bound
The upper limit is the number of data points, but you almost surely want a number a good bit lower for clustering to provide any value. If you have 10,000 products I would think 5,000 clusters would be a rough maximum from a usefulness standpoint.
You can use the silhouette score and inertia metrics to help determine the optimal number of clusters.
The Silhouette Coefficient is calculated using the mean intra-cluster distance (a) and the mean nearest-cluster distance (b) for each sample. The Silhouette Coefficient for a sample is (b - a) / max(a, b). To clarify, b is the distance between a sample and the nearest cluster that the sample is not a part of....
The best value is 1 and the worst value is -1. Values near 0 indicate overlapping clusters. - from the scikit-learn docs
inertia_ is an attribute of a fitted clustering object in scikit-learn - not a separate evaluation metric.
It is the "Sum of squared distances of samples to their closest cluster center." - see the KMeans clustering docs in scikit-learn, for example.
Note that inertia increases as you add more clusters, so you may want to use an elbow plot to visualize where the change becomes minimal.
I was recently introduced to clustering techniques because I was given the task to find "profiles" or "patterns" of professors of my university based on a survey they had to answer. I've been studying some of the avaible options to perform this and I came across the k-means clustering algorithm. Since most of my data is categorical I had to perform a one-hot-encoding (transforming the categorical variable in 0-1 single column vectors) and right after that I did a correlation analysis on Excel in order to exclude some redundant variables. After this I used python with pandas, numpy, matplotlib and sklearn libraries to perform a optimal cluster number check (elbow method) and then run k-means, finally.
This is the code I used to import the .csv with the data from the professors survey and to run the elbow method:
# loads the .csv dataframe (DF)
df = pd.read_csv('./dados_selecionados.csv', sep=",")
# prints the df
print(df)
#list for the sum of squared distances
SQD = []
#cluster number for testing in elbow method
num_clusters = 10
# runs k-means for each cluster number
for k in range(1,num_clusters+1):
kmeans = KMeans(n_clusters=k)
kmeans.fit(df)
SQD.append(kmeans.inertia_)
# sets up the plot and show it
plt.figure(figsize=(16, 8))
plt.plot(range(1, num_clusters+1), SQD, 'bx-')
plt.xlabel('Número de clusters')
plt.ylabel('Soma dos quadrados das distâncias de cada ponto ao centro de seu cluster')
plt.title('Método do cotovelo')
plt.show()
According to the figure I decided to go with 3 clusters. After that I run k-means for 3 clusters and sent cluster data to a .xlsx with the following code:
# runs k-means
kmeans = KMeans(n_clusters=3, max_iter=100,verbose=2)
kmeans.fit(df)
clusters = kmeans.fit_predict(df)
# dict to store clusters data
cluster_dict=[]
for c in clusters:
cluster_dict.append(c)
# prints the cluster dict
cluster_dict
# adds the cluster information as a column in the df
df['cluster'] = cluster_dict
# saves the df as a .xlsx
df.to_excel("3_clusters_k_means_selecionado.xlsx")
# shows the resulting df
print(df)
# shows each separate cluster
for c in clusters:
print(df[df['cluster'] == c].head(10))
My main doubt right know is how to perform a reasonable analysis on each cluster data to understand how they were created? I began using means on each variable and also conditional formatting on Excel to see if some patterns would show up and they kind of did actually, but I think this is not the best option.
And I'm also going to use this post to ask for any recommendations on the whole method. Maybe some of the steps I took were not the best.
If you're using scikit learns kmeans function, there is a parameter called n_init, which is the number of times the kmeans algorithm will run with different centroid seeds. By default it is set to 10 iteration, so essentially it does 10 different runs and outputs a single result with the lowest sum of squares. Another parameter you could mess around with is random_state which is a seed number to initialize the centroids randomly. This may give you better reproducibility because you choose the seed number, so if you see an optimal result you know which seed corresponds to that result.
You may want to consider testing several different clustering algos. Here is a list of some of the popular ones.
https://scikit-learn.org/stable/modules/clustering.html
I think there are over 100 different clustering algos out there now.
Also, some clustering algos will automatically select the optimal number of clusters for you, so you don't have to 'guess'. I say guess, because the silhouette and elbow techniques will help quantify the K number for you, but you, yourself, still need to do some kind of guess-work.
I have a customer data set with about 20-25 attributes about the customer such as:
age
gender_F
gender_M
num_purchases
loyalty_status_new
loyalty_status_intermediate
loyalty_status_advanced
...
I have cleaned my dataset to not have any null values and have one-hot encoded categorical variables as well into a pandas dataframe my_df. I have used scikit-learn's kmeans to create 2 clusters on this dataset, but I would like to understand how to tell which customers were clustered into which clusters.
scaler = StandardScaler()
my_df_scaler = scaler.fit_transform(my_df)
kmeans = KMeans(2)
model = kmeans.fit(my_df_scaler)
preds = model.predict(my_df_scaler)
Basically, I am looking for some help in getting insights like:
Cluster 1 represents people with larger values for age and loyalty_status_new
Thanks in advance!
If you have the clusters for each customer, you can compute the average by cluster for each parameters and you will have your answer. You can check more generally the distribution of each parameters in each clusters and compare them between clusters.
Yet, as I see your parameters, you should not take Gender_M and Gender_F as these features are correlated (Gender_M=1-Gender_F).
I see also loyalty status new, intermediate and advanced... If these parameters are computed from a continuous variable, you should keep the continuous variables and not go with three related variables like this.
Anyway here are some links that should help you for your clustering:
- rfm clustering principles: https://towardsdatascience.com/apply-rfm-principles-to-cluster-customers-with-k-means-fef9bcc9ab16
- go deeper in KMeans understanding: https://towardsdatascience.com/k-means-clustering-8e1e64c1561c
Suppose I have 20,000 features on map, and each feature have many attributes (as well as the latitude and longitude). One of the attributes called population.
I want to split these 20,000 features into 3 clusters where the total sum of population of each cluster are equal to specific value 90,000 and features in each cluster should be near each others(ie will take locations in our consideration)
So, the output clusters should have the following conditions:
Sum(population) of all points/items/features in cluster 1=90,000
Sum(population) of all points/items/features in cluster 2=90,000
Sum(population) of all points/items/features in cluster 3=90,000
I tried to use the k-mean clustering algorithm which gave me 3 clusters, but how to force the above constraint (sum of population should equal 90,000)
Any idea is appreciated.
A turnkey solution will not work for you.
You'll have to formulate this as a standard constraint optimization problem and run a silver to optimize this. It's fairly straightforward: take the k-means objective and add your constraints...