I have a problem. I want to cluster my dataset. Unfortunately my centroids are not in the clusters but outside. I have already read
Python k-mean, centroids are placed outside of the clusters about this.
However, I do not know what could be the reason. How can I cluster correctly?
You can find the dataset at https://gist.githubusercontent.com/Coderanker3/24c948d2ff0b7f71e51b3774c2cc7b22/raw/253ba0660720de3a9cf7dee2a2d25a37f61095ca/Dataset
import pandas as pd
from sklearn.cluster import KMeans
from scipy.cluster import hierarchy
import seaborn as sns
from sklearn import metrics
from sklearn.metrics import silhouette_samples
import matplotlib as mpl
import matplotlib.pyplot as plt
df = pd.read_csv(r'https://gist.githubusercontent.com/Coderanker3/24c948d2ff0b7f71e51b3774c2cc7b22/raw/253ba0660720de3a9cf7dee2a2d25a37f61095ca/Dataset')
df.shape
features_clustering = ['review_scores_accuracy',
'distance_to_center',
'bedrooms',
'review_scores_location',
'review_scores_value',
'number_of_reviews',
'beds',
'review_scores_communication',
'accommodates',
'review_scores_checkin',
'amenities_count',
'review_scores_rating',
'reviews_per_month',
'corrected_price']
df_cluster = df[features_clustering].copy()
X = df_cluster.copy()
model = KMeans(n_clusters=4, random_state=53, n_init=10, max_iter=1000, tol=0.0001)
clusters = model.fit_predict(X)
df_cluster["cluster"] = clusters
fig = plt.figure(figsize=(8, 8))
sns.scatterplot(data=df_cluster, x="amenities_count", y="corrected_price", hue="cluster", palette='Set2_r')
sns.scatterplot(x=model.cluster_centers_[:,0], y=model.cluster_centers_[:,1], color='blue',marker='*',
label='centroid', s=250)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
#plt.ylim(ymin=0)
plt.xlim(xmin=-0.1)
plt.show()
model.cluster_centers_
inertia = model.inertia_
sil = metrics.silhouette_score(X,model.labels_)
print(f'inertia {inertia:.3f}')
print(f'silhouette {sil:.3f}')
[OUT]
inertia 4490.076
silhouette 0.156
The answer to your main question: the cluster centers are not outside of your clusters.
1 : You are clustering over 14 features shown in features_clustering list.
2 : You are viewing the clusters over a two-dimensional space, arbitrarily choosing amenities_count and corrected_price for the data and two coordinates for the cluster centers x=model.cluster_centers_[:,0], y=model.cluster_centers_[:,1] which don't correspond to the same features.
For these reasons you are going to get strange results; they really don't mean anything.
The bottom line is you cannot view 14 dimension clustering over two-dimensions.
To show point 2 more clearly, change the plotting of the clusters line to
sns.scatterplot(x=model.cluster_centers_[:,10], y=model.cluster_centers_[:,13], color='blue',marker='*', label='centroid', s=250)
to be plotting the cluster centers against the same features as the data.
The link to the SO answer about the cluster centers being outside of the cluster data is about scaling the data before clustering to be between 0 and 1, and then not scaling the cluster centers back up when plotting with the real data. This is not the same as your issues here.
You are making multidimensional clusters and you want them to fit a two-dimensional map, by itself it will not work. Let me explain, a variable is a dimension: x1,x2,x3,...,xn and if you find the clusters it will give you as a result y1,y2,y3,...,yn. If you map in 2D the result as you are doing, (I take your example)
x1 is "amenities_count", x5 is "corrected_price".
It will create a 2D map of only these two variables and surely the plotter, seeing that you use a 2D map, will only take the first two variables from cluster, y1 and y2 to plot. Note that xi has no direct relationship with y1.
You must: 1) do a conversion to find its corresponding x,y or 2) reduce the dimensionality of the data you are using to generate a 2D map with the information of all the variables.
For the first case, I am not very sure because I have never done it (Remapping the data).
But in the dimensionality reduction, I recommend you to use https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding or the classic PCA.
Moral: if you want to see a 2D cluster, make sure you only have 2 variables.
Related
I made a PCA (using the scikit-learn library) on a dataset that I centered using the preprocessing method
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
std = StandardScaler(with_std=False)
df_pca = pd.DataFrame(std.fit_transform(df),
index=df.index,
columns=df.columns)
because my variables have the same unit of measurement (milimeters).
I used this code to create an instance of the PCA class:
pca= PCA(n_components=2, svd_solver="full")
pca.fit(df_pca)
My concern is when I represent my circle of correlations with the code:
import numpy as np
import matplotlib.pyplot as plt
pcs = pca.components_
fig = plt.subplots(figsize=(10,10))
plt.xlim(-1,1)
plt.ylim(-1,1)
plt.quiver(np.zeros(pcs.shape[1]), np.zeros(pcs.shape[1]),
pcs[0,:], pcs[1,:], angles='uv', scale_units='xy', scale=1, color='r', width= 0.003)
for i, (x, y) in enumerate(zip(pcs[0, :], pcs[1, :])):
plt.text(x, y, df.columns[i])
circle = plt.Circle((0,0), 1, facecolor='none', edgecolor='b')
plt.gca().add_artist(circle)
plt.plot([-1,1],[0,0],color='silver',linestyle='--',linewidth=1)
plt.plot([0,0],[-1,1],color='silver',linestyle='--',linewidth=1)
plt.title('Circle of correlations')
plt.xlabel('F{} ({}%)'.format(1, round(100*pca.explained_variance_ratio_[0],1)))
plt.ylabel('F{} ({}%)'.format(2, round(100*pca.explained_variance_ratio_[1],1)))
plt.show()
I get this correlation circle:
In the PCA theory it is said that "2 perpendicular arrows correspond to independent variables (uncorrelated) if these 2 variables are well represented on the factorial plane" and here the variables margin_low and length seem to be very well represented.
But my correlation matrix shows a correlation between these two variables:
Bonus question: is my correlation circle incorrect?
Here are the extraction of the correlation coefficients of the variables on the F1 and F2 axes:
When I use the fanalysis package I can when creating my PCA give the argument std_unit=False indicating that the PCA is performed from centered but not reduced data. Soit le code de création d'instance de classe PCA:
from fanalysis.pca import PCA as fan_PCA
X = df_acp.to_numpy()
my_pca = fan_PCA(std_unit=False,
row_labels=df_acp.index,
col_labels=df_acp.columns)
my_pca.fit(X)
I then get this circle of correlations en utilisant my_pca.correlation_circle(num_x_axis=1, num_y_axis=2)
The arrows of the variables length and margin_low are better drawn and I obtain the same percentage of inertia.
I tried to look at the documentation of the functions of the package, more particularly the file pca.py (https://github.com/OlivierGarciaDev/fanalysis/blob/9aa2cc8b1e5cc5600a05813144973b77143cfe42/fanalysis/pca.py) to understand the operation of std_unit to reproduce it with the scikit-learn library in order to more easily customize my function which displays my circle of correlations. I understand that this has an incident on the application of the reduction of the dimensionality of X (line 169 function transform) but being still beginner in datascience I cannot understand how to obtain this same result in use sklearn.decomposition.
Here is my code used to display my circle of correlations:
Is my first circle of correlations wrong? How to reproduce fanalysis's std_unit=False argument using sklearn.decomposition?
Note: My dataset having for shape
(1500.6)
Given the size of my dataset I thought it would be complicated to attach it to my question. If however it was necessary for the resolution of this one I will include it in an edit.
I am trying to cluster the supplied charging power to different vehicles.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import DBSCAN
from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN, OPTICS, cluster_optics_dbscan
df_temp_12 = list(map(lambda x, y: [x, y], VehicleList, Power))
Eps = 1.3
dbscan_12_object = OPTICS(eps=Eps, cluster_method="dbscan")
dbscan_12 = dbscan_12_object.fit_predict(df_temp_12)
My present output looks like this:
I am trying to cluster mainly along the horizontal axis ( as is shown by the cluster in Green). If you notice closely, all other clusters have not been clustered in the same way (i.e. along the horizontal) and hence I am trying to figure out if i can change the EPS value along each axis separately, and if so, how?
EDIT:
After the suggestion by #PlzBePython the output looks like this. Precisely what I needed
enter image description here
If you already know that you want to cluster along a specific axis, why not cluster that feature in isolation?
So for example:
data = np.array(Power).reshape((-1, 1))
clusterer = sklearn.cluster.OPTICS(eps=1.3, cluster_method="dbscan")
clustered_data = clusterer.fit(data).labels_
Otherwise, I would try creating OPTICS without parameter (default is to identify clusters across all scales)
clusterer = sklearn.cluster.OPTICS()
or using Gaussian Mixture.
I built a GMM model and used this to run a prediction.
bead = df['Ce140Di']
dna = df['DNA_1']
X = np.column_stack((dna, bead)) # create a 2D array from the two lists
#plt.scatter(X[:,0], X[:,1], s=0.5, c='black')
#plt.show()
gmm = GaussianMixture(n_components=4, covariance_type='tied')
gmm.fit(X)
labels = gmm.predict(X)
and then generated a plot as follows...
df['predicted_cluster'] = labels
fig= plt.figure()
colors = {1:'red', 2:'orange', 3:'purple', 0:'grey'}
plt.scatter(df['DNA_1'], df['Ce140Di'], c=df['predicted_cluster'].apply(lambda x: colors[x]), s = 0.5, alpha=0.5)
plt.show()
scatter plot colored by predictions
Whilst I have the output prediction for each row of my df, I don't actually know what cluster it corresponds to without looking at my colors dictionary, is there a way to do this without having to look at the scatter plot each time?
In other words, I want to know that 0 will always correspond to my grey cluster or that 1 will always be the red cluster but this changes each time...
Colors aside, how do I know the position of each cluster? What does a label of 0 mean?
EDIT I believe the answer to my perhaps silly question is to use np.random.seed but I could be wrong...
Helllo Hajar,
I think the answer to your question will disappoint you. I assume each Gaussian in your GMM is initialised to some random mean and variance. If you set a random seed then you could be reasonably certain that the resultant clusters will always be the same.
With that said, in multi-label scenarios without a random seed there are (to my knowledge) no clustering algorithms that guarantee which label is assigned to each cluster.
Clustering algorithms assign labels arbitrarily. The only guarantee any clustering algorithm makes about a point assigned a certain label is that it is similar to other points with the same label by some metric.
This makes measuring the accuracy of clustering algorithms quite challenging. Hence the existence of metrics like the Adjusted Mutual Information Score and the Adjusted Rand Index.
You could account for this with a sort of semi-supervised approach, in which you force a particular point to start with a "ground-truth" label and hope your algorithm centres a cluster on it, but even then there may be variance.
Goodluck and I hope this helps.
I have a dataset with locations (coordinates) and a scalar attribute of each location (for example, temperature). I need to cluster the locations based on the scalar attribute, but taking into consideration the distance between locations.
The problem is that, using temperature as an example, it is possible for locations that are far from each other to have the same temperature. If I cluster on temperature, these locations will be in the same cluster, when they shouldn't. The opposite is true if two locations that are near each other have different temperatures. In this case, clustering on temperature may result in these observations being in different clusters, while clustering based on a distance matrix would put them in the same one.
So, is there a way in which I could cluster observations giving more importance to one attribute (temperature) and then "refining" based on the distance matrix?
Here is a simple example showing how clustering differs depending on whether an attribute is used as the basis or the distance matrix. My goal is to be able to use both, the attribute and the distance matrix, giving more importance to the attribute.
import numpy as np
import matplotlib.pyplot as plt
import haversine
from scipy.cluster.hierarchy import linkage, fcluster
from scipy.spatial import distance as ssd
# Create location data
x = np.random.rand(100, 1)
y = np.random.rand(100, 1)
t = np.random.randint(0, 20, size=(100,1))
# Compute distance matrix
D = np.zeros((len(x),len(y)))
for k in range(len(x)):
for j in range(len(y)):
distance_pair= haversine.distance((x[k], y[k]), (x[j], y[j]))
D[k,j] = distance_pair
# Compare clustering alternatives
Zt = linkage(t, 'complete')
Zd = linkage(ssd.squareform(D), method="complete")
# Cluster based on t
clt = fcluster(Zt, 5, criterion='distance').reshape(100,1)
plt.figure(figsize=(10, 8))
plt.scatter(x, y, c=clt)
plt.show()
# Cluster based on distance matrix
cld = fcluster(Zd, 10, criterion='distance').reshape(100,1)
plt.figure(figsize=(10, 8))
plt.scatter(x, y, c=cld)
plt.show()
haversine.py is available here: https://gist.github.com/rochacbruno/2883505
Thanks.
I am currently stuck on a school exercise. The exercise is as follows.
We will consider a subset of the wild faces data described in
berg2005[1]. Load the wildfaces data, Data/wildfaces using the loadmat
function. Each data object is a 40*40*3=4800 dimensional
vector, corresponding to a 3-color 40*40 pixels image. Compute a
k-means clustering of the data with K=10 clusters. Plot a few random
images from the data set as well as their corresponding cluster
centroids to see how they are represented.
[1] Tamara L Berg, Alexander C Berg, Jaety Edwards, and DA Forsyth. Who's in the picture. Advances in Neural Information Processing Systems, 17:137-144, 2005.
Now to my question, how do I compute the centroids for one image? I am currently able to display the face and calculate centroids for the dataset. What I don't understand is, how do I know which centroids correspond to image 4 (as used in my code sample)? Do I have to calculate centroids for the entire dataset X or just X[4]? What steps do I need to take now, to 'plot the corresponding cluster centroids to see how they are represented'?
import scipy.io as spio
import sklearn.cluster as cl
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.pyplot import imshow
faces = spio.loadmat('Data/wildfaces.mat',squeeze_me=True)
X = faces['X']
Y = cl.k_means(X,10)
centroids = Y[0]
clusters = Y[1]
imshow(np.reshape(X[4,:],(3,40,40)).T)
plt.show()
You already have centroids. One per cluster.
You don't need to compute them, only display them.
Check the contents of Y