I am trying to cluster the supplied charging power to different vehicles.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import DBSCAN
from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN, OPTICS, cluster_optics_dbscan
df_temp_12 = list(map(lambda x, y: [x, y], VehicleList, Power))
Eps = 1.3
dbscan_12_object = OPTICS(eps=Eps, cluster_method="dbscan")
dbscan_12 = dbscan_12_object.fit_predict(df_temp_12)
My present output looks like this:
I am trying to cluster mainly along the horizontal axis ( as is shown by the cluster in Green). If you notice closely, all other clusters have not been clustered in the same way (i.e. along the horizontal) and hence I am trying to figure out if i can change the EPS value along each axis separately, and if so, how?
EDIT:
After the suggestion by #PlzBePython the output looks like this. Precisely what I needed
enter image description here
If you already know that you want to cluster along a specific axis, why not cluster that feature in isolation?
So for example:
data = np.array(Power).reshape((-1, 1))
clusterer = sklearn.cluster.OPTICS(eps=1.3, cluster_method="dbscan")
clustered_data = clusterer.fit(data).labels_
Otherwise, I would try creating OPTICS without parameter (default is to identify clusters across all scales)
clusterer = sklearn.cluster.OPTICS()
or using Gaussian Mixture.
Related
I have a problem. I want to cluster my dataset. Unfortunately my centroids are not in the clusters but outside. I have already read
Python k-mean, centroids are placed outside of the clusters about this.
However, I do not know what could be the reason. How can I cluster correctly?
You can find the dataset at https://gist.githubusercontent.com/Coderanker3/24c948d2ff0b7f71e51b3774c2cc7b22/raw/253ba0660720de3a9cf7dee2a2d25a37f61095ca/Dataset
import pandas as pd
from sklearn.cluster import KMeans
from scipy.cluster import hierarchy
import seaborn as sns
from sklearn import metrics
from sklearn.metrics import silhouette_samples
import matplotlib as mpl
import matplotlib.pyplot as plt
df = pd.read_csv(r'https://gist.githubusercontent.com/Coderanker3/24c948d2ff0b7f71e51b3774c2cc7b22/raw/253ba0660720de3a9cf7dee2a2d25a37f61095ca/Dataset')
df.shape
features_clustering = ['review_scores_accuracy',
'distance_to_center',
'bedrooms',
'review_scores_location',
'review_scores_value',
'number_of_reviews',
'beds',
'review_scores_communication',
'accommodates',
'review_scores_checkin',
'amenities_count',
'review_scores_rating',
'reviews_per_month',
'corrected_price']
df_cluster = df[features_clustering].copy()
X = df_cluster.copy()
model = KMeans(n_clusters=4, random_state=53, n_init=10, max_iter=1000, tol=0.0001)
clusters = model.fit_predict(X)
df_cluster["cluster"] = clusters
fig = plt.figure(figsize=(8, 8))
sns.scatterplot(data=df_cluster, x="amenities_count", y="corrected_price", hue="cluster", palette='Set2_r')
sns.scatterplot(x=model.cluster_centers_[:,0], y=model.cluster_centers_[:,1], color='blue',marker='*',
label='centroid', s=250)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
#plt.ylim(ymin=0)
plt.xlim(xmin=-0.1)
plt.show()
model.cluster_centers_
inertia = model.inertia_
sil = metrics.silhouette_score(X,model.labels_)
print(f'inertia {inertia:.3f}')
print(f'silhouette {sil:.3f}')
[OUT]
inertia 4490.076
silhouette 0.156
The answer to your main question: the cluster centers are not outside of your clusters.
1 : You are clustering over 14 features shown in features_clustering list.
2 : You are viewing the clusters over a two-dimensional space, arbitrarily choosing amenities_count and corrected_price for the data and two coordinates for the cluster centers x=model.cluster_centers_[:,0], y=model.cluster_centers_[:,1] which don't correspond to the same features.
For these reasons you are going to get strange results; they really don't mean anything.
The bottom line is you cannot view 14 dimension clustering over two-dimensions.
To show point 2 more clearly, change the plotting of the clusters line to
sns.scatterplot(x=model.cluster_centers_[:,10], y=model.cluster_centers_[:,13], color='blue',marker='*', label='centroid', s=250)
to be plotting the cluster centers against the same features as the data.
The link to the SO answer about the cluster centers being outside of the cluster data is about scaling the data before clustering to be between 0 and 1, and then not scaling the cluster centers back up when plotting with the real data. This is not the same as your issues here.
You are making multidimensional clusters and you want them to fit a two-dimensional map, by itself it will not work. Let me explain, a variable is a dimension: x1,x2,x3,...,xn and if you find the clusters it will give you as a result y1,y2,y3,...,yn. If you map in 2D the result as you are doing, (I take your example)
x1 is "amenities_count", x5 is "corrected_price".
It will create a 2D map of only these two variables and surely the plotter, seeing that you use a 2D map, will only take the first two variables from cluster, y1 and y2 to plot. Note that xi has no direct relationship with y1.
You must: 1) do a conversion to find its corresponding x,y or 2) reduce the dimensionality of the data you are using to generate a 2D map with the information of all the variables.
For the first case, I am not very sure because I have never done it (Remapping the data).
But in the dimensionality reduction, I recommend you to use https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding or the classic PCA.
Moral: if you want to see a 2D cluster, make sure you only have 2 variables.
I have a dataset with locations (coordinates) and a scalar attribute of each location (for example, temperature). I need to cluster the locations based on the scalar attribute, but taking into consideration the distance between locations.
The problem is that, using temperature as an example, it is possible for locations that are far from each other to have the same temperature. If I cluster on temperature, these locations will be in the same cluster, when they shouldn't. The opposite is true if two locations that are near each other have different temperatures. In this case, clustering on temperature may result in these observations being in different clusters, while clustering based on a distance matrix would put them in the same one.
So, is there a way in which I could cluster observations giving more importance to one attribute (temperature) and then "refining" based on the distance matrix?
Here is a simple example showing how clustering differs depending on whether an attribute is used as the basis or the distance matrix. My goal is to be able to use both, the attribute and the distance matrix, giving more importance to the attribute.
import numpy as np
import matplotlib.pyplot as plt
import haversine
from scipy.cluster.hierarchy import linkage, fcluster
from scipy.spatial import distance as ssd
# Create location data
x = np.random.rand(100, 1)
y = np.random.rand(100, 1)
t = np.random.randint(0, 20, size=(100,1))
# Compute distance matrix
D = np.zeros((len(x),len(y)))
for k in range(len(x)):
for j in range(len(y)):
distance_pair= haversine.distance((x[k], y[k]), (x[j], y[j]))
D[k,j] = distance_pair
# Compare clustering alternatives
Zt = linkage(t, 'complete')
Zd = linkage(ssd.squareform(D), method="complete")
# Cluster based on t
clt = fcluster(Zt, 5, criterion='distance').reshape(100,1)
plt.figure(figsize=(10, 8))
plt.scatter(x, y, c=clt)
plt.show()
# Cluster based on distance matrix
cld = fcluster(Zd, 10, criterion='distance').reshape(100,1)
plt.figure(figsize=(10, 8))
plt.scatter(x, y, c=cld)
plt.show()
haversine.py is available here: https://gist.github.com/rochacbruno/2883505
Thanks.
I have a large data matrix X and I use SciPy implementation of Ward's hierarchical clustering like so:
Z = ward(X.todense())
fig = plt.figure(figsize=(25, 10))
dn = dendrogram(Z)
I now wish to see which classes X[i] belongs to. How can I do this?
From the linkage matrix Z you can get the clusters with scipy.cluster.hierarchy.fcluster.
First, I assume you want the same clusters as the colors of dendrogram. From the docs we can see that the color_threshold is set to 0.7*max(Z[:,2]) if nothing else is specified. So that is what we will use.
For example:
from sklearn.datasets import make_classification
from scipy.cluster.hierarchy import linkage, fcluster
X, y = make_classification(n_samples=10)
Z = linkage(X, method='ward')
thresh = 0.7*max(Z[:,2])
fcluster(Z, thresh, criterion='distance')
See also How to get flat clustering corresponding to color clusters in the dendrogram created by scipy
I am currently stuck on a school exercise. The exercise is as follows.
We will consider a subset of the wild faces data described in
berg2005[1]. Load the wildfaces data, Data/wildfaces using the loadmat
function. Each data object is a 40*40*3=4800 dimensional
vector, corresponding to a 3-color 40*40 pixels image. Compute a
k-means clustering of the data with K=10 clusters. Plot a few random
images from the data set as well as their corresponding cluster
centroids to see how they are represented.
[1] Tamara L Berg, Alexander C Berg, Jaety Edwards, and DA Forsyth. Who's in the picture. Advances in Neural Information Processing Systems, 17:137-144, 2005.
Now to my question, how do I compute the centroids for one image? I am currently able to display the face and calculate centroids for the dataset. What I don't understand is, how do I know which centroids correspond to image 4 (as used in my code sample)? Do I have to calculate centroids for the entire dataset X or just X[4]? What steps do I need to take now, to 'plot the corresponding cluster centroids to see how they are represented'?
import scipy.io as spio
import sklearn.cluster as cl
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.pyplot import imshow
faces = spio.loadmat('Data/wildfaces.mat',squeeze_me=True)
X = faces['X']
Y = cl.k_means(X,10)
centroids = Y[0]
clusters = Y[1]
imshow(np.reshape(X[4,:],(3,40,40)).T)
plt.show()
You already have centroids. One per cluster.
You don't need to compute them, only display them.
Check the contents of Y
I have a time series dataset that I used a machine learning algorithm to identify distinct patterns. I have that classified but I want to check it visually to see how it did.
How do I make a time series graph with different colors for each pattern classification OR what is the best way to visualize or check for accuracy with time series classification data?
The data basically looks like this
DATE DEMAND CLASSIFICATION
June 4 678 1
Generally the classification would look like this
0000000000000000011111111111111111110000000000000000000000000022222222222222222 etc.
Any help?
Here's what I've tried:
1.Convert your (assuming str) data to int
x = '0000000000000000011111111111111111110000000000000000000000000022222222222222222'
x = [int(x[i]) for i in range(len(x))]
2.Create a color label array. I use here gray scale but you can use any RGB array.
import numpy as np
colors=[(i,i,i) for i in np.linspace(0,1,10)]
3.Use the bar plotting function with this color keys. I've tweaked some other properties in the bar function to make it a solid block.
import matplotlib.pyplot as plt
plt.bar(range(len(x)), np.ones(len(x),), width = 1, linewidth = 0, color = [colors[i] for i in x])