I am currently stuck on a school exercise. The exercise is as follows.
We will consider a subset of the wild faces data described in
berg2005[1]. Load the wildfaces data, Data/wildfaces using the loadmat
function. Each data object is a 40*40*3=4800 dimensional
vector, corresponding to a 3-color 40*40 pixels image. Compute a
k-means clustering of the data with K=10 clusters. Plot a few random
images from the data set as well as their corresponding cluster
centroids to see how they are represented.
[1] Tamara L Berg, Alexander C Berg, Jaety Edwards, and DA Forsyth. Who's in the picture. Advances in Neural Information Processing Systems, 17:137-144, 2005.
Now to my question, how do I compute the centroids for one image? I am currently able to display the face and calculate centroids for the dataset. What I don't understand is, how do I know which centroids correspond to image 4 (as used in my code sample)? Do I have to calculate centroids for the entire dataset X or just X[4]? What steps do I need to take now, to 'plot the corresponding cluster centroids to see how they are represented'?
import scipy.io as spio
import sklearn.cluster as cl
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.pyplot import imshow
faces = spio.loadmat('Data/wildfaces.mat',squeeze_me=True)
X = faces['X']
Y = cl.k_means(X,10)
centroids = Y[0]
clusters = Y[1]
imshow(np.reshape(X[4,:],(3,40,40)).T)
plt.show()
You already have centroids. One per cluster.
You don't need to compute them, only display them.
Check the contents of Y
Related
I made a PCA (using the scikit-learn library) on a dataset that I centered using the preprocessing method
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
std = StandardScaler(with_std=False)
df_pca = pd.DataFrame(std.fit_transform(df),
index=df.index,
columns=df.columns)
because my variables have the same unit of measurement (milimeters).
I used this code to create an instance of the PCA class:
pca= PCA(n_components=2, svd_solver="full")
pca.fit(df_pca)
My concern is when I represent my circle of correlations with the code:
import numpy as np
import matplotlib.pyplot as plt
pcs = pca.components_
fig = plt.subplots(figsize=(10,10))
plt.xlim(-1,1)
plt.ylim(-1,1)
plt.quiver(np.zeros(pcs.shape[1]), np.zeros(pcs.shape[1]),
pcs[0,:], pcs[1,:], angles='uv', scale_units='xy', scale=1, color='r', width= 0.003)
for i, (x, y) in enumerate(zip(pcs[0, :], pcs[1, :])):
plt.text(x, y, df.columns[i])
circle = plt.Circle((0,0), 1, facecolor='none', edgecolor='b')
plt.gca().add_artist(circle)
plt.plot([-1,1],[0,0],color='silver',linestyle='--',linewidth=1)
plt.plot([0,0],[-1,1],color='silver',linestyle='--',linewidth=1)
plt.title('Circle of correlations')
plt.xlabel('F{} ({}%)'.format(1, round(100*pca.explained_variance_ratio_[0],1)))
plt.ylabel('F{} ({}%)'.format(2, round(100*pca.explained_variance_ratio_[1],1)))
plt.show()
I get this correlation circle:
In the PCA theory it is said that "2 perpendicular arrows correspond to independent variables (uncorrelated) if these 2 variables are well represented on the factorial plane" and here the variables margin_low and length seem to be very well represented.
But my correlation matrix shows a correlation between these two variables:
Bonus question: is my correlation circle incorrect?
Here are the extraction of the correlation coefficients of the variables on the F1 and F2 axes:
When I use the fanalysis package I can when creating my PCA give the argument std_unit=False indicating that the PCA is performed from centered but not reduced data. Soit le code de création d'instance de classe PCA:
from fanalysis.pca import PCA as fan_PCA
X = df_acp.to_numpy()
my_pca = fan_PCA(std_unit=False,
row_labels=df_acp.index,
col_labels=df_acp.columns)
my_pca.fit(X)
I then get this circle of correlations en utilisant my_pca.correlation_circle(num_x_axis=1, num_y_axis=2)
The arrows of the variables length and margin_low are better drawn and I obtain the same percentage of inertia.
I tried to look at the documentation of the functions of the package, more particularly the file pca.py (https://github.com/OlivierGarciaDev/fanalysis/blob/9aa2cc8b1e5cc5600a05813144973b77143cfe42/fanalysis/pca.py) to understand the operation of std_unit to reproduce it with the scikit-learn library in order to more easily customize my function which displays my circle of correlations. I understand that this has an incident on the application of the reduction of the dimensionality of X (line 169 function transform) but being still beginner in datascience I cannot understand how to obtain this same result in use sklearn.decomposition.
Here is my code used to display my circle of correlations:
Is my first circle of correlations wrong? How to reproduce fanalysis's std_unit=False argument using sklearn.decomposition?
Note: My dataset having for shape
(1500.6)
Given the size of my dataset I thought it would be complicated to attach it to my question. If however it was necessary for the resolution of this one I will include it in an edit.
I am trying to cluster the supplied charging power to different vehicles.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import DBSCAN
from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN, OPTICS, cluster_optics_dbscan
df_temp_12 = list(map(lambda x, y: [x, y], VehicleList, Power))
Eps = 1.3
dbscan_12_object = OPTICS(eps=Eps, cluster_method="dbscan")
dbscan_12 = dbscan_12_object.fit_predict(df_temp_12)
My present output looks like this:
I am trying to cluster mainly along the horizontal axis ( as is shown by the cluster in Green). If you notice closely, all other clusters have not been clustered in the same way (i.e. along the horizontal) and hence I am trying to figure out if i can change the EPS value along each axis separately, and if so, how?
EDIT:
After the suggestion by #PlzBePython the output looks like this. Precisely what I needed
enter image description here
If you already know that you want to cluster along a specific axis, why not cluster that feature in isolation?
So for example:
data = np.array(Power).reshape((-1, 1))
clusterer = sklearn.cluster.OPTICS(eps=1.3, cluster_method="dbscan")
clustered_data = clusterer.fit(data).labels_
Otherwise, I would try creating OPTICS without parameter (default is to identify clusters across all scales)
clusterer = sklearn.cluster.OPTICS()
or using Gaussian Mixture.
I have a problem. I want to cluster my dataset. Unfortunately my centroids are not in the clusters but outside. I have already read
Python k-mean, centroids are placed outside of the clusters about this.
However, I do not know what could be the reason. How can I cluster correctly?
You can find the dataset at https://gist.githubusercontent.com/Coderanker3/24c948d2ff0b7f71e51b3774c2cc7b22/raw/253ba0660720de3a9cf7dee2a2d25a37f61095ca/Dataset
import pandas as pd
from sklearn.cluster import KMeans
from scipy.cluster import hierarchy
import seaborn as sns
from sklearn import metrics
from sklearn.metrics import silhouette_samples
import matplotlib as mpl
import matplotlib.pyplot as plt
df = pd.read_csv(r'https://gist.githubusercontent.com/Coderanker3/24c948d2ff0b7f71e51b3774c2cc7b22/raw/253ba0660720de3a9cf7dee2a2d25a37f61095ca/Dataset')
df.shape
features_clustering = ['review_scores_accuracy',
'distance_to_center',
'bedrooms',
'review_scores_location',
'review_scores_value',
'number_of_reviews',
'beds',
'review_scores_communication',
'accommodates',
'review_scores_checkin',
'amenities_count',
'review_scores_rating',
'reviews_per_month',
'corrected_price']
df_cluster = df[features_clustering].copy()
X = df_cluster.copy()
model = KMeans(n_clusters=4, random_state=53, n_init=10, max_iter=1000, tol=0.0001)
clusters = model.fit_predict(X)
df_cluster["cluster"] = clusters
fig = plt.figure(figsize=(8, 8))
sns.scatterplot(data=df_cluster, x="amenities_count", y="corrected_price", hue="cluster", palette='Set2_r')
sns.scatterplot(x=model.cluster_centers_[:,0], y=model.cluster_centers_[:,1], color='blue',marker='*',
label='centroid', s=250)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
#plt.ylim(ymin=0)
plt.xlim(xmin=-0.1)
plt.show()
model.cluster_centers_
inertia = model.inertia_
sil = metrics.silhouette_score(X,model.labels_)
print(f'inertia {inertia:.3f}')
print(f'silhouette {sil:.3f}')
[OUT]
inertia 4490.076
silhouette 0.156
The answer to your main question: the cluster centers are not outside of your clusters.
1 : You are clustering over 14 features shown in features_clustering list.
2 : You are viewing the clusters over a two-dimensional space, arbitrarily choosing amenities_count and corrected_price for the data and two coordinates for the cluster centers x=model.cluster_centers_[:,0], y=model.cluster_centers_[:,1] which don't correspond to the same features.
For these reasons you are going to get strange results; they really don't mean anything.
The bottom line is you cannot view 14 dimension clustering over two-dimensions.
To show point 2 more clearly, change the plotting of the clusters line to
sns.scatterplot(x=model.cluster_centers_[:,10], y=model.cluster_centers_[:,13], color='blue',marker='*', label='centroid', s=250)
to be plotting the cluster centers against the same features as the data.
The link to the SO answer about the cluster centers being outside of the cluster data is about scaling the data before clustering to be between 0 and 1, and then not scaling the cluster centers back up when plotting with the real data. This is not the same as your issues here.
You are making multidimensional clusters and you want them to fit a two-dimensional map, by itself it will not work. Let me explain, a variable is a dimension: x1,x2,x3,...,xn and if you find the clusters it will give you as a result y1,y2,y3,...,yn. If you map in 2D the result as you are doing, (I take your example)
x1 is "amenities_count", x5 is "corrected_price".
It will create a 2D map of only these two variables and surely the plotter, seeing that you use a 2D map, will only take the first two variables from cluster, y1 and y2 to plot. Note that xi has no direct relationship with y1.
You must: 1) do a conversion to find its corresponding x,y or 2) reduce the dimensionality of the data you are using to generate a 2D map with the information of all the variables.
For the first case, I am not very sure because I have never done it (Remapping the data).
But in the dimensionality reduction, I recommend you to use https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding or the classic PCA.
Moral: if you want to see a 2D cluster, make sure you only have 2 variables.
I have a 1000+ set of x, y and z coordinates and I want to find how they cluster. I'd like to set a maximum distance that will specify that points belong in the same cluster i.e. if the point has a Euclidean distance of less than 1 from another point, the algorithm will cluster them together. I've tried to brute force this on python with little success, does anyone have any ideas or a pre-established algorithm that does something similar?
Thanks in advance
You can find quite a few clustering algorithms in module scikit-learn: https://scikit-learn.org/stable/modules/clustering.html
With your particular definition of clusters, it appears that sklearn.cluster.AgglomerativeClustering(n_clusters=None, distance_threshold=1) is exactly what you want.
import numpy as np
from sklearn.cluster import AgglomerativeClustering
N = 1500
box_size = 10
points = np.random.rand(N, 2) * box_size
# array([[5.93688935, 6.63209391], [2.6182196 , 8.33040083], [4.35061433, 7.21399521], ..., [4.09271753, 2.3614302 ], [5.69176382, 1.78457418], [9.76504841, 1.38935121]])
clustering = AgglomerativeClustering(n_clusters=None, distance_threshold=1).fit(points)
print('Number of clusters:', clustering.n_clusters_)
# Number of clusters: 224
An alternative approach would have been to build a graph, then get the connected components of the graph, using for instance module networkx: networkx.algorithms.components.connected_components
Given 2000 random points in a unit circle (using numpy.random.normal(0,1)), I want to normalize them such that the output is a circle, how do I do that?
I was requested to show my efforts. This is part of a larger question: Write a program that samples 2000 points uniformly from the circumference of a unit circle. Plot and show it is indeed picked from the circumference. To generate a point (x,y) from the circumference, sample (x,y) from std normal distribution and normalise them.
I'm almost certain my code isn't correct, but this is where I am up to. Any advice would be helpful.
This is the new updated code, but it still doesn't seem to be working.
import numpy as np
import matplotlib.pyplot as plot
def plot():
xy = np.random.normal(0,1,(2000,2))
for i in range(2000):
s=np.linalg.norm(xy[i,])
xy[i,]=xy[i,]/s
plot.plot(xy)
plot.show()
I think the problem is in
plot.plot(xy)
even if I use
plot.plot(xy[:,0],xy[:,1])
it doesn't work.
Connected lines are not a good visualization here. You essentially connect random points on the circle. Since you do this quite often, you will get a filled circle. Try drawing points instead.
Also avoid name space mangling. You import matplotlib.pyplot as plot and also name your function plot. This will lead to name conflicts.
import numpy as np
import matplotlib.pyplot as plt
def plot():
xy = np.random.normal(0,1,(2000,2))
for i in range(2000):
s=np.linalg.norm(xy[i,])
xy[i,]=xy[i,]/s
fig, ax = plt.subplots(figsize=(5,5))
# scatter draws dots instead of lines
ax.scatter(xy[:,0], xy[:,1])
If you use dots instead, you will see that your points indeed lie on the unit circle.
Your code has many problems:
Why using np.random.normal (a gaussian distribution) when the problem text is about uniform (flat) sampling?
To pick points on a circle you need to correlate x and y; i.e. randomly sampling x and y will not give a point on the circle as x**2+y**2 must be 1 (for example for the unit circle centered in (x=0, y=0)).
A couple of ways to get the second point is to either "project" a random point from [-1...1]x[-1...1] on the unit circle or to pick instead uniformly the angle and compute a point on that angle on the circle.
First of all, if you look at the documentation for numpy.random.normal (and, by the way, you could just use numpy.random.randn), it takes an optional size parameter, which lets you create as large of an array as you'd like. You can use this to get a large number of values at once. For example: xy = numpy.random.normal(0,1,(2000,2)) will give you all the values that you need.
At that point, you need to normalize them such that xy[:,0]**2 + xy[:,1]**2 == 1. This should be relatively trivial after computing what xy[:,0]**2 + xy[:,1]**2 is. Simply using norm on each dimension separately isn't going to work.
Usual boilerplate
import numpy as np
import matplotlib.pyplot as plt
generate the random sample with two rows, so that it's more convenient to refer to x's and y's
xy = np.random.normal(0,1,(2,2000))
normalize the random sample using a library function to compute the norm, axis=0 means consider the subarrays obtained varying the first array index, the result is a (2000) shaped array that can be broadcasted to xy /= to have points with unit norm, hence lying on the unit circle
xy /= np.linalg.norm(xy, axis=0)
Eventually, the plot... here the key is the add_subplot() method, and in particular the keyword argument aspect='equal' that requires that the scale from user units to output units it's the same for both axes
plt.figure().add_subplot(111, aspect='equal').scatter(xy[0], xy[1])
pt.show()
to have