I built a GMM model and used this to run a prediction.
bead = df['Ce140Di']
dna = df['DNA_1']
X = np.column_stack((dna, bead)) # create a 2D array from the two lists
#plt.scatter(X[:,0], X[:,1], s=0.5, c='black')
#plt.show()
gmm = GaussianMixture(n_components=4, covariance_type='tied')
gmm.fit(X)
labels = gmm.predict(X)
and then generated a plot as follows...
df['predicted_cluster'] = labels
fig= plt.figure()
colors = {1:'red', 2:'orange', 3:'purple', 0:'grey'}
plt.scatter(df['DNA_1'], df['Ce140Di'], c=df['predicted_cluster'].apply(lambda x: colors[x]), s = 0.5, alpha=0.5)
plt.show()
scatter plot colored by predictions
Whilst I have the output prediction for each row of my df, I don't actually know what cluster it corresponds to without looking at my colors dictionary, is there a way to do this without having to look at the scatter plot each time?
In other words, I want to know that 0 will always correspond to my grey cluster or that 1 will always be the red cluster but this changes each time...
Colors aside, how do I know the position of each cluster? What does a label of 0 mean?
EDIT I believe the answer to my perhaps silly question is to use np.random.seed but I could be wrong...
Helllo Hajar,
I think the answer to your question will disappoint you. I assume each Gaussian in your GMM is initialised to some random mean and variance. If you set a random seed then you could be reasonably certain that the resultant clusters will always be the same.
With that said, in multi-label scenarios without a random seed there are (to my knowledge) no clustering algorithms that guarantee which label is assigned to each cluster.
Clustering algorithms assign labels arbitrarily. The only guarantee any clustering algorithm makes about a point assigned a certain label is that it is similar to other points with the same label by some metric.
This makes measuring the accuracy of clustering algorithms quite challenging. Hence the existence of metrics like the Adjusted Mutual Information Score and the Adjusted Rand Index.
You could account for this with a sort of semi-supervised approach, in which you force a particular point to start with a "ground-truth" label and hope your algorithm centres a cluster on it, but even then there may be variance.
Goodluck and I hope this helps.
Related
I have a problem. I want to cluster my dataset. Unfortunately my centroids are not in the clusters but outside. I have already read
Python k-mean, centroids are placed outside of the clusters about this.
However, I do not know what could be the reason. How can I cluster correctly?
You can find the dataset at https://gist.githubusercontent.com/Coderanker3/24c948d2ff0b7f71e51b3774c2cc7b22/raw/253ba0660720de3a9cf7dee2a2d25a37f61095ca/Dataset
import pandas as pd
from sklearn.cluster import KMeans
from scipy.cluster import hierarchy
import seaborn as sns
from sklearn import metrics
from sklearn.metrics import silhouette_samples
import matplotlib as mpl
import matplotlib.pyplot as plt
df = pd.read_csv(r'https://gist.githubusercontent.com/Coderanker3/24c948d2ff0b7f71e51b3774c2cc7b22/raw/253ba0660720de3a9cf7dee2a2d25a37f61095ca/Dataset')
df.shape
features_clustering = ['review_scores_accuracy',
'distance_to_center',
'bedrooms',
'review_scores_location',
'review_scores_value',
'number_of_reviews',
'beds',
'review_scores_communication',
'accommodates',
'review_scores_checkin',
'amenities_count',
'review_scores_rating',
'reviews_per_month',
'corrected_price']
df_cluster = df[features_clustering].copy()
X = df_cluster.copy()
model = KMeans(n_clusters=4, random_state=53, n_init=10, max_iter=1000, tol=0.0001)
clusters = model.fit_predict(X)
df_cluster["cluster"] = clusters
fig = plt.figure(figsize=(8, 8))
sns.scatterplot(data=df_cluster, x="amenities_count", y="corrected_price", hue="cluster", palette='Set2_r')
sns.scatterplot(x=model.cluster_centers_[:,0], y=model.cluster_centers_[:,1], color='blue',marker='*',
label='centroid', s=250)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
#plt.ylim(ymin=0)
plt.xlim(xmin=-0.1)
plt.show()
model.cluster_centers_
inertia = model.inertia_
sil = metrics.silhouette_score(X,model.labels_)
print(f'inertia {inertia:.3f}')
print(f'silhouette {sil:.3f}')
[OUT]
inertia 4490.076
silhouette 0.156
The answer to your main question: the cluster centers are not outside of your clusters.
1 : You are clustering over 14 features shown in features_clustering list.
2 : You are viewing the clusters over a two-dimensional space, arbitrarily choosing amenities_count and corrected_price for the data and two coordinates for the cluster centers x=model.cluster_centers_[:,0], y=model.cluster_centers_[:,1] which don't correspond to the same features.
For these reasons you are going to get strange results; they really don't mean anything.
The bottom line is you cannot view 14 dimension clustering over two-dimensions.
To show point 2 more clearly, change the plotting of the clusters line to
sns.scatterplot(x=model.cluster_centers_[:,10], y=model.cluster_centers_[:,13], color='blue',marker='*', label='centroid', s=250)
to be plotting the cluster centers against the same features as the data.
The link to the SO answer about the cluster centers being outside of the cluster data is about scaling the data before clustering to be between 0 and 1, and then not scaling the cluster centers back up when plotting with the real data. This is not the same as your issues here.
You are making multidimensional clusters and you want them to fit a two-dimensional map, by itself it will not work. Let me explain, a variable is a dimension: x1,x2,x3,...,xn and if you find the clusters it will give you as a result y1,y2,y3,...,yn. If you map in 2D the result as you are doing, (I take your example)
x1 is "amenities_count", x5 is "corrected_price".
It will create a 2D map of only these two variables and surely the plotter, seeing that you use a 2D map, will only take the first two variables from cluster, y1 and y2 to plot. Note that xi has no direct relationship with y1.
You must: 1) do a conversion to find its corresponding x,y or 2) reduce the dimensionality of the data you are using to generate a 2D map with the information of all the variables.
For the first case, I am not very sure because I have never done it (Remapping the data).
But in the dimensionality reduction, I recommend you to use https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding or the classic PCA.
Moral: if you want to see a 2D cluster, make sure you only have 2 variables.
I am plotting both a distribution of test scores and a fitted curve to these test scores:
h = sorted(data['Baseline']) #sorted
fit = stats.norm.pdf(h, np.mean(h), np.std(h))
plt.plot(h,fit,'-o')
plt.hist(h,normed=True) #use this to draw histogram of your data
plt.show()
The plot of the pdf, however, does not look normal (see kink in curve near x=60). See output:
I'm not sure what is going on here...any help appreciated. Is this because the normal line is being drawn between supplied observations? Can provide you with the actual data if needed, there are only 60 observations.
Yes, you evaluate the norm-pdf on the overservation. You would instead want to create some other data like
h = sorted(data['Baseline']) #sorted
x = np.linspace(h.min(), h.max(), 151)
fit = stats.norm.pdf(x, np.mean(h), np.std(h))
plt.plot(x,fit,'-')
plt.hist(h,normed=True)
plt.show()
Note however, that the data does not look normally distributed at all. So potentially you would rather fit a different distribution, or maybe perform a kernel density estimate.
I have a set of 3d coordinates (x,y,z) to which I would like to fit a space curve. Does anyone know of existing routines for this in Python?
From what I have found (https://docs.scipy.org/doc/scipy/reference/interpolate.html), there are existing modules for fitting a curve to a set of 2d coordinates, and others for fitting a surface to a set of 3d coordinates. I want the middle path - fitting a curve to a set of 3d coordinates.
EDIT --
I found an explicit answer to this on another post here, using interpolate.splprep() and interpolate.splenv(). Here are my data points:
import numpy as np
data = np.array([[21.735556483642707, 7.9999120559310359, -0.7043281314370935],
[21.009401429607784, 8.0101161320825103, -0.16388503829177037],
[20.199370045383134, 8.0361339131845497, 0.25664085801558179],
[19.318149385194054, 8.0540100864979447, 0.50434139043379278],
[18.405497793567243, 8.0621753888918484, 0.57169888018720161],
[17.952649703401562, 8.8413995204241491, 0.39316793526155014],
[17.539007529982641, 9.6245700151356104, 0.14326173861202204],
[17.100154581079089, 10.416295524018977, 0.011339000091976647],
[16.645143439968102, 11.208477191735446, 0.070252116425261066],
[16.198247656768263, 11.967005154933993, 0.31087815045809558],
[16.661378578010989, 12.717314230004659, 0.54140549139204996],
[17.126106263351478, 13.503461982612732, 0.57743407626794219],
[17.564249250974573, 14.28890107482801, 0.42307198199366186],
[17.968265052275274, 15.031985807202176, 0.10156997950061938]])
Here is my code:
from scipy import interpolate
from mpl_toolkits.mplot3d import Axes3D
data = data.transpose()
#now we get all the knots and info about the interpolated spline
tck, u= interpolate.splprep(data, k=5)
#here we generate the new interpolated dataset,
#increase the resolution by increasing the spacing, 500 in this example
new = interpolate.splev(np.linspace(0,1,500), tck, der=0)
#now lets plot it!
fig = plt.figure()
ax = Axes3D(fig)
ax.plot(data[0], data[1], data[2], label='originalpoints', lw =2, c='Dodgerblue')
ax.plot(new[0], new[1], new[2], label='fit', lw =2, c='red')
ax.legend()
plt.savefig('junk.png')
plt.show()
This is the image:
You can see that the fit is not good, while I am already using the maximum allowed fitting order value (k=5). Is this because the curve is not fully convex? Does anyone know how I can improve the fit?
Depends on what the points represent, but if it's just position data, you could use a kalman filter such as this one written in python. You could just query the kalman filter at any time to get the "expected point" at that time, so it would work just like a function of time.
If you do plan to use a kalman filter, just set the initial estimate to your first coordinate, and set your covariance to be a diagonal matrix of huge numbers, this will indicate that you are very uncertain about the position of your next point, which will quickly lock the filter onto your coordinates.
You'd want to stay away from spline fitting methods, because splines will always go through your data.
You can fit a curve to any dimensional data. The curve fitting / optimization algorithms (say, in scipy.optimize) all treat the observations you want to model as a plain 1-d array, and do not care what the independent variables are. If you flatten your 3d data, each value will correspond to an (x, y, z) tuple. You can just pass that information along as "extra" data to you fitting routine to help you calculate the model curve that will be fitted to your data.
My data is 2250 x 100. I would like to plot the output, like http://glowingpython.blogspot.com/2012/04/k-means-clustering-with-scipy.html. However, the problem is that all the examples use only a small number of clusters, usually 2 or 3. How would you plot the output of kmeans in scipy if you wanted more clusters, like a 100.
Here's what I got:
#get the centroids
centroids,_ = kmeans(data,100)
idx,_ = vq(data,centroids)
#do some plotting here...
Maybe with 10 colors and 10 point types?
Or you could plot each in a 10 x 10 grid of plots. The first would show the relationships better. The second would allow easier inspection of an arbitrary cluster.
I have plotted some experimental data in Python and need to find a cubic fit to the data. The reason I need to do this is because the cubic fit will be used to remove background (in this case resistance in a diode) and you will be left with the evident features. Here is the code I am currently using to make the cubic fit in the first place, where Vnew and yone represent arrays of the experimental data.
answer1=raw_input ('Cubic Plot attempt?\n ')
if answer1 in['y','Y','Yes']:
def cubic(x,A):
return A*x**3
cubic_guess=array([40])
popt,pcov=curve_fit(cubic,Vnew,yone,cubic_guess)
plot(Vnew,cubic(Vnew,*popt),'r-',label='Cubic Fit: curve_fit')
#ylim(-0.05,0.05)
legend(loc='best')
print 'Cubic plotted'
else:
print 'No Cubic Removal done'
I have knowledge of curve smoothing but only in theory. I do not know how to implement it. I would really appreciate any assistance.
Here is the graph generated so far:
To make the fitted curve "wider", you're looking for extrapolation. Although in this case, you could just make Vnew cover a larger interval, in which case you'd put this before your plot command:
Vnew = numpy.linspace(-1,1, 256) # min and max are merely an example, based on your graph
plot(Vnew,cubic(Vnew,*popt),'r-',label='Cubic Fit: curve_fit')
"Blanking out" the feature you see, can be done with numpy's masked arrays but also just by removing those elements you don't want from both your original Vnew (which I'll call xone) and yone:
mask = (xone > 0.1) & (xone < 0.35) # values between these voltages (?) need to be removed
xone = xone[numpy.logical_not(mask)]
yone = yone[numpy.logical_not(mask)]
Then redo the curve fitting:
popt,_ = curve_fit(cubic, xone, yone, cubic_guess)
This will have fitted only to the data that was actually there (which aren't that many points in your dataset, from the looks of it, so beware!).