Clustering observations based first on an attribute and on distance matrix - python

I have a dataset with locations (coordinates) and a scalar attribute of each location (for example, temperature). I need to cluster the locations based on the scalar attribute, but taking into consideration the distance between locations.
The problem is that, using temperature as an example, it is possible for locations that are far from each other to have the same temperature. If I cluster on temperature, these locations will be in the same cluster, when they shouldn't. The opposite is true if two locations that are near each other have different temperatures. In this case, clustering on temperature may result in these observations being in different clusters, while clustering based on a distance matrix would put them in the same one.
So, is there a way in which I could cluster observations giving more importance to one attribute (temperature) and then "refining" based on the distance matrix?
Here is a simple example showing how clustering differs depending on whether an attribute is used as the basis or the distance matrix. My goal is to be able to use both, the attribute and the distance matrix, giving more importance to the attribute.
import numpy as np
import matplotlib.pyplot as plt
import haversine
from scipy.cluster.hierarchy import linkage, fcluster
from scipy.spatial import distance as ssd
# Create location data
x = np.random.rand(100, 1)
y = np.random.rand(100, 1)
t = np.random.randint(0, 20, size=(100,1))
# Compute distance matrix
D = np.zeros((len(x),len(y)))
for k in range(len(x)):
for j in range(len(y)):
distance_pair= haversine.distance((x[k], y[k]), (x[j], y[j]))
D[k,j] = distance_pair
# Compare clustering alternatives
Zt = linkage(t, 'complete')
Zd = linkage(ssd.squareform(D), method="complete")
# Cluster based on t
clt = fcluster(Zt, 5, criterion='distance').reshape(100,1)
plt.figure(figsize=(10, 8))
plt.scatter(x, y, c=clt)
plt.show()
# Cluster based on distance matrix
cld = fcluster(Zd, 10, criterion='distance').reshape(100,1)
plt.figure(figsize=(10, 8))
plt.scatter(x, y, c=cld)
plt.show()
haversine.py is available here: https://gist.github.com/rochacbruno/2883505
Thanks.

Related

K-Means algorithm Centroids are not placed in the clusters

I have a problem. I want to cluster my dataset. Unfortunately my centroids are not in the clusters but outside. I have already read
Python k-mean, centroids are placed outside of the clusters about this.
However, I do not know what could be the reason. How can I cluster correctly?
You can find the dataset at https://gist.githubusercontent.com/Coderanker3/24c948d2ff0b7f71e51b3774c2cc7b22/raw/253ba0660720de3a9cf7dee2a2d25a37f61095ca/Dataset
import pandas as pd
from sklearn.cluster import KMeans
from scipy.cluster import hierarchy
import seaborn as sns
from sklearn import metrics
from sklearn.metrics import silhouette_samples
import matplotlib as mpl
import matplotlib.pyplot as plt
df = pd.read_csv(r'https://gist.githubusercontent.com/Coderanker3/24c948d2ff0b7f71e51b3774c2cc7b22/raw/253ba0660720de3a9cf7dee2a2d25a37f61095ca/Dataset')
df.shape
features_clustering = ['review_scores_accuracy',
'distance_to_center',
'bedrooms',
'review_scores_location',
'review_scores_value',
'number_of_reviews',
'beds',
'review_scores_communication',
'accommodates',
'review_scores_checkin',
'amenities_count',
'review_scores_rating',
'reviews_per_month',
'corrected_price']
df_cluster = df[features_clustering].copy()
X = df_cluster.copy()
model = KMeans(n_clusters=4, random_state=53, n_init=10, max_iter=1000, tol=0.0001)
clusters = model.fit_predict(X)
df_cluster["cluster"] = clusters
fig = plt.figure(figsize=(8, 8))
sns.scatterplot(data=df_cluster, x="amenities_count", y="corrected_price", hue="cluster", palette='Set2_r')
sns.scatterplot(x=model.cluster_centers_[:,0], y=model.cluster_centers_[:,1], color='blue',marker='*',
label='centroid', s=250)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
#plt.ylim(ymin=0)
plt.xlim(xmin=-0.1)
plt.show()
model.cluster_centers_
inertia = model.inertia_
sil = metrics.silhouette_score(X,model.labels_)
print(f'inertia {inertia:.3f}')
print(f'silhouette {sil:.3f}')
[OUT]
inertia 4490.076
silhouette 0.156
The answer to your main question: the cluster centers are not outside of your clusters.
1 : You are clustering over 14 features shown in features_clustering list.
2 : You are viewing the clusters over a two-dimensional space, arbitrarily choosing amenities_count and corrected_price for the data and two coordinates for the cluster centers x=model.cluster_centers_[:,0], y=model.cluster_centers_[:,1] which don't correspond to the same features.
For these reasons you are going to get strange results; they really don't mean anything.
The bottom line is you cannot view 14 dimension clustering over two-dimensions.
To show point 2 more clearly, change the plotting of the clusters line to
sns.scatterplot(x=model.cluster_centers_[:,10], y=model.cluster_centers_[:,13], color='blue',marker='*', label='centroid', s=250)
to be plotting the cluster centers against the same features as the data.
The link to the SO answer about the cluster centers being outside of the cluster data is about scaling the data before clustering to be between 0 and 1, and then not scaling the cluster centers back up when plotting with the real data. This is not the same as your issues here.
You are making multidimensional clusters and you want them to fit a two-dimensional map, by itself it will not work. Let me explain, a variable is a dimension: x1,x2,x3,...,xn and if you find the clusters it will give you as a result y1,y2,y3,...,yn. If you map in 2D the result as you are doing, (I take your example)
x1 is "amenities_count", x5 is "corrected_price".
It will create a 2D map of only these two variables and surely the plotter, seeing that you use a 2D map, will only take the first two variables from cluster, y1 and y2 to plot. Note that xi has no direct relationship with y1.
You must: 1) do a conversion to find its corresponding x,y or 2) reduce the dimensionality of the data you are using to generate a 2D map with the information of all the variables.
For the first case, I am not very sure because I have never done it (Remapping the data).
But in the dimensionality reduction, I recommend you to use https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding or the classic PCA.
Moral: if you want to see a 2D cluster, make sure you only have 2 variables.

Spline in 3 Dimensions for Python

I'm looking for a function which mimics MATLAB's cscvn function in their Curve Fitting Toolbox, suitable for points in 3D space. The closest function I've found has been scipy.interpolate.splprep, which is capable of computing 3 dimensions but loses its accuracy with fewer data points. If smoothness is reduced to a point of fitting the points, the curve has kinks.
I have a discrete dataset made up of physical points (elevation data) that I'm looking to model, so the spline must pass through those points. There is a finite number of points at varying chord lengths from one another.
Here's a sample of the quick test function I've written to test Python splines. Unfortunately, I can't share my MATLAB code, but the cscvn function splines smoothly and passes through all data points.
import scipy as sp
import matplotlib.pyplot as plt
import numpy as np
from scipy.interpolate import splprep, splev, interp2d
x = np.linspace(0, 10, num = 20) #list of known x coordinates
y = 2*x #list of known y coordinates
z = x*x #list of known z coordinates
## Note: You must have more points than degree of the spline. if k = 3, must have 4 points min.
print([x,y,z])
tck, u = splprep([x,y,z], s = 26) # Generate function out of provided points, default k = 3
newPoints = splev(u, tck) # Creating spline points
print(newPoints)
ax = plt.axes(projection = "3d")
ax.plot3D(x, y, z, 'go') # Green is the actual 3D function
ax.plot3D(newPoints[:][0], newPoints[:][1], newPoints[:][2], 'r-') # Red is the spline
plt.show()
Here is an example of many points creating a smooth curve (red), but the line doesn't align with the physical data points (green).
Here is an example of kinks in the spline (red) created by too few data points (green). This is more akin to what my dataset looks like.
Change your U for:
unew = np.arange(0, 1.00, 0.005)

Find mode of kernel density estimate in GPS data

I'm analyzing GPS location data with weights indicating "importance". This can be easily plotted as heatmap, e.g. in google maps.
I would like to analyze this using the python data stack and, in particular, want to find the mode of a kernel density estimate.
How can I compute the mode of a KDE in python?
Very specifically, given the example at https://scikit-learn.org/stable/auto_examples/neighbors/plot_species_kde.html how would you find the location with highest probability of finding the "Bradypus variegatus" species?
Lets consider a simple example of getting kde-estimation:
import numpy as np
from scipy.stats import gaussian_kde
from pylab import plt
np.random.seed(10)
x = np.random.rand(100)
y = np.random.rand(100)
kde = gaussian_kde(np.vstack([x, y]))
X, Y = np.meshgrid(np.linspace(0, 1, 100), np.linspace(0, 1, 100))
Z = kde(np.vstack([X.ravel(), Y.ravel()])).reshape(X.shape)
plt.contourf(X, Y, Z)
plt.show()
Now, we can get the coordinates X and Y, where Z takes its maximal value:
X.ravel()[np.argmax(Z.ravel())]
0.3535353535353536
Y.ravel()[np.argmax(Z.ravel())]
0.5555555555555556
In practice, when estimating locations of highest probability of occurrence some
species, you need not the only one position, but some area around it.
In this case, you can choose, for example, all locations, where the probability
is greater than 90 percentile of all possible probability values, e.g.
Y.ravel()[Z.ravel() > np.percentile(Z, 90)]
X.ravel()[Z.ravel() > np.percentile(Z, 90)]
In case of cited example, you can try the same approach to get desired result. Probably, you will need to tweak threshold value, e.g. choose 75-percentile instead of 90-percentile value.

How can you create a KDE from histogram values only?

I have a set of values that I'd like to plot the gaussian kernel density estimation of, however there are two problems that I'm having:
I only have the values of bars not the values themselves
I am plotting onto a categorical axis
Here's the plot I've generated so far:
The order of the y axis is actually relevant since it is representative of the phylogeny of each bacterial species.
I'd like to add a gaussian kde overlay for each color, but so far I haven't been able to leverage seaborn or scipy to do this.
Here's the code for the above grouped bar plot using python and matplotlib:
enterN = len(color1_plotting_values)
fig, ax = plt.subplots(figsize=(20,30))
ind = np.arange(N) # the x locations for the groups
width = .5 # the width of the bars
p1 = ax.barh(Species_Ordering.Species.values, color1_plotting_values, width, label='Color1', log=True)
p2 = ax.barh(Species_Ordering.Species.values, color2_plotting_values, width, label='Color2', log=True)
for b in p2:
b.xy = (b.xy[0], b.xy[1]+width)
Thanks!
How to plot a "KDE" starting from a histogram
The protocol for kernel density estimation requires the underlying data. You could come up with a new method that uses the empirical pdf (ie the histogram) instead, but then it wouldn't be a KDE distribution.
Not all hope is lost, though. You can get a good approximation of a KDE distribution by first taking samples from the histogram, and then using KDE on those samples. Here's a complete working example:
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as sts
n = 100000
# generate some random multimodal histogram data
samples = np.concatenate([np.random.normal(np.random.randint(-8, 8), size=n)*np.random.uniform(.4, 2) for i in range(4)])
h,e = np.histogram(samples, bins=100, density=True)
x = np.linspace(e.min(), e.max())
# plot the histogram
plt.figure(figsize=(8,6))
plt.bar(e[:-1], h, width=np.diff(e), ec='k', align='edge', label='histogram')
# plot the real KDE
kde = sts.gaussian_kde(samples)
plt.plot(x, kde.pdf(x), c='C1', lw=8, label='KDE')
# resample the histogram and find the KDE.
resamples = np.random.choice((e[:-1] + e[1:])/2, size=n*5, p=h/h.sum())
rkde = sts.gaussian_kde(resamples)
# plot the KDE
plt.plot(x, rkde.pdf(x), '--', c='C3', lw=4, label='resampled KDE')
plt.title('n = %d' % n)
plt.legend()
plt.show()
Output:
The red dashed line and the orange line nearly completely overlap in the plot, showing that the real KDE and the KDE calculated by resampling the histogram are in excellent agreement.
If your histograms are really noisy (like what you get if you set n = 10 in the above code), you should be a bit cautious when using the resampled KDE for anything other than plotting purposes:
Overall the agreement between the real and resampled KDEs is still good, but the deviations are noticeable.
Munge your categorial data into an appropriate form
Since you haven't posted your actual data I can't give you detailed advice. I think your best bet will be to just number your categories in order, then use that number as the "x" value of each bar in the histogram.
I have stated my reservations to applying a KDE to OP's categorical data in my comments above. Basically, as the phylogenetic distance between species does not obey the triangle inequality, there cannot be a valid kernel that could be used for kernel density estimation. However, there are other density estimation methods that do not require the construction of a kernel. One such method is k-nearest neighbour inverse distance weighting, which only requires non-negative distances which need not satisfy the triangle inequality (nor even need to be symmetric, I think). The following outlines this approach:
import numpy as np
#--------------------------------------------------------------------------------
# simulate data
total_classes = 10
sample_values = np.random.rand(total_classes)
distance_matrix = 100 * np.random.rand(total_classes, total_classes)
# Distances to the values itself are zero; hence remove diagonal.
distance_matrix -= np.diag(np.diag(distance_matrix))
# --------------------------------------------------------------------------------
# For each sample, compute an average based on the values of the k-nearest neighbors.
# Weigh each sample value by the inverse of the corresponding distance.
# Apply a regularizer to the distance matrix.
# This limits the influence of values with very small distances.
# In particular, this affects how the value of the sample itself (which has distance 0)
# is weighted w.r.t. other values.
regularizer = 1.
distance_matrix += regularizer
# Set number of neighbours to "interpolate" over.
k = 3
# Compute average based on sample value itself and k neighbouring values weighted by the inverse distance.
# The following assumes that the value of distance_matrix[ii, jj] corresponds to the distance from ii to jj.
for ii in range(total_classes):
# determine neighbours
indices = np.argsort(distance_matrix[ii, :])[:k+1] # +1 to include the value of the sample itself
# compute weights
distances = distance_matrix[ii, indices]
weights = 1. / distances
weights /= np.sum(weights) # weights need to sum to 1
# compute weighted average
values = sample_values[indices]
new_sample_values[ii] = np.sum(values * weights)
print(new_sample_values)
THE EASY WAY
For now, I am skipping any philosophical argument about the validity of using Kernel density in such settings. Will come around that later.
An easy way to do this is using scikit-learn KernelDensity:
import numpy as np
import pandas as pd
from sklearn.neighbors import KernelDensity
from sklearn import preprocessing
ds=pd.read_csv('data-by-State.csv')
Y=ds.loc[:,'State'].values # State is AL, AK, AZ, etc...
# With categorical data we need some label encoding here...
le = preprocessing.LabelEncoder()
le.fit(Y) # le.classes_ would be ['AL', 'AK', 'AZ',...
y=le.transform(Y) # y would be [0, 2, 3, ..., 6, 7, 9]
y=y[:, np.newaxis] # preparing for kde
kde = KernelDensity(kernel='gaussian', bandwidth=0.75).fit(y)
# You can control the bandwidth so the KDE function performs better
# To find the optimum bandwidth for your data you can try Crossvalidation
x=np.linspace(0,5,100)[:, np.newaxis] # let's get some x values to plot on
log_dens=kde.score_samples(x)
dens=np.exp(log_dens) # these are the density function values
array([0.06625658, 0.06661817, 0.06676005, 0.06669403, 0.06643584,
0.06600488, 0.0654239 , 0.06471854, 0.06391682, 0.06304861,
0.06214499, 0.06123764, 0.06035818, 0.05953754, 0.05880534,
0.05818931, 0.05771472, 0.05740393, 0.057276 , 0.05734634,
0.05762648, 0.05812393, 0.05884214, 0.05978051, 0.06093455,
..............
0.11885574, 0.11883695, 0.11881434, 0.11878766, 0.11875657,
0.11872066, 0.11867943, 0.11863229, 0.11857859, 0.1185176 ,
0.11844852, 0.11837051, 0.11828267, 0.11818407, 0.11807377])
And these values are all you need to plot your Kernel Density over your histogram. Capito?
Now, on the theoretical side, if X is a categorical(*), unordered variable with c possible values, then for 0 ≤ h < 1
is a valid kernel. For an ordered X,
where |x1-x2|should be understood as how many levels apart x1 and x2 are. As h tends to zero, both of these become indicators and return a relative frequency counting. h is oftentimes referred to as bandwidth.
(*) No distance needs to be defined on the variable space. Doesn't need to be a metric space.
Devroye, Luc and Gábor Lugosi (2001). Combinatorial Methods in Density Estimation. Berlin: Springer-Verlag.

Which cluster does X[i] belong to?

I have a large data matrix X and I use SciPy implementation of Ward's hierarchical clustering like so:
Z = ward(X.todense())
fig = plt.figure(figsize=(25, 10))
dn = dendrogram(Z)
I now wish to see which classes X[i] belongs to. How can I do this?
From the linkage matrix Z you can get the clusters with scipy.cluster.hierarchy.fcluster.
First, I assume you want the same clusters as the colors of dendrogram. From the docs we can see that the color_threshold is set to 0.7*max(Z[:,2]) if nothing else is specified. So that is what we will use.
For example:
from sklearn.datasets import make_classification
from scipy.cluster.hierarchy import linkage, fcluster
X, y = make_classification(n_samples=10)
Z = linkage(X, method='ward')
thresh = 0.7*max(Z[:,2])
fcluster(Z, thresh, criterion='distance')
See also How to get flat clustering corresponding to color clusters in the dendrogram created by scipy

Categories

Resources