From research, only Single-Linkage Hierarchical Clustering can obtain optimal clusters. This is also know as SLINK. The libraries are published in originally in C++ and now in Python/R.
So far, following the steps in the documentations, I managed to come up with:
import pandas as pd
from scipy.cluster.hierarchy import dendrogram, linkage
from scipy.spatial.distance import pdist
## generating random numbers from 20 to 90, and storing them in a dataframe. This is a 1-dimensional data
np.random.seed(1)
df = pd.DataFrame(np.random.randint(20,90,size=(100,1)), columns = list('A'))
df = df.sort_values(by=['A'])
df = df.values
df[:,0].sort()
## getting condensed distance matrix
d = pdist(df_final, metric='euclidean')
## running the SLINK algorithm
Z = linkage(d, 'single')
I understand that Z is a 'hierarchical clustering encoded as a linkage matrix' (as written in the documentation), but I am wondering how do I go back to my original data set and distinguish the cluster calculated by this result?
I could achieve clustering result by Scikit-Learn clustering, but I think Scikit-Learn clustering algorithms are not optimal and hence I turned to this SLINK algorithm. Would be much appreciated if someone could help me with this.
From scipy.cluster.hierarchy.linkage you get back how clusters are formed with each iteration.
Normally this information is not so useful, so we can look at the clustering first:
import scipy as scipy
import matplotlib.pyplot as plt
plt.figure()
dn =scipy.cluster.hierarchy.dendrogram(Z)
If we want to get the three clusters, we can do:
labels = scipy.cluster.hierarchy.fcluster(Z,3,'maxclust')
If you want to get it by distance between the data points:
scipy.cluster.hierarchy.fcluster(Z,2,'distance')
This gives about the same result as calling for 3 clusters because that's not many ways to cut this example dataset.
If you look the example you have, the next point you can cut it is at height ~ 1.5, which is 16 clusters. So if you try to do scipy.cluster.hierarchy.fcluster(Z,5,'maxclust'), you get the same results as for 3 clusters. If you have a more spread dataset, it will work:
np.random.seed(111)
df = np.random.normal(0,1,(50,3))
## getting condensed distance matrix
d = pdist(df, metric='euclidean')
Z = linkage(d, 'single')
dn = scipy.cluster.hierarchy.dendrogram(Z,above_threshold_color='black',color_threshold=1.1)
Then this works:
scipy.cluster.hierarchy.fcluster(Z,5,'maxclust')
Related
Say i have the following dataframe stored as a variable called coordinates, where the first few rows look like:
business_lat business_lng business_rating
0 19.111841 72.910729 5.
1 19.111342 72.908387 5.
2 19.111342 72.908387 4.
3 19.137815 72.914085 5.
4 19.119677 72.905081 2.
5 19.119677 72.905081 2.
. . .
. . .
. . .
As you can see this data is geospatial (has a lat and a lng) AND every row has an additional value, business_rating, that corresponds to the rating of the business at the latlng in that row. I want to cluster the data, where businesses that are nearby and have similar ratings are assigned into the same cluster. Essentially I need a a geospatial cluster with an additional requirement that the clustering must consider the rating column.
I've looked online and can't really find much addressing approaches for this: only things for strict geospatial clustering (only features to cluster on are latlng) or non spatial clustering.
I have a simple DBSCAN running below, but when i plot the results of the clustering it does not seem to be doing what I want correctly.
from sklearn.cluster import DBSCAN
import numpy as np
db = DBSCAN(eps=2/6371., min_samples=5, algorithm='ball_tree', metric='haversine').fit(np.radians(coordinates))
Would I be better served trying to tweak the parameters of the DBSCAN, doing some additional processing of the data or using a different approach all together?
The tricky part about clustering two different types of information (location and rating) is determining how they should relate to each other. It's simple to ask when it is just one domain and you are comparing the same units. My approach would be to look at how to relate rows within a domain and then determine some interaction between the domains. This could be done using scaling options like MinMaxScaler mentioned, however, I think this is a bit heavy handed and we could use our knowledge of the domains to cluster better.
Handling Location
Location distance is best handled directly as this has real world meaning that we can precalculate distances for. The meaning of meters apart is direct to what we
You could use the scaling option mentioned in the previous answer but this risks distorting the location data. For example, if you have a long and thin set of locations, MinMaxScaling would give more importance to variation on the thin axis than the long axis. If you are going to use scaling, do it on the computed distance matrix, not on the lat lon themselves.
import numpy as np
from sklearn.metrics.pairwise import haversine_distances
points_in_radians = df[['business_lat','business_lng']].apply(np.radians).values
distances_in_km = haversine_distances(points_in_radians) * 6371
Adding in Rating
We can think of the problem through asking a couple of questions that relate rating to distance. We could ask, how different must ratings be to separate observations in the same place? What is the meter difference to rating difference ratio? With an idea of ratio, we can calculate another distance matrix for the rating difference for all observations and use this to scale or add on the original location distance matrix or we could increase the distance for every gap in rating. This location-plus-ratings-difference matrix can then be clustered on.
from sklearn.metrics.pairwise import euclidean_distances
added_km_per_rating_gap = 1
rating_distances = euclidean_distances(df[['business_rating']].values) * added_km_per_rating_gap
We can then simply add these together and cluster on the resulting matrix.
from sklearn.cluster import DBSCAN
distance_matrix = rating_distances + distances_in_km
clustering = DBSCAN(metric='precomputed', eps=1, min_samples=2)
clustering.fit(distance_matrix)
What we have done is cluster by location, adding a penalty for ratings difference. Making that penalty direct and controllable allows for optimisation to find the best clustering.
Testing
The problem I'm finding is that (with my test data at least) DBSCAN has a tendency to 'walk' from observation to observation forming clusters that either blend ratings together because the penalty is not high enough or separates into single rating groups. It might be that DBSCAN is not suitable for this type of clustering. If I had more time, I would look for some open data to test this on and try other clustering methods.
Here is the code I used to test. I used the square of the ratings distance to emphasise larger gaps.
import random
from sklearn.datasets import make_blobs
X, y = make_blobs(n_samples=300, centers=6, cluster_std=0.60, random_state=0)
ratings = np.array([random.randint(1,4) for _ in range(len(X)//2)] \
+[random.randint(2,5) for _ in range(len(X)//2)]).reshape(-1, 1)
distances_in_km = euclidean_distances(X)
rating_distances = euclidean_distances(ratings)
def build_clusters(multiplier, eps):
rating_addition = (rating_distances ** 2) * multiplier
distance_matrix = rating_addition + distances_in_km
clustering = DBSCAN(metric='precomputed', eps=eps, min_samples=10)
clustering.fit(distance_matrix)
return clustering.labels_
Using the DBSCAN methodology, we can calculate the distance between points (the Euclidean distance or some other distance) and look for points which are far away from others. You may want to consider using the MinMaxScaler to normalize values, so one feature doesn't overwhelm other features.
Where is your code and what are your final results? Without an actual code sample, I can only guess what you are doing.
I hacked together some sample code for you. You can see the results below.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
import seaborn as sns; sns.set()
import csv
df = pd.read_csv('C:\\your_path_here\\business.csv')
X=df.loc[:,['review_count','latitude','longitude']]
K_clusters = range(1,10)
kmeans = [KMeans(n_clusters=i) for i in K_clusters]
Y_axis = df[['latitude']]
X_axis = df[['longitude']]
score = [kmeans[i].fit(Y_axis).score(Y_axis) for i in range(len(kmeans))]# Visualize
plt.plot(K_clusters, score)
plt.xlabel('Number of Clusters')
plt.ylabel('Score')
plt.title('Elbow Curve')
plt.show()
kmeans = KMeans(n_clusters = 3, init ='k-means++')
kmeans.fit(X[X.columns[0:2]]) # Compute k-means clustering.
X['cluster_label'] = kmeans.fit_predict(X[X.columns[0:2]])
centers = kmeans.cluster_centers_ # Coordinates of cluster centers.
labels = kmeans.predict(X[X.columns[0:2]]) # Labels of each point
X.head(10)
X.plot.scatter(x = 'latitude', y = 'longitude', c=labels, s=50, cmap='viridis')
plt.scatter(centers[:, 0], centers[:, 1], c='black', s=200, alpha=0.5)
from scipy.stats import zscore
df["zscore"] = zscore(df["review_count"])
df["outlier"] = df["zscore"].apply(lambda x: x <= -2.5 or x >= 2.5)
df[df["outlier"]]
df_cord = df[["latitude", "longitude"]]
df_cord.plot.scatter(x = "latitude", y = "latitude")
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df_cord = scaler.fit_transform(df_cord)
df_cord = pd.DataFrame(df_cord, columns = ["latitude", "longitude"])
df_cord.plot.scatter(x = "latitude", y = "longitude")
from sklearn.cluster import DBSCAN
outlier_detection = DBSCAN(
eps = 0.5,
metric="euclidean",
min_samples = 3,
n_jobs = -1)
clusters = outlier_detection.fit_predict(df_cord)
clusters
from matplotlib import cm
cmap = cm.get_cmap('Accent')
df_cord.plot.scatter(
x = "latitude",
y = "longitude",
c = clusters,
cmap = cmap,
colorbar = False
)
The final result looks a little weird, to tell you the truth. Remember, not everything is clusterable.
I have a python script which do clustering over a data file which is in svmlight format.
I use the function sklearn.datasets.load_svmlight_file to load the data from the data file.
I know that this function returns a sparse matrix.
I need to scatter plot the clusters, can any body help me please.
This what I have done:
import sklearn.datasets
import sys
from sklearn.cluster import KMeans
dataFilename = sys.argv[1]
X, y = sklearn.datasets.load_svmlight_file(dataFilename)
kmeans = KMeans(n_clusters = 3)
kmeans.fit(X)
labels = kmeans.labels_
print(labels)
centroids = kmeans.cluster_centers_
Without having the dataset, I would suggest the following:
Since load_svmlight_file() returns a sparse matrix, turn X into a NumPy array using samples = X.toarray() prior to fitting the model.
Plot two features (for example) of the dataset using:
plt.scatter(samples[:,0], samples[:,1], c=labels). This colours the clusters by their predicted labels.
Follow this with plt.scatter(centroids[:,0], centroids[:,1], marker='D') to see the location of the centroids with diamonds.
Note that samples[:,n] represents an array containing the sample values for the nth feature of the dataset.
I hope this helps. If not, please let me know.
I have a matrix of pairwise differences between samples. I would like to label each sample as being part of a cluster, named by cluster size, where clusters are defined by an absolute cutoff in the matrix values, e.g. all those clusters with a difference of zero from each other.
Mock data:
# Load packages
import numpy as np
import pandas as pd
import seaborn as sns
## Generate fake data
# matrix
d = {'sample_A': [0,2,0,1,1,2,2,1], 'sample_B': [2,0,2,3,3,0,0,3], 'sample_C': [0,2,0,1,1,2,2,1], 'sample_D': [1,3,1,0,2,3,3,1],
'sample_E': [1,3,1,2,0,3,3,1], 'sample_F': [2,0,2,3,3,0,0,3], 'sample_G': [2,0,2,3,3,0,0,3], 'sample_H': [1,3,1,1,1,3,3,0]}
idx = ["sample_A","sample_B","sample_C","sample_D","sample_E", "sample_F", "sample_G", "sample_H"]
df = pd.DataFrame(data=d,index=idx)
df
# Visualise heatmap (this isn't directly needed for this output)
g = sns.clustermap(df, cmap="coolwarm_r")
g
# Desired output
d = {'cluster_zero': [2,1,2,"NA","NA",1,1,"NA"]}
df3 = pd.DataFrame(data=d,index=idx)
df3
So the output labels each sample as belonging to a cluster defined as having zero pairwise difference in the matrix, and names the cluster in order of size from largest to smallest. In this case, samples B, F and G all have zero differences, so get put in cluster 1. Samples A and C also have zero differences from each other, and as that cluster is smaller than B/F/G they are cluster 2. There are no other samples with zero differences in this case, so the other samples don't get a cluster.
Ideally, I would like to be able to control the threshold of difference I used to define clusters, e.g. run the script again but using a threshold of <1 or <2 rather than zero.
There are various questions similar to this (e.g. Extracting clusters from seaborn clustermap), but they seem to use metrics of calculating distance rather than the absolute count in the matrix. Another similar question is: generating numerical clusters from matrix values of a minimal size but this counts the size of each cluster, which is different to the output I want.
Thanks for your help.
I found the answer to my question in this blog:
https://joernhees.de/blog/2015/08/26/scipy-hierarchical-clustering-and-dendrogram-tutorial/
The solution I've done is:
from scipy.cluster.hierarchy import dendrogram, linkage
import matplotlib.pyplot as plt
df_arr = np.asarray(df)
Z = hierarchy.linkage(df_arr, 'single')
plt.figure()
dn = hierarchy.dendrogram(Z)
from scipy.cluster.hierarchy import fcluster
# Set maximum threshold for difference e.g. 1
max_d = 1
clusters = fcluster(Z, max_d, criterion='distance')
clusters
Then I turn the clusters array into a dataframe and pd.concat it onto the distances dataframe, then extract the list of sample names with cluster. Lastlt I take only clusters with e.g. >2 samples in each cluster:
result = result.groupby('cluster').filter(lambda x : len(x)>2)
I have measured data (vibrations) from a wind turbine running under different operating conditions. My dataset consists of operating conditions as well as measurement features I have extracted from the measured data.
Dataset shape: (423, 15). Each of the 423 data points represent a measurement on a day, chronologically over 423 days.
I now want to cluster the data to see if there is any change in the measurements. Specifically, I want to examine if the vibrations change over time (which could indicate a fault in the turbine gearbox).
What I have currently done:
Scale the data between 0,1 ->
Perform PCA (reduce from 15 to 5)
Cluster using db scan since I do not know the number of clusters. I am using this code to find the optimal epsilon (eps) in dbscan:
# optimal Epsilon (distance):
X_pca = principalDf.values
neigh = NearestNeighbors(n_neighbors=2)
nbrs = neigh.fit(X_pca)
distances, indices = nbrs.kneighbors(X_pca)
distances = np.sort(distances, axis=0)
distances = distances[:,1]
plt.plot(distances,color="#0F215A")
plt.grid(True)
The result so far are not giving any clear indication that the data is changing over time:
Of course, the case could be that the data is not changing over these data points. Howver, what are some other things I could try? Kind of an open question, but I am running out of ideas.
First of all, with KMeans, if the dataset is not naturally partitioned, you may end up with some very weird results! As KMeans is unsupervised, you basically dump in all kinds of numeric variables, set the target variable, and let the machine do the lift for you. Here is a simple example using the canonical Iris dataset. You can EASILY modify this to fit your specific dataset. Just change the 'X' variables (all but the target variable) and 'y' variable (just one target variable). Try that and feedback.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
import urllib.request
import random
# seaborn is a layer on top of matplotlib which has additional visualizations -
# just importing it changes the look of the standard matplotlib plots.
# the current version also shows some warnings which we'll disable.
import seaborn as sns
sns.set(style="white", color_codes=True)
import warnings
warnings.filterwarnings("ignore")
from sklearn import datasets
iris = datasets.load_iris()
X = iris.data[:, 0:4] # we only take the first two features.
y = iris.target
from sklearn import preprocessing
scaler = preprocessing.StandardScaler()
scaler.fit(X)
X_scaled_array = scaler.transform(X)
X_scaled = pd.DataFrame(X_scaled_array)
X_scaled.sample(5)
# try clustering on the 4d data and see if can reproduce the actual clusters.
# ie imagine we don't have the species labels on this data and wanted to
# divide the flowers into species. could set an arbitrary number of clusters
# and try dividing them up into similar clusters.
# we happen to know there are 3 species, so let's find 3 species and see
# if the predictions for each point matches the label in y.
from sklearn.cluster import KMeans
nclusters = 3 # this is the k in kmeans
seed = 0
km = KMeans(n_clusters=nclusters, random_state=seed)
km.fit(X_scaled)
# predict the cluster for each data point
y_cluster_kmeans = km.predict(X_scaled)
y_cluster_kmeans
# use seaborn to make scatter plot showing species for each sample
sns.FacetGrid(data, hue="species", size=4) \
.map(plt.scatter, "sepal_length", "sepal_width") \
.add_legend();
To get the correlation between two arrays in python, I am using:
from scipy.stats import pearsonr
x, y = [1,2,3], [1,5,7]
cor, p = pearsonr(x, y)
However, as stated in the docs, the p-value returned from pearsonr() is only meaningful with datasets larger than 500. So how can I get a p-value that is reasonable for small datasets?
My temporary solution:
After reading up on linear regression, I have come up with my own small script, which basically uses Fischer transformation to get the z-score, from which the p-value is calculated:
import numpy as np
from scipy.stats import zprob
n = len(x)
z = np.log((1+cor)/(1-cor))*0.5*np.sqrt(n-3))
p = zprob(-z)
It works. However, I am not sure if it is more reasonable that p-value given by pearsonr(). Is there a python module which already has this functionality? I have not been able to find it in SciPy or Statsmodels.
Edit to clarify:
The dataset in my example is simplified. My real dataset is two arrays of 10-50 values.