How classify new entries in python having classified knowledge base [duplicate] - python

This question already has answers here:
How to assign an new observation to existing Kmeans clusters based on nearest cluster centriod logic in python?
(3 answers)
Closed 5 years ago.
I have a set of vectors, in python, composing my knowledge base, for example:
KB=[[1,2,3,4],[1,2,2,1],[4,3,1,2],[5,4,3,5]]
Now I computed the cluster for KB, using:
from sklearn.cluster import KMeans
model=KMeans(n_clusters=3)
model.fit(KB)
Now I have a new entry (could I have more than one),
A=[3,2,1,3]
and I would know which is the cluster that best fits A with respect to the cluster computed above, then exploiting the KB.
Could you help me?
Thanks in advance

Here you are:
KB=[[1,2,3,4],[1,2,2,1],[4,3,1,2],[5,4,3,5]]
from sklearn.cluster import KMeans
model=KMeans(n_clusters=3).fit(KB)
A=[3,2,1,3]
l = model.predict([A])
print model.labels_, l
centers = model.cluster_centers_.copy()
print centers
In order you model to be 'fit', i join two lines.
I then use the method predict to .. predict.
I also print the labels for each example that were use in the model.
Edit Add plot
import matplotlib.pyplot as plt
import numpy
# Compute the distances vector to vector
d = numpy.array([[numpy.sum(KBi - cj) for KBi in KB] for cj in centers])
print d
# for cluster 0 and 1
plt.scatter(d[0], d[1])
plt.pause(10)

Related

Generate a GMM Dataset by using multivariate_normal from scipy.stats

How can I use from scipy.stats import multivariate_normal to generate data?
In specific, I want to create a GMM data that contains 3 columns (features) and a label column (0 or 1).
So I am basically looking to see a 3d plot that contains 6 different Gaussians (3 per class).
Thanks a lot!

Clustering geospatial data on coordinates AND non spatial feature

Say i have the following dataframe stored as a variable called coordinates, where the first few rows look like:
business_lat business_lng business_rating
0 19.111841 72.910729 5.
1 19.111342 72.908387 5.
2 19.111342 72.908387 4.
3 19.137815 72.914085 5.
4 19.119677 72.905081 2.
5 19.119677 72.905081 2.
. . .
. . .
. . .
As you can see this data is geospatial (has a lat and a lng) AND every row has an additional value, business_rating, that corresponds to the rating of the business at the latlng in that row. I want to cluster the data, where businesses that are nearby and have similar ratings are assigned into the same cluster. Essentially I need a a geospatial cluster with an additional requirement that the clustering must consider the rating column.
I've looked online and can't really find much addressing approaches for this: only things for strict geospatial clustering (only features to cluster on are latlng) or non spatial clustering.
I have a simple DBSCAN running below, but when i plot the results of the clustering it does not seem to be doing what I want correctly.
from sklearn.cluster import DBSCAN
import numpy as np
db = DBSCAN(eps=2/6371., min_samples=5, algorithm='ball_tree', metric='haversine').fit(np.radians(coordinates))
Would I be better served trying to tweak the parameters of the DBSCAN, doing some additional processing of the data or using a different approach all together?
The tricky part about clustering two different types of information (location and rating) is determining how they should relate to each other. It's simple to ask when it is just one domain and you are comparing the same units. My approach would be to look at how to relate rows within a domain and then determine some interaction between the domains. This could be done using scaling options like MinMaxScaler mentioned, however, I think this is a bit heavy handed and we could use our knowledge of the domains to cluster better.
Handling Location
Location distance is best handled directly as this has real world meaning that we can precalculate distances for. The meaning of meters apart is direct to what we
You could use the scaling option mentioned in the previous answer but this risks distorting the location data. For example, if you have a long and thin set of locations, MinMaxScaling would give more importance to variation on the thin axis than the long axis. If you are going to use scaling, do it on the computed distance matrix, not on the lat lon themselves.
import numpy as np
from sklearn.metrics.pairwise import haversine_distances
points_in_radians = df[['business_lat','business_lng']].apply(np.radians).values
distances_in_km = haversine_distances(points_in_radians) * 6371
Adding in Rating
We can think of the problem through asking a couple of questions that relate rating to distance. We could ask, how different must ratings be to separate observations in the same place? What is the meter difference to rating difference ratio? With an idea of ratio, we can calculate another distance matrix for the rating difference for all observations and use this to scale or add on the original location distance matrix or we could increase the distance for every gap in rating. This location-plus-ratings-difference matrix can then be clustered on.
from sklearn.metrics.pairwise import euclidean_distances
added_km_per_rating_gap = 1
rating_distances = euclidean_distances(df[['business_rating']].values) * added_km_per_rating_gap
We can then simply add these together and cluster on the resulting matrix.
from sklearn.cluster import DBSCAN
distance_matrix = rating_distances + distances_in_km
clustering = DBSCAN(metric='precomputed', eps=1, min_samples=2)
clustering.fit(distance_matrix)
What we have done is cluster by location, adding a penalty for ratings difference. Making that penalty direct and controllable allows for optimisation to find the best clustering.
Testing
The problem I'm finding is that (with my test data at least) DBSCAN has a tendency to 'walk' from observation to observation forming clusters that either blend ratings together because the penalty is not high enough or separates into single rating groups. It might be that DBSCAN is not suitable for this type of clustering. If I had more time, I would look for some open data to test this on and try other clustering methods.
Here is the code I used to test. I used the square of the ratings distance to emphasise larger gaps.
import random
from sklearn.datasets import make_blobs
X, y = make_blobs(n_samples=300, centers=6, cluster_std=0.60, random_state=0)
ratings = np.array([random.randint(1,4) for _ in range(len(X)//2)] \
+[random.randint(2,5) for _ in range(len(X)//2)]).reshape(-1, 1)
distances_in_km = euclidean_distances(X)
rating_distances = euclidean_distances(ratings)
def build_clusters(multiplier, eps):
rating_addition = (rating_distances ** 2) * multiplier
distance_matrix = rating_addition + distances_in_km
clustering = DBSCAN(metric='precomputed', eps=eps, min_samples=10)
clustering.fit(distance_matrix)
return clustering.labels_
Using the DBSCAN methodology, we can calculate the distance between points (the Euclidean distance or some other distance) and look for points which are far away from others. You may want to consider using the MinMaxScaler to normalize values, so one feature doesn't overwhelm other features.
Where is your code and what are your final results? Without an actual code sample, I can only guess what you are doing.
I hacked together some sample code for you. You can see the results below.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
import seaborn as sns; sns.set()
import csv
df = pd.read_csv('C:\\your_path_here\\business.csv')
X=df.loc[:,['review_count','latitude','longitude']]
K_clusters = range(1,10)
kmeans = [KMeans(n_clusters=i) for i in K_clusters]
Y_axis = df[['latitude']]
X_axis = df[['longitude']]
score = [kmeans[i].fit(Y_axis).score(Y_axis) for i in range(len(kmeans))]# Visualize
plt.plot(K_clusters, score)
plt.xlabel('Number of Clusters')
plt.ylabel('Score')
plt.title('Elbow Curve')
plt.show()
kmeans = KMeans(n_clusters = 3, init ='k-means++')
kmeans.fit(X[X.columns[0:2]]) # Compute k-means clustering.
X['cluster_label'] = kmeans.fit_predict(X[X.columns[0:2]])
centers = kmeans.cluster_centers_ # Coordinates of cluster centers.
labels = kmeans.predict(X[X.columns[0:2]]) # Labels of each point
X.head(10)
X.plot.scatter(x = 'latitude', y = 'longitude', c=labels, s=50, cmap='viridis')
plt.scatter(centers[:, 0], centers[:, 1], c='black', s=200, alpha=0.5)
from scipy.stats import zscore
df["zscore"] = zscore(df["review_count"])
df["outlier"] = df["zscore"].apply(lambda x: x <= -2.5 or x >= 2.5)
df[df["outlier"]]
df_cord = df[["latitude", "longitude"]]
df_cord.plot.scatter(x = "latitude", y = "latitude")
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df_cord = scaler.fit_transform(df_cord)
df_cord = pd.DataFrame(df_cord, columns = ["latitude", "longitude"])
df_cord.plot.scatter(x = "latitude", y = "longitude")
from sklearn.cluster import DBSCAN
outlier_detection = DBSCAN(
eps = 0.5,
metric="euclidean",
min_samples = 3,
n_jobs = -1)
clusters = outlier_detection.fit_predict(df_cord)
clusters
from matplotlib import cm
cmap = cm.get_cmap('Accent')
df_cord.plot.scatter(
x = "latitude",
y = "longitude",
c = clusters,
cmap = cmap,
colorbar = False
)
The final result looks a little weird, to tell you the truth. Remember, not everything is clusterable.

Defining clusters from a matrix using a value threshold and named by cluster size in Python

I have a matrix of pairwise differences between samples. I would like to label each sample as being part of a cluster, named by cluster size, where clusters are defined by an absolute cutoff in the matrix values, e.g. all those clusters with a difference of zero from each other.
Mock data:
# Load packages
import numpy as np
import pandas as pd
import seaborn as sns
## Generate fake data
# matrix
d = {'sample_A': [0,2,0,1,1,2,2,1], 'sample_B': [2,0,2,3,3,0,0,3], 'sample_C': [0,2,0,1,1,2,2,1], 'sample_D': [1,3,1,0,2,3,3,1],
'sample_E': [1,3,1,2,0,3,3,1], 'sample_F': [2,0,2,3,3,0,0,3], 'sample_G': [2,0,2,3,3,0,0,3], 'sample_H': [1,3,1,1,1,3,3,0]}
idx = ["sample_A","sample_B","sample_C","sample_D","sample_E", "sample_F", "sample_G", "sample_H"]
df = pd.DataFrame(data=d,index=idx)
df
# Visualise heatmap (this isn't directly needed for this output)
g = sns.clustermap(df, cmap="coolwarm_r")
g
# Desired output
d = {'cluster_zero': [2,1,2,"NA","NA",1,1,"NA"]}
df3 = pd.DataFrame(data=d,index=idx)
df3
So the output labels each sample as belonging to a cluster defined as having zero pairwise difference in the matrix, and names the cluster in order of size from largest to smallest. In this case, samples B, F and G all have zero differences, so get put in cluster 1. Samples A and C also have zero differences from each other, and as that cluster is smaller than B/F/G they are cluster 2. There are no other samples with zero differences in this case, so the other samples don't get a cluster.
Ideally, I would like to be able to control the threshold of difference I used to define clusters, e.g. run the script again but using a threshold of <1 or <2 rather than zero.
There are various questions similar to this (e.g. Extracting clusters from seaborn clustermap), but they seem to use metrics of calculating distance rather than the absolute count in the matrix. Another similar question is: generating numerical clusters from matrix values of a minimal size but this counts the size of each cluster, which is different to the output I want.
Thanks for your help.
I found the answer to my question in this blog:
https://joernhees.de/blog/2015/08/26/scipy-hierarchical-clustering-and-dendrogram-tutorial/
The solution I've done is:
from scipy.cluster.hierarchy import dendrogram, linkage
import matplotlib.pyplot as plt
df_arr = np.asarray(df)
Z = hierarchy.linkage(df_arr, 'single')
plt.figure()
dn = hierarchy.dendrogram(Z)
from scipy.cluster.hierarchy import fcluster
# Set maximum threshold for difference e.g. 1
max_d = 1
clusters = fcluster(Z, max_d, criterion='distance')
clusters
Then I turn the clusters array into a dataframe and pd.concat it onto the distances dataframe, then extract the list of sample names with cluster. Lastlt I take only clusters with e.g. >2 samples in each cluster:
result = result.groupby('cluster').filter(lambda x : len(x)>2)

Separating Clusters after using SLINK in Python/R

From research, only Single-Linkage Hierarchical Clustering can obtain optimal clusters. This is also know as SLINK. The libraries are published in originally in C++ and now in Python/R.
So far, following the steps in the documentations, I managed to come up with:
import pandas as pd
from scipy.cluster.hierarchy import dendrogram, linkage
from scipy.spatial.distance import pdist
## generating random numbers from 20 to 90, and storing them in a dataframe. This is a 1-dimensional data
np.random.seed(1)
df = pd.DataFrame(np.random.randint(20,90,size=(100,1)), columns = list('A'))
df = df.sort_values(by=['A'])
df = df.values
df[:,0].sort()
## getting condensed distance matrix
d = pdist(df_final, metric='euclidean')
## running the SLINK algorithm
Z = linkage(d, 'single')
I understand that Z is a 'hierarchical clustering encoded as a linkage matrix' (as written in the documentation), but I am wondering how do I go back to my original data set and distinguish the cluster calculated by this result?
I could achieve clustering result by Scikit-Learn clustering, but I think Scikit-Learn clustering algorithms are not optimal and hence I turned to this SLINK algorithm. Would be much appreciated if someone could help me with this.
From scipy.cluster.hierarchy.linkage you get back how clusters are formed with each iteration.
Normally this information is not so useful, so we can look at the clustering first:
import scipy as scipy
import matplotlib.pyplot as plt
plt.figure()
dn =scipy.cluster.hierarchy.dendrogram(Z)
If we want to get the three clusters, we can do:
labels = scipy.cluster.hierarchy.fcluster(Z,3,'maxclust')
If you want to get it by distance between the data points:
scipy.cluster.hierarchy.fcluster(Z,2,'distance')
This gives about the same result as calling for 3 clusters because that's not many ways to cut this example dataset.
If you look the example you have, the next point you can cut it is at height ~ 1.5, which is 16 clusters. So if you try to do scipy.cluster.hierarchy.fcluster(Z,5,'maxclust'), you get the same results as for 3 clusters. If you have a more spread dataset, it will work:
np.random.seed(111)
df = np.random.normal(0,1,(50,3))
## getting condensed distance matrix
d = pdist(df, metric='euclidean')
Z = linkage(d, 'single')
dn = scipy.cluster.hierarchy.dendrogram(Z,above_threshold_color='black',color_threshold=1.1)
Then this works:
scipy.cluster.hierarchy.fcluster(Z,5,'maxclust')

Labels for cluster centers in Python sklearn

When utilizing the sklearn class sklearn.cluster for K-means clustering, a fitted k-means object has 3 attributes, including a numpy array of cluster centers (centers x features) named cluster_centers_. However, these centers don't have an attached label.
My question is: are the centers (rows) in cluster_centers_ ordered by the label value? That is, does row 1 correspond to the center for the cluster labeled 1? Or are they placed in the array randomly? A pointer to any documentation would be more than sufficient.
Thanks.
I couldn't find the documentation but yes it is ordered by cluster.
So:
kmeans.cluster_centers_[0] = centroid of cluster 0

Categories

Resources