How do i use pyclustering to implement kmedoids? - python

I am not sure how I used kmedoids in python. I have installed the pyclustering module from https://pypi.org/project/pyclustering/ yet I'm not sure how i call kmedoids? I am trying to implement PAM on my gower distance matrix.
I'm trying to cluster features from an trade dataset. I used this https://sourceforge.net/projects/gower-distance-4python/files/ to calculate gower distance on my matrix. Then i use this matrix which i've called D to pass through PAM/kmedoids
import pyclustering
import pyclustering.cluster.kmedoids
from sklearn.metrics.pairwise import pairwise_distances
import numpy as np
D = gower_distances(trade_data)
pam=pyclustering.kmedoids(D)
AttributeError: module 'pyclustering' has no attribute 'kmedoids'
I get the above error how do I call kmedoids/ use PAM?

You need to correct import and K-Medoids initialization:
from pyclustering.cluster.kmedoids import kmedoids
... ...
pam=kmedoids(D, initial_medoids)

You need to import kmedoids as
from pyclustering.cluster.kmedoids import kmedoids
You can read more about it in pyclustering's documentation here https://codedocs.xyz/annoviko/pyclustering/classpyclustering_1_1cluster_1_1kmedoids_1_1kmedoids.html

This is a very small code example from https://stats.stackexchange.com/questions/94172/how-to-perform-k-medoids-when-having-the-distance-matrix/470141#470141. It starts with an already given distance matrix, you use the gower_distances() then.
from pyclustering.cluster.kmedoids import kmedoids
import numpy as np
dm = np.array(
[[0.,1.91,2.23,3.14,4.25,3.37],
[0.,0.,2.15,1.82,2.41,2.58],
[0.,0.,0.,3.12,3.83,4.64],
[0.,0.,0.,0.,1.9,2.66],
[0.,0.,0.,0.,0.,3.12],
[0.,0.,0.,0.,0.,0.]])
dm = dm + np.transpose(dm)
k = 2
# choose medoid 2 and 4 in your C1 and C2 because min(D) in their cluster
initial_medoids = [1,3]
kmedoids_instance = kmedoids(dm, initial_medoids, data_type = 'distance_matrix')
# Run cluster analysis and obtain results.
kmedoids_instance.process()
clusters = kmedoids_instance.get_clusters()
centers = kmedoids_instance.get_medoids()
print(clusters)
# [[1, 0, 2, 5], [3, 4]]
print(centers)
# [1, 3]

Related

Python integration of Pandas dataframe

I have the following pandas dataframe df with 2 columns, which looks like:
0 0
1. 22
2. 34
3. 21
4. 21
5. 92
I would like to integrate the area under this curve if we were to plot the first columns as the x-axis and the second column as the y-axis. I have tried doing this using the integrated module from scipy (from scipy import integrate), and applied as follows as I have seen in examples online:
print(df.integrate)
However, it seems the integrate function does not work. I'm receiving the error:
Dataframe object has no attribute integrate
How would I go about this?
Thank you
You want numerical integration given a fixed sample of data. The Scipy package lists a handful of methods to do this: https://docs.scipy.org/doc/scipy/reference/integrate.html#integrating-functions-given-fixed-samples
For your data, the trapezoidal is probably the most straight forward. You provide the y and x values to the function. You did not post the column names of your data frame, so I am using the 0-index for x and the 1-index for y values
from scipy.integrate import trapz
trapz(df.iloc[:, 1], df.iloc[:, 0])
Since integrate is a scipy method not a pandas method, you need to invoke it as follows:
from scipy.integrate import trapz, simps
print(trapz(*args))
https://docs.scipy.org/doc/scipy/reference/tutorial/integrate.html
Try this
import pandas as pd
import numpy as np
def integrate(x, y):
area = np.trapz(y=y, x=x)
return area
df = pd.DataFrame({'x':[0, 1, 2, 3, 4, 4, 5],'y':[0, 1, 3, 3, 5, 6, 7]})
x = df.x.values
y = df.y.values
print(integrate(x, y))

Text data clustering with python

I am currently trying to cluster a list of sequences based on their similarity using python.
ex:
DFKLKSLFD
DLFKFKDLD
LDPELDKSL
...
The way I pre process my data is by computing the pairwise distances using for example the Levenshtein distance. After calculating all the pairwise distances and creating the distance matrix, I want to use it as input for the clustering algorithm.
I have already tried using Affinity Propagation, but convergence is a bit unpredictable and I would like to go around this problem.
Does anyone have any suggestions regarding other suitable clustering algorithms for this case?
Thank you!!
sklearn actually does show this example using DBSCAN, just like Luke once answered here.
This is based on that example, using !pip install python-Levenshtein.
But if you have pre-calculated all distances, you could change the custom metric, as shown below.
from Levenshtein import distance
import numpy as np
from sklearn.cluster import dbscan
data = ["DFKLKSLFD", "DLFKFKDLD", "LDPELDKSL"]
def z:
i, j = int(x[0]), int(y[0]) # extract indices
return distance(data[i], data[j])
X = np.arange(len(data)).reshape(-1, 1)
dbscan(X, metric=lev_metric, eps=5, min_samples=2)
And if you pre-calculated you could define pre_lev_metric(x, y) along the lines of
def pre_lev_metric(x, y):
i, j = int(x[0]), int(y[0]) # extract indices
return DISTANCES[i,j]
Alternative answer based on K-Medoids using sklearn_extra.cluster.KMedoids. K-Medoids is not yet that well known, but only needs distance as well.
I had to install like this
!pip uninstall -y enum34
!pip install scikit-learn-extra
Than I was able to create clusters with;
from sklearn_extra.cluster import KMedoids
import numpy as np
from Levenshtein import distance
data = ["DFKLKSLFD", "DLFKFKDLD", "LDPELDKSL"]
def lev_metric(x, y):
i, j = int(x[0]), int(y[0]) # extract indices
return distance(data[i], data[j])
X = np.arange(len(data)).reshape(-1, 1)
kmedoids = KMedoids(n_clusters=2, random_state=0, metric=lev_metric).fit(X)
The labels/centers are in
kmedoids.labels_
kmedoids.cluster_centers_
Try this.
import numpy as np
from sklearn.cluster import AffinityPropagation
import distance
words = 'XYZ,LDPELDKSL,DFKLKSLFD,ABC,DLFKFKDLD,XYZ,LDPELDKSL,DFKLKSLFD,ABC,DLFKFKDLD,XYZ,LDPELDKSL,XYZ,LDPELDKSL,DFKLKSLFD,ABC,DLFKFKDLD,XYZ,LDPELDKSL,DFKLKSLFD,ABC,DLFKFKDLD,XYZ,LDPELDKSL'.split(',') #Replace this line
words = np.asarray(words) #So that indexing with a list will work
lev_similarity = -1*np.array([[distance.levenshtein(w1,w2) for w1 in words] for w2 in words])
affprop = AffinityPropagation(affinity="precomputed", damping=0.5)
affprop.fit(lev_similarity)
for cluster_id in np.unique(affprop.labels_):
exemplar = words[affprop.cluster_centers_indices_[cluster_id]]
cluster = np.unique(words[np.nonzero(affprop.labels_==cluster_id)])
cluster_str = ", ".join(cluster)
print(" - *%s:* %s" % (exemplar, cluster_str))
Results:
- *LDPELDKSL:* LDPELDKSL
- *DFKLKSLFD:* DFKLKSLFD
- *XYZ:* ABC, XYZ
- *DLFKFKDLD:* DLFKFKDLD
common_words = kmeans.cluster_centers_.argsort()[:,-1:-11:-1]
for num, centroid in enumerate(common_words):
print(str(num) + ' : ' + ', '.join(words[word] for word in centroid))

Similar matrix computation using numpy

I am trying to find a similar matrix B to a 3 X 3 matrix :A using a random invertible matrix P .
B = P_inv.A.P
import numpy as np
from scipy import linalg as LA
from numpy.linalg import inv
A = np.random.randint(1,10,9).reshape(3,3)
P = np.random.randn(3,3)
P_inv = inv(P)
eig1 = LA.eigvalsh(A)
eig1 = np.sort(eig1)
B1 = P_inv.dot(A)
B = B1.dot(P)
eig2 = LA.eigvalsh(B)
eig2 = np.sort(eig2)
print(np.round(eig1 ,3))
print(np.round(eig2,3))
However ,I ntoice that eig1 & eig2 are never equal.
What am I missing, or is it a numerical error ?
Thanks
Kedar
You're using eigvalsh, which requires that the matrix be real symmetric (or complex Hermitian), which your randomly generated matrix is not.
Deleting the h and using eigvals instead fixes this.

Dendrogram with plotly - how to set a custom linkage method for hierarchical clustering

I am new to plotly and need to draw a dendrogram with group average linkage.
I am aware that there is a distfun parameter in create_dendrogram(), but I have no idea what to pass to that argument to get Group Average Linkage. The distfun argument apparently have to be callable. What function should I pass to it?
As a sidenote, I have a sample pairwise distance matrix
0
13 0
2 14 0
17 1 18 0
which, when I passed to the create_dendrogram() method, seems to produce an incorrect result. What am I doing wrong here?
code:
import plotly.figure_factory as ff
import numpy as np
X = np.matrix([[0,0,0,0],[13,0,0,0],[2,14,0,0],[17,1,18,0]])
names = list("0123")
fig = ff.create_dendrogram(X, orientation='left', labels=names)
fig.update_layout(width=800, height=800)
fig.show()
Code literally copied from the plotly website bc idk wth I'm supposed to do.
This website: https://plotly.com/python/v3/dendrogram/
You can choose a linkage method using scipy.cluster.hierarchy.linkage()
via linkagefun argument in create_dendrogram() function.
For example, to use UPGMA (Unweighted Pair Group Method with Arithmetic mean) algorithm:
import plotly.figure_factory as ff
import scipy.cluster.hierarchy as sch
import numpy as np
X = np.matrix([[0,0,0,0],[13,0,0,0],[2,14,0,0],[17,1,18,0]])
names = "0123"
fig = ff.create_dendrogram(X,
orientation='left',
labels=names,
linkagefun=lambda x: sch.linkage(x, "average"),)
fig.update_layout(width=800, height=800)
fig.show()
Please, note that X has to be a matrix of data samples.
This is a bit old but, for anyone else with similar issues, I think the distfun param simply specifies how you want to convert your data matrix to a condensed distance matrix - you define the function yourself.
For example, after a bit of head banging I cobbled together data_to_dist to convert a data matrix to a Jaccard distance matrix, then condense it. You should be aware that plotly's dendrogram implementation does not check whether your matrix is condensed so your distfun needs to ensure this occurs. Maybe this is wrong, but it looks like distfun should only take one positional param (the data matrix) and return one object (the condensed distance matrix):
import plotly.figure_factory as ff
import numpy as np
from scipy.spatial.distance import jaccard, squareform
def jaccard_dissimilarity(feature_list1, feature_list2, filler_val): #binary
all_features = set([i for i in feature_list1 if i != filler_val])#filler val can be used to even up ragged lists and ignore certain dtypes ie prots not in a module
all_features.update(set([i for i in feature_list2 if i != filler_val]))#works for both numpy arrays and lists
counts_1 = [1 if feature in feature_list1 else 0 for feature in all_features]
counts_2 = [1 if feature in feature_list2 else 0 for feature in all_features]
return jaccard(counts_1, counts_2)
def data_to_dist_matrix(mn_data, filler_val = 0):
#notes:
#the original plotly example uses pdist to find manhatten distance for clustering.
#pdist 'Returns a condensed distance matrix Y' - https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.pdist.html#scipy.spatial.distance.pdist.
#a condensed distance matrix is required for input into scipy linkage for clustering.
#plotly dendrogram function does not do this conversion to the output of a given distfun call - https://github.com/plotly/plotly.py/blob/cfad7862594b35965c0e000813bd7805e8494a5b/packages/python/plotly/plotly/figure_factory/_dendrogram.py#L340
#therefore you should convert distance matrix to condensed form yourself as below with squareform
distance_matrix = np.array([[jaccard_dissimilarity(a,b, filler_val) for b in mn_data] for a in mn_data])
return squareform(distance_matrix)
# toy data to visually check clustering looks sensible
data_array = np.array([[1, 2, 3,0],
[2, 3, 10, 0],
[4, 5, 6, 0],
[5, 6, 7, 0],
[7, 8, 1, 0],
[1,2,8,7],
[1,2,3,8],
[1,2,3,4]])
y_labels = [f'MODULE_{i}' for i in range(8)]
#this is the distance matrix and condensed distance matrix made by data_to_dist_matrix and is only included so I can check what it's doing
dist_matrix = np.array([[jaccard_dissimilarity(a,b, 0) for b in data_array] for a in data_array])
condensed_dist_matrix = data_to_dist_matrix(data_array, 0)
# Create Side Dendrogram
fig = ff.create_dendrogram(data_array,
orientation='right',
labels = y_labels,
distfun = data_to_dist_matrix)

How to fix TypeError: no supported conversion for types: (dtype('<U10'),)

I'm trying to build a simple song artist recommendation system in Python using cosine similarity algorithms. The dataset that I'm using is the last.fm dataset - https://www.kaggle.com/neferfufi/lastfm
I've been following the blog at https://www.benfrederickson.com/distance-metrics/
and I've tried to write similar code.
import pandas as pd
import numpy as np
from numpy import zeros
from collections import defaultdict
from scipy.sparse import csr_matrix
import keras
from keras.layers import dot
url_data = pd.read_csv("stuff.tsv",
usecols=[0, 2, 3],
names=['user', 'artist', 'plays'])
userids = defaultdict(lambda: len(userids))
url_data['userid'] = url_data['user'].map(userids.__getitem__)
artists = dict((artist, csr_matrix(
(group['plays'], (zeros(len(group)), group['userid'])),
shape=[1, len(userids)]))
for artist, group in data.groupby('artist'))
SMOOTHING = 20
def newSmoothcosine(a, b):
overlap = dot(binarize(a), binarize(b).T)[0, 0]
# smooth cosine by discounting by set intersection
return (overlap / (SMOOTHING + overlap)) * cosine(a, b)
def binarize(artist):
ret = csr_matrix(artist)
ret.data = ones(len(artist.data))
return ret
print(newSmoothcosine('Kanye West', 'Jay-Z'))
I expect it to return the smoothed cosine of the angle between the two artists, but instead I get
TypeError: no supported conversion for types: (dtype('<U10'),)
Please help out!
Here's a solution i dont it will work or not but you can try using lambda which convert the dtype float
df.apply(lambda x: x.replace('$', '').replace(',', '')).astype('float')

Categories

Resources