problem with hierarchical clustering in Python

problem with hierarchical clustering in Python - python

I am doing a hierarchical clustering a 2 dimensional matrix by correlation distance metric (i.e. 1 - Pearson correlation). My code is the following (the data is in a variable called "data"):
from hcluster import *
Y = pdist(data, 'correlation')
cluster_type = 'average'
Z = linkage(Y, cluster_type)
dendrogram(Z)
The error I get is:
ValueError: Linkage 'Z' contains negative distances.
What causes this error? The matrix "data" that I use is simply:
[[ 156.651968 2345.168618]
[ 158.089968 2032.840106]
[ 207.996413 2786.779081]
[ 151.885804 2286.70533 ]
[ 154.33665 1967.74431 ]
[ 150.060182 1931.991169]
[ 133.800787 1978.539644]
[ 112.743217 1478.903191]
[ 125.388905 1422.3247 ]]
I don't see how pdist could ever produce negative numbers when taking 1 - pearson correlation. Any ideas on this?
thank you.

There are some lovely floating point problems going on. If you look at the results of pdist, you'll find there are very small negative numbers (-2.22044605e-16) in them. Essentially, they should be zero. You can use numpy's clip function to deal with it if you would like.

If you were getting error
KeyError: -428
and your code was on the lines of
import matplotlib.pyplot as plt
import matplotlib as mpl
%matplotlib inline
from scipy.cluster.hierarchy import ward, dendrogram
linkage_matrix = ward(dist) #define the linkage_matrix using ward clustering pre-computed distances
fig, ax = plt.subplots(figsize=(35, 20),dpi=400) # set size
ax = dendrogram(linkage_matrix, orientation="right",labels=queries);
`
It is due to the mismatch in indexes of queries.
Might want to update to
ax = dendrogram(linkage_matrix, orientation="right",labels=list(queries));

Related

Gaussian Mixture Model with discrete data

I have 136 numbers which have an overlapping distribution of 8 Gaussian distributions. I want to find it's means, and variances with each Gaussian distribution! Can you find any mistakes with my code?
file = open("1.txt",'r') #data is in 1.txt like 0,0,0,0,0,0,1,0,0,1,4,4,6,14,25,43,71,93,123,194...
y=[int (i) for i in list((file.read()).split(','))] # I want to make list which element is above data
x=list(range(1,len(y)+1)) # it is x values
z=list(zip(x,y)) # z elements consist as (1, 0), (2, 0), ...
Therefore, through the above process, for the 136 points (x,y) on the xy plane having the first given data as y values, a list z using this as an element was obtained.
Now I want to obtain each Gaussian distribution's mean, variance. At this time, the basic assumption is that the given data consists of overlapping 8 Gaussian distributions.
import numpy as np
from sklearn.mixture import GaussianMixture
data = np.array(z).reshape(-1,1)
model = GaussianMixture(n_components=8).fit(data)
print(model.means_)
file.close()
Actually, I don't know how to make it's code to print 8 means and variances... Anyone can help me?

You can use this, I have made a sample code for your visualizations -
import numpy as np
from sklearn.mixture import GaussianMixture
import scipy
import matplotlib.pyplot as plt
%matplotlib inline
#Sample data
x = [0,0,0,0,0,0,1,0,0,1,4,4,6,14,25,43,71,93,123,194]
num_components = 2
#Fit a model onto the data
data = np.array(x).reshape(-1,1)
model = GaussianMixture(n_components=num_components).fit(data)
#Get list of means and variances
mu = np.abs(model.means_.flatten())
sd = np.sqrt(np.abs(model.covariances_.flatten()))
#Plotting
extend_window = 50 #this is for zooming into or out of the graph, higher it is , more zoom out
x_values = np.arange(data.min()-extend_window, data.max()+extend_window, 0.1) #For plotting smooth graphs
plt.plot(data, np.zeros(data.shape), linestyle='None', markersize = 10.0, marker='o') #plot the data on x axis
#plot the different distributions (in this case 2 of them)
for i in range(num_components):
y_values = scipy.stats.norm(mu[i], sd[i])
plt.plot(x_values, y_values.pdf(x_values))

Dendrogram with plotly - how to set a custom linkage method for hierarchical clustering

I am new to plotly and need to draw a dendrogram with group average linkage.
I am aware that there is a distfun parameter in create_dendrogram(), but I have no idea what to pass to that argument to get Group Average Linkage. The distfun argument apparently have to be callable. What function should I pass to it?
As a sidenote, I have a sample pairwise distance matrix
0
13 0
2 14 0
17 1 18 0
which, when I passed to the create_dendrogram() method, seems to produce an incorrect result. What am I doing wrong here?
code:
import plotly.figure_factory as ff
import numpy as np
X = np.matrix([[0,0,0,0],[13,0,0,0],[2,14,0,0],[17,1,18,0]])
names = list("0123")
fig = ff.create_dendrogram(X, orientation='left', labels=names)
fig.update_layout(width=800, height=800)
fig.show()
Code literally copied from the plotly website bc idk wth I'm supposed to do.
This website: https://plotly.com/python/v3/dendrogram/

You can choose a linkage method using scipy.cluster.hierarchy.linkage()
via linkagefun argument in create_dendrogram() function.
For example, to use UPGMA (Unweighted Pair Group Method with Arithmetic mean) algorithm:
import plotly.figure_factory as ff
import scipy.cluster.hierarchy as sch
import numpy as np
X = np.matrix([[0,0,0,0],[13,0,0,0],[2,14,0,0],[17,1,18,0]])
names = "0123"
fig = ff.create_dendrogram(X,
orientation='left',
labels=names,
linkagefun=lambda x: sch.linkage(x, "average"),)
fig.update_layout(width=800, height=800)
fig.show()
Please, note that X has to be a matrix of data samples.

This is a bit old but, for anyone else with similar issues, I think the distfun param simply specifies how you want to convert your data matrix to a condensed distance matrix - you define the function yourself.
For example, after a bit of head banging I cobbled together data_to_dist to convert a data matrix to a Jaccard distance matrix, then condense it. You should be aware that plotly's dendrogram implementation does not check whether your matrix is condensed so your distfun needs to ensure this occurs. Maybe this is wrong, but it looks like distfun should only take one positional param (the data matrix) and return one object (the condensed distance matrix):
import plotly.figure_factory as ff
import numpy as np
from scipy.spatial.distance import jaccard, squareform
def jaccard_dissimilarity(feature_list1, feature_list2, filler_val): #binary
all_features = set([i for i in feature_list1 if i != filler_val])#filler val can be used to even up ragged lists and ignore certain dtypes ie prots not in a module
all_features.update(set([i for i in feature_list2 if i != filler_val]))#works for both numpy arrays and lists
counts_1 = [1 if feature in feature_list1 else 0 for feature in all_features]
counts_2 = [1 if feature in feature_list2 else 0 for feature in all_features]
return jaccard(counts_1, counts_2)
def data_to_dist_matrix(mn_data, filler_val = 0):
#notes:
#the original plotly example uses pdist to find manhatten distance for clustering.
#pdist 'Returns a condensed distance matrix Y' - https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.pdist.html#scipy.spatial.distance.pdist.
#a condensed distance matrix is required for input into scipy linkage for clustering.
#plotly dendrogram function does not do this conversion to the output of a given distfun call - https://github.com/plotly/plotly.py/blob/cfad7862594b35965c0e000813bd7805e8494a5b/packages/python/plotly/plotly/figure_factory/_dendrogram.py#L340
#therefore you should convert distance matrix to condensed form yourself as below with squareform
distance_matrix = np.array([[jaccard_dissimilarity(a,b, filler_val) for b in mn_data] for a in mn_data])
return squareform(distance_matrix)
# toy data to visually check clustering looks sensible
data_array = np.array([[1, 2, 3,0],
[2, 3, 10, 0],
[4, 5, 6, 0],
[5, 6, 7, 0],
[7, 8, 1, 0],
[1,2,8,7],
[1,2,3,8],
[1,2,3,4]])
y_labels = [f'MODULE_{i}' for i in range(8)]
#this is the distance matrix and condensed distance matrix made by data_to_dist_matrix and is only included so I can check what it's doing
dist_matrix = np.array([[jaccard_dissimilarity(a,b, 0) for b in data_array] for a in data_array])
condensed_dist_matrix = data_to_dist_matrix(data_array, 0)
# Create Side Dendrogram
fig = ff.create_dendrogram(data_array,
orientation='right',
labels = y_labels,
distfun = data_to_dist_matrix)

How to find what points lie in each bin of a histogram?

I have a 2D dimensional histogram having bin size 10. I wish to know whether there is a numpy function (or any faster method) to obtain what points lie in each bin in the 2d grid. Is there a way to access the bin elements?

I hope this solve your problem. However, I believe other can improve my code because I am new in python.
Create Histogram with matplotlib
import matplotlib.pyplot as plt
rng = np.random.RandomState(10) # deterministic random data
a = np.hstack((rng.normal(size=100), rng.normal(loc=5, scale=2, size=1000)))
n ,bins ,patches = plt.hist(a, bins=10) # arguments are passed to np.histogram
plt.title("Histogram with '10' bins")
plt.show()
Reshape arrays and..
newbin = np.repeat(np.reshape(bins,(-1, len(bins))), a.shape, axis=0)
newa = np.repeat(np.reshape(a,(len(a),-1)),len(bins),axis=1)
#index_bin = (np.where(newbin[:,0] >np.reshape(a,(1,-1))[:,0] ) )[0][0]
index_bin = (newbin>newa).argmax(axis=1).T
test
print(a[0] , bins)
print(index_bin[0])
Output
1.331586504129518 [-2.13171211 -0.88255884 0.36659444 1.61574771 2.86490098 4.11405425
5.36320753 6.6123608 7.86151407 9.11066734 10.35982062]
3

Generate copula-correlated samples with specified marginals in Python

I have N random variables (X1,...,XN) each of which is distributed over a specific marginal (normal, log-normal, Poisson...) and I want to generate a sample of p joint realizations of these variables Xi, given that the variables are correlated with a given Copula, using Python 3. I know that R is a better option but i want to do it in Python.
Following this method I managed to do so with a Gaussian Copula. Now I want to adapt the method to use a Archimedean Copula (Gumbel, Frank...) or a Student Copula. Ath the beginning of the Gaussian copula method, you draw a sample of p realizations from a multivariate normal distribution. To adapt this to another copula, for instance a bivariate Gumbel, my idea is to draw a sample from the joint distribution of a bivariate Gumbel, but I am not sure on how to implement this.
I have tried using several Python 3 packages : copulae, copula and copulas all provide the noption to fit a particular copula to a dataset but do not allow to draw a random sample from a given copula.
Can you provide some algorithmic insight on how to draw multivariate random samples from a given Copula with uniform marginals?

Please check this page "Create a composed distribution". I think this is what you're looking for
For example, if you have 2 distributions x1: Uniform on [1, 3] and x2: Normal(0,2) and if you know the dependence structure copula, you can build the multidimensional distribution X = (x1, x2) very easily.
import openturns as ot
x1 = ot.Uniform(1, 3)
x2 = ot.Normal(0, 2)
copula = ot.IndependentCopula()
X = ot.ComposedDistribution([x1, x2], copula)
X.getSample(5) will return a sample of size = 5:
>>> [ X0 X1 ]
0 : [ 1.87016 0.802719 ]
1 : [ 1.72333 2.73565 ]
2 : [ 1.00422 2.00869 ]
3 : [ 1.47887 1.4831 ]
4 : [ 1.51031 -0.0872247 ]
You can view the cloud in 2D
import matplotlib.pyplot as plt
sample = dist.getSample(1000)
plt.scatter(sample[:, 0], sample[:, 1], s=2)
If you choose copula = ot.ClaytonCopula(2) the result will be:
with GumbelCopula(2):

The following code implements the Clayton and AMH copulas. Page 4 of this paper shows how other kinds of copulas can be implemented.
import random
import math
import scipy.stats as st
def clayton(theta, n):
v=random.gammavariate(1/theta,1)
uf=[random.expovariate(1)/v for _ in range(n)]
return [(k+1)**(-1.0/theta) for k in uf]
def amh(theta, n):
# NOTE: Use SciPy RNG for convenience here
v=st.geom(1-theta).rvs()
uf=[random.expovariate(1)/v for _ in range(n)]
return [(1-theta)/(math.exp(k)-theta) for k in uf]

3D interpolation between two cloud of points

I want to interpolate a set of temperature, defined on each node of a mesh of a CFD simulation, on a different mesh.
Data from the original set are in csv (X1,Y1,Z1,T1) and I want to find new T2 values on a X2,Y2,Z2 mesh.
From the many possibilities that SCIPY provide us, which is the more suitable for that application? Which are the differences between a linear and a nearest-node approach?
Thank you for your time.
EDIT
Here is an example:
import numpy as np
from scipy.interpolate import griddata
from scipy.interpolate import LinearNDInterpolator
data = np.array([
[ -3.5622760653000E-02, 8.0497122655290E-02, 3.0788827491158E-01],
[ -3.5854682326000E-02, 8.0591522802259E-02, 3.0784350432341E-01],
[ -2.8168760240000E-02, 8.0819296043557E-02, 3.0988532075795E-01],
[ -2.8413346037000E-02, 8.0890746063578E-02, 3.1002054434659E-01],
[ -2.8168663383000E-02, 8.0981744777379E-02, 3.1015319609412E-01],
[ -3.4150537103000E-02, 8.1385114641365E-02, 3.0865343388355E-01],
[ -3.4461673349000E-02, 8.1537336777452E-02, 3.0858242919307E-01],
[ -3.4285601228000E-02, 8.1655884824782E-02, 3.0877386496235E-01],
[ -2.1832991391000E-02, 8.0380712111108E-02, 3.0867371621337E-01],
[ -2.1933870390000E-02, 8.0335713699008E-02, 3.0867959866155E-01]])
temp = np.array([1.4285955811000E+03,
1.4281038818000E+03,
1.4543135986000E+03,
1.4636379395000E+03,
1.4624763184000E+03,
1.3410919189000E+03,
1.3400545654000E+03,
1.3505817871000E+03,
1.2361110840000E+03,
1.2398562012000E+03])
linInter= LinearNDInterpolator(data, temp)
print (linInter(np.array([[-2.8168760240000E-02, 8.0819296043557E-02, 3.0988532075795E-01]])))
this code is working, but I have a dataset of 10million of points to be interpolated on a data set of the same size.
The problem is that this operation is very slow to do for all of my points: is there a way to improve my code?
I used LinearNDinterpolator beacuse it seems to be faster than NearestNDInterpolator (LinearVSNearest).

One solution would be to use RegularGridInterpolator (if your grid is regular). Another approach I can think of is to reduce your data size by taking intervals:
step = 4 # you can increase this based on your data size (eg 100)
m = ((data.argsort(0) % step)==0).any(1)
linInter= LinearNDInterpolator(data[m], temp[m])

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

problem with hierarchical clustering in Python - python

There are some lovely floating point problems going on. If you look at the results of pdist, you'll find there are very small negative numbers (-2.22044605e-16) in them. Essentially, they should be zero. You can use numpy's clip function to deal with it if you would like.

Related

Gaussian Mixture Model with discrete data

Dendrogram with plotly - how to set a custom linkage method for hierarchical clustering

How to find what points lie in each bin of a histogram?

Generate copula-correlated samples with specified marginals in Python

3D interpolation between two cloud of points

Categories

Resources