DBSCAN fit_predict on precomputed metrics outputs strange clusters - python

I am trying to exercise with ML. Specifically, attempting to apply DBSCAN on precomputed distances matrix (just to check how this work). Yes, I know I could use Euclidean metrics but I wanted to test the precomputed.
I am unsure why the labels are all same value for a data set with random pairs in 3 different regions- expecting DBSCAN to separate those. Note: even if I use non-overlapping ranges for the data1/2/3 I still get a single cluster output.
Here is the code:
from sklearn.metrics.pairwise import pairwise_distances
from sklearn.cluster import DBSCAN
from scipy.spatial.distance import pdist, squareform
import random
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
data1 = np.array ([[random.randint(1,400) for i in range(2)] for j in range (50)], dtype=np.float64)
data2 = np.array ([[random.randint(300,700) for i in range(2)] for j in range (50)], dtype=np.float64)
data3 = np.array ([[random.randint(600,900) for i in range(2)] for j in range (50)], dtype=np.float64)
data= np.append (np.append (data1,data2,axis=0), data3, axis=0)
d = pdist(data, lambda u, v: np.sqrt(((u-v)**2).sum()))
distance_matrix = squareform(d)
cluster = DBSCAN (eps=0.3, min_samples=2,metric='precomputed')
dbscan_model = cluster.fit_predict (distance_matrix)
plt.scatter (data[:,0], data[:,1], s=100, c=dbscan_model)
plt.show ()

Related

I have some problems with a value error in python

Here are my codes to plot a stress-strain curve
import matplotlib.pyplot as plt
import numpy as np
import math
from scipy.interpolate import interp1d
from matplotlib.offsetbox import AnchoredText
import pandas as pd
#both strain is a column in the given dataframe, and I manually calculated stress
df_1 = pd.read_csv('1045.csv',skiprows=25,header=[0,1])
print(df_1.head())
A1 = 40.602*(10e-6)
stress1 = ((df_1.Load)/A1)
plt.figure(figsize=(12,9))
plt.plot(df_1.Strain1.values,df_1.Load.values,'g')
plt.ylabel('stress(Pa)',fontsize=13)
plt.xlabel('Strain(%)',fontsize=13)
plt.xticks(np.arange(-6e-5,0.15,step=0.005),rotation = 45)
plt.yticks(np.arange(0,42000,step=1000))
strain = df_1.Strain1.values
stress = np.array(((df_1.Load.values)/A1))
strain = np.array((df_1.Strain1.values))
LinearLimit=1
Strain_values_linear = np.linspace(strain[0], strain[LinearLimit], num=50, endpoint=True)
Strain_values_eng = np.linspace(strain[LinearLimit], strain[-1], num=50, endpoint=True)
f1 = interp1d(strain, stress, fill_value='extrapolate')
f2 = interp1d(strain, stress, kind=3, fill_value='extrapolate')
Now I keep getting a value error saying : "x and y arrays must be equal in length along interpolation axis." I don't understand this...i printed the shape of strain and stress and they are the same
Btw here is a screenshot of the csv file:
enter image description here
You probably are passing an array of shape (..., N) as the first argument (meaning strain has shape of the form (..., N)). SciPy doesn't allow that and throws a ValueError. See the documentation for details. You should run a for loop if you have multiple vectors in strain array. The following code should work, considering you want to interpolate 1 function for each row in strain (and that strain is a 2-d array. If it isn't, you can easily convert it using strain.reshape(-1, N)):
import matplotlib.pyplot as plt
import numpy as np
import math
from scipy.interpolate import interp1d
from matplotlib.offsetbox import AnchoredText
import pandas as pd
#both strain is a column in the given dataframe, and I manually calculated stress
df_1 = pd.read_csv('1045.csv',skiprows=25,header=[0,1])
print(df_1.head())
A1 = 40.602*(10e-6)
stress1 = ((df_1.Load)/A1)
plt.figure(figsize=(12,9))
plt.plot(df_1.Strain1.values,df_1.Load.values,'g')
plt.ylabel('stress(Pa)',fontsize=13)
plt.xlabel('Strain(%)',fontsize=13)
plt.xticks(np.arange(-6e-5,0.15,step=0.005),rotation = 45)
plt.yticks(np.arange(0,42000,step=1000))
strain = df_1.Strain1.values
stress = np.array(((df_1.Load.values)/A1))
strain = np.array((df_1.Strain1.values))
LinearLimit=1
Strain_values_linear = np.linspace(strain[0], strain[LinearLimit], num=50, endpoint=True)
Strain_values_eng = np.linspace(strain[LinearLimit], strain[-1], num=50, endpoint=True)
f1, f2 = [], []
for row in range(len(strain)):
f1.append(interp1d(strain[row], stress, fill_value='extrapolate'))
f2.append(interp1d(strain[row], stress, kind=3, fill_value='extrapolate'))
Edit: From the comment, you have strain array of shape (222, 1). This means you already have a vector but the shape is not compatible with what SciPy accepts. In this case, you will have to reshape the strain and sress array to have the shape of the form (N,). Following code should work:
import matplotlib.pyplot as plt
import numpy as np
import math
from scipy.interpolate import interp1d
from matplotlib.offsetbox import AnchoredText
import pandas as pd
#both strain is a column in the given dataframe, and I manually calculated stress
df_1 = pd.read_csv('1045.csv',skiprows=25,header=[0,1])
print(df_1.head())
A1 = 40.602*(10e-6)
stress1 = ((df_1.Load)/A1)
plt.figure(figsize=(12,9))
plt.plot(df_1.Strain1.values,df_1.Load.values,'g')
plt.ylabel('stress(Pa)',fontsize=13)
plt.xlabel('Strain(%)',fontsize=13)
plt.xticks(np.arange(-6e-5,0.15,step=0.005),rotation = 45)
plt.yticks(np.arange(0,42000,step=1000))
strain = df_1.Strain1.values
stress = np.array(((df_1.Load.values)/A1))
strain = np.array((df_1.Strain1.values))
strain = strain.reshape(-1,)
stress = stress.reshape(-1,)
LinearLimit=1
Strain_values_linear = np.linspace(strain[0], strain[LinearLimit], num=50, endpoint=True)
Strain_values_eng = np.linspace(strain[LinearLimit], strain[-1], num=50, endpoint=True)
f1 = interp1d(strain, stress, fill_value='extrapolate')
f2 = interp1d(strain, stress, kind=3, fill_value='extrapolate')

Principal component analysis dimension reduction in python

I have to implement my own PCA function function Y,V = PCA(data, M, whitening) that computes the first M principal
components and transforms the data, so that y_n = U^T x_n. The function should further
return V that explains the amount of variance that is explained by the transformation.
I have to reduce the dimension of data D=4 to M=2 > given function below <
def PCA(data,nr_dimensions=None, whitening=False):
""" perform PCA and reduce the dimension of the data (D) to nr_dimensions
Input:
data... samples, nr_samples x D
nr_dimensions... dimension after the transformation, scalar
whitening... False -> standard PCA, True -> PCA with whitening
Returns:
transformed data... nr_samples x nr_dimensions
variance_explained... amount of variance explained by the the first nr_dimensions principal components, scalar"""
if nr_dimensions is not None:
dim = nr_dimensions
else:
dim = 2
what I have done is the following:
import numpy as np
import matplotlib.cm as cm
import matplotlib.mlab as mlab
import matplotlib.pyplot as plt
import scipy.stats as stats
from scipy.stats import multivariate_normal
import pdb
import sklearn
from sklearn import datasets
#covariance matrix
mean_vec = np.mean(data)
cov_mat = (data - mean_vec).T.dot((data - mean_vec)) / (data.shape[0] - 1)
print('Covariance matrix \n%s' % cov_mat)
#now the eigendecomposition of the cov matrix
cov_mat = np.cov(data.T)
eig_vals, eig_vecs = np.linalg.eig(cov_mat)
print('Eigenvectors \n%s' % eig_vecs)
print('\nEigenvalues \n%s' % eig_vals)
# Make a list of (eigenvalue, eigenvector) tuples
eig_pairs = [(np.abs(eig_vals[i]), eig_vecs[:,i]) for i in range(len(eig_vals))]
This is the point where I don't know what to do now and how to reduce dimension.
Any help would be welcome! :)
Here is a simple example for the case where the initial matrix A that contains the samples and features has shape=[samples, features]
from numpy import array
from numpy import mean
from numpy import cov
from numpy.linalg import eig
# define a matrix
A = array([[1, 2], [3, 4], [5, 6]])
print(A)
# calculate the mean of each column since I assume that it's column is a variable/feature
M = mean(A.T, axis=1)
print(M)
# center columns by subtracting column means
C = A - M
print(C)
# calculate covariance matrix of centered matrix
V = cov(C.T)
print(V)
# eigendecomposition of covariance matrix
values, vectors = eig(V)
print(vectors)
print(values)
# project data
P = vectors.T.dot(C.T)
print(P.T)
PCA is actually the same as singular value decomposition, so you can either use numpy.linalg.svd:
import numpy as np
def PCA(U,ndim,whitening=False):
L,G,R=np.linalg.svd(U,full_matrices=False)
if not whitening:
L=L # G
Y=L[:,:ndim] # R[:,:ndim].T
return Y,G[:ndim]
If you want to use the eigenvalue problem, then assuming that the number of samples is higher than the number of features (or your data would be underfit), it is inefficient to calculate the spatial correlations (left eigenvectors) directly. Instead, using SVD use the right eigenfunctions:
def PCA(U,ndim,whitening=False):
K=U.T # U # Calculating right eigenvectors
G,R=np.linalg.eigh(K)
G=G[:,::-1]
R=R[::-1]
L=U # R # reconstructing left ones
nrm=np.linalg.norm(L,axis=0,keepdims=True) #normalizing them
L/=nrm
if not whitening:
L=L # G
Y=L[:,:ndim] # R[:,:ndim].T
return Y,G[:ndim]

Matplotlib plot pmf from list of 2D numpy arrays

I have a dataset from my simulations where I combine the results from each simulation seed into a bigger list using bl.extend(df['column'].tolist()).
I'm also running several simulation scenarios, so I append each scenario to a list of lists.
Finally, I'm computing the Probability Mass Function (PMF) of each list as follows (from How to plot a PMF of a sample?)
for idx,sublist in enumerate(pmf_list):
val, cnt = np.unique(sublist, return_counts=True)
pmf = cnt / float(len(sublist))
plot_pmf.append(np.column_stack((val, pmf)))
The issue is that I end up with a list of numpy arrays which I don't know how to plot. The minimum code to reproduce the problem is the following:
import numpy as np
list1 = np.empty([2, 2])
list2 = np.empty([2, 2])
list3 = np.empty([2, 2])
bl = [] # big list
bl.append(list1)
bl.append(list2)
bl.append(list3)
print bl
I can plot using plt.hist(bl[0]) but it doesn't give me the right results. See plot attached for the following list.
<type 'numpy.ndarray'>
[[0.00000000e+00 1.91734780e-01]
[1.00000000e+00 2.94277080e-02]
[2.00000000e+00 3.28276369e-01]
[3.00000000e+00 4.43357154e-01]
[4.00000000e+00 3.54294582e-03]
[5.00000000e+00 1.57306794e-03]
[6.00000000e+00 2.00530733e-03]
[7.00000000e+00 2.95245485e-05]
[8.00000000e+00 2.24386568e-05]
[9.00000000e+00 2.83435665e-05]
[1.00000000e+01 1.18098194e-06]
[1.20000000e+01 1.18098194e-06]]
Formatting the y-values I get:
0.1944084241
0.0415880165
0.3480178394
0.4031723062
0.0050902199
0.0033411939
0.0040175705
0.0001480127
0.0001031961
0.0001008373
0.0000058969
0.0000011794
0.0000047175
0.0000005897
very different from the y-values on the histogram plot
Does the following graph look right?
import matplotlib.pyplot as plt
import numpy as np
X = np.array([[0.00000000e+00, 1.91734780e-01],
[1.00000000e+00, 2.94277080e-02],
[2.00000000e+00, 3.28276369e-01],
[3.00000000e+00, 4.43357154e-01],
[4.00000000e+00, 3.54294582e-03],
[5.00000000e+00, 1.57306794e-03],
[6.00000000e+00, 2.00530733e-03],
[7.00000000e+00, 2.95245485e-05],
[8.00000000e+00, 2.24386568e-05],
[9.00000000e+00, 2.83435665e-05],
[1.00000000e+01, 1.18098194e-06],
[1.20000000e+01, 1.18098194e-06],])
plt.bar(x=X[:, 0], height=X[:, 1])
plt.show()
If you already have the first column as the possible values of the random variable, and the second column as the corresponding probability values, you could use a bar plot to visualize the PMF.
The histogram plot function plt.hist is for a vector of observed values. For example,
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
np.random.seed(0)
plt.hist(np.random.normal(size=1000))
plt.show()

Calculating Loss function for kmeans in pandas dataframe

I have a dataframe containing 5 columns. I am trying to cluster the points for three variables X, Y and Z and find the loss function for kmeans clustering. The following code takes care of that, but if I run this for my real dataframe with 160,000 row, it takes for ever! I assume it can be done a lot faster.
PS: It seems that KMeans module in sklearn does not provide the loss function that's why I am writing my own code.
from sklearn.cluster import KMeans
import numpy as np
df = pd.DataFrame(np.random.randn(1000, 5), columns=list('XYZVW'))
kmeans = KMeans(n_clusters = 6, random_state = 0).fit(df[['X','Y', 'Z']].values)
df['Cluster'] = kmeans.labels_
loss = 0.0
for i in range(df.shape[0]):
cluster = int(df.loc[i, "Cluster"])
a = np.array(df.loc[i,['X','Y', 'Z']])
b = kmeans.cluster_centers_[cluster]
loss += np.linalg.norm(a-b)
print(loss)
It seems that scipy package takes care of the loss function and it is pretty fast. Here's the code:
from scipy.cluster.vq import vq, kmeans, whiten
import numpy as np
df = pd.DataFrame(np.random.randn(1000, 5), columns=list('XYZVW'))
centers, loss = kmeans(df[['X','Y', 'Z']].values, 6)
df['Cluster'] = vq(features, centers)[0]
That being said, I am still interested to know the fastest way of calculating loss function using sklearn kmeans module.
http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html
inertia_ : float
Sum of distances of samples to their closest cluster center.

Determining a threshold value for a bimodal distribution via KMeans clustering

I'd like to find a threshold value for a bimodal distribution. For example, a bimodal distribution could look like the following:
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(45)
n = 1000; b = n//10; i = np.random.randint(0,2,n)
x = i*np.random.normal(-2.0,0.8,n) + (1-i)*np.random.normal(2.0,0.8,n)
_ = plt.hist(x,bins=b)
An attempt to find the cluster centers did not work, as I wasn't sure how the matrix, h, should be formatted:
from sklearn.cluster import KMeans
h = np.histogram(x,bins=b)
h = np.vstack((0.5*(h[1][:-1]+h[1][1:]),h[0])).T # because h[0] and h[1] have different sizes.
kmeans = KMeans(n_clusters=2).fit(h)
I would expect to be able to find the cluster centers around -2 and 2. The threshold value would then be the midpoint of the two cluster centers.
Your question is a bit confusing to me, so please let me know if I've interpreted it incorrectly. I think you are basically trying to do 1D kmeans, and try to introduce frequency as a second dimension to get KMeans to work, but would really just be happy with [-2,2] as the output for the centers instead of [(-2,y1), (2,y2)].
To do a 1D kmeans you can just reshape your data to be n of 1-length vectors (similar question: Scikit-learn: How to run KMeans on a one-dimensional array?)
code:
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(45)
n = 1000;
b = n//10;
i = np.random.randint(0,2,n)
x = i*np.random.normal(-2.0,0.8,n) + (1-i)*np.random.normal(2.0,0.8,n)
_ = plt.hist(x,bins=b)
from sklearn.cluster import KMeans
h = np.histogram(x,bins=b)
h = np.vstack((0.5*(h[1][:-1]+h[1][1:]),h[0])).T # because h[0] and h[1] have different sizes.
kmeans = KMeans(n_clusters=2).fit(x.reshape(n,1))
print kmeans.cluster_centers_
output:
[[-1.9896414]
[ 2.0176039]]

Categories

Resources