Clustering with custom distance metric in sklearn - python

I am trying to implement a custom distance metric for clustering. The code snippet looks like:
import numpy as np
from sklearn.cluster import KMeans, DBSCAN, MeanShift
def distance(x, y):
# print(x, y) -> This x and y aren't one-hot vectors and is the source of this question
match_count = 0.
for xi, yi in zip(x, y):
if float(xi) == 1. and xi == yi:
match_count += 1
return match_count
def custom_metric(x, y):
# x, y are two vectors
# distance(.,.) calculates count of elements when both xi and yi are True
return distance(x, y)
vectorized_text = np.stack([[1, 0, 0, 1] * 100,
[1, 1, 1, 0] * 100,
[0, 1, 1, 0] * 100,
[0, 0, 0, 1] * 100] * 100)
dbscan = DBSCAN(min_samples=2, metric=custom_metric, eps=3, p=1).fit(vectorized_text)
The vectorized_text is a one-hot encoded feature matrix of size n_sample x n_features. But when custom_metric is being called, one of x or y turns to be a real valued vector and other one remains the one-hot vector. Expectedly, both x and y should have been one-hot vector. This is causing the custom_metric to return wrong results during run-time and hence clustering is not as correct.
Example of x and y in distance(x, y) method:
x = [0.5 0.5 0.5 ... 0.5 0.5]
y = [0. 0. 0. 1. 0. 0. ... 1. 0.]
Both should have been one-hot vectors.
Does anyone have an idea to go about this situation?

First of all, your distance is wrong.
Distances must return small values for similar vectors. You have defined a similarity, not a distance.
Secondly, using naive python code such as zip will perform extremely poor. Python just does not optimize such code well, it will do all the work in the slow interpreter. Python speed is only okay if you vectorize everything. And in fact, this code can be vectorised trivially, and then it likely won't even matter whether your inputs are binary or float data. What you are computing in a very complicated fashion is nothing but the dot product of two vectors, isn't it?
This, your distance should probably look like this:
def distance(x, y):
return x.shape[0] - np.dot(x,y)
Or whatever distance transformation you intend to use.
Now for your actual problem: my guess is that sklearn tries to accelerate your distance with a ball tree. That won't help much because of the poor performance of Python interpreter callbacks (in fact, you should probably precompute the entire distance matrix in one vectorised operation - something like dist = dim - X.transpose().dot(X)? Do the math yourself to figure out the equation). Other languages such as Java (e.g., the ELKI tool) are much better to extend this way, because of the way the hotspot JIT compiler can optimize and inline such calls everywhere.
To test the hypothesis that the sklearn ball-tree is the cause for the odd values you are observing, try setting method="brute" or so (see the documentation) to disable the ball tree. But in the end, you'll want to either precompute the entire distance matrix (if you can afford O(n²) cost), or switch to a different programming language (implementing your distance in Cython for example helps, but you'll still likely see the data being numpy float arrays suddenly).

I don't get your question, if I have:
x = [1, 0, 1]
y = [0, 0, 1]
and I use:
def distance(x, y):
# print(x, y) -> This x and y aren't one-hot vectors and is the source of this question
match_count = 0.
for xi, yi in zip(x, y):
if float(xi) == 1. and xi == yi:
match_count += 1
return match_count
print(distance(x, y))
1.0
and on top if you print x, y now:
x
[1, 0, 1]
y
[0, 0, 1]
so it is working?

I reproduced your code and I do get your error. I explain it better here:
He has a vectorized_text variable (np.stack) which simulates a One Hot Encoded feature set (only contains 0s and 1s). And in the DBSCAN model, he uses a custom_metric function to calculate the distance. It is expected that when the model is run, the custom metric function takes as parameters pairs of observations as they are: One Hot encoded values, but instead when printing those values inside the distance function, only one is taken as it is, and the other one appears to be a list of real values as he described in the question:
x = [0.5 0.5 0.5 ... 0.5 0.5]
y = [0. 0. 0. 1. 0. 0. ... 1. 0.]
Anyway, when I pass lists to the fit parameter, the function obtains the values as they are:
from sklearn.cluster import KMeans, DBSCAN, MeanShift
x = [1, 0, 1]
y = [0, 0, 1]
feature_set = [x*5]*5
def distance(x, y):
# Printing here the values. Should be 0s and 1s
print(x, y)
match_count = 0.
for xi, yi in zip(x, y):
if float(xi) == 1. and xi == yi:
match_count += 1
return match_count
def custom_metric(x, y):
# x, y are two vectors
# distance(.,.) calculates count of elements when both xi and yi are True
return distance(x, y)
dbscan = DBSCAN(min_samples=2, metric=custom_metric, eps=3, p=1).fit(feature_set)`
Result:
[1. 0. 1. 1. 0. 1. 1. 0. 1. 1. 0. 1. 1. 0. 1.] ... [1. 0. 1. 1. 0.1. 1. 0. 1. 1. 0. 1. 1. 0. 1.]
[1. 0. 1. 1. 0. 1. 1. 0. 1. 1. 0. 1. 1. 0. 1.] ... [1. 0. 1. 1. 0.1. 1. 0. 1. 1. 0. 1. 1. 0. 1.]
I suggest you to use a pandas DataFrame or some other type of value and see if it works.

Related

Cross-Correlation for specific lags in python

I am currently using the correlate function from scipy.signal.
corrs = signal.correlate(x,y, mode="full")
where x and y are time-series arrays with the same length (N). This function returns all the cross-correlation lags between the 2 arrays x and y (returns an array of length 2*N - 1 containing all the lags for the cross-correlation).
Is there any other efficient way to compute the cross-correlations between x and y but only for certain lags, i.e. a function that only returns the first 10 lags?
Example:
x = [ 0. -0.11386133 0.04422655 0.03187104]
y = [ 0. 0.0970805 -0.02822892 -0.0661678 ]
corrs = [ 0. 0.00753395 0.00028781 -0.01441102 0.00339385 0.00309406 0. ]
and I want a subset of corrs: for example, I want corrs[2:5].
Thanks in advance!
I tried implementing the cross correlation as defined in the scipy docs but is way slower.

CVXPY DCPError: Problem does not follow DCP rules for minimization

I'm trying to solve a problem that finds the minimum of V value that would minimize the trace of Quad Form V'#W#V where V is 3x1, and W which is 3x3 matrix. With constrains V`#Q#V <= I, where Q is a 3x3 matrix. I'm new to cvxpy, but as far as I understand the constraints here are given correctly.
Edit: As Micheal pointed out if V is 3x1 and W 3x3 V'WV will be a scalar. I redefined, V to be a 6x2 matrix, W as a 6x6 matrix. Which should result in a 2x2 matrix of which I want to minimize the diagonal entries.
V = cp.Variable(shape=(6,2))
objective = cp.Minimize(cp.trace(cp.quad_form(V,W)))
con = cp.quad_form(V,Q)
cons = [con<=np.eye(3)]
prob = cp.Problem(objective, cons)
result = prob.solve()
Raises an error
DCPError: Problem does not follow DCP rules. Specifically:
The objective is not DCP. Its following subexpressions are not:
var461 # [[-1. 0. -1. 0. -0.77910125 -0.23003633]
[ 0. -1. 0. -1. 0.51523716 0.31351064]
[-1. 0. 0. 0. -0.43932559 0.06076264]
[ 0. -1. 0. 0. 0.43042035 0.59073962]
[-0.33977566 0.0848168 -0.43932559 0.43042035 -0.74982669 -0.26546118]
[-0.29079897 -0.27722898 0.06076264 0.59073962 -0.26546118 0.01021432]] # var461
Looking for correction on how to define V, correctly.

Weird clustering output (scikitlean kmeans)

I have an imbalanced dataset with four labels in total. Two of them have a much higher appearance frequency than the other two. I have nearly one million observations.
I'm trying to understand the data components a bit better by exploring with sklearn.cluster.kmeans clustering.
Here's my data:
print(X)
[[68. 0. 0. ... 0. 0. 0.]
[18. 1. 1. ... 1. 0. 0.]
[18. 1. 1. ... 0. 0. 0.]
...
[59. 0. 0. ... 0. 0. 0.]
[48. 1. 0. ... 0. 0. 1.]
[47. 1. 1. ... 0. 0. 0.]]
print(y)
[1 2 3 ... 3 2 3]
The observed labels have four levels (ordinal variables 0 - 3).
Here's my code:
scaler = StandardScaler()
scaler.fit(X)
X_scaled = scaler.transform(X)
algo1 = KMeans(n_clusters = 4)
y_pred = algo1.fit_predict(X_scaled)
mglearn.discrete_scatter(X_scaled[:, 0], X_scaled[:,1], y_pred)
plt.legend(["cluster 0", "cluster 1", "cluster 2", "cluster 3"], loc = 'best')
plt.xlabel("Feature 0")
plt.ylabel("Feature 1")
This looks weird, so I tried 3 clusters.
algo2 = KMeans(n_clusters = 3)
y_pred2 = algo2.fit_predict(X_scaled)
mglearn.discrete_scatter(X_scaled[:, 0], X_scaled[:,1], y_pred2)
plt.legend(["cluster 0", "cluster 1", "cluster 2"], loc = 'best')
plt.xlabel("Feature 0")
plt.ylabel("Feature 1")
And then 2 clusters
algo3 = KMeans(n_clusters = 2)
y_pred3 = algo3.fit_predict(X_scaled)
mglearn.discrete_scatter(X_scaled[:, 0], X_scaled[:,1], y_pred3)
plt.legend(["cluster 0", "cluster 1", "cluster 2"], loc = 'best')
plt.xlabel("Feature 0")
plt.ylabel("Feature 1")
I'm trying to figure out what exactly is happening to the clustering. Is there an alternative way that I could better understand the data structure?
You did not mention how many features you have, but by the looks of it, X has way more than just two variables. In your code, you are visualizing the clusters using only two features while algo1 used more variables than that.
In particular, you are visualizing the clusters using Feature 1, which appears to be binary (only takes on the values - 1 and 1 ), so it's not that the clustering is unsuccessful; you are simply visualizing the clusters under a very limited number of features.
By plotting in 2-D, you are limiting yourself to seeing the clusters as a function of only two variables, so there's a chance you'll be missing out on some relationships that are only visible in 3-D or even higher dimensions. If you wish to carry on this way, I recommend plotting Feature 1 against all other features, then Feature 2 against all other features, and so on. This way, you will visualize the clusters under all combinations of size two and perhaps this will help you understand the relationship between certain pairs of features and the cluster they belong to.
Remember also that KMeans is an unsupervised algorithm, so the clusters are not necessarily related to the labels in y. The results simply mean that the observations in each cluster are similar to each other in terms of distance to the centroid.

difference between np.zeros((1, n)) and np.zeros(n)

I am trying to understand the difference between np.zeros((1, n)) and np.zeros(n)
row_vector = np.zeros((1, n))
vector = np.zeros(n)
print('Shape of row_vector: {0}'.format(row_vector.shape))
print('Shape of vector: {0}'.format(vector.shape))
Output is without one extra bracket
Contents of row_vector:
[[0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]
Contents of vector - Note the number of brackets compared to row_vector:
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
Secondly, if I have to add them, how to do that?
It simply boils down to the number of dimensions of the array or tensor. The row_vector is a 2 dimensional array, while vector is a 1 dimensional array.
You can easily verify this by calling the ndim attribute.
e.g.
>>> row_vector.ndim()
2
>>> vector.ndim()
1
This additional dimension is very useful when working with tensor focused libraries such as TensorFlow and PyTorch.
The difference is that one is a 1D array, the other 2D. You can see this by printing the x.ndim attribute or len(x.shape).
That being said, for many practical purposes there is no difference. Broadcasting aligns dimensions on the right edge, so you can add the two arrays directly using the +operator:
s = vector + row_vector
The result will be 2D, with shape (1, n).
There would be a big difference if the shapes were (n,) and (n, 1). Broadcasting would again align the shapes on the right, which would make the sum an outer sum of shape (n, n).
You can perform an outer sum on the existing arrays by accessing a view with a unit axis using np.newaxis:
vector[:, np.newaxis] + row_vector
Or
vector + row_vector[:, np.newaxis]

Dendrogram through scipy given a similarity matrix

I have computed a jaccard similarity matrix with Python. I want to cluster highest similarities to lowest, however, no matter what linkage function I use it produces the same dendrogram! I have a feeling that the function assumes that my matrix is of original data, but I have already computed the first similarity matrix. Is there any way to pass this similarity matrix through to the dendrogram so it plots correctly? Or am I going to have to output the matrix and simply do it with R. Passing through the original raw data is not possible, as I am computing similarities of words. Thanks for the help!
Here is some code:
SimMatrix = [[ 0.,0.09259259, 0.125 , 0. , 0.08571429],
[ 0.09259259, 0. , 0.05555556, 0. , 0.05128205],
[ 0.125 , 0.05555556, 0. , 0.03571429, 0.05882353],
[ 0. , 0. , 0.03571429, 0. , 0. ],
[ 0.08571429, 0.05128205, 0.05882353, 0. , 0. ]]
linkage = hcluster.complete(SimMatrix) #doesnt matter what linkage...
dendro = hcluster.dendrogram(linkage) #same plot for all types?
show()
If you run this code, you will see a dendrogram that is completely backwards. No matter what linkage type I use, it produces the same dendrogram. This intuitively can not be correct!
Here's the solution. Turns out the SimMatrix needs to be first converted into a condensed matrix (the diagonal, upper right or bottom left, of this matrix).
You can see this in the code below:
import scipy.spatial.distance as ssd
distVec = ssd.squareform(SimMatrix)
linkage = hcluster.linkage(1 - distVec)
dendro = hcluster.dendrogram(linkage)
show()

Categories

Resources