Identify outliers from similarity matrix

Identify outliers from similarity matrix - python

I have a numpy matrix of dimension nxn where the [i,j] element is the similarity score (0-1 with 1 being identical and 0 being opposite) between two objects (in this case I'm analyzing color palettes, so it's the similarity score between color palette i and color palette j). I would like to determine which of the objects are "outliers" (using the definition loose here). The closest I've been able to think of is using something like DBSCAN and determining which objects don't seem to fit. Is there a better way of going about this?

I'd go for Markov clustering.
Essentially, the algorithm is having a random walk on a graph.
Random walks are super easy to implement if you have the proximity matrix.
The algorithm is roughly:
Normalize the matrix.
Raise it to a large power (M**n).
Look at the strength of the connections between nodes.

Related

How to cluster a score / probability map and get the modes (with variable numbers) in pytorch or numpy

I have a 2d probability map (please correct me if I use any term wrong). Something like this:
Here yellow is a high value and violet is zero. Please ignore the red cross. It is represented as a matrix in numpy/pytorch.
You can see, this examples has two clusters. How can I find those clusters including the mode coordinates (matrix indices) and accumulated probability mass corresponding to these clusters. The number of clusters can vary in each probability map and needs to be determined automatically.
I believe something like mean-shift should work, but I am new to this field so I don't know the best way to do it. I found a code from sklearn.cluster.MeanShift but it needs as input sampled points, which seems very expensive to do for images with size of roughly 512 x 512. Can I do it without sampling first?

Are there any python algorithms for conversion of coordinates between vector spaces of different norms?

Suppose I have a kxn array of data with columns of vectors and a distance function defined on these vectors. How do I convert the kxn array into another array of the same shape such that the euclidean norm among the converted vectors is the norm derived by the given distance function? I know you can directly calculate the distance matrix for the data by that given distance function, and derive the coordinates in R^k thereby. But this method is really expensive espesically when the distance function has a complexity O(n^2) or more. So I wonder if there is any simpler algorithm to do that.

It sounds like you are describing multidimension scaling (MDS). One way to do it in Python is with scikit-learn's sklearn.manifold.MDS.
MDS expects the NxN distance (or "dissimilarity") matrix as input, so that doesn't get around the cost of evaluating the distance function. The distance matrix is unavoidably needed for this conversion, so if the distance function itself is expensive, it seems the best thing to do is reduce the number of samples or look for a way to compute fast approximate distances it to speed it up. Also, beware that MDS is usually only approximate. A numerical optimization looks for the best fit of Euclidean norms to the given distances.

Clustering a sparse co-occurrence matrix

I have two N x N co-occurrence matrices (484x484 and 1060x1060) that I have to analyze. The matrices are symmetrical along the diagonal and contain lots of zero values. The non-zero values are integers.
I want to group together the positions that are non-zero. In other words, what I want to do is the algorithm on this link. When order by cluster is selected, the matrix gets re-arranged in rows and columns to group the non-zero values together.
Since I am using Python for this task, I looked into SciPy Sparse Linear Algebra library, but couldn't find what I am looking for.
Any help is much appreciated. Thanks in advance.

If you have a matrix dist with pairwise distances between objects, then you can find the order on which to rearrange the matrix by applying a clustering algorithm on this matrix (http://scikit-learn.org/stable/modules/clustering.html). For example it might be something like:
from sklearn import cluster
import numpy as np
model = cluster.AgglomerativeClustering(n_clusters=20,affinity="precomputed").fit(dist)
new_order = np.argsort(model.labels_)
ordered_dist = dist[new_order] # can be your original matrix instead of dist[]
ordered_dist = ordered_dist[:,new_order]
The order is given by the variable model.labels_, which has the number of the cluster to which each sample belongs. A few observations:
You have to find a clustering algorithm that accepts a distance matrix as input. AgglomerativeClustering is such an algorithm (notice the affinity="precomputed" option to tell it that we are using pre-computed distances).
What you have seems to be a pairwise similarity matrix, in which case you need to transform it to a distance matrix (e.g. dist=1 - data/data.max())
In the example I assumed 20 clusters, you may have to play with this variable a bit. Alternatively, you might try to find the best one-dimensional representation of your data (using e.g. MDS) to describe the optimal ordering of samples.

Because your data is sparse, treat it as a graph, not a matrix.
Then try the various graph clustering methods. For example cliques are interesting on such data.
Note that not everything may cluster.

Problems in performing K means clustering

I am trying to cluster the following data from a CSV file with K means clustering.
Sample1,Sample2,45
Sample1,Sample3,69
Sample1,Sample4,12
Sample2,Sample2,46
Sample2,Sample1,78
It is basically a graph where Samples are nodes and the numbers are the edges (weights).
I read the file as following:
fileopening = fopen('data.csv', 'rU')
reading = csv.reader(fileopening, delimiter=',')
L = list(reading)
I used this code: https://gist.github.com/betzerra/8744068
Here clusters are built based on the following:
num_points, dim, k, cutoff, lower, upper = 10, 2, 3, 0.5, 0, 200
points = map( lambda i: makeRandomPoint(dim, lower, upper), range(num_points) )
clusters = kmeans(points, k, cutoff)
for i,c in enumerate(clusters):
for p in c.points:
print " Cluster: ",i,"\t Point :", p
I replaced points with list L. But I got lots of errors: AttributeError, 'int' object has no attribute 'n', etc.
I need to perform K means clustering based on the third number column (edges) of my CSV file. This tutorial uses randomly creating points. But I am not sure, how to use this CSV data as an input to this k means function. How to perform k means (k=2) for my data? How can I send the CSV file data as input to this k means function?

In short "you can't".
Long answer:
K-means is defined for euclidean spaces only and it requires a valid points positions, while you only have distances between them, probably not in a strict mathematical sense but rather some kind of "similarity". K-means is not designed to work with similarity matrices.
What you can do?
You can use some other method to embeed your points in euclidean space in such a way, that they closely reasamble your distances, one of such tools is Multidimensional scaling (MDS): http://en.wikipedia.org/wiki/Multidimensional_scaling
Once point 1 is done you can run k-means
Alternatively you can also construct a kernel (valid in a Mercer's sense) by performing some kernel learning techniques to reasamble your data and then run kernel k-means on the resulting Gram matrix.

As lejlot said, only distances between points are not enough to run k-means in the classic sense. It's easy to understand if you understand the nature of k-means. On a high level, k-means works as follows:
1) Randomly assign points to cluster.
(Technically, there are more sophisticated ways of initial partitioning,
but that's not essential right now).
2) Compute centroids of the cluster.
(This is where you need the actual coordinates of the points.)
3) Reassign each point to a cluster with the closest centroid.
4) Repeat steps 2)-3) until stop condition is met.
So, as you can see, in the classic interpretation, k-means will not work, because it is unclear how to compute centroids. However, I have several suggestions of what you could do.
Suggestion 1.
Embed your points in N-dimensional space, where N is the number of points, so that the coordinates of each point are the distances to all the other points.
For example the data you showed:
Sample1,Sample2,45
Sample1,Sample3,69
Sample1,Sample4,12
Sample2,Sample2,46
Sample2,Sample1,78
becomes:
Sample1: (0,45,69,12,...)
Sample2: (78,46,0,0,...)
Then you can legitimately use Euclidean distance. Note, that the actual distances between points will not be preserved, but this could be a simple and reasonable approximation to preserve relative distances between the points. Another disadvantage is that if you have a lot of points, than your memory (and running time) requirements will be order of N^2.
Suggestion 2.
Instead of k-means, try k-medoids. For this one, you do not need the actual coordinates of the points, because instead of centroid, you need to compute medoids. Medoid of a cluster is a points from this cluster, whish has the smallest average distance to all other points in this cluster. You could look for the implementations online. Or it's actually pretty easy to implement. The running time will be proportional to N^2 as well.
Final remark.
Why do you wan to use k-means at all? Seems like you have a weighted directed graph. There are clustering algorithms specially intended for graphs. This is beyond the scope of your question, but maybe this is something that could be worth considering?

Can i get features of the clusters using hierarchical clustering - numpy

I am trying to do hierarchical clustering on an m*n array.
Input array : 500 * 1000 (1000 features, 500 observations)
Calculate distance matrix using a self-defined pdist function
Feed this distance matrix to linkage function :
clusters = sch.linkage(distanceMatrix,'single')
Form flat clusters :
fc = sch.fcluster(clusters,cutoff,'distance')
This gives me some clusters (around 80, using a cutoff value of 6.0).
Now, is there anyway, that i can get the 1000 features corresponding to each cluster as well? ( like we get the features of the centroids using K-means clustering).

Clusters in hierarchical clustering (or pretty much anything except k-means and Gaussian Mixture EM that are restricted to "spherical" - actually: convex - clusters) do not necessarily have sensible means.
Because they allow for non-spherical clusters. That actually is a feature...
https://en.wikipedia.org/wiki/Cluster_analysis#Connectivity_based_clustering_.28hierarchical_clustering.29
Have a look at the right image titled "Linkage clustering examples". What good is a cluster in this "banana" example? The centroid might not even be in the cluster!
Note that you can still compute the centroid yourself, if you need it. As the clustering algorithm does not need the centroid, it will not be computing it for you automatically, obviously.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.