I want to apply sklearn graph clustering algorithms but they don't accept input from networkx in .gexf format. What kind of library/transformations do I need to turn my .gexf graphs suitable for sklearn?
Cluster algorithms accept either distance matrices, affinity matrices, or feature matrices. For example, kmeans would accept a feature matrix (say X of n points of m dimensions) and apply the Euclidean distance metric, while affinity propagation accepts an affinity matrix (i.e. a square matrix D of nxn dimensions) or a feature matrix (depending on the affinity parameter).
If you want to apply a sklearn (or just non-graph) cluster algorithm, you can extract adjacency matrices from networkx graphs.
A = nx.to_scipy_sparse_matrix(G)
I guess you should make sure, your diagonal is 1; do numpy.fill_diagonal(D, 1) if not.
This then leaves only applying the clustering algorithm:
from sklearn.cluster import AffinityPropagation
ap = AffinityPropagation(affinity='precomputed').fit(A)
print(ap.labels_)
You can also convert your adjacency matrix to a distance matrix if you want to apply other algorithms or even project your adjacency/distance matrix to a feature matrix.
To go through all of this would go too far, however, as for getting the distance matrix, if you have binary edges, you can do D = 1 - A; if you have weighted edges you could D = A.max() - A.
Related
I have a numpy matrix of dimension nxn where the [i,j] element is the similarity score (0-1 with 1 being identical and 0 being opposite) between two objects (in this case I'm analyzing color palettes, so it's the similarity score between color palette i and color palette j). I would like to determine which of the objects are "outliers" (using the definition loose here). The closest I've been able to think of is using something like DBSCAN and determining which objects don't seem to fit. Is there a better way of going about this?
I'd go for Markov clustering.
Essentially, the algorithm is having a random walk on a graph.
Random walks are super easy to implement if you have the proximity matrix.
The algorithm is roughly:
Normalize the matrix.
Raise it to a large power (M**n).
Look at the strength of the connections between nodes.
I have two N x N co-occurrence matrices (484x484 and 1060x1060) that I have to analyze. The matrices are symmetrical along the diagonal and contain lots of zero values. The non-zero values are integers.
I want to group together the positions that are non-zero. In other words, what I want to do is the algorithm on this link. When order by cluster is selected, the matrix gets re-arranged in rows and columns to group the non-zero values together.
Since I am using Python for this task, I looked into SciPy Sparse Linear Algebra library, but couldn't find what I am looking for.
Any help is much appreciated. Thanks in advance.
If you have a matrix dist with pairwise distances between objects, then you can find the order on which to rearrange the matrix by applying a clustering algorithm on this matrix (http://scikit-learn.org/stable/modules/clustering.html). For example it might be something like:
from sklearn import cluster
import numpy as np
model = cluster.AgglomerativeClustering(n_clusters=20,affinity="precomputed").fit(dist)
new_order = np.argsort(model.labels_)
ordered_dist = dist[new_order] # can be your original matrix instead of dist[]
ordered_dist = ordered_dist[:,new_order]
The order is given by the variable model.labels_, which has the number of the cluster to which each sample belongs. A few observations:
You have to find a clustering algorithm that accepts a distance matrix as input. AgglomerativeClustering is such an algorithm (notice the affinity="precomputed" option to tell it that we are using pre-computed distances).
What you have seems to be a pairwise similarity matrix, in which case you need to transform it to a distance matrix (e.g. dist=1 - data/data.max())
In the example I assumed 20 clusters, you may have to play with this variable a bit. Alternatively, you might try to find the best one-dimensional representation of your data (using e.g. MDS) to describe the optimal ordering of samples.
Because your data is sparse, treat it as a graph, not a matrix.
Then try the various graph clustering methods. For example cliques are interesting on such data.
Note that not everything may cluster.
I have two problems.
1). First: How to generate a 1000x100 dim matrix each dimension being a marginal distribution with mean 0 and 1?. I know I can use univariate diet for each but how do you add 100 such distribution in one numpy/ pandas matrix?
2). Generate a 100x1000 dim matrix . This would use a multivariate dist, but how do you specify the mean and covariant matrix for that numpy function. It has to be random.
if you have this hierarchical clustering call in scipy in Python:
from scipy.cluster.hierarchy import linkage
# dist_matrix is long form distance matrix
linkage_matrix = linkage(squareform(dist_matrix), linkage_method)
then what's an efficient way to go from this to cluster assignments for individual points? i.e. a vector of length N where N is number of points, where each entry i is the cluster number of point i, given the number of clusters generated by a given threshold thresh on the resulting clustering?
To clarify: The cluster number would be the cluster that it's in after applying a threshold to the tree. In which case you would get a unique cluster for each leaf node for the cluster that it is in. Unique in the sense that each point belongs to one "most specific cluster" which is defined by the threshold where you cut the dendrogram.
I know that scipy.cluster.hierarchy.fclusterdata gives you this cluster assignment as its return value, but I am starting from a custom made distance matrix and distance metric, so I cannot use fclusterdata. The question boils down to: how can I compute what fclusterdata is computing -- the cluster assignments?
If I understand you right, that is what fcluster does:
scipy.cluster.hierarchy.fcluster(Z, t, criterion='inconsistent', depth=2, R=None, monocrit=None)
Forms flat clusters from the hierarchical clustering defined by the linkage matrix Z.
...
Returns: An array of length n. T[i] is the flat cluster number to which original observation i belongs.
So just call fcluster(linkage_matrix, t), where t is your threshold.
I am trying to do hierarchical clustering on an m*n array.
Input array : 500 * 1000 (1000 features, 500 observations)
Calculate distance matrix using a self-defined pdist function
Feed this distance matrix to linkage function :
clusters = sch.linkage(distanceMatrix,'single')
Form flat clusters :
fc = sch.fcluster(clusters,cutoff,'distance')
This gives me some clusters (around 80, using a cutoff value of 6.0).
Now, is there anyway, that i can get the 1000 features corresponding to each cluster as well? ( like we get the features of the centroids using K-means clustering).
Clusters in hierarchical clustering (or pretty much anything except k-means and Gaussian Mixture EM that are restricted to "spherical" - actually: convex - clusters) do not necessarily have sensible means.
Because they allow for non-spherical clusters. That actually is a feature...
https://en.wikipedia.org/wiki/Cluster_analysis#Connectivity_based_clustering_.28hierarchical_clustering.29
Have a look at the right image titled "Linkage clustering examples". What good is a cluster in this "banana" example? The centroid might not even be in the cluster!
Note that you can still compute the centroid yourself, if you need it. As the clustering algorithm does not need the centroid, it will not be computing it for you automatically, obviously.