Python: Cosine Similarity m * n matrices - python

I have two M X N matrices which I construct after extracting data from images. Both the vectors have lengthy first row and after the 3rd row they all become only first column.
for example raw vector looks like this
1,23,2,5,6,2,2,6,2,
12,4,5,5,
1,2,4,
1,
2,
2
:
Both vectors have a similar pattern where first three rows have lengthy row and then thin out as it progress. Do do cosine similarity I was thinking to use a padding technique to add zeros and make these two vectors N X N. I looked at Python options of cosine similarity but some examples were using a package call numpy. I couldn't figure out how exactly numpy can do this type of padding and carry out a cosine similarity. Any guidance would be greatly appreciated.

If both arrays have the same dimension, I would flatten them using NumPy. NumPy (and SciPy) is a powerful scientific computational tool that makes matrix manipulations way easier.
Here an example of how I would do it with NumPy and SciPy:
import numpy as np
from scipy.spatial import distance
A = np.array([[1,23,2,5,6,2,2,6,2],[12,4,5,5],[1,2,4],[1],[2],[2]], dtype=object )
B = np.array([[1,23,2,5,6,2,2,6,2],[12,4,5,5],[1,2,4],[1],[2],[2]], dtype=object )
Aflat = np.hstack(A)
Bflat = np.hstack(B)
dist = distance.cosine(Aflat, Bflat)
The result here is dist = 1.10e-16 (i.e., 0).
Note that I've used here the dtype=object because that's the only way I know to be able to store different shapes into an array in NumPy. That's why later I used hstack() in order to flatten the array (instead of using the more common flatten() function).

I would make them into a scipy sparse matrix (http://docs.scipy.org/doc/scipy/reference/sparse.html) and then run cosine similarity from the scikit learn module.
from scipy import sparse
sparse_matrix= scipy.sparse.csr_matrix(your_np_array)
from sklearn.metrics import pairwise_distances
from scipy.spatial.distance import cosine
distance_matrix= pairwise_distances(sparse_matrix, metric="cosine")

Why cant you just run a nested loop over both jagged lists (presumably), summating each row using Euclidian/vector dot product and using the result as a similarity measure. This assumes that the jagged dimensions are identical.
Although I'm not quite sure how you are getting a jagged array from a bitmap image (I would of assumed it would be a proper dense matrix of MxN form) or how the jagged array of arrays above is meant to represent an MxN matrix/image data, and therefore, how padding the data with zeros would make sense? If this was a sparse matrix representation, one would expect row/col information annotated with the values.

Related

Eigenvalues of a Scipy Sparse Block Diagonal Matrix

I am trying to find the eigenvalues of many small matrices, while not trying to use a loop, with the intent to use CuPy later on.
Thus, I have tried to set up a large matrix that takes the matrices that I want to solve as blocks on its diagonal. This matrix contains a lot of unnecessary zeros, thus I use Scipy.Sparse.
All works well, until I want to find the eigenvalues, where the spsolve() function calculates the full eigenvectors to the problem, when most of the entries should also be zero.
import numpy as np
from scipy import sparse as sp
from scipy.sparse.linalg import spsolve, eigs
sigx=np.array([[0, 1],[1, 0]], dtype=np.complex128) # a 2x2 Pauli matrix
karray=np.arange(-np.pi, np.pi, np.pi/100) #200 elements
H_sci=sp.kron(sp.diags(karray), sigx) #The sparse matrix I want to find the eigenvalues to
H_reg=H_sci.toarray() #Converted into a regular numpy array to see the memory difference
print(H_sci.data.nbytes) #12800 = 2*2*200*16, reminder that 16 bytes = 128 bits --> saves 4 arrays of length 200
print(H_reg.nbytes) #2560000 = 2*2*200*200*16 --> saves the entire matrix
E_sci=eigs(H_sci, k=398) #throws an error for k=400 and 399, even though I should have 400 eigenvalues?
print(E_sci[1].data.nbytes) #2547200 --> as much as H_reg
Do I do something wrong? Is there an alternative approach to solving many matrices (here 2x2 for example) in parallel? I have used Numba for looping over the matrices before, but I would like to try to use my GPU to see whether I can speed this problem up, because I do not see why I should solve these matrices one after another.

Efficient numpy approach to iterate through elements of numpy arrays

With the following code snippet, I am trying to generate a vector where each element of it is drawn from a different normal distribution. The "mean" and "standard deviation" (arguments to random.normal) values for this is obtained from 2 numpy vectors, meanVect and varVect. Both the vectors have the same shape as that of vector to be generated.
I am using list comprehension to achieve the same, which I have used as a quicj and dirty fix to achieve my objective. Is there a numpy specific approach to achieve the same, which is more efficient than my current solution.
from numpy import random
meanVect = np.random.rand(1,100) # using random vectors for MWE
varVect = np.random.rand(1,100) # Originally vectors from a different source is used
newVect = [random.normal(meanVect[i],varVect[i]) for i in range(len(meanVects[0])) ]
Since np.random.normal takes array-like inputs for loc and scale, you can just do:
newVect = np.random.normal(meanVect, varVect)
As long as both input vectors have the same .shape, this should work.

Clustering a sparse co-occurrence matrix

I have two N x N co-occurrence matrices (484x484 and 1060x1060) that I have to analyze. The matrices are symmetrical along the diagonal and contain lots of zero values. The non-zero values are integers.
I want to group together the positions that are non-zero. In other words, what I want to do is the algorithm on this link. When order by cluster is selected, the matrix gets re-arranged in rows and columns to group the non-zero values together.
Since I am using Python for this task, I looked into SciPy Sparse Linear Algebra library, but couldn't find what I am looking for.
Any help is much appreciated. Thanks in advance.
If you have a matrix dist with pairwise distances between objects, then you can find the order on which to rearrange the matrix by applying a clustering algorithm on this matrix (http://scikit-learn.org/stable/modules/clustering.html). For example it might be something like:
from sklearn import cluster
import numpy as np
model = cluster.AgglomerativeClustering(n_clusters=20,affinity="precomputed").fit(dist)
new_order = np.argsort(model.labels_)
ordered_dist = dist[new_order] # can be your original matrix instead of dist[]
ordered_dist = ordered_dist[:,new_order]
The order is given by the variable model.labels_, which has the number of the cluster to which each sample belongs. A few observations:
You have to find a clustering algorithm that accepts a distance matrix as input. AgglomerativeClustering is such an algorithm (notice the affinity="precomputed" option to tell it that we are using pre-computed distances).
What you have seems to be a pairwise similarity matrix, in which case you need to transform it to a distance matrix (e.g. dist=1 - data/data.max())
In the example I assumed 20 clusters, you may have to play with this variable a bit. Alternatively, you might try to find the best one-dimensional representation of your data (using e.g. MDS) to describe the optimal ordering of samples.
Because your data is sparse, treat it as a graph, not a matrix.
Then try the various graph clustering methods. For example cliques are interesting on such data.
Note that not everything may cluster.

3d image compression with numpy

I have a 3d numpy array representing an object with cells as voxels and the voxels having values from 1 to 10. I would like to compress the image (a) to make it smaller and (b) to get a quick idea later on of how complex the image is by compressing it to a minimum level of agreement with the original image.
I have used SVD to do this with 2D images and seeing how many singular values were required but it looks to have difficulty with 3D ones. If e.g. I look at the diagonal terms in the S matrix, they are all zero and I was expecting singular values.
Is there any way I can use svd to compress 3D arrays (e.g. flattening in some way)? Or are other methods more appropriate? If necessary I could probably simplify the voxel values to 0 or 1.
You could essentially apply the same principle to the 3D data without flattening it. There are some algorithms to separate N-dimensional matrices, such as the CP-ALS (using Alternating Least Squares) and this is implemented in the package sktensor. You can use the package to decompose the tensor given a rank:
from sktensor import dtensor, cp_als
T = dtensor(X)
rank = 5
P, fit, itr, exectimes = cp_als(T, rank, init='random')
With X being your data. You could then use the weights weights = P.lmbda to reconstruct the original array X and calculate the reconstruction error, as you would do with SVD.
Other decomposition methods for 3D data (or in general tensors) include the Tucker Decomposition or the Canonical Decomposition (also available in the same package).
It is not directly a 3D SVD, but all the methods above can be used to analyze the principal components of your data.
Find bellow (just for completeness) an image of the tucker decomposition:
And bellow another image of the decomposition that CP-ALS (optimization algorithm) tries to obtain:
Image credits to:
1- http://www.slideshare.net/KoheiHayashi1/talk-in-jokyonokai-12989223
2- http://www.bsp.brain.riken.jp/~zhougx/tensor.html
What you want is a higher order svd/Tucker decomposition.
In the 3D case, you will get three projection matrices (one for each dimension) and a low rank core tensor (a 3D array).
You can do this easily using TensorLy:
from tensorly.decomposition import tucker
core, factors = tucker(tensor, ranks=[2, 3, 4])
Here, core will have shape (2, 3, 4) and len(factors) will be 3, one factor for each dimension.

Python: Reshaping arrays and lists

I have a numpy ndarray object with the following shape:
(3, 256, 170, 256).
So, basically this represents an array of 3-dimensional vectors. The dimension of the vector is the first element as it enables one to write something like: array[0] for the relevant vector component.
Now, I am trying to use scipy pdist function, which computes the distance between the entries. So, I need to modify this array, so that it can be represented as a two dimensional matrix, where the number of rows is 256*170*256 and the number of columns is 3 and pdist should return me the matrix where each element is the squared distance between the corresponding 3 dimensional vectors (if I have interpreted the documentation correctly).
Can someone tell me how I can get a view into this numpy array, so that I can generate this matrix. I do not want to copy the data again (as these matrices can be quite large), so looking for some efficient solutions.

Categories

Resources