Dataframe faster cosine similarity - python

I have a dataframe consisting of individual tweets (id, text, author_id, nn_list) where nn_list is a list of other tweet indices which were previously identified as potential nearest neighbours. Now I have to calculate the cosine similarity of the index and every single entry of this list by looking at the index in the tfidf matrix to compare the vectors but with my current approach this is kind of slow. The current code looks something like this:
for index, row in data_df.iterrows():
for candidate in row["nn_list"]:
candidate_cos = float("%.2f" % pairwise_distances(tfidf_matrix[candidate], tfidf_matrix[index], metric='cosine'))
if candidate_cos < nn_distance:
current_nn_candidate = candidate
nn_distance = candidate_cos
Is there a significantly faster way to calculate this?

The following code should work assuming you have not a too large range of IDs:
import pandas as pd
import numpy as np
from scipy.sparse import csr_matrix
from sklearn.metrics.pairwise import cosine_similarity
df = pd.DataFrame({"nn_list": [[1, 2], [1,2,3], [1,2,3,7], [11, 12, 13], [2,1]]})
# Data consistent with https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html
df["data"] = df["nn_list"].apply(lambda x: np.repeat(1, len(x)))
df["row"] = df.index
df["row_ind"] = df[['row', 'nn_list']].apply(lambda x: np.repeat(x[0], len(x[1])), axis=1)
df["col_ind"] = df['nn_list'].apply(lambda x: np.array(x))
m = csr_matrix(
(np.concatenate(df['data']),
(np.concatenate(df['row_ind']), np.concatenate(df['col_ind']))))
cosine_similarity(m)
Will return:
array([[1. , 0.81649658, 0.70710678, 0. , 1. ],
[0.81649658, 1. , 0.8660254 , 0. , 0.81649658],
[0.70710678, 0.8660254 , 1. , 0. , 0.70710678],
[0. , 0. , 0. , 1. , 0. ],
[1. , 0.81649658, 0.70710678, 0. , 1. ]])
If you have a larger range of IDs I recommend to use spark or have look to cosine similarity on large sparse matrix with numpy.

Related

How to analyze all traces in a graph analyzed with minimum spanning trees?

The Minimum Spanning Tree algorithm (MST) allows to find the shortest possible connexion for all nodes. In Python, Scipy provides a package to analyze MST. I want to find a way to get the main traces based on the connexions. In the example underneath, there is a trace at dest2-dest4-dest1 and another one at source-dest4-dest3 (other combination of traces are possible as well).
from scipy.sparse.csgraph import minimum_spanning_tree
import pandas as pd
import numpy as np
The example has the following coordinate points for source and destinations
df = pd.DataFrame([[2, 2], [30, 2], [2, 30], [25, 25], [14,10]], columns=['xcord', 'ycord'], index=['source', 'dest1', 'dest2', 'dest3', 'dest4'])
A distance matrix and minimum spanning tree with resulting distances can be computed.
dm = pd.DataFrame(distance_matrix(df.values, df.values), index=df.index, columns=df.index)
mst = minimum_spanning_tree(dm)
arr_distances = mst.toarray()
This gives the following array.
array([[ 0. , 0. , 0. , 0. , 0. ],
[ 0. , 0. , 0. , 0. , 0. ],
[ 0. , 0. , 0. , 0. , 0. ],
[ 0. , 0. , 0. , 0. , 0. ],
[14.4222051 , 17.88854382, 23.32380758, 18.60107524, 0. ]])
Which means, distance source-dest4 = 14.42, dest1-dest4 = 17.89, dest2-dest4 = 23.32 and dest3-dest4 = 18.60.
How can one capture the available main trace(s) expressed in a list or numpy.array like ['source', 'dest4', 'dest3'] and ['dest2', 'dest4', 'dest1'] ? This should work for any configuration of input coordinates.
So far I have tried to get the main connexions using np.argwhere(arr_distances > 0), but somehow it is difficult to get them as aforementioned traces in different lists or sublists.
It would be good if the solution were in Python, but any other package than Scipy is welcome. Thank you very much in advance for your help!

How to store a distance matrix more efficiently?

I have this python code to calculate coordinates distances among different points.
IDs,X,Y,Z
0-20,193.722,175.733,0.0998975
0-21,192.895,176.727,0.0998975
7-22,187.065,178.285,0.0998975
0-23,192.296,178.648,0.0998975
7-24,189.421,179.012,0.0998975
8-25,179.755,179.347,0.0998975
8-26,180.436,179.288,0.0998975
7-27,186.453,179.2,0.0998975
8-28,178.899,180.92,0.0998975
The code works perfectly, but as the amount of coordinates I now have is very big (~50000) I need to optimise this code, otherwise is impossible to run. Could someone suggest me a way of doing this that is more memory efficient? Thanks for any suggestion.
#!/usr/bin/env python
import pandas as pd
import scipy.spatial as spsp
df_1 =pd.read_csv('Spots.csv', sep=',')
coords = df_1[['X', 'Y', 'Z']].to_numpy()
distances = spsp.distance_matrix(coords, coords)
df_1['dist'] = distances.tolist()
# CREATES columns d0, d1, d2, d3
dist_cols = df_1['IDs']
df_1[dist_cols] = df_1['dist'].apply(pd.Series)
df_1.to_csv("results_Spots.csv")
There are a couple of ways to save space. The first is to only store the upper triangle of your matrix and make sure that your indices always reflect that. The second is only to store the values that meet your threshold. This can be done collectively by using sparse matrices, which support most of the operations you will likely need, and will only store the elements you need.
To store half the data, preprocess your indices when you access your matrix. So for your matrix, access index [i, j] like this:
getitem(A, i, j):
if i > j:
i, j = j, i
return dist[i, j]
scipy.sparse supports a number of sparse matrix formats: BSR, Coordinate, CSR, CSC, Diagonal, DOK, LIL. According to the usage reference, the easiest way to construct a matrix is using DOK or LIL format. I will show the latter for simplicity, although the former may be more efficient. I will leave it up to the reader to benchmark different options once a basic functioning approach has been shown. Remember to convert to CSR or CSC format when doing matrix math.
We will sacrifice speed for spatial efficiency by constructing one row at a time:
N = coords.shape[0]
threshold = 2
threshold2 = threshold**2 # minor optimization to save on some square roots
distances = scipy.sparse.lil_matrix((N, N))
for i in range(N):
# Compute square distances
d2 = np.sum(np.square((coords[i + 1:, :] - coords[i])), axis=1)
# Threshold
mask = np.flatnonzero(d2 <= threshold2)
# Apply, only compute square root if necessary
distances[i, mask + i + 1] = np.sqrt(d2[mask])
For your toy example, we find that there are only four elements that actually pass threshold, making the storage very efficient:
>>> distances.nnz
4
>>> distances.toarray()
array([[0. , 1.29304486, 0. , 0. , 0. , 0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0. , 0. , 0. , 1.1008038 , 0. ],
[0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0. , 0. , 0.68355102, 0. , 1.79082802],
[0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ]])
Using the result from scipy.spatial.distance_matrix confirms that these numbers are in fact accurate.
If you want to fill the matrix (effectively doubling the storage, which should not be prohibitive), you should probably move away from LIL format before doing so. Simply add the transpose to the original matrix to fill it out.
The approach shown here addresses your storage concerns, but you can improve the efficiency of the entire computation using spatial sorting and other geospatial techniques. For example, you could use scipy.spatial.KDTree or the similar scipy.spatial.cKDTree to arrange your dataset and query neighbors within a specific threshold directly and efficiently.
For example, the following would replace the matrix construction shown here with what is likely a more efficient method:
tree = scipy.spatial.KDTree(coords)
distances = tree.sparse_distance_matrix(tree, threshold)
You are asking in your code for point to point distances in a ~50000 x ~50000 matrix. The result will be very big, if you really like to store it. The matrix is dense as each point has a positive distance to each other point.
I recommend to revisit your business requirements. Do you really need to calculate all these points upfront and store them in a file on a disk ? Sometimes it is better to do the required calculations on the fly; scipy.spacial is fast, perhaps even not much slower then reading a precalculated value.
EDIT (based on comment):
You can filter calculated distances by a threshold (here for illustration: 5.0) and then look up the IDs in the DataFrame
import pandas as pd
import scipy.spatial as spsp
df_1 =pd.read_csv('Spots.csv', sep=',')
coords = df_1[['X', 'Y', 'Z']].to_numpy()
distances = spsp.distance_matrix(coords, coords)
adj_5 = np.argwhere(distances[:] < 5.0)
pd.DataFrame(zip(df_1['IDs'][adj_5[:,0]].values,
df_1['IDs'][adj_5[:,1]].values),
columns=['from', 'to'])

multiplying a 1D array by items in a 3D array

I want to ask a question about multiplying items in a 1D array with items returned from a function that are a matrix in the form of a 3D array.
I have the following array of numbers named mass_array:
array([12.0107 , 1.00794, 12.0107 , 1.00794, 12.0107 , 1.00794,
12.0107 , 1.00794, 12.0107 , 1.00794, 12.0107 , 1.00794])
and the following 3D array named coordinate_array:
array([[ 0. , 1.40272, 0. ],
[ 0. , 2.49029, 0. ],
[-1.21479, 0.70136, 0. ],
[-2.15666, 1.24515, 0. ],
[-1.21479, -0.70136, 0. ],
[-2.15666, -1.24515, 0. ],
[ 0. , -1.40272, 0. ],
[ 0. , -2.49029, 0. ],
[ 1.21479, -0.70136, 0. ],
[ 2.15666, -1.24515, 0. ],
[ 1.21479, 0.70136, 0. ],
[ 2.15666, 1.24515, 0. ]])
I am going to perform a calculation on each of these lines (which correspond to an atom on Benzene) to return a 3x3 matrix using a function called buildi, which performs calculations on a 1x3 matrix.
I want to multiply each corresponding item in mass_array by the result of the buildi function with its corresponding line on coordinate_array:
e.g.
for line 1 of both arrays multiplied together:
12.0107 * buildi([ 0. , 1.40272, 0. ])
and then for line 2 of both arrays:
1.00794 * buildi([ 0. , 2.49029, 0. ])
all the way down to the very last line,
1.00794 * buildi([ 2.15666, 1.24515, 0. ])
and add the results of each of these multiplications to a final array.
My attempt at doing this ended up as such:
def inertia_matrix(array1, array2):
inertia_molecule = np.array([[0, 0, 0], [0, 0, 0], [0, 0, 0]])
for atom in array2:
inertia_molecule = inertia_molecule + buildi(atom)
print(inertia_molecule)
The problem, however, is that I can't 'map' the molecular weight to the corresponding line in the for loop.
My intention was to attempt something like:
for atom in array2 and weight in array1:
inertia_molecule = inertia_molecule + weight*buildi(atom)
but I couldn't work anything out that would fit such a purpose.
I attempted to use the zip function but I couldn't make it accommodate the weight*buildi(atom) part of my code.
How can I solve this problem?
The zip function is exactly made for this usecase:
inertia_molecules = []
for mass, atom in zip(mas_array, coordinate_array):
inertia_molecules.append( mass * buildi(atom) )
Now the list inertia_molecules holds a list of all 3x3 matrices produced by the calculations.
(If you are dealing with a large list, you might want to pre-allocate the space for speed and then access the individual cells instead appending new values to the end)

python negative sum of rows on matrix diagonal efficiently

I need to sum the rows of a matrix, negate them, and put them on the diagonal of either the original matrix or a matrix where the off-diagonal terms are zero. What works is
Mat2 = numpy.diag(numpy.negative(numpy.squeeze(numpy.asarray(numpy.sum(Mat1,axis=1))))
Is there a cleaner/faster way to do this? I'm trying to optimize some code.
I think np.diag(-Mat1.A.sum(1)) would produce the same results:
>>> Mat1 = np.matrix(np.random.rand(3,3))
>>> Mat1
matrix([[ 0.35702661, 0.0191392 , 0.34793743],
[ 0.9052968 , 0.16182118, 0.2239716 ],
[ 0.57865916, 0.77934846, 0.60984091]])
>>> Mat2 = np.diag(np.negative(np.squeeze(np.asarray(np.sum(Mat1,axis=1)))))
>>> Mat2
array([[-0.72410324, 0. , 0. ],
[ 0. , -1.29108958, 0. ],
[ 0. , 0. , -1.96784852]])
>>> np.diag(-Mat1.A.sum(1))
array([[-0.72410324, 0. , 0. ],
[ 0. , -1.29108958, 0. ],
[ 0. , 0. , -1.96784852]])
Note that matrices are a bit of a headache in numpy -- arrays are generally much more convenient -- and the only syntactic advantage they had, namely easier multiplication, doesn't really count any more now that we have # for matrix multiplication in modern Python.
If Mat1 were an array instead of a matrix, you wouldn't need the .A there.

find Markov steady state with left eigenvalues (using numpy or scipy)

I need to find the steady state of Markov models using the left eigenvectors of their transition matrices using some python code.
It has already been established in this question that scipy.linalg.eig fails to provide actual left eigenvectors as described, but a fix is demonstrated there. The official documentation is mostly useless and incomprehensible as usual.
A bigger problem than than the incorrect format is that the eigenvalues produced are not in any particular order (not sorted and different each time). So if you want to find the left eigenvectors that correspond to the 1 eigenvalues you have to hunt for them, and this poses it's own problem (see below). The math is clear, but how to get python to compute this and return the correct eigenvectors is not clear. Other answers to this question, like this one, don't seem to be using the left eigenvectors, so those can't be correct solutions.
This question provides a partial solution, but it doesn't account for the unordered eigenvalues of larger transition matrices. So, just using
leftEigenvector = scipy.linalg.eig(A,left=True,right=False)[1][:,0]
leftEigenvector = leftEigenvector / sum(leftEigenvector)
is close, but doesn't work in general because the entry in the [:,0] position may not be the eigenvector for the correct eigenvalue (and in my case it is usually not).
Okay, but the output of scipy.linalg.eig(A,left=True,right=False) is an array in which the [0] element is a list of each eigenvalue (not in any order) and that is followed in position [1] by an array of eigenvectors in a corresponding order to those eigenvalues.
I don't know a good way to sort or search that whole thing by the eigenvalues to pull out the correct eigenvectors (all eigenvectors with eigenvalue 1 normalized by the sum of the vector entries.) My thinking is to get the indices of the eigenvalues that equal 1, and then pull those columns from the array of eigenvectors. My version of this is slow and cumbersome. First I have a function (that doesn't quite work) to find positions in a last that matches a value:
# Find the positions of the element a in theList
def findPositions(theList, a):
return [i for i, x in enumerate(theList) if x == a]
Then I use it like this to get the eigenvectors matching the eigenvalues = 1.
M = transitionMatrix( G )
leftEigenvectors = scipy.linalg.eig(M,left=True,right=False)
unitEigenvaluePositions = findPositions(leftEigenvectors[0], 1.000)
steadyStateVectors = []
for i in unitEigenvaluePositions:
thisEigenvector = leftEigenvectors[1][:,i]
thisEigenvector / sum(thisEigenvector)
steadyStateVectors.append(thisEigenvector)
print steadyStateVectors
But actually this doesn't work. There is one eigenvalue = 1.00000000e+00 +0.00000000e+00j that is not found even though two others are.
My expectation is that I am not the first person to use python to find stationary distributions of Markov models. Somebody who is more proficient/experienced probably has a working general solution (whether using numpy or scipy or not). Considering how popular Markov models are I expected there to be a library just for them to perform this task, and maybe it does exist but I couldn't find one.
You linked to How do I find out eigenvectors corresponding to a particular eigenvalue of a matrix? and said it doesn't compute the left eigenvector, but you can fix that by working with the transpose.
For example,
In [901]: import numpy as np
In [902]: import scipy.sparse.linalg as sla
In [903]: M = np.array([[0.5, 0.25, 0.25, 0], [0, 0.1, 0.9, 0], [0.2, 0.7, 0, 0.1], [0.2, 0.3, 0, 0.5]])
In [904]: M
Out[904]:
array([[ 0.5 , 0.25, 0.25, 0. ],
[ 0. , 0.1 , 0.9 , 0. ],
[ 0.2 , 0.7 , 0. , 0.1 ],
[ 0.2 , 0.3 , 0. , 0.5 ]])
In [905]: eval, evec = sla.eigs(M.T, k=1, which='LM')
In [906]: eval
Out[906]: array([ 1.+0.j])
In [907]: evec
Out[907]:
array([[-0.32168797+0.j],
[-0.65529032+0.j],
[-0.67018328+0.j],
[-0.13403666+0.j]])
In [908]: np.dot(evec.T, M).T
Out[908]:
array([[-0.32168797+0.j],
[-0.65529032+0.j],
[-0.67018328+0.j],
[-0.13403666+0.j]])
To normalize the eigenvector (which you know should be real):
In [913]: u = (evec/evec.sum()).real
In [914]: u
Out[914]:
array([[ 0.18060201],
[ 0.36789298],
[ 0.37625418],
[ 0.07525084]])
In [915]: np.dot(u.T, M).T
Out[915]:
array([[ 0.18060201],
[ 0.36789298],
[ 0.37625418],
[ 0.07525084]])
If you don't know the multiplicity of eigenvalue 1 in advance, see #pv.'s comment showing code using scipy.linalg.eig. Here's an example:
In [984]: M
Out[984]:
array([[ 0.9 , 0.1 , 0. , 0. , 0. , 0. ],
[ 0.3 , 0.7 , 0. , 0. , 0. , 0. ],
[ 0. , 0. , 0.25, 0.75, 0. , 0. ],
[ 0. , 0. , 0.5 , 0.5 , 0. , 0. ],
[ 0. , 0. , 0. , 0. , 0. , 1. ],
[ 0. , 0. , 0. , 0. , 1. , 0. ]])
In [985]: import scipy.linalg as la
In [986]: evals, lvecs = la.eig(M, right=False, left=True)
In [987]: tol = 1e-15
In [988]: mask = abs(evals - 1) < tol
In [989]: evals = evals[mask]
In [990]: evals
Out[990]: array([ 1.+0.j, 1.+0.j, 1.+0.j])
In [991]: lvecs = lvecs[:, mask]
In [992]: lvecs
Out[992]:
array([[ 0.9486833 , 0. , 0. ],
[ 0.31622777, 0. , 0. ],
[ 0. , -0.5547002 , 0. ],
[ 0. , -0.83205029, 0. ],
[ 0. , 0. , 0.70710678],
[ 0. , 0. , 0.70710678]])
In [993]: u = lvecs/lvecs.sum(axis=0, keepdims=True)
In [994]: u
Out[994]:
array([[ 0.75, -0. , 0. ],
[ 0.25, -0. , 0. ],
[ 0. , 0.4 , 0. ],
[ 0. , 0.6 , 0. ],
[ 0. , -0. , 0.5 ],
[ 0. , -0. , 0.5 ]])
In [995]: np.dot(u.T, M).T
Out[995]:
array([[ 0.75, 0. , 0. ],
[ 0.25, 0. , 0. ],
[ 0. , 0.4 , 0. ],
[ 0. , 0.6 , 0. ],
[ 0. , 0. , 0.5 ],
[ 0. , 0. , 0.5 ]])
All right, I had to make some changes while implementing Warren's solution and I've included those below. It's basically the same so he gets all the credit, but the realities of numerical approximations with numpy and scipy required more massaging which I thought would be helpful to see for others trying to do this in the future. I also changed the variable names to be super noob-friendly.
Please let me know if I got anything wrong or there are further recommended improvements (e.g. for speed).
# in this case my Markov model is a weighted directed graph, so convert that nx.graph (G) into it's transition matrix
M = transitionMatrix( G )
#create a list of the left eigenvalues and a separate array of the left eigenvectors
theEigenvalues, leftEigenvectors = scipy.linalg.eig(M, right=False, left=True)
# for stationary distribution the eigenvalues and vectors are always real, and this speeds it up a bit
theEigenvalues = theEigenvalues.real
leftEigenvectors = leftEigenvectors.real
# set how close to zero is acceptable as being zero...1e-15 was too low to find one of the actual eigenvalues
tolerance = 1e-10
# create a filter to collect the eigenvalues that are near enough to zero
mask = abs(theEigenvalues - 1) < tolerance
# apply that filter
theEigenvalues = theEigenvalues[mask]
# filter out the eigenvectors with non-zero eigenvalues
leftEigenvectors = leftEigenvectors[:, mask]
# convert all the tiny and negative values to zero to isolate the actual stationary distributions
leftEigenvectors[leftEigenvectors < tolerance] = 0
# normalize each distribution by the sum of the eigenvector columns
attractorDistributions = leftEigenvectors / leftEigenvectors.sum(axis=0, keepdims=True)
# this checks that the vectors are actually the left eigenvectors, but I guess it's not needed to usage
#attractorDistributions = np.dot(attractorDistributions.T, M).T
# convert the column vectors into row vectors (lists) for each attractor (the standard output for this kind of analysis)
attractorDistributions = attractorDistributions.T
# a list of the states in any attractor with the approximate stationary distribution within THAT attractor (e.g. for graph coloring)
theSteadyStates = np.sum(attractorDistributions, axis=1)
Putting that all together in an easy copy-and-paste format:
M = transitionMatrix( G )
theEigenvalues, leftEigenvectors = scipy.linalg.eig(M, right=False, left=True)
theEigenvalues = theEigenvalues.real
leftEigenvectors = leftEigenvectors.real
tolerance = 1e-10
mask = abs(theEigenvalues - 1) < tolerance
theEigenvalues = theEigenvalues[mask]
leftEigenvectors = leftEigenvectors[:, mask]
leftEigenvectors[leftEigenvectors < tolerance] = 0
attractorDistributions = leftEigenvectors / leftEigenvectors.sum(axis=0, keepdims=True)
attractorDistributions = attractorDistributions.T
theSteadyStates = np.sum(attractorDistributions, axis=0)
Using this analysis on a generated Markov model produced one attractor (of three) with a steady state distribution of 0.19835218 and 0.80164782 compared to the mathematically accurate values of 0.2 and 0.8. So that's more than 0.1% off, kind of a big error for science. That's not a REAL problem because if accuracy is important then, now that the individual attractors have been identified, a more accurate analyses of behavior within each attractor using a matrix subset can be done.

Categories

Resources