Second-order cooccurrence of terms in texts - python

Basically, I want to reimplement this video.
Given a corpus of documents, I want to find the terms that are most similar to each other.
I was able to generate a cooccurrence matrix using this SO thread and use the video to generate an association matrix. Next I, would like to generate a second order cooccurrence matrix.
Problem statement: Consider a matrix where the rows of the matrix correspond to a term and the entries in the rows correspond to the top k terms similar to that term. Say, k = 4, and we have n terms in our dictionary, then the matrix M has n rows and 4 columns.
HAVE:
M = [[18,34,54,65], # Term IDs similar to Term t_0
[18,12,54,65], # Term IDs similar to Term t_1
...
[21,43,55,78]] # Term IDs similar to Term t_n.
So, M contains for each term ID, the most similar term IDs. Now, I would like to check how many of those similar terms match. In the example of M above, it seems that term t_0 and term t_1 are quite similar, because three out of four terms match, where as terms t_0 and t_nare not similar, because no terms match. Let's write M as a series of lists.
M = [list_0, # Term IDs similar to Term t_0
list_1, # Term IDs similar to Term t_1
...
list_n] # Term IDs similar to Term t_n.
WANT:
C = [[f(list_0, list_0), f(list_0, list_1), ..., f(list_0, list_n)],
[f(list_1, list_0), f(list_1, list_1), ..., f(list_1, list_n)],
...
[f(list_n, list_0), f(list_n, list_1), ..., f(list_n, list_n)]]
I'd like to find the matrix C, that has as its elements, a function f applied to the lists of M. f(a,b) measures the degree of similarity between two lists a and b. Going, with the example above, the degree of similarity between t_0 and t_1 should be high, whereas the degree of similarity of t_0 and t_n should be low.
My questions:
What is a good choice for comparing the ordering of two lists? That is, what is a good choice for function f?
Is there a transformation already available that takes as an input a matrix like M and produces a matrix like C? Preferably a python package?
Thank you, r0f1

In fact, cosine similarity might not be too bad in this case. The problem is, that you don't want to use the index vectors (i.e. [18,34,54,65] and so on in your case), but you want vectors of length n that are zero everywhere except for the values in your index vector. Luckily, you don't have to create those vectors explicitly, but you can just count how many indices the two index vectors have in common:
def f(u, v):
return len(set(u).intersection(set(v)))
Here, I omitted a constant normalization factor k. There are some more elaborate things that one could do (for example the TF-IDF kernel), but I would stay with this for the start.
In order to run this efficiently using numpy, you would want to do two things:
Convert f to a ufunc, i.e. a numpy vectorized function. You can do that by uf = np.frompyfunc(f, 2, 1) (assuming that you did import numpy as np at some point).
Store M as a 1d array of lists (basically what you state in your second code listing). That's a little more tricky, because numpy is trying to be smart here, but you want something else. So here is how to do that:
n = len(M)
Marray = np.empty(n, dtype='O') # dtype='O' allows you to have elements of type list
for i in range(n):
Marray[i] = M[i]
Now, Marray contains essentially what you described in your second code listing. You can then use the new ufunc's outer method to get your similarity matrix. Here is how all of that would work together for your M from the example (assuming n=3):
M = [[18, 34, 54, 65],
[18, 12, 54, 65],
[21, 43, 55, 78]]
n = len(M) # i.e. 3
uf = np.frompyfunc(f, 2, 1)
Marray = np.empty(n, dtype='O')
for i in range(n):
Marray[i] = M[i]
similarities = uf.outer(Marray, Marray).astype('d') # convert to float instead object type
print(similarities)
# array([[4., 3., 0.],
# [3., 4., 0.],
# [0., 0., 4.]])
I hope that answers your questions.

You asked two questions, one somewhat open-ended (the first one) and other one that has a definitive answer, so I will start by the second one:
Is there a transformation already available that takes as an input a
matrix like M and produces a matrix like C? Preferably, a python
package?
The answer is yes, there is one package named scipy.spatial.distance that contains a function that takes a matrix like M and produces a matrix like C. The following example is to show the function:
import numpy as np
from scipy.spatial.distance import pdist, squareform
# initial data
M = [[18, 34, 54, 65],
[18, 12, 54, 65],
[21, 43, 55, 78]]
# convert to numpy array
arr = np.array(M)
result = squareform(pdist(M, metric='euclidean'))
print(result)
Output
[[ 0. 22. 16.1245155 ]
[22. 0. 33.76388603]
[16.1245155 33.76388603 0. ]]
As seen from the example above, pdist takes the M matrix and generates an C matrix. Note that the output of pdist is a condensed distance matrix, so you need to convert it to square form using squareform. Now onto the second issue:
What is a good choice for comparing the ordering of two lists? That
is, what is a good choice for function f?
Given that order does matter in your particular case I suggest you look at rank correlation coefficients such as: Kendall or Spearman, both are provided in the scipy.stats package, along with a whole bunch of other coefficients. Usage example:
import numpy as np
from scipy.spatial.distance import pdist, squareform
from scipy.stats import kendalltau, spearmanr
# distance function
kendall = lambda x, y : kendalltau(x, y)[0]
spearman = lambda x, y : spearmanr(x, y)[0]
# initial data
M = [[18, 34, 54, 65],
[18, 12, 54, 65],
[21, 43, 55, 78]]
# convert to numpy array
arr = np.array(M)
# compute kendall C and convert to square form
kendall_result = 1 - squareform(pdist(arr, kendall)) # subtract 1 because you want a similarity
print(kendall_result)
print()
# compute spearman C and convert to square form
spearman_result = 1 - squareform(pdist(arr, spearman)) # subtract 1 because you want a similarity
print(spearman_result)
print()
Output
[[1. 0.33333333 0. ]
[0.33333333 1. 0.33333333]
[0. 0.33333333 1. ]]
[[1. 0.2 0. ]
[0.2 1. 0.2]
[0. 0.2 1. ]]
If those do not fit your needs you can take a look at the Hamming distance, for example:
import numpy as np
from scipy.spatial.distance import pdist, squareform
# initial data
M = [[18, 34, 54, 65],
[18, 12, 54, 65],
[21, 43, 55, 78]]
# convert to numpy array
arr = np.array(M)
# compute match_rank C and convert to square form
result = 1 - squareform(pdist(arr, 'hamming'))
print(result)
Output
[[1. 0.75 0. ]
[0.75 1. 0. ]
[0. 0. 1. ]]
In the end the choice of the similarity function will depend on your final application, so you will need to try out different functions and see the ones that fit your needs. Both scipy.spatial.distance and scipy.stats provide a plethora of distance and coefficient functions you can try out.
Further
The following paper contains a section on list similarity

I would suggest cosine similarity as every list is an vector.
from sklearn.metrics.pairwise import cosine_similarity
cosine_similarity(list0,list1)

Related

3d array to matrix multiplication

I have a matrix called vec with two columns, vec[:,0] and vec[:,1]. P contains two matrices, P[0,:,:] and P[1,:,:]. I want to mulitiply P[0,:,:] with the first column of vec and multiply P[1,:,:] with the second column of vec. However, the operation P#vec also gives me the matrix product of P[0,:,:] with the second column of vec and the matrix product of P[1,:,:] with the first column of vec, which slows my code.
Is it possible to directly compute the pairs column 1 to matrix 1 and column 2 to matrix 2 without the "off" products?
import numpy as np
P=np.arange(50).reshape(2, 5, 5)
vec=np.arange(10).reshape(5,2)
have=P#vec
want=np.column_stack((have[0,:,0],have[1,:,1]))
have,want
There is a very powerful function in numpy called np.einsum. It can perform all kind of tensor contractions, axis reordering and matrix multiplication. For your example you could use
res = np.einsum('nij,jn->in', P, vec)
after which res is exactly like want.
How does this work:
You give the np.einsum function both your arrays as well as a signature (that 'nij,jn->in' string) that tells the function how to multiply the arrays. In short, you want the third axis of the P tensor to be contracted with the first axis of vec. Therefore you choose the same index j in the signature string and leave it out in the part after the ->. A mere broadcast is done if indices appear on the left and right hand side of the ->, which is done here for the n and i indices.
A more complete explanation of this very powerful function with many examples of how to use it can be found at the corresponding numpy documentation.
#/matmul handles batches nicely, but the rules are that for 3d arrays, the first dimension is the batch, and dot is done on the last 2 dimensions, with the usual "last of A with the second to the last of B" pairing.
It took a bit of reading to decipher you description but it appears that you want the first of p to the batch, and last of vec to be the batch. That means vec needs to transformed to a (2,5,1) to work with the (2,5,5) p.
In [176]: P#vec.T[:,:,None]
Out[176]:
array([[[ 60],
[ 160],
[ 260],
[ 360],
[ 460]],
[[ 695],
[ 820],
[ 945],
[1070],
[1195]]])
The result is (2,5,1). We can squeeze out the the last to get (2,5), but apparently you want a (5,2)
In [179]: (P#vec.T[:,:,None])[...,0].T
Out[179]:
array([[ 60, 695],
[ 160, 820],
[ 260, 945],
[ 360, 1070],
[ 460, 1195]])
np.einsum('nij,jn->in', P, vec) does effectively the same, with the n as the batch dimension that is 'carried through' to the result, and sum-of-products on the shared j dimension.

How to put one entry across an entire diagonal for a sparse matrix in Python

I am seeking to construct a matrix of which I will calculate the inverse. This will be used in an implicit method for solving a nonlinear parabolic PDE. My current calculations are, which will become obvious to why, giving me a singular (no possible inverse) matrix. For context, in reality the matrix will be of dimension 30 by 30 but in these examples I am using smaller matrices for testing purposes.
Say I want to create a large square sparse matrix. Using spdiags only allows you to input members of the main, lower and upper diagonals individually. So how to you make it so that each diagonal has one value for all its entries?
Example Code:
import numpy as np
from scipy.sparse import spdiags
from numpy.linalg import inv
updiag = -0.25
diag = 0.5
lowdiag = -0.25
Jdata = np.array([[diag], [lowdiag], [updiag]])
Diags = [0, -1, 1]
J = spdiags(Jdata, Diags, 3, 3).toarray()
print(J)
inverseJ = inv(J)
print(inverseJ)
This would produce an 3 x 3 matrix but only with the first entry of each diagonal given. I wondered about using np.fill_diagonal but that would require a matrix first and only does the main diagonal. Am I misunderstanding something?
The first argument of spdiags is a matrix of values to be used as the diagonals. You can use it this way:
Jdata = np.array([3 * [diag], 3 * [lowdiag], 3 * [updiag]])
Diags = [0, -1, 1]
J = spdiags(Jdata, Diags, 3, 3).toarray()
print(J)
# [[ 0.5 -0.25 0. ]
# [-0.25 0.5 -0.25]
# [ 0. -0.25 0.5 ]]

How to multiply a 3D matrix with a 2D matrix efficiently in numpy

I have two multidimensional arrays, which I want to multiply with each other. One has the shape N,N,3 and the other has the shape N,N.
Let me set the stage:
I have an array of atom positions of the shape N,3:
atom_positions = [[x1,y1,z1],
[x2,y2,z2],
[x3,y3,z3],
...
]
From these I calculate an upper triangular matrix of distance vectors so that the resulting N,N,3 matrix contains all unique pair distance vectors r_ij of the vectors inside atom_positions:
pair_distance_vectors = [[[0,0,0],[x2-x1,y2-y1,z2-z1],[x3-x1,y3-y1,z3-z1],...],
[[0,0,0],[0,0,0] ,[x3-x2,y3-y2,z3-z2],...],
...
]
Now I want to normalize each of these pair distance vectors. For that I want to use my N,N pair_distances array, which contains the length of every vector inside pair_distance_vectors.
The formula for a single vector is:
r_ij/|r_ij|
I want to do that by doing a matrix multiplication, where every entry in the N,N array becomes a scalar by which a vector inside the N,N,3 array is multiplied. I'm pretty sure that this can be achieved somehow with numpy by using numpy.dot() or a different function, but I just can't find the answer myself. Also, I'm afraid if I do find a transformation which allows for this, that my maths will be faulty.
Here's some demonstration code, which achieves what I want in a very inefficient fashion:
import numpy as np
pair_distance_vectors = np.ones(shape=(2,2,3))
pair_distances = np.array(((1,2),(3,4)))
normalized_pair_distance_vectors = np.zeros(shape=(2,2,3))
for i,vec_list in enumerate(pair_distance_vectors):
for j,vec in enumerate(vec_list):
normalized_pair_distance_vectors[i,j] = vec*pair_distances[i,j]
print(normalized_pair_distance_vectors)
Thanks in advance.
EDIT: Maybe this is clearer:
distance_vectors = [[[x11,y11,z11],[x12,y12,z12],[x13,y13,z13],...],
[[x21,y21,z21],[x22,y22,z22],[x23,y23,z23],...],
... ]
distance_matrix = [[r_11,r_12,r_13,...],
[r_21,r_22,r_23,...],
... ]
norm_distance_vectors = some_operation(distance_vectors,distance_matrix)
norm_distance_vectors = [[r_11*[x11,y11,z11],r_12*[x12,y12,z12],r_13*[x13,y13,z13],...],
[r_21*[x21,y21,z21],r_22*[x22,y22,z22],r_23*[x23,y23,z23],...],
... ]
You won't need a loop. Trick is to expand your pair_distance in the 3rd dimension by repeating it m times (m being the dimension of your vectors, here 3D) and then divide two arrays element wise (works for any m-dimensional vectors, replace 3 with m):
pair_distances = np.repeat(pair_distances[:,:,None], 3, axis=2)
normalized_pair_distance_vectors = np.nan_to_num(pair_distance_vectors/ pair_distances)
Output for your example inputs:
[[[1. 1. 1. ]
[0.5 0.5 0.5 ]]
[[0.33333333 0.33333333 0.33333333]
[0.25 0.25 0.25 ]]]

Difference between scipy.linalg.expm versus hand-coded one

I was trying to implement the matrix exponential function as in scipy.linalg.expm. I gained inspiration from kaityo256's github repository. I thus wrote down the following.
from scipy.linalg import expm
from scipy.linalg import eigh
from scipy.linalg import inv
from math import exp as math_exp
from numpy import array, zeros
from numpy.random import random_sample
from numpy.testing import assert_allclose
def diag2sqr(x):
'''Makes an square matrix from a diagonal one.
Takes a 1d matrix. Determines its data type.
Finds out the shape of the 1d matrix.
Makes an empty square matrix with both
dimensions equal to largest (nonzero) dimension of
the 1d matrix. It then fills the elements of the
1d matrix into diagonal slots of the empty
square one.
Parameters
----------
x : ndarray
ndarray of be coverted to a square ndarray
Returns
-------
xsqr : ndarray
ndarray with diagonals sameas those of x
all other elements are zero
dtype same as that of x
'''
x_flat = x.ravel()
xsqr = zeros((x_flat.shape[0], x_flat.shape[0]), dtype=x.dtype)
# Making the empty matrix
for i in range(x_flat.shape[0]):
xsqr[i, i] = x_flat[i]
# filling up the ith element
print('xsqr', xsqr)
return xsqr
def kaityo_expm(x, ):
'''Exponentiates an ndarray (kaityo).
Exponentiates a ndarray in the most naive way.
Parameters
----------
x : ndarray
The ndarray to be exponentiated
Returns
-------
kexpm : ndarray
x after exponentiating
'''
rx, ux = eigh(x)
# Find eigenvalues and eigenvectors
# eigenvectors composed to form a unitary
ux_inv = inv(ux)
# Inverse of the unitary
# tx = diag([array([math_exp(i) for i in rx]).ravel()])
# tx = array([math_exp(i) for i in rx])
tx = diag2sqr(array([math_exp(i) for i in rx]))
# Constructing the diagonal matrix
kexpm1 = tx#ux_inv
kexpm = ux#kexpm1
return kexpm
Afterwards, I tried to test the above code versus scipy.linalg.expm.
x = random_sample((10, 10))
assert_allclose(expm(x), kaityo_expm(x))
This leads to the following output.
AssertionError:
Not equal to tolerance rtol=1e-07, atol=0
Mismatch: 100%
Max absolute difference: 7.04655733
Max relative difference: 0.59875635
x: array([[18.032424, 16.224408, 12.432163, 16.614248, 12.85653 , 13.705387,
15.096966, 10.577946, 18.399573, 17.938062],
[16.352809, 17.525898, 12.79079 , 16.295562, 13.512996, 14.407979,...
y: array([[18.649103, 13.157682, 11.264763, 16.099163, 15.2293 , 17.854499,
11.691586, 13.412066, 15.023189, 15.598455],
[13.157682, 13.612502, 9.628261, 12.659313, 13.559437, 13.382417,..
Obviously, both the implementations differ.
The questions are as follows:
Is it acceptable for them to differ?
Is my implementation wrong?
If my implementation is wrong, how do I fix it?
If my implementation is correct, when is it safe to use scipy.linalg.expm?
I have seen the following questions:
Matrix exponentiation with scipy: expm, expm2 and expm3
from a mathematical approach the definition of exponential of a matrix is made using the Taylor series of the exponential, so:
let A be a diagonal square matrix:
the problem arise when A is a generic square matrix, so before doing the exponential you will need do diagonalize it using eigenvalue and eigenvectors:
with U the matrix of eigenvectors and Lambda the matrix with the eigenvalues on the diagonal.
at this point we are close to finding what is an exponential of a matrix:
now lets implement this result in a simple script:
>>> import numpy as np
>>> import scipy.linalg as ln
>>> A = [[2/3, -4/3, 2],
[5/6, 4/3, -2],
[5/6, -2/3, 0]]
>>> A = np.matrix(A)
>>> print(A)
[[ 0.66666667 -1.33333333 2. ]
[ 0.83333333 1.33333333 -2. ]
[ 0.83333333 -0.66666667 0. ]]
>>> eigvalue, eigvectors = np.linalg.eig(A)
>>> print("eigvalue: ", eigvalue)
>>> print("eigvectors:")
>>> print(eigvectors)
eigvalue: [ 1. -1. 2.]
eigvectors:
[[ 0.81649658 0.27216553 0.87287156]
[ 0.40824829 -0.68041382 -0.21821789]
[ 0.40824829 -0.68041382 0.43643578]]
>>> e_Lambda = np.eye(np.size(A, 0))*(np.exp(eigvalue))
>>> print(e_Lambda)
[[2.71828183 0. 0. ]
[0. 0.36787944 0. ]
[0. 0. 7.3890561 ]]
>>> e_A = eigvectors*e_Lambda*eigvectors.I
>>> print(e_A)
[[ 2.3265481 -6.22769903 7.01116649]
[ 0.97933433 4.27520659 -3.51559341]
[ 0.97933433 -3.11384951 3.87346269]]
>>> e_A2 = ln.expm(A)
>>> print(e_A2)
[[ 2.3265481 -6.22769903 7.01116649]
[ 0.97933433 4.27520659 -3.51559341]
[ 0.97933433 -3.11384951 3.87346269]]
>>> np.testing.assert_allclose(e_A, e_A2)
>>> print(e_A - e_A2)
[[-1.77635684e-15 1.77635684e-15 -8.88178420e-16]
[ 4.44089210e-16 -1.77635684e-15 8.88178420e-16]
[-2.22044605e-16 0.00000000e+00 4.44089210e-16]]
we see that the result is basically the same, so i think it's safe to use scipy.linalg.expm for matrix exponentiation.
i created a repo with the notebook for further testing.

Interpolating 2 numpy arrays

Is there any numpy or scipy or python function to interpolate between two 2D numpy array's? I have two 2D numpy arrays, and I want to apply changes to the first numpy array to make it similar to the second 2D array. The constraint is that I want the changes to be smooth. e.g., let the arrays be:
A
[[1 1 1
1 1 1
1 1 1]]
and
B
[[34 100 15
62 17 87
17 34 60]]
To make A similar to B, I could add 33 to the first grid cell of A and so on.. However, to make the changes smoother, I plan to compute a mean using a 2x2 window on array B and then apply the resulting changes to array A. Is there a built in numpy or scipy method to do this or follow this approach without using for loop.
You've just described a Kalman Filtering / data fusion problem. You have an initial state A that has some errors and you have some observations B that also have some noise. You want to improve your estimate of state A by injecting some information from B, all while accounting for spatially correlated errors in both datasets. We don't have any prior information about the errors in A and B, so we can just make it up. Here's an implementation:
import numpy as np
# Make a matrix of the distances between points in an array
def dist(M):
nx = M.shape[0]
ny = M.shape[1]
x = np.ravel(np.tile(np.arange(nx),(ny,1))).reshape((nx*ny,1))
y = np.ravel(np.tile(np.arange(ny),(nx,1))).reshape((nx*ny,1))
n,m = np.meshgrid(x,y)
d = np.sqrt((n-n.T)**2+(m-m.T)**2)
return d
# Turn a distance matrix into a covariance matrix. Here is a linear covariance matrix.
def covariance(d,scaling_factor):
c = (-d/np.amax(d) + 1)*scaling_factor
return c
A = np.array([[1,1,1],[1,1,1],[1,1,1]]) # background state
B = np.array([[34,100,15],[62,17,87],[17,34,60]]) # observations
x = np.ravel(A).reshape((9,1)) # vector representation
y = np.ravel(B).reshape((9,1)) # vector representation
P_a = np.eye(9)*50 # background error covariance matrix (set to diagonal here)
P_b = covariance(dist(B),2) # observation error covariance matrix (set to a function of distance here)
# Compute the Kalman gain matrix
K = P_a.dot(np.linalg.inv(P_a+P_b))
x_new = x + K.dot(y-x)
A_new = x_new.reshape(A.shape)
print(A)
print(B)
print(A_new)
Now, this method only works if your data are unbiased. So mean(A) must equal mean(B). But you'll still get okay results regardless. Also, you can play with the covariance matrices however you like. I'd recommend reading the Kalman filter wikipedia page for more details.
By the way, the example above yields:
[[ 27.92920141 90.65490699 7.17920141]
[ 55.92920141 7.65490699 79.17920141]
[ 10.92920141 24.65490699 52.17920141]]
One way of smoothing could be to use convolve2d:
import numpy as np
from scipy import signal
B = np.array([[34, 100, 15],
[62, 17, 87],
[17, 34, 60]])
kernel = np.full((2, 2), .25)
smoothed = signal.convolve2d(B, kernel)
# [[ 8.5 33.5 28.75 3.75]
# [ 24. 53.25 54.75 25.5 ]
# [ 19.75 32.5 49.5 36.75]
# [ 4.25 12.75 23.5 15. ]]
The above pads the matrix with zeros from all sides and then calculates the mean of each 2x2 window placing the value at the center of the window.
If the matrices were actually larger, then using a 3x3 kernel (such as np.full((3, 3), 1/9)) and passing mode='same' to convolve2d would give a smoothed B with its shape preserved and elements "matching" the original. Otherwise you may need to decide what to do with the boundary values to make the shapes the same again.
To move A towards the smoothed B, it can be set to a chosen affine combination of the matrices using standard arithmetic operations, for instance: A = .2 * A + .8 * smoothed.

Categories

Resources