Soft cosine distance between two vectors (Python) - python

I am wondering if there is a good way to calculate the soft cosine distance between two vectors of numbers. So far, I have seen solutions for sentences, which however did not help me, unfortunately.
Say I have two vectors like this:
a = [0,.25,.25,0,.5]
b = [.5,.0,.0,0.25,.25]
Now, I know that the features in the vectors exhibit some degree of similarity among them. This is described via:
s = [[0,.67,.25,0.78,.53]
[.53,0,.33,0.25,.25]
[.45,.33,0,0.25,.25]
[.85,.04,.11,0,0.25]
[.95,.33,.44,0.25,0]]
So a and b are 1x5 vectors, and s is a 5x5 matrix, describing how similar the features in a and b are.
Now, I would like to calculate the soft cosine distance between a and b, but accounting for between-feature similarity. I found this formula, which should calculate what I need:
soft cosine formula
I already tried implementing it using numpy:
import numpy as np
soft_cosine = 1 - (np.dot(a,np.dot(s,b)) / (np.sqrt(np.dot(a,np.dot(s,b))) * np.sqrt(np.dot(a,np.dot(s,b)))))
It is supposed to produce a number between 0 and 1, with a higher number indicating a higher distance between a and b. However, I am running this on a larger dataframe with multiple vectors a and b, and for some it produces negative values. Clearly, I am doing something wrong.
Any help is greatly appreciated, and I am happy to clarify what need clarification!
Best,
Johannes

From what I see it may just be a formula error. Could you please try with mine ?
soft_cosine = a # (s#b) / np.sqrt( (a # (s#a) ) * (b # (s#b) ) )
I use the # operator (which is a shorthand for np.matmul on ndarrays), as I find it cleaner to write : it's just matrix multiplication, no matter if 1D or 2D. It is a simple way to compute a dot product between two 1D arrays, with less code than the usual np.dot function.

soft_cosine = 1 - (np.dot(a,np.dot(s,b)) / (np.sqrt(np.dot(a,np.dot(s,b))) * np.sqrt(np.dot(a,np.dot(s,b)))))
I think you need to change: the denominator has both "a" and both "b".
soft_cosine = 1 - (np.dot(a,np.dot(s,b)) / (np.sqrt(np.dot(a,np.dot(s,a))) * np.sqrt(np.dot(a,np.dot(s,b))))).

Related

Calculating roots of multiple polynomials in numpy without using a loop

I can use the polyfit() method with a 2D array as input, to calculate polynomials on multiple data sets in a fast manner. After getting these multiple polynomials, I want to calculate the roots of all of these polynomials, in a fast manner.
There is numpy.roots() method for finding the roots of a single polynomial but this method does not work with 2D inputs (meaning multiple polynomials). I am working with millions of polynomials, so I would like to avoid looping over all polynomials using a for loop, map or comprehension because it takes minutes in that case. I would prefer a vectoral numpy operation or series of vectoral operations.
An example code for inefficient calculation:
POLYNOMIAL_COUNT = 1000000
# Create a polynomial of second order with coefficients 2, 3 and 4
coefficients = np.array([[2,3,4]])
# Let's say we have the same polynomial multiple times, represented as a 2D array.
# In reality the polynomial coefficients will be different from each other,
# but they will be the same order.
coefficients = coefficients.repeat(POLYNOMIAL_COUNT, axis=0)
# Calculate roots of these same-order polynomials.
# Looping here takes too much time.
roots = []
for i in range(POLYNOMIAL_COUNT):
roots.append(np.roots(coefficients[i]))
Is there a way to find the roots of multiple same-order polynomials using numpy, but without looping?
For the special case of polynomials up to the fourth order, you can solve in a vectorized manner. Anything higher than that does not have an analytical solution, so requires iterative optimization, which is fundamentally unlikely to be vectorizable since different rows may require a different number of iterations. As #John Coleman suggests, you might be able to get away with using the same number of steps for each one, but will likely have to sacrifice accuracy to do so.
That being said, here is an example of how to vectorize the second order case:
d = coefficients[:, 1:-1]**2 - 4.0 * coefficients[:, ::2].prod(axis=1, keepdims=True)
roots = -0.5 * (coefficients[:, 1:-1] + [1, -1] * np.emath.sqrt(d)) / coefficients[:, :1]
If I got the order of the coefficients wrong, replace coefficients[:, :1] with coefficients[:, -1:] in the denominator of the last assignment. Using np.emath.sqrt is nice because it will return a complex128 result automatically when your discriminant d is negative anywhere, and normal float64 result for all real roots.
You can implement a third order solution or a fourth order solution in a similar manner.

How to find A in a Matrix multiplication Ax=b, with some Values of A known, and A being left stochastic

I have been trying to find an answer to this problem for a couple of hours now, but i can't find anything so far...
So I have two vectors let's call them b and x, of which i know all values. They add up to be the same amount, so sum(b) = sum(x).
I also have a Matrix, let's call it A, of which i know what values are 0, all the other values are unknown (but are different from 0).
Furthermore, the the elements of each column of A has the sum of 1 (I think that's called it's a left stochastic matrix)
Generally the Equation can be written in the form A*x = b.
Now I'm trying to find the missing values of A.
I have found one answer to the general problem here: https://math.stackexchange.com/questions/1170843/solving-ax-b-when-x-and-b-are-given
Furthermore i looked at the documentation of numpy.linalg
:https://docs.scipy.org/doc/numpy/reference/routines.linalg.html, but i just can't figure out how to do it.
It looks similar to a multi linear regression problem, but also on sklearn, i couldn't find anything: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression
Not a complete answer, but a bit of a more formal statement of the problem.
I think this can be solved as just a system of linear equations. Let
NZ = {(i,j)|a(i,j) is not fixed to zero}
Then write:
sum( j | (i,j) ∈ NZ, a(i,j) * x(j) ) = b(i) ∀i
sum( i | (i,j) ∈ NZ, a(i,j)) = 1 ∀j
This is just a system of linear equations in a(i,j). It may be under- (or over-) determined and it may be sparse. I think it depends a bit on this how to solve it. It may possible to think about these as constraints in a linear (or quadratic) programming problem. That would allow you to add an objective (in case of an underdetermined system or overdetermined -- in that case minimize sum of squared deviations, or 1-norm of deviations). In addition we can add bounds on a(i,j) (e.g. lower bounds of zero and upper bounds of one). So a linear programming approach may be what you are looking for.
This problem looks a bit like matrix balancing. This is used a lot for economic data sets that come from different sources and where we want to reconcile the data to get a consistent data set usable for subsequent modeling.

Calculate astronomical distance of two set of sky position with theano

I want to compute the angular distance between all points in two different sets, something like cdist of scipy but with a different distance algorithm and using theano. The angular distance between two sources with right ascension (ra) in (0,2pi) and with declination (dec) in (-pi/2, pi/2) is:
theta = arccos(sin(dec1)*sin(dec2)+cos(dec1)*cos(dec2)*cos(ra1-ra2))
suppose that X is a matrix consists of N sources with their position (ra, dec):
#RA DEC
54.29 -35.19
54.62 -35.45
...
and W is other set of sources M different sources. How can I determine the angular separation of all X sources with all W sources?
Inspired to the euclidian distance:
edist = T.sqrt((X ** 2).sum(1).reshape((X.shape[0], 1)) + (W ** 2).sum(1).reshape((1, W.shape[0])) - 2 * X.dot(W.T))
I have tried with:
d = T.arccos(\\
T.sin(X.reshape((X.shape[0], 1, -1))[...,1])*T.sin(W.reshape((1, W.shape[0], -1))[..., 1])+\\
T.cos(X.reshape((X.shape[0], 1, -1))[...,1])*T.cos(W.reshape((1, W.shape[0], -1))[..., 1])*\\
T.cos(X.reshape((X.shape[0], 1, -1))[...,0] -W.reshape((1, W.shape[0], -1))[...,0]))
that resulting d matrix has shape (N, M) instead of (N, M, 2), since I expected to sum over the third axis; further the numerical result is wrong (I have compared it with TOPCAT which is a software astronomy-oriented. Any suggestion?
You need to debug your expression by parts - calculate sin(dec1) first and make sure you get the right shape and the right numerical result. Then the multiplication with sin(dec2) and so on until you get the full arccos expression.
One idea of something that is possibly wrong with your code is the use of * for multiplication - if you want to multiply matrices you should use T.multiply() instead of *.
I have resolved the issue: simply i have to convert right ascension and declination from degree to radian. Now, the method works.

Summation of Trig Functions in Python

Can you do a summation of a trig function in python? For example, sum of cos(2x) over two iterations. Thank you!
def function_name(phi1, phi2, distance):
"""
phi_1 - List of first angles [radians]
phi_2 - List of second angles [radians]
distance - List of all distances
Note: All inputs must be the same length
Note: All inputs must be NumPy arrays!
"""
phi_1=[0.698132, 0.872665]
phi_2=[0.872665, 0.698132]
distance
sig_phi = 1
#Setup exponent array and get squared distance
exponents= np.ones_like(distance) * 2
dist_sq= np.power(distance,exponents)
mat1=[[np.sum(np.divide((1+np.cos(2*phi1)),(2*dist_sq))),-1.0 * np.sum(np.divide((np.sin(2*phi1)),(2*dist_sq)))], [np.sum(np.divide(1-np.cos(2*phi1)),(2*dist_sq)),-1.0 * np.sum(np.divide((np.sin(2*phi1)),(2*dist_sq)))]]
mat1 *= 1/sig2_phi**2
mat2=[[np.sum(np.divide((1+np.cos(2*phi2)),(2*dist_sq))),-1.0 * np.sum(np.divide((np.sin(2*phi2)),(2*dist_sq)))], [np.sum(np.divide(1-np.cos(2*phi2)),(2*dist_sq)),-1.0 * np.sum(np.divide((np.sin(2*phi2)),(2*dist_sq)))]]
mat2 *= 1/sig2_phi**2
return mat1,mat2
print (mat1, mat2, distance)
I want to keep the distance as a variable- so I can see what distance maximizes the determinant of the matrix. I am not getting any results when I run the code in my terminal. Does anyone know what I should do?
I'm not sure I fully understand the question, but certainly each individual piece is quite simple to accomplish in Python.
By using the sum() function with the trigonometric functions provided in the math module together in a list comprehension, it should be relatively simple to accomplish. For the example you give,
from math import *
iters = [1, 2]
summation = sum(cos(2*x) for x in iters)
, should do the trick.
Also, I'd recommend being more thorough in your questions in the future. It is frowned upon to ask questions with no evidence of having attempted the problem already.

pseudo inverse of sparse matrix in python

I am working with data from neuroimaging and because of the large amount of data, I would like to use sparse matrices for my code (scipy.sparse.lil_matrix or csr_matrix).
In particular, I will need to compute the pseudo-inverse of my matrix to solve a least-square problem.
I have found the method sparse.lsqr, but it is not very efficient. Is there a method to compute the pseudo-inverse of Moore-Penrose (correspondent to pinv for normal matrices).
The size of my matrix A is about 600'000x2000 and in every row of the matrix I'll have from 0 up to 4 non zero values. The matrix A size is given by voxel x fiber bundle (white matter fiber tracts) and we are expecting maximum 4 tracts to cross in a voxel. In most of the white matter voxels we expect to have at least 1 tract, but I will say that around 20% of the lines could be zeros.
The vector b should not be sparse, actually b contains the measure for each voxel, which is in general not zero.
I would need to minimize the error, but there are also some conditions on the vector x. As I tried the model on smaller matrices, I never needed to constrain the system in order to satisfy these conditions (in general 0
Is that of any help? Is there a way to avoid taking the pseudo-inverse of A?
Thanks
Update 1st June:
thanks again for the help.
I can't really show you anything about my data, because the code in python give me some problems. However, in order to understand how I could choose a good k I've tried to create a testing function in Matlab.
The code is as follow:
F=zeros(100000,1000);
for k=1:150000
p=rand(1);
a=0;
b=0;
while a<=0 || b<=0
a=random('Binomial',100000,p);
b=random('Binomial',1000,p);
end
F(a,b)=rand(1);
end
solution=repmat([0.5,0.5,0.8,0.7,0.9,0.4,0.7,0.7,0.9,0.6],1,100);
size(solution)
solution=solution';
measure=F*solution;
%check=pinvF*measure;
k=250;
F=sparse(F);
[U,S,V]=svds(F,k);
s=svds(F,k);
plot(s)
max(max(U*S*V'-F))
for s=1:k
if S(s,s)~=0
S(s,s)=1/S(s,s);
end
end
inv=V*S'*U';
inv*measure
max(inv*measure-solution)
Do you have any idea of what should be k compare to the size of F? I've taken 250 (over 1000) and the results are not satisfactory (the waiting time is acceptable, but not short).
Also now I can compare the results with the known solution, but how could one choose k in general?
I also attached the plot of the 250 single values that I get and their squares normalized. I don't know exactly how to better do a screeplot in matlab. I'm now proceeding with bigger k to see if suddently the value will be much smaller.
Thanks again,
Jennifer
You could study more on the alternatives offered in scipy.sparse.linalg.
Anyway, please note that a pseudo-inverse of a sparse matrix is most likely to be a (very) dense one, so it's not really a fruitful avenue (in general) to follow, when solving sparse linear systems.
You may like to describe a slight more detailed manner your particular problem (dot(A, x)= b+ e). At least specify:
'typical' size of A
'typical' percentage of nonzero entries in A
least-squares implies that norm(e) is minimized, but please indicate whether your main interest is on x_hat or on b_hat, where e= b- b_hat and b_hat= dot(A, x_hat)
Update: If you have some idea of the rank of A (and its much smaller than number of columns), you could try total least squares method. Here is a simple implementation, where k is the number of first singular values and vectors to use (i.e. 'effective' rank).
from scipy.sparse import hstack
from scipy.sparse.linalg import svds
def tls(A, b, k= 6):
"""A tls solution of Ax= b, for sparse A."""
u, s, v= svds(hstack([A, b]), k)
return v[-1, :-1]/ -v[-1, -1]
Regardless of the answer to my comment, I would think you could accomplish this fairly easily using the Moore-Penrose SVD representation. Find the SVD with scipy.sparse.linalg.svds, replace Sigma by its pseudoinverse, and then multiply V*Sigma_pi*U' to find the pseudoinverse of your original matrix.

Categories

Resources