Computing pairwise accuracy/comparison between many arrays - python

Let's say I have several arrays, where each array is the same length. I am working with binary-valued (values are 0 or 1) arrays which might simplify the problem, so it's okay if the proposed solution makes use of this property.
I want to compute pairwise accuracies between each pair of arrays, where accuracy can be thought of as the proportion of times the elements in two arrays are equal. So here is a simple example where I am using a list of lists format. Let's say A = [[1,1,1], [0,1,0], [1,1,0]]. We would want to output:
1. , 1/3, 2/3
1/3, 1., 2/3
2/3, 2/3, 1.
I can compute this using multiple loops (iterating over each pair of arrays, and over each index). However, is there are built-in functionalities or library (e.g numpy) that can help do this more cleanly and efficiently?

You can use broadcasting:
import numpy as np
A = np.array([[1,1,1], [0,1,0], [1,1,0]])
output = A[:,None,:] == A[None,:,:]
output = output.sum(axis=2) / 3
print(output)
# [[1. 0.33333333 0.66666667]
# [0.33333333 1. 0.66666667]
# [0.66666667 0.66666667 1. ]]

I'd suggest
A = np.array(A)
-1 * np.linalg.norm(A[:, None, :] - A[None, :, :], axis=-1, ord=1)/len(A) + 1
that leverages NumPy's linalg.norm.
Since pairwise accuracies seemingly refers to the relative number of coinciding elements in between two vectors. In this case, you compute
1 - HammingDistance(v1, v2) / len(v2)
where the Hamming distance counts the (absolute) number of indices of non-equal values. This is emulated by using the 1-norm through ord=1.
However, if you'd prefer to leverage the binary structure of your vectors without invoking the linear algebra in NumPy but merely is broadcasting capability,
A = np.array(A)
-1 * (A[:, None, :] != A).sum(2)/len(A) + 1
will equally do.
Naturally, both code snippets require the lists (i.e. vectors) in your code to have the same length. However, it is non-trivial in a mathematically rigorous way to measure distance (and in turn, similarity) anyway when this is not the case.

Related

Numpy operation of arrays with different size

c=np.array([ 0. , 0.2, 0.22, 0.89, 0.99])
rnd = np.random.uniform(low=0.00, high=1.00, size=12)
I want to see how many elements in c are smaller than each of the 12 random numbers in rnd. It needs to be in numpy and without the use of any lists so that it's faster.
The output will be an array of 12 elements, each describing how many elements for each of them are small than the corresponding number in rnd.
You can use broadcasting after extending c from 1D to a 2D array verison with None/np.newaxis for performing comparisons on all elements in a vectorized manner and then summing along rows with .sum(0) for the counting, like so -
(c[:,None] < rnd).sum(0)
It seems you can also use the efficient np.searchsorted like so -
np.searchsorted(c,rnd)

Theano sqrt returning NaN values

In my code I'm using theano to calculate an euclidean distance matrix (code from here):
import theano
import theano.tensor as T
MAT = T.fmatrix('MAT')
squared_euclidean_distances = (MAT ** 2).sum(1).reshape((MAT.shape[0], 1)) + (MAT ** 2).sum(1).reshape((1, MAT.shape[0])) - 2 * MAT.dot(MAT.T)
f_euclidean = theano.function([MAT], T.sqrt(squared_euclidean_distances))
def pdist_euclidean(mat):
return f_euclidean(mat)
But the following code causes some values of the matrix to be NaN. I've read that this happens when calculating theano.tensor.sqrt() and here it's suggested to
Add an eps inside the sqrt (or max(x,EPs))
So I've added an eps to my code:
import theano
import theano.tensor as T
eps = 1e-9
MAT = T.fmatrix('MAT')
squared_euclidean_distances = (MAT ** 2).sum(1).reshape((MAT.shape[0], 1)) + (MAT ** 2).sum(1).reshape((1, MAT.shape[0])) - 2 * MAT.dot(MAT.T)
f_euclidean = theano.function([MAT], T.sqrt(eps+squared_euclidean_distances))
def pdist_euclidean(mat):
return f_euclidean(mat)
And I'm adding it before performing sqrt. I'm getting less NaNs, but I'm still getting them. What is the proper solution to the problem? I've also noticed that if MAT is T.dmatrix() there are no NaN
There are two likely sources of NaNs when computing Euclidean distances.
Floating point representation approximation issues causing negative distances when it's really just zero. The square root of a negative number is undefined (assuming you're not interested in the complex solution).
Imagine MAT has the value
[[ 1.62434536 -0.61175641 -0.52817175 -1.07296862 0.86540763]
[-2.3015387 1.74481176 -0.7612069 0.3190391 -0.24937038]
[ 1.46210794 -2.06014071 -0.3224172 -0.38405435 1.13376944]
[-1.09989127 -0.17242821 -0.87785842 0.04221375 0.58281521]]
Now, if we break down the computation we see that (MAT ** 2).sum(1).reshape((MAT.shape[0], 1)) + (MAT ** 2).sum(1).reshape((1, MAT.shape[0])) has value
[[ 10.3838024 -9.92394296 10.39763039 -1.51676099]
[ -9.92394296 18.16971188 -14.23897281 5.53390084]
[ 10.39763039 -14.23897281 15.83764622 -0.65066204]
[ -1.51676099 5.53390084 -0.65066204 4.70316652]]
and 2 * MAT.dot(MAT.T) has value
[[ 10.3838024 14.27675714 13.11072431 7.54348446]
[ 14.27675714 18.16971188 17.00367905 11.4364392 ]
[ 13.11072431 17.00367905 15.83764622 10.27040637]
[ 7.54348446 11.4364392 10.27040637 4.70316652]]
The diagonal of these two values should be equal (the distance between a vector and itself is zero) and from this textual representation it looks like that is true, but in fact they are slightly different -- the differences are too small to show up when we print the floating point values like this
This becomes apparent when we print the value of the full expression (the second of the matrices above subtracted from the first)
[[ 0.00000000e+00 2.42007001e+01 2.71309392e+00 9.06024545e+00]
[ 2.42007001e+01 -7.10542736e-15 3.12426519e+01 5.90253836e+00]
[ 2.71309392e+00 3.12426519e+01 0.00000000e+00 1.09210684e+01]
[ 9.06024545e+00 5.90253836e+00 1.09210684e+01 0.00000000e+00]]
The diagonal is almost composed of zeros but the item in the second row, second column is now a very small negative value. When you then compute the square root of all these values you get NaN in that position because the square root of a negative number is undefined (for real numbers).
[[ 0. 4.91942071 1.64714721 3.01002416]
[ 4.91942071 nan 5.58951267 2.42951402]
[ 1.64714721 5.58951267 0. 3.30470398]
[ 3.01002416 2.42951402 3.30470398 0. ]]
Computing the gradient of a Euclidean distance expression with respect to a variable inside the input to the function. This can happen not only if a negative number of generated due to floating point approximations, as above, but also if any of the inputs are zero length.
If y = sqrt(x) then dy/dx = 1/(2 * sqrt(x)). So if x=0 or, for your purposes, if squared_euclidean_distances=0 then the gradient will be NaN because 2 * sqrt(0) = 0 and dividing by zero is undefined.
The solution to the first problem can be achieved by ensuring squared distances are never negative by forcing them to be no less than zero:
T.sqrt(T.maximum(squared_euclidean_distances, 0.))
To solve both problems (if you need gradients) then you need to make sure the squared distances are never negative or zero, so bound with a small positive epsilon:
T.sqrt(T.maximum(squared_euclidean_distances, eps))
The first solution makes sense since the problem only arises from approximate representations. The second is a bit more questionable because the true distance is zero so, in a sense, the gradient should be undefined. Your specific use case may yield some alternative solution that is maintains the semantics without an artificial bound (e.g. by ensuring that gradients are never computed/used for zero-length vectors). But NaN values can be pernicious: they can spread like weeds.
Just checking
In squared_euclidian_distances you're adding a column, a row, and a matrix. Are you sure this is what you want?
More precisely, if MAT is of shape (n, p), you're adding matrices of shapes (n, 1), (1, n) and (n, n).
Theano seems to silently repeat the rows (resp. the columns) of each one-dimensional member to match the number of rows and columns of the dot product.
If this is what you want
In reshape, you should probably specify ndim=2 according to basic tensor functionality : reshape.
If the shape is a Variable argument, then you might need to use the optional ndim parameter to declare how many elements the shape has, and therefore how many dimensions the reshaped Variable will have.
Also, it seems that squared_euclidean_distances should always be positive, unless imprecision errors in the difference change zero values into small negative values. If this is true, and if negative values are responsible for the NaNs you're seeing, you could indeed get rid of them without corrupting your result by surrounding squared_euclidean_distances with abs(...).

Replace elements in sparse matrix created by Scipy (Python)

I have a huge sparse matrix in Scipy and I would like to replace numerous elements inside by a given value (let's say -1).
Is there a more efficient way to do it than using:
SM[[rows],[columns]]=-1
Here is an example:
Nr=seg.shape[0] #size ~=50000
Im1=sparse.csr_matrix(np.append(np.array([-1]),np.zeros([1,Nr-1])))
Im1=sparse.csr_matrix(sparse.vstack([Im1,sparse.eye(Nr)]))
Im1[prev[1::]-1,Num[1::]-1]=-1 # this line is very slow
Im2=sparse.vstack([sparse.csr_matrix(np.zeros([1,Nr])),sparse.eye(Nr)])
IM=sparse.hstack([Im1,Im2]) #final result
I've played around with your sparse arrays. I'd encourage you to do some timings on smaller sizes, to see how different methods and sparse types behave. I like to use timeit in Ipython.
Nr=10 # seg.shape[0] #size ~=50000
Im2=sparse.vstack([sparse.csr_matrix(np.zeros([1,Nr])),sparse.eye(Nr)])
Im2 has a zero first row, and offset diagonal on the rest. So it's simpler, though not much faster, to start with an empty sparse matrix:
X = sparse.vstack([sparse.csr_matrix((1,Nr)),sparse.eye(Nr)])
Or use diags to construct the offset diagonal directly:
X = sparse.diags([1],[-1],shape=(Nr+1, Nr))
Im1 is similar, except it has a -1 in the (0,0) slot. How about stacking 2 diagonal matrices?
X = sparse.vstack([sparse.diags([-1],[0],(1,Nr)),sparse.eye(Nr)])
Or make the offset diagonal (copy Im2?), and modify [0,0]. A csr matrix gives an efficiency warning, recommending the use of lil format. It does, though, take some time to convert tolil().
X = sparse.diags([1],[-1],shape=(Nr+1, Nr)).tolil()
X[0,0] = -1 # slow warning with csr
Let's try your larger insertions:
prev = np.arange(Nr-2) # what are these like?
Num = np.arange(Nr-2)
Im1[prev[1::]-1,Num[1::]-1]=-1
With Nr=10, and various Im1 formats:
lil - 267 us
csr - 1.44 ms
coo - not supported
todense - 25 us
OK, I've picked prev and Num such that I end up modifying diagonals of Im1. In this case it would be faster to construct those diagonals right from the start.
X2=Im1.todia()
print X2.data
[[ 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
[-1. -1. -1. -1. -1. -1. -1. 0. 0. 0.]]
print X2.offsets
[-1 0]
You may have to learn how various sparse formats are stored. csr and csc are a bit complex, designed for fast linear algebra operations. lil, dia, coo are simpler to understand.

Vectorized Portfolio Risk

Im have N pairs of portfolio weights stored in a numpy array and would like to calculate portfolio risk which is w * E * w_T where w_T is weight transpose. The way I came up with is to loop through each weight pair and apply the matrix multiplication. Is there a vectorized approach to this such that given a weight pair (or if possible N weights that all sum to 1) I apply a single covariance matrix to each row to get the risk (ie without loop)?
import numpy as np
w = np.array([[0.2,0.8],[0.5,0.5]])
covar = np.array([0.000046,0.000017,0.000017,0.000032]).reshape([2,2])
w1 = w[0].reshape([1,2]) # each row in w
#portfolio risk
np.dot(np.dot(w1,covar),w1.T)
#Adam's answer is valid, but for big arrays, can result with very big temporary arrays (NxN), and unnecessary computations (computing the off-diagonal elements).
Here's a similar, yet much more efficient solution:
(I added another weight-pair, to distinguish between the different dimensions of the problem)
w = np.array([[0.2,0.8],[0.5,0.5], [0.33, 0.67]])
covar = np.array([0.000046,0.000017,0.000017,0.000032]).reshape([2,2])
(np.dot(w, covar) * w).sum(axis=-1)
=> array([ 2.77600000e-05, 2.80000000e-05, 2.68916000e-05])
By using plain-multiplication in the second step, I'm avoiding the unnecessary compuations of the off-diagonals.
EDIT: explaining the temporary arrays
# first multiplication (in both solutions)
np.dot(w, covar).shape
(3, 2)
# second, my solution
(np.dot(w, covar) * w).shape
(3, 2)
# second, Adam's solution
np.dot(np.dot(w,covar),w.T).shape
(3, 3)
Now, if you have N sets of weights you want to compute risk for (in this example N=3), and M instruments in your portfolio (here M=2), and N>>M, you get an array which is much bigger with Adam's solution (NxN). Not only that it will consume more memory, the computation populating the off-diagonal elements are expensive (matrix multiplication), and unnecessary.
It seems like your code is already set up for a vectorized approach, but you are only dealing with one row at a time. Grabbing the diagonals from the result when using your full weight matrix should give you what you want.
# portfolio risk
np.diagonal(np.dot(np.dot(w,covar),w.T))

Matrix multiplication with Numpy

Assume that I have an affinity matrix A and a diagonal matrix D. How can I compute the Laplacian matrix in Python with nympy?
L = D^(-1/2) A D^(1/2)
Currently, I use L = D**(-1/2) * A * D**(1/2). Is this a right way?
Thank you.
Please note that it is recommended to use numpy's array instead of matrix: see this paragraph in the user guide. The confusion in some of the responses is an example of what can go wrong... In particular, D**0.5 and the products are elementwise if applied to numpy arrays, which would give you a wrong answer. For example:
import numpy as np
from numpy import dot, diag
D = diag([1., 2., 3.])
print D**(-0.5)
[[ 1. Inf Inf]
[ Inf 0.70710678 Inf]
[ Inf Inf 0.57735027]]
In your case, the matrix is diagonal, and so the square root of the matrix is just another diagonal matrix with the square root of the diagonal elements. Using numpy arrays, the equation becomes
D = np.array([1., 2., 3.]) # note that we define D just by its diagonal elements
A = np.cov(np.random.randn(3,100)) # a random symmetric positive definite matrix
L = dot(diag(D**(-0.5)), dot(A, diag(D**0.5)))
Numpy allows you to exponentiate a diagonal "matrix" with positive elements and a positive exponent directly:
m = diag(range(1, 11))
print m**0.5
The result is what you expect in this case because NumPy actually applies the exponentiation to each element of the NumPy array individually.
However, it indeed does not allow you to exponentiate any NumPy matrix directly:
m = matrix([[1, 1], [1, 2]])
print m**0.5
produces the TypeError that you have observed (the exception says that the exponent must be an integer–even for matrices that can be diagonalized with positive coefficients).
So, as long as your matrix D is diagonal and your exponent is positive, you should be able to directly use your formula.
Well, the only problem I see is that if you are using Python 2.6.x (without from __future__ import division), then 1/2 will be interpreted as 0 because it will be considered integer division. You can get around this by using D**(-.5) * A * D**.5 instead. You can also force float division with 1./2 instead of 1/2.
Other than that, it looks correct to me.
Edit:
I was trying to exponentiate a numpy array, not a matrix before, which works with D**.5. You can exponentiate a matrix element-wise using numpy.power. So you would just use
from numpy import power
power(D, -.5) * A * power(D, .5)
Does numpy have square root function for matrixes? Then you could do sqrt(D) instead of (D**(1/2))
Maybe the formula should realy be written
L = (D**(-1/2)) * A * (D**(1/2))
Based on previous comment this formula should work in case of D being diagonal matrix (I have not chance to prove it now).

Categories

Resources