Vectorizing numpy calculation without a tensor dot product

Vectorizing numpy calculation without a tensor dot product - python

I would like to vectorize a particular case of the following mathematical formula (from Table 2 and Appendix A of this paper) with numpy:
The case I would like to compute is the following, where the scaling factors under the square root can be ignored.
The term w_kij - w_ij_bar is a n x p x p matrix, where n is typically much greater than p.
I implemented 2 solutions neither of which are particularly good: one involves a double loop, while the other fills the memory with unnecessary calculations very quickly.
dummy_data = np.random.normal(size=(100, 5, 5))
# approach 1: a double loop
out_hack = np.zeros((5, 5))
for i in range(5):
for j in range(5):
out_hack[i, j] = (dummy_data.T[j, j, :]*dummy_data[:, j, i]).sum()
# approach 2: slicing a diagonal from a tensor dot product
out = np.tensordot(dummy_data.T, dummy_data, axes=1)
out = out.diagonal(0, 0, 2).diagonal(0, 0, 2)
print((out.round(6) == out_hack.round(6)).all())
>>> True
Is there a way to find middle ground between these 2 approaches?

np.einsum translates that almost literally -
np.einsum('kjj,kji->ij',dummy_data,dummy_data)

Related

What is the fastest way to apply np.linalg.norm() (python) to each element of a 2d numpy array and a given value?

I want to compute the L2 norm between a given value x and each cell of a 2d array arr (which is currently of size 1000 x 100. My current approach:
for k in range(0, 999):
for l in range(0, 999):
distance = np.linalg.norm([x - arr[k][l]], ord= 2)
x and arr[k][l] are both scalars. I actually want to compute the pairwise distance of each array cell to the given value x. In the end I need 1000x1000 distances for 1000x 1000 values.
Unfortunately, the approach above is a bottleneck, when it comes to the time it takes to finish. Which is why I am searching for a way to speed this up. I am gratefull for any advice.
A reproducable example (as asked for):
arr = [[1, 2, 4, 4], [5, 6, 7, 8]]
x = 2
for k in range(0, 3):
for l in range(0, 1):
distance = np.linalg.norm([x - arr[k][l]], ord= 2)
Please note, that the real arr is much bigger. This is merely a toy example.
Actually, I am not bound to use np.linalg.norm(). I simply want the l2 norm for all of these array cells with the given value x. If you know any function which is more suitable, I would be willing to try it.

You can do the followng
Substarct x from the array arr
Then compute the norm
diff = arr - x
distance = np.linalg.norm(diff, axis=2, ord=2)

what causes different in array sum along axis for C versus F ordered arrays in numpy

I am curious if anyone can explain what exactly leads to the discrepancy in this particular handling of C versus Fortran ordered arrays in numpy. See the code below:
system:
Ubuntu 18.10
Miniconda python 3.7.1
numpy 1.15.4
def test_array_sum_function(arr):
idx=0
val1 = arr[idx, :].sum()
val2 = arr.sum(axis=(1))[idx]
print('axis sums:', val1)
print(' ', val2)
print(' equal:', val1 == val2)
print('total sum:', arr.sum())
n = 2_000_000
np.random.seed(42)
rnd = np.random.random(n)
print('Fortran order:')
arrF = np.zeros((2, n), order='F')
arrF[0, :] = rnd
test_array_sum_function(arrF)
print('\nC order:')
arrC = np.zeros((2, n), order='C')
arrC[0, :] = rnd
test_array_sum_function(arrC)
prints:
Fortran order:
axis sums: 999813.1414744433
999813.1414744079
equal: False
total sum: 999813.1414744424
C order:
axis sums: 999813.1414744433
999813.1414744433
equal: True
total sum: 999813.1414744433

This is almost certainly a consequence of numpy sometimes using pairwise summation and sometimes not.
Let's build a diagnostic array:
eps = (np.nextafter(1.0, 2)-1.0) / 2
1+eps+eps+eps
# 1.0
(1+eps)+(eps+eps)
# 1.0000000000000002
X = np.full((32, 32), eps)
X[0, 0] = 1
X.sum(0)[0]
# 1.0
X.sum(1)[0]
# 1.000000000000003
X[:, 0].sum()
# 1.000000000000003
This strongly suggests that 1D arrays and contiguous axes use pairwise summation while strided axes in a multidimensional array don't.
Note that to see that effect the array has to be large enough, otherwise numpy falls back to ordinary summation.

Floating point math isn't necessarily associative, i.e. (a+b)+c != a+(b+c).
Since you're adding along different axes, the order of operations is different, which can affect the final result. As a simple example, consider the matrix whose sum is 1.
a = np.array([[1e100, 1], [-1e100, 0]])
print(a.sum()) # returns 0, the incorrect result
af = np.asfortranarray(a)
print(af.sum()) # prints 1
(Interestingly, a.T.sum() still gives 0, as does aT = a.T; aT.sum() , so I'm not sure how exactly this is implemented in the backend)
The C order is using the sequence of operations (left-to-right) 1e100 + 1 + (-1e100) + 0 whereas the Fortran order uses 1e100 + (-1e100) + 1 + 0. The problem is that (1e100+1) == 1e100 because floats don't have enough precision to represent that small difference, so the 1 gets lost.
In general, don't do equality testing on floating point numbers, instead compare using a small epsilon (if abs(float1 - float2) < 0.00001 or np.isclose). If you need arbitrary float precision, use the Decimal library or fixed-point representation and ints.

How to vectorize 3D Numpy arrays

I have a 3D numpy array like a = np.zeros((100,100, 20)). I want to perform an operation over every x,y position that involves all the elements over the z axis and the result is stored in an array like b = np.zeros((100,100)) on the same corresponding x,y position.
Now i'm doing it using a for loop:
d_n = np.array([...]) # a parameter with the same shape as b
for (x,y), v in np.ndenumerate(b):
C = a[x,y,:]
### calculate some_value using C
minv = sys.maxint
depth = -1
C = a[x,y,:]
for d in range(len(C)):
e = 2.5 * float(math.pow(d_n[x,y] - d, 2)) + C[d] * 0.05
if e < minv:
minv = e
depth = d
some_value = depth
if depth == -1:
some_value = len(C) - 1
###
b[x,y] = some_value
The problem now is that this operation is much slower than others done the pythonic way, e.g. c = b * b (I actually profiled this function and it's around 2 orders of magnitude slower than others using numpy built in functions and vectorized functions, over a similar number of elements)
How can I improve the performance of such kind of functions mapping a 3D array to a 2D one?

What is usually done in 3D images is to swap the Z axis to the first index:
>>> a = a.transpose((2,0,1))
>>> a.shape
(20, 100, 100)
And now you can easily iterate over the Z axis:
>>> for slice in a:
do something
The slice here will be each of your 100x100 fractions of your 3D matrix. Additionally, by transpossing allows you to access each of the 2D slices directly by indexing the first axis. For example a[10] will give you the 11th 2D 100x100 slice.
Bonus: If you store the data contiguosly, without transposing (or converting to a contiguous array using a = np.ascontiguousarray(a.transpose((2,0,1))) the access to you 2D slices will be faster since they are mapped contiguosly in memory.

Obviously you want to get rid of the explicit for loop, but I think whether this is possible depends on what calculation you are doing with C. As a simple example,
a = np.zeros((100,100, 20))
a[:,:] = np.linspace(1,20,20) # example data: 1,2,3,.., 20 as "z" for every "x","y"
b = np.sum(a[:,:]**2, axis=2)
will fill the 100 by 100 array b with the sum of the squared "z" values of a, that is 1+4+9+...+400 = 2870.

If your inner calculation is sufficiently complex, and not amenable to vectorization, then your iteration structure is good, and does not contribute significantly to the calculation time
for (x,y), v in np.ndenumerate(b):
C = a[x,y,:]
...
for d in range(len(C)):
... # complex, not vectorizable calc
...
b[x,y] = some_value
There doesn't appear to be a special structure in the 1st 2 dimensions, so you could just as well think of it as 2D mapping on to 1D, e.g. mapping a (N,20) array onto a (N,) array. That doesn't speed up anything, but may help highlight the essential structure of the problem.
One step is to focus on speeding up that C to some_value calculation. There are functions like cumsum and cumprod that help you do sequential calculations on a vector. cython is also a good tool.
A different approach is to see if you can perform that internal calculation over the N values all at once. In other words, if you must iterate, it is better to do so over the smallest dimension.
In a sense this a non-answer. But without full knowledge of how you get some_value from C and d_n I don't think we can do more.
It looks like e can be calculated for all points at once:
e = 2.5 * float(math.pow(d_n[x,y] - d, 2)) + C[d] * 0.05
E = 2.5 * (d_n[...,None] - np.arange(a.shape[-1]))**2 + a * 0.05 # (100,100,20)
E.min(axis=-1) # smallest value along the last dimension
E.argmin(axis=-1) # index of where that min occurs
On first glance it looks like this E.argmin is the b value that you want (tweaked for some boundary conditions if needed).
I don't have realistic a and d_n arrays, but with simple test ones, this E.argmin(-1) matches your b, with a 66x speedup.

How can I improve the performance of such kind of functions mapping a 3D array to a 2D one?
Many functions in Numpy are "reduction" functions*, for example sum, any, std, etc. If you supply an axis argument other than None to such a function it will reduce the dimension of the array over that axis. For your code you can use the argmin function, if you first calculate e in a vectorized way:
d = np.arange(a.shape[2])
e = 2.5 * (d_n[...,None] - d)**2 + a*0.05
b = np.argmin(e, axis=2)
The indexing with [...,None] is used to engage broadcasting. The values in e are floating point values, so it's a bit strange to compare to sys.maxint but there you go:
I, J = np.indices(b.shape)
b[e[I,J,b] >= sys.maxint] = a.shape[2] - 1
* Strickly speaking a reduction function is of the form reduce(operator, sequence) so technically not std and argmin

Multiplying Numpy/Scipy Sparse and Dense Matrices Efficiently

I'm working to implement the following equation:
X =(Y.T * Y + Y.T * C * Y) ^ -1
Y is a (n x f) matrix and C is (n x n) diagonal one; n is about 300k and f will vary between 100 and 200. As part of an optimization process this equation will be used almost 100 million times so it has to be processed really fast.
Y is initialized randomly and C is a very sparse matrix with only a few numbers out of the 300k on the diagonal will be different than 0.Since Numpy's diagonal functions creates dense matrices, I created C as a sparse csr matrix. But when trying to solve the first part of the equation:
r = dot(C, Y)
The computer crashes due Memory limits. I decided then trying to convert Y to csr_matrix and make the same operation:
r = dot(C, Ysparse)
and this approach took 1.38 ms. But this solution is somewhat "tricky" since I'm using a sparse matrix to store a dense one, I wonder how efficient this really.
So my question is if is there some way of multiplying the sparse C and the dense Y without having to turn Y into sparse and improve performance? If somehow C could be represented as diagonal dense without consuming tons of memory maybe this would lead to very efficient performance but I don't know if this is possible.
I appreciate your help!

The reason the dot product runs into memory issues when computing r = dot(C,Y) is because numpy's dot function does not have native support for handling sparse matrices. What is happening is numpy thinks of the sparse matrix C as a python object, and not a numpy array. If you inspect on small scale you can see the problem first hand:
>>> from numpy import dot, array
>>> from scipy import sparse
>>> Y = array([[1,2],[3,4]])
>>> C = sparse.csr_matrix(array([[1,0], [0,2]]))
>>> dot(C,Y)
array([[ (0, 0) 1
(1, 1) 2, (0, 0) 2
(1, 1) 4],
[ (0, 0) 3
(1, 1) 6, (0, 0) 4
(1, 1) 8]], dtype=object)
Clearly the above is not the result you are interested in. Instead what you want to do is compute using scipy's sparse.csr_matrix.dot function:
r = sparse.csr_matrix.dot(C, Y)
or more compactly
r = C.dot(Y)

Try:
import numpy as np
from scipy import sparse
f = 100
n = 300000
Y = np.random.rand(n, f)
Cdiag = np.random.rand(n) # diagonal of C
Cdiag[np.random.rand(n) < 0.99] = 0
# Compute Y.T * C * Y, skipping zero elements
mask = np.flatnonzero(Cdiag)
Cskip = Cdiag[mask]
def ytcy_fast(Y):
Yskip = Y[mask,:]
CY = Cskip[:,None] * Yskip # broadcasting
return Yskip.T.dot(CY)
%timeit ytcy_fast(Y)
# For comparison: all-sparse matrices
C_sparse = sparse.spdiags([Cdiag], [0], n, n)
Y_sparse = sparse.csr_matrix(Y)
%timeit Y_sparse.T.dot(C_sparse * Y_sparse)
My timings:
In [59]: %timeit ytcy_fast(Y)
100 loops, best of 3: 16.1 ms per loop
In [18]: %timeit Y_sparse.T.dot(C_sparse * Y_sparse)
1 loops, best of 3: 282 ms per loop

First, are you really sure you need to perform a full matrix inversion in your problem ? Most of the time, one only really need to compute x = A^-1 y which is a much easier problem to solve.
If this is really so, I would consider computing an approximation of the inverse matrix instead of the full matrix inversion. Since matrix inversion is really costly. See for example the Lanczos algorithm for an efficient approximation of the inverse matrix. The approximation can be stored sparsely as a bonus. Plus, it requires only matrix-vector operations so you don't even have to store the full matrix to inverse.
As an alternative, using pyoperators, you can also use to .todense method to compute the matrix to inverse using efficient matrix vector operations. There is a special sparse container for diagonal matrices.
For an implementation of the Lanczos algorithm, you can have a look at pyoperators (disclaimer: I am one of the coauthor of this piece of software).

I don't know if it was possible when the question was asked; but nowadays, broadcasting is your friend. An n*n diagonal matrix needs only be an array of the diagonal elements to be used in a matrix product:
>>> n, f = 5, 3
>>> Y = np.random.randint(0, 10, (n, f))
>>> C = np.random.randint(0, 10, (n,))
>>> Y.shape
(5, 3)
>>> C.shape
(5,)
>>> np.all(Y.T # np.diag(C) # Y == Y.T*C # Y)
True
Do note that Y.T*C # Y is non-associative:
>>> Y.T*(C # Y)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: operands could not be broadcast together with shapes (3,5) (3,)
But Y.T # (C[:, np.newaxis]*Y) would yield the expected result:
>>> np.all(Y.T*C # Y == Y.T#(C[:, np.newaxis]*Y))
True

Normalize Numpy Upper-triangular subarray

I have an upper-triangular subarray of dimension 4. It is initialized as
N, Q = (99, 23)
bivariate = np.zeros((N,N,Q,Q))
and then populated by something like
for i in range(N):
for j in range(i+1,N):
bivariate[i,j] = num
I want the upper-triangular elements to be normalized (Q,Q) matrices. I am currently doing this by just doing a
bivariate /= bivariate.sum(axis=3).sum(axis=2)[:,:,np.newaxis,np.newaxis]
but I get Runtime Warnings due to the empty arrays of the lower-triangular portion being normalized. Is there a better way to do this other than the following?
for i in range(N):
for j in range(i+1,N):
bivariate[i,j] /= bivariate[i,j].sum()
Thanks.

If you're concerned about getting np.nan, you could try to replace the null entries of your normalization factor by 1:
norm_factor = bivariate.sum(axis=3).sum(axis=2)[:,:,None,None]
bivariate /= np.where(norm, norm, 1)
At least you'll avoid the for loops...

FWIW, I've found that it's much easier to work on the upper triangular portion separately then insert it back in.
triu = np.tri_indices(n, 1)
upper_tri = bivariate[triu].reshape(-1, Q*Q)
upper_tri /= upper_tri.sum(axis=1)
bivariate[triu] = upper_tri.reshape(-1, Q, Q)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Vectorizing numpy calculation without a tensor dot product - python

np.einsum translates that almost literally - np.einsum('kjj,kji->ij',dummy_data,dummy_data)

Related

What is the fastest way to apply np.linalg.norm() (python) to each element of a 2d numpy array and a given value?

what causes different in array sum along axis for C versus F ordered arrays in numpy

How to vectorize 3D Numpy arrays

Multiplying Numpy/Scipy Sparse and Dense Matrices Efficiently

Normalize Numpy Upper-triangular subarray

Categories

Resources