Numpy, replace a broadcast by iteration

Numpy, replace a broadcast by iteration - python

I have the following code snippet
def norm(x1, x2):
return np.sqrt(((x1 - x2)**2).sum(axis=0))
def call_norm(x1, x2):
x1 = x1[..., :, np.newaxis]
x2 = x2[..., np.newaxis, :]
return norm(x1, x2)
As I understand it, each x represents an array of points in N dimensional space, where N is the size of the final dimension of the array (so for points in 3-space the final dimension is size 3). It inserts extra dimensions and uses broadcasting to generate the cartesian product of these sets of points, and so calculates the distance between all pairs of points.
x = np.array([[1, 2, 3],[1, 2, 3]])
call_norm(x, x)
array([[ 0. , 1.41421356, 2.82842712],
[ 1.41421356, 0. , 1.41421356],
[ 2.82842712, 1.41421356, 0. ]])
(so the distance between[1,1] and [2,2] is 1.41421356, as expected)
I find that for moderate size problems this approach can use huge amounts of memory. I can easily "de-vectorise" the problem and replace it by iteration, but I'd expect that to be slow. I there a (reasonably) easy compromise solution where I could have most of the speed advantages of vectorisation but without the memory penalty? Some fancy generator trick?

There is no way to do this kind of computation without the memory penalty with numpy vectorization. For the specific case of efficiently computing pairwise distance matrices, packages tend to get around this by implementing things in C (e.g. scipy.spatial.distance) or in Cython (e.g. sklearn.metrics.pairwise).
If you want to do this "by-hand", so to speak, using numpy-style syntax but without incurring the memory penalty, the current best option is probably dask.array, which automates the construction and execution of flexible task graphs for batch execution using a numpy-style syntax.
Here's an example of using dask for this computation:
import dask.array as da
# Create the chunked data. This can be created
# from numpy arrays as well, e.g. x_dask = da.array(x_numpy)
x = da.random.random((100, 3), chunks=5)
y = da.random.random((200, 3), chunks=5)
# Compute the task graph (syntax just like numpy!)
diffs = x[:, None, :] - y[None, :, :]
dist = da.sqrt((diffs ** 2).sum(-1))
# Execute the task graph
result = dist.compute()
print(result.shape)
# (100, 200)
You'll find that dask is much more memory efficient than NumPy, is often more computationally efficient than NumPy, and can also be computed in parallel/out-of core relatively straightforwardly.

Related

MemoryError while creating cartesian product in Numpy

I have 3 numpy arrays and need to form the cartesian product between them. Dimensions of the arrays are not fixed, so they can take different values, one example could be A=(10000, 50), B=(40, 50), C=(10000,50).
Then, I perform some processing (like a+b-c) Below is the function that I am using for the product.
def cartesian_2d(arrays, out=None):
arrays = [np.asarray(x) for x in arrays]
dtype = arrays[0].dtype
n = np.prod([x.shape[0] for x in arrays])
if out is None:
out = np.empty([n, len(arrays), arrays[0].shape[1]], dtype=dtype)
m = n // arrays[0].shape[0]
out[:, 0] = np.repeat(arrays[0], m, axis=0)
if arrays[1:]:
cartesian_2d(arrays[1:], out=out[0:m, 1:, :])
for j in range(1, arrays[0].shape[0]):
out[j * m:(j + 1) * m, 1:] = out[0:m, 1:]
return out
a = [[ 0, -0.02], [1, -0.15]]
b = [[0, 0.03]]
result = cartesian_2d([a,b,a])
// array([[[ 0. , -0.02],
[ 0. , 0.03],
[ 0. , -0.02]],
[[ 0. , -0.02],
[ 0. , 0.03],
[ 1. , -0.15]],
[[ 1. , -0.15],
[ 0. , 0.03],
[ 0. , -0.02]],
[[ 1. , -0.15],
[ 0. , 0.03],
[ 1. , -0.15]]])
The output is the same as with itertools.product. However, I am using my custom function to take advantage of numpy vectorized operations, which is working fine compared to itertools.product in my case.
After this, I do
result[:, 0, :] + result[:, 1, :] - result[:, 2, :]
//array([[ 0. , 0.03],
[-1. , 0.16],
[ 1. , -0.1 ],
[ 0. , 0.03]])
So this is the final expected result.
The function works as expected as long as my array fits in memory. But my usecase requires me to work with huge data and I get a MemoryError at the line np.empty() since it is unable to allocate the memory required.
I am working with circa 20GB data at the moment and this might increase in future.
These arrays represent vectors and will have to be stored in float, so I cannot use int. Also, they are dense arrays, so using sparse is not an option.
I will be using these arrays for further processing and ideally I would not like to store them in files at this stage. So memmap / h5py format may not help, although I am not sure of this.
If there are other ways to form this product, that would be okay too.
As I am sure there are applications with way larger datasets than this, I hope someone has encountered such issues before and would like to know how to handle this issue. Please help.

If at least your result fits in memory
The following produces your expected result without relying on an intermediate three times the size of the result. It uses broadcasting.
Please note that almost any NumPy operation is broadcastable like this, so in practice there is probably no need for an explicit cartesian product:
#shared dimensions:
sh = a.shape[1:]
aba = (a[:, None, None] + b[None, :, None] - a[None, None, :]).reshape(-1, *sh)
aba
#array([[ 0. , 0.03],
# [-1. , 0.16],
# [ 1. , -0.1 ],
# [ 0. , 0.03]])
Addressing result rows by 'ID'
You may consider leaving out the reshape. That would allow you to address the rows in the result by combined index. If your component ID's are just 0,1,2,... like in your example this would be the same as the combined ID. For example aba[1,0,0] would correspond to the row obtained as second row of a + first row of b - first row of a.
A bit of explanation
Broadcasting: When for example adding two arrays their shapes do not have to be identical, only compatible because of broadcasting. Broadcasting is in a sense a generalization of adding scalars to arrays:
[[2], [[7], [[2],
7 + [3], equiv to [7], + [3],
[4]] [7]] [4]]
Broadcasting:
[[4], [[1, 2, 3], [[4, 4, 4],
[[1, 2, 3]] + [5], equiv to [1, 2, 3], + [5, 5, 5],
[6]] [1, 2, 3]] [6, 6, 6]]
For this to work each dimension of each operand must be either 1 or equal to the corresponding dimension in each other operand unless it is 1. If an operand has fewer dimensions than the others its shape is padded with ones on the left. Note that the equiv arrays shown in the illustration are not explicitly created.
If the result also does not fit
In that case I don't see how you can possibly avoid using storage, so h5py or something like that it is.
Removing the first column from each operand
This is just a matter of slicing:
a_no_id = a[:, 1:]
etc. Note that, unlike Python lists, NumPy arrays when sliced do not return a copy but a view. Therefore efficiency (memory or runtime) is not an issue here.

An alternate solution is to create a cartesian product of indices (which is easier, as solutions for cartesian products of 1D arrays exist):
idx = cartesian_product(
np.arange(len(a)),
np.arange(len(b)) + len(a),
np.arange(len(a))
)
And then use fancy indexing to create the output array:
x = np.concatenate((a, b))
result = x[idx.ravel(), :].reshape(*idx.shape, -1)

Writing results efficiently on disk
At first a few minds on the size of the resulting data.
Size of the result data
size_in_GB = A.shape[0]**2*A.shape[1]*B.shape[0]*(size_of_datatype)/1e9
In your question you mentioned A.shape=(10000,50), B=(40,50). Using float64 your result will be aproximately 1600 GB. This can be done without problems if you have enough disk space, but you have to think what you wan't to do with the data next. Maybe this is only a intermediate result and processing the data in blocks is possible.
If this is not the case here is an example how to handle 1600GB of data efficiently (RAM usage will be about 200 MB). The troughput should be around 200 MB/s on realistic data.
The code calculating the results is from #PaulPanzer.
import numpy as np
import tables #register blosc
import h5py as h5
import h5py_cache as h5c
a=np.arange(500*50).reshape(500, 50)
b=np.arange(40*50).reshape(40, 50)
# isn't well documented, have a look at https://github.com/Blosc/hdf5-blosc
compression_opts=(0, 0, 0, 0, 5, 1, 1)
compression_opts[4]=9 #compression level 0...9
compression_opts[5]=1 #shuffle
compression_opts[6]=1 #compressor (I guess that's lz4)
File_Name_HDF5='Test.h5'
f = h5.File(File_Name_HDF5, 'w',chunk_cache_mem_size=1024**2*300)
dset = f.create_dataset('Data', shape=(a.shape[0]**2*b.shape[0],a.shape[1]),dtype='d',chunks=(a.shape[0]*b.shape[0],1),compression=32001,compression_opts=(0, 0, 0, 0, 9, 1, 1), shuffle=False)
#Write the data
for i in range(a.shape[0]):
sh = a.shape[1:]
aba = (a[i] + b[:, None] - a).reshape(-1, *sh)
dset[i*a.shape[0]*b.shape[0]:(i+1)*a.shape[0]*b.shape[0]]=aba
f.close()
Reading the data
File_Name_HDF5='Test.h5'
f = h5c.File(File_Name_HDF5, 'r',chunk_cache_mem_size=1024**2*300)
dset=f['Data']
chunks_size=500
for i in range(0,dset.shape[0],chunks_size):
#Iterate over the first column
data=dset[i:i+chunks_size,:] #avoid excessive calls to the hdf5 library
#Do something with the data
f.close()
f = h5c.File(File_Name_HDF5, 'r',chunk_cache_mem_size=1024**2*300)
dset=f['Data']
for i in range(dset.shape[1]):
# Iterate over the second dimension
# fancy indexing e.g.[:,i] will be much slower
# use np.expand_dims or in this case np.squeeze after the read operation from the dset
# if you wan't to have the same result than [:,i] (1 dim array)
data=dset[:,i:i+1]
#Do something with the data
f.close()
On this test example I get a write throughput of about 550 M/s, a read throuhput of about (500 M/s first dim, 1000M/s second dim) and a compression ratio of 50. Numpy memmap will only provide acceptable speed if you read or write data along the fastest changing direction (in C the last dimension), with a chunked data format used by HDF5 here, this isn't a problem at all. Compression is also not possible with Numpy memmap, leading to higher file sizes and slower speed.
Please note that the compression filter and chunk shape have to be set up to your needs. This depends on how you wan't to read the data afterwards and the actual data.
If you do something completely wrong, the perfornance can be 10-100 times slower compared to a proper way to do it (e.g. the chunkshape can be optimized for the first or the second read example).

Large point-matrix array multiplication in numpy

Given two large numpy arrays, one for a list of 3D points, and another for a list of transformation matrices. Assuming there is a 1 to 1 correspondence between the two lists, i'm looking for the best way to calculate the result array of each point transformed by it's corresponding matrix.
My solution to do this was to use slicing (see "test4" in the example code below) which worked fine with small arrays, but fails with large arrays because of how memory-wasteful my method is :)
import numpy as np
COUNT = 100
matrix = np.random.random_sample((3,3,)) # A single matrix
matrices = np.random.random_sample((COUNT,3,3,)) # Many matrices
point = np.random.random_sample((3,)) # A single point
points = np.random.random_sample((COUNT,3,)) # Many points
# Test 1, result of a single point multiplied by a single matrix
# This is as easy as it gets
test1 = np.dot(point,matrix)
print 'done'
# Test 2, result of a single point multiplied by many matrices
# This works well and returns a transformed point for each matrix
test2 = np.dot(point,matrices)
print 'done'
# Test 3, result of many points multiplied by a single matrix
# This works also just fine
test3 = np.dot(points,matrix)
print 'done'
# Test 4, this is the case i'm trying to solve. Assuming there's a 1-1
# correspondence between the point and matrix arrays, the result i want
# is an array of points, where each point has been transformed by it's
# corresponding matrix
test4 = np.zeros((COUNT,3))
for i in xrange(COUNT):
test4[i] = np.dot(points[i],matrices[i])
print 'done'
With a small array, this works fine. With large arrays, (COUNT=1000000) Test #4 works but gets rather slow.
Is there a way to make Test #4 faster? Presuming without using a loop?

You can use numpy.einsum. Here's an example with 5 matrices and 5 points:
In [49]: matrices.shape
Out[49]: (5, 3, 3)
In [50]: points.shape
Out[50]: (5, 3)
In [51]: p = np.einsum('ijk,ik->ij', matrices, points)
In [52]: p[0]
Out[52]: array([ 1.16532051, 0.95155227, 1.5130032 ])
In [53]: matrices[0].dot(points[0])
Out[53]: array([ 1.16532051, 0.95155227, 1.5130032 ])
In [54]: p[1]
Out[54]: array([ 0.79929572, 0.32048587, 0.81462493])
In [55]: matrices[1].dot(points[1])
Out[55]: array([ 0.79929572, 0.32048587, 0.81462493])
The above is doing matrix[i] * points[i] (i.e. multiplying on the right), but I just reread the question and noticed that your code uses points[i] * matrix[i]. You can do that by switching the indices and arguments of einsum:
In [76]: lp = np.einsum('ij,ijk->ik', points, matrices)
In [77]: lp[0]
Out[77]: array([ 1.39510822, 1.12011057, 1.05704609])
In [78]: points[0].dot(matrices[0])
Out[78]: array([ 1.39510822, 1.12011057, 1.05704609])
In [79]: lp[1]
Out[79]: array([ 0.49750324, 0.70664634, 0.7142573 ])
In [80]: points[1].dot(matrices[1])
Out[80]: array([ 0.49750324, 0.70664634, 0.7142573 ])

It doesn't make much sense to have multiple transform matrices. You can combine transform matrices as in this question:
If I want to apply matrix A, then B, then C, I will multiply the matrices in reverse order np.dot(C,np.dot(B,A))
So you can save some memory space by precomputing that matrix. Then applying a bunch of vectors to one transform matrix should be easily handled (within reason).
I don't know why you would need one million transformation on one million vectors, but I would suggest buying a larger RAM.
Edit:
There isn't a way to reduce the operations, no. Unless your transformation matrices hold a specific property such as sparsity, diagonality, etc. you're going to have to run all multiplications and summations. However, the way you process these operations can be optimized across cores and/or using vector operations on GPUs.
Also, python is notably slow. You can try splitting numpy across your cores using NumExpr. Or maybe use a BLAS framework on C++ (notably quick ;))

Fast inverse and transpose matrix in Python

I have a large matrix A of shape (n, n, 3, 3) with n is about 5000. Now I want find the inverse and transpose of matrix A:
import numpy as np
A = np.random.rand(1000, 1000, 3, 3)
identity = np.identity(3, dtype=A.dtype)
Ainv = np.zeros_like(A)
Atrans = np.zeros_like(A)
for i in range(1000):
for j in range(1000):
Ainv[i, j] = np.linalg.solve(A[i, j], identity)
Atrans[i, j] = np.transpose(A[i, j])
Is there a faster, more efficient way to do this?

This is taken from a project of mine, where I also do vectorized linear algebra on many 3x3 matrices.
Note that there is only a loop over 3; not a loop over n, so the code is vectorized in the important dimensions. I don't want to vouch for how this compares to a C/numba extension to do the same thing though, performance wise. This is likely to be substantially faster still, but at least this blows the loops over n out of the water.
def adjoint(A):
"""compute inverse without division by det; ...xv3xc3 input, or array of matrices assumed"""
AI = np.empty_like(A)
for i in xrange(3):
AI[...,i,:] = np.cross(A[...,i-2,:], A[...,i-1,:])
return AI
def inverse_transpose(A):
"""
efficiently compute the inverse-transpose for stack of 3x3 matrices
"""
I = adjoint(A)
det = dot(I, A).mean(axis=-1)
return I / det[...,None,None]
def inverse(A):
"""inverse of a stack of 3x3 matrices"""
return np.swapaxes( inverse_transpose(A), -1,-2)
def dot(A, B):
"""dot arrays of vecs; contract over last indices"""
return np.einsum('...i,...i->...', A, B)
A = np.random.rand(2,2,3,3)
I = inverse(A)
print np.einsum('...ij,...jk',A,I)

for the transpose:
testing a bit in ipython showed:
In [1]: import numpy
In [2]: x = numpy.ones((5,6,3,4))
In [3]: numpy.transpose(x,(0,1,3,2)).shape
Out[3]: (5, 6, 4, 3)
so you can just do
Atrans = numpy.transpose(A,(0,1,3,2))
to transpose the second and third dimensions (while leaving dimension 0 and 1 the same)
for the inversion:
the last example of http://docs.scipy.org/doc/numpy/reference/generated/numpy.linalg.inv.html#numpy.linalg.inv
Inverses of several matrices can be computed at once:
from numpy.linalg import inv
a = np.array([[[1., 2.], [3., 4.]], [[1, 3], [3, 5]]])
>>> inv(a)
array([[[-2. , 1. ],
[ 1.5, -0.5]],
[[-5. , 2. ],
[ 3. , -1. ]]])
So i guess in your case, the inversion can be done with just
Ainv = inv(A)
and it will know that the last two dimensions are the ones it is supposed to invert over, and that the first dimensions are just how you stacked your data. This should be much faster
speed difference
for the transpose: your method needs 3.77557015419 sec, and mine needs 2.86102294922e-06 sec (which is a speedup of over 1 million times)
for the inversion: i guess my numpy version is not high enough to try that numpy.linalg.inv trick with (n,n,3,3) shape, to see the speedup there (my version is 1.6.2, and the docs i based my solution on are for 1.8, but it should work on 1.8, if someone else can test that?)

Numpy has the array.T properties which is a shortcut for transpose.
For inversions, you use np.linalg.inv(A).

As posted by wim A.I also works on matrix. e.g.
print (A.I)

for numpy-matrix object, use matrix.getI.
e.g.
A=numpy.matrix('1 3;5 6')
print (A.getI())

Calculating long expressions using Numpy (coordinate transform)?

In Pythons Numpy module, is there a function that can calculate long/advanced math expressions on an array? I heard of the numexp module but want to stay clear of further dependencies.
Better yet, can I limit these expressions to only say the first or second element of the sub arrays within my array, without having to unpack them as separate arrays?
Here is my specific problem. I have an array of arrays containing geographic point coordinates looking like this: [[x1,y1],[x2,y2],[x3,y3],etc...]. What I want is to transform these geocoords to pixel coordinates so they can be drawn on an image. I therefore want to run the following expression/calculation on the first element of each subarray, ie the xs:
((180+X)/360)*screenwidthpixels
And on the second element, ie the ys:
((-90+Y)/180)*-screenheightpixels
These expressions would work in a python for-loop but is too slow, which is why I'm turning to Numpy. I know I can and have tried to just link numpys single math operator functions after each other but still too slow, and besides, to do that I first had to unpack all the xs and ys to separate arrays and repack them together after the calculation making it even slower.
So I guess I'm looking for a more direct Numpy way using less steps to transform my coordinate array using the expressions above. Any ideas?

import numpy as np
points = np.random.rand(10,2)
translation = np.array([180,-90])
scaling = np.array([1024, -768]) / np.array([360,180])
transformed_points = (points + translation) * scaling
This will do what you are looking for. It relies on numpy broadcasting rules to achieve expressiveness and performance.
But rather than explaining exactly how that works, I think you are better off finding yourself a good numpy primer, and starting at the top. numpy is one of the best things about python, and you cant go wrong learning a little more about it. Suffice to say, numpy is certainly up to the kind of task you are facing.

I'm a little confused because I'm not sure exactly what you're saying you already tried, or what the speed condition for success is.
Are you saying you already tried something like the following, but it is too slow?
arr = whatever
arr[:,0] = (arr[:,0] + 180) / (360 * screenwidthpixels)
arr[:,1] = 180 - (arr[:,1] - 90) / (180 * screenheightpixels)

I'm not sure what you mean by "having to unpack" to X and Y. Here's how you avoid unpacking (if i understand...)
arr = np.array([ [x1,y1], [x2,y2], [x3,y3] ])
arr.shape
=> (3, 2)
X = arr[:,0] # fast, creates a view
Y = arr[:,1] # fast too
((X+180)/360)/screenwidthpixels
Further speed up can be achieved by rewriting/simplifying your expressions.
((X+180)/360)/s => (X+180)/(360*s)
(180-((Y+90)/180))/s => (180/s-1/(2*s)) - y/(180*s)
In the first rewrite, you get 2 traverses of the array, instead of 3, and in the second, the array is only traversed twice, instead of 4 times.

In [235]: xs=arange(1000)
In [236]: ys=arange(1, 1001)
In [237]: a=array([xs, ys]).T
In [238]: a
Out[238]:
array([[ 0, 1],
[ 1, 2],
[ 2, 3],
...,
[ 997, 998],
[ 998, 999],
[ 999, 1000]])
In [240]: a[:, 0]=(a[:, 0]+180)/360/1024
the a[:, 0] offers a view of the first column of a, it's fast and memory saving. docs for numpy here

Speed up python code for computing matrix cofactors

As part of a complex task, I need to compute matrix cofactors. I did this in a straightforward way using this nice code for computing matrix minors. Here is my code:
def matrix_cofactor(matrix):
C = np.zeros(matrix.shape)
nrows, ncols = C.shape
for row in xrange(nrows):
for col in xrange(ncols):
minor = matrix[np.array(range(row)+range(row+1,nrows))[:,np.newaxis],
np.array(range(col)+range(col+1,ncols))]
C[row, col] = (-1)**(row+col) * np.linalg.det(minor)
return C
It turns out that this matrix cofactor code is the bottleneck, and I would like to optimize the code snippet above. Any ideas as to how to do this?

If your matrix is invertible, the cofactor is related to the inverse:
def matrix_cofactor(matrix):
return np.linalg.inv(matrix).T * np.linalg.det(matrix)
This gives large speedups (~ 1000x for 50x50 matrices). The main reason is fundamental: this is an O(n^3) algorithm, whereas the minor-det-based one is O(n^5).
This probably means that also for non-invertible matrixes, there is some clever way to calculate the cofactor (i.e., not use the mathematical formula that you use above, but some other equivalent definition).
If you stick with the det-based approach, what you can do is the following:
The majority of the time seems to be spent inside det. (Check out line_profiler to find this out yourself.) You can try to speed that part up by linking Numpy with the Intel MKL, but other than that, there is not much that can be done.
You can speed up the other part of the code like this:
minor = np.zeros([nrows-1, ncols-1])
for row in xrange(nrows):
for col in xrange(ncols):
minor[:row,:col] = matrix[:row,:col]
minor[row:,:col] = matrix[row+1:,:col]
minor[:row,col:] = matrix[:row,col+1:]
minor[row:,col:] = matrix[row+1:,col+1:]
...
This gains some 10-50% total runtime depending on the size of your matrices. The original code has Python range and list manipulations, which are slower than direct slice indexing. You could try also to be more clever and copy only parts of the minor that actually change --- however, already after the above change, close to 100% of the time is spent inside numpy.linalg.det so that furher optimization of the othe parts does not make so much sense.

The calculation of np.array(range(row)+range(row+1,nrows))[:,np.newaxis] does not depended on col so you could could move that outside the inner loop and cache the value. Depending on the number of columns you have this might give a small optimization.

Instead of using the inverse and determinant, I'd suggest using the SVD
def cofactors(A):
U,sigma,Vt = np.linalg.svd(A)
N = len(sigma)
g = np.tile(sigma,N)
g[::(N+1)] = 1
G = np.diag(-(-1)**N*np.product(np.reshape(g,(N,N)),1))
return U # G # Vt

from sympy import *
A = Matrix([[1,2,0],[0,3,0],[0,7,1]])
A.adjugate().T
And the output (which is cofactor matrix) is:
Matrix([
[ 3, 0, 0],
[-2, 1, -7],
[ 0, 0, 3]])

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Numpy, replace a broadcast by iteration - python

Related

MemoryError while creating cartesian product in Numpy

Large point-matrix array multiplication in numpy

Fast inverse and transpose matrix in Python

Calculating long expressions using Numpy (coordinate transform)?

Speed up python code for computing matrix cofactors

Categories

Resources