Memory growth with broadcast operations in NumPy

Memory growth with broadcast operations in NumPy - python

I am using NumPy to handle some large data matrices (of around ~50GB in size). The machine where I am running this code has 128GB of RAM so doing simple linear operations of this magnitude shouldn't be a problem memory-wise.
However, I am witnessing a huge memory growth (to more than 100GB) when computing the following code in Python:
import numpy as np
# memory allocations (everything works fine)
a = np.zeros((1192953, 192, 32), dtype='f8')
b = np.zeros((1192953, 192), dtype='f8')
c = np.zeros((192, 32), dtype='f8')
a[:] = b[:, :, np.newaxis] - c[np.newaxis, :, :] # memory explodes here
Please note that initial memory allocations are done without any problems. However, when I try to perform the subtract operation with broadcasting, the memory grows to more than 100GB. I always thought that broadcasting would avoid making extra memory allocations but now I am not sure if this is always the case.
As such, can someone give some details on why this memory growth is happening, and how the following code could be rewritten using more memory efficient constructs?
I am running the code in Python 2.7 within IPython Notebook.

#rth's suggestion to do the operation in smaller batches is a good one. You could also try using the function np.subtract and give it the destination array to avoid creating an addtional temporary array. I also think you don't need to index c as c[np.newaxis, :, :], because it is already a 3-d array.
So instead of
a[:] = b[:, :, np.newaxis] - c[np.newaxis, :, :] # memory explodes here
try
np.subtract(b[:, :, np.newaxis], c, a)
The third argument of np.subtract is the destination array.

Well, your array a takes already 1192953*192*32* 8 bytes/1.e9 = 58 GB of memory.
The broadcasting does not make additional memory allocations for the initial arrays, but the result of
b[:, :, np.newaxis] - c[np.newaxis, :, :]
is still saved in a temporary array. Therefore at this line, you have allocated at least 2 arrays with the shape of a for a total memory used >116 GB.
You can avoid this issue, by operating on a smaller subset of your array at one time,
CHUNK_SIZE = 100000
for idx in range(b.shape[0]/CHUNK_SIZE):
sl = slice(idx*CHUNK_SIZE, (idx+1)*CHUNK_SIZE)
a[sl] = b[sl, :, np.newaxis] - c[np.newaxis, :, :]
this will be marginally slower, but uses much less memory.

Related

How do I optimise this numpy array operation?

This numpy operation gives a memory error. (Here X and Y are 2D arrays with shape (5000, 3072) and (500, 3072))
dists[:,:] = np.sqrt(np.sum(np.square(np.subtract(X, Y[:,np.newaxis])), axis=2))
I think the numpy array broadcasting is taking up a lot of memory. Is there any way to optimise the memory usage for these array operations?
Edit:
(This is a problem in Assignment 1 of cs231n). I found another solution that gives the same thing without a memory error:
dists[:,:] = np.sqrt((Y**2).sum(axis=1)[:, np.newaxis] + (X**2).sum(axis=1) - 2 * Y.dot(X.T))
Can you help me understand why my solution is inefficient memory wise?

Are squeezed arrays (in Python) smaller than arrays with single-dimensional entries? ([x,y] vs [x,y,1]?)

When trying to get code to work using different frameworks and sources, I've stumbled across this multiple times:
Python Numpy arrays A and B that contentwise are the same, but one has A.shape == [x, y] and the other B.shape == [x, y, 1]. From dealing with it several times, I know that I can solve issues with this with squeeze:
A == numpy.squeeze(B)
But currently I have to redesign a lot of code that errors due to "inconsistent" arrays in that regard (some images with len(img.shape) = 2 [1024, 1024] and some images with len(img.shape) = 3 [1024, 1024, 1].
Now I have to pick one and I'm leaning towards [1024, 1024, 1], but since this code should be memory-efficient I'm wondering:
Do arrays with single-dimensional entries consume more memory than squeezed arrays? Or is there any other reason why I should avoid single-dimensional entries?

Do arrays with single-dimensional entries consume more memory than squeezed arrays?
They take the same amount of memory.
NumPy arrays have a property called nbytes that represents the number of bytes used by the array itself. Using this you can easily verify this:
>>> import numpy as np
>>> arr = np.ones((1024, 1024, 1))
>>> arr.nbytes
8388608
>>> arr.squeeze().nbytes
8388608
The reason it takes the same amount of memory is actually easy: NumPy arrays aren't real multi-dimensional arrays. They are one-dimensional arrays that use strides to "emulate" multidimensionality. These strides give the memory offset for a particular dimension:
>>> arr.strides
(8192, 8, 8)
>>> arr.squeeze().strides
(8192, 8)
So by removing the length-one dimension you effectively removed a zero-byte offset.
Or is there any other reason why I should avoid single-dimensional entries?
It depends. In some cases you actually create these yourself to utilize broadcasting with NumPy arrays. However in some cases they are annoying.
Note that there is in fact a small memory difference because NumPy has to store one stride and shape integer for each dimension:
>>> import sys
>>> sys.getsizeof(arr)
8388736
>>> sys.getsizeof(arr.squeeze().copy()) # remove one dimension
8388720
>>> sys.getsizeof(arr[:, None].copy()) # add one dimension
8388752
However 16 bytes per dimension isn't very much compared to the 8kk bytes the array takes and to a view (squeeze returns a view - that's why I had to copy it) which uses ~100 bytes.

Addition speed of numpy arrays with different contiguous-type

Numpy arrays are stored with different contiguous types (C- and F-). When using numpy.swapaxes(), the contiguous type gets changed. I need to add two multidimensional arrays (3d to be more specific), one of which comes from another array with swapped axes. What I've noticed is that when the first axis gets swapped with the last axis, in the case of a 3d array, the contiguous type changes from C- to F-. And adding two arrays with different contiguous type is extremely slow (~6 times slower than adding two C-contiguous arrays). However, if other axes are swapped (0-1 or 1-2), the resulting array would have false flags for both C- and F- contiguous (non-contiguous). The weird thing to me is that adding one array of C-configuous and one array neither C- nor F- contiguous, is in fact only slightly slower than adding two arrays of same type. Here are my two questions:
Why does it seem to be different for C-&F-contiguous arrray addition and C-&non-contiguous array addition? Is is caused by different rearranging mechanism or simply because the rearranging distance between C- and F- contiguous is longest for all possible axes orders?
If I have to add a C-contiguous array and a F-contiguous/non-contiguous array, what is the best way to accelerate the speed?
Below is a minimum example of what I encountered. The three printed durations on my computer are 2.0s (C-contiguous + C-contiguous), 12.4s (C-contiguous + F-contiguous), 3.4s (C-contiguous + non-contiguous) and 3.3s (C-contiguous + non-contiguous).
import numpy as np
import time
np.random.seed(1234)
a = np.random.random((300, 400, 500)) # C-contiguous
b = np.swapaxes(np.random.random((500, 400, 300)), 0, 2) # F-contiguous
c = np.swapaxes(np.random.random((300, 500, 400)), 1, 2) # Non-contiguous
d = np.swapaxes(np.random.random((400, 300, 500)), 0, 1) # Non-contiguous
t = time.time()
for n in range(10):
result = a + a
print(time.time() - t)
t = time.time()
for n in range(10):
result = a + b
print(time.time() - t)
t = time.time()
for n in range(10):
result = a + c
print(time.time() - t)
t = time.time()
for n in range(10):
result = a + d
print(time.time() - t)

These types (F and C) denote whether a matrix (or multi-dimensional array) is stored in column-major (C as in C language which uses column-major storage) or row-major (F as in Fortran language which uses row-major storage).
Both do not really vary in speed. It is just a abstraction layer. No matter which one you use, it brings performance wise the same.
However, what makes an enormous difference is whether arrays are contiguous or not. If they are contiguous you will have good timings cause of caching effects, vectorization and other optimization games that the compiler might apply.

Numpy, replace a broadcast by iteration

I have the following code snippet
def norm(x1, x2):
return np.sqrt(((x1 - x2)**2).sum(axis=0))
def call_norm(x1, x2):
x1 = x1[..., :, np.newaxis]
x2 = x2[..., np.newaxis, :]
return norm(x1, x2)
As I understand it, each x represents an array of points in N dimensional space, where N is the size of the final dimension of the array (so for points in 3-space the final dimension is size 3). It inserts extra dimensions and uses broadcasting to generate the cartesian product of these sets of points, and so calculates the distance between all pairs of points.
x = np.array([[1, 2, 3],[1, 2, 3]])
call_norm(x, x)
array([[ 0. , 1.41421356, 2.82842712],
[ 1.41421356, 0. , 1.41421356],
[ 2.82842712, 1.41421356, 0. ]])
(so the distance between[1,1] and [2,2] is 1.41421356, as expected)
I find that for moderate size problems this approach can use huge amounts of memory. I can easily "de-vectorise" the problem and replace it by iteration, but I'd expect that to be slow. I there a (reasonably) easy compromise solution where I could have most of the speed advantages of vectorisation but without the memory penalty? Some fancy generator trick?

There is no way to do this kind of computation without the memory penalty with numpy vectorization. For the specific case of efficiently computing pairwise distance matrices, packages tend to get around this by implementing things in C (e.g. scipy.spatial.distance) or in Cython (e.g. sklearn.metrics.pairwise).
If you want to do this "by-hand", so to speak, using numpy-style syntax but without incurring the memory penalty, the current best option is probably dask.array, which automates the construction and execution of flexible task graphs for batch execution using a numpy-style syntax.
Here's an example of using dask for this computation:
import dask.array as da
# Create the chunked data. This can be created
# from numpy arrays as well, e.g. x_dask = da.array(x_numpy)
x = da.random.random((100, 3), chunks=5)
y = da.random.random((200, 3), chunks=5)
# Compute the task graph (syntax just like numpy!)
diffs = x[:, None, :] - y[None, :, :]
dist = da.sqrt((diffs ** 2).sum(-1))
# Execute the task graph
result = dist.compute()
print(result.shape)
# (100, 200)
You'll find that dask is much more memory efficient than NumPy, is often more computationally efficient than NumPy, and can also be computed in parallel/out-of core relatively straightforwardly.

optimize python code for memory efficiency

I have a python code as follow:
import numpy as np
sizes = 2000
array1 = np.empty((sizes, sizes, sizes, 3), dtype=np.float32)
for i in range(sizes):
array1[i, :, :, 0] = 1.5*i
array1[:, i, :, 1] = 2.5*i
array1[:, :, i, 2] = 3.5*i
array2 = array1.reshape(sizes*sizes*sizes, 3)
#do something with array2
array3 = array2.reshape(sizes*sizes*sizes, 3)
I would want to optimize this code for memory efficient but I have no idea. Could I use "numpy.reshape" by a more memory efficient way?

I think your code is already memory efficient.
When possible, np.reshape returns a view of the original array. That is so in this case and therefore np.reshape is already as memory efficient as can be.
Here is how you can tell np.reshape is returning a view:
import numpy as np
# Let's make array1 smaller; it won't change our conclusions
sizes = 5
array1 = np.arange(sizes*sizes*sizes*3).reshape((sizes, sizes, sizes, 3))
for i in range(sizes):
array1[i, :, :, 0] = 1.5*i
array1[:, i, :, 1] = 2.5*i
array1[:, :, i, 2] = 3.5*i
array2 = array1.reshape(sizes*sizes*sizes, 3)
Note the value of array2 at a certain location:
assert array2[0,0] == 0
Change the corresponding value in array1:
array1[0,0,0,0] = 100
Note that the value of array2 changes.
assert array2[0,0] == 100
Since array2 changes due to a modification of array1, you can conclude that array2 is a view of array1. Views share the underlying data. Since there is no copy being made, the reshape is memory efficient.
array2 is already of shape (sizes*sizes*sizes, 3), so this reshape does nothing.
array3 = array2.reshape(sizes*sizes*sizes, 3)
Finally, the assert below shows array3 was also affected by the modification made to array1. So that proves conclusively that array3 is also a view of array1.
assert array3[0,0] == 100

So really your problem depends on what you are doing with the array. You are currently storing a large amount of redundant information. You could keep 0.15% of the currently stored information and not lose anything.
For instance, if we define the following three one dimensional arrays
a = np.linspace(0,(size-1)*1.5,size).astype(np.float32)
b = np.linspace(0,(size-1)*2.5,size).astype(np.float32)
c = np.linspace(0,(size-1)*3.5,size).astype(np.float32)
We can create any minor entry (i.e. entry in the fastest rotating axis) in your array1:
In [235]: array1[4][3][19] == np.array([a[4],b[3],c[19]])
Out[235]: array([ True, True, True], dtype=bool)
The use of this all depends on what you are doing with the array, as it will be less performant to remake array1 from a,b and c. However, if you are nearing the limits of what your machine can handle, sacrificing performance for memory efficiency may be a necessary step. Also moving a,b and c around will have a much lower overhead than moving array1 around.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Memory growth with broadcast operations in NumPy - python

Related

How do I optimise this numpy array operation?

Are squeezed arrays (in Python) smaller than arrays with single-dimensional entries? ([x,y] vs [x,y,1]?)

Addition speed of numpy arrays with different contiguous-type

Numpy, replace a broadcast by iteration

optimize python code for memory efficiency

Categories

Resources