How to free up memory allocated to nested numpy arrays?

How to free up memory allocated to nested numpy arrays? - python

I have a super huge numpy array which memory allocated to it never gets free again. I simply demonstrate my situation so you can see the problem yourself.
Memory allocated to simple numpy arrays will immediately freed up after that variable can be remove (like below which I delete it):
import numpy as np
X = np.ones((40000, 40000))
X.nbytes
12800000000
del(X)
When I run the code above, all the 12 GB memory will free up immediately. But in case of nested numpy arrays things get complicated:
import numpy as np
import random
foo = np.array([np.array([np.ones((256,)) for j in range(random.randint(100, 150))]) for i in range(40000)])
sum(f.nbytes for f in foo)
10240481280
del(foo)
Now the 10 GB of memory will never gets freed even if you run gc.collect() explicitly. Do you guys have any clue?
P.S: The env: Ubuntu + Python 2.7 + numpy 1.15.1

Related

Gigantic memory use in example pytorch program. Why?

I have been trying to debug a program using vast amounts of memory and have distilled it into the following example:
# Caution, use carefully, this can utilise all available memory on your computer
# and render it effectively unresponsive, to the point where you cannot access
# the shell to kill the process; thus requiring reboot.
import numpy as np
import collections
import torch
# q = collections.deque(maxlen=1500) # Uses around 6.4GB
# q = collections.deque(maxlen=3000) # Uses around 12GB
q = collections.deque(maxlen=5000) # Uses around 18GB
def f():
nparray = np.zeros([4,84,84], dtype=np.uint8)
q.append(nparray)
nparray1 = np.zeros([32,4,84,84], dtype=np.float32)
tens = torch.tensor(nparray1, dtype=torch.float32)
while True:
f()
Please note the cautionary message in the 1st line of this program. If you set maxlen to a level where it uses too much of your available RAM, it can crash your computer.
I measured the memory using top (VIRT column), and its memory use seems wildly excessive (details on the commented lines above). From previous experience in my original program if maxlen is high enough it will crash my computer.
Why is it using so much memory?
I calculate the increase in expected memory from maxlen=1500 to maxlen=3000 to be:
4 * 84 * 84 * 15000 / (1024**2) == 403MB.
But we see an increase of 6GB.
There seems to be some sort of interaction between using collections and the tensor allocation as commenting either out causes memory use to be expected; eg commenting out the tensor line leads to total memory use of 2GB which seems much more reasonable.
Thanks for any help or insight,
Julian.

I think PyTorch store and update the computational graph each time you call f(), and thus the graph-size just keeps getting bigger and bigger.
Can you try to free the memory usage by using del(tens) (deleting the reference for the variable after usage), and let me know how it works? (found in PyTorch-documents here: https://pytorch.org/docs/stable/notes/faq.html)

Python equivalent of free() for numpy arrays?

I have a number of large numpy arrays that need to be stored as dask arrays. While trying to load each array from .npy and then convert it into dask.array, I noticed the RAM usage was almost just as much as regular numpy arrays even after I del arr after loading arr into dask.array.
In this example:
arr = np.random.random((100, 300))
print(f'Array ref count before conversion: {sys.getrefcount(arr) - 1}') # output: 1
dask_arr = da.from_array(arr)
print(f'Distributed array ref count: {sys.getrefcount(dask_arr) - 1}') # output: 1
print(f'Array ref count after conversion: {sys.getrefcount(arr) - 1}') # output: 3
My only guess is that while dask was loading the array, it created references to the numpy array.
How can I free up the memory and delete all references to the memory location (like free(ptr) in C)?

If you're getting a MemoryError you may have a few options:
Break your data into smaller chunks.
Manually trigger garbage collection and/or tweak the gc settings on the workers through a Worker Plugin (which op has tried but doesn't work; I'll include anyway for other readers)
Trim memory using malloc_trim (esp. if working with non-NumPy data or small NumPy chunks)
Make sure you can see the Dask Dashboard while your computations are running to figure out which approach is working.
From this resource:
"Another important cause of unmanaged memory on Linux and MacOSX, which is not widely known about, derives from the fact that the libc malloc()/free() manage a user-space memory pool, so free() won’t necessarily release memory back to the OS."

How does numpy's memmap copy-on-write mode work?

I'm confused by how numpy's memmap handles changes to data when using copy-on-write (mmap_mode=c). Since nothing is written to the original array on disk, I'm expecting that it has to store all changes in memory, and thus could run out of memory if you modify every single element. To my surprise, it didn't.
I am trying to reduce my memory usage for my machine learning scripts which I run on a shared cluster (the less mem each instance takes, the more instances I can run at the same time). My data are very large numpy arrays (each > 8 Gb). My hope is to use np.memmap to work with these arrays with small memory (<4Gb available).
However, each instance might modify the data differently (e.g. might choose to normalize the input data differently each time). This has implications for storage space. If I use the r+ mode, then normalizing the array in my script will permanently change the stored array.
Since I don't want redundant copies of the data, and just want to store the original data on disk, I thought I should use the 'c' mode (copy-on-write) to open the arrays. But then where do your changes go? Are the changes kept just in memory? If so, if I change the whole array won't I run out of memory on a small-memory system?
Here's an example of a test which I expected to fail:
On a large memory system, create the array:
import numpy as np
GB = 1000**3
GiB = 1024**3
a = np.zeros((50000, 20000), dtype='float32')
bytes = a.size * a.itemsize
print('{} GB'.format(bytes / GB))
print('{} GiB'.format(bytes / GiB))
np.save('a.npy', a)
# Output:
# 4.0 GB
# 3.725290298461914 GiB
Now, on a machine with just 2 Gb of memory, this fails as expected:
a = np.load('a.npy')
But these two will succeed, as expected:
a = np.load('a.npy', mmap_mode='r+')
a = np.load('a.npy', mmap_mode='c')
Issue 1: I run out of memory running this code, trying to modify the memmapped array (fails regardless of r+/c mode):
for i in range(a.shape[0]):
print('row {}'.format(i))
a[i,:] = i*np.arange(a.shape[1])
Why does this fail (especially, why does it fail even in r+ mode, where it can write to the disk)? I thought memmap would only load pieces of the array into memory?
Issue 2: When I force the numpy to flush the changes every once in a while, both r+/c mode successfully finish the loop. But how can c mode do this? I didn't think flush() would do anything for c mode? The changes aren't written to disk, so they are kept in memory, and yet somehow all the changes, which must be over 3Gb, don't cause out-of-memory errors?
for i in range(a.shape[0]):
if i % 100 == 0:
print('row {}'.format(i))
a.flush()
a[i,:] = i*np.arange(a.shape[1])

Numpy isn't doing anything clever here, it's just deferring to the builtin memmap module, which has an access argument that:
accepts one of four values: ACCESS_READ, ACCESS_WRITE, or ACCESS_COPY to specify read-only, write-through or copy-on-write memory respectively.
On linux, this works by calling the mmap system call with
MAP_PRIVATE
Create a private copy-on-write mapping. Updates to the
mapping are not visible to other processes mapping the same
file, and are not carried through to the underlying file.
Regarding your question
The changes aren't written to disk, so they are kept in memory, and yet somehow all the changes, which must be over 3Gb, don't cause out-of-memory errors?
The changes likely are written to disk, but just not to the file you opened. They're likely paged into virtual memory somewhere.

Python: How to pass numpy array to matlab array efficiently

I am working on a project in python. Due to some reason, I have to call matlab for calculation
ubuntu 14.04 64bit
python 2.7.6
numpy 1.11.1
matlab 2016a linux-64bit
import matlab
import matlab.engine
import numpy as np
import time
data = np.random.rand(1000, 100, 100)
print ('pass begin')
st = time.time()
data_matlab = matlab.double(data.tolist())
print ('pass numpy to matlab finished in {:.2f} sec'.format(time.time() - st))
passing a float64 type numpy array with shape of 1000,100,100 to matlab array takes 63.49 seconds. This is unacceptable. Is there any efficient way to passing big data array from numpy to matlab array in python ?
pass begin
pass numpy to matlab finished in 63.49 sec

Starting with MATLAB R2022a, this operation is at least an order of magnitude faster than in previous releases. When I run the code sample above on a Windows 10 machine, the operation takes consistently less than 2 seconds now as opposed to the more than 63 seconds reported in the original question. See release notes under "Performance"/"MATLAB Engine API for Python: Improved performance with large multidimensional arrays in Python".

Pyopencl: difference between to_device and Buffer

Let
import pyopencl as cl
import pyopencl.array as cl_array
import numpy
a = numpy.random.rand(50000).astype(numpy.float32)
mf = cl.mem_flags
What is the difference between
a_gpu = cl.Buffer(self.ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=a)
and
a_gpu = cl_array.to_device(self.ctx, self.queue, a)
?
And what is the difference between
result = numpy.empty_like(a)
cl.enqueue_copy(self.queue, result, result_gpu)
and
result = result_gpu.get()
?

Buffers are CL's version of malloc, while pyopencl.array.Array is a workalike of numpy arrays on the compute device.
So for the second version of the first part of your question, you may write a_gpu + 2 to get a new arrays that has 2 added to each number in your array, whereas in the case of the Buffer, PyOpenCL only sees a bag of bytes and cannot perform any such operation.
The second part of your question is the same in reverse: If you've got a PyOpenCL array, .get() copies the data back and converts it into a (host-based) numpy array. Since numpy arrays are one of the more convenient ways to get contiguous memory in Python, the second variant with enqueue_copy also ends up in a numpy array--but note that you could've copied this data into an array of any size (as long as it's big enough) and any type--the copy is performed as a bag of bytes, whereas .get() makes sure you get the same size and type on the host.
Bonus fact: There is of course a Buffer underlying each PyOpenCL array. You can get it from the .data attribute.

To answer the first question, Buffer(hostbuf=...) can be called with anything that implements the buffer interface (reference). pyopencl.array.to_device(...) must be called with an ndarray (reference). ndarray implements the buffer interface and works in either place. However, only hostbuf=... would be expected to work with for example a bytearray (which also implements the buffer interface). I have not confirmed this, but it appears to be what the docs suggest.
On the second question, I am not sure what type result_gpu is supposed to be when you call get() on it (did you mean Buffer.get_host_array()?) In any case, enqueue_copy() works between combination of Buffer, Image and host, can have offsets and regions, and can be asynchronous (with is_blocking=False), and I think these capabilities are only available that way (whereas get() would be blocking and return the whole buffer). (reference)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to free up memory allocated to nested numpy arrays? - python

Related

Gigantic memory use in example pytorch program. Why?

Python equivalent of free() for numpy arrays?

How does numpy's memmap copy-on-write mode work?

Python: How to pass numpy array to matlab array efficiently

Pyopencl: difference between to_device and Buffer

Categories

Resources