Python equivalent of free() for numpy arrays? - python

I have a number of large numpy arrays that need to be stored as dask arrays. While trying to load each array from .npy and then convert it into dask.array, I noticed the RAM usage was almost just as much as regular numpy arrays even after I del arr after loading arr into dask.array.
In this example:
arr = np.random.random((100, 300))
print(f'Array ref count before conversion: {sys.getrefcount(arr) - 1}') # output: 1
dask_arr = da.from_array(arr)
print(f'Distributed array ref count: {sys.getrefcount(dask_arr) - 1}') # output: 1
print(f'Array ref count after conversion: {sys.getrefcount(arr) - 1}') # output: 3
My only guess is that while dask was loading the array, it created references to the numpy array.
How can I free up the memory and delete all references to the memory location (like free(ptr) in C)?

If you're getting a MemoryError you may have a few options:
Break your data into smaller chunks.
Manually trigger garbage collection and/or tweak the gc settings on the workers through a Worker Plugin (which op has tried but doesn't work; I'll include anyway for other readers)
Trim memory using malloc_trim (esp. if working with non-NumPy data or small NumPy chunks)
Make sure you can see the Dask Dashboard while your computations are running to figure out which approach is working.
From this resource:
"Another important cause of unmanaged memory on Linux and MacOSX, which is not widely known about, derives from the fact that the libc malloc()/free() manage a user-space memory pool, so free() won’t necessarily release memory back to the OS."

Related

Gigantic memory use in example pytorch program. Why?

I have been trying to debug a program using vast amounts of memory and have distilled it into the following example:
# Caution, use carefully, this can utilise all available memory on your computer
# and render it effectively unresponsive, to the point where you cannot access
# the shell to kill the process; thus requiring reboot.
import numpy as np
import collections
import torch
# q = collections.deque(maxlen=1500) # Uses around 6.4GB
# q = collections.deque(maxlen=3000) # Uses around 12GB
q = collections.deque(maxlen=5000) # Uses around 18GB
def f():
nparray = np.zeros([4,84,84], dtype=np.uint8)
q.append(nparray)
nparray1 = np.zeros([32,4,84,84], dtype=np.float32)
tens = torch.tensor(nparray1, dtype=torch.float32)
while True:
f()
Please note the cautionary message in the 1st line of this program. If you set maxlen to a level where it uses too much of your available RAM, it can crash your computer.
I measured the memory using top (VIRT column), and its memory use seems wildly excessive (details on the commented lines above). From previous experience in my original program if maxlen is high enough it will crash my computer.
Why is it using so much memory?
I calculate the increase in expected memory from maxlen=1500 to maxlen=3000 to be:
4 * 84 * 84 * 15000 / (1024**2) == 403MB.
But we see an increase of 6GB.
There seems to be some sort of interaction between using collections and the tensor allocation as commenting either out causes memory use to be expected; eg commenting out the tensor line leads to total memory use of 2GB which seems much more reasonable.
Thanks for any help or insight,
Julian.
I think PyTorch store and update the computational graph each time you call f(), and thus the graph-size just keeps getting bigger and bigger.
Can you try to free the memory usage by using del(tens) (deleting the reference for the variable after usage), and let me know how it works? (found in PyTorch-documents here: https://pytorch.org/docs/stable/notes/faq.html)

Why does deleting columns or parts of a DataFrame increase memory usage, and how to ensure garbage collection on unused slices of DataFrame

When dealing with large DataFrames, you need to be careful with memory usage (for example you might want to download large data in chunks, process the chunks, and from then on delete all the unnecessary parts from memory).
I can't find any resources on the best procedures to deal with garbage collection in pandas, but I tried the following and got surprising results:
import os, psutil, gc
import pandas as pd
def get_process_mem_usage():
process = psutil.Process(os.getpid())
print("{:.3f} GB".format(process.memory_info().rss / 1e9))
get_process_mem_usage()
# Out: 0.146 GB
cdf = pd.DataFrame({i:np.random.rand(int(1e7)) for i in range(10)})
get_process_mem_usage()
# Out: 0.946 GB
With the following globals() and their memory usage:
Size
cdf 781.25MB
_iii 1.05KB
_i1 1.05KB
_oh 240.00B
When I try to delete something, I get:
del cdf[1]
gc.collect()
get_process_mem_usage()
# Out: 1.668 GB
with a high process memory usage, but the following globals()
Size
cdf 703.13MB
_i1 1.05KB
Out 240.00B
_oh 240.00B
so some memory is still allocated but not used by any object in globals().
I've also seen weird results when doing something like
cdf2 = cdf.iloc[:,:5]
del cdf
which sometimes creates a new global with a name like "_5" and more memory usage than cdf had before (I'm not sure what this global refers to, perhaps some sort of object containing the no-longer referenced columns from cdf, but why is it larger?
Another option is to "delete" columns through one of:
cdf = cdf.iloc[:, :5]
# or
cdf = cdf.drop(columns=[...])
where the columns are no longer referenced by any object so they get dropped. But for me this doesn't seem to happen every time; I could swear I've seen my process take up the same amount of memory after this operation, even when I call gc.collect() afterwards. Though when I try to recreate this in a notebook it doesn't happen.
So I guess my question is:
Why does the above happen with deleting resulting in more memory usage
What is the best way to ensure that no-longer needed columns are deleted from memory and properly garbage cleaned?

How does numpy's memmap copy-on-write mode work?

I'm confused by how numpy's memmap handles changes to data when using copy-on-write (mmap_mode=c). Since nothing is written to the original array on disk, I'm expecting that it has to store all changes in memory, and thus could run out of memory if you modify every single element. To my surprise, it didn't.
I am trying to reduce my memory usage for my machine learning scripts which I run on a shared cluster (the less mem each instance takes, the more instances I can run at the same time). My data are very large numpy arrays (each > 8 Gb). My hope is to use np.memmap to work with these arrays with small memory (<4Gb available).
However, each instance might modify the data differently (e.g. might choose to normalize the input data differently each time). This has implications for storage space. If I use the r+ mode, then normalizing the array in my script will permanently change the stored array.
Since I don't want redundant copies of the data, and just want to store the original data on disk, I thought I should use the 'c' mode (copy-on-write) to open the arrays. But then where do your changes go? Are the changes kept just in memory? If so, if I change the whole array won't I run out of memory on a small-memory system?
Here's an example of a test which I expected to fail:
On a large memory system, create the array:
import numpy as np
GB = 1000**3
GiB = 1024**3
a = np.zeros((50000, 20000), dtype='float32')
bytes = a.size * a.itemsize
print('{} GB'.format(bytes / GB))
print('{} GiB'.format(bytes / GiB))
np.save('a.npy', a)
# Output:
# 4.0 GB
# 3.725290298461914 GiB
Now, on a machine with just 2 Gb of memory, this fails as expected:
a = np.load('a.npy')
But these two will succeed, as expected:
a = np.load('a.npy', mmap_mode='r+')
a = np.load('a.npy', mmap_mode='c')
Issue 1: I run out of memory running this code, trying to modify the memmapped array (fails regardless of r+/c mode):
for i in range(a.shape[0]):
print('row {}'.format(i))
a[i,:] = i*np.arange(a.shape[1])
Why does this fail (especially, why does it fail even in r+ mode, where it can write to the disk)? I thought memmap would only load pieces of the array into memory?
Issue 2: When I force the numpy to flush the changes every once in a while, both r+/c mode successfully finish the loop. But how can c mode do this? I didn't think flush() would do anything for c mode? The changes aren't written to disk, so they are kept in memory, and yet somehow all the changes, which must be over 3Gb, don't cause out-of-memory errors?
for i in range(a.shape[0]):
if i % 100 == 0:
print('row {}'.format(i))
a.flush()
a[i,:] = i*np.arange(a.shape[1])
Numpy isn't doing anything clever here, it's just deferring to the builtin memmap module, which has an access argument that:
accepts one of four values: ACCESS_READ, ACCESS_WRITE, or ACCESS_COPY to specify read-only, write-through or copy-on-write memory respectively.
On linux, this works by calling the mmap system call with
MAP_PRIVATE
Create a private copy-on-write mapping. Updates to the
mapping are not visible to other processes mapping the same
file, and are not carried through to the underlying file.
Regarding your question
The changes aren't written to disk, so they are kept in memory, and yet somehow all the changes, which must be over 3Gb, don't cause out-of-memory errors?
The changes likely are written to disk, but just not to the file you opened. They're likely paged into virtual memory somewhere.

How to free up memory allocated to nested numpy arrays?

I have a super huge numpy array which memory allocated to it never gets free again. I simply demonstrate my situation so you can see the problem yourself.
Memory allocated to simple numpy arrays will immediately freed up after that variable can be remove (like below which I delete it):
import numpy as np
X = np.ones((40000, 40000))
X.nbytes
12800000000
del(X)
When I run the code above, all the 12 GB memory will free up immediately. But in case of nested numpy arrays things get complicated:
import numpy as np
import random
foo = np.array([np.array([np.ones((256,)) for j in range(random.randint(100, 150))]) for i in range(40000)])
sum(f.nbytes for f in foo)
10240481280
del(foo)
Now the 10 GB of memory will never gets freed even if you run gc.collect() explicitly. Do you guys have any clue?
P.S: The env: Ubuntu + Python 2.7 + numpy 1.15.1

Pyopencl: difference between to_device and Buffer

Let
import pyopencl as cl
import pyopencl.array as cl_array
import numpy
a = numpy.random.rand(50000).astype(numpy.float32)
mf = cl.mem_flags
What is the difference between
a_gpu = cl.Buffer(self.ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=a)
and
a_gpu = cl_array.to_device(self.ctx, self.queue, a)
?
And what is the difference between
result = numpy.empty_like(a)
cl.enqueue_copy(self.queue, result, result_gpu)
and
result = result_gpu.get()
?
Buffers are CL's version of malloc, while pyopencl.array.Array is a workalike of numpy arrays on the compute device.
So for the second version of the first part of your question, you may write a_gpu + 2 to get a new arrays that has 2 added to each number in your array, whereas in the case of the Buffer, PyOpenCL only sees a bag of bytes and cannot perform any such operation.
The second part of your question is the same in reverse: If you've got a PyOpenCL array, .get() copies the data back and converts it into a (host-based) numpy array. Since numpy arrays are one of the more convenient ways to get contiguous memory in Python, the second variant with enqueue_copy also ends up in a numpy array--but note that you could've copied this data into an array of any size (as long as it's big enough) and any type--the copy is performed as a bag of bytes, whereas .get() makes sure you get the same size and type on the host.
Bonus fact: There is of course a Buffer underlying each PyOpenCL array. You can get it from the .data attribute.
To answer the first question, Buffer(hostbuf=...) can be called with anything that implements the buffer interface (reference). pyopencl.array.to_device(...) must be called with an ndarray (reference). ndarray implements the buffer interface and works in either place. However, only hostbuf=... would be expected to work with for example a bytearray (which also implements the buffer interface). I have not confirmed this, but it appears to be what the docs suggest.
On the second question, I am not sure what type result_gpu is supposed to be when you call get() on it (did you mean Buffer.get_host_array()?) In any case, enqueue_copy() works between combination of Buffer, Image and host, can have offsets and regions, and can be asynchronous (with is_blocking=False), and I think these capabilities are only available that way (whereas get() would be blocking and return the whole buffer). (reference)

Categories

Resources