Gigantic memory use in example pytorch program. Why?

Gigantic memory use in example pytorch program. Why? - python

I have been trying to debug a program using vast amounts of memory and have distilled it into the following example:
# Caution, use carefully, this can utilise all available memory on your computer
# and render it effectively unresponsive, to the point where you cannot access
# the shell to kill the process; thus requiring reboot.
import numpy as np
import collections
import torch
# q = collections.deque(maxlen=1500) # Uses around 6.4GB
# q = collections.deque(maxlen=3000) # Uses around 12GB
q = collections.deque(maxlen=5000) # Uses around 18GB
def f():
nparray = np.zeros([4,84,84], dtype=np.uint8)
q.append(nparray)
nparray1 = np.zeros([32,4,84,84], dtype=np.float32)
tens = torch.tensor(nparray1, dtype=torch.float32)
while True:
f()
Please note the cautionary message in the 1st line of this program. If you set maxlen to a level where it uses too much of your available RAM, it can crash your computer.
I measured the memory using top (VIRT column), and its memory use seems wildly excessive (details on the commented lines above). From previous experience in my original program if maxlen is high enough it will crash my computer.
Why is it using so much memory?
I calculate the increase in expected memory from maxlen=1500 to maxlen=3000 to be:
4 * 84 * 84 * 15000 / (1024**2) == 403MB.
But we see an increase of 6GB.
There seems to be some sort of interaction between using collections and the tensor allocation as commenting either out causes memory use to be expected; eg commenting out the tensor line leads to total memory use of 2GB which seems much more reasonable.
Thanks for any help or insight,
Julian.

I think PyTorch store and update the computational graph each time you call f(), and thus the graph-size just keeps getting bigger and bigger.
Can you try to free the memory usage by using del(tens) (deleting the reference for the variable after usage), and let me know how it works? (found in PyTorch-documents here: https://pytorch.org/docs/stable/notes/faq.html)

Related

Is there a way to open hdf5 files with the POSIX_FADV_DONTNEED flag?

We are working with large (1.2TB) uncompressed, unchunked hdf5 files with h5py in python for a machine learning application, which requires us to work through the full dataset repeatedly, loading slices of ~15MB individually in a randomized order. We are working on a linux (Ubuntu 18.04) machine with 192 GB RAM. We noticed that the program is slowly filling the cache. When total size of cache reaches size comparable with full machine RAM (free memory in top almost 0 but plenty ‘available’ memory) swapping occurs slowing down all other applications. In order to pinpoint the source of the problem, we wrote a separate minimal example to isolate our dataloading procedures - but found that the problem was independent of each part of our method.
We tried:
Building numpy memmap and accessing requested slice:
#on init:
f = h5py.File(tv_path, 'r')
hdf5_event_data = f["event_data"]
self.event_data = np.memmap(tv_path, mode="r", shape=hdf5_event_data.shape,
offset=hdf5_event_data.id.get_offset(),dtype=hdf5_event_data.dtype)
self.e = np.ones((512,40,40,19))
#on __getitem__:
self.e = self.event_data[index,:,:,:19]
return self.e
Reopening the memmap on each call to getitem:
#on __getitem__:
self.event_data = np.memmap(self.path, mode="r", shape=self.shape,
offset=self.offset, dtype=self.dtype)
self.e = self.event_data[index,:,:,:19]
return self.e
Addressing the h5 file directly and converting to a numpy array:
#on init:
f = h5py.File(tv_path, 'r')
hdf5_event_data = f["event_data"]
self.event_data = hdf5_event_data
self.e = np.ones((512,40,40,19))
#on __getitem__:
self.e = self.event_data[index,:,:,:19]
return self.e
We also tried the above approaches within pytorch Dataset/Dataloader framework - but it made no difference.
We observe high memory fragmentation as evidenced by /proc/buddyinfo. Dropping the cache via sync; echo 3 > /proc/sys/vm/drop_caches doesn’t help while application is running. Cleaning cache before application starts removes swapping behaviour until cache eats up the memory again - and swapping starts again.
Our working hypothesis is that the system is trying to hold on to cached file data which leads to memory fragmentation. Eventually when new memory is requested swapping is performed even though most memory is still ‘available’.
As such, we turned to ways to change the Linux environment’s behaviour around file caching and found this post . Is there a way to call the POSIX_FADV_DONTNEED flag when opening an h5 file in python or a portion of that we accessed via numpy memmap, so that this accumulation of cache does not occur? In our use case we will not be re-visiting that particular file location for a long time (till we access all other remaining ‘slices’ of the file)

You can use os.posix_fadvise to tell the OS how regions you plan to load will be used. This naturally requires a bit of low-level tweaking to determine your file descriptor, and get an idea of the regions you plan on reading.
The easiest way to get the file descriptor is to supply it yourself:
pf = open(tv_path, 'rb')
f = h5py.File(pf, 'r')
You can now set the advice. For the entire file:
os.posix_fadvise(os.fileno(pf), 0, f.id.get_filesize(), os.POSIX_FADV_DONTNEED)
Or for a particular dataset:
os.posix_fadvise(os.fileno(pf), hdf5_event_data.id.get_offset(),
hdf5_event_data.id.get_storage_size(), os.POSIX_FADV_DONTNEED)
Other things to look at
H5py does its own chunk caching. You may want to try turning this off:
f = h5py.File(..., rdcc_nbytes=0)
As an alternative, you may want to try using one of the other drivers provided in h5py, like 'sec2':
f = h5py.File(..., driver='sec2')

How does numpy's memmap copy-on-write mode work?

I'm confused by how numpy's memmap handles changes to data when using copy-on-write (mmap_mode=c). Since nothing is written to the original array on disk, I'm expecting that it has to store all changes in memory, and thus could run out of memory if you modify every single element. To my surprise, it didn't.
I am trying to reduce my memory usage for my machine learning scripts which I run on a shared cluster (the less mem each instance takes, the more instances I can run at the same time). My data are very large numpy arrays (each > 8 Gb). My hope is to use np.memmap to work with these arrays with small memory (<4Gb available).
However, each instance might modify the data differently (e.g. might choose to normalize the input data differently each time). This has implications for storage space. If I use the r+ mode, then normalizing the array in my script will permanently change the stored array.
Since I don't want redundant copies of the data, and just want to store the original data on disk, I thought I should use the 'c' mode (copy-on-write) to open the arrays. But then where do your changes go? Are the changes kept just in memory? If so, if I change the whole array won't I run out of memory on a small-memory system?
Here's an example of a test which I expected to fail:
On a large memory system, create the array:
import numpy as np
GB = 1000**3
GiB = 1024**3
a = np.zeros((50000, 20000), dtype='float32')
bytes = a.size * a.itemsize
print('{} GB'.format(bytes / GB))
print('{} GiB'.format(bytes / GiB))
np.save('a.npy', a)
# Output:
# 4.0 GB
# 3.725290298461914 GiB
Now, on a machine with just 2 Gb of memory, this fails as expected:
a = np.load('a.npy')
But these two will succeed, as expected:
a = np.load('a.npy', mmap_mode='r+')
a = np.load('a.npy', mmap_mode='c')
Issue 1: I run out of memory running this code, trying to modify the memmapped array (fails regardless of r+/c mode):
for i in range(a.shape[0]):
print('row {}'.format(i))
a[i,:] = i*np.arange(a.shape[1])
Why does this fail (especially, why does it fail even in r+ mode, where it can write to the disk)? I thought memmap would only load pieces of the array into memory?
Issue 2: When I force the numpy to flush the changes every once in a while, both r+/c mode successfully finish the loop. But how can c mode do this? I didn't think flush() would do anything for c mode? The changes aren't written to disk, so they are kept in memory, and yet somehow all the changes, which must be over 3Gb, don't cause out-of-memory errors?
for i in range(a.shape[0]):
if i % 100 == 0:
print('row {}'.format(i))
a.flush()
a[i,:] = i*np.arange(a.shape[1])

Numpy isn't doing anything clever here, it's just deferring to the builtin memmap module, which has an access argument that:
accepts one of four values: ACCESS_READ, ACCESS_WRITE, or ACCESS_COPY to specify read-only, write-through or copy-on-write memory respectively.
On linux, this works by calling the mmap system call with
MAP_PRIVATE
Create a private copy-on-write mapping. Updates to the
mapping are not visible to other processes mapping the same
file, and are not carried through to the underlying file.
Regarding your question
The changes aren't written to disk, so they are kept in memory, and yet somehow all the changes, which must be over 3Gb, don't cause out-of-memory errors?
The changes likely are written to disk, but just not to the file you opened. They're likely paged into virtual memory somewhere.

Memory leak from pyarrow?

For the parsing of a larger file, I need to write in a loop to a large number of parquet files successively. However, it appears that the memory consumed by this task increases over each iteration, whereas I would expect it to remain constant (as nothing should be appended in memory). This makes it tricky to scale.
I've added a minimum reproducible example which creates 10 000 parquet and loop appends to it.
import resource
import random
import string
import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd
def id_generator(size=6, chars=string.ascii_uppercase + string.digits):
return ''.join(random.choice(chars) for _ in range(size))
schema = pa.schema([
pa.field('test', pa.string()),
])
resource.setrlimit(resource.RLIMIT_NOFILE, (1000000, 1000000))
number_files = 10000
number_rows_increment = 1000
number_iterations = 100
writers = [pq.ParquetWriter('test_'+id_generator()+'.parquet', schema) for i in range(number_files)]
for i in range(number_iterations):
for writer in writers:
table_to_write = pa.Table.from_pandas(
pd.DataFrame({'test': [id_generator() for i in range(number_rows_increment)]}),
preserve_index=False,
schema = schema,
nthreads = 1)
table_to_write = table_to_write.replace_schema_metadata(None)
writer.write_table(table_to_write)
print(i)
for writer in writers:
writer.close()
Would anyone have any idea what causes this leak and how to prevent it?

We aren't sure what is wrong, but some other users have reported as yet undiagnosed memory leaks. I added your example to one of the tracking JIRA issues https://issues.apache.org/jira/browse/ARROW-3324

Update on 2022:
I've spent several days on memory leak issue from pyarrow. Please see here for a better understanding. I'll paste the key points below. Basically, they are saying it is not a library memory leak issue, rather it is a common behavior.
Pyarrow uses jemalloc, a custom memory allocator which does its best to hold onto memory allocated from the OS (since this can be an expensive operation). Unfortunately, this makes it difficult to track line by line memory usage with tools like memory_profiler. There are a couple of options:
You could use this library function, pyarrow.total_allocated_bytes to track allocation instead of using memory_profiler.
You can also put the following line at the top of your script, this will configure jemalloc to release memory immediately instead of holding on to it (this will likely have some performance implications). However, I used it, but did not work.
import pyarrow as pa
pa.jemalloc_set_decay_ms(0)
The behavior you are seeing is pretty typical for jemalloc. For further reading, you can also see these other issues for more discussions and examples of jemalloc behaviors:
https://issues.apache.org/jira/browse/ARROW-6910
https://issues.apache.org/jira/browse/ARROW-7305

pandas, HDFstore and memory usage through load/unload cycles

I happily use pandas to store and manipulate experimental data. Usually, I choose HDF format (which I don't master) via pd.HDFstore to save stuff.
My dataframes got bigger and bigger and some economy in memory is needed.
I read some of the guides linked in related questions, although I cannot achieve a sustainable memory consumption, e.g. in the following typical task of mine:
. load some `df` in memory (scale size is 10GB)
. do business with some other preloaded `df`
. unload
. repeat
Apparently I keep on failing in the unloading stage.
Hence, I would like you to consider the following experiments.
(From fresh started kernel (in ipython notebook, if that matters))
import pandas as pd
for idx in range(6):
print idx
store = pd.HDFStore('detection_DB_N.h5')
detection_DB = store['detection_DB']
store.close()
del detection_DB
stats (from top):
. memory used by first iteration ~8GB
. memory used at the end of execution ~10GB (6 cycles)
Then, in the same kernel, I run
for idx in range(6):
print idx
store = pd.HDFStore('detection_DB_N.h5')
detection_DB = store['detection_DB']
store.close()
#del detection_DB #SAME AS BEFORE, BUT I DON'T del
stats:
. memory used at the end of execution ~15GB
Calling a del detection_DB doesn't make any difference in memory (CPU usage goes high for some 5sec).
Analogusly, calling
import gc
gc.collect()
doesn't make any relevant difference.
I add, for what is worth, that repeating the previous calls, I arrived to have ~20GB occupied (and no loaded object to play with).
Can anyone shed some light?
How can I achieve ~0GB (or so) occupied after del?

How to free memory after opening a file in Python

I'm opening a 3 GB file in Python to read strings. I then store this data in a dictionary. My next goal is to build a graph using this dictionary so I'm closely monitoring memory usage.
It seems to me that Python loads the whole 3 GB file into memory and I can't get rid of it. My code looks like that :
with open(filename) as data:
accounts = dict()
for line in data:
username = line.split()[1]
IP = line.split()[0]
try:
accounts[username].add(IP)
except KeyError:
accounts[username] = set()
accounts[username].add(IP)
print "The accounts will be deleted from memory in 5 seconds"
time.sleep(5)
accounts.clear()
print "The accounts have been deleted from memory"
time.sleep(5)
print "End of script"
The last lines are there so that I could monitor memory usage.
The script uses a bit more than 3 GB in memory. Clearing the dictionary frees around 300 MB. When the script ends, the rest of the memory is freed.
I'm using Ubuntu and I've monitored memory usage using both "System Monitor" and the "free" command in terminal.
What I don't understand is why does Python need so much memory after I've cleared the dictionary. Is the file still stored in memory ? If so, how can I get rid of it ? Is it a problem with my OS not seeing freed memory ?
EDIT : I've tried to force a gc.collect() after clearing the dictionary, to no avail.
EDIT2 : I'm running Python 2.7.3 on Ubuntu 12.04.LTS
EDIT3 : I realize I forgot to mention something quite important. My real problem is not that my OS does not "get back" the memory used by Python. It's that later on, Python does not seem to reuse that memory (it just asks for more memory to the OS).

this really does make no sense to me either, and I wanted to figure out how/why this happens. ( i thought that's how this should work too! ) i replicated it on my machine - though with a smaller file.
i saw two discrete problems here
why is Python reading the file into memory ( with lazy line reading, it shouldn't - right ? )
why isn't Python freeing up memory to the system
I'm not knowledgable at all on the Python internals, so I just did a lot of web searching. All of this could be completely off the mark. ( I barely develop anymore , have been on the biz side of tech for the past few years )
Lazy line reading...
I looked around and found this post -
http://www.peterbe.com/plog/blogitem-040312-1
it's from a much earlier version of python, but this line resonated with me:
readlines() reads in the whole file at once and splits it by line.
then i saw this , also old, effbot post:
http://effbot.org/zone/readline-performance.htm
the key takeaway was this:
For example, if you have enough memory, you can slurp the entire file into memory, using the readlines method.
and this:
In Python 2.2 and later, you can loop over the file object itself. This works pretty much like readlines(N) under the covers, but looks much better
looking at pythons docs for xreadlines [ http://docs.python.org/library/stdtypes.html?highlight=readline#file.xreadlines ]:
This method returns the same thing as iter(f)
Deprecated since version 2.3: Use for line in file instead.
it made me think that perhaps some slurping is going on.
so if we look at readlines [ http://docs.python.org/library/stdtypes.html?highlight=readline#file.readlines ]...
Read until EOF using readline() and return a list containing the lines thus read.
and it sort of seems like that's what's happening here.
readline , however, looked like what we wanted [ http://docs.python.org/library/stdtypes.html?highlight=readline#file.readline ]
Read one entire line from the file
so i tried switching this to readline, and the process never grew above 40MB ( it was growing to 200MB, the size of the log file , before )
accounts = dict()
data= open(filename)
for line in data.readline():
info = line.split("LOG:")
if len(info) == 2 :
( a , b ) = info
try:
accounts[a].add(True)
except KeyError:
accounts[a] = set()
accounts[a].add(True)
my guess is that we're not really lazy-reading the file with the for x in data construct -- although all the docs and stackoverflow comments suggest that we are. readline() consumed signficantly less memory for me, and realdlines consumed approximately the same amount of memory as for line in data
freeing memory
in terms of freeing up memory, I'm not familiar much with Python's internals, but I recall back from when I worked with mod_perl... if I opened up a file that was 500MB, that apache child grew to that size. if I freed up the memory, it would only be free within that child -- garbage collected memory was never returned to the OS until the process exited.
so i poked around on that idea , and found a few links that suggest this might be happening:
http://effbot.org/pyfaq/why-doesnt-python-release-the-memory-when-i-delete-a-large-object.htm
If you create a large object and delete it again, Python has probably released the memory, but the memory allocators involved don’t necessarily return the memory to the operating system, so it may look as if the Python process uses a lot more virtual memory than it actually uses.
that was sort of old, and I found a bunch of random (accepted) patches afterwards into python that suggested the behavior was changed and that you could now return memory to the os ( as of 2005 when most of those patches were submitted and apparently approved ).
then i found this posting http://objectmix.com/python/17293-python-memory-handling.html -- and note the comment #4
"""- Patch #1123430: Python's small-object allocator now returns an arena to the system free() when all memory within an arena becomes unused again. Prior to Python 2.5, arenas (256KB chunks of memory) were never freed. Some applications will see a drop in virtual memory size now, especially long-running applications that, from time to time, temporarily use a large number of small objects. Note that when Python returns an arena to the platform C's free(), there's no guarantee that the platform C library will in turn return that memory to the operating system. The effect of the patch is to stop making that impossible, and in tests it appears to be effective at least on Microsoft C and gcc-based systems. Thanks to Evan Jones for hard work and patience.
So with 2.4 under linux (as you tested) you will indeed not always get
the used memory back, with respect to lots of small objects being
collected.
The difference therefore (I think) you see between doing an f.read() and
an f.readlines() is that the former reads in the whole file as one large
string object (i.e. not a small object), while the latter returns a list
of lines where each line is a python object.
if the 'for line in data:' construct is essentially wrapping readlines and not readline, maybe this has something to do with it? perhaps it's not a problem of having a single 3GB object, but instead having millions of 30k objects.

Which version of python that are you trying this?
I did a test on Python 2.7/Win7, and it worked as expected, the memory was released.
Here I generate sample data like yours:
import random
fn = random.randint
with open('ips.txt', 'w') as f:
for i in xrange(9000000):
f.write('{0}.{1}.{2}.{3} username-{4}\n'.format(
fn(0,255),
fn(0,255),
fn(0,255),
fn(0,255),
fn(0, 9000000),
))
And then your script. I replaced dict by defaultdict because throwing exceptions makes the code slower:
import time
from collections import defaultdict
def read_file(filename):
with open(filename) as data:
accounts = defaultdict(set)
for line in data:
IP, username = line.split()[:2]
accounts[username].add(IP)
print "The accounts will be deleted from memory in 5 seconds"
time.sleep(5)
accounts.clear()
print "The accounts have been deleted from memory"
time.sleep(5)
print "End of script"
if __name__ == '__main__':
read_file('ips.txt')
As you can see, memory reached 1.4G and was then released, leaving 36MB:
Using your original script I got the same results, but a bit slower:

There are difference between when Python releases memory for reuse by Python and when it releases memory back to the OS. Python has internal pools for some kinds of objects and it will reuse these itself but doesn't give it back to the OS.

The gc module may be useful, particularly the collect function. I have never used it myself, but from the documentation, it looks like it may be useful. I would try running gc.collect() before you run accounts.clear().

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.