pandas, HDFstore and memory usage through load/unload cycles - python

I happily use pandas to store and manipulate experimental data. Usually, I choose HDF format (which I don't master) via pd.HDFstore to save stuff.
My dataframes got bigger and bigger and some economy in memory is needed.
I read some of the guides linked in related questions, although I cannot achieve a sustainable memory consumption, e.g. in the following typical task of mine:
. load some `df` in memory (scale size is 10GB)
. do business with some other preloaded `df`
. unload
. repeat
Apparently I keep on failing in the unloading stage.
Hence, I would like you to consider the following experiments.
(From fresh started kernel (in ipython notebook, if that matters))
import pandas as pd
for idx in range(6):
print idx
store = pd.HDFStore('detection_DB_N.h5')
detection_DB = store['detection_DB']
store.close()
del detection_DB
stats (from top):
. memory used by first iteration ~8GB
. memory used at the end of execution ~10GB (6 cycles)
Then, in the same kernel, I run
for idx in range(6):
print idx
store = pd.HDFStore('detection_DB_N.h5')
detection_DB = store['detection_DB']
store.close()
#del detection_DB #SAME AS BEFORE, BUT I DON'T del
stats:
. memory used at the end of execution ~15GB
Calling a del detection_DB doesn't make any difference in memory (CPU usage goes high for some 5sec).
Analogusly, calling
import gc
gc.collect()
doesn't make any relevant difference.
I add, for what is worth, that repeating the previous calls, I arrived to have ~20GB occupied (and no loaded object to play with).
Can anyone shed some light?
How can I achieve ~0GB (or so) occupied after del?

Related

Gigantic memory use in example pytorch program. Why?

I have been trying to debug a program using vast amounts of memory and have distilled it into the following example:
# Caution, use carefully, this can utilise all available memory on your computer
# and render it effectively unresponsive, to the point where you cannot access
# the shell to kill the process; thus requiring reboot.
import numpy as np
import collections
import torch
# q = collections.deque(maxlen=1500) # Uses around 6.4GB
# q = collections.deque(maxlen=3000) # Uses around 12GB
q = collections.deque(maxlen=5000) # Uses around 18GB
def f():
nparray = np.zeros([4,84,84], dtype=np.uint8)
q.append(nparray)
nparray1 = np.zeros([32,4,84,84], dtype=np.float32)
tens = torch.tensor(nparray1, dtype=torch.float32)
while True:
f()
Please note the cautionary message in the 1st line of this program. If you set maxlen to a level where it uses too much of your available RAM, it can crash your computer.
I measured the memory using top (VIRT column), and its memory use seems wildly excessive (details on the commented lines above). From previous experience in my original program if maxlen is high enough it will crash my computer.
Why is it using so much memory?
I calculate the increase in expected memory from maxlen=1500 to maxlen=3000 to be:
4 * 84 * 84 * 15000 / (1024**2) == 403MB.
But we see an increase of 6GB.
There seems to be some sort of interaction between using collections and the tensor allocation as commenting either out causes memory use to be expected; eg commenting out the tensor line leads to total memory use of 2GB which seems much more reasonable.
Thanks for any help or insight,
Julian.
I think PyTorch store and update the computational graph each time you call f(), and thus the graph-size just keeps getting bigger and bigger.
Can you try to free the memory usage by using del(tens) (deleting the reference for the variable after usage), and let me know how it works? (found in PyTorch-documents here: https://pytorch.org/docs/stable/notes/faq.html)

Why does deleting columns or parts of a DataFrame increase memory usage, and how to ensure garbage collection on unused slices of DataFrame

When dealing with large DataFrames, you need to be careful with memory usage (for example you might want to download large data in chunks, process the chunks, and from then on delete all the unnecessary parts from memory).
I can't find any resources on the best procedures to deal with garbage collection in pandas, but I tried the following and got surprising results:
import os, psutil, gc
import pandas as pd
def get_process_mem_usage():
process = psutil.Process(os.getpid())
print("{:.3f} GB".format(process.memory_info().rss / 1e9))
get_process_mem_usage()
# Out: 0.146 GB
cdf = pd.DataFrame({i:np.random.rand(int(1e7)) for i in range(10)})
get_process_mem_usage()
# Out: 0.946 GB
With the following globals() and their memory usage:
Size
cdf 781.25MB
_iii 1.05KB
_i1 1.05KB
_oh 240.00B
When I try to delete something, I get:
del cdf[1]
gc.collect()
get_process_mem_usage()
# Out: 1.668 GB
with a high process memory usage, but the following globals()
Size
cdf 703.13MB
_i1 1.05KB
Out 240.00B
_oh 240.00B
so some memory is still allocated but not used by any object in globals().
I've also seen weird results when doing something like
cdf2 = cdf.iloc[:,:5]
del cdf
which sometimes creates a new global with a name like "_5" and more memory usage than cdf had before (I'm not sure what this global refers to, perhaps some sort of object containing the no-longer referenced columns from cdf, but why is it larger?
Another option is to "delete" columns through one of:
cdf = cdf.iloc[:, :5]
# or
cdf = cdf.drop(columns=[...])
where the columns are no longer referenced by any object so they get dropped. But for me this doesn't seem to happen every time; I could swear I've seen my process take up the same amount of memory after this operation, even when I call gc.collect() afterwards. Though when I try to recreate this in a notebook it doesn't happen.
So I guess my question is:
Why does the above happen with deleting resulting in more memory usage
What is the best way to ensure that no-longer needed columns are deleted from memory and properly garbage cleaned?

How does numpy's memmap copy-on-write mode work?

I'm confused by how numpy's memmap handles changes to data when using copy-on-write (mmap_mode=c). Since nothing is written to the original array on disk, I'm expecting that it has to store all changes in memory, and thus could run out of memory if you modify every single element. To my surprise, it didn't.
I am trying to reduce my memory usage for my machine learning scripts which I run on a shared cluster (the less mem each instance takes, the more instances I can run at the same time). My data are very large numpy arrays (each > 8 Gb). My hope is to use np.memmap to work with these arrays with small memory (<4Gb available).
However, each instance might modify the data differently (e.g. might choose to normalize the input data differently each time). This has implications for storage space. If I use the r+ mode, then normalizing the array in my script will permanently change the stored array.
Since I don't want redundant copies of the data, and just want to store the original data on disk, I thought I should use the 'c' mode (copy-on-write) to open the arrays. But then where do your changes go? Are the changes kept just in memory? If so, if I change the whole array won't I run out of memory on a small-memory system?
Here's an example of a test which I expected to fail:
On a large memory system, create the array:
import numpy as np
GB = 1000**3
GiB = 1024**3
a = np.zeros((50000, 20000), dtype='float32')
bytes = a.size * a.itemsize
print('{} GB'.format(bytes / GB))
print('{} GiB'.format(bytes / GiB))
np.save('a.npy', a)
# Output:
# 4.0 GB
# 3.725290298461914 GiB
Now, on a machine with just 2 Gb of memory, this fails as expected:
a = np.load('a.npy')
But these two will succeed, as expected:
a = np.load('a.npy', mmap_mode='r+')
a = np.load('a.npy', mmap_mode='c')
Issue 1: I run out of memory running this code, trying to modify the memmapped array (fails regardless of r+/c mode):
for i in range(a.shape[0]):
print('row {}'.format(i))
a[i,:] = i*np.arange(a.shape[1])
Why does this fail (especially, why does it fail even in r+ mode, where it can write to the disk)? I thought memmap would only load pieces of the array into memory?
Issue 2: When I force the numpy to flush the changes every once in a while, both r+/c mode successfully finish the loop. But how can c mode do this? I didn't think flush() would do anything for c mode? The changes aren't written to disk, so they are kept in memory, and yet somehow all the changes, which must be over 3Gb, don't cause out-of-memory errors?
for i in range(a.shape[0]):
if i % 100 == 0:
print('row {}'.format(i))
a.flush()
a[i,:] = i*np.arange(a.shape[1])
Numpy isn't doing anything clever here, it's just deferring to the builtin memmap module, which has an access argument that:
accepts one of four values: ACCESS_READ, ACCESS_WRITE, or ACCESS_COPY to specify read-only, write-through or copy-on-write memory respectively.
On linux, this works by calling the mmap system call with
MAP_PRIVATE
Create a private copy-on-write mapping. Updates to the
mapping are not visible to other processes mapping the same
file, and are not carried through to the underlying file.
Regarding your question
The changes aren't written to disk, so they are kept in memory, and yet somehow all the changes, which must be over 3Gb, don't cause out-of-memory errors?
The changes likely are written to disk, but just not to the file you opened. They're likely paged into virtual memory somewhere.

Memory leak from pyarrow?

For the parsing of a larger file, I need to write in a loop to a large number of parquet files successively. However, it appears that the memory consumed by this task increases over each iteration, whereas I would expect it to remain constant (as nothing should be appended in memory). This makes it tricky to scale.
I've added a minimum reproducible example which creates 10 000 parquet and loop appends to it.
import resource
import random
import string
import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd
def id_generator(size=6, chars=string.ascii_uppercase + string.digits):
return ''.join(random.choice(chars) for _ in range(size))
schema = pa.schema([
pa.field('test', pa.string()),
])
resource.setrlimit(resource.RLIMIT_NOFILE, (1000000, 1000000))
number_files = 10000
number_rows_increment = 1000
number_iterations = 100
writers = [pq.ParquetWriter('test_'+id_generator()+'.parquet', schema) for i in range(number_files)]
for i in range(number_iterations):
for writer in writers:
table_to_write = pa.Table.from_pandas(
pd.DataFrame({'test': [id_generator() for i in range(number_rows_increment)]}),
preserve_index=False,
schema = schema,
nthreads = 1)
table_to_write = table_to_write.replace_schema_metadata(None)
writer.write_table(table_to_write)
print(i)
for writer in writers:
writer.close()
Would anyone have any idea what causes this leak and how to prevent it?
We aren't sure what is wrong, but some other users have reported as yet undiagnosed memory leaks. I added your example to one of the tracking JIRA issues https://issues.apache.org/jira/browse/ARROW-3324
Update on 2022:
I've spent several days on memory leak issue from pyarrow. Please see here for a better understanding. I'll paste the key points below. Basically, they are saying it is not a library memory leak issue, rather it is a common behavior.
Pyarrow uses jemalloc, a custom memory allocator which does its best to hold onto memory allocated from the OS (since this can be an expensive operation). Unfortunately, this makes it difficult to track line by line memory usage with tools like memory_profiler. There are a couple of options:
You could use this library function, pyarrow.total_allocated_bytes to track allocation instead of using memory_profiler.
You can also put the following line at the top of your script, this will configure jemalloc to release memory immediately instead of holding on to it (this will likely have some performance implications). However, I used it, but did not work.
import pyarrow as pa
pa.jemalloc_set_decay_ms(0)
The behavior you are seeing is pretty typical for jemalloc. For further reading, you can also see these other issues for more discussions and examples of jemalloc behaviors:
https://issues.apache.org/jira/browse/ARROW-6910
https://issues.apache.org/jira/browse/ARROW-7305

Memory leak in Pandas.groupby.apply()?

I'm currently using Pandas for a project with csv source files of around 600mb. During the analysis I am reading in the csv to a dataframe, grouping on some column and applying a simple function to the grouped dataframe. I noticed that I was going into Swap Memory during this process and so carried out a basic test:
I first created a fairly large dataframe in the shell:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(3000000, 3),index=range(3000000),columns=['a', 'b', 'c'])
I defined a pointless function called do_nothing():
def do_nothing(group):
return group
And ran the following command:
df = df.groupby('a').apply(do_nothing)
My system has 16gb of RAM and is running Debian (Mint). After creating the dataframe I was using ~600mb of RAM. As soon as the apply method began to execute, that value started to soar. It steadily climbed up to around 7gb(!) before finishing the command and settling back down to 5.4gb (while the shell was still active). The problem is, my work requires doing more than the 'do_nothing' method and as such while executing the real program, I cap my 16gb of RAM and start swapping, making the program unusable. Is this intended? I can't see why Pandas should need 7gb of RAM to effectively 'do_nothing', even if it has to store the grouped object.
Any ideas on what's causing this/how to fix it?
Cheers,
.P
Using 0.14.1, I don't think their is a memory leak (1/3 size of your frame).
In [79]: df = DataFrame(np.random.randn(100000,3))
In [77]: %memit -r 3 df.groupby(df.index).apply(lambda x: x)
maximum of 3: 1365.652344 MB per loop
In [78]: %memit -r 10 df.groupby(df.index).apply(lambda x: x)
maximum of 10: 1365.683594 MB per loop
Two general comments on how to approach a problem like this:
1) use the cython level function if at all possible, will be MUCH faster, and will use much less memory. IOW, it almost always worth it to decouple a groupby expression and void using function (if possible, somethings are just too complicated, but that's the point, you want to break things down). e.g.
Instead of:
df.groupby(...).apply(lambda x: x.sum() / x.mean())
It is MUCH better to do:
g = df.groupby(...)
g.sum() / g.mean()
2) You can easily 'control' the groupby by doing your aggregation manually (additionally this will allow periodic output and garbage collection if needed).
results = []
for i, (g, grp) in enumerate(df.groupby(....)):
if i % 500 == 0:
print "checkpoint: %s" % i
gc.collect()
results.append(func(g,grp))
# final result
pd.concate(results)

Categories

Resources