Memory leak in Pandas.groupby.apply()? - python

I'm currently using Pandas for a project with csv source files of around 600mb. During the analysis I am reading in the csv to a dataframe, grouping on some column and applying a simple function to the grouped dataframe. I noticed that I was going into Swap Memory during this process and so carried out a basic test:
I first created a fairly large dataframe in the shell:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(3000000, 3),index=range(3000000),columns=['a', 'b', 'c'])
I defined a pointless function called do_nothing():
def do_nothing(group):
return group
And ran the following command:
df = df.groupby('a').apply(do_nothing)
My system has 16gb of RAM and is running Debian (Mint). After creating the dataframe I was using ~600mb of RAM. As soon as the apply method began to execute, that value started to soar. It steadily climbed up to around 7gb(!) before finishing the command and settling back down to 5.4gb (while the shell was still active). The problem is, my work requires doing more than the 'do_nothing' method and as such while executing the real program, I cap my 16gb of RAM and start swapping, making the program unusable. Is this intended? I can't see why Pandas should need 7gb of RAM to effectively 'do_nothing', even if it has to store the grouped object.
Any ideas on what's causing this/how to fix it?
Cheers,
.P

Using 0.14.1, I don't think their is a memory leak (1/3 size of your frame).
In [79]: df = DataFrame(np.random.randn(100000,3))
In [77]: %memit -r 3 df.groupby(df.index).apply(lambda x: x)
maximum of 3: 1365.652344 MB per loop
In [78]: %memit -r 10 df.groupby(df.index).apply(lambda x: x)
maximum of 10: 1365.683594 MB per loop
Two general comments on how to approach a problem like this:
1) use the cython level function if at all possible, will be MUCH faster, and will use much less memory. IOW, it almost always worth it to decouple a groupby expression and void using function (if possible, somethings are just too complicated, but that's the point, you want to break things down). e.g.
Instead of:
df.groupby(...).apply(lambda x: x.sum() / x.mean())
It is MUCH better to do:
g = df.groupby(...)
g.sum() / g.mean()
2) You can easily 'control' the groupby by doing your aggregation manually (additionally this will allow periodic output and garbage collection if needed).
results = []
for i, (g, grp) in enumerate(df.groupby(....)):
if i % 500 == 0:
print "checkpoint: %s" % i
gc.collect()
results.append(func(g,grp))
# final result
pd.concate(results)

Related

Possible overhead on dask computation over list of delayed objects

I have a ddf with lots of partitions
ddf = dd.read_parquet("./input-*", engine='fastparquet')
ddf
Dask DataFrame Structure:
datetime ndvi str utm_x utm_y fpath scl_value
npartitions=71
Dask Name: read-parquet, 71 tasks
In each partition I want to run a custom function
my_df_list = list()
for arg_key, arg_value in my_dict_of_args.items() :
ddf_item = ddf_sliced.map_partitions(myfunc,
my_arg1 = arg_key,
my_arg2 = arg_value,
meta = my_meta)
my_df_list.append(ddf_item)
Things start to get tricky there, I have experienced the following command is too much for my pc, taking forever the beginning of the first item computation and eventually depleting all my ram:
dask.compute(*my_df_list)
Example graph using 2 dfs instead 71, dask.visualize(*my_df_list):
But it can handle easily the computation of each partition, one by one:
my_df_list[0].compute()
...
my_df_list[71].compute()
Example graph using 2 dfs instead 71 my_df_list[0].visualize():
Im struggling understanding the difference since to me its the same iteration scheme.
If it is indeed an overhead I will be glad to get some alternative flows to not call .compute on each element manually.
EDIT 1
After posting the graph images I understand dask.compute(*list) boost parallelism to optimize the df readings. See documentation section, Avoid calling compute repeatedly.
Now I can see the real problem is the initialization of the graph and probably my code: even loading 2 dfs instead of 71, my memory is depleted far before the real computation starts, when using dask.compute(*list)

Why does deleting columns or parts of a DataFrame increase memory usage, and how to ensure garbage collection on unused slices of DataFrame

When dealing with large DataFrames, you need to be careful with memory usage (for example you might want to download large data in chunks, process the chunks, and from then on delete all the unnecessary parts from memory).
I can't find any resources on the best procedures to deal with garbage collection in pandas, but I tried the following and got surprising results:
import os, psutil, gc
import pandas as pd
def get_process_mem_usage():
process = psutil.Process(os.getpid())
print("{:.3f} GB".format(process.memory_info().rss / 1e9))
get_process_mem_usage()
# Out: 0.146 GB
cdf = pd.DataFrame({i:np.random.rand(int(1e7)) for i in range(10)})
get_process_mem_usage()
# Out: 0.946 GB
With the following globals() and their memory usage:
Size
cdf 781.25MB
_iii 1.05KB
_i1 1.05KB
_oh 240.00B
When I try to delete something, I get:
del cdf[1]
gc.collect()
get_process_mem_usage()
# Out: 1.668 GB
with a high process memory usage, but the following globals()
Size
cdf 703.13MB
_i1 1.05KB
Out 240.00B
_oh 240.00B
so some memory is still allocated but not used by any object in globals().
I've also seen weird results when doing something like
cdf2 = cdf.iloc[:,:5]
del cdf
which sometimes creates a new global with a name like "_5" and more memory usage than cdf had before (I'm not sure what this global refers to, perhaps some sort of object containing the no-longer referenced columns from cdf, but why is it larger?
Another option is to "delete" columns through one of:
cdf = cdf.iloc[:, :5]
# or
cdf = cdf.drop(columns=[...])
where the columns are no longer referenced by any object so they get dropped. But for me this doesn't seem to happen every time; I could swear I've seen my process take up the same amount of memory after this operation, even when I call gc.collect() afterwards. Though when I try to recreate this in a notebook it doesn't happen.
So I guess my question is:
Why does the above happen with deleting resulting in more memory usage
What is the best way to ensure that no-longer needed columns are deleted from memory and properly garbage cleaned?

Pandas: Why is Series indexing using .loc taking 100x longer on the first run when timing it?

I'm slicing a quite big pandas series (~5M) using .loc and I stumble upon some weird behavior when checking times in an attempt to optimize my code.
It's weird that the first slicing attempt like series_object.loc[some_indexes] is taking 100x longer than the following ones.
When I try timeit it does not reflect this behaviour, but when checking the partial laps using `time``, we can see that the first lap is taking much longer than the following ones.
Is .loc using some sort of cacheing? if that's so, how does garbage collection is not influencing this?
Is timeit doing the cacheing even with garbage collector disabled and not behaving as it's suppose?
Which time should I trust that my app in production will take when running in a live environment?
I tried this on windows and linux machines using different versions of python (3.6, 3.7 and 2.7) and the behavior is always the same.
Thanks in advance for you help. This thing is banging my head for a week already and I miss not doubting %timeit :)
to reproduce:
Save the following code to a python file eg.:test_loc_times.py
import pandas as pd
import numpy as np
import timeit
import time, gc
def get_data():
ids = np.arange(size_bigseries)
big_series = pd.Series(index=ids, data=np.random.rand(len(ids)), name='{} elements series'.format(len(ids)))
small_slice = np.arange(size_slice)
return big_series, small_slice
# Method to test: a simple pandas slicing with .loc
def basic_loc_indexing(pd_series, slice_ids):
return pd_series.loc[slice_ids].dropna()
# method to time it
def timing_it(func, n, *args):
gcold = gc.isenabled()
gc.disable()
times = []
for i in range(n):
s = time.time()
func(*args)
times.append((time.time()-s)*1000)
if gcold:
gc.enable()
return times
if __name__ == '__main__':
import sys
n_tries = int(sys.argv[1]) if len(sys.argv)>1 and sys.argv[1] is not None else 1000
size_bigseries = int(sys.argv[2]) if len(sys.argv)>2 and sys.argv[2] is not None else 5000000 #5M
size_slice = int(sys.argv[3]) if len(sys.argv)>3 and sys.argv[3] is not None else 100 #5M
#1: timeit()
big_series, small_slice = get_data()
time_with_timeit = timeit.timeit('basic_loc_indexing(big_series, small_slice)',"gc.disable(); from __main__ import basic_loc_indexing, big_series, small_slice",number=n_tries)
print("using timeit: {:.6f}ms".format(time_with_timeit/n_tries*1000))
del big_series, small_slice
#2: time()
big_series, small_slice = get_data()
time_with_time = timing_it(basic_loc_indexing, n_tries, big_series, small_slice)
print("using time: {:.6f}ms".format(np.mean(time_with_time)))
print('head detail: {}\n'.format(time_with_time[:5]))
try out:
Run
python test_loc_times.py 1000 5000000 100
This will run timeit and time 1000 laps on slicing 100 elements from a 5M pandas.Series.
you can try it yourself with other values and the first run it always taking longer.
stdout:
>>> using timeit: 0.789754ms
>>> using time: 0.829869ms
>>> head detail: [145.02716064453125, 0.7691383361816406, 0.7028579711914062, 0.5738735198974609, 0.6380081176757812]
Weird right?
edit:
I found this answer which might be related. What do you think?
This code is likely not idempotent (has side effects that impact its execution).
timeit will run the code once first to measure the time and deduce the number of loops and runs it should use. If your code is not idempotent (has side effects, like cashing) then that first run (not recorded) will be longer and the subsequent (faster runs) will be measured and reported.
You can take a look at the arguments you can pass to timeit (see the doc) to specify the number of loops and forgo that initial run.
Also note that (taken from the doc linked above):
The times reported by %timeit will be slightly higher than those reported by the timeit.py script when variables are accessed. This is due to the fact that %timeit executes the statement in the namespace of the shell, compared with timeit.py, which uses a single setup statement to import function or create variables. Generally, the bias does not matter as long as results from timeit.py are not mixed with those from %timeit.
Edit: Missed the fact that you were passing the number of runs to timeit. In that case, only the latter part of my answer applies, but the numbers you are seeing seem to point to another issue...

Python Parallelize a Function with Multiple Inputs

New to python. Working with IPython.
I want to do some calculation on a pandas dataframe with a rolling window. The process looks like this:
def calculate_avg_ret_t(return_matrix, rolling_window, t):
ret_t = return_matrix.iloc[ np.arange((t-rolling_window+1),t+1,1), ]
avg_ret_t = ret_t.mean().mean() # much more complicated in reality
return avg_ret_t
return_matrix = pd.DataFrame( np.random.randn(10000, 10000) )
rolling_window = 21
avg_ret_ts = []
for t in np.arange(rolling_window-1,10001,1):
%time avg_ret_t = calculate_avg_ret_t(return_matrix, rolling_window, t)
avg_ret_ts.append(avg_ret_t)
The actual function executed within each for loop is much more complicated and time-consuming, hence the need for parallelization. Can this process be parallized, and if so, what's the most user-friendly module to do that?
I realized the potential problem is that the function has to call the gigantic input return_matrix in each loop. Should I first transform that matrix to a R-list like object, depending on rolling_window?
If the function is only dependent on the data in a given slice, then this would be easily parallelized. I would do the following:
1) Split the data set into N sets where N is the number of processors. The sets should overlap sufficiently.
2) Each processor compute the quantities on its own data subset.
You may want to look at using mpi4py in ipython. See for example https://ipython.org/ipython-doc/3/parallel/parallel_mpi.html. This would allow you to develop and debug parallel code quite easily.

pandas, HDFstore and memory usage through load/unload cycles

I happily use pandas to store and manipulate experimental data. Usually, I choose HDF format (which I don't master) via pd.HDFstore to save stuff.
My dataframes got bigger and bigger and some economy in memory is needed.
I read some of the guides linked in related questions, although I cannot achieve a sustainable memory consumption, e.g. in the following typical task of mine:
. load some `df` in memory (scale size is 10GB)
. do business with some other preloaded `df`
. unload
. repeat
Apparently I keep on failing in the unloading stage.
Hence, I would like you to consider the following experiments.
(From fresh started kernel (in ipython notebook, if that matters))
import pandas as pd
for idx in range(6):
print idx
store = pd.HDFStore('detection_DB_N.h5')
detection_DB = store['detection_DB']
store.close()
del detection_DB
stats (from top):
. memory used by first iteration ~8GB
. memory used at the end of execution ~10GB (6 cycles)
Then, in the same kernel, I run
for idx in range(6):
print idx
store = pd.HDFStore('detection_DB_N.h5')
detection_DB = store['detection_DB']
store.close()
#del detection_DB #SAME AS BEFORE, BUT I DON'T del
stats:
. memory used at the end of execution ~15GB
Calling a del detection_DB doesn't make any difference in memory (CPU usage goes high for some 5sec).
Analogusly, calling
import gc
gc.collect()
doesn't make any relevant difference.
I add, for what is worth, that repeating the previous calls, I arrived to have ~20GB occupied (and no loaded object to play with).
Can anyone shed some light?
How can I achieve ~0GB (or so) occupied after del?

Categories

Resources