Python Pandas leaking memory? - python

I have a script that is constantly measuring some data and regularly storing it in a file. In the past I was storing the data in a "manually created CSV" file in this way (pseudocode):
with open('data.csv','w') as ofile:
print('var1,var2,var3,...,varN', file=ofile) # Create CSV header.
while measure:
do_something()
print(f'{var1},{var2},{var3},...,{varN}', file=ofile) # Store data.
I worked in this way for several months and runned this script several hundreds of times with no issues other than 1) this is cumbersome (and prone to errors) when N is large (in my case between 20 and 30) and 2) CSV does not preserve data types. So I decided to change to something like this:
temporary_df = pandas.DataFrame()
while measure:
do_something()
temporary_df.append({'var1':var1,'var2':var2,...,'varN':varN}, ignore_index=True)
if save_data_in_this_iteration():
temporary_df.to_feather(f'file_{datetime.datetime.now()}.fd')
temporary_df = pandas.DataFrame() # Clean the dataframe.
merge_all_feathers_into_single_feather()
At a first glance this was working perfectly as I expected. However, after some hours Python crashes. After experiencing this both in a Windows and in a (separate) Linux machine I, I noticed that the problem is that Python is slowly sucking the memory of the machine until there is no more memory, and then of course it crashes.
As the function do_something is unchanged between the two approaches, and the crash happens before merge_all_feathers_into_single_feather is called, and save_data_in_this_iteration is trivially simple, I am blaming Pandas for this problem.
Google has told me that other people in the past have had memory problems while using Pandas. I have tried adding the garbage collector line in each iteration, as suggested e.g. here, but did not worked for me. I didn't tried the mutiprocessing approach yet because it looks like killing an ant with a nuke, and may bring other complications...
Is there any solution to keep using Pandas like this? Is there a better solution to this without using Pandas? Which?

Pandas was not the problem
After struggling with this problem for a while, I decided to create a MWE to do some tests. So I wrote this:
import pandas
import numpy
import datetime
df = pandas.DataFrame()
while True:
df = df.append({f'col_{i}': numpy.random.rand() for i in range(99)}, ignore_index=True)
if 'last_clean' not in locals() or (datetime.datetime.now()-last_clean).seconds > .5:
last_clean = datetime.datetime.now()
df.to_feather('delete_me.fd')
df = df[0:0]
To my surprise, the memory is not drained by this script! So here I concluded that Pandas was not my problem.
Then I added a new component to the MWE and I found the issue:
import pandas
import numpy
import datetime
import matplotlib.pyplot as plt
def save_matplotlib_plot(df):
fig, ax = plt.subplots()
ax.plot(df['col_1'], df['col_2'])
fig.savefig('delete_me.png')
# Uncomment the following two lines to release the memory and stop the "leak".
# ~ fig.clear()
# ~ plt.close(fig)
df = pandas.DataFrame()
while True:
df = df.append({f'col_{i}': numpy.random.rand() for i in range(99)}, ignore_index=True)
if 'last_clean' not in locals() or (datetime.datetime.now()-last_clean).seconds > .5:
last_clean = datetime.datetime.now()
df.to_feather('delete_me.fd')
save_matplotlib_plot(df) # Here I had my "leak" (which was not a leak indeed because matplotlib keeps track of all the figures it creates, so it was working as expected).
df = df[0:0]
It seems that when I switched from "handmade CSV" to "Pandas" I also changed something with the plots, so I was blaming Pandas when it was not the problem.
Just for completeness, the multiprocessing solution also works. The following script has no memory issues:
import pandas
import numpy
import datetime
import matplotlib.pyplot as plt
from multiprocessing import Process
def save_matplotlib_plot(df):
fig, ax = plt.subplots()
ax.plot(df['col_1'], df['col_2'])
fig.savefig('delete_me.png')
df = pandas.DataFrame()
while True:
df = df.append({f'col_{i}': numpy.random.rand() for i in range(99)}, ignore_index=True)
if 'last_clean' not in locals() or (datetime.datetime.now()-last_clean).seconds > .5:
last_clean = datetime.datetime.now()
df.to_feather('delete_me.fd')
p = Process(target=save_matplotlib_plot, args=(df,))
p.start()
p.join()
df = df[0:0]

Related

How to use pandas .describe()

I have used pandas in the past but I have recently run into a problem where my code is not displaying the .head() or the .describe() function. I have copied my code below from another website and it is still not displaying. Any help is appreciated.
import pandas as pd
import tensorflow as tf
from matplotlib import pyplot as plt
pd.options.display.max_rows = 10
pd.options.display.float_format = "{:.1f}".format
training_df = pd.read_csv(filepath_or_buffer="california_housing_train.csv")
training_df["median_house_value"] /= 1000.0
training_df.describe(include = 'all')
Your answer will work in a notebook or REPL, but doesn't actually print. Make sure to call the print() function to output while running.

Dask prints warning to use client.scatter althought I'm using the suggested approach

In dask distributed I get the following warning, which I would not expect:
/home/miniconda3/lib/python3.6/site-packages/distributed/worker.py:739: UserWarning: Large object of size 1.95 MB detected in task graph:
(['int-58e78e1b34eb49a68c65b54815d1b158', 'int-5cd ... 161071d7ae7'],)
Consider scattering large objects ahead of time
with client.scatter to reduce scheduler burden and
keep data on workers
future = client.submit(func, big_data) # bad
big_future = client.scatter(big_data) # good
future = client.submit(func, big_future) # good
% (format_bytes(len(b)), s))
The reason I'm suprised is, that I'm doing exactly what the warning is suggesting:
import dask.dataframe as dd
import pandas
from dask.distributed import Client, LocalCluster
c = Client(LocalCluster())
dask_df = dd.from_pandas(pandas.DataFrame.from_dict({'A':[1,2,3,4,5]*1000}), npartitions=10)
filter_list = c.scatter(list(range(2,100000,2)))
mask = c.submit(dask_df['A'].isin, filter_list)
dask_df[mask.result()].compute()
So my question is: Am I doing something wrong or is this a bug?
pandas='0.22.0'
dask='0.17.0'
The main reason why dask is complaining isn't the list, it's the pandas dataframe inside the dask dataframe.
dask_df = dd.from_pandas(pandas.DataFrame.from_dict({'A':[1,2,3,4,5]*1000}), npartitions=10)
You are creating a biggish amount of data locally when you create a pandas dataframe in your local session. Then you operate with it on the cluster. This will require moving your pandas dataframe to the cluster.
You're welcome to ignore these warnings, but in general I would not be surprised if performance here is worse than with pandas alone.
There are a few other things going on here. Your scatter of a list produces a bunch of futures, which may not be what you want. You're calling submit on a dask object, which is usually unnecessary.

Python3 not handling matplotlib plot when using a multiprocess pool

I have a small script creating different plots. Since no data are shared, I can do some multiprocessing. Using python2.7, no problem. With python3.6, I can´t seem to make it work.
I am using a pool (https://docs.python.org/3/library/multiprocessing.html and https://docs.python.org/2/library/multiprocessing.html) since I do not share objects or anything.
For Python3, I get a crash without traceback at line (fig = plt.figure(number)).
I am running on MacOs X sierra. I believe the problem is the same as for this topic (Saving multiple matplotlib figures with multiprocessing). Unfortunately, the problem wasn´t really addressed as not being the main issue.
One fast answer would be to use python2.7, but other pieces of my work rely on python3+ features.
Any idea on how to have traceback at least (verbose mode didn't show anything related to the crash), and then to solve this issue?
Many thanks
Here is the smallest code producing the error, coming from the thread mentioned above. (this code will create 4 files in the folder of the script).
import matplotlib.pyplot as plt
import numpy.random as random
from multiprocessing import Pool
def do_plot(number):
fig = plt.figure(number)
a = random.sample(100)
b = random.sample(100)
plt.scatter(a, b)
plt.savefig("%03d.jpg" % (number,))
plt.close()
print("Done ", number)
if __name__ == '__main__':
pool = Pool()
pool.map(do_plot, range(4))

IPython notebook: How to write cell magic which can access notebook variables?

My question is: How can I write an IPython cell magic which has access to the namespace of the IPython notebook?
IPython allows writing user-defined cell magics. My plan is creating a plotting function which can plot one or multiple arbitrary Python expressions (expressions based on Pandas Series objects), whereby each line in the cell string is a separate graph in the chart.
This is the code of the cell magic:
def p(line, cell):
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame()
line_list = cell.split('\n')
counter = 0
for line in line_list:
df['series' + str(counter)] = eval(line)
counter += 1
plt.figure(figsize = [20,6])
ax = plt.subplot(111)
df.plot(ax = ax)
def load_ipython_extension(ipython):
ipython.register_magic_function(p, 'cell')
The function receives the entire cell contents as a string. This string is then split by line breaks and evaluated using eval(). The result is added to a Pandas DataFrame. Finally the DataFrame is plotted using matplotlib.
Usage example: First define the Pandas Series object in IPython notebook.
import pandas as pd
ts = pd.Series([1,2,3])
Then call the magic in IPython notebook (whereby the whole code below is one cell):
%%p
ts * 3
ts + 1
This code fails with the following error:
NameError: name 'ts' is not defined
I suspect the problem is that the p function only receives ts * 3\n ts + 1 as a string and that it does not have access to the ts variable defined in the namespace of IPython notebook (because the p function is defined in a separate .py file).
How does my code have to be changed so the cell magic has access to the ts variable defined in the IPython notebook (and therefore does not fail with the NameError)?
Use the #needs_local_scope decorator decorator. Documentation is a bit missing, but you can see how it is used, and contributing to docs would be welcome.
You could also use shell.user_ns from Magics. For example something like:
from IPython.core.magic import Magics
class MyClass(Magics):
def myfunc(self):
print(self.shell.user_ns)
See how it's used in code examples: here and here.

Memory leak in Pandas.groupby.apply()?

I'm currently using Pandas for a project with csv source files of around 600mb. During the analysis I am reading in the csv to a dataframe, grouping on some column and applying a simple function to the grouped dataframe. I noticed that I was going into Swap Memory during this process and so carried out a basic test:
I first created a fairly large dataframe in the shell:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(3000000, 3),index=range(3000000),columns=['a', 'b', 'c'])
I defined a pointless function called do_nothing():
def do_nothing(group):
return group
And ran the following command:
df = df.groupby('a').apply(do_nothing)
My system has 16gb of RAM and is running Debian (Mint). After creating the dataframe I was using ~600mb of RAM. As soon as the apply method began to execute, that value started to soar. It steadily climbed up to around 7gb(!) before finishing the command and settling back down to 5.4gb (while the shell was still active). The problem is, my work requires doing more than the 'do_nothing' method and as such while executing the real program, I cap my 16gb of RAM and start swapping, making the program unusable. Is this intended? I can't see why Pandas should need 7gb of RAM to effectively 'do_nothing', even if it has to store the grouped object.
Any ideas on what's causing this/how to fix it?
Cheers,
.P
Using 0.14.1, I don't think their is a memory leak (1/3 size of your frame).
In [79]: df = DataFrame(np.random.randn(100000,3))
In [77]: %memit -r 3 df.groupby(df.index).apply(lambda x: x)
maximum of 3: 1365.652344 MB per loop
In [78]: %memit -r 10 df.groupby(df.index).apply(lambda x: x)
maximum of 10: 1365.683594 MB per loop
Two general comments on how to approach a problem like this:
1) use the cython level function if at all possible, will be MUCH faster, and will use much less memory. IOW, it almost always worth it to decouple a groupby expression and void using function (if possible, somethings are just too complicated, but that's the point, you want to break things down). e.g.
Instead of:
df.groupby(...).apply(lambda x: x.sum() / x.mean())
It is MUCH better to do:
g = df.groupby(...)
g.sum() / g.mean()
2) You can easily 'control' the groupby by doing your aggregation manually (additionally this will allow periodic output and garbage collection if needed).
results = []
for i, (g, grp) in enumerate(df.groupby(....)):
if i % 500 == 0:
print "checkpoint: %s" % i
gc.collect()
results.append(func(g,grp))
# final result
pd.concate(results)

Categories

Resources