Python - Working around memory leaks - python

I have a Python program that runs a series of experiments, with no data intended to be stored from one test to another. My code contains a memory leak which I am completely unable to find (I've look at the other threads on memory leaks). Due to time constraints, I have had to give up on finding the leak, but if I were able to isolate each experiment, the program would probably run long enough to produce the results I need.
Would running each test in a separate thread help?
Are there any other methods of isolating the effects of a leak?
Detail on the specific situation
My code has two parts: an experiment runner and the actual experiment code.
Although no globals are shared between the code for running all the experiments and the code used by each experiment, some classes/functions are necessarily shared.
The experiment runner isn't just a simple for loop that can be easily put into a shell script. It first decides on the tests which need to be run given the configuration parameters, then runs the tests then outputs the data in a particular way.
I tried manually calling the garbage collector in case the issue was simply that garbage collection wasn't being run, but this did not work
Update
Gnibbler's answer has actually allowed me to find out that my ClosenessCalculation objects which store all of the data used during each calculation are not being killed off. I then used that to manually delete some links which seems to have fixed the memory issues.

You can use something like this to help track down memory leaks
>>> from collections import defaultdict
>>> from gc import get_objects
>>> before = defaultdict(int)
>>> after = defaultdict(int)
>>> for i in get_objects():
... before[type(i)] += 1
...
now suppose the tests leaks some memory
>>> leaked_things = [[x] for x in range(10)]
>>> for i in get_objects():
... after[type(i)] += 1
...
>>> print [(k, after[k] - before[k]) for k in after if after[k] - before[k]]
[(<type 'list'>, 11)]
11 because we have leaked one list containing 10 more lists

Threads would not help. If you must give up on finding the leak, then the only solution to contain its effect is running a new process once in a while (e.g., when a test has left overall memory consumption too high for your liking -- you can determine VM size easily by reading /proc/self/status in Linux, and other similar approaches on other OS's).
Make sure the overall script takes an optional parameter to tell it what test number (or other test identification) to start from, so that when one instance of the script decides it's taking up too much memory, it can tell its successor where to restart from.
Or, more solidly, make sure that as each test is completed its identification is appended to some file with a well-known name. When the program starts it begins by reading that file and thus knows what tests have already been run. This architecture is more solid because it also covers the case where the program crashes during a test; of course, to fully automate recovery from such crashes, you'll want a separate watchdog program and process to be in charge of starting a fresh instance of the test program when it determines the previous one has crashed (it could use subprocess for the purpose -- it also needs a way to tell when the sequence is finished, e.g. a normal exit from the test program could mean that while any crash or exit with a status != 0 signify the need to start a new fresh instance).
If these architectures appeal but you need further help implementing them, just comment to this answer and I'll be happy to supply example code -- I don't want to do it "preemptively" in case there are as-yet-unexpressed issues that make the architectures unsuitable for you. (It might also help to know what platforms you need to run on).

I had the same problem with a third party C library which was leaking. The most clean work-around that I could think of was to fork and wait. The advantage of it is that you don't even have to create a separate process after each run. You can define the size of your batch.
Here's a general solution (if you ever find the leak, the only change you need to make is to change run() to call run_single_process() instead of run_forked() and you'll be done):
import os,sys
batchSize = 20
class Runner(object):
def __init__(self,dataFeedGenerator,dataProcessor):
self._dataFeed = dataFeedGenerator
self._caller = dataProcessor
def run(self):
self.run_forked()
def run_forked(self):
dataFeed = self._dataFeed
dataSubFeed = []
for i,dataMorsel in enumerate(dataFeed,1):
if i % batchSize > 0:
dataSubFeed.append(dataMorsel)
else:
self._dataFeed = dataSubFeed
self.fork()
dataSubFeed = []
if self._child_pid is 0:
self.run_single_process()
self.endBatch()
def run_single_process(self)
for dataMorsel in self._dataFeed:
self._caller(dataMorsel)
def fork(self):
self._child_pid = os.fork()
def endBatch(self):
if self._child_pid is not 0:
os.waitpid(self._child_pid, 0)
else:
sys.exit() # exit from the child when done
This isolates the memory leak to the child process. And it will never leak more times than the value of the batchSize variable.

I would simply refactor the experiments into individual functions (if not like that already) then accept an experiment number from the command line which calls the single experiment function.
The just bodgy up a shell script as follows:
#!/bin/bash
for expnum in 1 2 3 4 5 6 7 8 9 10 11 ; do
python youProgram ${expnum} otherParams
done
That way, you can leave most of your code as-is and this will clear out any memory leaks you think you have in between each experiment.
Of course, the best solution is always to find and fix the root cause of a problem but, as you've already stated, that's not an option for you.
Although it's hard to imagine a memory leak in Python, I'll take your word on that one - you may want to at least consider the possibility that you're mistaken there, however. Consider raising that in a separate question, something that we can work on at low priority (as opposed to this quick-fix version).
Update: Making community wiki since the question has changed somewhat from the original. I'd delete the answer but for the fact I still think it's useful - you could do the same to your experiment runner as I proposed the bash script for, you just need to ensure that the experiments are separate processes so that memory leaks dont occur (if the memory leaks are in the runner, you're going to have to do root cause analysis and fix the bug properly).

Related

Is it possible to execute indented lines from another file in Python?

Say, I have two files demo.py
# demo.py
from pathlib import Path
for i in range(5):
exec(Path('another_file.txt').read_text())
and another_file.txt (note the indent)
print(i)
Is it possible to make python demo.py run?
N.B. This is useful when using Page (or wxformbuilder or pyqt's designer) to generate a GUI layout where skeletons of callback functioins are automatically generated. The skeletons will have to be modified, at the same time, each iteration overwrites the skeletons -- code snippets will have to be copied back. Anyway, you know what I am talking about if you have used any of Page or wxformbuilder or pyqt's designer.
You can solve the basic problem by removing the indent:
from pathlib import Path
import textwrap
for i in range(5):
exec(textwrap.dedent(Path('another_file.txt').read_text()))
There are still two rather major problems with this:
There are serious security implications here. You're running code without including it in your project. The idea that you can "worry about security and other issues later" will cause you pain at a later point. You'll see similar advice on this site with avoiding SQL injection. That later date may never come, and even if it does, there's a very real chance you won't remember or correctly identify all of the problems. It's absolutely better to avoid the problems in the first place.
Also, with dynamic code like this, you run the very real risk of running into a syntax error with a call stack that shows you nothing about where the code is from. It's not bad for simple cases like this, but as you add more and more complexity to a project like this, you'll likely find you're adding more support to help you debug issues you run into, rather than spending your time adding features.
And, combining these two problems can be fun. It's contrived, but if you changed the for loop to a while loop like this:
i = 0
while i < 5:
exec(textwrap.dedent(Path('another_file.txt').read_text()))
i += 1
And then modified the text file to be this:
print(i)
i += 1
It's trivial to understand why it's no longer operating the 5 times you expect, but as both "sides" of this project get more complex, figuring out the complex interplay between the elements will get much harder.
In short, don't use eval. Future you will thank past you for making your life easier.

Best Approach for I/O Bound Problems?

I am currently running a code on a HPC cluster that writes several 16 MB files on disk (same directory) for a short period of time and then deletes it. They are written to disks and then deleted sequentially. However, the total number of I/O operations exceeds 20,000 * 12,000 times.
I am using the joblib module in python2.7 to take advantage of running my code on several cores. Its basically a nested loop problem with the outer loop being parallelised by joblib and the inner loop is run sequentially in the function. In total its a (20,000 * 12,000 loop.)
The basic skeleton of my code is the following.
from joblib import Parallel, delayed
import subprocess
def f(a,b,c,d):
cmds = 'path/to/a/bash_script_on_disk with arguments from a,b > \
save_file_to_disk'
subprocess.check_output(cmds,shell=True)
cmds1 = 'path/to/a/second_bash_script_on_disk > \
save_file_to_disk'
subprocess.check_output(cmds1,shell=True)
#The structure above is repeated several times.
#However I do delete the files as soon as I can using:
cmds2 = 'rm -rf files'
subprocess.check_output(cmds2,shell=True)
#This is followed by the second/inner loop.
for i in range(12000):
#Do some computation, create and delete files in each
#iteration.
if __name__ == '__main__':
num_cores = 48
Parallel(n_jobs=num_cores)(delayed(f)(a,b,c,d) for i in range(20,000))
#range(20,000) is batched by a wrapper script that sends no more \
#than 48 jobs per node.(Max.cores available)
This code is extremely slow and the bottleneck is the I/O time. Is this a good use case to temporarily write files to /dev/shm/? I have 34GB of space available as tmpfs on /dev/shm/.
Things I already tested:
I tried to set up the same code on a smaller scale on my laptop which has 8 cores. However, writing to /dev/shm/ ran slower than writing to disk.
Side Note: (The inner loop can be parallelised too, however, the number of cores I have available is far lesser than 20,000 which is why I am sticking to this configuration. Please let me know if there are better ways to do this.)
First, do not talk about total I/O operations, that is meaningless. Instead, talk about IOPS and throughout.
Second, that is almost impossible that writing to /dev/shm/ will be slower than writing to disk. Please provide more information. You can test write performance using fio, example command: sudo fio --name fio_test_file --rw=read --direct=1 --bs=4k --size=50M --numjobs=16 --group_reporting, and my test result is: bw=428901KB/s, iops=107225.
Third, you are really writing too many files, you should think about your structure.
It depends on your temporary data size.
If you have much more memory than you're using for the data, then yes - shm will be a good place for it. If you're going to write almost as much as you've got available, then you're likely going to start swapping - which would kill the performance of everything.
If you can fit your data in memory, then tmpfs by definition will always be faster than writing to a physical disk. If it isn't, then there are more factors impacting your environment. Running your code under a profiler would be a good idea in this case.

Is there an easy way to have "checkpoints" in an extended python script?

To preface my question let me give a bit of context: I'm currently working on a data pipeline that has a number of different steps. Each step can go wrong and many take some time (not a huge amount, but on the order of minutes).
For this reason the pipeline is currently heavily supervised by humans. An analyst goes through each step, running the python code in a Jupyter notebook and upon experiencing problems will make minor code adjustments inline and repeat that section.
In the long run, the goal here is to have zero human intervention. However, in the shorter term we'd like to make this process more seamless. The simplest way to do that seems like it would be to split each section into it's own script, and have a parent script that runs each bit and verifies output. However, we also need the ability to rerun the a file with an identical setup if it fails.
For example:
run a --> ✅
run b --> ✅ (b relies on some data produced by a)
run c --> ❌ (c relies on data produced by a and b)
// make some changes to c
run c --> ✅ (c should run in an identical state to its original run)
The most obvious way to do this would be to write output from each script to a file, and load all of these scripts into the next one. This would work, but seems a bit inelegant. A database seems another valid option, but a lot of the data doesn't fit cleanly into a db format.
Does anyone have any suggestions for some ways to achieve what I'm looking for? If anything is unclear I'm also more than happy to clarify any points!
You could create an object that basically maintains the state after each step and use pickle to serialize that object to a file.
Then it's up to your python script to unpickle that file and then decide which step it needs to start from based on the state.
https://wiki.python.org/moin/UsingPickle
Queues and pipelines work very nicely with each other (architecturally speaking). One of the nicer bits is the fact that slower stages can get more workers than faster stages, allowing the pipeline to be optimized based on the workload.

Python: CrashingPython explanation

I have a code in Python that makes the python interpreter crash randomly. I have tried to isolate the source of the problem, but I am still investigating. While searching on the net for problems that could make the interpreter crash, I stumble upon this:
def crash():
'''\
crash the Python interpreter...
'''
i = ctypes.c_char('a')
j = ctypes.pointer(i)
c = 0
while True:
j[c] = 'a'
c += 1
j
http://wiki.python.org/moin/CrashingPython
Since I am using Ctypes, I think that the problem could be related to the way the Ctypes is used. So I am trying to understand why that code could make Python interpreter crash. Understanding it would help investigate my problem in my Ctypes code.
Can anybody explain this?
Help would be appreciate.
It makes a pointer to memory that's likely to be unwritable, and writes to it.
The numerical value of a is very small, and very low memory addresses are typically not writable, causing a crash when you try to write to them.
Should the initial write succeed, it keeps trying successive addresses until it finds one that isn't writable. Not all memory addresses are writable, so it's bound to crash eventually.
(Why it doesn't simply start at address zero I don't know - that's a bit odd. Perhaps ctypes explicitly protects against that?)
The problem seems to be that there you're writing to memory locations indefinitely. So it will come the time when the memory accessed will be unwritable and the program will crash.

Pythonistas, please help convert this to utilize Python Threading concepts

Update : For anyone wondering what I went with at the end -
I divided the result-set into 4 and ran 4 instances of the same program with one argument each indicating what set to process. It did the trick for me. I also consider PP module. Though it worked, it prefer the same program. Please pitch in if this is a horrible implementation! Thanks..
Following is what my program does. Nothing memory intensive. It is serial processing and boring. Could you help me convert this to more efficient and exciting process? Say, I process 1000 records this way and with 4 threads, I can get it to run in 25% time!
I read articles on how python threading can be inefficient if done wrong. Even python creator says the same. So I am scared and while I am reading more about them, want to see if bright folks on here can steer me in the right direction. Muchos gracias!
def startProcessing(sernum, name):
'''
Bunch of statements depending on result,
will write to database (one update statement)
Try Catch blocks which upon failing,
will call this function until the request succeeds.
'''
for record in result:
startProc = startProcessing(str(record[0]), str(record[1]))
Python threads can't run at the same time due to the Global Interpreter Lock; you want new processes instead. Look at the multiprocessing module.
(I was instructed to post this as an answer =p.)

Categories

Resources