python multiprocessing set memory per process - python

I'm using python to do some processing on text files and am having issues with MemoryErrors. Sometimes the file being processed is quite large which means that too much RAM is being used by a multiprocessing Process.
Here is a snippet of my code:
import multiprocessing as mp
import os
def preprocess_file(file_path):
with open(file_path, "r+") as f:
file_contents = f.read()
# modify the file_contents
# ...
# overwrite file
f.seek(0)
f.write(file_contents)
f.truncate()
if __name__ == "main":
with mp.Pool(mp.cpu_count()) as pool:
pool_processes = []
# for all files in dir
for root, dirs, files in os.walk(some_path):
for f in files:
pool_processes.append(os.path.join(root, f))
# start the processes
pool.map(preprocess_file, pool_processes)
I have tried to use the resource package to set a limit to how much RAM each process can use as shown below but this hasn't fixed the issue, and I still get MemoryErrors being raised which leads me to believe it's the pool.map which is causing issues. I was hoping to have each process deal with the exception individually so that the file could be skipped rather than crashing the whole program.
import resource
def preprocess_file(file_path):
try:
hard = os.sysconf("SC_PAGE_SIZE") * os.sysconf("SC_PHYS_PAGES") # total bytes of RAM in machine
soft = (hard - 512 * 1024 * 1024) // mp.cpu_count() # split between each cpu and save 512MB for the system
resource.setrlimit(resource.RLIMIT_AS, (soft, hard)) # apply limit
with open(file_path, "r+") as f:
# ...
except Exception as e: # bad practice - should be more specific but just a placeholder
# ...
How can I let an individual process run out of memory while letting the other processes continue unaffected? Ideally I want to catch the exception within the preprocess_file file so that I can log exactly which file caused the error.
Edit: The preprocess_file function does not share data with any other processes so there is no need for shared memory. The function also needs to read the entire file at once as the file is reformatted which cannot be done line by line.
Edit 2: The traceback from the program is below. As you can see, the error doesn't actually point to the file being run, and instead comes from the package's files.
Process ForkPoolWorker-2:
Traceback (most recent call last):
File "/usr/lib64/python3.6/multiprocessing/pool.py", line 125, in worker
File "/usr/lib64/python3.6/multiprocessing/queues.py", line 341, in put
File "/usr/lib64/python3.6/multiprocessing/reduction.py", line 51, in dumps
File "/usr/lib64/python3.6/multiprocessing/reduction.py", line 39, in __init__
MemoryError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib64/python3.6/multiprocessing/process.py", line 258, in _bootstrap
File "/usr/lib64/python3.6/multiprocessing/process.py", line 93, in run
File "/usr/lib64/python3.6/multiprocessing/pool.py", line 130, in worker
File "/usr/lib64/python3.6/multiprocessing/queues.py", line 341, in put
File "/usr/lib64/python3.6/multiprocessing/reduction.py", line 51, in dumps
File "/usr/lib64/python3.6/multiprocessing/reduction.py", line 39, in __init__
MemoryError

If MemoryError is raised, the worker process may or may not be able to recover from the situation. If it do, as #Thomas suggest, catch the MemoryError somewhere.
import multiprocessing as mp
from time import sleep
def initializer():
# Probably set the memory limit here
pass
def worker(i):
sleep(1)
try:
if i % 2 == 0:
raise MemoryError
except MemoryError as ex:
return str(ex)
return i
if __name__ == '__main__':
with mp.Pool(2, initializer=initializer) as pool:
tasks = range(10)
results = pool.map(worker, tasks)
print(results)
If the worker cannot recover, the whole pool is unlikely working. For example, change worker to do a force exit:
def worker(i):
sleep(1)
try:
if i % 2 == 0:
raise MemoryError
elif i == 5:
import sys
sys.exit()
except MemoryError as ex:
return str(ex)
return i
the Pool.map never return and block forever.

Related

Why does multiprocessing not working when opening a file?

As I am trying out the multiprocessing pool module, I noticed that it does not work when I am loading / opening any kind of file. The code below works as expected. When I uncomment lines 8-9, the script skips the pool.apply_async method, and loopingTest never runs.
import time
from multiprocessing import Pool
class MultiClass:
def __init__(self):
file = 'test.txt'
# with open(file, 'r') as f: # This is the culprit
# self.d = f
self.n = 50000000
self.cases = ['1st time', '2nd time']
self.multiProc(self.cases)
print("It's done")
def loopingTest(self, cases):
print(f"looping start for {cases}")
n = self.n
while n > 0:
n -= 1
print(f"looping done for {cases}")
def multiProc(self, cases):
test = False
pool = Pool(processes=2)
if not test:
for i in cases:
pool.apply_async(self.loopingTest, (i,))
pool.close()
pool.join()
if __name__ == '__main__':
start = time.time()
w = MultiClass()
end = time.time()
print(f'Script finished in {end - start} seconds')
You see this behavior because calling apply_async fails when you save the file descriptor (self.d) to your instance. When you call apply_async(self.loopingTest, ...), Python needs to pickle self.loopingTest to send it to the worker process, which also requires pickling self. When you have the open file descriptor saved as a property of self, the pickling fails, because file descriptors can't be pickled. You'll see this for yourself if you use apply instead of apply_async in your sample code. You'll get an error like this:
Traceback (most recent call last):
File "a.py", line 36, in <module>
w = MultiClass()
File "a.py", line 12, in __init__
self.multiProc(self.cases)
File "a.py", line 28, in multiProc
out.get()
File "/usr/lib/python3.6/multiprocessing/pool.py", line 644, in get
raise self._value
File "/usr/lib/python3.6/multiprocessing/pool.py", line 424, in _handle_tasks
put(task)
File "/usr/lib/python3.6/multiprocessing/connection.py", line 206, in send
self._send_bytes(_ForkingPickler.dumps(obj))
File "/usr/lib/python3.6/multiprocessing/reduction.py", line 51, in dumps
cls(buf, protocol).dump(obj)
TypeError: cannot serialize '_io.TextIOWrapper' object
You need to change your code either avoiding saving the file descriptor to self, only create it in the worker method (if that's where you need to use it), or by using the tools Python provides to control the pickle/unpickle process for your class. Depending on the use-case, you can also turn the method you're passing to apply_async into a top-level function, so that self doesn't need to be pickled at all.

Flask-RQ2 Redis error: ZADD requires an equal number of values and scores

I have tried implementing the basic Flask-RQ2 setup as per the docs to attempt to write to two separate files concurrently, but I am getting the following Redis error: redis.exceptions.RedisError: ZADD requires an equal number of values and scores when the worker tries to perform a job in the Redis queue.
Here's the full stack trace:
10:20:37: Worker rq:worker:1d0c83d6294249018669d9052fd759eb: started, version 1.2.0
10:20:37: *** Listening on default...
10:20:37: Cleaning registries for queue: default
10:20:37: default: tester('time keeps on slipping') (02292167-c7e8-4040-a97b-742f96ea8756)
10:20:37: Worker rq:worker:1d0c83d6294249018669d9052fd759eb: found an unhandled exception, quitting...
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/rq/worker.py", line 515, in work
self.execute_job(job, queue)
File "/usr/local/lib/python3.6/dist-packages/rq/worker.py", line 727, in execute_job
self.fork_work_horse(job, queue)
File "/usr/local/lib/python3.6/dist-packages/rq/worker.py", line 667, in fork_work_horse
self.main_work_horse(job, queue)
File "/usr/local/lib/python3.6/dist-packages/rq/worker.py", line 744, in main_work_horse
raise e
File "/usr/local/lib/python3.6/dist-packages/rq/worker.py", line 741, in main_work_horse
self.perform_job(job, queue)
File "/usr/local/lib/python3.6/dist-packages/rq/worker.py", line 866, in perform_job
self.prepare_job_execution(job, heartbeat_ttl)
File "/usr/local/lib/python3.6/dist-packages/rq/worker.py", line 779, in prepare_job_execution
registry.add(job, timeout, pipeline=pipeline)
File "/usr/local/lib/python3.6/dist-packages/rq/registry.py", line 64, in add
return pipeline.zadd(self.key, {job.id: score})
File "/usr/local/lib/python3.6/dist-packages/redis/client.py", line 1691, in zadd
raise RedisError("ZADD requires an equal number of "
redis.exceptions.RedisError: ZADD requires an equal number of values and scores
My main is this:
#!/usr/bin/env python3
from flask import Flask
from flask_rq2 import RQ
import time
import tester
app = Flask(__name__)
rq = RQ(app)
default_worker = rq.get_worker()
default_queue = rq.get_queue()
tester = tester.Tester()
while True:
default_queue.enqueue(tester.tester, args=["time keeps on slipping"])
default_worker.work(burst=True)
with open('test_2.txt', 'w+') as f:
data = f.read() + "it works!\n"
time.sleep(5)
if __name__ == "__main__":
app.run()
and my tester.py module is thus:
#!/usr/bin/env python3
import time
class Tester:
def tester(string):
with open('test.txt', 'w+') as f:
data = f.read() + string + "\n"
f.write(data)
time.sleep(5)
I'm using the following:
python==3.6.7-1~18.04
redis==2.10.6
rq==1.2.0
Flask==1.0.2
Flask-RQ2==18.3
I've also tried the simpler setup from the documentation were you don't specify either queue or worker but implicitly rely on the Flask-RQ2 module defaults... Any help with this would be greatly appreciated.
Reading the docs a bit deeper it seems for it to work with Redis<3.0 you need Flask-RQ2==18.2.2.
I have different errors now but I can work with this.

python tempfile and multiprocessing pool error

I'm experimenting with python's multiprocessing. I struggled with a bug in my code and managed to narrow it down. However, I still don't know why this happens. What I'm posting is just sample code. If I import tempfile module and change tempdir, the code crashes at pool creation. I'm using python 2.7.5
Here's the code
from multiprocessing import Pool
import tempfile
tempfile.tempdir = "R:/" #REMOVING THIS LINE FIXES THE ERROR
def f(x):
return x*x
if __name__ == '__main__':
pool = Pool(processes=4) # start 4 worker processes
result = pool.apply_async(f, [10]) # evaluate "f(10)" asynchronously
print result.get(timeout=1) # prints "100" unless your computer is *very* slow
print pool.map(f, range(10)) # prints "[0, 1, 4,..., 81]"
Here's error
R:\>mp_pool_test.py
Traceback (most recent call last):
File "R:\mp_pool_test.py", line 11, in <module>
pool = Pool(processes=4) # start 4 worker processes
File "C:\Python27\lib\multiprocessing\__init__.py", line 232, in Pool
return Pool(processes, initializer, initargs, maxtasksperchild)
File "C:\Python27\lib\multiprocessing\pool.py", line 138, in __init__
self._setup_queues()
File "C:\Python27\lib\multiprocessing\pool.py", line 233, in _setup_queues
self._inqueue = SimpleQueue()
File "C:\Python27\lib\multiprocessing\queues.py", line 351, in __init__
self._reader, self._writer = Pipe(duplex=False)
File "C:\Python27\lib\multiprocessing\__init__.py", line 107, in Pipe
return Pipe(duplex)
File "C:\Python27\lib\multiprocessing\connection.py", line 223, in Pipe
1, obsize, ibsize, win32.NMPWAIT_WAIT_FOREVER, win32.NULL
WindowsError: [Error 123] The filename, directory name, or volume label syntax is incorrect
This code works fine.
from multiprocessing import Pool
import tempfile as TF
TF.tempdir = "R:/"
def f(x):
return x*x
if __name__ == '__main__':
print("test")
The bizarre thing is that, both times I don't do anything with TF.tempdir, but the one with the Pool doesn't work for some reason.
It is cool it looks like you have a name collision from what I can see in
"C:\Program Files\PYTHON\Lib\multiprocessing\connection.py"
It seems that multipprocessing is using tempfile as well
That behavior should not happen but it looks to me like the problem is in line 66 of connection.py
elif family == 'AF_PIPE':
return tempfile.mktemp(prefix=r'\\.\pipe\pyc-%d-%d-' %
(os.getpid(), _mmap_counter.next()))
I am still poking at this, I looked at globals after importing tempfile and then tempfile as TF, different names exist but now I am wondering about references, and so am trying to figure out if they point to the same thing.

Python multiprocessing memory error

I'm using map_asyn to share my work load, however I'm finding that I'm getting a MemoryError, but I'm not finding a solution or a work around. Here is the error I get:
Exception in thread Thread-3:
Traceback (most recent call last):
File "C:\Python27\lib\threading.py", line 810, in __bootstrap_inner
self.run()
File "C:\Python27\lib\threading.py", line 763, in run
self.__target(*self.__args, **self.__kwargs)
File "C:\Python27\lib\multiprocessing\pool.py", line 380, in _handle_results
task = get()
MemoryError
Here is the code:
pool = Pool(maxtasksperchild=2)
fixed_args = (targetdirectorytxt, value_dict)
varg = ((filename,) + fixed_args for filename in readinfiles)
op_list = pool.map_async(PPD_star, list(varg), chunksize=1)
while not op_list.ready():
print("Number of files left to process: {}".format(op_list._number_left))
time.sleep(600)
op_list = op_list.get()
pool.close()
pool.join()
Here is what I have tried:
Reducing the number of threads
Limiting maxtasksperchild
appply_sync instead of map_sync
Are there anymore suggestions to avoid this error?
I'm reading in the file with:
with open(os.path.join(txtdatapath,pathfilename), "r") as data:
datalines = (line.rstrip('\r\n') for line in data)
for record in datalines:
I agree with #AndréLaszio, it is likely that the files are too large to be kept in memory. Altering your logic to only keep one line in memory at a time should alleviate the memory pressure unless each line is also huge.
Here is an alternate approach to opening the file and working with one line at a time. Keeping the file contents available in memory as an array is an expensive operation.
readingfiles.py:
from memory_profiler import profile
#profile
def open_file_read_all_then_parse(filename):
"""
Open a file and remove all the new lines characters from each line then
parse the resulting array of clean lines.
"""
with open(filename, "r") as data:
datalines = (line.rstrip('\r\n') for line in data)
for record in datalines:
pass
#profile
def open_file_read_and_parse(filename):
"""
Open a file and iterate over each line of the file while striping the record
of any newline characters.
"""
with open(filename, "r") as data:
for record in data:
record.rstrip('\r\n')
if __name__ == '__main__':
# input.dat is a roughly 4m file with 10000 lines
open_file_read_all_then_parse("./input.dat")
open_file_read_and_parse("./input.dat")
I used an additional module to help track down my memory usage, the module is named memory profiler. This module helped me verify where my memory issues were coming from and may be useful for your debugging. It will list out memory usage of a program and the area the memory is being used.
For more in depth performance analysis I recommend this article by Huy Nguyen.

Error with multiprocessing, atexit and global data

Sorry in advance, this is going to be long ...
Possibly related:
Python Multiprocessing atexit Error "Error in atexit._run_exitfuncs"
Definitely related:
python parallel map (multiprocessing.Pool.map) with global data
Keyboard Interrupts with python's multiprocessing Pool
Here's a "simple" script I hacked together to illustrate my problem...
import time
import multiprocessing as multi
import atexit
cleanup_stuff=multi.Manager().list([])
##################################################
# Some code to allow keyboard interrupts
##################################################
was_interrupted=multi.Manager().list([])
class _interrupt(object):
"""
Toy class to allow retrieval of the interrupt that triggered it's execution
"""
def __init__(self,interrupt):
self.interrupt=interrupt
def interrupt():
was_interrupted.append(1)
def interruptable(func):
"""
decorator to allow functions to be "interruptable" by
a keyboard interrupt when in python's multiprocessing.Pool.map
**Note**, this won't actually cause the Map to be interrupted,
It will merely cause the following functions to be not executed.
"""
def newfunc(*args,**kwargs):
try:
if(not was_interrupted):
return func(*args,**kwargs)
else:
return False
except KeyboardInterrupt as e:
interrupt()
return _interrupt(e) #If we really want to know about the interrupt...
return newfunc
#atexit.register
def cleanup():
for i in cleanup_stuff:
print(i)
return
#interruptable
def func(i):
print(i)
cleanup_stuff.append(i)
time.sleep(float(i)/10.)
return i
#Must wrap func here, otherwise it won't be found in __main__'s dict
#Maybe because it was created dynamically using the decorator?
def wrapper(*args):
return func(*args)
if __name__ == "__main__":
#This is an attempt to use signals -- I also attempted something similar where
#The signals were only caught in the child processes...Or only on the main process...
#
#import signal
#def onSigInt(*args): interrupt()
#signal.signal(signal.SIGINT,onSigInt)
#Try 2 with signals (only catch signal on main process)
#import signal
#def onSigInt(*args): interrupt()
#signal.signal(signal.SIGINT,onSigInt)
#def startup(): signal.signal(signal.SIGINT,signal.SIG_IGN)
#p=multi.Pool(processes=4,initializer=startup)
#Try 3 with signals (only catch signal on child processes)
#import signal
#def onSigInt(*args): interrupt()
#signal.signal(signal.SIGINT,signal.SIG_IGN)
#def startup(): signal.signal(signal.SIGINT,onSigInt)
#p=multi.Pool(processes=4,initializer=startup)
p=multi.Pool(4)
try:
out=p.map(wrapper,range(30))
#out=p.map_async(wrapper,range(30)).get() #This doesn't work either...
#The following lines don't work either
#Effectively trying to roll my own p.map() with p.apply_async
# results=[p.apply_async(wrapper,args=(i,)) for i in range(30)]
# out = [ r.get() for r in results() ]
except KeyboardInterrupt:
print ("Hello!")
out=None
finally:
p.terminate()
p.join()
print (out)
This works just fine if no KeyboardInterrupt is raised. However, if I raise one, the following exception occurs:
10
7
9
12
^CHello!
None
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
File "/usr/lib/python2.6/atexit.py", line 24, in _run_exitfuncs
func(*targs, **kargs)
File "test.py", line 58, in cleanup
for i in cleanup_stuff:
File "<string>", line 2, in __getitem__
File "/usr/lib/python2.6/multiprocessing/managers.py", line 722, in _callmethod
self._connect()
File "/usr/lib/python2.6/multiprocessing/managers.py", line 709, in _connect
conn = self._Client(self._token.address, authkey=self._authkey)
File "/usr/lib/python2.6/multiprocessing/connection.py", line 143, in Client
c = SocketClient(address)
File "/usr/lib/python2.6/multiprocessing/connection.py", line 263, in SocketClient
s.connect(address)
File "<string>", line 1, in connect
error: [Errno 2] No such file or directory
Error in sys.exitfunc:
Traceback (most recent call last):
File "/usr/lib/python2.6/atexit.py", line 24, in _run_exitfuncs
func(*targs, **kargs)
File "test.py", line 58, in cleanup
for i in cleanup_stuff:
File "<string>", line 2, in __getitem__
File "/usr/lib/python2.6/multiprocessing/managers.py", line 722, in _callmethod
self._connect()
File "/usr/lib/python2.6/multiprocessing/managers.py", line 709, in _connect
conn = self._Client(self._token.address, authkey=self._authkey)
File "/usr/lib/python2.6/multiprocessing/connection.py", line 143, in Client
c = SocketClient(address)
File "/usr/lib/python2.6/multiprocessing/connection.py", line 263, in SocketClient
s.connect(address)
File "<string>", line 1, in connect
socket.error: [Errno 2] No such file or directory
Interestingly enough, the code does exit the Pool.map function without calling any of the additional functions ... The problem seems to be that the KeyboardInterrupt isn't handled properly at some point, but it is a little confusing where that is, and why it isn't handled in interruptable. Thanks.
Note, the same problem happens if I use out=p.map_async(wrapper,range(30)).get()
EDIT 1
A little closer ... If I enclose the out=p.map(...) in a try,except,finally clause, it gets rid of the first exception ... the other ones are still raised in atexit however. The code and traceback above have been updated.
EDIT 2
Something else that does not work has been added to the code above as a comment. (Same error). This attempt was inspired by:
http://jessenoller.com/2009/01/08/multiprocessingpool-and-keyboardinterrupt/
EDIT 3
Another failed attempt using signals added to the code above.
EDIT 4
I have figured out how to restructure my code so that the above is no longer necessary. In the (unlikely) event that someone stumbles upon this thread with the same use-case that I had, I will describe my solution ...
Use Case
I have a function which generates temporary files using the tempfile module. I would like those temporary files to be cleaned up when the program exits. My initial attempt was to pack each temporary file name into a list and then delete all the elements of the list with a function registered via atexit.register. The problem is that the updated list was not being updated across multiple processes. This is where I got the idea of using multiprocessing.Manager to manage the list data. Unfortunately, this fails on a KeyboardInterrupt no matter how hard I tried because the communication sockets between processes were broken for some reason. The solution to this problem is simple. Prior to using multiprocessing, set the temporary file directory ... something like tempfile.tempdir=tempfile.mkdtemp() and then register a function to delete the temporary directory. Each of the processes writes to the same temporary directory, so it works. Of course, this solution only works where the shared data is a list of files that needs to be deleted at the end of the program's life.

Categories

Resources