Using Concurrent Futures without running out of RAM - python

I'm doing some file parsing that is a CPU bound task. No matter how many files I throw at the process it uses no more than about 50MB of RAM.
The task is parrallelisable, and I've set it up to use concurrent futures below to parse each file as a separate process:
from concurrent import futures
with futures.ProcessPoolExecutor(max_workers=6) as executor:
# A dictionary which will contain a list the future info in the key, and the filename in the value
jobs = {}
# Loop through the files, and run the parse function for each file, sending the file-name to it.
# The results of can come back in any order.
for this_file in files_list:
job = executor.submit(parse_function, this_file, **parser_variables)
jobs[job] = this_file
# Get the completed jobs whenever they are done
for job in futures.as_completed(jobs):
# Send the result of the file the job is based on (jobs[job]) and the job (job.result)
results_list = job.result()
this_file = jobs[job]
# delete the result from the dict as we don't need to store it.
del jobs[job]
# post-processing (putting the results into a database)
post_process(this_file, results_list)
The problem is that when I run this using futures, RAM usage rockets and before long I've run out and Python has crashed. This is probably in large part because the results from parse_function are several MB in size. Once the results have been through post_processing, the application has no further need of them. As you can see, I'm trying del jobs[job] to clear items out of jobs, but this has made no difference, memory usage remains unchanged, and seems to increase at the same rate.
I've also confirmed it's not because it's waiting on the post_process function by only using a single process, plus throwing in a time.sleep(1).
There's nothing in the futures docs about memory management, and while a brief search indicates it has come up before in real-world applications of futures (Clear memory in python loop and http://grokbase.com/t/python/python-list/1458ss5etz/real-world-use-of-concurrent-futures) - the answers don't translate to my use-case (they're all concerned with timeouts and the like).
So, how do you use Concurrent futures without running out of RAM?
(Python 3.5)

I'll take a shot (Might be a wrong guess...)
You might need to submit your work bit by bit since on each submit you're making a copy of parser_variables which may end up chewing your RAM.
Here is working code with "<----" on the interesting parts
with futures.ProcessPoolExecutor(max_workers=6) as executor:
# A dictionary which will contain a list the future info in the key, and the filename in the value
jobs = {}
# Loop through the files, and run the parse function for each file, sending the file-name to it.
# The results of can come back in any order.
files_left = len(files_list) #<----
files_iter = iter(files_list) #<------
while files_left:
for this_file in files_iter:
job = executor.submit(parse_function, this_file, **parser_variables)
jobs[job] = this_file
if len(jobs) > MAX_JOBS_IN_QUEUE:
break #limit the job submission for now job
# Get the completed jobs whenever they are done
for job in futures.as_completed(jobs):
files_left -= 1 #one down - many to go... <---
# Send the result of the file the job is based on (jobs[job]) and the job (job.result)
results_list = job.result()
this_file = jobs[job]
# delete the result from the dict as we don't need to store it.
del jobs[job]
# post-processing (putting the results into a database)
post_process(this_file, results_list)
break; #give a chance to add more jobs <-----

Try adding del to your code like this:
for job in futures.as_completed(jobs):
del jobs[job] # or `val = jobs.pop(job)`
# del job # or `job._result = None`

Looking at the concurrent.futures.as_completed() function, I learned it is enough to ensure there is no longer any reference to the future. If you dispense this reference as soon as you've got the result, you'll minimise memory usage.
I use a generator expression for storing my Future instances because everything I care about is already returned by the future in its result (basically, the status of the dispatched work). Other implementations use a dict for example like in your case, because you don't return the input filename as part of the thread workers result.
Using a generator expression means once the result is yielded, there is no longer any reference to the Future. Internally, as_completed() has already taken care of removing its own reference, after it yielded the completed Future to you.
futures = (executor.submit(thread_worker, work) for work in workload)
for future in concurrent.futures.as_completed(futures):
output = future.result()
... # on next loop iteration, garbage will be collected for the result data, too
Edit: Simplified from using a set and removing entries, to simply using a generator expression.

Same problem for me.
In my case I need to start millions of threads. For python2, I would write a thread pool myself using a dict. But in python3 I encounted the following error when I del finished threads dynamically:
RuntimeError: dictionary changed size during iteration
So I have to use concurrent.futures, at first I coded like this:
from concurrent.futures import ThreadPoolExecutor
......
if __name__ == '__main__':
all_resouces = get_all_resouces()
with ThreadPoolExecutor(max_workers=50) as pool:
for r in all_resouces:
pool.submit(handle_resource, *args)
But soon memory exhausted, because memory will be released only after all threads finished. I need to del finished threads before to many thread started. So I read the docs here: https://docs.python.org/3/library/concurrent.futures.html#module-concurrent.futures
Find that Executor.shutdown(wait=True) might be what I need.
And this is my final solution:
from concurrent.futures import ThreadPoolExecutor
......
if __name__ == '__main__':
all_resouces = get_all_resouces()
i = 0
while i < len(all_resouces):
with ThreadPoolExecutor(max_workers=50) as pool:
for r in all_resouces[i:i+1000]:
pool.submit(handle_resource, *args)
i += 1000
You can avoid having to call this method explicitly if you use the with statement, which will shutdown the Executor (waiting as if Executor.shutdown() were called with wait set to True)

Related

Running a list of custom funcs in parallel, outside of main

I have the following line being called from within a particular area of my script:
data = [extract_func(domain, response) for domain, response, extract_func in responses]
Basically I collected a bunch of webpage responses asynchronously using aiohttp in the variable responses which is nice and fast and so we've already got that. Problem is that the parsing of those responses (using Beautiful Soup) is not asynchronous so I have to parallelize that some other way.
(Each extract_func is technically one of many different various extraction functions that were pre-packaged with the response data so that I call the right Beautiful Soup parsing code for each page. The domain is passed in too for other packaging purposes.)
Anyways I don't know how I'd run all these extraction functions at the same time and then collect the results. I tried looking into multiprocessing but it doesn't seem to apply here / requires that you launch it directly from main, whereas this collection process of mine is taking place from within another function.
I tried this for example (where each extract_function, at the end, adds the returned result to some global list - then here I try):
global extract_shared
extract_shared = []
proc = []
for domain, response, extract_func in responses:
p = Process(target=extract_func, args=(domain, response))
p.start()
proc.append(p)
for p in proc:
p.join()
data = extract_shared
However this still seems to move along super slowly, and I end up with no data anyway so my code is still wrong.
Is there a better way I should be going about this?
Is this correct?
pool = multiprocessing.Pool(multiprocessing.cpu_count())
result_objects = [pool.apply_async(extract_func, args=(domain, response)) for domain, response, extract_func in responses]
data = [r.get() for r in result_objects]
pool.close()
pool.join()
return data
The problem is that your extract_shared list as you have defined it exists as individual instances in each process's address space. You need to have a shared memory implementation of extract_shared so that each process is appending to the same list. If I knew what type of data was being appended, I might be able to recommend which flavor of multiprocessing.Array to use. Alternatively, although it carries a bit more overhead to use, a managed list that is created by a multiprocessing.SyncManager and functions just like a regular list, might be simpler to use.
Using a multiprocessing pool is the way to go. If your worker functions are not returning meaningful results, there is no need to save the the AsyncResult instances returned by the call to ApplyAsync. Simply calling pool.close() followed by pool.join() is sufficient to wait for all outstanding submitted tasks to complete.
import multiprocessing
def init_pool(the_list):
global extract_shared
extract_shared = the_list
# required for Windows:
if __name__ == '__main__':
# compute pool size required but no greater than number of CPU cores we have:
n_processes = min(len(responses), multiprocessing.cpu_count())
# create a managed list:
extract_shared = multiprocessing.Manager().list()
# Initialize each process in the pool's global variable extract_shared with our extract_shared
# (a managed list can also be passed as another argument to the worker function instead)
pool = multiprocessing.Pool(n_processes, initializer=init_pool, initargs=(extract_shared,))
for domain, response, extract_func in responses:
pool.apply_async(extract_func, args=(domain, response))
# wait for tasks to complete
pool.close()
pool.join()
# results in extract_shared
print(extract_shared)
Update
If is easier just to have the worker functions return the results and the main process do the appending. And you had essentially the correct code for that except I would limit the pool size to less than the number of CPU cores you have if the number of tasks you are submitting is less than that number.

Dump intermediate results of multiprocessing job to filesystem and continue with processing later on

I have a job that uses the multiprocessing package and calls a function via
resultList = pool.map(myFunction, myListOfInputParameters).
Each entry of the list of input parameters is independent from others.
This job will run a couple of hours. For safety reasons, I would like to store the results that are made in between in regular time intervals, like e.g. once an hour.
How can I do this and be able to continue with the processing when the job was aborted and I want to restart it based on the last available backup?
Perhaps use pickle. Read more here:
https://docs.python.org/3/library/pickle.html
Based on aws_apprentice's comment I created a full multiprocessing example in case you weren't sure how to use intermediate results. The first time this is run it will print "None" as there are no intermediate results. Run it again to simulate restarting.
from multiprocessing import Process
import pickle
def proc(name):
data = None
# Load intermediate results if they exist
try:
f = open(name+'.pkl', 'rb')
data = pickle.load(f)
f.close()
except:
pass
# Do something
print(data)
data = "intermediate result for " + name
# Periodically save your intermediate results
f = open(name+'.pkl', 'wb')
pickle.dump(data, f, -1)
f.close()
processes = []
for x in range(5):
p = Process(target=proc, args=("proc"+str(x),))
p.daemon = True
p.start()
processes.append(p)
for process in processes:
process.join()
for process in processes:
process.terminate()
You can also use json if that makes sense to output intermediate results in human readable format. Or sqlite as a database if you need to push data into rows.
There are at least two possible options.
Have each call of myFunction save its output into a uniquely named file. The file name should be based on or linked to the input data. Use the parent program to gather the results. In this case myFunction should return an identifier of the item that is finished.
Use imap_unordered instead of map. This will start yielding results as soon as they are available, instead of returing when all processing is finished. Have the parent program save the returned data and a indication which items are finished.
In both cases, the program would have to examine the data saved from previous runs to adjust myListOfInputParameters when it is being re-started.
Which option is best depends to a large degree on the amount of data returned by myFunction. If this is a large amount, there is a significant overhead associated with transferring it back to the parent. In that case option 1 is probably best.
Since writing to disk is relatively slow, calculations wil probably go faster with option 2. And it is easier for the parent program to track progress.
Note that you can also use imap_unordered with option 1.

multiprocessing.Pool.imap_unordered with fixed queue size or buffer?

I am reading data from large CSV files, processing it, and loading it into a SQLite database. Profiling suggests 80% of my time is spent on I/O and 20% is processing input to prepare it for DB insertion. I sped up the processing step with multiprocessing.Pool so that the I/O code is never waiting for the next record. But, this caused serious memory problems because the I/O step could not keep up with the workers.
The following toy example illustrates my problem:
#!/usr/bin/env python # 3.4.3
import time
from multiprocessing import Pool
def records(num=100):
"""Simulate generator getting data from large CSV files."""
for i in range(num):
print('Reading record {0}'.format(i))
time.sleep(0.05) # getting raw data is fast
yield i
def process(rec):
"""Simulate processing of raw text into dicts."""
print('Processing {0}'.format(rec))
time.sleep(0.1) # processing takes a little time
return rec
def writer(records):
"""Simulate saving data to SQLite database."""
for r in records:
time.sleep(0.3) # writing takes the longest
print('Wrote {0}'.format(r))
if __name__ == "__main__":
data = records(100)
with Pool(2) as pool:
writer(pool.imap_unordered(process, data, chunksize=5))
This code results in a backlog of records that eventually consumes all memory because I cannot persist the data to disk fast enough. Run the code and you'll notice that Pool.imap_unordered will consume all the data when writer is at the 15th record or so. Now imagine the processing step is producing dictionaries from hundreds of millions of rows and you can see why I run out of memory. Amdahl's Law in action perhaps.
What is the fix for this? I think I need some sort of buffer for Pool.imap_unordered that says "once there are x records that need insertion, stop and wait until there are less than x before making more." I should be able to get some speed improvement from preparing the next record while the last one is being saved.
I tried using NuMap from the papy module (which I modified to work with Python 3) to do exactly this, but it wasn't faster. In fact, it was worse than running the program sequentially; NuMap uses two threads plus multiple processes.
Bulk import features of SQLite are probably not suited to my task because the data need substantial processing and normalization.
I have about 85G of compressed text to process. I'm open to other database technologies, but picked SQLite for ease of use and because this is a write-once read-many job in which only 3 or 4 people will use the resulting database after everything is loaded.
As I was working on the same problem, I figured that an effective way to prevent the pool from overloading is to use a semaphore with a generator:
from multiprocessing import Pool, Semaphore
def produce(semaphore, from_file):
with open(from_file) as reader:
for line in reader:
# Reduce Semaphore by 1 or wait if 0
semaphore.acquire()
# Now deliver an item to the caller (pool)
yield line
def process(item):
result = (first_function(item),
second_function(item),
third_function(item))
return result
def consume(semaphore, result):
database_con.cur.execute("INSERT INTO ResultTable VALUES (?,?,?)", result)
# Result is consumed, semaphore may now be increased by 1
semaphore.release()
def main()
global database_con
semaphore_1 = Semaphore(1024)
with Pool(2) as pool:
for result in pool.imap_unordered(process, produce(semaphore_1, "workfile.txt"), chunksize=128):
consume(semaphore_1, result)
See also:
K Hong - Multithreading - Semaphore objects & thread pool
Lecture from Chris Terman - MIT 6.004 L21: Semaphores
Since processing is fast, but writing is slow, it sounds like your problem is
I/O-bound. Therefore there might not be much to be gained from using
multiprocessing.
However, it is possible to peel off chunks of data, process the chunk, and
wait until that data has been written before peeling off another chunk:
import itertools as IT
if __name__ == "__main__":
data = records(100)
with Pool(2) as pool:
chunksize = ...
for chunk in iter(lambda: list(IT.islice(data, chunksize)), []):
writer(pool.imap_unordered(process, chunk, chunksize=5))
It sounds like all you really need is to replace the unbounded queues underneath the Pool with bounded (and blocking) queues. That way, if any side gets ahead of the rest, it'll just block until they're ready.
This would be easy to do by peeking at the source, to subclass or monkeypatch Pool, something like:
class Pool(multiprocessing.pool.Pool):
def _setup_queues(self):
self._inqueue = self._ctx.Queue(5)
self._outqueue = self._ctx.Queue(5)
self._quick_put = self._inqueue._writer.send
self._quick_get = self._outqueue._reader.recv
self._taskqueue = queue.Queue(10)
But that's obviously not portable (even to CPython 3.3, much less to a different Python 3 implementation).
I think you can do it portably in 3.4+ by providing a customized context, but I haven't been able to get that right, so…
A simple workaround might be to use psutil to detect the memory usage in each process and say if more than 90% of memory are taken, than just sleep for a while.
while psutil.virtual_memory().percent > 75:
time.sleep(1)
print ("process paused for 1 seconds!")

Memory usage keep growing with Python's multiprocessing.pool

Here's the program:
#!/usr/bin/python
import multiprocessing
def dummy_func(r):
pass
def worker():
pass
if __name__ == '__main__':
pool = multiprocessing.Pool(processes=16)
for index in range(0,100000):
pool.apply_async(worker, callback=dummy_func)
# clean up
pool.close()
pool.join()
I found memory usage (both VIRT and RES) kept growing up till close()/join(), is there any solution to get rid of this? I tried maxtasksperchild with 2.7 but it didn't help either.
I have a more complicated program that calles apply_async() ~6M times, and at ~1.5M point I've already got 6G+ RES, to avoid all other factors, I simplified the program to above version.
EDIT:
Turned out this version works better, thanks for everyone's input:
#!/usr/bin/python
import multiprocessing
ready_list = []
def dummy_func(index):
global ready_list
ready_list.append(index)
def worker(index):
return index
if __name__ == '__main__':
pool = multiprocessing.Pool(processes=16)
result = {}
for index in range(0,1000000):
result[index] = (pool.apply_async(worker, (index,), callback=dummy_func))
for ready in ready_list:
result[ready].wait()
del result[ready]
ready_list = []
# clean up
pool.close()
pool.join()
I didn't put any lock there as I believe main process is single threaded (callback is more or less like a event-driven thing per docs I read).
I changed v1's index range to 1,000,000, same as v2 and did some tests - it's weird to me v2 is even ~10% faster than v1 (33s vs 37s), maybe v1 was doing too many internal list maintenance jobs. v2 is definitely a winner on memory usage, it never went over 300M (VIRT) and 50M (RES), while v1 used to be 370M/120M, the best was 330M/85M. All numbers were just 3~4 times testing, reference only.
I had memory issues recently, since I was using multiple times the multiprocessing function, so it keep spawning processes, and leaving them in memory.
Here's the solution I'm using now:
def myParallelProcess(ahugearray):
from multiprocessing import Pool
from contextlib import closing
with closing(Pool(15)) as p:
res = p.imap_unordered(simple_matching, ahugearray, 100)
return res
Simply create the pool within your loop and close it at the end of the loop with
pool.close().
Use map_async instead of apply_async to avoid excessive memory usage.
For your first example, change the following two lines:
for index in range(0,100000):
pool.apply_async(worker, callback=dummy_func)
to
pool.map_async(worker, range(100000), callback=dummy_func)
It will finish in a blink before you can see its memory usage in top. Change the list to a bigger one to see the difference. But note map_async will first convert the iterable you pass to it to a list to calculate its length if it doesn't have __len__ method. If you have an iterator of a huge number of elements, you can use itertools.islice to process them in smaller chunks.
I had a memory problem in a real-life program with much more data and finally found the culprit was apply_async.
P.S., in respect of memory usage, your two examples have no obvious difference.
I have a very large 3d point cloud data set I'm processing. I tried using the multiprocessing module to speed up the processing, but I started getting out of memory errors. After some research and testing I determined that I was filling the queue of tasks to be processed much quicker than the subprocesses could empty it. I'm sure by chunking, or using map_async or something I could have adjusted the load, but I didn't want to make major changes to the surrounding logic.
The dumb solution I hit on is to check the pool._cache length intermittently, and if the cache is too large then wait for the queue to empty.
In my mainloop I already had a counter and a status ticker:
# Update status
count += 1
if count%10000 == 0:
sys.stdout.write('.')
if len(pool._cache) > 1e6:
print "waiting for cache to clear..."
last.wait() # Where last is assigned the latest ApplyResult
So every 10k insertion into the pool I check if there are more than 1 million operations queued (about 1G of memory used in the main process). When the queue is full I just wait for the last inserted job to finish.
Now my program can run for hours without running out of memory. The main process just pauses occasionally while the workers continue processing the data.
BTW the _cache member is documented the the multiprocessing module pool example:
#
# Check there are no outstanding tasks
#
assert not pool._cache, 'cache = %r' % pool._cache
You can limit the number of task per child process
multiprocessing.Pool(maxtasksperchild=1)
maxtasksperchild is the number of tasks a worker process can complete before it will exit and be replaced with a fresh worker process, to enable unused resources to be freed. The default maxtasksperchild is None, which means worker processes will live as long as the pool. link
I think this is similar to the question I posted, but I'm not sure you have the same delay. My problem was that I was producing results from the multiprocessing pool faster than I was consuming them, so they built up in memory. To avoid that, I used a semaphore to throttle the inputs into the pool so they didn't get too far ahead of the outputs I was consuming.

Multiprocessing with python3 only runs once

I have a problem running multiple processes in python3 .
My program does the following:
1. Takes entries from an sqllite database and passes them to an input_queue
2. Create multiple processes that take items off the input_queue, run it through a function and output the result to the output queue.
3. Create a thread that takes items off the output_queue and prints them (This thread is obviously started before the first 2 steps)
My problem is that currently the 'function' in step 2 is only run as many times as the number of processes set, so for example if you set the number of processes to 8, it only runs 8 times then stops. I assumed it would keep running until it took all items off the input_queue.
Do I need to rewrite the function that takes the entries out of the database (step 1) into another process and then pass its output queue as an input queue for step 2?
Edit:
Here is an example of the code, I used a list of numbers as a substitute for the database entries as it still performs the same way. I have 300 items on the list and I would like it to process all 300 items, but at the moment it just processes 10 (the number of processes I have assigned)
#!/usr/bin/python3
from multiprocessing import Process,Queue
import multiprocessing
from threading import Thread
## This is the class that would be passed to the multi_processing function
class Processor:
def __init__(self,out_queue):
self.out_queue = out_queue
def __call__(self,in_queue):
data_entry = in_queue.get()
result = data_entry*2
self.out_queue.put(result)
#Performs the multiprocessing
def perform_distributed_processing(dbList,threads,processor_factory,output_queue):
input_queue = Queue()
# Create the Data processors.
for i in range(threads):
processor = processor_factory(output_queue)
data_proc = Process(target = processor,
args = (input_queue,))
data_proc.start()
# Push entries to the queue.
for entry in dbList:
input_queue.put(entry)
# Push stop markers to the queue, one for each thread.
for i in range(threads):
input_queue.put(None)
data_proc.join()
output_queue.put(None)
if __name__ == '__main__':
output_results = Queue()
def output_results_reader(queue):
while True:
item = queue.get()
if item is None:
break
print(item)
# Establish results collecting thread.
results_process = Thread(target = output_results_reader,args = (output_results,))
results_process.start()
# Use this as a substitute for the database in the example
dbList = [i for i in range(300)]
# Perform multi processing
perform_distributed_processing(dbList,10,Processor,output_results)
# Wait for it all to finish.
results_process.join()
A collection of processes that service an input queue and write to an output queue is pretty much the definition of a process pool.
If you want to know how to build one from scratch, the best way to learn is to look at the source code for multiprocessing.Pool, which is pretty simply Python, and very nicely written. But, as you might expect, you can just use multiprocessing.Pool instead of re-implementing it. The examples in the docs are very nice.
But really, you could make this even simpler by using an executor instead of a pool. It's hard to explain the difference (again, read the docs for both modules), but basically, a future is a "smart" result object, which means instead of a pool with a variety of different ways to run jobs and get results, you just need a dumb thing that doesn't know how to do anything but return futures. (Of course in the most trivial cases, the code looks almost identical either way…)
from concurrent.futures import ProcessPoolExecutor
def Processor(data_entry):
return data_entry*2
def perform_distributed_processing(dbList, threads, processor_factory):
with ProcessPoolExecutor(processes=threads) as executor:
yield from executor.map(processor_factory, dbList)
if __name__ == '__main__':
# Use this as a substitute for the database in the example
dbList = [i for i in range(300)]
for result in perform_distributed_processing(dbList, 8, Processor):
print(result)
Or, if you want to handle them as they come instead of in order:
def perform_distributed_processing(dbList, threads, processor_factory):
with ProcessPoolExecutor(processes=threads) as executor:
fs = (executor.submit(processor_factory, db) for db in dbList)
yield from map(Future.result, as_completed(fs))
Notice that I also replaced your in-process queue and thread, because it wasn't doing anything but providing a way to interleave "wait for the next result" and "process the most recent result", and yield (or yield from, in this case) does that without all the complexity, overhead, and potential for getting things wrong.
Don't try to rewrite the whole multiprocessing library again. I think you can use any of multiprocessing.Pool methods depending on your needs - if this is a batch job you can even use the synchronous multiprocessing.Pool.map() - only instead of pushing to input queue, you need to write a generator that yields input to the threads.

Categories

Resources