Multiprocessing is not executing parallel in Python - python

I have edited the code , currently it is working fine . But thinks it is not executing parallely or dynamically . Can anyone please check on to it
Code :
def folderStatistic(t):
j, dir_name = t
row = []
for content in dir_name.split(","):
row.append(content)
print(row)
def get_directories():
import csv
with open('CONFIG.csv', 'r') as file:
reader = csv.reader(file,delimiter = '\t')
return [col for row in reader for col in row]
def folderstatsMain():
freeze_support()
start = time.time()
pool = Pool()
worker = partial(folderStatistic)
pool.map(worker, enumerate(get_directories()))
def datatobechecked():
try:
folderstatsMain()
except Exception as e:
# pass
print(e)
if __name__ == '__main__':
datatobechecked()
Config.CSV
C:\USERS, .CSV
C:\WINDOWS , .PDF
etc.
There may be around 200 folder paths in config.csv

welcome to StackOverflow and Python programming world!
Moving on to the question.
Inside the get_directories() function you open the file in with context, get the reader object and close the file immediately after the moment you leave the context so when the time comes to use the reader object the file is already closed.
I don't want to discourage you, but if you are very new to programming do not dive into parallel programing yet. Difficulty in handling multiple threads simultaneously grows exponentially with every thread you add (pools greatly simplify this process though). Processes are even worse as they don't share memory and can't communicate with each other easily.
My advice is, try to write it as a single-thread program first. If you have it working and still need to parallelize it, isolate a single function with input file path as a parameter that does all the work and then use thread/process pool on that function.
EDIT:
From what I can understand from your code, you get directory names from the CSV file and then for each "cell" in the file you run parallel folderStatistics. This part seems correct. The problem may lay in dir_name.split(","), notice that you pass individual "cells" to the folderStatistics not rows. What makes you think it's not running paralelly?.

There is a certain amount of overhead in creating a multiprocessing pool because creating processes is, unlike creating threads, a fairly costly operation. Then those submitted tasks, represented by each element of the iterable being passed to the map method, are gathered up in "chunks" and written to a multiprocessing queue of tasks that are read by the pool processes. This data has to move from one address space to another and that has a cost associated with it. Finally when your worker function, folderStatistic, returns its result (which is None in this case), that data has to be moved from one process's address space back to the main process's address space and that too has a cost associated with it.
All of those added costs become worthwhile when your worker function is sufficiently CPU-intensive such that these additional costs is small compared to the savings gained by having the tasks run in parallel. But your worker function's CPU requirements are so small as to reap any benefit from multiprocessing.
Here is a demo comparing single-processing time vs. multiprocessing times for invoking a worker function, fn, twice where the first time it only performs its internal loop 10 times (low CPU requirements) while the second time it performs its internal loop 1,000,000 times (higher CPU requirements). You can see that in the first case the multiprocessing version runs considerable slower (you can't even measure the time for the single processing run). But when we make fn more CPU-intensive, then multiprocessing achieves gains over the single-processing case.
from multiprocessing import Pool
from functools import partial
import time
def fn(iterations, x):
the_sum = x
for _ in range(iterations):
the_sum += x
return the_sum
# required for Windows:
if __name__ == '__main__':
for n_iterations in (10, 1_000_000):
# single processing time:
t1 = time.time()
for x in range(1, 20):
fn(n_iterations, x)
t2 = time.time()
# multiprocessing time:
worker = partial(fn, n_iterations)
t3 = time.time()
with Pool() as p:
results = p.map(worker, range(1, 20))
t4 = time.time()
print(f'#iterations = {n_iterations}, single processing time = {t2 - t1}, multiprocessing time = {t4 - t3}')
Prints:
#iterations = 10, single processing time = 0.0, multiprocessing time = 0.35399389266967773
#iterations = 1000000, single processing time = 1.182999849319458, multiprocessing time = 0.5530076026916504
But even with a pool size of 8, the running time is not reduced by a factor of 8 (it's more like a factor of 2) due to the fixed multiprocessing overhead. When I change the number of iterations for the second case to be 100,000,000 (even more CPU-intensive), we get ...
#iterations = 100000000, single processing time = 109.3077495098114, multiprocessing time = 27.202054023742676
... which is a reduction in running time by a factor of 4 (I have many other processes running in my computer, so there is competition for the CPU).

Related

Using pool for multiprocessing in Python (Windows)

I have to do my study in a parallel way to run it much faster. I am new to multiprocessing library in python, and could not yet make it run successfully.
Here, I am investigating if each pair of (origin, target) remains at certain locations between various frames of my study. Several points:
It is one function, which I want to run faster (It is not several processes).
The process is performed subsequently; it means that each frame is compared with the previous one.
This code is a very simpler form of the original code. The code outputs a residece_list.
I am using Windows OS.
Can someone check the code (the multiprocessing section) and help me improve it to make it work. Thanks.
import numpy as np
from multiprocessing import Pool, freeze_support
def Main_Residence(total_frames, origin_list, target_list):
Previous_List = {}
residence_list = []
for frame in range(total_frames): #Each frame
Current_List = {} #Dict of pair and their residence for frames
for origin in range(origin_list):
for target in range(target_list):
Pair = (origin, target) #Eahc pair
if Pair in Current_List.keys(): #If already considered, continue
continue
else:
if origin == target:
if (Pair in Previous_List.keys()): #If remained from the previous frame, add residence
print "Origin_Target remained: ", Pair
Current_List[Pair] = (Previous_List[Pair] + 1)
else: #If new, add it to the current
Current_List[Pair] = 1
for pair in Previous_List.keys(): #Add those that exited from residence to the list
if pair not in Current_List.keys():
residence_list.append(Previous_List[pair])
Previous_List = Current_List
return residence_list
if __name__ == '__main__':
pool = Pool(processes=5)
Residence_List = pool.apply_async(Main_Residence, args=(20, 50, 50))
print Residence_List.get(timeout=1)
pool.close()
pool.join()
freeze_support()
Residence_List = np.array(Residence_List) * 5
Multiprocessing does not make sense in the context you are presenting here.
You are creating five subprocesses (and three threads belonging to the pool, managing workers, tasks and results) to execute one function once. All of this is coming at a cost, both in system resources and execution time, while four of your worker processes don't do anything at all. Multiprocessing does not speed up the execution of a function. The code in your specific example will always be slower than plainly executing Main_Residence(20, 50, 50) in the main process.
For multiprocessing to make sense in such a context, your work at hand would need to be broken down to a set of homogenous tasks that can be processed in parallel with their results potentially being merged later.
As an example (not necessarily a good one), if you want to calculate the largest prime factors for a sequence of numbers, you can delegate the task of calculating that factor for any specific number to a worker in a pool. Several workers would then do these individual calculations in parallel:
def largest_prime_factor(n):
p = n
i = 2
while i * i <= n:
if n % i:
i += 1
else:
n //= i
return p, n
if __name__ == '__main__':
pool = Pool(processes=3)
start = datetime.now()
# this delegates half a million individual tasks to the pool, i.e.
# largest_prime_factor(0), largest_prime_factor(1), ..., largest_prime_factor(499999)
pool.map(largest_prime_factor, range(500000))
pool.close()
pool.join()
print "pool elapsed", datetime.now() - start
start = datetime.now()
# same work just in the main process
[largest_prime_factor(i) for i in range(500000)]
print "single elapsed", datetime.now() - start
Output:
pool elapsed 0:00:04.664000
single elapsed 0:00:08.939000
(the largest_prime_factor function is taken from #Stefan in this answer)
As you can see, the pool is only roughly twice as fast as single process execution of the same amount of work, all while running in three processes in parallel. That's due to the overhead introduced by multiprocessing/the pool.
So, you stated that the code in your example has been simplified. You'll have to analyse your original code to see if it can be broken down to homogenous tasks that can be passed down to your pool for processing. If that is possible, using multiprocessing might help you speed up your program. If not, multiprocessing will likely cost you time, rather than save it.
Edit:
Since you asked for suggestions on the code. I can hardly say anything about your function. You said yourself that it is just a simplified example to provide an MCVE (much appreciated by the way! Most people don't take the time to strip down their code to its bare minimum). Requests for a code review are anyway better suited over at Codereview.
Play around a bit with the available methods of task delegation. In my prime factor example, using apply_async came with a massive penalty. Execution time increased ninefold, compared to using map. But my example is using just a simple iterable, yours needs three arguments per task. This could be a case for starmap, but that is only available as of Python 3.3.Anyway, the structure/nature of your task data basically determines the correct method to use.
I did some q&d testing with multiprocessing your example function.
The input was defined like this:
inp = [(20, 50, 50)] * 5000 # that makes 5000 tasks against your Main_Residence
I ran that in Python 3.6 in three subprocesses with your function unaltered, except for the removal of the print statment (I/O is costly). I used, starmap, apply, starmap_async and apply_async and also iterated through the results each time to account for the blocking get() on the async results.
Here's the output:
starmap elapsed 0:01:14.506600
apply elapsed 0:02:11.290600
starmap async elapsed 0:01:27.718800
apply async elapsed 0:01:12.571200
# btw: 5k calls to Main_Residence in the main process looks as bad
# as using apply for delegation
single elapsed 0:02:12.476800
As you can see, the execution times differ, although all four methods do the same amount of work; the apply_async you picked appears to be the fastest method.
Coding Style. Your code looks quite ... unconventional :) You use Capitalized_Words_With_Underscore for your names (both, function and variable names), that's pretty much a no-no in Python. Also, assigning the name Previous_List to a dictionary is ... questionable. Have a look at PEP 8, especially the section Naming Conventions to see the commonly accepted coding style for Python.
Judging by the way your print looks, you are still using Python 2. I know that in corporate or institutional environments that's sometimes all you have available. Still, keep in mind that the clock for Python 2 is ticking

python multiprocessing benchmark

I want to perform some benchmarking between 'multiprocessing' a file and sequential processing a file.
In basics it's a file that is read line by line (consists of 100 lines), and the first character is read from eachline and is put into the list if it doesn't exists.
import multiprocessing as mp
import sys
import time
database_layout=[]
def get_first_characters(string):
global database_layout
if string[0:1] not in database_layout:
database_layout.append(string[0:1])
if __name__ == '__main__':
start_time = time.time()
bestand_first_read=open('random.txt','r', encoding="latin-1")
for line in bestand_first_read:
p = mp.Process(target=get_first_characters, args=(line,))
p.start()
print(str(len(database_layout)))
print("Finished part one: "+ str(time.time() - start_time))
bestand_first_read.close()
###Part two
database_layout_two=[]
start_time = time.time()
bestand_first_read_two=open('random.txt','r', encoding="latin-1")
for linetwo in bestand_first_read_two:
if linetwo[0:1] not in database_layout_two:
database_layout_two.append(linetwo[0:1])
print(str(len(database_layout_two)))
print("Finished: part two"+ str(time.time() - start_time))
But when i execute this program i get the following result:
python test.py
0
Finished part one: 17.105965852737427
10
Finished part two: 0.0
Two problems arise at this moment.
1) Why does the multiprocessing takes much longer (+/- 17 sec) than the sequential processing (+/- 0 sec).
2) Why does the list 'database_layout' defined not get filled? (It is the same code)
EDIT
A same example which works with Pools.
import multiprocessing as mp
import timeit
def get_first_characters(string):
return string
if __name__ == '__main__':
database_layout=[]
start = timeit.default_timer()
nr = 0
with mp.Pool(processes=4) as pool:
for i in range(99999):
nr += 1
database_layout.append(pool.starmap(get_first_characters, [(str(i),)]))
stop = timeit.default_timer()
print("Pools: %s " % (stop - start))
database_layout=[]
start = timeit.default_timer()
for i in range(99999):
database_layout.append(get_first_characters(str(i)))
stop = timeit.default_timer()
print("Regular: %s " % (stop - start))
After running above example the following output is shown.
Pools: 22.058468394726148
Regular: 0.051738489109649066
This shows that in such a case working with Pools is 440 times slower than using sequential processing. Any clou why this is?
Multiprocessing starts one process for each line of your input. That means that all the overhead of opening one new Python interpreter for each line of your (possibly very long) file. That accounts for the long time it takes to go through the file.
However, there are other issues with your code. While there is no synchronisation issue due to fighting for the file (since all reads are done in the main process, where the line iteration is going on), you have misunderstood how multiprocessing works.
First of all, your global variable is not global across processes. Actually processes don't usually share memory (like threads) and you have to use some interface to share objects (and hence why shared objects must be picklable). When your code opens each process, each interpreter instance starts by loading your file, which creates a new database_layout variable. Because of that, each interpreter starts with an empty list, which means it ends with a single-element list. For actually sharing the list, you might want to use a Manager (also see how to share state in the docs).
Also because of the huge overhead of opening new interpreters, your script performance may benefit from using a pool of workers, since this will open just a few processes for sharing the work. Remember that resource contention will impact performance if opening more processes than you have CPU cores.
The second problem, besides the issue of sharing your variable, is that your code does not wait for the processing to finish. Hence, even if the state was shared, your processing might not have finished when you check the length of database_layout. Again, using a pool might help with that.
PS: unless you want to preserve the insertion order, you might get even faster by using a set, though I'm not sure the Manager supports it.
EDIT after the OP EDIT: Your pool code is still starting up the pool for each line (or number). As you did, you still have much of your processing in the main process, just looping and passing arguments to the other processes. Besides, you're still running each element in the pool individually and appending in the list, which pretty much uses only one worker process at a time (remember that map or starmaps waits until the work finishes to return). This is from Process Explorer running your code:
Note how the main process is still doing all the hard work (22% in a quad-core machine means its CPU is maxed). What you need to do is pass the iterable to map() in a single call, minimizing the work (specially switching between Python and the C side):
import multiprocessing as mp
import timeit
def get_first_characters(number):
return str(number)[0]
if __name__ == '__main__':
start = timeit.default_timer()
with mp.Pool(processes=4) as pool:
database_layout1 = (pool.map(get_first_characters, range(99999)))
stop = timeit.default_timer()
print("Pools: %s " % (stop - start))
database_layout2=[]
start = timeit.default_timer()
for i in range(99999):
database_layout2.append(get_first_characters(str(i)))
stop = timeit.default_timer()
print("Regular: %s " % (stop - start))
assert database_layout1 == database_layout2
This got me from this:
Pools: 14.169268206710512
Regular: 0.056271265139002935
To this:
Pools: 0.35610273658926417
Regular: 0.07681461930314981
It's still slower than the single-processing one, but that's mainly because of the message-passing overhead for a very simple function. If your function is more complex it'll make more sense.

Why is reading multiple files at the same time slower than reading sequentially?

I am trying to parse many files found in a directory, however using multiprocessing slows my program.
# Calling my parsing function from Client.
L = getParsedFiles('/home/tony/Lab/slicedFiles') <--- 1000 .txt files found here.
combined ~100MB
Following this example from python documentation:
from multiprocessing import Pool
def f(x):
return x*x
if __name__ == '__main__':
p = Pool(5)
print(p.map(f, [1, 2, 3]))
I've written this piece of code:
from multiprocessing import Pool
from api.ttypes import *
import gc
import os
def _parse(pathToFile):
myList = []
with open(pathToFile) as f:
for line in f:
s = line.split()
x, y = [int(v) for v in s]
obj = CoresetPoint(x, y)
gc.disable()
myList.append(obj)
gc.enable()
return Points(myList)
def getParsedFiles(pathToFile):
myList = []
p = Pool(2)
for filename in os.listdir(pathToFile):
if filename.endswith(".txt"):
myList.append(filename)
return p.map(_pars, , myList)
I followed the example, put all the names of the files that end with a .txt in a list, then created Pools, and mapped them to my function. Then I want to return a list of objects. Each object holds the parsed data of a file. However it amazes me that I got the following results:
#Pool 32 ---> ~162(s)
#Pool 16 ---> ~150(s)
#Pool 12 ---> ~142(s)
#Pool 2 ---> ~130(s)
Graph:
Machine specification:
62.8 GiB RAM
Intel® Core™ i7-6850K CPU # 3.60GHz × 12
What am I missing here ?
Thanks in advance!
Looks like you're I/O bound:
In computer science, I/O bound refers to a condition in which the time it takes to complete a computation is determined principally by the period spent waiting for input/output operations to be completed. This is the opposite of a task being CPU bound. This circumstance arises when the rate at which data is requested is slower than the rate it is consumed or, in other words, more time is spent requesting data than processing it.
You probably need to have your main thread do the reading and add the data to the pool when a subprocess becomes available. This will be different to using map.
As you are processing a line at a time, and the inputs are split, you can use fileinput to iterate over lines of multiple files, and map to a function processing lines instead of files:
Passing one line at a time might be too slow, so we can ask map to pass chunks, and can adjust until we find a sweet-spot. Our function parses chunks of lines:
def _parse_coreset_points(lines):
return Points([_parse_coreset_point(line) for line in lines])
def _parse_coreset_point(line):
s = line.split()
x, y = [int(v) for v in s]
return CoresetPoint(x, y)
And our main function:
import fileinput
def getParsedFiles(directory):
pool = Pool(2)
txts = [filename for filename in os.listdir(directory):
if filename.endswith(".txt")]
return pool.imap(_parse_coreset_points, fileinput.input(txts), chunksize=100)
In general it is never a good idea to read from the same physical (spinning) hard disk from different threads simultaneously, because every switch causes an extra delay of around 10ms to position the read head of the hard disk (would be different on SSD).
As #peter-wood already said, it is better to have one thread reading in the data, and have other threads processing that data.
Also, to really test the difference, I think you should do the test with some bigger files. For example: current hard disks should be able to read around 100MB/sec. So reading the data of a 100kB file in one go would take 1ms, while positioning the read head to the beginning of that file would take 10ms.
On the other hand, looking at your numbers (assuming those are for a single loop) it is hard to believe that being I/O bound is the only problem here. Total data is 100MB, which should take 1 second to read from disk plus some overhead, but your program takes 130 seconds. I don't know if that number is with the files cold on disk, or an average of multiple tests where the data is already cached by the OS (with 62 GB or RAM all that data should be cached the second time) - it would be interesting to see both numbers.
So there has to be something else. Let's take a closer look at your loop:
for line in f:
s = line.split()
x, y = [int(v) for v in s]
obj = CoresetPoint(x, y)
gc.disable()
myList.append(obj)
gc.enable()
While I don't know Python, my guess would be that the gc calls are the problem here. They are called for every line read from disk. I don't know how expensive those calls are (or what if gc.enable() triggers a garbage collection for example) and why they would be needed around append(obj) only, but there might be other problems because this is multithreading:
Assuming the gc object is global (i.e. not thread local) you could have something like this:
thread 1 : gc.disable()
# switch to thread 2
thread 2 : gc.disable()
thread 2 : myList.append(obj)
thread 2 : gc.enable()
# gc now enabled!
# switch back to thread 1 (or one of the other threads)
thread 1 : myList.append(obj)
thread 1 : gc.enable()
And if the number of threads <= number of cores, there wouldn't even be any switching, they would all be calling this at the same time.
Also, if the gc object is thread safe (it would be worse if it isn't) it would have to do some locking in order to safely alter it's internal state, which would force all other threads to wait.
For example, gc.disable() would look something like this:
def disable()
lock() # all other threads are blocked for gc calls now
alter internal data
unlock()
And because gc.disable() and gc.enable() are called in a tight loop, this will hurt performance when using multiple threads.
So it would be better to remove those calls, or place them at the beginning and end of your program if they are really needed (or only disable gc at the beginning, no need to do gc right before quitting the program).
Depending on the way Python copies or moves objects, it might also be slightly better to use myList.append(CoresetPoint(x, y)).
So it would be interesting to test the same on one 100MB file with one thread and without the gc calls.
If the processing takes longer than the reading (i.e. not I/O bound), use one thread to read the data in a buffer (should take 1 or 2 seconds on one 100MB file if not already cached), and multiple threads to process the data (but still without those gc calls in that tight loop).
You don't have to split the data into multiple files in order to be able to use threads. Just let them process different parts of the same file (even with the 14GB file).
A copy-paste snippet, for people who come from Google and don't like reading
Example is for json reading, just replace __single_json_loader with another file type to work with that.
from multiprocessing import Pool
from typing import Callable, Any, Iterable
import os
import json
def parallel_file_read(existing_file_paths: Iterable[str], map_lambda: Callable[[str], Any]):
result = {p: None for p in existing_file_paths}
pool = Pool()
for i, (temp_result, path) in enumerate(zip(pool.imap(map_lambda, existing_file_paths), result.keys())):
result[path] = temp_result
pool.close()
pool.join()
return result
def __single_json_loader(f_path: str):
with open(f_path, "r") as f:
return json.load(f)
def parallel_json_read(existing_file_paths: Iterable[str]):
combined_result = parallel_file_read(existing_file_paths, __single_json_loader)
return combined_result
And usage
if __name__ == "__main__":
def main():
directory_path = r"/path/to/my/file/directory"
assert os.path.isdir(directory_path)
d: os.DirEntry
all_files_names = [f for f in os.listdir(directory_path)]
all_files_paths = [os.path.join(directory_path, f_name) for f_name in all_files_names]
assert(all(os.path.isfile(p) for p in all_files_paths))
combined_result = parallel_json_read(all_files_paths)
main()
Very straight forward to replace a json reader with any other reader, and you're done.

Parallelism Speed

I'm trying to get a handle on python Parallelism. This is the code i'm using
import time
from concurrent.futures import ProcessPoolExecutor
def listmaker():
for i in xrange(10000000):
pass
#Without duo core
start = time.time()
listmaker()
end = time.time()
nocore = "Total time, no core, %.3f" % (end- start)
#with duo core
start = time.time()
pool = ProcessPoolExecutor(max_workers=2) #I have two cores
results = list(pool.map(listmaker()))
end = time.time()
core = "Total time core, %.3f" % (end- start)
print nocore
print core
I was under the assumption that because i'm using two cores my the speed be be close to double. However when i run this code most of the time the nocore output is faster than the core output. This is true even if I change
def listmaker():
for i in xrange(10000000):
pass
to
def listmaker():
for i in xrange(10000000):
print i
In fact in some runs the no core run is faster. Could someone shed some light on the issue? I'm is my setup correct? I'm I doing something wrong?
You're using pool.map() incorrectly. Take a look at the pool.map documentation. It expects an iterable argument, and it will pass each of the items from the iterable to the pool individually. Since your function only returns None, there is nothing for it to do. However, you're still incurring the overhead of spawning extra processes, which takes time.
Your usage of pool.map should look like this:
results = pool.map(function_name, some_iterable)
Notice a couple of things:
Since you're using the print statement rather than a function, I'm assuming you're using some Python2 variant. In Python2, pool.map returns a list anyway. No need to convert it to a list again.
The first argument should be the function name without parentheses. This identifies the function that the pool workers should execute. When you include the parentheses, the function is called right there, instead of in the pool.
pool.map is intended to call a function on every item in an iterable, so your test cases needs to create some iterable for it to consume, instead of a function that takes no arguments like your current example.
Try to run your trial again with some actual input to the function, and retrieve the output. Here's an example:
import time
from concurrent.futures import ProcessPoolExecutor
def read_a_file(file_name):
with open(file_name) as fi:
text = fi.read()
return text
file_list = ['t1.txt', 't2.txt', 't3.txt']
#Without duo core
start = time.time()
single_process_text_list = []
for file_name in file_list:
single_process_text_list.append(read_a_file(file_name))
end = time.time()
nocore = "Total time, no core, %.3f" % (end- start)
#with duo core
start = time.time()
pool = ProcessPoolExecutor(max_workers=2) #I have two cores
multiprocess_text_list = pool.map(read_a_file, file_list)
end = time.time()
core = "Total time core, %.3f" % (end- start)
print(nocore)
print(core)
Results:
Total time, no core, 0.047
Total time core, 0.009
The text files are 150,000 lines of gibberish each. Notice how much work had to be done before the parallel processing was worth it. When I ran the trial with 10,000 lines in each file, the single process approach was still faster because it didn't have the overhead of spawning extra processes. But with that much work to do, the extra processes become worth the effort.
And by the way, this functionality is available with multiprocessing pools in Python2, so you can avoid importing anything from futures if you want to.

Memory usage keep growing with Python's multiprocessing.pool

Here's the program:
#!/usr/bin/python
import multiprocessing
def dummy_func(r):
pass
def worker():
pass
if __name__ == '__main__':
pool = multiprocessing.Pool(processes=16)
for index in range(0,100000):
pool.apply_async(worker, callback=dummy_func)
# clean up
pool.close()
pool.join()
I found memory usage (both VIRT and RES) kept growing up till close()/join(), is there any solution to get rid of this? I tried maxtasksperchild with 2.7 but it didn't help either.
I have a more complicated program that calles apply_async() ~6M times, and at ~1.5M point I've already got 6G+ RES, to avoid all other factors, I simplified the program to above version.
EDIT:
Turned out this version works better, thanks for everyone's input:
#!/usr/bin/python
import multiprocessing
ready_list = []
def dummy_func(index):
global ready_list
ready_list.append(index)
def worker(index):
return index
if __name__ == '__main__':
pool = multiprocessing.Pool(processes=16)
result = {}
for index in range(0,1000000):
result[index] = (pool.apply_async(worker, (index,), callback=dummy_func))
for ready in ready_list:
result[ready].wait()
del result[ready]
ready_list = []
# clean up
pool.close()
pool.join()
I didn't put any lock there as I believe main process is single threaded (callback is more or less like a event-driven thing per docs I read).
I changed v1's index range to 1,000,000, same as v2 and did some tests - it's weird to me v2 is even ~10% faster than v1 (33s vs 37s), maybe v1 was doing too many internal list maintenance jobs. v2 is definitely a winner on memory usage, it never went over 300M (VIRT) and 50M (RES), while v1 used to be 370M/120M, the best was 330M/85M. All numbers were just 3~4 times testing, reference only.
I had memory issues recently, since I was using multiple times the multiprocessing function, so it keep spawning processes, and leaving them in memory.
Here's the solution I'm using now:
def myParallelProcess(ahugearray):
from multiprocessing import Pool
from contextlib import closing
with closing(Pool(15)) as p:
res = p.imap_unordered(simple_matching, ahugearray, 100)
return res
Simply create the pool within your loop and close it at the end of the loop with
pool.close().
Use map_async instead of apply_async to avoid excessive memory usage.
For your first example, change the following two lines:
for index in range(0,100000):
pool.apply_async(worker, callback=dummy_func)
to
pool.map_async(worker, range(100000), callback=dummy_func)
It will finish in a blink before you can see its memory usage in top. Change the list to a bigger one to see the difference. But note map_async will first convert the iterable you pass to it to a list to calculate its length if it doesn't have __len__ method. If you have an iterator of a huge number of elements, you can use itertools.islice to process them in smaller chunks.
I had a memory problem in a real-life program with much more data and finally found the culprit was apply_async.
P.S., in respect of memory usage, your two examples have no obvious difference.
I have a very large 3d point cloud data set I'm processing. I tried using the multiprocessing module to speed up the processing, but I started getting out of memory errors. After some research and testing I determined that I was filling the queue of tasks to be processed much quicker than the subprocesses could empty it. I'm sure by chunking, or using map_async or something I could have adjusted the load, but I didn't want to make major changes to the surrounding logic.
The dumb solution I hit on is to check the pool._cache length intermittently, and if the cache is too large then wait for the queue to empty.
In my mainloop I already had a counter and a status ticker:
# Update status
count += 1
if count%10000 == 0:
sys.stdout.write('.')
if len(pool._cache) > 1e6:
print "waiting for cache to clear..."
last.wait() # Where last is assigned the latest ApplyResult
So every 10k insertion into the pool I check if there are more than 1 million operations queued (about 1G of memory used in the main process). When the queue is full I just wait for the last inserted job to finish.
Now my program can run for hours without running out of memory. The main process just pauses occasionally while the workers continue processing the data.
BTW the _cache member is documented the the multiprocessing module pool example:
#
# Check there are no outstanding tasks
#
assert not pool._cache, 'cache = %r' % pool._cache
You can limit the number of task per child process
multiprocessing.Pool(maxtasksperchild=1)
maxtasksperchild is the number of tasks a worker process can complete before it will exit and be replaced with a fresh worker process, to enable unused resources to be freed. The default maxtasksperchild is None, which means worker processes will live as long as the pool. link
I think this is similar to the question I posted, but I'm not sure you have the same delay. My problem was that I was producing results from the multiprocessing pool faster than I was consuming them, so they built up in memory. To avoid that, I used a semaphore to throttle the inputs into the pool so they didn't get too far ahead of the outputs I was consuming.

Categories

Resources