Why append list is slower in Multiprocessing - python

During testing I find out in the following, MP method run a bit slower
def eat_time(j):
result = []
for j in range(10**4):
a = 0
for i in range(1000):
a += 101
result.append(a)
return result
if __name__ == '__main__':
#MP method
t = time.time()
pool = Pool()
result = []
data = pool.map(eat_time, [i for i in range(5)])
for d in data:
result += d
print(time.time()-t) #11s for my computer
#Normal method
t = time.time()
integers = []
for i in range(5):
integers += eat_time(i)
print(time.time()-t) #8s for my computer
However, if I don't require it to aggregate the data by changing eat_time() to
def eat_time(j):
result = []
for j in range(10**4):
a = 0
for i in range(1000):
a += 101
#result.append(a)
return result
The MP time is much faster and now for my computer just run 3s, while normal method still take 8s. (As expected)
It looks strange to me as result is declared individually in method, I don't expect appending completely ruin the MP.
May I know is there a correct way to do this? And why MP is slower when append involved?
Edited for comment
Thx for #torek and #akhavro clarify the point.
Yes, I understand creating process take times, that's why the problem raised.
Actually the original code put the for-loop outside and call the simple method again and again, it is a bit faster over normal method in significantly many task (my case more than 10**6 calls).
Therefore I change to put code inside and make the method a bit more complicated. By moving for j in range(10**4): this line into eat_time().
But it seems making the code complicated causes communication lag due to larger data size.
So, probably the answer is no way to solve it.

It is not append that causes your slowness but returning the result with appended elements. You can test it by changing your code to do the append but return only the first few elements of your result. Now it should work much faster again.
When you return your result from a Pool worker, this is in practice implemented as a queue from multiprocessing. It works but it is not a miracle performer, definitely much slower than just manipulating in-memory structures. When you return a lot of data, the queue needs to transmit a lot.
There is no easy workaround. You could try shared memory but I do not personally like it due to added complexity. The better way would be to redesign your application so that it does not need to transmit a lot of data between processes. For example, would it be possible to process data in your worker further so that you do not need to return it all but only a processed subset?

Related

Generator in threading

I have a generator that returns me a certain string, how can I use it together with this code?
from multiprocessing.dummy import Pool as ThreadPool
pool = ThreadPool(4)
results = pool.map(my_function, my_array)
Instead of the array that is passed above, I use a generator, since the number of values does not fit in the array, and I do not need to store them.
Additional question:
In this case, the transition to the next iteration should be carried out inside the function?
Essentially the same question that was posed here.
The essence is that multiprocessing will convert any iterable without a __len__ method into a list.
There is an open issue to add support for generators but for now, you're SOL.
If your array is too big to fit into memory, consider reading it in in chunks, processing it, and dumping the results to disk in chunks. Without more context, I can't really provide a more concrete solution.
UPDATE:
Thanks for posting your code. My first question, is it absolutely necessary to use multiprocessing? Depending on what my_function does, you may see no benefit to using a ThreadPool as python is famously limited by the GIL so any CPU bound worker function wouldn't speed up. In this case, maybe a ProcessPool would be better. Otherwise, you are probably better off just running results = map(my_function, generator).
Of course, if you don't have the memory to load the input data, it is unlikely you will have the memory to store the results.
Secondly, you can improve your generator by using itertools
Try:
import itertools
import string
letters = string.ascii_lowercase
cod = itertools.permutations(letters, 6)
def my_function(x):
return x
def dump_results_to_disk(results, outfile):
with open(outfile, 'w') as fp:
for result in results:
fp.write(str(result) + '\n')
def process_in_chunks(generator, chunk_size=50):
accumulator = []
chunk_number = 1
for item in generator:
if len(accumulator) < chunk_size:
accumulator.append(item)
else:
results = list(map(my_function, accumulator))
dump_results_to_disk(results, "results" + str(chunk_number) + '.txt')
chunk_number += 1
accumulator = []
dump_results_to_disk(results, "results" + str(chunk_number))
process_in_chunks(cod)
Obviously, change my_function() to whatever your worker function is and maybe you want to do something instead of dumping to disk. You can scale chunk_size to however many entries can fit in memory. If you don't have the disk space or the memory for the results, then there's really no way for you process the data in aggregate

Prevent RAM space from filling from repeated Python function

I have a Python function example below which simply takes in a variable and performs a simple mathematical operation on it before returning.
If I parallelise this function, to better reflect the operation I would like to do in real life, and run the parallelised function 10 times, I notice on my IDE that the memory increases despite using the del results line.
import multiprocessing as mp
import numpy as np
from tqdm import tqdm
def function(x):
return x*2
test_array = np.arange(0,1e4,1)
for i in range(10):
pool = mp.Pool(processes=4)
results = list(tqdm(pool.imap(function,test_array),total=len(test_array)))
results = [x for x in results if str(x) != 'nan']
del results
I have a few questions I would be grateful to know the answers to:
Is there a way to prevent this memory increase?
Is this memory loading due to the parallelisation process?
I haven't tried this out, but i'm quite sure you don't need to define
pool= mp.Pool(processes=4)
Within the loop, you're starting up 10 instances of the pool for no reason. Maybe try moving that out and seeing if your memory usage decreases?
If that doesn't help, consider restructuring your code to utilize yield instead to prevent your memory from filling up.
Each new process that pool.imap creates needs to receive some information about the function and the element it applies the function too. This information is copies, and will therefore cause information to be copies.
If you want to reduce it, you might want to look at the chunksize argument of pool.imap.
An other way would be to just rely on functions from numpy. You might already now, but you could just do results = test_array * 2. I don't know how your real life example looks like, but you might not need to use Python's pool.
Also, if you intend to actually write fast code, don't use tqdm. It is nice and if you need it, you need it, but it will slow down your code.

Using pool for multiprocessing in Python (Windows)

I have to do my study in a parallel way to run it much faster. I am new to multiprocessing library in python, and could not yet make it run successfully.
Here, I am investigating if each pair of (origin, target) remains at certain locations between various frames of my study. Several points:
It is one function, which I want to run faster (It is not several processes).
The process is performed subsequently; it means that each frame is compared with the previous one.
This code is a very simpler form of the original code. The code outputs a residece_list.
I am using Windows OS.
Can someone check the code (the multiprocessing section) and help me improve it to make it work. Thanks.
import numpy as np
from multiprocessing import Pool, freeze_support
def Main_Residence(total_frames, origin_list, target_list):
Previous_List = {}
residence_list = []
for frame in range(total_frames): #Each frame
Current_List = {} #Dict of pair and their residence for frames
for origin in range(origin_list):
for target in range(target_list):
Pair = (origin, target) #Eahc pair
if Pair in Current_List.keys(): #If already considered, continue
continue
else:
if origin == target:
if (Pair in Previous_List.keys()): #If remained from the previous frame, add residence
print "Origin_Target remained: ", Pair
Current_List[Pair] = (Previous_List[Pair] + 1)
else: #If new, add it to the current
Current_List[Pair] = 1
for pair in Previous_List.keys(): #Add those that exited from residence to the list
if pair not in Current_List.keys():
residence_list.append(Previous_List[pair])
Previous_List = Current_List
return residence_list
if __name__ == '__main__':
pool = Pool(processes=5)
Residence_List = pool.apply_async(Main_Residence, args=(20, 50, 50))
print Residence_List.get(timeout=1)
pool.close()
pool.join()
freeze_support()
Residence_List = np.array(Residence_List) * 5
Multiprocessing does not make sense in the context you are presenting here.
You are creating five subprocesses (and three threads belonging to the pool, managing workers, tasks and results) to execute one function once. All of this is coming at a cost, both in system resources and execution time, while four of your worker processes don't do anything at all. Multiprocessing does not speed up the execution of a function. The code in your specific example will always be slower than plainly executing Main_Residence(20, 50, 50) in the main process.
For multiprocessing to make sense in such a context, your work at hand would need to be broken down to a set of homogenous tasks that can be processed in parallel with their results potentially being merged later.
As an example (not necessarily a good one), if you want to calculate the largest prime factors for a sequence of numbers, you can delegate the task of calculating that factor for any specific number to a worker in a pool. Several workers would then do these individual calculations in parallel:
def largest_prime_factor(n):
p = n
i = 2
while i * i <= n:
if n % i:
i += 1
else:
n //= i
return p, n
if __name__ == '__main__':
pool = Pool(processes=3)
start = datetime.now()
# this delegates half a million individual tasks to the pool, i.e.
# largest_prime_factor(0), largest_prime_factor(1), ..., largest_prime_factor(499999)
pool.map(largest_prime_factor, range(500000))
pool.close()
pool.join()
print "pool elapsed", datetime.now() - start
start = datetime.now()
# same work just in the main process
[largest_prime_factor(i) for i in range(500000)]
print "single elapsed", datetime.now() - start
Output:
pool elapsed 0:00:04.664000
single elapsed 0:00:08.939000
(the largest_prime_factor function is taken from #Stefan in this answer)
As you can see, the pool is only roughly twice as fast as single process execution of the same amount of work, all while running in three processes in parallel. That's due to the overhead introduced by multiprocessing/the pool.
So, you stated that the code in your example has been simplified. You'll have to analyse your original code to see if it can be broken down to homogenous tasks that can be passed down to your pool for processing. If that is possible, using multiprocessing might help you speed up your program. If not, multiprocessing will likely cost you time, rather than save it.
Edit:
Since you asked for suggestions on the code. I can hardly say anything about your function. You said yourself that it is just a simplified example to provide an MCVE (much appreciated by the way! Most people don't take the time to strip down their code to its bare minimum). Requests for a code review are anyway better suited over at Codereview.
Play around a bit with the available methods of task delegation. In my prime factor example, using apply_async came with a massive penalty. Execution time increased ninefold, compared to using map. But my example is using just a simple iterable, yours needs three arguments per task. This could be a case for starmap, but that is only available as of Python 3.3.Anyway, the structure/nature of your task data basically determines the correct method to use.
I did some q&d testing with multiprocessing your example function.
The input was defined like this:
inp = [(20, 50, 50)] * 5000 # that makes 5000 tasks against your Main_Residence
I ran that in Python 3.6 in three subprocesses with your function unaltered, except for the removal of the print statment (I/O is costly). I used, starmap, apply, starmap_async and apply_async and also iterated through the results each time to account for the blocking get() on the async results.
Here's the output:
starmap elapsed 0:01:14.506600
apply elapsed 0:02:11.290600
starmap async elapsed 0:01:27.718800
apply async elapsed 0:01:12.571200
# btw: 5k calls to Main_Residence in the main process looks as bad
# as using apply for delegation
single elapsed 0:02:12.476800
As you can see, the execution times differ, although all four methods do the same amount of work; the apply_async you picked appears to be the fastest method.
Coding Style. Your code looks quite ... unconventional :) You use Capitalized_Words_With_Underscore for your names (both, function and variable names), that's pretty much a no-no in Python. Also, assigning the name Previous_List to a dictionary is ... questionable. Have a look at PEP 8, especially the section Naming Conventions to see the commonly accepted coding style for Python.
Judging by the way your print looks, you are still using Python 2. I know that in corporate or institutional environments that's sometimes all you have available. Still, keep in mind that the clock for Python 2 is ticking

Using local memory in Pool workers with python's multiprocessing module

I'm working on implementing a randomized algorithm in python. Since this involves doing the same thing many (say N) times, it rather naturally parallelizes and I would like to make use of that. More specifically, I want to distribute the N iterations on all of the cores of my CPU. The problem in question involves computing the maximum of something and is thus something where every worker could compute their own maximum and then only report that one back to the parent process, which then only needs to figure out the global maximum out of those few local maxima.
Somewhat surprisingly, this does not seem to be an intended use-case of the multiprocessing module, but I'm not entirely sure how else to go about it. After some research I came up with the following solution (toy problem to find the maximum in a list that is structurally the same as my actual one):
import random
import multiprocessing
l = []
N = 100
numCores = multiprocessing.cpu_count()
# globals for every worker
mySendPipe = None
myRecPipe = None
def doWork():
pipes = zip(*[multiprocessing.Pipe() for i in range(numCores)])
pool = multiprocessing.Pool(numCores, initializeWorker, (pipes,))
pool.map(findMax, range(N))
results = []
# collate results
for p in pipes[0]:
if p.poll():
results.append(p.recv())
print(results)
return max(results)
def initializeWorker(pipes):
global mySendPipe, myRecPipe
# ID of a worker process; they are consistently named PoolWorker-i
myID = int(multiprocessing.current_process().name.split("-")[1])-1
# Modulo: When starting a second pool for the second iteration of doWork() they are named with IDs 5-8.
mySendPipe = pipes[1][myID%numCores]
myRecPipe = pipes[0][myID%numCores]
def findMax(count):
myMax = 0
if myRecPipe.poll():
myMax = myRecPipe.recv()
value = random.choice(l)
if myMax < value:
myMax = value
mySendPipe.send(myMax)
l = range(1, 1001)
random.shuffle(l)
max1 = doWork()
l = range(1001, 2001)
random.shuffle(l)
max2 = doWork()
return (max1, max2)
This works, sort of, but I've got a problem with it. Namely, using pipes to store the intermediate results feels rather silly (and is probably slow). But it also has the real problem, that I can't send arbitrarily large things through the pipe, and my application unfortunately sometimes exceeds this size (and deadlocks).
So, what I'd really like is a function analogue to the initializer that I can call once for every worker on the pool to return their local results to the parent process. I could not find such functionality, but maybe someone here has an idea?
A few final notes:
I use a global variable for the input because in my application the input is very large and I don't want to copy it to every process. Since the processes never write to it, I believe it should never be copied (or am I wrong there?). I'm open to suggestions to do this differently, but mind that I need to run this on changing inputs (sequentially though, just like in the example above).
I'd like to avoid using the Manager-class, since (by my understanding) it introduces synchronisation and locks, which in this problem should be completely unnecessary.
The only other similar question I could find is Python's multiprocessing and memory, but they wish to actually process the individual results of the workers, whereas I do not want the workers to return N things, but to instead only run a total of N times and return only their local best results.
I'm using Python 2.7.15.
tl;dr: Is there a way to use local memory for every worker process in a multiprocessing pool, so that every worker can compute a local optimum and the parent process only needs to worry about figuring out which one of those is best?
You might be overthinking this a little.
By making your worker-functions (in this case findMax) actually return a value instead of communicating it, you can store the result from calling pool.map() - it is just a parallel variant of map, after all! It will map a function over a list of inputs and return the list of results of that function call.
The simplest example illustrating my point follows your "distributed max" example:
import multiprocessing
# [0,1,2,3,4,5,6,7,8]
x = range(9)
# split the list into 3 chunks
# [(0, 1, 2), (3, 4, 5), (6, 7, 8)]
input = zip(*[iter(x)]*3)
pool = multiprocessing.Pool(2)
# compute the max of each chunk:
# max((0,1,2)) == 2
# max((3,4,5)) == 5
# ...
res = pool.map(max, input)
print(res)
This returns [2, 5, 8].
Be aware that there is some light magic going on: I use the built-in max() function which expects iterables as input. Now, if I would only pool.map over a plain list of integers, say, range(9), that would result in calls to max(0), max(1) etc. - not very useful, huh? Instead, I partition my list into chunks, so effectively, when mapping, we now map over a list of tuples, thus feeding a tuple to max on each call.
So perhaps you have to:
return a value from your worker func
think about how you structure your input domain so that you feed meaningful chunks to each worker
PS: You wrote a great first question! Thank you, it was a pleasure reading it :) Welcome to StackOverflow!

How to use multiprocessing in python

New to python and I want to do parallel programming in the following code, and want to use multiprocessing in python to do it. So how to modify the code? I've been searching method by using Pool, but found limited examples that I can follow. Anyone can help me? Thank you.
Note that setinner and setouter are two independent functions and that's where I want to use parallel programming to reduce the running time.
def solve(Q,G,n):
i = 0
tol = 10**-4
while i < 1000:
inneropt,partition,x = setinner(Q,G,n)
outeropt = setouter(Q,G,n)
if (outeropt - inneropt)/(1 + abs(outeropt) + abs(inneropt)) < tol:
break
node1 = partition[0]
node2 = partition[1]
G = updateGraph(G,node1,node2)
if i == 999:
print "Maximum iteration reaches"
print inneropt
It's hard to parallelize code that needs to mutate the same shared data from different tasks. So, I'm going to assume that setinner and setouter are non-mutating functions; if that's not true, things will be more complicated.
The first step is to decide what you want to do in parallel.
One obvious thing is to do the setinner and setouter at the same time. They're completely independent of each other, and always need to both get done. So, that's what I'll do. Instead of doing this:
inneropt,partition,x = setinner(Q,G,n)
outeropt = setouter(Q,G,n)
… we want to submit the two functions as tasks to the pool, then wait for both to be done, then get the results of both.
The concurrent.futures module (which requires a third-party backport in Python 2.x) makes it easier to do things like "wait for both to be done" than the multiprocessing module (which is in the stdlib in 2.6+), but in this case, we don't need anything fancy; if one of them finishes early, we don't have anything to do until the other finishes anyway. So, let's stick with multiprocessing.apply_async:
pool = multiprocessing.Pool(2) # we never have more than 2 tasks to run
while i < 1000:
# parallelly start both tasks
inner_result = pool.apply_async(setinner, (Q, G, n))
outer_result = pool.apply_async(setouter, (Q, G, n))
# sequentially wait for both tasks to finish and get their results
inneropt,partition,x = inner_result.get()
outeropt = outer_result.get()
# the rest of your loop is unchanged
You may want to move the pool outside the function so it lives forever and can be used by other parts of your code. And if not, you almost certainly want to shut the pool down at the end of the function. (Later versions of multiprocessing let you just use the pool in a with statement, but I think that requires Python 3.2+, so you have to do it explicitly.)
What if you want to do more work in parallel? Well, there's nothing else obvious to do here without restructuring the loop. You can't do updateGraph until you get the results back from setinner and setouter, and nothing else is slow here.
But if you could reorganize things so that each loop's setinner were independent of everything that came before (which may or may not be possible with your algorithm—without knowing what you're doing, I can't guess), you could push 2000 tasks onto the queue up front, then loop by just grabbing results as needed. For example:
pool = multiprocessing.Pool() # let it default to the number of cores
inner_results = []
outer_results = []
for _ in range(1000):
inner_results.append(pool.apply_async(setinner, (Q,G,n,i))
outer_results.append(pool.apply_async(setouter, (Q,G,n,i))
while i < 1000:
inneropt,partition,x = inner_results.pop(0).get()
outeropt = outer_results.pop(0).get()
# result of your loop is the same as before
Of course you can make this fancier.
For example, let's say you rarely need more than a couple hundred iterations, so it's wasteful to always compute 1000 of them. You can just push the first N at startup, and push one more every time through the loop (or N more every N times) so you never do more than N wasted iterations—you can't get an ideal tradeoff between perfect parallelism and minimal waste, but you can usually tune it pretty nicely.
Also, if the tasks don't actually take that long, but you have a lot of them, you may want to batch them up. One really easy way to do this is to use one of the map variants instead of apply_async; this can make your fetching code a tiny bit more complicated, but it makes the queuing and batching code completely trivial (e.g., to map each func over a list of 100 parameters with a chunksize of 10 is just two simple lines of code).

Categories

Resources