When I run code without lru_cache I get this result. Which is understandable
with multiprocessing
time took 0.4375
without multiprocessing
time took8.8125
But when I run using lru_cache this is the result:
Test1
with multiprocessing
time took 0.34375
without multiprocessing
time took 0.3125
Test2
with multiprocessing
time took 3.234375
without multiprocessing
time took 3.046875
He we can clearly see without multiprocessing is almost equal or little faster than multiprocessing method. What's the reason for this? I understand that creating process is overhead but work list is very huge (10 million) so I guess chunk size is not too small. Or am I doing this wrong way?
Code explanation:
oddlist () take number and return the sum of all odd nums in that range
oddcount is a tuple contains 10 million random numbers
Code:
import os
from random import randint
from functools import reduce
from operator import add
from multiprocessing import Pool
import time
from functools import lru_cache
#lru_cache(maxsize=None)
def oddlist(num):
return reduce(add,(i for i in range(num) if i&1))
if __name__ == '__main__':
oddcounts=tuple(randint(10,50) for i in range(10000000))
print('with multiporcessing')
s=time.process_time()
with Pool(12) as p:
mp=p.map(oddlist, oddcounts)
e=time.process_time()
print(f'time took {e-s}')
print('witout multiporcessing')
s=time.process_time()
z=tuple(oddlist(i) for i in oddcounts)
e=time.process_time()
print(f'time took {e-s}')
Each process has its own cache, so while using multiprocessing, caching is 1/12th as effective as it would otherwise be. There are only 40 possible input values to oddlist. In the multiprocessing case, each process computes all 40, then uses the cache. Without multiprocessing, all 40 are only computed once. So, in addition to the overhead of starting the processes, each process does more work than it would need to if caching were working as intended. Also, there is a cost to pass the work to be done in each process to it, and passing the result back.
Related
I just learned about multiprocessing and tried to see how fast is it compared to simple for loop.
I use simple code to compare it,
import multiprocessing
from time import time as tt
def spawn(num,num2):
print('Process {} {}'.format(num,num2))
#normal process/single core
st = tt()
for i in range (1000):
spawn(i,i+1)
print('Total Running Time simple for loop:{}'.format((tt()-st)))
#multiprocessing
st2 = tt()
if __name__ == '__main__':
for i in range(1000):
p=multiprocessing.Process(target=spawn, args=(i,i+1))
p.start()
print('Total Running Time multiprocessing:{}'.format((tt()-st2)))
The output that I got showed that multiprocessing is much slower than the simple for loop
Total Running Time simple for loop:0.09924721717834473
Total Running Time multiprocessing:40.157875299453735
Can anyone explain why this happens?
It is because of the overhead for handling the processes. In this case the creation and deletion of the processes does not weigh up to the performance boost the code gets from running it parallel. If the executed code is more complex there will probably be a speedup.
I don't know how to parallelise a code in Python that takes each line of a FASTA file and makes some statistics, like compute GC content, of it. Do you have some tips or libraries that will help me to decrease the time spent in execution?
I've tried to use os.fork(), but it gives me more execution time than the sequential code. Probably is due to I don't know very well how to give each child a different sequence.
#Computing GC Content
from Bio import SeqIO
with open('chr1.fa', 'r') as f:
records = list (SeqIO.parse(f,'fasta'))
GC_for_sequence=[]
for i in records:
GC=0
for j in i:
if j in "GC":
GC+=1
GC_for_sequence.append(GC/len(i))
print(GC_for_sequence)
The expected execution would be: Each process takes one sequence, and they do the statistics in parallel.
a few notes on your existing code to start with:
I'd suggest not doing: list (SeqIO.parse(…)) as that will pause execution until all sequences have been loaded in memory, you're much better off (memory and total execution time) just leaving it as an iterator and consuming elements off to workers as needed
looping over each character is pretty slow, using str.count is going to be much faster
putting this together, you can do:
from Bio import SeqIO
with open('chr1.fa') as fd:
gc_for_sequence=[]
for seq in SeqIO.parse(fd, 'fasta'):
gc = sum(seq.seq.count(base) for base in "GC")
gc_for_sequence.append(gc / len(seq))
if this still isn't fast enough, then you can use the multiprocessing module like:
from Bio import SeqIO
from multiprocessing import Pool
def sequence_gc_prop(seq):
return sum(seq.count(base) for base in "GC") / len(seq)
with open('chr1.fa') as fd, Pool() as pool:
gc_for_sequence = pool.map(
sequence_gc_prop,
(seq.seq for seq in SeqIO.parse(fd, 'fasta')),
chunksize=1000,
)
comments from Lukasz mostly apply. other non-obvious stuff:
the weird seq.seq for seq in… stuff is to make sure that we're not Pickling unnecessary data
I'm setting chunksize to quite a large value because the function should be quick, hence we want to give children a reasonable amount of work to do so the parent process doesn't spend all its time orchestrating things
Here's one idea with standard multiprocessing module:
from multiprocessing import Pool
import numpy as np
no_cores_to_use = 4
GC_for_sequence = [np.random.rand(100) for x in range(10)]
with Pool(no_cores_to_use) as pool:
result = pool.map(np.average, GC_for_sequence)
print(result)
In the code I used numpy module to simulate a list with some content. pool.map takes function that you want to use on your data as first argument and data list as second. The function you can easily define yourself. By default, it should take a single argument. If you want to pass more, then use functools.partial.
[EDIT] Here's an example much closer to your problem:
from multiprocessing import Pool
import numpy as np
records = ['ACTGTCGCAGC' for x in range(10)]
no_cores_to_use = 4
def count(sequence):
count = sequence.count('GC')
return count
with Pool(no_cores_to_use) as pool:
result = pool.map(count, records)
print(sum(result))
I'm trying to get the all possible combination with replacement and make with each of them some calculation. I'm using the code below:
from itertools import combination_with_replacement
for seq in combination_with_replacement('ABCDE', 500):
# some calculation
How can I parallelize this calculation using multiprocessing?
You can use the standard library concurrent.futures.
from concurrent.futures import ProcessPoolExecutor
from itertools import combinations_with_replacement
def processing(combination):
print(combination)
# Compute interesting stuff
if __name__ == '__main__':
executor = ProcessPoolExecutor(max_workers=8)
result = executor.map(processing, combinations_with_replacement('ABCDE', 25))
for r in result:
# do stuff ...
A bit more explanations:
This code creates an executor using processes. Another possibility would be to use threads but full python threads only run on one core so it might not be the solution of interest in your case as you need to run heavy computation.
The map object return a asynchronous object. Thus, the line executor.map.. is non blocking and you can do other computation before collecting the result in the for loop.
It is important to declare the processing function out of the if __name__ == '__main__': block and to declare and use the executor in this block. This prevent for infinite executor spawning and permit to pickle the worker function to pass it to the child process. Without this block, the code is likely to fail.
I recommend this over multiprocessing.Pool as it has a more clever way to dispatch the work as you are using an iterator.
Note that your computation for combination of 500 with 5 elements ABCDE might not be possible. It needs to compute 5**500 > 1e350 elements. By parallelizing, you will only reduce your computation linearly by a factor max_workers, so in this case 8 and each process will need to run with ~ 1e349 elements, which should take about ~ 1e335 years if each computation is done in 1 micro second.
I am trying to get to grips with multiprocessing in Python. I started by creating this code. It simply computes cos(i) for integers i and measures the time taken when one uses multiprocessing and when one does not. I am not observing any time difference. Here is my code:
import multiprocessing
from multiprocessing import Pool
import numpy as np
import time
def tester(num):
return np.cos(num)
if __name__ == '__main__':
starttime1 = time.time()
pool_size = multiprocessing.cpu_count()
pool = multiprocessing.Pool(processes=pool_size,
)
pool_outputs = pool.map(tester, range(5000000))
pool.close()
pool.join()
endtime1 = time.time()
timetaken = endtime1 - starttime1
starttime2 = time.time()
for i in range(5000000):
tester(i)
endtime2 = time.time()
timetaken2 = timetaken = endtime2 - starttime2
print( 'The time taken with multiple processes:', timetaken)
print( 'The time taken the usual way:', timetaken2)
I am observing no (or very minimal) difference between the two times measured. I am using a machine with 8 cores, so this is surprising. What have I done incorrectly in my code?
Note that I learned all of this from this.
http://pymotw.com/2/multiprocessing/communication.html
I understand that "joblib" might be more convenient for an example like this, but the ultimate thing that this needs to be applied to does not work with "joblib".
Your job seems the computation of a single cos value. This is going to be basically unnoticeable compared to the time of communicating with the slave.
Try making 5 computations of 1000000 cos values and you should see them going in parallel.
First, you wrote :
timetaken2 = timetaken = endtime2 - starttime2
So it is normal if you have the same times displayed. But this is not the important part.
I ran your code on my computer (i7, 4 cores), and I get :
('The time taken with multiple processes:', 14.95710802078247)
('The time taken the usual way:', 6.465447902679443)
The multiprocessed loop is slower than doing the for loop. Why?
The multiprocessing module can use multiple processes, but still has to work with the Python Global Interpreter Lock, wich means you can't share memory between your processes. So when you try to launch a Pool, you need to copy useful variables, process your calculation, and retrieve the result. This cost you a little time for every process, and makes you less effective.
But this happens because you do a very small computation : multiprocessing is only useful for larger calculation, when the memory copying and results retrieving is cheaper (in time) than the calculation.
I tried with following tester, which is much more expensive, on 2000 runs:
def expenser_tester(num):
A=np.random.rand(10*num) # creation of a random Array 1D
for k in range(0,len(A)-1): # some useless but costly operation
A[k+1]=A[k]*A[k+1]
return A
('The time taken with multiple processes:', 4.030329942703247)
('The time taken the usual way:', 8.180987119674683)
You can see that on an expensive calculation, it is more efficient with the multiprocessing, even if you don't always have what you could expect (I could have a x4 speedup, but I only got x2)
Keep in mind that Pool has to duplicate every bit of memory used in calculation, so it may be memory expensive.
If you really want to improve a small calculation like your example, make it big by grouping and sending a list of variable to the pool instead of one variable by process.
You should also know that numpy and scipy have a lot of expensive function written in C/Fortran and already parallelized, so you can't do anything much to speed them.
If the problem is cpu bounded then you should see the required speed-up (if the operation is long enough and overhead is not significant). But when multiprocessing (because memory is not shared between processes) it's easier to have a memory bound problem.
For a map task from a list src_list to dest_list, len(src_list) is of the level of thousands:
def my_func(elem):
# some complex work, for example a minimizing task
return new_elem
dest_list[i] = my_func(src_list[i])
I use multiprocessing.Pool
pool = Pool(4)
# took 543 seconds
dest_list = list(pool.map(my_func, src_list, chunksize=len(src_list)/8))
# took 514 seconds
dest_list = list(pool.map(my_func, src_list, chunksize=4))
# took 167 seconds
dest_list = [my_func(elem) for elem in src_list]
I am confused. Can someone explain why the multiprocessing version runs even slower?
And I wonder what are the considerations to the choice of chunksize and the choice between
multi-threads and multi-processes, especially for my problem. Also, currently, I measure time
by sum all time spent in the my_func method because directly using
t = time.time()
dest_list = pool.map...
print time.time() - t
doesn't work. However, in here, the document says map() blocks until the result is ready, it seems different to my result. Is there another way rather than simply sum the time? I have tried pool.close() with pool.join() which does not work.
src_list is of length around 2000. time.time()-t doesn't work because it does not sum up all the time spent in my_func in pool.map. And strange thing happended when I used timeit.
def wrap_func(src_list):
pool = Pool(4)
dest_list = list(pool.map(my_func, src_list, chunksize=4))
print timeit("wrap_func(src_list)", setup="import ...")
It ran into
OS Error Cannot allocate memory
guess I have used timeit in a wrong way...
I use python 2.7.6 under Ubuntu 14.04.
Thanks!
Multiprocessing requires overhead to pass the data between processes because processes do not share memory. Any object passed between processes must be pickled (represented as a string) and depickled. This includes objects passed to the function in you list src_list and any object returned to dest_list. This takes time. To illustrate this you might try timing the following function in a single process and in parallel.
def NothingButAPickle(elem):
return elem
If you loop over your src_list in a single process this should be extremely fast because Python only has to make one copy of each object in the list in memory. If instead you call this function in parallel with the multiprocessing package it has to (1) pickle each object to send it from the main process to a subprocess as a string (2) depickle each object in the subprocess to go from a string representation to an object in memory (3) pickle the object to return it to the main process represented as a string, and then (4) depickle the object to represent it in memory in the main process. Without seeing your data or the actual function, this overhead cost typically only exceeds the multiprocessing gains if the objects you are passing are extremely large and/or the function is actually not that computationally intensive.