Why multiprocessing is slower than simple for loop?

Why multiprocessing is slower than simple for loop? - python

I just learned about multiprocessing and tried to see how fast is it compared to simple for loop.
I use simple code to compare it,
import multiprocessing
from time import time as tt
def spawn(num,num2):
print('Process {} {}'.format(num,num2))
#normal process/single core
st = tt()
for i in range (1000):
spawn(i,i+1)
print('Total Running Time simple for loop:{}'.format((tt()-st)))
#multiprocessing
st2 = tt()
if __name__ == '__main__':
for i in range(1000):
p=multiprocessing.Process(target=spawn, args=(i,i+1))
p.start()
print('Total Running Time multiprocessing:{}'.format((tt()-st2)))
The output that I got showed that multiprocessing is much slower than the simple for loop
Total Running Time simple for loop:0.09924721717834473
Total Running Time multiprocessing:40.157875299453735
Can anyone explain why this happens?

It is because of the overhead for handling the processes. In this case the creation and deletion of the processes does not weigh up to the performance boost the code gets from running it parallel. If the executed code is more complex there will probably be a speedup.

Related

why pool and cached method has almost same runtime?

When I run code without lru_cache I get this result. Which is understandable
with multiprocessing
time took 0.4375
without multiprocessing
time took8.8125
But when I run using lru_cache this is the result:
Test1
with multiprocessing
time took 0.34375
without multiprocessing
time took 0.3125
Test2
with multiprocessing
time took 3.234375
without multiprocessing
time took 3.046875
He we can clearly see without multiprocessing is almost equal or little faster than multiprocessing method. What's the reason for this? I understand that creating process is overhead but work list is very huge (10 million) so I guess chunk size is not too small. Or am I doing this wrong way?
Code explanation:
oddlist () take number and return the sum of all odd nums in that range
oddcount is a tuple contains 10 million random numbers
Code:
import os
from random import randint
from functools import reduce
from operator import add
from multiprocessing import Pool
import time
from functools import lru_cache
#lru_cache(maxsize=None)
def oddlist(num):
return reduce(add,(i for i in range(num) if i&1))
if __name__ == '__main__':
oddcounts=tuple(randint(10,50) for i in range(10000000))
print('with multiporcessing')
s=time.process_time()
with Pool(12) as p:
mp=p.map(oddlist, oddcounts)
e=time.process_time()
print(f'time took {e-s}')
print('witout multiporcessing')
s=time.process_time()
z=tuple(oddlist(i) for i in oddcounts)
e=time.process_time()
print(f'time took {e-s}')

Each process has its own cache, so while using multiprocessing, caching is 1/12th as effective as it would otherwise be. There are only 40 possible input values to oddlist. In the multiprocessing case, each process computes all 40, then uses the cache. Without multiprocessing, all 40 are only computed once. So, in addition to the overhead of starting the processes, each process does more work than it would need to if caching were working as intended. Also, there is a cost to pass the work to be done in each process to it, and passing the result back.

Why is multiprocessing slower here?

I am trying to speed up some code with multiprocessing in Python, but I cannot understand one point. Assume I have the following dumb function:
import time
from multiprocessing.pool import Pool
def foo(_):
for _ in range(100000000):
a = 3
When I run this code without using multiprocessing (see the code below) on my laptop (Intel - 8 cores cpu) time taken is ~2.31 seconds.
t1 = time.time()
foo(1)
print(f"Without multiprocessing {time.time() - t1}")
Instead, when I run this code by using Python multiprocessing library (see the code below) time taken is ~6.0 seconds.
pool = Pool(8)
t1 = time.time()
pool.map(foo, range(8))
print(f"Sample multiprocessing {time.time() - t1}")
From the best of my knowledge, I understand that when using multiprocessing there is some time overhead mainly caused by the need to spawn the new processes and to copy the memory state. However, this operation should be performed just once when the processed are initially spawned at the very beginning and should not be that huge.
So what I am missing here? Is there something wrong in my reasoning?
Edit: I think it is better to be more explicit on my question. What I expected here was the multiprocessed code to be slightly slower than the sequential one. It is true that I don't split the whole work across the 8 cores, but I am using 8 cores in parallel to do the same job (hence in an ideal world the processing time should more or less stay the same). Considering the overhead of spawning new processes, I expected a total increase in time of some (not too big) percentage, but not of a ~2.60x increase as I got here.

Well, multiprocessing can't possibly make this faster: you're not dividing the work across 8 processes, you're asking each of 8 processes to do the entire thing. Each process will take at least as long as your code doing it just once without using multiprocessing.
So if multiprocessing weren't helping at all, you'd expect it to take about 8 times as long (it's doing 8x the work!) as your single-processor run. But you said it's not taking 2.31 * 8 ~= 18.5 seconds, but "only" about 6. So you're getting better than a factor of 3 speedup.
Why not more than that? Can't guess from here. That will depend on how many physical cores your machine has, and how much other stuff you're running at the same time. Each process will be 100% CPU-bound for this specific function, so the number of "logical" cores is pretty much irrelevant - there's scant opportunity for processor hyper-threading to help. So I'm guessing you have 4 physical cores.
On my box
Sample timing on my box, which has 8 logical cores but only 4 physical cores, and otherwise left the box pretty quiet:
Without multiprocessing 2.468580484390259
Sample multiprocessing 4.78624415397644
As above, none of that surprises me. In fact, I was a little surprised (but pleasantly) at how effectively the program used up the machine's true capacity.

#TimPeters already answered that you are actually just running the job 8 times across the 8 Pool subprocesses, so it is slower not faster.
That answers the issue but does not really answer what your real underlying question was. It is clear from your surprise at this result, that you were expecting that the single job to somehow be automatically split up and run in parts across the 8 Pool processes. That is not the way that it works. You have to build in/tell it how to split up the work.
Different kinds of jobs needs need to be subdivided in different ways, but to continue with your example you might do something like this:
import time
from multiprocessing.pool import Pool
def foo(_):
for _ in range(100000000):
a = 3
def foo2(job_desc):
start, stop = job_desc
print(f"{start}, {stop}")
for _ in range(start, stop):
a = 3
def main():
t1 = time.time()
foo(1)
print(f"Without multiprocessing {time.time() - t1}")
pool_size = 8
pool = Pool(pool_size)
t1 = time.time()
top_num = 100000000
size = top_num // pool_size
job_desc_list = [[size * j, size * (j+1)] for j in range(pool_size)]
# this is in case the the upper bound is not a multiple of pool_size
job_desc_list[-1][-1] = top_num
pool.map(foo2, job_desc_list)
print(f"Sample multiprocessing {time.time() - t1}")
if __name__ == "__main__":
main()
Which results in:
Without multiprocessing 3.080709171295166
0, 12500000
12500000, 25000000
25000000, 37500000
37500000, 50000000
50000000, 62500000
62500000, 75000000
75000000, 87500000
87500000, 100000000
Sample multiprocessing 1.5312283039093018
As this shows, splitting the job up does allow it to take less time. The speedup will depend on the number of CPUs. In a CPU bound job you should try to limit it the pool size to the number of CPUs. My laptop has plenty more CPU's but some of the benefit is lost to the overhead. If the jobs were longer this should look more useful.

Threadpool in python is not as fast as expected

I'm beginner to python and machine learning. I'm trying to reproduce the code for countvectorizer() using multi-threading. I'm working with yelp dataset to do sentiment analysis using LogisticRegression. This is what I've written so far:
Code snippet:
from multiprocessing.dummy import Pool as ThreadPool
from threading import Thread, current_thread
from functools import partial
data = df['text']
rev = df['stars']
y = []
def product_helper(args):
return featureExtraction(*args)
def featureExtraction(p,t):
temp = [0] * len(bag_of_words)
for word in p.split():
if word in bag_of_words:
temp[bag_of_words.index(word)] += 1
return temp
# function to be mapped over
def calculateParallel(threads):
pool = ThreadPool(threads)
job_args = [(item_a, rev[i]) for i, item_a in enumerate(data)]
l = pool.map(product_helper,job_args)
pool.close()
pool.join()
return l
temp_X = calculateParallel(12)
Here this is just part of code.
Explanation:
df['text'] has all the reviews and df['stars'] has the ratings (1 through 5). I'm trying to find the word count vector temp_X using multi-threading. bag_of_words is a list of some frequent words of choice.
Question:
Without multi-threading , I was able to compute the temp_X in around 24 minutes and the above code took 33 mins for a dataset of size 100k reviews. My machine has 128GB of DRAM and 12 cores (6 physical cores with hyperthreading i.e., threads per core=2).
What am I doing wrong here?

Your whole code seems CPU Bound rather than IO Bound.You are just using threads which are under GIL so effectively running just one thread plus overheads.It runs only on one core.To run on multiple cores use
Use
import multiprocessing
pool = multiprocessing.Pool()
l = pool.map_async(product_helper,job_args)
from multiprocessing.dummy import Pool as ThreadPool is just a wrapper over thread module.It utilises just one core and not more than that.

Python and threads dont really work together very well. There is a known issue called the GIL (global interperter lock). Basically there is a lock in the interperter that makes all threads not run in parallel (even if you have multiple cpu cores). Python will simply give each thread a few milliseconds of cpu time one after another (and the reason it became slower is the overhead from context switching between those threads).
Here is a really good document explaining how it works: http://www.dabeaz.com/python/UnderstandingGIL.pdf
To fix your problem i suggest you try multi processing:
https://pymotw.com/2/multiprocessing/basics.html
Note: multiprocessing is not 100% equivilent to multithreading. Multiprocessing will run at parallel but the diffrent processes wont share memory so if you change a variable in one of them it will not be changed in the other process.

Multiprocessing in Python. Why is there no speed-up?

I am trying to get to grips with multiprocessing in Python. I started by creating this code. It simply computes cos(i) for integers i and measures the time taken when one uses multiprocessing and when one does not. I am not observing any time difference. Here is my code:
import multiprocessing
from multiprocessing import Pool
import numpy as np
import time
def tester(num):
return np.cos(num)
if __name__ == '__main__':
starttime1 = time.time()
pool_size = multiprocessing.cpu_count()
pool = multiprocessing.Pool(processes=pool_size,
)
pool_outputs = pool.map(tester, range(5000000))
pool.close()
pool.join()
endtime1 = time.time()
timetaken = endtime1 - starttime1
starttime2 = time.time()
for i in range(5000000):
tester(i)
endtime2 = time.time()
timetaken2 = timetaken = endtime2 - starttime2
print( 'The time taken with multiple processes:', timetaken)
print( 'The time taken the usual way:', timetaken2)
I am observing no (or very minimal) difference between the two times measured. I am using a machine with 8 cores, so this is surprising. What have I done incorrectly in my code?
Note that I learned all of this from this.
http://pymotw.com/2/multiprocessing/communication.html
I understand that "joblib" might be more convenient for an example like this, but the ultimate thing that this needs to be applied to does not work with "joblib".

Your job seems the computation of a single cos value. This is going to be basically unnoticeable compared to the time of communicating with the slave.
Try making 5 computations of 1000000 cos values and you should see them going in parallel.

First, you wrote :
timetaken2 = timetaken = endtime2 - starttime2
So it is normal if you have the same times displayed. But this is not the important part.
I ran your code on my computer (i7, 4 cores), and I get :
('The time taken with multiple processes:', 14.95710802078247)
('The time taken the usual way:', 6.465447902679443)
The multiprocessed loop is slower than doing the for loop. Why?
The multiprocessing module can use multiple processes, but still has to work with the Python Global Interpreter Lock, wich means you can't share memory between your processes. So when you try to launch a Pool, you need to copy useful variables, process your calculation, and retrieve the result. This cost you a little time for every process, and makes you less effective.
But this happens because you do a very small computation : multiprocessing is only useful for larger calculation, when the memory copying and results retrieving is cheaper (in time) than the calculation.
I tried with following tester, which is much more expensive, on 2000 runs:
def expenser_tester(num):
A=np.random.rand(10*num) # creation of a random Array 1D
for k in range(0,len(A)-1): # some useless but costly operation
A[k+1]=A[k]*A[k+1]
return A
('The time taken with multiple processes:', 4.030329942703247)
('The time taken the usual way:', 8.180987119674683)
You can see that on an expensive calculation, it is more efficient with the multiprocessing, even if you don't always have what you could expect (I could have a x4 speedup, but I only got x2)
Keep in mind that Pool has to duplicate every bit of memory used in calculation, so it may be memory expensive.
If you really want to improve a small calculation like your example, make it big by grouping and sending a list of variable to the pool instead of one variable by process.
You should also know that numpy and scipy have a lot of expensive function written in C/Fortran and already parallelized, so you can't do anything much to speed them.

If the problem is cpu bounded then you should see the required speed-up (if the operation is long enough and overhead is not significant). But when multiprocessing (because memory is not shared between processes) it's easier to have a memory bound problem.

python multiprocessing freezing

I am trying to implement multiprocessing with Python. It works when pooling very quick tasks, however, freezes when pooling longer tasks. See my example below:
from multiprocessing import Pool
import math
import time
def iter_count(addition):
print "starting ", addition
for i in range(1,99999999+addition):
if i==99999999:
print "completed ", addition
break
if __name__ == '__main__':
print "starting pooling "
pool = Pool(processes=2)
time_start = time.time()
possibleFactors = range(1,3)
try:
pool.map( iter_count, possibleFactors)
except:
print "exception"
pool.close()
pool.join()
#iter_count(1)
#iter_count(2)
time_end = time.time()
print "total loading time is : ", round(time_end-time_start, 4)," seconds"
In this example, if I use smaller numbers in for loop (something like 9999999) it works. But when running for 99999999 it freezes. I tried running two processes (iter_count(1) and iter_count(2)) in sequence, and it takes about 28 seconds, so not really a big task. But when I pool them it freezes. I know that there are some known bugs in python around multiprocessing, however, in my case, same code works for smaller sub tasks, but freezes for bigger ones.

You're using some version of Python 2 - we can tell because of how print is spelled.
So range(1,99999999+addition) is creating a gigantic list, with at least 100 million integers. And you're doing that in 2 worker processes simultaneously. I bet your disk is grinding itself to dust while the OS swaps out everything it can ;-)
Change range to xrange and see what happens. I bet it will work fine then.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Why multiprocessing is slower than simple for loop? - python

It is because of the overhead for handling the processes. In this case the creation and deletion of the processes does not weigh up to the performance boost the code gets from running it parallel. If the executed code is more complex there will probably be a speedup.

Related

why pool and cached method has almost same runtime?

Why is multiprocessing slower here?

Threadpool in python is not as fast as expected

Multiprocessing in Python. Why is there no speed-up?

python multiprocessing freezing

Categories

Resources