I am working on Ubuntu 12 with 8 CPU3 as reported by the System monitor.
the testing code is
import multiprocessing as mp
def square(x):
return x**2
if __name__ == '__main__':
pool=mp.Pool(processes=4)
pool.map(square,range(100000000))
pool.close()
# for i in range(100000000):
# square(i)
The problem is:
1) All workload seems to be scheduled to just one core, which gets close to 100% utilization, despite the fact that several processes are started. Occasionally all workload migrates to another core but the workload is never distributed among them.
2) without multiprocessing is faster
for i in range(100000000):
square(i)
I have read the similar questions on stackoverflow like:
Python multiprocessing utilizes only one core
still got no applied result.
The function you are using is way too short (i.e. doesn't take enough time to compute), so you spend all your time in the synchronization between processes, that has to be done in a serial manner (so why not on a single processor). Try this:
import multiprocessing as mp
def square(x):
for i in range(10000):
j = i**2
return x**2
if __name__ == '__main__':
# pool=mp.Pool(processes=4)
# pool.map(square,range(1000))
# pool.close()
for i in range(1000):
square(i)
You will see that suddenly the multiprocessing works well: it takes ~2.5 seconds to accomplish, while it will take 10s without it.
Note: If using python 2, you might want to replace all the range by xrange
Edit: I replaced time.sleep by a CPU-intensive but useless calculation
Addendum: In general, for multi-CPU applications, you should try to make each CPU do as much work as possible without returning to the same process. In a case like yours, this means splitting the range into almost-equal sized lists, one per CPU and send them to the various CPUs.
When you do:
pool.map(square, range(100000000))
Before invoking the map function, it has to create a list with 100000000 elements, and this is done by a single process, That's why you see a single core working.
Use a generator instead, so each core can pop a number out of it and you should see the speedup:
pool.map(square, xrange(100000000))
It isn't sufficient simply to import the multiprocessing library to make use of multiple processes to schedule your work. You actually have to create processes too!
Your work is currently scheduled to a single core because you haven't done so, and so your program is a single process with a single thread.
Naturally, when you start a new process to simply square a number, you are going to get slower performance. The overhead of process creation makes sure of that. So your process pool will very likely take longer than a singe-process run.
Related
def myfun(a):
return a*2
p=Pool(5)
k0=time.time()
p.map(myfun,[1,2,3,4,5,6,7,8,9,10,1,2,3,4,5,6,7,8,9,10])
k1=time.time()
print(k1-k0)
k0=time.time()
for i in [1,2,3,4,5,6,7,8,9,10,1,2,3,4,5,6,7,8,9,10]:
myfun(i)
k1=time.time()
print(k1-k0)
I am using the multiprocessing package in python. So as you can see i have executed two different snippets of code separately.The first one that uses Pool.map takes more time than the second one which is executed serially. Can anyone explain to me why so? I thought the p.map() would be much faster. Is it not executed parallely?
Indeed as noted in the comments, it takes longer to run in parallel for some tasks with multiprocessing. This is expected for very small tasks. The reason is that you have to spin up a python instance on each process for each worker used, and you also have to serialize and ship both the function and the data you are sending with map. This takes some time, so there's an overhead associated with using a multiprocessing.Pool. For very quick tasks, I suggest multiprocessing.dummy.Pool, which uses threads -- and thus minimizes setup overhead.
Try putting a time.sleep(x) in your function call, and varying x. You'll see that as x increases, the function becomes more suitable to run in a thread pool, and then in a process pool for even more expensive x.
I have some python code which performs a non-collision task, no race conditions can occur as a result of this parallelism. I'm merely attempting to increase the speed of processing, I've 4 files, and rather than reading each of them one at a time I'd like to open all four files and read/edit data from them simultaneously.
I've read a few questions on here detailing that multi-threading in python isn't possible due to the Global Interpreter Lock, but that multiprocessing gets around this. For the record my code does exactly what it's meant to when I just run it four times from the terminal in separate terminals - I'm guessing this is "multiprocessing", however I'd like a cleaner programmatic solution.
The data-sets are large, so it can be assumed that as soon as a "process" is given to the interpreter, it is essentially locked for a large time period :
Example:
import multiprocessing
def worker():
while true:
#do some stuff
return
if __name__ == '__main__':
jobs = []
for i in range(5):
p = multiprocessing.Process(target=worker)
jobs.append(p)
p.start()
This gives me the issue in that, the above will execute the first process, and then never realistically start the second until execution is finished on the first, making them run one after the other.
Is there a way I can effectively execute worker X amount of times by either starting them at the same time, or preventing them from running until they all start? - My OS has access to 8 cores
I have been playing around with multiprocessing problem and notice my algorithm is slower when I parallelizes it than when it is single thread.
In my code I don't share memory.
And I'm pretty sure my algorithm (see code), which is just nested loops is CPU bound.
However, no matter what I do. The parallel code runs 10-20% slower on all my computers.
I also ran this on a 20 CPUs virtual machine and single thread beats multithread every times (even slower up there than my computer, actually).
from multiprocessing.dummy import Pool as ThreadPool
from multi import chunks
from random import random
import logging
import time
from multi import chunks
## Product two set of stuff we can iterate over
S = []
for x in range(100000):
S.append({'value': x*random()})
H =[]
for x in range(255):
H.append({'value': x*random()})
# the function for each thread
# just nested iteration
def doStuff(HH):
R =[]
for k in HH['S']:
for h in HH['H']:
R.append(k['value'] * h['value'])
return R
# we will split the work
# between the worker thread and give it
# 5 item each to iterate over the big list
HChunks = chunks(H, 5)
XChunks = []
# turn them into dictionary, so i can pass in both
# S and H list
# Note: I do this because I'm not sure if I use the global
# S, will it spend too much time on cache synchronizatio or not
# the idea is that I dont want each thread to share anything.
for x in HChunks:
XChunks.append({'H': x, 'S': S})
print("Process")
t0 = time.time()
pool = ThreadPool(4)
R = pool.map(doStuff, XChunks)
pool.close()
pool.join()
t1 = time.time()
# measured time for 4 threads is slower
# than when i have this code just do
# doStuff(..) in non-parallel way
# Why!?
total = t1-t0
print("Took", total, "secs")
There are many related question opened, but many are geared toward code being structured incorrectly - each worker being IO bound and such.
You are using multithreading, not multiprocessing. While many languages allow threads to run in parallel, python does not. A thread is just a separate state of control, i.e. it holds it own stack, current function, etc. The python interpreter just switches between executing each stack every now and then.
Basically, all threads are running on a single core. They will only speed up your program when you are not CPU bound.
multiprocessing.dummy replicates the API of multiprocessing but is no more than a wrapper around the threading module.
Multithreading is usually slower than single threading if you are CPU bound. This is because the work and processing resources stay the same, but you add overhead for managing the threads, e.g. switching between them.
How to fix this: instead of using from multiprocessing.dummy import Pool as ThreadPool do multiprocessing.Pool as ThreadPool.
You might want to read up on the GIL, the Global Interpreter Lock. It's what prevents threads from running in parallel (that and implications on single threaded performance). Python interpreters other than CPython may not have the GIL and be able to run multithreaded on several cores.
I need to run the same function based on the same data a lot of times.
For this I am using multiprocessing.Pool in order to speedup the computation.
from multiprocessing import Pool
import numpy as np
x=np.array([1,2,3,4,5])
def func(x): #this should be a function that takes 3 minutes
m=mean(x)
return(m)
p=Pool(100)
mapper=p.map(multiple_cv,[x]*500)
The program works well but at the end I have 100 python processes opened and all my system starts to go very slow.
How can I solve this?
Am
I using Pool in the wrong way? Should I use another function?
EDIT: using p = Pool(multiprocessing.cpu_count()) will my PC use 100% of it's power?
Or there is something else I should use?
In addition to limiting yourself to
p = Pool(multiprocessing.cpu_count())
I believe you want to do the following when you're finished as well...
p.close()
This should close out the process after it's completed.
As a general rule, you don't want too many more pools than you have CPU cores, because your computer won't be able to parallelize the work beyond the number of cores available to actually do the processing. It doesn't matter if you've got 100 processes when your CPU can only process four thing simultaneously. A common practice is to do this
p = Pool(multiprocessing.cpu_count())
Recently I wanted to speed up some of my code using parallel processing, as I have a Quad Core i7 and it seemed like a waste. I learned about python's (I'm using v 3.3.2 if it maters) GIL and how it can be overcome using the multiprocessing module, so I wrote this simple test program:
from multiprocessing import Process, Queue
def sum(a,b):
su=0
for i in range(a,b):
su+=i
q.put(su)
q= Queue()
p1=Process(target=sum, args=(1,25*10**7))
p2=Process(target=sum, args=(25*10**7,5*10**8))
p3=Process(target=sum, args=(5*10**8,75*10**7))
p4=Process(target=sum, args=(75*10**7,10**9))
p1.run()
p2.run()
p3.run()
p4.run()
r1=q.get()
r2=q.get()
r3=q.get()
r4=q.get()
print(r1+r2+r3+r4)
The code runs in about 48 seconds measured using cProfile, however the single process code
def sum(a,b):
su=0
for i in range(a,b):
su+=i
print(su)
sum(1,10**9)
runs in about 50 seconds. I understand that the method has overheads but i expected the improvements to be more drastic. The error with fork() doesn't apply to my as I'm running the code on a Mac.
The problem is that you're calling run rather than start.
If you read the docs, run is the "Method representing the process's activity", while start is the function that starts the process's activity on the background process. (This is the same as with threading.Thread.)
So, what you're doing is running the sum function on the main process, and never doing anything on the background processes.
From timing tests on my laptop, this cuts the time to about 37% of the original. Not quite the 25% you'd hope for, and I'm not sure why, but… good enough to prove that it's really multi-processing. (That, and the fact that I get four extra Python processes each using 60-100% CPU…)
If you really want to write fast computations using python it is not the way to go. Use numpy, or cython. Your computations will be hundred times faster than plain python.
On the other hand if you just want to launch bunch of parralel jobs use proper tools for it, for example
from multiprocessing import Pool
def mysum(a,b):
su=0
for i in range(a,b):
su+=i
return su
with Pool() as pool:
print(sum(pool.starmap(mysum, ((1,25*10**7),
(25*10**7,5*10**8),
(5*10**7,75*10**7),
(75*10**7,10**9)))))