I need to run the same function based on the same data a lot of times.
For this I am using multiprocessing.Pool in order to speedup the computation.
from multiprocessing import Pool
import numpy as np
x=np.array([1,2,3,4,5])
def func(x): #this should be a function that takes 3 minutes
m=mean(x)
return(m)
p=Pool(100)
mapper=p.map(multiple_cv,[x]*500)
The program works well but at the end I have 100 python processes opened and all my system starts to go very slow.
How can I solve this?
Am
I using Pool in the wrong way? Should I use another function?
EDIT: using p = Pool(multiprocessing.cpu_count()) will my PC use 100% of it's power?
Or there is something else I should use?
In addition to limiting yourself to
p = Pool(multiprocessing.cpu_count())
I believe you want to do the following when you're finished as well...
p.close()
This should close out the process after it's completed.
As a general rule, you don't want too many more pools than you have CPU cores, because your computer won't be able to parallelize the work beyond the number of cores available to actually do the processing. It doesn't matter if you've got 100 processes when your CPU can only process four thing simultaneously. A common practice is to do this
p = Pool(multiprocessing.cpu_count())
Related
I have a few hundred thousand csv files I would all like to apply the same function to. Something like the following dummy function:
def process_single_file(fname):
df = pd.read_csv(fname)
# Pandas and non-pandas processing
df.to_csv(f"./output/{fname}")
As looping over all files individually would take too long, my question is what the most efficient way to schedule and parallelize this execution – no processes are dependent on each other. I started off trying to use python's multiprocessing:
import multiprocessing
files = sorted(glob.glob("./input/*.csv"))
processes = []
for fname in files:
p = multiprocessing.Process(target=process_file, args=(fname,))
processes.append(p)
p.start()
for process in processes:
process.join()
My computer, however, doesn't seem to like this process as it quickly overloads all CPU's and leading to slow-downs and crashes. Is there a more efficient way to reduce the workload of all CPU's and schedule the tasks such as using Dask, some Bash script or changing python? Thanks in advance.
It really depends on where your bottleneck is : are you spending more time reading / writing files, or doing CPU processing stuff ?
This RealPython tutorial really helped me a lot learning about all this stuff, I can only recommend a good read ;)
As explained in the tutorial, if I/O, multithreading is enough (and possibly better than multiprocessing):
def process_all_files(files):
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
executor.map(process_single_file, files)
And if CPU, multiprocessing will let you use all your available cores:
def process_all_files(files):
with multiprocessing.Pool() as pool:
pool.map(process_single_file, files)
You can try Ray, it is a quite efficient module to parallelize tasks
Absolutely pool is the way to go.
Something along the below lines
`from multiprocessing import Pool
def f(x):
return x*x
if name == 'main':
pool = Pool(processes=4) `
check the following post
Using multiprocessing.Process with a maximum number of simultaneous processes
So here is the deal
min_range = 1602
max_range = 9999999
for image in range(min_range,max_range):
p1=multiprocessing.Process(target=process,args=image,))
p1.start()
I have these many process to run, i cant run them all at one time it will wreck my system. And i cant run each process at a time using
p1.join()
so all i want to do is run 20 processes and wait until they are over once those 20 processes are done run next 20. But i dont know how to achieve that please help me.... Thanks :)
multiprocessing.Process is for the case where you want to define/start/control all Processes by yourself.
Your case looks more like a use case for multiprocessing.Pool: You define a pool of parallel processes, handle it a function and arguments (a list) and it distributes the work to the processes automatically.
A sidenote: why would you want 20 parallel processes? If you want to do multiprocessing for better using your CPU, the number of parallel processes should be <= number of cores (threads if you have a multi-threaded CPU)
I have some string processing job in Python. And I wish to speed up the job
by using a thread pool. The string processing job has no dependency to each
other. The result will be stored into a mongodb database.
I wrote my code as follow:
thread_pool_size = multiprocessing.cpu_count()
pool = ThreadPool(thread_pool_size)
for single_string in string_list:
pool.apply_async(_process, [single_string ])
pool.close()
pool.join()
def _process(s):
# Do staff, pure python string manipulation.
# Save the output to a database (pyMongo).
I try to run the code in a Linux machine with 8 CPU cores. And it turns out
that the maximum CPU usage can only be around 130% (read from top), when I
run the job for a few minutes.
Is my approach correct to use a thread pool? Is there any better way to do so?
You might check using multiple processes instead of multiple threads. Here is a good comparison of both options. In one of the comments it is stated that Python is not able to use multiple CPUs while working with multiple threads (due to the Global interpreter lock). So instead of using a Thread pool you should use a Process pool to take full leverage of your machine.
Perhaps _process isn't CPU bound; it might be slowed by the file system or network if you're writing to a database. You could see if the CPU usage rises if you make your process truly CPU bound, for example:
def _process(s):
for i in xrange(100000000):
j = i * i
I am working on Ubuntu 12 with 8 CPU3 as reported by the System monitor.
the testing code is
import multiprocessing as mp
def square(x):
return x**2
if __name__ == '__main__':
pool=mp.Pool(processes=4)
pool.map(square,range(100000000))
pool.close()
# for i in range(100000000):
# square(i)
The problem is:
1) All workload seems to be scheduled to just one core, which gets close to 100% utilization, despite the fact that several processes are started. Occasionally all workload migrates to another core but the workload is never distributed among them.
2) without multiprocessing is faster
for i in range(100000000):
square(i)
I have read the similar questions on stackoverflow like:
Python multiprocessing utilizes only one core
still got no applied result.
The function you are using is way too short (i.e. doesn't take enough time to compute), so you spend all your time in the synchronization between processes, that has to be done in a serial manner (so why not on a single processor). Try this:
import multiprocessing as mp
def square(x):
for i in range(10000):
j = i**2
return x**2
if __name__ == '__main__':
# pool=mp.Pool(processes=4)
# pool.map(square,range(1000))
# pool.close()
for i in range(1000):
square(i)
You will see that suddenly the multiprocessing works well: it takes ~2.5 seconds to accomplish, while it will take 10s without it.
Note: If using python 2, you might want to replace all the range by xrange
Edit: I replaced time.sleep by a CPU-intensive but useless calculation
Addendum: In general, for multi-CPU applications, you should try to make each CPU do as much work as possible without returning to the same process. In a case like yours, this means splitting the range into almost-equal sized lists, one per CPU and send them to the various CPUs.
When you do:
pool.map(square, range(100000000))
Before invoking the map function, it has to create a list with 100000000 elements, and this is done by a single process, That's why you see a single core working.
Use a generator instead, so each core can pop a number out of it and you should see the speedup:
pool.map(square, xrange(100000000))
It isn't sufficient simply to import the multiprocessing library to make use of multiple processes to schedule your work. You actually have to create processes too!
Your work is currently scheduled to a single core because you haven't done so, and so your program is a single process with a single thread.
Naturally, when you start a new process to simply square a number, you are going to get slower performance. The overhead of process creation makes sure of that. So your process pool will very likely take longer than a singe-process run.
Recently I wanted to speed up some of my code using parallel processing, as I have a Quad Core i7 and it seemed like a waste. I learned about python's (I'm using v 3.3.2 if it maters) GIL and how it can be overcome using the multiprocessing module, so I wrote this simple test program:
from multiprocessing import Process, Queue
def sum(a,b):
su=0
for i in range(a,b):
su+=i
q.put(su)
q= Queue()
p1=Process(target=sum, args=(1,25*10**7))
p2=Process(target=sum, args=(25*10**7,5*10**8))
p3=Process(target=sum, args=(5*10**8,75*10**7))
p4=Process(target=sum, args=(75*10**7,10**9))
p1.run()
p2.run()
p3.run()
p4.run()
r1=q.get()
r2=q.get()
r3=q.get()
r4=q.get()
print(r1+r2+r3+r4)
The code runs in about 48 seconds measured using cProfile, however the single process code
def sum(a,b):
su=0
for i in range(a,b):
su+=i
print(su)
sum(1,10**9)
runs in about 50 seconds. I understand that the method has overheads but i expected the improvements to be more drastic. The error with fork() doesn't apply to my as I'm running the code on a Mac.
The problem is that you're calling run rather than start.
If you read the docs, run is the "Method representing the process's activity", while start is the function that starts the process's activity on the background process. (This is the same as with threading.Thread.)
So, what you're doing is running the sum function on the main process, and never doing anything on the background processes.
From timing tests on my laptop, this cuts the time to about 37% of the original. Not quite the 25% you'd hope for, and I'm not sure why, but… good enough to prove that it's really multi-processing. (That, and the fact that I get four extra Python processes each using 60-100% CPU…)
If you really want to write fast computations using python it is not the way to go. Use numpy, or cython. Your computations will be hundred times faster than plain python.
On the other hand if you just want to launch bunch of parralel jobs use proper tools for it, for example
from multiprocessing import Pool
def mysum(a,b):
su=0
for i in range(a,b):
su+=i
return su
with Pool() as pool:
print(sum(pool.starmap(mysum, ((1,25*10**7),
(25*10**7,5*10**8),
(5*10**7,75*10**7),
(75*10**7,10**9)))))