I have a function that:
1) reads in a hdf5 dataset as integer ascii code
2) converts ascii integers to characters...chr() function
3) joins the characters into a single string function
Upon profiling, I found that the vast majority of the calculation is spent on the step #2, the conversion of the ascii integers to characters. I have somewhat optimized this call by using:
''.join([chr(x) for x in file[dataSetName].value])
As my parsing function seems to be cpu bound (the conversion of integer to characters) and not i/o bound, I expected to obtain a more/less linear speed enhancement with the number of cores devoted to parsing. To parse one file serially takes ~15 seconds...to parse 10 files (on my 12 core machine) takes ~150 seconds while using 10 threads. That is, there seems to be no enhancement at all.
I have used the following code to launch my threads:
threads=[]
timer=[]
threadNumber=10
for i,d in enumerate(sortedDirSet):
timer.append(time.time())
# self.loadFile(d,i)
threads.append(Thread(target=self.loadFileargs=(d,i)))
threads[-1].start()
if(i%threadNumber==0):
for i2,t in enumerate(threads):
t.join()
print(time.time()-timer[i2])
timer=[]
threads=[]
for t in threads:
t.join()
Any help would be greatly appreciated.
Python cannot use multiple cores (due to GIL) unless you spawn subprocesses (with multiprocessing for example). Thus you won't get any performance boost with spawning threads for CPU bound tasks.
Here's an example of a script using multiprocessing and queue:
from Queue import Empty # <-- only needed to catch Exception
from multiprocessing import Process, Queue, cpu_count
def loadFile(d, i, queue):
# some other stuff
queue.put(result)
if name == "main":
queue = Queue()
no = cpu_count()
processes = []
for i,d in enumerate(sortedDirSet):
p = Process(target=self.loadFile, args=(d, i, queue))
p.start()
processes.append(p)
if i % no == 0:
for p in processes:
p.join()
processes = []
for p in processes:
p.join()
results = []
while True:
try:
# False means "don't wait when Empty, throw an exception instead"
data = queue.get(False)
results.append(data)
except Empty:
break
# You have all the data, do something with it
The other (more complicated) way would be to use pipe instead of queue.
It would be also more efficient to spawn processes, then create a job queue and send them (via pipe) to subprocesses (so you won't have to create a process each time). But this would be even more complicated, so let's leave it like that.
Freakish is correct with his answer, it will be the GIL thwarting your efforts.
If you were to use python 3, you could do this very nicely using concurrent.futures. I believe PyPy has also backported this feature.
Also, you could eek a little bit more speed out of your code by replacing your list comprehension:
''.join([chr(x) for x in file[dataSetName].value])
With a map:
''.join(map(chr, file[dataSetName].value))
My tests (on a massive random list) using above code showed 15.73s using list comprehension and 12.44s using map.
Related
EDIT : seems that I misused multiprocessing by starting a lot short processes instead of dividing it into n_cpu long processes
I'm doing multi processing optimization and it's the first time that I actually use this. I have a function that takes a short time to run, however it needs to run a lot of times. I tried this code :
processes = []
# length is a variable that is in the order of 100k to 1M
for j in range(0, length):
p = mp.Process(target=mp_function)
processes.append(p)
p.start()
for process in processes:
process.join()
However this takes a very long time to run and just freezes my computer. I've also tried this but it takes a very long time (I just decided to not let it finish as it clearly takes a lot longer than what is acceptable) :
processes = []
for j in range(0, length):
p = mp.Process(target=mp_function)
processes.append(p)
p.start()
p.join()
I'm not sure what I'm doing wrong since it's the same approach as indicated on Python's documentation :
p = Process(target=f, args=('bob',))
p.start()
p.join()
From what I've read online, it seems that Pool is to be prefered when I have a few tasks that each takes a long time to execute, and Process is to be used when I have a lot of short tasks takes a short time to execute.
I'm not sure what I'm doing wrong, so any help would be very appreciated
A pool may work well. It will take a value and return a result which involves some processing time, but if you keep it to simple types, its fairly fast. By default the pool will "chunk" the input data, meaning that it will send equal amount of work to each worker in chunks to reduce interprocess communication.
import multiprocessing as mp
if __name__ == "__main__":
with mp.Pool() as pool: # using default 1 process per cpu, reduce
# if it makes machine run too slow
result = pool.map(my_function, [None]*length)
That list could be large in your case, so an intermediate function that knows how to break things out may be reasonable.
def my_function_runner(num_calls):
for _ in num_calls:
my_function()
cpus = mp.cpu_count()
with mp.Pool(processes=cpus) as pool:
num_calls_list = [length/cpus] * (cpus - 1)
num_calls_list.append(length%cpus)
result = pool.map(my_function_runner, num_calls_list)
Say I have a very large list and I'm performing an operation like so:
for item in items:
try:
api.my_operation(item)
except:
print 'error with item'
My issue is two fold:
There are a lot of items
api.my_operation takes forever to return
I'd like to use multi-threading to spin up a bunch of api.my_operations at once so I can process maybe 5 or 10 or even 100 items at once.
If my_operation() returns an exception (because maybe I already processed that item) - that's OK. It won't break anything. The loop can continue to the next item.
Note: this is for Python 2.7.3
First, in Python, if your code is CPU-bound, multithreading won't help, because only one thread can hold the Global Interpreter Lock, and therefore run Python code, at a time. So, you need to use processes, not threads.
This is not true if your operation "takes forever to return" because it's IO-bound—that is, waiting on the network or disk copies or the like. I'll come back to that later.
Next, the way to process 5 or 10 or 100 items at once is to create a pool of 5 or 10 or 100 workers, and put the items into a queue that the workers service. Fortunately, the stdlib multiprocessing and concurrent.futures libraries both wraps up most of the details for you.
The former is more powerful and flexible for traditional programming; the latter is simpler if you need to compose future-waiting; for trivial cases, it really doesn't matter which you choose. (In this case, the most obvious implementation with each takes 3 lines with futures, 4 lines with multiprocessing.)
If you're using 2.6-2.7 or 3.0-3.1, futures isn't built in, but you can install it from PyPI (pip install futures).
Finally, it's usually a lot simpler to parallelize things if you can turn the entire loop iteration into a function call (something you could, e.g., pass to map), so let's do that first:
def try_my_operation(item):
try:
api.my_operation(item)
except:
print('error with item')
Putting it all together:
executor = concurrent.futures.ProcessPoolExecutor(10)
futures = [executor.submit(try_my_operation, item) for item in items]
concurrent.futures.wait(futures)
If you have lots of relatively small jobs, the overhead of multiprocessing might swamp the gains. The way to solve that is to batch up the work into larger jobs. For example (using grouper from the itertools recipes, which you can copy and paste into your code, or get from the more-itertools project on PyPI):
def try_multiple_operations(items):
for item in items:
try:
api.my_operation(item)
except:
print('error with item')
executor = concurrent.futures.ProcessPoolExecutor(10)
futures = [executor.submit(try_multiple_operations, group)
for group in grouper(5, items)]
concurrent.futures.wait(futures)
Finally, what if your code is IO bound? Then threads are just as good as processes, and with less overhead (and fewer limitations, but those limitations usually won't affect you in cases like this). Sometimes that "less overhead" is enough to mean you don't need batching with threads, but you do with processes, which is a nice win.
So, how do you use threads instead of processes? Just change ProcessPoolExecutor to ThreadPoolExecutor.
If you're not sure whether your code is CPU-bound or IO-bound, just try it both ways.
Can I do this for multiple functions in my python script? For example, if I had another for loop elsewhere in the code that I wanted to parallelize. Is it possible to do two multi threaded functions in the same script?
Yes. In fact, there are two different ways to do it.
First, you can share the same (thread or process) executor and use it from multiple places with no problem. The whole point of tasks and futures is that they're self-contained; you don't care where they run, just that you queue them up and eventually get the answer back.
Alternatively, you can have two executors in the same program with no problem. This has a performance cost—if you're using both executors at the same time, you'll end up trying to run (for example) 16 busy threads on 8 cores, which means there's going to be some context switching. But sometimes it's worth doing because, say, the two executors are rarely busy at the same time, and it makes your code a lot simpler. Or maybe one executor is running very large tasks that can take a while to complete, and the other is running very small tasks that need to complete as quickly as possible, because responsiveness is more important than throughput for part of your program.
If you don't know which is appropriate for your program, usually it's the first.
There's multiprocesing.pool, and the following sample illustrates how to use one of them:
from multiprocessing.pool import ThreadPool as Pool
# from multiprocessing import Pool
pool_size = 5 # your "parallelness"
# define worker function before a Pool is instantiated
def worker(item):
try:
api.my_operation(item)
except:
print('error with item')
pool = Pool(pool_size)
for item in items:
pool.apply_async(worker, (item,))
pool.close()
pool.join()
Now if you indeed identify that your process is CPU bound as #abarnert mentioned, change ThreadPool to the process pool implementation (commented under ThreadPool import). You can find more details here: http://docs.python.org/2/library/multiprocessing.html#using-a-pool-of-workers
You can split the processing into a specified number of threads using an approach like this:
import threading
def process(items, start, end):
for item in items[start:end]:
try:
api.my_operation(item)
except Exception:
print('error with item')
def split_processing(items, num_splits=4):
split_size = len(items) // num_splits
threads = []
for i in range(num_splits):
# determine the indices of the list this thread will handle
start = i * split_size
# special case on the last chunk to account for uneven splits
end = None if i+1 == num_splits else (i+1) * split_size
# create the thread
threads.append(
threading.Thread(target=process, args=(items, start, end)))
threads[-1].start() # start the thread we just created
# wait for all threads to finish
for t in threads:
t.join()
split_processing(items)
import numpy as np
import threading
def threaded_process(items_chunk):
""" Your main process which runs in thread for each chunk"""
for item in items_chunk:
try:
api.my_operation(item)
except Exception:
print('error with item')
n_threads = 20
# Splitting the items into chunks equal to number of threads
array_chunk = np.array_split(input_image_list, n_threads)
thread_list = []
for thr in range(n_threads):
thread = threading.Thread(target=threaded_process, args=(array_chunk[thr]),)
thread_list.append(thread)
thread_list[thr].start()
for thread in thread_list:
thread.join()
I am writing my first Python 2.7 multiprocessing program (woohoo).
I am using multiprocessing Queues to retrieve data from my sub processes. My question is about the Queues .get() method. Is there any guarantee that I will get the full object (no matter how large it is) when I call the method? If not how will it be split.
The doc says: “Remove and return an item from the queue. […]”. But I am not sure if this means that I might end up getting chunks of an object or if it is rebuild by the methods internals.
Here is some sample code: (stats might get pretty large)
p = Process(target=process_analyze_db, args=(db_names[j], j, queue_stats))
processes.append(p)
p.start()
while 1:
running = any(p.is_alive() for p in processes)
while not queue_stats.empty(): #Is this loop necessary?
data = queue_stats.get_nowait()
results[data[0]] = data[1]
if not running:
break
#In the process
def process_analyze_db (db_name, profile_nr, queue _stats):
#Do lots of stuff
queue_stats.put([profile_nr, stats])
I recently tried refactoring some parallel processes into a pool and was surprised that the pool took almost twice as long as pure processes. Please assume they are running on the same machine with the same number of cores. I hope that someone can explain why my implementation using Pool is taking longer and perhaps offer some advice:
Shared dependency:
https://github.com/taynaud/python-louvain
from community import best_partition
Here is the faster implementation using Process. [UPDATE] refactored to control the number of active processes same as the Pool implementation, still faster:
processes = []
pipes = []
def _get_partition(send_end):
send_end.send(best_partition(a_graph, resolution=res, randomize=rand))
for idx in range(iterations):
recv_end, send_end = Pipe(False)
p = Process(target=_get_partition, args=(send_end,))
processes.append(p)
pipes.append(recv_end)
running_procs = []
finished_procs = []
while len(finished_procs) < iterations:
while len(running_procs) < n_cores and len(processes):
proc = processes.pop()
proc.start()
running_procs.append(proc)
for idx, proc in enumerate(running_procs):
if not proc.is_alive():
finished_procs.append(running_procs.pop(idx))
for p in finished_procs:
p.join()
partitions = [pipe.recv() for pipe in pipes]
And here is the slower, Pool implementation. This is still slower no matter how many processes the pool is given:
pool = Pool(processes=n_cores)
results = [
pool.apply_async(
best_partition,
(a_graph,),
dict(resolution=res, randomize=rand)
) for i in range(iterations)
]
partitions = [res.get() for res in results]
pool.close()
pool.join()
Usually when there is a difference between a pool and bunch of processes (it can be for the benefit of either), it is your data set and task performed that define the outcome.
Without knowing what your a_graph is, I make a wild guess it is something big. In your process model, you rely on the in-memory copy of this in your subprocesses. In your Pool model, you transmit a copy of a_graph as an argument to each worker every time one is called. This is in practice implemented as a queue. In your process model, your subprocess gets a copy of this at C level when Python interpreter calls fork(). This is much faster than transmitting a large Python object, dictionary, array or whatever it is, via a queue.
The reverse would be true if tasks took only a minuscule time to complete. In this case, Pool is the better performing solution, as Pool passes tasks to already running processes. Processes do not need to be recreated after each task. In this case, the overhead needed to create a lot of new processes that run only a fraction of a second, slows the process implementation down.
As I said, this is pure speculation, but in your examples there is a significant difference what you actually transmit as a parameter to your workers, and that might be the explanation.
I've got to generate a set of strings based on some calculations of other strings. This takes quite a while, and I'm working on a multiprocessor/multicore server so I figured that I could break these tasks down into chunks and pass them off to different process.
Firstly I break the first list of strings down into chunks of 10000 each, send this off to a process which creates a new set, then try to obtain a lock and report these back to the master process. However, my master processes's set is empty!
Here's some code:
def build_feature_labels(self,strings,return_obj,l):
feature_labels = set()
for s in strings:
feature_labels = feature_labels.union(s.get_feature_labels())
print "method: ", len(feature_labels)
l.acquire()
return_obj.return_feature_labels(feature_labels)
l.release()
print "Thread Done"
def return_feature_labels(self,labs):
self.feature_labels = self.feature_labels.union(labs)
print "length self", len(self.feature_labels)
print "length labs", len(labs)
current_pos = 0
lock = multiprocessing.Lock()
while current_pos < len(orig_strings):
while len(multiprocessing.active_children()) > threads:
print "WHILE: cpu count", str(multiprocessing.cpu_count())
T.sleep(30)
print "number of processes", str(len(multiprocessing.active_children()))
proc = multiprocessing.Process(target=self.build_feature_labels,args=(orig_strings[current_pos:current_pos+self.MAX_ITEMS],self,lock))
proc.start()
current_pos = current_pos + self.MAX_ITEMS
while len(multiprocessing.active_children()) > 0:
T.sleep(3)
print len(self.feature_labels)
What is strange is a) that self.feature_labels on the master process is empty, but when it is called from each sub-process it has items. I think I'm taking the wrong approach here (it's how I used to do it in Java!). Is there a better approach?
Thanks in advance.
Consider using a pool of workers: http://docs.python.org/dev/library/multiprocessing.html#using-a-pool-of-workers. This does a lot of the work for you in a map-reduce style and returns the assembled results.
Use a multiprocessing.Pipe, or Queue (or other such object) to pass data between processes. Use a Pipe to pass data between two processes, and a Queue to allow multiple producers and consumers.
Along with the official documentation, there are nice examples to be found in Doug Hellman's multiprocessing tutorial. In particular, it has an example of how to use multiprocessing.Pool to implement a mapreduce-type operation. It might suit your purposes very well.
Why it didn't work: multiprocessing uses processes, and process memory isn't shared. Multiprocessing can set up shared memory or pipes for IPC, but it must be done explicitly. This is how the various suggestions send data back to the master.