Threads is not executing in parallel python with ThreadPoolExecutor - python

I'm new in python threading and I'm experimenting this:
When I run something in threads (whenever I print outputs), it never seems to be running in parallel. Also, my functions take the same time that before using the library concurrent.futures (ThreadPoolExecutor).
I have to calculate the gains of some attributes over a dataset (I cannot use libraries). Since I have about 1024 attributes and the function was taking about a minute to execute (and I have to use it in a for iteration) I dicided to split the array of attributes into 10 (just as an example) and run the separete function gain(attribute) separetly for each sub array. So I did the following (avoiding some extra unnecessary code):
def calculate_gains(self):
splited_attributes = np.array_split(self.attributes, 10)
result = {}
for atts in splited_attributes:
with concurrent.futures.ThreadPoolExecutor() as executor:
future = executor.submit(self.calculate_gains_helper, atts)
return_value = future.result()
self.gains = {**self.gains, **return_value}
Here's the calculate_gains_helper:
def calculate_gains_helper(self, attributes):
inter_result = {}
for attribute in attributes:
inter_result[attribute] = self.gain(attribute)
return inter_result
Am I doing something wrong? I read some other older posts but I couldn't get any info.
Thanks a lot for any help!

Python threads do not run in parallel (at least in CPython implementation) because of the GIL. Use processes and ProcessPoolExecutor to really have parallelism
with concurrent.futures.ProcessPoolExecutor() as executor:
...

You submit and then wait for each work item serially so all the threads do is slow everything down. I can't guarantee this will speed things up much because you are still dealing with the python GIL that keeps python level stuff from working in parallel, but here goes.
I've created a thread pool and pushed everything possible into the worker, including the slicing of self.attributes.
def calculate_gains(self):
with concurrent.futures.ThreadPoolExecutor(max_workers=10) as executor:
result_list = executor.map(self.calculate_gains_helper,
((i, i+10) for i in range(0, len(self.attributes), 10)))
for return_value in result_list:
self.gains = {**self.gains, **return_value}
def calculate_gains_helper(self, start_end):
start, end = start_end
inter_result = {}
for attribute in self.attributes[start:end]:
inter_result[attribute] = self.gain(attribute)
return inter_result

Related

Multiprocessing is not executing parallel in Python

I have edited the code , currently it is working fine . But thinks it is not executing parallely or dynamically . Can anyone please check on to it
Code :
def folderStatistic(t):
j, dir_name = t
row = []
for content in dir_name.split(","):
row.append(content)
print(row)
def get_directories():
import csv
with open('CONFIG.csv', 'r') as file:
reader = csv.reader(file,delimiter = '\t')
return [col for row in reader for col in row]
def folderstatsMain():
freeze_support()
start = time.time()
pool = Pool()
worker = partial(folderStatistic)
pool.map(worker, enumerate(get_directories()))
def datatobechecked():
try:
folderstatsMain()
except Exception as e:
# pass
print(e)
if __name__ == '__main__':
datatobechecked()
Config.CSV
C:\USERS, .CSV
C:\WINDOWS , .PDF
etc.
There may be around 200 folder paths in config.csv
welcome to StackOverflow and Python programming world!
Moving on to the question.
Inside the get_directories() function you open the file in with context, get the reader object and close the file immediately after the moment you leave the context so when the time comes to use the reader object the file is already closed.
I don't want to discourage you, but if you are very new to programming do not dive into parallel programing yet. Difficulty in handling multiple threads simultaneously grows exponentially with every thread you add (pools greatly simplify this process though). Processes are even worse as they don't share memory and can't communicate with each other easily.
My advice is, try to write it as a single-thread program first. If you have it working and still need to parallelize it, isolate a single function with input file path as a parameter that does all the work and then use thread/process pool on that function.
EDIT:
From what I can understand from your code, you get directory names from the CSV file and then for each "cell" in the file you run parallel folderStatistics. This part seems correct. The problem may lay in dir_name.split(","), notice that you pass individual "cells" to the folderStatistics not rows. What makes you think it's not running paralelly?.
There is a certain amount of overhead in creating a multiprocessing pool because creating processes is, unlike creating threads, a fairly costly operation. Then those submitted tasks, represented by each element of the iterable being passed to the map method, are gathered up in "chunks" and written to a multiprocessing queue of tasks that are read by the pool processes. This data has to move from one address space to another and that has a cost associated with it. Finally when your worker function, folderStatistic, returns its result (which is None in this case), that data has to be moved from one process's address space back to the main process's address space and that too has a cost associated with it.
All of those added costs become worthwhile when your worker function is sufficiently CPU-intensive such that these additional costs is small compared to the savings gained by having the tasks run in parallel. But your worker function's CPU requirements are so small as to reap any benefit from multiprocessing.
Here is a demo comparing single-processing time vs. multiprocessing times for invoking a worker function, fn, twice where the first time it only performs its internal loop 10 times (low CPU requirements) while the second time it performs its internal loop 1,000,000 times (higher CPU requirements). You can see that in the first case the multiprocessing version runs considerable slower (you can't even measure the time for the single processing run). But when we make fn more CPU-intensive, then multiprocessing achieves gains over the single-processing case.
from multiprocessing import Pool
from functools import partial
import time
def fn(iterations, x):
the_sum = x
for _ in range(iterations):
the_sum += x
return the_sum
# required for Windows:
if __name__ == '__main__':
for n_iterations in (10, 1_000_000):
# single processing time:
t1 = time.time()
for x in range(1, 20):
fn(n_iterations, x)
t2 = time.time()
# multiprocessing time:
worker = partial(fn, n_iterations)
t3 = time.time()
with Pool() as p:
results = p.map(worker, range(1, 20))
t4 = time.time()
print(f'#iterations = {n_iterations}, single processing time = {t2 - t1}, multiprocessing time = {t4 - t3}')
Prints:
#iterations = 10, single processing time = 0.0, multiprocessing time = 0.35399389266967773
#iterations = 1000000, single processing time = 1.182999849319458, multiprocessing time = 0.5530076026916504
But even with a pool size of 8, the running time is not reduced by a factor of 8 (it's more like a factor of 2) due to the fixed multiprocessing overhead. When I change the number of iterations for the second case to be 100,000,000 (even more CPU-intensive), we get ...
#iterations = 100000000, single processing time = 109.3077495098114, multiprocessing time = 27.202054023742676
... which is a reduction in running time by a factor of 4 (I have many other processes running in my computer, so there is competition for the CPU).

Write python with joblib in parallel in the list

I use joblib to work in parallel, I want to write the results in parallel in a list.
So as to avoid problems, I create an ldata = [] list beforehand, so that it can be easily accessed.
During parallelization, the data are available in the list, but no longer when they are put together.
How can data be saved in parallel?
from joblib import Parallel, delayed
import multiprocessing
data = []
def worker(i):
ldata = []
... # create list ldata
data[i].append(ldata)
for i in range(0, 1000):
data.append([])
num_cores = multiprocessing.cpu_count()
Parallel(n_jobs=num_cores)(delayed(worker)(i) for i in range(0, 1000))
resultlist = []
for i in range(0, 1000):
resultlist.extend(data[i])
You should look at Parallel as a parallel map operation that does not allow for side effects. The execution model of Parallel is that it by default starts new worker copies of the master processes, serialises the input data, sends it over to the workers, have them iterate over it, then collects the return values. Any change a worker performs on data stays in its own memory space and is thus invisible to the master process. You have two options here:
First, you can have your workers return ldata instead of updating data[i]. In that case, data will have to be assigned the result returned by Parallel(...)(...):
def worker(i):
...
return ldata
data = Parallel(n_jobs=num_cores)(delayed(worker)(i) for i in range(0, 1000))
Second option is to force a shared memory semantics that uses threads instead of processes. When works execute in threads, their memory space is that of the master process, which is where data lies originally. To enforce this semantics, add require='sharedmem' keyword argument in the call to Parallel:
Parallel(n_jobs=num_cores, require='sharedmem')(delayed(worker)(i) for i in range(0, 1000))
The different modes and semantics are explained in the joblib documentation here.
Keep in mind that your worker() function is written in pure Python and is therefore interpreted. This means that worker threads can't run fully concurrently even if there is just one thread per CPU due to the dreaded Global Interpreter Lock (GIL). This is also explained in the documentation. Therefore, you'd better stick with the first solution in general, despite the marshalling and interprocess communication overheads.

How to parallelize "for" loops? [duplicate]

Say I have a very large list and I'm performing an operation like so:
for item in items:
try:
api.my_operation(item)
except:
print 'error with item'
My issue is two fold:
There are a lot of items
api.my_operation takes forever to return
I'd like to use multi-threading to spin up a bunch of api.my_operations at once so I can process maybe 5 or 10 or even 100 items at once.
If my_operation() returns an exception (because maybe I already processed that item) - that's OK. It won't break anything. The loop can continue to the next item.
Note: this is for Python 2.7.3
First, in Python, if your code is CPU-bound, multithreading won't help, because only one thread can hold the Global Interpreter Lock, and therefore run Python code, at a time. So, you need to use processes, not threads.
This is not true if your operation "takes forever to return" because it's IO-bound—that is, waiting on the network or disk copies or the like. I'll come back to that later.
Next, the way to process 5 or 10 or 100 items at once is to create a pool of 5 or 10 or 100 workers, and put the items into a queue that the workers service. Fortunately, the stdlib multiprocessing and concurrent.futures libraries both wraps up most of the details for you.
The former is more powerful and flexible for traditional programming; the latter is simpler if you need to compose future-waiting; for trivial cases, it really doesn't matter which you choose. (In this case, the most obvious implementation with each takes 3 lines with futures, 4 lines with multiprocessing.)
If you're using 2.6-2.7 or 3.0-3.1, futures isn't built in, but you can install it from PyPI (pip install futures).
Finally, it's usually a lot simpler to parallelize things if you can turn the entire loop iteration into a function call (something you could, e.g., pass to map), so let's do that first:
def try_my_operation(item):
try:
api.my_operation(item)
except:
print('error with item')
Putting it all together:
executor = concurrent.futures.ProcessPoolExecutor(10)
futures = [executor.submit(try_my_operation, item) for item in items]
concurrent.futures.wait(futures)
If you have lots of relatively small jobs, the overhead of multiprocessing might swamp the gains. The way to solve that is to batch up the work into larger jobs. For example (using grouper from the itertools recipes, which you can copy and paste into your code, or get from the more-itertools project on PyPI):
def try_multiple_operations(items):
for item in items:
try:
api.my_operation(item)
except:
print('error with item')
executor = concurrent.futures.ProcessPoolExecutor(10)
futures = [executor.submit(try_multiple_operations, group)
for group in grouper(5, items)]
concurrent.futures.wait(futures)
Finally, what if your code is IO bound? Then threads are just as good as processes, and with less overhead (and fewer limitations, but those limitations usually won't affect you in cases like this). Sometimes that "less overhead" is enough to mean you don't need batching with threads, but you do with processes, which is a nice win.
So, how do you use threads instead of processes? Just change ProcessPoolExecutor to ThreadPoolExecutor.
If you're not sure whether your code is CPU-bound or IO-bound, just try it both ways.
Can I do this for multiple functions in my python script? For example, if I had another for loop elsewhere in the code that I wanted to parallelize. Is it possible to do two multi threaded functions in the same script?
Yes. In fact, there are two different ways to do it.
First, you can share the same (thread or process) executor and use it from multiple places with no problem. The whole point of tasks and futures is that they're self-contained; you don't care where they run, just that you queue them up and eventually get the answer back.
Alternatively, you can have two executors in the same program with no problem. This has a performance cost—if you're using both executors at the same time, you'll end up trying to run (for example) 16 busy threads on 8 cores, which means there's going to be some context switching. But sometimes it's worth doing because, say, the two executors are rarely busy at the same time, and it makes your code a lot simpler. Or maybe one executor is running very large tasks that can take a while to complete, and the other is running very small tasks that need to complete as quickly as possible, because responsiveness is more important than throughput for part of your program.
If you don't know which is appropriate for your program, usually it's the first.
There's multiprocesing.pool, and the following sample illustrates how to use one of them:
from multiprocessing.pool import ThreadPool as Pool
# from multiprocessing import Pool
pool_size = 5 # your "parallelness"
# define worker function before a Pool is instantiated
def worker(item):
try:
api.my_operation(item)
except:
print('error with item')
pool = Pool(pool_size)
for item in items:
pool.apply_async(worker, (item,))
pool.close()
pool.join()
Now if you indeed identify that your process is CPU bound as #abarnert mentioned, change ThreadPool to the process pool implementation (commented under ThreadPool import). You can find more details here: http://docs.python.org/2/library/multiprocessing.html#using-a-pool-of-workers
You can split the processing into a specified number of threads using an approach like this:
import threading
def process(items, start, end):
for item in items[start:end]:
try:
api.my_operation(item)
except Exception:
print('error with item')
def split_processing(items, num_splits=4):
split_size = len(items) // num_splits
threads = []
for i in range(num_splits):
# determine the indices of the list this thread will handle
start = i * split_size
# special case on the last chunk to account for uneven splits
end = None if i+1 == num_splits else (i+1) * split_size
# create the thread
threads.append(
threading.Thread(target=process, args=(items, start, end)))
threads[-1].start() # start the thread we just created
# wait for all threads to finish
for t in threads:
t.join()
split_processing(items)
import numpy as np
import threading
def threaded_process(items_chunk):
""" Your main process which runs in thread for each chunk"""
for item in items_chunk:
try:
api.my_operation(item)
except Exception:
print('error with item')
n_threads = 20
# Splitting the items into chunks equal to number of threads
array_chunk = np.array_split(input_image_list, n_threads)
thread_list = []
for thr in range(n_threads):
thread = threading.Thread(target=threaded_process, args=(array_chunk[thr]),)
thread_list.append(thread)
thread_list[thr].start()
for thread in thread_list:
thread.join()

multiprocessing is slower than thread in python

I have tested a multiprocess and thread in python, but multiprocess is slower than thread, and I calculate a distance using editdistance, my code like:
def calc_dist(kw, trie_word):
dists = []
while len(trie_word) != 0:
w = trie_word.pop()
dist = editdistance.eval(kw, w)
dists.append((w, dist))
return dists
if __name__ == "__main__":
word_list = [str(i) for i in range(1, 10000001)]
key_word = '2'
print("calc")
s = time.time()
with Pool(processes=4) as pool:
result = pool.apply_async(calc_dist, (key_word, word_list))
print(len(result.get()))
print("用时",time.time()-s)
Using threading:
class DistThread(threading.Thread):
def __init__(self, func, args):
super(DistThread, self).__init__()
self.func = func
self.args = args
self.dists = None
def run(self):
self.dists = self.func(*self.args)
def join(self):
super().join(self)
return self.dists
In my computer, it consumes about 118s, but thread takes about 36s, where is wrong with it?
a couple of issues:
a significant amount of time will be spent serialising the data so it can be sent to the other process while threads share the same address space so pointers can be used
your current code is only using one process to do all the calcs with multiprocessing. you need to seperate your array into "chunks" somehow so that it can be processed via multiple workers
e.g:
import time
from multiprocessing import Pool
import editdistance
def calc_one(trie_word):
return editdistance.eval(key_word, trie_word)
if __name__ == "__main__":
word_list = [str(i) for i in range(1, 10000001)]
key_word = '2'
print("calc")
s = time.time()
with Pool(processes=4) as pool:
result = pool.map(calc_one, word_list, chunksize=10000)
print(len(result))
print("time",time.time()-s)
s = time.time()
result = list(calc_one(w) for w in word_list)
print(len(result))
print("time",time.time()-s)
this relies on key_word being a global variable. for me, the version using multiple processes takes ~5.3 seconds while the second version takes ~16.9 secs. not 4 times as quick as the data still needs to be sent back and forth, but pretty good
I had a similar experience with threading and multi processing inside Python to consume CSVS that had a large amount of data. I had a small look into this and found that processing spawns multiple processes to perform tasks which can be slower than just running one threaded process since threading runs in one place. There is a more definitive answer here: Multiprocessing vs Threading Python.
Pasting answer from link incase link disappears;
The threading module uses threads, the multiprocessing module uses processes. The difference is that threads run in the same memory space, while processes have separate memory. This makes it a bit harder to share objects between processes with multiprocessing. Since threads use the same memory, precautions have to be taken or two threads will write to the same memory at the same time. This is what the global interpreter lock is for.
Spawning processes is a bit slower than spawning threads. Once they are running, there is not much difference.

Multiprocessing with python3 only runs once

I have a problem running multiple processes in python3 .
My program does the following:
1. Takes entries from an sqllite database and passes them to an input_queue
2. Create multiple processes that take items off the input_queue, run it through a function and output the result to the output queue.
3. Create a thread that takes items off the output_queue and prints them (This thread is obviously started before the first 2 steps)
My problem is that currently the 'function' in step 2 is only run as many times as the number of processes set, so for example if you set the number of processes to 8, it only runs 8 times then stops. I assumed it would keep running until it took all items off the input_queue.
Do I need to rewrite the function that takes the entries out of the database (step 1) into another process and then pass its output queue as an input queue for step 2?
Edit:
Here is an example of the code, I used a list of numbers as a substitute for the database entries as it still performs the same way. I have 300 items on the list and I would like it to process all 300 items, but at the moment it just processes 10 (the number of processes I have assigned)
#!/usr/bin/python3
from multiprocessing import Process,Queue
import multiprocessing
from threading import Thread
## This is the class that would be passed to the multi_processing function
class Processor:
def __init__(self,out_queue):
self.out_queue = out_queue
def __call__(self,in_queue):
data_entry = in_queue.get()
result = data_entry*2
self.out_queue.put(result)
#Performs the multiprocessing
def perform_distributed_processing(dbList,threads,processor_factory,output_queue):
input_queue = Queue()
# Create the Data processors.
for i in range(threads):
processor = processor_factory(output_queue)
data_proc = Process(target = processor,
args = (input_queue,))
data_proc.start()
# Push entries to the queue.
for entry in dbList:
input_queue.put(entry)
# Push stop markers to the queue, one for each thread.
for i in range(threads):
input_queue.put(None)
data_proc.join()
output_queue.put(None)
if __name__ == '__main__':
output_results = Queue()
def output_results_reader(queue):
while True:
item = queue.get()
if item is None:
break
print(item)
# Establish results collecting thread.
results_process = Thread(target = output_results_reader,args = (output_results,))
results_process.start()
# Use this as a substitute for the database in the example
dbList = [i for i in range(300)]
# Perform multi processing
perform_distributed_processing(dbList,10,Processor,output_results)
# Wait for it all to finish.
results_process.join()
A collection of processes that service an input queue and write to an output queue is pretty much the definition of a process pool.
If you want to know how to build one from scratch, the best way to learn is to look at the source code for multiprocessing.Pool, which is pretty simply Python, and very nicely written. But, as you might expect, you can just use multiprocessing.Pool instead of re-implementing it. The examples in the docs are very nice.
But really, you could make this even simpler by using an executor instead of a pool. It's hard to explain the difference (again, read the docs for both modules), but basically, a future is a "smart" result object, which means instead of a pool with a variety of different ways to run jobs and get results, you just need a dumb thing that doesn't know how to do anything but return futures. (Of course in the most trivial cases, the code looks almost identical either way…)
from concurrent.futures import ProcessPoolExecutor
def Processor(data_entry):
return data_entry*2
def perform_distributed_processing(dbList, threads, processor_factory):
with ProcessPoolExecutor(processes=threads) as executor:
yield from executor.map(processor_factory, dbList)
if __name__ == '__main__':
# Use this as a substitute for the database in the example
dbList = [i for i in range(300)]
for result in perform_distributed_processing(dbList, 8, Processor):
print(result)
Or, if you want to handle them as they come instead of in order:
def perform_distributed_processing(dbList, threads, processor_factory):
with ProcessPoolExecutor(processes=threads) as executor:
fs = (executor.submit(processor_factory, db) for db in dbList)
yield from map(Future.result, as_completed(fs))
Notice that I also replaced your in-process queue and thread, because it wasn't doing anything but providing a way to interleave "wait for the next result" and "process the most recent result", and yield (or yield from, in this case) does that without all the complexity, overhead, and potential for getting things wrong.
Don't try to rewrite the whole multiprocessing library again. I think you can use any of multiprocessing.Pool methods depending on your needs - if this is a batch job you can even use the synchronous multiprocessing.Pool.map() - only instead of pushing to input queue, you need to write a generator that yields input to the threads.

Categories

Resources