low efficiency python parallel/multiprocessing with image convolution

low efficiency python parallel/multiprocessing with image convolution - python

In each task, I have ~500 images to convolve as the first step, and it seems that filters under ndimage.filters are only using 1 core. I have tried multiprocessing.pool and multiprocessing.process with multiprocessing.queue. Both worked but ran much slower than using single process. The reason was very possibly pickle and overheads: if I generated fake data within each worker rather than passing real data to each worker, multiprocessing indeed boosted the performance by a lot.
I am running spyder on a windows machine and I will pass the code to someone else on a different machine, so recompiling python and any low level tweak are not applicable.
In matlab, convolution makes use of multicore transparently, and there is parfor, which handles overheads decently. Any idea or suggestion to realize multiprocessing convolution in python? Many thanks in advance!

It seems that multiprocessing is a poor choice if the task is numpy/scipy heavy. I should have used multithreading instead of multiprocessing, because most numpy/scipy functions are not affected by GIL, multithreading outperforms multiprocessing due to light overhead.
And most importantly, multithreading is faster than single thread.
import Queue
import threading
import numpy as np
from scipy import ndimage
import time
repeats = 24
def smooth_img(q,im):
im_stack_tmp = ndimage.filters.gaussian_laplace(im, 2.)
q.put(im_stack_tmp)
im_all = [None]*repeats
im_all_filtered = [None]*repeats
for j in range(repeats):
im_all[j] = np.random.randn(2048,2048)
start = time.time()
for j in range(repeats):
im_all_filtered[j] = ndimage.filters.gaussian_laplace(im_all[j], 2.)
print('single thread: '+str(time.time()-start))
start = time.time()
q = Queue.Queue()
for im in im_all:
t = threading.Thread(target=smooth_img, args = (q,im))
t.daemon = True
t.start()
for j in range(repeats):
im_all_filtered[j] = q.get()
print('multi thread: '+str(time.time()-start))

Related

Dask delayed performance issues

I'm relatively new to Dask. I'm trying to parallelize a "custom" function that doesn't use Dask containers. I would just like to speed up the computation. But my results are that when I try parallelizing with dask.delayed, it has significantly worse performance than running the serial version. Here is a minimal implementation demonstrating the issue (the code I actually want to do this with is significantly more involved :) )
import dask,time
def mysum(rng):
# CPU intensive
z = 0
for i in rng:
z += i
return z
# serial
b = time.time(); zz = mysum(range(1, 1_000_000_000)); t = time.time() - b
print(f'time to run in serial {t}')
# parallel
ms_parallel = dask.delayed(mysum)
ss = []
ncores = 10
m = 100_000_000
for i in range(ncores):
lower = m*i
upper = (i+1) * m
r = range(lower, upper)
s = ms_parallel(r)
ss.append(s)
j = dask.delayed(ss)
b = time.time(); yy = j.compute(); t = time.time() - b
print(f'time to run in parallel {t}')
Typical results are:
time to run in serial 55.682398080825806
time to run in parallel 135.2043571472168
It seems I'm missing something basic here.

You are running a pure CPU-bound computation in threads by default. Because of python's Global Interpreter Lock (GIL), only one thread is actually running at a time. In short, you are only adding overhead to your original compute, due to thread switching and task executing.
To actually get faster for this workload, you should use dask-distributed. Just adding
import dask.distributed
client = dask.distributed.Client(threads_per_worker=1)
at the start of your script may well give you a decent speed up, since this invokes a certain number of processes, each with their own GIL. This scheduler becomes the default one just by creating it.
EDIT: ignore the following, I see you are already doing it :). Leaving here for others, unless people want it gone ...The second problem, for dask, is the sheer number of tasks. For any task execution system, there is an overhead associated with each task (actually, this is higher for distributed than the default threads scheduler). You could get around it by computing batches of function calls per task. This is, in practice, what dask.array and dask.dataframe do: they operate on largeish pieces of the overall problem, such that the overhead becomes small compared to the useful CPU execution time.

numpy.random.normal() in multithreaded environment

I was trying to parallelize the generation of normally-distributed random arrays with the numpy.random.normal() function but it looks like the calls from the threads are executed sequentially instead.
import numpy as np
start = time.time()
active_threads = 0
for i in range(100000):
t = threading.Thread(target=lambda x : np.random.normal(0,2,4000), args = [0])
t.start()
while active_threads >= 12:
time.sleep(0.1)
continue
end = time.time()
print(str(end-start))
If i measure the time for a 1 thread process i get the same result as the 12 thread one.
I know this kind of parallelization suffers from a lot of overhead, but even then it should still get some time off with the multithread version.

np.random.normal use a seed variable internally. This variable is retrieved from default_rng() and it is certainly shared between thread so it is not safe to call it using multiple threads (due to possible race conditions). In fact, the documentation provides examples for such a case (see here and there). Alternatively, you can use multiple processes (you need to configure the seed to get different results in different processes). Another solution is to use custom random number generators (RNG) so to use a different RNG object in each thread.

Parallelize a simple loop in Python and get results with concurrent.futures

Let suppose I have complicated function I cant to run on a list :
import concurrent.futures
import random
import numpy as np
inputData = random.sample(range(60000, 1000000), 50)
def complicated_function(x):
"""complicated stuff"""
return x**x
I can process directly in a loop this way (that are how I want the results at the end of my code, with the same order if possible) :
#without parallelization
processedData = [complicated_function(x) for x in inputData]
for parallel processing I look at tutorials and it I made this code :
#with parallelization
processedData2 = []
with concurrent.futures.ThreadPoolExecutor() as e:
fut = [e.submit(complicated_function, inputData[i]) for i in range(len(inputData))]
for r in concurrent.futures.as_completed(fut):
processedData2.append(r.result())
The problem is that watching at my system monitor there is just one core working when this is running.. so obviously this is not working as I need...
Thank a lot in advance for your help !

It is because you are threading, and pythons global interpreter lock doesnt actually run the tasks in parallel across multiple cores, when you use threading.
Instead u can use multiprocessing..Just like ThreadPoolExecutor, there is a ProcessPoolExecutor which will make sure multiple cores are utilised.

Python Multiprocessing pool.map unresponsive with too many worker processes

first question on stack overflow so please bear with. I am looking to calculate the variance for group ratings (long numpy arrays). Running the program without parallel processing works fine, but given each process can run independently and there are 32 groups I am looking to make use of multiprocessing to speed things up. This works OK for small numbers of groups < 10, but after this the program will often just seemingly stop running with no error messages at an unspecified number of groups ( usually between 20 and 30 ) although less frequently will run all the way through. The arrays are quite large ( 21451 x 11462 user item ratings) and so I am wondering if the problem is caused by not enough memory, although no error messages are printed.
import numpy as np
from functools import partial
import multiprocessing
def variance_parallel(extra_matrices, group_num):
# do some variation calculation
# print confirmation that we have entered function, and group number
return single_group_var
def variance(extra_matrices, num_groups):
variance_partial = partial(variance_parallel, extra_matrices)
for g in list(range(num_groups)):
group_var = pool.map(variance_partial,range(g))
return(group_var)
num_cores = multiprocessing.cpu_count() - 1
pool = multiprocessing.Pool(processes=num_cores)
variance(extra_matrices, num_groups)
Running the above code shows the program progressively building the number of groups it is checking variance on ([0],[0,1],[0,1,2],...) before eventually printing nothing.
Thanks in advance for any help and apologies if my formatting / question is a bit off!

Multiple processes do not share data
Data sent to processes needs to be copied
Since the arrays are large, the issue is very likely to do with said copying of large arrays to the processes. Further more in Python's multiprocessing, sending data to processes is done by serialisation which is (a) CPU intensive and (b) takes extra memory in and by it self.
In short multi processing is not a good fit for your use case. Since numpy is a native code extension (where GIL does not apply) and is thread safe, best to use threading instead of multiprocessing. With threading, the worker threads can share data via their parent process's address space which makes away with having to copy.
That should stop the program from running out of memory.
However, for threads to share address space the data they share needs to be bound to an object, like in a python class.
Something like the below - untested as the code sample is incomplete.
import numpy as np
from functools import partial
from threading import Thread
from multiprocessing import cpu_count
class Variance(Thread):
def __init__(self, extra_matrices, group_num):
Thread.__init__(self)
self.extra_matrices = extra_matrices
self.group_num = group_num
self.output = None
def run(self):
# do some variation calculation
# print confirmation that we have entered function, and group number
self.output = single_group_var
num_cores = cpu_count() - 1
results = []
for g in list(range(num_groups)):
workers = [Variance(extra_matrices, range(g))
for _ in range(num_cores)]
# Start threads
for worker in workers:
worker.start()
# Wait for completion
for worker in workers:
worker.join()
results.extend([w.output for w in workers])
print results

Threadpool in python is not as fast as expected

I'm beginner to python and machine learning. I'm trying to reproduce the code for countvectorizer() using multi-threading. I'm working with yelp dataset to do sentiment analysis using LogisticRegression. This is what I've written so far:
Code snippet:
from multiprocessing.dummy import Pool as ThreadPool
from threading import Thread, current_thread
from functools import partial
data = df['text']
rev = df['stars']
y = []
def product_helper(args):
return featureExtraction(*args)
def featureExtraction(p,t):
temp = [0] * len(bag_of_words)
for word in p.split():
if word in bag_of_words:
temp[bag_of_words.index(word)] += 1
return temp
# function to be mapped over
def calculateParallel(threads):
pool = ThreadPool(threads)
job_args = [(item_a, rev[i]) for i, item_a in enumerate(data)]
l = pool.map(product_helper,job_args)
pool.close()
pool.join()
return l
temp_X = calculateParallel(12)
Here this is just part of code.
Explanation:
df['text'] has all the reviews and df['stars'] has the ratings (1 through 5). I'm trying to find the word count vector temp_X using multi-threading. bag_of_words is a list of some frequent words of choice.
Question:
Without multi-threading , I was able to compute the temp_X in around 24 minutes and the above code took 33 mins for a dataset of size 100k reviews. My machine has 128GB of DRAM and 12 cores (6 physical cores with hyperthreading i.e., threads per core=2).
What am I doing wrong here?

Your whole code seems CPU Bound rather than IO Bound.You are just using threads which are under GIL so effectively running just one thread plus overheads.It runs only on one core.To run on multiple cores use
Use
import multiprocessing
pool = multiprocessing.Pool()
l = pool.map_async(product_helper,job_args)
from multiprocessing.dummy import Pool as ThreadPool is just a wrapper over thread module.It utilises just one core and not more than that.

Python and threads dont really work together very well. There is a known issue called the GIL (global interperter lock). Basically there is a lock in the interperter that makes all threads not run in parallel (even if you have multiple cpu cores). Python will simply give each thread a few milliseconds of cpu time one after another (and the reason it became slower is the overhead from context switching between those threads).
Here is a really good document explaining how it works: http://www.dabeaz.com/python/UnderstandingGIL.pdf
To fix your problem i suggest you try multi processing:
https://pymotw.com/2/multiprocessing/basics.html
Note: multiprocessing is not 100% equivilent to multithreading. Multiprocessing will run at parallel but the diffrent processes wont share memory so if you change a variable in one of them it will not be changed in the other process.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

low efficiency python parallel/multiprocessing with image convolution - python

Related

Dask delayed performance issues

numpy.random.normal() in multithreaded environment

Parallelize a simple loop in Python and get results with concurrent.futures

Python Multiprocessing pool.map unresponsive with too many worker processes

Threadpool in python is not as fast as expected

Categories

Resources