This is actually the continuation of my previous question:
Reduce time complexity of nested loop
I want to implement multithreading to my Pearson's r algorithm. Here, I am dividing matrix X into n/t x n as well as vector y into n/t x 1 and feed each segment into the threads. I am currently using ThreadPool from multiprocessing library as an alternate to the Threading library of python. This is because during the testing I noticed that Threading does not really running the threads simultaneously; and upon further searching in the internet this is due to GIL (global interpreter lock). They suggested multiprocessing library and the code below is what I have done so far.
pearson-cor: This algorithm aims to solve Pearson's r where the function is accepting a matrix X, a vector y, and (m, n) sizes. You may see the code in the link above.
My concern here is that the processing time still increases as the number of threads increases (see also sample output below). It is noticeable from my test that the time decreases from thread = 2 to thread = 4 and starting to increase again from thread = 8 to 64. Although, they are still better than the single threaded one, I am expecting for them to decrease as more computation are being process at the same time.
I want to know why this is happening. I am a newbie at multiprocessing and I just started to learn this last week. Is there any mistakes or lacking from my implementation? Is this still related to GIL? If so, do you have any suggestions how to fix this to implement multithreading more efficiently?
n = int(input("n = "))
t = int(input("t = "))
correct_rs = []
y = np.random.randint(0,100, size = n)
X = np.random.randint(0,100,size = (n,n))
split_y = np.array_split(y, t)
split_x = []
for i in range(n):
split_x.append(np.array_split(X[i], t))
for i in range(n):
threads = [None] * t
ret = []
for j in range(t):
pool = ThreadPool(processes=t)
threads[j] = pool.apply_async(pearson_cor, args=(split_x[i][j], split_y[j]))
ret.append(threads[j].get())
1 = 30.79026460647583
2 = 19.61565899848938
**4 = 22.66206121444702**
8 = 26.578197240829468
16 = 27.43799901008606
32 = 29.007505416870117
64 = 29.55176091194153
1 = 82.63879776000977
2 = 71.86883449554443
4 = 66.2829954624176
**8 = 72.7975389957428**
16 = 74.40859937667847
32 = 79.7437674999237
64 = 82.5101261138916
1 = 3.117418050765991
2 = 2.9685776233673096
4 = 2.442412853240967
**8 = 2.6580233573913574**
16 = 2.7630186080932617
32 = 2.747586727142334
64 = 2.768829345703125
One thing to consider is CPU thrashing(https://stackoverflow.com/a/20842853/6014330). depending on various things the CPU will spend more time swapping that actually running the code.
another thing: If your CPU only has 8 threads available and you try to use 64 threads you're just taking 64 threads and switching between them 8 times. The most optimal you can get is to match the amount of threads your CPU has. Any more than that is just waste.
Related
The issue
I am trying to optimise some calculations which lend themselves to so-called embarrassingly parallel calculations, but I am finding that using python's multiprocessing package actually slows things down.
My question is: am I doing something wrong, or is there an intrinsic reason why parallelisation actually slows things down? Is it because I am using numba? Would other packages like joblib or dak make much of a difference?
There are loads of similar questions, in which the answer is always that the overhead costs more than the time savings, but all those questions tend to revolve around very simple functions, whereas I would have expected something with nested loops to lend itself better to parallelisation. I have also not found comparisons among joblib, multiprocessing and dask.
My function
I have a function which takes a one-dimensional numpy array as argument of shape n, and outputs a numpy array of shape (n x t), where each row is independent, i.e. row 0 of the output depends only on item 0 of the input, row 1 on item 1, etc. Something like this:
The underlying calculation is optimised with numba , which speeds things up by various orders of magnitude.
Toy example - results
I cannot share the exact code, so I have come up with a toy example. The calculation defined in my_fun_numba is actually irrelevant, it's just some very banal number crunching to keep the CPU busy.
With the toy example, the results on my PC are these, and they are very similar to what I get with my actual code.
As you can see, splitting the input array into different chunks and sending each of them to multiprocessing.pool actually slows things down vs just using numba on a single core.
What I have tried
I have tried various combinations of the cache and nogil options in the numba.jit decorator, but the difference is minimal.
I have profiled the code (not the timeit.Timer part, just a single run) with PyCharm and, if I understand the output correctly, it seems most of the time is spent waiting for the pool.
Sorted by time:
Sorted by own time:
Toy example - the code
import numpy as np
import pandas as pd
import multiprocessing
from multiprocessing import Pool
import numba
import timeit
#numba.jit(nopython = True, nogil = True, cache = True)
def my_fun_numba(x):
dim2 = 10
out = np.empty((len(x), dim2))
n = len(x)
for r in range(n):
for c in range(dim2):
out[r,c] = np.cos(x[r]) ** 2 + np.sin(x[r]) ** 2
return out
def my_fun_non_numba(x):
dim2 = 10
out = np.empty((len(x), dim2))
n = len(x)
for r in range(n):
for c in range(dim2):
out[r,c] = np.cos(x[r]) ** 2 + np.sin(x[r]) ** 2
return out
def my_func_parallel(inp, func, cpus = None):
if cpus == None:
cpus = max(1, multiprocessing.cpu_count() - 1)
else:
cpus = cpus
inp_split = np.array_split(inp,cpus)
pool = Pool(cpus)
out = np.vstack(pool.map(func, inp_split) )
pool.close()
pool.join()
return out
if __name__ == "__main__":
inputs = np.array([100,10e3,1e6] ).astype(int)
res = pd.DataFrame(index = inputs, columns =['no paral, no numba','no paral, numba','numba 6 cores','numba 12 cores'])
r = 3
n = 1
for i in inputs:
my_arg = np.arange(0,i)
res.loc[i, 'no paral, no numba'] = min(
timeit.Timer("my_fun_non_numba(my_arg)", globals=globals()).repeat(repeat=r, number=n)
)
res.loc[i, 'no paral, numba'] = min(
timeit.Timer("my_fun_numba(my_arg)", globals=globals()).repeat(repeat=r, number=n)
)
res.loc[i, 'numba 6 cores'] = min(
timeit.Timer("my_func_parallel(my_arg, my_fun_numba, cpus = 6)", globals=globals()).repeat(repeat=r, number=n)
)
res.loc[i, 'numba 12 cores'] = min(
timeit.Timer("my_func_parallel(my_arg, my_fun_numba, cpus = 12)", globals=globals()).repeat(repeat=r, number=n)
)
For some reason, the execution time is still the same as without threading.
But if I add something like time.sleep(secs) there clearly is threading in work inside the target def d.
def d(CurrentPos, polygon, angale, id):
Returnvalue = 0
lock = True
steg = 0.0005
distance = 0
x = 0
y = 0
while lock == True:
x = math.sin(math.radians(angale)) * distance + CurrentPos[0]
y = math.cos(math.radians(angale)) * distance + CurrentPos[1]
Localpoint = Point(x, y)
inout = polygon.contains(Localpoint)
distance = distance + steg
if inout == False:
lock = False
l = LineString([[CurrentPos[0], CurrentPos[1]],[x,y]])
Returnvalue = list(l.intersection(polygon).coords)[0]
Returnvalue = calculateDistance(CurrentPos[0], CurrentPos[1],
Returnvalue[0], Returnvalue[1])
with Arraylock:
ReturnArray.append(Returnvalue)
ReturnArray.append(id)
def Main(CurrentPos, Map):
threads = []
for i in range(8):
t = threading.Thread(target = d, name ='thread{}'.format(i), args =
(CurrentPos, Map, angales[i], i))
threads.append(t)
t.start()
for i in threads:
i.join()
Welcome to the world of the Global Interpreter Lock a.k.a. GIL. Your function looks like a CPU bound code (some calculations, loops, ifs, memory access, etc.). You can't use threads to increase performance of CPU bound tasks, sorry. It is Python's limitation.
There are functions in Python that release GIL, e.g. disk i/o, network i/o and the one you've actually tried: sleep. And indeed, threads do increase performance of i/o bound tasks. But arithmetic and/or memory access won't run parallely in Python.
The standard workaround is to use processes instead of threads. But this is often painful due to not-that-easy interprocess communication. You may also want to consider using some low level libraries like numpy that actually releases GIL in certain situations (you can only do that at C level, GIL is not accessible from Python itself) or using some other language without this limitation, e.g. C#, Java, C, C++ and so on.
Using the threading library to accelerate calculating each point's neighborhood in a points-cloud. By calling function CalculateAllPointsNeighbors at the bottom of the post.
The function receives a search radius, maximum number of neighbors and a number of threads to split the work on. No changes are done on any of the points. And each point stores data in its own np.ndarray cell accessed by its own index.
The following function times how long it takes N number of threads to finish calculating all points neighborhoods:
def TimeFuncThreads(classObj, uptothreads):
listTimers = []
startNum = 1
EndNum = uptothreads + 1
for i in range(startNum, EndNum):
print("Current Number of Threads to Test: ", i)
tempT = time.time()
classObj.CalculateAllPointsNeighbors(searchRadius=0.05, maxNN=25, maxThreads=i)
tempT = time.time() - tempT
listTimers.append(tempT)
PlotXY(np.arange(startNum, EndNum), listTimers)
The problem is, I've been getting very different results in each run. Here are the plots from 5 subsequent runs of the function TimeFuncThreads. The X axis is number of threads, Y is the runtime. First thing is, they look totally random. And second, there is no significant acceleration boost.
I'm confused now whether I'm using the threading library wrong and what is this behavior that I'm getting?
The function that handles the threading and the function that is being called from each thread:
def CalculateAllPointsNeighbors(self, searchRadius=0.20, maxNN=50, maxThreads=8):
threadsList = []
pointsIndices = np.arange(self.numberOfPoints)
splitIndices = np.array_split(pointsIndices, maxThreads)
for i in range(maxThreads):
threadsList.append(threading.Thread(target=self.GetPointsNeighborsByID,
args=(splitIndices[i], searchRadius, maxNN)))
[t.start() for t in threadsList]
[t.join() for t in threadsList]
def GetPointsNeighborsByID(self, idx, searchRadius=0.05, maxNN=20):
if isinstance(idx, int):
idx = [idx]
for currentPointIndex in idx:
currentPoint = self.pointsOpen3D.points[currentPointIndex]
pointNeighborhoodObject = self.GetPointNeighborsByCoordinates(currentPoint, searchRadius, maxNN)
self.pointsNeighborsArray[currentPointIndex] = pointNeighborhoodObject
self.__RotatePointNeighborhood(currentPointIndex)
It pains me to be the one to introduce you to the Python Gil. Is a very nice feature that makes parallelism using threads in Python a nightmare.
If you really want to improve your code speed, you should be looking at the multiprocessing module
So the case is the following:
I wanted to compare the runtime for a matrix multiplication
with ipython parallel and just running on a single core.
Code for normal execution:
import numpy as np
n = 13
dim_1, dim_2, dim_3, dim_4 = 2**n, 2**n, 2**n, 2**n
A = np.random.random((dim_1, dim_2))
B = np.random.random((dim_3, dim_4))
start = timeit.time.time()
C = np.matmul(A,B)
dur = timeit.time.time() - start
well this amounts to about 24 seconds on my notebook
If I do the same thing trying to parallize it.
I start four engines using: ipcluster start -n 4 (I have 4 cores).
Then I run in my notebook:
from ipyparallel import Client
c = Client()
dview = c.load_balanced_view()
%px import numpy
def pdot(view_obj, A_mat, B_mat):
view_obj['B'] = B
view_obj.scatter('A', A)
view_obj.execute('C=A.dot(B)')
return view_obj.gather('C', block=True)
start = timeit.time.time()
pdot(dview, A, B)
dur1 = timeit.time.time() - start
dur1
which takes approximately 34 seconds.
When I view in the system monitor I can see, that in both
cases all cores are used. In the parallel case there seems to
be an overhead where they aren't on 100 % usage (I suppose that's
the part where they get scattered across the engines).
In the non parallel part immediately all cores are on 100 % usage.
This surprises me as I always thought python was intrinsically
run on a single core.
Would be happy if somebody has more insight into this.
I have a nested for loop of the form
while x<lat2[0]:
while y>lat3[1]:
if (is_inside_nepal([x,y])):
print("inside")
else:
print("not")
y = y - (1/150.0)
y = lat2[1]
x = x + (1/150.0)
#here lat2[0] represents a large number
Now this normally takes around 50s for executing.
And I have changed this loop to a multiprocessing code.
def v1find_coordinates(q):
while not(q.empty()):
x1 = q.get()
x2 = x1 + incfactor
while x1<x2:
def func(x1):
while y>lat3[1]:
if (is_inside([x1,y])):
print x1,y,"inside"
else:
print x1,y,"not inside"
y = y - (1/150.0)
func(x1)
y = lat2[1]
x1 = x1 + (1/150.0)
incfactor = 0.7
xvalues = drange(x,lat2[0],incfactor)
#this drange function is to get list with increment factor as decimal
cores = mp.cpu_count()
q = Queue()
for i in xvalues:
q.put(i)
for i in range(0,cores):
p = Process(target = v1find_coordinates,args=(q,) )
p.start()
p.Daemon = True
processes.append(p)
for i in processes:
print ("now joining")
i.join()
This multiprocessing code also takes around 50s execution time. This means there is no difference of time between the two.
I also have tried using pools. I have also managed the chunk size. I have googled and searched through other stackoverflow. But can't find any satisfying answer.
The only answer I could find was time was taken in process management to make both the result same. If this is the reason then how can I get the multiprocessing work to obtain faster results?
Will implementing in C from Python give faster results?
I am not expecting drastic results but by common sense one can tell that running on 4 cores should be a lot faster than running in 1 core. But I am getting similar results. Any kind of help would be appreciated.
You seem to be using a thread Queue (from Queue import Queue). This does not work as expected as Process uses fork() and it clones the entire Queue into each worker Process
Use:
from multiprocessing import Queue