Python Multiprocessing speed issue - python

I have a nested for loop of the form
while x<lat2[0]:
while y>lat3[1]:
if (is_inside_nepal([x,y])):
print("inside")
else:
print("not")
y = y - (1/150.0)
y = lat2[1]
x = x + (1/150.0)
#here lat2[0] represents a large number
Now this normally takes around 50s for executing.
And I have changed this loop to a multiprocessing code.
def v1find_coordinates(q):
while not(q.empty()):
x1 = q.get()
x2 = x1 + incfactor
while x1<x2:
def func(x1):
while y>lat3[1]:
if (is_inside([x1,y])):
print x1,y,"inside"
else:
print x1,y,"not inside"
y = y - (1/150.0)
func(x1)
y = lat2[1]
x1 = x1 + (1/150.0)
incfactor = 0.7
xvalues = drange(x,lat2[0],incfactor)
#this drange function is to get list with increment factor as decimal
cores = mp.cpu_count()
q = Queue()
for i in xvalues:
q.put(i)
for i in range(0,cores):
p = Process(target = v1find_coordinates,args=(q,) )
p.start()
p.Daemon = True
processes.append(p)
for i in processes:
print ("now joining")
i.join()
This multiprocessing code also takes around 50s execution time. This means there is no difference of time between the two.
I also have tried using pools. I have also managed the chunk size. I have googled and searched through other stackoverflow. But can't find any satisfying answer.
The only answer I could find was time was taken in process management to make both the result same. If this is the reason then how can I get the multiprocessing work to obtain faster results?
Will implementing in C from Python give faster results?
I am not expecting drastic results but by common sense one can tell that running on 4 cores should be a lot faster than running in 1 core. But I am getting similar results. Any kind of help would be appreciated.

You seem to be using a thread Queue (from Queue import Queue). This does not work as expected as Process uses fork() and it clones the entire Queue into each worker Process
Use:
from multiprocessing import Queue

Related

Multiprocessing a for loop in Python stops after a while and the program continues to run using a single thread

I am multiprocessing a for loop in Python using Pool in this way:
if __name__ == '__main__':
pool = Pool() # Create a multiprocessing Pool
result = pool.map(process_interp, range(ngal))
where process_interp is a function I have defined. However, when the number of calculations increase, the program seems to almost crash in a way that a 0 byte file is produced as my result and the Python program continues running but htop shows that it is no longer using multiple threads and only a single thread is being used. For not so heavy calculations, everything just works fine and the program runs fast and correctly.
Can someone please give me insight into what is happening here and what causes this problem? I have no clue how to solve this issue. Thanks a lot!
Edit: This is the function process_interp. Basically calculating some values on a grid. The issue arises when I increase the size of the grid (N_a and N_m) and therefore the number of calculations.
def process_interp(k):
#e = data[k][0]+np.random.normal(0, 0.3, len(data[k][0]))
e = data[k][0]
sigma_crit = data[k][1]
r_s = data[k][2]
dev_gamma = data[k][4]
dev_k = data[k][5]
z_d = data[k][6]
like_grid = np.empty((N_a, N_m), dtype='longdouble')
for i in range(N_a):
for j in range(N_m):
gamma_nfw, k_nfw = NFWfunc(m_grid[j], z_d, r_s, sigma_crit)
gamma_dev = dev_gamma*pow(10, alpha_grid[i]-alpha_0)
k_dev = dev_k*pow(10, alpha_grid[i]-alpha_0)
mean = (gamma_nfw+gamma_dev)/(1-k_nfw-k_dev)
loglike = np.zeros(len(e), dtype='longdouble')
loglike = -((mean-e)**2/(2*0.3**2))
like_grid[i,j] = np.prod(np.exp(loglike))
return like_grid

Fastest way to perform Multiprocessing of a loop in a function?

1. I have a function var. I want to know the best possible way to run the loop within this function quickly by multiprocessing/parallel processing by utilizing all the processors, cores, threads, and RAM memory the system has.
import numpy
from pysheds.grid import Grid
xs = 82.1206, 72.4542, 65.0431, 83.8056, 35.6744
ys = 25.2111, 17.9458, 13.8844, 10.0833, 24.8306
a = r'/home/test/image1.tif'
b = r'/home/test/image2.tif'
def var(interest):
variable_avg = []
for (x,y) in zip(xs,ys):
grid = Grid.from_raster(interest, data_name='map')
grid.catchment(data='map', x=x, y=y, out_name='catch')
variable = grid.view('catch', nodata=np.nan)
variable = numpy.array(variable)
variablemean = (variable).mean()
variable_avg.append(variablemean)
return(variable_avg)
2. It would be great if I can run both function var and loop in it parallelly for the given multiple parameters of the function.
ex:var(a)and var(b) at the same time. Since it will consume much less time then just parallelizing the loop alone.
Ignore 2, if it does not make sense.
TLDR:
You can use the multiprocessing library to run your var function in parallel. However, as written you likely don't make enough calls to var for multiprocessing to have a performance benefit because of its overhead. If all you need to do is run those two calls, running in serial is likely the fastest you'll get. However, if you need to make a lot of calls, multiprocessing can help you out.
We'll need to use a process pool to run this in parallel, threads won't work here because Python's global interpreter lock will prevent us from true parallelism. The drawback of process pools is that processes are heavyweight to spin up. In the example of just running two calls to var the time to create the pool overwhelms the time spent running var itself.
To illiustrate this, let's use a process pool and use asyncio to run calls to var in parallel and compare it to just running things sequentially. Note to run this example I used an image from the Pysheds library https://github.com/mdbartos/pysheds/tree/master/data - if your image is much larger the below may not hold true.
import functools
import time
from concurrent.futures.process import ProcessPoolExecutor
import asyncio
a = 'diem.tif'
xs = 10, 20, 30, 40, 50
ys = 10, 20, 30, 40, 50
async def main():
loop = asyncio.get_event_loop()
pool_start = time.time()
with ProcessPoolExecutor() as pool:
task_one = loop.run_in_executor(pool, functools.partial(var, a))
task_two = loop.run_in_executor(pool, functools.partial(var, a))
results = await asyncio.gather(task_one, task_two)
pool_end = time.time()
print(f'Process pool took {pool_end-pool_start}')
serial_start = time.time()
result_one = var(a)
result_two = var(a)
serial_end = time.time()
print(f'Running in serial took {serial_end - serial_start}')
if __name__ == "__main__":
asyncio.run(main())
Running the above on my machine (a 2.4 GHz 8-Core Intel Core i9) I get the following output:
Process pool took 1.7581260204315186
Running in serial took 0.32335805892944336
In this example, a process pool is over five times slower! This is due to the overhead of creating and managing multiple processes. That said, if you need to call var more than just a few times, a process pool may make more sense. Let's adapt this to run var 100 times and compare the results:
async def main():
loop = asyncio.get_event_loop()
pool_start = time.time()
tasks = []
with ProcessPoolExecutor() as pool:
for _ in range(100):
tasks.append(loop.run_in_executor(pool, functools.partial(var, a)))
results = await asyncio.gather(*tasks)
pool_end = time.time()
print(f'Process pool took {pool_end-pool_start}')
serial_start = time.time()
for _ in range(100):
result = var(a)
serial_end = time.time()
print(f'Running in serial took {serial_end - serial_start}')
Running 100 times, I get the following output:
Process pool took 3.442288875579834
Running in serial took 13.769982099533081
In this case, running in a process pool is about 4x faster. You may also wish to try running each iteration of your loop concurrently. You can do this by creating a function that processes one x,y coordinate at a time and then run each point you want to examine in a process pool:
def process_poi(interest, x, y):
grid = Grid.from_raster(interest, data_name='map')
grid.catchment(data='map', x=x, y=y, out_name='catch')
variable = grid.view('catch', nodata=np.nan)
variable = np.array(variable)
return variable.mean()
async def var_loop_async(interest, pool, loop):
tasks = []
for (x,y) in zip(xs,ys):
function_call = functools.partial(process_poi, interest, x, y)
tasks.append(loop.run_in_executor(pool, function_call))
return await asyncio.gather(*tasks)
async def main():
loop = asyncio.get_event_loop()
pool_start = time.time()
tasks = []
with ProcessPoolExecutor() as pool:
for _ in range(100):
tasks.append(var_loop_async(a, pool, loop))
results = await asyncio.gather(*tasks)
pool_end = time.time()
print(f'Process pool took {pool_end-pool_start}')
serial_start = time.time()
In this case I get Process pool took 3.2950568199157715 - so not really any faster than our first version with one process per each call of var. This is likely because the limiting factor at this point is how many cores we have available on our CPU, splitting our work into smaller increments does not add much value.
That said, if you have 1000 x and y coordinates you wish to examine across two images, this last approach may yield a performance gain.
I think this is a reasonable and straightforward way of speeding up your code by merely parallelizing only the main loop. You can saturate your cores with this, so there is no need to parallelize also for the interest variable. I can't test the code, so I assume that your function is correct, I have just encoded the loop in a new function and parallelized it in var().
from multiprocessing import Pool
def var(interest,xs,ys):
grid = Grid.from_raster(interest, data_name='map')
with Pool(4) as p: #uses 4 cores, adjust this as you need
variable_avg = p.starmap(loop, [(x,y,grid) for x,y in zip(xs,ys)])
return variable_avg
def loop(x, y, grid):
grid.catchment(data='map', x=x, y=y, out_name='catch')
variable = grid.view('catch', nodata=np.nan)
variable = numpy.array(variable)
return variable.mean()

Understand python multithreading by simple math computational problem, How Make it faster codes in python

I am trying to understand multithreading in python, but every examples I read are very simple and similar to each other and don't serve my purpose.
I am performing high mathematics computation tasks so I need to use python mulithreading. I need to use multithreading NOT multiprocessing to avoid memory issues.
Below you will find a simple math problem that I wanna perform multithreading (Not an actual one, but let me understand threading). So i need to run task#1 (square) and task#2 (inve) in-parallel at the same time at two different threads (wait the results from the two), and then use the results from the two to obtain final result back in the main thread.
Please use any threading options in python (Thread, ThreadPool, ThreadPoolExecutor) anything you prefer, and really save time and cost comparing to serial code
If any one has better idea to make this code run faster, please share it
import scipy
import numpy
def square(A):
# take the square root of a matix
y = scipy.linalg.sqrtm(A)
return y
def inve(A):
# take the inverse of a matrix
y = numpy.linalg.inv(A)
def main(A):
# A is a matrix
# a is the square root of a matrix
# b is the inverse of a matix
# Run the first task on one thread
a = square(A)
# Run the second task in-parallel in another thread
b = inve(A)
# and then run final result in main thread
result = a * b
return result
A=[[1,2],[3,4]]
result = main(A)
Here is how you can perform multithreading:
import threading
from scipy import linalg
import numpy
x=1
y=1
def square(A):
global x
# take the square root of a matix
x = linalg.sqrtm(A)
def inve(A):
global y
y = numpy.linalg.inv(A)
def main(A):
# creating thread
t1 = threading.Thread(target=square, args=(A,))
t2 = threading.Thread(target=inve, args=(A,))
# starting thread 1
t1.start()
# starting thread 2
t2.start()
# wait until thread 1 is completely executed
t1.join()
# wait until thread 2 is completely executed
t2.join()
if __name__ == "__main__":
A = [[1,2],[3,4]]
main(A)
result = x*y
print(result)

Python Threading inconsistent execution time

Using the threading library to accelerate calculating each point's neighborhood in a points-cloud. By calling function CalculateAllPointsNeighbors at the bottom of the post.
The function receives a search radius, maximum number of neighbors and a number of threads to split the work on. No changes are done on any of the points. And each point stores data in its own np.ndarray cell accessed by its own index.
The following function times how long it takes N number of threads to finish calculating all points neighborhoods:
def TimeFuncThreads(classObj, uptothreads):
listTimers = []
startNum = 1
EndNum = uptothreads + 1
for i in range(startNum, EndNum):
print("Current Number of Threads to Test: ", i)
tempT = time.time()
classObj.CalculateAllPointsNeighbors(searchRadius=0.05, maxNN=25, maxThreads=i)
tempT = time.time() - tempT
listTimers.append(tempT)
PlotXY(np.arange(startNum, EndNum), listTimers)
The problem is, I've been getting very different results in each run. Here are the plots from 5 subsequent runs of the function TimeFuncThreads. The X axis is number of threads, Y is the runtime. First thing is, they look totally random. And second, there is no significant acceleration boost.
I'm confused now whether I'm using the threading library wrong and what is this behavior that I'm getting?
The function that handles the threading and the function that is being called from each thread:
def CalculateAllPointsNeighbors(self, searchRadius=0.20, maxNN=50, maxThreads=8):
threadsList = []
pointsIndices = np.arange(self.numberOfPoints)
splitIndices = np.array_split(pointsIndices, maxThreads)
for i in range(maxThreads):
threadsList.append(threading.Thread(target=self.GetPointsNeighborsByID,
args=(splitIndices[i], searchRadius, maxNN)))
[t.start() for t in threadsList]
[t.join() for t in threadsList]
def GetPointsNeighborsByID(self, idx, searchRadius=0.05, maxNN=20):
if isinstance(idx, int):
idx = [idx]
for currentPointIndex in idx:
currentPoint = self.pointsOpen3D.points[currentPointIndex]
pointNeighborhoodObject = self.GetPointNeighborsByCoordinates(currentPoint, searchRadius, maxNN)
self.pointsNeighborsArray[currentPointIndex] = pointNeighborhoodObject
self.__RotatePointNeighborhood(currentPointIndex)
It pains me to be the one to introduce you to the Python Gil. Is a very nice feature that makes parallelism using threads in Python a nightmare.
If you really want to improve your code speed, you should be looking at the multiprocessing module

Python 3 multiprocessing on 1 core gives overhead that grows with workload

I am testing the parallel capabilities of Python3, which I intend to use in my code. I observe unexpectedly slow behaviour, and so I boil down my code to the following proof of principle. Let's calculate a simple logarithmic series. Let's do it serial, and in parallel using 1 core. One would imagine that the timing for these two examples would be the same, except for a small overhead associated with initializing and closing the multiprocessing.Pool class. However, what I observe is that the overhead grows linearly with problem size, and thus the parallel solution on 1 core is significantly worse relative to the serial solution even for large inputs. Please tell me if I am doing something wrong
import time
import numpy as np
import multiprocessing
import matplotlib.pyplot as plt
def foo(x):
return sum([np.log(1 + i*x) for i in range(10)])
def serial_series(rangeMax):
return [foo(x) for x in range(rangeMax)]
def parallel_series_1core(rangeMax):
pool = multiprocessing.Pool(processes=1)
rez = pool.map(foo, tuple(range(rangeMax)))
pool.terminate()
pool.join()
return rez
nTask = [1 + i ** 2 * 1000 for i in range(1, 2)]
nTimeSerial = []
nTimeParallel = []
for taskSize in nTask:
print('TaskSize', taskSize)
start = time.time()
rez = serial_series(taskSize)
end = time.time()
nTimeSerial.append(end - start)
start = time.time()
rez = parallel_series_1core(taskSize)
end = time.time()
nTimeParallel.append(end - start)
plt.plot(nTask, nTimeSerial)
plt.plot(nTask, nTimeParallel)
plt.legend(['serial', 'parallel 1 core'])
plt.show()
Edit:
It was commented that the overhead my be due to creating multiple jobs. Here is a modification of the parallel function that should explicitly only make 1 job. I still observe linear growth of the overhead
def parallel_series_1core(rangeMax):
pool = multiprocessing.Pool(processes=1)
rez = pool.map(serial_series, [rangeMax])
pool.terminate()
pool.join()
return rez
Edit 2: Once more, the exact code that produces linear growth. It can be tested with a print statement inside the serial_series function that it is only called once for each call of parallel_series_1core.
import time
import numpy as np
import multiprocessing
import matplotlib.pyplot as plt
def foo(x):
return sum([np.log(1 + i*x) for i in range(10)])
def serial_series(rangeMax):
return [foo(i) for i in range(rangeMax)]
def parallel_series_1core(rangeMax):
pool = multiprocessing.Pool(processes=1)
rez = pool.map(serial_series, [rangeMax])
pool.terminate()
pool.join()
return rez
nTask = [1 + i ** 2 * 1000 for i in range(1, 20)]
nTimeSerial = []
nTimeParallel = []
for taskSize in nTask:
print('TaskSize', taskSize)
start = time.time()
rez1 = serial_series(taskSize)
end = time.time()
nTimeSerial.append(end - start)
start = time.time()
rez2 = parallel_series_1core(taskSize)
end = time.time()
nTimeParallel.append(end - start)
plt.plot(nTask, nTimeSerial)
plt.plot(nTask, nTimeParallel)
plt.plot(nTask, [i / j for i,j in zip(nTimeParallel, nTimeSerial)])
plt.legend(['serial', 'parallel 1 core', 'ratio'])
plt.show()
When you use Pool.map() you're essentially telling it to split the passed iterable into jobs over all available sub-processes (which is one in your case) - the larger the iterable the more 'jobs' are created on the first call. That's what initially adds a huge (trumped only by the process creation itself), albeit linear overhead.
Since sub-processes do not share memory, for all changing data on POSIX systems (due to forking) and all data (even static) on Windows it needs to pickle it on one end and unpickle it on the other. Plus it needs time to clear out the process stack for the next job, plus there is an overhead in system thread switching (that's out of your control, you'd have to mess with the system's scheduler to reduce that one).
For simple/quick tasks a single process will always trump multiprocessing.
UPDATE - As I was saying above, the additional overhead comes from the fact that for any data exchange between processes Python transparently does pickling/unpickling routine. Since the list you return from the serial_series() function grows in size over time, so does the performance penalty for pickling/unpickling. Here's a simple demonstration of it based on your code:
import math
import pickle
import sys
import time
# multi-platform precision timer
get_timer = time.clock if sys.platform == "win32" else time.time
def foo(x): # logic/computation function
return sum([math.log(1 + i*x) for i in range(10)])
def serial_series(max_range): # main sub-process function
return [foo(i) for i in range(max_range)]
def serial_series_slave(max_range): # subprocess interface
return pickle.dumps(serial_series(pickle.loads(max_range)))
def serial_series_master(max_range): # main process interface
return pickle.loads(serial_series_slave(pickle.dumps(max_range)))
tasks = [1 + i ** 2 * 1000 for i in range(1, 20)]
simulated_times = []
for task in tasks:
print("Simulated task size: {}".format(task))
start = get_timer()
res = serial_series_master(task)
simulated_times.append((task, get_timer() - start))
At the end, simulated_times will contain something like:
[(1001, 0.010015994115533963), (4001, 0.03402641167313844), (9001, 0.06755546622419131),
(16001, 0.1252664260421834), (25001, 0.18815836740279515), (36001, 0.28339434475444325),
(49001, 0.3757235840503601), (64001, 0.4813749807557435), (81001, 0.6115452710446636),
(100001, 0.7573718332506543), (121001, 0.9228750064147522), (144001, 1.0909038813527427),
(169001, 1.3017281342479343), (196001, 1.4830192955746764), (225001, 1.7117389965616931),
(256001, 1.9392146632682739), (289001, 2.19192682050668), (324001, 2.4497541011649187),
(361001, 2.7481495578097466)]
showing clear greater-than-linear processing time increase as the list grows bigger. This is what essentially happens with multiprocessing - if your sub-process function didn't return anything it would end up considerably faster.
If you have a large amount of data you need to share among processes, I'd suggest you to use some in-memory database (like Redis) and have your sub-processes connect to it to store/retrieve data.

Categories

Resources