Python Multiprocessing Increasing Time delay using pool - python

I have to solve a complex network optimization problem using a heuristic algorithm (e.g. ant algorithm). This algorithm is decomposed in two steps: 1.) Calculate new solutions using an random component, 2.) Evaluate new Solutions. These two parts are very highly time-consuming and solved the problem parallel using multiprocessing in subprograms. With increasing number of iteration the time duration of the steps increases very fast. I localized the time delay between the initialization of the parallel processes (labels timeMainCalculate and timeMainEvaluate) and the start of the first subprogram (labels timeSubCalculate and timeSubEvaluate). In the first iteration the time difference timeMainCalculate-timeSubCalculate respectively timeMainEvaluate-timeSubEvaluate is nearly 0, after 100 iterations it is about 10 seconds for both steps and after 200 steps the time difference is about 20. This time difference is linear increasing. The time duration for calculation and evaluation in the subprograms is constant. So it might be a problem in the communication between the main program and the subprograms using multiprocessing. Pool.
For your Information: I’m using Python 3.3 on a eight core computer.
opt_heuristic.py:
import multiprocessing.Pool
import datetime
import calculate, evaluate
epsilon = 1e-10
nbOfCpusForParallel = 6
nbCalculation = 6
def calculate_bound_update_information(result):
Do_some_calculation using result
return [bound,x,y,z]
if __name__ == '__main__':
Inititalize x,y,z
while bound > epsilon:
# Calculate new Solution
pool = multiprocessing.Pool(processes=nbOfCpusForParallel)
result_parallel = list()
for i in range(nbCalculation):
result_parallel.append(pool.apply_async(calculate.main,[x,y,z]))
timeMainCalculate = datetime.datetime.now()
pool.close()
pool.join()
resultCalculation = [result_parallel[i].get() for i in range(nbCalculation)]
# Evaluate Solutions
pool = multiprocessing.Pool(processes=nbOfCpusForParallel)
argsEvalute = [[resultCalculation[i][0],resultCalculation[i][1]] for i in range(len(resultCalculation))]
result_evaluate = list()
for i in range(len(resultCalculation)):
result_evaluate.append(pool.apply_async(evaluate.main,argsEvalute[i]))
timeMainEvaluate = datetime.datetime.now()
pool.close()
pool.join()
resultEvaluation = [result_evaluate[i].get() for i in range(len(resultCalculation))]
[bound,x,y,z] = calculate_bound_update_information(resultEvaluation)
calculate.py:
import datetime
def main(x,y,z):
timeSubCalculate = datetime.datetime.now()
Do some random calculation using x,y,z
return result
evaluate.py
import datetime
def main(x,y):
timeSubEvaluate = datetime.datetime.now()
Do some evaluation using x,y
return result
I seems to me that the main program store some information of the parallel process. I tried some things like delete the pool variable but it was not successful.
Has somebody an idea what's the technical problem and how it could be solved? Thanks.

Related

How to use multiprocessing in Python on for loop to generate nested dictionary?

I am currently generating a nested dictionary that saves some arrays by using a nested for loop. Unfortunately, it takes quite some time; I realized that the server I am working on has a few cores available, so I was wondering if Python's multiprocessing library could be helpful to speed up the creation of the dictionary.
The nested for loop looks something like this (the actual computation is heavier and more complex):
import numpy as np
data_dict = {}
for s in range(1,5):
data_dict[s] = {}
for d in range(1,5):
if s * d > 4:
data_dict[s][d] = np.zeros((s,d))
else:
data_dict[s][d] = np.ones((s,d))
So this is what I tried:
from multiprocessing import Pool
import numpy as np
data_dict = {}
def process():
#sci=fits.open('{}.fits'.format(name))
for s in range(1,5):
data_dict[s] = {}
for d in range(1,5):
if s * d > 4:
data_dict[s][d] = np.zeros((s,d))
else:
data_dict[s][d] = np.ones((s,d))
if __name__ == '__main__':
pool = Pool() # Create a multiprocessing Pool
pool.map(process)
But pool.map (last line) seems to require an iterable, which I'm not sure what to insert there.
In my opinion, the real problem is what kind of processing is needed to compute entries of the dictionary and how many entries are there.
The kind of processing is essential to understand if multiprocessing can significantly speed up the creation of the dictionary. If your computation is I/O bound, you should use multithreading, while if it's CPU bound you should use multiprocessing. You can find more bout this here.
Assuming that the value of each entry can be computed independently and that this computation is CPU bound, let's benchmark the difference between single process and multiprocess implementation (based on multiprocessing library).
The following code is used to test the two approaches in some scenarios, varying the complexity of the computation needed for each entry and the number of entries (for the multiprocess implementation, 7 processes were used).
import timeit
import numpy as np
def some_fun(s, d, n=1):
"""A function with an adaptable complexity"""
a = s * np.ones(np.random.randint(1, 10, (2,))) / (d + 1)
for _ in range(n):
a += np.random.random(a.shape)
return a
# Code to create dictionary with only one process
setup_simple = "from __main__ import some_fun, n_first_level, n_second_level, complexity"
code_simple = """
data_dict = {}
for s in range(n_first_level):
data_dict[s] = {}
for d in range(n_second_level):
data_dict[s][d] = some_fun(s, d, n=complexity)
"""
# Code to create a dictionary with multiprocessing: we are going to use all the available cores except 1
setup_mp = """import numpy as np
import multiprocessing as mp
import itertools
from functools import partial
from __main__ import some_fun, n_first_level, n_second_level, complexity
n_processes = mp.cpu_count() - 1
# Uncomment if you want to know how many concurrent processes are you going to use
# print(f'{n_processes} concurrent processes')
"""
code_mp = """
with mp.Pool(processes=n_processes) as pool:
dict_values = pool.starmap(partial(some_fun, n=complexity), itertools.product(range(n_first_level), range(n_second_level)))
data_dict = {
k: dict(zip(range(n_second_level), dict_values[k * n_second_level: (k + 1) * n_second_level]))
for k in range(n_first_level)
}
"""
# Time the code with different settings
print('Execution time on 10 repetitions: mean [std]')
for label, complexity, n_first_level, n_second_level in (
("TRIVIAL FUNCTION", 0, 10, 10),
("TRIVIAL FUNCTION", 0, 500, 500),
("SIMPLE FUNCTION", 5, 500, 500),
("COMPLEX FUNCTION", 50, 100, 100),
("HEAVY FUNCTION", 1000, 10, 10),
):
print(f'\n{label}, {n_first_level * n_second_level} dictionary entries')
for l, t in (
('Single process', timeit.repeat(stmt=code_simple, setup=setup_simple, number=1, repeat=10)),
('Multiprocess', timeit.repeat(stmt=code_mp, setup=setup_mp, number=1, repeat=10)),
):
print(f'\t{l}: {np.mean(t):.3e} [{np.std(t):.3e}] seconds')
These are the results:
Execution time on 10 repetitions: mean [std]
TRIVIAL FUNCTION, 100 dictionary entries
Single process: 7.752e-04 [7.494e-05] seconds
Multiprocess: 1.163e-01 [2.024e-03] seconds
TRIVIAL FUNCTION, 250000 dictionary entries
Single process: 7.077e+00 [7.098e-01] seconds
Multiprocess: 1.383e+00 [7.752e-02] seconds
SIMPLE FUNCTION, 250000 dictionary entries
Single process: 1.405e+01 [1.422e+00] seconds
Multiprocess: 2.858e+00 [5.742e-01] seconds
COMPLEX FUNCTION, 10000 dictionary entries
Single process: 1.557e+00 [4.330e-02] seconds
Multiprocess: 5.383e-01 [5.330e-02] seconds
HEAVY FUNCTION, 100 dictionary entries
Single process: 3.181e-01 [5.026e-03] seconds
Multiprocess: 1.171e-01 [2.494e-03] seconds
As you can see, assuming that you have a CPU bounded computation, the multiprocess approach achieves better results in most of the scenarios. Only if you have a very light computation for each entry and/or a very limited number of entries, the single process approach should be preferred.
On the other hand, the improvement provided by multiprocessing comes with a cost: for example, if your computation for each entry uses a significant amount of memory, you could incur an OutOfMemory error, meaning that you have to improve your code and make it more complex to avoid it, finding the right balance between memory occupation and decrease in execution time. If you look around, there are a lot of questions asking how to solve memory issues caused by a non-optimal use of multiprocessing. In other words, this means that your code will be less easy to read and maintain.
To sum up, you should judge if the improvement in execution time is worthed, even if it is possible.

How do I use multiprocessing on Python to speed up a for loop?

I have this code which I would like to use multi-processing to speed up:
matrix=[]
for i in range(len(datasplit)):
matrix.append(np.array(np.asarray(datasplit[i].split()),dtype=float))
The variable "datasplit" is a comma-separated list of strings. Each string has around 50 numbers which are separated by a space. For each string, this code adds commas between these numbers instead of spaces, turns the entire string into an array, and turns each individual number into a string. This would now look like a an array of comma-separated strings where each string is 1 of the 50 numbers. The code then turns these strings into floats, so now we have an array of 50 comma separated numbers. After the code has run, printing, "matrix" would give a list of arrays, where each array has 50 comma separated numbers.
Now my problem is that the length of datasplit is huge. It has a length of ~ 10^7. This code takes around 15 minutes to run. I need to run this for 124 other samples of similar size, so I would like to use multiprocessing to speed up the run time.
How exactly would I re-write my code using multiprocessing to get it to run faster?
I appreciate any help.
The Python standard library provides two options for multiprocessing: The modules multiprocessing and concurrent.futures. The second adds a layer of abstraction onto the first. For simple map-scenarios like yours the usage is pretty simple.
Here's something to experiment with:
import numpy as np
from time import time
from os import cpu_count
from multiprocessing import Pool
from concurrent.futures import ProcessPoolExecutor
def string_to_float(string):
return np.array(np.asarray(string.split()), dtype=float)
if __name__ == '__main__':
# Example datasplit
rng = np.random.default_rng()
num_strings = 100000
datasplit = [' '.join(str(n) for n in rng.random(50))
for _ in range(num_strings)]
# Looping (sequential processing)
start = time()
matrix = []
for i in range(len(datasplit)):
matrix.append(np.array(np.asarray(datasplit[i].split()), dtype=float))
print(f'Duration of sequential processing: {time() - start:.2f} secs')
# Setting up multiprocessing
num_workers = int(0.8 * cpu_count())
chunksize = max(1, int(len(datasplit) / num_workers))
# Multiprocessing with Pool
start = time()
with Pool(num_workers) as p:
matrix = p.map(string_to_float, datasplit, chunksize)
print(f'Duration of parallel processing (Pool): {time() - start:.2f} secs')
# Multiprocessing with ProcessPoolExecutor
start = time()
with ProcessPoolExecutor(num_workers) as ppe:
matrix = list(ppe.map(string_to_float, datasplit, chunksize=chunksize))
print(f'Duration of parallel processing (PPE): {time() - start:.2f} secs')
You should play around with the num_workers and more importantly the chunksize variable. The ones I've used here worked well for me in quite a few situations. You can also let the system decide what to choose, those arguments are optional, but the results can be suboptimal, especially when the amount of data to be processed is large.
For 10 million strings (your range) and chunksize=10000 my machine produced the following results:
Duration of sequential processing: 393.78 secs
Duration of parallel processing (Pool): 73.76 secs
Duration of parallel processing (PPE): 85.82 secs
PS: Why do you use np.array(np.asarray(string.split()), dtype=float) instead of np.asarray(string.split(), dtype=float)?
This will split your tasks to multiple cores and speed up your performance by atleast 4-8x:
from multiprocessing import Pool
import os
import numpy as np
pool = Pool(os.cpu_count())
# Add your data to the datasplit variable below:
datasplit = []
results = pool.map(lambda x: np.array(np.asarray(x.split()),dtype=float), datasplit)
pool.close()
pool.join()

Best way to simultaneously run this loop?

I have the following code:
data = [2,5,3,16,2,5]
def f(x):
return 2*x
f_total = 0
for x in data:
f_total += f(x)
print(f_total/len(data))
which I want to speed up the for loop. (In reality the code is more complex and I want to run it in a super computer with many many processing cores). I have read that I can do this with the multiprocessing library where I can get python3 to simultaneously run different chunks of the loop at the same time but I am a bit lost with it.
Could you explain me how to do it with this minimal version of my program?
Thanks!
import multiprocessing
from numpy import random
"""
This mentions the number of worker threads that you want to run in parallel.
Depending on the number of cores in your system you should choose the appropriate
number of threads. When you call 'map' function it will distribute the input
values in that many parts
"""
NUM_CORES = 6
data = random.rand(100, 1)
"""
+2 so that the cores are not left idle in case a thread is waiting for I/O.
Choose by performing an empirical analysis depending on the function you are trying to compute.
It could match up to NUM_CORES as well. You can vary the chunksize as well depending on the size of 'data' that you have.
"""
NUM_THREADS = NUM_CORES+2
CHUNKSIZE = int(len(data)/(NUM_THREADS))
def f(x):
return 2*x
# This takes care of creating pool of worker threads which will be assigned the jobs
pool = multiprocessing.Pool(NUM_THREADS)
# map vs imap. If the data is large go for imap else map is also good.
it = pool.imap(f, data, chunksize=CHUNKSIZE)
f_total = 0
# Iterate and sum up the result
for value in it:
f_total += sum(value)
print(f_total/len(data))
Why choose imap over map?

Python Threading inconsistent execution time

Using the threading library to accelerate calculating each point's neighborhood in a points-cloud. By calling function CalculateAllPointsNeighbors at the bottom of the post.
The function receives a search radius, maximum number of neighbors and a number of threads to split the work on. No changes are done on any of the points. And each point stores data in its own np.ndarray cell accessed by its own index.
The following function times how long it takes N number of threads to finish calculating all points neighborhoods:
def TimeFuncThreads(classObj, uptothreads):
listTimers = []
startNum = 1
EndNum = uptothreads + 1
for i in range(startNum, EndNum):
print("Current Number of Threads to Test: ", i)
tempT = time.time()
classObj.CalculateAllPointsNeighbors(searchRadius=0.05, maxNN=25, maxThreads=i)
tempT = time.time() - tempT
listTimers.append(tempT)
PlotXY(np.arange(startNum, EndNum), listTimers)
The problem is, I've been getting very different results in each run. Here are the plots from 5 subsequent runs of the function TimeFuncThreads. The X axis is number of threads, Y is the runtime. First thing is, they look totally random. And second, there is no significant acceleration boost.
I'm confused now whether I'm using the threading library wrong and what is this behavior that I'm getting?
The function that handles the threading and the function that is being called from each thread:
def CalculateAllPointsNeighbors(self, searchRadius=0.20, maxNN=50, maxThreads=8):
threadsList = []
pointsIndices = np.arange(self.numberOfPoints)
splitIndices = np.array_split(pointsIndices, maxThreads)
for i in range(maxThreads):
threadsList.append(threading.Thread(target=self.GetPointsNeighborsByID,
args=(splitIndices[i], searchRadius, maxNN)))
[t.start() for t in threadsList]
[t.join() for t in threadsList]
def GetPointsNeighborsByID(self, idx, searchRadius=0.05, maxNN=20):
if isinstance(idx, int):
idx = [idx]
for currentPointIndex in idx:
currentPoint = self.pointsOpen3D.points[currentPointIndex]
pointNeighborhoodObject = self.GetPointNeighborsByCoordinates(currentPoint, searchRadius, maxNN)
self.pointsNeighborsArray[currentPointIndex] = pointNeighborhoodObject
self.__RotatePointNeighborhood(currentPointIndex)
It pains me to be the one to introduce you to the Python Gil. Is a very nice feature that makes parallelism using threads in Python a nightmare.
If you really want to improve your code speed, you should be looking at the multiprocessing module

Python 3 multiprocessing on 1 core gives overhead that grows with workload

I am testing the parallel capabilities of Python3, which I intend to use in my code. I observe unexpectedly slow behaviour, and so I boil down my code to the following proof of principle. Let's calculate a simple logarithmic series. Let's do it serial, and in parallel using 1 core. One would imagine that the timing for these two examples would be the same, except for a small overhead associated with initializing and closing the multiprocessing.Pool class. However, what I observe is that the overhead grows linearly with problem size, and thus the parallel solution on 1 core is significantly worse relative to the serial solution even for large inputs. Please tell me if I am doing something wrong
import time
import numpy as np
import multiprocessing
import matplotlib.pyplot as plt
def foo(x):
return sum([np.log(1 + i*x) for i in range(10)])
def serial_series(rangeMax):
return [foo(x) for x in range(rangeMax)]
def parallel_series_1core(rangeMax):
pool = multiprocessing.Pool(processes=1)
rez = pool.map(foo, tuple(range(rangeMax)))
pool.terminate()
pool.join()
return rez
nTask = [1 + i ** 2 * 1000 for i in range(1, 2)]
nTimeSerial = []
nTimeParallel = []
for taskSize in nTask:
print('TaskSize', taskSize)
start = time.time()
rez = serial_series(taskSize)
end = time.time()
nTimeSerial.append(end - start)
start = time.time()
rez = parallel_series_1core(taskSize)
end = time.time()
nTimeParallel.append(end - start)
plt.plot(nTask, nTimeSerial)
plt.plot(nTask, nTimeParallel)
plt.legend(['serial', 'parallel 1 core'])
plt.show()
Edit:
It was commented that the overhead my be due to creating multiple jobs. Here is a modification of the parallel function that should explicitly only make 1 job. I still observe linear growth of the overhead
def parallel_series_1core(rangeMax):
pool = multiprocessing.Pool(processes=1)
rez = pool.map(serial_series, [rangeMax])
pool.terminate()
pool.join()
return rez
Edit 2: Once more, the exact code that produces linear growth. It can be tested with a print statement inside the serial_series function that it is only called once for each call of parallel_series_1core.
import time
import numpy as np
import multiprocessing
import matplotlib.pyplot as plt
def foo(x):
return sum([np.log(1 + i*x) for i in range(10)])
def serial_series(rangeMax):
return [foo(i) for i in range(rangeMax)]
def parallel_series_1core(rangeMax):
pool = multiprocessing.Pool(processes=1)
rez = pool.map(serial_series, [rangeMax])
pool.terminate()
pool.join()
return rez
nTask = [1 + i ** 2 * 1000 for i in range(1, 20)]
nTimeSerial = []
nTimeParallel = []
for taskSize in nTask:
print('TaskSize', taskSize)
start = time.time()
rez1 = serial_series(taskSize)
end = time.time()
nTimeSerial.append(end - start)
start = time.time()
rez2 = parallel_series_1core(taskSize)
end = time.time()
nTimeParallel.append(end - start)
plt.plot(nTask, nTimeSerial)
plt.plot(nTask, nTimeParallel)
plt.plot(nTask, [i / j for i,j in zip(nTimeParallel, nTimeSerial)])
plt.legend(['serial', 'parallel 1 core', 'ratio'])
plt.show()
When you use Pool.map() you're essentially telling it to split the passed iterable into jobs over all available sub-processes (which is one in your case) - the larger the iterable the more 'jobs' are created on the first call. That's what initially adds a huge (trumped only by the process creation itself), albeit linear overhead.
Since sub-processes do not share memory, for all changing data on POSIX systems (due to forking) and all data (even static) on Windows it needs to pickle it on one end and unpickle it on the other. Plus it needs time to clear out the process stack for the next job, plus there is an overhead in system thread switching (that's out of your control, you'd have to mess with the system's scheduler to reduce that one).
For simple/quick tasks a single process will always trump multiprocessing.
UPDATE - As I was saying above, the additional overhead comes from the fact that for any data exchange between processes Python transparently does pickling/unpickling routine. Since the list you return from the serial_series() function grows in size over time, so does the performance penalty for pickling/unpickling. Here's a simple demonstration of it based on your code:
import math
import pickle
import sys
import time
# multi-platform precision timer
get_timer = time.clock if sys.platform == "win32" else time.time
def foo(x): # logic/computation function
return sum([math.log(1 + i*x) for i in range(10)])
def serial_series(max_range): # main sub-process function
return [foo(i) for i in range(max_range)]
def serial_series_slave(max_range): # subprocess interface
return pickle.dumps(serial_series(pickle.loads(max_range)))
def serial_series_master(max_range): # main process interface
return pickle.loads(serial_series_slave(pickle.dumps(max_range)))
tasks = [1 + i ** 2 * 1000 for i in range(1, 20)]
simulated_times = []
for task in tasks:
print("Simulated task size: {}".format(task))
start = get_timer()
res = serial_series_master(task)
simulated_times.append((task, get_timer() - start))
At the end, simulated_times will contain something like:
[(1001, 0.010015994115533963), (4001, 0.03402641167313844), (9001, 0.06755546622419131),
(16001, 0.1252664260421834), (25001, 0.18815836740279515), (36001, 0.28339434475444325),
(49001, 0.3757235840503601), (64001, 0.4813749807557435), (81001, 0.6115452710446636),
(100001, 0.7573718332506543), (121001, 0.9228750064147522), (144001, 1.0909038813527427),
(169001, 1.3017281342479343), (196001, 1.4830192955746764), (225001, 1.7117389965616931),
(256001, 1.9392146632682739), (289001, 2.19192682050668), (324001, 2.4497541011649187),
(361001, 2.7481495578097466)]
showing clear greater-than-linear processing time increase as the list grows bigger. This is what essentially happens with multiprocessing - if your sub-process function didn't return anything it would end up considerably faster.
If you have a large amount of data you need to share among processes, I'd suggest you to use some in-memory database (like Redis) and have your sub-processes connect to it to store/retrieve data.

Categories

Resources