Related
I doing 100 iterations of the function model so, i tried using multiprocessing to distribute the tasks and for getting the final output I tried using queue but it takes too much time, failing the purpose of multiprocessing. How to solve this problem?
def model(X,Y):
ada_clf={}
pred1={}
auc_final=[]
for iteration in range(100):
ada_clf[iteration] = AdaBoostClassifier(DecisionTreeClassifier(),n_estimators=1000,learning_rate=0.001)
ada_clf[iteration].fit(X,Y)
pred1[iteration]=ada_clf[iteration].predict(test1)
individuallabelsfromada1=[]
for i in range(len(test1)):
individuallabelsfromada1.append([])
for j in range(100):
individuallabelsfromada1[i].append(pred1[j][i])
final_labels_ada1=[]
for each in individuallabelsfromada1:
final_labels_ada1.append(find_majority(each))
final=pd.Series(final_labels_ada1)
temp_arr=np.array(final)
total_labels2=pd.Series(temp_arr)
fpr, tpr, thresholds = roc_curve(y_test, total_labels2, pos_label=1)
auc_final.append(auc(fpr,tpr))
q.put(total_labels2)
q1.put(auc_final)
q2.put(ada_clf)
print('done')
overall_labels={}
final_auc={}
final_ada_clf={}
processes=[]
q=Queue()
q1=Queue()
q2=Queue()
for iteration in range(100):
if __name__=='__main__':
p=multiprocessing.Process(target=model,args=(x_train,y_labels,q,q1,q2,))
overall_labels[iteration]=q.get()
final_auc[iteration]=q1.get()
final_ada_clf[iteration]=q2.get()
p.start()
processes.append(p)
for each in processes:
each.join()
Below is my edited version, but returns only single output, i tried using multiple output but could not get it, so settled for only single output i.e. total_labels2:-
##code before this is same as before, only thing changed is arguments of model from def model(X,Y) to def model(repeat,X,Y)
total_labels2 = pd.Series(temp_arr)
return (repeat,total_labels2)
def get_result(total_labels2):
global testover_forall
testover_forall.append(total_labels2)
if __name__ == '__main__':
import multiprocessing as mp
testover_forall = []
pool = mp.Pool(40)
for repeat in range(100):
pool.apply_async(bound_model, args= repeat, x_train, y_train), callback= get_result)
pool.close()
pool.join()
repetations_index=[]
for i in range(100):
repetations_index.append(testover_forall[i][0])
final_last_labels = {}
for i in range(100):
temp = str(i)
final_last_labels[temp] = testover_forall[repetations_index[i]][1]
totally_last_labels=[]
for each in final_last_labels:
temp=np.array(final_last_labels[each])
totally_last_labels.append(temp)
See my comments (actually questions) to your post.
You should be using a multiprocessing pool to limit the number of processes that you create to the number of CPU cores that you have. This will also make it easier to get return values back from your model function instead of writing results to 3 different queues (and you could have written a tuple of 3 values to just one queue). You will, of course, require other import statements and code. Given your use of numpy and other libraries, which may be implemented in C Language, you could also retry running this using threading to see if that helps or hurts performance. Do this by replacing ProcessPoolExecutor with ThreadPoolExecutor in the two places it is referenced.
Note
Any changes that model makes to passed arguments X and Y will not be reflected back to the main process. So if model is called repeatedly with the same arguments over and over, as it appears to be, it's not clear whether each call will return different values, especially if the calls are being done in parallel.
from concurrent.futures import ProcessPoolExecutor
def model(X,Y):
ada_clf={}
pred1={}
auc_final=[]
for iteration in range(100):
ada_clf[iteration] = AdaBoostClassifier(DecisionTreeClassifier(),n_estimators=1000,learning_rate=0.001)
ada_clf[iteration].fit(X,Y)
pred1[iteration]=ada_clf[iteration].predict(test1)
individuallabelsfromada1=[]
for i in range(len(test1)):
individuallabelsfromada1.append([])
for j in range(100):
individuallabelsfromada1[i].append(pred1[j][i])
final_labels_ada1=[]
for each in individuallabelsfromada1:
final_labels_ada1.append(find_majority(each))
final=pd.Series(final_labels_ada1)
temp_arr=np.array(final)
total_labels2=pd.Series(temp_arr)
fpr, tpr, thresholds = roc_curve(y_test, total_labels2, pos_label=1)
auc_final.append(auc(fpr,tpr))
#q.put(total_labels2)
#q1.put(auc_final)
#q2.put(ada_clf)
return total_labels2, auc_final, ada_clf
#print('done')
if __name__ == '__main__':
with ProcessPoolExecutor() as executor:
futures = [executor.submit(model, x_train, y_labels) for iteration in range(100)]
# simple lists will suffice:
overall_labels = []
final_auc = []
final_ada_clf = []
for future in futures:
# get return value and store
total_labels2, auc_final, ada_clf = future.result()
overall_labels.append(total_labels2)
final_auc.append(auc_final)
final_ada_clf.append(ada_clf)
Update
It wasn't clear from the problem specification that the returned results are based on a random number generator and if successive calls to the worker function, model, do not employ a single random number generator across all processes in the multiprocessing pool, then the multiprocessing implementation will clearly return different results than the non-multiprocessing version. And it is not clear from the code provided where the random number generator is being used; it may be in library code that you have no access to. If that is the case, you have two options: (1) Use multithreading instead by changing the import statement as I have indicated in the code below; you may still achieve performance benefits as I have already mentioned or (2) Update the signature to model as follows. You will be passed a new argument, random_generator, that currently supports two methods, randint (like random.randint and random (like random.random), although it should be easy enough to modify the code if you need a different method from module random. You will use this random number generator in place of module random if you are able to. But note that this random generator will run much more slowly than the standard one; this is the price you pay.
Since we are also adding a repetition argument to model (it now has to be the final argument -- note the updated signature below), we can now use method map (no need to use a callback):
def model(X,Y, random_generator, repetition):
...
etc.
from multiprocessing import Pool
# or use the following import instead to use multithreading (but then use standard random generator):
# from multiprocessing.dummy import Pool
import random
from functools import partial
from multiprocessing.managers import BaseManager
class RandomGeneratorManager(BaseManager):
pass
class RandomGenerator:
def __init__(self):
random.seed(0)
def randint(self, a, b):
return random.randint(a, b)
def random(self):
return random.random()
# add other functions if needed
if __name__ == '__main__':
RandomGeneratorManager.register('RandomGenerator', RandomGenerator)
with RandomGeneratorManager() as manager:
random_generator = manager.RandomGenerator()
# why 40? why not use default, which is the number of cpu cores you have?:
pool = Pool(40):
worker = partial(model, x_train, y_labels, random_generator)
results = pool.map(worker, range(100))
I am currently generating a nested dictionary that saves some arrays by using a nested for loop. Unfortunately, it takes quite some time; I realized that the server I am working on has a few cores available, so I was wondering if Python's multiprocessing library could be helpful to speed up the creation of the dictionary.
The nested for loop looks something like this (the actual computation is heavier and more complex):
import numpy as np
data_dict = {}
for s in range(1,5):
data_dict[s] = {}
for d in range(1,5):
if s * d > 4:
data_dict[s][d] = np.zeros((s,d))
else:
data_dict[s][d] = np.ones((s,d))
So this is what I tried:
from multiprocessing import Pool
import numpy as np
data_dict = {}
def process():
#sci=fits.open('{}.fits'.format(name))
for s in range(1,5):
data_dict[s] = {}
for d in range(1,5):
if s * d > 4:
data_dict[s][d] = np.zeros((s,d))
else:
data_dict[s][d] = np.ones((s,d))
if __name__ == '__main__':
pool = Pool() # Create a multiprocessing Pool
pool.map(process)
But pool.map (last line) seems to require an iterable, which I'm not sure what to insert there.
In my opinion, the real problem is what kind of processing is needed to compute entries of the dictionary and how many entries are there.
The kind of processing is essential to understand if multiprocessing can significantly speed up the creation of the dictionary. If your computation is I/O bound, you should use multithreading, while if it's CPU bound you should use multiprocessing. You can find more bout this here.
Assuming that the value of each entry can be computed independently and that this computation is CPU bound, let's benchmark the difference between single process and multiprocess implementation (based on multiprocessing library).
The following code is used to test the two approaches in some scenarios, varying the complexity of the computation needed for each entry and the number of entries (for the multiprocess implementation, 7 processes were used).
import timeit
import numpy as np
def some_fun(s, d, n=1):
"""A function with an adaptable complexity"""
a = s * np.ones(np.random.randint(1, 10, (2,))) / (d + 1)
for _ in range(n):
a += np.random.random(a.shape)
return a
# Code to create dictionary with only one process
setup_simple = "from __main__ import some_fun, n_first_level, n_second_level, complexity"
code_simple = """
data_dict = {}
for s in range(n_first_level):
data_dict[s] = {}
for d in range(n_second_level):
data_dict[s][d] = some_fun(s, d, n=complexity)
"""
# Code to create a dictionary with multiprocessing: we are going to use all the available cores except 1
setup_mp = """import numpy as np
import multiprocessing as mp
import itertools
from functools import partial
from __main__ import some_fun, n_first_level, n_second_level, complexity
n_processes = mp.cpu_count() - 1
# Uncomment if you want to know how many concurrent processes are you going to use
# print(f'{n_processes} concurrent processes')
"""
code_mp = """
with mp.Pool(processes=n_processes) as pool:
dict_values = pool.starmap(partial(some_fun, n=complexity), itertools.product(range(n_first_level), range(n_second_level)))
data_dict = {
k: dict(zip(range(n_second_level), dict_values[k * n_second_level: (k + 1) * n_second_level]))
for k in range(n_first_level)
}
"""
# Time the code with different settings
print('Execution time on 10 repetitions: mean [std]')
for label, complexity, n_first_level, n_second_level in (
("TRIVIAL FUNCTION", 0, 10, 10),
("TRIVIAL FUNCTION", 0, 500, 500),
("SIMPLE FUNCTION", 5, 500, 500),
("COMPLEX FUNCTION", 50, 100, 100),
("HEAVY FUNCTION", 1000, 10, 10),
):
print(f'\n{label}, {n_first_level * n_second_level} dictionary entries')
for l, t in (
('Single process', timeit.repeat(stmt=code_simple, setup=setup_simple, number=1, repeat=10)),
('Multiprocess', timeit.repeat(stmt=code_mp, setup=setup_mp, number=1, repeat=10)),
):
print(f'\t{l}: {np.mean(t):.3e} [{np.std(t):.3e}] seconds')
These are the results:
Execution time on 10 repetitions: mean [std]
TRIVIAL FUNCTION, 100 dictionary entries
Single process: 7.752e-04 [7.494e-05] seconds
Multiprocess: 1.163e-01 [2.024e-03] seconds
TRIVIAL FUNCTION, 250000 dictionary entries
Single process: 7.077e+00 [7.098e-01] seconds
Multiprocess: 1.383e+00 [7.752e-02] seconds
SIMPLE FUNCTION, 250000 dictionary entries
Single process: 1.405e+01 [1.422e+00] seconds
Multiprocess: 2.858e+00 [5.742e-01] seconds
COMPLEX FUNCTION, 10000 dictionary entries
Single process: 1.557e+00 [4.330e-02] seconds
Multiprocess: 5.383e-01 [5.330e-02] seconds
HEAVY FUNCTION, 100 dictionary entries
Single process: 3.181e-01 [5.026e-03] seconds
Multiprocess: 1.171e-01 [2.494e-03] seconds
As you can see, assuming that you have a CPU bounded computation, the multiprocess approach achieves better results in most of the scenarios. Only if you have a very light computation for each entry and/or a very limited number of entries, the single process approach should be preferred.
On the other hand, the improvement provided by multiprocessing comes with a cost: for example, if your computation for each entry uses a significant amount of memory, you could incur an OutOfMemory error, meaning that you have to improve your code and make it more complex to avoid it, finding the right balance between memory occupation and decrease in execution time. If you look around, there are a lot of questions asking how to solve memory issues caused by a non-optimal use of multiprocessing. In other words, this means that your code will be less easy to read and maintain.
To sum up, you should judge if the improvement in execution time is worthed, even if it is possible.
1. I have a function var. I want to know the best possible way to run the loop within this function quickly by multiprocessing/parallel processing by utilizing all the processors, cores, threads, and RAM memory the system has.
import numpy
from pysheds.grid import Grid
xs = 82.1206, 72.4542, 65.0431, 83.8056, 35.6744
ys = 25.2111, 17.9458, 13.8844, 10.0833, 24.8306
a = r'/home/test/image1.tif'
b = r'/home/test/image2.tif'
def var(interest):
variable_avg = []
for (x,y) in zip(xs,ys):
grid = Grid.from_raster(interest, data_name='map')
grid.catchment(data='map', x=x, y=y, out_name='catch')
variable = grid.view('catch', nodata=np.nan)
variable = numpy.array(variable)
variablemean = (variable).mean()
variable_avg.append(variablemean)
return(variable_avg)
2. It would be great if I can run both function var and loop in it parallelly for the given multiple parameters of the function.
ex:var(a)and var(b) at the same time. Since it will consume much less time then just parallelizing the loop alone.
Ignore 2, if it does not make sense.
TLDR:
You can use the multiprocessing library to run your var function in parallel. However, as written you likely don't make enough calls to var for multiprocessing to have a performance benefit because of its overhead. If all you need to do is run those two calls, running in serial is likely the fastest you'll get. However, if you need to make a lot of calls, multiprocessing can help you out.
We'll need to use a process pool to run this in parallel, threads won't work here because Python's global interpreter lock will prevent us from true parallelism. The drawback of process pools is that processes are heavyweight to spin up. In the example of just running two calls to var the time to create the pool overwhelms the time spent running var itself.
To illiustrate this, let's use a process pool and use asyncio to run calls to var in parallel and compare it to just running things sequentially. Note to run this example I used an image from the Pysheds library https://github.com/mdbartos/pysheds/tree/master/data - if your image is much larger the below may not hold true.
import functools
import time
from concurrent.futures.process import ProcessPoolExecutor
import asyncio
a = 'diem.tif'
xs = 10, 20, 30, 40, 50
ys = 10, 20, 30, 40, 50
async def main():
loop = asyncio.get_event_loop()
pool_start = time.time()
with ProcessPoolExecutor() as pool:
task_one = loop.run_in_executor(pool, functools.partial(var, a))
task_two = loop.run_in_executor(pool, functools.partial(var, a))
results = await asyncio.gather(task_one, task_two)
pool_end = time.time()
print(f'Process pool took {pool_end-pool_start}')
serial_start = time.time()
result_one = var(a)
result_two = var(a)
serial_end = time.time()
print(f'Running in serial took {serial_end - serial_start}')
if __name__ == "__main__":
asyncio.run(main())
Running the above on my machine (a 2.4 GHz 8-Core Intel Core i9) I get the following output:
Process pool took 1.7581260204315186
Running in serial took 0.32335805892944336
In this example, a process pool is over five times slower! This is due to the overhead of creating and managing multiple processes. That said, if you need to call var more than just a few times, a process pool may make more sense. Let's adapt this to run var 100 times and compare the results:
async def main():
loop = asyncio.get_event_loop()
pool_start = time.time()
tasks = []
with ProcessPoolExecutor() as pool:
for _ in range(100):
tasks.append(loop.run_in_executor(pool, functools.partial(var, a)))
results = await asyncio.gather(*tasks)
pool_end = time.time()
print(f'Process pool took {pool_end-pool_start}')
serial_start = time.time()
for _ in range(100):
result = var(a)
serial_end = time.time()
print(f'Running in serial took {serial_end - serial_start}')
Running 100 times, I get the following output:
Process pool took 3.442288875579834
Running in serial took 13.769982099533081
In this case, running in a process pool is about 4x faster. You may also wish to try running each iteration of your loop concurrently. You can do this by creating a function that processes one x,y coordinate at a time and then run each point you want to examine in a process pool:
def process_poi(interest, x, y):
grid = Grid.from_raster(interest, data_name='map')
grid.catchment(data='map', x=x, y=y, out_name='catch')
variable = grid.view('catch', nodata=np.nan)
variable = np.array(variable)
return variable.mean()
async def var_loop_async(interest, pool, loop):
tasks = []
for (x,y) in zip(xs,ys):
function_call = functools.partial(process_poi, interest, x, y)
tasks.append(loop.run_in_executor(pool, function_call))
return await asyncio.gather(*tasks)
async def main():
loop = asyncio.get_event_loop()
pool_start = time.time()
tasks = []
with ProcessPoolExecutor() as pool:
for _ in range(100):
tasks.append(var_loop_async(a, pool, loop))
results = await asyncio.gather(*tasks)
pool_end = time.time()
print(f'Process pool took {pool_end-pool_start}')
serial_start = time.time()
In this case I get Process pool took 3.2950568199157715 - so not really any faster than our first version with one process per each call of var. This is likely because the limiting factor at this point is how many cores we have available on our CPU, splitting our work into smaller increments does not add much value.
That said, if you have 1000 x and y coordinates you wish to examine across two images, this last approach may yield a performance gain.
I think this is a reasonable and straightforward way of speeding up your code by merely parallelizing only the main loop. You can saturate your cores with this, so there is no need to parallelize also for the interest variable. I can't test the code, so I assume that your function is correct, I have just encoded the loop in a new function and parallelized it in var().
from multiprocessing import Pool
def var(interest,xs,ys):
grid = Grid.from_raster(interest, data_name='map')
with Pool(4) as p: #uses 4 cores, adjust this as you need
variable_avg = p.starmap(loop, [(x,y,grid) for x,y in zip(xs,ys)])
return variable_avg
def loop(x, y, grid):
grid.catchment(data='map', x=x, y=y, out_name='catch')
variable = grid.view('catch', nodata=np.nan)
variable = numpy.array(variable)
return variable.mean()
I am testing the parallel capabilities of Python3, which I intend to use in my code. I observe unexpectedly slow behaviour, and so I boil down my code to the following proof of principle. Let's calculate a simple logarithmic series. Let's do it serial, and in parallel using 1 core. One would imagine that the timing for these two examples would be the same, except for a small overhead associated with initializing and closing the multiprocessing.Pool class. However, what I observe is that the overhead grows linearly with problem size, and thus the parallel solution on 1 core is significantly worse relative to the serial solution even for large inputs. Please tell me if I am doing something wrong
import time
import numpy as np
import multiprocessing
import matplotlib.pyplot as plt
def foo(x):
return sum([np.log(1 + i*x) for i in range(10)])
def serial_series(rangeMax):
return [foo(x) for x in range(rangeMax)]
def parallel_series_1core(rangeMax):
pool = multiprocessing.Pool(processes=1)
rez = pool.map(foo, tuple(range(rangeMax)))
pool.terminate()
pool.join()
return rez
nTask = [1 + i ** 2 * 1000 for i in range(1, 2)]
nTimeSerial = []
nTimeParallel = []
for taskSize in nTask:
print('TaskSize', taskSize)
start = time.time()
rez = serial_series(taskSize)
end = time.time()
nTimeSerial.append(end - start)
start = time.time()
rez = parallel_series_1core(taskSize)
end = time.time()
nTimeParallel.append(end - start)
plt.plot(nTask, nTimeSerial)
plt.plot(nTask, nTimeParallel)
plt.legend(['serial', 'parallel 1 core'])
plt.show()
Edit:
It was commented that the overhead my be due to creating multiple jobs. Here is a modification of the parallel function that should explicitly only make 1 job. I still observe linear growth of the overhead
def parallel_series_1core(rangeMax):
pool = multiprocessing.Pool(processes=1)
rez = pool.map(serial_series, [rangeMax])
pool.terminate()
pool.join()
return rez
Edit 2: Once more, the exact code that produces linear growth. It can be tested with a print statement inside the serial_series function that it is only called once for each call of parallel_series_1core.
import time
import numpy as np
import multiprocessing
import matplotlib.pyplot as plt
def foo(x):
return sum([np.log(1 + i*x) for i in range(10)])
def serial_series(rangeMax):
return [foo(i) for i in range(rangeMax)]
def parallel_series_1core(rangeMax):
pool = multiprocessing.Pool(processes=1)
rez = pool.map(serial_series, [rangeMax])
pool.terminate()
pool.join()
return rez
nTask = [1 + i ** 2 * 1000 for i in range(1, 20)]
nTimeSerial = []
nTimeParallel = []
for taskSize in nTask:
print('TaskSize', taskSize)
start = time.time()
rez1 = serial_series(taskSize)
end = time.time()
nTimeSerial.append(end - start)
start = time.time()
rez2 = parallel_series_1core(taskSize)
end = time.time()
nTimeParallel.append(end - start)
plt.plot(nTask, nTimeSerial)
plt.plot(nTask, nTimeParallel)
plt.plot(nTask, [i / j for i,j in zip(nTimeParallel, nTimeSerial)])
plt.legend(['serial', 'parallel 1 core', 'ratio'])
plt.show()
When you use Pool.map() you're essentially telling it to split the passed iterable into jobs over all available sub-processes (which is one in your case) - the larger the iterable the more 'jobs' are created on the first call. That's what initially adds a huge (trumped only by the process creation itself), albeit linear overhead.
Since sub-processes do not share memory, for all changing data on POSIX systems (due to forking) and all data (even static) on Windows it needs to pickle it on one end and unpickle it on the other. Plus it needs time to clear out the process stack for the next job, plus there is an overhead in system thread switching (that's out of your control, you'd have to mess with the system's scheduler to reduce that one).
For simple/quick tasks a single process will always trump multiprocessing.
UPDATE - As I was saying above, the additional overhead comes from the fact that for any data exchange between processes Python transparently does pickling/unpickling routine. Since the list you return from the serial_series() function grows in size over time, so does the performance penalty for pickling/unpickling. Here's a simple demonstration of it based on your code:
import math
import pickle
import sys
import time
# multi-platform precision timer
get_timer = time.clock if sys.platform == "win32" else time.time
def foo(x): # logic/computation function
return sum([math.log(1 + i*x) for i in range(10)])
def serial_series(max_range): # main sub-process function
return [foo(i) for i in range(max_range)]
def serial_series_slave(max_range): # subprocess interface
return pickle.dumps(serial_series(pickle.loads(max_range)))
def serial_series_master(max_range): # main process interface
return pickle.loads(serial_series_slave(pickle.dumps(max_range)))
tasks = [1 + i ** 2 * 1000 for i in range(1, 20)]
simulated_times = []
for task in tasks:
print("Simulated task size: {}".format(task))
start = get_timer()
res = serial_series_master(task)
simulated_times.append((task, get_timer() - start))
At the end, simulated_times will contain something like:
[(1001, 0.010015994115533963), (4001, 0.03402641167313844), (9001, 0.06755546622419131),
(16001, 0.1252664260421834), (25001, 0.18815836740279515), (36001, 0.28339434475444325),
(49001, 0.3757235840503601), (64001, 0.4813749807557435), (81001, 0.6115452710446636),
(100001, 0.7573718332506543), (121001, 0.9228750064147522), (144001, 1.0909038813527427),
(169001, 1.3017281342479343), (196001, 1.4830192955746764), (225001, 1.7117389965616931),
(256001, 1.9392146632682739), (289001, 2.19192682050668), (324001, 2.4497541011649187),
(361001, 2.7481495578097466)]
showing clear greater-than-linear processing time increase as the list grows bigger. This is what essentially happens with multiprocessing - if your sub-process function didn't return anything it would end up considerably faster.
If you have a large amount of data you need to share among processes, I'd suggest you to use some in-memory database (like Redis) and have your sub-processes connect to it to store/retrieve data.
I'm generating 100 random int matrices of size 1000x1000. I'm using the multiprocessing module to calculate the eigen values of the 100 matrices.
The code is given below:
import timeit
import numpy as np
import multiprocessing as mp
def calEigen():
S, U = np.linalg.eigh(a)
def multiprocess(processes):
pool = mp.Pool(processes=processes)
#Start timing here as I don't want to include time taken to initialize the processes
start = timeit.default_timer()
results = [pool.apply_async(calEigen, args=())]
stop = timeit.default_timer()
print (processes":", stop - start)
results = [p.get() for p in results]
results.sort() # to sort the results
if __name__ == "__main__":
global a
a=[]
for i in range(0,100):
a.append(np.random.randint(1,100,size=(1000,1000)))
#Print execution time without multiprocessing
start = timeit.default_timer()
calEigen()
stop = timeit.default_timer()
print stop - start
#With 1 process
multiprocess(1)
#With 2 processes
multiprocess(2)
#With 3 processes
multiprocess(3)
#With 4 processes
multiprocess(4)
The output is
0.510247945786
('Process:', 1, 5.1021575927734375e-05)
('Process:', 2, 5.698204040527344e-05)
('Process:', 3, 8.320808410644531e-05)
('Process:', 4, 7.200241088867188e-05)
Another iteration showed this output:
69.7296020985
('Process:', 1, 0.0009050369262695312)
('Process:', 2, 0.023727893829345703)
('Process:', 3, 0.0003509521484375)
('Process:', 4, 0.057518959045410156)
My questions are these:
Why doesn't the time execution time reduce as the number of
processes increase? Am I using the multiprocessing module correctly?
Am I calculating the execution time correctly?
I have edited the code given in the comments below. I want the serial and multiprocessing functions to find the eigen values for the same list of 100 matrices. The edited code is-
import numpy as np
import time
from multiprocessing import Pool
a=[]
for i in range(0,100):
a.append(np.random.randint(1,100,size=(1000,1000)))
def serial(z):
result = []
start_time = time.time()
for i in range(0,100):
result.append(np.linalg.eigh(z[i])) #calculate eigen values and append to result list
end_time = time.time()
print("Single process took :", end_time - start_time, "seconds")
def caleigen(c):
result = []
result.append(np.linalg.eigh(c)) #calculate eigenvalues and append to result list
return result
def mp(x):
start_time = time.time()
with Pool(processes=x) as pool: # start a pool of 4 workers
result = pool.map_async(caleigen,a) # distribute work to workers
result = result.get() # collect result from MapResult object
end_time = time.time()
print("Mutltiprocessing took:", end_time - start_time, "seconds" )
if __name__ == "__main__":
serial(a)
mp(1,a)
mp(2,a)
mp(3,a)
mp(4,a)
There is no reduction in the time as the number of processes increases. Where am I going wrong? Does multiprocessing divide the list into chunks for the processes or do I have to do the division?
You're not using the multiprocessing module correctly. As #dopstar pointed out, you're not dividing your task. There is only one task for the process pool, so not matter how many workers you assigned, only one will get the job. As for your second question, I didn't use timeit to measure process time precisely. I just use time module to get a crude sense of how fast things are. It serves the purpose most of the time, though. If I understand what you're trying to do correctly, this should be the single process version of your code
import numpy as np
import time
result = []
start_time = time.time()
for i in range(100):
a = np.random.randint(1, 100, size=(1000,1000)) #generate random matrix
result.append(np.linalg.eigh(a)) #calculate eigen values and append to result list
end_time = time.time()
print("Single process took :", end_time - start_time, "seconds")
The single process version took 15.27 seconds on my computer. Below is the multiprocess version, which took only 0.46 seconds on my computer. I also included the single process version for comparison. (The single process version has to be enclosed in the if block as well and placed after the multiprocess version.) Because you would like to repeat your calculation for 100 times, it'd be a lot easier to create a pool of workers and let them take on unfinished task automatically than to manually start each process and specify what each process should do. Here in my codes, the argument for the caleigen call is merely to keep track of how many times the task has been executed. Finally, map_async is generally faster than apply_async, with its downside being consuming slightly more memory and taking only one argument for function call. The reason for using map_async but not map is that in this case, the order in which result is returned does not matter and map_async is much faster than map.
from multiprocessing import Pool
import numpy as np
import time
def caleigen(x): # define work for each worker
a = np.random.randint(1,100,size=(1000,1000))
S, U = np.linalg.eigh(a)
return S, U
if __name__ == "main":
start_time = time.time()
with Pool(processes=4) as pool: # start a pool of 4 workers
result = pool.map_async(caleigen, range(100)) # distribute work to workers
result = result.get() # collect result from MapResult object
end_time = time.time()
print("Mutltiprocessing took:", end_time - start_time, "seconds" )
# Run the single process version for comparison. This has to be within the if block as well.
result = []
start_time = time.time()
for i in range(100):
a = np.random.randint(1, 100, size=(1000,1000)) #generate random matrix
result.append(np.linalg.eigh(a)) #calculate eigen values and append to result list
end_time = time.time()
print("Single process took :", end_time - start_time, "seconds")