taking many random samples in parallel python - python

I'm trying to repeatedly run a function that requires a few positional arguments and involves random number generation (to generate many samples of a distribution). For a MWE, I think this captures everything:
import numpy as np
import multiprocessing as mup
from functools import partial
def rarr(xsize, ysize, k):
return np.random.rand(xsize, ysize)
def clever_array(nsamp, xsize=100, ysize=100, ncores=None):
np.random.seed()
if ncores is None:
p = mup.Pool()
else:
p = mup.Pool(ncores)
out = p.map_async( partial(rarr, xsize, ysize), range(nsamp))
p.close()
return np.array(out.get())
Note that the final positional argument for rarr() is just a dummy variable, since I am using map_async(), which requires an iterable. Now if I run %timeit clever_array(500, ncores = 1) I get 208 ms, whereas %timeit clever_array(500, ncores = 5) takes 149 ms. So there is definitely some kind of parallelism happening (the speedup isn't terribly impressive for this MWE but is decent in my real code).
However, I'm wondering a few things -- is there a more natural implementation other than the dummy variable for rarr() passed as an iterable to map_async to run this many times? Is there any obvious way to pass the xsize and ysize args to rarr() other than partial()? And is there any way to ensure different results from the different cores other than initializing a different random.seed() every time?
Thanks for any help!

Typically when we use multiprocessing we would expect different results from each invocation of a function, therefore it doesn't quite make sense to call the same function many times. In order to ensure the randomness of the sampling output, it is best to separate the random state (seed) from the function itself. The best approach as recommended by the numpy official doc is to use a np.random.Generator object, created via np.random.default_rng([seed]). With that we can modify your code to
import numpy as np
import multiprocessing as mup
from functools import partial
def rarr(xsize, ysize, rng):
return rng.random((xsize, ysize))
def clever_array(nsamp, xsize=100, ysize=100, ncores=None):
if ncores is None:
p = mup.Pool()
else:
p = mup.Pool(ncores)
out = p.map_async(partial(rarr, xsize, ysize), map(np.random.default_rng, range(nsamp)))
p.close()
return np.array(out.get())

Related

Multiple iterations of function with multiple arguments returning multiple values using Multiprocessing in python

I doing 100 iterations of the function model so, i tried using multiprocessing to distribute the tasks and for getting the final output I tried using queue but it takes too much time, failing the purpose of multiprocessing. How to solve this problem?
def model(X,Y):
ada_clf={}
pred1={}
auc_final=[]
for iteration in range(100):
ada_clf[iteration] = AdaBoostClassifier(DecisionTreeClassifier(),n_estimators=1000,learning_rate=0.001)
ada_clf[iteration].fit(X,Y)
pred1[iteration]=ada_clf[iteration].predict(test1)
individuallabelsfromada1=[]
for i in range(len(test1)):
individuallabelsfromada1.append([])
for j in range(100):
individuallabelsfromada1[i].append(pred1[j][i])
final_labels_ada1=[]
for each in individuallabelsfromada1:
final_labels_ada1.append(find_majority(each))
final=pd.Series(final_labels_ada1)
temp_arr=np.array(final)
total_labels2=pd.Series(temp_arr)
fpr, tpr, thresholds = roc_curve(y_test, total_labels2, pos_label=1)
auc_final.append(auc(fpr,tpr))
q.put(total_labels2)
q1.put(auc_final)
q2.put(ada_clf)
print('done')
overall_labels={}
final_auc={}
final_ada_clf={}
processes=[]
q=Queue()
q1=Queue()
q2=Queue()
for iteration in range(100):
if __name__=='__main__':
p=multiprocessing.Process(target=model,args=(x_train,y_labels,q,q1,q2,))
overall_labels[iteration]=q.get()
final_auc[iteration]=q1.get()
final_ada_clf[iteration]=q2.get()
p.start()
processes.append(p)
for each in processes:
each.join()
Below is my edited version, but returns only single output, i tried using multiple output but could not get it, so settled for only single output i.e. total_labels2:-
##code before this is same as before, only thing changed is arguments of model from def model(X,Y) to def model(repeat,X,Y)
total_labels2 = pd.Series(temp_arr)
return (repeat,total_labels2)
def get_result(total_labels2):
global testover_forall
testover_forall.append(total_labels2)
if __name__ == '__main__':
import multiprocessing as mp
testover_forall = []
pool = mp.Pool(40)
for repeat in range(100):
pool.apply_async(bound_model, args= repeat, x_train, y_train), callback= get_result)
pool.close()
pool.join()
repetations_index=[]
for i in range(100):
repetations_index.append(testover_forall[i][0])
final_last_labels = {}
for i in range(100):
temp = str(i)
final_last_labels[temp] = testover_forall[repetations_index[i]][1]
totally_last_labels=[]
for each in final_last_labels:
temp=np.array(final_last_labels[each])
totally_last_labels.append(temp)
See my comments (actually questions) to your post.
You should be using a multiprocessing pool to limit the number of processes that you create to the number of CPU cores that you have. This will also make it easier to get return values back from your model function instead of writing results to 3 different queues (and you could have written a tuple of 3 values to just one queue). You will, of course, require other import statements and code. Given your use of numpy and other libraries, which may be implemented in C Language, you could also retry running this using threading to see if that helps or hurts performance. Do this by replacing ProcessPoolExecutor with ThreadPoolExecutor in the two places it is referenced.
Note
Any changes that model makes to passed arguments X and Y will not be reflected back to the main process. So if model is called repeatedly with the same arguments over and over, as it appears to be, it's not clear whether each call will return different values, especially if the calls are being done in parallel.
from concurrent.futures import ProcessPoolExecutor
def model(X,Y):
ada_clf={}
pred1={}
auc_final=[]
for iteration in range(100):
ada_clf[iteration] = AdaBoostClassifier(DecisionTreeClassifier(),n_estimators=1000,learning_rate=0.001)
ada_clf[iteration].fit(X,Y)
pred1[iteration]=ada_clf[iteration].predict(test1)
individuallabelsfromada1=[]
for i in range(len(test1)):
individuallabelsfromada1.append([])
for j in range(100):
individuallabelsfromada1[i].append(pred1[j][i])
final_labels_ada1=[]
for each in individuallabelsfromada1:
final_labels_ada1.append(find_majority(each))
final=pd.Series(final_labels_ada1)
temp_arr=np.array(final)
total_labels2=pd.Series(temp_arr)
fpr, tpr, thresholds = roc_curve(y_test, total_labels2, pos_label=1)
auc_final.append(auc(fpr,tpr))
#q.put(total_labels2)
#q1.put(auc_final)
#q2.put(ada_clf)
return total_labels2, auc_final, ada_clf
#print('done')
if __name__ == '__main__':
with ProcessPoolExecutor() as executor:
futures = [executor.submit(model, x_train, y_labels) for iteration in range(100)]
# simple lists will suffice:
overall_labels = []
final_auc = []
final_ada_clf = []
for future in futures:
# get return value and store
total_labels2, auc_final, ada_clf = future.result()
overall_labels.append(total_labels2)
final_auc.append(auc_final)
final_ada_clf.append(ada_clf)
Update
It wasn't clear from the problem specification that the returned results are based on a random number generator and if successive calls to the worker function, model, do not employ a single random number generator across all processes in the multiprocessing pool, then the multiprocessing implementation will clearly return different results than the non-multiprocessing version. And it is not clear from the code provided where the random number generator is being used; it may be in library code that you have no access to. If that is the case, you have two options: (1) Use multithreading instead by changing the import statement as I have indicated in the code below; you may still achieve performance benefits as I have already mentioned or (2) Update the signature to model as follows. You will be passed a new argument, random_generator, that currently supports two methods, randint (like random.randint and random (like random.random), although it should be easy enough to modify the code if you need a different method from module random. You will use this random number generator in place of module random if you are able to. But note that this random generator will run much more slowly than the standard one; this is the price you pay.
Since we are also adding a repetition argument to model (it now has to be the final argument -- note the updated signature below), we can now use method map (no need to use a callback):
def model(X,Y, random_generator, repetition):
...
etc.
from multiprocessing import Pool
# or use the following import instead to use multithreading (but then use standard random generator):
# from multiprocessing.dummy import Pool
import random
from functools import partial
from multiprocessing.managers import BaseManager
class RandomGeneratorManager(BaseManager):
pass
class RandomGenerator:
def __init__(self):
random.seed(0)
def randint(self, a, b):
return random.randint(a, b)
def random(self):
return random.random()
# add other functions if needed
if __name__ == '__main__':
RandomGeneratorManager.register('RandomGenerator', RandomGenerator)
with RandomGeneratorManager() as manager:
random_generator = manager.RandomGenerator()
# why 40? why not use default, which is the number of cpu cores you have?:
pool = Pool(40):
worker = partial(model, x_train, y_labels, random_generator)
results = pool.map(worker, range(100))

how to pass parameters other than data through pool.imap() function for multiprocessing in python?

I am working on hyperspectral images. To reduce the noise from the image I am using wavelet transformation using pywt package. When I am doing this normally(serial processing) it's working smoothly. But when I am trying to implement parallel processing using multiple cores for wavelet transformation on the image, then I have to pass certain parameters like
wavelet family
thresholding value
threshold technique (hard/soft)
But I am not able to pass these parameters using the pool object, I can pass only data as an argument when I am using the pool.imap(). But when I am using the pool.apply_async() it's taking much more time and also the order of the output is not the same. Here I am adding the code for reference:
import matplotlib.pyplot as plt
import numpy as np
import multiprocessing as mp
import os
import time
from math import log10, sqrt
import pywt
import tifffile
def spec_trans(d,wav_fam,threshold_val,thresh_type):
data=np.array(d,dtype=np.float64)
data_dec=decomposition(data,wav_fam)
data_t=thresholding(data_dec,threshold_val,thresh_type)
data_rec=reconstruction(data_t,wav_fam)
return data_rec
if __name__ == '__main__':
#input
X=tifffile.imread('data/Classification/university.tif')
#take paramaters
threshold_val=float(input("Enter the value for image thresholding: "))
print("The available wavelet functions:",pywt.wavelist())
wav_fam=input("Choose a wavelet function for transformation: ")
threshold_type=['hard','soft']
print("The available wavelet functions:",threshold_type)
thresh_type=input("Choose a type for threshholding technique: ")
start=time.time()
p = mp.Pool(4)
jobs=[]
for dataBand in xmp:
jobs.append(p.apply_async(spec_trans,args=(dataBand,wav_fam,threshold_val,thresh_type)))
transformedX=[]
for jobBit in jobs:
transformedX.append(jobBit.get())
end=time.time()
p.close()
Also when I am using the 'soft' technique for thresholding I am facing the following error:
C:\Users\Sawon\anaconda3\lib\site-packages\pywt\_thresholding.py:25: RuntimeWarning: invalid value encountered in multiply
thresholded = data * thresholded
The results of serial execution and parallel execution would be more or less the same. But here I am getting slightly different results.
Any suggestion to modify the code will be helpful
Thank
[This isn't a direct answer to the question, but a clearer follow up query than trying the below via the small comment box]
As a quick check, pass in an iterator counter to spec_trans and return it back out (as well as your result) - and push it into a separate list, transformedXseq or something - and then compare to your input sequence. i.e.
def spec_trans(d,wav_fam,threshold_val,thresh_type, iCount):
data=np.array(d,dtype=np.float64)
data_dec=decomposition(data,wav_fam)
data_t=thresholding(data_dec,threshold_val,thresh_type)
data_rec=reconstruction(data_t,wav_fam)
return data_rec, iCount
and then within main
jobs=[]
iJobs = 0
for dataBand in xmp:
jobs.append(p.apply_async(spec_trans,args=(dataBand,wav_fam,threshold_val,thresh_type, iJobs)))
iJobs = iJobs + 1
transformedX=[]
transformedXseq=[]
for jobBit in jobs:
res = jobBit.get()
transformedX.append(res[0])
transformedXseq.append(res[1])
... and check the list transformedXseq to see if you've gathered the jobs back up in the sequence you submitted them. It should match!
Assuming wav_fam, threshold_val and thresh_type do not vary from call to call, first arrange for these arguments to be the first arguments to worker function spec_trans:
def spec_trans(wav_fam, threshold_val, thresh_type, d):
Now I don't see where in your pool-creation block you have defined xmp, but presumably this is an iterable. You need to modify this code as follows:
from functools import partial
def compute_chunksize(pool_size, iterable_size):
chunksize, remainder = divmod(iterable_size, 4 * pool_size)
if remainder:
chunksize += 1
return chunksize
if __name__ == '__main__':
X=tifffile.imread('data/Classification/university.tif')
#take paramaters
threshold_val=float(input("Enter the value for image thresholding: "))
print("The available wavelet functions:",pywt.wavelist())
wav_fam=input("Choose a wavelet function for transformation: ")
threshold_type=['hard','soft']
print("The available wavelet functions:",threshold_type)
thresh_type=input("Choose a type for threshholding technique: ")
start=time.time()
p = mp.Pool(4)
# first 3 arguments to spec_trans will be wav_fam, threshold_val and thresh_type
worker = partial(spec_trans, wav_fam, threshold_val, thresh_type)
suitable_chunksize = compute_chunksize(4, len(xmp))
transformedX = list(p.imap(worker, xmp, chunksize=suitable_chunksize))
end=time.time()
To obtain improved performance over using apply_async, you must use a "suitable chunksize" value with imap. Function compute_chunksize can be used for computing such a value based on the size of your pool, which is 4, and the size of the iterable being passed to imap, which would be len(xmp). If the size of xmp is small enough such that the chunksize value computed is 1, I don't really see how imap would be significantly more performant over apply_async.
Of course, you might as well just use:
transformedX = p.map(worker, xmp)
And let the pool compute its own suitable chunksize. imap has an advantage over map when the iterable is very large and not already a list. For map to compute a suitable chunksize it would first have to convert the iterable to a list just to get its length and this could be memory inefficient. But if you know the length (or approximate length) of the iterable, then by using imap you can explicitly set a chunksize without having to convert the iterable to a list. The other advantage of imap_unordered over map is that you can process the results for the individual tasks as they become available whereas with map you only get results when all the submitted tasks are complete.
Update
If you want to catch possible exceptions thrown by individual tasks submitted to your worker function, then stick with using imap, and use the following code to iterate the results returned by imap:
#transformedX = list(p.imap(worker, xmp, chunksize=suitable_chunksize))
transformedX = []
results = p.imap(worker, xmp, chunksize=suitable_chunksize)
import traceback
while True:
try:
result = next(results)
except StopIteration: # no more results
break
except Exception as e:
print('Exception occurred:', e)
traceback.print_exc() # print stacktrace
else:
transformedX.append(result)

How to use multiprocessing in Python on for loop to generate nested dictionary?

I am currently generating a nested dictionary that saves some arrays by using a nested for loop. Unfortunately, it takes quite some time; I realized that the server I am working on has a few cores available, so I was wondering if Python's multiprocessing library could be helpful to speed up the creation of the dictionary.
The nested for loop looks something like this (the actual computation is heavier and more complex):
import numpy as np
data_dict = {}
for s in range(1,5):
data_dict[s] = {}
for d in range(1,5):
if s * d > 4:
data_dict[s][d] = np.zeros((s,d))
else:
data_dict[s][d] = np.ones((s,d))
So this is what I tried:
from multiprocessing import Pool
import numpy as np
data_dict = {}
def process():
#sci=fits.open('{}.fits'.format(name))
for s in range(1,5):
data_dict[s] = {}
for d in range(1,5):
if s * d > 4:
data_dict[s][d] = np.zeros((s,d))
else:
data_dict[s][d] = np.ones((s,d))
if __name__ == '__main__':
pool = Pool() # Create a multiprocessing Pool
pool.map(process)
But pool.map (last line) seems to require an iterable, which I'm not sure what to insert there.
In my opinion, the real problem is what kind of processing is needed to compute entries of the dictionary and how many entries are there.
The kind of processing is essential to understand if multiprocessing can significantly speed up the creation of the dictionary. If your computation is I/O bound, you should use multithreading, while if it's CPU bound you should use multiprocessing. You can find more bout this here.
Assuming that the value of each entry can be computed independently and that this computation is CPU bound, let's benchmark the difference between single process and multiprocess implementation (based on multiprocessing library).
The following code is used to test the two approaches in some scenarios, varying the complexity of the computation needed for each entry and the number of entries (for the multiprocess implementation, 7 processes were used).
import timeit
import numpy as np
def some_fun(s, d, n=1):
"""A function with an adaptable complexity"""
a = s * np.ones(np.random.randint(1, 10, (2,))) / (d + 1)
for _ in range(n):
a += np.random.random(a.shape)
return a
# Code to create dictionary with only one process
setup_simple = "from __main__ import some_fun, n_first_level, n_second_level, complexity"
code_simple = """
data_dict = {}
for s in range(n_first_level):
data_dict[s] = {}
for d in range(n_second_level):
data_dict[s][d] = some_fun(s, d, n=complexity)
"""
# Code to create a dictionary with multiprocessing: we are going to use all the available cores except 1
setup_mp = """import numpy as np
import multiprocessing as mp
import itertools
from functools import partial
from __main__ import some_fun, n_first_level, n_second_level, complexity
n_processes = mp.cpu_count() - 1
# Uncomment if you want to know how many concurrent processes are you going to use
# print(f'{n_processes} concurrent processes')
"""
code_mp = """
with mp.Pool(processes=n_processes) as pool:
dict_values = pool.starmap(partial(some_fun, n=complexity), itertools.product(range(n_first_level), range(n_second_level)))
data_dict = {
k: dict(zip(range(n_second_level), dict_values[k * n_second_level: (k + 1) * n_second_level]))
for k in range(n_first_level)
}
"""
# Time the code with different settings
print('Execution time on 10 repetitions: mean [std]')
for label, complexity, n_first_level, n_second_level in (
("TRIVIAL FUNCTION", 0, 10, 10),
("TRIVIAL FUNCTION", 0, 500, 500),
("SIMPLE FUNCTION", 5, 500, 500),
("COMPLEX FUNCTION", 50, 100, 100),
("HEAVY FUNCTION", 1000, 10, 10),
):
print(f'\n{label}, {n_first_level * n_second_level} dictionary entries')
for l, t in (
('Single process', timeit.repeat(stmt=code_simple, setup=setup_simple, number=1, repeat=10)),
('Multiprocess', timeit.repeat(stmt=code_mp, setup=setup_mp, number=1, repeat=10)),
):
print(f'\t{l}: {np.mean(t):.3e} [{np.std(t):.3e}] seconds')
These are the results:
Execution time on 10 repetitions: mean [std]
TRIVIAL FUNCTION, 100 dictionary entries
Single process: 7.752e-04 [7.494e-05] seconds
Multiprocess: 1.163e-01 [2.024e-03] seconds
TRIVIAL FUNCTION, 250000 dictionary entries
Single process: 7.077e+00 [7.098e-01] seconds
Multiprocess: 1.383e+00 [7.752e-02] seconds
SIMPLE FUNCTION, 250000 dictionary entries
Single process: 1.405e+01 [1.422e+00] seconds
Multiprocess: 2.858e+00 [5.742e-01] seconds
COMPLEX FUNCTION, 10000 dictionary entries
Single process: 1.557e+00 [4.330e-02] seconds
Multiprocess: 5.383e-01 [5.330e-02] seconds
HEAVY FUNCTION, 100 dictionary entries
Single process: 3.181e-01 [5.026e-03] seconds
Multiprocess: 1.171e-01 [2.494e-03] seconds
As you can see, assuming that you have a CPU bounded computation, the multiprocess approach achieves better results in most of the scenarios. Only if you have a very light computation for each entry and/or a very limited number of entries, the single process approach should be preferred.
On the other hand, the improvement provided by multiprocessing comes with a cost: for example, if your computation for each entry uses a significant amount of memory, you could incur an OutOfMemory error, meaning that you have to improve your code and make it more complex to avoid it, finding the right balance between memory occupation and decrease in execution time. If you look around, there are a lot of questions asking how to solve memory issues caused by a non-optimal use of multiprocessing. In other words, this means that your code will be less easy to read and maintain.
To sum up, you should judge if the improvement in execution time is worthed, even if it is possible.

Python for loop: Proper implementation of multiprocessing

The following for loop is part of a iterative simulation process and is the main bottleneck regarding computational time:
import numpy as np
class Simulation(object):
def __init__(self,n_int):
self.n_int = n_int
def loop(self):
for itr in range(self.n_int):
#some preceeding code which updates rows_list and diff with every itr
cols_red_list = []
rows_list = list(range(2500)) #row idx for diff where negative element is known to appear
diff = np.random.uniform(-1.323, 3.780, (2500, 300)) #np.random.uniform is just used as toy example
for row in rows_list:
col = next(idx for idx, val in enumerate(diff[row,:]) if val < 0)
cols_red_list.append(col)
# some subsequent code which uses the cols_red_list data
sim1 = Simulation(n_int=10)
sim1.loop()
Hence, I tried to parallelize it by using the multiprocessing package in hope to reduce computation time:
import numpy as np
from multiprocessing import Pool, cpu_count
from functools import partial
def crossings(row, diff):
return next(idx for idx, val in enumerate(diff[row,:]) if val < 0)
class Simulation(object):
def __init__(self,n_int):
self.n_int = n_int
def loop(self):
for itr in range(self.n_int):
#some preceeding code which updates rows_list and diff with every
rows_list = list(range(2500))
diff = np.random.uniform(-1, 1, (2500, 300))
if __name__ == '__main__':
num_of_workers = cpu_count()
print('number of CPUs : ', num_of_workers)
pool = Pool(num_of_workers)
cols_red_list = pool.map(partial(crossings,diff = diff), rows_list)
pool.close()
print(len(cols_red_list))
# some subsequent code which uses the cols_red_list data
sim1 = Simulation(n_int=10)
sim1.loop()
Unfortunately, the parallelization turns out to be much slower compared to the sequential piece of code.
Hence my question: Did I use the multiprocessing package properly in that particular example? Are there alternative ways to parallelize the above mentioned for loop ?
Disclaimer: As you're trying to reduce the runtime of your code through parallelisation, this doesn't strictly answer your question but it might still be a good learning opportunity.
As a golden rule, before moving to multiprocessing to improve
performance (execution time), one should first optimise the
single-threaded case.
Your
rows_list = list(range(2500))
Generates the numbers 0 to 2499 (that's the range) and stores them in memory (list), which requires time to do the allocation of the required memory and the actual write. You then only use these predictable values once each, by reading them from memory (which also takes time), in a predictable order:
for row in rows_list:
This is particularly relevant to the runtime of your loop function as you do it repeatedly (for itr in range(n_int):).
Instead, consider generating the number only when you need it, without an intermediate store (which conceptually removes any need to access RAM):
for row in range(2500):
Secondly, on top of sharing the same issue (unnecessary accesses to memory), the following:
diff = np.random.uniform(-1, 1, (2500, 300))
# ...
col = next(idx for idx, val in enumerate(diff[row,:]) if val < 0)
seems to me to be optimisable at the level of math (or logic).
What you're trying to do is get a random variable (that col index) by defining it as "the first time I encounter a random variable in [-1;1] that is lower than 0". But notice that figuring out if a random variable with a uniform distribution over [-α;α] is negative, is the same as having a random variable over {0,1} (i.e. a bool).
Therefore, you're now working with bools instead of floats and you don't even have to do the comparison (val < 0) as you already have a bool. This potentially makes the code much faster. Using the same idea as for rows_list, you can generate that bool only when you need it; testing it until it is True (or False, choose one, it doesn't matter obviously). By doing so, you only generate as many random bools as you need, not more and not less (BTW, what happens in your code if all 300 elements in the row are negative? ;) ):
for _ in range(n_int):
cols_red_list = []
for row in range(2500):
col = next(i for i in itertools.count() if random.getrandbits(1))
cols_red_list.append(col)
or, with list comprehension:
cols_red_list = [next(i for i in count() if getrandbits(1))
for _ in range(2500)]
I'm sure that, through proper statistical analysis, you even can express that col random variable as a non-uniform variable over [0;limit[, allowing you to compute it much faster.
Please test the performance of an "optimized" version of your single-threaded implementation first. If the runtime is still not acceptable, you should then look into multithreading.
multiprocessing uses system processes (not threads!) for parallelization, which require expensive IPC (inter-process communication) to share data.
This bites you in two spots:
diff = np.random.uniform(-1, 1, (2500, 300)) creates a large matrix which is expensive to pickle/copy to another process
rows_list = list(range(2500)) creates a smaller list, but the same applies here.
To avoid this expensive IPC, you have one and a half choices:
If on a POSIX-compliant system, initialize your variables on the module level, that way each process gets a quick-and-dirty copy of the required data. This is not scalable as it requires POSIX, weird architecture (you probably don't want to put everything on the module level), and doesn't support sharing changes to that data.
Use shared memory. This only supports mostly primitive data types, but mp.Array should cover your needs.
The second problem is that setting up a pool is expensive, as num_cpu processes need to be started. Your workload is small enough to be negligible compared to this overhead. A good practice is to only create one pool and reuse it.
Here is a quick-and-dirty example of the POSIX only solution:
import numpy as np
from multiprocessing import Pool, cpu_count
from functools import partial
n_int = 10
rows_list = np.array(range(2500))
diff = np.random.uniform(-1, 1, (2500, 300))
def crossings(row, diff):
return next(idx for idx, val in enumerate(diff[row,:]) if val < 0)
def workload(_):
cols_red_list = [crossings(row, diff) for row in rows_list]
print(len(cols_red_list))
class Simulation(object):
def loop(self):
num_of_workers = cpu_count()
with Pool(num_of_workers) as pool:
pool.map(workload, range(10))
pool.close()
sim1 = Simulation()
sim1.loop()
For me (and my two cores) this is roughly twice as fast as the sequential version.
Update with shared memory:
import numpy as np
from multiprocessing import Pool, cpu_count, Array
from functools import partial
n_int = 10
ROW_COUNT = 2500
### WORKER
diff = None
result = None
def init_worker(*args):
global diff, result
(diff, result) = args
def crossings(i):
result[i] = next(idx for idx, val in enumerate(diff[i*300:(i+1)*300]) if val < 0)
### MAIN
class Simulation():
def loop(self):
num_of_workers = cpu_count()
diff = Array('d', range(ROW_COUNT*300), lock=False)
result = Array('i', ROW_COUNT, lock=False)
# Shared memory needs to be passed when workers are spawned
pool = Pool(num_of_workers, initializer=init_worker, initargs=(diff, result))
for i in range(n_int):
# SLOW, I assume you use a different source of values anyway.
diff[:] = np.random.uniform(-1, 1, ROW_COUNT*300)
pool.map(partial(crossings), range(ROW_COUNT))
print(len(result))
pool.close()
sim1 = Simulation()
sim1.loop()
A few notes:
Shared memory needs to be set up at worker creation, so it's global anyway.
This still isn't faster than the sequential version, but that's mainly due to random.uniform needing to be copied entirely into shared memory. I assume that are just values for testing, and in reality you'd fill it differently anyway.
I only pass indices to the worker, and use them to read and write values to the shared memory.

Is there a simple process-based parallel map for python?

I'm looking for a simple process-based parallel map for python, that is, a function
parmap(function,[data])
that would run function on each element of [data] on a different process (well, on a different core, but AFAIK, the only way to run stuff on different cores in python is to start multiple interpreters), and return a list of results.
Does something like this exist? I would like something simple, so a simple module would be nice. Of course, if no such thing exists, I will settle for a big library :-/
I seems like what you need is the map method in multiprocessing.Pool():
map(func, iterable[, chunksize])
A parallel equivalent of the map() built-in function (it supports only
one iterable argument though). It blocks till the result is ready.
This method chops the iterable into a number of chunks which it submits to the
process pool as separate tasks. The (approximate) size of these chunks can be
specified by setting chunksize to a positive integ
For example, if you wanted to map this function:
def f(x):
return x**2
to range(10), you could do it using the built-in map() function:
map(f, range(10))
or using a multiprocessing.Pool() object's method map():
import multiprocessing
pool = multiprocessing.Pool()
print pool.map(f, range(10))
This can be done elegantly with Ray, a system that allows you to easily parallelize and distribute your Python code.
To parallelize your example, you'd need to define your map function with the #ray.remote decorator, and then invoke it with .remote. This will ensure that every instance of the remote function will executed in a different process.
import time
import ray
ray.init()
# Define the function you want to apply map on, as remote function.
#ray.remote
def f(x):
# Do some work...
time.sleep(1)
return x*x
# Define a helper parmap(f, list) function.
# This function executes a copy of f() on each element in "list".
# Each copy of f() runs in a different process.
# Note f.remote(x) returns a future of its result (i.e.,
# an identifier of the result) rather than the result itself.
def parmap(f, list):
return [f.remote(x) for x in list]
# Call parmap() on a list consisting of first 5 integers.
result_ids = parmap(f, range(1, 6))
# Get the results
results = ray.get(result_ids)
print(results)
This will print:
[1, 4, 9, 16, 25]
and it will finish in approximately len(list)/p (rounded up the nearest integer) where p is number of cores on your machine. Assuming a machine with 2 cores, our example will execute in 5/2 rounded up, i.e, in approximately 3 sec.
There are a number of advantages of using Ray over the multiprocessing module. In particular, the same code will run on a single machine as well as on a cluster of machines. For more advantages of Ray see this related post.
Python3's Pool class has a map() method and that's all you need to parallelize map:
from multiprocessing import Pool
with Pool() as P:
xtransList = P.map(some_func, a_list)
Using with Pool() as P is similar to a process pool and will execute each item in the list in parallel. You can provide the number of cores:
with Pool(processes=4) as P:
For those who looking for Python equivalent of R's mclapply(), here is my implementation. It is an improvement of the following two examples:
"Parallelize Pandas map() or apply()", as mentioned by #Rafael
Valero.
How to apply map to functions with multiple arguments.
It can be apply to map functions with single or multiple arguments.
import numpy as np, pandas as pd
from scipy import sparse
import functools, multiprocessing
from multiprocessing import Pool
num_cores = multiprocessing.cpu_count()
def parallelize_dataframe(df, func, U=None, V=None):
#blockSize = 5000
num_partitions = 5 # int( np.ceil(df.shape[0]*(1.0/blockSize)) )
blocks = np.array_split(df, num_partitions)
pool = Pool(num_cores)
if V is not None and U is not None:
# apply func with multiple arguments to dataframe (i.e. involves multiple columns)
df = pd.concat(pool.map(functools.partial(func, U=U, V=V), blocks))
else:
# apply func with one argument to dataframe (i.e. involves single column)
df = pd.concat(pool.map(func, blocks))
pool.close()
pool.join()
return df
def square(x):
return x**2
def test_func(data):
print("Process working on: ", data.shape)
data["squareV"] = data["testV"].apply(square)
return data
def vecProd(row, U, V):
return np.sum( np.multiply(U[int(row["obsI"]),:], V[int(row["obsJ"]),:]) )
def mProd_func(data, U, V):
data["predV"] = data.apply( lambda row: vecProd(row, U, V), axis=1 )
return data
def generate_simulated_data():
N, D, nnz, K = [302, 184, 5000, 5]
I = np.random.choice(N, size=nnz, replace=True)
J = np.random.choice(D, size=nnz, replace=True)
vals = np.random.sample(nnz)
sparseY = sparse.csc_matrix((vals, (I, J)), shape=[N, D])
# Generate parameters U and V which could be used to reconstruct the matrix Y
U = np.random.sample(N*K).reshape([N,K])
V = np.random.sample(D*K).reshape([D,K])
return sparseY, U, V
def main():
Y, U, V = generate_simulated_data()
# find row, column indices and obvseved values for sparse matrix Y
(testI, testJ, testV) = sparse.find(Y)
colNames = ["obsI", "obsJ", "testV", "predV", "squareV"]
dtypes = {"obsI":int, "obsJ":int, "testV":float, "predV":float, "squareV": float}
obsValDF = pd.DataFrame(np.zeros((len(testV), len(colNames))), columns=colNames)
obsValDF["obsI"] = testI
obsValDF["obsJ"] = testJ
obsValDF["testV"] = testV
obsValDF = obsValDF.astype(dtype=dtypes)
print("Y.shape: {!s}, #obsVals: {}, obsValDF.shape: {!s}".format(Y.shape, len(testV), obsValDF.shape))
# calculate the square of testVals
obsValDF = parallelize_dataframe(obsValDF, test_func)
# reconstruct prediction of testVals using parameters U and V
obsValDF = parallelize_dataframe(obsValDF, mProd_func, U, V)
print("obsValDF.shape after reconstruction: {!s}".format(obsValDF.shape))
print("First 5 elements of obsValDF:\n", obsValDF.iloc[:5,:])
if __name__ == '__main__':
main()
I know this is an old post, but just in case, I wrote a tool to make this super, super easy called parmapper (I actually call it parmap in my use but the name was taken).
It handles a lot of the setup and deconstruction of processes and adds tons of features. In rough order of importance
Can take lambda and other unpickleable functions
Can apply starmap and other similar call methods to make it very easy to directly use.
Can split amongst both threads and/or processes
Includes features such as progress bars
It does incur a small cost but for most uses, that is negligible.
I hope you find it useful.
(Note: It, like map in Python 3+, returns an iterable so if you expect all results to pass through it immediately, use list())

Categories

Resources