Difference between map() and pool.map() - python

I have a code like this
def plotFrame(n):
a = data[n, :]
do_something_with(a)
data = loadtxt(filename)
ids = data[:,0] # some numbers from the first column of data
map(plotFrame, ids)
That worked fine for me. Now I want to try replacing map() with pool.map() as follows:
pools = multiprocessing.Pool(processes=1)
pools.map(plotFrame, ids)
But that won't work, saying:
NameError: global name 'data' is not defined
The questions is: What is going on? Why map() does not complain about the data variable that is not passed to the function, but pool.map() does?
EDIT:
I' m using Linux.
EDIT 2:
Based on #Bill 's second suggestion, I now have the following code:
def plotFrame_v2(line):
plot_with(line)
if __name__ == "__main__":
ff = np.loadtxt(filename)
m = int( max(ff[:,-1]) ) # max id
l = ff.shape[0]
nfig = 0
pool = Pool(processes=1)
for i in range(0, l/m, 50):
data = ff[i*m:(i+1)*m, :] # data of one frame contains several ids
pool.map(plotFrame_v2, data)
nfig += 1
plt.savefig("figs_bot/%.3d.png"%nfig)
plt.clf()
That works just as expected. However, now I have another unexpected problem: The produced figures are blank, whereas the above code with map() produces figures with the content of data.

Using multiprocessing.pool, you are spawning individual processes to work with the shared (global) resource data. Typically, you can allow the processes to work with a shared resource in the parent process by make that resource explicitly global. However, it is better practice to explicitly pass all needed resources to the child processes as function arguments. This is required if you are working on Windows. Check out the multiprocessing guidelines here.
So you could try doing
data = loadtxt(filename)
def plotFrame(n):
global data
a = data[n, :]
do_something_with(a)
ids = data[:,0] # some numbers from the first column of data
pools = multiprocessing.Pool(processes=1)
pools.map(plotFrame, ids)
or even better see this thread about feeding multiple arguments to a function with multiprocessing.pool. A simple way could be
def plotFrameWrapper(args):
return plotFrame(*args)
def plotFrame(n, data):
a = data[n, :]
do_something_with(a)
if __name__ == "__main__":
from multiprocessing import Pool
data = loadtxt(filename)
pools = Pool(1)
ids = data[:,0]
pools.map(plotFrameWrapper, zip([data]*len(inds), inds))
print results
One last thing: since it looks like the only thing you are doing from your example is slicing the array, you can simply slice first then pass the sliced arrays to your function:
def plotFrame(sliced_data):
do_something_with(sliced_data)
if __name__ == "__main__":
from multiprocessing import Pool
data = loadtxt(filename)
pools = Pool(1)
ids = data[:,0]
pools.map(plotFrame, data[ids])
print results

To avoid "unexpected" problems, avoid globals.
To reproduce your first code example with builtin map that calls plotFrame:
def plotFrame(n):
a = data[n, :]
do_something_with(a)
using multiprocessing.Pool.map, the first thing is to deal with the global data. If do_something_with(a) also uses some global data then it should also be changed.
To see how to pass a numpy array to a child process, see Use numpy array in shared memory for multiprocessing. If you don't need to modify the array then it is even simpler:
import numpy as np
def init(data_): # inherit data
global data #NOTE: no other globals in the program
data = data_
def main():
data = np.loadtxt(filename)
ids = data[:,0] # some numbers from the first column of data
pool = Pool(initializer=init, initargs=[data])
pool.map(plotFrame, ids)
if __name__=="__main__":
main()
All arguments either should be explicitly passed as arguments to plotFrame or inherited via init().
Your second code example tries to manipulate global data again (via plt calls):
import matplotlib.pyplot as plt
#XXX BROKEN, DO NOT USE
pool.map(plotFrame_v2, data)
nfig += 1
plt.savefig("figs_bot/%.3d.png"%nfig)
plt.clf()
Unless you draw something in the main process this code saves blank figures. Either plot in the child processes or send data to be plotted to the parent processes explicitly e.g., by returning it from plotFrame and using pool.map() returned value. Here's a code example: how to plot in child processes.

Related

Python concurrent.futures global variables

I have a multiprocessing code, and each process have to analyse same data differently.
The input data is always the same, it is not changeable.
Input data is a data frame 20 columns and 60k rows.
How to efficiently 'put' this data to each process?
On single process application I have used global variable, but in multiprocessing it's not working.
When I try to transfer this as a function argument, I have only the first element of the table
Welcome to Stack Overflow. You need to take the time and give reproducible minimal working examples to get specific answers and help the society in general.
Anyway, you shouldn't use global variables if you need to change them with each iteration/process/etc.
Multiprocessing works like that in rough easily-digestible terms:
import concurrent.futures
import glob
def manipulate_data_function(data):
result = torture_data(data)
return result
# ProcessPoolExecutor for CPU bound stuff
with concurrent.futures.ThreadPoolExecutor(max_workers = None) as executor:
futures = []
for file in glob.glob('*txt'):
futures.append(executor.submit(manipulate_data_function, data))
thank you for the answer, I don't change this date each iteration. I use the same data to each process, data how to change the data is given throw function argument
with concurrent.futures.ProcessPoolExecutor() as executor:
res = executor.map(goal_fcn, p)
for f in concurrent.futures.as_completed(res):
fp = res
and next
def goal_fcn(x):
return heavy_calculation(x, global_DataFrame, global_String)
EDIT:
it work with:
with concurrent.futures.ProcessPoolExecutor() as executor:
res = executor.map(goal_fcn, p, [global_DataFrame], [global_String])
for f in concurrent.futures.as_completed(res):
fp = res
def goal_fcn(x, DataFrame, String):
return heavy_calculation(x, DataFrame, String)

Multiple iterations of function with multiple arguments returning multiple values using Multiprocessing in python

I doing 100 iterations of the function model so, i tried using multiprocessing to distribute the tasks and for getting the final output I tried using queue but it takes too much time, failing the purpose of multiprocessing. How to solve this problem?
def model(X,Y):
ada_clf={}
pred1={}
auc_final=[]
for iteration in range(100):
ada_clf[iteration] = AdaBoostClassifier(DecisionTreeClassifier(),n_estimators=1000,learning_rate=0.001)
ada_clf[iteration].fit(X,Y)
pred1[iteration]=ada_clf[iteration].predict(test1)
individuallabelsfromada1=[]
for i in range(len(test1)):
individuallabelsfromada1.append([])
for j in range(100):
individuallabelsfromada1[i].append(pred1[j][i])
final_labels_ada1=[]
for each in individuallabelsfromada1:
final_labels_ada1.append(find_majority(each))
final=pd.Series(final_labels_ada1)
temp_arr=np.array(final)
total_labels2=pd.Series(temp_arr)
fpr, tpr, thresholds = roc_curve(y_test, total_labels2, pos_label=1)
auc_final.append(auc(fpr,tpr))
q.put(total_labels2)
q1.put(auc_final)
q2.put(ada_clf)
print('done')
overall_labels={}
final_auc={}
final_ada_clf={}
processes=[]
q=Queue()
q1=Queue()
q2=Queue()
for iteration in range(100):
if __name__=='__main__':
p=multiprocessing.Process(target=model,args=(x_train,y_labels,q,q1,q2,))
overall_labels[iteration]=q.get()
final_auc[iteration]=q1.get()
final_ada_clf[iteration]=q2.get()
p.start()
processes.append(p)
for each in processes:
each.join()
Below is my edited version, but returns only single output, i tried using multiple output but could not get it, so settled for only single output i.e. total_labels2:-
##code before this is same as before, only thing changed is arguments of model from def model(X,Y) to def model(repeat,X,Y)
total_labels2 = pd.Series(temp_arr)
return (repeat,total_labels2)
def get_result(total_labels2):
global testover_forall
testover_forall.append(total_labels2)
if __name__ == '__main__':
import multiprocessing as mp
testover_forall = []
pool = mp.Pool(40)
for repeat in range(100):
pool.apply_async(bound_model, args= repeat, x_train, y_train), callback= get_result)
pool.close()
pool.join()
repetations_index=[]
for i in range(100):
repetations_index.append(testover_forall[i][0])
final_last_labels = {}
for i in range(100):
temp = str(i)
final_last_labels[temp] = testover_forall[repetations_index[i]][1]
totally_last_labels=[]
for each in final_last_labels:
temp=np.array(final_last_labels[each])
totally_last_labels.append(temp)
See my comments (actually questions) to your post.
You should be using a multiprocessing pool to limit the number of processes that you create to the number of CPU cores that you have. This will also make it easier to get return values back from your model function instead of writing results to 3 different queues (and you could have written a tuple of 3 values to just one queue). You will, of course, require other import statements and code. Given your use of numpy and other libraries, which may be implemented in C Language, you could also retry running this using threading to see if that helps or hurts performance. Do this by replacing ProcessPoolExecutor with ThreadPoolExecutor in the two places it is referenced.
Note
Any changes that model makes to passed arguments X and Y will not be reflected back to the main process. So if model is called repeatedly with the same arguments over and over, as it appears to be, it's not clear whether each call will return different values, especially if the calls are being done in parallel.
from concurrent.futures import ProcessPoolExecutor
def model(X,Y):
ada_clf={}
pred1={}
auc_final=[]
for iteration in range(100):
ada_clf[iteration] = AdaBoostClassifier(DecisionTreeClassifier(),n_estimators=1000,learning_rate=0.001)
ada_clf[iteration].fit(X,Y)
pred1[iteration]=ada_clf[iteration].predict(test1)
individuallabelsfromada1=[]
for i in range(len(test1)):
individuallabelsfromada1.append([])
for j in range(100):
individuallabelsfromada1[i].append(pred1[j][i])
final_labels_ada1=[]
for each in individuallabelsfromada1:
final_labels_ada1.append(find_majority(each))
final=pd.Series(final_labels_ada1)
temp_arr=np.array(final)
total_labels2=pd.Series(temp_arr)
fpr, tpr, thresholds = roc_curve(y_test, total_labels2, pos_label=1)
auc_final.append(auc(fpr,tpr))
#q.put(total_labels2)
#q1.put(auc_final)
#q2.put(ada_clf)
return total_labels2, auc_final, ada_clf
#print('done')
if __name__ == '__main__':
with ProcessPoolExecutor() as executor:
futures = [executor.submit(model, x_train, y_labels) for iteration in range(100)]
# simple lists will suffice:
overall_labels = []
final_auc = []
final_ada_clf = []
for future in futures:
# get return value and store
total_labels2, auc_final, ada_clf = future.result()
overall_labels.append(total_labels2)
final_auc.append(auc_final)
final_ada_clf.append(ada_clf)
Update
It wasn't clear from the problem specification that the returned results are based on a random number generator and if successive calls to the worker function, model, do not employ a single random number generator across all processes in the multiprocessing pool, then the multiprocessing implementation will clearly return different results than the non-multiprocessing version. And it is not clear from the code provided where the random number generator is being used; it may be in library code that you have no access to. If that is the case, you have two options: (1) Use multithreading instead by changing the import statement as I have indicated in the code below; you may still achieve performance benefits as I have already mentioned or (2) Update the signature to model as follows. You will be passed a new argument, random_generator, that currently supports two methods, randint (like random.randint and random (like random.random), although it should be easy enough to modify the code if you need a different method from module random. You will use this random number generator in place of module random if you are able to. But note that this random generator will run much more slowly than the standard one; this is the price you pay.
Since we are also adding a repetition argument to model (it now has to be the final argument -- note the updated signature below), we can now use method map (no need to use a callback):
def model(X,Y, random_generator, repetition):
...
etc.
from multiprocessing import Pool
# or use the following import instead to use multithreading (but then use standard random generator):
# from multiprocessing.dummy import Pool
import random
from functools import partial
from multiprocessing.managers import BaseManager
class RandomGeneratorManager(BaseManager):
pass
class RandomGenerator:
def __init__(self):
random.seed(0)
def randint(self, a, b):
return random.randint(a, b)
def random(self):
return random.random()
# add other functions if needed
if __name__ == '__main__':
RandomGeneratorManager.register('RandomGenerator', RandomGenerator)
with RandomGeneratorManager() as manager:
random_generator = manager.RandomGenerator()
# why 40? why not use default, which is the number of cpu cores you have?:
pool = Pool(40):
worker = partial(model, x_train, y_labels, random_generator)
results = pool.map(worker, range(100))

Adding arrays to global array using multiprocessing

I've a global NumPy array ys_final and have defined a function that generates an array ys. The ys array will be generated based on an input parameter and I want to add these ys arrays to the global array, i.e ys_final = ys_final + ys.
The order of addition doesn't matter so I want to use Pool.apply_async() from the multiprocessing library but I can't write to the global array. The code for reference is:
import multiprocessing as mp
ys_final = np.zeros(len)
def ys_genrator(i):
#code to generate ys array
return ys
pool = mp.Pool(mp.cpu_count())
for i in range(3954):
ys_final = ys_final + pool.apply_async(ys_genrator, args=(i)).get()
pool.close()
pool.join()
The above block of code keeps on running forever and nothing happens. I've tried *mp.Process also and still I face the same problem. There I defined a target function that directly adds to the global array but it is also not working as the block keeps running forever. Reference:
def func(i):
#code to generate ys
global ys_final
ys_final = ys_final + ys
for i in range(3954):
p = mp.Process(target=func, args=(i,))
p.start()
p.join()
Any suggestions will be really helpful.
EDIT:
My ys_genrator is a function for linear interpolation. Based on the parameter i which is an index for rows in a 2D image, the function creates an array of interpolated amplitudes that will be superimposed with all the interpolated amplitudes from the image, so ys need to be added to ys_final
The variable len is the length of the interpolated array, which is same for all rows.
For reference, a simpler version of ys_genrator(i) is as follows:
def ys_genrator(i):
ys = np.ones(10)*i
return ys
A few points:
pool.apply_async(ys_genrator, args=(i)) needs to be pool.apply_async(ys_genrator, args=(i,)). Note the comma after the i.
pool.apply_async(ys_genrator, args=(i,)).get() is exactly equivalent to pool.apply(ys.genrator, args=(i,)). That is, you will block because of your immediate call to get and you will have absolutely no parallism. You would need to do all your calls to pool.apply_async and save the returned AsyncResult instances and only then call get on these instances.
If you are running under Windows, you will have a problem. The code that creates new processes must be within a block governed by if __name__ == '__main__':
If you are running under something like Jupyter Notebook or iPython you will have a problem. The worker function, ys_genrator, would need to be in an external file and imported.
Using apply_async for submitting a lot of tasks is inefficient. You are better of using imap or imap_unordered where the tasks get submitted in "chunks" and you can process the results one by one as they become available. But you must choose a "suitable" chunksize argument.
Any code you have at the global level, such as ys_final = np.zeros(len) will be executed by every sub-process if you are running under Windows, and this can be wasteful if the subprocesses do not need to "see" this variable. If they do need to see this variable, be aware that each process in the pool will be working with its own copy of the variable so it better be a read-only usage. Even then, it can be very wasteful of storage if the variable is large. There are ways of sharing such a variable across the processes but it is not perfectly clear whether you need to (you haven't even defined variable len). So it is difficult to give you improved code. However, it appears that your worker function does not need to "see" ys_final, so I will take a shot at an improved solution.
But be aware that if your function ys_genrator is very trivial, nothing will be gained by using multiprocessing because there is overhead in both creating the processing pool and in passing arguments from one process to another. Also, if ys_genrator is using numpy, this can also be a source of problems since numpy uses multiprocessing for some of its own functions and you are better off not mixing numpy with your own multiprocessing.
import multiprocessing as mp
import numpy as np
SIZE = 3
def ys_genrator(i):
#code to generate ys array
# for this dummy example all SIZE entries will end up with the same result:
ys = [i] * SIZE # for example: [1, 1, 1]
return ys
def compute_chunksize(poolsize, iterable_size):
chunksize, remainder = divmod(iterable_size, 4 * poolsize)
if remainder:
chunksize += 1
return chunksize
if __name__ == '__main__':
ys_final = np.zeros(SIZE)
n_iterations = 3954
poolsize = min(mp.cpu_count(), n_iterations)
chunksize = compute_chunksize(poolsize, n_iterations)
print('poolsize =', poolsize, 'chunksize =', chunksize)
pool = mp.Pool(poolsize)
for result in pool.imap_unordered(ys_genrator, range(n_iterations), chunksize):
ys_final += result
print(ys_final)
Prints:
poolsize = 8 chunksize = 124
[7815081. 7815081. 7815081.]
Update
You can also just use:
for result in pool.map(ys_genrator, range(n_iterations)):
ys_final += result
The issue is that when you use method map, the method wants to compute an efficient chunksize argument based on the size of the iterable argument (see my compute_chunksize function above, which is essentially what pool.map will use). But to do this, is will have to first convert the iterable to a list to get its size. If n_iterations is very large, this is not very efficient, although it's probably not an major issue for a size of 3954. Still, you would be better off using my compute_chunksize function in this case since you know the size of the iterable and then pass the chunksize argument explicitly to map as I have done in the code using imap_unordered.

Implementing Multiprocessing with the same Function in While Loop

I'm have implemented an Evolutionary Algorithm process in Python 3.8, and am attempting to optimise/reduce its runtime. Due to the heavy constraints upon valid solutions, it can take a few minutes to generate valid chromosomes. To avoid spending hours just generating the initial population, I want to use Multiprocessing to generate multiple at a time.
My code at this point in time is:
populationCount = 500
def readDistanceMatrix():
# code removed
def generateAvailableValues():
# code removed
def generateAvailableValuesPerColumn():
# code removed
def generateScheduleTemplate():
# code removed
def generateChromosome():
# code removed
if __name__ == '__main__':
# Data type = DataFrame
distanceMatrix = readDistanceMatrix()
# Data type = List of Integers
availableValues = generateAvailableValues()
# Data type = List containing Lists of Integers
availableValuesPerColumn = generateAvailableValuesPerColumn(availableValues)
# Data type = DataFrame
scheduleTemplate = generateScheduleTemplate(distanceMatrix)
# Data type = List containing custom class (with Integer and DataFrame)
population = []
while len(population) < populationCount:
chrmSolution = generateChromosome(availableValuesPerColumn, scheduleTemplate, distanceMatrix)
population.append(chrmSolution)
Where the population list is filled in with the while loop at the end. I would like to replace the while loop with a Multiprocessing solution that can use up to a pre-set number of cores. For example:
population = []
availableCores = 6
while len(population) < populationCount:
while usedCores < availableCores:
# start generating another chromosome as 'chrmSolution'
population.append(chrmSolution)
However, after reading and watching hours worth of tutorials, I'm unable to get a loop up-and-running. How should I go about doing this?
It sounds like a simple multiprocessing.Pool should do the trick, or at least be a place to start. Here's a simple example of how that might look:
from multiprocessing import Pool, cpu_count
child_globals = {} #mutable object at the `module` level acts as container for globals (constants)
if __name__ == '__main__':
# ...
def init_child(availableValuesPerColumn, scheduleTemplate, distanceMatrix):
#passing variables to the child process every time is inefficient if they're
# constant, so instead pass them to the initialization function, and let
# each child re-use them each time generateChromosome is called
child_globals['availableValuesPerColumn'] = availableValuesPerColumn
child_globals['scheduleTemplate'] = scheduleTemplate
child_globals['distanceMatrix'] = distanceMatrix
def child_work(i):
#child_work simply wraps generateChromosome with inputs, and throws out dummy `i` from `range()`
return generateChromosome(child_globals['availableValuesPerColumn'],
child_globals['scheduleTemplate'],
child_globals['distanceMatrix'])
with Pool(cpu_count(),
initializer=init_child, #init function to stuff some constants into the child's global context
initargs=(availableValuesPerColumn, scheduleTemplate, distanceMatrix)) as p:
#imap_unordered doesn't make child processes wait to ensure order is preserved,
# so it keeps the cpu busy more often. it returns a generator, so we use list()
# to store the results into a list.
population = list(p.imap_unordered(child_work, range(populationCount)))

Running different function parallel in python3.7 using multiprocessing

I want to run two different function in parallel in python, I have used the below code :
def remove_special_char(data):
data['Description'] = data['Description'].apply(lambda val: re.sub(r'^=', "'=", str(val))) # Change cell values which start with '=' sign leading to Excel formula issues
return(data)
file_path1 = '.\file1.xlsx'
file_path2 = '.\file2.xlsx'
def method1(file_path1):
data = pd.read_excel(file_path1)
data= remove_special_char(data)
return data
def method2(file_path2):
data = pd.read_excel(file_path2)
data= remove_special_char(data)
return data
I am using the below Pool process , but its not working.
from multiprocessing import Pool
p = Pool(3)
result1 = p.map(method1(file_path1), args=file_path1)
result2 = p.map(method2(file_path1), args=file_path2)
I want to run both these methods in parallel to save execution time and at the same time get the return value as well.
I don't know why you are defining the same method twice with different parameter names, but anyway the map method of Pools is taking as its first argument a function, and the second argument is an iterable. What map does is call the function on each item of the iterable, and return a list with all the results. So what you want to do is more something like:
from multiprocessing import Pool
file_paths = ('.\file1.xlsx', '.\file2.xlsx')
def method(file_path):
data = pd.read_excel(file_path)
data= remove_special_char(data)
return data
with Pool(3) as p:
result = p.map(method, file_paths)

Categories

Resources