I am attempting to parallelize an algorithm that I have been working on using the Multiprocessing and Pool.map() commands. I ran into a problem and was hoping someone could point me in the right direction.
Let x denote an array of N rows and 1 column, which is initialized to be a vector of zeros. Let C denote an array of length N by 2. The vector x is constructed iteratively by using information from some subsets of C (doing some math operations). The code (not parallelized) as a large for loop looks roughly as follows:
for j in range(0,N)
#indx_j will have n_j <<N entries
indx_j = build_indices(C,j)
#x_j will be entries to be added to vector x at indices indx_j
#This part is time consuming
x_j = build_x_j(indx_j,C)
#Add x_j into entries of x
x[indx_j] = x[indx_j] + x_j
I was able to parallelize this using the multiprocessing module and using the pool.map to eliminate the large for loop. I wrote a function that did the above computations, except the step of adding x_j to x[indx_j]. The parallelized function instead returns two data sets back: x_j and indx_j. After those are computed, I run a for loop (not parallel) to build up x by doing the x[indx_j] = x[indx_j] +x_j computation for j=0,N.
The downside to my method is that pool.map operation returns a gigantic list of N pairs of arrays x_j and indx_j. where both x_j and indx_j were n_j by 1 vectors (n_j << N). For large N (N >20,000) this was taking up way too much memory. Here is my question: Can I somehow, in parallel, do the construction operation x[indx_j] = x[indx_j] + x_j. It seems to me each process in pool.map() would have to be able to interact with the vector x. Do I place x in some sort of shared memory? How would I do such a thing? I suspect that this has to be possible somehow, as I assume people assemble matrices in parallel for finite element methods all the time. How can I have multiple processes interact with a vector without having some sort of problem? I'm worried that perhaps for j= 20 and j = 23, if they happen simultaneously, they might try to add to x[indx_20] = x[indx_20] + x_20 and simultaneously x[indx_30] = x[indx_30] + x_30 and maybe some error will happen. I also don't know how to even have this computation done via the pool.map() (I don't think I can feed x in as an input, as it would be changing after each process).
I'm not sure if it matters or not, but the sets indx_j will have non-trivial intersection (e.g., indx_1 and indx_2 may have indices [1,2,3] and [3,4,5] for example).
If this is unclear, please let me know and I will attempt to clarify. This is my first time trying to work in parallel, so I am very unsure of how to proceed. Any information would be greatly appreciated. Thanks!
I dont know If I am qualified to give proper advice on the topic of shared memory arrays, but I had a similar need to share arrays across processes in python recently and came across a small custom numpy.ndarray implementation for a shared memory array in numpy using the shared ctypes within multiprocessing. Here is a link to the code: shmarray.py. It acts just like a normal array,except the underlying data is stored in shared memory, meaning separate processes can both read and write to the same array.
Using Shared Memory Array
In threading, all information available to the thread (global and local namespace) can be handled as shared between all other threads that have access to it, but in multiprocessing that data is not so easily accessible. On linux data is available for reading, but cannot be written to. Instead when a write is done, the data is copied and then written to, meaning no other process can see those changes. However, if the memory being written to is shared memory, it is not copied. This means with shmarray we can do things similar to the way we would do threading, with the true parallelism of multiprocessing. One way to have access to the shared memory array is with a subclass. I know you are currently using Pool.map(), but I had felt limited by the way map worked, especially when dealing with n-dimensional arrays. Pool.map() is not really designed to work with numpy styled interfaces, at least I don't think it can easily. Here is a simple idea where you would spawn a process for each j in N:
import numpy as np
import shmarray
import multiprocessing
class Worker(multiprocessing.Process):
def __init__(self, j, C, x):
multiprocessing.Process.__init__()
self.shared_x = x
self.C = C
self.j = j
def run(self):
#Your Stuff
#indx_j will have n_j <<N entries
indx_j = build_indices(self.C,self.j)
#x_j will be entries to be added to vector x at indices indx_j
x_j = build_x_j(indx_j,self.C)
#Add x_j into entries of x
self.shared_x[indx_j] = self.shared_x[indx_j] + x_j
#And then actually do the work
N = #What ever N should be
x = shmarray.zeros(shape=(N,1))
C = #What ever C is, doesn't need to be shared mem, since no writing is happening
procs = []
for j in range(N):
proc = Worker(j, C, x)
procs.append(proc)
proc.start()
#And then join() the processes with the main process
for proc in procs:
proc.join()
Custom Process Pool and Queues
So this might work, but spawning several thousand processes is not really going to be of any use if you only have a few cores. The way I handled this was to implement a Queue system between my process. That is to say, we have a Queue that the main process fills with j's and then a couple worker processes get numbers from the Queue and do work with it, note that by implementing this, you are essentially doing exactly what Pool does. Also note we are actually going to use multiprocessing.JoinableQueue for this since it lets use join() to wait till a queue is emptied.
Its not hard to implement this at all really, simply we must modify our Subclass a bit and how we use it.
import numpy as np
import shmarray
import multiprocessing
class Worker(multiprocessing.Process):
def __init__(self, C, x, job_queue):
multiprocessing.Process.__init__()
self.shared_x = x
self.C = C
self.job_queue = job_queue
def run(self):
#New Queue Stuff
j = None
while j!='kill': #this is how I kill processes with queues, there might be a cleaner way.
j = self.job_queue.get() #gets a job from the queue if there is one, otherwise blocks.
if j!='kill':
#Your Stuff
indx_j = build_indices(self.C,j)
x_j = build_x_j(indx_j,self.C)
self.shared_x[indx_j] = self.shared_x[indx_j] + x_j
#This tells the queue that the job that was pulled from it
#Has been completed (we need this for queue.join())
self.job_queue.task_done()
#The way we interact has changed, now we need to define a job queue
job_queue = multiprocessing.JoinableQueue()
N = #What ever N should be
x = shmarray.zeros(shape=(N,1))
C = #What ever C is, doesn't need to be shared mem, since no writing is happening
procs = []
proc_count = multiprocessing.cpu_count() # create as many procs as cores
for _ in range(proc_count):
proc = Worker(C, x, job_queue) #now we pass the job queue instead
procs.append(proc)
proc.start()
#all the workers are just waiting for jobs now.
for j in range(N):
job_queue.put(j)
job_queue.join() #this blocks the main process until the queue has been emptied
#Now if you want to kill all the processes, just send a 'kill'
#job for each process.
for proc in procs:
job_queue.put('kill')
job_queue.join()
Finally, I really cannot say anything about how this will handle writing to overlapping indices at the same time. Worst case is that you could have a serious problem if two things attempt to write at the same time and things get corrupted/crash(I am no expert here so I really have no idea if that would happen). Best case since you are just doing addition, and order of operations doesn't matter, everything runs smoothly. If it doesn't run smoothly, my suggestion is to create a second custom Process subclass that specifically does the array assignment. To implement this you would need to pass both a job queue, and an 'output' queue to the Worker subclass. Within the while loop, you should have a `output_queue.put((indx_j, x_j)). NOTE: If you are putting these into a Queue they are being pickled, which can be slow. I recommend making them shared memory arrays if they can be before using put. It may be faster to just pickle them in some cases, but I have not tested this. To assign these as they are generated, you then need to have your Assigner process read these values from a queue as jobs and apply them, such that the work loop would essentially be:
def run(self):
job = None
while job!='kill':
job = self.job_queue.get()
if job!='kill':
indx_j, x_j = job
#Note this is the process which really needs access to the X array.
self.x[indx_j] += x_j
self.job_queue.task_done()
This last solution will likely be slower than doing the assignment within the worker threads, but if you are doing it this way, you have no worries about race conditions, and memory is still lighter since you can use up the indx_j and x_j values as you generate them, instead of waiting till all of them are done.
Note for Windows
I didn't do any of this work on windows, so I am not 100% certain, but I believe the code above will be very memory intensive since windows does not implement a copy-on-write system for spawning independent processes. Essentially windows will copy ALL information that a process needs when spawning a new one from the main process. To fix this, I think replacing all your x_j and C with shared memory arrays (anything you will be handing around to other processes) instead of normal arrays should cause windows to not copy the data, but I am not certain. You did not specify which platform you were on so I figured better safe than sorry, since multiprocessing is a different beast on windows than linux.
Related
I am trying to figure out how to perfom a multiprocessing task with an unusual formulation.
Basically, given two lists containing 10 matrices for each list, I have to check if applying an operation (that I'll call fn) gives the same results if the input is (A, B) or vice versa (B, A).
With a sequential approach, the solution is streightforward:
#Given
A = [matrix_a1, ... , matrix_a10]
B = [matrix_b1, ... , matrix_b10]
AB_BA= [fn(A[i], B[i])==fn(B[i], A[i]) for i in range(0, len(A)) ]
The next task is a bit strange because it requires setting strictly more than ten threads and applying multiprocessing. The restriction is that you can not assign all the single comparisons to ten different processes because the remaining processes will be unused. I do not know why the request seems to be using "process" and "thread" interchangeably.
This task seems a bit confusing because in multiprocessing, generally, you set the maximum number of workers, not the minimum.
I tried to use a solution that uses a ProcessPoolExecutor, as follows:
def equality(A, B,i):
res= fn(A[i], B[i]) == fn(B[i],A[i] )
return(res)
with concurrent.futures.ProcessPoolExecutor(max_workers=20) as executor:
idx=range(0, len(A))
results= executor.map(equality, A, B, idx)
for result in results:
print(result)
My problem is that I am not sure how to check resource usage. I have naively tried to monitor the CPU usage using the ubuntu system monitor as well as "top" from the command line.
In addition, this solution is the most efficient among those I tried, but there is not a direct specification to use at least 11 workers, so this solution seems not to stick with what was requested.
I also tried other solutions, such as using pool directly. This causes to evoke 10 python instances using top, but again, not more than 10. Here's what I tried:
def equality(A, B):
res=fn(A, B) == fn(B,A )
return(res)
with mp.Pool(20) as p:
print(p.starmap(equality, ((A[i], B[i]) for i in range(0, len(A)))))
Do you have any suggestions to address this request as well as monitor the resource usage to be sure it is working as expected?
Thank you very much for your help in advance.
I wish you had published the actual problem word for word, since your description is a bit unclear. But this is what I know (or think I know):
Unless the amount of CPU processing done by your worker function equality is great enough so that what is gained by running the function in parallel more than offsets the additional multiprocessing overhead you would not otherwise have if not using multiprocessing (i.e. starting processes, moving data from one address space to another, etc.), your multiprocessing code will run more slowly. Therefore, you should design your worker function to do the most work possible and to pass as little data as possible.
When you specify ...
results = executor.map(equality, A, B, idx)
... your equality function will be invoked once for each element of A, B and idx. So what is being passed is not the entire lists A and B but rather individual elements (e.g. matrix_a1 and matrix_b1). Therefore, there is no point in even passing an idx argument:
def equality(matrix_a, matrix_b):
"""
matrix_a and matrix_a are each single elements of
lists A and B respecticely.
"""
return fn(matrix_a) == fn(matrix_b)
def main():
from os import cpu_count
from concurrent.futures import ProcessPoolExecutor
A = [matrix_a1, ... , matrix_a10]
B = [matrix_b1, ... , matrix_b10]
# Do not create more processes then we have either
# CPU cores or the number of tasks that need to submit:
pool_size = min(cpu_count(), len(A))
with ProcessPoolExecutor(max_workers=pool_size) as executor:
AB_BA = list(executor.map(equality, A, B))
# This will be a list of 10 elements, each either `True` or `False`:
print(AB_BA)
# Required for Windows:
if __name__ == '__main__':
main()
So we will be submitting 10 tasks to a pool size of 10. Internally there is a "task queue" on which all the arguments being passed to equality exist:
matrix_a1, matrix_b1 # task 1
matrix_a2, matrix_b2 # task 2
...
matrix_a10, matrix_b10 # task 10
Any process in the pool that is idle will grap the next task in the queue to work on and the results will be returned in task submission order. But since equality is such a short-running function unless function fn is sufficiently complicated, there is the possibility that the pool process that grabs the first task can complete it and then grab the second task before some other pool process is dispatched by the operating system and can grab it. So there is no guarantee that all 10 tasks will be worked on in parallel by 10 pool processes even if function fn is sufficiently CPU-intensive. If you were to insert a call to time.sleep(.1) at the beginning of equality, that would give the other pool processes a chance to "wake up" and grab its own task from the task queue. But that would slow your program down even more since sleeping for this purposes is totally non-productive. But the point I am trying to make is that you cannot ensure that all pool processes will always be active concurrently.
I use joblib to work in parallel, I want to write the results in parallel in a list.
So as to avoid problems, I create an ldata = [] list beforehand, so that it can be easily accessed.
During parallelization, the data are available in the list, but no longer when they are put together.
How can data be saved in parallel?
from joblib import Parallel, delayed
import multiprocessing
data = []
def worker(i):
ldata = []
... # create list ldata
data[i].append(ldata)
for i in range(0, 1000):
data.append([])
num_cores = multiprocessing.cpu_count()
Parallel(n_jobs=num_cores)(delayed(worker)(i) for i in range(0, 1000))
resultlist = []
for i in range(0, 1000):
resultlist.extend(data[i])
You should look at Parallel as a parallel map operation that does not allow for side effects. The execution model of Parallel is that it by default starts new worker copies of the master processes, serialises the input data, sends it over to the workers, have them iterate over it, then collects the return values. Any change a worker performs on data stays in its own memory space and is thus invisible to the master process. You have two options here:
First, you can have your workers return ldata instead of updating data[i]. In that case, data will have to be assigned the result returned by Parallel(...)(...):
def worker(i):
...
return ldata
data = Parallel(n_jobs=num_cores)(delayed(worker)(i) for i in range(0, 1000))
Second option is to force a shared memory semantics that uses threads instead of processes. When works execute in threads, their memory space is that of the master process, which is where data lies originally. To enforce this semantics, add require='sharedmem' keyword argument in the call to Parallel:
Parallel(n_jobs=num_cores, require='sharedmem')(delayed(worker)(i) for i in range(0, 1000))
The different modes and semantics are explained in the joblib documentation here.
Keep in mind that your worker() function is written in pure Python and is therefore interpreted. This means that worker threads can't run fully concurrently even if there is just one thread per CPU due to the dreaded Global Interpreter Lock (GIL). This is also explained in the documentation. Therefore, you'd better stick with the first solution in general, despite the marshalling and interprocess communication overheads.
I have a large program (specifically, a function) that I'm attempting to parallelize using a JoinableQueue and the multiprocessing map_async method. The function that I'm working with does several operations on multidimensional arrays, so I break up each array into sections, and each section evaluates independently; however I need to stitch together one of the arrays early on, but the "stitch" happens before the "evaluate" and I need to introduce some kind of delay in the JoinableQueue. I've searched all over for a workable solution but I'm very new to multiprocessing and most of it goes over my head.
This phrasing may be confusing- apologies. Here's an outline of my code (I can't put all of it because it's very long, but I can provide additional detail if needed)
import numpy as np
import multiprocessing as mp
from multiprocessing import Pool, Pipe, JoinableQueue
def main_function(section_number):
#define section sizes
array_this_section = array[:,start:end+1,:]
histogram_this_section = np.zeros((3, dataset_size, dataset_size))
#start and end are defined according to the size of the array
#dataset_size is to show that the histogram is a different size than the array
for m in range(1,num_iterations+1):
#do several operations- each section of the array
#corresponds to a section on the histogram
hist_queue.put(histogram_this_section)
#each process sends their own part of the histogram outside of the pool
#to be combined with every other part- later operations
#in this function must use the full histogram
hist_queue.join()
full_histogram = full_hist_queue.get()
full_hist_queue.task_done()
#do many more operations
hist_queue = JoinableQueue()
full_hist_queue = JoinableQueue()
if __name__ == '__main__':
pool = mp.Pool(num_sections)
args = np.arange(num_sections)
pool.map_async(main_function, args, chunksize=1)
#I need the map_async because the program is designed to display an output at the
#end of each iteration, and each output must be a compilation of all processes
#a few variable definitions go here
for m in range(1,num_iterations+1):
for i in range(num_sections):
temp_hist = hist_queue.get() #the code hangs here because the queue
#is attempting to get before anything
#has been put
hist_full += temp_hist
for i in range(num_sections):
hist_queue.task_done()
for i in range(num_sections):
full_hist_queue.put(hist_full) #the full histogram is sent back into
#the pool
full_hist_queue.join()
#etc etc
pool.close()
pool.join()
I'm pretty sure that your issue is how you're creating the Queues and trying to share them with the child processes. If you just have them as global variables, they may be recreated in the child processes instead of inherited (the exact details depend on what OS and/or context you're using for multiprocessing).
A better way to go about solving this issue is to avoid using multiprocessing.Pool to spawn your processes and instead explicitly create Process instances for your workers yourself. This way you can pass the Queue instances to the processes that need them without any difficulty (it's technically possible to pass the queues to the Pool workers, but it's awkward).
I'd try something like this:
def worker_function(section_number, hist_queue, full_hist_queue): # take queues as arguments
# ... the rest of the function can work as before
# note, I renamed this from "main_function" since it's not running in the main process
if __name__ == '__main__':
hist_queue = JoinableQueue() # create the queues only in the main process
full_hist_queue = JoinableQueue() # the workers don't need to access them as globals
processes = [Process(target=worker_function, args=(i, hist_queue, full_hist_queue)
for i in range(num_sections)]
for p in processes:
p.start()
# ...
If the different stages of your worker function are more or less independent of one another (that is, the "do many more operations" step doesn't depend directly on the "do several operations" step above it, just on full_histogram), you might be able to keep the Pool and instead split up the different steps into separate functions, which the main process could call via several calls to map on the pool. You don't need to use your own Queues in this approach, just the ones built in to the Pool. This might be best especially if the number of "sections" you're splitting the work up into doesn't correspond closely with the number of processor cores on your computer. You can let the Pool match the number of cores, and have each one work on several sections of the data in turn.
A rough sketch of this would be something like:
def worker_make_hist(section_number):
# do several operations, get a partial histogram
return histogram_this_section
def worker_do_more_ops(section_number, full_histogram):
# whatever...
return some_result
if __name__ == "__main__":
pool = multiprocessing.Pool() # by default the size will be equal to the number of cores
for temp_hist in pool.imap_unordered(worker_make_hist, range(number_of_sections)):
hist_full += temp_hist
some_results = pool.starmap(worker_do_more_ops, zip(range(number_of_sections),
itertools.repeat(hist_full)))
I relatively new to python, and have been able to answer most of my questions based on similar problems answered on forms, but I'm at a point where I'm stuck an could use some help.
I have a simple nested for loop script that generates an output of strings. What I need to do next is have each grouping go through a simulation, based on numerical values that the strings will be matched too.
really my question is how do I go about this in the best way? Im not sure if multithreading will work since the strings are generated and then need to undergo the simulation, one set at a time. I was reading about queue's and wasn't sure if they could be passed into queue's for storage and then undergo the simulation, in the same order they entered the queue.
Regardless of the research I've done I'm open to any suggestion anyone can make on the matter.
thanks!
edit: Im not look for an answer on how to do the simulation, but rather a way to store the combinations while simulations are being computed
example
X = ["a","b"]
Y = ["c","d","e"]
Z= ["f","g"]
for A in itertools.combinations(X,1):
for B in itertools.combinations(Y,2):
for C in itertools.combinations(Z, 2):
D = A + B + C
print(D)
As was hinted at in the comments, the multiprocessing module is what you're looking for. Threading won't help you because of the Global Interpreter Lock (GIL), which limits execution to one Python thread at a time. In particular, I would look at multiprocessing pools. These objects give you an interface to have a pool of subprocesses do work for you in parallel with the main process, and you can go back and check on the results later.
Your example snippet could look something like this:
import multiprocessing
X = ["a","b"]
Y = ["c","d","e"]
Z= ["f","g"]
pool = multiprocessing.pool() # by default, this will create a number of workers equal to
# the number of CPU cores you have available
combination_list = [] # create a list to store the combinations
for A in itertools.combinations(X,1):
for B in itertools.combinations(Y,2):
for C in itertools.combinations(Z, 2):
D = A + B + C
combination_list.append(D) # append this combination to the list
results = pool.map(simulation_function, combination_list)
# simulation_function is the function you're using to actually run your
# simulation - assuming it only takes one parameter: the combination
The call to pool.map is blocking - meaning that once you call it, execution in the main process will halt until all the simulations are complete, but it is running them in parallel. When they complete, whatever your simulation function returns will be available in results, in the same order that the input arguments were in the combination_list.
If you don't want to wait for them, you can also use apply_async on your pool and store the result to look at later:
import multiprocessing
X = ["a","b"]
Y = ["c","d","e"]
Z= ["f","g"]
pool = multiprocessing.pool()
result_list = [] # create a list to store the simulation results
for A in itertools.combinations(X,1):
for B in itertools.combinations(Y,2):
for C in itertools.combinations(Z, 2):
D = A + B + C
result_list.append(pool.apply_async(
simulation_function,
args=(D,))) # note the extra comma - args must be a tuple
# do other stuff
# now iterate over result_list to check the results when they're ready
If you use this structure, result_list will be full of multiprocessing.AsyncResult objects, which allow you to check if they are ready with result.ready() and, if it's ready, retrieve the result with result.get(). This approach will cause the simulations to be kicked off right when the combination is calculated, instead of waiting until all of them have been calculated to start processing them. The downside is that it's a little more complicated to manage and retrieve the results. For example, you have to make sure the result is ready or be ready to catch an exception, you need to be ready to catch exceptions that may have been raised in the worker function, etc. The caveats are explained pretty well in the documentation.
If calculating the combinations doesn't actually take very long and you don't mind your main process halting until they're all ready, I suggest the pool.map approach.
The upshot of the below is that I have an embarrassingly parallel for loop that I am trying to thread. There's a bit of rigamarole to explain the problem, but despite all the verbosity, I think this should be a rather trivial problem that the multiprocessing module is designed to solve easily.
I have a large length-N array of k distinct functions, and a length-N array of abcissa. Thanks to the clever solution provided by #senderle described in Efficient algorithm for evaluating a 1-d array of functions on a same-length 1d numpy array, I have a fast numpy-based algorithm that I can use to evaluate the functions at the abcissa to return a length-N array of ordinates:
def apply_indexed_fast(abcissa, func_indices, func_table):
""" Returns the output of an array of functions evaluated at a set of input points
if the indices of the table storing the required functions are known.
Parameters
----------
func_table : array_like
Length k array of function objects
abcissa : array_like
Length Npts array of points at which to evaluate the functions.
func_indices : array_like
Length Npts array providing the indices to use to choose which function
operates on each abcissa element. Thus func_indices is an array of integers
ranging between 0 and k-1.
Returns
-------
out : array_like
Length Npts array giving the evaluation of the appropriate function on each
abcissa element.
"""
func_argsort = func_indices.argsort()
func_ranges = list(np.searchsorted(func_indices[func_argsort], range(len(func_table))))
func_ranges.append(None)
out = np.zeros_like(abcissa)
for i in range(len(func_table)):
f = func_table[i]
start = func_ranges[i]
end = func_ranges[i+1]
ix = func_argsort[start:end]
out[ix] = f(abcissa[ix])
return out
What I'm now trying to do is use multiprocessing to parallelize the for loop inside this function. Before describing my approach, for clarity I'll briefly sketch how the algorithm #senderle developed works. If you can read the above code and understand it immediately, just skip the next paragraph of text.
First we find the array of indices that sorts the input func_indices, which we use to define the length-k func_ranges array of integers. The integer entries of func_ranges control the function that gets applied to the appropriate sub-array of the input abcissa, which works as follows. Let f be the i^th function in the input func_table. Then the slice of the input abcissa to which we should apply the function f is slice(func_ranges[i], func_ranges[i+1]). So once func_ranges is calculated, we can just run a simple for loop over the input func_table and successively apply each function object to the appropriate slice, filling our output array. See the code below for a minimal example of this algorithm in action.
def trivial_functional(i):
def f(x):
return i*x
return f
k = 250
func_table = np.array([trivial_functional(j) for j in range(k)])
Npts = 1e6
abcissa = np.random.random(Npts)
func_indices = np.random.random_integers(0,len(func_table)-1,Npts)
result = apply_indexed_fast(abcissa, func_indices, func_table)
So my goal now is to use multiprocessing to parallelize this calculation. I thought this would be straightforward using my usual trick for threading embarrassingly parallel for loops. But my attempt below raises an exception that I do not understand.
from multiprocessing import Pool, cpu_count
def apply_indexed_parallelized(abcissa, func_indices, func_table):
func_argsort = func_indices.argsort()
func_ranges = list(np.searchsorted(func_indices[func_argsort], range(len(func_table))))
func_ranges.append(None)
out = np.zeros_like(abcissa)
num_cores = cpu_count()
pool = Pool(num_cores)
def apply_funci(i):
f = func_table[i]
start = func_ranges[i]
end = func_ranges[i+1]
ix = func_argsort[start:end]
out[ix] = f(abcissa[ix])
pool.map(apply_funci, range(len(func_table)))
pool.close()
return out
result = apply_indexed_parallelized(abcissa, func_indices, func_table)
PicklingError: Can't pickle <type 'function'>: attribute lookup __builtin__.function failed
I have seen this elsewhere on SO: Multiprocessing: How to use Pool.map on a function defined in a class?. One by one, I have tried each method proposed there; in all cases, I get a "too many files open" error because the threads were never closed, or the adapted algorithm simply hangs. This seems like there should be a simple solution since this is nothing more than threading an embarrassingly parallel for loop.
Warning/Caveat:
You may not want to apply multiprocessing to this problem. You'll find that relatively simple operations on large arrays, the problems will be memory bound with numpy. The bottleneck is moving data from RAM to the CPU caches. The CPU is starved for data, so throwing more CPUs at the problem doesn't help much. Furthermore, your current approach will pickle and make a copy of the entire array for every item in your input sequence, which adds lots of overhead.
There are plenty of cases where numpy + multiprocessing is very effective, but you need to make sure you're working with a CPU-bound problem. Ideally, it's a CPU-bound problem with relatively small inputs and outputs to alleviate the overhead of pickling the input and output. For many of the problems that numpy is most often used for, that's not the case.
Two Problems with Your Current Approach
On to answering your question:
Your immediate error is due to passing in a function that's not accessible from the global scope (i.e. a function defined inside a function).
However, you have another issue. You're treating the numpy arrays as though they're shared memory that can be modified by each process. Instead, when using multiprocessing the original array will be pickled (effectively making a copy) and passed to each process independently. The original array will never be modified.
Avoiding the PicklingError
As a minimal example to reproduce your error, consider the following:
import multiprocessing
def apply_parallel(input_sequence):
def func(x):
pass
pool = multiprocessing.Pool()
pool.map(func, input_sequence)
pool.close()
foo = range(100)
apply_parallel(foo)
This will result in:
PicklingError: Can't pickle <type 'function'>: attribute lookup
__builtin__.function failed
Of course, in this simple example, we could simply move the function definition back into the __main__ namespace. However, in yours, you need it to refer to data that's passed in. Let's look at an example that's a bit closer to what you're doing:
import numpy as np
import multiprocessing
def parallel_rolling_mean(data, window):
data = np.pad(data, window, mode='edge')
ind = np.arange(len(data)) + window
def func(i):
return data[i-window:i+window+1].mean()
pool = multiprocessing.Pool()
result = pool.map(func, ind)
pool.close()
return result
foo = np.random.rand(20).cumsum()
result = parallel_rolling_mean(foo, 10)
There are multiple ways you could handle this, but a common approach is something like:
import numpy as np
import multiprocessing
class RollingMean(object):
def __init__(self, data, window):
self.data = np.pad(data, window, mode='edge')
self.window = window
def __call__(self, i):
start = i - self.window
stop = i + self.window + 1
return self.data[start:stop].mean()
def parallel_rolling_mean(data, window):
func = RollingMean(data, window)
ind = np.arange(len(data)) + window
pool = multiprocessing.Pool()
result = pool.map(func, ind)
pool.close()
return result
foo = np.random.rand(20).cumsum()
result = parallel_rolling_mean(foo, 10)
Great! It works!
However, if you scale this up to large arrays, you'll soon find that it will either run very slow (which you can speed up by increasing chunksize in the pool.map call) or you'll quickly run out of RAM (once you increase the chunksize).
multiprocessing pickles the input so that it can be passed to separate and independent python processes. This means you're making a copy of the entire array for every i you operate on.
We'll come back to this point in a bit...
multiprocessing Does not share memory between processes
The multiprocessing module works by pickling the inputs and passing them to independent processes. This means that if you modify something in one process the other process won't see the modification.
However, multiprocessing also provides primitives that live in shared memory and can be accessed and modified by child processes. There are a few different ways of adapting numpy arrays to use a shared memory multiprocessing.Array. However, I'd recommend avoiding those at first (read up on false sharing if you're not familiar with it). There are cases where it's very useful, but it's typically to conserve memory, not to improve performance.
Therefore, it's best to do all modifications to a large array in a single process (this is also a very useful pattern for general IO). It doesn't have to be the "main" process, but it's easiest to think about that way.
As an example, let's say we wanted to have our parallel_rolling_mean function take an output array to store things in. A useful pattern is something similar to the following. Notice the use of iterators and modifying the output only in the main process:
import numpy as np
import multiprocessing
def parallel_rolling_mean(data, window, output):
def windows(data, window):
padded = np.pad(data, window, mode='edge')
for i in xrange(len(data)):
yield padded[i:i + 2*window + 1]
pool = multiprocessing.Pool()
results = pool.imap(np.mean, windows(data, window))
for i, result in enumerate(results):
output[i] = result
pool.close()
foo = np.random.rand(20000000).cumsum()
output = np.zeros_like(foo)
parallel_rolling_mean(foo, 10, output)
print output
Hopefully that example helps clarify things a bit.
chunksize and performance
One quick note on performance: If we scale this up, it will get very slow very quickly. If you look at a system monitor (e.g. top/htop), you may notice that your cores are sitting idle most of the time.
By default, the master process pickles each input for each process and passes it in immediately and then waits until they're finished to pickle the next input. In many cases, this means that the master process works, then sits idle while the worker processes are busy, then the worker processes sit idle while the master process is pickling the next input.
The key is to increase the chunksize parameter. This will cause pool.imap to "pre-pickle" the specified number of inputs for each process. Basically, the master thread can stay busy pickling inputs and the worker processes can stay busy processing. The downside is that you're using more memory. If each input uses up a large amount of RAM, this can be a bad idea. If it doesn't, though, this can dramatically speed things up.
As a quick example:
import numpy as np
import multiprocessing
def parallel_rolling_mean(data, window, output):
def windows(data, window):
padded = np.pad(data, window, mode='edge')
for i in xrange(len(data)):
yield padded[i:i + 2*window + 1]
pool = multiprocessing.Pool()
results = pool.imap(np.mean, windows(data, window), chunksize=1000)
for i, result in enumerate(results):
output[i] = result
pool.close()
foo = np.random.rand(2000000).cumsum()
output = np.zeros_like(foo)
parallel_rolling_mean(foo, 10, output)
print output
With chunksize=1000, it takes 21 seconds to process a 2-million element array:
python ~/parallel_rolling_mean.py 83.53s user 1.12s system 401% cpu 21.087 total
But with chunksize=1 (the default) it takes about eight times as long (2 minutes, 41 seconds).
python ~/parallel_rolling_mean.py 358.26s user 53.40s system 246% cpu 2:47.09 total
In fact, with the default chunksize, it's actually far worse than a single-process implementation of the same thing, which takes only 45 seconds:
python ~/sequential_rolling_mean.py 45.11s user 0.06s system 99% cpu 45.187 total