I have a script that all my workers updates different dictionary in a Manager list. My question is: when one worker is writing the list, will the others wait for it, or all worker can update the list at the same time?
Here is my code:
from multiprocessing import Process, Manager
def worker(x, i, *args):
sub_l = x[i]
sub_l[i] = i
x[i] = sub_l
if __name__ == '__main__':
manager = Manager()
num = 4
x = manager.list([{}]*num)
p = []
for i in range(num):
p.append(Process(target=worker, args=(x, i)))
p[i].start()
for i in range(5):
p[i].join()
print x
I only need all my workers to run separately, and update different global variables. I kind of think using manager.list is an over kill, but not sure if there is other way to do this.
The Manager server that provides access to the Manager.list doesn't do any synchronization when you try to access objects its managing. You can basically think of it exactly the same way you would if you were just dealing with threads and an ordinary global variable; because of the GIL, two threads aren't going to be able to actually step on top of each other while doing an atomic bytecode operation, but doing something like incrementing a variable, which requires multiple bytecode operations, need to be protected by a lock.
In your case, you'd only be in trouble if multiple workers did certain operations on the same sub-list in parallel. If two workers ran this at the same time:
sub_l = x[i]
sub_l[i] = sub_l[i] + <something unique to the worker>
x[i] = sub_l
With the same i, they could end up stomping on each other; both would store a copy of the same sub_l, and both would increment sub_l[i], and then both would update x[i], but the second one to update x would overwrite the changes done by the first one.
As long as you don't try to do those kind of things in parallel across workers, you should be fine.
Related
I use joblib to work in parallel, I want to write the results in parallel in a list.
So as to avoid problems, I create an ldata = [] list beforehand, so that it can be easily accessed.
During parallelization, the data are available in the list, but no longer when they are put together.
How can data be saved in parallel?
from joblib import Parallel, delayed
import multiprocessing
data = []
def worker(i):
ldata = []
... # create list ldata
data[i].append(ldata)
for i in range(0, 1000):
data.append([])
num_cores = multiprocessing.cpu_count()
Parallel(n_jobs=num_cores)(delayed(worker)(i) for i in range(0, 1000))
resultlist = []
for i in range(0, 1000):
resultlist.extend(data[i])
You should look at Parallel as a parallel map operation that does not allow for side effects. The execution model of Parallel is that it by default starts new worker copies of the master processes, serialises the input data, sends it over to the workers, have them iterate over it, then collects the return values. Any change a worker performs on data stays in its own memory space and is thus invisible to the master process. You have two options here:
First, you can have your workers return ldata instead of updating data[i]. In that case, data will have to be assigned the result returned by Parallel(...)(...):
def worker(i):
...
return ldata
data = Parallel(n_jobs=num_cores)(delayed(worker)(i) for i in range(0, 1000))
Second option is to force a shared memory semantics that uses threads instead of processes. When works execute in threads, their memory space is that of the master process, which is where data lies originally. To enforce this semantics, add require='sharedmem' keyword argument in the call to Parallel:
Parallel(n_jobs=num_cores, require='sharedmem')(delayed(worker)(i) for i in range(0, 1000))
The different modes and semantics are explained in the joblib documentation here.
Keep in mind that your worker() function is written in pure Python and is therefore interpreted. This means that worker threads can't run fully concurrently even if there is just one thread per CPU due to the dreaded Global Interpreter Lock (GIL). This is also explained in the documentation. Therefore, you'd better stick with the first solution in general, despite the marshalling and interprocess communication overheads.
I'm working on implementing a randomized algorithm in python. Since this involves doing the same thing many (say N) times, it rather naturally parallelizes and I would like to make use of that. More specifically, I want to distribute the N iterations on all of the cores of my CPU. The problem in question involves computing the maximum of something and is thus something where every worker could compute their own maximum and then only report that one back to the parent process, which then only needs to figure out the global maximum out of those few local maxima.
Somewhat surprisingly, this does not seem to be an intended use-case of the multiprocessing module, but I'm not entirely sure how else to go about it. After some research I came up with the following solution (toy problem to find the maximum in a list that is structurally the same as my actual one):
import random
import multiprocessing
l = []
N = 100
numCores = multiprocessing.cpu_count()
# globals for every worker
mySendPipe = None
myRecPipe = None
def doWork():
pipes = zip(*[multiprocessing.Pipe() for i in range(numCores)])
pool = multiprocessing.Pool(numCores, initializeWorker, (pipes,))
pool.map(findMax, range(N))
results = []
# collate results
for p in pipes[0]:
if p.poll():
results.append(p.recv())
print(results)
return max(results)
def initializeWorker(pipes):
global mySendPipe, myRecPipe
# ID of a worker process; they are consistently named PoolWorker-i
myID = int(multiprocessing.current_process().name.split("-")[1])-1
# Modulo: When starting a second pool for the second iteration of doWork() they are named with IDs 5-8.
mySendPipe = pipes[1][myID%numCores]
myRecPipe = pipes[0][myID%numCores]
def findMax(count):
myMax = 0
if myRecPipe.poll():
myMax = myRecPipe.recv()
value = random.choice(l)
if myMax < value:
myMax = value
mySendPipe.send(myMax)
l = range(1, 1001)
random.shuffle(l)
max1 = doWork()
l = range(1001, 2001)
random.shuffle(l)
max2 = doWork()
return (max1, max2)
This works, sort of, but I've got a problem with it. Namely, using pipes to store the intermediate results feels rather silly (and is probably slow). But it also has the real problem, that I can't send arbitrarily large things through the pipe, and my application unfortunately sometimes exceeds this size (and deadlocks).
So, what I'd really like is a function analogue to the initializer that I can call once for every worker on the pool to return their local results to the parent process. I could not find such functionality, but maybe someone here has an idea?
A few final notes:
I use a global variable for the input because in my application the input is very large and I don't want to copy it to every process. Since the processes never write to it, I believe it should never be copied (or am I wrong there?). I'm open to suggestions to do this differently, but mind that I need to run this on changing inputs (sequentially though, just like in the example above).
I'd like to avoid using the Manager-class, since (by my understanding) it introduces synchronisation and locks, which in this problem should be completely unnecessary.
The only other similar question I could find is Python's multiprocessing and memory, but they wish to actually process the individual results of the workers, whereas I do not want the workers to return N things, but to instead only run a total of N times and return only their local best results.
I'm using Python 2.7.15.
tl;dr: Is there a way to use local memory for every worker process in a multiprocessing pool, so that every worker can compute a local optimum and the parent process only needs to worry about figuring out which one of those is best?
You might be overthinking this a little.
By making your worker-functions (in this case findMax) actually return a value instead of communicating it, you can store the result from calling pool.map() - it is just a parallel variant of map, after all! It will map a function over a list of inputs and return the list of results of that function call.
The simplest example illustrating my point follows your "distributed max" example:
import multiprocessing
# [0,1,2,3,4,5,6,7,8]
x = range(9)
# split the list into 3 chunks
# [(0, 1, 2), (3, 4, 5), (6, 7, 8)]
input = zip(*[iter(x)]*3)
pool = multiprocessing.Pool(2)
# compute the max of each chunk:
# max((0,1,2)) == 2
# max((3,4,5)) == 5
# ...
res = pool.map(max, input)
print(res)
This returns [2, 5, 8].
Be aware that there is some light magic going on: I use the built-in max() function which expects iterables as input. Now, if I would only pool.map over a plain list of integers, say, range(9), that would result in calls to max(0), max(1) etc. - not very useful, huh? Instead, I partition my list into chunks, so effectively, when mapping, we now map over a list of tuples, thus feeding a tuple to max on each call.
So perhaps you have to:
return a value from your worker func
think about how you structure your input domain so that you feed meaningful chunks to each worker
PS: You wrote a great first question! Thank you, it was a pleasure reading it :) Welcome to StackOverflow!
I am using Python's multiprocessing to create a parallel application. Processes need to share some data, for which I use a Manager. However, I have some common functions which processes need to call and which need to access the data stored by the Manager object. My question is whether I can avoid needing to pass the Manager instance to these common functions as an argument and rather use it like a global. In other words, consider the following code:
import multiprocessing as mp
manager = mp.Manager()
global_dict = manager.dict(a=[0])
def add():
global_dict['a'] += [global_dict['a'][-1]+1]
def foo_parallel(var):
add()
print var
num_processes = 5
p = []
for i in range(num_processes):
p.append(mp.Process(target=foo_parallel,args=(global_dict,)))
[pi.start() for pi in p]
[pi.join() for pi in p]
This runs fine and returns p=[0,1,2,3,4,5] on my machine. However, is this "good form"? Is this a good way to doing it, just as good as defining add(var) and calling add(var) instead?
Your code example seems to have bigger problems than form. You get your desired output only with luck. Repeated execution will yield different results. That's because += is not an atomic operation. Multiple processes can read the same old value one after another, before any of them has updated it and they will write back the same values. To prevent this behaviour, you'll have to use a Manager.Lock additionally.
To your original question about "good form".
IMO it would be cleaner, to let the main-function of the child process foo_parallel, pass global_dict explicitly into a generic function add(var). That would be a form of dependency injection and has some advantages. In your example non-exhaustively:
allows isolated testing
increases code reusability
easier debugging (detecting non-accessibility of the managed object shouldn't be delayed until addis called (fail fast)
less boilerplate code (for example try-excepts blocks on resources multiple functions need)
As a side note. Using list comprehensions only for it's side effects is considered a 'code smell'. If you don't need a list as result, just use a for-loop.
Code:
import os
from multiprocessing import Process, Manager
def add(l):
l += [l[-1] + 1]
return l
def foo_parallel(global_dict, lock):
with lock:
l = global_dict['a']
global_dict['a'] = add(l)
print(os.getpid(), global_dict)
if __name__ == '__main__':
N_WORKERS = 5
with Manager() as manager:
lock = manager.Lock()
global_dict = manager.dict(a=[0])
pool = [Process(target=foo_parallel, args=(global_dict, lock))
for _ in range(N_WORKERS)]
for p in pool:
p.start()
for p in pool:
p.join()
print('result', global_dict)
In a python script, I have a large dataset that I would like to apply multiple functions to. The functions are responsible for creating certain outputs that get saved to the hard drive.
A few things of note:
the functions are independent
none of the functions return anything
the functions will take variable amounts of time
some of the functions may fail, and that is fine
Can I multiprocess this in any way that each function and the dataset are sent separately to a core and run there? This way I do not need the first function to finish before the second one can kick off? There is no need for them to be sequentially dependent.
Thanks!
Since your functions are independent and only read data, as long as it is not an issue if your data is modified during the execution of a function, then they are also thread safe.
Use a thread pool (click) . You would have to create a task per function you want to run.
Note: In order for it to run on more than one core you must use Python Multiprocessing. Else all the threads will run on a single core. This happens because Python has a Global Interpreter Lock (GIL). For more information Python threads all executing on a single core
Alternatively, you could use DASK , which augments the data in order to run some multi threading. While adding some overhead, it might be quicker for your needs.
I was in a similar situation as yours, and used Processes with the following function:
import multiprocessing as mp
def launch_proc(nproc, lst_functions, lst_args, lst_kwargs):
n = len(lst_functions)
r = 1 if n % nproc > 0 else 0
for b in range(n//nproc + r):
bucket = []
for p in range(nproc):
i = b*nproc + p
if i == n:
break
proc = mp.Process(target=lst_functions[i], args=lst_args[i], kwargs=lst_kwargs[i])
bucket.append(proc)
for proc in bucket:
proc.start()
for proc in bucket:
proc.join()
This has a major drawback: all Processes in a bucket have to finish before a new bucket can start. I tried to use a JoinableQueue to avoid this, but could not make it work.
Example:
def f(i):
print(i)
nproc = 2
n = 11
lst_f = [f] * n
lst_args = [[i] for i in range(n)]
lst_kwargs = [{}] * n
launch_proc(nproc, lst_f, lst_args, lst_kwargs)
Hope it can help.
I am attempting to parallelize an algorithm that I have been working on using the Multiprocessing and Pool.map() commands. I ran into a problem and was hoping someone could point me in the right direction.
Let x denote an array of N rows and 1 column, which is initialized to be a vector of zeros. Let C denote an array of length N by 2. The vector x is constructed iteratively by using information from some subsets of C (doing some math operations). The code (not parallelized) as a large for loop looks roughly as follows:
for j in range(0,N)
#indx_j will have n_j <<N entries
indx_j = build_indices(C,j)
#x_j will be entries to be added to vector x at indices indx_j
#This part is time consuming
x_j = build_x_j(indx_j,C)
#Add x_j into entries of x
x[indx_j] = x[indx_j] + x_j
I was able to parallelize this using the multiprocessing module and using the pool.map to eliminate the large for loop. I wrote a function that did the above computations, except the step of adding x_j to x[indx_j]. The parallelized function instead returns two data sets back: x_j and indx_j. After those are computed, I run a for loop (not parallel) to build up x by doing the x[indx_j] = x[indx_j] +x_j computation for j=0,N.
The downside to my method is that pool.map operation returns a gigantic list of N pairs of arrays x_j and indx_j. where both x_j and indx_j were n_j by 1 vectors (n_j << N). For large N (N >20,000) this was taking up way too much memory. Here is my question: Can I somehow, in parallel, do the construction operation x[indx_j] = x[indx_j] + x_j. It seems to me each process in pool.map() would have to be able to interact with the vector x. Do I place x in some sort of shared memory? How would I do such a thing? I suspect that this has to be possible somehow, as I assume people assemble matrices in parallel for finite element methods all the time. How can I have multiple processes interact with a vector without having some sort of problem? I'm worried that perhaps for j= 20 and j = 23, if they happen simultaneously, they might try to add to x[indx_20] = x[indx_20] + x_20 and simultaneously x[indx_30] = x[indx_30] + x_30 and maybe some error will happen. I also don't know how to even have this computation done via the pool.map() (I don't think I can feed x in as an input, as it would be changing after each process).
I'm not sure if it matters or not, but the sets indx_j will have non-trivial intersection (e.g., indx_1 and indx_2 may have indices [1,2,3] and [3,4,5] for example).
If this is unclear, please let me know and I will attempt to clarify. This is my first time trying to work in parallel, so I am very unsure of how to proceed. Any information would be greatly appreciated. Thanks!
I dont know If I am qualified to give proper advice on the topic of shared memory arrays, but I had a similar need to share arrays across processes in python recently and came across a small custom numpy.ndarray implementation for a shared memory array in numpy using the shared ctypes within multiprocessing. Here is a link to the code: shmarray.py. It acts just like a normal array,except the underlying data is stored in shared memory, meaning separate processes can both read and write to the same array.
Using Shared Memory Array
In threading, all information available to the thread (global and local namespace) can be handled as shared between all other threads that have access to it, but in multiprocessing that data is not so easily accessible. On linux data is available for reading, but cannot be written to. Instead when a write is done, the data is copied and then written to, meaning no other process can see those changes. However, if the memory being written to is shared memory, it is not copied. This means with shmarray we can do things similar to the way we would do threading, with the true parallelism of multiprocessing. One way to have access to the shared memory array is with a subclass. I know you are currently using Pool.map(), but I had felt limited by the way map worked, especially when dealing with n-dimensional arrays. Pool.map() is not really designed to work with numpy styled interfaces, at least I don't think it can easily. Here is a simple idea where you would spawn a process for each j in N:
import numpy as np
import shmarray
import multiprocessing
class Worker(multiprocessing.Process):
def __init__(self, j, C, x):
multiprocessing.Process.__init__()
self.shared_x = x
self.C = C
self.j = j
def run(self):
#Your Stuff
#indx_j will have n_j <<N entries
indx_j = build_indices(self.C,self.j)
#x_j will be entries to be added to vector x at indices indx_j
x_j = build_x_j(indx_j,self.C)
#Add x_j into entries of x
self.shared_x[indx_j] = self.shared_x[indx_j] + x_j
#And then actually do the work
N = #What ever N should be
x = shmarray.zeros(shape=(N,1))
C = #What ever C is, doesn't need to be shared mem, since no writing is happening
procs = []
for j in range(N):
proc = Worker(j, C, x)
procs.append(proc)
proc.start()
#And then join() the processes with the main process
for proc in procs:
proc.join()
Custom Process Pool and Queues
So this might work, but spawning several thousand processes is not really going to be of any use if you only have a few cores. The way I handled this was to implement a Queue system between my process. That is to say, we have a Queue that the main process fills with j's and then a couple worker processes get numbers from the Queue and do work with it, note that by implementing this, you are essentially doing exactly what Pool does. Also note we are actually going to use multiprocessing.JoinableQueue for this since it lets use join() to wait till a queue is emptied.
Its not hard to implement this at all really, simply we must modify our Subclass a bit and how we use it.
import numpy as np
import shmarray
import multiprocessing
class Worker(multiprocessing.Process):
def __init__(self, C, x, job_queue):
multiprocessing.Process.__init__()
self.shared_x = x
self.C = C
self.job_queue = job_queue
def run(self):
#New Queue Stuff
j = None
while j!='kill': #this is how I kill processes with queues, there might be a cleaner way.
j = self.job_queue.get() #gets a job from the queue if there is one, otherwise blocks.
if j!='kill':
#Your Stuff
indx_j = build_indices(self.C,j)
x_j = build_x_j(indx_j,self.C)
self.shared_x[indx_j] = self.shared_x[indx_j] + x_j
#This tells the queue that the job that was pulled from it
#Has been completed (we need this for queue.join())
self.job_queue.task_done()
#The way we interact has changed, now we need to define a job queue
job_queue = multiprocessing.JoinableQueue()
N = #What ever N should be
x = shmarray.zeros(shape=(N,1))
C = #What ever C is, doesn't need to be shared mem, since no writing is happening
procs = []
proc_count = multiprocessing.cpu_count() # create as many procs as cores
for _ in range(proc_count):
proc = Worker(C, x, job_queue) #now we pass the job queue instead
procs.append(proc)
proc.start()
#all the workers are just waiting for jobs now.
for j in range(N):
job_queue.put(j)
job_queue.join() #this blocks the main process until the queue has been emptied
#Now if you want to kill all the processes, just send a 'kill'
#job for each process.
for proc in procs:
job_queue.put('kill')
job_queue.join()
Finally, I really cannot say anything about how this will handle writing to overlapping indices at the same time. Worst case is that you could have a serious problem if two things attempt to write at the same time and things get corrupted/crash(I am no expert here so I really have no idea if that would happen). Best case since you are just doing addition, and order of operations doesn't matter, everything runs smoothly. If it doesn't run smoothly, my suggestion is to create a second custom Process subclass that specifically does the array assignment. To implement this you would need to pass both a job queue, and an 'output' queue to the Worker subclass. Within the while loop, you should have a `output_queue.put((indx_j, x_j)). NOTE: If you are putting these into a Queue they are being pickled, which can be slow. I recommend making them shared memory arrays if they can be before using put. It may be faster to just pickle them in some cases, but I have not tested this. To assign these as they are generated, you then need to have your Assigner process read these values from a queue as jobs and apply them, such that the work loop would essentially be:
def run(self):
job = None
while job!='kill':
job = self.job_queue.get()
if job!='kill':
indx_j, x_j = job
#Note this is the process which really needs access to the X array.
self.x[indx_j] += x_j
self.job_queue.task_done()
This last solution will likely be slower than doing the assignment within the worker threads, but if you are doing it this way, you have no worries about race conditions, and memory is still lighter since you can use up the indx_j and x_j values as you generate them, instead of waiting till all of them are done.
Note for Windows
I didn't do any of this work on windows, so I am not 100% certain, but I believe the code above will be very memory intensive since windows does not implement a copy-on-write system for spawning independent processes. Essentially windows will copy ALL information that a process needs when spawning a new one from the main process. To fix this, I think replacing all your x_j and C with shared memory arrays (anything you will be handing around to other processes) instead of normal arrays should cause windows to not copy the data, but I am not certain. You did not specify which platform you were on so I figured better safe than sorry, since multiprocessing is a different beast on windows than linux.