How do I nest multiprocessing in multiprocessing, with common variables (python)? - python

I have a function which is running twice in two parallel processes. Lets call it - parentFunction().
Each process ends with a dictionary which is added to a common list which gives a list of two dictionaries. This I solved by using preset list using manager.
Now, inside parentFunction() L would like to run two parallel processes, each gives one variable to the dictionary. I tried to do this with preset dictionary using manager
At the end I`m converting the list of dictionaries to pandas data frame.
def I(D, a):
D["a"] = a
def II(D, b):
D["a"] = b
def task(L, x):
x = 0
a = 1
b = 2
manager = Manager()
D = manager.dict() # <-- can be shared between processes.
pI = Process(target=I, args=(D, 0))
pII = Process(target=II, args=(D, 0))
pI.start()
pII.start()
pI.join()
pII.join()
L.append(D)
if __name__ == "__main__":
with Manager() as manager:
L = manager.list() # <-- can be shared between processes.
p1 = Process(target=task, args=(L, 0)) # Passing the list
p2 = Process(target=task, args=(L, 0)) # Passing the list
p1.start()
p2.start()
p1.join()
p2.join()
print(L)
returns error:
TypeError: task() missing 1 required positional argument: 'L'
Traceback (most recent call last):
File "C:\Users\user\AppData\Roaming\JetBrains\PyCharmCE2021.2\scratches\scratch_8.py", line 88, in <module>
print(list(L))
File "<string>", line 2, in __getitem__
File "C:\Users\user\AppData\Local\Programs\Python\Python39\lib\multiprocessing\managers.py", line 810, in _callmethod
kind, result = conn.recv()
File "C:\Users\user\AppData\Local\Programs\Python\Python39\lib\multiprocessing\connection.py", line 256, in recv
return _ForkingPickler.loads(buf.getbuffer())
File "C:\Users\user\AppData\Local\Programs\Python\Python39\lib\multiprocessing\managers.py", line 934, in RebuildProxy
return func(token, serializer, incref=incref, **kwds)
File "C:\Users\user\AppData\Local\Programs\Python\Python39\lib\multiprocessing\managers.py", line 784, in __init__
self._incref()
File "C:\Users\user\AppData\Local\Programs\Python\Python39\lib\multiprocessing\managers.py", line 838, in _incref
conn = self._Client(self._token.address, authkey=self._authkey)
File "C:\Users\user\AppData\Local\Programs\Python\Python39\lib\multiprocessing\connection.py", line 505, in Client
c = PipeClient(address)
File "C:\Users\user\AppData\Local\Programs\Python\Python39\lib\multiprocessing\connection.py", line 707, in PipeClient
_winapi.WaitNamedPipe(address, 1000)
FileNotFoundError: [WinError 2] The system cannot find the file specified
```

The source you posted does not seem to match your stack trace. You would only get a FileNotFoundException when the main process tries to enumerate any objects within list L with a statement such as print(list(L)), which I see in the stack trace but not in your code. It helps when you post the actual code causing the exception. But here is the cause of your problem:
When you create a new manager with the call manager = Manager() a new process is created and any objects that are created via the manager "live" in the same address space and process as that manager. You are creating two manager processes, once in the main process and once in the child process task. It is in the latter that the dictionary, D is created. When that process terminates the manager process terminates too along with any objects created by that manager. So when the main process attempts to print the list L, the proxy object within it, D, no longer points to an existing object. The solution is to have the main process create the dictionary, D, and pass it to the task child process:
from multiprocessing import Process, Manager
def I(D, a):
D["a"] = a
def II(D, b):
D["a"] = b
def task(L, D, x):
x = 0
a = 1
b = 2
pI = Process(target=I, args=(D, 0))
pII = Process(target=II, args=(D, 0))
pI.start()
pII.start()
pI.join()
pII.join()
L.append(D)
if __name__ == "__main__":
with Manager() as manager:
L = manager.list() # <-- can be shared between processes.
D = manager.dict() # <-- can be shared between processes.
p = Process(target=task, args=(L, D, 0)) # Passing the list
p.start()
p.join()
print(L[0])
Prints:
{'a': 0}

Related

Cycle an iterator using multiprocessing in Python

I have an iterator that will retrive various number of lines from a very large (>20GB) file depend on some features. The iterator works fine, but I can only use 1 thread to process the result. I would like to feed the value from each iteration to multiple threads / processes.
I'm using a text file with 9 lines to mimic my data, here is my code. I've been struggling on how to create the feedback so when one process finished, it will go and retrive the next iteration:
from multiprocessing import Process, Manager
import time
# Iterator
class read_file(object):
def __init__(self, filePath):
self.file = open(filePath, 'r')
def __iter__(self):
return self
def __next__(self):
line = self.file.readline()
if line:
return line
else:
raise StopIteration
# worker for one process
def print_worker(a, n, stat):
print(a)
stat[n] = True # Set the finished status as True
return None
# main
def main():
file_path = 'tst_mp.txt' # the txt file wit 9 lines
n_worker = 2
file_handle = read_file(file_path)
workers = []
# Create shared list for store dereplicated dict and progress counter
manager = Manager()
status = manager.list([False] * 2) # list of dictonary for each thread
# Initiate the workers
for i in range(n_worker):
workers.append(Process(target=print_worker, args=(file_handle.__next__(), i, status,)))
for worker in workers:
worker.start()
block = file_handle.__next__() # The next block (line)
while block: # continue is there is still block left
print(status)
time.sleep(1) # for every second
for i in range(2):
if status[i]: # Worker i finished
workers[i].join()
# workers[i].close()
workers[i] = Process(target=print_worker, args=(block, i, status,))
status[i] = False # Set worker i as busy (False)
workers[i].start() # Start worker i
try: # try to get the next item in the iterator
block = file_handle.__next__()
except StopIteration:
block = False
if __name__ == '__main__':
main()
The code is clumsy, but it did print out the sequence, but also with some error when I ran the code twice:
1
2
3
4
5
6
7
8
9
Process Process-10:
Traceback (most recent call last):
File "/home/zewei/mambaforge/lib/python3.9/multiprocessing/managers.py", line 802, in _callmethod
conn = self._tls.connection
AttributeError: 'ForkAwareLocal' object has no attribute 'connection'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/zewei/mambaforge/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/home/zewei/mambaforge/lib/python3.9/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/home/zewei/share/paf_depth/test_multiprocess.py", line 31, in print_worker
stat[n] = True # Set the finished status as True
File "<string>", line 2, in __setitem__
File "/home/zewei/mambaforge/lib/python3.9/multiprocessing/managers.py", line 806, in _callmethod
self._connect()
File "/home/zewei/mambaforge/lib/python3.9/multiprocessing/managers.py", line 794, in _connect
dispatch(conn, None, 'accept_connection', (name,))
File "/home/zewei/mambaforge/lib/python3.9/multiprocessing/managers.py", line 90, in dispatch
kind, result = c.recv()
File "/home/zewei/mambaforge/lib/python3.9/multiprocessing/connection.py", line 255, in recv
buf = self._recv_bytes()
File "/home/zewei/mambaforge/lib/python3.9/multiprocessing/connection.py", line 419, in _recv_bytes
buf = self._recv(4)
File "/home/zewei/mambaforge/lib/python3.9/multiprocessing/connection.py", line 384, in _recv
chunk = read(handle, remaining)
ConnectionResetError: [Errno 104] Connection reset by peer
Here is where I'm stucked, I was wondering if there is any fix or more elegant way for this?
Thanks!
Here's a better way to do what you are doing, using pool:
from multiprocessing import Pool
import time
.
.
.
.
# worker for one process
def print_worker(a):
print(a)
return None
def main():
file_path = r'' # the txt file wit 9 lines
n_worker = 2
file_handle = read_file(file_path)
results = []
with Pool(n_worker) as pool:
for result in pool.imap(print_worker, file_handle):
results.append(result)
print(results)
if __name__ == '__main__':
main()
Here, the imap function lazily iterates over the iterator, so that the whole file won't be read into memory. Pool handles spreading the tasks across the number of processes you started (using n_worker) automatically so that you don't have to manage it yourself.

AttributeError: Can't get attribute 'journalerReader' on <module '__mp_main__

I tried to implement Lmax in python .I tried to handle data in 4 processes
import disruptor
import multiprocessing
import random
if __name__ == '__main__':
cb = disruptor.CircularBuffer(5)
def receiveWriter():
while(True):
n = random.randint(5,20)
cb.receive(n)
def ReplicatorReader():
while(True):
cb.replicator()
def journalerReader():
while(True):
cb.journaler()
def unmarshallerReader():
while(True):
cb.unmarshaller()
def consumeReader():
while(True):
print(cb.consume())
p1 = multiprocessing.Process(name="p1",target=ReplicatorReader)
p1.start()
p0 = multiprocessing.Process(name="p0",target=receiveWriter)
p0.start()
p1 = multiprocessing.Process(name="p1",target=ReplicatorReader)
p1.start()
p2 = multiprocessing.Process(name="p2",target=journalerReader)
p2.start()
p3 = multiprocessing.Process(name="p3",target=unmarshallerReader)
p3.start()
p4 = multiprocessing.Process(name="p4",target=consumeReader)
p4.start()
but I get this Error in my code :
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "<string>", line 1, in <module>
File "C:\Program Files\Python39\lib\multiprocessing\spawn.py", line 116, in spawn_main
File "C:\Program Files\Python39\lib\multiprocessing\spawn.py", line 116, in spawn_main
exitcode = _main(fd, parent_sentinel)
exitcode = _main(fd, parent_sentinel)
File "C:\Program Files\Python39\lib\multiprocessing\spawn.py", line 126, in _main
File "C:\Program Files\Python39\lib\multiprocessing\spawn.py", line 126, in _main
self = reduction.pickle.load(from_parent)
self = reduction.pickle.load(from_parent)
AttributeError: Can't get attribute 'unmarshallerReader' on <module '__mp_main__' from 'd:\\python\\RunDisruptor.py'>
AttributeError: Can't get attribute 'consumeReader' on <module '__mp_main__' from 'd:\\python\\RunDisruptor.py'>
Your first problem is that the target of a Process call cannot be within the if __name__ == '__main__': block. But:
As I mentioned in an earlier post of yours, the only way I see that you can share an instance of CircularBuffer across multiple processess is to implement a managed class, which surprisingly is not all that difficult to do. But when you create a managed class and create an instance of that class, what you have is actually a proxy reference to the object. This has two implications:
Each method call is more like a remote procedure call to a special server process created by the manager you will start up and therefore has more overhead than a local method call.
If you print the reference, the class's __str__ method will not be called; you will be printing a representation of the proxy pointer. You should probably rename method __str__ to something like dump and call that explicitly whenever you want a representation of the instance.
You should also explicitly wait for the completion of the processes you are creating so that the manager service does not shutdown prematurely, which means that each process should be assigned to a unique variable and have a unique name.
import disruptor
import multiprocessing
from multiprocessing.managers import BaseManager
import random
class CircularBufferManager(BaseManager):
pass
def receiveWriter(cb):
while(True):
n = random.randint(5,20)
cb.receive(n)
def ReplicatorReader(cb):
while(True):
cb.replicator()
def journalerReader(cb):
while(True):
cb.journaler()
def unmarshallerReader(cb):
while(True):
cb.unmarshaller()
def consumeReader(cb):
while(True):
print(cb.consume())
if __name__ == '__main__':
# Create managed class
CircularBufferManager.register('CircularBuffer', disruptor.CircularBuffer)
# create and start manager:
with CircularBufferManager() as manager:
cb = manager.CircularBuffer(5)
p1 = multiprocessing.Process(name="p1", target=ReplicatorReader, args=(cb,))
p1.start()
p0 = multiprocessing.Process(name="p0",target=receiveWriter, args=(cb,))
p0.start()
p1a = multiprocessing.Process(name="p1a",target=ReplicatorReader, args=(cb,))
p1a.start()
p2 = multiprocessing.Process(name="p2",target=journalerReader, args=(cb,))
p2.start()
p3 = multiprocessing.Process(name="p3",target=unmarshallerReader, args=(cb,))
p3.start()
p4 = multiprocessing.Process(name="p4",target=consumeReader, args=(cb,))
p4.start()
p1.join()
p0.join()
p1a.join()
p2.join()
p3.join()
p4.join()

Python process not cleaned for reuse

Processes not cleaned for reuse
Hi there,
I stumbled upon a problem with ProcessPoolExecutor, where processes access
data, they should not be able to. Let me explain:
I have a situation similar to the below example: I got several runs to start
with different arguments each. They compute their stuff in parallel and have no
reason to interact with each other. Now, as I understand it, when a process
forks, it duplicates itself. The child process has the same (memory) data, as
its parent, but should it change anything, it does so on its own copy. If I
would want the changes to survive the lifetime of the child process, I would
call in queues, pipes and other IPC stuff.
But I actually don't! The processes each manipulate data for their own, which
should not carry over to any of the other runs. The example below shows
otherwise, though. The next runs (not parallel running ones) can access the
data of their previous run, implicating, that the data has not been scrubbed
from the process.
Code/Example
from concurrent.futures import ProcessPoolExecutor
from multiprocessing import current_process, set_start_method
class Static:
integer: int = 0
def inprocess(run: int) -> None:
cp = current_process()
# Print current state
print(f"[{run:2d} {cp.pid} {cp.name}] int: {Static.integer}", flush=True)
# Check value
if Static.integer != 0:
raise Exception(f"[{run:2d} {cp.pid} {cp.name}] Variable already set!")
# Update value
Static.integer = run + 1
def pooling():
cp = current_process()
# Get master's pid
print(f"[{cp.pid} {cp.name}] Start")
with ProcessPoolExecutor(max_workers=2) as executor:
for i, _ in enumerate(executor.map(inprocess, range(4))):
print(f"run #{i} finished", flush=True)
if __name__ == '__main__':
set_start_method("fork") # enforce fork
pooling()
Output
[1998 MainProcess] Start
[ 0 2020 Process-1] int: 0
[ 2 2020 Process-1] int: 1
[ 1 2021 Process-2] int: 0
[ 3 2021 Process-2] int: 2
run #0 finished
run #1 finished
concurrent.futures.process._RemoteTraceback:
"""
Traceback (most recent call last):
File "/usr/lib/python3.6/concurrent/futures/process.py", line 175, in _process_worker
r = call_item.fn(*call_item.args, **call_item.kwargs)
File "/usr/lib/python3.6/concurrent/futures/process.py", line 153, in _process_chunk
return [fn(*args) for args in chunk]
File "/usr/lib/python3.6/concurrent/futures/process.py", line 153, in <listcomp>
return [fn(*args) for args in chunk]
File "<stdin>", line 14, in inprocess
Exception: [ 2 2020 Process-1] Variable already set!
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "<stdin>", line 29, in <module>
File "<stdin>", line 24, in pooling
File "/usr/lib/python3.6/concurrent/futures/process.py", line 366, in _chain_from_iterable_of_lists
for element in iterable:
File "/usr/lib/python3.6/concurrent/futures/_base.py", line 586, in result_iterator
yield fs.pop().result()
File "/usr/lib/python3.6/concurrent/futures/_base.py", line 425, in result
return self.__get_result()
File "/usr/lib/python3.6/concurrent/futures/_base.py", line 384, in __get_result
raise self._exception
Exception: [ 2 2020 Process-1] Variable already set!
This behaviour can also be reproduced with max_workers=1, as the process is
re-used. The start-method has no influence on the error (though only "fork"
seems to use more than one process).
So to summarise: I want each new run in a process with all previous data, but
no new data from any of the other runs. Is that possible? How would I achive
it? Why does the above not do exactly that?
I appreciate any help.
I found multiprocessing.pool.Pool where one can set maxtasksperchild=1, so
a worker process is destroyed, when its task is finished. But I dislike the
multiprocessing interface; the ProcessPoolExecutor is more comfortable to
use. Additionally, the whole idea of the pool is to save process setup time,
which would be dismissed, when killing the hosting process after each run.
Brand new processes in python do not share memory state. However ProcessPoolExecutor reuses process instances. It's a pool of active processes after all. I assume this is done to prevent the OS overhead of stooping and starting processes all the time.
You see the same behavior in other distribution technologies like celery where if you're not careful you can bleed global state between executions.
I recommend you manage your namespace better to encapsulate your data. Using your example, you could for example encapsulate your code and data in a parent class which you instantiate in inprocess(), instead of storing it in a shared namespace like a static field in classes or directly in a module. That way the object will ultimate be cleaned up by the garbage collector:
class State:
def __init__(self):
self.integer: int = 0
def do_stuff():
self.integer += 42
def use_global_function(state):
state.integer -= 1664
state.do_stuff()
def inprocess(run: int) -> None:
cp = current_process()
state = State()
print(f"[{run:2d} {cp.pid} {cp.name}] int: {state.integer}", flush=True)
if state.integer != 0:
raise Exception(f"[{run:2d} {cp.pid} {cp.name}] Variable already set!")
state.integer = run + 1
state.do_stuff()
use_global_function(state)
I have been facing some potentially similar problems and saw some interesting posts a in this one High Memory Usage Using Python Multiprocessing, that points towards using gc.collector(), however in your case it did not worked. So I thought of how the Static class was initialized, some points:
Unfortunately, I cannot reproduce your minimal example the value error prompts:
ValueError: cannot find context for 'fork'
Considering 1, I use set_start_method("spawn")
A quick fix then could be to initialize every time the Static class as below:
{
class Static:
integer: int = 0
def __init__(self):
pass
def inprocess(run: int) -> None:
cp = current_process()
# Print current state
print(f"[{run:2d} {cp.pid} {cp.name}] int: {Static().integer}", flush=True)
# Check value
if Static().integer != 0:
raise Exception(f"[{run:2d} {cp.pid} {cp.name}] Variable already set!")
# Update value
Static().integer = run + 1
def pooling():
cp = current_process()
# Get master's pid
print(f"[{cp.pid} {cp.name}] Start")
with ProcessPoolExecutor(max_workers=2) as executor:
for i, _ in enumerate(executor.map(inprocess, range(4))):
print(f"run #{i} finished", flush=True)
if __name__ == "__main__":
print("start")
# set_start_method("fork") # enforce fork , ValueError: cannot find context for 'fork'
set_start_method("spawn") # Alternative
pooling()
}
This returns:
[ 0 1424 SpawnProcess-2] int: 0
[ 1 1424 SpawnProcess-2] int: 0
run #0 finished
[ 2 17956 SpawnProcess-1] int: 0
[ 3 1424 SpawnProcess-2] int: 0
run #1 finished
run #2 finished
run #3 finished

Returning multiple lists from pool.map processes?

Win 7, x64, Python 2.7.12
In the following code I am setting off some pool processes to do a trivial multiplication via the multiprocessing.Pool.map() method. The output data is collected in List_1.
NOTE: this is a stripped down simplification of my actual code. There are multiple lists involved in the real application, all huge.
import multiprocessing
import numpy as np
def createLists(branches):
firstList = branches[:] * node
return firstList
def init_process(lNodes):
global node
node = lNodes
print 'Starting', multiprocessing.current_process().name
if __name__ == '__main__':
mgr = multiprocessing.Manager()
nodes = mgr.list()
pool_size = multiprocessing.cpu_count()
branches = [i for i in range(1, 21)]
lNodes = 10
splitBranches = np.array_split(branches, int(len(branches)/pool_size))
pool = multiprocessing.Pool(processes=pool_size, initializer=init_process, initargs=[lNodes])
myList_1 = pool.map(createLists, splitBranches)
pool.close()
pool.join()
I now add an extra calculation to createLists() & try to pass back both lists.
import multiprocessing
import numpy as np
def createLists(branches):
firstList = branches[:] * node
secondList = branches[:] * node * 2
return firstList, secondList
def init_process(lNodes):
global node
node = lNodes
print 'Starting', multiprocessing.current_process().name
if __name__ == '__main__':
mgr = multiprocessing.Manager()
nodes = mgr.list()
pool_size = multiprocessing.cpu_count()
branches = [i for i in range(1, 21)]
lNodes = 10
splitBranches = np.array_split(branches, int(len(branches)/pool_size))
pool = multiprocessing.Pool(processes=pool_size, initializer=init_process, initargs=[lNodes])
myList_1, myList_2 = pool.map(createLists, splitBranches)
pool.close()
pool.join()
This raises the follow error & traceback..
Traceback (most recent call last):
File "<ipython-input-6-ff188034c708>", line 1, in <module>
runfile('C:/Users/nr16508/Local Documents/Inter Trab Angle/Parallel/scratchpad.py', wdir='C:/Users/nr16508/Local Documents/Inter Trab Angle/Parallel')
File "C:\Users\nr16508\AppData\Local\Continuum\Anaconda2\lib\site-packages\spyder\utils\site\sitecustomize.py", line 866, in runfile
execfile(filename, namespace)
File "C:\Users\nr16508\AppData\Local\Continuum\Anaconda2\lib\site-packages\spyder\utils\site\sitecustomize.py", line 87, in execfile
exec(compile(scripttext, filename, 'exec'), glob, loc)
File "C:/Users/nr16508/Local Documents/Inter Trab Angle/Parallel/scratchpad.py", line 36, in <module>
myList_1, myList_2 = pool.map(createLists, splitBranches)
ValueError: too many values to unpack
When I tried to put both list into one to pass back ie...
return [firstList, secondList]
......
myList = pool.map(createLists, splitBranches)
...the output becomes too jumbled for further processing.
Is there an method of collecting more than one list from pooled processes?
This question has nothing to do with multiprocessing or threadpooling. It is simply about how to unzip lists, which can be done with the standard zip(*...) idiom.
myList_1, myList_2 = zip(*pool.map(createLists, splitBranches))

I have an issue using the multiprocessing module under python 2.7 on Windows

I was using a huge scientific code on Linux using the multiprocessing module to accelerate some computations. Somewhere in a library that I am using the multiprocessing is called this way :
manager = multiprocessing.Manager()
return_dict = manager.dict()
n=0
while n <n_samples:
if n + n_procs < n_samples:
n_subs = n_procs
else:
n_subs = n_samples-n
jobs = []
for i in range(n_subs):
index = n+i
x_in = samples[index]
p = multiprocessing.Process(target=self.__worker, args=(index, x_in, return_dict))
jobs.append(p)
p.start()
I wrapped the top level call of this code in my main python script as following :
if __name__ == '__main__' :
freeze_support()
fitting.optimize()
As the :
fitting.optimize()
line calls the parallel code
Still while launching the code I have an error occuring and I do not know why :
File "C:\Users\XXX\workspace\git-DLLM\Simulation\CFD_Data_Analysis\Fitting3DPanels\CoeffFitting3DPanels.py", line 48, in <module>
fitting.optimize()
....
File "C:\Users\XXX\workspace\git-MDOTools\modules\MDOTools\ValidGrad\FDGradient.py", line 97, in grad_f
manager = multiprocessing.Manager()
File "C:\Users\XXX\AppData\Local\Continuum\Anaconda2\lib\multiprocessing\__init__.py", line 99, in Manager
m.start()
File "C:\Users\XXX\AppData\Local\Continuum\Anaconda2\lib\multiprocessing\managers.py", line 528, in start
self._address = reader.recv()
EOFError

Categories

Resources