Python multiprocessing queue fails with subprocesses

Python multiprocessing queue fails with subprocesses - python

I'm trying to add spawned subprocesses to a queue so that only one can execute at a time. I don't want to wait for the process to execute with each iteration of the for loop because there is some code at the beginning of the loop that can run in parallel with the processes.
Example code:
from multiprocessing import Queue
import subprocess
q = Queue()
hrs = range(0,12)
for hr in hrs:
print(hr)
# There will be other code here that takes some time to run
p = subprocess.Popen(['python', 'test.py', '--hr={}'.format(hr)])
q.put(p)
This results in:
Traceback (most recent call last):
File "/home/kschneider/anaconda3/envs/ewall/lib/python3.8/multiprocessing/queues.py", line 239, in _feed
obj = _ForkingPickler.dumps(obj)
File "/home/kschneider/anaconda3/envs/ewall/lib/python3.8/multiprocessing/reduction.py", line 51, in dumps
cls(buf, protocol).dump(obj)
TypeError: cannot pickle '_thread.lock' object
Is there a different way to set this up that will not result in the thread lock errors?

Related

Why I can't use multiprocessing.Queue with ProcessPoolExecutor?

When I run the below code:
from concurrent.futures import ProcessPoolExecutor, as_completed
from multiprocessing import Queue
q = Queue()
def my_task(x, queue):
queue.put("Task Complete")
return x
with ProcessPoolExecutor() as executor:
tasks = [executor.submit(my_task, i, q) for i in range(10)]
for task in as_completed(tasks):
print(task.result())
I get this error:
concurrent.futures.process._RemoteTraceback:
"""
Traceback (most recent call last):
File "/usr/lib/python3.10/multiprocessing/queues.py", line 244, in _feed
obj = _ForkingPickler.dumps(obj)
File "/usr/lib/python3.10/multiprocessing/reduction.py", line 51, in dumps
cls(buf, protocol).dump(obj)
File "/usr/lib/python3.10/multiprocessing/queues.py", line 58, in __getstate__
context.assert_spawning(self)
File "/usr/lib/python3.10/multiprocessing/context.py", line 373, in assert_spawning
raise RuntimeError(
RuntimeError: Queue objects should only be shared between processes through inheritance
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/tmp/nn.py", line 14, in <module>
print(task.result())
File "/usr/lib/python3.10/concurrent/futures/_base.py", line 451, in result
return self.__get_result()
File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
raise self._exception
File "/usr/lib/python3.10/multiprocessing/queues.py", line 244, in _feed
obj = _ForkingPickler.dumps(obj)
File "/usr/lib/python3.10/multiprocessing/reduction.py", line 51, in dumps
cls(buf, protocol).dump(obj)
File "/usr/lib/python3.10/multiprocessing/queues.py", line 58, in __getstate__
context.assert_spawning(self)
File "/usr/lib/python3.10/multiprocessing/context.py", line 373, in assert_spawning
raise RuntimeError(
RuntimeError: Queue objects should only be shared between processes through inheritance
What is the purpose of multiprocessing.Queue if I cannot using for multiprocessing? How can I make this to work? In my real code, I need every worker to update a queue frequently about the task status so another thread will get data from that queue to feed a progress bar.

Short Explanation
Why can't you pass a multiprocessing.Queue as a worker function argument? The short answer is that submitted tasks are submitted to a transparent input queue from which the pool processes get the next task to be performed. But these arguments must be serializable with pickle and a multiprocessing.Queue is not in general serializable. But it is serializable for the special case of passing an argument to a child process as a function argument. Arguments to a multiprocessing.Process are stored as an attribute of the instance when it is created. When start is called on the instance, its state must be serialized to the new address space before the run method is called in that new address space. Why this serialization works for this case but not the general case is unclear to me; I would have to spend a lot of time looking at the source for the interpreter to come up with a definitive answer.
See what happens when I try to put a queue instance to a queue:
>>> from multiprocessing import Queue
>>> q1 = Queue()
>>> q2 = Queue()
>>> q1.put(q2)
>>> Traceback (most recent call last):
File "C:\Program Files\Python38\lib\multiprocessing\queues.py", line 239, in _feed
obj = _ForkingPickler.dumps(obj)
File "C:\Program Files\Python38\lib\multiprocessing\reduction.py", line 51, in dumps
cls(buf, protocol).dump(obj)
File "C:\Program Files\Python38\lib\multiprocessing\queues.py", line 58, in __getstate__
context.assert_spawning(self)
File "C:\Program Files\Python38\lib\multiprocessing\context.py", line 359, in assert_spawning
raise RuntimeError(
RuntimeError: Queue objects should only be shared between processes through inheritance
>>> import pickle
>>> b = pickle.dumps(q2)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Program Files\Python38\lib\multiprocessing\queues.py", line 58, in __getstate__
context.assert_spawning(self)
File "C:\Program Files\Python38\lib\multiprocessing\context.py", line 359, in assert_spawning
raise RuntimeError(
RuntimeError: Queue objects should only be shared between processes through inheritance
>>>
How to Pass the Queue via Inheritance
First of all your code will run more slowly using multiprocessing then if you had just called my_task in a loop because multiprocessing introduces additional overhead (starting of processes and moving data across address spaces) which requires that what you gain from running my_task in parallel more than offsets the additional overhead. In your case it doesn't because my_task is not sufficiently CPU-intensive as to justify multiprocessing.
That said, when you wish to have your pool processes using a multiprocessing.Queue instance, it cannot be passed as an argument to a worker function (unlike the case when you are using explicitly multiprocessing.Process instances instead of a pool). Instead, you must initialize a global variable in each pool process with the queue instance.
If you are running under a platform that uses fork to create new processes, then you can just create queue as a global and it will be inherited by each pool process:
from concurrent.futures import ProcessPoolExecutor, as_completed
from multiprocessing import Queue
queue = Queue()
def my_task(x):
queue.put("Task Complete")
return x
with ProcessPoolExecutor() as executor:
tasks = [executor.submit(my_task, i) for i in range(10)]
for task in as_completed(tasks):
print(task.result())
# This queue must be read before the pool terminates:
for _ in range(10):
print(queue.get())
Prints:
1
0
2
3
6
5
4
7
8
9
Task Complete
Task Complete
Task Complete
Task Complete
Task Complete
Task Complete
Task Complete
Task Complete
Task Complete
Task Complete
If you need portability with platforms that do not use the fork method to create processes, such as Windows (which uses the spawn method), then you cannot allocate the queue as a global since each pool process will create its own queue instance. Instead, the main process must create the queue and then initialize each pool process' global queue variable by using the initializer and initargs:
from concurrent.futures import ProcessPoolExecutor, as_completed
from multiprocessing import Queue
def init_pool_processes(q):
global queue
queue = q
def my_task(x):
queue.put("Task Complete")
return x
# Windows compatibilitY
if __name__ == '__main__':
q = Queue()
with ProcessPoolExecutor(initializer=init_pool_processes, initargs=(q,)) as executor:
tasks = [executor.submit(my_task, i) for i in range(10)]
for task in as_completed(tasks):
print(task.result())
# This queue must be read before the pool terminates:
for _ in range(10):
print(q.get())
If you want to advance a progress bar as each task completes (you haven't precisely stated how the bar is to advance; see my comment to your question), then the following shows that a queue is necessary. If each task submitted consisted of N parts (for a total of 10 * N parts, since there are 10 tasks) and would like to see a single progress bar advance as each part is completed, then a queue is probably the most straight forward way of signaling a part completion back to the main process.
from concurrent.futures import ProcessPoolExecutor, as_completed
from tqdm import tqdm
def my_task(x):
return x
# Windows compatibilitY
if __name__ == '__main__':
with ProcessPoolExecutor() as executor:
with tqdm(total=10) as bar:
tasks = [executor.submit(my_task, i) for i in range(10)]
for _ in as_completed(tasks):
bar.update()
# To get the results in task submission order:
results = [task.result() for task in tasks]
print(results)

Multiprocessing simple function doesn't work but why

I am trying to multiprocess system commands, but can't get it to work with a simple program. The function runit(cmd) works fine though...
#!/usr/bin/python3
from subprocess import call, run, PIPE,Popen
from multiprocessing import Pool
import os
pool = Pool()
def runit(cmd):
proc = Popen(cmd, shell=True,stdout=PIPE, stderr=PIPE, universal_newlines=True)
return proc.stdout.read()
#print(runit('ls -l'))
it = []
for i in range(1,3):
it.append('ls -l')
results = pool.map(runit, it)
It outputs:
Process ForkPoolWorker-1:
Process ForkPoolWorker-2:
Traceback (most recent call last):
Traceback (most recent call last):
File "/usr/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
self.run()
File "/usr/lib/python3.5/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/usr/lib/python3.5/multiprocessing/pool.py", line 108, in worker
task = get()
File "/usr/lib/python3.5/multiprocessing/queues.py", line 345, in get
return ForkingPickler.loads(res)
AttributeError: Can't get attribute 'runit' on <module '__main__' from './syscall.py'>
File "/usr/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
self.run()
File "/usr/lib/python3.5/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/usr/lib/python3.5/multiprocessing/pool.py", line 108, in worker
task = get()
File "/usr/lib/python3.5/multiprocessing/queues.py", line 345, in get
return ForkingPickler.loads(res)
AttributeError: Can't get attribute 'runit' on <module '__main__' from './syscall.py'>
Then it somehow waits and does nothing, and when I press Ctrl+C a few times it spits out:
^CProcess ForkPoolWorker-4:
Process ForkPoolWorker-6:
Traceback (most recent call last):
File "./syscall.py", line 17, in <module>
Process ForkPoolWorker-5:
results = pool.map(runit, it)
File "/usr/lib/python3.5/multiprocessing/pool.py", line 260, in map
...
buf = self._recv(4)
File "/usr/lib/python3.5/multiprocessing/connection.py", line 379, in _recv
chunk = read(handle, remaining)
KeyboardInterrupt

I'm not sure, since the issue I know is windows-related (and I don't have access to Linux box to reprocude), but in order to be portable you have to wrap your multiprocessing-dependent commands in if __name__=="__main__" or it conflicts with the way python spawns the processes: that fixed example runs fine on windows (and should work OK on other platforms as well):
from multiprocessing import Pool
import os
def runit(cmd):
proc = Popen(cmd, shell=True,stdout=PIPE, stderr=PIPE, universal_newlines=True)
return proc.stdout.read()
#print(runit('ls -l'))
it = []
for i in range(1,3):
it.append('ls -l')
if __name__=="__main__":
# all calls to multiprocessing module are "protected" by this directive
pool = Pool()
(Studying the error messages more closely, now I'm pretty sure that just moving pool = Pool() after the declaration of runit would fix it as well on Linux, but wrapping in __main__ fixes+makes it portable)
That said, note that your multiprocessing just creates a new process, so you'd be better off with thread pools (Threading pool similar to the multiprocessing Pool?): threads which creates processes, like this:
from multiprocessing.pool import ThreadPool # uses threads, not processes
import os
def runit(cmd):
proc = Popen(cmd, shell=True,stdout=PIPE, stderr=PIPE, universal_newlines=True)
return proc.stdout.read()
it = []
for i in range(1,3):
it.append('ls -l')
if __name__=="__main__":
pool = ThreadPool() # ThreadPool instead of Pool
results = pool.map(runit, it)
print(results)
results = pool.map(runit, it)
print(results)
the latter solution is more lightweight and is less issue-prone (multiprocessing is a delicate module to handle). You'll be able to work with objects, shared data, etc... without the need for a Manager object, among other advantages

Python error : AttributeError: 'module' object has no attribute 'heappush'

I am trying some simple programs which involve multiprocessing features in Python.
The code is given below:
from multiprocessing import Process, Queue
def print_square(i):
print i*i
if __name__ == '__main__':
output = Queue()
processes = [Process(target=print_square,args=(i,)) for i in range(5)]
for p in processes:
p.start()
for p in processes:
p.join()
However, this gives an error message AttributeError: 'module' object has no attribute 'heappush'. The complete output upon executing the script is given below:
Traceback (most recent call last):
File "parallel_3.py", line 15, in
output = Queue()
File "C:\Users\abc\AppData\Local\Continuum\Anaconda2\lib\multi
processing\__init__.py", line 217, in Queue
from multiprocessing.queues import Queue
File "C:\Users\abc\AppData\Local\Continuum\Anaconda2\lib\multi
processing\queues.py", line 45, in
from Queue import Empty, Full
File "C:\Users\abc\AppData\Local\Continuum\Anaconda2\lib\Queue
.py", line 212, in
class PriorityQueue(Queue):
File "C:\Users\abc\AppData\Local\Continuum\Anaconda2\lib\Queue
.py", line 224, in PriorityQueue
def _put(self, item, heappush=heapq.heappush):
AttributeError: 'module' object has no attribute 'heappush'
The code compiles fine if the output=Queue() statement is commented.
What could be possibly causing this error ?

Your filename should be the package name. Change to another filename such as heap and it will work.

Python Multiprocessing: Topping off multiprocessing queue before becoming empty

I'm trying to make a multiprocessing Queue in Python 2.7 that fills up to it's maxsize with processes, and then while there are more processes to be done that haven't yet been put into the Queue, will refill the Queue when any of the current procs finish. I'm trying to maximize performance so size of the Queue is numCores on the PC so each core is always doing work (ideally CPU will be at 100% use the whole time). I'm also trying to avoid context switching which is why I only want this many in the Queue at any time.
Example would be, say there are 50 tasks to be done, the CPU has 4 cores, so the Queue will be maxsize 4. We start by filling Queue with 4 processes, and immediately upon any of those 4 finishing (at which time there will be 3 in the Queue), a new proc is generated and sent to the queue. It continues doing this until all 50 tasks have been generated and completed.
This task is proving to be difficult since I'm new to multiprocessing, and also it seems the join() function will not work for me since that forces a blocking statement until ALL of the procs in the Queue have completed, which is NOT what I want.
Here is my code right now:
def queuePut(q, thread):
q.put(thread)
def launchThreads(threadList, performanceTestList, resultsPath, cofluentExeName):
numThreads = len(threadList)
threadsLeft = numThreads
print "numThreads: " + str(numThreads)
cpuCount = multiprocessing.cpu_count()
q = multiprocessing.Queue(maxsize=cpuCount)
count = 0
while count != numThreads:
while not q.full():
thread = threadList[numThreads - threadsLeft]
p = multiprocessing.Process(target=queuePut, args=(q,thread))
print "Starting thread " + str(numThreads - threadsLeft)
p.start()
threadsLeft-=1
count +=1
if(threadsLeft == 0):
threadsLeft+=1
break
Here is where it gets called in code:
for i in testNames:
p = multiprocessing.Process(target=worker,args=(i,paths[0],cofluentExeName,))
jobs.append(p)
launchThreads(jobs, testNames, testDirectory, cofluentExeName)
The procs seem to get created and put into the queue, for an example where there are 12 tasks and 40 cores, the output is as follows, proceeded by the error below:
numThreads: 12
Starting thread 0
Starting thread 1
Starting thread 2
Starting thread 3
Starting thread 4
Starting thread 5
Starting thread 6
Starting thread 7
Starting thread 8
Starting thread 9
Starting thread 10
Starting thread 11
File "C:\Python27\lib\multiprocessing\queues.py", line 262, in _feed
send(obj)
File "C:\Python27\lib\multiprocessing\process.py", line 290, in __reduce__
'Pickling an AuthenticationString object is '
TypeError: Pickling an AuthenticationString object is disallowed for security re
asons
Traceback (most recent call last):
File "C:\Python27\lib\multiprocessing\queues.py", line 262, in _feed
send(obj)
File "C:\Python27\lib\multiprocessing\process.py", line 290, in __reduce__
'Pickling an AuthenticationString object is '
TTypeError: Pickling an AuthenticationString object is disallowed for security r
easons
raceback (most recent call last):
File "C:\Python27\lib\multiprocessing\queues.py", line 262, in _feed
send(obj)
File "C:\Python27\lib\multiprocessing\process.py", line 290, in __reduce__
'Pickling an AuthenticationString object is '
TTypeError: Pickling an AuthenticationString object is disallowed for security r
easons
raceback (most recent call last):
File "C:\Python27\lib\multiprocessing\queues.py", line 262, in _feed
send(obj)
File "C:\Python27\lib\multiprocessing\process.py", line 290, in __reduce__
'Pickling an AuthenticationString object is '
TypeError: Pickling an AuthenticationString object is disallowed for security re
asons

Why don't you use a multiprocessing Pool to accomplish this?
import multiprocessing
pool = multiprocessing.Pool()
pool.map(your_function, dataset) ##dataset is a list; could be other iterable object
pool.close()
pool.join()
The multiprocessing.Pool() can have the argument processes=# where you specify the # of jobs you want to start. If you don't specify this parameter, it will start as many jobs as you have cores (so if you have 4 cores, 4 jobs). When one job finishes it'll automatically start the next one; you don't have to manage that.
Multiprocessing: https://docs.python.org/2/library/multiprocessing.html

Race condition using multiprocessing and threading together

I wrote the sample program.
It creates 8 threads and spawns process in each one
import threading
from multiprocessing import Process
def fast_function():
pass
def thread_function():
process_number = 1
print 'start %s processes' % process_number
for i in range(process_number):
p = Process(target=fast_function, args=())
p.start()
p.join()
def main():
threads_number = 8
print 'start %s threads' % threads_number
threads = [threading.Thread(target=thread_function, args=())
for i in range(threads_number)]
for thread in threads:
thread.start()
for thread in threads:
thread.join()
It crashes with several exceptions like this
Exception in thread Thread-3:
Traceback (most recent call last):
File "/usr/lib/python2.6/threading.py", line 532, in __bootstrap_inner
self.run()
File "/usr/lib/python2.6/threading.py", line 484, in run
self.__target(*self.__args, **self.__kwargs)
File "./repeat_multiprocessing_bug.py", line 15, in thread_function
p.start()
File "/usr/lib/python2.6/multiprocessing/process.py", line 99, in start
_cleanup()
File "/usr/lib/python2.6/multiprocessing/process.py", line 53, in _cleanup
if p._popen.poll() is not None:
File "/usr/lib/python2.6/multiprocessing/forking.py", line 106, in poll
pid, sts = os.waitpid(self.pid, flag)
OSError: [Errno 10] No child processes
Python version 2.6.5. Can somebody explain what I do wrong?

You're probably trying to run it from the interactive interpreter. Try writing your code to a file and run it as a python script, it works on my machine...
See the explanation and examples at the Python multiprocessing docs.

The multiprocessing module has a thread-safety issue in 2.6.5. Your best bet is updating to a newer Python, or add this patch to 2.6.5: http://hg.python.org/cpython/rev/41aef062d529/
The bug is described in more detail in the following links:
http://bugs.python.org/issue11891
http://bugs.python.org/issue1731717

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python multiprocessing queue fails with subprocesses - python

Related

Why I can't use multiprocessing.Queue with ProcessPoolExecutor?

Multiprocessing simple function doesn't work but why

Python error : AttributeError: 'module' object has no attribute 'heappush'

Python Multiprocessing: Topping off multiprocessing queue before becoming empty

Race condition using multiprocessing and threading together

Categories

Resources