ThreadPoolExecutor with stateful workers - python

I'm working with a Backend class which spawns a subprocess to perform the CPU-bound work. I have no control over that class and basically the only way of interaction is to create an instance backend = Backend() and submit work via backend.run(data) (this in turn submits the work to the subprocess and blocks until completion). Because these computations take quite some time, I'd like to perform them in parallel. Since the Backend class already spawns its own subprocess to perform the actual work, this appears to be an IO-bound situation.
So I thought about using multiple threads, each of which uses its own Backend instance. I could create these threads manually and connect them via queues. The following is an example implementation with some Backend mock class:
import os
import pty
from queue import Queue
from subprocess import PIPE, Popen
from threading import Thread
class Backend:
def __init__(self):
f, g = pty.openpty()
self.process = Popen(
['bash'], # example program
text=True, bufsize=1, stdin=PIPE, stdout=g)
self.write = self.process.stdin.write
self.read = os.fdopen(f).readline
def __enter__(self):
self.write('sleep 2\n') # startup work
return self
def __exit__(self, *exc):
self.process.stdin.close()
self.process.kill()
def run(self, x):
self.write(f'sleep {x} && echo "ok"\n') # perform work
return self.read().strip()
class Worker(Thread):
def __init__(self, inq, outq, **kwargs):
super().__init__(**kwargs)
self.inq = inq
self.outq = outq
def run(self):
with Backend() as backend:
while True:
data = self.inq.get()
result = backend.run(data)
self.outq.put((data, result))
task_queue = Queue()
result_queue = Queue()
n_workers = 3
threads = [Worker(task_queue, result_queue, daemon=True) for _ in range(n_workers)]
for thread in threads:
thread.start()
data = [2]*7
for x in data:
task_queue.put(x)
for _ in data:
print(f'Result ready: {result_queue.get()}')
Since the Backend needs to perform some work at startup, I don't want to create a new instance for each task. Hence each Worker creates one Backend instance for its whole life cycle. It's also important that each of the workers has its own backend, so they won't interfere with each other.
Now here's the question: Can I also use concurrent.futures.ThreadPoolExecutor to accomplish this? It looks like the Executor.map method would be the right candidate, but I can't figure out how to ensure that each worker receives its own instance of Backend (which needs to be persistent between tasks).

The state of worker threads can be saved in the global namespace, e.g. as a dict. Then threading.current_thread can be used to save/load the state for each of the workers. contextlib.ExitStack can be used to handle Backend appropriately as a context manager.
from concurrent.futures import ThreadPoolExecutor
from contextlib import ExitStack
import os
import pty
from subprocess import PIPE, Popen
import threading
class Backend:
...
backends = {}
exit_stack = ExitStack()
def init_backend():
backends[threading.current_thread()] = exit_stack.enter_context(Backend())
def compute(data):
return data, backends[threading.current_thread()].run(data)
with exit_stack:
with ThreadPoolExecutor(max_workers=3, initializer=init_backend) as executor:
for result in executor.map(compute, [2]*7):
print(f'Result ready: {result}')

Related

Can I dynamically register objects to proxy with a multiprocessing BaseManager?

There are plenty of examples of using a multiprocessing BaseManager-derived class to register a method for returning a queue handle proxy, that clients can then use to pull/put from the queue.
This is great, but I have a different scenario - what if the number of queues that I need to proxy changes in response to outside events? What I really want is to proxy a method that returns a specific queue given a UID.
I tried this out but I couldn't get it to work, it appears that the only things that are available are what is registered with the class before the object is instantiated. I'm unable to BaseManager.register("my-new-queue", lambda: queue.Queue) once I've already instantiated an instance of that class and caused it to run.
Is there any way around this? It feels to me like we should be able to dynamically handle this
The registration is most important in the "server" process where the callable will actually get called. Registering a callable in a "client" process only adds that typeid (the string you pass to register) as a method to the manager class. The rub is that running the server blocks, preventing you from registering new callables, and it occurs in another process making it further difficult to modify the registry.
I've been tinkering with this a little while... imao managers are cursed.. I think your prior question would also be answered (aside from our discussion in the comments) by the thing that solved it. Basically python attempts to be a little bit secure about not sending around the authkey parameter for proxied objects, but it stumbles sometimes (particularly with nested proxies). The fix is to set the default authkey for the process mp.current_process().authkey = b'abracadabra' which is used as the fallback when authkey=None (https://bugs.python.org/issue7503)
Here's my full testing script which is derived from the remote manager example from the docs. Basically I create a shared dict to hold shared queues:
#server process
from multiprocessing.managers import BaseManager, DictProxy
from multiprocessing import current_process
from queue import Queue
queues = {} #dict[uuid, Queue]
class QueueManager(BaseManager):
pass
QueueManager.register('new_queue', callable=Queue)
QueueManager.register('get_queues', callable=lambda:queues, proxytype=DictProxy)
m = QueueManager(address=('localhost', 50000), authkey=b'abracadabra')
current_process().authkey = b'abracadabra'
s = m.get_server()
s.serve_forever()
#process A
from multiprocessing.managers import BaseManager
from multiprocessing import current_process
class QueueManager(BaseManager):
pass
QueueManager.register('new_queue')
QueueManager.register('get_queues')
m = QueueManager(address=('localhost', 50000), authkey=b'abracadabra')
m.connect()
current_process().authkey = b'abracadabra'
queues_dict = m.get_queues()
queues_dict['my_uuid'] = m.new_queue()
queues_dict['my_uuid'].put("this is a test")
#process B
from multiprocessing.managers import BaseManager
from multiprocessing import current_process
class QueueManager(BaseManager):
pass
QueueManager.register('new_queue')
QueueManager.register('get_queues')
m = QueueManager(address=('localhost', 50000), authkey=b'abracadabra')
m.connect()
current_process().authkey = b'abracadabra'
queues_dict = m.get_queues()
print(queues_dict['my_uuid'].get())
EDIT:
Regarding the comments: "get_queue take the UUID and return the specific queue" the modification is simple, and does not involve nested proxies thereby avoiding the digest auth issue:
#server process
from multiprocessing.managers import BaseManager
from collections import defaultdict
from queue import Queue
queues = defaultdict(Queue)
class QueueManager(BaseManager): pass
QueueManager.register('get_queue', callable=lambda uuid:queues[uuid])
m = QueueManager(address=('localhost', 50000), authkey=b'abracadabra')
s = m.get_server()
s.serve_forever()
#process A
from multiprocessing.managers import BaseManager
class QueueManager(BaseManager): pass
QueueManager.register('get_queue')
m = QueueManager(address=('localhost', 50000), authkey=b'abracadabra')
m.connect()
m.get_queue("my_uuid").put("this is a test")
#process B
from multiprocessing.managers import BaseManager
class QueueManager(BaseManager): pass
QueueManager.register('get_queue')
m = QueueManager(address=('localhost', 50000), authkey=b'abracadabra')
m.connect()
print(m.get_queue("my_uuid").get())
Aaron's answer is perhaps the simplest way here, where you share a dictionary and store the queues in that shared dictionary. However, it does not answer the problem of not being able to update the methods on a manager once it has started. Therefore, here is a more complete solution, less verbose than it's alternative, where you can update the registry even after the server has started:
from queue import Queue
from multiprocessing.managers import SyncManager, Server, State, dispatch
from multiprocessing.context import ProcessError
class UpdateServer(Server):
public = ['shutdown', 'create', 'accept_connection', 'get_methods',
'debug_info', 'number_of_objects', 'dummy', 'incref', 'decref', 'update_registry']
def update_registry(self, c, registry):
with self.mutex:
self.registry.update(registry)
def get_server(self):
if self._state.value != State.INITIAL:
if self._state.value == State.STARTED:
raise ProcessError("Already started server")
elif self._state.value == State.SHUTDOWN:
raise ProcessError("Manager has shut down")
else:
raise ProcessError(
"Unknown state {!r}".format(self._state.value))
return self._Server(self._registry, self._address,
self._authkey, self._serializer)
class UpdateManager(SyncManager):
_Server = UpdateServer
def update_registry(self):
assert self._state.value == State.STARTED, 'server not yet started'
conn = self._Client(self._address, authkey=self._authkey)
try:
dispatch(conn, None, 'update_registry', (type(self)._registry, ), {})
finally:
conn.close()
class MyQueue:
def __init__(self):
self.initialized = False
self.q = None
def initialize(self):
self.q = Queue()
def __call__(self):
if not self.initialized:
self.initialize()
self.initialized = True
return self.q
if __name__ == '__main__':
# Create an object of wrapper class, note that we do not initialize the queue right away (it's unpicklable)
queue = MyQueue()
manager = UpdateManager()
manager.start()
# If you register new typeids, then call update_registry. The method_to_typeid parameter maps the __call__ method to
# return a proxy of the queue instead since Queues are not picklable
UpdateManager.register('uuid', queue, method_to_typeid={'__call__': 'Queue'})
manager.update_registry()
# Once the object is stored in the manager process, now we can safely initialize the queue and share
# it among processes. Initialization is implicit when we call uuid() if it's not already initialized
q = manager.uuid()
q.put('bye')
print(q.get())
Over here, UpdateServer and UpdateManager add support for method update_registry which informs the server if any new typeid's are registered with the manager. MyQueue is simply a wrapper class to return the new queues registered if called directly. While it's functionally similar to registering lambda : queue, the wrapper is necessary because lamdba functions are not picklable and the server process is being started in a new process here (rather than doing server.serve_forever() in the main process, but you can do that too if you want).
So, you can now register typeids even after the manager process is running, just make sure to call the update_registry function right after. This function call will even work if you are starting the server in the main process itself (by using serve_forever, like in Aaron's answer) and connecting to it from another process using manager.connect.

How to create one thread for slowly logging so that the main jobs can continue running (in python)?

I have main works with heavy calculations and also logging with many IO operations.
I don't care much about either the speed or the order of logging.
What I want is a log collector who can take the context I want to log in a new thread so that my main script can keep running without being blocked.
The code I tried is as below:
import threading
from loguru import logger
from collections import deque
import time
class ThreadLogger:
def __init__(self):
self.thread = threading.Thread(target=self.run, daemon=True)
self.log_queue = deque()
self.thread.start()
self.run()
def run(self):
# I also have tried while True:
while self.log_queue:
log_func, context = self.log_queue.popleft()
log_func(*context)
def addLog(self, log_func, context):
self.log_queue.append([log_func, context])
thlogger = ThreadLogger()
for i in range(20):
# add log here with new thread so that won't affect main jobs
thlogger.addLog(logger.debug, (f'hi {i}',))
# main jobs here (I want to do some real shit here with heavy calculation)
The code above doesn't really work as my expectation.
It cannot detect by itself when to digest the queue
Also, if I use "while True: " it just blocks the queue that the queue is never getting longer.
All other techniques I can come out with aren't really doing on a new single thread
Any suggestions I would be very appreciated!
Remove the call self.run() as you already have started a thread to run that method. And it is that call that is blocking your program. It causes the main thread to sit blocked on the empty queue.
def __init__(self):
self.thread = threading.Thread(target=self.run, daemon=True)
self.log_queue = deque()
self.thread.start()
#self.run() # remove
Once you do that then you can change while self.log_queue: to while True:
As Dan D.'s answer
import threading
from loguru import logger
from collections import deque
import time
class ThreadLogger:
def __init__(self):
self.thread = threading.Thread(target=self.run, daemon=True)
self.log_queue = deque()
self.thread.start()
def run(self):
while True:
if self.log_queue:
log_func, context = self.log_queue.popleft()
log_func(*context)
def addLog(self, log_func, context):
self.log_queue.append([log_func, context])
thlogger = ThreadLogger()
for i in range(20):
thlogger.addLog(logger.debug, (f'hi {i}',))
time.sleep(1) # wait for log to happen

Is there something like NSOperationQueue from ObjectiveC in Python?

I'm looking into concurrency options for Python. Since I'm an iOS/macOS developer, I'd find it very useful if there was something like NSOperationQueue in python.
Basically, it's a queue to which you can add operations (every operation is Operation-derived class with run method to implement) which are executed either serially, or in parallel or ideally various dependencies can be set on operations (ie that some operation depends on others being executed before it can start).
have you looked celery as an option? This is what celery website quotes
Celery is an asynchronous task queue/job queue based on distributed message passing. It is focused on real-time operation, but supports scheduling as well.
I'm looking for it, too. But since it doesn't seem to exist yet, I have written my own implementation:
import time
import threading
import queue
import weakref
class OperationQueue:
def __init__(self):
self.thread = None
self.queue = queue.Queue()
def run(self):
while self.queue.qsize() > 0:
msg = self.queue.get()
print(msg)
# emulate if it cost time
time.sleep(2)
def addOperation(self, string):
# put to queue first for thread safe.
self.queue.put(string)
if not (self.thread and self.thread.is_alive()):
print('renew a thread')
self.thread = threading.Thread(target=self.run)
self.thread.start()
myQueue = OperationQueue()
myQueue.addOperation("test1")
# test if it auto free
item = weakref.ref(myQueue)
time.sleep(1)
myQueue.addOperation("test2")
myQueue = None
time.sleep(3)
print(f'item = {item}')
print("Done.")

Python - multiprocessing - processes became zombies

For a couple weeks I have been trying to solve a problem with a multiprocessing module in python (2.7.x)
Idea:
Lets have Message Queue (RabbitMQ in our case). Create a listener on that queue and on the message spawn task which will process that message.
Problem:
Everything works fine, but after a couple hundred tasks, some sub-processes became zombies which is the main problem.
We have also some limitation (such as max number of tasks per machine) - which in the end leads that the machine stops processing any task.
Current implementation:
I created minimal code which should explain our approach
# -*- coding: utf-8 -*-
from multiprocessing import Process
import signal
from threading import Lock
class Task(Process):
def __init__(self, data):
super(Task, self).__init__()
self.data = data
def run(self):
# ignore sigchild signals in subprocess
signal.signal(signal.SIGCHLD, signal.SIG_DFL)
self.do_job() # long job there
pass
def do_job(self):
# very long job
pass
class MQListener(object):
def __init__(self):
self.tasks = []
self.tasks_lock = Lock()
self.register_signal_handler()
mq = RabbitMQ()
mq.listen("task_queue", self.on_message)
def register_signal_handler(self):
signal.signal(signal.SIGCHLD, self.on_signal_received)
def on_signal_received(self, *_):
self._check_existing_processes()
def on_message(self, message):
# ack message and create task
task = Task(message)
with self.tasks_lock:
self.tasks.append(task)
task.start()
pass
def _check_existing_processes(self):
"""
go over all created task, if some is not alive - remove them from tasks collection
"""
try:
with self.tasks_lock:
running_tasks = []
for w in self.tasks:
if not w.is_alive():
w.join()
else:
running_tasks.append(w)
self.tasks = running_tasks
except Exception:
# log
pass
if __name__ == '__main__':
m = MQListener()
I'm quite open to use some library for that - if you can recommend some, that will be great as well.
Using SIGCHLD to catch child processes termination has quite many gotchas. The signal handler is run asynchronously and multiple SIGCHLD calls might get aggregated.
In short is better not to use it as long as you're not really aware of how it works.
Your program has, as well, another issue: what happens if you get 10000 messages at once? You'll spawn 10000 processes altogether and kill your machine.
You could use a process Pool and let it handle all these issues for you.
from multiprocessing import Pool
class MQListener(object):
def __init__(self):
self.pool = Pool()
self.rabbitclient = RabbitMQ()
def new_message(self, message):
self.pool.apply_async(do_job, args=(message, ))
def run(self):
self.rabbitclient.listen("task_queue", self.new_message)
app = MQListener()
app.run()

python interprocess querying/control

I have this Python based service daemon which is doing a lot of multiplexed IO (select).
From another script (also Python) I want to query this service daemon about status/information and/or control the processing (e.g. pause it, shut it down, change some parameters, etc).
What is the best way to send control messages ("from now on you process like this!") and query processed data ("what was the result of that?") using python?
I read somewhere that named pipes might work, but don't know that much about named pipes, especially in python - and whether there are any better alternatives.
Both the background service daemon AND the frontend will be programmed by me, so all options are open :)
I am using Linux.
Pipes and Named pipes are good solution to communicate between different processes.
Pipes work like shared memory buffer but has an interface that mimics a simple file on each of two ends. One process writes data on one end of the pipe, and another reads that data on the other end.
Named pipes are similar to above , except that this pipe is actually associated with a real file in your computer.
More details at
http://www.softpanorama.org/Scripting/pipes.shtml
In Python, named pipe files are created with the os.mkfifo call
x = os.mkfifo(filename)
In child and parent open this pipe as file
out = os.open(filename, os.O_WRONLY)
in = open(filename, 'r')
To write
os.write(out, 'xxxx')
To read
lines = in.readline( )
Edit: Adding links from SO
Create a temporary FIFO (named pipe) in Python?
https://stackoverflow.com/search?q=python+named+pipes
You may want to read more on "IPC and Python"
http://www.freenetpages.co.uk/hp/alan.gauld/tutipc.htm
The best way to do IPC is using message Queue in python as bellow
server process server.py (run this before running client.py and interact.py)
from multiprocessing.managers import BaseManager
import Queue
queue1 = Queue.Queue()
queue2 = Queue.Queue()
class QueueManager(BaseManager): pass
QueueManager.register('get_queue1', callable=lambda:queue1)
QueueManager.register('get_queue2', callable=lambda:queue2)
m = QueueManager(address=('', 50000), authkey='abracadabra')
s = m.get_server()
s.serve_forever()
The inter-actor which is for I/O interact.py
from multiprocessing.managers import BaseManager
import threading
import sys
class QueueManager(BaseManager): pass
QueueManager.register('get_queue1')
QueueManager.register('get_queue2')
m = QueueManager(address=('localhost', 50000),authkey='abracadabra')
m.connect()
queue1 = m.get_queue1()
queue2 = m.get_queue2()
def read():
while True:
sys.stdout.write(queue2.get())
def write():
while True:
queue1.put(sys.stdin.readline())
threads = []
threadr = threading.Thread(target=read)
threadr.start()
threads.append(threadr)
threadw = threading.Thread(target=write)
threadw.start()
threads.append(threadw)
for thread in threads:
thread.join()
The client program Client.py
from multiprocessing.managers import BaseManager
import sys
import string
import os
class QueueManager(BaseManager): pass
QueueManager.register('get_queue1')
QueueManager.register('get_queue2')
m = QueueManager(address=('localhost', 50000), authkey='abracadabra')
m.connect()
queue1 = m.get_queue1()
queue2 = m.get_queue2()
class RedirectOutput:
def __init__(self, stdout):
self.stdout = stdout
def write(self, s):
queue2.put(s)
class RedirectInput:
def __init__(self, stdin):
self.stdin = stdin
def readline(self):
return queue1.get()
# redirect standard output
sys.stdout = RedirectOutput(sys.stdout)
sys.stdin = RedirectInput(sys.stdin)
# The test program which will take input and produce output
Text=raw_input("Enter Text:")
print "you have entered:",Text
def x():
while True:
x= raw_input("Enter 'exit' to end and some thing else to continue")
print x
if 'exit' in x:
break
x()
this can be used to communicate between two process in network or on same machine
remember that inter-actor and server process will not terminate until you manually kill it.

Categories

Resources