Light persistence in the context of ThreadPoolExecutor in Python

Light persistence in the context of ThreadPoolExecutor in Python - python

I've got some Python code that farms out expensive jobs using ThreadPoolExecutor, and I'd like to keep track of which of them have completed so that if I have to restart this system, I don't have to redo the stuff that already finished. In a single-threaded context, I could just mark what I've done in a shelf. Here's a naive port of that idea to a multithreaded environment:
from concurrent.futures import ThreadPoolExecutor
import subprocess
import shelve
def do_thing(done, x):
# Don't let the command run in the background; we want to be able to tell when it's done
_ = subprocess.check_output(["some_expensive_command", x])
done[x] = True
futs = []
with shelve.open("done") as done:
with ThreadPoolExecutor(max_workers=18) as executor:
for x in things_to_do:
if done.get(x, False):
continue
futs.append(executor.submit(do_thing, done, x))
# Can't run `done[x] = True` here--have to wait until do_thing finishes
for future in futs:
future.result()
# Don't want to wait until here to mark stuff done, as the whole system might be killed at some point
# before we get through all of things_to_do
Can I get away with this? The documentation for shelve doesn't contain any guarantees about thread safety, so I'm thinking no.
So what is the simple way to handle this? I thought that perhaps sticking done[x] = True in future.add_done_callback would do it, but that will often run in the same thread as the future itself. Perhaps there is a locking mechanism that plays nicely with ThreadPoolExecutor? That seems cleaner to me that writing a loop that sleeps and then checks for completed futures.

While you're still in the outer-most with context manager, the done shelve is just a normal python object- it is only written to disk when the context manager closes and it runs its __exit__ method. It is therefore just as thread safe as any other python object, due to the GIL (as long as you're using CPython).
Specifically, the reassignment done[x] = True is thread safe / will be done atomically.
It's important to note that while the shelve's __exit__ method will run after a Ctrl-C, it won't if the python process ends abruptly, and the shelve won't be saved to disk.
To protect against this kind of failure, I would suggest using a lightweight file-based thread safe database like sqllite3.

Related

asyncio with multiple processors [duplicate]

As almost everyone is aware when they first look at threading in Python, there is the GIL that makes life miserable for people who actually want to do processing in parallel - or at least give it a chance.
I am currently looking at implementing something like the Reactor pattern. Effectively I want to listen for incoming socket connections on one thread-like, and when someone tries to connect, accept that connection and pass it along to another thread-like for processing.
I'm not (yet) sure what kind of load I might be facing. I know there is currently setup a 2MB cap on incoming messages. Theoretically we could get thousands per second (though I don't know if practically we've seen anything like that). The amount of time spent processing a message isn't terribly important, though obviously quicker would be better.
I was looking into the Reactor pattern, and developed a small example using the multiprocessing library that (at least in testing) seems to work just fine. However, now/soon we'll have the asyncio library available, which would handle the event loop for me.
Is there anything that could bite me by combining asyncio and multiprocessing?

You should be able to safely combine asyncio and multiprocessing without too much trouble, though you shouldn't be using multiprocessing directly. The cardinal sin of asyncio (and any other event-loop based asynchronous framework) is blocking the event loop. If you try to use multiprocessing directly, any time you block to wait for a child process, you're going to block the event loop. Obviously, this is bad.
The simplest way to avoid this is to use BaseEventLoop.run_in_executor to execute a function in a concurrent.futures.ProcessPoolExecutor. ProcessPoolExecutor is a process pool implemented using multiprocessing.Process, but asyncio has built-in support for executing a function in it without blocking the event loop. Here's a simple example:
import time
import asyncio
from concurrent.futures import ProcessPoolExecutor
def blocking_func(x):
time.sleep(x) # Pretend this is expensive calculations
return x * 5
#asyncio.coroutine
def main():
#pool = multiprocessing.Pool()
#out = pool.apply(blocking_func, args=(10,)) # This blocks the event loop.
executor = ProcessPoolExecutor()
out = yield from loop.run_in_executor(executor, blocking_func, 10) # This does not
print(out)
if __name__ == "__main__":
loop = asyncio.get_event_loop()
loop.run_until_complete(main())
For the majority of cases, this is function alone is good enough. If you find yourself needing other constructs from multiprocessing, like Queue, Event, Manager, etc., there is a third-party library called aioprocessing (full disclosure: I wrote it), that provides asyncio-compatible versions of all the multiprocessing data structures. Here's an example demoing that:
import time
import asyncio
import aioprocessing
import multiprocessing
def func(queue, event, lock, items):
with lock:
event.set()
for item in items:
time.sleep(3)
queue.put(item+5)
queue.close()
#asyncio.coroutine
def example(queue, event, lock):
l = [1,2,3,4,5]
p = aioprocessing.AioProcess(target=func, args=(queue, event, lock, l))
p.start()
while True:
result = yield from queue.coro_get()
if result is None:
break
print("Got result {}".format(result))
yield from p.coro_join()
#asyncio.coroutine
def example2(queue, event, lock):
yield from event.coro_wait()
with (yield from lock):
yield from queue.coro_put(78)
yield from queue.coro_put(None) # Shut down the worker
if __name__ == "__main__":
loop = asyncio.get_event_loop()
queue = aioprocessing.AioQueue()
lock = aioprocessing.AioLock()
event = aioprocessing.AioEvent()
tasks = [
asyncio.async(example(queue, event, lock)),
asyncio.async(example2(queue, event, lock)),
]
loop.run_until_complete(asyncio.wait(tasks))
loop.close()

Yes, there are quite a few bits that may (or may not) bite you.
When you run something like asyncio it expects to run on one thread or process. This does not (by itself) work with parallel processing. You somehow have to distribute the work while leaving the IO operations (specifically those on sockets) in a single thread/process.
While your idea to hand off individual connections to a different handler process is nice, it is hard to implement. The first obstacle is that you need a way to pull the connection out of asyncio without closing it. The next obstacle is that you cannot simply send a file descriptor to a different process unless you use platform-specific (probably Linux) code from a C-extension.
Note that the multiprocessing module is known to create a number of threads for communication. Most of the time when you use communication structures (such as Queues), a thread is spawned. Unfortunately those threads are not completely invisible. For instance they can fail to tear down cleanly (when you intend to terminate your program), but depending on their number the resource usage may be noticeable on its own.
If you really intend to handle individual connections in individual processes, I suggest to examine different approaches. For instance you can put a socket into listen mode and then simultaneously accept connections from multiple worker processes in parallel. Once a worker is finished processing a request, it can go accept the next connection, so you still use less resources than forking a process for each connection. Spamassassin and Apache (mpm prefork) can use this worker model for instance. It might end up easier and more robust depending on your use case. Specifically you can make your workers die after serving a configured number of requests and be respawned by a master process thereby eliminating much of the negative effects of memory leaks.

Based on #dano's answer above I wrote this function to replace places where I used to use multiprocess pool + map.
def asyncio_friendly_multiproc_map(fn: Callable, l: list):
"""
This is designed to replace the use of this pattern:
with multiprocessing.Pool(5) as p:
results = p.map(analyze_day, list_of_days)
By letting caller drop in replace:
asyncio_friendly_multiproc_map(analyze_day, list_of_days)
"""
tasks = []
with ProcessPoolExecutor(5) as executor:
for e in l:
tasks.append(asyncio.get_event_loop().run_in_executor(executor, fn, e))
res = asyncio.get_event_loop().run_until_complete(asyncio.gather(*tasks))
return res

See PEP 3156, in particular the section on Thread interaction:
http://www.python.org/dev/peps/pep-3156/#thread-interaction
This documents clearly the new asyncio methods you might use, including run_in_executor(). Note that the Executor is defined in concurrent.futures, I suggest you also have a look there.

When would I acquire a lock with block state set to False?

I was wondering why ever setting block=false would make sense?
from multiprocessing import Process, Lock
lock.acquire(block=False)
If i don't need to block, I wouldn't use Lock at all?

From Python in a Nutshell:
L.acquire()
When
blocking
is True, acquire locks
L
. If
L
is already locked, the calling thread suspends and waits until
L
is unlocked,
then locks
L
. Even if the calling thread was the one that last locked
L
, it still suspends and waits until another thread
releases
L
. When
blocking
is False and
L
is unlocked, acquire locks
L
and returns True. When
blocking
is False and
L
is
locked, acquire does not affect
L
, and returns False.
And a practical example using the following simple code:
from multiprocessing import Process, Lock, current_process
def blocking_testing(lock):
if not lock.acquire(False):
print('{} Couldn\'t get lock'.format(current_process().ident))
else:
print('{} Got lock'.format(current_process().ident))
if __name__ == '__main__':
lock = Lock()
for i in range(3):
procs = []
p = Process(target=blocking_testing, args=(lock,))
procs.append(p)
p.start()
for p in procs:
p.join()
With the above version (blocking=False) this outputs
12206 Got lock
12207 Couldn't get lock
12208 Couldn't get lock
If I set blocking=True (or remove it, as it defaults to True) the main process will hang, as the Lock is not being released.
Finally, if I set blocking=True and add a lock.release() at the end, my output will be
12616 Got lock
12617 Got lock
12618 Got lock
I hope this was a clear enough explanation.

multiprocessing.Lock is not used for blocking, it's used to protect one or more resources from concurrent access.
The simplest of the examples could be a file written by multiple processes. To guarantee that only one process at a time is writing on the given file, you protect it with a Lock.
There are situations where your logic cannot block. For example, if your logic is orchestrated by an event loop like the asyncio module, blocking would stop the entire execution until the Lock is released.
In such cases the common approach is trying to acquire the Lock. If you succeed, you proceed accessing the protected resource, otherwise you move to other routines and try later.

This is make sense as its parameter's name: block. block=False provide a non-blocking function to access protected resources.
Example one:
You have a GUI thread and a background work thread. Your GUI thread needs to modify some data generated by work thread, but your GUI thread cannot block as it will block the whole interaction. So you can use lock.acquire(block=False) to safely check if data is ready without blocking.
Example two:
Another example related to event loop is asyncio, this provide a non-blocking access to protected resources.

How can I stop the execution of a Python function from outside of it?

So I have this library that I use and within one of my functions I call a function from that library, which happens to take a really long time. Now, at the same time I have another thread running where I check for different conditions, what I want is that if a condition is met, I want to cancel the execution of the library function.
Right now I'm checking the conditions at the start of the function, but if the conditions happen to change while the library function is running, I don't need its results, and want to return from it.
Basically this is what I have now.
def my_function():
if condition_checker.condition_met():
return
library.long_running_function()
Is there a way to run the condition check every second or so and return from my_function when the condition is met?
I've thought about decorators, coroutines, I'm using 2.7 but if this can only be done in 3.x I'd consider switching, it's just that I can't figure out how.

You cannot terminate a thread. Either the library supports cancellation by design, where it internally would have to check for a condition every once in a while to abort if requested, or you have to wait for it to finish.
What you can do is call the library in a subprocess rather than a thread, since processes can be terminated through signals. Python's multiprocessing module provides a threading-like API for spawning forks and handling IPC, including synchronization.
Or spawn a separate subprocess via subprocess.Popen if forking is too heavy on your resources (e.g. memory footprint through copying of the parent process).
I can't think of any other way, unfortunately.

Generally, I think you want to run your long_running_function in a separate thread, and have it occasionally report its information to the main thread.
This post gives a similar example within a wxpython program.
Presuming you are doing this outside of wxpython, you should be able to replace the wx.CallAfter and wx.Publisher with threading.Thread and PubSub.
It would look something like this:
import threading
import time
def myfunction():
# subscribe to the long_running_function
while True:
# subscribe to the long_running_function and get the published data
if condition_met:
# publish a stop command
break
time.sleep(1)
def long_running_function():
for loop in loops:
# subscribe to main thread and check for stop command, if so, break
# do an iteration
# publish some data
threading.Thread(group=None, target=long_running_function, args=()) # launches your long_running_function but doesn't block flow
myfunction()
I haven't used pubsub a ton so I can't quickly whip up the code but it should get you there.
As an alternative, do you know the stop criteria before you launch the long_running_function? If so, you can just pass it as an argument and check whether it is met internally.

Testing Python code in thread without modifications?

Let's say I have this blob of code that's made to be one long-running thread of execution, to poll for events and fire off other events (in my case, using XMLRPC calls). It needs to be refactored into clean objects so it can be unit tested, but in the meantime I want to capture some of its current behavior in some integration tests, treating it like a black box. For example:
# long-lived code
import xmlrpclib
s = xmlrpclib.ServerProxy('http://XXX:yyyy')
def do_stuff():
while True:
...
if s.xyz():
s.do_thing(...)
_
# test code
import threading, time
# stub out xmlrpclib
def run_do_stuff():
other_code.do_stuff()
def setUp():
t = threading.Thread(target=run_do_stuff)
t.setDaemon(True)
def tearDown():
# somehow kill t
t.join()
def test1():
t.start()
time.sleep(5)
assert some_XMLRPC_side_effects
The last big issue is that the code under test is designed to run forever, until a Ctrl-C, and I don't see any way to force it to raise an exception or otherwise kill the thread so I can start it up from scratch without changing the code I'm testing. I lose the ability to poll any flags from my thread as soon as I call the function under test.
I know this is really not how tests are designed to work, integration tests are of limited value, etc, etc, but I was hoping to show off the value of testing and good design to a friend by gently working up to it rather than totally redesigning his software in one go.

The last big issue is that the code under test is designed to run forever, until a Ctrl-C, and I don't see any way to force it to raise an exception or otherwise kill the thread
The point of Test-Driven Development is to rethink your design so that it is testable.
Loop forever -- while seemingly fine for production use -- is untestable.
So make the loop terminate. It won't hurt production. It will improve testability.
The "designed to run forever" is not designed for testability. So fix the design to be testable.

I think I found a solution that does what I was looking for: Instead of using a thread, use a separate process.
I can write a small python stub to do mocking and run the code in a controlled way. Then I can write the actual tests to run my stub in a subprocess for each test and kill it when each test is finished. The test process could interact with the stub over stdio or a socket.

How to pause and resume a thread in python

I make a thread to run a script, and it may spend much time. And I want to pause and resume it in another thread. If I use a flag and detect it, it can not pause immediately. I have searched a lot, but it seems that self.__flag, self.pause can not achieve the target.
class MT(threading.Thread):
def __init__():
self.__running = threading.Event()
self.__running.set()
self.__flag = threading.Event()
self.__flag.set()
def run(self):
'''
run the script
'''
while self.__running.isSet():
self.__flag.wait()
moudleTest()
def pause(self):
'''
pause the thread
'''
self.__flag.clear()
def resume(self):
'''
resume the thread
'''
self._-flag.set()

What you want is not possible without diving below the Python layer using C extensions with OS specific techniques, e.g. on Windows, SuspendThread. You can not immediately and completely suspend another thread via Python level APIs, because doing so is considered absurdly dangerous.
Even when such a thing is possible, it's a terrible idea, prone to deadlocks and other terrible things. Just for example, pre-CPython 3.3, there was a single global import lock for the whole interpreter. If the other thread was in the middle of importing a module when it was suspended, no other thread could import at all until it was resumed and finished the import (causing a deadlock if that thread was the one responsible for resuming the suspended thread); in CPython 3.3+, it's better, but if another thread tried to import that specific module, it would deadlock just as badly.
In summary: Use Locks, Events and/or Conditions appropriately, and if you need faster pauses, make the wait checks more often (interspersed with thread "work" more regularly). If your code can't tolerate even a tiny delay before the pause, you have a design problem that you need to fix (e.g. you're using Event to simulate locking or the like, possibly for performance, which is hilariously misguided, since Events are built on Conditions which are in turn built on Locks, and all but Lock are implemented at the Python layer, not the C layer, and therefore quite slow).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.