context switching with 'yield' - python

I was reading a gevent tutorial and saw this interesting snippet:
import gevent
def foo():
print('Running in foo')
gevent.sleep(0)
print('Explicit context switch to foo again')
def bar():
print('Explicit context to bar')
gevent.sleep(0)
print('Implicit context switch back to bar')
gevent.joinall([
gevent.spawn(foo),
gevent.spawn(bar),
])
In which the flow of execution goes like this foo -> bar -> foo -> bar .
Is it not possible to do the same without the gevent module but with yield statements?
I've been trying to do this with 'yield' but for some reason I can't get it to work... :(

Generators used for this purpose are often called tasks (among many other terms), and I'll use that term here for clarity. Yes, it is possible. There are, in fact, several approaches that work and make sense in some contexts. However, none (that I'm aware of) work without an equivalent for at least one of gevent.spawn and gevent.joinall. The more powerful and well-designed ones require an equivalent for both.
The fundamental problem is this: Generators can be suspended (when they hit yield), but that's it. To kick them off again, you need some other code calling next() on them. In fact, you even need to call next() on a freshly-created generator for it to do anything to begin with.
Similarly, the generator itself isn't the best place to decide what should run next. So you need what is a loop that initiates each tasks's time slice (runs them to the next yield) and switches between them, indefinitely. This is usually called a scheduler. They tend to become really hairy really quickly, so I won't attempt to write a full scheduler in one answer. There are however some core concepts I can try to explain:
One usually treats yield as giving control back to the sheduler (in effect similar to gevent.sleep(0) in your code). That means, the generator does whatever it wants to do, and when it's in a place where a context switch is convenient and possibly useful, it yields.
In Python 3.3+, yield from is a very useful tool to delegate to another generator. If you can't use it, you have to make the scheduler emulate a call stack and route return values to the right place, and do things like result = yield subtasks() in your tasks. This is slower, more complex to implement, and unlikely to yield useful stack traces (yield from does this for free). But until recently, it was the best we had.
Depending on your use case, you may need a wide range of tools to manage tasks. Common examples are spawning more tasks, waiting for a task to complete, waiting for any one of several tasks to complete, detecting failure (uncaught exception) of other tasks, etc. These are usually handled by the scheduler and the tasks are given an API to communicate with the scheduler. A neat (but not always perfect) way to do this communication is yielding special values.
One rather important difference between generator-based tasks and gevent (and similar libraries) is that context switches in the latter are implicit, while tasks make it trivial to identify context switches: Only things that yield [from] can possibly run scheduler code. For example, you can make sure whether a piece of code is atomic (w.r.t. other tasks; if you add threads to the mix, you have to worry about them independently) just by looking at the code, without inspecting anything it calls.
Finally, you may be interested in Greg Ewing's tutorial on creating such a scheduler. (This came up on python-ideas while brainstorming over what now is PEP 3156. These mail threads may also be of interest to your, though the web-based archive is not really suited to reading hundreds of mails in dozens of threads written half a year ago.)

The key is to realise that you will have to provide your own driving loop—I have provided a simple demo below. I was lazy and used a Queue object to provide a FIFO, I haven't used python for a significant project for a while.
#!/usr/bin/python
import Queue
def foo():
print('Constructing foo')
yield
print('Running in foo')
yield
print('Explicit context switch to foo again')
def bar():
print('Constructing bar')
yield
print('Explicit context to bar')
yield
print('Implicit context switch back to bar')
def trampoline(taskq):
while not taskq.empty():
task = taskq.get()
try:
task.next()
taskq.put(task)
except StopIteration:
pass
tasks = Queue.Queue()
tasks.put(foo())
tasks.put(bar())
trampoline(tasks)
print('Finished')
And when run:
$ ./coroutines.py
Constructing foo
Constructing bar
Running in foo
Explicit context to bar
Explicit context switch to foo again
Implicit context switch back to bar
Finished

Related

twisted: processing incoming events in synchronous code

Suppose there's a synchronous function in a twisted-powered Python program that takes a long time to execute, doing that in a lot of reasonable-sized pieces of work. If the function could return deferreds, this would be a no-brainer, however the function happens to be deep inside some synchronous code, so that yielding deferreds to continue is impossible.
Is there a way to let twisted handle outstanding events without leaving that function? I.e. what I want to do is something along the lines of
def my_func():
results = []
for item in a_lot_of_items():
results.append(do_computation(item))
reactor.process_outstanding_events()
return results
Of course, this imposes reentrancy requirements on the code, but still, there's QCoreApplication.processEvents for that in Qt, is there anything in twisted?
The solution taken by some event-loop-based systems (essentially the solution you're referencing via Qt's QCoreApplication.processEvents API) is to make the main loop re-entrant. In Twisted terms, this would mean something like (not working code):
def my_expensive_task_that_cannot_be_asynchronous():
#inlineCallbacks
def do_work(units):
for unit in units:
yield do_one_work_asynchronously(unit)
work = do_work(some_work_units())
work.addBoth(lambda ignored: reactor.stop())
reactor.run()
def main():
# Whatever your setup is...
# Then, hypothetical event source triggering your
# expensive function:
reactor.callLater(
30,
my_expensive_task_that_cannot_be_asynchronous,
)
reactor.run()
Notice how there are two reactor.run calls in this program. If Twisted had a re-entrant event loop, this second call would start spinning the reactor again and not return until a matching reactor.stop call is encountered. The reactor would process all events it knows about, not just the ones generated by do_work, and so you would have the behavior you desire.
This requires a re-entrant event loop because my_expensive_task_... is already being called by the reactor loop. The reactor loop is on the call stack. Then, reactor.run is called and the reactor loop is now on the call stack again. So the usual issues apply: the event loop cannot have left over state in its frame (otherwise it may be invalid by the time the nested call is complete), it cannot leave its instance state inconsistent during any calls out to other code, etc.
Twisted does not have a re-entrant event loop. This is a feature that has been considered and, at least in the past, explicitly rejected. Supporting this features brings a huge amount of additional complexity (described above) to the implementation and the application. If the event loop is re-entrant then it becomes very difficult to avoid requiring all application code to be re-entrant safe as well. This negates one of the major benefits of the cooperative multitasking approach Twisted takes to concurrency (that you are guaranteed your functions will not be re-entered).
So, when using Twisted, this solution is out.
I'm not aware of another solution which would allow you to continue to run this code in the reactor thread. You mentioned that the code in question is nested deeply within some other synchronous code. The other options that come to mind are:
make the synchronous code capable of dealing with asynchronous things
factor the expensive parts out and compute them first, then pass the result in to the rest of the code
run all of that code, not just the computationally expensive part, in another thread
You could use deferToThread.
http://twistedmatrix.com/documents/13.2.0/core/howto/threading.html
That method runs your calculation in a separate thread and returns a deferred that is called back when the calculation is actually finished.
The issue is if do_heavy_computation() is code that blocks then execution won't go to the next function. In this case use deferToThread or blockingCallFromThread for heavy calculations. Also if you don't care for the results of the calculation then you can use callInThread. Take a look at documentation on threads
This should do:
for item in items:
reactor.callLater(0, heavy_func, item)
reactor.callLater should bring you back into the event loop.

Unit-testing a periodic coroutine with mock time

I'm using Tornado as a coroutine engine for a periodic process, where the repeating coroutine calls ioloop.call_later() on itself at the end of each execution. I'm now trying to drive this with unit tests (using Tornado's gen.test) where I'm mocking the ioloop's time with a local variable t:
DUT.ioloop.time = mock.Mock(side_effect= lambda: t)
(DUT <==> Device Under Test)
Then in the test, I manually increment t, and yield gen.moment to kick the ioloop. The idea is to trigger the repeating coroutine after various intervals so I can verify its behaviour.
But the coroutine doesn't always trigger - or perhaps it yields back to the testing code before completing execution, causing failures.
I think should be using stop() and wait() to synchronise the test code, but I can't see concretely how to use them in this situation. And how does this whole testing strategy work if the DUT runs in its own ioloop?
In general, using yield gen.moment to trigger specific events is dicey; there are no guarantees about how many "moments" you must wait, or in what order the triggered events occur. It's better to make sure that the function being tested has some effect that can be asynchronously waited for (if it doesn't have such an effect naturally, you can use a tornado.locks.Condition).
There are also subtleties to patching IOLoop.time. I think it will work with the default Tornado IOLoops (where it is possible without the use of mock: pass a time_func argument when constructing the loop), but it won't have the desired effect with e.g. AsyncIOLoop.
I don't think you want to use AsyncTestCase.stop and .wait, but it's not clear how your test is set up.

`eventlet.spawn` doesn't work as expected

I'm writing a web UI for data analysis tasks.
Here's the way it's supposed to work:
After a user specifies parameters like dataset and learning rate, I create a new task record, then a executor for this task is started asyncly (The executor may take a long time to run.), and the user is redirected to some other page.
After searching for an async library for python, I started with eventlet, here's what I wrote in a flask view function:
db.save(task)
eventlet.spawn(executor, task)
return redirect("/show_tasks")
With the code above, the executor didn't execute at all.
What may be the problem of my code? Or maybe I should try something else?
While you been given with direct solutions, i will try to answer your first question and explain why your code does not work as expected.
Disclosures: i currently maintain Eventlet. This comment will contain a number of simplifications to fit into reasonable size.
Brief introduction to cooperative multithreading
There are two ways to do Multithreading and Eventlet exploits cooperative approach. At the core is Greenlet library which basically allows you to create independent "execution contexts". One could think of such context as frozen state of all local variables and a pointer to next instruction. Basically, multithreading = contexts + scheduler. Greenlet provides contexts so we need a scheduler, something that makes decisions about which context should occupy CPU right now. It turns, to make decisions we should also run some code. Which means a separate context (green thread). This special green thread is called a Hub in Eventlet code base. Scheduler maintains an ordered set of contexts that need to be run ASAP - run queue and set of contexts that are waiting for something (e.g. network IO or time limited sleep) to finish.
But since we are doing cooperative multitasking, one context will execute indefinitely unless it explicitly yields to another. This would be very sad style of programming, and also by definition incompatible with existing libraries (pointing at they-know-who); so what Eventlet does is it provides green versions of common modules, changed in such way that they switch to Hub instead of blocking everything. Then, some time may be spent in other green threads or in Hub's wait-for-external-events implementation, in which case Hub would switch back to green thread originating that event - and it would continue execution.
End. Now back to your problem.
What eventlet.spawn actually does: it creates a new execution context. Basically, allocates an object in memory. Also it tells scheduler to put this context into run queue, so at first possible moment, Hub will switch to newly spawned function. Your code does not provide such a moment. There is no place where you explicitly give up execution to other green threads, for Eventlet this is usually done via eventlet.sleep(). And since you don't use green versions of common modules, there is no chance to yield implicitly when other code waits. Most appropriate (if not the only one) place would be your WSGI server's accept loop: it should give other green threads chance to run while waiting for next request. Mentioned in first answer eventlet.monkey_patch() is just a convenient way to replace all (or subset of) common modules with their corresponding green versions.
Unwanted opinion on overall design
In separate section, to skip easily. Iff you are building error resistant software, you usually want to limit execution time for spawned threads (including but not limited to "green") and processes and at least report(log) or react to their unhandled errors. In provided code, your spawned green thread, technically may run in next moment or five minutes later (again, because nobody yields CPU) or fail with unhandled exception. Luckily, Eventlet provides two solutions for both problems: Timeout with_timeout() allow to limit waiting time (remember, if it does not yield, you can't possibly limit it) and GreenThread.link() to catch all exceptions. It may be tempting (it was for me) to reraise exceptions in "main" code, and link() allows that easily, but consider that exceptions would be raised from sleep and IO calls - places where you yield to Hub. This may provide some really counter intuitive tracebacks.
You'll need to patch some system libraries in order to make eventlet work. Here is a minimal working example (also as gist):
#!/usr/bin/env python
from flask import Flask
import time
import eventlet
eventlet.monkey_patch()
app = Flask(__name__)
app.debug = True
def background():
""" do something in the background """
print('[background] working in the background...')
time.sleep(2)
print('[background] done.')
return 42
def callback(gt, *args, **kwargs):
""" this function is called when results are available """
result = gt.wait()
print("[cb] %s" % result)
#app.route('/')
def index():
greenth = eventlet.spawn(background)
greenth.link(callback)
return "Hello World"
if __name__ == '__main__':
app.run()
More on that:
http://eventlet.net/doc/patching.html#monkey-patch
One of the challenges of writing a library like Eventlet is that the built-in networking libraries don’t natively support the sort of cooperative yielding that we need.
Eventlet may indeed be suitable for your purposes, but it doesn't just fit in with any old application; Eventlet requires that it be in control of all your application's I/O.
You may be able to get away with either
Starting Eventlet's main loop in another thread, or even
Not using Eventlet and just spawning your task in another thread.
Celery may be another option.

Twisted: Making code non-blocking

I'm a bit puzzled about how to write asynchronous code in python/twisted. Suppose (for arguments sake) I am exposing a function to the world that will take a number and return True/False if it is prime/non-prime, so it looks vaguely like this:
def IsPrime(numberin):
for n in range(2,numberin):
if numberin % n == 0: return(False)
return(True)
(just to illustrate).
Now lets say there is a webserver which needs to call IsPrime based on a submitted value. This will take a long time for large numberin.
If in the meantime another user asks for the primality of a small number, is there a way to run the two function calls asynchronously using the reactor/deferreds architecture so that the result of the short calc gets returned before the result of the long calc?
I understand how to do this if the IsPrime functionality came from some other webserver to which my webserver would do a deferred getPage, but what if it's just a local function?
i.e., can Twisted somehow time-share between the two calls to IsPrime, or would that require an explicit invocation of a new thread?
Or, would the IsPrime loop need to be chunked into a series of smaller loops so that control can be passed back to the reactor rapidly?
Or something else?
I think your current understanding is basically correct. Twisted is just a Python library and the Python code you write to use it executes normally as you would expect Python code to: if you have only a single thread (and a single process), then only one thing happens at a time. Almost no APIs provided by Twisted create new threads or processes, so in the normal course of things your code runs sequentially; isPrime cannot execute a second time until after it has finished executing the first time.
Still considering just a single thread (and a single process), all of the "concurrency" or "parallelism" of Twisted comes from the fact that instead of doing blocking network I/O (and certain other blocking operations), Twisted provides tools for performing the operation in a non-blocking way. This lets your program continue on to perform other work when it might otherwise have been stuck doing nothing waiting for a blocking I/O operation (such as reading from or writing to a socket) to complete.
It is possible to make things "asynchronous" by splitting them into small chunks and letting event handlers run in between these chunks. This is sometimes a useful approach, if the transformation doesn't make the code too much more difficult to understand and maintain. Twisted provides a helper for scheduling these chunks of work, cooperate. It is beneficial to use this helper since it can make scheduling decisions based on all of the different sources of work and ensure that there is time left over to service event sources without significant additional latency (in other words, the more jobs you add to it, the less time each job will get, so that the reactor can keep doing its job).
Twisted does also provide several APIs for dealing with threads and processes. These can be useful if it is not obvious how to break a job into chunks. You can use deferToThread to run a (thread-safe!) function in a thread pool. Conveniently, this API returns a Deferred which will eventually fire with the return value of the function (or with a Failure if the function raises an exception). These Deferreds look like any other, and as far as the code using them is concerned, it could just as well come back from a call like getPage - a function that uses no extra threads, just non-blocking I/O and event handlers.
Since Python isn't ideally suited for running multiple CPU-bound threads in a single process, Twisted also provides a non-blocking API for launching and communicating with child processes. You can offload calculations to such processes to take advantage of additional CPUs or cores without worrying about the GIL slowing you down, something that neither the chunking strategy nor the threading approach offers. The lowest level API for dealing with such processes is reactor.spawnProcess. There is also Ampoule, a package which will manage a process pool for you and provides an analog to deferToThread for processes, deferToAMPProcess.

Python - How can I make this code asynchronous?

Here's some code that illustrates my problem:
def blocking1():
while True:
yield 'first blocking function example'
def blocking2():
while True:
yield 'second blocking function example'
for i in blocking1():
print 'this will be shown'
for i in blocking2():
print 'this will not be shown'
I have two functions which contain while True loops. These will yield data which I will then log somewhere (most likely, to an sqlite database).
I've been playing around with threading and have gotten it working. However, I don't really like it... What I would like to do is make my blocking functions asynchronous. Something like:
def blocking1(callback):
while True:
callback('first blocking function example')
def blocking2(callback):
while True:
callback('second blocking function example')
def log(data):
print data
blocking1(log)
blocking2(log)
How can I achieve this in Python? I've seen the standard library comes with asyncore and the big name in this game is Twisted but both of these seem to be used for socket IO.
How can I async my non-socket related, blocking functions?
A blocking function is a function which doesn't return, but still leaves your process idle - unable to complete more work.
You're asking us to make your blocking functions non-blocking. However – unless you're writing an operating system – you don't have any blocking functions. You might have functions which block because they make calls to blocking system calls, or you might have functions which "block" because they do a lot of computation.
Making the former type of function non-blocking is impossible without making the underlying system call non-blocking. Depending on what that system call is, it may be difficult to make it non-blocking without also adding an event loop to your program; you don't just need to make the call and have it not block, you also have to make another call to determine that the result of that call will be delivered somewhere you could associate it.
The answer to this question is a very long python program and a lot of explanations of different OS interfaces and how they work, but luckily I already wrote that answer on a different site; I called it Twisted. If your particular task is already supported by a Twisted reactor, then you're in luck. Otherwise, as long as your task maps to some existing operating system concept, you can extend a reactor to support it. Practically speaking there are only 2 of these mechanisms: file descriptors on every sensible operating system ever, and I/O Completion Ports on Windows.
In the other case, if your functions are consuming a lot of CPU, and therefore not returning, they're not really blocking; your process is still chugging along and getting work done. There are three ways to deal with that:
separate threads
separate processes
if you have an event loop, write a task that periodically yields, by writing the task in such a way that it does some work, then asks the event loop to resume it in the near future in order to allow other tasks to run.
In Twisted this last technique can be accomplished in various ways, but here's a syntactically convenient trick that makes it easy:
from twisted.internet import reactor
from twisted.internet.task import deferLater
from twisted.internet.defer import inlineCallbacks, returnValue
#inlineCallbacks
def slowButSteady():
result = SomeResult()
for something in somethingElse:
result.workHardForAMoment(something)
yield deferLater(reactor, 0, lambda : None)
returnValue(result)
You can use generators for cooperative multitasking, but you have to write your own main loop that passes control between them.
Here's a (very simple) example using your example above:
def blocking1():
while True:
yield 'first blocking function example'
def blocking2():
while True:
yield 'second blocking function example'
tasks = [blocking1(), blocking2()]
# Repeat until all tasks have stopped
while tasks:
# Iterate through all current tasks. Use
# tasks[:] to copy the list because we
# might mutate it.
for t in tasks[:]:
try:
print t.next()
except StopIteration:
# If the generator stops, remove it from the task list
tasks.remove(t)
You could further improve it by allowing the generators to yield new generators, which then could be added to tasks, but hopefully this simplified example will give the general idea.
The twisted framework is not just sockets. It has asynchronous adapters for many scenarios, including interacting with subprocesses. I recommend taking a closer look at that. It does what you are trying to do.
If you don't want to use full OS threading, you might try Stackless, which is a variant of Python that adds many interesting features, including "microthreads". There are a number of good examples that you will find helpful.
Your code isn’t blocking. blocking1() and it’s brother return iterators immediately (not blocking), and neither does a single iteration block (in your case).
If you want to “eat” from both iterators one-by-one, don’t make your program try to eat up “blocking1()” entirely, before continuing...
for b1, b2 in zip(blocking1(), blocking2()):
print 'this will be shown', b1, 'and this, too', b2

Categories

Resources