how to understand appengine ndb.tasklet? - python

From documentation:
An NDB tasklet is a piece of code that might run concurrently with
other code. If you write a tasklet, your application can use it much
like it uses an async NDB function: it calls the tasklet, which
returns a Future; later, calling the Future's get_result() method gets
the result.
The explanation and examples in the document really likes a magic for me.
I can use it, but feel hard to understand it properly.
For example:
May I put any kind of code inside a function and decorate it as ndb.tasklet? Then used it as async function later. Or it must be appengine RPC?
Does this kind of decorator also works on my PC?
Is it the same as tasklet for pypy

If you look at the implementation of a Future, its very comparable to what a generator is in python. In fact, it uses the same yield keyword to achieve what it says it does. Read the intro comments on the tasklets.py for some clarification.
When you use the #tasklet decorator, it creates a Future and waits for a value on the wrapped function. If the value is a generator, it adds the Future to the event loop. When you yield on a Future, the event loop runs through ALL queued Futures until the Future you want is ready. The concurrency here is that each Future will execute its code until it either returns (using raise ndb.Return(...) or the function completes), an exception is thrown, or yield is used again. I guess technically, you can use yield in the code just to stop executing that function in favor of letting the event loop continue running other Futures, but I would assume this wouldn't help much unless you really have a clever use-case in mind.
To answer your questions:
Technically yes, but it will not run asynchronously. When you decorate a non-yielding function with #tasklet, its Future's value is computed and set when you call that function. That is, it runs through the entire function when you call it. If you want to achieve asynchronous operation, you must yield on something that does asynchronous work. Generally in GAE it will work its way down to an RPC call.
If by work on your PC you mean does the dev appserver implement tasklets/Futures like GAE, then yes, although this is more accurate with the devappserver2 (now the default in the newer SDK). I'm actually not 100% sure if local RPC calls will run in parallel when using Futures, but there is an eventloop going through Futures whether its local or in production. If you want to use Future's in your other, non-GAE code, then I think you would be better off using Python 3.2's built-in future (or find a backport here)
Kind of, its not really a simple comparison. Look at the documentation here. The idea is somewhat the same (the scheduler can be compared to the eventloop), but the low-level implementation greatly differs.

Related

Why should concurrent.futures' "finishing methods" only be used by Executor implementations and unit tests"?

The documentation of concurrent.futures.Future.set_result() says
This method should only be used by Executor implementations and unit tests.
The same says the documentation of set_exception() and set_running_or_notify_cancel()`.
But I can imagine using (and in Java also have used in the past) Future objects for (slightly) other purposes resp. in slightly different circumstances. In these cases, I'd just like to use a Future object. Why is this discouraged?
In order to become a little clearer: I talk about a quite complex sequence of events which involve two different schedulers and stuff, and a Future object could be used to cancel this process as well as notify about its completion. This framework is not an Executor (nor can it be one).
What the documentation says is that the set_result() method, not the Future object itself, should only be used by an implementation.
The rationale seems obvious to me: The result of a future should be the outcome of the work encapsulated by the Future object, but the work itself (the code that you want to run asynchronously) should not care how it is being executed. So who should be allowed to define what the result() method returns? Only the framework that schedules and launches the tasks.
In your example, you can pass Future objects around, cancel them or check their status or retrieve the result (with result()). None of these operations require you to use set_result(). But as you explained in the comments, your framework is not based on the Executor class (by inheritance or by implementing the Executor interface) but still it is an Executor work-alike. Therefore, I would say using the set_result() method is in conformance with the spirit of the comment that are concerned about.
Incidentally, have you considered tweaking your framework so that it does offer a compatible implementation of the Executor interface? It would be conceptually nice and might make some things easier to maintain and work with in the future.

What are all these deprecated "loop" parameters in asyncio?

A lot of the functions in asyncio have deprecated loop parameters, scheduled to be removed in Python 3.10. Examples include as_completed(), sleep(), and wait().
I'm looking for some historical context on these parameters and their removal.
What problems did loop solve? Why would one have used it in the first place?
What was wrong with loop? Why is it being removed en masse?
What replaces loop, now that it's gone?
What problems did loop solve? Why would one have used it in the first place?
Prior to Python 3.6, asyncio.get_event_loop() was not guaranteed to return the event loop currently running when called from an asyncio coroutine or callback. It would return whatever event loop was previously set using set_event_loop(some_loop), or the one automatically created by asyncio. But sync code could easily create a different loop with another_loop = asyncio.new_event_loop() and spin it up using another_loop.run_until_complete(some_coroutine()). In this scenario, get_event_loop() called inside some_coroutine and the coroutines it awaits would return some_loop rather than another_loop. This kind of thing wouldn't occur when using asyncio casually, but it had to be accounted for by async libraries which couldn't assume that they were running under the default event loop. (For example, in tests or in some usages involving threads, one might want to spin up an event loop without disturbing the global setting with set_event_loop.) The libraries would offer the explicit loop argument where you'd pass another_loop in the above case, and which you'd use whenever the running loop differed from the loop set up with asyncio.set_event_loop().
This issue would be fixed in Python 3.6 and 3.5.3, where get_event_loop() was modified to reliably return the running loop if called from inside one, returning another_loop in the above scenario. Python 3.7 would additionally introduced get_running_loop() which completely ignores the global setting and always returns the currently running loop, raising an exception if not inside one. See this thread for the original discussion.
Once get_event_loop() became reliable, another problem was that of performance. Since the event loop was needed for some very frequently used calls, most notably call_soon, it was simply more efficient to pass around and cache the loop object. Asyncio itself did that, and many libraries followed suit. Eventually get_event_loop() was accelerated in C and was no longer a bottleneck.
These two changes made the loop arguments redundant.
What was wrong with loop? Why is it being removed en masse?
As any other redundancy, it complicates the API and opens up possibilities for errors. Async code should almost never just randomly communicate with a different loop, and now that get_event_loop() is both correct and fast, there is no reason not to use it.
Also, passing the loop through all the layers of abstraction of a typical application is simply tedious. With async/await becoming mainstream in other languages, it has become clear that manually propagating a global object is not ergonomic and should not be required from programmers.
What replaces loop, now that it's gone?
Just use get_event_loop() to get the loop when you need it. Alternatively, you can use get_running_loop() to assert that a loop is running.
The need for accessing the event loop is somewhat reduced in Python 3.7, as some functions that were previously only available as methods on the loop, such as create_task, are now available as stand-alone functions.
The loop parameter was the way to pass the global event loop around. New implementations of the same functions no longer require you to pass the global event loop, they instead just request it where it's needed.
As the documentation suggests https://docs.python.org/3/library/asyncio-eventloop.html: "Application developers should typically use the high-level asyncio functions, such as asyncio.run(), and should rarely need to reference the loop object or call its methods."
Removing the need for you to pass it around to library functions aligns with that principle. The loop is not replaced, but its disappearance simply means you no longer have to deal with it 'manually'.

How can I learn how to implement a custom Python asyncio event loop?

I’m looking into implementing a new event loop to plug into asyncio based on existing run loop implementations, such as Cocoa’s NSRunLoop and Qt’s QEventLoop. but find it difficult to to pick a place to start.
The documentation says that the system is designed to be pluggable, but nowhere does it say exactly how this can be done. Should I start with AbstractEventLoop, or BaseEventLoop? What method does what, and what components do I need to provide? The only alternative implementation I find useful is uvloop, but find it difficult to understand because it relies heavily on Cython and libuv, which I am not familiar with.
Is there some kind of a write-up on how the event loop implementation is done, and how a custom one can be made? Or a less involved implementation I can wrap my head around more quickly? Thanks for any pointers.
The documentation says to inherit from AbstractEventLoop.
For the rest of your question, I didn't find the documentation very clear, but the source code for the concrete event loop in asyncio was helpful. I've written up a
pretty minimal example of inheriting from AbstractEventLoop to create an event driven simulator.
The main things that I'd have liked to be told are
Implement create_task. The end-user schedules a coroutine using asyncio.ensure_future(coro()), but that just calls your loop's create_task method. It doesn't need to be anything more than
def create_task(self, coro): return asyncio.Task(coro, loop=self).
Implement call_soon, call_at and call_later. These are invoked by the end-user to schedule a plain callback function. They are also invoked by the async/await system automatically, whenever the end-user schedules a coroutine.
If a regular callback raises an exception, it goes to your loop's call_exception_handler method. If a coroutine raises an exception, the exception lives in some asynchronous never-never land, and you have to catch it there.
Look up the source code for AbstractEventLoop to see all the other methods that you should be overriding. Bonus: somewhat helpful comments.

Why do we need gevent.queue?

My understanding of Gevent is that it's merely concurrency and not parallelism. My understanding of concurrency mechanisms like Gevent and AsyncIO is that, nothing in the Python application is ever executing at the same time.
The closest you get is, calling a non-blocking IO method, and while waiting for that call to return other methods within the Python application are able to be executed. Again, none of the methods within the Python application ever actually execute Python code at the same time.
With that said, why is there a need for gevent.queue? It sounds to me like the Python application doesn't really need to worry about more than one Python method accessing a queue instance at a time.
I'm sure there's a scenario that I'm not seeing that gevent.queue fixes, I'm just curious what that is.
Although you are right that no two statements execute at the same time within a single Python process, you might want to ensure that a series of statements execute atomically, or you might want to impose an order on the execution of certain statements, and in that case things like gevent.queue become useful. A tutorial is here.

Tulip/asyncIO: why not all calls be async and specify when things should be synchronous?

I went to the SF Python meetup when Guido talked about Tulip, the future asyncIO library for asynchronous operations in Python.
The take away is that if you want something to be run asynchronously you can use the "yield from" + expression and a couple of decorators to specify that the call to what comes after yield from should be executed asynchronously. The nice thing about it is that you can read the statements in that function normally (as if it was synchronous) and it will behave as if it was synchronous with respect to the execution of that function (return values and error/exception propagation and handling).
My question is: why not have the opposite behavior, namely, have all function calls be by default async (and without the yield from) and have a different explicit syntax when you want to execute something synchronously?
(besides the need for another keyword/syntax spec)
The real answer is that Guido likes the fact that asynchronous yield points are explicit in coroutines, because if you don't realize that a call can yield, then that's an invitation to concurrency problems -- like with threads. But if you have to write an explicit yield from, it's fairly easy to make sure it doesn't land in the middle of two critical operations that should appear atomic to the rest of the code.
As he mentions in his PyCon 2013 keynote, there are other Python async frameworks like Gevent, which are async by default, and he doesn't like that approach. (at 11:58):
And unfortunately you're still not completely clear of the problem
that the scheduler could at a random moment interrupt your task and
switch to a different one. [...] Any function that you call today that
you happen to know that it never switches, tomorrow someone could add
a logging statement or a lazy caching or a consulting of a settings
file. [...]
Note that the possible uses of yield from are a small part of the asynch PEP, and never need to be used. Maybe Guido oversold them in his talk ;-)
As to why functions aren't being changed to always be async by default, that's just realism. Asynch gimmicks bring new overheads and semantic complications, and Python isn't going to slow down and complicate life for everyone to make a few applications easier to write.
In short, "practicality beats purity" ;-)

Categories

Resources