Why do we need gevent.queue? - python

My understanding of Gevent is that it's merely concurrency and not parallelism. My understanding of concurrency mechanisms like Gevent and AsyncIO is that, nothing in the Python application is ever executing at the same time.
The closest you get is, calling a non-blocking IO method, and while waiting for that call to return other methods within the Python application are able to be executed. Again, none of the methods within the Python application ever actually execute Python code at the same time.
With that said, why is there a need for gevent.queue? It sounds to me like the Python application doesn't really need to worry about more than one Python method accessing a queue instance at a time.
I'm sure there's a scenario that I'm not seeing that gevent.queue fixes, I'm just curious what that is.

Although you are right that no two statements execute at the same time within a single Python process, you might want to ensure that a series of statements execute atomically, or you might want to impose an order on the execution of certain statements, and in that case things like gevent.queue become useful. A tutorial is here.

Related

What is the point of Pool.apply() in Python if it blocks

I am trying to understand what the point of Pool.apply() is, within a parallelisation module.
My understanding is it is synchronous, and so no parallel processing occurs - but you just get the overhead of running code in separate processes.
I agree that it's not useful for parallelization, but it might have other uses. For example, if by poor life choices, your function has an unholy mess of side effects like changing global variables (probably because you only intended to execute it in a Pool when writing it), running it in a separate process can help with that.
Of course, using it like that is most likely an anti-pattern, so I guess this function only exists for compatibility reasons...

Is there a way to find out if Python threading locks are ever used by more than one thread?

I'm working on a personal project that has been refactored a number of times. It started off using multithreading, then parts of it used asyncio, and now it is back to being mainly single threaded.
As a result of all these changes I have a number of threading.Lock()'s in the code that I would like to remove and cleanup to prevent future issues.
How can I easily work out which locks are in use and hit by more than one thread during the runtime of the application?
If I am in the situation to find that out, I would try to replace the lock with a wrapper that do the counting (or print something, raise an exception, etc.) for me when the undesired behavior happened. Python is hacky, so I can simply create a function and overwrite the original threading.Lock to get the job done. That might need some careful implementation, e.g., catch both all possible pathway to lock and unlock.
However, you have to be careful that even so, you might not exercise all possible code path and thus never know if you really remove all "bugs".

Will asyncio help me here?

I have a little python application that I have developed using wxpython4.0.3 that performs a fairly simple ETL type task of:
Takes user input to select a parent directory containing several
sub-directories full of CSV files (time series data.)
Transforms the data in each of those files to comply with formatting
required by a third party recipient and writes them out to new
files.
Zips each of the new files into a file named after the original
file's parent directory
At the user's initiative the application then FTPs the zip files to
the third party to be loaded into a data store.
The application works well enough, but the time required to process through several thousand CSV files is pretty extreme and mostly IO bound from what I can tell.
Is asyncio a reasonable option to pursue or are there other recommendations that anyone can make? I originally wrote this as a CLI and saw significant performance gains by using pypy, but I was reluctant to combine pypy with wxpython when I developed the UI for others.
Thanks for your guidance.
If you saw a significant speedup by using PyPy instead of CPython, that implies that your code probably isn't I/O-bound. Which means that making the I/O asynchronous isn't going to help very much. Plus, it'll be extra work, as well, because you'll have to restructure all of your CPU-heavy tasks into small pieces that can await repeatedly so they don't block the other tasks.
So, you probably want to use multiple processes here.
The simplest solution is to use a concurrent.futures.ProcessPoolExecutor: just toss tasks at the executor, and it'll run them on the child processes and return you a Future.
Unlike using asyncio, you won't have to change those tasks at all. They can read a file just by looping over the csv module, process it all in one big chunk, and even use the synchronous ftplib module, without needing to worry about anyone blocking anyone else. Only your top-level code needs to change.
However, you may want to consider splitting the code into a wx GUI that you run in CPython, and a multiprocessing engine that you run via subprocess in PyPy, which then spins off the ProcessPoolExecutor in PyPy as well. This would take a bit more work, but it means you'll get the CPU benefits of using PyPy, the well-tested-with-wx benefits of using CPython, and the parallelism of multiprocessing.
Another option to consider is pulling in a library like NumPy or Pandas that can do the slow parts (whether that's reading and processing the CSV, or doing some kind of elementwise computation on thousands of rows, or whatever) more quickly (and possibly even releasing the GIL, meaning you don't need multiprocessing).
If your code really is I/O-bound code, and primarily bound on the FTP requests, asyncio would help. But it would require rewriting a lot of code. You'd need to find or write an asyncio-driven FTP client library. And, if the file reading takes any significant part of your time, converting that to async is even more work.
There's also the problem of integrating the wx event loop with the asyncio event loop. You might be able to get away with running the asyncio loop in a second thread, but then you need to come up with some way of communicating between the wx event loop in the main thread and the asyncio loop in the background thread. Alternatively, you might be able to drive one loop from the other (or there might even be third-party libraries that do that for you). But this might be a lot easier to do with (or have better third-party libraries to help with) something like twisted instead of asyncio.
But, unless you need massive concurrency (which you probably don't, unless you've got hundreds of different FTP servers to talk to), threads should work just as well, with a lot fewer changes to your code. Just use a concurrent.futures.ThreadPoolExecutor, which is nearly identical to using a ProcessPoolExecutor as explained above.
Yes, you will probably benefit from using asynchronous library. Since most of your time is spent waiting for IO, a well-written asynchronous program will use that time to do something else, without the overhead of extra threads/processes. It will scale really well.

How do I use Google Storage in Flask?

The example code makes use of this oauth2_client which it immediately locks. The script does not work without these lines. What's the correct way to integrate this into a Flask app? Do I have to manage these locks? Does it matter if my web server spawns multiple threads? Or if I'm using gunicorn+gevent? Is there documentation on this anywhere?
It's not actually locking, it's just instantiating a lock object inside the module. The lock is actually acquired/released internally by oauth2_client; you don't need to manage it yourself. You can see this by looking at the source code, here: https://github.com/GoogleCloudPlatform/gsutil/blob/master/gslib/third_party/oauth2_plugin/oauth2_client.py
In fact, based on the source code linked above, you should be able to simply call oauth2_client.InitializeMultiprocessingVariables() instead of the try/except block, since that is ultimately doing almost the exact same thing.

how to understand appengine ndb.tasklet?

From documentation:
An NDB tasklet is a piece of code that might run concurrently with
other code. If you write a tasklet, your application can use it much
like it uses an async NDB function: it calls the tasklet, which
returns a Future; later, calling the Future's get_result() method gets
the result.
The explanation and examples in the document really likes a magic for me.
I can use it, but feel hard to understand it properly.
For example:
May I put any kind of code inside a function and decorate it as ndb.tasklet? Then used it as async function later. Or it must be appengine RPC?
Does this kind of decorator also works on my PC?
Is it the same as tasklet for pypy
If you look at the implementation of a Future, its very comparable to what a generator is in python. In fact, it uses the same yield keyword to achieve what it says it does. Read the intro comments on the tasklets.py for some clarification.
When you use the #tasklet decorator, it creates a Future and waits for a value on the wrapped function. If the value is a generator, it adds the Future to the event loop. When you yield on a Future, the event loop runs through ALL queued Futures until the Future you want is ready. The concurrency here is that each Future will execute its code until it either returns (using raise ndb.Return(...) or the function completes), an exception is thrown, or yield is used again. I guess technically, you can use yield in the code just to stop executing that function in favor of letting the event loop continue running other Futures, but I would assume this wouldn't help much unless you really have a clever use-case in mind.
To answer your questions:
Technically yes, but it will not run asynchronously. When you decorate a non-yielding function with #tasklet, its Future's value is computed and set when you call that function. That is, it runs through the entire function when you call it. If you want to achieve asynchronous operation, you must yield on something that does asynchronous work. Generally in GAE it will work its way down to an RPC call.
If by work on your PC you mean does the dev appserver implement tasklets/Futures like GAE, then yes, although this is more accurate with the devappserver2 (now the default in the newer SDK). I'm actually not 100% sure if local RPC calls will run in parallel when using Futures, but there is an eventloop going through Futures whether its local or in production. If you want to use Future's in your other, non-GAE code, then I think you would be better off using Python 3.2's built-in future (or find a backport here)
Kind of, its not really a simple comparison. Look at the documentation here. The idea is somewhat the same (the scheduler can be compared to the eventloop), but the low-level implementation greatly differs.

Categories

Resources