Will asyncio help me here? - python

I have a little python application that I have developed using wxpython4.0.3 that performs a fairly simple ETL type task of:
Takes user input to select a parent directory containing several
sub-directories full of CSV files (time series data.)
Transforms the data in each of those files to comply with formatting
required by a third party recipient and writes them out to new
files.
Zips each of the new files into a file named after the original
file's parent directory
At the user's initiative the application then FTPs the zip files to
the third party to be loaded into a data store.
The application works well enough, but the time required to process through several thousand CSV files is pretty extreme and mostly IO bound from what I can tell.
Is asyncio a reasonable option to pursue or are there other recommendations that anyone can make? I originally wrote this as a CLI and saw significant performance gains by using pypy, but I was reluctant to combine pypy with wxpython when I developed the UI for others.
Thanks for your guidance.

If you saw a significant speedup by using PyPy instead of CPython, that implies that your code probably isn't I/O-bound. Which means that making the I/O asynchronous isn't going to help very much. Plus, it'll be extra work, as well, because you'll have to restructure all of your CPU-heavy tasks into small pieces that can await repeatedly so they don't block the other tasks.
So, you probably want to use multiple processes here.
The simplest solution is to use a concurrent.futures.ProcessPoolExecutor: just toss tasks at the executor, and it'll run them on the child processes and return you a Future.
Unlike using asyncio, you won't have to change those tasks at all. They can read a file just by looping over the csv module, process it all in one big chunk, and even use the synchronous ftplib module, without needing to worry about anyone blocking anyone else. Only your top-level code needs to change.
However, you may want to consider splitting the code into a wx GUI that you run in CPython, and a multiprocessing engine that you run via subprocess in PyPy, which then spins off the ProcessPoolExecutor in PyPy as well. This would take a bit more work, but it means you'll get the CPU benefits of using PyPy, the well-tested-with-wx benefits of using CPython, and the parallelism of multiprocessing.
Another option to consider is pulling in a library like NumPy or Pandas that can do the slow parts (whether that's reading and processing the CSV, or doing some kind of elementwise computation on thousands of rows, or whatever) more quickly (and possibly even releasing the GIL, meaning you don't need multiprocessing).
If your code really is I/O-bound code, and primarily bound on the FTP requests, asyncio would help. But it would require rewriting a lot of code. You'd need to find or write an asyncio-driven FTP client library. And, if the file reading takes any significant part of your time, converting that to async is even more work.
There's also the problem of integrating the wx event loop with the asyncio event loop. You might be able to get away with running the asyncio loop in a second thread, but then you need to come up with some way of communicating between the wx event loop in the main thread and the asyncio loop in the background thread. Alternatively, you might be able to drive one loop from the other (or there might even be third-party libraries that do that for you). But this might be a lot easier to do with (or have better third-party libraries to help with) something like twisted instead of asyncio.
But, unless you need massive concurrency (which you probably don't, unless you've got hundreds of different FTP servers to talk to), threads should work just as well, with a lot fewer changes to your code. Just use a concurrent.futures.ThreadPoolExecutor, which is nearly identical to using a ProcessPoolExecutor as explained above.

Yes, you will probably benefit from using asynchronous library. Since most of your time is spent waiting for IO, a well-written asynchronous program will use that time to do something else, without the overhead of extra threads/processes. It will scale really well.

Related

Does Python multiprocessing module need multi-core CPU?

Do I need a multi-core CPU to take advantage of the Python multiprocessing module?
Also, can someone tell me how it works under the hood?
multiprocessing asks the OS to launch one or more new processes, running the same version of Python and the same version of your script. It can also set up pipes or other ways of sharing data directly between them.
It usually works like magic; when you peek under the hood, sometimes it looks like sausage being made, but you can usually understand the sausage grinders. The multiprocessing docs do a great job explaining things further (they're long, but then there's a lot to explain). And if you need even more under-the-hood knowledge, the docs link to the source, which is pretty readable Python code. If you have a specific question after reading, come back to SO and ask a specific question.
Meanwhile, you can get some of the benefits of multiprocessing without multiple cores.
The main benefit—the reason the module was designed—is parallelism for speed. And obviously, without 4 cores, you aren't going to cut your time down to 25%. But sometimes, you actually can get a bit of speedup even with a single core, especially if that core has "hyperthreading" or similar technologies. I've seen times come down to 80%, or even 60%. More commonly, they'll go up to 108% instead (because you did get a small benefit from hyperthreading, but the overhead cost was higher than the gain). But try it with your code and see.
Meanwhile, you get all of the side benefits:
Concurrency: You can run multiple tasks at once without them blocking each other. Of course threads, asyncio, and other techniques can do this too.
Isolation: You can run multiple tasks at once without the risk of one of them changing data that another one wasn't expecting to change.
Crash protection: If a child task segfaults, only that task is affected. (Well, you still have to be careful of any side-effects—if it crashed in the middle of writing a file that another tasks expects to be in a consistent shape, you're still in trouble.)
You can also use the multiprocessing module without multiple processes. Sometimes you just want the higher-level API of the module, but you want to use it with threads; multiprocessing.dummy does that. And you can switch back and forth in a couple lines of code to test it both ways. Or you can use the higher-level concurrent.futures.ProcessPoolExecutor wrapper, if its model fits what you want to do. Besides often being simpler, it lets you switch between threads and processes by just changing one word in one line.
Also, redesigning your program around multiprocessing takes you a step closer to further redesigning it as a distributed system that runs on multiple separate machines. It forces you to deal with questions like how your tasks communicate without being able to share everything, without forcing you to deal with further questions like how they communicate without reliable connections.

More than one process at the same time

Hey I am learning Python at the moment. I wrote a few programs. Now I have a question:
Is it possible to run more "operations" at once?
According to my knowledge the scripts runs from the top to the bottom (except from thing like called def and if statements and so on).
For example: I want to do something and wait 5 seconds an then continue but while my program "waits" it should do something other? (This one is very simple)
Or: While checking for input do something other output things.
The examples are very poor but I do not finde something better at the moment. (If something comes to my mind, I will add it later)
I hope you understand what my question is.
Cheers
TL;DR: Use an async approach. Raymond Hettinger is a god, and this talk explains this concept more accurately and thoroughly than I can. ;)
The behavior you are describing is called "concurrency" or "asynchronicity", where you have more than one "piece" of code executing "at the same time". This is one of the hardest problems in practical computer science, because adding the dimension of time causes scheduling problems in addition to logic problems. However, it is very much in demand these days because of multi-core processors and the inherently parallel environment of the internet
"At the same time" is in quotes, because there are two basic ways to make this happen:
actually run the code at the same time
make it look like it is running at the same time.
The first option is called Concurrent programing, and the second is called Asynchronous programming (commonly "async").
Generally, "modern" programming seems to favor async, because it's easier to reason about and comes with fewer, less severe pitfalls. If you do it right, async programs can look a lot like the synchronous, procedural code you're already familiar with. Golang is basically built on the concept. Javascript has embraced "futures" in the form of Promises and async/await. I know it's not Python, but this talk by the creator of Go gives a good overview of the philosophy.
Python gives you three main ways to approach this, separated into three major modules: threading, multiprocessing, and asyncio
multiprocessing and threading are concurrent solutions. They do very similar things, but accomplish them in slightly different ways by delegating to the OS in different ways. This answer has a concise explanation of the difference. Concurrency is notoriously difficult to debug, because it is not deterministic: small differences in timing can result in completely different sequences of execution. You also have to deal with "race conditions" in threads, where two bits of code want to read/change the same piece of shared state at the same time.
asyncio, or "asynchronous input-output" is a more recent, async solution. You'll need at least Python 3.4. It uses event loops to allow long-running tasks to execute without "blocking" the rest of the program. Processes and threads do a similar thing, running two or more operations on even the same processor core by interrupting the running process periodically, forcing them to take turns. But with async, you decide where the turn-taking happens. It's like designing mature adults that interact cooperatively rather than designing kindergarteners that have to be baby-sat by the OS and forced to share the processor.
There are also third-party packages like gevent and eventlet that predate asyncio and work in earlier versions of Python. If you can afford to target Python >=3.4, I would recommend just using asyncio, because it's part of the Python core.

Using Time in Python

I rewriting a simple midi music sequencer from javascript into Python as a way of teaching myself Python.
I'm ready to begin working with time (for fireing midi events) but I can't find any good resources for executing scripts in time, scheduling timing events, etc.
A few things I've read suggest I should use a module like tkinter, but I would rather have all the timing mechanisms independent of any gui module.
Does anyone have any suggestions/resources for working with time?
For executing scripts in a certain interval (of course within another script), you might want to take a look at the time module (Documentation here).
But if you are planning to use timing with a GUI, you might want to have concurrent threading or processing so that there is not delay with the user interface. In such case you can use multithreading (Documentation) or multiprocessing (Documentation) modules.
As a final note, some GUI frameworks come with built-in threading support, so you might want to take a look at that. For example, PyQT4 has something called QThread which handles all thread/event manipulation.

How to choose between different concurrent method available in Python?

There's different ways of doing concurrent in Python, below is a simple list:
process-based: process.Popen, multiprocessing.Process, old fashioned os.system, os.popen, os.exe*
thread-based: threading.Thread
microthread-based: greenlet
I know the difference between thread-based concurrency and process-based concurrency, and I know some (but not too much) about GIL's impact in CPython's thread support.
For a beginner who want to implement some level of concurrency, how to choose between them? Or, what's the general difference between them? Are there any more ways to do concurrent in Python?
I'm not sure if I'm asking the right question, please feel free to improve this question.
The reason all three of these mechanisms exist is that they have different strengths and weaknesses.
First, if you have huge numbers of small, independent tasks, and there's no sensible way to batch them up (typically, this means you're writing a C10k server, but that's not the only possible case), microthreads win hands down. You can only run a few hundred OS threads or processes before everything either bogs down or just fails. So, either you use microthreads, or you give up on automatic concurrency and start writing explicit callbacks or coroutines. This is really the only time microthreads win; otherwise, they're just like OS threads except a few things don't work right.
Next, if your code is CPU-bound, you need processes. Microthreads are an inherently single-core solution; Threads in Python generally can't parallelize well because of the GIL; processes get as much parallelism as the OS can handle. So, processes will let your 4-core system run your code 4x as fast; nothing else will. (In fact, you might want to go farther and distribute across separate computers, but you didn't ask about that.) But if your code is I/O-bound, core-parallelism doesn't help, so threads are just as good as processes.
If you have lots of shared, mutable data, things are going to be tough. Processes require explicitly putting everything into sharable structures, like using multiprocessing.Array in place of list, which gets nightmarishly complicated. Threads share everything automatically—which means there are race conditions everywhere. Which means you need to think through your flow very carefully and use locks effectively. With processes, an experienced developers can build a system that works on all of the test data but has to be reorganized every time you give it a new set of inputs. With threads, an experienced developer can write code that runs for weeks before accidentally and silently scrambling everyone's credit card numbers.
Whichever of those two scares you more—do that one, because you understand the problem better. Or, if it's at all possible, step back and try to redesign your code to make most of the shared data independent or immutable. This may not be possible (without making things either too slow or too hard to understand), but think about it hard before deciding that.
If you have lots of independent data or shared immutable data, threads clearly win. Processes need either explicit sharing (like multiprocessing.Array again) or marshaling. multiprocessing and its third-party alternatives make marshaling pretty easy for the simple cases where everything is picklable, but it's still not as simple as just passing values around directly, and it's also a lot slower.
Unfortunately, most cases where you have lots of immutable data to pass around are the exact same cases where you need CPU parallelism, which means you have a tradeoff. And the best answer to this tradeoff may be OS threads on your current 4-core system, but processes on the 16-core system you have in 2 years. (If you organize things around, e.g., multiprocessing.ThreadPool or concurrent.futures.ThreadPoolExecutor, and trivially switch to Pool or ProcessPoolExecutor later—or even with a runtime configuration switch—that pretty much solves the problem. But this isn't always possible.)
Finally, if your application inherently requires an event loop (e.g., a GUI app or a network server), pick the framework you like first. Coding with, say, PySide vs. wx, or twisted vs. gevent, is a bigger difference than coding with microthreads vs. OS threads. And, once you've picked the framework, see how much you can take advantage of its event loop where you thought you needed real concurrency. For example, if you need some code to run every 30 seconds, don't start a thread (micro- or OS) for that, ask the framework to schedule it however it wants.

Python:When to use Threads vs. Multiprocessing

What are some good guidelines to follow when deciding to use threads or multiprocessing when speaking in terms of efficiency and code clarity?
Many of the differences between threading and multiprocessing are not really Python-specific, and some differences are specific to a certain Python implementation.
For CPython, I would use the multiprocessing module in either fo the following cases:
I need to make use of multiple cores simultaneously for performance reasons. The global interpreter lock (GIL) would prevent any speedup when using threads. (Sometimes you can get away with threads in this case anyway, for example when the main work is done in C code called via ctypes or when using Cython and explicitly releasing the GIL where approriate. Of course the latter requires extra care.) Note that this case is actually rather rare. Most applications are not limited by processor time, and if they really are, you usually don't use Python.
I want to turn my application into a real distributed application later. This is a lot easier to do for a multiprocessing application.
There is very little shared state needed between the the tasks to be performed.
In almost all other circumstances, I would use threads. (This includes making GUI applications responsive.)
For code clarity, one of the biggest things is to learn to know and love the Queue object for talking between threads (or processes, if using multiprocessing... multiprocessing has its own Queue object). Queues make things a lot easier and I think enable a lot cleaner code.
I had a look for some decent Queue examples, and this one has some great examples of how to use them and how useful they are (with the exact same logic applying for the multiprocessing Queue):
http://effbot.org/librarybook/queue.htm
For efficiency, the details and outcome may not noticeably affect most people, but for python <= 3.1 the implementation for CPython has some interesting (and potentially brutal), efficiency issues on multicore machines that you may want to know about. These issues involve the GIL. David Beazley did a video presentation on it a while back and it is definitely worth watching. More info here, including a followup talking about significant improvements on this front in python 3.2.
Basically, my cheap summary of the GIL-related multicore issue is that if you are expecting to get full multi-processor use out of CPython <= 2.7 by using multiple threads, don't be surprised if performance is not great, or even worse than single core. But if your threads are doing a bunch of i/o (file read/write, DB access, socket read/write, etc), you may not even notice the problem.
The multiprocessing module avoids this potential GIL problem entirely by creating a python interpreter (and GIL) per processor.

Categories

Resources