I am developing some Python code for Windows. A criteria is that it will use less than 1% of CPU. I understand that it is impossible to guarantee this all the time due to things like garbage collection, but what would be the best practice to get as close as possible. My current solution is to spread a lot of time.sleep(0.1) around the code, especially in loops. There are, however, obvious problems with this approach.
What other approaches could be taken?
I should also mention that the application has lots of threads in it using the threading library.
EDIT: Setting the process priority is not what I am after.
It is the job of the operating system to schedule CPU time. Use your operating system's built-in process-limits mechanisms (hopefully they exist on Windows) to restrict your process to <1% CPU.
This style of sprinkling unnecessary sleeps every few lines in the code will make the code terrible to create and extend and maintain, not to mention incredibly inelegant. (Rate-limiting yourself may be useful in very small, limited, critical sections -- for example your program is queuing lots of IO requests and you don't wish to inundate the operating system, you might wish to put a single sleep-until-[condition] in each critical loop which has the potential to inundate the system, but otherwise use extremely sparingly.)
Ideally you would call an API to the appropriate OS mechanisms from within your program when you start up, telling the OS to throttle you appropriately.
If the goal is to not bother the user then "below 1% CPU" is the wrong approach. What you really want is "don't take time away from other processes but still complete as fast as possible" - that's what "below normal" process priority is for. See http://code.activestate.com/recipes/496767-set-process-priority-in-windows/ for an example of how process priority can be changed for the current process (calling that function with default parameters will do).
For the sales pitch you can show the task manager while the computer is idle ("See? 99%, my application gets lots of work done") and then start some CPU-intensive application ("Almost all CPU time is spent in the application the user is working with, my application simply went into background").
If the box used for the demonstration is a Windows Server, it can use Windows System Resource Manager for restricting CPU usage below the desired threshold. Trying to force this behavior by code is impossible, unless a Windows API exposes this capability explicitly.
Related
I'm writing a Python server for a telnet-like protocol. Clients connect and authenticate a session, and then issue a series of commands that each have a response. The sessions have state, in the sense that a user authenticates once and then it's assumed that subsequent commands are performed by that user. The command/response operations in different sessions are effectively independent, although they do involve reads and occasional writes to a shared IO resource (postgres) that is largely capable of managing its own concurrency.
It's a design goal to support a large number of users with a small number of 8 or 16-core servers. I'm looking for a reasonably efficient way to architect the server implementation.
Some options I've considered include:
Using threads for each session; I suspect with the GIL this will make poor use of available cores
Using multiple processes for each session; I suspect that with a high ratio of sessions to servers (1000-to-1, say) the overhead of 1000 python interpreters may exceed memory limitations. You also have a "slow start" problem when a user connects.
Assigning sessions to process pools of 32 or so processes; idle sessions may get assigned to all 32 processes and prevent non-idle sessions from being processed.
Using some type of "routing" system where all sessions are handled by a single process and then individual commands are farmed out to a process pool. This still sounds substantially single-threaded to me (as there's a big single-threaded bottleneck), and this system may introduce substantial overhead if some commands are very trivial but must cross an IPC boundary two times and wait for a free process to get a response.
Use Jython/IronPython and multithreading; lack of C extensions is a concern
Python isn't a good fit for this problem; use Go/C++/Scala/Java either as a router for Python processes or abandon Python completely.
Using threads for each session; I suspect with the GIL this will make poor use of available cores
Is your code actually CPU-bound?* If it spends all its time waiting on I/O, then the GIL doesn't matter at all.** So there's absolutely no reason to use processes, or a GIL-less Python implementation.
Of course if your code is CPU-bound, then you should definitely use processes or a GIL-less implementation. But in that case, you're really only going to be able to efficiently handle N clients at a time with N CPUs, which is a very different problem than the one you're describing. Having 10000 users all fighting to run CPU-bound code on 8 cores is just going to frustrate all of them. The only way to solve that is to only handle, say, 8 or 32 at a time, which means the whole "10000 simultaneous connections" problem doesn't even arise.
So, I'll assume your code I/O-bound and your problem is a sensible and solvable one.
There are other reasons threads can be limiting. In particular, if you want to handle 10000 simultaneous clients, your platform probably can't run 10000 simultaneous threads (or can't switch between them efficiently), so this will not work. But in that case, processes usually won't help either (in fact, on some platforms, they'll just make things a lot worse).
For that, you need to use some kind of asynchronous networking—either a proactor (a small thread pool and I/O completion), or a reactor (a single-threaded event loop around an I/O readiness multiplexer). The Socket Programming HOWTO in the Python docs shows how to do this with select; doing it with more powerful mechanisms is a bit more complicated, and a lot more platform-specific, but not that much harder.
However, there are libraries that make this a lot easier. Python 3.4 comes with asyncio,*** which lets you abstract all the obnoxious details out and just write protocols that talk to transports via coroutines. Under the covers, there's either a reactor or a proactor (and a good one for each platform), without you having to worry about it.
If you can't wait for 3.4 to be finalized, or want to use something that's less-bleeding-edge, there are popular third-party frameworks like Twisted, which have other advantages as well.****
Or, if you prefer to think in a threaded paradigm, you can use a library like gevent, while uses greenlets to fake a bunch of threads on a single socket on top of a reactor.
From your comments, it sounds like you really have two problems:
First, you need to handle 10000 connections that are mostly sitting around doing nothing. The actual scheduling and multiplexing of 10000 connections is itself a major I/O bound if you try to do it with something like select, and as I said about, running 10000 threads or processes is not going to work. So, you need a good proactor or reactor for your platform, which is all described above.
Second, a few of those connections will be alive at a time.
First, for simplicity, let's assume it's all CPU-bound. So you will want processes. In particular, you want a pool of N processes, where N is the number of cores. Which you do by just creating a concurrent.futures.ProcessPoolExecutor() or multiprocessing.Pool().
But you claim they're doing a mix of CPU-bound and I/O-bound work. If all the tasks spend, say, 1/4th of their time burning CPU, use 4N processes instead. There's a bit of wasted overhead in context switching, but you're unlikely to notice it. You can get N as n = multiprocessing.cpu_count(); then use ProcessPoolExecutor(4*n) or Pool(4*n). If they're not that consistent or predictable, you can still almost always pretend they are—measure average CPU time over a bunch of tasks, and use n/avg. You can fudge this up or down depending on whether you're more concerned with maximizing peak performance or typical performance, but it's just one knob to twiddle, and you can just twiddle it empirically.
And that's it.*****
* … and in Python or in C extensions that don't release the GIL. If you're using, e.g., NumPy, it will do much of its slow work without holding the GIL.
** Well, it matters before Python 3.2. But hopefully if you're already using 3.x you can upgrade to 3.2+.
*** There's also asyncore and its friend asynchat, which have been in the stdlib for decades, but you're better off just ignoring them.
**** For example, frameworks like Twisted are chock full of protocol implementations and wrappers and adaptors and so on to tie all kinds of other functionality in without having to write a mess of complicated code yourself.
***** What if it really isn't good enough, and the task switching overhead or the idleness when all of your tasks happen to be I/O-waiting at the same time kills performance? Well, those are both very unlikely except in specific kinds of apps. If it happens, you will need to either break your tasks up to separate out the actual CPU-bound subtasks from the I/O-bound, or write some kind of application-specific adaptive load balancer.
My wx GUI shows thumbnails, but they're slow to generate, so:
The program should remain usable while the thumbnails are generating.
Switching to a new folder should stop generating thumbnails for the old folder.
If possible, thumbnail generation should make use of multiple processors.
What is the best way to do this?
Putting the thumbnail generation in a background thread with threading.Thread will solve your first problem, making the program usable.
If you want a way to interrupt it, the usual way is to add a "stop" variable which the background thread checks every so often (e.g., once per thumbnail), and the GUI thread sets when it wants to stop it. Ideally you should protect this with a threading.Condition. (The condition isn't actually necessary in most cases—the same GIL that prevents your code from parallelizing well also protects you from certain kinds of race conditions. But you shouldn't rely on that.)
For the third problem, the first question is: Is thumbnail generation actually CPU-bound? If you're spending more time reading and writing images from disk, it probably isn't, so there's no point trying to parallelize it. But, let's assume that it is.
First, if you have N cores, you want a pool of N threads, or N-1 if the main thread has a lot of work to do too, or maybe something like 2N or 2N-1 to trade off a bit of best-case performance for a bit of worst-case performance.
However, if that CPU work is done in Python, or in a C extension that nevertheless holds the Python GIL, this won't help, because most of the time, only one of those threads will actually be running.
One solution to this is to switch from threads to processes, ideally using the standard multiprocessing module. It has built-in APIs to create a pool of processes, and to submit jobs to the pool with simple load-balancing.
The problem with using processes is that you no longer get automatic sharing of data, so that "stop flag" won't work. You need to explicitly create a flag in shared memory, or use a pipe or some other mechanism for communication instead. The multiprocessing docs explain the various ways to do this.
You can actually just kill the subprocesses. However, you may not want to do this. First, unless you've written your code carefully, it may leave your thumbnail cache in an inconsistent state that will confuse the rest of your code. Also, if you want this to be efficient on Windows, creating the subprocesses takes some time (not as in "30 minutes" or anything, but enough to affect the perceived responsiveness of your code if you recreate the pool every time a user clicks a new folder), so you probably want to create the pool before you need it, and keep it for the entire life of the program.
Other than that, all you have to get right is the job size. Hopefully creating one thumbnail isn't too big of a job—but if it's too small of a job, you can batch multiple thumbnails up into a single job—or, more simply, look at the multiprocessing API and change the way it batches jobs when load-balancing.
Meanwhile, if you go with a pool solution (whether threads or processes), if your jobs are small enough, you may not really need to cancel. Just drain the job queue—each worker will finish whichever job it's working on now, but then sleep until you feed in more jobs. Remember to also drain the queue (and then maybe join the pool) when it's time to quit.
One last thing to keep in mind is that if you successfully generate thumbnails as fast as your computer is capable of generating them, you may actually cause the whole computer—and therefore your GUI—to become sluggish and unresponsive. This usually comes up when your code is actually I/O bound and you're using most of the disk bandwidth, or when you use lots of memory and trigger swap thrash, but if your code really is CPU-bound, and you're having problems because you're using all the CPU, you may want to either use 1 fewer core, or look into setting thread/process priorities.
I'm new to multiprocessing - and I may interpreting this wrong - but as I run my programs I notice the more processes I spawn the more 'sy' goes up on my linux computer. For example:
Cpu(s): 14.0%us, 24.1%sy, 0.0%ni, 58.8%id, 0.0%wa, 2.2%hi, 0.0%si, 0.8%st
The more processes I spawn the higher the sy process goes and the actual process just gets half'ed(so it was 20%/cpu before it goes to 10%/cpu) and ideal cpu remains the same(almost 60%). I'm not sure if this is a linux question or python question but is there any thing I can do to reduce this number and allow my programs to use more available cpu?
The system CPU time is time used by processes inside the kernel. If you have such a big ratio of system CPU to user CPU, it probably means that your process is doing a lot of system calls.
Don't think that this is lost time: the kernel is doing something useful for your process.
You might try to e.g. lower the rate of system calls by notably increasing your buffer sizes. Or maybe your processes have too much synchronization primitives.
You might use strace to find out about system calls done by your processes.
Its more likely a hardware question.
Some key things:
How much RAM is free?
Are you using Swap space?
How many CPUs do you have?
Is your app heavy on calculations?
How large is the shared variable(s)?
Does your app have any I/O?
If your app has a lot of output, you may want to look into a database option and insert the values into a table. This will add caching and control traffic flow between processes. No need to share a variable which may eventually cause other issues when the result set increases over time.
There may be some other tweaks you can do to Linux's memory to help.
Number of open files may be one. I can check which proc settings you can optimize if needed. It will help a little, but I think you may be running into a hardware wall.
Another option is to setup the manager to spawn to other servers, and then run processes there. You will need to ssh to a machine and pass an arg if the process is master or slave. It can be done by adding an init override in the manager to redirect the processes.
Hope this helps
Rich
I currently have a couple of small applications < 500 lines. I am intending to eventually run them on a small LINUX ARM box. Is it better to combine these applications and use threading, or continue to have them as two separate applications?
These applications plus a very small website use a small sqlite database, though only one of the applications write everything else currently does reads. Due to constraints of the target box I am using Python 2.6.
I am using SQLite to prevent data loss over several days of use. There is no direct interaction between the two application though there is the potential for database locking issue especially during period data maintenance. Stopping these issue are a concern also the foot print of the two applications as the target devices are pretty small.
Depends on whether you need them to share data and how involved the sharing is. Other than that, from a speed point of view, for a multiprocessing machine, threading won't give you much of an advantage over separate processes.
If sharing can easily take place via a flat file or database then just let them be separate rather than complicating via threading.
For performance purpose, I will suggest you to use threads, process consumes much more resources than threads, it will be faster to create and need less memory (usefull in embedded environment), but of course, you'll have to deal with the common traps of multithreading programmation (concurent access solved by locks, but locks may lead to interlocking...)
If you plan to use many libraries that make low level calls, maybe with C extension developped that could not release properkly the GIL (Global Interpreter Lock), in this case, processes can be a better solution, to allow your applications to run even when one is blocked by GIL.
If you need to pass data between the two, you could use the Queues and other mechanisms in the multiprocessing module.
It's often much simpler to use multiprocessing rather than sharing memory or objects using the threading module.
If you don't need to pass data between your programs, just run them separately (and release any locks on files or databases as soon as possible to reduce contention).
I have decided to adopt a process rather than a threaded approach to resolving this issue. The primary factor in this decision is simplicity. The second factor is whilst one of these applications will be carrying out data acquisition the other will be communicating with a modem on an ad-hoc basis (receiving calls) I don't have control over the calling application but based on my investigations, there is the potential for a lot to go wrong.
There are a couple of factor which may change the approach further down the line primarily the need for the two processes to interact to prevent data contention on the database. Secondly if resources (memory/disk space/cpu) become an issue (due to the size of the device) one application should give me the ability to manage these problems.
That said the data acquisition application is already threaded. This allows the parent thread to manage the worker when exceptions arise, as the device will not be in a managed environment.
Keeping the GUI responsive while the application does some CPU-heavy processing is one of the challenges of effective GUI programming.
Here's a good discussion of how to do this in wxPython. To summarize, there are 3 ways:
Use threads
Use wxYield
Chunk the work and do it in the IDLE event handler
Which method have you found to be the most effective ? Techniques from other frameworks (like Qt, GTK or Windows API) are also welcome.
Threads. They're what I always go for because you can do it in every framework you need.
And once you're used to multi-threading and parallel processing in one language/framework, you're good on all frameworks.
Definitely threads. Why? The future is multi-core. Almost any new CPU has more than one core or if it has just one, it might support hyperthreading and thus pretending it has more than one. To effectively make use of multi-core CPUs (and Intel is planing to go up to 32 cores in the not so far future), you need multiple threads. If you run all in one main thread (usually the UI thread is the main thread), users will have CPUs with 8, 16 and one day 32 cores and your application never uses more than one of these, IOW it runs much, much slower than it could run.
Actual if you plan an application nowadays, I would go away of the classical design and think of a master/slave relationship. Your UI is the master, it's only task is to interact with the user. That is displaying data to the user and gathering user input. Whenever you app needs to "process any data" (even small amounts and much more important big ones), create a "task" of any kind, forward this task to a background thread and make the thread perform the task, providing feedback to the UI (e.g. how many percent it has completed or just if the task is still running or not, so the UI can show a "work-in-progress indicator"). If possible, split the task into many small, independent sub-tasks and run more than one background process, feeding one sub-task to each of them. That way your application can really benefit from multi-core and get faster the more cores CPUs have.
Actually companies like Apple and Microsoft are already planing on how to make their still most single threaded UIs themselves multithreaded. Even with the approach above, you may one day have the situation that the UI is the bottleneck itself. The background processes can process data much faster than the UI can present it to the user or ask the user for input. Today many UI frameworks are little thread-safe, many not thread-safe at all, but that will change. Serial processing (doing one task after another) is a dying design, parallel processing (doing many task at once) is where the future goes. Just look at graphic adapters. Even the most modern NVidia card has a pitiful performance, if you look at the processing speed in MHz/GHz of the GPU alone. How comes it can beat the crap out of CPUs when it comes to 3D calculations? Simple: Instead of calculating one polygon point or one texture pixel after another, it calculates many of them in parallel (actually a whole bunch at the same time) and that way it reaches a throughput that still makes CPUs cry. E.g. the ATI X1900 (to name the competitor as well) has 48 shader units!
I think delayedresult is what you are looking for:
http://www.wxpython.org/docs/api/wx.lib.delayedresult-module.html
See the wxpython demo for an example.
Threads or processes depending on the application. Sometimes it's actually best to have the GUI be it's own program and just send asynchronous calls to other programs when it has work to do. You'll still end up having multiple threads in the GUI to monitor for results, but it can simplify things if the work being done is complex and not directly connected to the GUI.
Threads -
Let's use a simple 2-layer view (GUI, application logic).
The application logic work should be done in a separate Python thread. For Asynchronous events that need to propagate up to the GUI layer, use wx's event system to post custom events. Posting wx events is thread safe so you could conceivably do it from multiple contexts.
Working in the other direction (GUI input events triggering application logic), I have found it best to home-roll a custom event system. Use the Queue module to have a thread-safe way of pushing and popping event objects. Then, for every synchronous member function, pair it with an async version that pushes the sync function object and the parameters onto the event queue.
This works particularly well if only a single application logic-level operation can be performed at a time. The benefit of this model is that synchronization is simple - each synchronous function works within it's own context sequentially from start to end without worry of pre-emption or hand-coded yielding. You will not need locks to protect your critical sections. At the end of the function, post an event to the GUI layer indicating that the operation is complete.
You could scale this to allow multiple application-level threads to exist, but the usual concerns with synchronization will re-appear.
edit - Forgot to mention the beauty of this is that it is possible to completely decouple the application logic from the GUI code. The modularity helps if you ever decide to use a different framework or use provide a command-line version of the app. To do this, you will need an intermediate event dispatcher (application level -> GUI) that is implemented by the GUI layer.
Working with Qt/C++ for Win32.
We divide the major work units into different processes. The GUI runs as a separate process and is able to command/receive data from the "worker" processes as needed. Works nicely in todays multi-core world.
This answer doesn't apply to the OP's question regarding Python, but is more of a meta-response.
The easy way is threads. However, not every platform has pre-emptive threading (e.g. BREW, some other embedded systems) If possibly, simply chunk the work and do it in the IDLE event handler.
Another problem with using threads in BREW is that it doesn't clean up C++ stack objects, so it's way too easy to leak memory if you simply kill the thread.
I use threads so the GUI's main event loop never blocks.
For some types of operations, using separate processes makes a lot of sense. Back in the day, spawning a process incurred a lot of overhead. With modern hardware this overhead is hardly even a blip on the screen. This is especially true if you're spawning a long running process.
One (arguable) advantage is that it's a simpler conceptual model than threads that might lead to more maintainable code. It can also make your code easier to test, as you can write test scripts that exercise these external processes without having to involve the GUI. Some might even argue that is the primary advantage.
In the case of some code I once worked on, switching from threads to separate processes led to a net reduction of over 5000 lines of code while at the same time making the GUI more responsive, the code easier to maintain and test, all while improving the total overall performance.