I'm just starting to work on a tornado application that is having some CPU issues. The CPU time will monotonically grow as time goes by, maxing out the CPU at 100%. The system is currently designed to not block the main thread. If it needs to do something that blocks and asynchronous drivers aren't available, it will spawn another thread to do the blocking operation.
Thus we have the main thread being almost totally CPU-bound and a bunch of other threads that are almost totally IO-bound. From what I've read, this seems to be the perfect way to run into problems with the GIL. Plus, my profiling shows that we're spending a lot of time waiting on signals (which I'm assuming is what __semwait_signal is doing), which is consistent with the effects the GIL would have in my limited understanding.
If I use sys.setcheckinterval to set the check interval to 300, the CPU growth slows down significantly. What I'm trying to determine is whether I should increase the check interval, leave it at 300, or be scared with upping it. After all, I notice that CPU performance gets better, but I'm a bit concerned that this will negatively impact the system's responsiveness.
Of course, the correct answer is probably that we need to rethink our architecture to take the GIL into account. But that isn't something that can be done immediately. So how do I determine the appropriate course of action to take in the short-term?
The first thing I would check for would be to ensure that you're properly exiting threads. It's very hard to figure out what's going on with just your description to go from, but you use the word "monotonically," which implies that CPU use is tied to time rather than to load.
You may very well be running into threading limits of Python, but it should vary up and down with load (number of active threads,) and CPU usage (context switching costs) should reduce as those threads exit. Is there some reason for a thread, once created, to live forever? If that's the case, prioritize that rearchitecture. Otherwise, short term would be to figure out why CPU usage is tied to time and not load. It implies that each new thread has a permanent, irreversible cost in your system - meaning it never exits.
Related
I have an asyncio based program which has very inconsistent CPU load. I need to do some relatively computation intensive things to fill up a buffer which the program reads from. However, if I do this while there's high load, I may end up causing the latency-sensitive parts to be slower than I'd like, as the "precompute the stuff" coroutine will be hogging a lot of CPU time. There are also coroutines that must run frequently (handling heartbeats for a websocket connection), so if this preprocessing takes too long those will die.
One solution I've come up with is to simply do this in another process which has lower priority, but if I could keep this all in a single program I'd be much happier. What is a good design for handling this sort of situation?
I have a quad core i7 920 CPU. It is Hyperthreaded, so the computer thinks it has 8 cores.
From what I've read on the interweb, when doing parallel tasks, I should use the number of physical cores, not the number of hyper threaded cores.
So I have done some timings, and was surprised that using 8 threads in a parallel loop is faster than using 4 threads.
Why is this? My example code is too long to post here, but can be found by running the example here: https://github.com/jsphon/MTVectorizer
A chart of the performance is here:
(Intel) hyperthreaded cores act like (up to) two CPUs.
The observation is that a single CPU has a set of resources that are ideally busy continuously, but in practice sit idle surprising often while the CPU waits for some external event, typically memory reads or writes.
By adding a bit of additional state information for another hardware thread (e.g., another copy of the registers + additional stuff), the "single" CPU can switch its attention to executing the other thread when the first one blocks. (One can generalize this N hardware threads, and other architectures have done this; Intel quit at 2).
If both hardware threads spend their time waiting for various events, the CPU can arguably do the corresponding processing for the hardware threads. 40 nanoseconds for a memory wait is a long time. So if your program fetches lots of memory, I'd expect it to look as if both hardware threads were fully effective, e.g, you should get nearly 2x.
If the two hardware threads are doing work that is highly local (e.g., intense computations in just the registers), then internal waits become minimal and the single CPU can't switch fast enough to service both hardware threads as fast as they generate work. In this case, performance will degrade.
I don't recall where I heard it, and I heard this a long time ago: under such circumstances the net effect is more like 1.3x than the idealized 2x. (Expecting the SO audience to correct me on this).
Your application may switch back and forth in its needs depending on which part is running at the moment. Then you will get a mix of performance. I'm happy with any speed up I can get.
Ira Baxter has explained your question pretty well, but I want to add one more thing (can't comment on his answer cuz not enough rep yet): there is an overhead to switching from one thread to another. This process, called context switching (http://wiki.osdev.org/Context_Switching#Hardware_Context_Switching), requires at minimum your CPU core to change its registers to reflect data in the new thread. This cost is significant if you are doing process-level context switching, but gets quite a bit cheaper when you are doing thread-level switching. This means 2 things:
1) Hyper threading will never give you the theoretical 2x performance boost because the cost of context switching is non-trivial. This is also why highly logical threads degrade performance, per Ira: frequent context switching multiplies that cost.
2) 8 single-threaded processes will run slower than 4 double-threaded processes doing the same work. Thus, you should make use of Python's thread library, or the awesome greenlet library (https://greenlet.readthedocs.org/en/latest/) if you plan on doing multithreading work.
My wx GUI shows thumbnails, but they're slow to generate, so:
The program should remain usable while the thumbnails are generating.
Switching to a new folder should stop generating thumbnails for the old folder.
If possible, thumbnail generation should make use of multiple processors.
What is the best way to do this?
Putting the thumbnail generation in a background thread with threading.Thread will solve your first problem, making the program usable.
If you want a way to interrupt it, the usual way is to add a "stop" variable which the background thread checks every so often (e.g., once per thumbnail), and the GUI thread sets when it wants to stop it. Ideally you should protect this with a threading.Condition. (The condition isn't actually necessary in most cases—the same GIL that prevents your code from parallelizing well also protects you from certain kinds of race conditions. But you shouldn't rely on that.)
For the third problem, the first question is: Is thumbnail generation actually CPU-bound? If you're spending more time reading and writing images from disk, it probably isn't, so there's no point trying to parallelize it. But, let's assume that it is.
First, if you have N cores, you want a pool of N threads, or N-1 if the main thread has a lot of work to do too, or maybe something like 2N or 2N-1 to trade off a bit of best-case performance for a bit of worst-case performance.
However, if that CPU work is done in Python, or in a C extension that nevertheless holds the Python GIL, this won't help, because most of the time, only one of those threads will actually be running.
One solution to this is to switch from threads to processes, ideally using the standard multiprocessing module. It has built-in APIs to create a pool of processes, and to submit jobs to the pool with simple load-balancing.
The problem with using processes is that you no longer get automatic sharing of data, so that "stop flag" won't work. You need to explicitly create a flag in shared memory, or use a pipe or some other mechanism for communication instead. The multiprocessing docs explain the various ways to do this.
You can actually just kill the subprocesses. However, you may not want to do this. First, unless you've written your code carefully, it may leave your thumbnail cache in an inconsistent state that will confuse the rest of your code. Also, if you want this to be efficient on Windows, creating the subprocesses takes some time (not as in "30 minutes" or anything, but enough to affect the perceived responsiveness of your code if you recreate the pool every time a user clicks a new folder), so you probably want to create the pool before you need it, and keep it for the entire life of the program.
Other than that, all you have to get right is the job size. Hopefully creating one thumbnail isn't too big of a job—but if it's too small of a job, you can batch multiple thumbnails up into a single job—or, more simply, look at the multiprocessing API and change the way it batches jobs when load-balancing.
Meanwhile, if you go with a pool solution (whether threads or processes), if your jobs are small enough, you may not really need to cancel. Just drain the job queue—each worker will finish whichever job it's working on now, but then sleep until you feed in more jobs. Remember to also drain the queue (and then maybe join the pool) when it's time to quit.
One last thing to keep in mind is that if you successfully generate thumbnails as fast as your computer is capable of generating them, you may actually cause the whole computer—and therefore your GUI—to become sluggish and unresponsive. This usually comes up when your code is actually I/O bound and you're using most of the disk bandwidth, or when you use lots of memory and trigger swap thrash, but if your code really is CPU-bound, and you're having problems because you're using all the CPU, you may want to either use 1 fewer core, or look into setting thread/process priorities.
Background
I'm a bit new to developing and had a general python/programming question. If you have a method that is a recursion, what is involved to enabling multiple threads or multiprocessing? I've done some light reading and a few examples but they seem to be applying the syntax for new code(and not very cpu intensive tasks), I'm more wondering how do I re-design existing code to do this?
Say I have something thats cpu intensive(basically keeps adding to itself until limit is hit):
def adderExample(sum, number):
if sum > 1000:
print 'sum is larger than 10. Stoping'
else:
sum = sum + number
print sum
number = number + 1
adderExample(sum, number)
adderExample(0,0)
Question(s)/Though process
How would I approach this to make it run faster assuming I have multiple cores available(I want it to eventually want it span machines but I think thats a sperate issue with hadoop so I'll keep this example to only one system with multiple cpu's)? It seems threading it isn't the best choice(because of the time it takes to spawn new threads), if thats true should I only focus on multiprocessing? If so, can recursions be split to different cpu's(vai queues I assume and then rejoin after its done)? Can I create multiple threads for each process than split those processes over multiple cpu's? Lastly, is recursion depth limits an overall limit or is it based on threads/proceses, if so does multiprocessing/threading get around it?
Another question(related) how do those guys trying to codes(rsa, wireless keys,etc) via brute force overcome this problem? I assume they are scaling their mathematical processes over multiple cpu somehow. This or any example to build my understanding would be great.
Any tips/suggestions would be great
Thanks!
Such a loop wouldn't benefit much at all from threading. Consider that you're doing a series of additions, whose intermediate values depend on the previous iterations. This can't be parallelized, because the threads would be stomping on each other's values and overwriting things. You can lock the data so only one thread works on it at a time, but then you lose any benefit of having multiple threads working on that data.
Threads work best when they have independent data sets. e.g. a graphics renderer is a perfect example. Each thread renders a subset of the larger image - they may share common data sources for texture/vertex/color/etc... data, but each thread has its own little section of the total image to work one, and doesn't touch other areas of the image. Whatever thread #1 does on its little section of pixels won't affect what thread #2 is doing elsewhere in the image.
For your related question, password cracking is another example where threading/multiprocessing makes sense. Each thread goes off on its own testing multiple possible passwords against one common "to be cracked" list. What one thread is doing doesn't affect any of the other cracker threads, unless you get a match, which may mean all threads abort since the job is "done".
Once threads become interdependent on each other, you lose a lot of the benefits of having multiple threads. They'll spend more time waiting for the other to finish than they'll spend on doing actual work. Of course, this doesn't say you should never use threads. Sometimes it does makes sense to have multiple threads, even if they are interdependent. E.g. a graphics thread + sound effects thread + action processor thread + A.I. calculations thread, etc... in a game. each one is nominally dependent on each other, but while the sound thread is busy generating the bang+ricochet audio for the gun the player just shot, the a.i. thread is off calculating what the game's mobs are doing, the graphics thread is drawing some clouds in the background, etc...
Threading kinda sorta implies multiple stacks, recursion single stacks. That said, if you get to the recurse-left, recurse-right part and decide to spawn threads for the sub-problems if the current count of threads is "low" and do straight recursion otherwise you can combine the concepts.
But regular Python is not a good language for this pattern. Python threads all run on the same interpreter hardware thread, so you won't actually pick up any multiprocessing goodness.
Phunctor is correct that the threading library is a poor choice for parallelizing this type of problem, due to the "Global Interpreter Lock" that prevents multiple threads from executing Python code in parallel.
Where the threading library can be highly useful, though, is when each thread's code spends a lot of time waiting for I/O to happen. So, for example, if you're implementing a server that has to hit the disk or wait on a network response, servicing a request in each thread can be very efficient, since the threading library can favor the ones that are not waiting on I/O and thus maximize use of the Python interpreter. (In a single thread, you'd have to use a tight loop checking the statuses of your I/O requests, which would tend to be wasteful as load got high.)
I am developing some Python code for Windows. A criteria is that it will use less than 1% of CPU. I understand that it is impossible to guarantee this all the time due to things like garbage collection, but what would be the best practice to get as close as possible. My current solution is to spread a lot of time.sleep(0.1) around the code, especially in loops. There are, however, obvious problems with this approach.
What other approaches could be taken?
I should also mention that the application has lots of threads in it using the threading library.
EDIT: Setting the process priority is not what I am after.
It is the job of the operating system to schedule CPU time. Use your operating system's built-in process-limits mechanisms (hopefully they exist on Windows) to restrict your process to <1% CPU.
This style of sprinkling unnecessary sleeps every few lines in the code will make the code terrible to create and extend and maintain, not to mention incredibly inelegant. (Rate-limiting yourself may be useful in very small, limited, critical sections -- for example your program is queuing lots of IO requests and you don't wish to inundate the operating system, you might wish to put a single sleep-until-[condition] in each critical loop which has the potential to inundate the system, but otherwise use extremely sparingly.)
Ideally you would call an API to the appropriate OS mechanisms from within your program when you start up, telling the OS to throttle you appropriately.
If the goal is to not bother the user then "below 1% CPU" is the wrong approach. What you really want is "don't take time away from other processes but still complete as fast as possible" - that's what "below normal" process priority is for. See http://code.activestate.com/recipes/496767-set-process-priority-in-windows/ for an example of how process priority can be changed for the current process (calling that function with default parameters will do).
For the sales pitch you can show the task manager while the computer is idle ("See? 99%, my application gets lots of work done") and then start some CPU-intensive application ("Almost all CPU time is spent in the application the user is working with, my application simply went into background").
If the box used for the demonstration is a Windows Server, it can use Windows System Resource Manager for restricting CPU usage below the desired threshold. Trying to force this behavior by code is impossible, unless a Windows API exposes this capability explicitly.