Limit CPU usage of a Python Bottle webserver

Limit CPU usage of a Python Bottle webserver - python

I'm using Python + Bottle as a webserver.
As I use the production server for many other websites, I don't want Python + Bottle to eat 70% of the CPU for example.
How is it possible to limit the CPU usage of a Python Bottle webserver?
I was thinking about using resource.setrlimit, but is this a good way to do it?
With which syntax should we use resource.setrlimit to set the limit to 20% of the CPU for example?

Step 1
You should ask yourself whether resource optimization is really necessary. If you're certain that specific application consumes too much resources than it could or should then go to step 2.
Step 2
When your application consumes too much resources, first thing you should do is try to identify bottlenecks in it and see if they can be optimized away. Python has various tools that can help you (code profilers, PyPy etc.) If there's nothing you can do in that regard, go to step 3.
Step 3
If you absolutely must limit process resources, keep in mind that:
OS has a very sophisticated scheduling mechanisms that do their best to ensure each running process gets a fair share of CPU time. Even on processor overload, things will still run fine (up to some point).
if you deliberately limit CPU time of one of your processes it may respond slowly or not at all due to network timeouts.
My answer to this question is - reduce static priority of your server if you think it may starve other services but then your server may suffer from starvation when processors are overload. Using nice utility would be my choice as it won't litter your code.

Related

Comparing two different processes based on the real, user and sys times

I have been through other answers on SO about real,user and sys times. In this question, apart from the theory, I am interested in understanding the practical implications of the times being reported by two different processes, achieving the same task.
I have a python program and a nodejs program https://github.com/rnanwani/vips_performance. Both work on a set of input images and process them to obtain different outputs. Both using libvips implementations.
Here are the times for the two
Python
real 1m17.253s
user 1m54.766s
sys 0m2.988s
NodeJS
real 1m3.616s
user 3m25.097s
sys 0m8.494s
The real time (the wall clock time as per other answers is lesser for NodeJS, which as per my understanding means that the entire process from input to output, finishes much quicker on NodeJS. But the user and sys times are very high as compared to Python. Also using the htop utility, I see that NodeJS process has a CPU usage of about 360% during the entire process maxing out the 4 cores. Python on the other hand has a CPU usage from 250% to 120% during the entire process.
I want to understand a couple of things
Does a smaller real time and a higher user+sys time mean that the process (in this case Node) utilizes the CPU more efficiently to complete the task sooner?
What is the practical implication of these times - which is faster/better/would scale well as the number of requests increase?

My guess would be that node is running more than one vips pipeline at once, whereas python is strictly running one after the other. Pipeline startup and shutdown is mostly single-threaded, so if node starts several pipelines at once, it can probably save some time, as you observed.
You load your JPEG images in random access mode, so the whole image will be decompressed to memory with libjpeg. This is a single-threaded library, so you will never see more than 100% CPU use there.
Next, you do resize/rotate/crop/jpegsave. Running through these operations, resize will thread well, with the CPU load increasing as the square of the reduction, the rotate is too simple to have much effect on runtime, and the crop is instant. Although the jpegsave is single-threaded (of course) vips runs this in a separate background thread from a write-behind buffer, so you effectively get it for free.
I tried your program on my desktop PC (six hyperthreaded cores, so 12 hardware threads). I see:
$ time ./rahul.py indir outdir
clearing output directory - outdir
real 0m2.907s
user 0m9.744s
sys 0m0.784s
That looks like we're seeing 9.7 / 2.9, or about a 3.4x speedup from threading, but that's very misleading. If I set the vips threadpool size to 1, you see something closer to the true single-threaded performance (though it still uses the jpegsave write-behind thread):
$ export VIPS_CONCURRENCY=1
$ time ./rahul.py indir outdir
clearing output directory - outdir
real 0m18.160s
user 0m18.364s
sys 0m0.204s
So we're really getting 18.1 / 2.97, or a 6.1x speedup.
Benchmarking is difficult and real/user/sys can be hard to interpret. You need to consider a lot of factors:
Number of cores and number of hardware threads
CPU features like SpeedStep and TurboBoost, which will clock cores up and down depending on thermal load
Which parts of the program are single-threaded
IO load
Kernel scheduler settings
And I'm sure many others I've forgotten.
If you're curious, libvips has it's own profiler which can help give more insight into the runtime behaviour. It can show you graphs of the various worker threads, how long they are spending in synchronisation, how long in housekeeping, how long actually processing your pixels, when memory is allocated, and when it finally gets freed again. There's a blog post about it here:
http://libvips.blogspot.co.uk/2013/11/profiling-libvips.html

Does a smaller real time and a higher user+sys time mean that the process (in this case Node) utilizes the CPU more efficiently to complete the task sooner?
It doesn't necessarily mean they utilise the processor(s) more efficiently.
The higher user time means that Node is utilising more user space processor time, and in turn complete the task quicker. As stated by Luke Exton, the cpu is spending more time on "Code you wrote/might look at"
The higher sys time means there is more context switching happening, which makes sense from your htop utilisation numbers. This means the scheduler (kernel process) is jumping between Operating system actions, and user space actions. This is the time spent finding a CPU to schedule the task onto.
What is the practical implication of these times - which is faster/better/would scale well as the number of requests increase?
The question of implementation is a long one, and has many caveats. I would assume from the python vs Node numbers that the Python threads are longer, and in turn doing more processing inline. Another thing to note is the GIL in python. Essentially python is a single threaded application, and you can't easily break out of this. This could be a contributing factor to the Node implementation being quicker (using real threads).
The Node appears to be written to be correctly threaded and to split many tasks out. The advantages of the highly threaded application will have a tipping point where you will spend MORE time trying to find a free cpu for a new thread, than actually doing the work. When this happens your python implementation might start being faster again.

The higher user+sys time means that the process had more running threads and as you've noticed by looking at 360% used almost all available CPU resources of your 4-cores. That means that NodeJS process is already limited by available CPU resources and unable to process more requests. Also any other CPU intensive processes that you could eventually run on that machine will hit your NodeJS process. On the other hand Python process doesn't take all available CPU resources and probably could scale with a number of requests.

So these times are not reliable in and of themselves, they say how long the process took to perform an action on the CPU. This is coupled very tightly to whatever else was happening at the same time on that machine and could fluctuate wildly based entirely on physical resources.
In terms of these times specifically:
real = Wall Clock time (Start to finish time)
user = Userspace CPU time (i.e. Code you wrote/might look at) e.g. node/python libs/your code
sys = Kernel CPU time (i.e. Syscalls, e.g Open a file from the OS.)
Specifically, small real time means it actually finished faster. Does it mean it did it better for sure, NO. There could have been less happening on the machine at the same time for instance.
In terms of scale, these numbers are a little irrelevant, and it depends on the architecture/bottlenecks. For instance, in terms of scale and specifically, cloud compute, it's about efficiently allocating resources and the relevant IO for each, generally (compute, disk, network). Does processing this image as fast as possible help with scale? Maybe? You need to examine bottlenecks and specifics to be sure. It could for instance overwhelm your network link and then you are constrained there, before you hit compute limits. Or you might be constrained by how quickly you can write to the disk.

One potentially important aspect of this which no one mention is the fact that your library (vips) will itself launch threads:
http://www.vips.ecs.soton.ac.uk/supported/current/doc/html/libvips/using-threads.html
When libvips calculates an image, by default it will use as many
threads as you have CPU cores. Use vips_concurrency_set() to change
this.
This explains the thing that surprised me initially the most -- NodeJS should (to my understanding) be pretty single threaded, just as Python with its GIL. It being all about asynchronous processing and all.
So perhaps Python and Node bindings for vips just use different threading settings. That's worth investigating.
(that said, a quick look doesn't find any evidence of changes to the default concurrency levels in either library)

Measure performance and time used by Python web server

I'm running a website using Bottle and Sqlite, which are both lightweight, on a small private hosting (using Linux).
I'm starting to have more and more users each day.
How to measure if both of them are :
running smoothly, using only < 10% of CPU / RAM of the hosting computer
running with a medium usage, this is a premature sign meaning I should think about using something more production-ready for multi-users than Sqlite
running with a very high CPU or RAM usage, meaning that a pipe is going to break soon ... if I don't modify something !

To simply measure CPU and memory use, you can log into the machine and run top or htop.
I assume, though, that you don't simply want to measure, but that you also want monitoring and alerts. In that case you'll need to run some sort of monitoring framework; which one to choose is a broader question. One popular choice that you might take a look at is Monit.

Optimise network bound multiprocessing code

I have a function I'm calling with multiprocessing.Pool
Like this:
from multiprocessing import Pool
def ingest_item(id):
# goes and does alot of network calls
# adds a bunch to a remote db
return None
if __name__ == '__main__':
p = Pool(12)
thing_ids = range(1000000)
p.map(ingest_item, thing_ids)
The list pool.map is iterating over contains around 1 million items,
for each ingest_item() call it will go and call 3rd party services and add data to a remote Postgresql database.
On a 12 core machine this processes ~1,000 pool.map items in 24 hours. CPU and RAM usage is low.
How can I make this faster?
Would switching to Threads make sense as the bottleneck seems to be network calls?
Thanks in advance!

First: remember that you are performing a network task. You should expect your CPU and RAM usage to be low, because the network is orders of magnitude slower than your 12-core machine.
That said, it's wasteful to have one process per request. If you start experiencing issues from starting too many processes, you might try pycurl, as suggested here Library or tool to download multiple files in parallel
This pycurl example looks very similar to your task https://github.com/pycurl/pycurl/blob/master/examples/retriever-multi.py

It is unlikely that using threads will substantially improve performance. This is because no matter how much you break up the task all requests have to go through the network.
To improve performance you might want to see if the 3rd party services have some kind of bulk request API with better performance.
If your workload permits it you could attempt to use some kind of caching. However, from your explanation of the task it sounds like that would have little effect since you're primarily sending data, not requesting it. You could also consider caching open connections (If you aren't already doing so), this helps avoid the very slow TCP handshake. This type of caching is often used in web browsers (Eg. Chrome).
Disclaimer: I have no Python experience

Python/WSGI: Dynamically spin up/down server worker processes across installations

The setup
Our setup is unique in the following ways:
we have a large number of distinct Django installations on a single server.
each of these has its own code base, and even runs as a separate linux user. (Currently implemented using Apache mod_wsgi, each installation configured with a small number of threads (2-5) behind a nginx proxy).
each of these installations have a significant memory footprint (20 - 200 MB)
these installations are "web apps" - they are not exposed to the general web, and will be used by a limited nr. of users (1 - 100).
traffic is expected to be in (small) bursts per-installation. I.e. if a certain installation becomes used, a number of follow up requests are to be expected for that installation (but not others).
As each of these processes has the potential to rack up anywhere between 20 and 200 MB of memory, the total memory footprint of the Django processes is "too large". I.e. it quickly exceeds the available physical memory on the server, leading to extensive swapping.
I see 2 specific problems with the current setup:
We're leaving the guessing of which installation needs to be in physical mememory to the OS. It would seem to me that we can do better. Specifically, an installation that currently gets more traffic would be better off with a larger number of ready workers. Also: installations that get no traffic for extensive amounts of time could even do with 0 ready workers as we can deal with the 1-2s for the initial request as long as follow-up requests are fast enough. A specific reason I think we can be "smarter than the OS": after a server restart on a slow day the server is much more responsive (difference is so great it can be observed w/ the naked eye). This would suggest to me that the overhead of presumably swapped processes is significant even if they have not currenlty activily serving requests for a full day.
Some requests have larger memory needs than others. A process that has once dealt with one such a request has claimed the memory from the OS, but due to framentation will likely not be able to return it. It would be worthwhile to be able to retire memory-hogs. (Currenlty we simply have a retart-after-n-requests configured on Apache, but this is not specifically triggered after the fragmentation).
The question:
My idea for a solution would be to have the main server spin up/down workers per installation depending on the needs per installation in terms of traffic. Further niceties:
* configure some general system constraints, i.e. once the server becomes busy be less generous in spinning up processes
* restart memory hogs.
There are many python (WSGI) servers available. Which of them would (easily) allow for such a setup. And what are good pointers for that?

See if uWSGI works for you. I don't think there is something more flexible.
You can have it spawn and kill workers dynamically, set max memory usage etc. Or you might come with better ideas after reading their docs.

flup/fastcgi cpu usage under no-load conditions

I'm running Django as threaded fastcgi via flup, served by lighttpd, communicating via sockets.
What is the expected CPU usage for each fastcgi thread under no load? On startup, each thread runs at 3-4% cpu usage for a while, and then backs off to around .5% over the course of a couple of hours. It doesn't sink below this level.
Is this much CPU usage normal? Do I have some bug in my code that is causing the idle loop to require more processing than it should? I expected the process to use no measurable CPU when it was completely idle.
I'm not doing anything ridiculously complicated with Django, definitely nothing that should require extended processing. I realize that this isn't a lot of load, but if it's a bug I introduced, I would like to fix it.

I've looked at this on django running as fastcgi on both Slicehost (django 1.1, python 2.6) and Dreamhost (django 1.0, python 2.5), and I can say this:
Running the top command shows the processes use a large amount of CPU to start up for ~2-3 seconds, then drop down to 0 almost immediately.
Running the ps aux command after starting the django app shows something similar to what you describe, however this is actually misleading. From the Ubuntu man pages for ps:
CPU usage is currently expressed as
the percentage of time spent running
during the entire lifetime of a
process. This is not ideal, and it
does not conform to the standards that
ps otherwise conforms to. CPU usage is
unlikely to add up to exactly 100%.
Basically, the %CPU column shown by ps is actually an average over the time the process has been running. The decay you see is due to the high initial spike followed by inactivity being averaged over time.

Your fast-cgi threads must not consume any (noticeable) CPU if there are no requests to process.
You should investigate the load you are describing. I use the same architecture and my threads are completely idle.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.