CherryPy and concurrency - python

I'm using CherryPy in order to serve a python application through WSGI.
I tried benchmarking it, but it seems as if CherryPy can only handle exactly 10 req/sec. No matter what I do.
Built a simple app with a 3 second pause, in order to accurately determine what is going on... and I can confirm that the 10 req/sec has nothing to do with the resources used by the python script.
__
Any ideas?

By default, CherryPy's builtin HTTP server will use a thread pool with 10 threads. If you are still using the defaults, you could try increasing this in your config file.
[global]
server.thread_pool = 30
See the cpserver documentation
Or the archive.org copy of the old documentation

This was extremely confounding for me too. The documentation says that CherryPy will automatically scale its thread pool based on observed load. But my experience is that it will not. If you have tasks which might take a while and may also use hardly any CPU in the mean time, then you will need to estimate a thread_pool size based on your expected load and target response time.
For instance, if the average request will take 1.5 seconds to process and you want to handle 50 requests per second, then you will need 75 threads in your thread_pool to handle your expectations.
In my case, I delegated the heavy lifting out to other processes via the multiprocessing module. This leaves the main CherryPy process and threads at idle. However, the CherryPy threads will still be blocked awaiting output from the delegated multiprocessing processes. For this reason, the server needs enough threads in the thread_pool to have available threads for new requests.
My initial thinking is that the thread_pool would not need to be larger than the multiprocessing pool worker size. But this turns out also to be a misled assumption. Somehow, the CherryPy threads will remain blocked even where there is available capacity in the multiprocessing pool.
Another misled assumption is that the blocking and poor performance have something to do with the Python GIL. It does not. In my case I was already farming out the work via multiprocessing and still needed a thread_pool sized on the average time and desired requests per second target. Raising the thread_pool size addressed the issue. Although it looks like and incorrect fix.
Simple fix for me:
cherrypy.config.update({
'server.thread_pool': 100
})

Your client needs to actually READ the server's response. Otherwise the socket/thread will stay open/running until timeout and garbage collected.
use a client that behaves correctly and you'll see that your server will behave too.

Related

Concurrency: multiprocessing, threading, greenthreads and asyncio

I'm currently working on Python project that receives a lot os AWS SQS messages (more than 1 million each day), process these messages, and send then to another SQS queue with additional data. Everything works fine, but now we need to speed up this process a lot!
From what we have seen, or biggest bottleneck is in regards to HTTP requests to send and receive messages from AWS SQS api. So basically, our code is mostly I/O bound due to these HTTP requests.
We are trying to escalate this process by one of the following methods:
Using Python's multiprocessing: this seems like a good idea, but our workers run on small machines, usually with a single core. So creating different process may still give some benefit, since the CPU will probably change process as one or another is stuck at an I/O operation. But still, that seems a lot of overhead of process managing and resources for an operations that doesn't need to run in parallel, but concurrently.
Using Python's threading: since GIL locks all threads at a single core, and threads have less overhead than processes, this seems like a good option. As one thread is stuck waiting for an HTTP response, the CPU can take another thread to process, and so on. This would get us to our desired concurrent execution. But my question is how dos Python's threading know that it can switch some thread for another? Does it knows that some thread is currently on an I/O operation and that he can switch her for another one? Will this approach absolutely maximize CPU usage avoiding busy wait? Do I specifically has to give up control of a CPU inside a thread or is this automatically done in Python?
Recently, I also read about a concept called green-threads, using Eventlet on Python. From what I saw, they seem the perfect match for my project. The have little overhead and don't create OS threads like threading. But will we have the same problems as threading referring to CPU control? Does a green-thread needs to warn the CPU that it may take another one? I saw on some examples that Eventlet offers some built-in libraries like Urlopen, but no Requests.
The last option we considered was using Python's AsyncIo and async libraries such as Aiohttp. I have done some basic experimenting with AsyncIo and wasn't very pleased. But I can understand that most of it comes from the fact that Python is not a naturally asynchronous language. From what I saw, it would behave something like Eventlet.
So what do you think would be the best option here? What library would allow me to maximize performance on a single core machine? Avoiding busy waits as much as possible?

Async python smooth latency

I'm using grpc.aio.server and I am stuck with a problem that if I try to make a load test on my service it will have some requests lagging for 10 seconds, but the requests are similar. The load is stable (200rps) and the latency of almost all requests are almost the same. I'm ok with higher latency, as long as it's stable. I've tried to google something like async task priority, but in my mind it means that something is wrong with the priority of tasks which wait a very long time, but they're finished or the full request task is waiting to start for a long time.
e.g 1000 requests were sent to the gRPC service, they all have the same logic to execute, the same db instance, the same query to the db, the same time to get results from db, etc, everything is the same. I see that e.g. 10th request latency is 10 seconds, but 13th request latency is 5 seconds. I can also see in logs, that the db queries have almost the same execution time.
Any suggestions? maybe I understand something wrong
There are multiple reasons why this behaviour may happen. Here are a few things that you can take a look at:
What type of workload do you have? Is it I/O bound or CPU bound?
Does your code block the event loop at some point? The path for each request is fully asynchronous? The doc states pretty clear that blocking the event loop is costly:
Blocking (CPU-bound) code should not be called directly. For example, if a function performs a CPU-intensive calculation for 1 second, all concurrent asyncio Tasks and IO operations would be delayed by 1 second.
What happens with the memory when you see those big latencies? You can run a memory profiling using this tool and check the memory. There are high chances to see a correlation between the latency and an intense activity of the Python Memory Manager which tries to reclaim the memory. Here is a nice article around memory management that you can check out.
Every server will have latency differences between requests. However, the scale should be much lower then what you experience
Your question does not have the server initialization code so can't know what config is used. I would start by looking at the thread pool size for the server.
According to the docs the thread pool instance is a required argument so maybe try to set a different pool size. My guess is that the threads are exhausted and then the latency goes up since the request is waiting for a thread to free up

Flask: spawning a single async sub-task within a request

I have seen a few variants of my question but not quite exactly what I am looking for, hence opening a new question.
I have a Flask/Gunicorn app that for each request inserts some data in a store and, consequently, kicks off an indexing job. The indexing is 2-4 times longer than the main data write and I would like to do that asynchronously to reduce the response latency.
The overall request lifespan is 100-150ms for a large request body.
I have thought about a few ways to do this, that is as resource-efficient as possible:
Use Celery. This seems the most obvious way to do it, but I don't want to introduce a large library and most of all, a dependency on Redis or other system packages.
Use subprocess.Popen. This may be a good route but my bottleneck is I/O, so threads could be more efficient.
Using threads? I am not sure how and if that can be done. All I know is how to launch multiple processes concurrently with ThreadPoolExecutor, but I only need to spawn one additional task, and return immediately without waiting for the results.
asyncio? This too I am not sure how to apply to my situation. asyncio has always a blocking call.
Launching data write and indexing concurrently: not doable. I have to wait for a response from the data write to launch indexing.
Any suggestions are welcome!
Thanks.
Celery will be your best bet - it's exactly what it's for.
If you have a need to introduce dependencies, it's not a bad thing to have dependencies. Just as long as you don't have unneeded dependencies.
Depending on your architecture, though, more advanced and locked-in solutions might be available. You could, if you're using AWS, launch an AWS Lambda function by firing off an AWS SNS notification, and have that handle what it needs to do. The sky is the limit.
I actually should have perused the Python manual section on concurrency better: the threading module does just what I needed: https://docs.python.org/3.5/library/threading.html
And I confirmed with some dummy sleep code that the sub-thread gets completed even after the Flask request is completed.

How to tell uWSGI to prefer processes to threads for load balancing

I've installed Nginx + uWSGI + Django on a VDS with 3 CPU cores. uWSGI is configured for 6 processes and 5 threads per process. Now I want to tell uWSGI to use processes for load balancing until all processes are busy, and then to use threads if needed. It seems uWSGI prefer threads, and I have not found any config option to change this behaviour. First process takes over 100% CPU time, second one takes about 20%, and another processes are mostly not used.
Our site receives 40 r/s. Actually even having 3 processes without threads is anough to handle all requests usually. But request processing hangs from time to time for various reasons like locked shared resources, etc. In such cases we have -1 process. Users don't like to wait and click the link again and again. As a result all processes hangs and all users have to wait.
I'd add even more threads to make the server more robust. But the problem is probably python GIL. Threads wan't use all CPU cores. So multiple processes work much better for load balancing. But threads may help a lot in case of locked shared resources and i/o wait delays. A process may do much work while one of it's thread is locked.
I don't want to decrease time limits until there is no another solution. It is possible to solve this problem with threads in theory, and I don't want to show error messages to user or to make him waiting on every request until there is no another choice.
So, the solution is:
Upgrade uWSGI to recent stable version (as roberto suggested).
Use --thunder-lock option.
Now I'm running with 50 threads per process and all requests are distributed between processes equally.
Every process is effectively a thread, as threads are execution contexts of the same process.
For such a reason there is nothing like "a process executes it instead of a thread". Even without threads your process has 1 execution context (a thread). What i would investigate is why you get (perceived) poor performances when using multiple threads per process. Are you sure you are using a stable (with solid threading support) uWSGI release ? (1.4.x or 1.9.x)
Have you thought about dynamically spawning more processes when the server is overloaded ? Check the uWSGI cheaper modes, there are various algorithm available. Maybe one will fit your situation.
The GIL is not a problem for you, as from what you describe the problem is the lack of threads for managing new requests (even if from your numbers it looks you may have a too much heavy lock contention on something else)

Threaded vs. asynchronous image processing?

I have a Python function which generates an image once it is accessed. I can either invoke it directly upon a HTTP request, or do it asynchronously using Gearman. There are a lot of requests.
Which way is better:
Inline - create an image inline, will result in many images being generated at once
Asynchronous - queue jobs (with Gearman) and generate images in a worker
Which option is better?
In this case "better" would mean the best speed / load combinations. The image generation example is symbolical, as this can also be applied to Database connections and other things.
I have a Python function which
generates an image once it is
accessed. I can either invoke it
directly upon a HTTP request, or do it
asynchronously using Gearman. There
are a lot of requests.
You should not do it inside you request because then you can't throttle(your server could get overloaded). All big sites use a message queue to do the processing offline.
Which option is better?
In this case "better" would mean the
best speed / load combinations. The
image generation example is
symbolical, as this can also be
applied to Database connections and
other things.
You should do it asynchronous because the most compelling reason to do it besides it speeds up your website is that you can throttle your queue if you are on high load. You could first execute the tasks with the highest priority.
I believe forking processes is expensive. I would create a couple worker processes(maybe do a little threading inside process) to handle the load. I would probably use redis because it is fast, actively developed(antirez/pietern commits almost everyday) and has a very good/stable python client library. blpop/rpush could be used to simulate a queue(job)
If your program is CPU bound in the interpreter then spawning multiple threads will actually slow down the result even if there are enough processors to run them all. This happens because the GIL (global interpreter lock) only allows one thread to run in the interpreter at a time.
If most of the work happens in a C library it's likely the lock is not held and you can productively use multiple threads.
If you are spawning threads yourself you'll need to make sure to not create too many - 10K threads at one would be bad news - so you'd need to setup a work queue that the threads read from instead of just spawning them in a loop.
If I was doing this I'd just use the standard multiprocessing module.

Categories

Resources