grpc-Python max_workers limiting number of simultaneous processes

grpc-Python max_workers limiting number of simultaneous processes - python

while using python grpc server,
server = grpc.server(futures.ThreadPoolExecutor(max_workers=10))
this is general way in which grpc server is instantiated. But with this running, if I try to run more than 10 instances of client , which expects server streaming, the 11th one doesn't work (I am running 10 instances of client which connects to this server and gets the stream)
Even if I change max_workers to None, max it creates is 40 threads (8 cores x 5 as per documentation), so max 40 clients can be served simultaneously in that case.
Is this the expected behavior ?
I was working on my code, but tried with general grpc python code documented here:
https://grpc.io/docs/tutorials/basic/python.html
I am able to reproduce the same issue with this.
To reproduce it, just run route_guide_server.py in one window with max_workers= 4 and then try to run 4-5 different clients in different windows . The 4th client will have to wait till one of the client is finished. (To get better view, add a time.sleep in yield)
If a large number of clients (100s and 1000s of clients) want to access grpc server in python with streaming (which should be continuous), then anymore clients will never get chance.

Yes this is the expected behavior.
After running my own test code, yes if you supply an argument of None to max_workers then 40 is the max. However, if I set the max to 100 then sure enough I can have at most 100 concurrent workers. This should be expected behavior because a thread pool is created based on the number of workers requested. You cannot expect that if you don't supply a number of max workers that it will just scale up and down at run time. Not without changing grpc and concurrent futures thread pool. With the way the interface is coupled, in python grpc right now we must used concurrent futures threadpool, so we must supply an argument to max_workers if we want it to be more than 40, and it must be set at compile time.

Related

How to autoscale Python webjob based on Service Bus Queues in Azure?

When a service bus queue contains any messages, I want my python webjob to scale out so the messages are processed faster.
I have a python webjob which feeds off a Service Bus Queue. The queue is populated each day at midnight and can have between 0 and around 400k messages added to it.
The bottleneck in the current processing is where some data needs to be downloaded, which means that scaling up the webjob won't help as much as parallelising it.
I scaled it up to 10 instances from 1 but that doesn't appear to affect the rate at which messages are consumed from the queue, which suggests that this isn't working the way I expect. When it was on 1 instance it processed ~1.53k in an hour. The hour since scaling out to 10 instances it processed ~1.5k messages (so basically, no difference.)
The code I'm using to interface with the queue is this (if there is a better way of doing this in Python please let me know!):
from azure.servicebus import ServiceBusService, Message, Queue
bus_service = ServiceBusService(
service_namespace= <namespace>,
shared_access_key_name='RootManageSharedAccessKey',
shared_access_key_value=<key>)
while(1):
msg = bus_service.receive_queue_message(<queue>, peek_lock=False, timeout=1)
if msg.body is None:
print("No messages in queue")
time.sleep(5)
else:
number = int(msg.body.decode('utf-8'))
print(number)
I know in C# there is a QueueTrigger attribute for webjobs but I don't know of anything similar for Python.
I would expect that the more instances running in the app service, the faster messages would be processed, so why isn't that what I see?

The bottleneck in the program was the database, which was at maximum. Adding more instances just increased the number of calls on the database and therefore slowed down each instance.
Scaling up the database and optimising the database calls improved performance and also now means that multiple instances can be spun up to further increase throughput.

How can I handle multiple Python requests on my AWS EC2 instance?

I have a Flask app deployed on Elastic Beanstalk onto an EC2 instance on AWS. If 100 people simultaneously connected to my server, then wouldn't that mean that they have to wait in a queue of 100 since the app can only handle one instance at a time?
How can I make it so that I can handle more requests using the same IP address to connect to? Thanks!

The short answer is to use uWSGI or gunicorn.
The longer answer is that your intuition is correct - what you are worrying about is "concurrency", or the number of simultaneous requests your app can handle. And yes, a single Flask app without any application server can handle one request at a time. How do you change that? For most Python apps, the unit of concurrency is a process (there are frameworks that change that, but the majority of app deployments are probably process-based). That is, you run a process for each concurrent request you think you'll need. App servers like uWSGI do the listening for your app, then dispatch the request to a process from a pool. So, how many processes do you need?
The second concept you need is "throughput" - how many requests can be served in a specific time, which is influenced by, but different from, "concurrency" and is where your intuition may mislead you. Let's say you have 8 processes. You may think "but I'll have 100 users, 8 is clearly not enough". Let's assume you know that each request completes in 1/8 (.125) seconds. That means that each process can serve 8 requests a second. Times 8 processes; your throughput will be (roughly) 64 requests per second. 8 process gets you a lot closer to your 100 users than you may have otherwise expected. Your 100 users probably won't actually issue requests in that 1 second window. Possible, but unlikely. The issue isn't really the concurrency, but whether the user gets a response in a reasonable time.
Hope this helps. Scaling is a wonderful topic - both straightforward and frustratingly nuanced at the same time. As your traffic increases, the above guidance will shift and you'll need more and more advanced techniques. But to get started - keep it simple and focus on the basics.
See How many concurrent requests does a single Flask process receive?

Azure Machine Learning Request Response latency

I have made an Azure Machine Learning Experiment which takes a small dataset (12x3 array) and some parameters and does some calculations using a few Python modules (a linear regression calculation and some more). This all works fine.
I have deployed the experiment and now want to throw data at it from the front-end of my application. The API-call goes in and comes back with correct results, but it takes up to 30 seconds to calculate a simple linear regression. Sometimes it is 20 seconds, sometimes only 1 second. I even got it down to 100 ms one time (which is what I'd like), but 90% of the time the request takes more than 20 seconds to complete, which is unacceptable.
I guess it has something to do with it still being an experiment, or it is still in a development slot, but I can't find the settings to get it to run on a faster machine.
Is there a way to speed up my execution?
Edit: To clarify: The varying timings are obtained with the same test data, simply by sending the same request multiple times. This made me conclude it must have something to do with my request being put in a queue, there is some start-up latency or I'm throttled in some other way.

First, I am assuming you are doing your timing test on the published AML endpoint.
When a call is made to the AML the first call must warm up the container. By default a web service has 20 containers. Each container is cold, and a cold container can cause a large(30 sec) delay. In the string returned by the AML endpoint, only count requests that have the isWarm flag set to true. By smashing the service with MANY requests(relative to how many containers you have running) can get all your containers warmed.
If you are sending out dozens of requests a instance, the endpoint might be getting throttled. You can adjust the number of calls your endpoint can accept by going to manage.windowsazure.com/
manage.windowsazure.com/
Azure ML Section from left bar
select your workspace
go to web services tab
Select your web service from list
adjust the number of calls with slider
By enabling debugging onto your endpoint you can get logs about the execution time for each of your modules to complete. You can use this to determine if a module is not running as you intended which may add to the time.
Overall, there is an overhead when using the Execute python module, but I'd expect this request to complete in under 3 secs.

Console output consuming much CPU? (about 140 lines per second)

I am doing my bachelor's thesis where I wrote a program that is distributed over many servers and exchaning messages via IPv6 multicast and unicast. The network usage is relatively high but I think it is not too high when I have 15 servers in my test where there are 2 requests every second that are going like that:
Server 1 requests information from server 3-15 via multicast. every of 3-15 must respond. if one response is missing after 0.5 sec, the multicast is resent, but only the missing servers must respond (so in most cases this is only one server)
Server 2 does exactly the same. If there are missing results after 5 retries the missing servers are marked as dead and the change is synced with the other server (1/2)
So there are 2 multicasts every second and 26 unicasts every second. I think this should not be too much?
Server 1 and 2 are running python web servers which I use to do the request every second on each server (via a web client)
The whole szenario is running in a mininet environment which is running in a virtual box ubuntu that has 2 cores (max 2.8ghz) and 1GB RAM. While running the test, i see via htop that the CPUs are at 100% while the RAM is at 50%. So the CPU is the bottleneck here.
I noticed that after 2-5 minutes (1 minute = 60 * (2+26) messages = 1680 messages) there are too many missing results causing too many sending repetitions while new requests are already coming in, so that the "management server" thinks the client servers (3-15) are down and deregisters them. After syncing this with the other management server, all client servers are marked as dead on both management servers which is not true...
I am wondering if the problem could be my debug outputs? I am printing 3-5 messages for every message that is sent and received. So that are about (let's guess it are 5 messages per sent/recvd msg) (26 + 2)*5 = 140 lines that are printed on the console.
I use python 2.6 for the servers.
So the question here is: Can the console output slow down the whole system that simple requests take more than 0.5 seconds to complete 5 times in a row? The request processing is simple in my test. No complex calculations or something like that. basically it is something like "return request_param in ["bla", "blaaaa", ...] (small list of 5 items)"
If yes, how can I disable the output completely without having to comment out every print statement? Or is there even the possibility to output only lines that contain "Error" or "Warning"? (not via grep, because when grep becomes active all the prints already have finished... I mean directly in python)
What else could cause my application to be that slow? I know this is a very generic question, but maybe someone already has some experience with mininet and network applications...

I finally found the real problem. It was not because of the prints (removing them improved performance a bit, but not significantly) but because of a thread that was using a shared lock. This lock was shared over multiple CPU cores causing the whole thing being very slow.
It even got slower the more cores I added to the executing VM which was very strange...
Now the new bottleneck seems to be the APScheduler... I always get messages like "event missed" because there is too much load on the scheduler. So that's the next thing to speed up... :)

Django, sleep() pauses all processes, but only if no GET parameter?

Using Django (hosted by Webfaction), I have the following code
import time
def my_function(request):
time.sleep(10)
return HttpResponse("Done")
This is executed via Django when I go to my url, www.mysite.com
I enter the url twice, immediately after each other. The way I see it, both of these should finish after 10 seconds. However, the second call waits for the first one and finishes after 20 seconds.
If, however, I enter some dummy GET parameter, www.mysite.com?dummy=1 and www.mysite.com?dummy=2 then they both finish after 10 seconds. So it is possible for both of them to run simultaneously.
It's as though the scope of sleep() is somehow global?? Maybe entering a parameter makes them run as different processes instead of the same???
It is hosted by Webfaction. httpd.conf has:
KeepAlive Off
Listen 30961
MaxSpareThreads 3
MinSpareThreads 1
ServerLimit 1
SetEnvIf X-Forwarded-SSL on HTTPS=1
ThreadsPerChild 5
I do need to be able to use sleep() and trust that it isn't stopping everything. So, what's up and how to fix it?
Edit: Webfaction runs this using Apache.

As Gjordis pointed out, sleep will pause the current thread. I have looked at Webfaction and it looks like their are using WSGI for running the serving instance of Django. This means, every time a request comes in, Apache will look at how many worker processes (that are processes that each run a instance of Django) are currently running. If there are none/to view it will spawn additonally workers and hand the requests to them.
Here is what I think is happening in you situation:
first GET request for resource A comes in. Apache uses a running worker (or starts a new one)
the worker sleeps 10 seconds
during this, a new request for resource A comes in. Apache sees it is requesting the same resource and sends it to the same worker as for request A. I guess the assumption here is that a worker that recently processes a request for a specific resource it is more likely that the worker has some information cached/preprocessed/whatever so it can handle this request faster
this results in a 20 second block since there is only one worker that waits 2 times 10 seconds
This behavior makes complete sense 99% of the time so it's logical to do this by default.
However, if you change the requested resource for the second request (by adding GET parameter) Apache will assume that this is a different resource and will start another worker (since the first one is already "busy" (Apache can not know that you are not doing any hard work). Since there are now two worker, both waiting 10 seconds the total time goes down to 10 seconds.
Additionally I assume that something is **wrong** with your design. There are almost no cases which I can think of where it would be sensible to not respond to a HTTP request as fast as you can. After all, you want to serve as many requests as possible in the shortest amount of time, so sleeping 10 seconds is the most counterproductive thing you can do. I would recommend the you create a new question and state what you actual goal is that you are trying to achieve. I'm pretty sure there is a more sensible solution to this!

Assuming you run your Django-server just with run() , by default this makes a single threaded server. If you use sleep on a single threaded process, the whole application freezes for that sleep time.

It may simply be that your browser is queuing the second request to be performed only after the first one completes. If you are opening your URLs in the same browser, try using the two different ones (e.g. Firefox and Chrome), or try performing requests from the command line using wget or curl instead.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.