Azure Machine Learning Request Response latency - python

I have made an Azure Machine Learning Experiment which takes a small dataset (12x3 array) and some parameters and does some calculations using a few Python modules (a linear regression calculation and some more). This all works fine.
I have deployed the experiment and now want to throw data at it from the front-end of my application. The API-call goes in and comes back with correct results, but it takes up to 30 seconds to calculate a simple linear regression. Sometimes it is 20 seconds, sometimes only 1 second. I even got it down to 100 ms one time (which is what I'd like), but 90% of the time the request takes more than 20 seconds to complete, which is unacceptable.
I guess it has something to do with it still being an experiment, or it is still in a development slot, but I can't find the settings to get it to run on a faster machine.
Is there a way to speed up my execution?
Edit: To clarify: The varying timings are obtained with the same test data, simply by sending the same request multiple times. This made me conclude it must have something to do with my request being put in a queue, there is some start-up latency or I'm throttled in some other way.

First, I am assuming you are doing your timing test on the published AML endpoint.
When a call is made to the AML the first call must warm up the container. By default a web service has 20 containers. Each container is cold, and a cold container can cause a large(30 sec) delay. In the string returned by the AML endpoint, only count requests that have the isWarm flag set to true. By smashing the service with MANY requests(relative to how many containers you have running) can get all your containers warmed.
If you are sending out dozens of requests a instance, the endpoint might be getting throttled. You can adjust the number of calls your endpoint can accept by going to manage.windowsazure.com/
manage.windowsazure.com/
Azure ML Section from left bar
select your workspace
go to web services tab
Select your web service from list
adjust the number of calls with slider
By enabling debugging onto your endpoint you can get logs about the execution time for each of your modules to complete. You can use this to determine if a module is not running as you intended which may add to the time.
Overall, there is an overhead when using the Execute python module, but I'd expect this request to complete in under 3 secs.

Related

Cloud Run suddenly starts timing out when processing any request

We've been running a backend application on Cloud Run for about a year and a half now, and a month ago it suddenly stopped properly handling all requests at seemingly random times (about every couple of days), only working again once we redeploy from the latest image from Cloud Build. The application will actually receive the request, however it just doesn't do anything and eventually the request will just time out (504) after 59m59s (the max timeout), even a test endpoint that just returns 'Hello World' times out without sending a response.
The application is written in Python and uses Flask to handle requests. We have a Cloud SQL instance that is used as its database, however we're confident this is not the source of the issue as even requests that don't involve the DB in any form do not work and the Cloud SQL instance is accessible even when the application stops working. Cloud Run is deployed with the following configuration:
CPU: 2
Memory: 8Gi
Timeout: 59m59s
VPC connector
VPC egress: private-ranges-only
Concurrency: 100
The vast majority of endpoints should produce some form of log when they first start, so we're confident that the application isn't executing any of the code after being triggered. We're not seeing any useful error messages in Logs Explorer either, simply just 504 errors from the requests timing out. It's deployed with a 59m59s timeout, so it's not the case that the timeout has been entered incorrectly and even then, that wouldn't explain why it works again when it's redeployed.
We have a Cloud Scheduler schedule that triggers the application every 15 minutes, which sends to an endpoint in the application that checks if any tasks are due to run and creates Cloud Tasks tasks (which send HTTP requests to an endpoint on the same application) for any tasks that need performing at that point in time. Every time the application stops working, it does seem to be during one of these runs, however we're not certain it's the cause as the Cloud Scheduler schedule is the most frequent trigger anyway. There doesn't seem to be a specific time of day that the crashes take place either.
This is a (heavily redacted) screenshot of the logs. The Cloud Scheduler schedule hits the endpoint at 21:00 and creates a number of tasks but then hits the default 3m Cloud Scheduler timeout limit at 21:03. The tasks it created then hit the default 10m Cloud Tasks timeout limit at 21:10 without their endpoint having done anything. After that point, all requests to the service timeout without doing anything.
The closest post I could find on SO was this one, their problem is also temporarily fixed by redeployment, however ours isn't sending 200 responses when it stops working and is instead just timing out without doing anything. We've tried adding retries to Cloud Scheduler + increasing its timeout limit, and we've also tried increasing the CPU and RAM allocation.
Any help is appreciated!

How can I schedule or queue api calls to maintain rate limit?

I am trying to continuously crawl a large amount of information from a site using the REST api they provide. I have following constraints-
Stay within api limit (5 calls/sec)
Utilising the full limit (making exactly 5 calls per second, 5*60 calls per minute)
Each call will be with different parameters (params will be fetched from db or in-memory cache)
Calls will be made from AWS EC2 (or GAE) and processed data will be stored in AWS RDS/DynamoDB
For now I am just using a scheduled task that runs a python script every minute- and the script makes 10-20 api calls-> processes response-> stores data to DB. I want to scale this procedure (make 5*60= 300 calls per minute) and make it manageable via code (pushing new tasks, pause/resuming them easily, monitoring failures, changing call frequency).
My question is- what are the best available tools to achieve this? Any suggestion/guidance/link is appreciated.
I do know the names of some task queuing frameworks like Celery/RabbitMQ/Redis, but I do not know much about them. However I am wiling to learn one or each of those if these are the best tools to solve my problem, want to hear from SO veterans before jumping in ☺
Also please let me know if there's any other AWS service I should look to use (SQS or AWS Data Pipeline?) to make any step easier.
You needn't add an external dependency just for rate-limiting, as your use case is rather straightforward.
I can think of two options:
Modify the script (that currently wakes up every minute and makes 10-20 API calls) to wake up every second and make 5 calls (sequentially or in parallel).
In your current design, your API calls might not be properly distributed across 1 minute, i.e. you might be making all your 10-20 calls in the first, say, 20 seconds.
If you change that script to run every second, your API call rate will be more balanced.
Change your Python script to a long running daemon, and use a Rate Limiter library, such as this. You can configure the latter to make 1 call per x seconds.

Extreme django performance issues after launching

We have recently launched a django site which amongst other things, has a screen representing all sorts of data. A request to the server is sent every 10 seconds to get new data. The average response size is 10kb.
The site is working on approx. 30 clients, meaning every client sends a get request every 10 seconds.
When locally testing, responses came back after 80ms. After deployment with 30~ users, we're taking up to 20 seconds!!
So the initial thought is that my code sucks. I went through all my queries and did everything i can to optimize then and reduce calls to the database (which was hard, nearly everything is somwething like object.filter(id=num) and my tables have less thab 5k rows atm...)
But then i noticed the same issue occurs in the admin panel! Which is clearly optimized and doesn't have my perhaps inefficient code, since I didn't write it. Opening the users tab takes 30 seconds at certain requests!!
So, what is it? Do I argue with the company sysadmins and demand a better server? They say we dont need better hardware (running on dual core 2.67ghz and 4gb ram, which isnt a lot, but still shouldn't be THAT slow)
Doesn't the fact that the admin site is slow imply that this is a hardware issue?

GAE: how to quantify Frontend Instance Hours usage?

We are developing a Python server on Google App Engine that should be capable of handling incoming HTTP POST requests (around 1,000 to 3,000 per minute in total). Each of the requests will trigger some datastore writing operations. In addition we will write a web-client as a human-usable interface for displaying and analyse stored data.
First we are trying to estimate usage for GAE to have at least an approximation about the costs we would have to cover in future based on the number of requests. As for datastore write operations and data storage size it is fairly easy to come up with an approximate number, though it is not so obvious for the frontend and backend instance hours.
As far as I understood each time a request is coming in, an instance is being started which then is running for 15 minutes. If a request is coming in within these 15 minutes, the same instance would have been used. And now it is getting a bit tricky I think: if two requests are coming in at the very same time (which is not so odd with 3,000 requests per minute), is Google firing up another instance, hence Google would count an addition of (at least) 0.15 instance hours? Also I am not quite sure how a web-client that is constantly performing read operations on the datastore in order to display and analyse data would increase the instance hours.
Does anyone know a reliable way of counting instance hours and creating meaningful estimations? We would use that information to know how expensive it would be to run an application on GAE in comparison to just ordering a web server.
There's no 100% sure way to assess the number of frontend instance hours. An instance can serve more than one request at a time. In addition, the algorithm of the scheduler (the system that starts the instances) is not documented by Google.
Depending on how demanding your code is, I think you can expect a standard F1 instance to hold up to 5 requests in parallel, that's a maximum. 2 is a safer bet.
My recommendation, if possible, would be to simulate standard interaction on your website with limited number of users, and see how the number of instances grow, then extrapolate.
For example, let's say you simulate 100 requests per minute during 2 hours, and you see that GAE spawns 5 instances for that, then you can extrapolate that a continuous load of 3000 requests per minute would require 150 instances during the same 2 hours. Then I would double this number for safety, and end up with an estimate of 300 instances.

Help me pick a webservice platform to scale up an existing python webservice

I have a simple webservice that I need to scale up substantially.
I'm trying to decide where to go amongst the various web frameworks, load balancers, app servers (e.g Mongrel2, Tornado, and nginx, mod_proxy).
I have an existing Python app (currently exposed via BaseHTTPServer) that accepts some JSON data (about 900KB per request), and returns some JSON data (about 1k). The processing is algorithmic and done in a mixture of Python and some C (via Cython).
This is heavily optimized already (down to 1.1 seconds per-job from >1hour). But I can optimise that no further. While I rewrite in something a bit more thread-friendly, I need to scale things out horizontally (ec2 maybe).
There is no session or state, but the startup time of the app is quite slow (even with pickling and cashing). It takes about 3 seconds to load all the source data. Once running it takes about 1.1 seconds per request. I
Maybe I could spin up a number of copies and then reverse proxy them? Maybe I could do some funky worker pool in one of those frameworks? But I'm still in the unknown unknowns here.
First, you should decouple your webservice layer from number crunching. Use external job queue (for example http://celeryproject.org/), to offload web frontend. Then you can scale each part interdependently.
You should look for IaaS-type cloud providers (EC2, Rackspace, Linode, Softlayer etc), where you should be able to add nodes automatically (preferred way would be to spin up some preconfigured image to minimize node setup time).

Categories

Resources