Celery/CloudAMQP error in a Heroku Flask App - python

I'm running a Flask app on Heroku (on the free tier) and running into some trouble when scheduling tasks using apply_async. If I schedule more than two tasks, I get a long stacktrace with the exception:
AccessRefused(403, u"ACCESS_REFUSED - access to exchange 'celeryresults' in vhost 'rthtwchf' refused for user 'rthtwchf'", (40, 10), 'Exchange.declare')
The odd thing is the first two tasks (before restarting all of my processes) always seem to complete with no issue.
A little bit of search engine sleuthing leads me to https://stackoverflow.com/questions/21071906/celery-cannot-connect-remote-worker-with-new-username which makes it looks like a permissions issue, but I'd assume that the Heroku CloudAMPQ service would have taken care of that already.
Any advice is appreciated!

I think your connections are exceeding 3 (free plan limit). Set the BROKER_POOL_LIMIT to 1 and it will work.

Related

Flask Workers on Gunicorn : handling failures

I developed my site on Flask and on the road, as it grows i am learning and rethinking new things. It is a Flask App that runs behind the scenes with Guinicorn, parametrized by 12 gevent worgers and nginx as reverse proxy.
As I understand, there are several reasons a worker can fail: lack of resources, sharing of resources, database connections, bad code, so my first question is... I need to design my site to avoid at any cost Workers failures? or is supposed to happen any way some times a day.
If the there is a failure? the way to handdle user experience is by setting an error handler on server (NGinx)? Because as I understand, when a worker fails, is not possible to handdle the error with Flask.
Thanks in advance, best regards!

Celery Redis ConnectionError('max number of clients reached',)

I have a django application leveraging celery for asynchronous tasks. I've been running into an issue where I reach the max number of redis connections. I am fairly new to both celery and redis.
I'm confused because in my config I define - CELERY_REDIS_MAX_CONNECTIONS = 20 which is the limit on my redis plan.
For experimentation, I bumped the plan up and that solved the issue. I am confused, however, that I am running into this problem again after defining the max number of connections. I downgraded the plan and set the limit to the plans max.
I am wondering if the BROKER_POOL_LIMIT needs to be changed.
Is there anything I am missing to help solve connection errors and celery.
Is it possible to figure out how many connections all of my tasks need? I have 16 jobs running every minute.
Another thought, I noticed that connecting the redis cli threw the connection error, is it possible that I am at the limit, and accessing the cli is putting me over?
I also cant kill connections because I cannot connect to the redis cli while it throws this error.

Production ready Python apps on Kubernetes

I have been deploying apps to Kubernetes for the last 2 years. And in my org, all our apps(especially stateless) are running in Kubernetes. I still have a fundamental question, just because very recently we found some issues with respect to our few python apps.
Initially when we deployed, our python apps(Written in Flask and Django), we ran it using python app.py. It's known that, because of GIL, python really doesn't have support for system threads, and it will only serve one request at a time, but in case the one request is CPU heavy, it will not be able to process further requests. This is causing sometimes the health API to not work. We have observed that, at this moment, if there is a single request which is not IO and doing some operation, we will hold the CPU and cannot process another request in parallel. And since it's only doing fewer operations, we have observed there is no increase in the CPU utilization also. This has an impact on how HorizontalPodAutoscaler works, its unable to scale the pods.
Because of this, we started using uWSGI in our pods. So basically uWSGI can run multiple pods under the hood and handle multiple requests in parallel, and automatically spin new processes on demand. But here comes another problem, that we have seen, uwsgi is lacking speed in auto-scaling the process tocorrected serve the request and its causing HTTP 503 errors, Because of this we are unable to serve our few APIs in 100% availability.
At the same time our all other apps, written in nodejs, java and golang, is giving 100% availability.
I am looking at what is the best way by which I can run a python app in 100%(99.99) availability in Kubernetes, with the following
Having health API and liveness API served by the app
An app running in Kubernetes
If possible without uwsgi(Single process per pod is the fundamental docker concept)
If with uwsgi, are there any specific config we can apply for k8s env
We use Twisted's WSGI server with 30 threads and it's been solid for our Django application. Keeps to a single process per pod model which more closely matches Kubernetes' expectations, as you mentioned. Yes, the GIL means only one of those 30 threads can be running Python code at time, but as with most webapps, most of those threads are blocked on I/O (usually waiting for a response from the database) the vast majority of the time. Then run multiple replicas on top of that both for redundancy and to give you true concurrency at whatever level you need (we usually use 4-8 depending on the site traffic, some big ones are up to 16).
I have exactly the same problem with a python deployment running the Flask application. Most api calls are handled in a matter of seconds, but there are some cpu intensive requests that acquire GIL for 2 minutes.... The pod keep accepting requests, ignores the configured timeouts, ignores a closed connection by the user; then after 1 minute of liveness probes failing, the pod is restarted by kubelet.
So 1 fat request can dramatically drop the availability.
I see two different solutions:
have a separate deployment that will host only long running api calls; configure ingress to route requests between these two deployments;
using multiprocessing handle liveness/readyness probes in a main process, every other request must be handled in the child process;
There are pros and cons for each solution, maybe I will need a combination of both. Also if I need a steady flow of prometheus metrics, I might need to create a proxy server on the application layer (1 more container on the same pod). Also need to configure ingress to have a single upstream connection to python pods, so that long running request will be queued, whereas short ones will be processed concurrently (yep, python, concurrency, good joke). Not sure tho it will scale well with HPA.
So yeah, running production ready python rest api server on kubernetes is not a piece of cake. Go and java have a much better ecosystem for microservice applications.
PS
here is a good article that shows that there is no need to run your app in kubernetes with WSGI
https://techblog.appnexus.com/beyond-hello-world-modern-asynchronous-python-in-kubernetes-f2c4ecd4a38d
PPS
Im considering to use prometheus exporter for flask. Looks better than running a python client in a separate thread;
https://github.com/rycus86/prometheus_flask_exporter

Flask application being run by gunicorn gets hanged after some time

I am facing a unique problem from the past 30 days. After trying a hell lot of stuff, I am seeking support from the community.
I have made a deep learning based web application using python flask. Backend is written in python (deep learning code) and front end is being served using HTML, JS and Bootstrap. The deployment is done using Gunicorn. Also, the app is https enabled and deployed on GCP.
The application runs fine for some time, after that it hangs. Although the python process is still running, it stops serving and responding for any API requests. New hits do not reach the python code. This behavior is very random in terms of running time. Sometimes it takes 4-5 hours to stop and some other times it takes even 2 days to stop. Then I manually start the application using gunicorn command and it works.
Things I have tried -
Checked system memory using htop when the application is in the hanging stage. But it seems fine.
I have tried to test the APIs using Jmeter (hitting multiple requests in a loop in sequence) and the system does not hang up.
I even ran the application using uwsgi + apache, but still, the problem persists.
Here is the command being used to run the gunicorn server -
gunicorn -b 0.0.0.0:443 --threads=4 --certfile=path_to_certificate_file --keyfile=path_to_key_file server:app --max-requests 1000 --access-logfile /var/log/gunicorn/gunicorn-access.log --error-logfile /var/log/gunicorn/gunicorn-error.log --capture-output --log-level debug --logger-class=simple --daemon
I am still not able to diagnose the exact problem and replicate it. Looking for any specific direction to explore. Feel free to share your hypothesis/thoughts. Let me know if any other information is required from my side in order to make it intuitive further.
I've also been experiencing this issue with a gunicorn/flask app:
Requests load only within the first few minutes after gunicorn starts. When left alone for a few minutes, gunicorn stops responding to requests (browser perpetually displays "loading" icon animation). Gunicorn error log shows nothing.
My current workaround is to create an arbitrary "keep alive" call to the flask app from the client html file.
So in the Flask app file:
#app.route('/keep_alive/<val>')
def keep_alive(val):
return json.dumps({'success':val})
And in the html's JS block:
function keepAlive() {
var t = setTimeout(keepAlive, 60000);
var d = new Date();
var n = d.getTime();
$.getJSON(base_url + '/keep_alive/' + n.toString(), function (data) {
console.log('keep alive return ' + data)
});
}
keepAlive();
For what it's worth, the gunicorn setting --keep-alive=21600 didn't have any noticeable effect. It would be great if someone here could enlighten me.
Thanks

Django app with long running calculations

I'm creating a Django web app which features potentially very long running calculations of up to an hour. The calculations are simulation models built in Python. The web app sends inputs to the simulation model and after some time receives the answer. Also, the user should be able to close his browser after starting the simulation and if he logs in the next day the results should be there.
From my research it seems like I can use Celery together with Redis/RabbitMQ as broker to run the calculation in the background. Ideally I would want to display progress updates using ajax, so that the page updates without a user refresh when the calculation is complete.
I want to host the app on Heroku, so the calculation will also be running on the Heroku server. How hard will it be if I want to move the calculation engine to another server? It might be useful if the calculation engine is on a different server.
So my question is, is my this a good approach above or what other options can I look at?
I think Celery is a good approach. Not sure if you need Redis/RabbitMQ as a broker or you could just use MySQL - it depends on your tasks. Celery workers could be runned on the different servers, so Celery supports distributed queues.
Another approach - implement some queue engine with python, database as a broker and a cron for job executions. But it could be a dirty way with a lots of pain and bugs.
So I think that Celery is a more nice way to do it.
If you are running on Heroku, you want django-rq, not Celery. See https://devcenter.heroku.com/articles/python-rq.

Categories

Resources