How do I analyze what is hanging my Flask application

How do I analyze what is hanging my Flask application - python

I have a Python Flask web application, which uses a Postgresql database.
When I put a load on my application, it stops to respond. This only happens when I request pages which uses the database.
My setup:
nginx frontend (although in my test environment, skipping this tier doesn't make a difference), connecting via UNIX socket to:
gunicorn application server with 3 child processes, connecting via UNIX socket to:
pgbouncer, connection pooler for PostgreSQL, connecting via TCP/IP to:
I need pgbouncer, because SQLAlchemy has connection pooling per process. If I don't use pgbouncer, my database get's overloaded with connection requests very quickly.
postgresql 13, the database server.
I have a test environment on Debian Linux (with nginx) and on my iMac, and the application hang occurs on both machines.
I put load on the application with hey, a http load generator. I use the default, which generates 200 requests with 50 workers. The test-page issues two queries to the database.
When I run my load test, I see gunicorn getting worker timeouts. It's killing the timedout processes, and starts up new ones. Eventually (after a lot of timeouts) everything is fine again. For this, I lowered the statement timeout setting of Postgresql. First is was 30 and later I set it to 15 seconds. Gunicorn's worker timeouts happend more quickly now. (I don't understand this behaviour; why would gunicorn recycle a worker, when a query times out?)
When I look at pgbouncer, with the show clients; command I see some waiting clients. I think this is a hint of the problem. My Web application is waiting on pgbouncer, and pgbouncer seems to be waiting for Postgres. When the waiting lines are gone, the application behaves normally again (trying a few requests). Also, when I restart the gunicorn process, everything goes back to normal.
But with my application under stress, when I look at postgresql (querying with a direct connection, by-passing pgbouncer), I can't see anything wrong, or waiting or whatever. When I query pg_stat_activity, all I see are idle connections (except from then connection I use to query the view).
How do I debug this? I'm a bit stuck. pg_stat_activity should show queries running, but this doesn't seem to be the case. Is there something else wrong? How do I get my application to work under load, and how to analyze this.

So, I solved my question.
As it turned out, not being able to see what SqlAlchemy was doing turned out to be the most confusing part. I could see what Postgres was doing (pg_stat_activity), and also what pgbouncer was doing (show clients;).
SqlAlchemy does have an echo and pool_echo setting, but for some reason this didn't help me.
What helped me was the realization that SqlAlchemy uses standard python logging. For me, the best way to check it out was to add the default Flask logging handler to these loggers, something like this:
log_level = "INFO"
app.logger.setLevel(log_level)
for log_name in ["sqlalchemy.dialects", "sqlalchemy.engine", "sqlalchemy.orm", "sqlalchemy.pool"]:
additional_logger = logging.getLogger(log_name)
additional_logger.setLevel(log_level)
additional_logger.addHandler(app.logger.handlers[0])
(of course I can control my solution via a config-file, but I left that part out for clarity)
Now I could see what was actually happening. Still no statistics, like with the other tiers, but this helped.
Eventually I found the problem. I was using two (slightly) different connection strings to the same database. I had them because the first was for authentication (used by Flask-Session and Flask-Login via ORM), and the other for application queries (used by my own queries via PugSQL). In the end, different connection strings were not necessary. However it made SqlAlchemy do strange things when in stress.
I'm still not sure what the actual problem was (probably there were two connection pools which were fighting each other), but this solved it.
Nice benefit: I don't need pg_bouncer in my situation, so that removes a lot of complexity.

Related

Run code on first Django start

I have a Django application written to handle displaying a webpage with data from a model based on the primary key passed in the URL, this all works fine and the Django component is working perfectly for the most part.
My question though is, and I have tried multiple methods such as using an AppConfig, is how I can make it so when the Django server boots up, code is called that would then create a separate thread which would then monitor an external source, logging valid data from that source as a model into the database.
I have the threading code written along with the section that creates the model and saves it in the database, my issue though is that if I try to use an AppConfig to create the thread which would then handle the code, I get an django.core.exceptions.AppRegistryNotReady: Apps aren't loaded yet. error and the server does not boot up.
Where would be appropriate to place the code? Is my approach incorrect to the matter?

Trying to use threading to get around blocking processes like web servers is an exercise in pain. I've done it before and it's fragile and often yields unpredictable results.
A much easier idea is to create a separate worker that runs in a totally different process that you start separately. It would have the same database access and could even use your Django models. This is how hosts like Heroku approach this problem. It comes with the added benefit of being able to be tested separately and doesn't need to run at all while you're working on your main Django application.
These days, with a multitude of virtualization options like Vagrant and containerization options like Docker, running parallel processes and workers is trivial. In the wild they may literally be running on separate servers with your database on yet another server. As was mentioned in the comments, starting a worker process could easily be delegated to a separate Django management command. This, in turn, can be fairly easily turned into separate worker processes by gunicorn on your web server.

sqlite3: avoiding "database locked" collision

I am running two python files on one cpu in parallel, both of which make use of the same sqlite3 database. I am handling the sqlite3 database using sqlalchemy and my understanding is that sqlalchemy handles all the threading database issues within one app. My question is how to handle the access from the two different apps?
One of my two programs is a flask application and the other is a cronjob which updates the database from time to time.
It seems that even read-only tasks on the sqlite database lock the database, meaning that if both apps want to read or write at the same time I get an error.
OperationalError: (sqlite3.OperationalError) database is locked
Lets assume that my cronjob app runs every 5min. How can I make sure that there are no collisions between my two apps? I could write some read flag into a file which I check before accessing the database, but it seems to me there should be a standard way to do this?
Furthermore I am running my app with gunicorn and in principle it is possible to have multiple jobs running... so I eventually want more than 2 parallel jobs for my flask app...
thanks
carl

It's true. Sqlite isn't built for this kind of application. Sqlite is really for lightweight single-threaded, single-instance applications.
Sqlite connections are one per instance, and if you start getting into some kind of threaded multiplexer (see https://www.sqlite.org/threadsafe.html) it'd be possible, but it's more trouble than it's worth. And there are other solutions that provide that function-- take a look at Postgresql or MySQL. Those DB's are open source, are well documented, well supported, and support the kind of concurrency you need.

I'm not sure how SQLAlchemy handles connections, but if you were using Peewee ORM then the solution is quite simple.
When your Flask app initiates a request, you will open a connection to the DB. Then when Flask sends the response, you close the DB.
Similarly, in your cron script, open a connection when you start to use the DB, then close it when the process is finished.
Another thing you might consider is using SQLite in WAL mode. This can improve concurrency. You set the journaling mode with a PRAGMA query when you open your connection.
For more info, see http://charlesleifer.com/blog/sqlite-small-fast-reliable-choose-any-three-/

Do I need celery when I am using gevent?

I am working on a django web app that has functions (say for e.g. sync_files()) that take a long time to return. When I use gevent, my app does not block when sync_file() runs and other clients can connect and interact with the webapp just fine.
My goal is to have the webapp responsive to other clients and not block. I do not expect a zillion users to connect to my webapp (perhaps max 20 connections), and I do not want to set this up to become the next twitter. My app is running on a vps, so I need something light weight.
So in my case listed above, is it redundant to use celery when I am using gevent? Is there a specific advantage to using celery? I prefer not to use celery since it is yet another service that will be running on my machine.
edit: found out that celery can run the worker pool on gevent. I think I am a litle more unsure about the relationship between gevent & celery.

In short you do need a celery.
Even if you use gevent and have concurrency, the problem becomes request timeout. Lets say your task takes 10 minutes to run however the typical request timeout is about up to a minute. So what will happen if you trigger the task directly within a view is that the server will start processing it however after a minute a client (browser) will probably disconnect the connection since it will think the server is offline. As a result, your data can become corrupt since you cannot be guaranteed what will happen when connection will close. Celery solves this because it will trigger a background process which will process the task independent of the view. So the user will get the view response right away and at the same time the server will start processing the task. That is a correct pattern to handle any scenarios which require lots of processing.

A question about sites downtime updates

I've a server when I run a Django application but I've a little problem:
when with mercurial I commit and pushing new changes on the server, there's a micro time (like 1 microsec) where the home page is unreachable.
I have apache on the server.
How can I solve this?

You could run multiple instances of the django app (either on the same machine with different ports or on different machines) and use apache to reverse proxy requests to each instance. It can failover to instance B whilst instance A is rebooting. See mod_proxy.
If the downtime is as short as you say though, it is unlikly to be an issue worth worrying about.
Also note that there are likely to be better (and easier) proxies than Apache. Nginx is popular, as is HAProxy.

If you have any significant traffic in time that is measured in microsecond it's probably best to push new changes to your web servers one at a time, and remove the machine from load balancer rotation for the moment you're doing the upgrade there.

When using apachectl graceful, you minimize the time the website is unavailable when 'restarting' Apache. All children are 'kindly' requested to restart and get their new configuration when they're not doing anything.
The USR1 or graceful signal causes the parent process to advise the children to exit after their current request (or to exit immediately if they're not serving anything). The parent re-reads its configuration files and re-opens its log files. As each child dies off the parent replaces it with a child from the new generation of the configuration, which begins serving new requests immediately.
At a heavy-traffic website, you will notice some performance loss, as some children will temporarily not accept new connections. It's my experience, however, that TCP recovers perfectly from this.
Considering that some websites take several minutes or hours to update, that is completely acceptable. If it is a really big issue, you could use a proxy, running multiple instances and updating them one at a time, or update at an off-peak moment.

If you're at the point of complaining about a 1/1,000,000th of a second outage, then I suggest the following approach:
Front end load balancers pointing to multiple backend servers.
Remove one backend server from the loadbalancer to ensure no traffic will go to it.
Wait for all traffic that the server was processing has been sent.
Shutdown the webserver on that instance.
Update the django instance on that machine.
Add that instance back to the load balancers.
Repeat for every other server.
This will ensure that the 1/1,000,000th of a second gap is removed.

i think it's normal, since django may be needing to restart its server after your update

Postgres 8.4.4 + psycopg2 + python 2.6.5 + Win7 instability

You can see the combination of software components I'm using in the title of the question.
I have a simple 10-table database running on a Postgres server (Win 7 Pro). I have client apps (python using psycopg to connect to Postgres) who connect to the database at random intervals to conduct relatively light transactions. There's only one client app at a time doing any kind of heavy transaction, and those are typically < 500ms. The rest of them spend more time connecting than actually waiting for the database to execute the transaction. The point is that the database is under light load, but the load is evenly split between reads and writes.
My client apps run as servers/services themselves. I've found that it is pretty common for me to be able to (1) take the Postgres server completely down, and (2) ruin the database by killing the client app with a keyboard interrupt.
By (1), I mean that the Postgres process on the server aborts and the service needs to be restarted.
By (2), I mean that the database crashes again whenever a client tries to access the database after it has restarted and (presumably) finished "recovery mode" operations. I need to delete the old database/schema from the database server, then rebuild it each time to return it to a stable state. (After recovery mode, I have tried various combinations of Vacuums to see whether that improves stability; the vacuums run, but the server will still go down quickly when clients try to access the database again.)
I don't recall seeing the same effect when I kill the client app using a "taskkill" - only when using a keyboard interrupt to take the python process down. It doesn't happen all the time, but frequently enough that it's a major concern (25%?).
Really surprised that anything on a client would actually be able to take down an "enterprise class" database. Can anyone share tips on how to improve robustness, and hopefully help me to understand why this is happening in the first place? Thanks, M

If you're having problems with postgresql acting up like this, you should read this page:
http://wiki.postgresql.org/wiki/Guide_to_reporting_problems
For an example of a real bug, and how to ask a question that gets action and answers, read this thread.
http://archives.postgresql.org/pgsql-general/2010-12/msg01030.php

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.