A question about sites downtime updates

A question about sites downtime updates - python

I've a server when I run a Django application but I've a little problem:
when with mercurial I commit and pushing new changes on the server, there's a micro time (like 1 microsec) where the home page is unreachable.
I have apache on the server.
How can I solve this?

You could run multiple instances of the django app (either on the same machine with different ports or on different machines) and use apache to reverse proxy requests to each instance. It can failover to instance B whilst instance A is rebooting. See mod_proxy.
If the downtime is as short as you say though, it is unlikly to be an issue worth worrying about.
Also note that there are likely to be better (and easier) proxies than Apache. Nginx is popular, as is HAProxy.

If you have any significant traffic in time that is measured in microsecond it's probably best to push new changes to your web servers one at a time, and remove the machine from load balancer rotation for the moment you're doing the upgrade there.

When using apachectl graceful, you minimize the time the website is unavailable when 'restarting' Apache. All children are 'kindly' requested to restart and get their new configuration when they're not doing anything.
The USR1 or graceful signal causes the parent process to advise the children to exit after their current request (or to exit immediately if they're not serving anything). The parent re-reads its configuration files and re-opens its log files. As each child dies off the parent replaces it with a child from the new generation of the configuration, which begins serving new requests immediately.
At a heavy-traffic website, you will notice some performance loss, as some children will temporarily not accept new connections. It's my experience, however, that TCP recovers perfectly from this.
Considering that some websites take several minutes or hours to update, that is completely acceptable. If it is a really big issue, you could use a proxy, running multiple instances and updating them one at a time, or update at an off-peak moment.

If you're at the point of complaining about a 1/1,000,000th of a second outage, then I suggest the following approach:
Front end load balancers pointing to multiple backend servers.
Remove one backend server from the loadbalancer to ensure no traffic will go to it.
Wait for all traffic that the server was processing has been sent.
Shutdown the webserver on that instance.
Update the django instance on that machine.
Add that instance back to the load balancers.
Repeat for every other server.
This will ensure that the 1/1,000,000th of a second gap is removed.

i think it's normal, since django may be needing to restart its server after your update

Related

How do I analyze what is hanging my Flask application

I have a Python Flask web application, which uses a Postgresql database.
When I put a load on my application, it stops to respond. This only happens when I request pages which uses the database.
My setup:
nginx frontend (although in my test environment, skipping this tier doesn't make a difference), connecting via UNIX socket to:
gunicorn application server with 3 child processes, connecting via UNIX socket to:
pgbouncer, connection pooler for PostgreSQL, connecting via TCP/IP to:
I need pgbouncer, because SQLAlchemy has connection pooling per process. If I don't use pgbouncer, my database get's overloaded with connection requests very quickly.
postgresql 13, the database server.
I have a test environment on Debian Linux (with nginx) and on my iMac, and the application hang occurs on both machines.
I put load on the application with hey, a http load generator. I use the default, which generates 200 requests with 50 workers. The test-page issues two queries to the database.
When I run my load test, I see gunicorn getting worker timeouts. It's killing the timedout processes, and starts up new ones. Eventually (after a lot of timeouts) everything is fine again. For this, I lowered the statement timeout setting of Postgresql. First is was 30 and later I set it to 15 seconds. Gunicorn's worker timeouts happend more quickly now. (I don't understand this behaviour; why would gunicorn recycle a worker, when a query times out?)
When I look at pgbouncer, with the show clients; command I see some waiting clients. I think this is a hint of the problem. My Web application is waiting on pgbouncer, and pgbouncer seems to be waiting for Postgres. When the waiting lines are gone, the application behaves normally again (trying a few requests). Also, when I restart the gunicorn process, everything goes back to normal.
But with my application under stress, when I look at postgresql (querying with a direct connection, by-passing pgbouncer), I can't see anything wrong, or waiting or whatever. When I query pg_stat_activity, all I see are idle connections (except from then connection I use to query the view).
How do I debug this? I'm a bit stuck. pg_stat_activity should show queries running, but this doesn't seem to be the case. Is there something else wrong? How do I get my application to work under load, and how to analyze this.

So, I solved my question.
As it turned out, not being able to see what SqlAlchemy was doing turned out to be the most confusing part. I could see what Postgres was doing (pg_stat_activity), and also what pgbouncer was doing (show clients;).
SqlAlchemy does have an echo and pool_echo setting, but for some reason this didn't help me.
What helped me was the realization that SqlAlchemy uses standard python logging. For me, the best way to check it out was to add the default Flask logging handler to these loggers, something like this:
log_level = "INFO"
app.logger.setLevel(log_level)
for log_name in ["sqlalchemy.dialects", "sqlalchemy.engine", "sqlalchemy.orm", "sqlalchemy.pool"]:
additional_logger = logging.getLogger(log_name)
additional_logger.setLevel(log_level)
additional_logger.addHandler(app.logger.handlers[0])
(of course I can control my solution via a config-file, but I left that part out for clarity)
Now I could see what was actually happening. Still no statistics, like with the other tiers, but this helped.
Eventually I found the problem. I was using two (slightly) different connection strings to the same database. I had them because the first was for authentication (used by Flask-Session and Flask-Login via ORM), and the other for application queries (used by my own queries via PugSQL). In the end, different connection strings were not necessary. However it made SqlAlchemy do strange things when in stress.
I'm still not sure what the actual problem was (probably there were two connection pools which were fighting each other), but this solved it.
Nice benefit: I don't need pg_bouncer in my situation, so that removes a lot of complexity.

Routing requests to a specific Heroku Dyno

I built an real-time collaboration application with Prosemirror that uses a centralised operational transform algorithm (described here by Marijn Haverbeke) with a Python server using Django Channels with prosemirror-py as the central point.
The server creates a DocumentInstance for every document users are collaborating on, keeps it in memory and occasionally stores it in a Redis database. As long as there is only one Dyno, all requests are routed there. Then the server looks up the instance the request belongs to and updates it.
I would like to take advantage of Heroku's horizontal scaling and run more than one dyno. But as I understand, this would imply requests being routed to any of the running dynos. But since one DocumentInstance can only live on one server this would not work.
Is there a way to make sure that requests belonging to a specific DocumentInstance are only routed to the machine that keep that keeps it?
Or maybe there is an alternative architecture that I am overlooking?

When building scalable architecture the horizontal autoscaling layer is typically ephemeral, meaning no state is kept there it typically serves either for processing or IO to other services as these boxes are shut down and spun up with regularity. You should not be keeping the DocumentInstance data there as it will be lost given dynos spin up and shut down during autoscaling.
If you're using Redis as a centralized datastore you would be much better off either saving in real-time to Redis and having all dynos write/call to Redis to keep in sync, Redis is capable of this and would move all state from ephemeral dynos to a non-emphereal datastore. You may however be able to find a middle ground for your case using https://devcenter.heroku.com/articles/session-affinity.
I run 123 Dyno an autoscaling and monitoring add-on for Heroku for more options and faster, configurable autoscaling on Heroku when you get there.

Run code on first Django start

I have a Django application written to handle displaying a webpage with data from a model based on the primary key passed in the URL, this all works fine and the Django component is working perfectly for the most part.
My question though is, and I have tried multiple methods such as using an AppConfig, is how I can make it so when the Django server boots up, code is called that would then create a separate thread which would then monitor an external source, logging valid data from that source as a model into the database.
I have the threading code written along with the section that creates the model and saves it in the database, my issue though is that if I try to use an AppConfig to create the thread which would then handle the code, I get an django.core.exceptions.AppRegistryNotReady: Apps aren't loaded yet. error and the server does not boot up.
Where would be appropriate to place the code? Is my approach incorrect to the matter?

Trying to use threading to get around blocking processes like web servers is an exercise in pain. I've done it before and it's fragile and often yields unpredictable results.
A much easier idea is to create a separate worker that runs in a totally different process that you start separately. It would have the same database access and could even use your Django models. This is how hosts like Heroku approach this problem. It comes with the added benefit of being able to be tested separately and doesn't need to run at all while you're working on your main Django application.
These days, with a multitude of virtualization options like Vagrant and containerization options like Docker, running parallel processes and workers is trivial. In the wild they may literally be running on separate servers with your database on yet another server. As was mentioned in the comments, starting a worker process could easily be delegated to a separate Django management command. This, in turn, can be fairly easily turned into separate worker processes by gunicorn on your web server.

Ensuring at most a single instance of job executing on Kubernetes and writing into Postgresql

I have a Python program that I am running as a Job on a Kubernetes cluster every 2 hours. I also have a webserver that starts the job whenever user clicks a button on a page.
I need to ensure that at most only one instance of the Job is running on the cluster at any given time.
Given that I am using Kubernetes to run the job and connecting to Postgresql from within the job, the solution should somehow leverage these two. I though a bit about it and came with the following ideas:
Find a setting in Kubernetes that would set this limit, attempts to start second instance would then fail. I was unable to find this setting.
Create a shared lock, or mutex. Disadvantage is that if job crashes, I may not unlock before quitting.
Kubernetes is running etcd, maybe I can use that
Create a 'lock' table in Postgresql, when new instance connects, it checks if it is the only one running. Use transactions somehow so that one wins and proceeds, while others quit. I have not yet thought this out, but is should work.
Query kubernetes API for a label I use on the job, see if there are some instances. This may not be atomic, so more than one instance may slip through.
What are the usual solutions to this problem given the platform choice I made? What should I do, so that I don't reinvent the wheel and have something reliable?

A completely different approach would be to run a (web) server that executes the job functionality. At a high level, the idea is that the webserver can contact this new job server to execute functionality. In addition, this new job server will have an internal cron to trigger the same functionality every 2 hours.
There could be 2 approaches to implementing this:
You can put the checking mechanism inside the jobserver code to ensure that even if 2 API calls happen simultaneously to the job server, only one executes, while the other waits. You could use the language platform's locking features to achieve this, or use a message queue.
You can put the checking mechanism outside the jobserver code (in the database) to ensure that only one API call executes. Similar to what you suggested. If you use a postgres transaction, you don't have to worry about your job crashing and the value of the lock remaining set.
The pros/cons of both approaches are straightforward. The major difference in my mind between 1 & 2, is that if you update the job server code, then you might have a situation where 2 job servers might be running at the same time. This would destroy the isolation property you want. Hence, database might work better, or be more idiomatic in the k8s sense (all servers are stateless so all the k8s goodies work; put any shared state in a database that can handle concurrency).
Addressing your ideas, here are my thoughts:
Find a setting in k8s that will limit this: k8s will not start things with the same name (in the metadata of the spec). But anything else goes for a job, and k8s will start another job.
a) etcd3 supports distributed locking primitives. However, I've never used this and I don't really know what to watch out for.
b) postgres lock value should work. Even in case of a job crash, you don't have to worry about the value of the lock remaining set.
Querying k8s API server for things that should be atomic is not a good idea like you said. I've used a system that reacts to k8s events (like an annotation change on an object spec), but I've had bugs where my 'operator' suddenly stops getting k8s events and needs to be restarted, or again, if I want to push an update to the event-handler server, then there might be 2 event handlers that exist at the same time.
I would recommend sticking with what you are best familiar with. In my case that would be implementing a job-server like k8s deployment that runs as a server and listens to events/API calls.

Redis server responding intermittently to Python client

Using Redis with our Python WSGI application, we are noticing that at certain intervals, Redis stops responding to requests. After some time, however, we are able to fetch the values stored in it again.
After seeing this situation and checking the status of the Redis service, it is still online.
If it is of any help, we are using the redis Python package and using StrictRedis as a connection class, with a default ConnectionPool. Any ideas on this will be greatly appreciated. If there is more information which will help better diagnose the problem, let me know and I'll update ASAP.
Thanks very much!

More data about your redis setup and the size of the data set would be useful. That said I would venture to guess your Eedis server is configured to persist data to disk (the default). If so you could be seeing your Redis node getting a bit overwhelmed whenit forks off a copy of itself to save the data set to disk.
If this is the case and you do need to persist to disk then I would recommend a second instance be stood up and configured to be a shave to the first one and persist to disk. The master you would then configure to not persist to disk. In this configuration you should see the writable node be fully responsive at all times. You could even go so far as to setup a Non-persisting slave for read-only access.
But without more details on your configuration, resources, and usage it is merely an educated guess.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.