Recovering Celery From a Database Outage

Recovering Celery From a Database Outage - python

I have Celeryd/RabbitMQ running on a Fedora box, communicating with a MySQL
database on a separate box. I've noticed that, on rare occasions, if
there's even the slightest problem connecting to the MySQL database
(even for a few seconds), celeryd will crash with the error:
OperationalError: (2003, "Can't connect to MySQL server on
'mydatabasedomain' (111)")
and fail to reconnect even when the database becomes available again.
Currently, I'm forced to manually restart the celeryd service to get
celery running again. Is there a more graceful and automatic way to
recover from these types of event? Is there any feature of celeryd to
just quietly wait, logging the OperationalError, and reconnect instead
of exiting out entirely?

I don't know of any way to fix this by simply using a config flag, but you could consider running your worker using supervisor (s. http://supervisord.org).
This is even mentioned in the celery docs (http://celery.readthedocs.org/en/latest/tutorials/daemonizing.html#supervisord) including a link to some example config files.

Related

Airflow Scheduler Error- MySQL OperationalError 2006, "Can't connect to MySQL server on <HOST IP> (111)"

I am configuring Airflow with CeleryExecutor with 2 RHEL machines on GCP. We can consider one as master and other as client.
I have mounted master directory /airflow on client and configured airflow.cfg file on master with required changes like replacing localhost even in sqlAlchemy connection.
Created rabbitmq queues on both master as well as client (configured rabbitmq cluster also). Once all the setup is ready, started services web server, worker and scheduler on master.
All services starts up perfectly. I am able to perform airflow initdb as well.
But, After some seconds first scheduler goes down with MySQL connection error and then gradually worker and web server errors out and shuts down.
sqlalchemy.exc.OperationalError: (_mysql_exceptions.OperationalError) (2006, "Can't connect to MySQL server on '<MASTER IP>' (111)")
{celery_executor.py:264} ERROR - Error fetching Celery task state, ignoring it:OperationalError('(_mysql_exceptions.OperationalError) (2006, "Can\'t connect to MySQL server on <MASTERIP>' (111)")',)
Other error observed in log:
{celery_executor.py:264} ERROR - Error fetching Celery task state, ignoring it:
sqlalchemy.exc.OperationalError: (_mysql_exceptions.OperationalError) (2013, 'Lost connection to MySQL server during query')
Master is the one where MySQL database in installed.
After the initial setup when i had created user on MySQL, tested logins with the user which all are successful from localhost as well as from client machine. Not, sure despite all access available to mysql user and database why above error is received.
If anyone came across the similar issue while configuring Airflow in distributed mode using celeryExecutor, kindly share the workaround.
It would be great help.
Note: I have tried out other solutions like editing /etc/my.cnf with bind-address parameter, user and host level privileges but no success.
Thanks in advance.

flask server on raspberry pi stopped responding

I set up a flask server on my raspberry pi and set up a crontab to start the server on reboot. It was working fine for a while, for the past couple of days it stopped responding even after I rebooted the pi.
It seems like that the server is running because when I ssh in and try to run another server it says that the address is already in use.
Any ideas why my server is not responding anymore?
also, this may or may not be related but when I typed crontab -e I got the following error:
/tmp/crontab.Qqy98c: No space left on device
Creation of temporary crontab file failed - aborting

Address is already in use error indicates that some process is occupying the host/port that your Flask application is trying to use.
To resolve this problem, find out which processes are running on that port, then kill them. For example, if you use the default setting of port 5000, you can do the following in the terminal:
lsof -i:5000
This will show you the process that's running on port 5000 -- Take note of the process id. Let's say it's 12345 for example.
kill 12345
This should free up the address, allowing the Flask application to launch normally.
The error No space left on device Creation of temporary crontab file failed should be somewhat self-explanatory. If you're running so low on file space, you should strongly consider clearing up space. Otherwise, you're bound to run into all kinds of problems.

Python process suspends on SSH logout after nohup/screen

I have a remote server through Blue Host that's intended to run a server based on Twisted for Python. The only access I have to it is over SSH, so to keep Python running after I log out I tried using nohup python server.py & and screen -dm python server.py, getting the same results for each. Everything works fine until I log out of SSH - even though Python is running in the background as expected, once I've logged out, my client can no longer communicate with the server. The strange part is that if I log back in over SSH and check the running processes with ps aux, I see Python running and my client can successfully communicate with the server again. Even if I don't type anything at all once I log back in, everything works as expected. But, of course, as soon as I log back out, it's as if the server is gone.
I've contacted support for the hosting service in case this is some oddity on their end, but hopefully this is something that can be resolved on my end instead.
Edit: Looks like Blue Host doesn't want me doing server-y stuff without buying the VPS upgrade so it looks like that's the big problem.
Edit 2: Okay, so in case anybody ends up having a similar problem, here's what the main issue turned out to be. I was mistaken in my original description; I was able to connect to the server but I was getting kicked off immediately for what turned out to be a MySQL error. I guess trying to connect to a localhost database with no active connection somehow causes problems, so instead I changed the MySQL connection command to connect to my site's IP address instead, even though it was the same IP as the server. That seemed to do the trick in terms of my main issue.

Don't use this method to keep the server process running. Instead try using supervisor (apt-get install supervisor). It allows you to daemonize your process, and ability to stop/restart etc.
Here's a sample config entry (/etc/supervisor/supervisord.conf):
[program:my_server]
command=python /path/to/server/server.py
directory=/path/to/server/
autostart=true
autorestart=true
stdout_logfile=/var/log/server.log
stderr_logfile=/var/log/server_error.log
user=your_linux_user_name
After you edit your config, do
sudo service supervisor stop
sudo service supervisor start #need to do this - doing a `restart` doesn't reload the config file!
your server should now be running properly. You can manage its lifecycle via sudo supervisorctl

google compute engine connection keeps disconecting

I have an instance on google compute engine, connecting to it by terminal: gcutil ssh, on it I have several DJango servieces. I run the server using: python manage.py runserver 0.0.0.0:8000. the services are being called from an iPhone application IOS 6.1
the problem I'm facing is that every few minutes (between 10- 15) I'm getting disconnected and have to reconnect and run the server again.
Why is my server being disconnected and how can I keep the it running?

Try using supervisor.d. It sounds like for what your trying to do, supervisor can keep your process up and running. http://supervisord.org/
Here's an example conf:
[program:app]
process_name = app-%(process_num)s
command =python /home/ubuntu/production/current/app/src/app.py --port=%(process_num)s
# Increase numprocs to run multiple processes on different ports.
# Note that the chat demo won't actually work in that configuration
# because it assumes all listeners are in one process.
numprocs = 4
numprocs_start = 8000
This is for running multiple processes of the same program. Just change around the args and it should work for you.

SSH normally times out after a period of inactivity, and that may be what is happening here. If so, this article might be useful to help configure SSH to send a regular message so connections are less likely to be dropped.
However, the core issue is that you'd like software you started at the terminal to keep running even when you're logged out. Consider using screen or tmux to host your shell sessions. This will allow your shell software to run even when you are not connected, and for you to pick up right where you left off when you reconnect. Here is a nice getting started post about tmux.
Once you're ready for production, take a look at the Django deployment docs.

Django hideously slow with 8000 model instances (How to drop PostgreSQL database)?

I have a Django installation running on the development server. I recently used python manage.py loaddata 8000models.json. Now everything is super slow. Pages won't load. python manage.py flush does not return. ModelType.objects.count() does not return.
(Or maybe they will return if I wait a sufficiently long time.)
What's going on here? Is Django truly unable to handle that much data? Or is there some other issue?
Update: I observe this issue on PostgreSQL but not SQLite with the same amount of data. Perhaps I just need to wipe the PostgreSQL and reload the data.
Update 2: How can I wipe the PostgreSQL and reset? python manage.py reset appname isn't responding.
Update 3: This is how I'm trying to wipe the PostgreSQL:
#! bin/bash
sudo -u postgres dropdb mydb
sudo -u postgres createdb mydb
sudo -u postgres psql mydb < ~/my-setup/my-init.sql
python ~/path/to/manage.py syncdb
However, this causes the following errors:
dropdb: database removal failed: ERROR: database "mydb" is being accessed by other users
DETAIL: There are 8 other session(s) using the database.
createdb: database creation failed: ERROR: database "mydb" already exists
ERROR: role "myrole" cannot be dropped because some objects depend on it
DETAIL: owner of table mydb.mytable_mytable
# ... more "owner of table", "owner of sequence" statements, etc
How can I close out these other sessions? I don't have Apache running. I've only been using one instance of the Django development server at a time. However, when it got unresponsive I killed it with Control+Z. Perhaps that caused it to not release a database connection, thereby causing this issue? How can I get around this?

Ctrl-Z just stops the process, it does not kill it. (I assume you're using bash) type jobs in your terminal, and you should see the old processes still running.
Once you kill all the jobs that are accessing the PostgreSQL database, you should be able to drop, create, and syncdb as you expect.

Have you tried checking that your postrgres database is still replying at all, from the sounds of it it is probably processing the data you told it to load.
When loading the data you should just let it finish first. If pages are still taking a very long time to render you should check your queries instead.
Using .count(), even in PostgreSQL (which has rather slow counts) with 8000 items should return within a good amount of time.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.