MySQL query errors when connecting from Celery task running on Heroku - python

I'm seeing wrong query results when executing queries against an external MySQL database, but only when connecting from Celery tasks running on Heroku. The same tasks, when run on my own machine do not show these errors, and the errors only appear about half of the time (although when they fail, all tasks are wrong).
The tasks are managed by Celery via Redis, and the MySQL database does not itself run on Heroku. Both my local machine and Heroku connect to the same MySQL database.
I connect to the database using MySQL, with the pymysql driver, using;
DB_URI = 'mysql+pymysql://USER:PW#SERVER/DB'
engine = create_engine(stats_config.DB_URI, convert_unicode=True, echo_pool=True)
db_session = scoped_session(sessionmaker(autocommit=False, autoflush=False, bind=engine))
Base = declarative_base()
Base.query = db_session.query_property()
The tasks are executed one by one.
Here is an example of a task with different results:
#shared_task(bind=True, name="get_gross_revenue_task")
def get_gross_revenue_task(self, g_start_date, g_end_date, START_TIME_FORM):
db_session.close()
start_date = datetime.strptime(g_start_date, '%d-%m-%Y')
end_date = datetime.strptime(g_end_date, '%d-%m-%Y')
gross_rev_trans_VK = db_session.query(func.sum(UsersTransactionsVK.amount)).filter(UsersTransactionsVK.date_added >= start_date, UsersTransactionsVK.date_added <= end_date, UsersTransactionsVK.payed == 'Yes').scalar()
gross_rev_trans_Stripe = db_session.query(func.sum(UsersTransactionsStripe.amount)).filter(UsersTransactionsStripe.date_added >= start_date, UsersTransactionsStripe.date_added <= end_date, UsersTransactionsStripe.payed == 'Yes').scalar()
gross_rev_trans = db_session.query(func.sum(UsersTransactions.amount)).filter(UsersTransactions.date_added >= start_date, UsersTransactions.date_added <= end_date, UsersTransactions.on_hold == 'No').scalar()
if gross_rev_trans_VK is None:
gross_rev_trans_VK = 0
if gross_rev_trans_Stripe is None:
gross_rev_trans_Stripe = 0
if gross_rev_trans is None:
gross_rev_trans = 0
print ('gross', gross_rev_trans_VK, gross_rev_trans_Stripe, gross_rev_trans)
total_gross_rev = gross_rev_trans_VK + gross_rev_trans_Stripe + gross_rev_trans
return {'total_rev' : str(total_gross_rev / 100), 'current': 100, 'total': 100, 'statistic': 'get_gross_revenue', 'time_benchmark': (datetime.today() - START_TIME_FORM).total_seconds()}
# Selects gross revenue between selected dates
#app.route('/get-gross-revenue', methods=["POST"])
#basic_auth.required
#check_verified
def get_gross_revenue():
if request.method == "POST":
task = get_gross_revenue_task.apply_async([session['g_start_date'], session['g_end_date'], session['START_TIME_FORM']])
return json.dumps({}), 202, {'Location': url_for('taskstatus_get_gross_revenue', task_id=task.id)}
These are simple and fast tasks, completing within a few seconds.
The tasks fail by producing small differences. For example, for a task where the correct result would by 30111, when things break the task would produce 29811 instead. It is always the code that uses `db
What I tried:
I am already using the same timezone by executing:
db_session.execute("SET SESSION time_zone = 'Europe/Berlin'")
I checked for errors in the worker logs. Although there are some entries like
2013 Lost connection to MySQL
sqlalchemy.exc.ResourceClosedError: This result object does not return rows. It has been closed automatically
2014 commands out of sync
I haven't found a correlation between SQL errors and wrong results. The wrong tasks results can appear without a lost connection.
A very dirty fix is to hard-code an expected result for one of the tasks, execute that first and then re-submit everything if the result produced is incorrect.
This is probably a cache or isolation level problem with the way I use the SQLAlchemy session. Because I only ever need to use SELECT (no inserts or updates), I also tried different settings for the isolation level, before running tasks, such as
#db_session.close()
#db_session.commit()
#db_session.execute('SET TRANSACTION READ ONLY')
These show an error when I run these on Heroku, but they work when I run them on my Windows machine.
I also tried to alter the connection itself with 'isolation_level="READ UNCOMMITTED', without any result.
I am certain that the workers are not reusing the same db_session.
It seems that only tasks which use db_session in the query can return wrong results. Code using the query attribute on the Base base class (a db_session.query_property() object, e.g. Users.query) does not appear to having issues. I thought this was basically the same thing?

You are re-using sessions between tasks in different workers. Create your session per Celery worker, or even per task.
Know that tasks are actually persisted per worker. You can use this to cache a session for each task, so you don't have to recreate the session each time the task is run. This is easiest done with a custom task class; the documentation uses database connection caching as an example there.
To do this with a SQLAlchemy session, use:
Session = scoped_session(sessionmaker(autocommit=True, autoflush=True))
class SQLASessionTask(Task):
_session = None
#property
def session(self):
if self._session is None:
engine = create_engine(
stats_config.DB_URI, convert_unicode=True, echo_pool=True)
self._session = Session(bind=engine)
return self._session
Use this as:
#shared_task(base=SQLASessionTask, bind=True, name="get_gross_revenue_task")
def get_gross_revenue_task(self, g_start_date, g_end_date, START_TIME_FORM):
db_session = self.session
# ... etc.
This only creates a SQLAlchemy session for the current task only if it needs one, the moment you access self.session.

Related

FastAPI db query stuck. Kubernetes

I have one small app that use fastapi. The problem is when I deploy it to my server and trying to make a post request to route which contains some database operations, it just stuck and gives me 504 error. But on my local machine it working well.
Here is how my db connecting:
app.add_event_handler("startup", tasks.create_start_app_handler(app))
app.add_event_handler("shutdown", tasks.create_stop_app_handler(app))
I tried to revert db connection from startup application to creation of this connection with closing it in different route to test and its worked. Like:
#app.get("/")
async def create_item():
engine = create_engine(
DB_URL
)
SessionLocal = sessionmaker(autocommit=False, autoflush=False, bind=engine)
Base = declarative_base()
t = engine.execute('SELECT * FROM auth_user').fetchone()
engine.dispose()
return t
How it's depend on events? Versions of postgresql are different, but I don't think it's because of it.
Currently I have deployment with 2 pods running in it. When I use psql command I can connect normally. So it only stuck in application, not it pod.
If somebody finds same, I fixed it by updating pgpoll from 4.2.2 to latest.

Acquiring pool connections in Python Gino (async)

I'm using Postgres, Python3.7 with asyncio + asyncpg + gino (ORM-ish) + aiohttp (routing, web responses).
I created a small postgres table users in my database testdb and inserted a single row:
testdb=# select * from users;
id | nickname
----+----------
1 | fantix
I'm trying to set up my database such that I can make use of the ORM within routes as requests come in.
import time
import asyncio
import gino
DATABASE_URL = os.environ.get('DATABASE_URL')
db = gino.Gino()
class User(db.Model):
__tablename__ = 'users'
id = db.Column(db.Integer(), primary_key=True)
nickname = db.Column(db.Unicode(), default='noname')
kwargs = dict(
min_size=10,
max_size=100,
max_queries=1000,
max_inactive_connection_lifetime=60 * 5,
echo=True
)
async def test_engine_implicit():
await db.set_bind(DATABASE_URL, **kwargs)
return await User.query.gino.all() # this works
async def test_engine_explicit():
engine = await gino.create_engine(DATABASE_URL, **kwargs)
db.bind = engine
async with engine.acquire() as conn:
return await conn.all(User.select()) # this doesn't work!
users = asyncio.get_event_loop().run_until_complete(test_engine_implicit())
print(f'implicit query: {users}')
users = asyncio.get_event_loop().run_until_complete(test_engine_explicit())
print(f'explicit query: {users}')
The output is:
web_1 | INFO gino.engine._SAEngine SELECT users.id, users.nickname FROM users
web_1 | INFO gino.engine._SAEngine ()
web_1 | implicit query: [<db.User object at 0x7fc57be42410>]
web_1 | INFO gino.engine._SAEngine SELECT
web_1 | INFO gino.engine._SAEngine ()
web_1 | explicit query: [()]
which is strange. The "explicit" code essentially runs a bare SELECT against the database, which is useless.
I can't find in the documentation a way to both 1) use the ORM, and 2) explicitly check out connections from the pool.
Questions I have:
Does await User.query.gino.all() check out a connection from the pool? How is it released?
How would I wrap queries in a transaction? I am uneasy that I am not able to explicitly control when / where I acquire a connection from the pool, and how I release it.
I'd essentially like the explicitness of the style in test_engine_explicit() to work with Gino, but perhaps I'm just not understanding how the Gino ORM works.
I have never used GINO before, but after a quick look into the code:
GINO connection simply executes provided clause as is. Thus, if you provide bare User.select(), then it adds nothing to that.
If you want to achieve the same as using User.query.gino.all(), but maintaining connection yourself, then you could follow the docs and use User.query instead of plain User.select():
async with engine.acquire() as conn:
return await conn.all(User.query)
Just tested and it works fine for me.
Regarding the connections pool, I am not sure that I got the question correctly, but Engine.acquire creates a reusable connection by default and then it is added to the pool, which is actually a stack:
:param reusable: Mark this connection as reusable or otherwise. This
has no effect if it is a reusing connection. All reusable connections
are placed in a stack, any reusing acquire operation will always
reuse the top (latest) reusable connection. One reusable connection
may be reused by several reusing connections - they all share one
same underlying connection. Acquiring a connection with
``reusable=False`` and ``reusing=False`` makes it a cleanly isolated
connection which is only referenced once here.
There is also a manual transaction control in GINO, so e.g. you can create a non-reusable, non-reuse connection and control transaction flow manually:
async with engine.acquire(reuse=False, reusable=False) as conn:
tx = await conn.transaction()
try:
await conn.status("INSERT INTO users(nickname) VALUES('e')")
await tx.commit()
except Exception:
await tx.rollback()
raise
As for connection release, I cannot find any evidence that GINO releases connections itself. I guess that pool is maintained by SQLAlchemy core.
I definitely have not answered your questions directly, but hope it will help you somehow.

Flask + Celery + SQLAlchemy: database connection timeout

I have a Flask application to start long-running Celery tasks (~10-120 min/task, sometimes with slow queries). I use Flask-SQLAlchemy for ORM and connection management. My app looks like this:
app = Flask(__name__)
db = SQLAlchemy(app)
celery = make_celery(app)
#app.route('/start_job')
def start_job():
task = job.delay()
return 'Async job started', 202
#celery.task(bind=True)
def job(self):
db.session.query(... something ...)
... do something for hours ...
db.session.add(... something ...)
db.session.commit()
return
Unfortunately the MySQL server I have to use likes to close connections after a few minutes inactivity and the celery tasks can't handle the situation, so after a lot of waiting I get (2006, 'MySQL server has gone away') errors. AFAIK the connection pooling should take care of the closed connections. I read the docs, but it only writes about the SQLALCHEMY_POOL_TIMEOUT and SQLALCHEMY_POOL_RECYCLE parameters, so based on some random internet article I tried to change recycle to 3 minutes, but that didn't help.
How the connection (session ?) handling works with this configuration? What should I do to avoid such errors?
I am not entirely sure about the goodness of the solution below, but it seems to solve the problem.
The session initialize a connection before the first query (or insert) statement and starts a transaction. Then it waits for a rollback or commit, but because of inactivity the MySQL server closes the connection after a few minutes. The solution is to close the session if you do not need it for a long time, and SQLAlchemy will open a new one for the next transaction.
#celery.task(bind=True)
def job(self):
db.session.query(... something ...)
db.session.close()
... do something for hours ...
db.session.add(... something ...)
db.session.commit()
return

SQLAlchemy connection hangs on AWS MySQL RDS reboot with failover

We have a Python server which uses SQLAlchemy to read/write data from an AWS MySQL MultiAZ RDS instance.
We're experiencing a behavior we'd like to avoid where whenever we trigger a failover reboot, a connection which was open already and then issues a statement hangs indefinitely. While this is something to expect according to AWS documentation, we would expect the Python MySQL connector would be able to cope with this situation.
The closest case we've found on the web is this google groups thread which talks about the issue and offers a solution regarding a Postgres RDS.
For example, the below script will hang indefinitely when initiating a failover reboot (adopted from the above mention google groups thread).
from datetime import datetime
from time import time, sleep
from sqlalchemy import Column, Integer, String, create_engine
from sqlalchemy.orm import sessionmaker
from sqlalchemy.orm.scoping import scoped_session
from sqlalchemy.ext.declarative import declarative_base
import logging
current_milli_time = lambda: int(round(time() * 1000))
Base = declarative_base()
logging.basicConfig(format='%(asctime)s %(filename)s %(lineno)s %(process)d %(levelname)s: %(message)s', level="INFO")
class Message(Base):
__tablename__ = 'message'
id = Column(Integer, primary_key=True)
body = Column(String(450), nullable=False)
engine = create_engine('mysql://<username>:<password>#<db_host>/<db_name>',echo=False, pool_recycle=1800,)
session_maker = scoped_session(sessionmaker(bind=engine, autocommit=False, autoflush=False))
session = session_maker()
while True:
try:
ids = ''
start = current_milli_time()
for msg in session.query(Message).order_by(Message.id.desc()).limit(5):
ids += str(msg.id) + ', '
logging.info('({!s}) (took {!s} ms) fetched ids: {!s}'.format(datetime.now().time().isoformat(), current_milli_time() - start, ids))
start = current_milli_time()
m = Message()
m.body = 'some text'
session.add(m)
session.commit()
logging.info('({!s}) (took {!s} ms) inserted new message'.format(datetime.now().time().isoformat(), current_milli_time() - start))
except Exception, e:
logging.exception(e)
session.rollback()
finally:
session_maker.remove()
sleep(0.25)
We've tried playing with the connection timeouts but it seems the issue is related to an already opened connection which simply hangs once AWS switches to the failover instance.
Our question is - has anyone encountered this issue or has possible directions worthwhile checking?
IMHO, using SQL connector timeout to handle switchcover is like black magic. Each connector always act differently and difficult to diagnose.
If you read #univerio comment again, AWS will reassign a new IP address for the SAME RDS endpoint name. While doing the switching, your RDS endpoint name and old IP adderss is still inside your server instance DNS cache. So this is a DNS caching issues, and that's why AWS ask you to "clean up....".
Unless you restart SQLAlchemy to read the DNS again, there is no way that the session know something happens and switch it dynamically. And worst, the issue can be happens in connector that used by SQLAlchemy.
IMHO, it doesn't worth the effort to deal with switch over inside the code. I will just subscribe to AWS service like lambda that can act upon switch over events, trigger the app server to restart the connection, which suppose to reflect the new IP address.

Python and Django OperationalError (2006, 'MySQL server has gone away')

Original: I have recently started getting MySQL OperationalErrors from some of my old code and cannot seem to trace back the problem. Since it was working before, I thought it may have been a software update that broke something. I am using python 2.7 with django runfcgi with nginx. Here is my original code:
views.py
DBNAME = "test"
DBIP = "localhost"
DBUSER = "django"
DBPASS = "password"
db = MySQLdb.connect(DBIP,DBUSER,DBPASS,DBNAME)
cursor = db.cursor()
def list(request):
statement = "SELECT item from table where selected = 1"
cursor.execute(statement)
results = cursor.fetchall()
I have tried the following, but it still does not work:
views.py
class DB:
conn = None
DBNAME = "test"
DBIP = "localhost"
DBUSER = "django"
DBPASS = "password"
def connect(self):
self.conn = MySQLdb.connect(DBIP,DBUSER,DBPASS,DBNAME)
def cursor(self):
try:
return self.conn.cursor()
except (AttributeError, MySQLdb.OperationalError):
self.connect()
return self.conn.cursor()
db = DB()
cursor = db.cursor()
def list(request):
cursor = db.cursor()
statement = "SELECT item from table where selected = 1"
cursor.execute(statement)
results = cursor.fetchall()
Currently, my only workaround is to do MySQLdb.connect() in each function that uses mysql. Also I noticed that when using django's manage.py runserver, I would not have this problem while nginx would throw these errors. I doubt that I am timing out with the connection because list() is being called within seconds of starting the server up. Were there any updates to the software I am using that would cause this to break/is there any fix for this?
Edit: I realized that I recently wrote a piece of middle-ware to daemonize a function and this was the cause of the problem. However, I cannot figure out why. Here is the code for the middle-ware
def process_request_handler(sender, **kwargs):
t = threading.Thread(target=dispatch.execute,
args=[kwargs['nodes'],kwargs['callback']],
kwargs={})
t.setDaemon(True)
t.start()
return
process_request.connect(process_request_handler)
Sometimes if you see "OperationalError: (2006, 'MySQL server has gone away')", it is because you are issuing a query that is too large. This can happen, for instance, if you're storing your sessions in MySQL, and you're trying to put something really big in the session. To fix the problem, you need to increase the value of the max_allowed_packet setting in MySQL.
The default value is 1048576.
So see the current value for the default, run the following SQL:
select ##max_allowed_packet;
To temporarily set a new value, run the following SQL:
set global max_allowed_packet=10485760;
To fix the problem more permanently, create a /etc/my.cnf file with at least the following:
[mysqld]
max_allowed_packet = 16M
After editing /etc/my.cnf, you'll need to restart MySQL or restart your machine if you don't know how.
As per the MySQL documentation, your error message is raised when the client can't send a question to the server, most likely because the server itself has closed the connection. In the most common case the server will close an idle connection after a (default) of 8 hours. This is configurable on the server side.
The MySQL documentation gives a number of other possible causes which might be worth looking into to see if they fit your situation.
An alternative to calling connect() in every function (which might end up needlessly creating new connections) would be to investigate using the ping() method on the connection object; this tests the connection with the option of attempting an automatic reconnect. I struggled to find some decent documentation for the ping() method online, but the answer to this question might help.
Note, automatically reconnecting can be dangerous when handling transactions as it appears the reconnect causes an implicit rollback (and appears to be the main reason why autoreconnect is not a feature of the MySQLdb implementation).
This might be due to DB connections getting copied in your child threads from the main thread. I faced the same error when using python's multiprocessing library to spawn different processes. The connection objects are copied between processes during forking and it leads to MySQL OperationalErrors when making DB calls in the child thread.
Here's a good reference to solve this: Django multiprocessing and database connections
For me this was happening in debug mode.
So I tried Persistent connections in debug mode, checkout the link: Django - Documentation - Databases - Persistent connections.
In settings:
'default': {
'ENGINE': 'django.db.backends.mysql',
'NAME': 'dbname',
'USER': 'root',
'PASSWORD': 'root',
'HOST': 'localhost',
'PORT': '3306',
'CONN_MAX_AGE': None
},
Check if you are allowed to create mysql connection object in one thread and then use it in another.
If it's forbidden, use threading.Local for per-thread connections:
class Db(threading.local):
""" thread-local db object """
con = None
def __init__(self, ...options...):
super(Db, self).__init__()
self.con = MySQLdb.connect(...options...)
db1 = Db(...)
def test():
"""safe to run from any thread"""
cursor = db.con.cursor()
cursor.execute(...)
This error is mysterious because MySQL doesn't report why it disconnects, it just goes away.
It seems there are many causes of this kind of disconnection. One I just found is, if the query string too large, the server will disconnect. This probably relates to the max_allowed_packets setting.
I've been struggling with this issue too. I don't like the idea of increasing timeout on mysqlserver. Autoreconnect with CONNECTION_MAX_AGE doesn't work either as it was mentioned. Unfortunately I ended up with wrapping every method that queries the database like this
def do_db( callback, *arg, **args):
try:
return callback(*arg, **args)
except (OperationalError, InterfaceError) as e: # Connection has gone away, fiter it with message or error code if you could catch another errors
connection.close()
return callback(*arg, **args)
do_db(User.objects.get, id=123) # instead of User.objects.get(id=123)
As you can see I rather prefer catching the exception than pinging the database every time before querying it. Because catching an exception is a rare case. I would expect django to reconnect automatically but they seemed to refused that issue.
This error may occur when you try to use the connection after a time-consuming operation that doesn't go to the database. Since the connection is not used for some time, MySQL timeout is hit and the connection is silently dropped.
You can try calling close_old_connections() after the time-consuming non-DB operation so that a new connection is opened if the connection is unusable. Beware, do not use close_old_connections() if you have a transaction.
The most common issue regarding such warning, is the fact that your application has reached the wait_timeout value of MySQL.
I had the same problem with a Flask app.
Here's how I solved:
$ grep timeout /etc/mysql/mysql.conf.d/mysqld.cnf
# https://support.rackspace.com/how-to/how-to-change-the-mysql-timeout-on-a-server/
# wait = timeout for application session (tdm)
# inteactive = timeout for keyboard session (terminal)
# 7 days = 604800s / 4 hours = 14400s
wait_timeout = 604800
interactive_timeout = 14400
Observation: if you search for the variables via MySQL batch mode, the values will appear as it is. But If you perform SHOW VARIABLES LIKE 'wait%'; or SHOW VARIABLES LIKE 'interactive%';, the value configured for interactive_timeout, will appear to both variables, and I don't know why, but the fact is, that the values configured for each variable at '/etc/mysql/mysql.conf.d/mysqld.cnf', will be respected by MySQL.
How old is this code? Django has had databases defined in settings since at least .96. Only other thing I can think of is multi-db support, which changed things a bit, but even that was 1.1 or 1.2.
Even if you need a special DB for certain views, I think you'd probably be better off defining it in settings.
SQLAlchemy now has a great write-up on how you can use pinging to be pessimistic about your connection's freshness:
http://docs.sqlalchemy.org/en/latest/core/pooling.html#disconnect-handling-pessimistic
From there,
from sqlalchemy import exc
from sqlalchemy import event
from sqlalchemy.pool import Pool
#event.listens_for(Pool, "checkout")
def ping_connection(dbapi_connection, connection_record, connection_proxy):
cursor = dbapi_connection.cursor()
try:
cursor.execute("SELECT 1")
except:
# optional - dispose the whole pool
# instead of invalidating one at a time
# connection_proxy._pool.dispose()
# raise DisconnectionError - pool will try
# connecting again up to three times before raising.
raise exc.DisconnectionError()
cursor.close()
And a test to make sure the above works:
from sqlalchemy import create_engine
e = create_engine("mysql://scott:tiger#localhost/test", echo_pool=True)
c1 = e.connect()
c2 = e.connect()
c3 = e.connect()
c1.close()
c2.close()
c3.close()
# pool size is now three.
print "Restart the server"
raw_input()
for i in xrange(10):
c = e.connect()
print c.execute("select 1").fetchall()
c.close()
I had this problem and did not have the option to change my configuration. I finally figured out that the problem was occurring 49500 records in to my 50000-record loop, because that was the about the time I was trying again (after having tried a long time ago) to hit my second database.
So I changed my code so that every few thousand records, I touched the second database again (with a count() of a very small table), and that fixed it. No doubt "ping" or some other means of touching the database would work, as well.
Firstly, You should make sure the MySQL session & global enviroments wait_timeout and interactive_timeout values. And secondly Your client should try to reconnect to the server below those enviroments values.

Categories

Resources