Python rq: handle success or failure of jobs - python

I have a fairly basic (so far) queue set up in my app:
Job 1 (backup): back up the SQL table I'm about to replace
Job 2 (update): do the actual table drop/update
very simplified code:
from rq import Queue
from rq.decorators import job
#job('backup')
def backup(db, table, conn_str):
backup_sql = "SELECT * INTO {}.dbo.{}_backup from {}.dbo.{}".format(db, table, db, collection)
#job('update')
def update(db, table, conn_str, keys, data):
truncate_sql = "TRUNCATE TABLE {}.dbo.{}".format(db, collection)
sql_cursor.execute(truncate_sql)
for sql_row in data:
sql = "INSERT INTO {}.dbo.{} ({}) values ({})".format(db, table, ",".join(keys), ",".join(["?"] * len(sql_row)))
sql_cursor.execute(sql, sql_row)
sql_cursor.commit()
def update_data():
...
update_queue = Queue('update', connection=redis_conn)
backup_job = update_queue.enqueue('backup', db, table, conn_str, result_ttl=current_app.config['RESULT_TTL'],)
update_job = update_queue.enqueue('update', db, table, conn_str, result_ttl=current_app.config['RESULT_TTL'],)
What I'd like to do, is find a way to watch the update. If it fails, I want to run a job to restore the backup created in the backup job. If it's successful, I want to run a different job to simply remove the backup.
What's the right way to go about this? I'm pretty new to rq and am looking around in the docs, but haven't found either a way to poll update for success/failure or an idiomatic way to handle either outcome.

One option is to create another third job called "checker" for example, which will decide what to do based on the status of "update" job. For that, you have to specify a dependency relationship.
depends_on specifies another job (or job id) that must complete before
this job will be queued.
def checker(*args, **kwargs):
pass
checker_job = update_queue.enqueue('checker', *args, depends_on=update_job.id, result_ttl=current_app.config['RESULT_TTL'])
Then check the status of the dependency inside of "checker" and based on that status load backup or delete it.
def checker(*args, **kwargs):
job = rq.get_current_job()
update_job = job.dependency
if update_job.status == 'failed':
# do the stuff here
else: # or elif
# do the stuff here

Related

How to dynamically rename a task run based on passed variable from within a Prefect flow for Prefect 2.0?

I have a prefect task that will source a table by name from a database as below. What I would like to do is to have the task run name include the source table variable so that I can easily keep track of it from the Prefect Cloud UI (when I source 10+ tables in the one flow).
#task
def source_table_by_name(id, table_name):
logger = get_run_logger()
sql = f"SELECT * FROM {table_name} WHERE upload_id = '{id}'"
df = pd.read_sql(sql, sd_engine_prod)
logger.info(f"Source table {table_name} from database")
return df
What I tried to do initially was put a template in the name to be able to reference the variables passed (I'll be honest, ChatGPT hallucinated this one for me).
#task(name='source_table_by_name_{table_name}')
def source_table_by_name(id, table_name):
logger = get_run_logger()
sql = f"SELECT * FROM {table_name} WHERE upload_id = '{id}'"
df = pd.read_sql(sql, sd_engine_prod)
logger.info(f"Source table {table_name} from database")
return df
#flow
def report_flow(upload_id):
df_table1 = source_table_by_name(upload_id, table1)
df_table2 = source_table_by_name(upload_id, table2)
I am able to just write a specific task for each table to be sourced so the naming is fixed but clear from the start. However it would be great to have a more DRY approach if possible.
Totally justifiable question, we have an internal WIP to address the issue in a better way and there is also this open issue but for now you could use with_options() and pass the variable through a for loop:
from prefect import flow, task
from typing import List
import time
#task
def hello_world(user: str) -> None:
print(f"✨ Hello from Prefect, {user}! 👋 📚")
time.sleep(5)
#flow(log_prints=True)
def hi(
users: List[str] = [
"Marvin",
"Anna",
"Prefect"
]
) -> None:
for user in users:
hello_world.with_options(name=user).submit(user)
if __name__ == "__main__":
hi()

SQLAlchemy does not update/expire model instances with external changes

Recently I came across strange behavior of SQLAlchemy regarding refreshing/populating model instances with the the changes that were made outside of the current session. I created the following minimal working example and was able to reproduce problem with it.
from time import sleep
from sqlalchemy import orm, create_engine, Column, BigInteger, Integer
from sqlalchemy.ext.declarative import declarative_base
DATABASE_URI = "postgresql://{user}:{password}#{host}:{port}/{name}".format(
user="postgres",
password="postgres",
host="127.0.0.1",
name="so_sqlalchemy",
port="5432",
)
class SQLAlchemy:
def __init__(self, db_url, autocommit=False, autoflush=True):
self.engine = create_engine(db_url)
self.session = None
self.autocommit = autocommit
self.autoflush = autoflush
def connect(self):
session_maker = orm.sessionmaker(
bind=self.engine,
autocommit=self.autocommit,
autoflush=self.autoflush,
expire_on_commit=True
)
self.session = orm.scoped_session(session_maker)
def disconnect(self):
self.session.flush()
self.session.close()
self.session.remove()
self.session = None
BaseModel = declarative_base()
class TestModel(BaseModel):
__tablename__ = "test_models"
id = Column(BigInteger, primary_key=True, nullable=False)
field = Column(Integer, nullable=False)
def loop(db):
while True:
with db.session.begin():
t = db.session.query(TestModel).with_for_update().get(1)
if t is None:
print("No entry in db, creating...")
t = TestModel(id=1, field=0)
db.session.add(t)
db.session.flush()
print(f"t.field value is {t.field}")
t.field += 1
print(f"t.field value before flush is {t.field}")
db.session.flush()
print(f"t.field value after flush is {t.field}")
print(f"t.field value after transaction is {t.field}")
print("Sleeping for 2 seconds.")
sleep(2.0)
def main():
db = SQLAlchemy(DATABASE_URI, autocommit=True, autoflush=True)
db.connect()
try:
loop(db)
except KeyboardInterrupt:
print("Canceled")
if __name__ == '__main__':
main()
My requirements.txt file looks like this:
alembic==1.0.10
psycopg2-binary==2.8.2
sqlalchemy==1.3.3
If I run the script (I use Python 3.7.3 on my laptop running Ubuntu 16.04), it will nicely increment a value every two seconds as expected:
t.field value is 0
t.field value before flush is 1
t.field value after flush is 1
t.field value after transaction is 1
Sleeping for 2 seconds.
t.field value is 1
t.field value before flush is 2
t.field value after flush is 2
t.field value after transaction is 2
Sleeping for 2 seconds.
...
Now at some point I open postgres database shell and begin another transaction:
so_sqlalchemy=# BEGIN;
BEGIN
so_sqlalchemy=# UPDATE test_models SET field=100 WHERE id=1;
UPDATE 1
so_sqlalchemy=# COMMIT;
COMMIT
As soon as I press Enter after the UPDATE query, the script blocks as expected, as I'm issuing SELECT ... FOR UPDATE query there. However, when I commit the transaction in the database shell, script continues from the previous value (say, 27) and does not detect that external transaction has changed the value of field in database to 100.
My question is, why does this happen at all? There are several factors that seem to contradict the current behavior:
I'm using expire_on_commit setting set to True, which seems to imply that every model instance that has been used in transaction will be marked as expired after the transaction has been committed. (Quoting documentation, "When True, all instances will be fully expired after each commit(), so that all attribute/object access subsequent to a completed transaction will load from the most recent database state.").
I'm not accessing some old model instance but rather issue completely new query every time. As far as I understand, this should lead to direct query to the database and not access cached instance. I can confirm that this is indeed the case if I turn sqlalchemy debug log on.
The quick and dirty fix for this problem is to call db.session.expire_all() right after the transaction has begun, but this seems very inelegant and counter-intuitive. I would be very glad to understand what's wrong with the way I'm working with sqlalchemy here.
I ran into a very similar situation with MySQL. I needed to "see" changes to the table that were coming from external sources in the middle of my code's database operations. I ended up having to set autocommit=True in my session call and use the begin() / commit() methods of the session to "see" data that was updated externally.
The SQLAlchemy docs say this is a legacy configuration:
Warning
“autocommit” mode is a legacy mode of use and should not be considered for new projects.
but also say in the next paragraph:
Modern usage of “autocommit mode” tends to be for framework integrations that wish to control specifically when the “begin” state occurs
So it doesn't seem to be clear which statement is correct.

Google BigQuery - python client - creating/managing jobs

I'm new to the BigQuery world... I'm using the python google.cloud package and I need simply to run a query from Python on a BigQuery table and print the results. This is the part of the query function which creates a query job.
function test():
query = "SELECT * FROM " + dataset_name + '.' + table_name
job = bigquery_client.run_async_query('test-job', query)
job.begin()
retry_count = 100
while retry_count > 0 and job.state != 'DONE':
retry_count -= 1
sleep(10)
job.reload() # API call
print(job.state)
print(job.ended)
If I run the test() function multiple times, I get the error:
google.api.core.exceptions.Conflict: 409 POST https://www.googleapis.com/bigquery/v2/projects/myprocject/jobs:
Already Exists: Job myprocject:test-job
Since I have to run the test() function multiple times, do I have to delete the job named 'test-job' each time or do I have to assign a new job-name (e.g. a random one or datetime-based) each time?
do I have to delete the job named 'test-job' each time
You cannot delete job. Jobs collection stores your project's complete job history, but availability is only guaranteed for jobs created in the past six months. The best you can do is to request automatic deletion of jobs that are more than 50 days old for which you should contact support.
or do I have to assign a new job-name (e.g. a random one or datetime-based) each time?
Yes. This is the way to go
As a side recommendation, we usually do it like:
import uuid
job_name = str(uuid.uuid4())
job = bigquery_client.run_async_query(job_name, query)
Notice this is already automatic if you run a synced query.
Also, you don't have to manage the validation for job completeness (as of version 0.27.0), if you want you can use it like:
job = bigquery_client.run_async_query(job_name, query)
job_result = job.result()
query_result = job_result.query_results()
data = list(query_result.fetch_data())

Django: Insert new row with 'order' value of the next highest value avoiding race condition

Say I have some models:
from django.db import models
class List(models.Model):
name = models.CharField(max_length=32)
class ListElement(models.Model):
lst = models.ForeignKey(List)
name = models.CharField(max_length=32)
the_order = models.PositiveSmallIntegerField()
class Meta:
unique_together = (("lst", "the_order"),)
and I want to append a new ListElement on to a List with the next-highest the_order value. How do I do this without creating a race condition whereby another ListElement is inserted between looking up the highest the_order and inserting new one?
I have looked into select_for_update() but that won't stop a new INSERT from taking place, just stop the existing elements from being changed. I have also thought about using transactions, but that will simply fail if another thread gets there before us, and I don't want to loop until we succeed.
What I was thinking is along the lines of the following MySQL query
INSERT INTO list_elements (name, lists_id, the_order) VALUES ("another element", 1, (SELECT MAX(the_order)+1 FROM list_elements WHERE lists_id = 1));
however, even this is invalid SQL since you're not able to SELECT from the table you're INSERTing into.
Perhaps there is a way using Django's F() expressions, but I haven't been able to get anything working with it.
AUTO_INCREMENT won't help here either since it's table-wide and not per foreign key.
EDIT:
This SQL does seem to do the trick, however, there doesn't appear to be a way to use the INSERT ... SELECT function from Django's ORM.
INSERT INTO list_elements (name, lists_id, the_order) SELECT "another element", 1, MAX(the_order)+1 FROM list_elements WHERE lists_id = 1;
For concurrency problems in Django & relational databases, you could write table lock to achieve atomic transactions. I came across this problem and found this great code snippet from http://shiningpanda.com/mysql-table-lock-django.html. I'm not sure if copy/pasting his code directly here would be offend anybody, but since SO discourage link-only answers, I will cite it anyway(Thanks to ShiningPanda.com for this):
#-*- coding: utf-8 -*-
import contextlib
from django.db import connection
#contextlib.contextmanager
def acquire_table_lock(read, write):
'''Acquire read & write locks on tables.
Usage example:
from polls.models import Poll, Choice
with acquire_table_lock(read=[Poll], write=[Choice]):
pass
'''
cursor = lock_table(read, write)
try:
yield cursor
finally:
unlock_table(cursor)
def lock_table(read, write):
'''Acquire read & write locks on tables.'''
# MySQL
if connection.settings_dict['ENGINE'] == 'django.db.backends.mysql':
# Get the actual table names
write_tables = [model._meta.db_table for model in write]
read_tables = [model._meta.db_table for model in read]
# Statements
write_statement = ', '.join(['%s WRITE' % table for table in write_tables])
read_statement = ', '.join(['%s READ' % table for table in read_tables])
statement = 'LOCK TABLES %s' % ', '.join([write_statement, read_statement])
# Acquire the lock
cursor = connection.cursor()
cursor.execute(statement)
return cursor
# Other databases: not supported
else:
raise Exception('This backend is not supported: %s' %
connection.settings_dict['ENGINE'])
def unlock_table(cursor):
'''Release all acquired locks.'''
# MySQL
if connection.settings_dict['ENGINE'] == 'django.db.backends.mysql':
cursor.execute("UNLOCK TABLES")
# Other databases: not supported
else:
raise Exception('This backend is not supported: %s' %
connection.settings_dict['ENGINE'])
It works with the models declared in your django application, by
simply providing two lists:
the list of models to lock for read purposes, and the list of models
to lock for write purposes. For instance, using django tutorial's
models, you would just call the context manager like this:
with acquire_table_lock(read=[models.Poll], write=[models.Choice]):
# Do something here
pass
It basically creates a python context manager to wrap your insert your ORM statement and do LOCK TABLES UNLOCK TALBES upon entering and exiting the context.

Close SQLAlchemy connection

I have the following function in python:
def add_odm_object(obj, table_name, primary_key, unique_column):
db = create_engine('mysql+pymysql://root:#127.0.0.1/mydb')
metadata = MetaData(db)
t = Table(table_name, metadata, autoload=True)
s = t.select(t.c[unique_column] == obj[unique_column])
rs = s.execute()
r = rs.fetchone()
if not r:
i = t.insert()
i_res = i.execute(obj)
v_id = i_res.inserted_primary_key[0]
return v_id
else:
return r[primary_key]
This function looks if the object obj is in the database, and if it is not found, it saves it to the DB. Now, I have a problem. I call the above function in a loop many times. And after few hundred times, I get an error: user root has exceeded the max_user_connections resource (current value: 30) I tried to search for answers and for example the question: How to close sqlalchemy connection in MySQL recommends creating a conn = db.connect() object where dbis the engine and calling conn.close() after my query is completed.
But, where should I open and close the connection in my code? I am not working with the connection directly, but I'm using the Table() and MetaData functions in my code.
The engine is an expensive-to-create factory for database connections. Your application should call create_engine() exactly once per database server.
Similarly, the MetaData and Table objects describe a fixed schema object within a known database. These are also configurational constructs that in most cases are created once, just like classes, in a module.
In this case, your function seems to want to load up tables dynamically, which is fine; the MetaData object acts as a registry, which has the convenience feature that it will give you back an existing table if it already exists.
Within a Python function and especially within a loop, for best performance you typically want to refer to a single database connection only.
Taking these things into account, your module might look like:
# module level variable. can be initialized later,
# but generally just want to create this once.
db = create_engine('mysql+pymysql://root:#127.0.0.1/mydb')
# module level MetaData collection.
metadata = MetaData()
def add_odm_object(obj, table_name, primary_key, unique_column):
with db.begin() as connection:
# will load table_name exactly once, then store it persistently
# within the above MetaData
t = Table(table_name, metadata, autoload=True, autoload_with=conn)
s = t.select(t.c[unique_column] == obj[unique_column])
rs = connection.execute(s)
r = rs.fetchone()
if not r:
i_res = connection.execute(t.insert(), some_col=obj)
v_id = i_res.inserted_primary_key[0]
return v_id
else:
return r[primary_key]

Categories

Resources