Detect if a SQLAlchemy model has pending write operations associated with it - python

I have an application in which I query against a SQL database and end up with a SQL Alchemy object representing a given row. Then, based on user input and a series of if/then statements I may perform an update on the SQLA object.
i.e.,
if 'fooinput1' in payload:
sqla_instance.foo1 = validate_foo1(fooinput1)
if 'fooinput2' in payload:
sqla_instance.foo2 = validate_foo2(fooinput2)
...
I now need to add modified_at and modified_by data to this system. Is it possible to check something on the SQLA instance like sqla_instance.was_modified or sqla_instance.update_pending to determine if a modification was performed?
(I recognize that I could maintain my own was_modified boolean, but since there are many of these if/then clauses that would lead to a lot of boilerplate which I'd like to avoid if possible.)
FWIW: this is a python 2.7 pyramids app reading from a MySQL db in the context of a web request.

The Session object SQL Alchemy ORM provides has two attributes that can help with what you are trying to do:
1) Session.is_modified()
2) Session.dirty

Depending on the number of fields which modification you want to track you may achieve what you need using ORM Events:
#sa.event.listens_for(Thing.foo, 'set')
def foo_changed(target, value, oldvalue, initiator):
target.last_modified = datetime.now()
#sa.event.listens_for(Thing.baz, 'set')
def baz_changed(target, value, oldvalue, initiator):
target.last_modified = datetime.now()

The session object provides events, which may help you.
just review once,
-->Transient to pending status for session object

Related

Unit testing a function that depends on database

I am running tests on some functions. I have a function that uses database queries. So, I have gone through the blogs and docs that say we have to make an in memory or test database to use such functions. Below is my function,
def already_exists(story_data,c):
# TODO(salmanhaseeb): Implement de-dupe functionality by checking if it already
# exists in the DB.
c.execute("""SELECT COUNT(*) from posts where post_id = ?""", (story_data.post_id,))
(number_of_rows,)=c.fetchone()
if number_of_rows > 0:
return True
return False
This function hits the production database. My question is that, when in testing, I create an in memory database and populate my values there, I will be querying that database (test DB). But I want to test my already_exists() function, after calling my already_exists function from test, my production db will be hit. How do I make my test DB hit while testing this function?
There are two routes you can take to address this problem:
Make an integration test instead of a unit test and just use a copy of the real database.
Provide a fake to the method instead of actual connection object.
Which one you should do depends on what you're trying to achieve.
If you want to test that the query itself works, then you should use an integration test. Full stop. The only way to make sure the query as intended is to run it with test data already in a copy of the database. Running it against a different database technology (e.g., running against SQLite when your production database in PostgreSQL) will not ensure that it works in production. Needing a copy of the database means you will need some automated deployment process for it that can be easily invoked against a separate database. You should have such an automated process, anyway, as it helps ensure that your deployments across environments are consistent, allows you to test them prior to release, and "documents" the process of upgrading the database. Standard solutions to this are migration tools written in your programming language like albemic or tools to execute raw SQL like yoyo or Flyway. You would need to invoke the deployment and fill it with test data prior to running the test, then run the test and assert the output you expect to be returned.
If you want to test the code around the query and not the query itself, then you can use a fake for the connection object. The most common solution to this is a mock. Mocks provide stand ins that can be configured to accept the function calls and inputs and return some output in place of the real object. This would allow you to test that the logic of the method works correctly, assuming that the query returns the results you expect. For your method, such a test might look something like this:
from unittest.mock import Mock
...
def test_already_exists_returns_true_for_positive_count():
mockConn = Mock(
execute=Mock(),
fetchone=Mock(return_value=(5,)),
)
story = Story(post_id=10) # Making some assumptions about what your object might look like.
result = already_exists(story, mockConn)
assert result
# Possibly assert calls on the mock. Value of these asserts is debatable.
mockConn.execute.assert_called("""SELECT COUNT(*) from posts where post_id = ?""", (story.post_id,))
mockConn.fetchone.assert_called()
The issue is ensuring that your code consistently uses the same database connection. Then you can set it once to whatever is appropriate for the current environment.
Rather than passing the database connection around from method to method, it might make more sense to make it a singleton.
def already_exists(story_data):
# Here `connection` is a singleton which returns the database connection.
connection.execute("""SELECT COUNT(*) from posts where post_id = ?""", (story_data.post_id,))
(number_of_rows,) = connection.fetchone()
if number_of_rows > 0:
return True
return False
Or make connection a method on each class and turn already_exists into a method. It should probably be a method regardless.
def already_exists(self):
# Here the connection is associated with the object.
self.connection.execute("""SELECT COUNT(*) from posts where post_id = ?""", (self.post_id,))
(number_of_rows,) = self.connection.fetchone()
if number_of_rows > 0:
return True
return False
But really you shouldn't be rolling this code yourself. Instead you should use an ORM such as SQLAlchemy which takes care of basic queries and connection management like this for you. It has a single connection, the "session".
from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker
from sqlalchemy_declarative import Address, Base, Person
engine = create_engine('sqlite:///sqlalchemy_example.db')
Base.metadata.bind = engine
DBSession = sessionmaker(bind=engine)
session = DBSession()
Then you use that to make queries. For example, it has an exists method.
session.query(Post.id).filter(q.exists()).scalar()
Using an ORM will greatly simplify your code. Here's a short tutorial for the basics, and a longer and more complete tutorial.

Dialect-specific SQLAlchemy declarative Column defaults

Short Version
In SQLAlchemy's ORM column declaration, how can I use server_default=sa.FetchedValue() on one dialect, and default=somePythonFunction on another, so that my real DBMS can populate things with triggers, and my test code can be written against sqlite?
Background
I'm using SQLAlchemy's declarative ORM to work with a Postgres database, but trying to write unit tests against an sqlite:///:memory:, and running into a problem with columns that have computed defaults on their primary keys. For a minimal example:
CREATE TABLE test_table(
id VARCHAR PRIMARY KEY NOT NULL
DEFAULT (lower(hex(randomblob(16))))
)
SQLite itself is quite happy with this table definition (sqlfiddle) but SQLAlchemy seems unable to work out the ID of newly created rows.
class TestTable(Base):
__tablename__ = 'test_table'
id = sa.Column(
sa.VARCHAR,
primary_key=True,
server_default=sa.FetchedValue())
Definitions like this work just fine in postgres, but die in sqlite (as you can see on Ideone) with a FlushError when I call Session.commit:
sqlalchemy.orm.exc.FlushError: Instance <TestTable at 0x7fc0e0254a10> has a NULL identity key. If this is an auto-generated value, check that the database table allows generation of new primary key values, and that the mapped Column object is configured to expect these generated values. Ensure also that this flush() is not occurring at an inappropriate time, such as within a load() event.
The documentation for FetchedValue warns us that this can happen on dialects that don't support the RETURNING clause on INSERT:
For special situations where triggers are used to generate primary key
values, and the database in use does not support the RETURNING clause,
it may be necessary to forego the usage of the trigger and instead
apply the SQL expression or function as a “pre execute” expression:
t = Table('test', meta,
Column('abc', MyType, default=func.generate_new_value(),
primary_key=True)
)
func.generate_new_value is not defined anywhere else in SQLAlchemy, so it seems they intend I either generate defaults in Python, or else write a separate function to do a SQL query to generate a default value in the DBMS. I can do that, but the problem is, I only want to do that for SQLite, since FetchedValue does exactly what I want on postgres.
Dead Ends
Subclassing Column probably won't work. Nothing that I can find in the sources ever tells the Column what dialect is being used, and the behavior of the default and server_default fields is defined outside the class
Writing a python function that calls the triggers by hand on the real DBMS creates a race condition. Avoiding the race condition by changing the isolation level creates a deadlock.
My Current Workaround
Bad because it breaks integration tests that connect to a real postgres.
import sys
import sqlalchemy as sa
def trigger_column(*a, **kw):
python_default = kw.pop('python_default')
if 'unittest' in sys.modules:
return sa.Column(*a, default=python_default, **kw)
else
return sa.Column(*a, server_default=sa.FetchedValue(), **kw)
Not a direct answer to you question but hopefully helpful to someone
My problem was wanting to change the collation depending on the dialect, this was my solution:
from sqlalchemy import Unicode
from sqlalchemy.ext.compiler import compiles
#compiles(Unicode, 'sqlite')
def compile_unicode(element, compiler, **kw):
element.collation = None
return compiler.visit_unicode(element, **kw)
This changes the collation for all Unicode columns only for sqlite.
Here's some documentation: http://docs.sqlalchemy.org/en/latest/core/custom_types.html#overriding-type-compilation

Add trigger to SQLAlchemy Base Class

I'm making a SQLAlchemy base class for a new Postgres database and want to have bookkeeping fields incorporated into it. Specifically, I want to have two columns for modified_at and modified_by that are updated automatically. I was able to find out how to do this for individual tables, but it seems like making this part of the base class is trickier.
My first thought was to try and leverage the declared_attr functionality, but I don't actually want to make the triggers an attribute in the model so that seems incorrect. Then I looked at adding the trigger using event.listen:
trigger = """
CREATE TRIGGER update_{table_name}_modified
BEFORE UPDATE ON {table_name}
FOR EACH ROW EXECUTE PROCEDURE update_modified_columns()
"""
def create_modified_trigger(target, connection, **kwargs):
if hasattr(target, 'name'):
connection.execute(modified_trigger.format(table_name=target.name))
Base = declarative_base()
event.listen(Base.metadata,'after_create', create_modified_trigger)
I thought I could find table_name using the target parameter as shown in the docs but when used with Base.metadata it returns MetaData(bind=None) rather than a table.
I would strongly prefer to have this functionality as part of the Base rather than including it in migrations or externally to reduce the chance of someone forgetting to add the triggers. Is this possible?
I was able to sort this out with the help of a coworker. The returned MetaData object did in fact have a list of tables. Here is the working code:
modified_trigger = """
CREATE TRIGGER update_{table_name}_modified
BEFORE UPDATE ON {table_name}
FOR EACH ROW EXECUTE PROCEDURE update_modified_columns()
"""
def create_modified_trigger(target, connection, **kwargs):
"""
This is used to add bookkeeping triggers after a table is created. It hooks
into the SQLAlchemy event system. It expects the target to be an instance of
MetaData.
"""
for key in target.tables:
table = target.tables[key]
connection.execute(modified_trigger.format(table_name=table.name))
Base = declarative_base()
event.listen(Base.metadata, 'after_create', create_modified_trigger)

Django. Thread safe update or create.

We know, that update - is thread safe operation.
It means, that when you do:
SomeModel.objects.filter(id=1).update(some_field=100)
Instead of:
sm = SomeModel.objects.get(id=1)
sm.some_field=100
sm.save()
Your application is relativly thread safe and operation SomeModel.objects.filter(id=1).update(some_field=100) will not rewrite data in other model fields.
My question is.. If there any way to do
SomeModel.objects.filter(id=1).update(some_field=100)
but with creation of object if it does not exists?
from django.db import IntegrityError
def update_or_create(model, filter_kwargs, update_kwargs)
if not model.objects.filter(**filter_kwargs).update(**update_kwargs):
kwargs = filter_kwargs.copy()
kwargs.update(update_kwargs)
try:
model.objects.create(**kwargs)
except IntegrityError:
if not model.objects.filter(**filter_kwargs).update(**update_kwargs):
raise # re-raise IntegrityError
I think, code provided in the question is not very demonstrative: who want to set id for model?
Lets assume we need this, and we have simultaneous operations:
def thread1():
update_or_create(SomeModel, {'some_unique_field':1}, {'some_field': 1})
def thread2():
update_or_create(SomeModel, {'some_unique_field':1}, {'some_field': 2})
With update_or_create function, depends on which thread comes first, object will be created and updated with no exception. This will be thread-safe, but obviously has little use: depends on race condition value of SomeModek.objects.get(some__unique_field=1).some_field could be 1 or 2.
Django provides F objects, so we can upgrade our code:
from django.db.models import F
def thread1():
update_or_create(SomeModel,
{'some_unique_field':1},
{'some_field': F('some_field') + 1})
def thread2():
update_or_create(SomeModel,
{'some_unique_field':1},
{'some_field': F('some_field') + 2})
You want django's select_for_update() method (and a backend that supports row-level locking, such as PostgreSQL) in combination with manual transaction management.
try:
with transaction.commit_on_success():
SomeModel.objects.create(pk=1, some_field=100)
except IntegrityError: #unique id already exists, so update instead
with transaction.commit_on_success():
object = SomeModel.objects.select_for_update().get(pk=1)
object.some_field=100
object.save()
Note that if some other process deletes the object between the two queries, you'll get a SomeModel.DoesNotExist exception.
Django 1.7 and above also has atomic operation support and a built-in update_or_create() method.
You can use Django's built-in get_or_create, but that operates on the model itself, rather than a queryset.
You can use that like this:
me = SomeModel.objects.get_or_create(id=1)
me.some_field = 100
me.save()
If you have multiple threads, your app will need to determine which instance of the model is correct. Usually what I do is refresh the model from the database, make changes, and then save it, so you don't have a long time in a disconnected state.
It's impossible in django do such upsert operation, with update. But queryset update method return number of filtered fields so you can do:
from django.db import router, connections, transaction
class MySuperManager(models.Manager):
def _lock_table(self, lock='ACCESS EXCLUSIVE'):
cursor = connections[router.db_for_write(self.model)]
cursor.execute(
'LOCK TABLE %s IN %s MODE' % (self.model._meta.db_table, lock)
)
def create_or_update(self, id, **update_fields):
with transaction.commit_on_success():
self.lock_table()
if not self.get_query_set().filter(id=id).update(**update_fields):
self.model(id=id, **update_fields).save()
this example if for postgres, you can use it without sql code, but update or insert operation will not be atomic. If you create a lock on table you will be sure that two objects will be not created in two other threads.
I think if you have critical demands on atom operations. You'd better design it in database level instead of Django ORM level.
Django ORM system is focusing on convenience instead of performance and safety. You have to optimize the automatic generated SQL sometimes.
"Transaction" in most productive databases provide database lock and rollback well.
In mashup(hybrid) systems, or say your system added some 3rd part components, like logging, statistics. Application in different framework or even language may access database at the same time, adding thread safe in Django is not enough in this case.
SomeModel.objects.filter(id=1).update(set__some_field=100)

Django: how to do get_or_create() in a threadsafe way?

In my Django app very often I need to do something similar to get_or_create(). E.g.,
User submits a tag. Need to see if
that tag already is in the database.
If not, create a new record for it. If
it is, just update the existing
record.
But looking into the doc for get_or_create() it looks like it's not threadsafe. Thread A checks and finds Record X does not exist. Then Thread B checks and finds that Record X does not exist. Now both Thread A and Thread B will create a new Record X.
This must be a very common situation. How do I handle it in a threadsafe way?
Since 2013 or so, get_or_create is atomic, so it handles concurrency nicely:
This method is atomic assuming correct usage, correct database
configuration, and correct behavior of the underlying database.
However, if uniqueness is not enforced at the database level for the
kwargs used in a get_or_create call (see unique or unique_together),
this method is prone to a race-condition which can result in multiple
rows with the same parameters being inserted simultaneously.
If you are using MySQL, be sure to use the READ COMMITTED isolation
level rather than REPEATABLE READ (the default), otherwise you may see
cases where get_or_create will raise an IntegrityError but the object
won’t appear in a subsequent get() call.
From: https://docs.djangoproject.com/en/dev/ref/models/querysets/#get-or-create
Here's an example of how you could do it:
Define a model with either unique=True:
class MyModel(models.Model):
slug = models.SlugField(max_length=255, unique=True)
name = models.CharField(max_length=255)
MyModel.objects.get_or_create(slug=<user_slug_here>, defaults={"name": <user_name_here>})
... or by using unique_togheter:
class MyModel(models.Model):
prefix = models.CharField(max_length=3)
slug = models.SlugField(max_length=255)
name = models.CharField(max_length=255)
class Meta:
unique_together = ("prefix", "slug")
MyModel.objects.get_or_create(prefix=<user_prefix_here>, slug=<user_slug_here>, defaults={"name": <user_name_here>})
Note how the non-unique fields are in the defaults dict, NOT among the unique fields in get_or_create. This will ensure your creates are atomic.
Here's how it's implemented in Django: https://github.com/django/django/blob/fd60e6c8878986a102f0125d9cdf61c717605cf1/django/db/models/query.py#L466 - Try creating an object, catch an eventual IntegrityError, and return the copy in that case. In other words: handle atomicity in the database.
This must be a very common situation. How do I handle it in a threadsafe way?
Yes.
The "standard" solution in SQL is to simply attempt to create the record. If it works, that's good. Keep going.
If an attempt to create a record gets a "duplicate" exception from the RDBMS, then do a SELECT and keep going.
Django, however, has an ORM layer, with it's own cache. So the logic is inverted to make the common case work directly and quickly and the uncommon case (the duplicate) raise a rare exception.
try transaction.commit_on_success decorator for callable where you are trying get_or_create(**kwargs)
"Use the commit_on_success decorator to use a single transaction for all the work done in a function.If the function returns successfully, then Django will commit all work done within the function at that point. If the function raises an exception, though, Django will roll back the transaction."
apart from it, in concurrent calls to get_or_create, both the threads try to get the object with argument passed to it (except for "defaults" arg which is a dict used during create call in case get() fails to retrieve any object). in case of failure both the threads try to create the object resulting in multiple duplicate objects unless some unique/unique together is implemented at database level with field(s) used in get()'s call.
it is similar to this post
How do I deal with this race condition in django?
So many years have passed, but nobody has written about threading.Lock. If you don't have the opportunity to make migrations for unique together, for legacy reasons, you can use locks or threading.Semaphore objects. Here is the pseudocode:
from concurrent.futures import ThreadPoolExecutor
from threading import Lock
_lock = Lock()
def get_staff(data: dict):
_lock.acquire()
try:
staff, created = MyModel.objects.get_or_create(**data)
return staff
finally:
_lock.release()
with ThreadPoolExecutor(max_workers=50) as pool:
pool.map(get_staff, get_list_of_some_data())

Categories

Resources