Graceful way of cleaning objects in sqlalchemy? - python

Currently I am developing a class which abstracts the SQLAlchemy. This class will act as helper tool to verify the values from database. This class will be used in regression/load test. Test cases will make hundred-thousands of database query. The layout of my class is as following.
class MyDBClass:
def __init__(self, dbName)
self.dbName = dbName
self.dbEngines[dbName] = create_engine()
self.dbMetaData[dbName] = MetaData()
self.dbMetaData[dbName].reflect(bind=self.dbEngines[dbName])
self.dbSession[dbName] = sessionmaker(bind=self.dbEngines[dbName])
def QueryFunction(self,dbName, tablename, some arguments):
session = self.dbSession[dbName]()
query = session.query(requiredTable)
result = query.filter().all()
session.close()
def updateFunction(self, dbName, talbeName, some arguments):
session = self.dbSession[dbName]()
session.query(requiredTable).filter().update()
session.commit()
session.close()
def insertFunction(self, dbName, tableName, some arguments):
connection = self.dbEngines[dbName].connect()
requiredTable = self.dbMetaData[dbName].tables[tableName]
connection.execute(requiredTable.insert(values=columnValuePair))
connection.close()
def cleanClose(self):
# Code which will remove the connection/session/object from memory.
# do some graceful work to clean close.
I want to write cleanClose() method which should remove the object which might be created by this class. This method should remove all those object from memory and provide a clean close. This may also avoid the memory leak.
I am not able to figure out what all object should be removed from the memory. Can some one suggest me what method calls I need to make here?
Edit1:
Is there any way by which I can measure the performance different method and their variant?
I was going through the documentation here and realized that I should not make session in every method rather I should create single instance of session and use throughout. Please provide your feedback on this. And let me know what would be the best way of doing thing here.
Any kind of help will be greatly appreciated here.

To remove objects from memory in Python, you just need to stop referencing them. There is not usually any need to explicitly write or call any methods to destroy or clean up the objects. So, an instance of MyDBClass will be automatically cleaned up when it goes out of scope.
If you are talking about closing down an SQLAlchemy session, then you just need to call the close() method on it.
An SQLAlchemy session is designed for multiple transactions. You don't generally need to create and destroy it multiple times. Create one session in the __init__ function and then use that in QueryFunction, updateFunction, etc.

Related

Unit testing a function that depends on database

I am running tests on some functions. I have a function that uses database queries. So, I have gone through the blogs and docs that say we have to make an in memory or test database to use such functions. Below is my function,
def already_exists(story_data,c):
# TODO(salmanhaseeb): Implement de-dupe functionality by checking if it already
# exists in the DB.
c.execute("""SELECT COUNT(*) from posts where post_id = ?""", (story_data.post_id,))
(number_of_rows,)=c.fetchone()
if number_of_rows > 0:
return True
return False
This function hits the production database. My question is that, when in testing, I create an in memory database and populate my values there, I will be querying that database (test DB). But I want to test my already_exists() function, after calling my already_exists function from test, my production db will be hit. How do I make my test DB hit while testing this function?
There are two routes you can take to address this problem:
Make an integration test instead of a unit test and just use a copy of the real database.
Provide a fake to the method instead of actual connection object.
Which one you should do depends on what you're trying to achieve.
If you want to test that the query itself works, then you should use an integration test. Full stop. The only way to make sure the query as intended is to run it with test data already in a copy of the database. Running it against a different database technology (e.g., running against SQLite when your production database in PostgreSQL) will not ensure that it works in production. Needing a copy of the database means you will need some automated deployment process for it that can be easily invoked against a separate database. You should have such an automated process, anyway, as it helps ensure that your deployments across environments are consistent, allows you to test them prior to release, and "documents" the process of upgrading the database. Standard solutions to this are migration tools written in your programming language like albemic or tools to execute raw SQL like yoyo or Flyway. You would need to invoke the deployment and fill it with test data prior to running the test, then run the test and assert the output you expect to be returned.
If you want to test the code around the query and not the query itself, then you can use a fake for the connection object. The most common solution to this is a mock. Mocks provide stand ins that can be configured to accept the function calls and inputs and return some output in place of the real object. This would allow you to test that the logic of the method works correctly, assuming that the query returns the results you expect. For your method, such a test might look something like this:
from unittest.mock import Mock
...
def test_already_exists_returns_true_for_positive_count():
mockConn = Mock(
execute=Mock(),
fetchone=Mock(return_value=(5,)),
)
story = Story(post_id=10) # Making some assumptions about what your object might look like.
result = already_exists(story, mockConn)
assert result
# Possibly assert calls on the mock. Value of these asserts is debatable.
mockConn.execute.assert_called("""SELECT COUNT(*) from posts where post_id = ?""", (story.post_id,))
mockConn.fetchone.assert_called()
The issue is ensuring that your code consistently uses the same database connection. Then you can set it once to whatever is appropriate for the current environment.
Rather than passing the database connection around from method to method, it might make more sense to make it a singleton.
def already_exists(story_data):
# Here `connection` is a singleton which returns the database connection.
connection.execute("""SELECT COUNT(*) from posts where post_id = ?""", (story_data.post_id,))
(number_of_rows,) = connection.fetchone()
if number_of_rows > 0:
return True
return False
Or make connection a method on each class and turn already_exists into a method. It should probably be a method regardless.
def already_exists(self):
# Here the connection is associated with the object.
self.connection.execute("""SELECT COUNT(*) from posts where post_id = ?""", (self.post_id,))
(number_of_rows,) = self.connection.fetchone()
if number_of_rows > 0:
return True
return False
But really you shouldn't be rolling this code yourself. Instead you should use an ORM such as SQLAlchemy which takes care of basic queries and connection management like this for you. It has a single connection, the "session".
from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker
from sqlalchemy_declarative import Address, Base, Person
engine = create_engine('sqlite:///sqlalchemy_example.db')
Base.metadata.bind = engine
DBSession = sessionmaker(bind=engine)
session = DBSession()
Then you use that to make queries. For example, it has an exists method.
session.query(Post.id).filter(q.exists()).scalar()
Using an ORM will greatly simplify your code. Here's a short tutorial for the basics, and a longer and more complete tutorial.

Using transactions with peewee without using `atomic()`

We have a file db.py where a peewee database is defined:
db = PostgresqlExtDatabase('mom',
user=DB_CONFIG['username'],
password=DB_CONFIG['password'],
host=DB_CONFIG['host'],
port=DB_CONFIG['port'],
threadlocals=True,
register_hstore=False,
autocommit=True,
autorollback=True,
cursor_factory=DictCursor)
Calling db.execute("SOME RAW SQL UPDATE QUERY") works as expected.
But calling a begin before that does not stop the DB from being modified.
db.begin()
db.execute("SOME RAW SQL UPDATE QUERY") # <- Does not wait, db is updated immediately here
db.commit()
Am i doing this right?
I basically need to nest the raw sql in a transaction if one is already ongoing,
else just execute it right away if there is no transaction begin called.
This works as expected if i do db.set_autocommit(False) then execute_sql then commit().
It also works inside the atomic() context manager.
To give some context, I am working on a web application, on logistics,
and our codebase uses Flask and an SQLAlchemy scoped_session with autocommit set to True.
It does not use the SQLAlchemy ORM (due to.. historical reasons)
and instead just uses the Session object and its
execute(), begin(), begin_nested(), rollback() and remove() methods.
The way it does it is by defining a Session Session = scoped_session(sessionmaker(autocommit=True)) in a file,
and then calling session = Session() everywhere in the codebase, and executing the queries using session.execute("SQL")
Sometimes, a session.begin() is called, so the query does not execute until the commit (or rollback).
We'd now really like to use peewee.
But.. the codebase is built on this session. So this has to be spoofed.
Going and changing every file is impossible, and not enough test cases to boot (for.. historical reasons).
Also I had some questions but I don't know where to ask them, so I hope you don't mind if I put them here:
Is this db object (and its connection) bound to the thread it is executing in?
Basically will there be some bug if db is imported from two different files, and db.begin() is called from each?
I can see in the ipython shell that the id for the db object above is the same per thread,
so am I correct in assuming that unless the psycopg2 connection is recreated, this should be isolated?
To spoof the sqlalchemy Session, I have created a wrapper class that returns the kind of session required,
the SQLA Session object, or the wrapper I've written for peewee to spoof that.
class SessionMocker(object):
# DO NOT make this a singleton. Sessions will break
def __init__(self, orm_type=ORM_TYPES.SQLA):
assert orm_type in ORM_TYPES, "Invalid session constructor type"
super(SessionMocker, self).__init__()
self.orm_type = orm_type
def __call__(self, *args, **kwargs):
if self.orm_type == ORM_TYPES.SQLA:
return SQLASession(*args, **kwargs)
if self.orm_type == ORM_TYPES.PEEWEE:
# For now lets assume no slave
return SessionWrapper(*args, **kwargs)
raise NotImplementedError
def __getattr__(self, item):
"""
Assuming this will never be called without calling Session() first.
Else there is no way to tell what type of Session class (ORM) is required, since that can't be passed.
"""
if self.orm_type == ORM_TYPES.SQLA:
kls = SQLASession
elif self.orm_type == ORM_TYPES.PEEWEE:
kls = SessionWrapper
else:
raise NotImplementedError
return getattr(kls, item)
Session = SessionMocker(ORM_TYPES.SQLA)
I figured this will allow the codebase to make a transparent and seamless switch over to using peewee without having to change it everywhere.
How can i do this in a better way?
The docs explain how to do this: http://docs.peewee-orm.com/en/latest/peewee/transactions.html#autocommit-mode
But, tl;dr, you need to disable autocommit before begin/commit/rollback will work like you expect:
db.set_autocommit(False)
db.begin()
try:
user.delete_instance(recursive=True)
except:
db.rollback()
raise
else:
try:
db.commit()
except:
db.rollback()
raise
finally:
db.set_autocommit(True)
The default value for autocommit is True but the default value of autorollback is False. Setting autorollback as True automatically rollback when an exception occurs while executing a query. Can't be sure but maybe this mess the situation. So, if you want try it with the autorollback as False

How to retrieve properties only once from database in django

I have some relationships in my database that I describe like that:
#property
def translations(self):
"""
:return: QuerySet
"""
if not hasattr(self, '_translations'):
self._translations = ClientTranslation.objects.filter(base=self)
return self._translations
The idea behind the hasattr() and self._translation is to have the db only hit one time, while the second time the stored property is returned.
However, after reading, the docs, I'm not sure if the code is doing that - as queries are only hitting the db when the values are really needed - which comes after my code.
How would a correct approach look like?
Yes, DB is hit the first time someone needs the value. But as you pointed out, you save the query, not the results. Wrap the query with list(...) to save the results.
By the way, you can use the cached_property decorator to make it more elegant. It is not a built-in, though. It can be found here. You end up with:
#cached_property
def translations(self):
return list(ClientTranslation.objects.filter(base=self))

Django. Thread safe update or create.

We know, that update - is thread safe operation.
It means, that when you do:
SomeModel.objects.filter(id=1).update(some_field=100)
Instead of:
sm = SomeModel.objects.get(id=1)
sm.some_field=100
sm.save()
Your application is relativly thread safe and operation SomeModel.objects.filter(id=1).update(some_field=100) will not rewrite data in other model fields.
My question is.. If there any way to do
SomeModel.objects.filter(id=1).update(some_field=100)
but with creation of object if it does not exists?
from django.db import IntegrityError
def update_or_create(model, filter_kwargs, update_kwargs)
if not model.objects.filter(**filter_kwargs).update(**update_kwargs):
kwargs = filter_kwargs.copy()
kwargs.update(update_kwargs)
try:
model.objects.create(**kwargs)
except IntegrityError:
if not model.objects.filter(**filter_kwargs).update(**update_kwargs):
raise # re-raise IntegrityError
I think, code provided in the question is not very demonstrative: who want to set id for model?
Lets assume we need this, and we have simultaneous operations:
def thread1():
update_or_create(SomeModel, {'some_unique_field':1}, {'some_field': 1})
def thread2():
update_or_create(SomeModel, {'some_unique_field':1}, {'some_field': 2})
With update_or_create function, depends on which thread comes first, object will be created and updated with no exception. This will be thread-safe, but obviously has little use: depends on race condition value of SomeModek.objects.get(some__unique_field=1).some_field could be 1 or 2.
Django provides F objects, so we can upgrade our code:
from django.db.models import F
def thread1():
update_or_create(SomeModel,
{'some_unique_field':1},
{'some_field': F('some_field') + 1})
def thread2():
update_or_create(SomeModel,
{'some_unique_field':1},
{'some_field': F('some_field') + 2})
You want django's select_for_update() method (and a backend that supports row-level locking, such as PostgreSQL) in combination with manual transaction management.
try:
with transaction.commit_on_success():
SomeModel.objects.create(pk=1, some_field=100)
except IntegrityError: #unique id already exists, so update instead
with transaction.commit_on_success():
object = SomeModel.objects.select_for_update().get(pk=1)
object.some_field=100
object.save()
Note that if some other process deletes the object between the two queries, you'll get a SomeModel.DoesNotExist exception.
Django 1.7 and above also has atomic operation support and a built-in update_or_create() method.
You can use Django's built-in get_or_create, but that operates on the model itself, rather than a queryset.
You can use that like this:
me = SomeModel.objects.get_or_create(id=1)
me.some_field = 100
me.save()
If you have multiple threads, your app will need to determine which instance of the model is correct. Usually what I do is refresh the model from the database, make changes, and then save it, so you don't have a long time in a disconnected state.
It's impossible in django do such upsert operation, with update. But queryset update method return number of filtered fields so you can do:
from django.db import router, connections, transaction
class MySuperManager(models.Manager):
def _lock_table(self, lock='ACCESS EXCLUSIVE'):
cursor = connections[router.db_for_write(self.model)]
cursor.execute(
'LOCK TABLE %s IN %s MODE' % (self.model._meta.db_table, lock)
)
def create_or_update(self, id, **update_fields):
with transaction.commit_on_success():
self.lock_table()
if not self.get_query_set().filter(id=id).update(**update_fields):
self.model(id=id, **update_fields).save()
this example if for postgres, you can use it without sql code, but update or insert operation will not be atomic. If you create a lock on table you will be sure that two objects will be not created in two other threads.
I think if you have critical demands on atom operations. You'd better design it in database level instead of Django ORM level.
Django ORM system is focusing on convenience instead of performance and safety. You have to optimize the automatic generated SQL sometimes.
"Transaction" in most productive databases provide database lock and rollback well.
In mashup(hybrid) systems, or say your system added some 3rd part components, like logging, statistics. Application in different framework or even language may access database at the same time, adding thread safe in Django is not enough in this case.
SomeModel.objects.filter(id=1).update(set__some_field=100)

In django, how to delete all related objects when deleting a certain type of instances?

I first tried to override the delete() method but that doesn't work for QuerySet's bulk delete method. It should be related to pre_delete signal but I can't figure it out. My code is as following:
def _pre_delete_problem(sender, instance, **kwargs):
instance.context.delete()
instance.stat.delete()
But this method seems to be called infinitely and the program runs into a dead loop.
Can someone please help me?
If the class has foreign keys (or related objects) they are deleted by default like a DELETE CASCADE in sql.
You can change the behavior using the on_delete argument when defining the ForeignKey in the class, but by default it is CASCADE.
You can check the docs here.
Now the pre_delete signal works, but it doesn't call the delete() method if you are using a bulk delete, since its not deleting in a object by object basis.
In your case, using the post_delete signal instead of pre_delete should fix the infinite loop issue. Due to a ForeignKey's on_delete default value of cascade, using pre_delete logic this way will trigger the instance.context object to call delete on instance, which will then call instance.context, and so forth.
Using this approach:
def _post_delete_problem(sender, instance, **kwargs):
instance.context.delete()
instance.stat.delete()
post_delete.connect(_post_delete_problem, sender=Foo)
Can do the cleanup you want.
If you'd like a quick one-off to delete an instance and all of its related objects and those related objects' objects and so on without having to change the DB schema, you can do this -
def recursive_delete(to_del):
"""Recursively delete an object, all of its protected related
instances, those instances' protected instances, and so on.
"""
from django.db.models import ProtectedError
while True:
try:
to_del_pk = to_del.pk
if to_del_pk is None:
return # unsaved object
to_del.delete()
print(f"Deleted {to_del.__class__.__name__} with pk {to_del_pk}: {to_del}")
except ProtectedError as e:
for protected_ob in e.protected_objects:
recursive_delete(protected_ob)
Be careful, though!
I'd only use this to help with debugging in one-off scripts (or on the shell) with test databases that I don't mind wiping. Relationships aren't always obvious and if something is protected, it's probably for a reason.

Categories

Resources