how expensive is to save a django ORM Model without changes? - python

sometimes we have to do a Model instance.save() regardless if some field changed, just for security and fast development.
how expensive is this with django ORM?
signals are always sent?
any SQL query is executed?
I tested with django debug toolbar to do 10 .save() in different points where anything in the model has changed, and the log does not register sql queries.
other way to test it or some article?
thank you in advance.

Im not entirely sure how you application handles this.
But i ran a small test:
a = Blog.objects.get(pk=1)
for b in range(1, 100):
a.save()
This gave me a result of:
87.04 ms (201 queries)
Be ware as well that a save will do two queries:
SELECT ••• FROM `fun_blog` WHERE `fun_blog`.`id` = 1 LIMIT 1
UPDATE `fun_blog` SET `title` = 'This is my testtitle', `body` = 'This is a testbody' WHERE `fun_blog`.`id` = 1

Related

How to test SQLAlchemy versioning in unit tests - Python

Note: Using flask_sqlalchemy here
I'm working on adding versioning to multiple services on the same DB. To make sure it works, I'm adding unit tests that confirm I get an error (for this case my error should be StaleDataError). For other services in other languages, I pulled the same object twice from the DB, updated one instance, saved it, updated the other instance, then tried to save that as well.
However, because SQLAlchemy adds a fake-cache layer between the DB and the service, when I update the first object it automatically updates the other object I hold in memory. Does anyone have a way around this? I created a second session (that solution had worked in other languages) but SQLAlchemy knows not to hold the same object in two sessions.
I was able to manually test it by putting time.sleep() halfway through the test and manually changing data in the DB, but I'd like a way to test this using just the unit code.
Example code:
def test_optimistic_locking(self):
c = Customer(formal_name='John', id=1)
db.session.add(c)
db.session.flush()
cust = Customer.query.filter_by(id=1).first()
db.session.expire(cust)
same_cust = Customer.query.filter_by(id=1).first()
db.session.expire(same_cust)
same_cust.formal_name = 'Tim'
db.session.add(same_cust)
db.session.flush()
db.session.expire(same_cust)
cust.formal_name = 'Jon'
db.session.add(cust)
with self.assertRaises(StaleDataError): db.session.flush()
db.session.rollback()
It actually is possible, you need to create two separate sessions. See the unit test of SQLAlchemy itself for inspiration. Here's a code snippet of one of our unit tests written with pytest:
def test_article__versioning(connection, db_session: Session):
article = ProductSheetFactory(title="Old Title", version=1)
db_session.refresh(article)
assert article.version == 1
db_session2 = Session(bind=connection)
article2 = db_session2.query(ProductSheet).get(article.id)
assert article2.version == 1
article.title = "New Title"
article.version += 1
db_session.commit()
assert article.version == 2
with pytest.raises(sqlalchemy.orm.exc.StaleDataError):
article2.title = "Yet another title"
assert article2.version == 1
article2.version += 1
db_session2.commit()
Hope that helps. Note that we use "version_id_generator": False in the model, that's why we increment the version ourselves. See the docs for details.
For anyone that comes across this question, my current hypothesis is that it can't be done. SQLAlchemy is incredibly powerful and, given that the functionality is so good that we can't test this line, we should trust that it works as expected

Django find_each (like RoR)

Is there way to use find_each in django?
According to the rails documentation:
This method is only intended to use for batch processing of large
amounts of records that wouldn’t fit in memory all at once. If you
just need to loop over less than 1000 records, it’s probably better
just to use the regular find methods.
http://apidock.com/rails/ActiveRecord/Batches/ClassMethods/find_each
Thanks.
One possible solution could be to use the built-in Paginator class (could save a lot of hassle).
https://docs.djangoproject.com/en/dev/topics/pagination/
Try something like:
from django.core.paginator import Paginator
from yourapp.models import YourModel
result_query = YourModel.objects.filter(<your find conditions>)
paginator = Paginator(result_query, 1000) # the desired batch size
for page in range(1, paginator.num_pages + 1):
for row in paginator.page(page).object_list:
# here you can add your required code
Or, you could use the limiting options as per needs to iterate over the results.
You can query parts of the whole table with a loop and by slicing the queryset.
If you're working with Debug = True it's important that you flush your queries after each loop since this can cause memory issues (Django stores all the queries that were run until the script finishes or dies).
If you need to restrict the results of the queryset you can replace ".all()" with the appropriate ".filter(conditions)"
from django import db
from myapp import MyModel
# Getting the total of records in the table
total_count = MyModel.objects.all().count()
chunk_size = 1000 # You can change this to any amount you can keep in memory
total_checked = 0
while total_checked < total_count:
# Querying all the objects and slicing only the part you need to work
# with at the moment (only that part will be loaded into memory)
query_set = MyModel.objects.all()[total_checked:total_checked + chunk_size]
for item in query_set:
# Do what you need to do with your results
pass
total_checked += chunk_size
# Clearing django's query cache to avoid a memory leak
db.reset_queries()

Multi-tenancy with SQLAlchemy

I've got a web-application which is built with Pyramid/SQLAlchemy/Postgresql and allows users to manage some data, and that data is almost completely independent for different users. Say, Alice visits alice.domain.com and is able to upload pictures and documents, and Bob visits bob.domain.com and is also able to upload pictures and documents. Alice never sees anything created by Bob and vice versa (this is a simplified example, there may be a lot of data in multiple tables really, but the idea is the same).
Now, the most straightforward option to organize the data in the DB backend is to use a single database, where each table (pictures and documents) has user_id field, so, basically, to get all Alice's pictures, I can do something like
user_id = _figure_out_user_id_from_domain_name(request)
pictures = session.query(Picture).filter(Picture.user_id==user_id).all()
This is all easy and simple, however there are some disadvantages
I need to remember to always use additional filter condition when making queries, otherwise Alice may see Bob's pictures;
If there are many users the tables may grow huge
It may be tricky to split the web application between multiple machines
So I'm thinking it would be really nice to somehow split the data per-user. I can think of two approaches:
Have separate tables for Alice's and Bob's pictures and documents within the same database (Postgres' Schemas seems to be a correct approach to use in this case):
documents_alice
documents_bob
pictures_alice
pictures_bob
and then, using some dark magic, "route" all queries to one or to the other table according to the current request's domain:
_use_dark_magic_to_configure_sqlalchemy('alice.domain.com')
pictures = session.query(Picture).all() # selects all Alice's pictures from "pictures_alice" table
...
_use_dark_magic_to_configure_sqlalchemy('bob.domain.com')
pictures = session.query(Picture).all() # selects all Bob's pictures from "pictures_bob" table
Use a separate database for each user:
- database_alice
- pictures
- documents
- database_bob
- pictures
- documents
which seems like the cleanest solution, but I'm not sure if multiple database connections would require much more RAM and other resources, limiting the number of possible "tenants".
So, the question is, does it all make sense? If yes, how do I configure SQLAlchemy to either modify the table names dynamically on each HTTP request (for option 1) or to maintain a pool of connections to different databases and use the correct connection for each request (for option 2)?
After pondering on jd's answer I was able to achieve the same result for postgresql 9.2, sqlalchemy 0.8, and flask 0.9 framework:
from sqlalchemy import event
from sqlalchemy.pool import Pool
#event.listens_for(Pool, 'checkout')
def on_pool_checkout(dbapi_conn, connection_rec, connection_proxy):
tenant_id = session.get('tenant_id')
cursor = dbapi_conn.cursor()
if tenant_id is None:
cursor.execute("SET search_path TO public, shared;")
else:
cursor.execute("SET search_path TO t" + str(tenant_id) + ", shared;")
dbapi_conn.commit()
cursor.close()
Ok, I've ended up with modifying search_path in the beginning of every request, using Pyramid's NewRequest event:
from pyramid import events
def on_new_request(event):
schema_name = _figire_out_schema_name_from_request(event.request)
DBSession.execute("SET search_path TO %s" % schema_name)
def app(global_config, **settings):
""" This function returns a WSGI application.
It is usually called by the PasteDeploy framework during
``paster serve``.
"""
....
config.add_subscriber(on_new_request, events.NewRequest)
return config.make_wsgi_app()
Works really well, as long as you leave transaction management to Pyramid (i.e. do not commit/roll-back transactions manually, letting Pyramid to do that at the end of request) - which is ok as committing transactions manually is not a good approach anyway.
What works very well for me it to set the search path at the connection pool level, rather than in the session. This example uses Flask and its thread local proxies to pass the schema name so you'll have to change schema = current_schema._get_current_object() and the try block around it.
from sqlalchemy.interfaces import PoolListener
class SearchPathSetter(PoolListener):
'''
Dynamically sets the search path on connections checked out from a pool.
'''
def __init__(self, search_path_tail='shared, public'):
self.search_path_tail = search_path_tail
#staticmethod
def quote_schema(dialect, schema):
return dialect.identifier_preparer.quote_schema(schema, False)
def checkout(self, dbapi_con, con_record, con_proxy):
try:
schema = current_schema._get_current_object()
except RuntimeError:
search_path = self.search_path_tail
else:
if schema:
search_path = self.quote_schema(con_proxy._pool._dialect, schema) + ', ' + self.search_path_tail
else:
search_path = self.search_path_tail
cursor = dbapi_con.cursor()
cursor.execute("SET search_path TO %s;" % search_path)
dbapi_con.commit()
cursor.close()
At engine creation time:
engine = create_engine(dsn, listeners=[SearchPathSetter()])

Django: How do I avoid unnecessary SQL statements?

I'm optimizing a slow page load in our (first) Django project. The overall project does test status management, so there are protocols which have cases which have planned executions. Currently the code is:
protocols = Protocol.active.filter(team=team, release=release)
cases = Case.active.filter(protocol__in=protocols)
caseCount = cases.count()
plannedExecs = Planned_Exec.active.filter(case__in=cases, team=team, release=release)
# Start aggregating test suite information
# pgi Model
testSuite['pgi_model'] = []
for pgi in PLM.objects.filter(release=release).values('pgi_model').distinct():
plmForPgi = PLM.objects.filter(pgi_model=pgi['pgi_model'])
peresults = plannedExecs.filter(plm__in=plmForPgi).count()
if peresults > 0:
try:
testSuite['pgi_model'].append((pgi['pgi_model'], "", "", peresults, int(peresults/float(testlistCount)*100)))
except ZeroDivisionError:
testSuite['pgi_model'].append((pgi['pgi_model'], "", "", peresults, 0))
# Browser
testSuite['browser'] = []
for browser in BROWSER_OPTIONS:
peresults = plannedExecs.filter(browser=browser[0]).count()
try:
testSuite['browser'].append((browser[1], "", "", peresults, int(peresults/float(testlistCount)*100)))
except ZeroDivisionError:
testSuite['browser'].append((browser[1], "", "", peresults, 0))
# ... more different categories are aggregated below, then the report is generated...
This code makes a lot of SQL statements. The PLM.objects.filter(release=release).values('pgi_model').distinct() returns a list of 50 strings, and the two filter operations both execute an SQL statement for each string, meaning 100 SQL statements for just this for loop. (Also, it seems like that should use values_list with flat=True.)
Since I want to get information about relevant cases and plannedExecutions, I think I really only need to retrieve those two tables, then perform some analysis on that. Using filter and count() seemed like the obvious solution at the time, but I'm wondering if I wouldn't be better off just building a dict of relevant case and plannedExecution information using .values() and then analyzing that instead, so as to avoid unnecessary SQL statements. Any helpful advice? Thanks!
Edit: In trying to profile this to understand where the time goes, I'm using Django Debug toolbar. It explains that there are over 200 queries, and each of which runs extremely quickly, so that overall they account for very little time. However, could it be that the execution of the SQL is relatively quick, but the building of the ORM adds up, given that it happens over 200 times? I refactored a previous page which took 3 minutes to load, and used values() instead of the ORM, thus getting the page load down to 2.7 seconds and 5 SQL statements.
Creating a queryset does not hit the database; only accessing results from it does. Accordingly, merely creating querysets is not your issue.
Note that passing a queryset to to another queryset does not create two queries. Accordingly, building dicts will not reduce the number of database hits.
If you can build up dicts, it may be that you manage to create a simpler query than you would otherwise, which would speed up the actual query execution. That is something of a separate issue, however.
This strikes me as a case for reverse foreign key lookups. We should be able to reduce the top for loop by getting all pgi_models associated with PLMs in the release. I assume you have a model for PGI, for which the PLM model has a foreign key field named pgi_model. If this is the case, you can find the PGIs in a PLM release with the following. You still have a loop, but the iterations of the loop should be reduced, theoretically:
pgis = PGI.objects.filter(plm__in=PLM.objects.filter(release=release))
for pgi in pgis:
peresults = plannedExecs.filter(plm=pgi.plm).count()
if peresults > 0:
try:
testSuite['pgi_model'].append((pgi['pgi_model'], "", "", peresults, int(peresults/float(testlistCount)*100)))
except ZeroDivisionError:
testSuite['pgi_model'].append((pgi['pgi_model'], "", "", peresults, 0))

SQL Alchemy relationships

I've run into another problem with SQLAlchemy. I have a relationship that is suppose to cascade delete some data from my model declared like so :
parentProject = relationship(Project, backref=backref("OPERATIONS", cascade="all,delete"))
This works fine as long as the data is from the current session. But if I start a session , add some data then close. Start another session and try to delete data from the previous one , the cascade doesnt work. The initializer of the database is as follows:
if isDBEmpty:
LOGGER.info("Initializing Database")
session = dao.Session()
model.Base.metadata.create_all(dao.Engine)
session.commit()
LOGGER.info("Database Default Tables created successfully!")
dao.storeEntity(model.User(administrator_username, md5(administrator_password).hexdigest(), administrator_email, True, model.ROLE_ADMINISTRATOR))
LOGGER.info("Database Default Generic Values were stored!")
else:
LOGGER.info("Database already has some data, will not be re-created at this startup!")
I'm guessing I'm missing something very basic here. Some help would be very appreciated.
Regards,
Bogdan

Categories

Resources