Indexing SQLAlchemy models on ElasticSearch - python

I'm trying to use sqlalchemy signals to keep an elasticsearch index updated reflecting certain models on my DB.
The problem I'm having is that to build the ES documents I access some of the models' relationships, and that means that there's several points at the signals cycle the relationships cannot be accessed. I was creating the ES Document from the Model instance in the listeners for after_(insert|udpate|delete), but those are issued while the connection is in flushing state, and it seems like relationships return None at that point, so the ES document cannot be built.
I changed it so I just saved queue of the commands to be emitted, keeping a reference to the model instead, and the plan was to issue them on commit, but the before_commit signal is issued before flushing, and when after_commit is issued the DB cannot be accessed anymore, so again relationships are inaccessible.
It seems to me like the right moment to create the Documents and save them is in the after_flush_postexec signal, but I have a feeling that between that and when the commit is actually issued there might be a rollback, and then the ES index would not reflect the DB anymore.
I'm not sure what's the best way to do this.
Here's the code I'm working with right now. ChargeIndexer.upsert takes a SQLAlchemy ORM model, and uses the elasticsearch api to insert/update a document built from it and .delete does the same, except of course it deletes it, and only depends on the model's id.
class ChargeListener(object):
ops = []
#event.listens_for(Charge, 'after_delete', propagate=True)
def after_delete(mapper, connection, target):
ChargeListener.add_delete(target)
#event.listens_for(Charge, 'after_insert', propagate=True)
def after_insert(mapper, connection, target):
ChargeListener.add_insert(target)
#event.listens_for(Charge, 'after_update', propagate=True)
def after_update(mapper, connection, target):
ChargeListener.add_insert(target)
#classmethod
def execute_ops(cls):
charges_indexer = ChargesIndexer()
for op, charge in ChargeListener.ops:
if op == 'insert':
charges_indexer.upsert(charge)
elif op == 'delete':
charges_indexer.delete(charge)
ChargeListener.reset()
#event.listens_for(sess, 'after_flush_postexec')
def after_flush_postexec(session, flush_context):
ChargeListener.execute_ops()
#event.listens_for(sess, 'after_soft_rollback')
def after_soft_rollback(session, previous_transaction):
ChargeListener.reset()
#classmethod
def add_insert(cls, charge):
cls.ops.append(('insert', charge))
#classmethod
def add_delete(cls, charge):
cls.ops.append(('delete', charge))
#classmethod
def reset(cls):
cls.ops = []

Related

how to use peewee's Using as a decorator to dynamically specify a database?

Despite numerous recipes and examples in peewee's documentation; I have not been able to find how to accomplish the following:
For finer-grained control, check out the Using context manager / decorator. This allows you to specify the database to use with a given list of models for the duration of the wrapped block.
I assume it would go something like...
db = MySQLDatabase(None)
class BaseModelThing(Model):
class Meta:
database = db
class SubModelThing(BaseModelThing):
'''imagine all the fields'''
class Meta:
db_table = 'table_name'
runtime_db = MySQLDatabase('database_name.db', fields={'''imagine field mappings here''', **extra_stuff)
#Using(runtime_db, [SubModelThing])
#runtime_db.execution_context()
def some_kind_of_query():
'''imagine the queries here'''
but I have not found examples, so an example would be the answer to this question.
Yeah, there's not a great example of using Using or the execution_context decorators, so the first thing is: don't use the two together. It doesn't appear to break anything, just seems to be redundant. Logically that makes sense as both of the decorators cause the specified model calls in the block to run in a single connection/transaction.
The only(/biggest) difference between the two is that Using allows you to specify the particular database that the connection will be using - useful for master/slave (though the Read slaves extension is probably a cleaner solution).
If you run with two databases and try using execution_context on the 'second' database (in your example, runtime_db) nothing will happen with the data. A connection will be opened at the start of the block and closed and the end, but no queries will be executed on it because the models are still using their original database.
The code below is an example. Every run should result in only 1 row being added to each database.
from peewee import *
db = SqliteDatabase('other_db')
db.connect()
runtime_db = SqliteDatabase('cmp_v0.db')
runtime_db.connect()
class BaseModelThing(Model):
class Meta:
database = db
class SubModelThing(Model):
first_name = CharField()
class Meta:
db_table = 'table_name'
db.create_tables([SubModelThing], safe=True)
SubModelThing.delete().where(True).execute() # Cleaning out previous runs
with Using(runtime_db, [SubModelThing]):
runtime_db.create_tables([SubModelThing], safe=True)
SubModelThing.delete().where(True).execute()
#Using(runtime_db, [SubModelThing], with_transaction=True)
def execute_in_runtime(throw):
SubModelThing(first_name='asdfasdfasdf').save()
if throw: # to demo transaction handling in Using
raise Exception()
# Create an instance in the 'normal' database
SubModelThing.create(first_name='name')
try: # Try to create but throw during the transaction
execute_in_runtime(throw=True)
except:
pass # Failure is expected, no row should be added
execute_in_runtime(throw=False) # Create a row in the runtime_db
print 'db row count: {}'.format(len(SubModelThing.select()))
with Using(runtime_db, [SubModelThing]):
print 'Runtime DB count: {}'.format(len(SubModelThing.select()))

Django Db routing

I am trying to run my Django application with two db's (1 master, 1 read replica). My problem is if I try to read right after a write the code explodes. For example:
p = Product.objects.create()
Product.objects.get(id=p.id)
OR
If user is redirected to Product's
details page
The code runs way faster than the read replica. And if the read operation uses the replica the code crashes, because it didn't update in time.
Is there any way to avoid this? For example, the db to read being chosen by request instead of by operation?
My Router is identical to Django's documentation:
import random
class PrimaryReplicaRouter(object):
def db_for_read(self, model, **hints):
"""
Reads go to a randomly-chosen replica.
"""
return random.choice(['replica1', 'replica2'])
def db_for_write(self, model, **hints):
"""
Writes always go to primary.
"""
return 'primary'
def allow_relation(self, obj1, obj2, **hints):
"""
Relations between objects are allowed if both objects are
in the primary/replica pool.
"""
db_list = ('primary', 'replica1', 'replica2')
if obj1._state.db in db_list and obj2._state.db in db_list:
return True
return None
def allow_migrate(self, db, app_label, model_name=None, **hints):
"""
All non-auth models end up in this pool.
"""
return True
Solved it with :
class Model(models.Model):
objects = models.Manager() -> objects only access master
sobjects = ReplicasManager() -> sobjects access either master and replicas
class Meta:
abstract = True -> so django doesn't create a table
make every model extend this one instead of models.Model, and then use objects or sobjects whether I want to access only master or if want to access either master or replicas
Depending on the size of the data and the application I'd tackle this with either of the following methods:
Database pinning:
Extend your database router to allow pinning functions to specific databases. For example:
from customrouter.pinning import use_master
#use_master
def save_and_fetch_foo():
...
A good example of that can be seen in django-multidb-router.
Of course you could just use this package as well.
Use a model manager to route queries to specific databases.
class MyManager(models.Manager):
def get_queryset(self):
qs = CustomQuerySet(self.model)
if self._db is not None:
qs = qs.using(self._db)
return qs
Write a middleware that'd route your requests to master/slave automatically.
Basically same as the pinning method but you wouldn't specify when to run GET requests against master.
IN master replica conf the new data will take few millisecond to replicate the data on all other replica server/database.
so whenever u tried to read after write it wont gives you correct result.
Instead of reading from replica you can use master to read immediately after write by using using('primary') keyword with your get query.

SQLAlchemy ORM Event hook for attribute persisted

I am working on finding a way in SQLAlchemy events to call an external API upon an attribute gets updated and persisted into the database. Here is my context:
An User model with an attribute named birthday. When an instance of User model gets updated and saved, I want to call to an external API to update this user's birthday accordingly.
I've tried Attribute Events, however, it generates too many hits and there is no way to guarantee that the set/remove attribute event would get persisted eventually (auto commit is set to False and transaction gets rolled back when errors occurred.)
Session Events would not work either because it requires a Session/SessionFactory as a parameter and there are just so many places in the code based that sessions have been used.
I have been looking at all the possible SQLAlchemy ORM event hooks in the official documentation but I couldn't find any one of them satisfy my requirement.
I wonder if anyone else has any insight into how to implement this kind of combination event trigger in SQLAlchemy. Thanks.
You can do this by combining multiple events. The specific events you need to use depend on your particular application, but the basic idea is this:
[InstanceEvents.load] when an instance is loaded, note down the fact that it was loaded and not added to the session later (we only want to save the initial state if the instance was loaded)
[AttributeEvents.set/append/remove] when an attribute changes, note down the fact that it was changed, and, if necessary, what it was changed from (these first two steps are optional if you don't need the initial state)
[SessionEvents.before_flush] when a flush happens, note down which instances are actually being saved
[SessionEvents.before_commit] before a commit completes, note down the current state of the instance (because you may not have access to it anymore after the commit)
[SessionEvents.after_commit] after a commit completes, fire off the custom event handler and clear the instances that you saved
An interesting challenge is the ordering of the events. If you do a session.commit() without doing a session.flush(), you'll notice that the before_commit event fires before the before_flush event, which is different from the scenario where you do a session.flush() before session.commit(). The solution is to call session.flush() in your before_commit call to force the ordering. This is probably not 100% kosher, but it works for me in production.
Here's a (simple) diagram of the ordering of events:
begin
load
(save initial state)
set attribute
...
flush
set attribute
...
flush
...
(save modified state)
commit
(fire off "object saved and changed" event)
Complete Example
from itertools import chain
from weakref import WeakKeyDictionary, WeakSet
from sqlalchemy import Column, String, Integer, create_engine
from sqlalchemy import event
from sqlalchemy.orm import sessionmaker, object_session
from sqlalchemy.ext.declarative import declarative_base
Base = declarative_base()
engine = create_engine("sqlite://")
Session = sessionmaker(bind=engine)
class User(Base):
__tablename__ = "users"
id = Column(Integer, primary_key=True)
birthday = Column(String)
#event.listens_for(User.birthday, "set", active_history=True)
def _record_initial_state(target, value, old, initiator):
session = object_session(target)
if session is None:
return
if target not in session.info.get("loaded_instances", set()):
return
initial_state = session.info.setdefault("initial_state", WeakKeyDictionary())
# this is where you save the entire object's state, not necessarily just the birthday attribute
initial_state.setdefault(target, old)
#event.listens_for(User, "load")
def _record_loaded_instances_on_load(target, context):
session = object_session(target)
loaded_instances = session.info.setdefault("loaded_instances", WeakSet())
loaded_instances.add(target)
#event.listens_for(Session, "before_flush")
def track_instances_before_flush(session, context, instances):
modified_instances = session.info.setdefault("modified_instances", WeakSet())
for obj in chain(session.new, session.dirty):
if session.is_modified(obj) and isinstance(obj, User):
modified_instances.add(obj)
#event.listens_for(Session, "before_commit")
def set_pending_changes_before_commit(session):
session.flush() # IMPORTANT
initial_state = session.info.get("initial_state", {})
modified_instances = session.info.get("modified_instances", set())
del session.info["modified_instances"]
pending_changes = session.info["pending_changes"] = []
for obj in modified_instances:
initial = initial_state.get(obj)
current = obj.birthday
pending_changes.append({
"initial": initial,
"current": current,
})
initial_state[obj] = current
#event.listens_for(Session, "after_commit")
def after_commit(session):
pending_changes = session.info.get("pending_changes", {})
del session.info["pending_changes"]
for changes in pending_changes:
print(changes) # this is where you would fire your custom event
loaded_instances = session.info["loaded_instances"] = WeakSet()
for v in session.identity_map.values():
if isinstance(v, User):
loaded_instances.add(v)
def main():
engine = create_engine("sqlite://", echo=False)
Base.metadata.create_all(bind=engine)
session = Session(bind=engine)
user = User(birthday="foo")
session.add(user)
user.birthday = "bar"
session.flush()
user.birthday = "baz"
session.commit() # prints: {"initial": None, "current": "baz"}
user.birthday = "foobar"
session.commit() # prints: {"initial": "baz", "current": "foobar"}
session.close()
if __name__ == "__main__":
main()
As you can see, it's a little complicated and not very ergonomic. It would be nicer if it were integrated into the ORM, but I also understand there may be reasons for not doing so.

Is this an acceptable way to make threaded SQLAlchemy queries from Twisted?

I've been doing some reading on using SQLAlchemy's ORM in the context of a Twisted application. It's a lot of information to digest, so I'm having a bit of trouble putting all the pieces together. So far, I've gathered the following absolute truths:
One session implies one thread. Always.
scoped_session, by default, provides us with a way of constraining sessions to a given thread. In other words, I am sure that by using scoped_session, I will not pass sessions to other threads (unless I do so explicitly, which I won't).
I also gathered that there are some issues relating to lazy/eager-loading and that one possible approach is to dissociate ORM objects from a session and reattach them to another session when changing threads. I'm quite fuzzy on the details, but I also concluded that scoped_session renders many of these points moot.
My first question is whether or not I am severely mistaken in my above conclusions.
Beyond that, I've crafted this approach, which I hope is satisfactory.
I begin by creating a scoped_session object...
Session = scoped_session(sessionmaker(bind=_my_engine))
... which I will then use from a context manager, in order to handle exceptions and clean-up gracefully:
#contextmanager
def transaction_context():
session = Session()
try:
yield session
session.commit()
except:
session.rollback()
raise
finally:
session.remove() # dispose of the session
Now all I need to do is to use the above context manager in a function that is deferred to a separate thread. I've thrown together a decorator to make things a bit prettier:
def threaded(fn):
#wraps(fn) # functools.wraps
def wrapper(*args, **kwargs):
return deferToThread(fn, *args, **kwargs) # t.i.threads.deferToThread
return wrapper
Here is an example of how I intend to use the whole shebang. Below is a function that performs a DB lookup using the SQLAlchemy ORM:
#threaded
def get_some_attributes(group):
with transaction_context() as session:
return session.query(Attribute).filter(Attribute.group == group)
My second question is whether or not this approach is viable.
Am I making any fundamentally flawed assumptions?
Are there any caveats?
Is there a better way?
Edit: Here is a related question concerning the unexpected error in my context manager.
Right now I work on this exact problem, and I think I found a solution.
Indeed, you must defer all database access functions to a thread. But in your solution, you remove the session after querying the database, so all your results ORM objects will be detached and you wont have access to their fields.
You can't use scoped_session because in Twisted we have only one MainThread (except with things that work in deferToThread). We can, however, use scoped_sesssion with scopefunc.
In Twisted there is a great thing known as ContextTracker:
provides a way to pass arbitrary key/value data up and down a call
stack without passing them as parameters to the functions on that call
stack.
In my twisted web app in method render_GET I set a uuid parameter:
call = context.call({"uuid": str(uuid.uuid4())}, self._render, request)
and then I call the _render method to do the actual work (work with db, render html, etc).
I create the scoped_session like this:
scopefunc = functools.partial(context.get, "uuid")
Session = scoped_session(session_factory, scopefunc=scopefunc)
Now within any function calls of _render I can get session with:
Session()
and at the end of _render I have to do Session.remove() to remove the session.
It worksa with my webapp and I think can work for other tasks.
This is completely standalone example, show how all it work together.
from twisted.internet import reactor, threads
from twisted.web.resource import Resource
from twisted.web.server import Site, NOT_DONE_YET
from twisted.python import context
from sqlalchemy import create_engine, Column, Integer, String
from sqlalchemy.orm import sessionmaker, scoped_session
from sqlalchemy.ext.declarative import declarative_base
import uuid
import functools
engine = create_engine(
'sqlite:///test.sql',
connect_args={'check_same_thread': False},
echo=False)
session_factory = sessionmaker(bind=engine)
scopefunc = functools.partial(context.get, "uuid")
Session = scoped_session(session_factory, scopefunc=scopefunc)
Base = declarative_base()
class User(Base):
__tablename__ = 'users'
id = Column(Integer, primary_key=True)
name = Column(String)
Base.metadata.create_all(bind=engine)
class TestPage(Resource):
isLeaf = True
def render_GET(self, request):
context.call({"uuid": str(uuid.uuid4())}, self._render, request)
return NOT_DONE_YET
def render_POST(self, request):
return self.render_GET(request)
def work_with_db(self):
user = User(name="TestUser")
Session.add(user)
Session.commit()
return user
def _render(self, request):
print "session: ", id(Session())
d = threads.deferToThread(self.work_with_db)
def success(result):
html = "added user with name - %s" % result.name
request.write(html.encode('UTF-8'))
request.finish()
Session.remove()
call = functools.partial(context.call, {"uuid": scopefunc()}, success)
d.addBoth(call)
return d
if __name__ == "__main__":
reactor.listenTCP(8888, Site(TestPage()))
reactor.run()
I print out id of session, and you can see that its different for each request. If you remove scopefunc from scoped_session constructor and do two simultaneous request(insert time.sleep to work_with_db), you will get one common session for this two requests.
The scoped_session object by default uses threading.local() as storage, so that a single Session is maintained for all who call upon the scoped_session registry, but only within the scope of a single thread
a problem here that in twisted we have only one thread for all requests. Thats why we have to create own scopefunc, that will show the difference between requests.
An other problem, that twisted didnt pass context to callbacks and we have to wrap callback and send current context to it.
call = functools.partial(context.call, {"uuid": scopefunc()}, success)
Still I dont know how to make it work with defer.inLineCallback, that I use everywhere in my code.

How can I use different databases for different models

I have a model called Requests which I want to save in different database than default django databse.
The reason for this is that that table is going to record every request for analytics and that is going to get populated very heavily. As I am taking database backups hourly so I don't want to increase the db size just for that table.
So I was thinking of puting in separate DB so that I don't backup it up more often.
This docs says like this
https://docs.djangoproject.com/en/dev/topics/db/multi-db/
def db_for_read(self, model, **hints):
"""
Reads go to a randomly-chosen slave.
"""
return random.choice(['slave1', 'slave2'])
def db_for_write(self, model, **hints):
"""
Writes always go to master.
"""
return 'master'
Now I am not sure how can I check that if my model is Requests then choose database A else database B
Models are just classes - so check, if you have right class. This example should work for you:
from analytics.models import Requests
def db_for_read(self, model, **hints):
"""
Reads go to default database, unless it is about requests
"""
if model is Requests:
return 'database_A'
else:
return 'database_B'
def db_for_write(self, model, **hints):
"""
Writes go to default database, unless it is about requests
"""
if model is Requests:
return 'database_A'
else:
return 'database_B'
If you wish, though, you can also use one of some other techniques (such as checking model.__name__ or looking at model._meta).
One note, though: the requests should not have foreign keys connecting them to models in other databases. But you probably already know that.

Categories

Resources