SQLAlchemy get items from the identity map not only by primary key

SQLAlchemy get items from the identity map not only by primary key - python

Is it possible to use a couple of fields not from the primary key to retrieve items (already fetched earlier) from the identity map? For example, I often query a table by (external_id, platform_id) pair, which is a unique key, but not a primary key. And I want to omit unnecessary SQL queries in such cases.

A brief overview of identity_map and get():
An identity map is kept for a lifecycle of a SQLAlchemy's session object i.e. in case of a web-service or a RESTful api the session object's lifecycle is not more than a single request (recommended).
From : http://martinfowler.com/eaaCatalog/identityMap.html
An Identity Map keeps a record of all objects that have been read from
the database in a single business transaction. Whenever you want an object, you check the Identity Map first to see if you already have it.
In SQLAlchemy's ORM there's this special query method get(), it first looks into identity_map using the pk (only allowed argument) and returns object from the identity map, actually executing the SQL query and hitting the database.
From the docs:
get(ident)
Return an instance based on the given primary key identifier, or None
if not found.
get() is special in that it provides direct access to the identity
map of the owning Session. If the given primary key identifier is
present in the local identity map, the object is returned directly
from this collection and no SQL is emitted, unless the object has been
marked fully expired. If not present, a SELECT is performed in order
to locate the object.
Only get() is using the identity_map - official docs:
It’s somewhat used as a cache, in that it implements the identity map
pattern, and stores objects keyed to their primary key. However, it
doesn’t do any kind of query caching. This means, if you say
session.query(Foo).filter_by(name='bar'), even if Foo(name='bar')
is right there, in the identity map, the session has no idea about
that. It has to issue SQL to the database, get the rows back, and
then when it sees the primary key in the row, then it can look in the
local identity map and see that the object is already there. It’s
only when you say query.get({some primary key}) that the Session
doesn’t have to issue a query.
P.S. If you're querying not using pk, you aren't hitting the identity_map in the first place.
Few relevant SO questions, helpful to clear the concept:
Forcing a sqlalchemy ORM get() outside identity map

It's possible to access the whole identity map sequentially:
for obj in session.identity_map.values():
print(obj)
To get an object by arbitrary attributes, you then have to filter for the object type first and then check your attributes.
It's not a lookup in constant time, but can prevent unnecessary queries.
There is the argument, that objects may have been modified by another process and the identity map doesn't hold the current state, but this argument is invalid: If your transaction isolation level is read committed (or less) - and this is often the case, data ALWAYS may have been changed immediately after the query is finished.

Related

Does django's `save()` create or update?

The documentation just says
To save an object back to the database, call save()
That does not make it clear. Exprimenting, I found that if I include an id, it updates existing entry, while, if I don't, it creates a new row. Does the documentation specify what happens?

It's fuly documented here:
https://docs.djangoproject.com/en/2.2/ref/models/instances/#how-django-knows-to-update-vs-insert
You may have noticed Django database objects use the same save()
method for creating and changing objects. Django abstracts the need to
use INSERT or UPDATE SQL statements. Specifically, when you call
save(), Django follows this algorithm:
If the object’s primary key attribute is set to a value that evaluates
to True (i.e., a value other than None or the empty string), Django
executes an UPDATE. If the object’s primary key attribute is not set
or if the UPDATE didn’t update anything (e.g. if primary key is set to
a value that doesn’t exist in the database), Django executes an
INSERT. The one gotcha here is that you should be careful not to
specify a primary-key value explicitly when saving new objects, if you
cannot guarantee the primary-key value is unused. For more on this
nuance, see Explicitly specifying auto-primary-key values above and
Forcing an INSERT or UPDATE below.
As a side note: django is OSS so when in doubt you can always read the source code ;-)

Depends on how the Model object was created. If it was queried from the database, UPDATE. If it's a new object and has not been saved before, INSERT.

Forcing a sqlalchemy ORM get() outside identity map

Background
The get() method is special in SQLAlchemy's ORM because it tries to return objects from the identity map before issuing a SQL query to the database (see the documentation).
This is great for performance, but can cause problems for distributed applications because an object may have been modified by another process, so the local process has no ability to know that the object is dirty and will keep retrieving the stale object from the identity map when get() is called.
Question
How can I force get() to ignore the identity map and issue a call to the DB every time?
Example
I have a Company object defined in the ORM.
I have a price_updater() process which updates the stock_price attribute of all the Company objects every second.
I have a buy_and_sell_stock() process which buys and sells stocks occasionally.
Now, inside this process, I may have loaded a microsoft = Company.query.get(123) object.
A few minutes later, I may issue another call for Company.query.get(123). The stock price has changed since then, but my buy_and_sell_stock() process is unaware of the change because it happened in another process.
Thus, the get(123) call returns the stale version of the Company from the session's identity map, which is a problem.
I've done a search on SO(under the [sqlalchemy] tag) and read the SQLAlchemy docs to try to figure out how to do this, but haven't found a way.

Using session.expire(my_instance) will cause the data to be re-selected on access. However, even if you use expire (or expunge), the next data that is fetched will be based on the transaction isolation level. See the PostgreSQL docs on isolations levels (it applies to other databases as well) and the SQLAlchemy docs on setting isolation levels.
You can test if an instance is in the session with in: my_instance in session.
You can use filter instead of get to bypass the cache, but it still has the same isolation level restriction.
Company.query.filter_by(id=123).one()

Google NDB: Adding an entity with non existing parent

I am working on a web application based on Google App Engine (Python / Webapp2) and Google NDB Datastore.
I assumed that if I tried to add a new entity using as parent key the key of a no longer existing entity an exception was thrown. I have instead found the entity is actually created.
Am i doing something wrong?
I may check before whether the parent still exist through a keys_only query. Does it consume GAE read quotas?

You can create a key for any entity whether this entity exists or not. This is because a key is simply an encoding of an entity kind and either an id or name (and ancestor keys, if any).
This means that you can store a child entity before a parent entity is saved, as long as you know the parent's id or name. You cannot reassign a child from one parent to another, though.

As for your second question, the AppEngine pricing page says:
Calls to the datastore API result in the following billable operations. Small datastore operations include calls to allocate datastore ids or keys-only queries. These operations are free.
Complementing on #andrei's answer to the first question, no key reference in Ndb is checked for refering to an existing entity, this is true for keys used as parent, as well as for keys used as ̀KeyProperty within an entity.

Simple explanation of Google App Engine NDB Datastore

I'm creating a Google App Engine application (python) and I'm learning about the general framework. I've been looking at the tutorial and documentation for the NDB datastore, and I'm having some difficulty wrapping my head around the concepts. I have a large background with SQL databases and I've never worked with any other type of data storage system, so I'm thinking that's where I'm running into trouble.
My current understanding is this: The NDB datastore is a collection of entities (analogous to DB records) that have properties (analogous to DB fields/columns). Entities are created using a Model (analogous to a DB schema). Every entity has a key that is generated for it when it is stored. This is where I run into trouble because these keys do not seem to have an analogy to anything in SQL DB concepts. They seem similar to primary keys for tables, but those are more tightly bound to records, and in fact are fields themselves. These NDB keys are not properties of entities, but are considered separate objects from entities. If an entity is stored in the datastore, you can retrieve that entity using its key.
One of my big questions is where do you get the keys for this? Some of the documentation I saw showed examples in which keys were simply created. I don't understand this. It seemed that when entities are stored, the put() method returns a key that can be used later. So how can you just create keys and define ids if the original keys are generated by the datastore?
Another thing that I seem to be struggling with is the concept of ancestry with keys. You can define parent keys of whatever kind you want. Is there a predefined schema for this? For example, if I had a model subclass called 'Person', and I created a key of kind 'Person', can I use that key as a parent of any other type? Like if I wanted a 'Shoe' key to be a child of a 'Person' key, could I also then declare a 'Car' key to be a child of that same 'Person' key? Or will I be unable to after adding the 'Shoe' key?
I'd really just like a simple explanation of the NDB datastore and its API for someone coming from a primarily SQL background.

I think you've overcomplicating things in your mind. When you create an entity, you can either give it a named key that you've chosen yourself, or leave that out and let the datastore choose a numeric ID. Either way, when you call put, the datastore will return the key, which is stored in the form [<entity_kind>, <id_or_name>] (actually this also includes the application ID and any namespace, but I'll leave that out for clarity).
You can make entities members of an entity group by giving them an ancestor. That ancestor doesn't actually have to refer to an existing entity, although it usually does. All that happens with an ancestor is that the entity's key includes the key of the ancestor: so it now looks like [<parent_entity_kind>, <parent_id_or_name>, <entity_kind>, <id_or_name>]. You can now only get the entity by including its parent key. So, in your example, the Shoe entity could be a child of the Person, whether or not that Person has previously been created: it's the child that knows about the ancestor, not the other way round.
(Note that that ancestry path can be extended arbitrarily: the child entity can itself be an ancestor, and so on. In this case, the group is determined by the entity at the top of the tree.)
Saving entities as part of a group has advantages in terms of consistency, in that a query inside an entity group is always guaranteed to be fully consistent, whereas outside the query is only eventually consistent. However, there are also disadvantages, in that the write rate of an entity group is limited to 1 per second for the whole group.

Datastore keys are a little more analogous to internal SQL row identifiers, but of course not entirely. Identifiers in Appengine are a bit like SQL primary keys. To support decentralised concurrent creation of new keys by many application instances in a cloud of servers, AppEngine internally generates the keys to guarantee uniqueness. Your application defines parameters (application identifier, optional namespace, kind and optional entity identifier) which AppEngine uses to seed its key generator. If you do not provide an identifier, AppEngine will generate a unique numeric identifier that you can read.
Eventual consistency takes time so it is occasionally more efficient to request multiple new keys in bulk. AppEngine then generates a range of numeric entity identifiers for you. You can read their values from keys as KeyProperty metadata.
Ancestry is used to group together writes of related entities of all kinds for the purpose of transactions and isolation. There is no predefined schema for this but you are limited to one parent per child.
In your example, one particular Shoe might have a particular Person as parent. Another particular Shoe could have a Horse as parent. And another Shoe might have no parent. Many entities of all kinds can have the same parent, so several Car entities could also have that initial Person as parent. The Datastore is schemaless, so it's up to your application to allow or forbid a Car to have a Horse as parent.
Note that a child knows its parent, but a parent does not know its children, because implementing that would impact scalability.

sqlalchemy id equality vs reference equality

I'm working with SQLAlchemy for the first time and was wondering...generally speaking is it enough to rely on python's default equality semantics when working with SQLAlchemy vs id (primary key) equality?
In other projects I've worked on in the past using ORM technologies like Java's Hibernate, we'd always override .equals() to check for equality of an object's primary key/id, but when I look back I'm not sure this was always necessary.
In most if not all cases I can think of, you only ever had one reference to a given object with a given id. And that object was always the attached object so technically you'd be able to get away with reference equality.
Short question: Should I be overriding eq() and hash() for my business entities when using SQLAlchemy?

Short answer: No, unless you're working with multiple Session objects.
Longer answer, quoting the awesome documentation:
The ORM concept at work here is known as an identity map and ensures that all operations upon a particular row within a Session operate upon the same set of data. Once an object with a particular primary key is present in the Session, all SQL queries on that Session will always return the same Python object for that particular primary key; it also will raise an error if an attempt is made to place a second, already-persisted object with the same primary key within the session.

I had a few situations where my sqlalchemy application would load multiple instances of the same object (multithreading/ different sqlalchemy sessions ...). It was absolutely necessary to override eq() for those objects or I would get various problems. This could be a problem in my application design, but it probably doesn't hurt to override eq() just to be sure.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.