SQLAlchemy - bulk update objects

SQLAlchemy - bulk update objects - python

I'm attempting to do a bulk update with sqlalchemy.
What does work, is selecting the objects to update, then setting the attributes inside of a with session.begin_nested(): statement. The time it takes to do the actual save is slow, however.
When I attempt to use a bulk operation instead via session.bulk_save_objects or session.bulk_update_mappings I get the following exception:
A value is required for bind parameter 'schema_table_table_id'
[SQL: u'UPDATE schema.table SET updated_col=%(updated_col)s
WHERE schema.table.table_id = %(schema_table_table_id)s']
[parameters: [{'updated_col': 'some_val'}]]
It looks like bulk_save_objects uses the same logic path as bulk_update_mappings.
I actually don't even understand how bulk_update_mappings is supposed to work, because you are providing the updated values and a reference class but the primary key associated with those values are missing from your list. That essentially seems to be problem here. I tried using bulk_update_mappings and I provided the generated dictionary key used for the primary key param (schema_table_table_id in my example) and it just ended up being ignored. If I used the id attribute name instead, it would update the primary key in the generated SQL but still did not provide the needed parameter in the where clause.
This is using SQLAlchemy 1.0.12, which is the latest version on pip.
My suspicion is that this is a bug.

Related

Calling stored function or procedure won't insert and persist changes

So I am very confused about this weird behaviour I have with SQLAlchemy and PostgreSQL. Let's say I have a table:
create table staging.my_table(
id integer DEFAULT nextval(...),
name text,
...
);
and a stored function:
create or replace function staging.test()
returns void
language plpgsql
as $function$
begin
insert into staging.my_table (name) values ('yay insert');
end;
$function$;
What I want to do now is call this function in Python with SQLAlchemy like this:
from sqlalchemy import create_engine
engine = create_engine('postgresql+psycopg2://foo:bar#localhost:5432/baz')
engine.execute('select staging.test()')
When I run this Python code nothing get's inserted in my database. That's weird because when I replace the function call with select 1 and add .fetchall() to it it gets executed and I see the result in console when I print it.
Let's say I run this code twice and nothing happens but code runs successful without errors.
If I switch to the database now and run select staging.test(); and select my_table I get: id: 3; name: yay insert.
So that means the sequence is actually increasing when I run my Python file but there is no data in my table.
What am I doing wrong? Am I missing something? I googled but didn't find anything.

This particular use case is singled out in "Understanding Autocommit":
Full control of the “autocommit” behavior is available using the generative Connection.execution_options() method provided on Connection, Engine, Executable, using the “autocommit” flag which will turn on or off the autocommit for the selected scope. For example, a text() construct representing a stored procedure that commits might use it so that a SELECT statement will issue a COMMIT:
engine.execute(text("SELECT my_mutating_procedure()").execution_options(autocommit=True))
The way SQLAlchemy autocommit detects data changing operations is that it matches the statement against a pattern, looking for things like UPDATE, DELETE, and the like. It is impossible for it to detect if a stored function/procedure performs mutations, and so explicit control over autocommit is provided.
The sequence is incremented even on failure because nextval() and setval() calls are never rolled back.

Bulk update with subquery using SQLAlchemy

I'm trying to implement the following MySQL query using SQLAlchemy. The table in question is nested set hierarchy.
UPDATE category
JOIN
(
SELECT
node.cat_id,
(COUNT(parent.cat_id) - 1) AS depth
FROM category AS node, category AS parent
WHERE node.lft BETWEEN parent.lft AND parent.rgt
GROUP BY node.cat_id
) AS depths
ON category.cat_id = depths.cat_id
SET category.depth = depths.depth
This works just fine.
This is where I start pulling my hair out:
from sqlalchemy.orm import aliased
from sqlalchemy import func
from myapp.db import db
node = aliased(Category)
parent = aliased(Category)
stmt = db.session.query(node.cat_id,
func.count(parent.cat_id).label('depth_'))\
.filter(node.lft.between(parent.lft, parent.rgt))\
.group_by(node.cat_id).subquery()
db.session.query(Category,
stmt.c.cat_id,
stmt.c.depth_)\
.outerjoin(stmt,
Category.cat_id == stmt.c.cat_id)\
.update({Category.depth: stmt.c.depth_},
synchronize_session='fetch')
...and I get InvalidRequestError: This operation requires only one Table or entity be specified as the target. It seems to me that Category.depth adequately specifies the target, but of course SQLAlchemy trumps whatever I may think.
Stumped. Any suggestions? Thanks.

I know this question is five years old, but I stumbled upon it today. My answer might be useful to someone else. I understand that my solution is not the perfect one, but I don't have a better way of doing this.
I had to change only the last line to:
db.session.query(Category)\
.outerjoin(stmt,
Category.cat_id == stmt.c.cat_id)\
.update({Category.depth: stmt.c.depth_},
synchronize_session='fetch')
Then, you have to commit the changes:
db.session.commit()
This gives the following warning:
SAWarning: Evaluating non-mapped column expression '...' onto ORM
instances; this is a deprecated use case. Please make use of the
actual mapped columns in ORM-evaluated UPDATE / DELETE expressions.
"UPDATE / DELETE expressions." % clause
To get rid of it, I used the solution in this post: Turn off a warning in sqlalchemy
Note: For some reason, aliases don't work in SQLAlchemy update statements.

Completing object with its relations and avoiding unnecessary queries in sqlalchemy

I have some database structure; as most of it is irrelevant for us, i'll describe just some relevant pieces. Let's lake Item object as example:
items_table = Table("invtypes", gdata_meta,
Column("typeID", Integer, primary_key = True),
Column("typeName", String, index=True),
Column("marketGroupID", Integer, ForeignKey("invmarketgroups.marketGroupID")),
Column("groupID", Integer, ForeignKey("invgroups.groupID"), index=True))
mapper(Item, items_table,
properties = {"group" : relation(Group, backref = "items"),
"_Item__attributes" : relation(Attribute, collection_class = attribute_mapped_collection('name')),
"effects" : relation(Effect, collection_class = attribute_mapped_collection('name')),
"metaGroup" : relation(MetaType,
primaryjoin = metatypes_table.c.typeID == items_table.c.typeID,
uselist = False),
"ID" : synonym("typeID"),
"name" : synonym("typeName")})
I want to achieve some performance improvements in the sqlalchemy/database layer, and have couple of ideas:
1) Requesting the same item twice:
item = session.query(Item).get(11184)
item = None (reference to item is lost, object is garbage collected)
item = session.query(Item).get(11184)
Each request generates and issues SQL query. To avoid it, i use 2 custom maps for an item object:
itemMapId = {}
itemMapName = {}
#cachedQuery(1, "lookfor")
def getItem(lookfor, eager=None):
if isinstance(lookfor, (int, float)):
id = int(lookfor)
if eager is None and id in itemMapId:
item = itemMapId[id]
else:
item = session.query(Item).options(*processEager(eager)).get(id)
itemMapId[item.ID] = item
itemMapName[item.name] = item
elif isinstance(lookfor, basestring):
if eager is None and lookfor in itemMapName:
item = itemMapName[lookfor]
else:
# Items have unique names, so we can fetch just first result w/o ensuring its uniqueness
item = session.query(Item).options(*processEager(eager)).filter(Item.name == lookfor).first()
itemMapId[item.ID] = item
itemMapName[item.name] = item
return item
I believe sqlalchemy does similar object tracking, at least by primary key (item.ID). If it does, i can wipe both maps (although wiping name map will require minor modifications to application which uses these queries) to not duplicate functionality and use stock methods. Actual question is: if there's such functionality in sqlalchemy, how to access it?
2) Eager loading of relationships often helps to save alot of requests to database. Say, i'll definitely need following set of item=Item() properties:
item.group (Group object, according to groupID of our item)
item.group.items (fetch all items from items list of our group)
item.group.items.metaGroup (metaGroup object/relation for every item in the list)
If i have some item ID and no item is loaded yet, i can request it from the database, eagerly loading everything i need: sqlalchemy will join group, its items and corresponding metaGroups within single query. If i'd access them with default lazy loading, sqlalchemy would need to issue 1 query to grab an item + 1 to get group + 1*#items for all items in the list + 1*#items to get metaGroup of each item, which is wasteful.
2.1) But what if i already have Item object fetched, and some of the properties which i want to load are already loaded? As far as i understand, when i re-fetch some object from the database - its already loaded relations do not become unloaded, am i correct?
2.2) If i have Item object fetched, and want to access its group, i can just getGroup using item.groupID, applying any eager statements i'll need ("items" and "items.metaGroup"). It should properly load group and its requested relations w/o touching item stuff. Will sqlalchemy properly map this fetched group to item.group, so that when i access item.group it won't fetch anything from the underlying database?
2.3) If i have following things fetched from the database: original item, item.group and some portion of the items from the item.group.items list some of which may have metaGroup loaded, what would be best strategy for completing data structure to the same as eager list above: re-fetch group with ("items", "items.metaGroup") eager load, or check each item from items list individually, and if item or its metaGroup is not loaded - load them? It seems to depend on the situation, because if everything has already been loaded some time ago - issuing such heavy query is pointless. Does sqlalchemy provide a way to track if some object relation is loaded, with the ability to look deeper than just one level?
As an illustration to 2.3 - i can fetch group with ID 83, eagerly fetching "items" and "items.metaGroup". Is there a way to determine from an item (which has groupID of an 83), does it have "group", "group.items" and "group.items.metaGroup" loaded or not, using sqlalchemy tools (in this case all of them should be loaded)?

To force loading lazy attributes just access them. This the simplest way and it works fine for relations, but is not as efficient for Columns (you will get separate SQL query for each column in the same table). You can get a list of all unloaded properties (both relations and columns) from sqlalchemy.orm.attributes.instance_state(obj).unloaded.
You don't use deferred columns in your example, but I'll describe them here for completeness. The typical scenario for handling deferred columns is the following:
Decorate selected columns with deferred(). Combine them into one or several groups by using group parameter to deferred().
Use undefer() and undefer_group() options in query when desired.
Accessing deferred column put in group will load all columns in this group.
Unfortunately this doesn't work reverse: you can combine columns into groups without deferring loading of them by default with column_property(Column(…), group=…), but defer() option won't affect them (it works for Columns only, not column properties, at least in 0.6.7).
To force loading deferred column properties session.refresh(obj, attribute_names=…) suggested by Nathan Villaescusa is probably the best solution. The only disadvantage I see is that it expires attributes first so you have to insure there is not loaded attributes among passed as attribute_names argument (e.g. by using intersection with state.unloaded).
Update
1) SQLAlchemy does track loaded objects. That's how ORM works: there must be the only object in the session for each identity. Its internal cache is weak by default (use weak_identity_map=False to change this), so the object is expunged from the cache as soon as there in no reference to it in your code. SQLAlchemy won't do SQL request for query.get(pk) when object is already in the session. But this works for get() method only, so query.filter_by(id=pk).first() will do SQL request and refresh object in the session with loaded data.
2) Eager loading of relations will lead to fewer requests, but it's not always faster. You have to check this for your database and data.
2.1) Refetching data from database won't unload objects bound via relations.
2.2) item.group is loaded using query.get() method, so there won't lead to SQL request if object is already in the session.
2.3) Yes, it depends on situation. For most cases it's the best is to hope SQLAlchemy will use the right strategy :). For already loaded relation you can check if related objects' relations are loaded via state.unloaded and so recursively to any depth. But when relation is not loaded yet you can't get know whether related objects and their relations are already loaded: even when relation is not yet loaded the related object[s] might be already in the session (just imagine you request first item, load its group and then request other item that has the same group). For your particular example I see no problem to just check state.unloaded recursively.

1)
From the Session documentation:
[The Session] is somewhat used as a cache, in that
it implements the identity map
pattern, and stores objects keyed to
their primary key. However, it doesn’t
do any kind of query caching. ... It’s only
when you say query.get({some primary
key}) that the Session doesn’t have to
issue a query.
2.1) You are correct, relationships are not modified when you refresh an object.
2.2) Yes, the group will be in the identity map.
2.3) I believe your best bet will be to attempt to reload the entire group.items in a single query. From my experience it is usually much quicker to issue one large request than several smaller ones. The only time it would make sense to only reload a specific group.item is there was exactly one of them that needed to be loaded. Though in that case you are doing one large query instead of one small one so you don't actually reduce the number of queries.
I have not tried it, but I believe you should be able to use the sqlalchemy.orm.util.identity_key method to determine whether an object is in sqlalchemy's identiy map. I would be interested to find out what calling identiy_key(Group, 83) returns.
Initial Question)
If I understand correctly you have an object that you fetched from the database where some of its relationships were eagerloaded and you would like to fetch the rest of the relationships with a single query? I believe you may be able to use the Session.refresh() method passing in the the names of the relationships that you want to load.

Do I reference the session when making any db calls in sqlalchemy?

In this tutorial it says (http://www.rmunn.com/sqlalchemy-tutorial/tutorial.html) to select all rows of an entity like:
s = products.select()
rs = s.execute()
I get an error saying:
This select object is not bound and does not support direct execution ...
Do I need to reference the session object?
I just want to get all rows in my products table (i've already mapped everything, and I already inserted thousands of rows so that part works)

Since that tutorial is built for SQLALchemy 0.2, it is likely that you aren't using that old of a version. In the latest documentation using the connection and passing the select statement to it is the preferred method. Try this instead:
query = users.select()
result = conn.execute(query)
Ref: http://www.sqlalchemy.org/docs/05/sqlexpression.html#selecting

SQLAlchemy - MapperExtension.before_delete not called

I have question regarding the SQLAlchemy. I have database which contains Items, every Item has assigned more Records (1:n). And the Record is partially stored in the database, but it also has an assigned file (1:1) on the filesystem.
What I want to do is to delete the assigned file when the Record is removed from the database. So I wrote the following MapperExtension:
class _StoredRecordEraser(MapperExtension):
def before_delete(self, mapper, connection, instance):
instance.erase()
The following code creates an experimental setup (full code is here: test.py):
session = Session()
i1 = Item(id='item1')
r11 = Record(id='record11', attr='1')
i1.records.append(r11)
r12 = Record(id='record12', attr='2')
i1.records.append(r12)
session.add(i1)
session.commit()
And finally, my problem... The following code works O.k. and the old.erase() method is called:
session = Session()
i1 = session.query(Item).get('item1')
old = i1.records[0]
new = Record(id='record13', attr='3')
i1.records.remove(old)
i1.records.append(new)
session.commit()
But when I change the id of a new Record to record11, which is already in the database, but it is not the same item (attr=3), the old.erase() is not called. Does anybody know why?
Thanks

A delete + insert of two records that ultimately have the same primary key within a single flush are converted into a single update right now. this is not the best behavior - it really should delete then insert, so that the various events assigned to those activities are triggered as expected (not just mapper extension methods, but database level defaults too). But the flush() process is hardwired to perform inserts/updates first, then deletes. As a workaround, you can issue a flush() after the remove/delete operation, then a second for the add/insert.
As far as flushes' current behavior, I've looked into trying to break this out but it gets very complicated - inserts which depend on deletes would have to execute after the deletes, but updates which depend on inserts would have to execute beforehand. Ultimately, the unitofwork module would be rewritten (big time) to consider all insert/update/deletes in a single stream of dependent actions that would be topologically sorted against each other. This would simplify the methods used to execute statements in the correct order, although all new systems for synchronizing data between rows based on server-level defaults would have to be devised, and its possible that complexity would be re-introduced if it turned out the "simpler" method spent too much time naively sorting insert statements that are known at the ORM level to not require any sorting against each other. The topological sort works at a more coarse grained level than that right now.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.