How to do bulk update in sqlalchemy ORM

How to do bulk update in sqlalchemy ORM - python

Suppose I have a model
class Foo(...):
foo = Column...
bar = Column...
I have a list of pairs t (foo, bar) which indicates that for the Foo where foo equals t[0] new value of bar equals t[1]
How can update many rows in one transaction?
I see session.bulk_update_mapping, but from the docs it's not clear to me where to provide arguments for the where cause.

I’m inserting 400,000 rows with the ORM and it’s really slow! - Performance — SQLAlchemy 1.1 Documentation (the 4th in Google on "sqlalchemy bulk update") suggests that ORM is not intended for these - as well as possible ways to go:
ORMs are basically not intended for high-performance bulk inserts -
this is the whole reason SQLAlchemy offers the Core in addition to the
ORM as a first-class component.
For the use case of fast bulk inserts, the SQL generation and
execution system that the ORM builds on top of is part of the Core.
Using this system directly, we can produce an INSERT that is
competitive with using the raw database API directly.
Alternatively, the SQLAlchemy ORM offers the Bulk Operations suite of
methods, which provide hooks into subsections of the unit of work
process in order to emit Core-level INSERT and UPDATE constructs with
a small degree of ORM-based automation.

Related

SQLAlchemy MetaData.reflect() vs. automap_base.prepare()

It seems to me that MetaData.reflect() and sqlalchemy.ext.automap.prepare() tables should be able to be used interchangeably for many use cases, but they can't be.
The metadata.tables['mytable'] into conn.execute(select(...)) returns a sqlalchemy.engine.cursor.CursorResult and your iterator gets the columns directly (eg x.columnA).
But automap_base().classes.mytable into the same conn.execute(select(...)) returns a sqlalchemy.engine.result.ChunkedIteratorResult and you need x.mytable.columnA to get at the column.
The sqlalchemy.engine.Result() documention says as much:
New in version 1.4: The Result object provides a completely updated
usage model and calling facade for SQLAlchemy Core and SQLAlchemy ORM.
In Core, it forms the basis of the CursorResult object which replaces
the previous ResultProxy interface. When using the ORM, a higher level
object called ChunkedIteratorResult is normally used.
Can I generically convert one to the other? That is, some wrapper that works for every table without needing the table name?
What's the best futureproof way to do this? I want my code to be forward-looking to sqlalchemy 2.0. Does that mean I should move away from either automap or MetaData?
sqlalchemy 1.4.35

This is the difference between the Core and the ORM.
select() from a Table vs. ORM class
While the SQL generated in these examples looks the same whether we
invoke select(user_table) or select(User), in the more general case
they do not necessarily render the same thing, as an ORM-mapped class
may be mapped to other kinds of “selectables” besides tables. The
select() that’s against an ORM entity also indicates that ORM-mapped
instances should be returned in a result, which is not the case when
SELECTing from a Table object.
Don't hesitate to use the ORM. It's higher level, pythonic, cool, and automap is ORM.

SQLAlchemy doesn't map reflected class

I have this code:
def advertiser_table(engine):
return Table('advertiser', metadata, autoload=True, autoload_with=engine)
And later I try this:
advertisers = advertiser_table(engine)
...
session.bulk_insert_mappings(
advertisers.name,
missing_advetisers.to_dict('records'),
)
where missing_adverisers is a Pandas DataFrame (but it's not important for this question).
The error this gives me is:
sqlalchemy.orm.exc.UnmappedClassError: Class ''advertiser'' is not mapped
From reading the documentation I could scramble enough to ask the question, but not much more than that... What is Mapper and why is it so detrimental to the functioning of this library?.. Why isn't "the class" mapped? Obviously, what am I to do to "map" it to whatever this library wants it to map?

A Mapper is the M in ORM. It is the thing that maps your table (advertisers in this case) to instances of a class (which you are missing in this case) in order for you to operate on it.
The reason it's confusing for you is because SQLAlchemy is actually two libraries in one -- one is called SQLAlchemy Core, and the other is the SQLAlchemy ORM. Core provides the ability to work with tables and to construct queries that return rows, while the ORM builds on top of Core to provide the ability to work with instances of classes and their relationships as an abstraction. Core roughly corresponds to things you can do on Connection and Engine, while ORM roughly corresponds to things you can do on Session.
So, all of that is to say, session.bulk_insert_mappings is an ORM functionality, and you cannot use it without having a mapped class.
What can you do instead? Use the equivalent Core functionality:
query = advertisers.insert().values(missing_advetisers.to_dict('records'))
engine.execute(query) # or session.execute(query)
Or even use the pandas-provided to_sql function:
missing_advetisers.to_sql("advertiser", engine, if_exists="append")
If you insist on using the ORM, you need to declare a mapped class for your table. The easiest way when using reflection is to use automap. The linked documentation has many examples, so I won't go into detail here.

SqlAlchemy look-ahead caching?

Edit: Main Topic:
Is there a way I can force SqlAlchemy to pre-populate the session as far as it can? Syncronize as much state from the database as possible (there will be no DB updates at this point).
I am having some mild performance issues and I believe I have traced it to SqlAlchemy. I'm sure there are changes in my declarative and db-schema that could improve time, but that is not what I am asking about here.
My SqlAlchemy declarative defines 8 classes, my database has 11 tables with only 7 of them holding my real data, and total my database has 800 records (all Integers and UnicodeText). My database engine is sqlite and the actual size is currently 242Kb.
Really, the number of entities is quite small, but many of the table relationships have recursive behavior (5-6 levels deep). My problem starts with the wonderful automagic that SA does for me, and my reluctance to properly extract the data with my own python classes.
I have ORM attribute access scattered across all kinds of iterators, recursive evaluators, right up to my file I/O streams. The access on these attributes is largely non-linear, and ever time I do a lookup, my callstack disappears into SqlAlchemy for quite some time, and I am getting lots of singleton queries.
I am using mostly default SA settings (python 2.7.2, sqlalchemy 0.7).
Considering that RAM is not an issue, and that my database is so small (for the time being), is there a way I can just force SqlAlchemy to pre-populate the session as far as it can. I am hoping that if I just load the raw data into memory, then the most I will have to do is chase a few joins dynamically (almost all queries are pretty straighforward).
I am hoping for a 5 minute fix so I can run some reports ASAP. My next month of TODO is likely going to be full of direct table queries and tighter business logic that can pipeline tuples.

A five minute fix for that kind of issue is unlikely, but for many-to-one "singleton" gets there is a simple recipe I use often. Suppose you're loading lots of User objects and they all have many-to-one references to a Category of some kind:
# load all categories, then hold onto them
categories = Session.query(Category).all()
for user in Session.query(User):
print user, user.category # no SQL will be emitted for the Category
this because the query.get() that a many-to-one emits will look in the local identity map for the primary key first.
If you're looking for more caching than that (and have a bit more than five minutes to spare), the same concept can be expanded to also cache the results of SELECT statements in a way that the cache is associated only with the current Session - check out the local_session_caching.py recipe included with the distribution examples.

Select statement with SqlAlchemy

Yes, very basic question.
I've successfully created my db using declarative_base, and can perform inserts into the db too. I just have a few questions about SqlAlchemy sql statements.
I've create a table called Location.
A few issues/questions (see code below):
For statement, "print row", I have to specify each column name that I want to have output. i.e. "print row.name, row.lat, etc" Why? (Otherwise the print statement outputs "<classname.Location at <...>>"
Also, what is the preferred way to interact with a db and perform queries (select, insert, update, etc.)- there seem to be a bunch of options: using sqlalchemy.orm.select for example, or engine.text(<sql query>).execute().fetchall(), or even conn.execute(<select>). Options are great, but right now they're all just confusing me.
Thanks so much for the tips!
Here's my code:
from sqlalchemy import create_engine
from sqlalchemy.sql import select
from location_db_setup import *
db_path = "sqlite:////volumes/users/shared/programming/python/web/map.db"
engine = create_engine(db_path, echo= True)
Session = sessionmaker(bind= engine)
session = Session()
session.query(Location).fetchall()
for row in locations:
print row

You code in sample is incomplete and has errors. So it's impossible to say for sure what is Location here. I assume it's a mapped class, so you are requesting a list of all Location objects, not rows. When you print an object you get its string representation. String representation of objects can be changed by defining custom __str__ method.
Although ORM is the most important part of SQLAlchemy, it's not the only. It also expose a lot of functionality not related to ORM directly. When you work with objects the preferred way to create queries are corresponding session method. But sometimes you need selectable objects not bound to particular session (they are not executed directly, but are used in expressions passed to session methods). That's why there are functions in sqlalchemy.orm package.

The preferred way to interact with a db when using an ORM is not to use queries but to use objects that correspond to the tables you are manipulating, typically in conjunction with the session object. SELECT queries become get() or find() calls in some ORMs, query() calls in others. INSERT becomes creating a new object of the type you want (and maybe explicitly adding it, eg session.add() in sqlalchemy). UPDATE becomes editing such an object, and DELETE becomes deleting an object (eg. session.delete() ). The ORM is meant to handle the hard work of translating these operations into SQL for you.
Have you read the tutorial?

Denis and Kylotan gave you good answers. I'm just gonna focus on point 2.
Sometimes depends on your taste. There are times when you need database specific features that an ORM can't do, that's a case when you should use Session(<sql here>).execute() or conn.execute(<sql here>). Another case is when you have a very complex query which is beyond you and you don't find a suitable ORM expression.
Usually, using ORM features like select([...]).where(... or Session.query(<Model here>).filter(... (declarative base) are enough. Almost every sql query has an ORM equivalent.

Map raw SQL to multiple related Django models

Due to performance reasons I can't use the ORM query methods of Django and I have to use raw SQL for some complex questions. I want to find a way to map the results of a SQL query to several models.
I know I can use the following statement to map the query results to one model, but I can't figure how to use it to be able to map to related models (like I can do by using the select_related statement in Django).
model_instance = MyModel(**dict(zip(field_names, row_data)))
Is there a relatively easy way to be able to map fields of related tables that are also in the query result set?

First, can you prove the ORM is stopping your performance? Sometimes performance problems are simply poor database design, or improper indexes. Usually this comes from trying to force-fit Django's ORM onto a legacy database design. Stored procedures and triggers can have adverse impact on performance -- especially when working with Django where the trigger code is expected to be in the Python model code.
Sometimes poor performance is an application issue. This includes needless order-by operations being done in the database.
The most common performance problem is an application that "over-fetches" data. Casually using the .all() method and creating large in-memory collections. This will crush performance. The Django query sets have to be touched as little as possible so that the query set iterator is given to the template for display.
Once you choose to bypass the ORM, you have to fight out the Object-Relational Impedance Mismatch problem. Again. Specifically, relational "navigation" has no concept of "related": it has to be a first-class fetch of a relational set using foreign keys. To assemble a complex in-memory object model via SQL is simply hard. Circular references make this very hard; resolving FK's into collections is hard.
If you're going to use raw SQL, you have two choices.
Eschew "select related" -- it doesn't exist -- and it's painful to implement.
Invent your own ORM-like "select related" features. A common approach is to add stateful getters that (a) check a private cache to see if they've fetched the related object and if the object doesn't exist, (b) fetch the related object from the database and update the cache.
In the process of inventing your own stateful getters, you'll be reinventing Django's, and you'll probably discover that it isn't the ORM layer, but a database design or an application design issue.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.