SQLAlchemy group_by SQLite vs PostgreSQL - python

For the web app we are building we used SQLite for testing purposes. Recently we wanted to migrate to PostgreSQL. That's where the problems started:
We have this SQLAlchemy model (simplified)
class Entity(db.Model):
id = db.Column(db.Integer, primary_key=True)
i_want_this = db.Column(db.String)
some_value = db.Column(db.Integer)
I want to group all Entitys by some_value which i did like this (simplified):
db.session.query(Entity, db.func.count()).group_by(Entity.some_value)
In SQLite this worked. In retrospect I see that it does not make sense but SQLite did make sense of it. I can't say for sure which of the entities was returned.
Now in PostgrSQL we get this error:
sqlalchemy.exc.ProgrammingError: (psycopg2.ProgrammingError) column "entity.id" must appear in the GROUP BY clause or be used in an aggregate function
LINE 1: SELECT entity.id AS entity_id, entity.i_want_this AS entity_not...
^
[SQL: 'SELECT entity.id AS entity_id, entity.i_want_this AS entity_i_want_this, count(*) AS count_1 \nFROM entity GROUP BY entity.some_value']
And that error totally makes sense.
So my first question is: Why does SQLite allow this and how does it do it (what hidden aggregation is used)?
My second question is obvious: How would I do it with PostgreSQL?
I'm actually only interested in the count and the first i_want_this value. So I could do this:
groups = db.session.query(db.func.min(Entity.id), db.func.count()).group_by(Entity.some_value)
[(Entity.query.get(id_), count) for id_, count in groups]
But I don't want these additional get queries.
So I want to select the first entity (The entity with the minimal id) and the number of entities grouped by some_value or the first i_want_this and the count grouped by some_value
EDIT to make it clear:
I want to group by some_value (Done)
I want to get the number of entities in each group (Done)
I want to get the entity with the lowest id in each group (Need help on this)
Alternatively I want to get the i_want_this value of the entity with the lowest id in each group (Need help on this)

Concerning your first question, check the documentation:
Each expression in the result-set is then evaluated once for each
group of rows. If the expression is an aggregate expression, it is
evaluated across all rows in the group. Otherwise, it is evaluated
against a single arbitrarily chosen row from within the group. If
there is more than one non-aggregate expression in the result-set,
then all such expressions are evaluated for the same row.
Concerning the second question, you'll probably have to explain what you actually want to achieve, considering that your current query also in SQLite returns more or less random results.
EDIT:
To get the entities with minimum id per group, you can use the Query.select_from construct:
import sqlalchemy.sql as sa_sql
# create the aggregate/grouped query
grouped = sa_sql.select([sa_sql.func.min(Entity.id).label('min_id')])\
.group_by(Entity.some_value)\
.alias('grouped')
# join it with the full entities table
joined = sa_sql.join(Entity, grouped, grouped.c.min_id == Entity.id)
# and let sqlalchemy pull the entities from this statement:
session.query(Entity).select_from(joined)
This will produce the following SQL:
SELECT entities.id AS entities_id,
entities.i_want_this AS entities_i_want_this,
entities.some_value AS entities_some_value
FROM entities JOIN (SELECT min(entities.id) AS min_id
FROM entities GROUP BY entities.some_value) AS grouped
ON grouped.min_id = entities.id

Related

sqlalchemy.exc.IntegrityError: (psycopg2.errors.UniqueViolation) duplicate key value violates unique constraint "product_pkey" [duplicate]

I have a tabled called products
which has following columns
id, product_id, data, activity_id
What I am essentially trying to do is copy bulk of existing products and update it's activity_id and create new entry in the products table.
Example:
I already have 70 existing entries in products with activity_id 2
Now I want to create another 70 entries with same data except for updated activity_id
I could have thousands of existing entries that I'd like to make a copy of and update the copied entries activity_id to be a new id.
products = self.session.query(model.Products).filter(filter1, filter2).all()
This returns all the existing products for a filter.
Then I iterate through products, then simply clone existing products and just update activity_id field.
for product in products:
product.activity_id = new_id
self.uow.skus.bulk_save_objects(simulation_skus)
self.uow.flush()
self.uow.commit()
What is the best/ fastest way to do these bulk entries so it kills time, as of now it's OK performance, is there a better solution?
You don't need to load these objects locally, all you really want to do is have the database create these rows.
You essentially want to run a query that creates the rows from the existing rows:
INSERT INTO product (product_id, data, activity_id)
SELECT product_id, data, 2 -- the new activity_id value
FROM product
WHERE activity_id = old_id
The above query would run entirely on the database server; this is far preferable over loading your query into Python objects, then sending all the Python data back to the server to populate INSERT statements for each new row.
Queries like that are something you could do with SQLAlchemy core, the half of the API that deals with generating SQL statements. However, you can use a query built from a declarative ORM model as a starting point. You'd need to
Access the Table instance for the model, as that then lets you create an INSERT statement via the Table.insert() method.
You could also get the same object from models.Product query, more on that later.
Access the statement that would normally fetch the data for your Python instances for your filtered models.Product query; you can do so via the Query.statement property.
Update the statement to replace the included activity_id column with your new value, and remove the primary key (I'm assuming that you have an auto-incrementing primary key column).
Apply that updated statement to the Insert object for the table via Insert.from_select().
Execute the generated INSERT INTO ... FROM ... query.
Step 1 can be achieved by using the SQLAlchemy introspection API; the inspect() function, applied to a model class, gives you a Mapper instance, which in turn has a Mapper.local_table attribute.
Steps 2 and 3 require a little juggling with the Select.with_only_columns() method to produce a new SELECT statement where we swapped out the column. You can't easily remove a column from a select statement but we can, however, use a loop over the existing columns in the query to 'copy' them across to the new SELECT, and at the same time make our replacement.
Step 4 is then straightforward, Insert.from_select() needs to have the columns that are inserted and the SELECT query. We have both as the SELECT object we have gives us its columns too.
Here is the code for generating your INSERT; the **replace keyword arguments are the columns you want to replace when inserting:
from sqlalchemy import inspect, literal
from sqlalchemy.sql import ClauseElement
def insert_from_query(model, query, **replace):
# The SQLAlchemy core definition of the table
table = inspect(model).local_table
# and the underlying core select statement to source new rows from
select = query.statement
# validate asssumptions: make sure the query produces rows from the above table
assert table in select.froms, f"{query!r} must produce rows from {model!r}"
assert all(c.name in select.columns for c in table.columns), f"{query!r} must include all {model!r} columns"
# updated select, replacing the indicated columns
as_clause = lambda v: literal(v) if not isinstance(v, ClauseElement) else v
replacements = {name: as_clause(value).label(name) for name, value in replace.items()}
from_select = select.with_only_columns([
replacements.get(c.name, c)
for c in table.columns
if not c.primary_key
])
return table.insert().from_select(from_select.columns, from_select)
I included a few assertions about the model and query relationship, and the code accepts arbitrary column clauses as replacements, not just literal values. You could use func.max(models.Product.activity_id) + 1 as a replacement value (wrapped as a subselect), for example.
The above function executes steps 1-4, producing the desired INSERT SQL statement when printed (I created a products model and query that I thought might be representative):
>>> print(insert_from_query(models.Product, products, activity_id=2))
INSERT INTO products (product_id, data, activity_id) SELECT products.product_id, products.data, :param_1 AS activity_id
FROM products
WHERE products.activity_id != :activity_id_1
All you have to do is execute it:
insert_stmt = insert_from_query(models.Product, products, activity_id=2)
self.session.execute(insert_stmt)

Django: select values with max timestamps or join to the same table

I have a simple Django models
class Server(models.Model):
name = models.CharField(max_length=120)
class ServerPropertie(models.Model):
name = models.CharField(max_length=120)
value = models.CharField(max_length=120)
timestamp = models.DateTimeField()
server = models.ForeignKey(Server)
I want to add get_properties method to the Server model, which will return all last properties for current server. I mean it should return name and value of all properties names for current server and each uniq propertie name should have a value, which row has a maximum timestamp.
I can do it in raw hardcoded raw SQL(i use postgres):
SELECT t1.name, t1.value FROM environments_serverpropertie t1
JOIN (SELECT max("timestamp") "timestamp", name
FROM environments_serverpropertie
group by name) t2 on t1.name = t2.name and t1.timestamp = t2.timestamp;
or in in python, but i believe that there exists pythonic solution. Could you please help me.
if you're using PostgreSQL, usual syntax for that is:
select distinct on (name)
name, value
from environments_serverpropertie
where server = ...
order by name, timestamp desc
From PostgreSQL documentation:
SELECT DISTINCT ON ( expression [, ...] ) keeps only the first row of
each set of rows where the given expressions evaluate to equal. The
DISTINCT ON expressions are interpreted using the same rules as for
ORDER BY (see above). Note that the "first row" of each set is
unpredictable unless ORDER BY is used to ensure that the desired row
appears first.
You can see and try it in sql fiddle demo.
It's possible to translate this syntax to django, from django documentation:
On PostgreSQL only, you can pass positional arguments (*fields) in
order to specify the names of fields to which the DISTINCT should
apply. This translates to a SELECT DISTINCT ON SQL query.
So in django it will be something like:
ServerPropertie.objects.filter(...).order_by('name', '-timestamp').distinct('name')

SQLalchemy distinct, order_by different column

I am querying two table with SQLalchemy, I want to use the distinct feature on my query, to get a unique set of customer id's
I have the following query:
orders[n] = DBSession.query(Order).\
join(Customer).\
filter(Order.oh_reqdate == date_q).\
filter(Order.vehicle_id == vehicle.id).\
order_by(Customer.id).\
distinct(Customer.id).\
order_by(asc(Order.position)).all()
If you can see what is going on here, I am querying the Order table for all orders out for a specific date, for a specific vehicle, this works fine. However some customers may have more than one order for a single date. So I am trying to filter the results to only list each customer once. This work fine, however In order to do this, I must first order the results by the column that has the distinct() function on it. I can add in a second order_by to the column I want the results ordered by, without causing a syntax error. But it gets ignored and results are simply ordered by the Customer.id.
I need to perform my query on the Order table and join to the customer (not the other way round) due to the way the foreign keys have been setup.
Is what I want to-do possible within one query? Or will I need to re-loop over my results to get the data I want in the right order?
you never need to "re-loop" - if you mean load the rows into Python, that is. You probably want to produce a subquery and select from that, which you can achieve using query.from_self().order_by(asc(Order.position)). More specific scenarios you can get using subquery().
In this case I can't really tell what you're going for. If a customer has more than one Order with the requested vehicle id and date, you'll get two rows, one for each Order, and each Order row will refer to the Customer. What exactly do you want instead ? Just the first order row within each customer group ? I'd do that like this:
highest_order = s.query(Order.customer_id, func.max(Order.position).label('position')).\
filter(Order.oh_reqdate == date_q).\
filter(Order.vehicle_id == vehicle.id).\
group_by(Order.customer_id).\
subquery()
s.query(Order).\
join(Customer).\
join(highest_order, highest_order.c.customer_id == Customer.id).\
filter(Order.oh_reqdate == date_q).\
filter(Order.vehicle_id == vehicle.id).\
filter(Order.position == highest_order.c.position)

How to filter by joinloaded table in SqlAlchemy?

Lets say I got 2 models, Document and Person. Document got relationship to Person via "owner" property. Now:
session.query(Document)\
.options(joinedload('owner'))\
.filter(Person.is_deleted!=True)
Will double join table Person. One person table will be selected, and the doubled one will be filtered which is not exactly what I want cuz this way document rows will not be filtered.
What can I do to apply filter on joinloaded table/model ?
You are right, table Person will be used twice in the resulting SQL, but each of them serves different purpose:
one is to filter the the condition: filter(Person.is_deleted != True)
the other is to eager load the relationship: options(joinedload('owner'))
But the reason your query returns wrong results is because your filter condition is not complete. In order to make it produce the right results, you also need to JOIN the two models:
qry = (session.query(Document).
join(Document.owner). # THIS IS IMPORTANT
options(joinedload(Document.owner)).
filter(Person.is_deleted != True)
)
This will return correct rows, even though it will still have 2 references (JOINs) to Person table. The real solution to your query is that using contains_eager instead of joinedload:
qry = (session.query(Document).
join(Document.owner). # THIS IS STILL IMPORTANT
options(contains_eager(Document.owner)).
filter(Person.is_deleted != True)
)

SQLAlchemy filter query by related object

Using SQLAlchemy, I have a one to many relation with two tables - users and scores. I am trying to query the top 10 users sorted by their aggregate score over the past X amount of days.
users:
id
user_name
score
scores:
user
score_amount
created
My current query is:
top_users = DBSession.query(User).options(eagerload('scores')).filter_by(User.scores.created > somedate).order_by(func.sum(User.scores).desc()).all()
I know this is clearly not correct, it's just my best guess. However, after looking at the documentation and googling I cannot find an answer.
EDIT:
Perhaps it would help if I sketched what the MySQL query would look like:
SELECT user.*, SUM(scores.amount) as score_increase
FROM user LEFT JOIN scores ON scores.user_id = user.user_id
WITH scores.created_at > someday
ORDER BY score_increase DESC
The single-joined-row way, with a group_by added in for all user columns although MySQL will let you group on just the "id" column if you choose:
sess.query(User, func.sum(Score.amount).label('score_increase')).\
join(User.scores).\
filter(Score.created_at > someday).\
group_by(User).\
order_by("score increase desc")
Or if you just want the users in the result:
sess.query(User).\
join(User.scores).\
filter(Score.created_at > someday).\
group_by(User).\
order_by(func.sum(Score.amount))
The above two have an inefficiency in that you're grouping on all columns of "user" (or you're using MySQL's "group on only a few columns" thing, which is MySQL only). To minimize that, the subquery approach:
subq = sess.query(Score.user_id, func.sum(Score.amount).label('score_increase')).\
filter(Score.created_at > someday).\
group_by(Score.user_id).subquery()
sess.query(User).join((subq, subq.c.user_id==User.user_id)).order_by(subq.c.score_increase)
An example of the identical scenario is in the ORM tutorial at: http://docs.sqlalchemy.org/en/latest/orm/tutorial.html#selecting-entities-from-subqueries
You will need to use a subquery in order to compute the aggregate score for each user. Subqueries are described here: http://www.sqlalchemy.org/docs/05/ormtutorial.html?highlight=subquery#using-subqueries
I am assuming the column (not the relation) you're using for the join is called Score.user_id, so change it if this is not the case.
You will need to do something like this:
DBSession.query(Score.user_id, func.sum(Score.score_amount).label('total_score')).group_by(Score.user_id).filter(Score.created > somedate).order_by('total_score DESC')[:10]
However this will result in tuples of (user_id, total_score). I'm not sure if the computed score is actually important to you, but if it is, you will probably want to do something like this:
users_scores = []
q = DBSession.query(Score.user_id, func.sum(Score.score_amount).label('total_score')).group_by(Score.user_id).filter(Score.created > somedate).order_by('total_score DESC')[:10]
for user_id, total_score in q:
user = DBSession.query(User)
users_scores.append((user, total_score))
This will result in 11 queries being executed, however. It is possible to do it all in a single query, but due to various limitations in SQLAlchemy, it will likely create a very ugly multi-join query or subquery (dependent on engine) and it won't be very performant.
If you plan on doing something like this often and you have a large amount of scores, consider denormalizing the current score onto the user table. It's more work to upkeep, but will result in a single non-join query like:
DBSession.query(User).order_by(User.computed_score.desc())
Hope that helps.

Categories

Resources