SQLalchemy distinct, order_by different column - python

I am querying two table with SQLalchemy, I want to use the distinct feature on my query, to get a unique set of customer id's
I have the following query:
orders[n] = DBSession.query(Order).\
join(Customer).\
filter(Order.oh_reqdate == date_q).\
filter(Order.vehicle_id == vehicle.id).\
order_by(Customer.id).\
distinct(Customer.id).\
order_by(asc(Order.position)).all()
If you can see what is going on here, I am querying the Order table for all orders out for a specific date, for a specific vehicle, this works fine. However some customers may have more than one order for a single date. So I am trying to filter the results to only list each customer once. This work fine, however In order to do this, I must first order the results by the column that has the distinct() function on it. I can add in a second order_by to the column I want the results ordered by, without causing a syntax error. But it gets ignored and results are simply ordered by the Customer.id.
I need to perform my query on the Order table and join to the customer (not the other way round) due to the way the foreign keys have been setup.
Is what I want to-do possible within one query? Or will I need to re-loop over my results to get the data I want in the right order?

you never need to "re-loop" - if you mean load the rows into Python, that is. You probably want to produce a subquery and select from that, which you can achieve using query.from_self().order_by(asc(Order.position)). More specific scenarios you can get using subquery().
In this case I can't really tell what you're going for. If a customer has more than one Order with the requested vehicle id and date, you'll get two rows, one for each Order, and each Order row will refer to the Customer. What exactly do you want instead ? Just the first order row within each customer group ? I'd do that like this:
highest_order = s.query(Order.customer_id, func.max(Order.position).label('position')).\
filter(Order.oh_reqdate == date_q).\
filter(Order.vehicle_id == vehicle.id).\
group_by(Order.customer_id).\
subquery()
s.query(Order).\
join(Customer).\
join(highest_order, highest_order.c.customer_id == Customer.id).\
filter(Order.oh_reqdate == date_q).\
filter(Order.vehicle_id == vehicle.id).\
filter(Order.position == highest_order.c.position)

Related

SQLAlchemy sqlite3 remove value from JSON column on multiple rows with different JSON values

Say I have an id column that is saved as ids JSON NOT NULL using SQLAlchemy, and now I want to delete an id from this column. I'd like to do several things at once:
query only the rows who have this specific ID
delete this ID from all rows it appears in
a bonus, if possible - delete the row if the ID list is now empty.
For the query, something like this:
db.query(models.X).filter(id in list(models.X.ids)) should work.
now, I'd rather avoid iterating over each query and then send an update request as it can be multiple rows. Is there any elegant way to do this?
Thanks!
For the search and remove remove part you can use json_remove function (from SQLLite built-in functions)
from sqlalchemy import func
db.query(models.X).update({'ids': func.json_remove(models.X.ids,f'$[{TARGET_ID}]') })
Here replace TARGET_ID by the targeted id.
Now this will update the row 'silently' (wether or not this id is present in the array).
If you want to first check if target id is in the column: you can query first all rows containing the target id with json_extract query (calling .all() method and then remove those ids with an .update() call.
But this will cost you double amount of queries (less performant).
For the delete part, you can use the json_array_length built-in function
from sqlalchemy import func
db.query(models.X).filter(func.json_array_length(models.X.ids) == 0).delete()
FYI : Not sure that you can do both in one query, and even if possible, I would not do it for clean syntax, logging and monitoring reasons.

Deduping a table while keeping record structures

I've got a weekly process which does a full replace operation on a few tables. The process is weekly since there are large amounts of data as a whole. However, we also want to do daily/hourly delta updates, so the system would be more in sync with production.
When we update data, we are creating duplications of rows (updates of an existing row), which I want to get rid of. To achieve this, I've written a python script which runs the following query on a table, inserting the results back into it:
QUERY = """#standardSQL
select {fields}
from (
select *
, max(record_insert_time) over (partition by id) as max_record_insert_time
from {client_name}_{environment}.{table} as a
)
where 1=1
and record_insert_time = max_record_insert_time"""
The {fields} variable is replaced with a list of all the table columns; I can't use * here because that would only work for 1 run (the next will already have a field called max_record_insert_time and that would cause an ambiguity issue).
Everything is working as expected, with one exception - some of the columns in the table are of RECORD datatype; despite not using aliases for them, and selecting their fully qualified name (e.g. record_name.child_name), when the output is written back into the table, the results are flattened. I've added the flattenResults: False config to my code, but this has not changed the outcome.
I would love to hear thoughts about how to resolve this issue using my existing plan, other methods of deduping, or other methods of handling delta updates altogether.
Perhaps you can use in the outer statement
SELECT * EXCEPT (max_record_insert_time)
This should keep the exact record structure. (for more detailed documentation see https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax#select-except)
Alternative approach, would be include in {fields} only top level columns even if they are non leaves, i.e. just record_name and not record_name.*
Below answer is definitely not better than use of straightforward SELECT * EXCEPT modifier, but wanted to present alternative version
SELECT t.*
FROM (
SELECT
id, MAX(record_insert_time) AS max_record_insert_time,
ARRAY_AGG(t) AS all_records_for_id
FROM yourTable AS t GROUP BY id
), UNNEST(all_records_for_id) AS t
WHERE t.record_insert_time = max_record_insert_time
ORDER BY id
What above query does is - first groups all records for each id into array of respective rows along with max value for insert_time. Then, for each id - it simply flattens all (previously aggregated) rows and picks only rows with insert_time matching max time. Result is as expected. No Analytic Function involved but rather simple Aggregation. But extra use of UNNEST ...
Still - at least different option :o)

Python, SQL: How to update multiple rows and columns in a single trip around the database?

Hello StackEx community.
I am implementing a relational database using SQLite interfaced with Python. My table consists of 5 attributes with around a million tuples.
To avoid large number of database queries, I wish to execute a single query that updates 2 attributes of multiple tuples. These updated values depend on the tuples' Primary Key value and so, are different for each tuple.
I am trying something like the following in Python 2.7:
stmt= 'UPDATE Users SET Userid (?,?), Neighbours (?,?) WHERE Username IN (?,?)'
cursor.execute(stmt, [(_id1, _Ngbr1, _name1), (_id2, _Ngbr2, _name2)])
In other words, I am trying to update the rows that have Primary Keys _name1 and _name2 by substituting the Neighbours and Userid columns with corresponding values. The execution of the two statements returns the following error:
OperationalError: near "(": syntax error
I am reluctant to use executemany() because I want to reduce the number of trips across the database.
I am struggling with this issue for a couple of hours now but couldn't figure out either the error or an alternate on the web. Please help.
Thanks in advance.
If the column that is used to look up the row to update is properly indexed, then executing multiple UPDATE statements would be likely to be more efficient than a single statement, because in the latter case the database would probably need to scan all rows.
Anyway, if you really want to do this, you can use CASE expressions (and explicitly numbered parameters, to avoid duplicates):
UPDATE Users
SET Userid = CASE Username
WHEN ?5 THEN ?1
WHEN ?6 THEN ?2
END,
Neighbours = CASE Username
WHEN ?5 THEN ?3
WHEN ?6 THEN ?4
END,
WHERE Username IN (?5, ?6);

How to filter by joinloaded table in SqlAlchemy?

Lets say I got 2 models, Document and Person. Document got relationship to Person via "owner" property. Now:
session.query(Document)\
.options(joinedload('owner'))\
.filter(Person.is_deleted!=True)
Will double join table Person. One person table will be selected, and the doubled one will be filtered which is not exactly what I want cuz this way document rows will not be filtered.
What can I do to apply filter on joinloaded table/model ?
You are right, table Person will be used twice in the resulting SQL, but each of them serves different purpose:
one is to filter the the condition: filter(Person.is_deleted != True)
the other is to eager load the relationship: options(joinedload('owner'))
But the reason your query returns wrong results is because your filter condition is not complete. In order to make it produce the right results, you also need to JOIN the two models:
qry = (session.query(Document).
join(Document.owner). # THIS IS IMPORTANT
options(joinedload(Document.owner)).
filter(Person.is_deleted != True)
)
This will return correct rows, even though it will still have 2 references (JOINs) to Person table. The real solution to your query is that using contains_eager instead of joinedload:
qry = (session.query(Document).
join(Document.owner). # THIS IS STILL IMPORTANT
options(contains_eager(Document.owner)).
filter(Person.is_deleted != True)
)

SQLAlchemy filter query by related object

Using SQLAlchemy, I have a one to many relation with two tables - users and scores. I am trying to query the top 10 users sorted by their aggregate score over the past X amount of days.
users:
id
user_name
score
scores:
user
score_amount
created
My current query is:
top_users = DBSession.query(User).options(eagerload('scores')).filter_by(User.scores.created > somedate).order_by(func.sum(User.scores).desc()).all()
I know this is clearly not correct, it's just my best guess. However, after looking at the documentation and googling I cannot find an answer.
EDIT:
Perhaps it would help if I sketched what the MySQL query would look like:
SELECT user.*, SUM(scores.amount) as score_increase
FROM user LEFT JOIN scores ON scores.user_id = user.user_id
WITH scores.created_at > someday
ORDER BY score_increase DESC
The single-joined-row way, with a group_by added in for all user columns although MySQL will let you group on just the "id" column if you choose:
sess.query(User, func.sum(Score.amount).label('score_increase')).\
join(User.scores).\
filter(Score.created_at > someday).\
group_by(User).\
order_by("score increase desc")
Or if you just want the users in the result:
sess.query(User).\
join(User.scores).\
filter(Score.created_at > someday).\
group_by(User).\
order_by(func.sum(Score.amount))
The above two have an inefficiency in that you're grouping on all columns of "user" (or you're using MySQL's "group on only a few columns" thing, which is MySQL only). To minimize that, the subquery approach:
subq = sess.query(Score.user_id, func.sum(Score.amount).label('score_increase')).\
filter(Score.created_at > someday).\
group_by(Score.user_id).subquery()
sess.query(User).join((subq, subq.c.user_id==User.user_id)).order_by(subq.c.score_increase)
An example of the identical scenario is in the ORM tutorial at: http://docs.sqlalchemy.org/en/latest/orm/tutorial.html#selecting-entities-from-subqueries
You will need to use a subquery in order to compute the aggregate score for each user. Subqueries are described here: http://www.sqlalchemy.org/docs/05/ormtutorial.html?highlight=subquery#using-subqueries
I am assuming the column (not the relation) you're using for the join is called Score.user_id, so change it if this is not the case.
You will need to do something like this:
DBSession.query(Score.user_id, func.sum(Score.score_amount).label('total_score')).group_by(Score.user_id).filter(Score.created > somedate).order_by('total_score DESC')[:10]
However this will result in tuples of (user_id, total_score). I'm not sure if the computed score is actually important to you, but if it is, you will probably want to do something like this:
users_scores = []
q = DBSession.query(Score.user_id, func.sum(Score.score_amount).label('total_score')).group_by(Score.user_id).filter(Score.created > somedate).order_by('total_score DESC')[:10]
for user_id, total_score in q:
user = DBSession.query(User)
users_scores.append((user, total_score))
This will result in 11 queries being executed, however. It is possible to do it all in a single query, but due to various limitations in SQLAlchemy, it will likely create a very ugly multi-join query or subquery (dependent on engine) and it won't be very performant.
If you plan on doing something like this often and you have a large amount of scores, consider denormalizing the current score onto the user table. It's more work to upkeep, but will result in a single non-join query like:
DBSession.query(User).order_by(User.computed_score.desc())
Hope that helps.

Categories

Resources