I have a working Postgres SQL statement that performs a table -- question -- joining on itself to only extract rows with matching foo column that have the most recent created_data:
SELECT q.*
FROM question q
INNER JOIN
(SELECT foo, MAX(created_date) AS most_current
FROM question
GROUP BY foo) grouped_q
ON q.foo = grouped_q.foo
AND q.created_date = grouped_q.most_current
I'm interested if and how this could be converted to SQLAlchemy, such that I could take advantage of the resulting rows as ORM objects?
And, if known, what kind of performance differences there might be?
FWIW, I'm able to hydrate a SQLAlchemy ORM object by looping through rows and doing the following...
for row in res.fetchall():
question = Question(**dict(zip(res.keys(), row)))
but this feels kludgy, and I'd like to keep the query in SQLAlchemy syntax if possible.
Thanks in advance.
Related
I have this query:
SELECT COUNT(DISTINCT Serial, DatumOrig, Glucose) FROM values;
I've tried to recreate it with SQLAlchemy this way:
session.query(Value.Serial, Value.DatumOrig, Value.Glucose).distinct().count()
But this translates to this:
SELECT count(*) AS count_1
FROM (SELECT DISTINCT
values.`Serial` AS `values_Serial`,
values.`DatumOrig` AS `values_DatumOrig`,
values.`Glucose` AS `values_Glucose`
FROM values)
AS anon_1
Which does not call the count function inline but wraps the select distinct into a subquery.
My question is: What are the different ways with SQLAlchemy to count a distinct select on multiple columns and what are they translating into?
Is there any solution which would translate into my original query? Is there any serious difference in performance or memory usage?
First off, I think that COUNT(DISTINCT) supporting more than 1 expression is a MySQL extension. You can kind of achieve the same in for example PostgreSQL with ROW values, but the behaviour is not the same regarding NULL. In MySQL if any of the value expressions evaluate to NULL, the row does not qualify. That also leads to what is different between the two queries in the question:
If any of Serial, DatumOrig, or Glucose is NULL in the COUNT(DISTINCT) query, that row does not qualify or in other words does not count.
COUNT(*) is the cardinality of the subquery anon_1, or in other words the count of rows. SELECT DISTINCT Serial, DatumOrig, Glucose will include (distinct) rows with NULL.
Looking at EXPLAIN output for the 2 queries it looks like the subquery causes MySQL to use a temporary table. That will likely cause a performance difference, especially if it is materialized on disk.
Producing the multi valued COUNT(DISTINCT) query in SQLAlchemy is a bit tricky, because count() is a generic function and implemented closer to the SQL standard. It only accepts a single expression as its (optional) positional argument and the same goes for distinct(). If all else fails, you can always revert to text() fragments, like in this case:
# NOTE: text() fragments are included in the query as is, so if the text originates
# from an untrusted source, the query cannot be trusted.
session.query(func.count(distinct(text("`Serial`, `DatumOrig`, `Glucose`")))).\
select_from(Value).\
scalar()
which is far from readable and maintainable code, but gets the job done now. Another option is to write a custom construct that implements the MySQL extension, or rewrite the query as you have attempted. One way to form a custom construct that produces the required SQL would be:
from itertools import count
from sqlalchemy import func, distinct as _distinct
def _comma_list(exprs):
# NOTE: Magic number alert, the precedence value must be large enough to avoid
# producing parentheses around the "comma list" when passed to distinct()
ps = count(10 + len(exprs), -1)
exprs = iter(exprs)
cl = next(exprs)
for p, e in zip(ps, exprs):
cl = cl.op(',', precedence=p)(e)
return cl
def distinct(*exprs):
return _distinct(_comma_list(exprs))
session.query(func.count(distinct(
Value.Serial, Value.DatumOrig, Value.Glucose))).scalar()
I have two tables, Table A and Table B. I have added one column to Table A, record_id. Table B has record_id and the primary ID for Table A, table_a_id. I am looking to deprecate Table B.
Relationships exist between Table B's table_a_id and Table A's id, if that helps.
Currently, my solution is:
db.execute("UPDATE table_a t
SET record_id = b.record_id
FROM table_b b
WHERE t.id = b.table_a_id")
This is my first time using this ORM -- I'd like to see if there is a way I can use my Python models and the actual functions SQLAlchemy gives me to be more 'Pythonic' rather than just dumping a Postgres statement that I know works in an execute call.
My solution ended up being as follows:
(db.query(TableA)
.filter(TableA.id == TableB.table_a_id,
TableA.record_id.is_(None))
.update({TableA.record_id: TableB.record_id}, synchronize_session=False))
This leverages the ability of PostgreSQL to do updates based on implicit references of other tables, which I did in my .filter() call (this is analogous to a WHERE in a JOIN query). The solution was deceivingly simple.
In this example, I am using the sample MySQL classicmodels database.
So I have two queries:
products = session.query(Products)
orderdetails = session.query(OrderDetails)
Let's assume I cannot make any more queries to the database after this and I can only join these two queries from this point on.
I want to do an outer join on them to be able to do something like this:
for orderdetail, product in query:
print product.productName, product.productCode, orderdetails.quantityOrdered
However, whenever I do an outerjoin on this, I can only seem to get a left join.
query = orderdetails.outerjoin(Products)
Code like this yields only orderdetails columns:
for q in query:
# Only gives orderdetails columns
print q
And doing something like this:
for orderdetails, product in query:
print orderdetails, product
Gives me an error: TypeError: 'OrderDetails' object is not iterable.
What am I doing wrong? I just want columns from the Products table as well.
EDIT:
I have found my solution thanks to #univerio's answer. My real goal was to do a join on two existing queries and then do a SUM and COUNT operation on them.
SQLAlchemy basically just transforms a query object to a SQL statement. The with_entities function just changes the SELECT expression to whatever you pass to it. This is my updated solution, which includes unpacking and reading the join:
for productCode, numOrders, quantityOrdered in orderdetails.with_entities(
OrderDetails.productCode,
func.count(OrderDetails.productCode),
func.sum(OrderDetails.quantityOrdered)).group_by(OrderDetails.productCode):
print productCode, numOrders, quantityOrdered
You can overwrite the entity list with with_entities():
orderdetails.outerjoin(Products).with_entities(OrderDetails, Products)
I'm new to sqlalchemy and am wondering how to do a union of two tables that have the same columns. I'm doing the following:
table1_and_table2 = sql.union_all(self.tables['table1'].alias("table1_subquery").select(),
self.tables['table2'].alias("table2_subquery").select())
I'm seeing this error:
OperationalError: (OperationalError) (1248, 'Every derived table must have its own alias')
(Note that self.tables['table1'] returns a sqlalchemy Table with name table1.)
Can someone point out the error or suggest a better way to combine the rows from both tables?
First, can you output the SQL generated that creates the problem? You should be able to do this by setting echo=True in your create_engine statement.
Second, and this is just a hunch, try rearranging your subqueries to this:
table1_and_table2 = sql.union_all(self.tables['table1'].select().alias("table1_subquery"),
self.tables['table2'].select().alias("table2_subquery"))
If my hunch is right it's creating aliases, then running a query and the resulting query results are re-aliased and clashing
I am using Python with SQLite 3. I have user entered SQL queries and need to format the results of those for a template language.
So, basically, I need to use .description of the DB API cursor (PEP 249), but I need to get both the column names and the table names, since the users often do joins.
The obvious answer, i.e. to read the table definitions, is not possible -- many of the tables have the same column names.
I also need some intelligent behaviour on the column/table names for aggregate functions like avg(field)...
The only solution I can come up with is to use an SQL parser and analyse the SELECT statement (sigh), but I haven't found any SQL parser for Python that seems really good?
I haven't found anything in the documentation or anyone else with the same problem, so I might have missed something obvious?
Edit: To be clear -- the problem is to find the result of an SQL select, where the select statement is supplied by a user in a user interface. I have no control of it. As I noted above, it doesn't help to read the table definitions.
Python's DB API only specifies column names for the cursor.description (and none of the RDBMS implementations of this API will return table names for queries...I'll show you why).
What you're asking for is very hard, and only even approachable with an SQL parser...and even then there are many situations where even the concept of which "Table" a column is from may not make much sense.
Consider these SQL statements:
Which table is today from?
SELECT DATE('now') AS today FROM TableA FULL JOIN TableB
ON TableA.col1 = TableB.col1;
Which table is myConst from?
SELECT 1 AS myConst;
Which table is myCalc from?
SELECT a+b AS myCalc FROM (select t1.col1 AS a, t2.col2 AS b
FROM table1 AS t1
LEFT OUTER JOIN table2 AS t2 on t1.col2 = t2.col2);
Which table is myCol from?
SELECT SUM(a) as myCol FROM (SELECT a FROM table1 UNION SELECT b FROM table2);
The above were very simple SQL statements for which you either have to make up a "table", or arbitrarily pick one...even if you had an SQL parser!
What SQL gives you is a set of data back as results. The elements in this set can not necessarily be attributed to specific database tables. You probably need to rethink your approach to this problem.