Trying to move some queries I run daily into an automated script. I have one in Postgres like the below:
SELECT regexp_split_to_array(col1, "|")[1] AS item, COUNT(*) AS itemcount FROM Tabel1 GROUP BY item ORDER BY itemcount
In SqlAlchemy I have this:
session.query((func.regexp_split_to_array(model.table1.col1, "|")[1]).label("item"), func.count().label("itemcount")).group_by("item").order_by("itemcount")
Python can't "get_item" since it's not actually a collection. I've looked through the docs and can't seem to find something that would let me do this without running raw SQL using execute (which I can do and works, but was looking for a solution for next time).
SQLAlchemy does support indexing with [...]. If you declare a type of a column that you have to be of type postgresql.ARRAY, then it works:
table2 = Table("table2", meta, Column("col1", postgresql.ARRAY(String)))
q = session.query(table2.c.col1[1])
print(q.statement.compile(dialect=postgresql.dialect()))
# SELECT table2.col1[%(col1_1)s] AS anon_1
# FROM table2
The reason why your code doesn't work is that SQLAlchemy does not know that func.regexp_split_to_array(...) returns an array, since func.foo produces a generic function for convenience. To make it work, we need to make sure SQLAlchemy knows the return type of the function, by specifying the type_ parameter:
q = session.query(func.regexp_split_to_array(table1.c.col1, "|", type_=postgresql.ARRAY(String))[1].label("item"))
print(q.statement.compile(dialect=postgresql.dialect()))
# SELECT (regexp_split_to_array(table1.col1, %(regexp_split_to_array_1)s))[%(regexp_split_to_array_2)s] AS item
# FROM table1
Related
I have this query:
SELECT COUNT(DISTINCT Serial, DatumOrig, Glucose) FROM values;
I've tried to recreate it with SQLAlchemy this way:
session.query(Value.Serial, Value.DatumOrig, Value.Glucose).distinct().count()
But this translates to this:
SELECT count(*) AS count_1
FROM (SELECT DISTINCT
values.`Serial` AS `values_Serial`,
values.`DatumOrig` AS `values_DatumOrig`,
values.`Glucose` AS `values_Glucose`
FROM values)
AS anon_1
Which does not call the count function inline but wraps the select distinct into a subquery.
My question is: What are the different ways with SQLAlchemy to count a distinct select on multiple columns and what are they translating into?
Is there any solution which would translate into my original query? Is there any serious difference in performance or memory usage?
First off, I think that COUNT(DISTINCT) supporting more than 1 expression is a MySQL extension. You can kind of achieve the same in for example PostgreSQL with ROW values, but the behaviour is not the same regarding NULL. In MySQL if any of the value expressions evaluate to NULL, the row does not qualify. That also leads to what is different between the two queries in the question:
If any of Serial, DatumOrig, or Glucose is NULL in the COUNT(DISTINCT) query, that row does not qualify or in other words does not count.
COUNT(*) is the cardinality of the subquery anon_1, or in other words the count of rows. SELECT DISTINCT Serial, DatumOrig, Glucose will include (distinct) rows with NULL.
Looking at EXPLAIN output for the 2 queries it looks like the subquery causes MySQL to use a temporary table. That will likely cause a performance difference, especially if it is materialized on disk.
Producing the multi valued COUNT(DISTINCT) query in SQLAlchemy is a bit tricky, because count() is a generic function and implemented closer to the SQL standard. It only accepts a single expression as its (optional) positional argument and the same goes for distinct(). If all else fails, you can always revert to text() fragments, like in this case:
# NOTE: text() fragments are included in the query as is, so if the text originates
# from an untrusted source, the query cannot be trusted.
session.query(func.count(distinct(text("`Serial`, `DatumOrig`, `Glucose`")))).\
select_from(Value).\
scalar()
which is far from readable and maintainable code, but gets the job done now. Another option is to write a custom construct that implements the MySQL extension, or rewrite the query as you have attempted. One way to form a custom construct that produces the required SQL would be:
from itertools import count
from sqlalchemy import func, distinct as _distinct
def _comma_list(exprs):
# NOTE: Magic number alert, the precedence value must be large enough to avoid
# producing parentheses around the "comma list" when passed to distinct()
ps = count(10 + len(exprs), -1)
exprs = iter(exprs)
cl = next(exprs)
for p, e in zip(ps, exprs):
cl = cl.op(',', precedence=p)(e)
return cl
def distinct(*exprs):
return _distinct(_comma_list(exprs))
session.query(func.count(distinct(
Value.Serial, Value.DatumOrig, Value.Glucose))).scalar()
I try to join a second table (PageLikes) on a first Table (PageVisits) after selecting only distinct values on one column of the first table with the python ORM peewee.
In pure SQL I can do this:
SELECT DISTINCT(pagevisits.visitor_id), pagelikes.liked_item FROM pagevisits
INNER JOIN pagelikes on pagevisits.visitor_id = pagelikes.user_id
In peewee with python I have tried:
query = (Page.select(
fn.Distinct(Pagevisits.visitor_id),
PageLikes.liked_item)
.join(PageLIkes)
This gives me an error:
distinct() takes 1 positional argument but 2 were given
The only way I can and have used distinct with peewee is like this:
query = (Page.select(
Pagevisits.visitor_id,
PageLikes.liked_item)
.distinct()
which does not seem to work for my scenario.
So how can I select only distinct values in one table based on one column before I join another table with peewee?
I don't believe you should be encountering an error using fn.DISTINCT() in that way. I'm curious to see the full traceback. In my testing locally, I have no problems running something like:
query = (PageVisits
.select(fn.DISTINCT(PageVisits.visitor_id), PageLikes.liked_item)
.join(PageLikes))
Which produces SQL equivalent to what you're after. I'm using the latest peewee code btw.
As Papooch suggested, calling distinct on the Model seems to work:
distinct_visitors = (Pagevisits
.select(
Pagevisits.visitor_id.distinct().alias("visitor")
)
.where(Pagevisits.page_id == "Some specifc page")
.alias('distinct_visitors')
)
query = (Pagelikes
.select(fn.Count(Pagelikes.liked_item),
)
.join(distinct_visitors, on=(distinct_visitors.c.visitor = Pagelikes.user_id))
.group_by(Pagelikes.liked_item)
)
I'm using the SQLalchemy table declaration to avoid using strings and manually generating SQL statements. This has worked quite well, except for the following use case where I'm trying to return a statement which creates a merged table with all columns from both tables, joined on an implicit PK/FK which exists between the tables.
It seems like the statement below only selects columns from the first (Result) table and I'm not sure how to generate the full select statement?
sql = Query(Results, Details).join(Details)\
.filter(Details.result_type == 'standard')\
.statement
The query construct isn't the best to use for this use case, as it will return the objects themselves. Instead, this can be done with the select construct as follows:
sql = select([Results, Details])\
.select_from(Results.join(Details))
I have the query:
q = Session.query(func.array_agg(Order.col))
The compiled query will be:
SELECT array_agg(order.col) FROM orders
I want dynamically replace the existing column. After replacing query have to be:
SELECT group_concat(orders.col) FROM orders
I have to use Session and model. I don't have to use SQLAlchemy core. I don't have to use subqueries. And, of course, there can be some other columns, but I need to replace only one. I tried to replace objects in column_descriptions property, I tried to use q.selectable.replace (or something like this, sorry, but I don't remember right names) and I didn't get right result.
The right method:
q = Session.query(func.array_agg(Order.col))
q.with_entities(func.group_concat(Order.col))
SELECT group_concat(orders.col) FROM orders
I am using Python with SQLite 3. I have user entered SQL queries and need to format the results of those for a template language.
So, basically, I need to use .description of the DB API cursor (PEP 249), but I need to get both the column names and the table names, since the users often do joins.
The obvious answer, i.e. to read the table definitions, is not possible -- many of the tables have the same column names.
I also need some intelligent behaviour on the column/table names for aggregate functions like avg(field)...
The only solution I can come up with is to use an SQL parser and analyse the SELECT statement (sigh), but I haven't found any SQL parser for Python that seems really good?
I haven't found anything in the documentation or anyone else with the same problem, so I might have missed something obvious?
Edit: To be clear -- the problem is to find the result of an SQL select, where the select statement is supplied by a user in a user interface. I have no control of it. As I noted above, it doesn't help to read the table definitions.
Python's DB API only specifies column names for the cursor.description (and none of the RDBMS implementations of this API will return table names for queries...I'll show you why).
What you're asking for is very hard, and only even approachable with an SQL parser...and even then there are many situations where even the concept of which "Table" a column is from may not make much sense.
Consider these SQL statements:
Which table is today from?
SELECT DATE('now') AS today FROM TableA FULL JOIN TableB
ON TableA.col1 = TableB.col1;
Which table is myConst from?
SELECT 1 AS myConst;
Which table is myCalc from?
SELECT a+b AS myCalc FROM (select t1.col1 AS a, t2.col2 AS b
FROM table1 AS t1
LEFT OUTER JOIN table2 AS t2 on t1.col2 = t2.col2);
Which table is myCol from?
SELECT SUM(a) as myCol FROM (SELECT a FROM table1 UNION SELECT b FROM table2);
The above were very simple SQL statements for which you either have to make up a "table", or arbitrarily pick one...even if you had an SQL parser!
What SQL gives you is a set of data back as results. The elements in this set can not necessarily be attributed to specific database tables. You probably need to rethink your approach to this problem.