SQLAlchemy - How to count distinct on multiple columns

SQLAlchemy - How to count distinct on multiple columns - python

I have this query:
SELECT COUNT(DISTINCT Serial, DatumOrig, Glucose) FROM values;
I've tried to recreate it with SQLAlchemy this way:
session.query(Value.Serial, Value.DatumOrig, Value.Glucose).distinct().count()
But this translates to this:
SELECT count(*) AS count_1
FROM (SELECT DISTINCT
values.`Serial` AS `values_Serial`,
values.`DatumOrig` AS `values_DatumOrig`,
values.`Glucose` AS `values_Glucose`
FROM values)
AS anon_1
Which does not call the count function inline but wraps the select distinct into a subquery.
My question is: What are the different ways with SQLAlchemy to count a distinct select on multiple columns and what are they translating into?
Is there any solution which would translate into my original query? Is there any serious difference in performance or memory usage?

First off, I think that COUNT(DISTINCT) supporting more than 1 expression is a MySQL extension. You can kind of achieve the same in for example PostgreSQL with ROW values, but the behaviour is not the same regarding NULL. In MySQL if any of the value expressions evaluate to NULL, the row does not qualify. That also leads to what is different between the two queries in the question:
If any of Serial, DatumOrig, or Glucose is NULL in the COUNT(DISTINCT) query, that row does not qualify or in other words does not count.
COUNT(*) is the cardinality of the subquery anon_1, or in other words the count of rows. SELECT DISTINCT Serial, DatumOrig, Glucose will include (distinct) rows with NULL.
Looking at EXPLAIN output for the 2 queries it looks like the subquery causes MySQL to use a temporary table. That will likely cause a performance difference, especially if it is materialized on disk.
Producing the multi valued COUNT(DISTINCT) query in SQLAlchemy is a bit tricky, because count() is a generic function and implemented closer to the SQL standard. It only accepts a single expression as its (optional) positional argument and the same goes for distinct(). If all else fails, you can always revert to text() fragments, like in this case:
# NOTE: text() fragments are included in the query as is, so if the text originates
# from an untrusted source, the query cannot be trusted.
session.query(func.count(distinct(text("`Serial`, `DatumOrig`, `Glucose`")))).\
select_from(Value).\
scalar()
which is far from readable and maintainable code, but gets the job done now. Another option is to write a custom construct that implements the MySQL extension, or rewrite the query as you have attempted. One way to form a custom construct that produces the required SQL would be:
from itertools import count
from sqlalchemy import func, distinct as _distinct
def _comma_list(exprs):
# NOTE: Magic number alert, the precedence value must be large enough to avoid
# producing parentheses around the "comma list" when passed to distinct()
ps = count(10 + len(exprs), -1)
exprs = iter(exprs)
cl = next(exprs)
for p, e in zip(ps, exprs):
cl = cl.op(',', precedence=p)(e)
return cl
def distinct(*exprs):
return _distinct(_comma_list(exprs))
session.query(func.count(distinct(
Value.Serial, Value.DatumOrig, Value.Glucose))).scalar()

Related

Selecting the first item of an ARRAY with PostgreSQL/SqlAlchemy

Trying to move some queries I run daily into an automated script. I have one in Postgres like the below:
SELECT regexp_split_to_array(col1, "|")[1] AS item, COUNT(*) AS itemcount FROM Tabel1 GROUP BY item ORDER BY itemcount
In SqlAlchemy I have this:
session.query((func.regexp_split_to_array(model.table1.col1, "|")[1]).label("item"), func.count().label("itemcount")).group_by("item").order_by("itemcount")
Python can't "get_item" since it's not actually a collection. I've looked through the docs and can't seem to find something that would let me do this without running raw SQL using execute (which I can do and works, but was looking for a solution for next time).

SQLAlchemy does support indexing with [...]. If you declare a type of a column that you have to be of type postgresql.ARRAY, then it works:
table2 = Table("table2", meta, Column("col1", postgresql.ARRAY(String)))
q = session.query(table2.c.col1[1])
print(q.statement.compile(dialect=postgresql.dialect()))
# SELECT table2.col1[%(col1_1)s] AS anon_1
# FROM table2
The reason why your code doesn't work is that SQLAlchemy does not know that func.regexp_split_to_array(...) returns an array, since func.foo produces a generic function for convenience. To make it work, we need to make sure SQLAlchemy knows the return type of the function, by specifying the type_ parameter:
q = session.query(func.regexp_split_to_array(table1.c.col1, "|", type_=postgresql.ARRAY(String))[1].label("item"))
print(q.statement.compile(dialect=postgresql.dialect()))
# SELECT (regexp_split_to_array(table1.col1, %(regexp_split_to_array_1)s))[%(regexp_split_to_array_2)s] AS item
# FROM table1

Django Raw Query with params on Table Column (SQL Injection)

I have a kinda unusual scenario but in addition to my sql parameters, I need to let the user / API define the table column name too. My problem with the params is that the query results in:
SELECT device_id, time, 's0' ...
instead of
SELECT device_id, time, s0 ...
Is there another way to do that through raw or would I need to escape the column by myself?
queryset = Measurement.objects.raw(
'''
SELECT device_id, time, %(sensor)s FROM measurements
WHERE device_id=%(device_id)s AND time >= to_timestamp(%(start)s) AND time <= to_timestamp(%(end)s)
ORDER BY time ASC;
''', {'device_id': device_id, 'sensor': sensor, 'start': start, 'end': end})

As with any potential for SQL injection, be careful.
But essentially this is a fairly common problem with a fairly safe solution. The problem, in general, is that query parameters are "the right way" to handle query values, but they're not designed for schema elements.
To dynamically include schema elements in your query, you generally have to resort to string concatenation. Which is exactly the thing we're all told not to do with SQL queries.
But the good news here is that you don't have to use the actual user input. This is because, while possible query values are infinite, the superset of possible valid schema elements is quite finite. So you can validate the user's input against that superset.
For example, consider the following process:
User inputs a value as a column name.
Code compares that value (raw string comparison) against a list of known possible values. (This list can be hard-coded, or can be dynamically fetched from the database schema.)
If no match is found, return an error.
If a match is found, use the matched known value directly in the SQL query.
So all you're ever using are the very strings you, as the programmer, put in the code. Which is the same as writing the SQL yourself anyway.

It doesn't look like you need raw() for the example query you posted. I think the following queryset is very similar.
measurements = Measurement.objects.filter(
device_id=device_id,
to_timestamp__gte=start,
to_timestamp__lte,
).order_by('time')
for measurement in measurements:
print(getattr(measurement, sensor)
If you need to optimise and avoid loading other fields, you can use values() or only().

Get All of Single Column from Every Table in Schema

In our system, we have 1000+ tables, each of which has an 'date' column containing DateTime object. I want to get a list containing every date that exists within all of the tables. I'm sure there should be an easy way to do this, but I've very limited knowledge of either postgresql or sqlalchemy.
In postgresql, I can do a full join on two tables, but there doesn't seem to be a way to do a join on every table in a schema, for a single common field.
I then tried to solve this programmatically in python with sqlalchemy. For each table, I did created a select distinct for the 'date' column, then set that list of selectes that to the selects property of a CompoundSelect object, and executed. As one might expect from an ugly brute force query, it has ben running now for an hour or so, and I am unsure if it has broken silently somewhere and will never return.
Is there a clean and better way to do this?

You definitely want to do this on the server, not at the application level, due to the many round trips between application and server and likely duplication of data in intermediate results.
Since you need to process 1,000+ tables, you should use the system catalogs and dynamically query the tables. You need a function to do that efficiently:
CREATE FUNCTION get_all_dates() RETURNS SETOF date AS $$
DECLARE
tbl name;
BEGIN
FOR tbl IN SELECT 'public.' || tablename FROM pg_tables WHERE schemaname = 'public' LOOP
RETURN QUERY EXECUTE 'SELECT DISTINCT date::date FROM ' || tbl;
END LOOP
END; $$ LANGUAGE plpgsql;
This will process all the tables in the public schema; change as required. If the tables are in multiple schemas you need to insert your additional logic on where tables are stored, or you can make the schema name a parameter of the function and call the function multiple times and UNION the results.
Note that you may get duplicate dates from multiple tables. These duplicates you can weed out in the statement calling the function:
SELECT DISTINCT * FROM get_all_dates() ORDER BY 1;
The function creates a result set in memory, but if the number of distinct dates in the rows in the 1,000+ tables is very large, the results will be written to disk. If you expect this to happen, then you are probably better off creating a temporary table at the beginning of the function and inserting the dates into that temp table.

Ended up reverting back to a previous solution of using SqlAlchemy to run the queries. This allowed me to parallelize things and run a little faster, since it really was a very large query.
I knew a few things with the dataset that helped with this query- I only wanted distinct dates from each table, and that the dates were the PK in my set. I ended up using the approach from this wiki page. Code being sent in the query looked like the following:
WITH RECURSIVE t AS (
(SELECT date FROM schema.tablename ORDER BY date LIMIT 1)
UNION ALL SELECT (SELECT knowledge_date FROM schema.table WHERE date > t.date ORDER BY date LIMIT 1)
FROM t WHERE t.date IS NOT NULL)
SELECT date FROM t WHERE date IS NOT NULL;
I pulled the results of that query into a list of all my dates if they weren't already in the list, then saved that for use later. It's possible that it takes just as long as running it all in the pgsql console, but it was easier for me to save locally than to have to query the temp table in the db.

SQLAlchemy: Perform double filter and sum in the same query

I have a general ledger table in my DB with the columns: member_id, is_credit and amount. I want to get the current balance of the member.
Ideally that can be got by two queries where the first query has is_credit == True and the second query is_credit == False something close to:
credit_amount = session.query(func.sum(Funds.amount).label('Debit_Amount')).filter(Funds.member_id==member_id, Funds.is_credit==True)
debit_amount = session.query(func.sum(Funds.amount).label('Debit_Amount')).filter(Funds.member_id==member_id, Funds.is_credit==False)
balance = credit_amount - debit_amount
and then subtract the result. Is there a way to have the above run in one query to give the balance?

From the comments you state that hybrids are too advanced right now, so I will propose an easier but not as efficient solution (still its okay):
(session.query(Funds.is_credit, func.sum(Funds.amount).label('Debit_Amount')).
filter(Funds.member_d==member_id).group_by(Funds.is_credit))
What will this do? You will recieve a two-row result, one has the credit, the other the debit, depending on the is_credit property of the result. The second part (Debit_Amount) will be the value. You then evaluate them to get the result: Only one query that fetches both values.
If you are unsure what group_by does, I recommend you read up on SQL before doing it in SQLAlchemy. SQLAlchemy offers very easy usage of SQL but it requires that you understand SQL as well. Thus, I recommend: First build a query in SQL and see that it does what you want - then translate it to SQLAlchemy and see that it does the same. Otherwise SQLAlchemy will often generate highly inefficient queries, because you asked for the wrong thing.

Can I get table names along with column names using .description() in Python's DB API?

I am using Python with SQLite 3. I have user entered SQL queries and need to format the results of those for a template language.
So, basically, I need to use .description of the DB API cursor (PEP 249), but I need to get both the column names and the table names, since the users often do joins.
The obvious answer, i.e. to read the table definitions, is not possible -- many of the tables have the same column names.
I also need some intelligent behaviour on the column/table names for aggregate functions like avg(field)...
The only solution I can come up with is to use an SQL parser and analyse the SELECT statement (sigh), but I haven't found any SQL parser for Python that seems really good?
I haven't found anything in the documentation or anyone else with the same problem, so I might have missed something obvious?
Edit: To be clear -- the problem is to find the result of an SQL select, where the select statement is supplied by a user in a user interface. I have no control of it. As I noted above, it doesn't help to read the table definitions.

Python's DB API only specifies column names for the cursor.description (and none of the RDBMS implementations of this API will return table names for queries...I'll show you why).
What you're asking for is very hard, and only even approachable with an SQL parser...and even then there are many situations where even the concept of which "Table" a column is from may not make much sense.
Consider these SQL statements:
Which table is today from?
SELECT DATE('now') AS today FROM TableA FULL JOIN TableB
ON TableA.col1 = TableB.col1;
Which table is myConst from?
SELECT 1 AS myConst;
Which table is myCalc from?
SELECT a+b AS myCalc FROM (select t1.col1 AS a, t2.col2 AS b
FROM table1 AS t1
LEFT OUTER JOIN table2 AS t2 on t1.col2 = t2.col2);
Which table is myCol from?
SELECT SUM(a) as myCol FROM (SELECT a FROM table1 UNION SELECT b FROM table2);
The above were very simple SQL statements for which you either have to make up a "table", or arbitrarily pick one...even if you had an SQL parser!
What SQL gives you is a set of data back as results. The elements in this set can not necessarily be attributed to specific database tables. You probably need to rethink your approach to this problem.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

SQLAlchemy - How to count distinct on multiple columns - python

Related

Selecting the first item of an ARRAY with PostgreSQL/SqlAlchemy

Django Raw Query with params on Table Column (SQL Injection)

Get All of Single Column from Every Table in Schema

SQLAlchemy: Perform double filter and sum in the same query

Can I get table names along with column names using .description() in Python's DB API?

Categories

Resources