Inefficient SQL query while excluding results on QuerySet - python

I'm trying to figure out why django ORM has such strange (as I think) behaviour. I have 2 basic models (simplified to get the main idea):
class A(models.Model):
pass
class B(models.Model):
name = models.CharField(max_length=15)
a = models.ForeignKey(A)
Now I want to select rows from table a that are refered from table b that dont have some value in column name.
Here is sample SQL I expect Django ORM to produce:
SELECT * FROM inefficient_foreign_key_exclude_a a
INNER JOIN inefficient_foreign_key_exclude_b b ON a.id = b.a_id
WHERE NOT (b.name = '123');
In case of filter() method of django.db.models.query.QuerySet it works as expected:
>>> from inefficient_foreign_key_exclude.models import A
>>> print A.objects.filter(b__name='123').query
SELECT `inefficient_foreign_key_exclude_a`.`id`
FROM `inefficient_foreign_key_exclude_a`
INNER JOIN `inefficient_foreign_key_exclude_b` ON (`inefficient_foreign_key_exclude_a`.`id` = `inefficient_foreign_key_exclude_b`.`a_id`)
WHERE `inefficient_foreign_key_exclude_b`.`name` = 123
But if I use exclude() method (a negative form of Q object in underlaying logic) it creates a really strange SQL query:
>>> print A.objects.exclude(b__name='123').query
SELECT `inefficient_foreign_key_exclude_a`.`id`
FROM `inefficient_foreign_key_exclude_a`
WHERE NOT ((`inefficient_foreign_key_exclude_a`.`id` IN (
SELECT U1.`a_id` FROM `inefficient_foreign_key_exclude_b` U1 WHERE (U1.`name` = 123 AND U1.`a_id` IS NOT NULL)
) AND `inefficient_foreign_key_exclude_a`.`id` IS NOT NULL))
Why does ORM make a subquery instead of just JOIN?
UPDATE:
I've made a test to prove that using a subquery is not efficient at all.
I created 500401 rows in both a and b tables. And here what I got:
For join:
mysql> SELECT count(*)
-> FROM inefficient_foreign_key_exclude_a a
-> INNER JOIN inefficient_foreign_key_exclude_b b ON a.id = b.a_id
-> WHERE NOT (b.name = 'abc');
+----------+
| count(*) |
+----------+
| 500401 |
+----------+
1 row in set (0.97 sec)
And for subquery:
mysql> SELECT count(*)
-> FROM inefficient_foreign_key_exclude_a a
-> WHERE NOT ((a.id IN (
-> SELECT U1.`a_id` FROM `inefficient_foreign_key_exclude_b` U1 WHERE (U1.`name` = 'abc' AND U1.`a_id` IS NOT NULL)
-> ) AND a.id IS NOT NULL));
+----------+
| count(*) |
+----------+
| 500401 |
+----------+
1 row in set (3.76 sec)
Join is almost 4 times faster.

It looks like it's a kind of optimization.
While filter() can be 'any' condition, it makes the join and the applies the restriction.
exclude() is more restrictive, so you are not forced to join the tables and it can build the query using subqueries which I suppose would make the query faster (due to index usage).
If you are using MySQL you could use explain command on the queries and see if my suggestion is right.

Related

Remove similar repeated groups in same column [duplicate]

I have a table which I want to get the latest entry for each group. Here's the table:
DocumentStatusLogs Table
|ID| DocumentID | Status | DateCreated |
| 2| 1 | S1 | 7/29/2011 |
| 3| 1 | S2 | 7/30/2011 |
| 6| 1 | S1 | 8/02/2011 |
| 1| 2 | S1 | 7/28/2011 |
| 4| 2 | S2 | 7/30/2011 |
| 5| 2 | S3 | 8/01/2011 |
| 6| 3 | S1 | 8/02/2011 |
The table will be grouped by DocumentID and sorted by DateCreated in descending order. For each DocumentID, I want to get the latest status.
My preferred output:
| DocumentID | Status | DateCreated |
| 1 | S1 | 8/02/2011 |
| 2 | S3 | 8/01/2011 |
| 3 | S1 | 8/02/2011 |
Is there any aggregate function to get only the top from each group? See pseudo-code GetOnlyTheTop below:
SELECT
DocumentID,
GetOnlyTheTop(Status),
GetOnlyTheTop(DateCreated)
FROM DocumentStatusLogs
GROUP BY DocumentID
ORDER BY DateCreated DESC
If such function doesn't exist, is there any way I can achieve the output I want?
Or at the first place, could this be caused by unnormalized database? I'm thinking, since what I'm looking for is just one row, should that status also be located in the parent table?
Please see the parent table for more information:
Current Documents Table
| DocumentID | Title | Content | DateCreated |
| 1 | TitleA | ... | ... |
| 2 | TitleB | ... | ... |
| 3 | TitleC | ... | ... |
Should the parent table be like this so that I can easily access its status?
| DocumentID | Title | Content | DateCreated | CurrentStatus |
| 1 | TitleA | ... | ... | s1 |
| 2 | TitleB | ... | ... | s3 |
| 3 | TitleC | ... | ... | s1 |
UPDATE
I just learned how to use "apply" which makes it easier to address such problems.
;WITH cte AS
(
SELECT *,
ROW_NUMBER() OVER (PARTITION BY DocumentID ORDER BY DateCreated DESC) AS rn
FROM DocumentStatusLogs
)
SELECT *
FROM cte
WHERE rn = 1
If you expect 2 entries per day, then this will arbitrarily pick one. To get both entries for a day, use DENSE_RANK instead
As for normalised or not, it depends if you want to:
maintain status in 2 places
preserve status history
...
As it stands, you preserve status history. If you want latest status in the parent table too (which is denormalisation) you'd need a trigger to maintain "status" in the parent. or drop this status history table.
I just learned how to use cross apply. Here's how to use it in this scenario:
select d.DocumentID, ds.Status, ds.DateCreated
from Documents as d
cross apply
(select top 1 Status, DateCreated
from DocumentStatusLogs
where DocumentID = d.DocumentId
order by DateCreated desc) as ds
I know this is an old thread but the TOP 1 WITH TIES solutions is quite nice and might be helpful to some reading through the solutions.
select top 1 with ties
DocumentID
,Status
,DateCreated
from DocumentStatusLogs
order by row_number() over (partition by DocumentID order by DateCreated desc)
The select top 1 with ties clause tells SQL Server that you want to return the first row per group. But how does SQL Server know how to group up the data? This is where the order by row_number() over (partition by DocumentID order by DateCreated desc comes in. The column/columns after partition by defines how SQL Server groups up the data. Within each group, the rows will be sorted based on the order by columns. Once sorted, the top row in each group will be returned in the query.
More about the TOP clause can be found here.
I've done some timings over the various recommendations here, and the results really depend on the size of the table involved, but the most consistent solution is using the CROSS APPLY These tests were run against SQL Server 2008-R2, using a table with 6,500 records, and another (identical schema) with 137 million records. The columns being queried are part of the primary key on the table, and the table width is very small (about 30 bytes). The times are reported by SQL Server from the actual execution plan.
Query Time for 6500 (ms) Time for 137M(ms)
CROSS APPLY 17.9 17.9
SELECT WHERE col = (SELECT MAX(COL)…) 6.6 854.4
DENSE_RANK() OVER PARTITION 6.6 907.1
I think the really amazing thing was how consistent the time was for the CROSS APPLY regardless of the number of rows involved.
If you're worried about performance, you can also do this with MAX():
SELECT *
FROM DocumentStatusLogs D
WHERE DateCreated = (SELECT MAX(DateCreated) FROM DocumentStatusLogs WHERE ID = D.ID)
ROW_NUMBER() requires a sort of all the rows in your SELECT statement, whereas MAX does not. Should drastically speed up your query.
SELECT * FROM
DocumentStatusLogs JOIN (
SELECT DocumentID, MAX(DateCreated) DateCreated
FROM DocumentStatusLogs
GROUP BY DocumentID
) max_date USING (DocumentID, DateCreated)
What database server? This code doesn't work on all of them.
Regarding the second half of your question, it seems reasonable to me to include the status as a column. You can leave DocumentStatusLogs as a log, but still store the latest info in the main table.
BTW, if you already have the DateCreated column in the Documents table you can just join DocumentStatusLogs using that (as long as DateCreated is unique in DocumentStatusLogs).
Edit: MsSQL does not support USING, so change it to:
ON DocumentStatusLogs.DocumentID = max_date.DocumentID AND DocumentStatusLogs.DateCreated = max_date.DateCreated
This is one of the most easily found question on the topic, so I wanted to give a modern answer to the it (both for my reference and to help others out). By using first_value and over you can make short work of the above query:
Select distinct DocumentID
, first_value(status) over (partition by DocumentID order by DateCreated Desc) as Status
, first_value(DateCreated) over (partition by DocumentID order by DateCreated Desc) as DateCreated
From DocumentStatusLogs
This should work in Sql Server 2008 and up. First_value can be thought of as a way to accomplish Select Top 1 when using an over clause. Over allows grouping in the select list so instead of writing nested subqueries (like many of the existing answers do), this does it in a more readable fashion. Hope this helps.
Here are 3 separate approaches to the problem in hand along with the best choices of indexing for each of those queries (please try out the indexes yourselves and see the logical read, elapsed time, execution plan. I have provided the suggestions from my experience on such queries without executing for this specific problem).
Approach 1: Using ROW_NUMBER(). If rowstore index is not being able to enhance the performance, you can try out nonclustered/clustered columnstore index as for queries with aggregation and grouping and for tables which are ordered by in different columns all the times, columnstore index usually is the best choice.
;WITH CTE AS
(
SELECT *,
RN = ROW_NUMBER() OVER (PARTITION BY DocumentID ORDER BY DateCreated DESC)
FROM DocumentStatusLogs
)
SELECT ID
,DocumentID
,Status
,DateCreated
FROM CTE
WHERE RN = 1;
Approach 2: Using FIRST_VALUE. If rowstore index is not being able to enhance the performance, you can try out nonclustered/clustered columnstore index as for queries with aggregation and grouping and for tables which are ordered by in different columns all the times, columnstore index usually is the best choice.
SELECT DISTINCT
ID = FIRST_VALUE(ID) OVER (PARTITION BY DocumentID ORDER BY DateCreated DESC)
,DocumentID
,Status = FIRST_VALUE(Status) OVER (PARTITION BY DocumentID ORDER BY DateCreated DESC)
,DateCreated = FIRST_VALUE(DateCreated) OVER (PARTITION BY DocumentID ORDER BY DateCreated DESC)
FROM DocumentStatusLogs;
Approach 3: Using CROSS APPLY. Creating rowstore index on DocumentStatusLogs table covering the columns used in the query should be enough to cover the query without need of a columnstore index.
SELECT DISTINCT
ID = CA.ID
,DocumentID = D.DocumentID
,Status = CA.Status
,DateCreated = CA.DateCreated
FROM DocumentStatusLogs D
CROSS APPLY (
SELECT TOP 1 I.*
FROM DocumentStatusLogs I
WHERE I.DocumentID = D.DocumentID
ORDER BY I.DateCreated DESC
) CA;
This is quite an old thread, but I thought I'd throw my two cents in just the same as the accepted answer didn't work particularly well for me. I tried gbn's solution on a large dataset and found it to be terribly slow (>45 seconds on 5 million plus records in SQL Server 2012). Looking at the execution plan it's obvious that the issue is that it requires a SORT operation which slows things down significantly.
Here's an alternative that I lifted from the entity framework that needs no SORT operation and does a NON-Clustered Index search. This reduces the execution time down to < 2 seconds on the aforementioned record set.
SELECT
[Limit1].[DocumentID] AS [DocumentID],
[Limit1].[Status] AS [Status],
[Limit1].[DateCreated] AS [DateCreated]
FROM (SELECT DISTINCT [Extent1].[DocumentID] AS [DocumentID] FROM [dbo].[DocumentStatusLogs] AS [Extent1]) AS [Distinct1]
OUTER APPLY (SELECT TOP (1) [Project2].[ID] AS [ID], [Project2].[DocumentID] AS [DocumentID], [Project2].[Status] AS [Status], [Project2].[DateCreated] AS [DateCreated]
FROM (SELECT
[Extent2].[ID] AS [ID],
[Extent2].[DocumentID] AS [DocumentID],
[Extent2].[Status] AS [Status],
[Extent2].[DateCreated] AS [DateCreated]
FROM [dbo].[DocumentStatusLogs] AS [Extent2]
WHERE ([Distinct1].[DocumentID] = [Extent2].[DocumentID])
) AS [Project2]
ORDER BY [Project2].[ID] DESC) AS [Limit1]
Now I'm assuming something that isn't entirely specified in the original question, but if your table design is such that your ID column is an auto-increment ID, and the DateCreated is set to the current date with each insert, then even without running with my query above you could actually get a sizable performance boost to gbn's solution (about half the execution time) just from ordering on ID instead of ordering on DateCreated as this will provide an identical sort order and it's a faster sort.
My code to select top 1 from each group
select a.* from #DocumentStatusLogs a where
datecreated in( select top 1 datecreated from #DocumentStatusLogs b
where
a.documentid = b.documentid
order by datecreated desc
)
This solution can be used to get the TOP N most recent rows for each partition (in the example, N is 1 in the WHERE statement and partition is doc_id):
SELECT T.doc_id, T.status, T.date_created FROM
(
SELECT a.*, ROW_NUMBER() OVER (PARTITION BY doc_id ORDER BY date_created DESC) AS rnk FROM doc a
) T
WHERE T.rnk = 1;
CROSS APPLY was the method I used for my solution, as it worked for me, and for my clients needs. And from what I've read, should provide the best overall performance should their database grow substantially.
Verifying Clint's awesome and correct answer from above:
The performance between the two queries below is interesting. 52% being the top one. And 48% being the second one. A 4% improvement in performance using DISTINCT instead of ORDER BY. But ORDER BY has the advantage to sort by multiple columns.
IF (OBJECT_ID('tempdb..#DocumentStatusLogs') IS NOT NULL) BEGIN DROP TABLE #DocumentStatusLogs END
CREATE TABLE #DocumentStatusLogs (
[ID] int NOT NULL,
[DocumentID] int NOT NULL,
[Status] varchar(20),
[DateCreated] datetime
)
INSERT INTO #DocumentStatusLogs([ID], [DocumentID], [Status], [DateCreated]) VALUES (2, 1, 'S1', '7/29/2011 1:00:00')
INSERT INTO #DocumentStatusLogs([ID], [DocumentID], [Status], [DateCreated]) VALUES (3, 1, 'S2', '7/30/2011 2:00:00')
INSERT INTO #DocumentStatusLogs([ID], [DocumentID], [Status], [DateCreated]) VALUES (6, 1, 'S1', '8/02/2011 3:00:00')
INSERT INTO #DocumentStatusLogs([ID], [DocumentID], [Status], [DateCreated]) VALUES (1, 2, 'S1', '7/28/2011 4:00:00')
INSERT INTO #DocumentStatusLogs([ID], [DocumentID], [Status], [DateCreated]) VALUES (4, 2, 'S2', '7/30/2011 5:00:00')
INSERT INTO #DocumentStatusLogs([ID], [DocumentID], [Status], [DateCreated]) VALUES (5, 2, 'S3', '8/01/2011 6:00:00')
INSERT INTO #DocumentStatusLogs([ID], [DocumentID], [Status], [DateCreated]) VALUES (6, 3, 'S1', '8/02/2011 7:00:00')
Option 1:
SELECT
[Extent1].[ID],
[Extent1].[DocumentID],
[Extent1].[Status],
[Extent1].[DateCreated]
FROM #DocumentStatusLogs AS [Extent1]
OUTER APPLY (
SELECT TOP 1
[Extent2].[ID],
[Extent2].[DocumentID],
[Extent2].[Status],
[Extent2].[DateCreated]
FROM #DocumentStatusLogs AS [Extent2]
WHERE [Extent1].[DocumentID] = [Extent2].[DocumentID]
ORDER BY [Extent2].[DateCreated] DESC, [Extent2].[ID] DESC
) AS [Project2]
WHERE ([Project2].[ID] IS NULL OR [Project2].[ID] = [Extent1].[ID])
Option 2:
SELECT
[Limit1].[DocumentID] AS [ID],
[Limit1].[DocumentID] AS [DocumentID],
[Limit1].[Status] AS [Status],
[Limit1].[DateCreated] AS [DateCreated]
FROM (
SELECT DISTINCT [Extent1].[DocumentID] AS [DocumentID] FROM #DocumentStatusLogs AS [Extent1]
) AS [Distinct1]
OUTER APPLY (
SELECT TOP (1) [Project2].[ID] AS [ID], [Project2].[DocumentID] AS [DocumentID], [Project2].[Status] AS [Status], [Project2].[DateCreated] AS [DateCreated]
FROM (
SELECT
[Extent2].[ID] AS [ID],
[Extent2].[DocumentID] AS [DocumentID],
[Extent2].[Status] AS [Status],
[Extent2].[DateCreated] AS [DateCreated]
FROM #DocumentStatusLogs AS [Extent2]
WHERE [Distinct1].[DocumentID] = [Extent2].[DocumentID]
) AS [Project2]
ORDER BY [Project2].[ID] DESC
) AS [Limit1]
In Microsoft SQL Server Management Studio: after highlighting and running the first block, highlight both Option 1 and Option 2, right click -> [Display Estimated Execution Plan]. Then run the entire thing to see the results.
Option 1 Results:
ID DocumentID Status DateCreated
6 1 S1 8/2/11 3:00
5 2 S3 8/1/11 6:00
6 3 S1 8/2/11 7:00
Option 2 Results:
ID DocumentID Status DateCreated
6 1 S1 8/2/11 3:00
5 2 S3 8/1/11 6:00
6 3 S1 8/2/11 7:00
Note:
I tend to use APPLY when I want a join to be 1-to-(1 of many).
I use a JOIN if I want the join to be 1-to-many, or many-to-many.
I avoid CTE with ROW_NUMBER() unless I need to do something advanced and am ok with the windowing performance penalty.
I also avoid EXISTS / IN subqueries in the WHERE or ON clause, as I have experienced this causing some terrible execution plans. But mileage varies. Review the execution plan and profile performance where and when needed!
SELECT o.*
FROM `DocumentStatusLogs` o
LEFT JOIN `DocumentStatusLogs` b
ON o.DocumentID = b.DocumentID AND o.DateCreated < b.DateCreated
WHERE b.DocumentID is NULL ;
If you want to return only recent document order by DateCreated, it will return only top 1 document by DocumentID
I believe this can be done just like this. This might need some tweaking but you can just select the max from the group.
These answers are overkill..
SELECT
d.DocumentID,
MAX(d.Status),
MAX(d1.DateCreated)
FROM DocumentStatusLogs d, DocumentStatusLogs d1
USING DocumentID
GROUP BY 1
ORDER BY 3 DESC
In scenarios where you want to avoid using row_count(), you can also use a left join:
select ds.DocumentID, ds.Status, ds.DateCreated
from DocumentStatusLogs ds
left join DocumentStatusLogs filter
ON ds.DocumentID = filter.DocumentID
-- Match any row that has another row that was created after it.
AND ds.DateCreated < filter.DateCreated
-- then filter out any rows that matched
where filter.DocumentID is null
For the example schema, you could also use a "not in subquery", which generally compiles to the same output as the left join:
select ds.DocumentID, ds.Status, ds.DateCreated
from DocumentStatusLogs ds
WHERE ds.ID NOT IN (
SELECT filter.ID
FROM DocumentStatusLogs filter
WHERE ds.DocumentID = filter.DocumentID
AND ds.DateCreated < filter.DateCreated)
Note, the subquery pattern wouldn't work if the table didn't have at least one single-column unique key/constraint/index, in this case the primary key "Id".
Both of these queries tend to be more "expensive" than the row_count() query (as measured by Query Analyzer). However, you might encounter scenarios where they return results faster or enable other optimizations.
SELECT documentid,
status,
datecreated
FROM documentstatuslogs dlogs
WHERE status = (SELECT status
FROM documentstatuslogs
WHERE documentid = dlogs.documentid
ORDER BY datecreated DESC
LIMIT 1)
Some database engines* are starting to support the QUALIFY clause that allows to filter the result of window functions (which the accepted answer uses).
So the accepted answer can become
SELECT *, ROW_NUMBER() OVER (PARTITION BY DocumentID ORDER BY DateCreated DESC) AS rn
FROM DocumentStatusLogs
QUALIFY rn = 1
See this article for an in depth explanation: https://jrandrews.net/the-joy-of-qualify
You can use this tool to see which database support this clause: https://www.jooq.org/translate/
There is an option to transform the qualify clause when the target dialect does not support it.
*Teradata, BigQuery, H2, Snowflake...
Try this:
SELECT [DocumentID]
,[tmpRez].value('/x[2]', 'varchar(20)') AS [Status]
,[tmpRez].value('/x[3]', 'datetime') AS [DateCreated]
FROM (
SELECT [DocumentID]
,cast('<x>' + max(cast([ID] AS VARCHAR(10)) + '</x><x>' + [Status] + '</x><x>' + cast([DateCreated] AS VARCHAR(20))) + '</x>' AS XML) AS [tmpRez]
FROM DocumentStatusLogs
GROUP BY DocumentID
) AS [tmpQry]

Effectively classifying a DB of event logs by a column

Situation
I am using Python 3.7.2 with its built-in sqlite3 module. (sqlite3.version == 2.6.0)
I have a sqlite database that looks like:
| user_id | action | timestamp |
| ------- | ------ | ---------- |
| Alice | 0 | 1551683796 |
| Alice | 23 | 1551683797 |
| James | 1 | 1551683798 |
| ....... | ...... | .......... |
where user_id is TEXT, action is an arbitary INTEGER, and timestamp is an INTEGER representing UNIX time.
The database has 200M rows, and there are 70K distinct user_ids.
Goal
I need to make a Python dictionary that looks like:
{
"Alice":[(0, 1551683796), (23, 1551683797)],
"James":[(1, 1551683798)],
...
}
that has user_ids as keys and respective event logs as values, which are lists of tuples (action, timestamp). Hopefully each list will be sorted by timestamp in increasing order, but even if it isn't, I think it can be easily achieved by sorting each list after a dictionary is made.
Effort
I have the following code to query the database. It first queries for the list of users (with user_list_cursor), and then query for all rows belonging to the user.
import sqlite3
connection = sqlite3.connect("database.db")
user_list_cursor = connection.cursor()
user_list_cursor.execute("SELECT DISTINCT user_id FROM EVENT_LOG")
user_id = user_list_cursor.fetchone()
classified_log = {}
log_cursor = connection.cursor()
while user_id:
user_id = user_id[0] # cursor.fetchone() returns a tuple
query = (
"SELECT action, timestamp"
" FROM TABLE"
" WHERE user_id = ?"
" ORDER BY timestamp ASC"
)
parameters = (user_id,)
local_cursor.execute(query, parameters) # Here is the bottleneck
classified_log[user_id] = list()
for row in local_cursor.fetchall():
classified_log[user_id].append(row)
user_id = user_list_cursor.fetchone()
Problem
The query execution for each user is too slow. That single line of code (which is commented as bottleneck) takes around 10 seconds for each user_id. I think I am making a wrong approach with the queries. What is the right way to achieve the goal?
I tried searching with keywords "classify db by a column", "classify sql by a column", "sql log to dictionary python", but nothing seems to match my situation. I think this wouldn't be a rare need, so maybe I'm missing the right keyword to search with.
Reproducibility
If anyone is willing to reproduce the situation with a 200M row sqlite database, the following code will create a 5GB database file.
But I hope there is somebody who is familiar with such a situation and knows how to write the right query.
import sqlite3
import random
connection = sqlite3.connect("tmp.db")
cursor = connection.cursor()
cursor.execute(
"CREATE TABLE IF NOT EXISTS EVENT_LOG (user_id TEXT, action INTEGER, timestamp INTEGER)"
)
query = "INSERT INTO EVENT_LOG VALUES (?, ?, ?)"
parameters = []
for timestamp in range(200_000_000):
user_id = f"user{random.randint(0, 70000)}"
action = random.randint(0, 1_000_000)
parameters.append((user_id, action, timestamp))
cursor.executemany(query, parameters)
connection.commit()
cursor.close()
connection.close()
Big thanks to #Strawberry and #Solarflare for their help given in comments.
The following solution achieved more than 70X performance increase, so I'm leaving what I did as an answer for completeness' sake.
I used indices and queried for the whole table, as they suggested.
import sqlite3
from operators import attrgetter
connection = sqlite3.connect("database.db")
# Creating index, thanks to #Solarflare
cursor = connection.cursor()
cursor.execute("CREATE INDEX IF NOT EXISTS idx_user_id ON EVENT_LOG (user_id)")
cursor.commit()
# Reading the whole table, then make lists by user_id. Thanks to #Strawberry
cursor.execute("SELECT user_id, action, timestamp FROM EVENT_LOG ORDER BY user_id ASC")
previous_user_id = None
log_per_user = list()
classified_log = dict()
for row in cursor:
user_id, action, timestamp = row
if user_id != previous_user_id:
if previous_user_id:
log_per_user.sort(key=itemgetter(1))
classified_log[previous_user_id] = log_per_user[:]
log_per_user = list()
log_per_user.append((action, timestamp))
previous_user_id = user_id
So the points are
Indexing by user_id to make ORDER BY user_id ASC execute in acceptable time.
Reading the whole table, then classify by user_id, instead of making individual queries for each user_id.
Iterating over cursor to read row by row, instead of cursor.fetchall().

Create a temporary table in python to join with a sql table

I have the following data in a vertica db, Mytable:
+----+-------+
| ID | Value |
+----+-------+
| A | 5 |
| B | 9 |
| C | 10 |
| D | 7 |
+----+-------+
I am trying to create a query in python to access a vertica data base. In python I have a list:
ID_list= ['A', 'C']
I would like to create a query that basically inner joins Mytable with the ID_list and then I could make a WHERE query.
So it will be basically something like this
SELECT *
FROM Mytable
INNER JOIN ID_list
ON Mytable.ID = ID_list as temp_table
WHERE Value = 5
I don't have writing rights on the data base, so the table needs to be created localy. Or is there an alternative way of doing this?
If you have a small table, then you can do as Tim suggested and create an in-list.
I kind of prefer to do this using python ways, though. I would probably also make ID_list a set as well to keep from having dups, etc.
in_list = '(%s)' % ','.join(str(id) for id in ID_list)
or better use bind variables (depends on the client you are using, and probably not strictly necessary if you are dealing with a set of ints since I can't imagine a way to inject sql with that):
in_list = '(%s)' % ','.join(['%d'] * len(ID_list)
and send in your ID_list as a parameter list for your cursor.execute. This method is positional, so you'll need to arrange your bind parameters correctly.
If you have a very, very large list... you could create a local temp and load it before doing your query with join.
CREATE LOCAL TEMP TABLE mytable ( id INTEGER );
COPY mytable FROM STDIN;
-- Or however you need to load the data. Using python, you'll probably need to stream in a list using `cursor.copy`
Then join to mytable.
I wouldn't bother doing the latter with a very small number of rows, too much overhead.
So I used the approach from Tim:
# create a String of all the ID_list so they can be inserted into a SQL query
Sql_string='(';
for ss in ID_list:
Sql_string= Sql_string + " " + str(ss) + ","
Sql_string=Sql_string[:-1] + ")"
"SELECT * FROM
(SELECT * FROM Mytable WHERE ID IN " + Sql_string) as temp
Where Value = 5"
works surprisingly fast

The same query returns different results in the shell and in code

Previously, I stumbled across one interesting thing in Oracle -
Oracle: ORA-01722: invalid number. It turned out to be Oracle's natural behaviour (though different from other major databases I dealt before - MySQL, Postgres and SQLite). But now I see another counterintuitive thing - I have a very simple query which returns results in the shell, but returns nothing from Python code. This is the query:
SELECT * FROM TEST_TABLE T0
INNER JOIN TEST_TABLE_2 T1 ON T1.ATTR=T0.ID
INNER JOIN TEST_TABLE_3 T2 ON T2.ID = T1.ID
As you can see, it's a very simple query with just two simple joins. And here is voodoo magic screencast:
So, as you can see in the shell it returns data. Here is another picture of the ghost:
Now you see that in Python code it returns nothing. However, it does return if we tune this query a little bit - just remove the second join:
So, what is wrong with all that? And how can I trust Oracle? (now it just seems to me that I can rely more on a file database like SQLite, then on Oracle).
EDIT
Below is schema with data:
SQL> SELECT COLUMN_NAME, DATA_TYPE FROM USER_TAB_COLUMNS
WHERE TABLE_NAME = 'TEST_TABLE';
COLUMN_NAME
------------------------------
DATA_TYPE
------------------------------
ID
NUMBER
SQL> SELECT * FROM TEST_TABLE;
ID
----------
1
SQL> SELECT COLUMN_NAME, DATA_TYPE FROM USER_TAB_COLUMNS
WHERE TABLE_NAME = 'TEST_TABLE_2';
COLUMN_NAME
------------------------------
DATA_TYPE
------------------------------
ID
NUMBER
TXT
VARCHAR2
ATTR
NUMBER
SQL> SELECT * FROM TEST_TABLE_2;
ID
----------
TXT
-----------------------------------
ATTR
----------
2
hello
1
SQL> SELECT COLUMN_NAME, DATA_TYPE FROM USER_TAB_COLUMNS
WHERE TABLE_NAME = 'TEST_TABLE_3';
COLUMN_NAME
------------------------------
DATA_TYPE
------------------------------
ID
NUMBER
SQL> SELECT * FROM TEST_TABLE_3;
ID
----------
2
EDIT
To be more precise, I created my three tables with these statements:
CREATE TABLE test_table(id number(19) default 0 not null)
CREATE TABLE test_table_2(txt varchar(255),id number(19) default 0 not null,attr number(19) default 0 not null)
CREATE TABLE test_table_3(id number(19) default 0 not null)

RIGHT OUTER JOIN in SQLAlchemy

I have two tables beard and moustache defined below:
+--------+---------+------------+-------------+
| person | beardID | beardStyle | beardLength |
+--------+---------+------------+-------------+
+--------+-------------+----------------+
| person | moustacheID | moustacheStyle |
+--------+-------------+----------------+
I have created a SQL Query in PostgreSQL which will combine these two tables and generate following result:
+--------+---------+------------+-------------+-------------+----------------+
| person | beardID | beardStyle | beardLength | moustacheID | moustacheStyle |
+--------+---------+------------+-------------+-------------+----------------+
| bob | 1 | rasputin | 1 | | |
+--------+---------+------------+-------------+-------------+----------------+
| bob | 2 | samson | 12 | | |
+--------+---------+------------+-------------+-------------+----------------+
| bob | | | | 1 | fu manchu |
+--------+---------+------------+-------------+-------------+----------------+
Query:
SELECT * FROM beards LEFT OUTER JOIN mustaches ON (false) WHERE person = "bob"
UNION ALL
SELECT * FROM beards b RIGHT OUTER JOIN mustaches ON (false) WHERE person = "bob"
However I can not create SQLAlchemy representation of it. I tried several ways from implementing from_statement to outerjoin but none of them really worked. Can anyone help me with it?
In SQL, A RIGHT OUTER JOIN B is equivalent of B LEFT OUTER JOIN A. So, technically there is no need in the RIGHT OUTER JOIN API - it is possible to do the same by switching the places of the target "selectable" and joined "selectable". SQL Alchemy provides an API for this:
# this **fictional** API:
query(A).join(B, right_outer_join=True) # right_outer_join doesn't exist in SQLA!
# can be implemented in SQLA like this:
query(A).select_entity_from(B).join(A, isouter=True)
See SQLA Query.join() doc, section "Controlling what to Join From".
From #Francis P's suggestion I came up with this snippet:
q1 = session.\
query(beard.person.label('person'),
beard.beardID.label('beardID'),
beard.beardStyle.label('beardStyle'),
sqlalchemy.sql.null().label('moustachID'),
sqlalchemy.sql.null().label('moustachStyle'),
).\
filter(beard.person == 'bob')
q2 = session.\
query(moustache.person.label('person'),
sqlalchemy.sql.null().label('beardID'),
sqlalchemy.sql.null().label('beardStyle'),
moustache.moustachID,
moustache.moustachStyle,
).\
filter(moustache.person == 'bob')
result = q1.union(q2).all()
However this works but you can't call it as an answer because it appears as a hack. This is one more reason why there should be RIGHT OUTER JOIN in sqlalchemy.
If A,B are tables, you can achieve:
SELECT * FROM A RIGHT JOIN B ON A.id = B.a_id WHERE B.id = my_id
by:
SELECT A.* FROM B JOIN ON A.id = B.a_id WHERE B.id = my_id
in sqlalchemy:
from sqlalchemy import select
result = session.query(A).select_entity_from(select([B]))\
.join(A, A.id == B.a_id)\
.filter(B.id == my_id).first()
for example:
# import ...
class User(Base):
__tablenane = "user"
id = Column(Integer, primary_key=True)
group_id = Column(Integer, ForeignKey("group.id"))
class Group(Base):
__tablename = "group"
id = Column(Integer, primary_key=True)
name = Column(String(100))
You can get user group name by user id with the follow code:
# import ...
from sqlalchemy import select
user_group_name, = session.query(Group.name)\
.select_entity_from(select([User]))\
.join(Group, User.group_id == Group.id)\
.filter(User.id == 1).first()
If you want a outer join, use outerjoin() instead of join().
This answer is a complement to the previous one(Timur's answer).
Here's what I've got, ORM style:
from sqlalchemy.sql import select, false
stmt = (
select([Beard, Moustache])
.select_from(
outerjoin(Beard, Moustache, false())
).apply_labels()
).union_all(
select([Beard, Moustache])
.select_from(
outerjoin(Moustache, Beard, false())
).apply_labels()
)
session.query(Beard, Moustache).select_entity_from(stmt)
Which seems to work on it's own, but seems to be impossible to join with another select expression
Unfortunately, SQLAlchemy only provides API for LEFT OUTER JOIN as .outerjoin(). As mentioned above, we could get a RIGHT OUTER JOIN by reversing the operands of LEFT OUTER JOIN; eg. A RIGHT JOIN B is the same as B LEFT JOIN A.
In SQL, the following statements are equivalent:
SELECT * FROM A RIGHT OUTER JOIN B ON A.common = B.common;
SELECT * FROM B LEFT OUTER JOIN A ON A.common = B.common;
However, in SQLAlchemy, we need to query on a class then perform join. The tricky part is rewriting the SQLAlchemy statement to reverse the tables. For example, the results of the first two queries below are different as they return different objects.
# No such API (rightouterjoin()) but this is what we want.
# This should return the result of A RIGHT JOIN B in a list of object A
session.query(A).rightouterjoin(B).all() # SELECT A.* FROM A RIGHT OUTER JOIN B ...
# We could reverse A and B but this returns a list of object B
session.query(B).outerjoin(A).all() # SELECT B.* FROM B LEFT OUTER JOIN A ...
# This returns a list of object A by choosing the 'left' side to be B using select_from()
session.query(A).select_from(B).outerjoin(A).all() # SELECT A.* FROM B LEFT OUTER JOIN A ...
# For OP's example, assuming we want to return a list of beard object:
session.query(beard).select_from(moustache).outerjoin(beard).all()
Just adding to the answers, you can find the use of select_from from the SQLAlchemy doc.

Categories

Resources