Merging distinct results SQL

Merging distinct results SQL - python

So my problem is the following:
I have a MariaDB database that I am trying to query.
I am looking at experiment data. What I would like, is to summarize several experiments into one results row of an SQL query.
This is what the data looks like:
Experiment_ID | Antibiotic | Strain | Medium | ...
1 Ampicillin E. coli TBauto ...
2 Ampicillin E. coli TB + IPTG
What I would like to get:
Experiment_ID | Antibiotic | Strain | Medium | ...
1 Ampicillin E. coli TBauto, TB + IPTG ...
I don't care about the Experiment_ID, that is just there to make clear the I am talking about two distinct entries.
I already tried
Select
tmp.*
From
((Select * from my_tbl Where ExperimentID = 1)
Union
(Select * from my_tbl Where ExperimentID = 2)) as tmp
There I still end up with 2 rows of results. I could also do it in my Python code. But I would like to not have to modify results that I grab from the DB.
I guess I must have just been looking for the wrong word, to merge those two entries. So please kindly guide me in the right direction.

You seem to want aggregation:
select min(Experiment_ID) as Experiment_ID, Antibiotic, Strain,
group_concat(Medium order by Experiment_ID separator ', ')
from t
where Experiment_ID in (1, 2)
group by Antibiotic, Strain;

Related

Remove similar repeated groups in same column [duplicate]

I have a table which I want to get the latest entry for each group. Here's the table:
DocumentStatusLogs Table
|ID| DocumentID | Status | DateCreated |
| 2| 1 | S1 | 7/29/2011 |
| 3| 1 | S2 | 7/30/2011 |
| 6| 1 | S1 | 8/02/2011 |
| 1| 2 | S1 | 7/28/2011 |
| 4| 2 | S2 | 7/30/2011 |
| 5| 2 | S3 | 8/01/2011 |
| 6| 3 | S1 | 8/02/2011 |
The table will be grouped by DocumentID and sorted by DateCreated in descending order. For each DocumentID, I want to get the latest status.
My preferred output:
| DocumentID | Status | DateCreated |
| 1 | S1 | 8/02/2011 |
| 2 | S3 | 8/01/2011 |
| 3 | S1 | 8/02/2011 |
Is there any aggregate function to get only the top from each group? See pseudo-code GetOnlyTheTop below:
SELECT
DocumentID,
GetOnlyTheTop(Status),
GetOnlyTheTop(DateCreated)
FROM DocumentStatusLogs
GROUP BY DocumentID
ORDER BY DateCreated DESC
If such function doesn't exist, is there any way I can achieve the output I want?
Or at the first place, could this be caused by unnormalized database? I'm thinking, since what I'm looking for is just one row, should that status also be located in the parent table?
Please see the parent table for more information:
Current Documents Table
| DocumentID | Title | Content | DateCreated |
| 1 | TitleA | ... | ... |
| 2 | TitleB | ... | ... |
| 3 | TitleC | ... | ... |
Should the parent table be like this so that I can easily access its status?
| DocumentID | Title | Content | DateCreated | CurrentStatus |
| 1 | TitleA | ... | ... | s1 |
| 2 | TitleB | ... | ... | s3 |
| 3 | TitleC | ... | ... | s1 |
UPDATE
I just learned how to use "apply" which makes it easier to address such problems.

;WITH cte AS
(
SELECT *,
ROW_NUMBER() OVER (PARTITION BY DocumentID ORDER BY DateCreated DESC) AS rn
FROM DocumentStatusLogs
)
SELECT *
FROM cte
WHERE rn = 1
If you expect 2 entries per day, then this will arbitrarily pick one. To get both entries for a day, use DENSE_RANK instead
As for normalised or not, it depends if you want to:
maintain status in 2 places
preserve status history
...
As it stands, you preserve status history. If you want latest status in the parent table too (which is denormalisation) you'd need a trigger to maintain "status" in the parent. or drop this status history table.

I just learned how to use cross apply. Here's how to use it in this scenario:
select d.DocumentID, ds.Status, ds.DateCreated
from Documents as d
cross apply
(select top 1 Status, DateCreated
from DocumentStatusLogs
where DocumentID = d.DocumentId
order by DateCreated desc) as ds

I know this is an old thread but the TOP 1 WITH TIES solutions is quite nice and might be helpful to some reading through the solutions.
select top 1 with ties
DocumentID
,Status
,DateCreated
from DocumentStatusLogs
order by row_number() over (partition by DocumentID order by DateCreated desc)
The select top 1 with ties clause tells SQL Server that you want to return the first row per group. But how does SQL Server know how to group up the data? This is where the order by row_number() over (partition by DocumentID order by DateCreated desc comes in. The column/columns after partition by defines how SQL Server groups up the data. Within each group, the rows will be sorted based on the order by columns. Once sorted, the top row in each group will be returned in the query.
More about the TOP clause can be found here.

I've done some timings over the various recommendations here, and the results really depend on the size of the table involved, but the most consistent solution is using the CROSS APPLY These tests were run against SQL Server 2008-R2, using a table with 6,500 records, and another (identical schema) with 137 million records. The columns being queried are part of the primary key on the table, and the table width is very small (about 30 bytes). The times are reported by SQL Server from the actual execution plan.
Query Time for 6500 (ms) Time for 137M(ms)
CROSS APPLY 17.9 17.9
SELECT WHERE col = (SELECT MAX(COL)…) 6.6 854.4
DENSE_RANK() OVER PARTITION 6.6 907.1
I think the really amazing thing was how consistent the time was for the CROSS APPLY regardless of the number of rows involved.

If you're worried about performance, you can also do this with MAX():
SELECT *
FROM DocumentStatusLogs D
WHERE DateCreated = (SELECT MAX(DateCreated) FROM DocumentStatusLogs WHERE ID = D.ID)
ROW_NUMBER() requires a sort of all the rows in your SELECT statement, whereas MAX does not. Should drastically speed up your query.

SELECT * FROM
DocumentStatusLogs JOIN (
SELECT DocumentID, MAX(DateCreated) DateCreated
FROM DocumentStatusLogs
GROUP BY DocumentID
) max_date USING (DocumentID, DateCreated)
What database server? This code doesn't work on all of them.
Regarding the second half of your question, it seems reasonable to me to include the status as a column. You can leave DocumentStatusLogs as a log, but still store the latest info in the main table.
BTW, if you already have the DateCreated column in the Documents table you can just join DocumentStatusLogs using that (as long as DateCreated is unique in DocumentStatusLogs).
Edit: MsSQL does not support USING, so change it to:
ON DocumentStatusLogs.DocumentID = max_date.DocumentID AND DocumentStatusLogs.DateCreated = max_date.DateCreated

This is one of the most easily found question on the topic, so I wanted to give a modern answer to the it (both for my reference and to help others out). By using first_value and over you can make short work of the above query:
Select distinct DocumentID
, first_value(status) over (partition by DocumentID order by DateCreated Desc) as Status
, first_value(DateCreated) over (partition by DocumentID order by DateCreated Desc) as DateCreated
From DocumentStatusLogs
This should work in Sql Server 2008 and up. First_value can be thought of as a way to accomplish Select Top 1 when using an over clause. Over allows grouping in the select list so instead of writing nested subqueries (like many of the existing answers do), this does it in a more readable fashion. Hope this helps.

Here are 3 separate approaches to the problem in hand along with the best choices of indexing for each of those queries (please try out the indexes yourselves and see the logical read, elapsed time, execution plan. I have provided the suggestions from my experience on such queries without executing for this specific problem).
Approach 1: Using ROW_NUMBER(). If rowstore index is not being able to enhance the performance, you can try out nonclustered/clustered columnstore index as for queries with aggregation and grouping and for tables which are ordered by in different columns all the times, columnstore index usually is the best choice.
;WITH CTE AS
(
SELECT *,
RN = ROW_NUMBER() OVER (PARTITION BY DocumentID ORDER BY DateCreated DESC)
FROM DocumentStatusLogs
)
SELECT ID
,DocumentID
,Status
,DateCreated
FROM CTE
WHERE RN = 1;
Approach 2: Using FIRST_VALUE. If rowstore index is not being able to enhance the performance, you can try out nonclustered/clustered columnstore index as for queries with aggregation and grouping and for tables which are ordered by in different columns all the times, columnstore index usually is the best choice.
SELECT DISTINCT
ID = FIRST_VALUE(ID) OVER (PARTITION BY DocumentID ORDER BY DateCreated DESC)
,DocumentID
,Status = FIRST_VALUE(Status) OVER (PARTITION BY DocumentID ORDER BY DateCreated DESC)
,DateCreated = FIRST_VALUE(DateCreated) OVER (PARTITION BY DocumentID ORDER BY DateCreated DESC)
FROM DocumentStatusLogs;
Approach 3: Using CROSS APPLY. Creating rowstore index on DocumentStatusLogs table covering the columns used in the query should be enough to cover the query without need of a columnstore index.
SELECT DISTINCT
ID = CA.ID
,DocumentID = D.DocumentID
,Status = CA.Status
,DateCreated = CA.DateCreated
FROM DocumentStatusLogs D
CROSS APPLY (
SELECT TOP 1 I.*
FROM DocumentStatusLogs I
WHERE I.DocumentID = D.DocumentID
ORDER BY I.DateCreated DESC
) CA;

This is quite an old thread, but I thought I'd throw my two cents in just the same as the accepted answer didn't work particularly well for me. I tried gbn's solution on a large dataset and found it to be terribly slow (>45 seconds on 5 million plus records in SQL Server 2012). Looking at the execution plan it's obvious that the issue is that it requires a SORT operation which slows things down significantly.
Here's an alternative that I lifted from the entity framework that needs no SORT operation and does a NON-Clustered Index search. This reduces the execution time down to < 2 seconds on the aforementioned record set.
SELECT
[Limit1].[DocumentID] AS [DocumentID],
[Limit1].[Status] AS [Status],
[Limit1].[DateCreated] AS [DateCreated]
FROM (SELECT DISTINCT [Extent1].[DocumentID] AS [DocumentID] FROM [dbo].[DocumentStatusLogs] AS [Extent1]) AS [Distinct1]
OUTER APPLY (SELECT TOP (1) [Project2].[ID] AS [ID], [Project2].[DocumentID] AS [DocumentID], [Project2].[Status] AS [Status], [Project2].[DateCreated] AS [DateCreated]
FROM (SELECT
[Extent2].[ID] AS [ID],
[Extent2].[DocumentID] AS [DocumentID],
[Extent2].[Status] AS [Status],
[Extent2].[DateCreated] AS [DateCreated]
FROM [dbo].[DocumentStatusLogs] AS [Extent2]
WHERE ([Distinct1].[DocumentID] = [Extent2].[DocumentID])
) AS [Project2]
ORDER BY [Project2].[ID] DESC) AS [Limit1]
Now I'm assuming something that isn't entirely specified in the original question, but if your table design is such that your ID column is an auto-increment ID, and the DateCreated is set to the current date with each insert, then even without running with my query above you could actually get a sizable performance boost to gbn's solution (about half the execution time) just from ordering on ID instead of ordering on DateCreated as this will provide an identical sort order and it's a faster sort.

My code to select top 1 from each group
select a.* from #DocumentStatusLogs a where
datecreated in( select top 1 datecreated from #DocumentStatusLogs b
where
a.documentid = b.documentid
order by datecreated desc
)

This solution can be used to get the TOP N most recent rows for each partition (in the example, N is 1 in the WHERE statement and partition is doc_id):
SELECT T.doc_id, T.status, T.date_created FROM
(
SELECT a.*, ROW_NUMBER() OVER (PARTITION BY doc_id ORDER BY date_created DESC) AS rnk FROM doc a
) T
WHERE T.rnk = 1;

CROSS APPLY was the method I used for my solution, as it worked for me, and for my clients needs. And from what I've read, should provide the best overall performance should their database grow substantially.

Verifying Clint's awesome and correct answer from above:
The performance between the two queries below is interesting. 52% being the top one. And 48% being the second one. A 4% improvement in performance using DISTINCT instead of ORDER BY. But ORDER BY has the advantage to sort by multiple columns.
IF (OBJECT_ID('tempdb..#DocumentStatusLogs') IS NOT NULL) BEGIN DROP TABLE #DocumentStatusLogs END
CREATE TABLE #DocumentStatusLogs (
[ID] int NOT NULL,
[DocumentID] int NOT NULL,
[Status] varchar(20),
[DateCreated] datetime
)
INSERT INTO #DocumentStatusLogs([ID], [DocumentID], [Status], [DateCreated]) VALUES (2, 1, 'S1', '7/29/2011 1:00:00')
INSERT INTO #DocumentStatusLogs([ID], [DocumentID], [Status], [DateCreated]) VALUES (3, 1, 'S2', '7/30/2011 2:00:00')
INSERT INTO #DocumentStatusLogs([ID], [DocumentID], [Status], [DateCreated]) VALUES (6, 1, 'S1', '8/02/2011 3:00:00')
INSERT INTO #DocumentStatusLogs([ID], [DocumentID], [Status], [DateCreated]) VALUES (1, 2, 'S1', '7/28/2011 4:00:00')
INSERT INTO #DocumentStatusLogs([ID], [DocumentID], [Status], [DateCreated]) VALUES (4, 2, 'S2', '7/30/2011 5:00:00')
INSERT INTO #DocumentStatusLogs([ID], [DocumentID], [Status], [DateCreated]) VALUES (5, 2, 'S3', '8/01/2011 6:00:00')
INSERT INTO #DocumentStatusLogs([ID], [DocumentID], [Status], [DateCreated]) VALUES (6, 3, 'S1', '8/02/2011 7:00:00')
Option 1:
SELECT
[Extent1].[ID],
[Extent1].[DocumentID],
[Extent1].[Status],
[Extent1].[DateCreated]
FROM #DocumentStatusLogs AS [Extent1]
OUTER APPLY (
SELECT TOP 1
[Extent2].[ID],
[Extent2].[DocumentID],
[Extent2].[Status],
[Extent2].[DateCreated]
FROM #DocumentStatusLogs AS [Extent2]
WHERE [Extent1].[DocumentID] = [Extent2].[DocumentID]
ORDER BY [Extent2].[DateCreated] DESC, [Extent2].[ID] DESC
) AS [Project2]
WHERE ([Project2].[ID] IS NULL OR [Project2].[ID] = [Extent1].[ID])
Option 2:
SELECT
[Limit1].[DocumentID] AS [ID],
[Limit1].[DocumentID] AS [DocumentID],
[Limit1].[Status] AS [Status],
[Limit1].[DateCreated] AS [DateCreated]
FROM (
SELECT DISTINCT [Extent1].[DocumentID] AS [DocumentID] FROM #DocumentStatusLogs AS [Extent1]
) AS [Distinct1]
OUTER APPLY (
SELECT TOP (1) [Project2].[ID] AS [ID], [Project2].[DocumentID] AS [DocumentID], [Project2].[Status] AS [Status], [Project2].[DateCreated] AS [DateCreated]
FROM (
SELECT
[Extent2].[ID] AS [ID],
[Extent2].[DocumentID] AS [DocumentID],
[Extent2].[Status] AS [Status],
[Extent2].[DateCreated] AS [DateCreated]
FROM #DocumentStatusLogs AS [Extent2]
WHERE [Distinct1].[DocumentID] = [Extent2].[DocumentID]
) AS [Project2]
ORDER BY [Project2].[ID] DESC
) AS [Limit1]
In Microsoft SQL Server Management Studio: after highlighting and running the first block, highlight both Option 1 and Option 2, right click -> [Display Estimated Execution Plan]. Then run the entire thing to see the results.
Option 1 Results:
ID DocumentID Status DateCreated
6 1 S1 8/2/11 3:00
5 2 S3 8/1/11 6:00
6 3 S1 8/2/11 7:00
Option 2 Results:
ID DocumentID Status DateCreated
6 1 S1 8/2/11 3:00
5 2 S3 8/1/11 6:00
6 3 S1 8/2/11 7:00
Note:
I tend to use APPLY when I want a join to be 1-to-(1 of many).
I use a JOIN if I want the join to be 1-to-many, or many-to-many.
I avoid CTE with ROW_NUMBER() unless I need to do something advanced and am ok with the windowing performance penalty.
I also avoid EXISTS / IN subqueries in the WHERE or ON clause, as I have experienced this causing some terrible execution plans. But mileage varies. Review the execution plan and profile performance where and when needed!

SELECT o.*
FROM `DocumentStatusLogs` o
LEFT JOIN `DocumentStatusLogs` b
ON o.DocumentID = b.DocumentID AND o.DateCreated < b.DateCreated
WHERE b.DocumentID is NULL ;
If you want to return only recent document order by DateCreated, it will return only top 1 document by DocumentID

I believe this can be done just like this. This might need some tweaking but you can just select the max from the group.
These answers are overkill..
SELECT
d.DocumentID,
MAX(d.Status),
MAX(d1.DateCreated)
FROM DocumentStatusLogs d, DocumentStatusLogs d1
USING DocumentID
GROUP BY 1
ORDER BY 3 DESC

In scenarios where you want to avoid using row_count(), you can also use a left join:
select ds.DocumentID, ds.Status, ds.DateCreated
from DocumentStatusLogs ds
left join DocumentStatusLogs filter
ON ds.DocumentID = filter.DocumentID
-- Match any row that has another row that was created after it.
AND ds.DateCreated < filter.DateCreated
-- then filter out any rows that matched
where filter.DocumentID is null
For the example schema, you could also use a "not in subquery", which generally compiles to the same output as the left join:
select ds.DocumentID, ds.Status, ds.DateCreated
from DocumentStatusLogs ds
WHERE ds.ID NOT IN (
SELECT filter.ID
FROM DocumentStatusLogs filter
WHERE ds.DocumentID = filter.DocumentID
AND ds.DateCreated < filter.DateCreated)
Note, the subquery pattern wouldn't work if the table didn't have at least one single-column unique key/constraint/index, in this case the primary key "Id".
Both of these queries tend to be more "expensive" than the row_count() query (as measured by Query Analyzer). However, you might encounter scenarios where they return results faster or enable other optimizations.

SELECT documentid,
status,
datecreated
FROM documentstatuslogs dlogs
WHERE status = (SELECT status
FROM documentstatuslogs
WHERE documentid = dlogs.documentid
ORDER BY datecreated DESC
LIMIT 1)

Some database engines* are starting to support the QUALIFY clause that allows to filter the result of window functions (which the accepted answer uses).
So the accepted answer can become
SELECT *, ROW_NUMBER() OVER (PARTITION BY DocumentID ORDER BY DateCreated DESC) AS rn
FROM DocumentStatusLogs
QUALIFY rn = 1
See this article for an in depth explanation: https://jrandrews.net/the-joy-of-qualify
You can use this tool to see which database support this clause: https://www.jooq.org/translate/
There is an option to transform the qualify clause when the target dialect does not support it.
*Teradata, BigQuery, H2, Snowflake...

Try this:
SELECT [DocumentID]
,[tmpRez].value('/x[2]', 'varchar(20)') AS [Status]
,[tmpRez].value('/x[3]', 'datetime') AS [DateCreated]
FROM (
SELECT [DocumentID]
,cast('<x>' + max(cast([ID] AS VARCHAR(10)) + '</x><x>' + [Status] + '</x><x>' + cast([DateCreated] AS VARCHAR(20))) + '</x>' AS XML) AS [tmpRez]
FROM DocumentStatusLogs
GROUP BY DocumentID
) AS [tmpQry]

How to combine Cross Join and String Agg in Bigquery

I am trying to go from the following table
| user_id | path |
| 1 | Impression,Impression,Purchase, Impression, Email, Purchase |
to
| user_id | path |
| 1 | Impression,Impression,Purchase |
| 1 | Impression, Email, Purchase |
In essence I am trying to create a new row for each unique user in the table every time a 'Purchase' is encountered in a comma separated string.
From the little I have gathered I need to use a mixture of cross join and string agg but I tried using a case statement within string agg and was not able to get to the required result.
Is there a better way to do it in SQL (Bigquery).
Thank you

Below is for BigQuery Standard SQL
#standardSQL
with `project.dataset.table` as (
select 1 user_id, 'Impression,Impression,Purchase, Impression, Email, Purchase' path union all
select 2, 'Impression,Purchase,Impression,Purchase, Impression, Email'
)
select user_id, part
from `project.dataset.table`,
unnest(split(regexp_replace(path, r'Purchase,?\s*', r'Purchase|'), '|')) part
where trim(part) != ''
with output

I think this query generates what you need. It uses REGEXP_EXTRACT_ALL and LTRIM
with data as (
select 1 as x, 'Impression, Impression, Purchase, Impression, Email, Purchase, Purchase, Something, Purchase' as y
UNION ALL
select 2 as x, 'Impression, Impression, Purchase, Impression, Email, Purchase, Purchase, Something, Purchase' as y
)
select x, ltrim(q, ', ') from data, unnest(REGEXP_EXTRACT_ALL(y, '.+?Purchase')) as q;

How insert into a new table from another table cross joined with another table

I'm using Python and SQLite to manipulate a database.
I have a SQLite table Movies in database Data that looks like this:
| ID | Country
+----------------+-------------
| 1 | USA, Germany, Mexico
| 2 | Brazil, Peru
| 3 | Peru
I have a table Countries in the same database that looks like this
| ID | Country
+----------------+-------------
| 1 | USA
| 1 | Germany
| 1 | Mexico
| 2 | Brazil
| 2 | Peru
| 3 | Peru
I want to insert from database Data all movies from Peru into a new database PeruData that looks like this
| ID | Country
+----------------+-------------
| 2 | Peru
| 3 | Peru
I'm new to SQL and having trouble programming the right query.
Here's my attempt:
con = sqlite3.connect("PeruData.db")
cur = con.cursor()
cur.execute("CREATE TABLE Movies (ID, Country);")
cur.execute("ATTACH DATABASE 'Data.db' AS other;")
cur.execute("\
INSERT INTO Movies \
(ID, Country) \
SELECT ID, Country
FROM other.Movies CROSS JOIN other.Countries\
WHERE other.Movies.ID = other.Countries.ID AND other.Countries.Country = 'Peru'\
con.commit()
con.close()
Clearly, I'm doing something wrong because I get the error
sqlite3.OperationalError: no such table: other.Countries

Here's a workaround which successfully got the result you wanted.
Instead of having to write con = sqlite3.connect("data.db") and then having to write con.commit() and con.close(), you can shorten your code to written like this:
with sqlite3.connect("Data.db") as connection:
c = connection.cursor()
This way you won't have to commit the changes and close each time you're working with a database. Just a nifty shortcut I learned. Now onto your code...
Personally, I'm unfamiliar with the SQL statement ATTACH DATABASE. I would incorporate your new database at the end of your program instead that way you can avoid any conflicts that you aren't knowledgeable of handling (such as your OperationalError given). So first I would begin to get the desired result and then insert that into your new table. Your third execution statement can be rewritten like so:
c.execute("""SELECT DISTINCT Movies.ID, Countries.Country
FROM Movies
CROSS JOIN Countries
WHERE Movies.ID = Countries.ID AND Countries.Country = 'Peru'
""")
This does the job, but you need to use fetchall() to return your result set in a list of tuples which can then be inserted into your new table. So you'd type this:
rows = c.fetchall()
Now you can open a new connection by creating the "PeruData.db" database, creating the table, and inserting the values.
with sqlite3.connect("PeruData.db") as connection:
c = connection.cursor()
c.execute("CREATE TABLE Movies (ID INT, Country TEXT)")
c.executemany("INSERT INTO Movies VALUES(?, ?)", rows)
That's it.
Hope I was able to answer your question!

The current error is probably caused by a typo or other minor problem. I could create the databases described here and successfully do the insert after fixing minor errors: a missing backslash at an end of line and adding qualifiers for the selected columns.
But I also advise your to use aliases for the tables in multi-table selects. The code that works in my test is:
cur.execute("\
INSERT INTO Movies \
(ID, Country) \
SELECT m.ID, c.Country\
FROM other.Movies m CROSS JOIN other.Countries c \
WHERE m.ID = c.ID AND c.Country = 'Peru'")

SQLite: Transposing results of a GROUP BY and filling in IDs with names

My question is rather specific, if you have a better title please suggest one. Also, formatting is bad - didn't know how to combine lists and codeblocks.
I have an SQLite3 database with the following (relevant parts of the) .schema:
CREATE TABLE users (id INTEGER PRIMARY KEY NOT NULL, user TEXT UNIQUE);
CREATE TABLE locations (id INTEGER PRIMARY KEY NOT NULL, name TEXT UNIQUE);
CREATE TABLE purchases (location_id INTEGER, user_id INTEGER);
CREATE TABLE sales (location_id integer, user_id INTEGER);
purchases has about 4.5mil entries, users about 300k, sales about 100k, and locations about 250 - just to gauge the data volume.
My desired use would be to generate a JSON object to be handed off to another application, very much condensed in volume by doing the following:
-GROUPing both purchases and sales into one common table BY location_id,user_id - IOW, getting the number of "actions" per user per location. That I can do, result is something like
loc | usid | loccount
-----------------------
1 | 1246 | 123
1 | 2345 | 1
13 | 1246 | 46
13 | 8732 | 4
27 | 2345 | 41
(At least it looks good, always hard to tell with such volumes; query:
select location_id,user_id,count(location_id) from
(select location_id,user_id from purchases
union all
select location_id,user_id from sales)
group by location_id,user_id order by user_id`
)
-Then, transposing that giant table such that I would get:
usid | loc1 | loc13 | loc27
---------------------------
1246 | 123 | 46 | 0
2345 | 1 | 0 | 41
8732 | 0 | 4 | 0
That I cannot do, and it's my absolutely crucial point for this question. I tried some things I found online, especially here, but I just started SQLite a little while ago and don't understand many queries.
-Lastly, translate the table into plain text in order to write it to JSON:
user | AAAA | BBBBB | CCCCC
---------------------------
zeta | 123 | 46 | 0
beta | 1 | 0 | 41
iota | 0 | 4 | 0
That I probably could do with quite a bit of experimentation and inner join, although I'm always very unsure what way is the best approach to handle such data volumes, hence I wouldn't mind a pointer.
The whole thing is written in Python's sqlite3 interface, if it matters. In the end, I'd love to have something I could just do a "for" loop per user over in order to generate the JSON, which would then of course be very simple. It doesn't matter if the query takes a long time (<10min would be nice), it's only run twice per day as a sort of backup. I've only got a tiny VPS available, but being limited to a single core the performance is as good as on my reasonably powerful desktop. (i5-3570k running Debian.)
The table headers are just examples because I wasn't quite sure if I could use integers for them (didn't discover the syntax if so), as long as I'm somehow able to look up the numeric part in the locations table I'm fine. Same for translating the user IDs into names. The number of columns is known beforehand - they're after all just INTEGER PRIMARY KEYs and I have a list() of them from some other operation. The number of rows can be determined reasonably quickly, ~3s, if need be.

Consider using subqueries to achieve your desired transposed output:
SELECT DISTINCT m.usid,
IFNULL((SELECT t1.loccount FROM tablename t1
WHERE t1.usid = m.usid AND t1.loc=1),0) AS Loc1,
IFNULL((SELECT t2.loccount FROM tablename t2
WHERE t2.usid = m.usid AND t2.loc=13),0) AS Loc13,
IFNULL((SELECT t3.loccount FROM tablename t3
WHERE t3.usid = m.usid AND t3.loc=27),0) AS Loc27
FROM tablename As m
Alternatively, you can use nested IF statements (or in the case of SQLite that uses CASE/WHEN) as derived table:
SELECT temp.usid, Max(temp.loc1) As Loc1,
Max(temp.loc13) As Loc13, Max(temp.loc27) As Loc27
FROM
(SELECT tablename.usid,
CASE WHEN loc=1 THEN loccount ELSE 0 As Loc1 END,
CASE WHEN loc=13 THEN loccount ELSE 0 As Loc13 END,
CASE WHEN loc=27 THEN loccount ELSE 0 As Loc27 END
FROM tablename) AS temp
GROUP BY temp.usid

Django group by id then select max timestamp

It might be a redundant question, but I have tried previous answers from other related topics and still can't figure it out.
I have a table Board_status looks like this (multiple status and timestamp for each board):
time | board_id | status
-------------------------------
2012-4-5 | 1 | good
2013-6-6 | 1 | not good
2013-6-7 | 1 | alright
2012-6-8 | 2 | good
2012-6-4 | 3 | good
2012-6-10 | 2 | good
Now I want to select all records from Board_status table, group all of them by board_id for distinct board_id, then select the latest status on each board. Basically end up with table like this (only latest status and timestamp for each board):
time | board_id | status
------------------------------
2013-6-7 | 1 | alright
2012-6-4 | 3 | good
2012-6-10 | 2 | good
I have tried:
b = Board_status.objects.values('board_id').annotate(max=Max('time')).values_list('board_id','max','status')
but doesn't seem like it is working. Still give me more than 1 record per board_id.
Which command should I use in Django to do this?

An update, this is the solution I use. Not the best, but it works for now:
b=[]
a = Board_status.objects.values('board_id').distinct()
for i in range(a.count()):
b.append(Board_status.objects.filter(board_id=a[i]['board_id']).latest('time'))
So I got all board_id, store into list a. Then for each board_id, do another query to get the latest time. Any better answer is still welcomed.

How will it work? You neither have filter nor distinct to filter out the duplicates. I am not sure if this can be easily done in a single django query. You should read more on:
https://docs.djangoproject.com/en/dev/ref/models/querysets/#django.db.models.query.QuerySet.distinct
https://docs.djangoproject.com/en/1.4/topics/db/aggregation/
If you can't do it in 1 raw sql query, you can't do it with an OR mapper either as it's built on top of mysql (in your case). Can you tell me how you would do this via raw SQL?

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Merging distinct results SQL - python

You seem to want aggregation: select min(Experiment_ID) as Experiment_ID, Antibiotic, Strain, group_concat(Medium order by Experiment_ID separator ', ') from t where Experiment_ID in (1, 2) group by Antibiotic, Strain;

Related

Remove similar repeated groups in same column [duplicate]

How to combine Cross Join and String Agg in Bigquery

How insert into a new table from another table cross joined with another table

SQLite: Transposing results of a GROUP BY and filling in IDs with names

Django group by id then select max timestamp

Categories

Resources