I have a question about SQL.
I have the following sample table called Records:
record_id
subject
start_timestamp
end_timestamp
interval
2
Start Product 2
2021-04-21T16:22:39
2021-04-21T16:23:40
0.97
3
error 1
2021-04-21T16:25:44
2021-04-21T16:25:54
10.0
4
End Product 2
2021-04-21T16:30:13
2021-04-21T16:30:14
0.97
5
Start Product 1
2021-04-21T16:35:13
2021-04-21T16:35:13
0.6
6
End Product 1
2021-04-21T16:36:13
2021-04-21T16:36:13
0.45
First I select all the items that have start in there subject with and are not in the table BackupTO (for now the table BackupTO is not important):
SELECT Records.record_id, Records.start_timestamp, Records.interval FROM Records
LEFT JOIN BackupTO ON BackupTO.record_id = Records.record_id
WHERE BackupTO.record_id IS NULL AND Records.log_type = 1 AND Records.subject LIKE '%start%'
When I ran this we get:
record_id
start_timestamp
interval
2
2021-04-21T16:22:39
0.97
5
2021-04-21T16:35:13
0.6
Oke, all good now comes my question, I fetch this in Python and loop through the data, first I calculate the product number based on the interval with:
product = round(result[2] / 0.5)
So a interval of 0.97 is product 2, and a interval of 0.6,0.45 is product 1, all great!
So I know record_id 2 is product 2 and I want to execute a sql query thats returns all items starting from record_id 2 untils its find a items that has %end%2 in its name (the 2 is for product 2, could also be product 1).
For example its finds Start Product 2 I get a list with record_id 3 and 4.
I want to get all items from the start until end.
So it gets me a list like this, this are all the items found under Start Product 2 until %end%2 was found. For product 1, it just would return just record_id 6, because there is nothing between the start and stop.
record_id
start_timestamp
interval
3
2021-04-21T16:22:39
10.0
4
2021-04-21T16:35:13
0.97
I tried OFFSET and FETCH, but I couldnt get it to work, somebody that could help me out here?
Use your query as a CTE and join it to the table Records.
Then with MIN() window function find the record_id up to which you want the rows returned:
WITH cte AS (
SELECT r.*
FROM Records r LEFT JOIN BackupTO b
ON b.record_id = r.record_id
WHERE b.record_id IS NULL AND r.log_type = 1 AND r.subject LIKE '%start%'
)
SELECT *
FROM (
SELECT r.*,
MIN(CASE WHEN r.subject LIKE '%end%' THEN r.record_id END) OVER () id_end
FROM Records r INNER JOIN cte c
ON r.record_id > c.record_id
WHERE c.record_id = ?
)
WHERE COALESCE(record_id <= id_end, 1)
Change ? to 2 or 5 for each case.
If you have the record_ids returned by your query, it is simpler:
SELECT *
FROM (
SELECT r.*,
MIN(CASE WHEN r.subject LIKE '%end%' THEN r.record_id END) OVER () id_end
FROM Records r
WHERE r.record_id > ?
)
WHERE COALESCE(record_id <= id_end, 1)
See the demo.
Related
If I have two tables. TableA:
primary key
join_id
distance
value
1
1
50
A
2
1
100
B
3
1
150
C
4
2
50
AA
5
2
100
BB
6
2
150
CC
Table B:
join_id
other value
1
X
2
Y
And I want to form Table C:
join_id
other_value
dist_50
dist_100
dist_150
1
X
A
B
C
2
Y
AA
BB
CC
What is the simplest strategy for this? I'm working with pandas dataframes which I convert into a sqlite3 db so i can use pd.read_sql() for queries.
So far, I've tried writing a loop to create a new table for each distance, using a WHERE filter.
distances = [50,100,150]
for dist in distances:
int_tableA_df = pd.read_sql('SELECT * FROM tableA '
'WHERE distance == {}'.format(dist), mydb)
int_tableA_df.to_sql('int_tableA{}'.format(dist), mydb)
This does work, but I can't imagine it's the optimal solution, particularly when scaled up (it yields a lot of tables). Then, I've tried to loop in a similar way to join with each of those previous intermediary tables. (assume tableB_df is previously defined).
for dist in distances:
tableB_df.to_sql('tableB{}'.format(dist), mydb)
tableB_df = pd.read_sql('SELECT tableB{0}.*, tableA{0}.distance AS dist{0} '
'FROM tableB{0} '
'LEFT JOIN tableA{0} '
'ON tableB{0}.result_id = tableA{0}.result_id'.format(dist), mydb)
This gives me a duplicate column level 0 error. I'm not sure I understand why I would have duplicate columns when I'm creating a new table each loop, just adding one column (with a distinct name) to it. Any insights to where I'm going wrong would be appreciated!
Also, there has to be a cleaner way to solve this, right? I know I could just write out a series of left joins for this small example, but that won't scale well with a lot more distances.
Thank you!
Consider conditional aggregation for a single SQL query pull:
sql = '''SELECT a.join_id,
b.other_value,
SUM(CASE WHEN a.distance = 50 THEN a.value END) AS dist_50,
SUM(CASE WHEN a.distance = 100 THEN a.value END) AS dist_100,
SUM(CASE WHEN a.distance = 150 THEN a.value END) AS dist_150
FROM tableA a
LEFT JOIN tableB b AS a.join_id = b.join_id
GROUP BY a.join_id
'''
dist_df = pd.read_sql(sql, mydb)
I have 2 tables in PostgreSQL:-
"student" table
student_id name score
1 Adam 10
2 Brian 9
"student_log" table:-
log_id student_id score
1 1 10
2 2 9
I have a python script which fetches a DataFrame with columns - "name" and "score" and then populates it to the student table.
I want to update the student and student_log table whenever the "score" changes for a student. Also, if there is a new student name in the dataframe, I want to add another row for it in the student table as well as maintain its record in the "student_log" table. Can anyone suggest how it can be done?
Let us consider the new fetched DataFrame looks like this:-
name score
Adam 7
Lee 5
Then the Expected Result is:-
"student" table
student_id name score
1 Adam 7
2 Brian 9
3 Lee 5
"student_log" table:-
log_id student_id score
1 1 10
2 2 9
3 1 7
4 3 5
I finally found a good answer. I used trigger, function and CTE.
I create a function to log changes along with a trigger to handle the updates. Following is the code.
CREATE OR REPLACE FUNCTION log_last_changes()
RETURNS TRIGGER
LANGUAGE PLPGSQL
AS
$$
DECLARE
serial_num integer;
BEGIN
IF NEW.name <> OLD.name OR NEW.score <> OLD.score
THEN
SELECT SETVAL('log_id_seq', (select max(id) from log)) into serial_num;
INSERT INTO log(student_id,score)
VALUES(NEW.id,NEW.score)
ON CONFLICT DO NOTHING;
END IF;
RETURN NEW;
END;
$$;
CREATE TRIGGER log_student
AFTER UPDATE
ON student
FOR EACH ROW
EXECUTE PROCEDURE log_last_changes();
THE CTE expression is as follow:-
WITH new_values(id, name, score) AS (
values
(1,'Adam',7),
(2,'Brian',9),
(3,'Lee',5)
),
upsert AS
(
UPDATE student s
SET NAME = nv.name,
SCORE = nv.score
FROM new_values nv, student s2
WHERE
s.id = nv.id and s.id = s2.id
Returning s.*
)
INSERT INTO student select id, name, score
FROM
new_values
WHERE NOT EXISTS (
SELECT 1 from upsert up where up.id=new_values.id
);
I guess you try to diff two dataframe
here is a example
#old student dataframe
old_pd:pd.DataFrame
#new student dataframe
new_pd:pd.DataFrame
joined_pd = new_pd.join(old_pd,on='name',lsuffix='_new',rsuffix='_old')
diff_pd = joined_pd[joined_pd['score_new']!=joined_pd['score_old']]
#then insert all diff_pd to student_log table.and update to student table
I am trying to create a rolling stock count from the quantity. The count should reset everytime there is a real stock count (TypeOfMovement = 3). This should work on each ArticleNo and be grouped by date.
I can get a running stock count (lines <123 in image) and then take the real stock count when TypeOfMovement = 3 (Line 123 of image), but the count doesn't reset, it continues from before the real stock count (ResetRunningTotal in line124 should be 6293).
The solution should run in SSMS. Alternatively a python solution could be run.
My query so far is:
WITH a
AS
(SELECT
DateOfMovement, Quantity, ArticleNo, TypeOfChange,
CASE
WHEN TypeOfChange = 3 Then 0
ELSE Quantity
END AS RunningTotal
FROM Stock_Table
Group by DateOfMovement, Quantity, ArticleNo, TypeOfChange)
SELECT
*
,CASE
WHEN TypeOfChange= 3 THEN Quantity
ELSE Sum(Quantity) OVER(ORDER BY ArticleNo, DateOfMovement)
END AS ResetRunningTotal
FROM a
WHEre ArticleNo = 9410
group by DateOfMovement, ArticleNo, Quantity, TypeOfChange, RunningTotal
order by DateOfMovement asc
Image of results table is..
Ok so you want running totals for each ArticleNo ordered by DateOfMovement that reset whenever you encounter a TypeOfChange value of 3.
To do this you need to create a grouping_id (Grp) for each running total. You can do this with a CTE that calculates group ids, then do the running totals with the CTE results:
with Groups as (
select st.*
, sum(case TypeOfChange when 3 then 1 else 0 end)
over (partition by ArticleNo order by DateOfMovement) Grp
from Stock_Table st
)
select Groups.*
, sum(Quantity) over (partition by ArticleNo, Grp order by DateOfMovement) RunningTotal
from Groups
order by ArticleNo, dateofmovement
I know how correlated subqueries work and usually steer away from them.. but for what I was trying to do i couldn't find another way to write this query. I was ok with using it because I didn't have a bunch of records in the table I was using.. but now I need to do the same thing on some tables with > 100,000 records in them which will make this query take a long time. so I need some help with rewriting this query.
INITIAL QUERY:
SELECT
DATE(m.Created_At) AS m_date,
COUNT(m.id) AS daily_count,
(
SELECT
COUNT(m1.id)
FROM members_joined m1
WHERE m1.has_verified = 1 and DATE(m1.Created_At) <= m_date
) AS member_totals
FROM members_joined AS m
WHERE m.has_verified = 1 and m.Created_At BETWEEN CURDATE() - INTERVAL 30 DAY AND CURDATE()
GROUP BY m_date;
what this query is doing is getting the total count of registered members up to 30 days ago.. and then for every day between the last 30 days its doing the total count again
m_date daily_count member_totals
2014-05-27 4 619
2014-05-29 1 620
2014-05-30 3 623
2014-06-02 4 627
2014-06-03 7 634
2014-06-04 10 644
2014-06-05 12 656
2014-06-06 4 660
2014-06-07 3 663
2014-06-08 3 666
2014-06-09 3 669
2014-06-10 5 674
2014-06-11 3 677
if you notice it is incrementing the count by each daily count.
this query itself only takes .036 seconds to run which isnt a big deal
but on some of the bigger tables it already takes 12 seconds and they are only going to get bigger.
NOTE: can this be done without using a correlated subquery? assume you cannot use user-defined variables.
if this cant be done in mysql I can hit the database twice and do it in python, but I was hoping to find a solution in MySQL and not hit the database multiple times.
thanks for any pointers / help!
This can hardly be done with pure SQL efficiently, but you can use MySQL-specific user defined variables. You will just need to compute counts for each day and accumulate counts for each row.
The main trick is to force MySQL update the variable in the correct order.
The query should look like the following (not sure that it is 100% correct):
SELECT m_date, daily_count, #count:=(daily_count + #count) as member_totals
FROM
(SELECT DATE(Created_At) as m_date, COUNT(*) as daily_count
FROM members_joined
WHERE DATE(Created_At) >= CURDATE() - INTERVAL 30 DAY AND has_verified = 1
GROUP BY m_date
) as days,
(SELECT #count:=COUNT(*) as cnt0
FROM members_joined
WHERE DATE(Created_At) < CURDATE() - INTERVAL 30 DAY
AND has_verified = 1) as init
ORDER BY m_date;
EDIT : fix a mistake in the query
EDIT2 : proposal of a python way
You can rewrite it with an explicit auto-join, but I do not think it will be faster
SELECT
DATE(m.Created_At) AS m_date,
COUNT(m.id) AS daily_count,
COUNT(m1.id) AS member_totals
FROM members_joined AS m
JOIN members_joined m1 ON m1.has_verified = 1 and DATE(m1.Created_At) <= DATE(m.Created_At)
WHERE m.has_verified = 1 and m.Created_At BETWEEN CURDATE() - INTERVAL 30 DAY AND CURDATE()
GROUP BY m_date;
You are asking the database to get a quadratic number of rows.
As the autojoin do not improve performance, I think you should 2 simpler queries from your database, and then do the sums in Python.
SELECT
DATE(m.Created_At) AS m_date,
COUNT(m.id) AS daily_count,
FROM members_joined AS m
WHERE m.has_verified = 1 and m.Created_At BETWEEN CURDATE() - INTERVAL 30 DAY AND CURDATE()
GROUP BY m_date;
to get daily_count per date for the last 30 days, and
SELECT
CURDATE() - INTERVAL 30 DAY AS m_date,
COUNT(m.id) AS member_totals,
FROM members_joined AS m
WHERE m.has_verified = 1 and m.Created_At < CURDATE() - INTERVAL 30 DAY;
to get the cumulative member_totals for the day preceding the first of the other query
Python pseudo code, say second query initialized member_total, and then first query initialized rows (be it a cursor or a list or tuple obtained by fetchall)
for row in rows:
dat, daily_count = row
member_totals += daily_count
# use dat, daily_count, member_totals
how can I transform this table:
ID ITEM_CODE
--------------------
1 1AB
1 22S
1 1AB
2 67R
2 225
3 YYF
3 1AB
3 UUS
3 F67
3 F67
3 225
......
..to a list of lists, each list being a distinct ID containing its allocated item_codes?
in the form: [[1AB,22S,1AB],[67R,225],[YYF,1AB,UUS,F67,F67,225]]
Using this query:
SELECT ID, ITEM_CODE
FROM table1
ORDER BY ID;
and doing cursor.fetchall() in python does not return it as a list nor ordered by ID
Thank you
You probly will have less post-processing in Python using that query:
SELECT GROUP_CONCAT(ITEM_CODE)
FROM table1
GROUP BY ID
ORDER BY ID;
That will directly produce that result:
1AB,22S,1AB
67R,225
YYF,1AB,UUS,F67,F67,225
After that, cursor.fetchall() will directly return more or less what you expected, I think.
EDIT:
result = [ split(row, ',') for row in cursor.fetchall()]