Iterative SQL JOINS using pandas + sqllite3 with a for loop?

Iterative SQL JOINS using pandas + sqllite3 with a for loop? - python

If I have two tables. TableA:
primary key
join_id
distance
value
1
1
50
A
2
1
100
B
3
1
150
C
4
2
50
AA
5
2
100
BB
6
2
150
CC
Table B:
join_id
other value
1
X
2
Y
And I want to form Table C:
join_id
other_value
dist_50
dist_100
dist_150
1
X
A
B
C
2
Y
AA
BB
CC
What is the simplest strategy for this? I'm working with pandas dataframes which I convert into a sqlite3 db so i can use pd.read_sql() for queries.
So far, I've tried writing a loop to create a new table for each distance, using a WHERE filter.
distances = [50,100,150]
for dist in distances:
int_tableA_df = pd.read_sql('SELECT * FROM tableA '
'WHERE distance == {}'.format(dist), mydb)
int_tableA_df.to_sql('int_tableA{}'.format(dist), mydb)
This does work, but I can't imagine it's the optimal solution, particularly when scaled up (it yields a lot of tables). Then, I've tried to loop in a similar way to join with each of those previous intermediary tables. (assume tableB_df is previously defined).
for dist in distances:
tableB_df.to_sql('tableB{}'.format(dist), mydb)
tableB_df = pd.read_sql('SELECT tableB{0}.*, tableA{0}.distance AS dist{0} '
'FROM tableB{0} '
'LEFT JOIN tableA{0} '
'ON tableB{0}.result_id = tableA{0}.result_id'.format(dist), mydb)
This gives me a duplicate column level 0 error. I'm not sure I understand why I would have duplicate columns when I'm creating a new table each loop, just adding one column (with a distinct name) to it. Any insights to where I'm going wrong would be appreciated!
Also, there has to be a cleaner way to solve this, right? I know I could just write out a series of left joins for this small example, but that won't scale well with a lot more distances.
Thank you!

Consider conditional aggregation for a single SQL query pull:
sql = '''SELECT a.join_id,
b.other_value,
SUM(CASE WHEN a.distance = 50 THEN a.value END) AS dist_50,
SUM(CASE WHEN a.distance = 100 THEN a.value END) AS dist_100,
SUM(CASE WHEN a.distance = 150 THEN a.value END) AS dist_150
FROM tableA a
LEFT JOIN tableB b AS a.join_id = b.join_id
GROUP BY a.join_id
'''
dist_df = pd.read_sql(sql, mydb)

Related

SQLite fetch until

I have a question about SQL.
I have the following sample table called Records:
record_id
subject
start_timestamp
end_timestamp
interval
2
Start Product 2
2021-04-21T16:22:39
2021-04-21T16:23:40
0.97
3
error 1
2021-04-21T16:25:44
2021-04-21T16:25:54
10.0
4
End Product 2
2021-04-21T16:30:13
2021-04-21T16:30:14
0.97
5
Start Product 1
2021-04-21T16:35:13
2021-04-21T16:35:13
0.6
6
End Product 1
2021-04-21T16:36:13
2021-04-21T16:36:13
0.45
First I select all the items that have start in there subject with and are not in the table BackupTO (for now the table BackupTO is not important):
SELECT Records.record_id, Records.start_timestamp, Records.interval FROM Records
LEFT JOIN BackupTO ON BackupTO.record_id = Records.record_id
WHERE BackupTO.record_id IS NULL AND Records.log_type = 1 AND Records.subject LIKE '%start%'
When I ran this we get:
record_id
start_timestamp
interval
2
2021-04-21T16:22:39
0.97
5
2021-04-21T16:35:13
0.6
Oke, all good now comes my question, I fetch this in Python and loop through the data, first I calculate the product number based on the interval with:
product = round(result[2] / 0.5)
So a interval of 0.97 is product 2, and a interval of 0.6,0.45 is product 1, all great!
So I know record_id 2 is product 2 and I want to execute a sql query thats returns all items starting from record_id 2 untils its find a items that has %end%2 in its name (the 2 is for product 2, could also be product 1).
For example its finds Start Product 2 I get a list with record_id 3 and 4.
I want to get all items from the start until end.
So it gets me a list like this, this are all the items found under Start Product 2 until %end%2 was found. For product 1, it just would return just record_id 6, because there is nothing between the start and stop.
record_id
start_timestamp
interval
3
2021-04-21T16:22:39
10.0
4
2021-04-21T16:35:13
0.97
I tried OFFSET and FETCH, but I couldnt get it to work, somebody that could help me out here?

Use your query as a CTE and join it to the table Records.
Then with MIN() window function find the record_id up to which you want the rows returned:
WITH cte AS (
SELECT r.*
FROM Records r LEFT JOIN BackupTO b
ON b.record_id = r.record_id
WHERE b.record_id IS NULL AND r.log_type = 1 AND r.subject LIKE '%start%'
)
SELECT *
FROM (
SELECT r.*,
MIN(CASE WHEN r.subject LIKE '%end%' THEN r.record_id END) OVER () id_end
FROM Records r INNER JOIN cte c
ON r.record_id > c.record_id
WHERE c.record_id = ?
)
WHERE COALESCE(record_id <= id_end, 1)
Change ? to 2 or 5 for each case.
If you have the record_ids returned by your query, it is simpler:
SELECT *
FROM (
SELECT r.*,
MIN(CASE WHEN r.subject LIKE '%end%' THEN r.record_id END) OVER () id_end
FROM Records r
WHERE r.record_id > ?
)
WHERE COALESCE(record_id <= id_end, 1)
See the demo.

SQL code in python - double indexing pivot table counts

I created my own DB and added my csv data into it successfully via Python. Queries therefore work, but I use python to do the queries. Package sqlite3. In short, the table is: ID of user, Type of action performed, ID of object that action was performed on.
df1 and df2 are the DataFrames.
import pandas as pd
import sqlite3
connection = sqlite3.connect("test.db")
c = connection.cursor()
df1.to_sql('tab1', con = connection, if_exists = 'replace', chunksize = 1000)
df2.to_sql('tab2', con = connection, if_exists = 'replace', chunksize = 1000)
In this SQL command, I need to select the columns, join one table and select only data from certain time. I hope it's okay I include it there for completeness but won't include all columns in example table as they're only for the filtering. Note that the subquery without the pivot works.
sql_command = """
SELECT * FROM
(SELECT tab1.action,
tab1.iduser,
tab2.idobject
FROM tab1,
tab2
LEFT JOIN tab2 ON tab1.objectid = tab2.objectid
WHERE tab2.Datetime BETWEEN '2020-01-29' AND '2020-02-04'
) AS source
PIVOT(
COUNT(tab1.iduser)
FOR tab1.action IN (
[a],
[b],
[c])
) AS pivot_table;"""
# execute query
c.execute(sql_command)
This is the table I get from the subquery basically:
iduser action idobject
--------------------------------------
1 a Obj1
1 b Obj2
1 c Obj3
1 a Obj2
2 a Obj1
2 c Obj2
2 a Obj2
2 b Obj1
3 a Obj1
3 c Obj3
Now I want to create such table, i.e. count the number of types of actions performed on each object.
action idobject Count
----------------------------------------------------------------------
a Obj1 3
Obj2 2
Obj3 0
b Obj1 1
Obj2 1
Obj3 0
c Obj1 0
Obj2 1
Obj3 2
If i run the code above with the PIVOT table, I get the error "Error: Incorrect syntax near "PIVOT"".

Merging distinct results SQL

So my problem is the following:
I have a MariaDB database that I am trying to query.
I am looking at experiment data. What I would like, is to summarize several experiments into one results row of an SQL query.
This is what the data looks like:
Experiment_ID | Antibiotic | Strain | Medium | ...
1 Ampicillin E. coli TBauto ...
2 Ampicillin E. coli TB + IPTG
What I would like to get:
Experiment_ID | Antibiotic | Strain | Medium | ...
1 Ampicillin E. coli TBauto, TB + IPTG ...
I don't care about the Experiment_ID, that is just there to make clear the I am talking about two distinct entries.
I already tried
Select
tmp.*
From
((Select * from my_tbl Where ExperimentID = 1)
Union
(Select * from my_tbl Where ExperimentID = 2)) as tmp
There I still end up with 2 rows of results. I could also do it in my Python code. But I would like to not have to modify results that I grab from the DB.
I guess I must have just been looking for the wrong word, to merge those two entries. So please kindly guide me in the right direction.

You seem to want aggregation:
select min(Experiment_ID) as Experiment_ID, Antibiotic, Strain,
group_concat(Medium order by Experiment_ID separator ', ')
from t
where Experiment_ID in (1, 2)
group by Antibiotic, Strain;

How to get last record in sqlite3 db with where condition

I am trying search records in sqlite3 table to get last record inserted with where condition, but I can do it with only one condition WHERE CODE = df = "DS3243". But what I want to do is with multiple WHERE conditions jf = "QS2134", df = "DS3243", sf = "MS5787", so that I can get the last record inserted with the codes provided.
DEMONTSTRATION
CODE POINT
QS2134 1000
DS3244 2000
MS5787 3000
QS2134 130
QS2134 200 # want to get this because it last with such code
DS3244 300
MS5787 4500
DS3244 860 # want to get this because it last with such code
MS5787 567
MS5787 45009 # want to get this because it last with such code
Am able to do for only one variable cur.execute("SELECT * FROM PROFILE WHERE CODE=? ORDER BY POINT ASC LIMIT 1 ",(df,)) but i want to do for multiple varaiables.
import sqlite3
jf = "QS2134"
df = "DS3243"
sf = "MS5787"
con = sqlite3.connect("TEST.db")
cur = con.cursor()
cur.execute("SELECT * FROM PROFILE WHERE CODE=? ORDER BY POINT ASC LIMIT 1 ",(df,)) # limit one means last one
rows = cur.fetchall()
for row in rows:
print(row)
con.commit()
con.close()

I'm not sure I understand your question, but is it possible that you meant that you want to group the results?
Is it "group by" clause that you're looking for?
Something like:
select CODE, MAX(POINT) group by CODE;

I think you are simply trying to extend your query, in which case, why don't you try string formatting?
x = "SELECT * FROM my_table where col1 = '{0}' or col2 ='{1}';".format(var_1, var_2)
cur.execute(x)
That way you can extend your query with as many conditions as you like.

Run checks on Items from tables in Sqlite and python

I have two tables below:
----------
Items | QTY
----------
sugar | 14
mango | 10
apple | 50
berry | 1
----------
Items |QTY
----------
sugar |10
mango |5
apple |48
berry |1
I use the following query in python to check difference between the QTY of table one and table two.
cur = conn.cursor()
cur.execute("select s.Items, s.qty - t.qty as quantity from Stock s join Second_table t on s.Items = t.Items;")
remaining_quantity = cur.fetchall()
I'm a bit stuck on how to go about what I need to accomplish. I need to check the difference between the quantity of table one and table two, if the quantity (difference) is under 5 then for those Items I want to be able to store this in another table column with the value 1 if not then the value will be 0 for those Items. How can I go about this?
Edit:
I have attempted this like by looping through the rows and if the column value is less than 5 then insert into the new table with the value below. :
for row in remaining_quantity:
print(row[1])
if((row[1]) < 5):
cur.execute('INSERT OR IGNORE INTO check_quantity_tb VALUES (select distinct s.Items, s.qty, s.qty - t.qty as quantity, 1 from Stock s join Second_table t on s.Items = t.Items'), row)
print(row)
But I get a SQL syntax error not sure where the error could be :/

First modify your first query so you retrieve all relevant infos and don't have to issue subqueries later:
readcursor = conn.cursor()
readcursor.execute(
"select s.Items, s.qty, s.qty - t.qty as remain "
"from Stock s join Second_table t on s.Items = t.Items;"
)
Then use it to update your third table:
writecursor = conn.cursor()
for items, qty, remain in readcursor:
print(remain)
if remain < 5:
writecursor.execute(
'INSERT OR IGNORE INTO check_quantity_tb VALUES (?, ?, ?, ?)',
(items, qty, remain, 1)
)
conn.commit()
Note the following points:
1/ We use two distinct cursor so we can iterate over the first one while wrting with the second one. This avoids fetching all results in memory, which can be a real life saver on huge datasets
2/ when iterating on the first cursor, we unpack the rows into their individual componants. This is called "tuple unpacking" (but actually works for most sequence types):
>>> row = ("1", "2", "3")
>>> a, b, c = row
>>> a
'1'
>>> b
'2'
>>> c
'3'
3/ We let the db-api module do the proper sanitisation and escaping of the values we want to insert. This avoids headaches with escaping / quoting etc and protects your code from SQL injection attacks (not that you might have one here, but that's the correct way to write parameterized queries in Python).
NB : since you didn't not post your full table definitions nor clear specs - not even the full error message and traceback - I only translated your code snippet to something more sensible (avoiding the costly and useless subquery, which migh or not be the cause of your error). I can't garantee it will work out of the box, but at least it should put you back on tracks.
NB2 : you mentionned you had to set the last col to either 1 or 0 depending on remain value. If that's the case, you want your loop to be:
writecursor = conn.cursor()
for items, qty, remain in readcursor:
print(remain)
flag = 1 if remain < 5 else 0
writecursor.execute(
'INSERT OR IGNORE INTO check_quantity_tb VALUES (?, ?, ?, ?)',
(items, qty, remain, flag)
)
conn.commit()
If you instead only want to process rows where remain < 5, you can specify it directly in your first query with a where clause.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Iterative SQL JOINS using pandas + sqllite3 with a for loop? - python

Related

SQLite fetch until

SQL code in python - double indexing pivot table counts

Merging distinct results SQL

How to get last record in sqlite3 db with where condition

Run checks on Items from tables in Sqlite and python

Categories

Resources