Iterative SQL JOINS using pandas + sqllite3 with a for loop? - python

If I have two tables. TableA:
primary key
join_id
distance
value
1
1
50
A
2
1
100
B
3
1
150
C
4
2
50
AA
5
2
100
BB
6
2
150
CC
Table B:
join_id
other value
1
X
2
Y
And I want to form Table C:
join_id
other_value
dist_50
dist_100
dist_150
1
X
A
B
C
2
Y
AA
BB
CC
What is the simplest strategy for this? I'm working with pandas dataframes which I convert into a sqlite3 db so i can use pd.read_sql() for queries.
So far, I've tried writing a loop to create a new table for each distance, using a WHERE filter.
distances = [50,100,150]
for dist in distances:
int_tableA_df = pd.read_sql('SELECT * FROM tableA '
'WHERE distance == {}'.format(dist), mydb)
int_tableA_df.to_sql('int_tableA{}'.format(dist), mydb)
This does work, but I can't imagine it's the optimal solution, particularly when scaled up (it yields a lot of tables). Then, I've tried to loop in a similar way to join with each of those previous intermediary tables. (assume tableB_df is previously defined).
for dist in distances:
tableB_df.to_sql('tableB{}'.format(dist), mydb)
tableB_df = pd.read_sql('SELECT tableB{0}.*, tableA{0}.distance AS dist{0} '
'FROM tableB{0} '
'LEFT JOIN tableA{0} '
'ON tableB{0}.result_id = tableA{0}.result_id'.format(dist), mydb)
This gives me a duplicate column level 0 error. I'm not sure I understand why I would have duplicate columns when I'm creating a new table each loop, just adding one column (with a distinct name) to it. Any insights to where I'm going wrong would be appreciated!
Also, there has to be a cleaner way to solve this, right? I know I could just write out a series of left joins for this small example, but that won't scale well with a lot more distances.
Thank you!

Consider conditional aggregation for a single SQL query pull:
sql = '''SELECT a.join_id,
b.other_value,
SUM(CASE WHEN a.distance = 50 THEN a.value END) AS dist_50,
SUM(CASE WHEN a.distance = 100 THEN a.value END) AS dist_100,
SUM(CASE WHEN a.distance = 150 THEN a.value END) AS dist_150
FROM tableA a
LEFT JOIN tableB b AS a.join_id = b.join_id
GROUP BY a.join_id
'''
dist_df = pd.read_sql(sql, mydb)

Related

SQLite fetch until

I have a question about SQL.
I have the following sample table called Records:
record_id
subject
start_timestamp
end_timestamp
interval
2
Start Product 2
2021-04-21T16:22:39
2021-04-21T16:23:40
0.97
3
error 1
2021-04-21T16:25:44
2021-04-21T16:25:54
10.0
4
End Product 2
2021-04-21T16:30:13
2021-04-21T16:30:14
0.97
5
Start Product 1
2021-04-21T16:35:13
2021-04-21T16:35:13
0.6
6
End Product 1
2021-04-21T16:36:13
2021-04-21T16:36:13
0.45
First I select all the items that have start in there subject with and are not in the table BackupTO (for now the table BackupTO is not important):
SELECT Records.record_id, Records.start_timestamp, Records.interval FROM Records
LEFT JOIN BackupTO ON BackupTO.record_id = Records.record_id
WHERE BackupTO.record_id IS NULL AND Records.log_type = 1 AND Records.subject LIKE '%start%'
When I ran this we get:
record_id
start_timestamp
interval
2
2021-04-21T16:22:39
0.97
5
2021-04-21T16:35:13
0.6
Oke, all good now comes my question, I fetch this in Python and loop through the data, first I calculate the product number based on the interval with:
product = round(result[2] / 0.5)
So a interval of 0.97 is product 2, and a interval of 0.6,0.45 is product 1, all great!
So I know record_id 2 is product 2 and I want to execute a sql query thats returns all items starting from record_id 2 untils its find a items that has %end%2 in its name (the 2 is for product 2, could also be product 1).
For example its finds Start Product 2 I get a list with record_id 3 and 4.
I want to get all items from the start until end.
So it gets me a list like this, this are all the items found under Start Product 2 until %end%2 was found. For product 1, it just would return just record_id 6, because there is nothing between the start and stop.
record_id
start_timestamp
interval
3
2021-04-21T16:22:39
10.0
4
2021-04-21T16:35:13
0.97
I tried OFFSET and FETCH, but I couldnt get it to work, somebody that could help me out here?
Use your query as a CTE and join it to the table Records.
Then with MIN() window function find the record_id up to which you want the rows returned:
WITH cte AS (
SELECT r.*
FROM Records r LEFT JOIN BackupTO b
ON b.record_id = r.record_id
WHERE b.record_id IS NULL AND r.log_type = 1 AND r.subject LIKE '%start%'
)
SELECT *
FROM (
SELECT r.*,
MIN(CASE WHEN r.subject LIKE '%end%' THEN r.record_id END) OVER () id_end
FROM Records r INNER JOIN cte c
ON r.record_id > c.record_id
WHERE c.record_id = ?
)
WHERE COALESCE(record_id <= id_end, 1)
Change ? to 2 or 5 for each case.
If you have the record_ids returned by your query, it is simpler:
SELECT *
FROM (
SELECT r.*,
MIN(CASE WHEN r.subject LIKE '%end%' THEN r.record_id END) OVER () id_end
FROM Records r
WHERE r.record_id > ?
)
WHERE COALESCE(record_id <= id_end, 1)
See the demo.

SQL code in python - double indexing pivot table counts

I created my own DB and added my csv data into it successfully via Python. Queries therefore work, but I use python to do the queries. Package sqlite3. In short, the table is: ID of user, Type of action performed, ID of object that action was performed on.
df1 and df2 are the DataFrames.
import pandas as pd
import sqlite3
connection = sqlite3.connect("test.db")
c = connection.cursor()
df1.to_sql('tab1', con = connection, if_exists = 'replace', chunksize = 1000)
df2.to_sql('tab2', con = connection, if_exists = 'replace', chunksize = 1000)
In this SQL command, I need to select the columns, join one table and select only data from certain time. I hope it's okay I include it there for completeness but won't include all columns in example table as they're only for the filtering. Note that the subquery without the pivot works.
sql_command = """
SELECT * FROM
(SELECT tab1.action,
tab1.iduser,
tab2.idobject
FROM tab1,
tab2
LEFT JOIN tab2 ON tab1.objectid = tab2.objectid
WHERE tab2.Datetime BETWEEN '2020-01-29' AND '2020-02-04'
) AS source
PIVOT(
COUNT(tab1.iduser)
FOR tab1.action IN (
[a],
[b],
[c])
) AS pivot_table;"""
# execute query
c.execute(sql_command)
This is the table I get from the subquery basically:
iduser action idobject
--------------------------------------
1 a Obj1
1 b Obj2
1 c Obj3
1 a Obj2
2 a Obj1
2 c Obj2
2 a Obj2
2 b Obj1
3 a Obj1
3 c Obj3
Now I want to create such table, i.e. count the number of types of actions performed on each object.
action idobject Count
----------------------------------------------------------------------
a Obj1 3
Obj2 2
Obj3 0
b Obj1 1
Obj2 1
Obj3 0
c Obj1 0
Obj2 1
Obj3 2
If i run the code above with the PIVOT table, I get the error "Error: Incorrect syntax near "PIVOT"".

Merging distinct results SQL

So my problem is the following:
I have a MariaDB database that I am trying to query.
I am looking at experiment data. What I would like, is to summarize several experiments into one results row of an SQL query.
This is what the data looks like:
Experiment_ID | Antibiotic | Strain | Medium | ...
1 Ampicillin E. coli TBauto ...
2 Ampicillin E. coli TB + IPTG
What I would like to get:
Experiment_ID | Antibiotic | Strain | Medium | ...
1 Ampicillin E. coli TBauto, TB + IPTG ...
I don't care about the Experiment_ID, that is just there to make clear the I am talking about two distinct entries.
I already tried
Select
tmp.*
From
((Select * from my_tbl Where ExperimentID = 1)
Union
(Select * from my_tbl Where ExperimentID = 2)) as tmp
There I still end up with 2 rows of results. I could also do it in my Python code. But I would like to not have to modify results that I grab from the DB.
I guess I must have just been looking for the wrong word, to merge those two entries. So please kindly guide me in the right direction.
You seem to want aggregation:
select min(Experiment_ID) as Experiment_ID, Antibiotic, Strain,
group_concat(Medium order by Experiment_ID separator ', ')
from t
where Experiment_ID in (1, 2)
group by Antibiotic, Strain;

How to get last record in sqlite3 db with where condition

I am trying search records in sqlite3 table to get last record inserted with where condition, but I can do it with only one condition WHERE CODE = df = "DS3243". But what I want to do is with multiple WHERE conditions jf = "QS2134", df = "DS3243", sf = "MS5787", so that I can get the last record inserted with the codes provided.
DEMONTSTRATION
CODE POINT
QS2134 1000
DS3244 2000
MS5787 3000
QS2134 130
QS2134 200 # want to get this because it last with such code
DS3244 300
MS5787 4500
DS3244 860 # want to get this because it last with such code
MS5787 567
MS5787 45009 # want to get this because it last with such code
Am able to do for only one variable cur.execute("SELECT * FROM PROFILE WHERE CODE=? ORDER BY POINT ASC LIMIT 1 ",(df,)) but i want to do for multiple varaiables.
import sqlite3
jf = "QS2134"
df = "DS3243"
sf = "MS5787"
con = sqlite3.connect("TEST.db")
cur = con.cursor()
cur.execute("SELECT * FROM PROFILE WHERE CODE=? ORDER BY POINT ASC LIMIT 1 ",(df,)) # limit one means last one
rows = cur.fetchall()
for row in rows:
print(row)
con.commit()
con.close()
I'm not sure I understand your question, but is it possible that you meant that you want to group the results?
Is it "group by" clause that you're looking for?
Something like:
select CODE, MAX(POINT) group by CODE;
I think you are simply trying to extend your query, in which case, why don't you try string formatting?
x = "SELECT * FROM my_table where col1 = '{0}' or col2 ='{1}';".format(var_1, var_2)
cur.execute(x)
That way you can extend your query with as many conditions as you like.

Run checks on Items from tables in Sqlite and python

I have two tables below:
----------
Items | QTY
----------
sugar | 14
mango | 10
apple | 50
berry | 1
----------
Items |QTY
----------
sugar |10
mango |5
apple |48
berry |1
I use the following query in python to check difference between the QTY of table one and table two.
cur = conn.cursor()
cur.execute("select s.Items, s.qty - t.qty as quantity from Stock s join Second_table t on s.Items = t.Items;")
remaining_quantity = cur.fetchall()
I'm a bit stuck on how to go about what I need to accomplish. I need to check the difference between the quantity of table one and table two, if the quantity (difference) is under 5 then for those Items I want to be able to store this in another table column with the value 1 if not then the value will be 0 for those Items. How can I go about this?
Edit:
I have attempted this like by looping through the rows and if the column value is less than 5 then insert into the new table with the value below. :
for row in remaining_quantity:
print(row[1])
if((row[1]) < 5):
cur.execute('INSERT OR IGNORE INTO check_quantity_tb VALUES (select distinct s.Items, s.qty, s.qty - t.qty as quantity, 1 from Stock s join Second_table t on s.Items = t.Items'), row)
print(row)
But I get a SQL syntax error not sure where the error could be :/
First modify your first query so you retrieve all relevant infos and don't have to issue subqueries later:
readcursor = conn.cursor()
readcursor.execute(
"select s.Items, s.qty, s.qty - t.qty as remain "
"from Stock s join Second_table t on s.Items = t.Items;"
)
Then use it to update your third table:
writecursor = conn.cursor()
for items, qty, remain in readcursor:
print(remain)
if remain < 5:
writecursor.execute(
'INSERT OR IGNORE INTO check_quantity_tb VALUES (?, ?, ?, ?)',
(items, qty, remain, 1)
)
conn.commit()
Note the following points:
1/ We use two distinct cursor so we can iterate over the first one while wrting with the second one. This avoids fetching all results in memory, which can be a real life saver on huge datasets
2/ when iterating on the first cursor, we unpack the rows into their individual componants. This is called "tuple unpacking" (but actually works for most sequence types):
>>> row = ("1", "2", "3")
>>> a, b, c = row
>>> a
'1'
>>> b
'2'
>>> c
'3'
3/ We let the db-api module do the proper sanitisation and escaping of the values we want to insert. This avoids headaches with escaping / quoting etc and protects your code from SQL injection attacks (not that you might have one here, but that's the correct way to write parameterized queries in Python).
NB : since you didn't not post your full table definitions nor clear specs - not even the full error message and traceback - I only translated your code snippet to something more sensible (avoiding the costly and useless subquery, which migh or not be the cause of your error). I can't garantee it will work out of the box, but at least it should put you back on tracks.
NB2 : you mentionned you had to set the last col to either 1 or 0 depending on remain value. If that's the case, you want your loop to be:
writecursor = conn.cursor()
for items, qty, remain in readcursor:
print(remain)
flag = 1 if remain < 5 else 0
writecursor.execute(
'INSERT OR IGNORE INTO check_quantity_tb VALUES (?, ?, ?, ?)',
(items, qty, remain, flag)
)
conn.commit()
If you instead only want to process rows where remain < 5, you can specify it directly in your first query with a where clause.

Categories

Resources