Python & Sqlite3 - Subset one table then join on two other tables

Python & Sqlite3 - Subset one table then join on two other tables - python

I'm using python and sqlite3. I have 3 tables:
Table 1:
Col A
Table 2:
Col A | Col B
Table 3:
Col B
I want the first 500k rows from Table 1 and any matching rows from Table 2 that have matching rows from Table 3. How do I do this? I was thinking something like this
conn = sqlite3.connect('database.sqlite')
conn.execute('SELECT * FROM Table1 LIMIT 500000 AS sample
LEFT JOIN Table2
ON sample.A = Table2.A
LEFT JOIN Table3 ON table2.B = Table3.B')
But I get this error: OperationalError: near "AS": syntax error
The result should be 500k rows with all columns found in all 3 Tables.
Apologies if any of my wording is difficult to understand.

As #furas said, LIMIT has to be at the end of the complete statement.
What you actually want to do is most likely a subquery, like:
SELECT * FROM (SELECT * FROM Table1 LIMIT 500000) AS sample
LEFT JOIN Table2
ON sample.A = Table2.A
LEFT JOIN Table3 ON table2.B = Table3.B

Related

Pyspark SQL WHERE NOT IN?

I'm trying to find everything that is in nodes2 but not nodes1.
spark.sql("""
SELECT COUNT(*) FROM
(SELECT * FROM nodes2 WHERE NOT IN
(SELECT * FROM nodes1))
""").show()
Getting the following error:
"cannot resolve 'NOT' given input columns: [nodes2.~id, nodes2.~label];
Is it possible to do this sort of set difference operation in Pyspark?

Matching single column with NOT IN:
Do you need to define some columns with where? which you trying to match for NOT operator?
If that is the case, then, for example, you want to check id
spark.sql("""
SELECT COUNT(*) FROM
(SELECT * FROM nodes2 WHERE id NOT IN
(SELECT id FROM nodes1))
""").show()
Matching multiple columns (or complete row) with NOT IN:
Or if you really want to match complete row (all columns), use something like concat on all columns to match
spark.sql("""
SELECT COUNT(*) FROM
(SELECT * FROM nodes2 as WHERE CONCAT(id,label) NOT IN (SELECT CONCAT(id,label) FROM nodes1))
""").show()
or with alias
spark.sql("""
SELECT COUNT(*) FROM
(SELECT * FROM nodes2 n2 as WHERE CONCAT(n2.id,n2.label) NOT IN (SELECT CONCAT(n1.id,n1.label) FROM nodes1 n1))
""").show()

How to use an SQL query with join on the result of another SQL query executed before?

I am trying to use an SQL query on the result of a previous SQL query but I'm not able to.
I am creating a python script and using postgresql.
I have 3 tables from which I need to match different columns and join the data but using only 2 tables at a time.
For example:
I have table1 where I have a codecolumn and there is a same column of codes in table2
Now I am matching the values of both the columns and joining a column 'area' from table 2 which corresponds to codes and a column 'pincode' from table1.
For this I used the following query which is working:
'''
select
table1.code,table2.code,table2.area,table1.pincode
from
table1 left join table2
ON
table1.code=table2.code
order by table1.row_num '''
I am getting the result but in this data there is some data in which the area value is returned as None
Wherever I am getting the area as None when matching code columns, I need to use the pincode column in table1 and pincode column in table3 to again find the corresponding area from table3.area.
So I used the following Query:
'''
select
table1.code,table3.area,table1.pincode
from
table1 left join table3
ON
table1.pincode=table3.pincode
IN (
select
table1.code,table2.code,table2.area,table1.pincode
from
table1 left join table2
ON
table1.code=table2.code
where table2.area is NULL
order by table1.row_num '''
and I got the following error:
sqlalchemy.exc.ProgrammingError: (psycopg2.errors.SyntaxError) subquery has too many columns
My python code is as follows:
import psycopg2
from sqlalchemy import create_engine
engine=create_engine('postgresql+psycopg2://credentials')
conn=engine.connect()
query = '''
select
table1.code,table2.code,table2.area,table1.pincode
from
table1 left join table2
ON
table1.code=table2.code
order by table1.row_num '''
area=conn.execute(query)
area_x=area.fetchall()
for i in area_x:
print(i)
query2 = select
table1.code,table3.area,table1.pincode
from
table1 left join table3
ON
table1.pincode=table3.pincode
IN (
select
table1.code,table2.code,table2.area,table1.pincode
from
table1 left join table2
ON
table1.code=table2.code
where table2.area is NULL
order by table1.row_num '''
area=conn.execute(query2)
area_x=area.fetchall()
for i in area_x:
print(i)
This is how my first query is returning the data:
Wherever I am not able to match the code columns I get None value in area column from table 2 and whenever the area value is None I have to apply another query to find this data
Now i have to match data in table1.pincode with data in table3.pincode to find table3.area and replace the None value with table3.area
These are the 2 ways to find the area
The desired result should be:
What could be the correct solution??
Thank You

it looks that your query2 needs a where clause and the subquery, as per error message should be reduced to the column you are trying to pass to the outer query. query2 should be something like this:
query2 =
select
table1.code,table3.area,table1.pincode
from
table1 left join table3
ON
table1.pincode=table3.pincode
WHERE table1.code
IN (
select
table1.code
from
table1 left join table2
ON
table1.code=table2.code
where table2.area is NULL

Update one postgres table from another postgres table

I am loading a batch csv file to postgres using python (Say Table A).
I am using pandas to upload the data into chunk which is quite faster.
for chunk in pd.read_csv(csv_file, sep='|',chunksize=chunk_size,low_memory=False):
Now I want to update another table (say Table B) using A based on following rules
if there are any new records in table A which is not in table B then insert that as a new record in table B (based on Id field)
if the values changes in the Table A for the same ID which exists in Table B then update the records in table B using TableA
(There are server tables which i need to update based on Table A )
I am able to do that using below and then loop through each row, but Table A always have records around 1,825,172 and it becomes extremely slow. Any forum member can help to speed this up or suggest a alternate approach to achieve the same.
cursor.execute(sql)
records = cursor.fetchall()
for row in records:
id= 0 if row[0] is None else row[0] # Use this to match with Table B and decide insert or update
id2=0 if row[1] is None else row[1]
id2=0 if row[2] is None else row[2]

You could leverage Postgres upsert syntax, like:
insert into tableB tb (id, col1, col2)
select ta.id, ta.col1, ta.col2 from tableA ta
on conflict(id) do update
set col1 = ta.col1, col2 = ta.col2

You should do this completely inside the DBMS, not loop through the records inside your python script. That allows your DBMS to better optimize.
UPDATE TableB
SET x=y
FROM TableA
WHERE TableA.id = TableB.id
INSERT INTO TableB(id,x)
SELECT id, y
FROM TableA
WHERE TableA.id NOT IN ( SELECT id FROM TableB )

SQLite divide each row by values of the same row from another table

So how would one take each row in a table and divide each column value for that row by the corresponding value in another table
say i have a table_one
A B C
0 12 20 12
and table_two
date A B C
01/01/19 6 10 3
02/01/19 12 5 6
and by dividing each cell in each row in table_two by the corresponding value in table_one to make the new table
date A B C
01/01/19 0.5 0.5 0.25
02/01/19 1 0.25 0.5
I thought about using an INNER JOIN like
"""INSERT INTO newtable( date, A, B, C)
SELECT t.date,
t.A / s.A
t.B / s.B
t.C / s.C
FROM table_two t
INNER JOIN table_one s
ON ;"""
But then I don't think this is correct as I don't know what I would put after the ON statement?

If you want the calculation to run directly across the rows (with the first element of the second table divided by the first element of the first table, ...) then you could give a number-id (that auto-increments for each row) column to each table and then join on that number-id.
This would call for recreating the table with this ID column, which would look like this at the top:
CREATE TABLE table1 (
id INT NOT NULL AUTO_INCREMENT, ...
You include a date on there as well. If you would like to perform the calculation on the date, it could work if the dates are unique.
For the first scenario, your INSERT statement would end like this:
FROM table_two t
INNER JOIN table_one s
ON t.id = s.id;
If the scenario is the date-column join then it would just be:
FROM table_two t
INNER JOIN table_one s
ON t.date = s.date;
And this is assuming your tables have columns like (id, date, A, B, C)
I hope that at least gives you some idea.

how would one take each row in a table and divide each column value
for that row by the corresponding value in another table?
Corresponding to what?
In the sample data you posted about table_one there is a 0 at the beginning of the row.
Does this mean that there is only 1 row in the table and it will always be just 1 row?
If the answer is yes then you need a CROSS JOIN between the tables which will join the only row of table_one to every row of table_two:
INSERT INTO newtable( date, A, B, C)
SELECT t.date,
t.A / s.A,
t.B / s.B,
t.C / s.C
FROM table_two t CROSS JOIN table_one s
If the answer is no then you need a to set a condition in the ON clause of the INNER JOIN:
INSERT INTO newtable( date, A, B, C)
SELECT t.date,
t.A / s.A,
t.B / s.B,
t.C / s.C
FROM table_two t INNER JOIN table_one s
ON s.somecolumn = t.othercolumn

python - pandas dataframe processing

So I am back with another question about python and pandas.
I have table1 with following columns:
ID;COUNT;FOREIGN_ID;OTHER_DATA
1;3;xyz1
2;1;xyz2
3;1;xyz3
table2
ID;FOREIGN_ID;OTHER_DATA
1;xyz1;000001
2;xyz1;000002
3;xyz1;000003
4;xyz1;000004
5;xyz1;000005
6;xyz2;000000
7;xyz2;000000
8;xyz3;000000
9;xyz3;000000
Both tables are stored as CSV files. I load both of them into dataframe, and then iterate through TABLE1. I must find all records in table2 with same record and randomly select some of them.
df_result = pd.DataFrame()
df_table1 = pd.read_csv(table1, delimiter=';')
df_table2 = pd.read_csv(table2, delimiter=';')
for index, row in df_table1 .iterrows():
df_candidates = df_table2[(df_table2['FOREIGN_ID'] == row['FOREIGN_ID']
random_numbers = np.random.choice(len(df_kandidati), row['count'], replace=False)
df_result.append(df_candidates.iloc[random_numbers])
In my earlier question I got an answer that using For loop is big time waster... But for this problem I can't find a solution where I wouldn't need to use for loop.
EDIT:
I am sorry for editing my question so late.. was busy with other stuff...
As requested below is the result_table. Please note that my real tables are slightly different than those below. I am joining tables on 3 foreign keys in my real use but for demonstration, I am using tables with fake data.
So the logic should be something like this:
Read the first line of table1.
1;3;xyz1
Find all records with same FOREIGN_ID in table2
count = 3, foreign_id = xyz1
Rows with foreign_id = xyz1 are rows:
1;xyz1;000001
2;xyz1;000002
3;xyz1;000003
4;xyz1;000004
5;xyz1;000005
Because count = 3 I must randomly choose 3 of those records.
I do this with the following line:
df_candidates is table of all suitable records (table above)
random_numbers = np.random.choice(len(df_candidates), row['count'], replace=False)
Then I store randomly chosen records in a df_result after parsing all rows from table1 I write df_result to the csv.
Problem is that my tables are 0.5milion - 1 milion rows big so iterating through every row in table1 is really slow... And I am sure there is a better way of doing this.. But I've been stuck on this for past 2 days so..

To select rows, containing only values from Table1, you can use, for example, pd.merge :
col = "FOREIGN_ID"
left = df_table2
right = df_table1[[col]]
filtered = pd.merge(left=left, right=right, on=col, how="inner")
Or df.isin():
ix = df_table2[col].isin(df_table1[col])
filtered = df_table2[ix]
Then to select random sample per group:
def select_random_row(grp):
choice = np.random.randint(len(grp))
return grp.iloc[choice]
filtered.groupby(col).apply(select_random_row)

Have you looked into using pd.merge()
Your call would look something like:
results=pd.merge(table1, table2, how='inner', on='FOREIGN_ID')

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python & Sqlite3 - Subset one table then join on two other tables - python

As #furas said, LIMIT has to be at the end of the complete statement. What you actually want to do is most likely a subquery, like: SELECT * FROM (SELECT * FROM Table1 LIMIT 500000) AS sample LEFT JOIN Table2 ON sample.A = Table2.A LEFT JOIN Table3 ON table2.B = Table3.B

Related

Pyspark SQL WHERE NOT IN?

How to use an SQL query with join on the result of another SQL query executed before?

Update one postgres table from another postgres table

SQLite divide each row by values of the same row from another table

python - pandas dataframe processing

Categories

Resources