Pyspark SQL WHERE NOT IN? - python

I'm trying to find everything that is in nodes2 but not nodes1.
spark.sql("""
SELECT COUNT(*) FROM
(SELECT * FROM nodes2 WHERE NOT IN
(SELECT * FROM nodes1))
""").show()
Getting the following error:
"cannot resolve 'NOT' given input columns: [nodes2.~id, nodes2.~label];
Is it possible to do this sort of set difference operation in Pyspark?

Matching single column with NOT IN:
Do you need to define some columns with where? which you trying to match for NOT operator?
If that is the case, then, for example, you want to check id
spark.sql("""
SELECT COUNT(*) FROM
(SELECT * FROM nodes2 WHERE id NOT IN
(SELECT id FROM nodes1))
""").show()
Matching multiple columns (or complete row) with NOT IN:
Or if you really want to match complete row (all columns), use something like concat on all columns to match
spark.sql("""
SELECT COUNT(*) FROM
(SELECT * FROM nodes2 as WHERE CONCAT(id,label) NOT IN (SELECT CONCAT(id,label) FROM nodes1))
""").show()
or with alias
spark.sql("""
SELECT COUNT(*) FROM
(SELECT * FROM nodes2 n2 as WHERE CONCAT(n2.id,n2.label) NOT IN (SELECT CONCAT(n1.id,n1.label) FROM nodes1 n1))
""").show()

Related

Python Regex applying split to a capturing group

I'm trying to get create statement from ctas statement by adding limit 0 to all tables as I don't want to load the table. Using python for the same
Input :
create table new_table as select
mytable.column1, mytable2.column2
from schema.mytable left join mytable2 ;
Expected Output:
create table new_table IF NOT EXISTS as select
mytable.column1, mytable2.column2
from (select * from mytable limit 0) mytable join (select * from mytable2 limit 0) mytable2
I have to replace all the tables in from and join clause to (select * from tablename limit 0) and alias.
However I'm only able to generate below, not able to get the table name and add it as alias. Also not able to change the last name in the join clause. If the input has alias explicitly mentioned, I'm able to generate it. I'm very new to using regex and feel very overwhelmed. Appreciate support from the experts here.
Output obtained:
create table new_table as select
mytable.column1, mytable2.column2
from (select * from schema.mytable limit 0) join mytable2 ;
Code I tried: (First tried to capture if there's already alias and put it in capture group 4. I would like to generate alias when tables do not have alias explicitly mentioned. Capture group 2 would get the schema_name.table_name. If I can use python function split to the capturing group. Also the last table in the sql I'm not able to translate
import re
sql = """
create table new_table as select
mytable.column1, mytable2.column2
from schema.mytable left join mytable2 ;"""
rgxsubtable = re.compile(r"\b((?:from|join)\s+)([\w.\"]+)([\)\s]+)(\bleft|on|cross|join|inner\b)",re.MULTILINE|re.IGNORECASE) # look for table names in from and join clauses
rgxalias = re.compile(r"\b((?:from|join)\s+)([\w.\"]+)(\s+)(\b(?!left|on|cross|join|inner\b)\w+)",re.MULTILINE|re.IGNORECASE) # look for table names in from and join clauses but with aliases
sql = rgxalias.sub(r"\1 (select * from \2 limit 0) \4 ", sql)
sql = rgxsubtable.sub(r"\1 (select * from \2 limit 0) ", sql)

Filter by results of select

I am trying to translate the following query to peewee:
select count(*) from A where
id not in (select distinct package_id FROM B)
What is the correct Python code? So far I have this:
A.select(A.id).where(A.id.not_in(B.select(B.package_id).distinct()).count()
This code is not returning the same result. A and B are large 10-20M rows each. I can't create a dictionary of existing package_id items in the memory.
For example, this takes lot of time:
A.select(A.id).where(A.id.not_in({x.package_id for x in B.select(B.package_id).distinct()}).count()
May be LEFT JOIN?
Update: I ended up calling database.execute_sql()
Your SQL:
select count(*) from A where
id not in (select distinct package_id FROM B)
Equivalent peewee:
q = (A
.select(fn.COUNT(A.id))
.where(A.id.not_in(B.select(B.package_id.distinct()))))
count = q.scalar()

How to use an SQL query with join on the result of another SQL query executed before?

I am trying to use an SQL query on the result of a previous SQL query but I'm not able to.
I am creating a python script and using postgresql.
I have 3 tables from which I need to match different columns and join the data but using only 2 tables at a time.
For example:
I have table1 where I have a codecolumn and there is a same column of codes in table2
Now I am matching the values of both the columns and joining a column 'area' from table 2 which corresponds to codes and a column 'pincode' from table1.
For this I used the following query which is working:
'''
select
table1.code,table2.code,table2.area,table1.pincode
from
table1 left join table2
ON
table1.code=table2.code
order by table1.row_num '''
I am getting the result but in this data there is some data in which the area value is returned as None
Wherever I am getting the area as None when matching code columns, I need to use the pincode column in table1 and pincode column in table3 to again find the corresponding area from table3.area.
So I used the following Query:
'''
select
table1.code,table3.area,table1.pincode
from
table1 left join table3
ON
table1.pincode=table3.pincode
IN (
select
table1.code,table2.code,table2.area,table1.pincode
from
table1 left join table2
ON
table1.code=table2.code
where table2.area is NULL
order by table1.row_num '''
and I got the following error:
sqlalchemy.exc.ProgrammingError: (psycopg2.errors.SyntaxError) subquery has too many columns
My python code is as follows:
import psycopg2
from sqlalchemy import create_engine
engine=create_engine('postgresql+psycopg2://credentials')
conn=engine.connect()
query = '''
select
table1.code,table2.code,table2.area,table1.pincode
from
table1 left join table2
ON
table1.code=table2.code
order by table1.row_num '''
area=conn.execute(query)
area_x=area.fetchall()
for i in area_x:
print(i)
query2 = select
table1.code,table3.area,table1.pincode
from
table1 left join table3
ON
table1.pincode=table3.pincode
IN (
select
table1.code,table2.code,table2.area,table1.pincode
from
table1 left join table2
ON
table1.code=table2.code
where table2.area is NULL
order by table1.row_num '''
area=conn.execute(query2)
area_x=area.fetchall()
for i in area_x:
print(i)
This is how my first query is returning the data:
Wherever I am not able to match the code columns I get None value in area column from table 2 and whenever the area value is None I have to apply another query to find this data
Now i have to match data in table1.pincode with data in table3.pincode to find table3.area and replace the None value with table3.area
These are the 2 ways to find the area
The desired result should be:
What could be the correct solution??
Thank You
it looks that your query2 needs a where clause and the subquery, as per error message should be reduced to the column you are trying to pass to the outer query. query2 should be something like this:
query2 =
select
table1.code,table3.area,table1.pincode
from
table1 left join table3
ON
table1.pincode=table3.pincode
WHERE table1.code
IN (
select
table1.code
from
table1 left join table2
ON
table1.code=table2.code
where table2.area is NULL

insert into a table from another table when a new record is inserted

I have 2 tables, I want to insert all records from the first table into the second one if the records do not exist. If a new row is added to the first table, it must be inserted into the seccond one.
I found this query
INSERT INTO Table2 SELECT * FROM Table1 WHERE NOT EXISTS (SELECT * FROM Table2)
but if a new row is added to table1, the row is not inserted into table2.
PS: table1 and table2 have the same fields and contain thousands of records
You could do it all in SQL - if you can run, for example, a batch every 2 minutes ...
-- with these two tables and their contents ...
DROP TABLE IF EXISTS tableA;
CREATE TABLE IF NOT EXISTS tableA(id,name,dob) AS (
SELECT 42,'Arthur Dent',DATE '1957-04-22'
UNION ALL SELECT 43,'Ford Prefect',DATE '1900-08-01'
UNION ALL SELECT 44,'Tricia McMillan',DATE '1959-03-07'
UNION ALL SELECT 45,'Zaphod Beeblebrox',DATE '1900-02-01'
);
ALTER TABLE tableA ADD CONSTRAINT pk_A PRIMARY KEY(id);
DROP TABLE IF EXISTS tableB;
CREATE TABLE IF NOT EXISTS tableB(id,name,dob) AS (
SELECT 43,'Ford Prefect',DATE '1900-08-01'
UNION ALL SELECT 44,'Tricia McMillan',DATE '1959-03-07'
UNION ALL SELECT 45,'Zaphod Beeblebrox',DATE '1900-02-01'
);
ALTER TABLE tableB ADD CONSTRAINT pk_B PRIMARY KEY(id);
-- .. this MERGE statement will ad the row with id=42 to table B ..
MERGE
INTO tableB t
USING tableA s
ON s.id = t.id
WHEN NOT MATCHED THEN INSERT (
id
, name
, dob
) VALUES (
s.id
, s.name
, s.dob
);
So it's an Access database.
Tough luck. They have no triggers.
Well then. I suppose you know how to insert a row into Access using Python, so I won't go through that.
I'll just build a scenario just after having inserted a row.
CREATE TABLE tableA (
id INTEGER
, name VARCHAR(20)
, dob DATE
, PRIMARY KEY (id)
)
;
INSERT INTO tableA(id,name,dob) VALUES(42,'Arthur Dent','1957-04-22');
INSERT INTO tableA(id,name,dob) VALUES(43,'Ford Prefect','1900-08-01');
INSERT INTO tableA(id,name,dob) VALUES(44,'Tricia McMillan','1959-03-07');
INSERT INTO tableA(id,name,dob) VALUES(45,'Zaphod Beeblebrox','1900-02-01');
CREATE TABLE tableB (
id INTEGER
, name VARCHAR(20)
, dob DATE
, PRIMARY KEY (id)
)
;
INSERT INTO tableB(id,name,dob) VALUES(43,'Ford Prefect','1900-08-01');
INSERT INTO tableB(id,name,dob) VALUES(44,'Tricia McMillan','1959-03-07');
INSERT INTO tableB(id,name,dob) VALUES(45,'Zaphod Beeblebrox','1900-02-01');
Ok. Scenario ready.
Merge ....
MERGE
INTO tableB t
USING tableA s
ON s.id = t.id
WHEN NOT MATCHED THEN INSERT (
id
, name
, dob
) VALUES (
s.id
, s.name
, s.dob
);
42000:1:-3500:[Microsoft][ODBC Microsoft Access Driver]
Invalid SQL statement;
expected 'DELETE', 'INSERT', 'PROCEDURE', 'SELECT', or 'UPDATE'.
So, Merge not supported.
Trying something else:
INSERT INTO tableB
SELECT * FROM tableA
WHERE id <> ALL (SELECT id FROM tableB)
;
1 row inserted
Or:
-- UNDO the previous insert
DELETE FROM tableB WHERE id=42;
1 row deleted
-- retry ...
INSERT INTO tableB
SELECT * FROM tableA
WHERE id NOT IN (SELECT id FROM tableB)
;
1 row inserted
You could run it like here above.
Or, if your insert into Table A, from Python, was:
INSERT INTO tableA(id,name,dob) VALUES(?,?,?);
... and you supplied the values for id, name and dob via host variables,
you could continue with:
INSERT INTO tableB
SELECT * FROM tableA a
WHERE id=?
AND NOT EXISTS(
SELECT * FROM tableB WHERE id=a.id
);
You would still have the value 42 in the first host variable, and could just reuse it. It would be faster this way in case of single row inserts.
Should you perform mass inserts, then, I would insert all new rows to table A, then run the INSERT ... WHERE ... NOT IN or the INSERT ... WHERE id <> ALL .... .

Python & Sqlite3 - Subset one table then join on two other tables

I'm using python and sqlite3. I have 3 tables:
Table 1:
Col A
Table 2:
Col A | Col B
Table 3:
Col B
I want the first 500k rows from Table 1 and any matching rows from Table 2 that have matching rows from Table 3. How do I do this? I was thinking something like this
conn = sqlite3.connect('database.sqlite')
conn.execute('SELECT * FROM Table1 LIMIT 500000 AS sample
LEFT JOIN Table2
ON sample.A = Table2.A
LEFT JOIN Table3 ON table2.B = Table3.B')
But I get this error: OperationalError: near "AS": syntax error
The result should be 500k rows with all columns found in all 3 Tables.
Apologies if any of my wording is difficult to understand.
As #furas said, LIMIT has to be at the end of the complete statement.
What you actually want to do is most likely a subquery, like:
SELECT * FROM (SELECT * FROM Table1 LIMIT 500000) AS sample
LEFT JOIN Table2
ON sample.A = Table2.A
LEFT JOIN Table3 ON table2.B = Table3.B

Categories

Resources