Create new table from query results is slow in Python with SQLite - python

Using Python, I try to create a new SQLite table in a database from a query result which is executed on two other tables in that database.
For the record, the query is
CREATE TABLE results AS
SELECT table_1.*, table_2.*
FROM table_1
LEFT JOIN table_2
ON table_1.ID_1 = table_2.ID_2
UNION ALL
SELECT table_1.*, table_2.*
FROM table_2
LEFT JOIN table_1
ON table_1.ID_1 = table_2.ID_2
WHERE table_1.ID_1 IS NULL
which is supposed to be a workaround for a FULL OUTER JOIN which is not directly available in SQLite (this method can be found on different SO threads on this topic).
However, this operation is slow on my tables with ~1 million rows each... so slow that I get the impression it's going to take hours when hitting refresh in the Explorer Window showing the database's file size while updating.
How can I get this faster? I already did a lot of research on this and most of the time, people are talking about using transactions because otherwise, each row will open up a new connection to the database or whatever... however, I could not find a working example on how to use this.
My two approaches so far which are way too slow:
Using Python's sqlite3 module:
# open sqlite database
conn = sqlite3.connect('Database.sqlite')
# get a cursor
cursor = conn.cursor()
# start query
cursor.execute("""
CREATE TABLE results AS
SELECT table_1.*, table_2.*
FROM table_1
LEFT JOIN table_2
ON table_1.ID = table_2.ID
UNION ALL
SELECT table_1.*, table_2.*
FROM table_2
LEFT JOIN table_1
ON table_1.ID = table_2.NETWORK_ID
WHERE table_1.ID IS NULL
;""")
Using sqalchemy:
from sqlalchemy import create_engine, event
from sqlite3 import dbapi2 as sqlite
# create database engine
engine = create_engine('sqlite:///Database.sqlite')
# open sqlite database
connection = engine.connect()
# query
with connection.begin() as trans:
connection.execute("BEGIN TRANSACTION;")
connection.execute("""
CREATE TABLE results AS
SELECT table_1.*, table_2.*
FROM table_1
LEFT JOIN table_2
ON table_1.ID = table_2.ID
UNION ALL
SELECT table_1.*, table_2.*
FROM table_2
LEFT JOIN table_1
ON table_1.ID = table_2.ID
WHERE table_1.ID IS NULL
;""")
trans.commit()

Argh... the problem was not with the code above but with the tables that provide the input for the JOIN. I lacked some accuracy while creating them so that ID column of table_1 was of type INTEGER while ID column of table_2 was of type STRING.
Unfortunately, doing a JOIN on columns of different types does not throw an error - instead I guess that something very expensive is happening internally like casting each STRING to INTEGER, comparing and then casting back to STRING or vice versa.
So, the moral of the story: If your CREATE TABLE from a JOIN is painfully slow then check the data types of your JOIN columns for equality.

Related

Tables being queried with SQL and Sqlalchemy have same column names

I am querying a variety of different tables in a mysql database with sqlalchemy, and SQL query code.
My issue right now is renaming some of the columns being joined. The queries are all coming into one dataframe.
SELECT *
FROM original
LEFT JOIN table1
on original.id = table1.t1key
LEFT JOIN table2
on original.id = table2.t2key
LEFT JOIN table3
on original.id = table3.t3key;
All I actually want to get from those tables is a single column added to my query. Each table has a column with the same name. My approach to using an alias has been as below,
table1.columnchange AS 'table1columnchange'
table2.columnchange AS 'table2columnchange'
table3.columnchange AS 'table3columnchange'
But the variety of ways I've tried to implement this ends up with annoying errors.
I am querying around 20 different tables as well, so using SELECT * at the beginning while inefficient is ideal for the sake of ease.
The output I'm looking for is a dataframe that has each of the columns I need in it (which I am trying to then filter and build modeling with python for). I am fine with managing the query through sqlalchemy into pandas, the alias is what is giving me grief right now.
Thanks in advance
You can use nested queries:
SELECT
original.column1 as somename,
table1.column1 as somename1,
table2.column1 as somename2
FROM
(SELECT
column1
FROM
original
) original
LEFT JOIN (
SELECT
column1
FROM
table1
) table1 ON original.id = table1.t1key
LEFT JOIN (
SELECT
column1
FROM
table2
) table2 ON original.id = table2.t2key

Improve performance SQL query in Python Cx_Oracle

I actually use Cx_Oracle library in Python to work with my database Oracle.
import cx_Oracle as Cx
# Parameters for server connexion
dsn_tns = Cx.makedsn(_ip, _port, service_name=_service_name)
# Connexion with Oracle Database
db = Cx.connect(_user, _password, dsn_tns)
# Obtain a cursor for make SQL query
cursor = db.cursor()
One of my query write in an INSERT of a Python dataframe into my Oracle target table among some conditions.
query = INSERT INTO ORA_TABLE(ID1, ID2)
SELECT :1, :2
FROM DUAL
WHERE (:1 != 'NF' AND :1 NOT IN (SELECT ID1 FROM ORA_TABLE))
OR (:1 = 'NF' AND :2 NOT IN (SELECT ID2 FROM ORA_TABLE))
The goal of this query is to write only rows who respect conditions into the WHERE.
Actually ,this query works well when my Oracle target table have few rows. But, if my target Oracle table have more than 100 000 rows, it's very slow because I read through all the table in WHERE condition.
Is there a way to improve performance of this query with join or something else ?
End of code :
# SQL query incoming
cursor.prepare(query)
# Launch query with Python dataset
cursor.executemany(None, _py_table.values.tolist())
# Commit changes into Oracle database
db.commit()
# Close the cursor
cursor.close()
# Close the server connexion
db.close()
Here is a possible solution that could help: The sql that you have has an OR condition and only one part of this condition will be true for a given value. So I would divide it in two parts by checking the following in the code and constructing two inserts instead of one and at any point of time, only one would execute:
IF :1 != 'NF' then use the following insert:
INSERT INTO ORA_TABLE (ID1, ID2)
SELECT :1, :2
FROM DUAL
WHERE (:1 NOT IN (SELECT ID1
FROM ORA_TABLE));
and IF :1 = 'NF' then use the following insert:
INSERT INTO ORA_TABLE (ID1, ID2)
SELECT :1, :2
FROM DUAL
WHERE (:2 NOT IN (SELECT ID2
FROM ORA_TABLE));
So you check in code what is the value of :1 and depending on that use the two simplified insert. Please check if this is functionally the same as original query and verify if it improves the response time.
Assuming Pandas, consider exporting your data as a table to be used as staging for final migration where you run your subquery only once and not for every row of data set. In Pandas, you would need to interface with sqlalchemy to run the to_sql export operation. Note: this assumes your connected user has such DROP TABLE and CREATE TABLE privileges.
Also, consider using EXISTS subquery to combine both IN subqueries. Below subquery attempts to run opposite of your logic for exclusion.
import sqlalchemy
...
engine = sqlalchemy.create_engine("oracle+cx_oracle://user:password#dsn")
# EXPORT DATA -ALWAYS REPLACING
pandas_df.to_sql('myTempTable', con=engine, if_exists='replace')
# RUN TRANSACTION
with engine.begin() as cn:
sql = """INSERT INTO ORA_TABLE (ID1, ID2)
SELECT t.ID1, t.ID2
FROM myTempTable t
WHERE EXISTS
(
SELECT 1 FROM ORA_TABLE sub
WHERE (t.ID1 != 'NF' AND t.ID1 = sub.ID1)
OR (t.ID1 = 'NF' AND t.ID2 = sub.ID2)
)
"""
cn.execute(sql)

Fast insert into table with Foreign Key constraints

I have to periodically insert data into table #1 that contains foreign key reference to table #2. And table #2 is quite big - about 200.000 rows. I'm trying to check rows that must be inserted into table #1 for foreign key constraint by simply removing those rows that definitely can't be inserted, and my query looks like this:
DELETE FROM temp_table1
WHERE temp_table1.fk NOT IN (SELECT id FROM table2) AND
temp_table1.id_d IS NOT NULL;
The problem is, this method is veeery slow :( So is there any "right" method to insert rows in such situation?
I'm using Python3, Postgresql and Psycopg2, if it matters.
You do not need the delete step. insert directly in instead:
insert into table1
select t1.*
from
temp_table1 t1
inner join
table2 t2 on t1.fk = t2.id
where t1.id_d is not null

Using pyodbc with SQL join statement in Python

I am trying to join 2 tables in Python. (Using Windows, jupyter notebook.)
Table 1 is an excel file read in using pandas.
TABLE_1= pd.read_excel('my_file.xlsx')
Table 2 is a large table in oracle database that I can connect to using pyodbc. I can read in the entire table successfully using pyodbc like this, but it takes a very long time to run.
sql = "SELECT * FROM ORACLE.table_2"
cnxn = odbc.connect(##########)
TABLE_2 = pd.read_sql(sql, cnxn)
So I would like to do an inner join as part of the pyodbc import, so that it runs faster and I only pull in the needed records. Table 1 and Table 2 share the same unique identifier/primary key.
sql = "SELECT * FROM ORACLE.TABLE_1 INNER JOIN TABLE_2 ON ORACLE.TABLE1.ID=TABLE_2.ID"
cnxn = odbc.connect(##########)
TABLE_1_2_JOINED = pd.read_sql(sql, cnxn)
But this doesn't work. I get this error:
DatabaseError: Execution failed on sql 'SELECT * FROM ORACLE.TABLE_1
INNER JOIN TABLE_2 ON ORACLE.TABLE1.ID=TABLE_2.ID': ('42S02', '[42S02]
[Oracle][ODBC][Ora]ORA-00942: table or view does not exist\n (942)
(SQLExecDirectW)')
Is there another way I can do this? It seems very inefficient to have to import entire table w/millions of records when I only need to join a few hundred. Thank you.
Something like this might work.
First do:
MyIds = set(table_1['id'])
Then:
SQL1 = "CREATE TEMPORARY TABLE MyIds ( ID int );"
Now insert your ids:
SQL2 = "INSERT INTO MyIds.ID %d VALUES %s"
for element in list(MyIds):
cursor.execute(SQL2, element)
And lastly
SQL3 = "SELECT * FROM ORACLE.TABLE_1 WHERE ORACLE.TABLE1.ID IN (SELECT ID FROM MyIds)"
I have used MySQL not oracle and a different connector to you but the principles are probably the same. Of course there's a bit more code with the python-sql connections etc. Hope it works, otherwise try to make a regular table rather than a temporary one.

Postgres query stalls when selecting explicit columns with an order by using a server side cursor with psycopg2

I have a simple query that joins two (reasonably large) tables and iterates over the results with a server side cursor:
select * from tableA a join tableB b on (a.fid = b.id) order by a.id;
This query is fine and I can iterate over the results in chunks with a fetchmany.
However, as soon as I change the query to
select a.id from tableA a join tableB b on (a.fid = b.id) order by a.id
the execute call stalls and I'm unable to iterate over the results.
The id column on both tables is index via a BTREE index and the fid column is also indexed with a BTREE and is a constraint on tableB.id. id and fid are integers.
The reason I need to iterate over all the rows is because I'm creating a database export of sorts which cannot be accomplished via a copy command as it requires a join.

Categories

Resources