Best way to merge 3+ tables in SQL - python

I have encountered a situations that it takes a long time when I tried to merge multiple tables together. I am aware of two SQL line that will work but I don't know which one is better or whether there is any difference between them.
SQL1:
SELECT a.code,a.date,b.varaible1,c.varaible2 \
FROM TABLEA a,TABLEB b,TABLEC c\
WHERE a.date = b.date\
AND a.code = b.code\
AND a.date = c.date
SQL2:
SELECT a.code,a.date,b.varaible1,c.varaible2 \
FROM TABLEA a\
JOIN TABLEB b\
on a.date = b.date\
AND a.code = b.code \
JOIN TABLEC c\
on a.date = c.date
so is there any difference between the two? or it depends?

Related

mysql.connector.errors.DataError: 1242 (21000): Subquery returns more than 1 row

I have this code:
query = """SELECT sp.customer_surname, sp.amount, cp.amount, sp.monthly, sp.date_ FROM set_payment7777 sp JOIN customers_payments7777 cp ON cp.customer_VAT = sp.customer_VAT WHERE sp.date_ = (SELECT MAX(date_) FROM set_payment7777 GROUP BY customer_VAT) GROUP BY sp.customer_VAT"""
mycursor.execute(query)
for row in mycursor:
#do something
but I get the error:
mysql.connector.errors.DataError: 1242 (21000): Subquery returns more
than 1 row
You have several customer_VAT then your subquery return more than a row .. for avoid this you could use a join on the subquery
query = """SELECT sp.customer_surname, sp.amount, cp.amount, sp.monthly, sp.date_
FROM set_payment7777 sp
INNER JOIN customers_payments7777 cp ON cp.customer_VAT = sp.customer_VAT
INNER JOIN (
SELECT MAX(date_) FROM set_payment7777
GROUP BY customer_VAT
) t on t.customer_VAT = sp.customer_VAT
GROUP BY sp.customer_VAT"""
anyway you have a main select without aggregation function then you should avoid an improper use of group by. In this case use DISTINCT if you need not repeated result
query = """SELECT DISTINCT sp.customer_surname, sp.amount, cp.amount, sp.monthly, sp.date_
FROM set_payment7777 sp
INNER JOIN customers_payments7777 cp ON cp.customer_VAT = sp.customer_VAT
INNER JOIN (
SELECT MAX(date_) FROM set_payment7777
GROUP BY customer_VAT
) t on t.customer_VAT = sp.customer_VAT"""

How to replace IN in an SQL query containing a lot of parameters with Postgresql?

I am trying to retrieve information from a database using a Python tuple containing a set of ids (between 1000 and 10000 ids), but my query uses the IN statement and is subsequently very slow.
query = """ SELECT *
FROM table1
LEFT JOIN table2 ON table1.id = table2.id
LEFT JOIN ..
LEFT JOIN ...
WHERE table1.id IN {} """.format(my_tuple)
and then I query the database using PostgreSQL to charge the result in a Pandas dataframe:
with tempfile.TemporaryFile() as tmpfile:
copy_sql = "COPY ({query}) TO STDOUT WITH CSV {head}".format(
query=query, head="HEADER"
)
conn = db_engine.raw_connection()
cur = conn.cursor()
cur.copy_expert(copy_sql, tmpfile)
tmpfile.seek(0)
df = pd.read_csv(tmpfile, low_memory=False)
I know that IN is not very efficient with a high number of parameters, but I do not have any idea to optimise this part of the query. Any hint?
You could debug your query using explain statement. Probably you are trying to
sequently read big table while needing only a few rows. Is field table1.id indexed?
Or you could try to filter table1 first and then start joining
with t1 as (
select f1,f2, .... from table1 where id in {}
)
select *
from t1
left join ....

Add calculated column in lateral join SQLAlchemy

I am trying to write the following PosgreSQL query in SQLAlchemy:
SELECT DISTINCT user_id
FROM
(SELECT *, (amount * usd_rate) as usd_amount
FROM transactions AS t1
LEFT JOIN LATERAL (
SELECT rate as usd_rate
FROM fx_rates fx
WHERE (fx.ccy = t1.currency) AND (t1.created_date > fx.ts)
ORDER BY fx.ts DESC
LIMIT 1
) t2 On true) AS complete_table
WHERE type = 'CARD_PAYMENT' AND usd_amount > 10
So far, I have the lateral join by using subquery in the following way:
lateral_query = session.query(fx_rates.rate.label('usd_rate')).filter(fx_rates.ccy == transactions.currency,
transactions.created_date > fx_rates.ts).order_by(desc(fx_rates.ts)).limit(1).subquery('rates_lateral').lateral('rates')
task2_query = session.query(transactions).outerjoin(lateral_query,true()).filter(transactions.type == 'CARD_PAYMENT')
print(task2_query)
This produces:
SELECT transactions.currency AS transactions_currency, transactions.amount AS transactions_amount, transactions.state AS transactions_state, transactions.created_date AS transactions_created_date, transactions.merchant_category AS transactions_merchant_category, transactions.merchant_country AS transactions_merchant_country, transactions.entry_method AS transactions_entry_method, transactions.user_id AS transactions_user_id, transactions.type AS transactions_type, transactions.source AS transactions_source, transactions.id AS transactions_id
FROM transactions LEFT OUTER JOIN LATERAL (SELECT fx_rates.rate AS usd_rate
FROM fx_rates
WHERE fx_rates.ccy = transactions.currency AND transactions.created_date > fx_rates.ts ORDER BY fx_rates.ts DESC
LIMIT %(param_1)s) AS rates ON true
WHERE transactions.type = %(type_1)s
Which print the correct lateral query,but so far I don't know how to add the calculated field (amount*usd_rate), so I can apply the distinct and where statements.
Add the required entity in the Query, give it a label, and use the result as a subquery as you've done in SQL:
task2_query = session.query(
transactions,
(transactions.amount * lateral_query.c.usd_rate).label('usd_amount')).\
outerjoin(lateral_query, true()).\
subquery()
task3_query = session.query(task2_query.c.user_id).\
filter(task2_query.c.type == 'CARD_PAYMENT',
task2_query.c.usd_amount > 10).\
distinct()
On the other hand wrapping it in a subquery should be unnecessary, since you can use the calculated USD amount in a WHERE predicate in the inner query just as well:
task2_query = session.query(transactions.user_id).\
outerjoin(lateral_query, true()).\
filter(transactions.type == 'CARD_PAYMENT',
transactions.amount * lateral_query.c.usd_rate > 10).\
distinct()

Multiple insertion of one value in sqlalchemy statement to pandas

I have constructed a sql clause where I reference the same table as a and b to compare the two geometries as a postgis command.
I would like to pass a value into the sql statement using the %s operator and read the result into a pandas dataframe using to_sql, params kwargs. Currently my code will allow for one value to be passed to one %s but i'm looking for multiple insertions of the same list of values.
I'm connecting to a postgresql database using psycopg2.
Simplified code is below
sql = """
SELECT
st_distance(a.the_geom, b.the_geom, true) AS dist
FROM
(SELECT
table.*
FROM table
WHERE id in %s) AS a,
(SELECT
table.*
FROM table
WHERE id in %s) AS b
WHERE a.nid <> b.nid """
sampList = (14070,11184)
df = pd.read_sql(sql, con=conn, params = [sampList])
Basically i'm looking to replace both %s with the sampList value in both places. The code as written will only replace the first value indicating ': list index out of range. If I adjust to having one %s and replacing the second in statement with numbers the code runs, but ultimately I would like away to repeat those values.
You dont need the subqueries, just join the table with itself:
SELECT a.*, b.* -- or whatwever
, st_distance(a.the_geom, b.the_geom, true) AS dist
FROM ztable a
JOIN ztable b ON a.nid < b.nid
WHERE a.id IN (%s)
AND b.id IN (%s)
;
avoid repetition by using a CTE (this may be non-optimal, performance-wise)
WITH zt AS (
SELECT * FROM ztable
WHERE id IN (%s)
)
SELECT a.*, b.* -- or whatever
, st_distance(a.the_geom, b.the_geom, true) AS dist
FROM zt a
JOIN zt b ON a.nid < b.nid
;
Performance-wise, I would just stick to the first version, and supply the list-argument twice. (or refer to it twice, using a FORMAT() construct)
first of all i would recommend you to use updated SQL from #wildplasser - it's much better and more efficient way to do that.
now you can do the following:
sql_ = """\
WITH zt AS (
SELECT * FROM ztable
WHERE id IN ({})
)
SELECT a.*, b.* -- or whatever
, st_distance(a.the_geom, b.the_geom, true) AS dist
FROM zt a
JOIN zt b ON a.nid < b.nid
"""
sampList = (14070,11184)
sql = sql_.format(','.join(['?' for x in sampList]))
df = pd.read_sql(sql, con=conn, params=sampList)
dynamically generated SQL with parameters (AKA: prepared statements, bind variables, etc.):
In [27]: print(sql)
WITH zt AS (
SELECT * FROM ztable
WHERE id IN (?,?)
)
SELECT a.*, b.* -- or whatever
, st_distance(a.the_geom, b.the_geom, true) AS dist
FROM zt a
JOIN zt b ON a.nid < b.nid

SQL left join twice from one table web2py

How to do this thing?
I have two table
Table1:
-id
-table2_id_1
-table2_id_2
Table2:
-id
-table3_id
Table3:
-id
-table4_id
-table5_id
-table6_id
Table4, Table5 and Table6:
-id
-name
-date
Main table is Table1
db(db.Table1).select()
I need to join twice Table2(colums) in witch i need to join Table3(in each table2_id_1 and table2_id_2 field table3_id is equals), than join Table4,Table5,Table6
I don't know, if I really got, what you are trying to do, but if you just want to join the tables according to the id's, something like that should work:
SELECT *
FROM table1 a JOIN table2 b ON (a.table2_id_1 = b.id) JOIN
table2 c ON (a.table2_id_2 = c.id) JOIN
table3 d ON (b.table3_id = d.id) JOIN
table3 e ON (c.table3_id = e.id) JOIN
table4 f ON (d.table4_id = f.id) JOIN
table5 g ON (d.table5_id = g.id) JOIN
table6 h ON (d.table6_id = h.id) JOIN
table4 i ON (e.table4_id = i.id) JOIN
table5 j ON (e.table5_id = j.id) JOIN
table6 k ON (e.table6_id = k.id)

Categories

Resources