In Spark SQL , i would need to cast as_of_date to string and do a multiple inner join with 3 tables and select all rows & columns in table1 , 2 and 3 after join . Example table schema as shown below
Tablename : Table_01 alias t1
Column | Datatype
as_of_date | String
Tablename | String
Credit_Card | String
Tablename : Table_02 alias t2
Column | Datatype
as_of_date | INT
Customer_name | String
tablename | string
Tablename : Table_03 alias t3
Column | Datatype
as_of_date | String
tablename | String
address | String
Join use-case :
t1.as_of_date = t2.as_of_date AND t1.tablename = t2.tablename
t2.as_of_date = t3.as_of_date AND t2.tablename = t3.tablename
Tables are already created in hive, i am doing a spark transformation on top of this tables and i am converting as_of_date in table_02 as string.
There are 2 approach i have thought of , but i am unsure which is the best approach
Approach 1:
df = spark.sql("select t1.*,t2.*,t3.* from table_1 t1 where cast(t1.as_of_date as string) inner join table_t2 t2 on t1.as_of_date = t2.as_of_date AND t1.tablename = t2.tablename inner join table_03 t3 on t2.as_of_date = t3.as_of_date and t2.tablename = t3.tablename")
Approach 2:
df_t1 = spark.sql("select * from table_01");
df_t2 = spark.sql("select * from table_02");
df_t3 = spark.sql("select * from table_03");
## Cast as_of_date as String if dtype as of date is int
if dict(df_t2.dtypes)["as_of_date"] == 'int':
df_t1["as_of_date"].cast(cast(StringType())
## Join Condition
df = df_t1.alias('t1').join(df_t2.alias('t2'),on="t1.tablename=t2.tablename AND t1.as_of_date = t2.as_of_date", how='inner').join(df_t3.alias('t3'),on="t2.as_of_date = t3.as_of_date AND t2.tablename = t3.tablename",how='inner').select('t1.*,t2.*,t3.*')
I feel that using approach 2 is long winded, i would need some advice on which approach should i go with as for easy maintenance and the scripts used
I would suggest using Spark SQL directly as below. You can cast every as_of_date column from all tables as a string regardless of its data type. You want to cast integer into string, but if you also cast string into a string, it does no harm.
df = spark.sql("""
select t1.*, t2.*, t3.*
from t1
join t2 on string(t1.as_of_date) = string(t2.as_of_date) AND t1.tablename = t2.tablename
join t3 on string(t2.as_of_date) = string(t3.as_of_date) AND t2.tablename = t3.tablename
""")
I have some problems with a SQL for Python that I hope you can help me with - I'm trying to retrieve some data from wordpress/woocommerce.
My code:
cursor.execute("
SELECT t1.ID, t1.post_date, t2.meta_value AS first_name, t3.meta_value AS last_name
FROM test_posts t1
LEFT JOIN test_postmeta t2
ON t1.ID = t2.post_id
WHERE t2.meta_key = '_billing_first_name' and t2.post_id = t1.ID
LEFT JOIN test_postmeta t3
ON t1.ID = t3.post_id
WHERE t3.meta_key = '_billing_last_name' and t3.post_id = t1.ID
GROUP BY t1.ID
ORDER BY t1.post_date DESC LIMIT 20")
I'm getting the following error:
mysql.connector.errors.ProgrammingError: 1064 (42000): You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'LEFT JOIN test_postmeta t3 ON t1.ID = t3.post_id WHERE t3.meta_key = '_billing' at line 1
What am I doing wrong?
Thanks in advance.
There should be only 1 WHERE clause before GROUP BY.
But since you use LEFT joins, setting a condition on the right table like t2.meta_key = '_billing_first_name' you get an INNER join instead because you reject unmatched rows.
So set all the conditions in the ON clauses:
cursor.execute("
SELECT t1.ID, t1.post_date, t2.meta_value AS first_name, t3.meta_value AS last_name
FROM test_posts t1
LEFT JOIN test_postmeta t2
ON t1.ID = t2.post_id AND t2.meta_key = '_billing_first_name'
LEFT JOIN test_postmeta t3
ON t1.ID = t3.post_id AND t3.meta_key = '_billing_last_name'
GROUP BY t1.ID
ORDER BY t1.post_date DESC LIMIT 20")
Although this query may be syntactically correct for MySql, it does not make sense to use GROUP BY since you do not do any aggregation.
Your SQL syntax is incorrect. Try this:
cursor.execute("
SELECT t1.ID, t1.post_date, t2.meta_value AS first_name, t3.meta_value AS last_name
FROM test_posts t1
LEFT JOIN test_postmeta t2 ON t1.ID = t2.post_id
LEFT JOIN test_postmeta t3 ON t1.ID = t3.post_id
WHERE t3.meta_key = '_billing_last_name' and t2.meta_key = '_billing_first_name'
GROUP BY t1.ID
ORDER BY t1.post_date DESC LIMIT 20")
It might be worth reading a little bit about SQL Joins and WHERE statements.
I wrote raw query to psql and it's work fine but when i wrote this in sqlalchemy my WHERE clause duplicated to FROM clause.
select id from T1 where arr && array(select l.id from T1 as l where l.box && box '((0,0),(50,50))');
In this query i fetch all id from T1 where array with ints intersects with results from subquery.
class T1():
arr = Column(ARRAY(Integer))
...
class T2():
box = Column(Box) # my geometry type
...
1 verison:
layers_q = select([T2.id]).where(T2.box.op('&&')(box)) # try find all T2 intersects with box
chunks = select([T1.id]).where(T1.arr.overlap(layers_q)) # try find all T1.id where T1.arr overlap with result from first query
SELECT T1.id
FROM T1
WHERE T1.arr && (SELECT T2.id
FROM T2
WHERE T2.box && %(box_1)s)
This i have a PG error about type cast. I understand it.
2 version:
layers_q = select([T2.id]).where(T2.box.op('&&')(box))
chunks = select([T1.id]).where(T1.arr.overlap(func.array(layers_q)))
I added func.array() for cast to array but result is not correct:
SELECT T1.id
FROM T1, (SELECT T2.id AS id
FROM T2
WHERE T2.box && %(box_1)s)
WHERE T1.arr && array((SELECT T2.id
FROM T2
WHERE T2.box && %(box_1)s))
There you can see what i have duplicate in FROM clause. How did it correctly?
I find solution!
func.array(select([T2.id]).where(T2.box.op('&&')(box)).as_scalar())
After added as_scalar() all be good, beacause in my select all ids need have in one array.
Snowflake DB does not support recursive with clause function , Need help me on how to achieve below query . Below query works well in Teradata
If any one also can help me to achieve using Python that would be great
WITH RECURSIVE RECURTEMP(ID,KCODE,LVL)
AS(SELECT ID, MIN(KCODE) AS KCODE,1
FROM TABLE_A
GROUP BY 1
UNION ALL
SELECT b.ID, trim(a.KCODE)|| ';'||trim(b.KCODE), LVL+1
FROM TABLE_A a
INNER JOIN RECURTEMP b ON a.ID = b.ID AND a.KCODE > b.KCODE
)
SELECT * FROM RECURTEMP
![Result]: https://imgur.com/a/ppSRXeT
CREATE TABLE MYTABLE (
ID VARCHAR2(50),
KCODE VARCHAR2(50)
);
INSERT INTO MYTABLE VALUES ('ABCD','K10');
INSERT INTO MYTABLE VALUES ('ABCD','K53');
INSERT INTO MYTABLE VALUES ('ABCD','K55');
INSERT INTO MYTABLE VALUES ('ABCD','K56');
COMMIT;
OUTPUT as below
ID KCODE LEVEL
--------------------------------------
ABCD K10 1
ABCD K53;K10 2
ABCD K55;K10 2
ABCD K56;K10 2
ABCD K55;K53;K10 3
ABCD K56;K53;K10 3
ABCD K56;K55;K10 3
ABCD K56;K55;K53;K10 4
Recursive WITH is now supported in Snowflake.
Your query
WITH RECURSIVE RECURTEMP(ID,KCODE,LVL) AS(
SELECT
ID,
MIN(KCODE) AS KCODE,
1
FROM
TABLE_A
GROUP BY
1
UNION ALL
SELECT
b.ID,
trim(a.KCODE) || ';' || trim(b.KCODE) AS KCODE,
LVL+1
FROM
TABLE_A a
INNER JOIN RECURTEMP b ON (a.ID = b.ID AND a.KCODE > b.KCODE)
)
SELECT * FROM RECURTEMP
Link to article is below.
https://docs.snowflake.net/manuals/user-guide/queries-cte.html#overview-of-recursive-cte-syntax
I am trying to fetch in a single query a fixed set of rows, plus some other rows found by a subquery. My problem is that the query generated by my SQLAlchemy code is incorrect.
The problem is that the query generated by SQLAlchemy is as follows:
SELECT tbl.id AS tbl_id
FROM tbl
WHERE tbl.id IN
(
SELECT t2.id AS t2_id
FROM tbl AS t2, tbl AS t1
WHERE t2.id =
(
SELECT t3.id AS t3_id
FROM tbl AS t3, tbl AS t1
WHERE t3.id < t1.id ORDER BY t3.id DESC LIMIT 1 OFFSET 0
)
AND t1.id IN (4, 8)
)
OR tbl.id IN (0, 8)
while the correct query should not have the second tbl AS t1 (the goal from this query is to select IDs 0 and 8, as well as the IDs just before 4 and 8).
Unfortunately, I can't find how to get SQLAlchemy to generate the correct one (see the code below).
Suggestions to also achieve the same result with a simpler query are also welcome (they need to be efficient though -- I tried a few variants and some were a lot slower on my real use case).
The code producing the query:
from sqlalchemy import create_engine, or_
from sqlalchemy import Column, Integer, MetaData, Table
from sqlalchemy.orm import sessionmaker
engine = create_engine('sqlite:///:memory:', echo=True)
meta = MetaData(bind=engine)
table = Table('tbl', meta, Column('id', Integer))
session = sessionmaker(bind=engine)()
meta.create_all()
# Insert IDs 0, 2, 4, 6, 8.
i = table.insert()
i.execute(*[dict(id=i) for i in range(0, 10, 2)])
print session.query(table).all()
# output: [(0,), (2,), (4,), (6,), (8,)]
# Subquery of interest: look for the row just before IDs 4 and 8.
sub_query_txt = (
'SELECT t2.id '
'FROM tbl t1, tbl t2 '
'WHERE t2.id = ( '
' SELECT t3.id from tbl t3 '
' WHERE t3.id < t1.id '
' ORDER BY t3.id DESC '
' LIMIT 1) '
'AND t1.id IN (4, 8)')
print session.execute(sub_query_txt).fetchall()
# output: [(2,), (6,)]
# Full query of interest: get the rows mentioned above, as well as more rows.
query_txt = (
'SELECT * '
'FROM tbl '
'WHERE ( '
' id IN (%s) '
'OR id IN (0, 8))'
) % sub_query_txt
print session.execute(query_txt).fetchall()
# output: [(0,), (2,), (6,), (8,)]
# Attempt at an SQLAlchemy translation (from innermost sub-query to full query).
t1 = table.alias('t1')
t2 = table.alias('t2')
t3 = table.alias('t3')
q1 = session.query(t3.c.id).filter(t3.c.id < t1.c.id).order_by(t3.c.id.desc()).\
limit(1)
q2 = session.query(t2.c.id).filter(t2.c.id == q1, t1.c.id.in_([4, 8]))
q3 = session.query(table).filter(
or_(table.c.id.in_(q2), table.c.id.in_([0, 8])))
print list(q3)
# output: [(0,), (6,), (8,)]
What you are missing is a correlation between the innermost sub-query and the next level up; without the correlation, SQLAlchemy will include the t1 alias in the innermost sub-query:
>>> print str(q1)
SELECT t3.id AS t3_id
FROM tbl AS t3, tbl AS t1
WHERE t3.id < t1.id ORDER BY t3.id DESC
LIMIT ? OFFSET ?
>>> print str(q1.correlate(t1))
SELECT t3.id AS t3_id
FROM tbl AS t3
WHERE t3.id < t1.id ORDER BY t3.id DESC
LIMIT ? OFFSET ?
Note that tbl AS t1 is now missing from the query. From the .correlate() method documentation:
Return a Query construct which will correlate the given FROM clauses to that of an enclosing Query or select().
Thus, t1 is assumed to be part of the enclosing query, and isn't listed in the query itself.
Now your query works:
>>> q1 = session.query(t3.c.id).filter(t3.c.id < t1.c.id).order_by(t3.c.id.desc()).\
... limit(1).correlate(t1)
>>> q2 = session.query(t2.c.id).filter(t2.c.id == q1, t1.c.id.in_([4, 8]))
>>> q3 = session.query(table).filter(
... or_(table.c.id.in_(q2), table.c.id.in_([0, 8])))
>>> print list(q3)
2012-10-24 22:16:22,239 INFO sqlalchemy.engine.base.Engine SELECT tbl.id AS tbl_id
FROM tbl
WHERE tbl.id IN (SELECT t2.id AS t2_id
FROM tbl AS t2, tbl AS t1
WHERE t2.id = (SELECT t3.id AS t3_id
FROM tbl AS t3
WHERE t3.id < t1.id ORDER BY t3.id DESC
LIMIT ? OFFSET ?) AND t1.id IN (?, ?)) OR tbl.id IN (?, ?)
2012-10-24 22:16:22,239 INFO sqlalchemy.engine.base.Engine (1, 0, 4, 8, 0, 8)
[(0,), (2,), (6,), (8,)]
I'm only kinda sure I understand the query you're asking for. Lets break it down, though:
the goal from this query is to select IDs 0 and 8, as well as the IDs just before 4 and 8.
It looks like you want to query for two kinds of things, and then combine them. The proper operator for that is union. Do the simple queries and add them up at the end. I'll start with the second bit, "ids just before X".
To start with; lets look at the all the ids that are before some given value. For this, we'll join the table on itself with a <:
# select t1.id t1_id, t2.id t2_id from tbl t1 join tbl t2 on t1.id < t2.id;
t1_id | t2_id
-------+-------
0 | 2
0 | 4
0 | 6
0 | 8
2 | 4
2 | 6
2 | 8
4 | 6
4 | 8
6 | 8
(10 rows)
That certainly gives us all of the pairs of rows where the left is less than the right. Of all of them, we want the rows for a given t2_id that is as high as possible; We'll group by t2_id and select the maximum t1_id
# select max(t1.id), t2.id from tbl t1 join tbl t2 on t1.id < t2.id group by t2.id;
max | id
-----+-------
0 | 2
2 | 4
4 | 6
6 | 8
(4 rows)
Your query, using a limit, could achieve this, but its usually a good idea to avoid using this technique when alternatives exist because partitioning does not have good, portable support across Database implementations. Sqlite can use this technique, but postgresql doesn't like it, it uses a technique called "analytic queries" (which are both standardised and more general). MySQL can do neither. The above query, though, works consistently across all sql database engines.
the rest of the work is just using in or other equivalent filtering queries and are not difficult to express in sqlalchemy. The boilerplate...
>>> import sqlalchemy as sa
>>> from sqlalchemy.orm import Query
>>> engine = sa.create_engine('sqlite:///:memory:')
>>> meta = sa.MetaData(bind=engine)
>>> table = sa.Table('tbl', meta, sa.Column('id', sa.Integer))
>>> meta.create_all()
>>> table.insert().execute([{'id':i} for i in range(0, 10, 2)])
>>> t1 = table.alias()
>>> t2 = table.alias()
>>> before_filter = [4, 8]
First interesting bit is we give the 'max(id)' expression a name. this is needed so that we can refer to it more than once, and to lift it out of a subquery.
>>> c1 = sa.func.max(t1.c.id).label('max_id')
>>> # ^^^^^^
The 'heavy lifting' portion of the query, join the above aliases, group and select the max
>>> q1 = Query([c1, t2.c.id]) \
... .join((t2, t1.c.id < t2.c.id)) \
... .group_by(t2.c.id) \
... .filter(t2.c.id.in_(before_filter))
Because we'll be using a union, we need this to produce the right number of fields: we wrap it in a subquery and project down to the only column we're interested in. This will have the name we gave it in the above label() call.
>>> q2 = Query(q1.subquery().c.max_id)
>>> # ^^^^^^
The other half of the union is much simpler:
>>> t3 = table.alias()
>>> exact_filter = [0, 8]
>>> q3 = Query(t3).filter(t3.c.id.in_(exact_filter))
All that's left is to combine them:
>>> q4 = q2.union(q3)
>>> engine.execute(q4.statement).fetchall()
[(0,), (2,), (6,), (8,)]
The responses here helped me fix my issue but in my case I had to use both correlate() and subquery():
# ...
subquery = subquery.correlate(OuterCorrelationTable).subquery()
filter_query = db.session.query(func.sum(subquery.c.some_count_column))
filter = filter_query.as_scalar() == as_many_as_some_param
# ...
final_query = db.session.query(OuterCorrelationTable).filter(filter)