In pandas, is there an equivalent merge or merge_asof operation that can accomplish the SQL equivalent of an:
INNER JOIN number_table as n on n.N <= t.some_integer_field
where n is a number/tally table dataframe with a single column of integers(1 to 1000)
and t is a table with some integer field you would like to "deaggregate"
Any tips would be most appreciated!
In SQL, the INNER JOIN without equality is equivalent to CROSS JOIN and ON can be replaced with WHERE. Technically, even with equality! So your need of:
INNER JOIN number_table as n ON n.N <= t.some_integer_field
Can be replaced as:
CROSS JOIN number_table as n WHERE n.N <= t.some_integer_field
And because cross joins are cartesian products, run the same process in pandas where you assign a column in both dataframes of same value and merge on it which returns all possible combinations of rows from both sets since key will align.
df_number['key'] = 1 # OR df_number.assign(key=1)
df_table['key'] = 1
# CROSS JOIN WITH CONDITIONAL FILTER
pd.merge(df_table, df_number, on='key').query('N < some_integer_field')
Now performance of CJs is another question!
Related
If I have two tables, I can easily combine them in SQL using something like
SELECT a.*, b.* FROM table_1 a, table_2 b
WHERE (a.id < 1000 OR b.id > 700)
AND a.date < b.date
AND (a.name = b.first_name OR a.name = b.last_name)
AND (a.location = b.origin OR b.destination = 'home')
and there could be many more conditions. Note that this is just an example and the set of conditions may be anything.
The two easiest solutions in pandas that support any set of conditions are:
Compute a cross product of the tables and then filter one condition at a time.
Loop over one DataFrame (apply, itertuples, ...) and filter the second DataFrame in each iteration. Append the filtered DataFrames from each iteration.
In case of huge datasets (at least a few million rows per DataFrame), the first solution is impossible because of the required memory and the second one is considered an anti-pattern (https://stackoverflow.com/a/55557758/2959697). Either solution will be rather slow.
What is the pandaic way to proceed in this general case?
Note that I am not only interested in a solution to this particular problem but in the general concept of how to translate these types of statements. Can I use pandas.eval? Is it possible to perform a "conditional merge"? Etc.
I have two dataframe df and buyers.
I need to apply two different type of join inner and leftanti on them and take 1% sample from leftanit and then do the union of these two resultant. I tried the below
buyr = df.join(buyers,on=['key'],how='inner')
non_buyr = df.join(buyers,on=['key'],how='leftanti')
onepct = non_buyr.sample(False,0.01,seed=100)
df_final = buyr.unionAll(onepct)
But due to this we have two stages for df and each stage of 731 partition take around 4hr to complete
Is there any way to perform the inner and left anti join in single step instead of two or any other efficient method to perform the same ?
The Inner + Left anti is equal to a simple Left join.
Then, you can filter the line where F.col("buyers.key").isNotNull() # The inner and F.col("buyers.key").isNull() # The left anti.
Note this is for the core, not orm.
Hope some one can help me with these 2 questions :
1) There seems to be outerjoin and plain join but how does one do inner join?
2) What is the syntax to do multiple joins. I was able to do one join, but not sure the syntax for multiple joins.
My 1st join which works looks like this :
select([...]).select_from(outerjoin(a, b))
but it generates some errors for this syntax to do two joins :
select([...]).select_from(outerjoin(a, b).select_from(outerjoin(ma, tr))
Thanks in advance.
join does INNER JOIN by default. outerjoin calls join with argument isouter=True.
If our desired sql query is
SELECT a.col1, b.col2, c.col3
FROM a
LEFT JOIN b ON a.col1 = b.col1
LEFT JOIN c ON c.col1 = b.col2
Then the sqlalchemy-core statement should be:
select(
[a.c.col1, b.c.col2, c.c.col3]
).select_from(
a.outerjoin(
b, a.c.col1 == b.c.col1
).outerjoin(
c, b.c.col2 == c.c.col1
)
)
The on clause is not necessary if the relationship has been defined and is not ambiguous.
The outerjoin functions can nested rather than chained (as you have done for the simple join), i.e.
outerjoin(outerjoin(a, b), c)
but I find that form less readable.
I have this large query I am trying to perform. I perform a series of joins, and then from that resulting relation I want to perform another join and filter out certain tuples.
SELECT *
FROM
(
SELECT *
FROM
market_instrument
inner join exchange_instrument
on market_instrument.id = exchange_instrument.instrument_id
inner join Table1 on market_instrument.id = Table1.instrument_id
left join Table2 on market_instrument.id = Table2.instrument_id
left join `options`on market_instrument.id = `options`.instrument_id
left join Table3 on market_instrument.id = Table3.instrument_id
) as R
inner join Table4 on R.instrument_id = Table4.instrument_id
where Table4.fill_timestamp between CURDATE() - INTERVAL 30 DAY AND NOW();
R is the "series of joins" I'm referring to. I want to inner join R with Table4 and then filter out the resulting relation for the last 30 days (where the date attribute is Table4.fill_timestamp). I'm using SQLAlchemy so I thought about somehow saving R to some result relation variable and performing a separate query on that, but I don't know how SQLAlchemy handles that, so I wanted to try doing the entire query in SQL first.
I keep getting the Duplicate Column Name "instrument_id" error. instrument_id is the primary key for all tables except market_instrument, where it's the same but it's called id instead. What can I do to get around this issue?
The problem is that R has all the columns from several tables, and more than one of those tables has a column named "instrument_id". You have not assigned aliases to any of those column names, so SQL does not know which instrument_id column you mean when you say "R.instrument_id".
If market_instrument is the only table with an id column then you could join on R.id instead of R.instrument_id.
Alternatively, another group of solutions involves assigning different names to some or all of the columns in R. For example,
SELECT
market_instrument.*,
exchange_instrument.*,
Table1.instrument_id AS the_one_true_id,
Table1.another_column,
Table1.yet_another_column,
...
Table2.*,
options.*,
Table3.*
FROM
market_instrument
inner join exchange_instrument
on market_instrument.id = exchange_instrument.instrument_id
inner join Table1 on market_instrument.id = Table1.instrument_id
left join Table2 on market_instrument.id = Table2.instrument_id
left join `options`on market_instrument.id = `options`.instrument_id
left join Table3 on market_instrument.id = Table3.instrument_id
With the above, you could then join on R.the_one_true_id. Alternatively, you could leave your current join as it is, and rename all the instrument_id columns but one. It might (or might not) be convenient to do that in the context of replacing R with a full-fledged VIEW in your schema.
Alternatively, your select list could enumerate all the columns of all the tables in the join. That might be tedious, but if you really do need all of them, then you will need to do that to disambiguate the other duplicate names, which include, at least, the various other instrument_id columns. Presented with such a task, however, perhaps you would discover that you don't really need every one of them.
As yet another alternative, you could add more columns instead of renaming existing ones. For example,
SELECT
*
exchange_instrument.instrumentId AS ei_instrument_id,
Table1.instrument_id AS t1_instrument_id,
Table2.instrument_id AS t2_instrument_id,
options.instrument_id AS op_instrument_id,
Table3.instrument_id AS t3_instrument_id
FROM
...
Then you can access, say, R.t1_instrument_id, whose name is presumably unique.
How do I do nested joins in SQLAlchemy? The statement I'm trying to run is
SELECT a.col1, a.col2, c.col3
FROM a
LEFT OUTER JOIN (b INNER JOIN c ON c.col4 = b.col4) ON b.col5 = a.col5
I need to show all records in A, but join them only with those records in B that can JOIN with C.
The code I have so far is
session.query(a.col1, a.col2, c.col3).outerjoin(b, b.col5 == a.col5).all()
This gets me most of what I need, with A records showing null values where it's missing B records; however, too many Bs are getting in, and I need to limit them. However, if I just add another join, i.e.,
session.query(a.col1, a.col2, c.col3).outerjoin(b, b.col5 == a.col5).join(c, b.col4 == c.col4).all()
It drops out all of the A records with null values in B.
I should point out that I can't join A to C directly because the only connection between the two is through B.
This is done easiest using subquery:
subq = (session.query(b.col5).join(c, c.col4 == b.col4)).subquery("subq")
qry = session.query(a).outerjoin(subq, a.col5 == subq.c.col5)
print(qry)
If you showed more of the model definition and especially the nature of the relationships between the tables, there might be a more elegant solution.