What did I misunderstand about inner join meaning for wrds database? - python

I learned join methods in sql, and I know that inner join means returning only the intersections of the two different tables that we want to set.
I thought for python the concept is same. But I have problem understanding the certain code.
crsp1=pd.merge(crsp, crsp_maxme, how='inner', on=['jdate','permco','me'])
crsp1=crsp1.drop(['me'], axis=1)
crsp2=pd.merge(crsp1, crsp_summe, how='inner', on=['jdate','permco'])
If I understood correctly, the first line merges table crsp and crsp_maxme with intersection on column 'jdate', 'permco', 'me'. So the table crsp1 would have 3 columns.
The second line drops the 'me' column of table crsp1.
The last lien would merge newly adjusted table crsp1 and crsp_summe with inner join, with intersection on 'jdate' and 'permco'. Which makes newly merged table crsp2 only having 2 columns.
However, the code explanation from line 2 says that the second and third lines drop 'me' column from crsp1 and then replace it with 'me' from crsp_summe table, which I had problem understanding.
Could anyone clarify these lines for me?
PS: I thought it isn't necessary to explain what the table crsp, crsp_summe, and crsp_maxme since they are all framed by inner join function. So please excuse the lack of background info.

The merge() functions on parameter specifies on what columns you want to make joins. how specifies what type of join you want to apply (similar to sql joins as outer, inner, left, right etc.).
Ex:
suppose there are two tables A and B containing columns as A['x1','x2','x3'] and B['x2','y1'] so joining them based on 'x1' (as it is common column in both table) would produce A_join_B_on_x1['A_B_x1','A_x2','A_x3','B_y1'] and the join will based on how you want to join.
in your current code consider,
A = crsp1
B = crsp_maxme
C = crsp_summe
Now in your program your
first line merges your A,B on ['jdate','permco','me'] columns and creates a new dataframe A_B containing ['jdate','permco','me',...'+columns_from_both_tables(A)(B)'] as inner join (i.e rows which are common in both A,B based on ['jdate','permco','me'] columns)
second line drops 'me' column from A_B dataframe. so it will be something like
['jdate','permco',...'+columns_from_both_tables(A)(B)']
third line merges your A_B,C on ['jdate','permco'] and creates ['jdate','permco',...'+columns_from_both_tables(A_B)(C)'] as inner join (i.e rows which are common in both A_B,C based on ['jdate','permco','me'] columns)

Related

Two different join on same table Pyspark

I have two dataframe df and buyers.
I need to apply two different type of join inner and leftanti on them and take 1% sample from leftanit and then do the union of these two resultant. I tried the below
buyr = df.join(buyers,on=['key'],how='inner')
non_buyr = df.join(buyers,on=['key'],how='leftanti')
onepct = non_buyr.sample(False,0.01,seed=100)
df_final = buyr.unionAll(onepct)
But due to this we have two stages for df and each stage of 731 partition take around 4hr to complete
Is there any way to perform the inner and left anti join in single step instead of two or any other efficient method to perform the same ?
The Inner + Left anti is equal to a simple Left join.
Then, you can filter the line where F.col("buyers.key").isNotNull() # The inner and F.col("buyers.key").isNull() # The left anti.

How to correctly merge/join a feature to another dataframe

I have to dataframes and I want to add 3 features of the first dataframe to the second but ONLY if they match on a certain key value (TicketNr). This key is not unique and can occur multiple times in both dataframes.
I have tried different versions of concat, merge and join but I can't get it they way I need. I don't want to add any rows to the dataframe, just these three columns.
I think this illustration sums up my question. Who can help me with the right code?
You mentioned that TicketNr is not unique in the training set, but if I am correct to assume that TicketSurvRate, AllSurvived, AllDIED are the same as long as TicketNr is the same, we could try the following:
# Drop duplicates to get one row per TicketNr, assuming that
# TicketSurvRate, AllSurvived, AllDIED are uniquely defined by TicketNr
x = engineered_train[
['TicketNr', 'TicketSurvRate', 'AllSurvived', 'AllDIED']].drop_duplicates()
# Merge test dataset with these de-duplicated stats.
# The how='left' parameter will keep all records from the test set.
# There will be `NaN`s where no match for TicketNr is found.
engineered_test.merge(x, how='left', on='TicketNr')

Join All dataframes which names starts with df

I've multiple dataframes (with exacly the same structure, same variables...) and they all starts with "df_".
What I would like to do is to join all these dataframes into one.
I can do it manually, but I have many data frames and their names can change.
frames = [df_24_10000, df_48_10000, df_64_20000, df_82_30000]
result = pd.concat(frames)
Is it possible to join all data frames that starts with "df_"?
This is normally the sign of a design problem. If you find yourself trying to group a number of objects by their names, it means that should have been elements of the same container (list, dict, set, map or whatever) from the very beginning.
Said differently, if instead of df_24_10000, df_48_10000, df_64_20000, you had dfs['24_10000'], dfs['48_10000'], dfs['64_20000'], the join would simply be:
result = pd.concat(dfs.values)

SQL Alchemy using Or_ looping multiple columns (Pandas Dataframes)

SUMMARY:
How to query against values from different data frame columns with table.column_name combinations in SQL Alchemy using the OR_ statement.
I'm working on a SQL Alchemy project where I pull down valid columns of a dataframe and enter them all into SQL Alchemy's filter. I've successfully got it running where it would enter all entries of a column using the head of the column like this:
qry = qry.filter(or_(*[getattr(Query_Tbl,column_head).like(x) \
for x in (df[column_head].dropna().values)]))
This produced the pattern I was looking for of (tbl.column1 like a OR tbl.column1 like b...) AND- etc.
However, there are groups of the dataframe that need to be placed together where the columns are different but still need to be placed within the OR_ category,
i.e. (The desired result)
(tbl1.col1 like a OR tbl.col1 like b OR tbl.col2 like c OR tbl.col2 like d OR tbl.col3 like e...) etc.
My latest attempt was to sub-group the columns I needed grouped together, then repeat the previous style inside those groups like:
qry = qry.filter(or_((*[getattr(Query_Tbl, set_id[0]).like(x) \
for x in (df[set_id[0]].dropna().values)]),
(*[getattr(Query_Tbl, set_id[1]).like(y) \
for y in (df[set_id[1]].dropna().values)]),
(*[getattr(Query_Tbl, set_id[2]).like(z) \
for z in (df[set_id[2]].dropna().values)])
))
Where set_id is a list of 3 strings corresponding to column1, column2, and column 3 so I get the designated results, however, this produces simply:
(What I'm actually getting)
(tbl.col1 like a OR tbl.col1 like b..) AND (tbl.col2 like c OR tbl.col2 like d...) AND (tbl.col3 like e OR...)
Is there a better way to go about this in SQL Alchemy to get the result I want, or would it better to find a way of implementing column values with Pandas directly into getattr() to work it into my existing code?
Thank you for reading and in advance for your help!
It appears I was having issues with the way the data-frame was formatted, and I was reading column names into groups differently. This pattern works for anyone who want to process multiple df columns into the same OR statements.
I apologize for the issue, if anyone has any comments or questions on the subject I will help others with this type of issue.
Alternatively, I found a much cleaner answer. Since SQL Alchemy's OR_ function can be used with a variable column if you use Python's built in getattr() function, you only need to create (column,value) pairs where by you can unpack both in a loop.
for group in [group_2, group_3]:
set_id = list(set(df.columns.values) & set(group))
if len(set_id) > 1:
set_tuple = list()
for column in set_id:
for value in df[column].dropna().values:
set_tuple.append((column, value))
print(set_tuple)
qry = qry.filter(or_(*[getattr(Query_Tbl,id).like(x) for id, x in set_tuple]))
df = df.drop(group, axis=1)
If you know what column need to be grouped in the Or_ statement, you can put them into lists and iterate through them. Inside those, you create a list of tuples where you create the (column, value) pairs you need. Then within the Or_ function you upact the column and values in a loop, and assign them accordingly. The code is must easier to read and much for compack. I found this to be a more robust solution than explicitly writing out cases for the group sizes.

Need to retain all columns but getting Duplicate Column Name #1060

I have this large query I am trying to perform. I perform a series of joins, and then from that resulting relation I want to perform another join and filter out certain tuples.
SELECT *
FROM
(
SELECT *
FROM
market_instrument
inner join exchange_instrument
on market_instrument.id = exchange_instrument.instrument_id
inner join Table1 on market_instrument.id = Table1.instrument_id
left join Table2 on market_instrument.id = Table2.instrument_id
left join `options`on market_instrument.id = `options`.instrument_id
left join Table3 on market_instrument.id = Table3.instrument_id
) as R
inner join Table4 on R.instrument_id = Table4.instrument_id
where Table4.fill_timestamp between CURDATE() - INTERVAL 30 DAY AND NOW();
R is the "series of joins" I'm referring to. I want to inner join R with Table4 and then filter out the resulting relation for the last 30 days (where the date attribute is Table4.fill_timestamp). I'm using SQLAlchemy so I thought about somehow saving R to some result relation variable and performing a separate query on that, but I don't know how SQLAlchemy handles that, so I wanted to try doing the entire query in SQL first.
I keep getting the Duplicate Column Name "instrument_id" error. instrument_id is the primary key for all tables except market_instrument, where it's the same but it's called id instead. What can I do to get around this issue?
The problem is that R has all the columns from several tables, and more than one of those tables has a column named "instrument_id". You have not assigned aliases to any of those column names, so SQL does not know which instrument_id column you mean when you say "R.instrument_id".
If market_instrument is the only table with an id column then you could join on R.id instead of R.instrument_id.
Alternatively, another group of solutions involves assigning different names to some or all of the columns in R. For example,
SELECT
market_instrument.*,
exchange_instrument.*,
Table1.instrument_id AS the_one_true_id,
Table1.another_column,
Table1.yet_another_column,
...
Table2.*,
options.*,
Table3.*
FROM
market_instrument
inner join exchange_instrument
on market_instrument.id = exchange_instrument.instrument_id
inner join Table1 on market_instrument.id = Table1.instrument_id
left join Table2 on market_instrument.id = Table2.instrument_id
left join `options`on market_instrument.id = `options`.instrument_id
left join Table3 on market_instrument.id = Table3.instrument_id
With the above, you could then join on R.the_one_true_id. Alternatively, you could leave your current join as it is, and rename all the instrument_id columns but one. It might (or might not) be convenient to do that in the context of replacing R with a full-fledged VIEW in your schema.
Alternatively, your select list could enumerate all the columns of all the tables in the join. That might be tedious, but if you really do need all of them, then you will need to do that to disambiguate the other duplicate names, which include, at least, the various other instrument_id columns. Presented with such a task, however, perhaps you would discover that you don't really need every one of them.
As yet another alternative, you could add more columns instead of renaming existing ones. For example,
SELECT
*
exchange_instrument.instrumentId AS ei_instrument_id,
Table1.instrument_id AS t1_instrument_id,
Table2.instrument_id AS t2_instrument_id,
options.instrument_id AS op_instrument_id,
Table3.instrument_id AS t3_instrument_id
FROM
...
Then you can access, say, R.t1_instrument_id, whose name is presumably unique.

Categories

Resources