How to merge two pandas dataframes on column of sets

How to merge two pandas dataframes on column of sets - python

I have columns in two dataframes representing interacting partners in a biological system, so if gene_A interacts with gene_B, the entry in column 'gene_pair' would be {gene_A, gene_B}. I want to do an inner join, but trying:
pd.merge(df1, df2, how='inner', on=['gene_pair'])
throws the error
TypeError: type object argument after * must be a sequence, not itertools.imap
I need to merge on the unordered pair, so as far as I can tell I can't merge on a combination of two individual columns with gene names. Is there another way to achieve this merge?
Some example dfs:
gene_pairs1 = [
set(['gene_A','gene_B']),
set(['gene_A','gene_C']),
set(['gene_D','gene_A'])
]
df1 = pd.DataFrame({'r_name': ['r1','r2','r3'], 'gene_pair': gene_pairs1})
gene_pairs2 = [
set(['gene_A','gene_B']),
set(['gene_F','gene_A']),
set(['gene_C','gene_A'])
]
df2 = pd.DataFrame({'function': ['f1','f2','f3'], 'gene_pair': gene_pairs2})
pd.merge(df1,df2,how='inner',on=['gene_pair'])
and I would like entry 'r1' line up with 'f1' and 'r2' to line up with 'f3'.

Pretty simple in the end: I used frozenset, rather than set.

I suggest u get an extra Id column for each pair and then join on that!
for eg.
df2['gp'] = df2.gene_pair.apply(lambda x: list(x)[0][-1]+list(x)[1][-1])
df1['gp'] = df1.gene_pair.apply(lambda x: list(x)[0][-1]+list(x)[1][-1])
pd.merge(df1, df2[['function','gp']],how='inner',on=['gp']).drop('gp', axis=1)

Related

Joining two dataframes on subvalue of the key column

I am currently trying to join / merge two df on the column Key, where in df1 the key is a standalone value such as 5, but in df2, the key can consist of multiple values such as [5,6,13].
For example like this:
df1 = pd.DataFrame({'key': [["5","6","13"],["10","7"],["6","8"]]})
df2 = pd.DataFrame({'sub_key': ["5","10","6"]})
However, my df are a lot bigger and consist of many columns, so an efficient solution would be great.
As a result I would like to have a table like this:
Key1
Key2
5
5,6,13
10
10,7
and so on ....
I already tried to apply this approach to my code, but it didn't work:
df1['join'] = 1
df2['join'] = 1
merged= df1.merge(df2, on='join').drop('join', axis=1)
df2.drop('join', axis=1, inplace=True)
merged['match'] = merged.apply(lambda x: x.key(x.sub_key), axis=1).ge(0)
I also tried to split and explode the column and to join on single values, however there the problem was, that not all column values were split correctly and I would need to combine everything back into one cell once joined.
Help would be much appreciated!

If you only want to match the first key:
df1['sub_key'] = df1.key.str[0]
df1.merge(df2)
If you want to match ANY key:
df3 = df1.explode('key').rename(columns={'key':'sub_key'})
df3 = df3.join(df1)
df3.merge(df2)
Edit: First version had a small bug, fixed it.

Merge multiple dataframes in Python on key

I have three dataframes with Date_Final as column under them all. I want to merge them all and Currently, my code is. I want to merge these dataframes into a single dataframe where firstly I create a new dataframe from any stock's dataframe.
What should be the optimized code?
df_stockanlaysis = stock1[['Date_final']]
df_stockanlaysis = pd.merge(df_stockanlaysis, stock1, how='left', on ="Date_final")
df_stockanlaysis = pd.merge(df_stockanlaysis, stock2, how='left', on ="Date_final")
df_stockanlaysis = pd.merge(df_stockanlaysis, stock3, how='left', on ="Date_final")

You could use a reduce, pandas and list comprehension.
Put those dataframes in a list like
dfs = [df_stockanlaysis1, df_stockanlaysis2, df_stockanlaysis3]
and the common column in this case i assume its 'Date_final'
df_final = reduce(lambda left,right: pd.merge(left,right,on='Date_final'), dfs)

How to specify hierarchical columns in Pandas merge?

After a serious misconception of how on works in join (spoiler: very different from on in merge), this is my example code.
import pandas as pd
index1 = pd.MultiIndex.from_product([["variables"], ["number", "fruit"]])
df1 = pd.DataFrame([["one", "apple"], ["two", "banana"]], columns=index1)
index2 = pd.MultiIndex.from_product([["variables"], ["fruit", "color"]])
df2 = pd.DataFrame([["banana", "yellow"]], columns=index2)
print(df1.merge(df2, on="fruit", how="left"))
I get a KeyError. How do I correctly reference variables.fruit here?
To understand what I am after, consider the same problem without a multi index:
import pandas as pd
df1 = pd.DataFrame([["one", "apple"], ["two", "banana"]], columns=["number", "fruit"])
df2 = pd.DataFrame([["banana", "yellow"]], columns=["fruit", "color"])
# this is obviously incorrect as it uses indexes on `df1` as well as `df2`:
print(df1.join(df2, rsuffix="_"))
# this is *also* incorrect, although I initially thought it should work, but it uses the index on `df2`:
print(df1.join(df2, on="fruit", rsuffix="_"))
# this is correct:
print(df1.merge(df2, on="fruit", how="left"))
The expected and wanted result is this:
number fruit color
0 one apple NaN
1 two banana yellow
How do I get the same when fruit is part of a multi index?

I think I understand what you are trying to accomplish now, and I don't think join is going to get you there. Both DataFrame.join and DataFrame.merge make a call to pandas.core.reshape.merge.merge, but using DataFrame.merge gives you more control over what defaults are applied.
In your case, you can use reference the columns to join on via a list of tuples, where the elements of the tuple are the levels of the multi-indexed columns. I.e. to use the variables / fruit column, you can pass [('variables', 'fruit')].
Using tuples is how you index into multi-index columns (and row indices). You need to wrap it in a list because merge operations can be performed using multiple columns, or multiple multi-indexed columns, like a JOIN statement in SQL. Passing a single string is just a convenience case that gets wrapped in a list for you.
Since you are only joining on 1 column, it is a list of a single tuple.
import pandas as pd
index1 = pd.MultiIndex.from_product([["variables"], ["number", "fruit"]])
df1 = pd.DataFrame([["one", "apple"], ["two", "banana"]], columns=index1)
index2 = pd.MultiIndex.from_product([["variables"], ["fruit", "color"]])
df2 = pd.DataFrame([["banana", "yellow"]], columns=index2)
df1.merge(df2, how='left', on=[('variables', 'fruit')])
# returns:
variables
number fruit color
0 one apple NaN
1 two banana yellow

Pandas join dataframes based on different columns

I have been trying to merge multiple dataframes using reduce() function mentioned in this link pandas three-way joining multiple dataframes on columns.
dfs = [df0, df1, df2, dfN]
df_final = reduce(lambda left,right: pd.merge(left,right,on='name'), dfs)
However, in my case the join columns are different for the related dataframes. Therefore I would need to use different left_on and right_on values on every merge.
I have come up with a workaround, which is not efficient or elegant in any way, but for now it works. I would like to know if the same can be achieved using reduce() or may be other efficient alternatives. I am foreseeing that there would be many dataframes I would need to join down-the-line.
import pandas as pd
...
...
# xml files - table1.xml, table2.xml and table3.xml are converted to <dataframe11>, <dataframe2>, <dataframe3> respectively.
_df = {
'table1' : '<dataframe1>',
'table2' : '<dataframe2>',
'table3' : '<dataframe3>'
}
# variable that tells column1 of table1 is related to column2 of table2, which can be used as left_on/right_on while merging dataframes
_relationship = {
'table1': {
'table2': ['NAME', 'DIFF_NAME']},
'table2': {
'table3': ['T2_ID', 'T3_ID']}
}
def _join_dataframes(_rel_pair):
# copy
df_temp = dict(_df)
for ele in _rel_pair:
first_table = ele[0]
second_table = ele[1]
lefton = _onetomany[first_table][second_table][0]
righton = _onetomany[first_table][second_table][1]
_merged_df = pd.merge(df_temp[first_table], df_temp[second_table],
left_on=lefton, right_on=righton, how="inner")
df_temp[ele[1]] = _merged_df
return _merged_df
# I have come up with this structure based on _df.keys()
_rel_pair = [['table1', 'table2'], ['table2', 'table3']]
_join_dataframes(_rel_pair)

Why don't you just rename the columns of all the dataframes first?
df0.rename({'commonname': 'old_column_name0'}, inplace=True)
.
.
.
.
dfN.rename({'commonname': 'old_column_nameN'}, inplace=True)
dfs = [df0, df1, df2, ... , dfN]
df_final = reduce(lambda left,right: pd.merge(left,right,on='name'), dfs)

Try using the concat function, instead of reduce.
A simple trick I like to use when merging DFs is setting the index on the columns I want to use as a guide when merging. Example:
# note different column names 'B' and 'C'
dfA = pd.read_csv('yourfile_A.csv', index_col=['A', 'B']
dfB = pd.read_csv('yourfile_B.csv', index_col=['C', 'D']
df = pd.concat([dfA, dfB], axis=1)
You will need unique indexes / multiindexes for this to work, but I think this should be no problem for most cases. Never tried a large concat, but this approach should theoretically work for N concats.
Alternatively, you can use merge instead, as it provide left_on and right_on parameters specially for those situations where column names differ between dataframes. An example:
dfA.merge(dfB, left_on='name', right_on='username')
A more complete explanation on how to merge dfs: https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html
concat: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html
merge: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.merge.html

How to use the content of one dataframe to index another multi-level index dataframe?

I have the following dataframes site_1_df and `site_2_df (both are similar):
site_1_df:
And the following dataframe:
site_1_index_df = pd.DataFrame(site_1_df.index.values.tolist(), columns=["SiteNumber", "WeekNumber", "PG"])
site_2_index_df = pd.DataFrame(site_2_df.index.values.tolist(), columns=["SiteNumber", "WeekNumber", "PG"])
index_intersection = pd.merge(left=site_1_index_df, right=site_2_index_df,
on=["WeekNumber", "PG"], how="inner")[["WeekNumber", "PG"]]
index_intersection:
Consequently, it is clear that site_1_df and site_2_df are multi-level indexed dataframes. Therefore, I woulld like to use the index_intersection to index the above dataframe. Or, If I am indexing from site_1_df, then I want a subset of the rows from the same dataframe. And technically, I should get back a dataframe that has (8556 rows x 6 columns), i.e., the same number of rows of index_intersection. How can I achieve that efficiently in pandas?
I tried:
index_intersection = pd.merge(left=site_1_index_df, right=site_2_index_df,
on=["WeekNumber", "PG"], how="inner")[["SiteNumber_x", "WeekNumber", "PG"]]
index_intersection = index_intersection.rename(columns={"SiteNumber_x": "SiteNumber"})
index_intersection = index_intersection.set_index(["SiteNumber", "WeekNumber", "PG"])
index_intersection
And I got:
However, indexing the dataframe using another dataframe such as:
site_2_df.loc[index_intersection]
# or
site_2_df.loc[index_intersection.index]
# or
site_2_df.loc[index_intersection.index.values]
will give me an error:
NotImplementedError: Indexing a MultiIndex with a DataFrame key is not implemented
Any help is much appreciated!!

So I figured out that I can find the intersection of 2 dataframes, based on their index through:
sites_common_rows = pd.merge(left=site_1_df.reset_index([0]), right=site_2_df.reset_index([0]),
left_index=True, right_index=True, how="inner")
The reset_index([0]) above is used to ignore the SiteNumber since this totally different from one dataframe to another. Consequently, I am able to find the inner join between two dataframes from their indexes.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to merge two pandas dataframes on column of sets - python

Pretty simple in the end: I used frozenset, rather than set.

Related

Joining two dataframes on subvalue of the key column

Merge multiple dataframes in Python on key

How to specify hierarchical columns in Pandas merge?

Pandas join dataframes based on different columns

How to use the content of one dataframe to index another multi-level index dataframe?

Categories

Resources