How to merge two pandas dataframes on column of sets - python

I have columns in two dataframes representing interacting partners in a biological system, so if gene_A interacts with gene_B, the entry in column 'gene_pair' would be {gene_A, gene_B}. I want to do an inner join, but trying:
pd.merge(df1, df2, how='inner', on=['gene_pair'])
throws the error
TypeError: type object argument after * must be a sequence, not itertools.imap
I need to merge on the unordered pair, so as far as I can tell I can't merge on a combination of two individual columns with gene names. Is there another way to achieve this merge?
Some example dfs:
gene_pairs1 = [
set(['gene_A','gene_B']),
set(['gene_A','gene_C']),
set(['gene_D','gene_A'])
]
df1 = pd.DataFrame({'r_name': ['r1','r2','r3'], 'gene_pair': gene_pairs1})
gene_pairs2 = [
set(['gene_A','gene_B']),
set(['gene_F','gene_A']),
set(['gene_C','gene_A'])
]
df2 = pd.DataFrame({'function': ['f1','f2','f3'], 'gene_pair': gene_pairs2})
pd.merge(df1,df2,how='inner',on=['gene_pair'])
and I would like entry 'r1' line up with 'f1' and 'r2' to line up with 'f3'.

Pretty simple in the end: I used frozenset, rather than set.

I suggest u get an extra Id column for each pair and then join on that!
for eg.
df2['gp'] = df2.gene_pair.apply(lambda x: list(x)[0][-1]+list(x)[1][-1])
df1['gp'] = df1.gene_pair.apply(lambda x: list(x)[0][-1]+list(x)[1][-1])
pd.merge(df1, df2[['function','gp']],how='inner',on=['gp']).drop('gp', axis=1)

Related

Joining two dataframes on subvalue of the key column

I am currently trying to join / merge two df on the column Key, where in df1 the key is a standalone value such as 5, but in df2, the key can consist of multiple values such as [5,6,13].
For example like this:
df1 = pd.DataFrame({'key': [["5","6","13"],["10","7"],["6","8"]]})
df2 = pd.DataFrame({'sub_key': ["5","10","6"]})
However, my df are a lot bigger and consist of many columns, so an efficient solution would be great.
As a result I would like to have a table like this:
Key1
Key2
5
5,6,13
10
10,7
and so on ....
I already tried to apply this approach to my code, but it didn't work:
df1['join'] = 1
df2['join'] = 1
merged= df1.merge(df2, on='join').drop('join', axis=1)
df2.drop('join', axis=1, inplace=True)
merged['match'] = merged.apply(lambda x: x.key(x.sub_key), axis=1).ge(0)
I also tried to split and explode the column and to join on single values, however there the problem was, that not all column values were split correctly and I would need to combine everything back into one cell once joined.
Help would be much appreciated!
If you only want to match the first key:
df1['sub_key'] = df1.key.str[0]
df1.merge(df2)
If you want to match ANY key:
df3 = df1.explode('key').rename(columns={'key':'sub_key'})
df3 = df3.join(df1)
df3.merge(df2)
Edit: First version had a small bug, fixed it.

Merge multiple dataframes in Python on key

I have three dataframes with Date_Final as column under them all. I want to merge them all and Currently, my code is. I want to merge these dataframes into a single dataframe where firstly I create a new dataframe from any stock's dataframe.
What should be the optimized code?
df_stockanlaysis = stock1[['Date_final']]
df_stockanlaysis = pd.merge(df_stockanlaysis, stock1, how='left', on ="Date_final")
df_stockanlaysis = pd.merge(df_stockanlaysis, stock2, how='left', on ="Date_final")
df_stockanlaysis = pd.merge(df_stockanlaysis, stock3, how='left', on ="Date_final")
You could use a reduce, pandas and list comprehension.
Put those dataframes in a list like
dfs = [df_stockanlaysis1, df_stockanlaysis2, df_stockanlaysis3]
and the common column in this case i assume its 'Date_final'
df_final = reduce(lambda left,right: pd.merge(left,right,on='Date_final'), dfs)

How to specify hierarchical columns in Pandas merge?

After a serious misconception of how on works in join (spoiler: very different from on in merge), this is my example code.
import pandas as pd
index1 = pd.MultiIndex.from_product([["variables"], ["number", "fruit"]])
df1 = pd.DataFrame([["one", "apple"], ["two", "banana"]], columns=index1)
index2 = pd.MultiIndex.from_product([["variables"], ["fruit", "color"]])
df2 = pd.DataFrame([["banana", "yellow"]], columns=index2)
print(df1.merge(df2, on="fruit", how="left"))
I get a KeyError. How do I correctly reference variables.fruit here?
To understand what I am after, consider the same problem without a multi index:
import pandas as pd
df1 = pd.DataFrame([["one", "apple"], ["two", "banana"]], columns=["number", "fruit"])
df2 = pd.DataFrame([["banana", "yellow"]], columns=["fruit", "color"])
# this is obviously incorrect as it uses indexes on `df1` as well as `df2`:
print(df1.join(df2, rsuffix="_"))
# this is *also* incorrect, although I initially thought it should work, but it uses the index on `df2`:
print(df1.join(df2, on="fruit", rsuffix="_"))
# this is correct:
print(df1.merge(df2, on="fruit", how="left"))
The expected and wanted result is this:
number fruit color
0 one apple NaN
1 two banana yellow
How do I get the same when fruit is part of a multi index?
I think I understand what you are trying to accomplish now, and I don't think join is going to get you there. Both DataFrame.join and DataFrame.merge make a call to pandas.core.reshape.merge.merge, but using DataFrame.merge gives you more control over what defaults are applied.
In your case, you can use reference the columns to join on via a list of tuples, where the elements of the tuple are the levels of the multi-indexed columns. I.e. to use the variables / fruit column, you can pass [('variables', 'fruit')].
Using tuples is how you index into multi-index columns (and row indices). You need to wrap it in a list because merge operations can be performed using multiple columns, or multiple multi-indexed columns, like a JOIN statement in SQL. Passing a single string is just a convenience case that gets wrapped in a list for you.
Since you are only joining on 1 column, it is a list of a single tuple.
import pandas as pd
index1 = pd.MultiIndex.from_product([["variables"], ["number", "fruit"]])
df1 = pd.DataFrame([["one", "apple"], ["two", "banana"]], columns=index1)
index2 = pd.MultiIndex.from_product([["variables"], ["fruit", "color"]])
df2 = pd.DataFrame([["banana", "yellow"]], columns=index2)
df1.merge(df2, how='left', on=[('variables', 'fruit')])
# returns:
variables
number fruit color
0 one apple NaN
1 two banana yellow

Pandas join dataframes based on different columns

I have been trying to merge multiple dataframes using reduce() function mentioned in this link pandas three-way joining multiple dataframes on columns.
dfs = [df0, df1, df2, dfN]
df_final = reduce(lambda left,right: pd.merge(left,right,on='name'), dfs)
However, in my case the join columns are different for the related dataframes. Therefore I would need to use different left_on and right_on values on every merge.
I have come up with a workaround, which is not efficient or elegant in any way, but for now it works. I would like to know if the same can be achieved using reduce() or may be other efficient alternatives. I am foreseeing that there would be many dataframes I would need to join down-the-line.
import pandas as pd
...
...
# xml files - table1.xml, table2.xml and table3.xml are converted to <dataframe11>, <dataframe2>, <dataframe3> respectively.
_df = {
'table1' : '<dataframe1>',
'table2' : '<dataframe2>',
'table3' : '<dataframe3>'
}
# variable that tells column1 of table1 is related to column2 of table2, which can be used as left_on/right_on while merging dataframes
_relationship = {
'table1': {
'table2': ['NAME', 'DIFF_NAME']},
'table2': {
'table3': ['T2_ID', 'T3_ID']}
}
def _join_dataframes(_rel_pair):
# copy
df_temp = dict(_df)
for ele in _rel_pair:
first_table = ele[0]
second_table = ele[1]
lefton = _onetomany[first_table][second_table][0]
righton = _onetomany[first_table][second_table][1]
_merged_df = pd.merge(df_temp[first_table], df_temp[second_table],
left_on=lefton, right_on=righton, how="inner")
df_temp[ele[1]] = _merged_df
return _merged_df
# I have come up with this structure based on _df.keys()
_rel_pair = [['table1', 'table2'], ['table2', 'table3']]
_join_dataframes(_rel_pair)
Why don't you just rename the columns of all the dataframes first?
df0.rename({'commonname': 'old_column_name0'}, inplace=True)
.
.
.
.
dfN.rename({'commonname': 'old_column_nameN'}, inplace=True)
dfs = [df0, df1, df2, ... , dfN]
df_final = reduce(lambda left,right: pd.merge(left,right,on='name'), dfs)
Try using the concat function, instead of reduce.
A simple trick I like to use when merging DFs is setting the index on the columns I want to use as a guide when merging. Example:
# note different column names 'B' and 'C'
dfA = pd.read_csv('yourfile_A.csv', index_col=['A', 'B']
dfB = pd.read_csv('yourfile_B.csv', index_col=['C', 'D']
df = pd.concat([dfA, dfB], axis=1)
You will need unique indexes / multiindexes for this to work, but I think this should be no problem for most cases. Never tried a large concat, but this approach should theoretically work for N concats.
Alternatively, you can use merge instead, as it provide left_on and right_on parameters specially for those situations where column names differ between dataframes. An example:
dfA.merge(dfB, left_on='name', right_on='username')
A more complete explanation on how to merge dfs: https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html
concat: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html
merge: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.merge.html

How to use the content of one dataframe to index another multi-level index dataframe?

I have the following dataframes site_1_df and `site_2_df (both are similar):
site_1_df:
And the following dataframe:
site_1_index_df = pd.DataFrame(site_1_df.index.values.tolist(), columns=["SiteNumber", "WeekNumber", "PG"])
site_2_index_df = pd.DataFrame(site_2_df.index.values.tolist(), columns=["SiteNumber", "WeekNumber", "PG"])
index_intersection = pd.merge(left=site_1_index_df, right=site_2_index_df,
on=["WeekNumber", "PG"], how="inner")[["WeekNumber", "PG"]]
index_intersection:
Consequently, it is clear that site_1_df and site_2_df are multi-level indexed dataframes. Therefore, I woulld like to use the index_intersection to index the above dataframe. Or, If I am indexing from site_1_df, then I want a subset of the rows from the same dataframe. And technically, I should get back a dataframe that has (8556 rows x 6 columns), i.e., the same number of rows of index_intersection. How can I achieve that efficiently in pandas?
I tried:
index_intersection = pd.merge(left=site_1_index_df, right=site_2_index_df,
on=["WeekNumber", "PG"], how="inner")[["SiteNumber_x", "WeekNumber", "PG"]]
index_intersection = index_intersection.rename(columns={"SiteNumber_x": "SiteNumber"})
index_intersection = index_intersection.set_index(["SiteNumber", "WeekNumber", "PG"])
index_intersection
And I got:
However, indexing the dataframe using another dataframe such as:
site_2_df.loc[index_intersection]
# or
site_2_df.loc[index_intersection.index]
# or
site_2_df.loc[index_intersection.index.values]
will give me an error:
NotImplementedError: Indexing a MultiIndex with a DataFrame key is not implemented
Any help is much appreciated!!
So I figured out that I can find the intersection of 2 dataframes, based on their index through:
sites_common_rows = pd.merge(left=site_1_df.reset_index([0]), right=site_2_df.reset_index([0]),
left_index=True, right_index=True, how="inner")
The reset_index([0]) above is used to ignore the SiteNumber since this totally different from one dataframe to another. Consequently, I am able to find the inner join between two dataframes from their indexes.

Categories

Resources