Merge multiple dataframes in Python on key - python

I have three dataframes with Date_Final as column under them all. I want to merge them all and Currently, my code is. I want to merge these dataframes into a single dataframe where firstly I create a new dataframe from any stock's dataframe.
What should be the optimized code?
df_stockanlaysis = stock1[['Date_final']]
df_stockanlaysis = pd.merge(df_stockanlaysis, stock1, how='left', on ="Date_final")
df_stockanlaysis = pd.merge(df_stockanlaysis, stock2, how='left', on ="Date_final")
df_stockanlaysis = pd.merge(df_stockanlaysis, stock3, how='left', on ="Date_final")

You could use a reduce, pandas and list comprehension.
Put those dataframes in a list like
dfs = [df_stockanlaysis1, df_stockanlaysis2, df_stockanlaysis3]
and the common column in this case i assume its 'Date_final'
df_final = reduce(lambda left,right: pd.merge(left,right,on='Date_final'), dfs)

Related

Import multiple CSV files into pandas and merge those based on column values

I have 4 dataframes:
import pandas as pd
df_inventory_parts = pd.read_csv('inventory_parts.csv')
df_colors = pd.read_csv('colors.csv')
df_part_categories = pd.read_csv('part_categories.csv')
df_parts = pd.read_csv('parts.csv')
Now I have merged them into 1 new dataframe like:
merged = pd.merge(
left=df_inventory_parts,
right=df_colors,
how='left',
left_on='color_id',
right_on='id')
merged = pd.merge(
left=merged,
right=df_parts,
how='left',
left_on='part_num',
right_on='part_num')
merged = pd.merge(
left=merged,
right=df_part_categories,
how='left',
left_on='part_cat_id',
right_on='id')
merged.head(20)
This gives the correct dataset that I'm looking for. However, I was wondering if there's a shorter way / faster way of writing this. Using pd.merge 3 times one seems a bit excessive.
You have a pretty clear section of code that does exactly what you want. You want to do three merges so using merge() three times is adequate rather than excessive.
You can make your code a bit shorter by using the fact DataFrames have a merge function so you don't need the left argument. You can also chain them but I would point out my example does not look as neat and readable as your longer form code.
merged = df_inventory_parts.merge(
right=df_colors,
how='left',
left_on='color_id',
right_on='id').merge(
right=df_parts,
how='left',
left_on='part_num',
right_on='part_num').merge(
right=df_part_categories,
how='left',
left_on='part_cat_id',
right_on='id')

Join in Pandas Dataframe using conditional join statement

I am trying to join two dataframes with the following data:
df1
df2
I want to join these two dataframes on the condition that if 'col2' of df2 is blank/NULL then the join should occur only on 'column1' of df1 and 'col1' of df2 but if it is not NULL/blank then the join should occur on two conditions, i.e. 'column1', 'column2' of df1 with 'col1', 'col2' of df2 respectively.
For reference the final dataframe that I wish to obtain is:
My current approach is that I'm trying to slice these 2 dataframes into 4 and then joining them seperately based on the condition. Is there any way to do this without slicing them or maybe a better way that I'm missing out??
Idea is rename columns before left join by both columns first and then replace missing value by matching by column1, here is necessary remove duplicates by DataFrame.drop_duplicates before Series.map for unique values in col1:
df22 = df2.rename(columns={'col1':'column1','col2':'column2'})
df = df1.merge(df22, on=['column1','column2'], how='left')
s = df2.drop_duplicates('col1').set_index('col1')['col3']
df['col3'] = df['col3'].fillna(df['column1'].map(s))
EDIT: General solution working with multiple columns - first part is same, is used left join, in second part is used merge by one column with DataFrame.combine_first for replace missing values:
df22 = df2.rename(columns={'col1':'column1','col2':'column2'})
df = df1.merge(df22, on=['column1','column2'], how='left')
df23 = df22.drop_duplicates('column1').drop('column2', axis=1)
df = df.merge(df23, on='column1', how='left', suffixes=('','_'))
cols = df.columns[df.columns.str.endswith('_')]
df = df.combine_first(df[cols].rename(columns=lambda x: x.strip('_'))).drop(cols, axis=1)

Pandas join dataframes based on different columns

I have been trying to merge multiple dataframes using reduce() function mentioned in this link pandas three-way joining multiple dataframes on columns.
dfs = [df0, df1, df2, dfN]
df_final = reduce(lambda left,right: pd.merge(left,right,on='name'), dfs)
However, in my case the join columns are different for the related dataframes. Therefore I would need to use different left_on and right_on values on every merge.
I have come up with a workaround, which is not efficient or elegant in any way, but for now it works. I would like to know if the same can be achieved using reduce() or may be other efficient alternatives. I am foreseeing that there would be many dataframes I would need to join down-the-line.
import pandas as pd
...
...
# xml files - table1.xml, table2.xml and table3.xml are converted to <dataframe11>, <dataframe2>, <dataframe3> respectively.
_df = {
'table1' : '<dataframe1>',
'table2' : '<dataframe2>',
'table3' : '<dataframe3>'
}
# variable that tells column1 of table1 is related to column2 of table2, which can be used as left_on/right_on while merging dataframes
_relationship = {
'table1': {
'table2': ['NAME', 'DIFF_NAME']},
'table2': {
'table3': ['T2_ID', 'T3_ID']}
}
def _join_dataframes(_rel_pair):
# copy
df_temp = dict(_df)
for ele in _rel_pair:
first_table = ele[0]
second_table = ele[1]
lefton = _onetomany[first_table][second_table][0]
righton = _onetomany[first_table][second_table][1]
_merged_df = pd.merge(df_temp[first_table], df_temp[second_table],
left_on=lefton, right_on=righton, how="inner")
df_temp[ele[1]] = _merged_df
return _merged_df
# I have come up with this structure based on _df.keys()
_rel_pair = [['table1', 'table2'], ['table2', 'table3']]
_join_dataframes(_rel_pair)
Why don't you just rename the columns of all the dataframes first?
df0.rename({'commonname': 'old_column_name0'}, inplace=True)
.
.
.
.
dfN.rename({'commonname': 'old_column_nameN'}, inplace=True)
dfs = [df0, df1, df2, ... , dfN]
df_final = reduce(lambda left,right: pd.merge(left,right,on='name'), dfs)
Try using the concat function, instead of reduce.
A simple trick I like to use when merging DFs is setting the index on the columns I want to use as a guide when merging. Example:
# note different column names 'B' and 'C'
dfA = pd.read_csv('yourfile_A.csv', index_col=['A', 'B']
dfB = pd.read_csv('yourfile_B.csv', index_col=['C', 'D']
df = pd.concat([dfA, dfB], axis=1)
You will need unique indexes / multiindexes for this to work, but I think this should be no problem for most cases. Never tried a large concat, but this approach should theoretically work for N concats.
Alternatively, you can use merge instead, as it provide left_on and right_on parameters specially for those situations where column names differ between dataframes. An example:
dfA.merge(dfB, left_on='name', right_on='username')
A more complete explanation on how to merge dfs: https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html
concat: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html
merge: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.merge.html

Merge values in next row pandas

I am trying to merge two dataframes:
this are tables I am trying to merge
and this is the result I want to achieve
So, I want merged values not to be on the same row.
When I try to merge like that:
pd.merge(a,b,how='inner', on=['date_install','device_os'])
it merges values in one raw.
Please,suggest how it can be resolved
You can use append():
df1.append(df2, ignore_index = True)
or by using concat():
pd.concat([df,df1], ignore_index = True)
or using merge():
df.merge(df1, how = 'outer')
Reference taken from: https://stackoverflow.com/a/24391268/3748167 (probable duplicate)
The only difference here is concatenating both dataframes first and then applying rest of the logic.
def sjoin(x):
return ';'.join(x[x.notnull()].astype(str))
pd.concat([a,b]).groupby(level=0, axis=1).apply(lambda x: x.apply(sjoin, axis=1))

How to merge two pandas dataframes on column of sets

I have columns in two dataframes representing interacting partners in a biological system, so if gene_A interacts with gene_B, the entry in column 'gene_pair' would be {gene_A, gene_B}. I want to do an inner join, but trying:
pd.merge(df1, df2, how='inner', on=['gene_pair'])
throws the error
TypeError: type object argument after * must be a sequence, not itertools.imap
I need to merge on the unordered pair, so as far as I can tell I can't merge on a combination of two individual columns with gene names. Is there another way to achieve this merge?
Some example dfs:
gene_pairs1 = [
set(['gene_A','gene_B']),
set(['gene_A','gene_C']),
set(['gene_D','gene_A'])
]
df1 = pd.DataFrame({'r_name': ['r1','r2','r3'], 'gene_pair': gene_pairs1})
gene_pairs2 = [
set(['gene_A','gene_B']),
set(['gene_F','gene_A']),
set(['gene_C','gene_A'])
]
df2 = pd.DataFrame({'function': ['f1','f2','f3'], 'gene_pair': gene_pairs2})
pd.merge(df1,df2,how='inner',on=['gene_pair'])
and I would like entry 'r1' line up with 'f1' and 'r2' to line up with 'f3'.
Pretty simple in the end: I used frozenset, rather than set.
I suggest u get an extra Id column for each pair and then join on that!
for eg.
df2['gp'] = df2.gene_pair.apply(lambda x: list(x)[0][-1]+list(x)[1][-1])
df1['gp'] = df1.gene_pair.apply(lambda x: list(x)[0][-1]+list(x)[1][-1])
pd.merge(df1, df2[['function','gp']],how='inner',on=['gp']).drop('gp', axis=1)

Categories

Resources