How to merge multilevel (i.e. MultiIndex) dataframes?

How to merge multilevel (i.e. MultiIndex) dataframes? - python

What's the python/panda way to merge on multilevel dataframe on column "t" under "cell1" and "cell2"?
import pandas as pd
import numpy as np
df1 = pd.DataFrame(np.arange(4).reshape(2, 2),
columns = [['cell 1'] * 2, ['t', 'sb']])
df2 = pd.DataFrame([[1, 5], [2, 6]],
columns = [['cell 2'] * 2, ['t', 'sb']])
Now when I tried to merge on "t", python REPL will error out
ddf = pd.merge(df1, df2, on='t', how='outer')
What's a good way to handle this?

pd.merge(df1, df2, left_on=[('cell 1', 't')], right_on=[('cell 2', 't')])

One solution is to drop the top level (e.g. cell_1 and cell_2) from the dataframes and then merge.
If you want, you can save these columns to reinstate them after the merge.
c1 = df1.columns
c2 = df2.columns
df1.columns = df1.columns.droplevel()
df2.columns = df2.columns.droplevel()
df_merged = df1.merge(df2, on='t', how='outer', suffixes=['_df1', '_df2'])
df1.columns = c1
df2.columns = c2
>>> df_merged
t sb_df1 sb_df2
0 0 1 NaN
1 2 3 6
2 1 NaN 5

Related

Pass the name of the dataframe into each for row for a new column in a list of dataframes Python

I have a list of dataframes, I want to add a new column to each dataframe that is the name of the dataframe.
df_all = [df1,df2,df3]
for df in df_all:
df["Loc"] = df[df].astype.(str)
Boolean array expected for the condition, not object
is this possible to achieve?

You can't do this, python objects have no possibility to know their name(s).
You could emulate it with:
df_all = [df1, df2, df3]
for i, df in enumerate(df_all, start=1):
df['Loc'] = f'df{i}'
Alternatively, use a dictionary:
df_all = {'df1': df1, 'df2': df2, 'df3': df3}
for k, df in df_all.items():
df['Loc'] = k

It can be done with using the system's locals() dictionary, which contains variable names and references, and the is operator to match.
df1, df2, df3 = pd.DataFrame([1, 1, 1]), pd.DataFrame([2, 2, 2]), pd.DataFrame([3, 3, 3])
df_all = [df1, df2, df3]
_df = k = v = None
for _df in df_all:
for k, v in locals().items():
if v is _df and k != '_df':
_df["Loc"] = k
print(*df_all, sep='\n\n')
0 Loc
0 1 df1
1 1 df1
2 1 df1
0 Loc
0 2 df2
1 2 df2
2 2 df2
0 Loc
0 3 df3
1 3 df3
2 3 df3

pandas merge resulting in duplicated columns

I am using pandas 1.4.3 and python 3.9.13
I am creating some data frames which are identical as follows:
d = {'col1': [1, 2], 'col2': [3, 4]}
df_1 = pd.DataFrame(data=d)
df_2 = pd.DataFrame(data=d)
df_3 = pd.DataFrame(data=d)
df_4 = pd.DataFrame(data=d)
datasets = [df_1, df_2, df_3, df_4]
Now I am trying to merge them all in a single data frame on col1. So,I do the following:
from functools import reduce
df_merged = reduce(lambda left,right: pd.merge(left, right,on=['col1'], how='outer', suffixes=["_x", "_y"]), datasets)
So, I am trying to basically keep all the columns but just use some suffixes so that they stay unique. However, the issue is that since it is more than two dataframes, this ends up resulting in duplicated columns as:
col1 col2_x col2_y col2_x col2_y
0 1 3 3 3 3
1 2 4 4 4 4
I was wondering what would be the best way to do such a merge while ensuring no columns are dropped and duplicates are conserved properly with incrementally adding suffixes...
EDIT
At the moment, I am now doing it with a loop as:
merged = datasets[0]
for i in range(1, len(datasets)):
merged = pd.merge(merged, datasets[i], how='outer', on=['col1'], suffixes=[None, f"_{str(i)}"])

A little bit cumbersome solution:
import pandas as pd
d = {'col1': [1, 2], 'col2': [3, 4]}
df_1 = pd.DataFrame(data=d)
df_1.attrs['name']='1'
df_2 = pd.DataFrame(data=d)
df_2.attrs['name']='2'
df_3 = pd.DataFrame(data=d)
df_3.attrs['name']='3'
df_4 = pd.DataFrame(data=d)
df_4.attrs['name']='4'
datasets = [df_1, df_2, df_3, df_4]
from functools import reduce
def mrg (left,right):
return pd.merge(left, right,on=['col1'], how='outer', suffixes=["_"+str(left.attrs.get('name')), "_"+str(right.attrs.get('name'))])
df_merged = reduce(lambda left,right: mrg(left,right), datasets)

Changing pandas dataframe by reference

I have two large DataFrames that I don't want to make copies of, but want to apply the same change to. How can I do this properly? For example, this is similar to what I want to do, but on a smaller scale. This only creates the temporary variable df that gives the result of each DataFrame, but I want both DataFrames to be themselves changed:
import pandas as pd
df1 = pd.DataFrame({'a':[1,2,3]})
df2 = pd.DataFrame({'a':[0,1,5,7]})
for df in [df1, df2]:
df = df[df['a'] < 3]

We can do query with inplace
df1 = pd.DataFrame({'a':[1,2,3]})
df2 = pd.DataFrame({'a':[0,1,5,7]})
for df in [df1, df2]:
df.query('a<3',inplace=True)
df1
a
0 1
1 2
df2
a
0 0
1 1

Don't think this is the best solution, but should do the job.
import pandas as pd
df1 = pd.DataFrame({'a':[1,2,3]})
df2 = pd.DataFrame({'a':[0,1,5,7]})
dfs = [df1, df2]
for i, df in enumerate(dfs):
dfs[i] = df[df['a'] < 3]
dfs[0]
a
0 1
1 2

Pandas left join - how to replace values not present in second df with NaN

I have two dataframes which I am joining like so:
df3 = df1.join(df2.set_index('id'), on='id', how='left')
But I want to replace values for id-s which are present in df1 but not in df2 with NaN (left join will just leave the values in df1 as they are). Whats the easiest way to accomplish this?

I think you need Series.where with Series.isin:
df1['id'] = df1['id'].where(df1['id'].isin(df2['id']))
Or numpy.where:
df1['id'] = np.where(df1['id'].isin(df2['id']), df1['id'], np.nan)
Sample:
df1 = pd.DataFrame({
'id':list('abc'),
})
df2 = pd.DataFrame({
'id':list('dmna'),
})
df1['id'] = df1['id'].where(df1['id'].isin(df2['id']))
print (df1)
id
0 a
1 NaN
2 NaN
Or solution with merge and indicator parameter:
df3 = df1.merge(df2, on='id', how='left', indicator=True)
df3['id'] = df3['id'].mask(df3.pop('_merge').eq('left_only'))
print (df3)
id
0 a
1 NaN
2 NaN

pandas: how to merge 2 dataframes on 'X' from df1 and f['Y'] from df2

i want to merge 2 dataframes df1 and df2,
using left_on: col 'X' from df1,
using right_on: result of function f apply to col 'Y' from df2.
I may make a specific columns 'Z' result of f(df2['Y']) , but i want to avoid it.
is it possible ?

You can pass left_on and right_on arguments to merge:
In [11]: df1 = pd.DataFrame([[1, 3], [2, 4]], columns=["X", "A"])
In [12]: df2 = pd.DataFrame([[1, 5], [2, 6]], columns=["Z", "B"])
In [13]: df1.merge(df2, left_on=["X"], right_on=["Z"])
Out[13]:
X A Z B
0 1 3 1 5
1 2 4 2 6

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to merge multilevel (i.e. MultiIndex) dataframes? - python

pd.merge(df1, df2, left_on=[('cell 1', 't')], right_on=[('cell 2', 't')])

Related

Pass the name of the dataframe into each for row for a new column in a list of dataframes Python

pandas merge resulting in duplicated columns

Changing pandas dataframe by reference

Pandas left join - how to replace values not present in second df with NaN

pandas: how to merge 2 dataframes on 'X' from df1 and f['Y'] from df2

Categories

Resources