What's the python/panda way to merge on multilevel dataframe on column "t" under "cell1" and "cell2"?
import pandas as pd
import numpy as np
df1 = pd.DataFrame(np.arange(4).reshape(2, 2),
columns = [['cell 1'] * 2, ['t', 'sb']])
df2 = pd.DataFrame([[1, 5], [2, 6]],
columns = [['cell 2'] * 2, ['t', 'sb']])
Now when I tried to merge on "t", python REPL will error out
ddf = pd.merge(df1, df2, on='t', how='outer')
What's a good way to handle this?
pd.merge(df1, df2, left_on=[('cell 1', 't')], right_on=[('cell 2', 't')])
One solution is to drop the top level (e.g. cell_1 and cell_2) from the dataframes and then merge.
If you want, you can save these columns to reinstate them after the merge.
c1 = df1.columns
c2 = df2.columns
df1.columns = df1.columns.droplevel()
df2.columns = df2.columns.droplevel()
df_merged = df1.merge(df2, on='t', how='outer', suffixes=['_df1', '_df2'])
df1.columns = c1
df2.columns = c2
>>> df_merged
t sb_df1 sb_df2
0 0 1 NaN
1 2 3 6
2 1 NaN 5
Related
I have a list of dataframes, I want to add a new column to each dataframe that is the name of the dataframe.
df_all = [df1,df2,df3]
for df in df_all:
df["Loc"] = df[df].astype.(str)
Boolean array expected for the condition, not object
is this possible to achieve?
You can't do this, python objects have no possibility to know their name(s).
You could emulate it with:
df_all = [df1, df2, df3]
for i, df in enumerate(df_all, start=1):
df['Loc'] = f'df{i}'
Alternatively, use a dictionary:
df_all = {'df1': df1, 'df2': df2, 'df3': df3}
for k, df in df_all.items():
df['Loc'] = k
It can be done with using the system's locals() dictionary, which contains variable names and references, and the is operator to match.
df1, df2, df3 = pd.DataFrame([1, 1, 1]), pd.DataFrame([2, 2, 2]), pd.DataFrame([3, 3, 3])
df_all = [df1, df2, df3]
_df = k = v = None
for _df in df_all:
for k, v in locals().items():
if v is _df and k != '_df':
_df["Loc"] = k
print(*df_all, sep='\n\n')
0 Loc
0 1 df1
1 1 df1
2 1 df1
0 Loc
0 2 df2
1 2 df2
2 2 df2
0 Loc
0 3 df3
1 3 df3
2 3 df3
I am using pandas 1.4.3 and python 3.9.13
I am creating some data frames which are identical as follows:
d = {'col1': [1, 2], 'col2': [3, 4]}
df_1 = pd.DataFrame(data=d)
df_2 = pd.DataFrame(data=d)
df_3 = pd.DataFrame(data=d)
df_4 = pd.DataFrame(data=d)
datasets = [df_1, df_2, df_3, df_4]
Now I am trying to merge them all in a single data frame on col1. So,I do the following:
from functools import reduce
df_merged = reduce(lambda left,right: pd.merge(left, right,on=['col1'], how='outer', suffixes=["_x", "_y"]), datasets)
So, I am trying to basically keep all the columns but just use some suffixes so that they stay unique. However, the issue is that since it is more than two dataframes, this ends up resulting in duplicated columns as:
col1 col2_x col2_y col2_x col2_y
0 1 3 3 3 3
1 2 4 4 4 4
I was wondering what would be the best way to do such a merge while ensuring no columns are dropped and duplicates are conserved properly with incrementally adding suffixes...
EDIT
At the moment, I am now doing it with a loop as:
merged = datasets[0]
for i in range(1, len(datasets)):
merged = pd.merge(merged, datasets[i], how='outer', on=['col1'], suffixes=[None, f"_{str(i)}"])
A little bit cumbersome solution:
import pandas as pd
d = {'col1': [1, 2], 'col2': [3, 4]}
df_1 = pd.DataFrame(data=d)
df_1.attrs['name']='1'
df_2 = pd.DataFrame(data=d)
df_2.attrs['name']='2'
df_3 = pd.DataFrame(data=d)
df_3.attrs['name']='3'
df_4 = pd.DataFrame(data=d)
df_4.attrs['name']='4'
datasets = [df_1, df_2, df_3, df_4]
from functools import reduce
def mrg (left,right):
return pd.merge(left, right,on=['col1'], how='outer', suffixes=["_"+str(left.attrs.get('name')), "_"+str(right.attrs.get('name'))])
df_merged = reduce(lambda left,right: mrg(left,right), datasets)
I have two large DataFrames that I don't want to make copies of, but want to apply the same change to. How can I do this properly? For example, this is similar to what I want to do, but on a smaller scale. This only creates the temporary variable df that gives the result of each DataFrame, but I want both DataFrames to be themselves changed:
import pandas as pd
df1 = pd.DataFrame({'a':[1,2,3]})
df2 = pd.DataFrame({'a':[0,1,5,7]})
for df in [df1, df2]:
df = df[df['a'] < 3]
We can do query with inplace
df1 = pd.DataFrame({'a':[1,2,3]})
df2 = pd.DataFrame({'a':[0,1,5,7]})
for df in [df1, df2]:
df.query('a<3',inplace=True)
df1
a
0 1
1 2
df2
a
0 0
1 1
Don't think this is the best solution, but should do the job.
import pandas as pd
df1 = pd.DataFrame({'a':[1,2,3]})
df2 = pd.DataFrame({'a':[0,1,5,7]})
dfs = [df1, df2]
for i, df in enumerate(dfs):
dfs[i] = df[df['a'] < 3]
dfs[0]
a
0 1
1 2
I have two dataframes which I am joining like so:
df3 = df1.join(df2.set_index('id'), on='id', how='left')
But I want to replace values for id-s which are present in df1 but not in df2 with NaN (left join will just leave the values in df1 as they are). Whats the easiest way to accomplish this?
I think you need Series.where with Series.isin:
df1['id'] = df1['id'].where(df1['id'].isin(df2['id']))
Or numpy.where:
df1['id'] = np.where(df1['id'].isin(df2['id']), df1['id'], np.nan)
Sample:
df1 = pd.DataFrame({
'id':list('abc'),
})
df2 = pd.DataFrame({
'id':list('dmna'),
})
df1['id'] = df1['id'].where(df1['id'].isin(df2['id']))
print (df1)
id
0 a
1 NaN
2 NaN
Or solution with merge and indicator parameter:
df3 = df1.merge(df2, on='id', how='left', indicator=True)
df3['id'] = df3['id'].mask(df3.pop('_merge').eq('left_only'))
print (df3)
id
0 a
1 NaN
2 NaN
i want to merge 2 dataframes df1 and df2,
using left_on: col 'X' from df1,
using right_on: result of function f apply to col 'Y' from df2.
I may make a specific columns 'Z' result of f(df2['Y']) , but i want to avoid it.
is it possible ?
You can pass left_on and right_on arguments to merge:
In [11]: df1 = pd.DataFrame([[1, 3], [2, 4]], columns=["X", "A"])
In [12]: df2 = pd.DataFrame([[1, 5], [2, 6]], columns=["Z", "B"])
In [13]: df1.merge(df2, left_on=["X"], right_on=["Z"])
Out[13]:
X A Z B
0 1 3 1 5
1 2 4 2 6