Pandas Merge dataframe with multiple columns based on condition - python

Given df1 and df2:
df1 = pd.DataFrame({
'Key': ['k1', 'k1', 'k1', 'k2', 'k3'],
'Num': [1, 2, 3, 1, 2],
'A': ['a1', 'a2', 'a3', 'a4', 'a5']
})
display(df1)
df2 = pd.DataFrame({
'Key': ['k1', 'k1', 'k2', 'k3'],
'Num': [1, 2, 1, 1],
'X': ['x1', 'x2', 'x3', 'x4']
})
display(df2)
df1:
Key Num A
0 k1 1 a1
1 k1 2 a2
2 k1 3 a3
3 k2 1 a4
4 k3 2 a5
df2:
Key Num X
0 k1 1 x1
1 k1 2 x2
2 k2 1 x3
3 k3 1 x4
Expected Output:
Key Num A X
0 k1 1 a1 x1
1 k1 2 a2 x2
2 k1 3 a3 x1
3 k2 1 a4 x3
4 k3 2 a5 x4
I would like to merge df2 into df1 on columns 'Key' and 'Num' such that if Num doesn't match, then the value with same key and num 1 from df2 will be matched if available.

IIUC, you can merge then fillna with another merge (here as map):
s = df1['Key'].map(df2.drop_duplicates('Key').set_index('Key')['X'])
df3 = (df1
.merge(df2, on=['Key', 'Num'], how='left')
.fillna({'X': s})
)
output:
Key Num A X
0 k1 1 a1 x1
1 k1 2 a2 x2
2 k1 3 a3 x1
3 k2 1 a4 x3
4 k3 2 a5 x4

Right merge df2 to df1, then merge that to df2 by just Key where Num == 1.
Fill in the missing values in X_x with the X_y values.
Drop excess columns and restore naming:
df3 = df2.merge(df1, how='right').merge(df2[df2.Num==1], on='Key')
df3['X_x'] = df3[['X_x', 'X_y']].bfill(axis=1)['X_x']
df3.drop(['Num_y', 'X_y'], axis=1, inplace=True)
df3.columns = ['Key', 'Num', 'X', 'A']
display(df3)
Output:
Key Num X A
0 k1 1 x1 a1
1 k1 2 x2 a2
2 k1 3 x1 a3
3 k2 1 x3 a4
4 k3 2 x4 a5

Related

pandas DataFrame re-order cells for each group

I have a dataframe of groups of 3s like:
group value1 value2 value3
1 A1 A2 A3
1 B1 B2 B3
1 C1 C2 C3
2 D1 D2 D3
2 E1 E2 E3
2 F1 F2 F3
...
I'd like to re-order the cells within each group according to a fixed rule by their 'positions', and repeat the same operation over all groups.
This 'fixed' rule will work like below:
Input:
group value1 value2 value3
1 position1 position2 position3
1 position4 position5 position6
1 position7 position8 position9
Output:
group value1 value2 value3
1 position1 position8 position6
1 position4 position2 position9
1 position7 position5 position3
Eventually the dataframe should look like (if this makes sense):
group value1 value2 value3
1 A1 C2 B3
1 B1 A2 C3
1 C1 B2 A3
2 D1 F2 E3
2 E1 D2 F3
2 F1 E2 D3
...
I know how to re-order them if the dataframe only has one group - basically create a temporary variable to store values, get each cell by .loc, and overwrite each cell with desired values.
However, even if we only have 1 group of 3 rows, this is still an apparently silly and tedious way.
My question is: can we possibly
find a general operation to rearrange cells by their relative position of in a group
repeat this operation over all groups?
Here is a proposal which uses numpy indexing with reshaping on each group.
Setup:
Lets assume your original df and the position dataframes are as below:
d = {'group': [1, 1, 1, 2, 2, 2],
'value1': ['A1', 'B1', 'C1', 'D1', 'E1', 'F1'],
'value2': ['A2', 'B2', 'C2', 'D2', 'E2', 'F2'],
'value3': ['A3', 'B3', 'C3', 'D3', 'E3', 'F3']}
out_d = {'group': [1, 1, 1, 2, 2, 2],
'value1': ['position1', 'position4', 'position7',
'position1', 'position4', 'position7'],
'value2': ['position8', 'position2', 'position5',
'position8', 'position2', 'position5'],
'value3': ['position6', 'position9', 'position3',
'position6', 'position9', 'position3']}
df = pd.DataFrame(d)
out = pd.DataFrame(out_d)
print("Original dataframe :\n\n",df,"\n\n Position dataframe :\n\n",out)
Original dataframe :
group value1 value2 value3
0 1 A1 A2 A3
1 1 B1 B2 B3
2 1 C1 C2 C3
3 2 D1 D2 D3
4 2 E1 E2 E3
5 2 F1 F2 F3
Position dataframe :
group value1 value2 value3
0 1 position1 position8 position6
1 1 position4 position2 position9
2 1 position7 position5 position3
3 2 position1 position8 position6
4 2 position4 position2 position9
5 2 position7 position5 position3
Working Solution:
Method 1: : Creating a function and use in df.groupby.apply
#remove letters and extract only position numbers and subtract 1
#since python indexing starts at 0
o = out.applymap(lambda x: int(''.join(re.findall('\d+',x)))-1 if type(x)==str else x)
#Merge this output with original dataframe
df1 = df.merge(o,on='group',left_index=True,right_index=True,suffixes=('','_pos'))
# Build a function which rearranges the df based on the position df:
def fun(x):
c = x.columns.str.contains("_pos")
return pd.DataFrame(np.ravel(x.loc[:,~c])[np.ravel(x.loc[:,c])]
.reshape(x.loc[:,~c].shape),
columns=x.columns[~c])
output = (df1.groupby("group").apply(fun).reset_index("group")
.reset_index(drop=True))
print(output)
group value1 value2 value3
0 1 A1 C2 B3
1 1 B1 A2 C3
2 1 C1 B2 A3
3 2 D1 F2 E3
4 2 E1 D2 F3
5 2 F1 E2 D3
Method 2: Iterate through each group and re-arrange:
o = out.applymap(lambda x: int(''.join(re.findall('\d+',x)))-1 if type(x)==str else x)
df1 = df.merge(o,on='group',left_index=True,right_index=True,
suffixes=('','_pos')).set_index("group")
idx = df1.index.unique()
l = []
for i in idx:
v = df1.loc[i]
c = v.columns.str.contains("_pos")
l.append(np.ravel(v.loc[:,~c])[np.ravel(v.loc[:,c])].reshape(v.loc[:,~c].shape))
final = pd.DataFrame(np.concatenate(l),index=df1.index,
columns=df1.columns[~c]).reset_index()
print(final)
group value1 value2 value3
0 1 A1 C2 B3
1 1 B1 A2 C3
2 1 C1 B2 A3
3 2 D1 F2 E3
4 2 E1 D2 F3
5 2 F1 E2 D3

Concat two dataframes wit common columns [duplicate]

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 3 years ago.
I have two dataframes with same columns. Only one column has different values. I want to concatenate the two without duplication.
df2 = pd.DataFrame({'key': ['K0', 'K1', 'K2'],'cat': ['C0', 'C1', 'C2'],'B': ['B0', 'B1', 'B2']})
df1 = pd.DataFrame({'key': ['K0', 'K1', 'K2'],'cat': ['C0', 'C1', 'C2'],'B': ['A0', 'A1', 'A2']})
df1
Out[630]:
key cat B
0 K0 C0 A0
1 K1 C1 A1
2 K2 C2 A2
df2
Out[631]:
key cat B
0 K0 C0 B0
1 K1 C1 B1
2 K2 C2 B2
I tried:
result = pd.concat([df1, df2], axis=1)
result
Out[633]:
key cat B key cat B
0 K0 C0 A0 K0 C0 B0
1 K1 C1 A1 K1 C1 B1
2 K2 C2 A2 K2 C2 B2
The desired output:
key cat B_df1 B_df2
0 K0 C0 A0 B0
1 K1 C1 A1 B1
2 K2 C2 A2 B2
NOTE: I could drop duplicates afterwards and rename columns but that doesn't seem efficient
pd.merge will do the job
pd.merge(df1,df2, on=['key','cat'])
Output
key cat B_x B_y
0 K0 C0 A0 B0
1 K1 C1 A1 B1
2 K2 C2 A2 B2

Show common values in a column once in pandas

I have a dataframe that looks like this:
df = pd.DataFrame({'key': ['K0', 'K0', 'K0', 'K1'],'cat': ['C0', 'C0', 'C1', 'C1'],'B': ['A0', 'A1', 'A2', 'A3']})
df
Out[15]:
key cat B
0 K0 C0 A0
1 K0 C0 A1
2 K0 C1 A2
3 K1 C1 A3
Is it possible to convert it to:
key cat B
0 K0 C0 A0
1 A1
2 K0 C1 A2
3 K1 C1 A3
I want to avoid showing same value of key & cat again and again and key reappears once cat changes.
It's for an excel purpose so I need it to be compatible with:
style.apply(f)
to_excel()
You can use duplicated over a subset of the columns to look for duplicate values:
cols = ['key', 'cat']
df.loc[df.duplicated(subset=cols), cols] = ''
key cat B
0 K0 C0 A0
1 A1
2 K0 C1 A2
3 K1 C1 A3

How can I find the "set difference" of rows in two dataframes on a subset of columns in Pandas?

I have two dataframes, say df1 and df2, with the same column names.
Example:
df1
C1 | C2 | C3 | C4
A 1 2 AA
B 1 3 A
A 3 2 B
df2
C1 | C2 | C3 | C4
A 1 3 E
B 1 2 C
Q 4 1 Z
I would like to filter out rows in df1 based on common values in a fixed subset of columns between df1 and df2. In the above example, if the columns are C1 and C2, I would like the first two rows to be filtered out, as their values in both df1 and df2 for these columns are identical.
What would be a clean way to do this in Pandas?
So far, based on this answer, I have been able to find the common rows.
common_df = pandas.merge(df1, df2, how='inner', on=['C1','C2'])
This gives me a new dataframe with only those rows that have common values in the specified columns, i.e., the intersection.
I have also seen this thread, but the answers all seem to assume a difference on all the columns.
The expected result for the above example (rows common on specified columns removed):
C1 | C2 | C3 | C4
A 3 2 B
Maybe not the cleanest, but you could add a key column to df1 to check against.
Setting up the datasets
import pandas as pd
df1 = pd.DataFrame({ 'C1': ['A', 'B', 'A'],
'C2': [1, 1, 3],
'C3': [2, 3, 2],
'C4': ['AA', 'A', 'B']})
df2 = pd.DataFrame({ 'C1': ['A', 'B', 'Q'],
'C2': [1, 1, 4],
'C3': [3, 2, 1],
'C4': ['E', 'C', 'Z']})
Adding a key, using your code to find the commons
df1['key'] = range(1, len(df1) + 1)
common_df = pd.merge(df1, df2, how='inner', on=['C1','C2'])
df_filter = df1[~df1['key'].isin(common_df['key'])].drop('key', axis=1)
You can use an anti-join method where you do an outer join on the specified columns while returning the method of the join with an indicator. Only downside is that you'd have to rename and drop the extra columns after the join.
>>> import pandas as pd
>>> df1 = pd.DataFrame({'C1':['A','B','A'],'C2':[1,1,3],'C3':[2,3,2],'C4':['AA','A','B']})
>>> df2 = pd.DataFrame({'C1':['A','B','Q'],'C2':[1,1,4],'C3':[3,2,1],'C4':['E','C','Z']})
>>> df_merged = df1.merge(df2, on=['C1','C2'], indicator=True, how='outer')
>>> df_merged
C1 C2 C3_x C4_x C3_y C4_y _merge
0 A 1 2.0 AA 3.0 E both
1 B 1 3.0 A 2.0 C both
2 A 3 2.0 B NaN NaN left_only
3 Q 4 NaN NaN 1.0 Z right_only
>>> df1_setdiff = df_merged[df_merged['_merge'] == 'left_only'].rename(columns={'C3_x': 'C3', 'C4_x': 'C4'}).drop(['C3_y', 'C4_y', '_merge'], axis=1)
>>> df1_setdiff
C1 C2 C3 C4
2 A 3 2.0 B
>>> df2_setdiff = df_merged[df_merged['_merge'] == 'right_only'].rename(columns={'C3_y': 'C3', 'C4_y': 'C4'}).drop(['C3_x', 'C4_x', '_merge'], axis=1)
>>> df2_setdiff
C1 C2 C3 C4
3 Q 4 1.0 Z
import pandas as pd
df1 = pd.DataFrame({'C1':['A','B','A'],'C2':[1,1,3],'C3':[2,3,2],'C4':['AA','A','B']})
df2 = pd.DataFrame({'C1':['A','B','Q'],'C2':[1,1,4],'C3':[3,2,1],'C4':['E','C','Z']})
common = pd.merge(df1, df2,on=['C1','C2'])
R1 = df1[~((df1.C1.isin(common.C1))&(df1.C2.isin(common.C2)))]
R2 = df2[~((df2.C1.isin(common.C1))&(df2.C2.isin(common.C2)))]
df1:
C1 C2 C3 C4
0 A 1 2 AA
1 B 1 3 A
2 A 3 2 B
df2:
C1 C2 C3 C4
0 A 1 3 E
1 B 1 2 C
2 Q 4 1 Z
common:
C1 C2 C3_x C4_x C3_y C4_y
0 A 1 2 AA 3 E
1 B 1 3 A 2 C
R1:
C1 C2 C3 C4
2 A 3 2 B
R2:
C1 C2 C3 C4
2 Q 4 1 Z

Pandas: How to expand data frame rows containing a dictionary with varying keys in a column?

I'm a little stuck, can you please help me with this. I've simplified the problem I'm facing to the following:
Input
Desired Output
I know how to handle the case where the dictionaries in col. c have same keys.
You can create DataFrame by constructor, reshape by stack and last join to original:
df1 = (pd.DataFrame(df.c.values.tolist())
.stack()
.reset_index(level=1)
.rename(columns={0:'val','level_1':'key'}))
print (df1)
key val
0 c00 v00
0 c01 v01
1 c10 v10
2 c20 v20
2 c21 v21
2 c22 v22
df = df.drop('c', 1).join(df1).reset_index(drop=True)
print (df)
a b key val
0 a0 b0 c00 v00
1 a0 b0 c01 v01
2 a1 b1 c10 v10
3 a2 b2 c20 v20
4 a2 b2 c21 v21
5 a2 b2 c22 v22
Here is one way:
import pandas as pd
from itertools import chain
df = pd.DataFrame([['a0', 'b0', {'c00': 'v00', 'c01': 'v01'}],
['a1', 'b1', {'c10': 'v10'}],
['a2', 'b2', {'c20': 'v20', 'c21': 'v21', 'c22': 'v22'}] ],
columns=['a', 'b', 'c'])
# first convert 'c' to list of tuples
df['c'] = df['c'].apply(lambda x: list(x.items()))
lens = list(map(len, df['c']))
# create dataframe
df_out = pd.DataFrame({'a': np.repeat(df['a'].values, lens),
'b': np.repeat(df['b'].values, lens),
'c': list(chain.from_iterable(df['c'].values))})
# unpack tuple
df_out = df_out.join(df_out['c'].apply(pd.Series))\
.rename(columns={0: 'key', 1: 'val'}).drop('c', 1)
# a b key val
# 0 a0 b0 c00 v00
# 1 a0 b0 c01 v01
# 2 a1 b1 c10 v10
# 3 a2 b2 c20 v20
# 4 a2 b2 c21 v21
# 5 a2 b2 c22 v22
My solution is next:
import pandas as pd
t=pd.DataFrame([['a0','b0',{'c00':'v00','c01':'v01'}],['a1','b1',{'c10':'v10'}],['a2','b2',{'c20':'v20','c21':'v21','c22':'v22'}]],columns=['a','b','c'])
l2=[]
for i in t.index:
for j in t.loc[i,'c']:
l2+=[[t.loc[i,'a'],t.loc[i,'b'],j,t.loc[i,'c'][j]]]
t2=pd.DataFrame(l2,columns=['a','b','key','val'])
where 't' is your DataFrame, which you obtain as you want.

Categories

Resources