merge two dataframe based on specific column information - python

I am trying to handling dataframe in several ways.
and now I'd like to merge two dataframe based on specific column information and delete rows which is duplicated
Is it possible?
I tried to use Concatenate function but faliled...
for example if I want to merge df1 and df2 into d3 with
condition:
if c1&c2 information is same, delete duplicated rows(only use df1, even if c3 data between df1 and df2 is different)
if c1&c2 information is different, use both rows (df1,df2)
before:
df1
c1 c2 c3
0 0 x {'a':1 ,'b':2}
1 0 y {'a':3 ,'b':4}
2 2 z {'a':5 ,'b':6}
df2
c1 c2 c3
0 0 x {'a':11 ,'b':12}
1 0 y {'a':13 ,'b':14}
2 3 z {'a':15 ,'b':16}
expected result d3:
c1 c2 c3
0 0 x {'a':1 ,'b':2}
1 0 y {'a':3 ,'b':4}
2 2 z {'a':5 ,'b':6}
3 3 z {'a':15 ,'b':16}
enter code here

You can do this firstly by determining which rows are only in df2 using merge and passing how='right' and indicator=True, then concat this with df1:
In [125]:
merged = df1.merge(df2, left_on=['c1','c2'], right_on=['c1','c2'], how='right', indicator=True)
merged = merged[merged['_merge']=='right_only']
merged = merged.rename(columns={'c3_y':'c3'})
merged
Out[125]:
c1 c2 c3_x c3 _merge
2 3 z NaN {'a':15 ,'b':16} right_only
In [126]:
combined = pd.concat([df1, merged[df1.columns]])
combined
Out[126]:
c1 c2 c3
0 0 x {'a':1 ,'b':2}
1 0 y {'a':3 ,'b':4}
2 2 z {'a':5 ,'b':6}
2 3 z {'a':15 ,'b':16}
If we break down the above:
In [128]:
merged = df1.merge(df2, left_on=['c1','c2'], right_on=['c1','c2'], how='right', indicator=True)
merged
Out[128]:
c1 c2 c3_x c3_y _merge
0 0 x {'a':1 ,'b':2} {'a':11 ,'b':12} both
1 0 y {'a':3 ,'b':4} {'a':13 ,'b':14} both
2 3 z NaN {'a':15 ,'b':16} right_only
In [129]:
merged = merged[merged['_merge']=='right_only']
merged
Out[129]:
c1 c2 c3_x c3_y _merge
2 3 z NaN {'a':15 ,'b':16} right_only
In [130]:
merged = merged.rename(columns={'c3_y':'c3'})
merged
Out[130]:
c1 c2 c3_x c3 _merge
2 3 z NaN {'a':15 ,'b':16} right_only

Related

Adjust column position after split

I have a column that is positioned in the middle of a dataframe. I need to split it into multiple columns, and replace it with the new columns. I'm able to do it with the following code:
df = df.join(df[col_to_split].str.split(', ', expand=True).add_prefix(col_to_split + '_'))
However, the new columns are placed at the end of the dataframe, rather than replacing the original column. I need a way to place the new columns at the same position of original columns.
Note that I don't want to manually order ALL columns (i.e. df = df[[c1, c2, c3 ... cn]]) because of many reasons, i.e.it's not known how many new columns are going to be generated, and dataframe contains hundreds of columns.
Sample data:
c1 c2 c3 col_to_split c4 c5 ... cn
1 a b 1,5,3 1 1 ... 1
2 a c 5,10 3 3 ... 4
3 z c 3 2 3 ... 4
Desired output:
c1 c2 c3 col_to_split_0 col_to_split_1 col_to_split_2 c4 c5 ... cn
1 a b 1 5 3 1 1 ... 1
2 a c 5 10 3 3 ... 4
3 z c 3 2 3 ... 4
Idea is use your solution with dynamic insert df1.columns to original columns with cols[pos:pos] trick, position of original column is count by Index.get_loc:
col_to_split = 'col_to_split'
cols = df.columns.tolist()
pos = df.columns.get_loc(col_to_split)
df1 = df[col_to_split].str.split(',', expand=True).fillna("").add_prefix(col_to_split + '_')
cols[pos:pos] = df1.columns.tolist()
cols.remove(col_to_split)
print (cols)
['c1', 'c2', 'c3', 'col_to_split_0', 'col_to_split_1', 'col_to_split_2',
'c4', 'c5', 'cn']
df = df.join(df1).reindex(cols, axis=1)
print (df)
c1 c2 c3 col_to_split_0 col_to_split_1 col_to_split_2 c4 c5 cn
0 1 a b 1 5 3 1 1 1
1 2 a c 5 10 3 3 4
2 3 z c 3 2 3 4
Similar solution for join columsn names in lists:
col_to_split = 'col_to_split'
pos = df.columns.get_loc(col_to_split)
df1 = df[col_to_split].str.split(",", expand=True).fillna("").add_prefix(col_to_split + '_')
cols = df.columns.tolist()
cols = cols[:pos] + df1.columns.tolist() + cols[pos+1:]
print(cols)
['c1', 'c2', 'c3', 'col_to_split_0', 'col_to_split_1', 'col_to_split_2',
'c4', 'c5', 'cn']
df = df.join(df1).reindex(cols, axis=1)
print (df)
c1 c2 c3 col_to_split_0 col_to_split_1 col_to_split_2 c4 c5 cn
0 1 a b 1 5 3 1 1 1
1 2 a c 5 10 3 3 4
2 3 z c 3 2 3 4
We can wrap this operation to a function
import pandas as pd
import numpy as np
from io import StringIO
df = pd.read_csv(StringIO("""c1 c2 c3 col_to_split c4 c5 cn
1 a b 1,5,3 1 1 1
2 a c 5,10 3 3 4
3 z c 3 2 3 4"""), sep="\s+")
def split_by_col(df, colname):
pos = df.columns.tolist().index(colname)
df_tmp = df[colname].str.split(",", expand=True).fillna("")
df_tmp.columns=["col_to_split_" + str(i) for i in range(len(df_tmp.columns))]
return pd.concat([df.iloc[:,:pos], df_tmp, df.iloc[:,pos+1:]], axis=1)
With example:
>>> split_by_col(df, "col_to_split")
c1 c2 c3 col_to_split_0 col_to_split_1 col_to_split_2 c4 c5 cn
0 1 a b 1 5 3 1 1 1
1 2 a c 5 10 3 3 4
2 3 z c 3 2 3 4
Try this:
df = df.join(df[col_to_split].str.split(', ', expand=True).add_prefix(col_to_split + '_'))
df = df[["c1", "c2", "c3" "col_to_split_0" "col_to_split_1" "col_to_split_2" "c4" "c5" ... "cn"]]

Pandas merge on multiple columns ignoring NaN

I am trying to do the same as this answer, but with the difference that I'd like to ignore NaN in some cases. For instance:
#df1
c1 c2 c3
0 a b 1
1 a c 2
2 a nan 1
3 b nan 3
4 c d 1
5 d e 3
#df2
c1 c2 c4
0 a nan 1
1 a c 2
2 a x 1
3 b nan 3
4 z y 2
#merged output based on [c1, c2], dropping instances
#with `NaN` unless both dataframes have `NaN`.
c1 c2 c3 c4
0 a b 1 1 #c1,c2 from df1 because df2 has a nan in c2
1 a c 2 2 #in both
2 a x 1 1 #c1,c2 from df2 because df1 has a nan in c2
3 b nan 3 3 #c1,c2 as found in both
4 c d 1 nan #from df1
5 d e 3 nan #from df1
6 z y nan 2 #from df2
NaNs may come from either c1 or c2, but for this example I kept it simpler.
I'm not sure what's the cleanest way to do this. I was thinking to merge based on [c1,c2], and then loop by rows with nan, but this will not be so direct. Do you see a better way to do it?
Edit - clarifying conditions
1. No duplicates are found anywhere.
2. No combination is performed between two rows if they both have values. c1 may not be combined with c2, so order must be respected.
3. For the cases where one of the 2 dfs has a nan in either c1 or c2, find the rows in the other dataframe that don't have a full match on both c1+c2, and use it. For instance:
(a,c) has a match in both so it is no longer discussed.
(a,b) is only in df1. No b is found in df2.c2. The only row in df2 with a known key and a nan is row 0 so it is combined with this one. Note that order must be respected this is why (a,b) #df1 cannot be combined with any other row of df2 that also contains a b.
(a,x) is only in df2. No x is found in df1.c2. The only row in df1 with one of the known keys with a nan is row with index 2.

How can I find the "set difference" of rows in two dataframes on a subset of columns in Pandas?

I have two dataframes, say df1 and df2, with the same column names.
Example:
df1
C1 | C2 | C3 | C4
A 1 2 AA
B 1 3 A
A 3 2 B
df2
C1 | C2 | C3 | C4
A 1 3 E
B 1 2 C
Q 4 1 Z
I would like to filter out rows in df1 based on common values in a fixed subset of columns between df1 and df2. In the above example, if the columns are C1 and C2, I would like the first two rows to be filtered out, as their values in both df1 and df2 for these columns are identical.
What would be a clean way to do this in Pandas?
So far, based on this answer, I have been able to find the common rows.
common_df = pandas.merge(df1, df2, how='inner', on=['C1','C2'])
This gives me a new dataframe with only those rows that have common values in the specified columns, i.e., the intersection.
I have also seen this thread, but the answers all seem to assume a difference on all the columns.
The expected result for the above example (rows common on specified columns removed):
C1 | C2 | C3 | C4
A 3 2 B
Maybe not the cleanest, but you could add a key column to df1 to check against.
Setting up the datasets
import pandas as pd
df1 = pd.DataFrame({ 'C1': ['A', 'B', 'A'],
'C2': [1, 1, 3],
'C3': [2, 3, 2],
'C4': ['AA', 'A', 'B']})
df2 = pd.DataFrame({ 'C1': ['A', 'B', 'Q'],
'C2': [1, 1, 4],
'C3': [3, 2, 1],
'C4': ['E', 'C', 'Z']})
Adding a key, using your code to find the commons
df1['key'] = range(1, len(df1) + 1)
common_df = pd.merge(df1, df2, how='inner', on=['C1','C2'])
df_filter = df1[~df1['key'].isin(common_df['key'])].drop('key', axis=1)
You can use an anti-join method where you do an outer join on the specified columns while returning the method of the join with an indicator. Only downside is that you'd have to rename and drop the extra columns after the join.
>>> import pandas as pd
>>> df1 = pd.DataFrame({'C1':['A','B','A'],'C2':[1,1,3],'C3':[2,3,2],'C4':['AA','A','B']})
>>> df2 = pd.DataFrame({'C1':['A','B','Q'],'C2':[1,1,4],'C3':[3,2,1],'C4':['E','C','Z']})
>>> df_merged = df1.merge(df2, on=['C1','C2'], indicator=True, how='outer')
>>> df_merged
C1 C2 C3_x C4_x C3_y C4_y _merge
0 A 1 2.0 AA 3.0 E both
1 B 1 3.0 A 2.0 C both
2 A 3 2.0 B NaN NaN left_only
3 Q 4 NaN NaN 1.0 Z right_only
>>> df1_setdiff = df_merged[df_merged['_merge'] == 'left_only'].rename(columns={'C3_x': 'C3', 'C4_x': 'C4'}).drop(['C3_y', 'C4_y', '_merge'], axis=1)
>>> df1_setdiff
C1 C2 C3 C4
2 A 3 2.0 B
>>> df2_setdiff = df_merged[df_merged['_merge'] == 'right_only'].rename(columns={'C3_y': 'C3', 'C4_y': 'C4'}).drop(['C3_x', 'C4_x', '_merge'], axis=1)
>>> df2_setdiff
C1 C2 C3 C4
3 Q 4 1.0 Z
import pandas as pd
df1 = pd.DataFrame({'C1':['A','B','A'],'C2':[1,1,3],'C3':[2,3,2],'C4':['AA','A','B']})
df2 = pd.DataFrame({'C1':['A','B','Q'],'C2':[1,1,4],'C3':[3,2,1],'C4':['E','C','Z']})
common = pd.merge(df1, df2,on=['C1','C2'])
R1 = df1[~((df1.C1.isin(common.C1))&(df1.C2.isin(common.C2)))]
R2 = df2[~((df2.C1.isin(common.C1))&(df2.C2.isin(common.C2)))]
df1:
C1 C2 C3 C4
0 A 1 2 AA
1 B 1 3 A
2 A 3 2 B
df2:
C1 C2 C3 C4
0 A 1 3 E
1 B 1 2 C
2 Q 4 1 Z
common:
C1 C2 C3_x C4_x C3_y C4_y
0 A 1 2 AA 3 E
1 B 1 3 A 2 C
R1:
C1 C2 C3 C4
2 A 3 2 B
R2:
C1 C2 C3 C4
2 Q 4 1 Z

how to convert column names into column values in pandas - python

df=pd.DataFrame(index=['x','y'], data={'a':[1,2],'b':[3,4]})
how can I convert column names into values of a column? This is my desired output
c1 c2
x 1 a
x 3 b
y 2 a
y 4 b
You can use:
print (df.T.unstack().reset_index(level=1, name='c1')
.rename(columns={'level_1':'c2'})[['c1','c2']])
c1 c2
x 1 a
x 3 b
y 2 a
y 4 b
Or:
print (df.stack().reset_index(level=1, name='c1')
.rename(columns={'level_1':'c2'})[['c1','c2']])
c1 c2
x 1 a
x 3 b
y 2 a
y 4 b
try this:
In [279]: df.stack().reset_index().set_index('level_0').rename(columns={'level_1':'c2',0:'c1'})
Out[279]:
c2 c1
level_0
x a 1
x b 3
y a 2
y b 4
Try:
df1 = df.stack().reset_index(-1).iloc[:, ::-1]
df1.columns = ['c1', 'c2']
df1
In [62]: (pd.melt(df.reset_index(), var_name='c2', value_name='c1', id_vars='index')
.set_index('index'))
Out[62]:
c2 c1
index
x a 1
y a 2
x b 3
y b 4

Concatenate pandas dataframes with varying rows per index

I have two dataframes df1 and df2 with key as index.
dict_1={'key':[1,1,1,2,2,3], 'col1':['a1','b1','c1','d1','e1','f1']}
df1 = pd.DataFrame(dict_1).set_index('key')
dict_2={'key':[1,1,2], 'col2':['a2','b2','c2']}
df2 = pd.DataFrame(dict_2).set_index('key')
df1:
col1
key
1 a1
1 b1
1 c1
2 d1
2 e1
3 f1
df2
col2
key
1 a2
1 b2
2 c2
Note that there are unequal rows for each index. I want to concatenate these two dataframes such that, I have the following dataframe (say df3).
df3
col1 col2
key
1 a1 a2
1 b1 b2
2 d1 c2
i.e. concatenate the two columns so that the new dataframe as the least (of df1 and df2) rows for each index.
I tried
pd.concat([df1,df2],axis=1)
but I get the following error:
Value Error: Shape of passed values is (2,17), indices imply (2,7)
My question: How can I concatentate df1 and df2 to get df3? Should I use DataFrame.merge instead? If so, how?
Merge/join alone will get you a lot of (hard to get rid of) duplicates. But a little trick will help:
df1['count1'] = 1
df1['count1'] = df1['count1'].groupby(df1.index).cumsum()
df1
Out[198]:
col1 count1
key
1 a1 1
1 b1 2
1 c1 3
2 d1 1
2 e1 2
3 f1 1
The same thing for df2:
df2['count2'] = 1
df2['count2'] = df2['count2'].groupby(df2.index).cumsum()
And finally:
df_aligned = df1.reset_index().merge(df2.reset_index(), left_on = ['key','count1'], right_on = ['key', 'count2'])
df_aligned
Out[199]:
key col1 count1 col2 count2
0 1 a1 1 a2 1
1 1 b1 2 b2 2
2 2 d1 1 c2 1
Now, you can reset index with set_index('key') and drop no longer needed columns countn.
The biggest problem for why you are not going to be able to line up the two in the way that you want is that your keys are duplicative. How are you going to be line up the A1 value in df1 with the A2 value in df2 When A1, A2, B1, B2, and C1 all have the same key?
Using merge is what you'll want if you can resolve the key issues:
df3 = df1.merge(df2, left_index=True, right_index=True, how='inner')
You can use inner, outer, left or right for how.

Categories

Resources