I have two dataframes df1 and df2 with key as index.
dict_1={'key':[1,1,1,2,2,3], 'col1':['a1','b1','c1','d1','e1','f1']}
df1 = pd.DataFrame(dict_1).set_index('key')
dict_2={'key':[1,1,2], 'col2':['a2','b2','c2']}
df2 = pd.DataFrame(dict_2).set_index('key')
df1:
col1
key
1 a1
1 b1
1 c1
2 d1
2 e1
3 f1
df2
col2
key
1 a2
1 b2
2 c2
Note that there are unequal rows for each index. I want to concatenate these two dataframes such that, I have the following dataframe (say df3).
df3
col1 col2
key
1 a1 a2
1 b1 b2
2 d1 c2
i.e. concatenate the two columns so that the new dataframe as the least (of df1 and df2) rows for each index.
I tried
pd.concat([df1,df2],axis=1)
but I get the following error:
Value Error: Shape of passed values is (2,17), indices imply (2,7)
My question: How can I concatentate df1 and df2 to get df3? Should I use DataFrame.merge instead? If so, how?
Merge/join alone will get you a lot of (hard to get rid of) duplicates. But a little trick will help:
df1['count1'] = 1
df1['count1'] = df1['count1'].groupby(df1.index).cumsum()
df1
Out[198]:
col1 count1
key
1 a1 1
1 b1 2
1 c1 3
2 d1 1
2 e1 2
3 f1 1
The same thing for df2:
df2['count2'] = 1
df2['count2'] = df2['count2'].groupby(df2.index).cumsum()
And finally:
df_aligned = df1.reset_index().merge(df2.reset_index(), left_on = ['key','count1'], right_on = ['key', 'count2'])
df_aligned
Out[199]:
key col1 count1 col2 count2
0 1 a1 1 a2 1
1 1 b1 2 b2 2
2 2 d1 1 c2 1
Now, you can reset index with set_index('key') and drop no longer needed columns countn.
The biggest problem for why you are not going to be able to line up the two in the way that you want is that your keys are duplicative. How are you going to be line up the A1 value in df1 with the A2 value in df2 When A1, A2, B1, B2, and C1 all have the same key?
Using merge is what you'll want if you can resolve the key issues:
df3 = df1.merge(df2, left_index=True, right_index=True, how='inner')
You can use inner, outer, left or right for how.
Related
I have three data frames as follows:
df1
col1 CAND_SNP
1 a1
1 a2
1 a3
1 a4
2 b1
3 c1
3 c2
3 c3
df2
col1 LEAD_SNP
1 a1
2 b1
3 c1
df3
snp col2
a3 x1
a21 x2
a31 x3
a41 x4
b11 x5
c11 x6
c21 x7
c31 x8
I need to match CAND_SNP of df1 with snp of df3 to populate a new column in df2 with values "yes" or "no". The match needs to be groupwise for col1 of df1. In the above example, there are 3 groups in col1 of df1. If any of these group's corresponding value in CAND_SNP matches with snp of df3 then the new column of df2 would be "yes" as below: Any help?
df2
col1 LEAD_SNP col3
1 a1 Yes
2 b1 No
3 c1 No
If I understand correctly, you can group df1 by col1 and look up whether a value of col2 exists in col1 of df3. Then merge with df2:
df1['col3'] = df1.groupby('col1')['CAND_SNP'].apply(lambda s: s.isin(df3['snp']))
df2 = df2.merge(df1.groupby('col1')['col3'].any(), left_on='col1', right_index=True, how='left')
And if you need 'Yes'/'No' as values, use
df2.col3.map({True: 'Yes', False: 'No'})
I have N dataframes:
df1:
time data
1.0 a1
2.0 b1
3.0 c1
df2:
time data
1.0 a2
2.0 b2
3.0 c2
df3:
time data
1.0 a3
2.0 b3
3.0 c3
I want to merge all of them on id, thus getting
time data1 data2 data3
1.0 a1 a2 a3
2.0 b1 b2 b3
3.0 c1 c2 c3
I can assure all the ids are the same in all dataframes.
How can I do this in pandas?
One idea is use concat for list of DataFrames - only necessary create index by id for each DaatFrame. Also for avoid duplicated columns names is added keys parameter, but it create MultiIndex in output. So added map with format for flatten it:
dfs = [df1, df2, df3]
dfs = [x.set_index('id') for x in dfs]
df = pd.concat(dfs, axis=1, keys=range(1, len(dfs) + 1))
df.columns = df.columns.map('{0[1]}{0[0]}'.format)
df = df.reset_index()
print (df)
id data1 data2 data3
0 1 a1 a2 a3
1 2 b1 b2 b3
2 3 c1 c2 c3
Probably a duplicate, but I'm not even sure what to search for.
If I have a pandas dataframe like so:
index RH LH Data1 Data2 . . .
1 A1 A2 A B
2 B1 NaN C D
3 NaN C2 E F
And I want to re-index as so:
index Data1 Data2
A1 A B
A2 A B
B1 C D
C2 E F
Is there a simple-ish way to do this? Or should I just do a pair of for loops?
You can use DataFrame.set_index with all columns without names defined in list and reshape by DataFrame.stack, then remove last level by DataFrame.reset_index with drop=True, convert all another levels to columns and create index by DataFrame.set_index:
cols = df.columns.difference(['RH','LH']).tolist()
df = (df.set_index(cols)
.stack()
.reset_index(len(cols), drop=True)
.reset_index(name='idx')
.set_index('idx'))
print (df)
Data1 Data2
idx
A1 A B
A2 A B
B1 C D
C2 E F
Or use DataFrame.melt with DataFrame.dropna, remove column variable and last create index by idx column:
df = (df.melt(cols, value_name='idx')
.dropna(subset=['idx'])
.drop('variable', axis=1)
.set_index('idx'))
print (df)
Data1 Data2
idx
A1 A B
B1 C D
A2 A B
C2 E F
df1=
A B C D
a1 b1 c1 1
a2 b2 c2 2
a3 b3 c3 4
df2=
A B C D
a1 b1 c1 2
a2 b2 c2 1
I want to compare the value of the column 'D' in both dataframes. If both dataframes had same number of rows I would just do this.
newDF = df1['D']-df2['D']
However there are times when the number of rows are different. I want a result Dataframe which shows a dataframe like this.
resultDF=
A B C D_df1 D_df2 Diff
a1 b1 c1 1 2 -1
a2 b2 c2 2 1 1
EDIT: if 1st row in A,B,C from df1 and df2 is same then and only then compare 1st row of column D for each dataframe. Similarly, repeat for all the row.
Use merge and df.eval
df1.merge(df2, on=['A','B','C'], suffixes=['_df1','_df2']).eval('Diff=D_df1 - D_df2')
Out[314]:
A B C D_df1 D_df2 Diff
0 a1 b1 c1 1 2 -1
1 a2 b2 c2 2 1 1
I am trying to handling dataframe in several ways.
and now I'd like to merge two dataframe based on specific column information and delete rows which is duplicated
Is it possible?
I tried to use Concatenate function but faliled...
for example if I want to merge df1 and df2 into d3 with
condition:
if c1&c2 information is same, delete duplicated rows(only use df1, even if c3 data between df1 and df2 is different)
if c1&c2 information is different, use both rows (df1,df2)
before:
df1
c1 c2 c3
0 0 x {'a':1 ,'b':2}
1 0 y {'a':3 ,'b':4}
2 2 z {'a':5 ,'b':6}
df2
c1 c2 c3
0 0 x {'a':11 ,'b':12}
1 0 y {'a':13 ,'b':14}
2 3 z {'a':15 ,'b':16}
expected result d3:
c1 c2 c3
0 0 x {'a':1 ,'b':2}
1 0 y {'a':3 ,'b':4}
2 2 z {'a':5 ,'b':6}
3 3 z {'a':15 ,'b':16}
enter code here
You can do this firstly by determining which rows are only in df2 using merge and passing how='right' and indicator=True, then concat this with df1:
In [125]:
merged = df1.merge(df2, left_on=['c1','c2'], right_on=['c1','c2'], how='right', indicator=True)
merged = merged[merged['_merge']=='right_only']
merged = merged.rename(columns={'c3_y':'c3'})
merged
Out[125]:
c1 c2 c3_x c3 _merge
2 3 z NaN {'a':15 ,'b':16} right_only
In [126]:
combined = pd.concat([df1, merged[df1.columns]])
combined
Out[126]:
c1 c2 c3
0 0 x {'a':1 ,'b':2}
1 0 y {'a':3 ,'b':4}
2 2 z {'a':5 ,'b':6}
2 3 z {'a':15 ,'b':16}
If we break down the above:
In [128]:
merged = df1.merge(df2, left_on=['c1','c2'], right_on=['c1','c2'], how='right', indicator=True)
merged
Out[128]:
c1 c2 c3_x c3_y _merge
0 0 x {'a':1 ,'b':2} {'a':11 ,'b':12} both
1 0 y {'a':3 ,'b':4} {'a':13 ,'b':14} both
2 3 z NaN {'a':15 ,'b':16} right_only
In [129]:
merged = merged[merged['_merge']=='right_only']
merged
Out[129]:
c1 c2 c3_x c3_y _merge
2 3 z NaN {'a':15 ,'b':16} right_only
In [130]:
merged = merged.rename(columns={'c3_y':'c3'})
merged
Out[130]:
c1 c2 c3_x c3 _merge
2 3 z NaN {'a':15 ,'b':16} right_only