Addition-merging dataframes - python

What is the best way to add the contents of two dataframes, which have mostly equivalent indices:
df1:
A B C
A 0 3 1
B 3 0 2
C 1 2 0
df2:
A B C D
A 0 1 1 0
B 1 0 3 2
C 1 3 0 0
D 0 2 0 0
df1 + df2 =
A B C D
A 0 4 2 0
B 4 0 5 2
C 2 5 0 0
D 0 2 0 0

You can also concat both the dataframes since concatenation (by default) happens by index.
# sample dataframe
df1 = pd.DataFrame({'a': [1,2,3], 'b':[2,3,4]}, index=['a','c','e'])
df2 = pd.DataFrame({'a': [10,20], 'b':[11,22]}, index=['b','d'])
new_df= pd.concat([df1, df2]).sort_index()
print(new_df)
a b
a 1 2
b 10 11
c 2 3
d 20 22
e 3 4

I think you can just add:
In [625]: df1.add(df2,fill_value=0)
Out[625]:
A B C D
A 0.0 4.0 2.0 0.0
B 4.0 0.0 5.0 2.0
C 2.0 5.0 0.0 0.0
D 0.0 2.0 0.0 0.0

Related

merge dataframes and replace existing column

df1 = pd.DataFrame({'A':[3,5,2,5], 'B':['w','x','y','z'], 'C':['0','0','0','0']})
df2 = pd.DataFrame({'B':['w','x','y','z'],'C':['1','2','3','4'], 'D':[10,20,30,40]})
I'm trying to merge df1 and df2 on B and keep all A B C and D columns:
A B C D
0 3.0 w 1 10.0
1 5.0 x 2 20.0
2 2.0 y 3 30.0
3 5.0 z 4 40.0
I've tried df1.merge(df2, how='outer', on='B')
A B C_x C_y D
0 3 w 0 1 10
1 5 x 0 2 20
2 2 y 0 3 30
3 5 z 0 4 40
which is almost what I want, but need C in df2 to replace C in df1. How can I achieve that?
If you don't want C from the lefthand side at all you could simply drop it before the merge:
df1 = pd.DataFrame({'A':[3,5,2,5], 'B':['w','x','y','z'], 'C':['0','0','0','0']})
df2 = pd.DataFrame({'B':['w','x','y','z'],'C':['1','2','3','4'], 'D':[10,20,30,40]})
result = pd.merge(
df1.drop('C', axis=1),
df2,
how='outer',
on='B')
A B C D
0 3 w 1 10
1 5 x 2 20
2 2 y 3 30
3 5 z 4 40
Edit:
However, if you're wanting to combine in cases where you don't have C from df2 you could utilize combine_first():
df1 = pd.DataFrame({'A':[3,5,2,5,6], 'B':['w','x','y','z','q'], 'C':['0','0','0','0','88']})
df2 = pd.DataFrame({'B':['w','x','y','z'],'C':['1','2','3','4'], 'D':[10,20,30,40]})
result_2 = pd.merge(df1, df2, how='outer', on='B')
result_2['C'] = result_2['C_y'].combine_first(result_2['C_x'])
result_2.drop(['C_x', 'C_y'], axis=1, inplace=True)
A B D C
0 3 w 10.0 1
1 5 x 20.0 2
2 2 y 30.0 3
3 5 z 40.0 4
4 6 q 88
here is one way to do it, by choosing the column you need to include in the merge
df1[['A','B']].merge(df2,
on='B',
how='outer')
A B C D
0 3 w 1 10
1 5 x 2 20
2 2 y 3 30
3 5 z 4 40

How to insert list of values into null values of a column in python?

I am new to pandas. I am facing an issue with null values. I have a list of 3 values which has to be inserted into a column of missing values how do I do that?
In [57]: df
Out[57]:
a b c d
0 0 1 2 3
1 0 NaN 0 1
2 0 Nan 3 4
3 0 1 2 5
4 0 Nan 2 6
In [58]: list = [11,22,44]
The output I want
Out[57]:
a b c d
0 0 1 2 3
1 0 11 0 1
2 0 22 3 4
3 0 1 2 5
4 0 44 2 6
If your list is same length as the no of NaN:
l=[11,22,44]
df.loc[df['b'].isna(),'b'] = l
print(df)
a b c d
0 0 1.0 2 3
1 0 11.0 0 1
2 0 22.0 3 4
3 0 1.0 2 5
4 0 44.0 2 6
Try with stack and assign the value then unstack back
s = df.stack(dropna=False)
s.loc[s.isna()] = l # chnage the list name to l here, since override the original python and panda function and object name will create future warning
df = s.unstack()
df
Out[178]:
a b c d
0 0.0 1.0 2.0 3.0
1 0.0 11.0 0.0 1.0
2 0.0 22.0 3.0 4.0
3 0.0 1.0 2.0 5.0
4 0.0 44.0 2.0 6.0

Pandas insert empty row at 0th position

Suppose have following data frame
A B
1 2 3 4 5
4 5 6 7 8
I want to check if df(0,0) is nan then insert pd.series(np.nan) at 0th position. So in above case it will be
A B
1 2 3 4 5
4 5 6 7 8
I am able to check (0,0) element but how do I insert empty row at first position?
Use append of DataFrame with one empty row:
df1 = pd.DataFrame([[np.nan] * len(df.columns)], columns=df.columns)
df = df1.append(df, ignore_index=True)
print (df)
A B C D E
0 NaN NaN NaN NaN NaN
1 1.0 2.0 3.0 4.0 5.0
2 4.0 5.0 6.0 7.0 8.0
Perhaps you can first append a row with zeros, shift the whole rows and overwrite the first with 0:
df
A B C D E
0 1 2 3 4 5
1 4 5 6 7 8
df.loc[len(df)] = 0
df
A B C D E
0 1 2 3 4 5
1 4 5 6 7 8
2 0 0 0 0 0
df = df.shift()
df.loc[0] = 0
df
A B C D E
0 0.0 0.0 0.0 0.0 0.0
1 1.0 2.0 3.0 4.0 5.0
2 4.0 5.0 6.0 7.0 8.0

creating dictionary from multiple columns in pandas

For the following data frame df1:
sentence A B C D F G
dizzy 1 1 0 0 k 1
Head 0 0 1 0 l 1
nausea 0 0 0 1 fd 1
zap 1 0 1 0 g 1
dizziness 0 0 0 1 V 1
I need to create a dictionary from column sentence with columns A, B, C,and D.
In the next step, I need to map sentences column in data frame F2 to the value A, B, C, and D. The output is like this:
sentences A B C D
dizzy 1 1 0 0
happy
Head 0 0 1 0
nausea 0 0 0 1
fill out
zap 1 0 1 0
dizziness 0 0 0 1
This is my code, but just for one column, I do not know how to do it for several columns:
equiv = df1.set_index (sentences)[A].to_dict()
df2[A]=df2[sentences].apply (lambda x:equiv.get(x, np.nan))
Thanks.
IIUC:
Setup:
In [164]: df1
Out[164]:
sentence A B C D F G
0 dizzy 1 1 0 0 k 1
1 Head 0 0 1 0 l 1
2 nausea 0 0 0 1 fd 1
3 zap 1 0 1 0 g 1
4 dizziness 0 0 0 1 V 1
In [165]: df2
Out[165]:
sentences
0 dizzy
1 happy
2 Head
3 nausea
4 fill out
5 zap
6 dizziness
Solution:
In [174]: df2[['sentences']].merge(df1[['sentence','A','B','C','D']],
left_on='sentences',
right_on='sentence',
how='outer')
Out[174]:
sentences sentence A B C D
0 dizzy dizzy 1.0 1.0 0.0 0.0
1 happy NaN NaN NaN NaN NaN
2 Head Head 0.0 0.0 1.0 0.0
3 nausea nausea 0.0 0.0 0.0 1.0
4 fill out NaN NaN NaN NaN NaN
5 zap zap 1.0 0.0 1.0 0.0
6 dizziness dizziness 0.0 0.0 0.0 1.0

Pandas: solve a crosstab issue

I have a situation where a user belongs to multiple categories:
UserID Category
1 A
1 B
2 A
3 A
4 C
2 C
4 A
A = 1,2,3,4
B = 1
C = 2,4
I want the crosstab which shows data like this using pandas:
A B C
A 4 1 2
B 1 2 0
C 2 0 2
I try:
df.groupby(UserID).agg(countDistinct('Category'))
I did the above but it returns 0 for elements not on the diagonal.
You can first create DataFrame from lists a, b, c. Then stack and merge it to original. Last use crosstab:
a = [1,2,3,4]
b = [1]
c = [2,4]
df1 = pd.DataFrame({'A':pd.Series(a), 'B':pd.Series(b), 'C':pd.Series(c)})
print (df1)
A B C
0 1 1.0 2.0
1 2 NaN 4.0
2 3 NaN NaN
3 4 NaN NaN
df2 = df1.stack()
.reset_index(drop=True, level=0)
.reset_index(name='UserID')
.rename(columns={'index':'newCat'})
print (df2)
newCat UserID
0 A 1.0
1 B 1.0
2 C 2.0
3 A 2.0
4 C 4.0
5 A 3.0
6 A 4.0
df3 = pd.merge(df, df2, on='UserID')
print (pd.crosstab(df3.newCat, df3.Category))
Category A B C
newCat
A 4 1 2
B 1 1 0
C 2 0 2

Categories

Resources