Pandas: solve a crosstab issue - python

I have a situation where a user belongs to multiple categories:
UserID Category
1 A
1 B
2 A
3 A
4 C
2 C
4 A
A = 1,2,3,4
B = 1
C = 2,4
I want the crosstab which shows data like this using pandas:
A B C
A 4 1 2
B 1 2 0
C 2 0 2
I try:
df.groupby(UserID).agg(countDistinct('Category'))
I did the above but it returns 0 for elements not on the diagonal.

You can first create DataFrame from lists a, b, c. Then stack and merge it to original. Last use crosstab:
a = [1,2,3,4]
b = [1]
c = [2,4]
df1 = pd.DataFrame({'A':pd.Series(a), 'B':pd.Series(b), 'C':pd.Series(c)})
print (df1)
A B C
0 1 1.0 2.0
1 2 NaN 4.0
2 3 NaN NaN
3 4 NaN NaN
df2 = df1.stack()
.reset_index(drop=True, level=0)
.reset_index(name='UserID')
.rename(columns={'index':'newCat'})
print (df2)
newCat UserID
0 A 1.0
1 B 1.0
2 C 2.0
3 A 2.0
4 C 4.0
5 A 3.0
6 A 4.0
df3 = pd.merge(df, df2, on='UserID')
print (pd.crosstab(df3.newCat, df3.Category))
Category A B C
newCat
A 4 1 2
B 1 1 0
C 2 0 2

Related

merge dataframes and replace existing column

df1 = pd.DataFrame({'A':[3,5,2,5], 'B':['w','x','y','z'], 'C':['0','0','0','0']})
df2 = pd.DataFrame({'B':['w','x','y','z'],'C':['1','2','3','4'], 'D':[10,20,30,40]})
I'm trying to merge df1 and df2 on B and keep all A B C and D columns:
A B C D
0 3.0 w 1 10.0
1 5.0 x 2 20.0
2 2.0 y 3 30.0
3 5.0 z 4 40.0
I've tried df1.merge(df2, how='outer', on='B')
A B C_x C_y D
0 3 w 0 1 10
1 5 x 0 2 20
2 2 y 0 3 30
3 5 z 0 4 40
which is almost what I want, but need C in df2 to replace C in df1. How can I achieve that?
If you don't want C from the lefthand side at all you could simply drop it before the merge:
df1 = pd.DataFrame({'A':[3,5,2,5], 'B':['w','x','y','z'], 'C':['0','0','0','0']})
df2 = pd.DataFrame({'B':['w','x','y','z'],'C':['1','2','3','4'], 'D':[10,20,30,40]})
result = pd.merge(
df1.drop('C', axis=1),
df2,
how='outer',
on='B')
A B C D
0 3 w 1 10
1 5 x 2 20
2 2 y 3 30
3 5 z 4 40
Edit:
However, if you're wanting to combine in cases where you don't have C from df2 you could utilize combine_first():
df1 = pd.DataFrame({'A':[3,5,2,5,6], 'B':['w','x','y','z','q'], 'C':['0','0','0','0','88']})
df2 = pd.DataFrame({'B':['w','x','y','z'],'C':['1','2','3','4'], 'D':[10,20,30,40]})
result_2 = pd.merge(df1, df2, how='outer', on='B')
result_2['C'] = result_2['C_y'].combine_first(result_2['C_x'])
result_2.drop(['C_x', 'C_y'], axis=1, inplace=True)
A B D C
0 3 w 10.0 1
1 5 x 20.0 2
2 2 y 30.0 3
3 5 z 40.0 4
4 6 q 88
here is one way to do it, by choosing the column you need to include in the merge
df1[['A','B']].merge(df2,
on='B',
how='outer')
A B C D
0 3 w 1 10
1 5 x 2 20
2 2 y 3 30
3 5 z 4 40

Merge two DataFrames by combining duplicates and concatenating nonduplicates

I have two DataFrames:
df = pd.DataFrame({'A':[1,2],
'B':[3,4]})
A B
0 1 3
1 2 4
df2 = pd.DataFrame({'A':[3,2,1],
'C':[5,6,7]})
A C
0 3 5
1 2 6
2 1 7
and I want to merge in a way that the column 'A' add the different values between DataFrames but merge the duplicates.
Desired output:
A B C
0 3 NaN 5
1 2 4 6
2 1 3 7
You can use combine_first:
df2 = df2.combine_first(df)
Output:
A B C
0 1 3.0 5
1 2 4.0 6
2 3 NaN 7

compare two columns row by row and nan duplicate values pandas

I have a df
a b c
0 3 0
1 1 4
2 3 3
4 4 1
I want to compare a and b to c. If a value in the same row is equal to c I want 'nan' in a and/or b.
Like that:
a b c
nan 3 0
1 1 4
2 nan 3
4 4 1
We can use to_numpy with DataFrame.mask for this:
eqs = df.loc[:, :'b'].eq(df['c'].to_numpy()[:, None])
df.loc[:, :'b'] = df.loc[:, :'b'].mask(eqs)
a b c
0 NaN 3.0 0
1 1.0 1.0 4
2 2.0 NaN 3
3 4.0 4.0 1

fill NaN values with mean based on another column specific value

I want to fill the NaN values on my dataframe on column c with the mean for only rows who has as category B, and ignore the others.
print (df)
Category b c
0 A 1 5.0
1 C 1 NaN
2 A 1 4.0
3 B 2 NaN
4 A 2 1.0
5 B 2 Nan
6 C 1 3.0
7 C 1 2.0
8 B 1 NaN
So what I'm doing for the moment is :
df.c = df.c.fillna(df.c.mean())
But it fill all the NaN values, while I want only to fill the 3rd, 5th and the 8th rows who had category value equal to B.
Combine fillna with slicing assignment
df.loc[df.Category.eq('B'), 'c'] = (df.loc[df.Category.eq('B'), 'c'].
fillna(df.c.mean()))
Out[736]:
Category b c
0 A 1 5.0
1 C 1 NaN
2 A 1 4.0
3 B 2 3.0
4 A 2 1.0
5 B 2 3.0
6 C 1 3.0
7 C 1 2.0
8 B 1 3.0
Or a direct assignment with 2 masks
pandas.DataFrame.eq is the element wise equality operator.
df.loc[df.Category.eq('B') & df.c.isna(), 'c'] = df.c.mean()
Out[745]:
Category b c
0 A 1 5.0
1 C 1 NaN
2 A 1 4.0
3 B 2 3.0
4 A 2 1.0
5 B 2 3.0
6 C 1 3.0
7 C 1 2.0
8 B 1 3.0
This would be the answer for your question:
df.c = df.apply(
lambda row: row['c'].fillna(df.c.mean()) if row['Category']=='B' else row['c'] ,axis=1)

Addition-merging dataframes

What is the best way to add the contents of two dataframes, which have mostly equivalent indices:
df1:
A B C
A 0 3 1
B 3 0 2
C 1 2 0
df2:
A B C D
A 0 1 1 0
B 1 0 3 2
C 1 3 0 0
D 0 2 0 0
df1 + df2 =
A B C D
A 0 4 2 0
B 4 0 5 2
C 2 5 0 0
D 0 2 0 0
You can also concat both the dataframes since concatenation (by default) happens by index.
# sample dataframe
df1 = pd.DataFrame({'a': [1,2,3], 'b':[2,3,4]}, index=['a','c','e'])
df2 = pd.DataFrame({'a': [10,20], 'b':[11,22]}, index=['b','d'])
new_df= pd.concat([df1, df2]).sort_index()
print(new_df)
a b
a 1 2
b 10 11
c 2 3
d 20 22
e 3 4
I think you can just add:
In [625]: df1.add(df2,fill_value=0)
Out[625]:
A B C D
A 0.0 4.0 2.0 0.0
B 4.0 0.0 5.0 2.0
C 2.0 5.0 0.0 0.0
D 0.0 2.0 0.0 0.0

Categories

Resources