compare two columns row by row and nan duplicate values pandas - python

I have a df
a b c
0 3 0
1 1 4
2 3 3
4 4 1
I want to compare a and b to c. If a value in the same row is equal to c I want 'nan' in a and/or b.
Like that:
a b c
nan 3 0
1 1 4
2 nan 3
4 4 1

We can use to_numpy with DataFrame.mask for this:
eqs = df.loc[:, :'b'].eq(df['c'].to_numpy()[:, None])
df.loc[:, :'b'] = df.loc[:, :'b'].mask(eqs)
a b c
0 NaN 3.0 0
1 1.0 1.0 4
2 2.0 NaN 3
3 4.0 4.0 1

Related

How to remove or drop all rows after first occurrence of `NaN` from the entire DataFrame

I am looking forward to remove/drop all rows after first occurrence of NaN based on any of dataFrame column.
I have created two sample DataFrames as illustrated Below, the first dataframe the dtypes are for initial two columns are object while the last one in int, while in the Second dataframe these are float, obj and int.
First:
>>> df = pd.DataFrame({"A": (1,2,3,4,5,6,7,'NaN','NaN','NaN','NaN'),"B": (1,2,3,'NaN',4,5,6,7,'NaN',"9","10"),"C": range(11)})
>>> df
A B C
0 1 1 0
1 2 2 1
2 3 3 2
3 4 NaN 3
4 5 4 4
5 6 5 5
6 7 6 6
7 NaN 7 7
8 NaN NaN 8
9 NaN 9 9
10 NaN 10 10
Dtypes:
>>> df.dtypes
A object
B object
C int64
dtype: object
While carrying out index based approach as follows based on a particular, it works Just fine as long as dtype is obj and int but i'm looking for dataFrame level action merely not limited to a column.
>>> df[:df[df['A'] == 'NaN'].index[0]]
A B C
0 1 1 0
1 2 2 1
2 3 3 2
3 4 NaN 3
4 5 4 4
5 6 5 5
6 7 6 6
>>> df[:df[df['B'] == 'NaN'].index[0]]
A B C
0 1 1 0
1 2 2 1
2 3 3 2
Second:
Another interesting fact while creating DataFrame with np.nan where we get different dtype, then even index based approach failed for a single column operation s well.
>>> df = pd.DataFrame({"A": (1,2,3,4,5,6,7,np.nan,np.nan,np.nan,np.nan),"B": (1,2,3,np.nan,4,5,6,7,np.nan,"9","10"),"C": range(11)})
>>> df
A B C
0 1.0 1 0
1 2.0 2 1
2 3.0 3 2
3 4.0 NaN 3
4 5.0 4 4
5 6.0 5 5
6 7.0 6 6
7 NaN 7 7
8 NaN NaN 8
9 NaN 9 9
10 NaN 10 10
dtypes:
>>> df.dtypes
A float64
B object
C int64
dtype: object
Error:
>>> df[:df[df['B'] == 'NaN'].index[0]]
IndexError: index 0 is out of bounds for axis 0 with size 0
>>> df[:df[df['A'] == 'NaN'].index[0]]
IndexError: index 0 is out of bounds for axis 0 with size 0
Expected should be for the Second DataFrame:
>>> df
A B C
0 1.0 1 0
1 2.0 2 1
2 3.0 3 2
So, i am looking a way around to check across the entire DataFrame regardless of dtype and drop all rows from the first occurrence of NaN in the DataFrame.
You can try:
out=df.iloc[:df.isna().any(1).idxmax()]
OR
via replace() make your string 'NaN's to real 'NaN's then check for missing values and filter rows:
df=df.replace({'NaN':float('NaN'),'nan':float('NaN')})
out=df.iloc[:df.isna().any(1).idxmax()]
output of out:
A B C
0 1.0 1 0
1 2.0 2 1
2 3.0 3 2
Just for posterity ...
>>> df.iloc[:df.isna().any(1).argmax()]
A B C
0 1.0 1 0
1 2.0 2 1
2 3.0 3 2
>>> df.iloc[:df.isnull().any(1).argmax()]
A B C
0 1.0 1 0
1 2.0 2 1
2 3.0 3 2

set entire group to NaN if containing a single NaN and combine columns

I have a df
a b c d
0 1 nan 1
0 2 2 nan
0 2 3 4
1 3 1 nan
1 1 nan 3
1 1 2 3
1 1 2 4
I need to groub by a and b and then if c or d contains 1 or more nan's within groups I want the entire group in the specific column to be nan:
a b c d
0 1 nan 1
0 2 2 nan
0 2 3 nan
1 3 1 nan
1 1 nan 3
1 1 nan 3
1 1 nan 4
and then combine c and d that there is no nan's anymore
a b c d e
0 1 nan 1 1
0 2 2 nan 2
0 2 3 nan 3
1 3 1 nan 1
1 1 nan 3 3
1 1 nan 3 3
1 1 nan 4 4
You will want to check each group for whether it is nan and then set the appropriate value (nan or existing value) and then use combine_first() to combine the columns.
from io import StringIO
import pandas as pd
import numpy as np
df = pd.read_csv(StringIO("""
a b c d
0 1 nan 1
0 2 2 nan
0 2 3 4
1 3 1 nan
1 1 nan 3
1 1 2 3
1 1 2 4
"""), sep=' ')
for col in ['c', 'd']:
df[col] = df.groupby(['a','b'])[col].transform(lambda x: np.nan if any(x.isna()) else x)
df['e'] = df['c'].combine_first(df['d'])
df
a b c d e
0 0 1 NaN 1.0 1.0
1 0 2 2.0 NaN 2.0
2 0 2 3.0 NaN 3.0
3 1 3 1.0 NaN 1.0
4 1 1 NaN 3.0 3.0
5 1 1 NaN 3.0 3.0
6 1 1 NaN 4.0 4.0

fill NaN values with mean based on another column specific value

I want to fill the NaN values on my dataframe on column c with the mean for only rows who has as category B, and ignore the others.
print (df)
Category b c
0 A 1 5.0
1 C 1 NaN
2 A 1 4.0
3 B 2 NaN
4 A 2 1.0
5 B 2 Nan
6 C 1 3.0
7 C 1 2.0
8 B 1 NaN
So what I'm doing for the moment is :
df.c = df.c.fillna(df.c.mean())
But it fill all the NaN values, while I want only to fill the 3rd, 5th and the 8th rows who had category value equal to B.
Combine fillna with slicing assignment
df.loc[df.Category.eq('B'), 'c'] = (df.loc[df.Category.eq('B'), 'c'].
fillna(df.c.mean()))
Out[736]:
Category b c
0 A 1 5.0
1 C 1 NaN
2 A 1 4.0
3 B 2 3.0
4 A 2 1.0
5 B 2 3.0
6 C 1 3.0
7 C 1 2.0
8 B 1 3.0
Or a direct assignment with 2 masks
pandas.DataFrame.eq is the element wise equality operator.
df.loc[df.Category.eq('B') & df.c.isna(), 'c'] = df.c.mean()
Out[745]:
Category b c
0 A 1 5.0
1 C 1 NaN
2 A 1 4.0
3 B 2 3.0
4 A 2 1.0
5 B 2 3.0
6 C 1 3.0
7 C 1 2.0
8 B 1 3.0
This would be the answer for your question:
df.c = df.apply(
lambda row: row['c'].fillna(df.c.mean()) if row['Category']=='B' else row['c'] ,axis=1)

Pandas create new column based on first unique values of existing column

I'm trying to add a new column to a dataframe with only unique values from an existing column. There will be fewer rows in the new column maybe with np.nan values where duplicates would have been.
import pandas as pd
import numpy as np
df = pd.DataFrame({'a':[1,2,3,4,5], 'b':[3,4,3,4,5]})
df
a b
0 1 3
1 2 4
2 3 3
3 4 4
4 5 5
Goal:
a b c
0 1 3 3
1 2 4 4
2 3 3 nan
3 4 4 nan
4 5 5 5
I've tried:
df['c'] = np.where(df['b'].unique(), df['b'], np.nan)
It throws: operands could not be broadcast together with shapes (3,) (5,) ()
mask + duplicated
You can use Pandas methods for masking a series:
df['c'] = df['b'].mask(df['b'].duplicated())
print(df)
a b c
0 1 3 3.0
1 2 4 4.0
2 3 3 NaN
3 4 4 NaN
4 5 5 5.0
Use duplicated with np.where:
df['c'] = np.where(df['b'].duplicated(),np.nan,df['b'])
Or:
df['c'] = df['b'].where(~df['b'].duplicated(),np.nan)
print(df)
a b c
0 1 3 3.0
1 2 4 4.0
2 3 3 NaN
3 4 4 NaN
4 5 5 5.0

Pandas: solve a crosstab issue

I have a situation where a user belongs to multiple categories:
UserID Category
1 A
1 B
2 A
3 A
4 C
2 C
4 A
A = 1,2,3,4
B = 1
C = 2,4
I want the crosstab which shows data like this using pandas:
A B C
A 4 1 2
B 1 2 0
C 2 0 2
I try:
df.groupby(UserID).agg(countDistinct('Category'))
I did the above but it returns 0 for elements not on the diagonal.
You can first create DataFrame from lists a, b, c. Then stack and merge it to original. Last use crosstab:
a = [1,2,3,4]
b = [1]
c = [2,4]
df1 = pd.DataFrame({'A':pd.Series(a), 'B':pd.Series(b), 'C':pd.Series(c)})
print (df1)
A B C
0 1 1.0 2.0
1 2 NaN 4.0
2 3 NaN NaN
3 4 NaN NaN
df2 = df1.stack()
.reset_index(drop=True, level=0)
.reset_index(name='UserID')
.rename(columns={'index':'newCat'})
print (df2)
newCat UserID
0 A 1.0
1 B 1.0
2 C 2.0
3 A 2.0
4 C 4.0
5 A 3.0
6 A 4.0
df3 = pd.merge(df, df2, on='UserID')
print (pd.crosstab(df3.newCat, df3.Category))
Category A B C
newCat
A 4 1 2
B 1 1 0
C 2 0 2

Categories

Resources