Compare two columns with NaNs in Pandas and get differences

Compare two columns with NaNs in Pandas and get differences - python

I have a following dataframe:
case c1 c2
1 x x
2 NaN y
3 x NaN
4 y x
5 NaN NaN
I would like to get a column "match" which will show which records with values in "c1" and "c2" are equal or different:
case c1 c2 match
1 x x True
2 NaN y False
3 x NaN False
4 y x False
5 NaN NaN True
I tried the following based on another Stack Overflow question: Comparing two columns and keeping NaNs
However, I can't get both cases 4 and 5 correct.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'case': [1, 2, 3, 4, 5],
'c1': ['x', np.nan,'x','y', np.nan],
'c2': ['x', 'y',np.nan,'x', np.nan],
})
cond1 = df['c1'] == df['c2']
cond2 = (df['c1'].isnull()) == (df['c2'].isnull())
df['c3'] = np.select([cond1, cond2], [True, True], False)
df

Use eq with isna:
df.c1.eq(df.c2)|df.iloc[:, 1:].isna().all(1)
#or
df.c1.eq(df.c2)|df.loc[:, ['c1','c2']].isna().all(1)

import pandas as pd
import numpy as np
df = pd.DataFrame({
'case': [1, 2, 3, 4, 5],
'c1': ['x', np.nan,'x','y', np.nan],
'c2': ['x', 'y',np.nan,'x', np.nan],
})
df['c3'] = df.apply(lambda row: True if str(row.c1) == str(row.c2) else False, axis=1)
print(df)
Output
case c1 c2 c3
0 1 x x True
1 2 NaN y False
2 3 x NaN False
3 4 y x False
4 5 NaN NaN True

Use nuquine with fillna
import numpy as np
df.fillna(np.inf)[['c1','c2']].nunique(1) < 2
Or nunique with option dropna=False
df[['c1','c2']].nunique(1, dropna=False) < 2
Out[13]:
0 True
1 False
2 False
3 False
4 True
dtype: bool

Related

Is there a way to fill NaNs at the end of a column with values from another column at the beginning in Python?

So I have two columns for example A & B and they look like this:
A B
1 4
2 5
3 6
NaN NaN
NaN NaN
NaN NaN
and I want it like this:
A
1
2
3
4
5
6
Any ideas?

I'm assuming your data is in two columns in a DataFrame, you can append B values to the end of A values, then drop the NA values with np.nan != np.nan trick. Here's an example
import pandas as pd
import numpy as np
d = {
'A': [1,2,3, np.nan, np.nan, np.nan],
'B': [4,5,6, np.nan, np.nan, np.nan]
}
df = pd.DataFrame(d)
>>> df
A B
1 4
2 5
3 6
NaN NaN
NaN NaN
NaN NaN
# np.nan == np.nan trick
>>> df['A'] == df['A']
0 True
1 True
2 True
3 False
4 False
5 False
Name: A, dtype: bool
x = pd.concat([df['A'], df['B']])
>>> x
0 1.0
1 2.0
2 3.0
3 NaN
4 NaN
5 NaN
0 4.0
1 5.0
2 6.0
3 NaN
4 NaN
5 NaN
dtype: float64
x = x[x == x]
>>> x
A
1
2
3
4
5
6

Using numpy, it could be something like:
import numpy as np
A = np.array([1, 2, 3, np.nan, np.nan, np.nan])
B = np.array([4, 5, 6, np.nan, np.nan, np.nan])
C = np.hstack([A[A < np.infty], B[B < np.infty]])
print(C) # [1. 2. 3. 4. 5. 6.]

What you might want is:
import pandas as pd
a = pd.Series([1, 2, 3, None, None, None])
b = pd.Series([4, 5, 6, None, None, None])
print(pd.concat([a.iloc[:3], b.iloc[:3]]))
And if you are just looking for non-NaN values feel free to use .dropna() in Series.

Delete pandas dataframe NaN rows selectively, grouped by id column which contains duplicates

I have a pandas dataframe like below: user id is a column which can contain duplicates. C1,C2,C3 are also columns.
I want to delete only those rows which has duplicated user column and have NaN for all values in C1,C2,C3 columns for those rows.
Expected output for this example:
delete 1st row (user 1) as it has all NaN, but don't want to delete the row 3 (user 2) as it has only one instance (no duplicates). How can I accomplish it across all such rows?
user C1 C2 C3
1 NaN NaN NaN
1 Nan x y
2 NaN NaN Nan
3 a b c

You can do it like this
# Getting the count of each id
res = dict(df['id'].value_counts())
res
def check(idx):
'''
If the value at the given index has all the column values as NULL and the
occurrence of that id is greater than 1 then we return False (we don't
want this row) otherwise, we return True (we want this row).
'''
if df.loc[idx, 'temp'] == 3 and res[df.loc[idx, 'id']] > 1:
return False
else:
return True
# temp row
df['temp'] = np.sum(df.isna(), axis=1)
df['temp and dup'] = df.index.map(check)
# Now we just select the rows we want.
df = df[df['temp and dup'] == True]
df.drop(columns=['temp', 'temp and dup'], inplace=True)
df
if it wolved your problem then give the green tick.

We can create an inclusive mask to keep rows where it is both not a duplicated id and cols C1, C2, and C3 are all NaN (isna):
df = df[~(df['user'].duplicated(keep=False) &
df[['C1', 'C2', 'C3']].isna().all(axis=1))]
df:
user C1 C2 C3
1 1 NaN x y
2 2 NaN NaN NaN
3 3 a b c
If there are lots of columns loc can be used to select instead of listing them all.
All columns after C1 (inclusive)
df = df[~(df['user'].duplicated(keep=False) &
df.loc[:, 'C1':].isna().all(axis=1))]
All columns between C1 and C3 (inclusive):
df = df[~(df['user'].duplicated(keep=False) &
df.loc[:, 'C1':'C3'].isna().all(axis=1))]
Breakdown of the mask creation as a DataFrame:
breakdown_df = df.join(
pd.DataFrame({
'duplicated_id': df['user'].duplicated(keep=False),
'all_nan': df[['C1', 'C2', 'C3']].isna().all(axis=1),
'neither_nor': ~(df['user'].duplicated(keep=False) &
df[['C1', 'C2', 'C3']].isna().all(axis=1))
})
)
breakdown_df:
user C1 C2 C3 duplicated_id all_nan neither_nor
0 1 NaN NaN NaN True True False
1 1 NaN x y True False True
2 2 NaN NaN NaN False True True
3 3 a b c False False True
The True neither_nor rows are the rows that are kept

Something like dropna then reindex
out = df.set_index('user').dropna(how = 'all',axis = 0).reindex(df.user.unique()).reset_index()
Out[437]:
user C1 C2 C3
0 1 NaN x y
1 2 NaN NaN NaN
2 3 a b c

df = pd.DataFrame({
'user': [1, 1, 2, 3],
'C1': [np.nan, np.nan, np.nan, 'a'],
'C2': [np.nan, 'x', np.nan, 'b'],
'C3': [np.nan, 'y', np.nan, 'c']
})
nanrows = df.iloc[:, 1:].isna().all(axis=1)
counts = df.user.value_counts()
dupes = df.user.map(counts) > 1
df[~(dupes & nanrows)]
user C1 C2 C3
1 1 NaN x y
2 2 NaN NaN NaN
3 3 a b c
EDIT:
The above solution has a subtle bug. If a user only exists in all NaN rows, they get dropped. You can see it with the following modified dataframe:
df = pd.DataFrame(
{
"user": [1, 1, 2, 3, 4, 4],
"C1": [np.nan, np.nan, np.nan, "a", np.nan, np.nan],
"C2": [np.nan, "x", np.nan, "b", np.nan, np.nan],
"C3": [np.nan, "y", np.nan, "c", np.nan, np.nan],
}
)
You can resolve this by finding the first occurrence of this kind of user and setting it to "not duplicate".
todrop = df.user[(dupes & nanrows)].drop_duplicates()
tokeep = df.user[~(dupes & nanrows)].drop_duplicates()
if todrop.isin(tokeep).all() == False:
notinkeep = todrop[todrop.isin(tokeep) == False]
duped_user = df.user[dupes]
for user in notinkeep:
for i, val in duped_user.items():
if val in notinkeep:
dupes.loc[i] = False
break
The nested for-loop is a bit nasty, but I think it could be simplified with some work.

Remove all values below certain threshold and shift columns up in Pandas

I have growth data. I would like to calibrate all the columns to a certain (arbitrary) cutoff by removing all values below this threshold and "shift" the values up in each individual column.
To illustrate:
import pandas as pd
df = pd.DataFrame([[1, 2], [3, 4],[5, 6]], columns=list('AB'))
result:
A B
0 1 2
1 3 4
2 5 6
Removing all values below 3:
df = df.where(df > 3, np.nan)
result:
A B
0 NaN NaN
1 NaN 4
2 5 6
What I'd finally want is the following dataframe (in a sense cutting and pasting values more than 3 to the top of the df):
A B
0 5 4
1 NaN 6
2 NaN NaN
Any idea how I would be able to do so?

Use justify function for improve performance:
df = pd.DataFrame([[1, 2], [3, 4],[5, 6]], columns=list('AB'))
df = df.where(df > 3, np.nan)
arr = justify(df.to_numpy(), invalid_val=np.nan, axis=0, side='up')
#oldier pandas versions
arr = justify(df.values, invalid_val=np.nan, axis=0, side='up')
df = pd.DataFrame(arr, columns=df.columns)
print (df)
A B
0 5.0 4.0
1 NaN 6.0
2 NaN NaN
Function by divakar:
def justify(a, invalid_val=0, axis=1, side='left'):
"""
Justifies a 2D array
Parameters
----------
A : ndarray
Input array to be justified
axis : int
Axis along which justification is to be made
side : str
Direction of justification. It could be 'left', 'right', 'up', 'down'
It should be 'left' or 'right' for axis=1 and 'up' or 'down' for axis=0.
"""
if invalid_val is np.nan:
mask = ~np.isnan(a)
else:
mask = a!=invalid_val
justified_mask = np.sort(mask,axis=axis)
if (side=='up') | (side=='left'):
justified_mask = np.flip(justified_mask,axis=axis)
out = np.full(a.shape, invalid_val)
if axis==1:
out[justified_mask] = a[mask]
else:
out.T[justified_mask.T] = a.T[mask.T]
return out

I would do it by using built-in Python sorted following way:
import numpy as np
import pandas as pd
df = pd.DataFrame([[1, 2], [3, 6],[5, 4]], columns=list('AB'))
df = df.where(df > 3, np.nan)
print(df)
Output:
A B
0 NaN NaN
1 NaN 6.0
2 5.0 4.0
Then just do:
for col in df.columns:
df[col] = sorted(df[col], key=pd.isnull)
print(df)
Output:
A B
0 5.0 6.0
1 NaN 4.0
2 NaN NaN
I harness fact that built-in sorted is stable (note that I slightly changed input (6 before 4) to show that). isnull function produce False for all non-NaNs which is treated as 0 during sorting and True for rest which is treated as 1 during sorting.

Pandas count different combinations of 2 columns with nan

I have a dataframe similar to
df = pd.DataFrame({'A': [1, np.nan,2,3, np.nan,4], 'B': [np.nan, 1,np.nan,2, 3, np.nan]})
df
A B
0 1.0 NaN
1 NaN 1.0
2 2.0 NaN
3 3.0 2.0
4 NaN 3.0
5 4.0 NaN
How do I count the number of occurrences of A is np.nan but B not np.nan, A not np.nan but B is np.nan, and A and B both not np.nan?
I tried df.groupby(['A', 'B']).count() but it doesn't read the rows with np.nan.

Using
df.isnull().groupby(['A','B']).size()
Out[541]:
A B
False False 1
True 3
True False 2
dtype: int64

You can use DataFrame.isna with crosstab for count Trues values:
df1 = df.isna()
df2 = pd.crosstab(df1.A, df1.B)
print (df2)
B False True
A
False 1 3
True 2 0
For scalar:
print (df2.loc[False, False])
1
df2 = pd.crosstab(df1.A, df1.B).add_prefix('B_').rename(lambda x: 'A_' + str(x))
print (df2)
B B_False B_True
A
A_False 1 3
A_True 2 0
Then for scalar use indexing:
print (df2.loc['A_False', 'B_False'])
1
Another solution is use DataFrame.dot by columns names with Series.replace and Series.value_counts:
df = pd.DataFrame({'A': [1, np.nan,2,3, np.nan,4, np.nan],
'B': [np.nan, 1,np.nan,2, 3, np.nan, np.nan]})
s = df.isna().dot(df.columns).replace({'':'no match'}).value_counts()
print (s)
B 3
A 2
no match 1
AB 1
dtype: int64

If we are dealing with two columns only, there's a very simple solution that involves assigning simple weights to columns A and B, then summing them.
v = df.isna().mul([1, 2]).sum(1).value_counts()
v.index = v.index.map({2: 'only B', 1: 'only A', 0: 'neither'})
v
only B 3
only A 2
neither 1
dtype: int64
Another alternative with pivot_table and stack can be achieved by,
df.isna().pivot_table(index='A', columns='B', aggfunc='size').stack()
A B
False False 1.0
True 3.0
True False 2.0
dtype: float64

I think you need:
df = pd.DataFrame({'A': [1, np.nan,2,3, np.nan,4], 'B': [np.nan, 1,np.nan,2, 3, np.nan]})
count1 = len(df[(~df['A'].isnull()) & (df['B'].isnull())])
count2 = len(df[(~df['A'].isnull()) & (~df['B'].isnull())])
count3 = len(df[(df['A'].isnull()) & (~df['B'].isnull())])
print(count1, count2, count3)
Output:
3 1 2

To get rows where either A or B is null, we can do:
bool_df = df.isnull()
df[bool_df['A'] ^ bool_df['B']].shape[0]
To get rows where both are null values:
df[bool_df['A'] & bool_df['B']].shape[0]

Drop all rows that do not contain any NaNs in their columns

I use to drop the rows which has one cell with NAN value with this command:
pos_data = df.iloc[:,[5,6,2]].dropna()
No I want to know how can I keep the rows with NAN and remove all other rows which do not have NAN in one of their columns.
my data is Pandas dataframe.
Thanks.

Use boolean indexing, find all columns that have at least one NaN in their rows and use the mask to filter.
df[df.iloc[:, [5, 6, 2]].isna().any(1)]
The DeMorgan equivalent of this is:
df[~df.iloc[:, [5, 6, 2]].notna().all(1)]
df = pd.DataFrame({'A': ['x', 'x', np.nan, np.nan], 'B': ['y', np.nan, 'y', 'y'], 'C': list('zzz') + [np.nan]})
df
A B C
0 x y z
1 x NaN z
2 NaN y z
3 NaN y NaN
If we're only considering columns "A" and "C", then our solution will look like
df[['A', 'C']]
A C
0 x z
1 x z
2 NaN z
3 NaN NaN
# Check which cells are NaN
df[['A', 'C']].isna()
A C
0 False False
1 False False
2 True False
3 True True
# Use `any` along the first axis to perform a logical OR across columns
df[['A', 'C']].isna().any(axis=1)
0 False
1 False
2 True
3 True
dtype: bool
# Now, we filter
df[df[['A', 'C']].isna().any(axis=1)]
A B C
2 NaN y z
3 NaN y NaN
As mentioned, the inverse of this is using notna + all(axis=1):
df[['A', 'C']].notna().all(1)
0 True
1 True
2 False
3 False
dtype: bool
# You'll notice this is the logical inverse of what we need,
# so we invert using bitwise NOT `~` operator
~df[['A', 'C']].notna().all(1)
0 False
1 False
2 True
3 True
dtype: bool

This should remove all rows that do no have at least 1 na value:
df[df.isna().any(axis=1)]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Compare two columns with NaNs in Pandas and get differences - python

Use eq with isna: df.c1.eq(df.c2)|df.iloc[:, 1:].isna().all(1) #or df.c1.eq(df.c2)|df.loc[:, ['c1','c2']].isna().all(1)

Use nuquine with fillna import numpy as np df.fillna(np.inf)[['c1','c2']].nunique(1) < 2 Or nunique with option dropna=False df[['c1','c2']].nunique(1, dropna=False) < 2 Out[13]: 0 True 1 False 2 False 3 False 4 True dtype: bool

Related

Is there a way to fill NaNs at the end of a column with values from another column at the beginning in Python?

Delete pandas dataframe NaN rows selectively, grouped by id column which contains duplicates

Remove all values below certain threshold and shift columns up in Pandas

Pandas count different combinations of 2 columns with nan

Drop all rows that do not contain any NaNs in their columns

Categories

Resources

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Compare two columns with NaNs in Pandas and get differences - python

Use eq with isna: df.c1.eq(df.c2)|df.iloc[:, 1:].isna().all(1) #or df.c1.eq(df.c2)|df.loc[:, ['c1','c2']].isna().all(1)

Use nuquine with fillna import numpy as np df.fillna(np.inf)[['c1','c2']].nunique(1) < 2 Or nunique with option dropna=False df[['c1','c2']].nunique(1, dropna=False) < 2 Out[13]: 0 True 1 False 2 False 3 False 4 True dtype: bool

Related

Is there a way to fill NaNs at the end of a column with values from another column at the beginning in Python?

Delete pandas dataframe NaN rows selectively, grouped by id column which contains duplicates

Remove all values below certain threshold and shift columns up in Pandas

Pandas count different combinations of 2 columns with nan

Drop all rows that *do not* contain any NaNs in their columns

Categories

Resources

Drop all rows that do not contain any NaNs in their columns