Drop all rows that *do not* contain any NaNs in their columns - python

I use to drop the rows which has one cell with NAN value with this command:
pos_data = df.iloc[:,[5,6,2]].dropna()
No I want to know how can I keep the rows with NAN and remove all other rows which do not have NAN in one of their columns.
my data is Pandas dataframe.
Thanks.

Use boolean indexing, find all columns that have at least one NaN in their rows and use the mask to filter.
df[df.iloc[:, [5, 6, 2]].isna().any(1)]
The DeMorgan equivalent of this is:
df[~df.iloc[:, [5, 6, 2]].notna().all(1)]
df = pd.DataFrame({'A': ['x', 'x', np.nan, np.nan], 'B': ['y', np.nan, 'y', 'y'], 'C': list('zzz') + [np.nan]})
df
A B C
0 x y z
1 x NaN z
2 NaN y z
3 NaN y NaN
If we're only considering columns "A" and "C", then our solution will look like
df[['A', 'C']]
A C
0 x z
1 x z
2 NaN z
3 NaN NaN
# Check which cells are NaN
df[['A', 'C']].isna()
A C
0 False False
1 False False
2 True False
3 True True
# Use `any` along the first axis to perform a logical OR across columns
df[['A', 'C']].isna().any(axis=1)
0 False
1 False
2 True
3 True
dtype: bool
# Now, we filter
df[df[['A', 'C']].isna().any(axis=1)]
A B C
2 NaN y z
3 NaN y NaN
As mentioned, the inverse of this is using notna + all(axis=1):
df[['A', 'C']].notna().all(1)
0 True
1 True
2 False
3 False
dtype: bool
# You'll notice this is the logical inverse of what we need,
# so we invert using bitwise NOT `~` operator
~df[['A', 'C']].notna().all(1)
0 False
1 False
2 True
3 True
dtype: bool

This should remove all rows that do no have at least 1 na value:
df[df.isna().any(axis=1)]

Related

Is there a way to fill NaNs at the end of a column with values from another column at the beginning in Python?

So I have two columns for example A & B and they look like this:
A B
1 4
2 5
3 6
NaN NaN
NaN NaN
NaN NaN
and I want it like this:
A
1
2
3
4
5
6
Any ideas?
I'm assuming your data is in two columns in a DataFrame, you can append B values to the end of A values, then drop the NA values with np.nan != np.nan trick. Here's an example
import pandas as pd
import numpy as np
d = {
'A': [1,2,3, np.nan, np.nan, np.nan],
'B': [4,5,6, np.nan, np.nan, np.nan]
}
df = pd.DataFrame(d)
>>> df
A B
1 4
2 5
3 6
NaN NaN
NaN NaN
NaN NaN
# np.nan == np.nan trick
>>> df['A'] == df['A']
0 True
1 True
2 True
3 False
4 False
5 False
Name: A, dtype: bool
x = pd.concat([df['A'], df['B']])
>>> x
0 1.0
1 2.0
2 3.0
3 NaN
4 NaN
5 NaN
0 4.0
1 5.0
2 6.0
3 NaN
4 NaN
5 NaN
dtype: float64
x = x[x == x]
>>> x
A
1
2
3
4
5
6
Using numpy, it could be something like:
import numpy as np
A = np.array([1, 2, 3, np.nan, np.nan, np.nan])
B = np.array([4, 5, 6, np.nan, np.nan, np.nan])
C = np.hstack([A[A < np.infty], B[B < np.infty]])
print(C) # [1. 2. 3. 4. 5. 6.]
What you might want is:
import pandas as pd
a = pd.Series([1, 2, 3, None, None, None])
b = pd.Series([4, 5, 6, None, None, None])
print(pd.concat([a.iloc[:3], b.iloc[:3]]))
And if you are just looking for non-NaN values feel free to use .dropna() in Series.

Compare two columns with NaNs in Pandas and get differences

I have a following dataframe:
case c1 c2
1 x x
2 NaN y
3 x NaN
4 y x
5 NaN NaN
I would like to get a column "match" which will show which records with values in "c1" and "c2" are equal or different:
case c1 c2 match
1 x x True
2 NaN y False
3 x NaN False
4 y x False
5 NaN NaN True
I tried the following based on another Stack Overflow question: Comparing two columns and keeping NaNs
However, I can't get both cases 4 and 5 correct.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'case': [1, 2, 3, 4, 5],
'c1': ['x', np.nan,'x','y', np.nan],
'c2': ['x', 'y',np.nan,'x', np.nan],
})
cond1 = df['c1'] == df['c2']
cond2 = (df['c1'].isnull()) == (df['c2'].isnull())
df['c3'] = np.select([cond1, cond2], [True, True], False)
df
Use eq with isna:
df.c1.eq(df.c2)|df.iloc[:, 1:].isna().all(1)
#or
df.c1.eq(df.c2)|df.loc[:, ['c1','c2']].isna().all(1)
import pandas as pd
import numpy as np
df = pd.DataFrame({
'case': [1, 2, 3, 4, 5],
'c1': ['x', np.nan,'x','y', np.nan],
'c2': ['x', 'y',np.nan,'x', np.nan],
})
df['c3'] = df.apply(lambda row: True if str(row.c1) == str(row.c2) else False, axis=1)
print(df)
Output
case c1 c2 c3
0 1 x x True
1 2 NaN y False
2 3 x NaN False
3 4 y x False
4 5 NaN NaN True
Use nuquine with fillna
import numpy as np
df.fillna(np.inf)[['c1','c2']].nunique(1) < 2
Or nunique with option dropna=False
df[['c1','c2']].nunique(1, dropna=False) < 2
Out[13]:
0 True
1 False
2 False
3 False
4 True
dtype: bool

Use a custom function to apply on a df column if a condition is satisfied

I have a DataFrame like
A B
1 2
2 -
5 -
4 5
I want to apply a function func() on column B (but the function gives an error if - is passed). I cannot modify the func() function. I need something like:
df['B']=df['B'].apply(func) only if value not equal to -
Use a custom function to apply on a df column if a condition is satisfied:
def func(a):
return a + 10
#new pandas dataframe with four rows and 2 columns. 3rd row having a nan
df = pd.DataFrame([[1, 2], [3, 4], [5, pd.np.nan], [7, 8]], columns=["A", "B"])
print(df)
#coerce column named B to numeric
s = pd.to_numeric(df['B'], errors='coerce')
#a mask has true for numeric rows, false for non numeric rows
mask = s.notna()
#mask
print(mask)
#run function named func across the B column
df.loc[mask, 'B'] = s[mask].apply(func)
print(df)
Which prints:
A B
0 1 2.0
1 3 4.0
2 5 NaN
3 7 8.0
0 True
1 True
2 False
3 True
A B
0 1 12.0
1 3 14.0
2 5 NaN
3 7 18.0
Try:
df['B'] = df[df['B']!='-']['B'].apply(func)
Or when the - is actaully nan you can use:
df['B'] = df[pd.notnull(df['B'])]['B'].apply(func)

Pandas count different combinations of 2 columns with nan

I have a dataframe similar to
df = pd.DataFrame({'A': [1, np.nan,2,3, np.nan,4], 'B': [np.nan, 1,np.nan,2, 3, np.nan]})
df
A B
0 1.0 NaN
1 NaN 1.0
2 2.0 NaN
3 3.0 2.0
4 NaN 3.0
5 4.0 NaN
How do I count the number of occurrences of A is np.nan but B not np.nan, A not np.nan but B is np.nan, and A and B both not np.nan?
I tried df.groupby(['A', 'B']).count() but it doesn't read the rows with np.nan.
Using
df.isnull().groupby(['A','B']).size()
Out[541]:
A B
False False 1
True 3
True False 2
dtype: int64
You can use DataFrame.isna with crosstab for count Trues values:
df1 = df.isna()
df2 = pd.crosstab(df1.A, df1.B)
print (df2)
B False True
A
False 1 3
True 2 0
For scalar:
print (df2.loc[False, False])
1
df2 = pd.crosstab(df1.A, df1.B).add_prefix('B_').rename(lambda x: 'A_' + str(x))
print (df2)
B B_False B_True
A
A_False 1 3
A_True 2 0
Then for scalar use indexing:
print (df2.loc['A_False', 'B_False'])
1
Another solution is use DataFrame.dot by columns names with Series.replace and Series.value_counts:
df = pd.DataFrame({'A': [1, np.nan,2,3, np.nan,4, np.nan],
'B': [np.nan, 1,np.nan,2, 3, np.nan, np.nan]})
s = df.isna().dot(df.columns).replace({'':'no match'}).value_counts()
print (s)
B 3
A 2
no match 1
AB 1
dtype: int64
If we are dealing with two columns only, there's a very simple solution that involves assigning simple weights to columns A and B, then summing them.
v = df.isna().mul([1, 2]).sum(1).value_counts()
v.index = v.index.map({2: 'only B', 1: 'only A', 0: 'neither'})
v
only B 3
only A 2
neither 1
dtype: int64
Another alternative with pivot_table and stack can be achieved by,
df.isna().pivot_table(index='A', columns='B', aggfunc='size').stack()
A B
False False 1.0
True 3.0
True False 2.0
dtype: float64
I think you need:
df = pd.DataFrame({'A': [1, np.nan,2,3, np.nan,4], 'B': [np.nan, 1,np.nan,2, 3, np.nan]})
count1 = len(df[(~df['A'].isnull()) & (df['B'].isnull())])
count2 = len(df[(~df['A'].isnull()) & (~df['B'].isnull())])
count3 = len(df[(df['A'].isnull()) & (~df['B'].isnull())])
print(count1, count2, count3)
Output:
3 1 2
To get rows where either A or B is null, we can do:
bool_df = df.isnull()
df[bool_df['A'] ^ bool_df['B']].shape[0]
To get rows where both are null values:
df[bool_df['A'] & bool_df['B']].shape[0]

pandas - filter on groups which have at least one column containing non-null values in a groupby

I have the following python pandas dataframe:
df = pd.DataFrame({'Id': ['1', '1', '1', '2', '2', '3'], 'A': ['TRUE', 'TRUE', 'TRUE', 'TRUE', 'TRUE', 'FALSE'], 'B': [np.nan, np.nan, 'abc', np.nan, np.nan, 'def'],'C': [np.nan, np.nan, np.nan, np.nan, np.nan, '456']})
>>> print(df)
Id A B C
0 1 TRUE NaN NaN
1 1 TRUE NaN NaN
2 1 TRUE abc NaN
3 2 TRUE NaN NaN
4 2 TRUE NaN NaN
5 3 FALSE def 456
I want to end up with the following dataframe:
>>> print(dfout)
Id A B C
0 1 TRUE abc NaN
The same Id value can appear on multiple rows. Each Id will either have the value TRUE or FALSE in column A consistently on all its rows. Columns B and C can have any value, including NaN.
I want one row in dfout for each Id that has A=TRUE and show the max value seen in columns B and C. But if the only values seen in columns B and C = NaN for all of an Id's rows, then that Id is to be excluded from dfout.
Id 1 has A=TRUE, and has B=abc in its third row, so it meets
the requirements.
Id 2 has A=TRUE, but columns B and C are NaN for
both its rows, so it does not.
Id 3 has A=FALSE, so it does not
meet requirements.
I created a groupby df on Id, then applied a mask to only include rows with A=TRUE. But having trouble understanding how to remove the rows with NaN for all rows in columns B and C.
grouped = df.groupby(['Id'])
mask = grouped['A'].transform(lambda x: 'TRUE' == x.max()).astype(bool)
df.loc[mask].reset_index(drop=True)
Id A B C
0 1 TRUE NaN NaN
1 1 TRUE NaN NaN
2 1 TRUE abc NaN
3 2 TRUE NaN NaN
4 2 TRUE NaN NaN
Then I tried several things along the lines of:
df.loc[mask].reset_index(drop=True).all(['B'],['C']).isnull
But getting errors, like:
" TypeError: unhashable type: 'list' ".
Using python 3.6, pandas 0.23.0; Looked here for help: keep dataframe rows meeting a condition into each group of the same dataframe grouped by
The solution has three parts to it.
Filter dataframe to keep rows where column A is True
Groupby Id and use first which will return first not null value
Use dropna on the resulting dataframe on columns B and C with how = 'all'
df.loc[df['A'] == True].groupby('Id', as_index = False).first().dropna(subset = ['B', 'C'], how = 'all')
Id A B C
0 1 True abc NaN

Categories

Resources