Pandas count different combinations of 2 columns with nan - python

I have a dataframe similar to
df = pd.DataFrame({'A': [1, np.nan,2,3, np.nan,4], 'B': [np.nan, 1,np.nan,2, 3, np.nan]})
df
A B
0 1.0 NaN
1 NaN 1.0
2 2.0 NaN
3 3.0 2.0
4 NaN 3.0
5 4.0 NaN
How do I count the number of occurrences of A is np.nan but B not np.nan, A not np.nan but B is np.nan, and A and B both not np.nan?
I tried df.groupby(['A', 'B']).count() but it doesn't read the rows with np.nan.

Using
df.isnull().groupby(['A','B']).size()
Out[541]:
A B
False False 1
True 3
True False 2
dtype: int64

You can use DataFrame.isna with crosstab for count Trues values:
df1 = df.isna()
df2 = pd.crosstab(df1.A, df1.B)
print (df2)
B False True
A
False 1 3
True 2 0
For scalar:
print (df2.loc[False, False])
1
df2 = pd.crosstab(df1.A, df1.B).add_prefix('B_').rename(lambda x: 'A_' + str(x))
print (df2)
B B_False B_True
A
A_False 1 3
A_True 2 0
Then for scalar use indexing:
print (df2.loc['A_False', 'B_False'])
1
Another solution is use DataFrame.dot by columns names with Series.replace and Series.value_counts:
df = pd.DataFrame({'A': [1, np.nan,2,3, np.nan,4, np.nan],
'B': [np.nan, 1,np.nan,2, 3, np.nan, np.nan]})
s = df.isna().dot(df.columns).replace({'':'no match'}).value_counts()
print (s)
B 3
A 2
no match 1
AB 1
dtype: int64

If we are dealing with two columns only, there's a very simple solution that involves assigning simple weights to columns A and B, then summing them.
v = df.isna().mul([1, 2]).sum(1).value_counts()
v.index = v.index.map({2: 'only B', 1: 'only A', 0: 'neither'})
v
only B 3
only A 2
neither 1
dtype: int64
Another alternative with pivot_table and stack can be achieved by,
df.isna().pivot_table(index='A', columns='B', aggfunc='size').stack()
A B
False False 1.0
True 3.0
True False 2.0
dtype: float64

I think you need:
df = pd.DataFrame({'A': [1, np.nan,2,3, np.nan,4], 'B': [np.nan, 1,np.nan,2, 3, np.nan]})
count1 = len(df[(~df['A'].isnull()) & (df['B'].isnull())])
count2 = len(df[(~df['A'].isnull()) & (~df['B'].isnull())])
count3 = len(df[(df['A'].isnull()) & (~df['B'].isnull())])
print(count1, count2, count3)
Output:
3 1 2

To get rows where either A or B is null, we can do:
bool_df = df.isnull()
df[bool_df['A'] ^ bool_df['B']].shape[0]
To get rows where both are null values:
df[bool_df['A'] & bool_df['B']].shape[0]

Related

Is there a way to fill NaNs at the end of a column with values from another column at the beginning in Python?

So I have two columns for example A & B and they look like this:
A B
1 4
2 5
3 6
NaN NaN
NaN NaN
NaN NaN
and I want it like this:
A
1
2
3
4
5
6
Any ideas?
I'm assuming your data is in two columns in a DataFrame, you can append B values to the end of A values, then drop the NA values with np.nan != np.nan trick. Here's an example
import pandas as pd
import numpy as np
d = {
'A': [1,2,3, np.nan, np.nan, np.nan],
'B': [4,5,6, np.nan, np.nan, np.nan]
}
df = pd.DataFrame(d)
>>> df
A B
1 4
2 5
3 6
NaN NaN
NaN NaN
NaN NaN
# np.nan == np.nan trick
>>> df['A'] == df['A']
0 True
1 True
2 True
3 False
4 False
5 False
Name: A, dtype: bool
x = pd.concat([df['A'], df['B']])
>>> x
0 1.0
1 2.0
2 3.0
3 NaN
4 NaN
5 NaN
0 4.0
1 5.0
2 6.0
3 NaN
4 NaN
5 NaN
dtype: float64
x = x[x == x]
>>> x
A
1
2
3
4
5
6
Using numpy, it could be something like:
import numpy as np
A = np.array([1, 2, 3, np.nan, np.nan, np.nan])
B = np.array([4, 5, 6, np.nan, np.nan, np.nan])
C = np.hstack([A[A < np.infty], B[B < np.infty]])
print(C) # [1. 2. 3. 4. 5. 6.]
What you might want is:
import pandas as pd
a = pd.Series([1, 2, 3, None, None, None])
b = pd.Series([4, 5, 6, None, None, None])
print(pd.concat([a.iloc[:3], b.iloc[:3]]))
And if you are just looking for non-NaN values feel free to use .dropna() in Series.

Use a custom function to apply on a df column if a condition is satisfied

I have a DataFrame like
A B
1 2
2 -
5 -
4 5
I want to apply a function func() on column B (but the function gives an error if - is passed). I cannot modify the func() function. I need something like:
df['B']=df['B'].apply(func) only if value not equal to -
Use a custom function to apply on a df column if a condition is satisfied:
def func(a):
return a + 10
#new pandas dataframe with four rows and 2 columns. 3rd row having a nan
df = pd.DataFrame([[1, 2], [3, 4], [5, pd.np.nan], [7, 8]], columns=["A", "B"])
print(df)
#coerce column named B to numeric
s = pd.to_numeric(df['B'], errors='coerce')
#a mask has true for numeric rows, false for non numeric rows
mask = s.notna()
#mask
print(mask)
#run function named func across the B column
df.loc[mask, 'B'] = s[mask].apply(func)
print(df)
Which prints:
A B
0 1 2.0
1 3 4.0
2 5 NaN
3 7 8.0
0 True
1 True
2 False
3 True
A B
0 1 12.0
1 3 14.0
2 5 NaN
3 7 18.0
Try:
df['B'] = df[df['B']!='-']['B'].apply(func)
Or when the - is actaully nan you can use:
df['B'] = df[pd.notnull(df['B'])]['B'].apply(func)

Drop all rows that *do not* contain any NaNs in their columns

I use to drop the rows which has one cell with NAN value with this command:
pos_data = df.iloc[:,[5,6,2]].dropna()
No I want to know how can I keep the rows with NAN and remove all other rows which do not have NAN in one of their columns.
my data is Pandas dataframe.
Thanks.
Use boolean indexing, find all columns that have at least one NaN in their rows and use the mask to filter.
df[df.iloc[:, [5, 6, 2]].isna().any(1)]
The DeMorgan equivalent of this is:
df[~df.iloc[:, [5, 6, 2]].notna().all(1)]
df = pd.DataFrame({'A': ['x', 'x', np.nan, np.nan], 'B': ['y', np.nan, 'y', 'y'], 'C': list('zzz') + [np.nan]})
df
A B C
0 x y z
1 x NaN z
2 NaN y z
3 NaN y NaN
If we're only considering columns "A" and "C", then our solution will look like
df[['A', 'C']]
A C
0 x z
1 x z
2 NaN z
3 NaN NaN
# Check which cells are NaN
df[['A', 'C']].isna()
A C
0 False False
1 False False
2 True False
3 True True
# Use `any` along the first axis to perform a logical OR across columns
df[['A', 'C']].isna().any(axis=1)
0 False
1 False
2 True
3 True
dtype: bool
# Now, we filter
df[df[['A', 'C']].isna().any(axis=1)]
A B C
2 NaN y z
3 NaN y NaN
As mentioned, the inverse of this is using notna + all(axis=1):
df[['A', 'C']].notna().all(1)
0 True
1 True
2 False
3 False
dtype: bool
# You'll notice this is the logical inverse of what we need,
# so we invert using bitwise NOT `~` operator
~df[['A', 'C']].notna().all(1)
0 False
1 False
2 True
3 True
dtype: bool
This should remove all rows that do no have at least 1 na value:
df[df.isna().any(axis=1)]

Pandas improvement

I currently have a Pandas Dataframe in which I'm performing comparisons between columns. I found a case in which there are empty columns when comparison is taking place, comparison for some reason returns else value. I added an extra statement to clean it up to empty. Looking to see if I can simplify this and have a single statement.
df['doc_type'].loc[(df['a_id'].isnull() & df['b_id'].isnull())] = ''
Code
df = pd.DataFrame({
'a_id': ['A', 'B', 'C', 'D', '', 'F', ''],
'a_score': [1, 2, 3, 4, '', 6, ''],
'b_id': ['a', 'b', 'c', 'd', 'e', 'f', ''],
'b_score': [0.1, 0.2, 3.1, 4.1, 5, 5.99, ''],
})
print df
# Replace empty string with NaN
df = df.apply(lambda x: x.str.strip() if isinstance(x, str) else x).replace('', np.nan)
# Calculate higher score
df['doc_id'] = df.apply(lambda df: df['a_id'] if df['a_score'] >= df['b_score'] else df['b_id'], axis=1)
# Select type based on higher score
df['doc_type'] = df.apply(lambda df: 'a' if df['a_score'] >= df['b_score'] else 'b', axis=1)
print df
# Update type when is empty
df['doc_type'].loc[(df['a_id'].isnull() & df['b_id'].isnull())] = ''
print df
You can use numpy.where instead apply, also for select by boolean indexing with column(s) is better use this solution:
df.loc[mask, 'colname'] = val
# Replace empty string with NaN
df = df.apply(lambda x: x.str.strip() if isinstance(x, str) else x).replace('', np.nan)
# Calculate higher score
df['doc_id'] = np.where(df['a_score'] >= df['b_score'], df['a_id'], df['b_id'])
# Select type based on higher score
df['doc_type'] = np.where(df['a_score'] >= df['b_score'], 'a', 'b')
print (df)
# Update type when is empty
df.loc[(df['a_id'].isnull() & df['b_id'].isnull()), 'doc_type'] = ''
print (df)
a_id a_score b_id b_score doc_id doc_type
0 A 1.0 a 0.10 A a
1 B 2.0 b 0.20 B a
2 C 3.0 c 3.10 c b
3 D 4.0 d 4.10 d b
4 NaN NaN e 5.00 e b
5 F 6.0 f 5.99 F a
6 NaN NaN NaN NaN NaN
Alternative of mask with DataFrame.all for check if all True in row - axis=1:
print (df[['a_id', 'b_id']].isnull())
a_id b_id
0 False False
1 False False
2 False False
3 False False
4 True False
5 False False
6 True True
print (df[['a_id', 'b_id']].isnull().all(axis=1))
0 False
1 False
2 False
3 False
4 False
5 False
6 True
dtype: bool
df.loc[df[['a_id', 'b_id']].isnull().all(axis=1), 'doc_type'] = ''
print (df)
a_id a_score b_id b_score doc_id doc_type
0 A 1.0 a 0.10 A a
1 B 2.0 b 0.20 B a
2 C 3.0 c 3.10 c b
3 D 4.0 d 4.10 d b
4 NaN NaN e 5.00 e b
5 F 6.0 f 5.99 F a
6 NaN NaN NaN NaN NaN
Bur better is use double numpy.where:
# Replace empty string with NaN
df = df.apply(lambda x: x.str.strip() if isinstance(x, str) else x).replace('', np.nan)
#create masks to series - not compare twice
mask = df['a_score'] >= df['b_score']
mask1 = (df['a_id'].isnull() & df['b_id'].isnull())
#altrnative solution for mask1
#mask1 = df[['a_id', 'b_id']].isnull().all(axis=1)
# Calculate higher score
df['doc_id'] = np.where(mask, df['a_id'], df['b_id'])
# Select type based on higher score
df['doc_type'] = np.where(mask, 'a', np.where(mask1, '', 'b'))
print (df)
a_id a_score b_id b_score doc_id doc_type
0 A 1.0 a 0.10 A a
1 B 2.0 b 0.20 B a
2 C 3.0 c 3.10 c b
3 D 4.0 d 4.10 d b
4 NaN NaN e 5.00 e b
5 F 6.0 f 5.99 F a
6 NaN NaN NaN NaN NaN

Make sure Column B = a certain value when Column A is Null - Python

I want to make sure that when Column A is NULL (in csv), or NaN (in dataframe), Column B is "Cash".
I've tried this:
check = df[df['A'].isnull()]['B']
check = check.to_string(index=False)
if "Cash" not in check:
print "Column A Fail"
else:
print "Column A Pass!"
But it is not working.
any suggestions?
I also need to make sure that it doesn't treat '0' as NaN
UPDATE:
my goal is not to assign 'Cash', but rather to make sure that it's
already there as a quality check
In [40]: df
Out[40]:
A B
0 NaN a
1 1.0 b
2 2.0 c
3 NaN Cash
In [41]: df.query("A != A and B != 'Cash'")
Out[41]:
A B
0 NaN a
or using boolean indexing:
In [42]: df.loc[df.A.isnull() & (df.B != 'Cash')]
Out[42]:
A B
0 NaN a
OLD answer:
Alternative solution:
In [23]: df.B = np.where(df.A.isnull(), 'Cash', df.B)
In [24]: df
Out[24]:
A B
0 NaN Cash
1 1.0 b
2 2.0 c
3 NaN Cash
another solution:
In [31]: df = df.mask(df.A.isnull(), df.assign(B='Cash'))
In [32]: df
Out[32]:
A B
0 NaN Cash
1 1.0 b
2 2.0 c
3 NaN Cash
Use loc to assign where A is null.
df.loc[df['A'].isnull(), 'B'] = 'Cash'
example
df = pd.DataFrame(dict(
A=[np.nan, 1, 2, np.nan],
B=['a', 'b', 'c', 'd']
))
print(df)
A B
0 NaN a
1 1.0 b
2 2.0 c
3 NaN d
Then do
df.loc[df['A'].isnull(), 'B'] = 'Cash'
print(df)
A B
0 NaN Cash
1 1.0 b
2 2.0 c
3 NaN Cash
check if all B are 'Cash' where A is null*
(df.loc[df.A.isnull(), 'B'] == 'Cash').all()
According to logic rules, P=>Q is (not P) or Q. So
(~df.A.isnull()|(df.B=="Cash")).all()
check all the lines.

Categories

Resources