Count number of rows with NaN in a pandas DataFrame?

Count number of rows with NaN in a pandas DataFrame? - python

Having the following running code:
import datetime as dt
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
my_funds = [1, 2, 5, 7, 9, 11]
my_time = ['2020-01', '2019-12', '2019-11', '2019-10', '2019-09', '2019-08']
df = pd.DataFrame({'TIME': my_time, 'FUNDS':my_funds})
for x in range(2,3):
df.insert(len(df.columns), f'x**{x}', df["FUNDS"]**x)
df = df.replace([1, 7, 9, 25],float('nan'))
print(df.isnull().values.ravel().sum()) #5 (obviously counting NaNs in total)
print(sum(map(any, df.isnull()))) #3 (I guess counting the NaNs in the left column)
I am getting the dataframe below. I want to get the count of the total rows, with 1 or more NaN, which in my case is 4, on rows - [0, 2, 3, 4].

Use:
print (df.isna().any(axis=1).sum())
4
Explanation: First compare missing values by DataFrame.isna:
print (df.isna())
TIME FUNDS x**2
0 False True True
1 False False False
2 False False True
3 False True False
4 False True False
5 False False False
And test if at least per rows is True by DataFrame.any:
print (df.isna().any(axis=1))
0 True
1 False
2 True
3 True
4 True
5 False
dtype: bool
And last count Trues by sum.

Another option:
nan_rows = len(df[df["FUNDS"].isna() | df["x**2"].isna()])

New option Series.clip
to take one when there is more than one NaN per row
df.isna().sum(axis=1).clip(upper=1).sum()
#4

Related

Pandas IF statement referencing other column value

I can't for the life of me find an example of this operation in pandas.... I am trying to write an IF statement saying IF my Check column is true then pull the value from my Proficiency_Value column, and if it's False, then default to 1000.
report_skills['Check'] = report_skills.apply(lambda x: x.Skill_Specialization in x.Specialization, axis=1)
report_skills = report_skills.loc[report_skills['Check'] == True, 'Proficiency_Value'] = 1000
Any ideas why this is not working? I'm sure this is an easy fix

Let`s create a small example DataFrame like this:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Check':[True,False, False ,True],
'Proficiency_Value':range(4)
})
>>> df
Check Proficiency_Value
0 True 0
1 False 1
2 False 2
3 True 3
If you use now the np.where() functione, you can get the result you are asking for.
df['Proficiency_Value'] = np.where(df['Check']==True, df['Proficiency_Value'], 1000)
>>> df
Check Proficiency_Value
0 True 0
1 False 1000
2 False 1000
3 True 3

output for a value index in pandas, returns a boolean array

when searching for a unique value index in a pandas dataframe, it works perfectly. but when that value is not unique, the output seems to be an array of booleans:
load the file in:
import pandas as pd
df = pd.read_csv('test_file.csv')
print(df.head(10))
and lets say this is the file i am dealing with:
test
0 10
1 20
2 10
3 20
4 20
5 20
6 10
7 10
8 10
9 30
now when i try to get the index of a value that is not unique in the column:
output_index = df.set_index('test').index.get_loc(10)
print(output_index)
output:
[ True False True False False False True True True False]
but the same code works just fine when its done for a unique value:
output_index = df.set_index('test').index.get_loc(30)
print(output_index)
output:
9
so what is the correct way to get the index(es) of a value that has occured more than once in a dataframe?

try this:
output_index = df[df['test'] == 10].index
if you need a list:
output_index = df[df['test'] == 10].index.to_list()

df.index returns the row labels for a particular value.
If the value is repeated in the dataframe, it will return all the indexes, like below:
df.index[df['test'].eq(10)].tolist()
Output:
[0, 2, 6, 7, 8]
If the value is unique, it will return:
df.index[df['test'].eq(30)].tolist()
Output:
[9]

Comparing two columns in pandas dataframe

I have a pandas dataframe where I would like to verify that column A is greater than column B (row wise). I am doing something like
tmp=df['B']-df['A']
if(any( [ v for v in tmp if v > 0])):
....
I was wondering if there was better(concise) way of doing it or if pandas dataframe had any such built in routines to accomplish this

df = pd.DataFrame({'A': [1, 2, 3], 'B': [3, 1, 1]})
temp = df['B'] - df['A']
print(temp)
0 2
1 -1
2 -2
Now you can create a Boolean series using temp > 0:
print(temp > 0)
0 True
1 False
2 False
dtype: bool
This boolean series can be fed to any and therefore you can use:
if any(temp > 0):
print('juhu!')
Or simply (which avoids temp):
if any(df['B'] > df['A']):
print('juhu')
using the same logic of creating a Boolean series first:
print(df['B'] > df['A'])
0 True
1 False
2 False
dtype: bool

df['B']>df['A'] will be pandas series in boolean datatype.
>>> (df['B']>df['A']).dtype
dtype('bool')
For example
>>> df['B']>df['A']
0 True
1 False
2 False
3 True
4 True
dtype: bool
any() function returns True if any of the item in an iterable is true
>>> if any(df['B']>df['A']):
... print(True)
...
True

I guess you wanted to check if any df[‘B'] > df[‘A’] then do something.
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': [1, 2, 3, 4], 'B': [2, 0, 6, 3]})
if np.where(df['B'] > df['A'], 1, 0).sum():
print('do something')

Python | count number of False statements in 3 rows

Columns L,M,N of my dataframe are populated with 'true' and 'false' statements(1000 rows). I would like to create a new column 'count_false' that will return the number of times 'false' statement occurred in columns L,M and N.
Any tips appreciated!
Thank you.

You can negate your dataframe and sum over axis=1:
df = pd.DataFrame(np.random.randint(0, 2, (5, 3)), columns=list('LMN')).astype(bool)
df['Falses'] = (~df).sum(1)
print(df)
L M N Falses
0 True False True 1
1 True False False 2
2 True True True 0
3 False True False 2
4 False False True 2
If you have additional columns, you can filter accordingly:
df['Falses'] = (~df[list('LMN')]).sum(1)

Try this : df[df==false].count()

As explained in this Stack question True and False equal to 1 and 0 in Python, therefore something like the line three of the following example should solve your problem:
import pandas as pd
df = pd.DataFrame([[True, False, True],[False, False, False],[True, False, True],[False, False, True]], columns=['L','M','N'])
df['count_false'] = 3 - (df['L']*1 + df['M']*1 + df['N']*1)

Is there a `in` like statement for a Pandas Series?

I can easily check whether each is equal to a number:
In [20]: s = pd.Series([1, 2, 3])
In [21]: s == 1
Out[21]:
0 True
1 False
2 False
My problem is, is there a function like s._in([1, 2]) and output something like
0 True
1 True
2 False

Yes, it is called isin. Do s.isin([1, 2]).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Count number of rows with NaN in a pandas DataFrame? - python

Another option: nan_rows = len(df[df["FUNDS"].isna() | df["x**2"].isna()])

New option Series.clip to take one when there is more than one NaN per row df.isna().sum(axis=1).clip(upper=1).sum() #4

Related

Pandas IF statement referencing other column value

output for a value index in pandas, returns a boolean array

Comparing two columns in pandas dataframe

Python | count number of False statements in 3 rows

Is there a `in` like statement for a Pandas Series?

Categories

Resources