Finding NaN Values in Pandas MultiIndex - python

I'm trying to find the difference between two Pandas MultiIndex objects of different shapes. I've used:
df1.index.difference(df2)
and receive
TypeError: '<' not supported between instances of 'float' and 'str'
My indices are str and datetime, but I suspect there are NaNs hidden there (the floats). Hence my question:
What's the best way to find the NaNs somewhere in the MultiIndex? How does one iterate through the levels and names? Can I use something like isna()?

For MultiIndex are not implemented many functions, you can check this.
You need convert MultiIndex to DataFrame by MultiIndex.to_frame first:
#W-B sample
idx=pd.MultiIndex.from_tuples([(np.nan,1),(1,1),(1,2)])
print (idx.to_frame())
0 1
NaN 1 NaN 1
1 1 1.0 1
2 1.0 2
print (idx.to_frame().isnull())
0 1
NaN 1 True False
1 1 False False
2 False False
Or use DataFrame constructor:
print (pd.DataFrame(list(idx.tolist())))
0 1
0 NaN 1
1 1.0 1
2 1.0 2
Because:
print (pd.isnull(idx))
NotImplementedError: isna is not defined for MultiIndex
EDIT:
For check at least one True per rows use any with boolean indexing:
df = idx.to_frame()
print (df[df.isna().any(axis=1)])
0 1
NaN 1 NaN 1
Also is possible filter MultiIndex, but is necessary add MultiIndex.remove_unused_levels:
print (idx[idx.to_frame().isna().any(axis=1)].remove_unused_levels())
MultiIndex(levels=[[], [1]],
labels=[[-1], [0]])

We can using reset_index , then with isna
idx=pd.MultiIndex.from_tuples([(np.nan,1),(1,1),(1,2)])
df=pd.DataFrame([1,2,3],index=idx)
df.reset_index().filter(like='level_').isna()
Out[304]:
level_0 level_1
0 True False
1 False False
2 False False

Related

Pandas change Nan Columns values to True or False

I need to change a column to either True or False based on the NaN value.
Here is the df.
missing
0 NaN
1 b
2 NaN
4 y
5 NaN
would become
missing
0 False
1 True
2 False
4 True
5 False
yes I can do a loop but there was to be a simple way to do in a single line of code.
thank you.
You can do
df['missing'].notna() # or notnull()
you need to overwrite the column values with binary applied on the same column, which can be achieved notna()
df['missing'] = df['missing'].notna()

Why sometimes we have to add .values when we do elementwise operation in pandas?

Suppose I have a dataframe looks like
A
0 0
1 1
2 2
3 3
and when I run:
a = df.loc[np.arange(0,2)] / df.loc[np.arange(2,4)]
I get
A
0 NaN
1 NaN
2 NaN
3 NaN
I know I could get the right result by writing
a = df.loc[np.arange(0,2)].values / df.loc[np.arange(2,4)]
b = df.loc[np.arange(0,2)] / df.loc[np.arange(2,4)].values
Can anyone explain why?
Due to pandas is index and columns sensitive, when you do the calculation the hidden key for them get match first , if we only need to get the value match and remove the impact of index and columns is adding .values or to_numpy() , however, index also bring some advantage as well
Example 1 index not match so the value will return NaN
s1=pd.Series([1],index=[1])
s2=pd.Series([1],index=[999])
s1/s2
1 NaN
999 NaN
dtype: float64
s1.values/s2.values
array([1.])
Example 2 index match so pandas will return the value when the index match
s1=pd.Series([1],index=[1])
s2=pd.Series([1,999],index=[1,999])
s1/s2
1 1.0
999 NaN
dtype: float64

comparing two columns and replace NaN with numbers

for i in range(len(df1)-1):
if (df1['overall_rating'][i]==np.nan) and (df1['recommended'][i]==0):
df1['overall_rating']=df1['overall_rating'][i].replace(np.nan,1)
else:
df1['overall_rating']
print(df1['overall_rating'])
I am comparing overall rating columns and recommended column in a pandas dataframe. If both column values happens to be true then i should replace nan in rating column to be 1 . But I am not getting answer as well error.Anyone please let me know where I am going wrong.
Use DataFrame.loc for set 1 by 2 conditions, for test missing values is used Series.isna function:
df1 = pd.DataFrame({'overall_rating':[np.nan,2,4,np.nan],
'recommended':[0,0,1,1]})
df1.loc[df1['overall_rating'].isna() & (df1['recommended']==0), 'overall_rating'] = 1
print (df1)
overall_rating recommended
0 1.0 0
1 2.0 0
2 4.0 1
3 NaN 1

how can I get the index of rows having null values in all columns

I'd like to get index of rows which have only null values straight in pandas, python3.
thanks.
Use:
i = df.index[df.isna().all(axis=1)]
If large DataFrame, slowier solution:
i = df[df.isna().all(axis=1)].index
Sample:
df=pd.DataFrame({"a":[np.nan,0,1],
"b":[np.nan,1,np.nan]})
print (df)
a b
0 NaN NaN
1 0.0 1.0
2 1.0 NaN
i = df.index[df.isna().all(axis=1)]
print (i)
Int64Index([0], dtype='int64')
Explanation:
First compare missing values by DataFrame.isna:
print (df.isna())
a b
0 True True
1 False False
2 False True
Then check if all Trues per rows by DataFrame.all:
print (df.isna().all(axis=1))
0 True
1 False
2 False
dtype: bool
And last filter index values by boolean indexing.

What is the Right Syntax When Using .notnull() in Pandas?

I want to use .notnull() on several columns of a dataframe to eliminate the rows which contain "NaN" values.
Let say I have the following df:
A B C
0 1 1 1
1 1 NaN 1
2 1 NaN NaN
3 NaN 1 1
I tried to use this syntax but it does not work? do you know what I am doing wrong?
df[[df.A.notnull()],[df.B.notnull()],[df.C.notnull()]]
I get this Error:
TypeError: 'Series' objects are mutable, thus they cannot be hashed
What should I do to get the following output?
A B C
0 1 1 1
Any idea?
You can first select subset of columns by df[['A','B','C']], then apply notnull and specify if all values in mask are True:
print (df[['A','B','C']].notnull())
A B C
0 True True True
1 True False True
2 True False False
3 False True True
print (df[['A','B','C']].notnull().all(1))
0 True
1 False
2 False
3 False
dtype: bool
print (df[df[['A','B','C']].notnull().all(1)])
A B C
0 1.0 1.0 1.0
Another solution is from Ayhan comment with dropna:
print (df.dropna(subset=['A', 'B', 'C']))
A B C
0 1.0 1.0 1.0
what is same as:
print (df.dropna(subset=['A', 'B', 'C'], how='any'))
and means drop all rows, where is at least one NaN value.
You can apply multiple conditions by combining them with the & operator (this works not only for the notnull() function).
df[(df.A.notnull() & df.B.notnull() & df.C.notnull())]
A B C
0 1.0 1.0 1.0
Alternatively, you can just drop all columns which contain NaN. The original DataFrame is not modified, instead a copy is returned.
df.dropna()
You can simply do:
df.dropna()

Categories

Resources