numpy.isnan(value) not the same as value == numpy.nan? - python

Why am I getting the following:
>>> v
nan
>>> type(v)
<type 'numpy.float64'>
>>> v == np.nan
False
>>> np.isnan(v)
True
I would have thought the two should be equivalent?

nan != nan. That's just how equality comparisons on nan are defined. It was decided that this result is more convenient for numerical algorithms than the alternative. This is specifically why isnan exists.

Related

Numpy/Pandas clean way to check if a specific value is NaN

How can I check if a given value is NaN?
e.g. if (a == np.NaN) (doesn't work)
Please note that:
Numpy's isnan method throws errors with data types like string
Pandas docs only provide methods to drop rows containing NaNs, or ways to check if/when DataFrame contains NaNs. I'm asking about checking if a specific value is NaN.
Relevant Stackoverflow questions and Google search results seem to be about checking "if any value is NaN" or "which values in a DataFrame"
There must be a clean way to check if a given value is NaN?
You can use the inate property that NaN != NaN
so a == a will return False if a is NaN
This will work even for strings
Example:
In[52]:
s = pd.Series([1, np.NaN, '', 1.0])
s
Out[52]:
0 1
1 NaN
2
3 1
dtype: object
for val in s:
print(val==val)
True
False
True
True
This can be done in a vectorised manner:
In[54]:
s==s
Out[54]:
0 True
1 False
2 True
3 True
dtype: bool
but you can still use the method isnull on the whole series:
In[55]:
s.isnull()
Out[55]:
0 False
1 True
2 False
3 False
dtype: bool
UPDATE
As noted by #piRSquared if you compare None==None this will return True but pd.isnull will return True so depending on whether you want to treat None as NaN you can still use == for comparison or pd.isnull if you want to treat None as NaN
Pandas has isnull, notnull, isna, and notna
These functions work for arrays or scalars.
Setup
a = np.array([[1, np.nan],
[None, '2']])
Pandas functions
pd.isna(a)
# same as
# pd.isnull(a)
array([[False, True],
[ True, False]])
pd.notnull(a)
# same as
# pd.notna(a)
array([[ True, False],
[False, True]])
DataFrame (or Series) methods
b = pd.DataFrame(a)
b.isnull()
# same as
# b.isna()
0 1
0 False True
1 True False
b.notna()
# same as
# b.notnull()
0 1
0 True False
1 False True

Comparing lists containing NaNs

I am trying to compare two different lists to see if they are equal, and was going to remove NaNs, only to discover that my list comparisons still work, despite NaN == NaN -> False.
Could someone explain why the following evaluate True or False, as I am finding this behavior unexpected. Thanks,
I have read the following which don't seem to resolve the issue:
Why in numpy nan == nan is False while nan in [nan] is True?
Why is NaN not equal to NaN? [duplicate]
(Python 2.7.3, numpy-1.9.2)
I have marked surprising evaluations with a * at the end
>>> nan = np.nan
>>> [1,2,3]==[3]
False
>>> [1,2,3]==[1,2,3]
True
>>> [1,2,nan]==[1,2,nan]
True ***
>>> nan == nan
False
>>> [nan] == [nan]
True ***
>>> [nan, nan] == [nan for i in range(2)]
True ***
>>> [nan, nan] == [float(nan) for i in range(2)]
True ***
>>> float(nan) is (float(nan) + 1)
False
>>> float(nan) is float(nan)
True ***
To understand what happens here, simply replace nan = np.nan by foo = float('nan'), you will get exactly the same result, why?
>>> foo = float('nan')
>>> foo is foo # This is obviously True!
True
>>> foo == foo # This is False per the standard (nan != nan).
False
>>> bar = float('nan') # foo and bar are two different objects.
>>> foo is bar
False
>>> foo is float(foo) # "Tricky", but float(x) is x if type(x) == float.
True
Now think that numpy.nan is just a variable name that holds a float('nan').
Now why [nan] == [nan] is simply because list comparison first test identity equality between items before equality for value, think of it as:
def equals(l1, l2):
for u, v in zip(l1, l2):
if u is not v and u != v:
return False
return True

contract of pandas.DataFrame.equals

I have a simple test case of a function which returns a df that can potentially contain NaN. I was testing if the output and expected output were equal.
>>> output
Out[1]:
r t ts tt ttct
0 2048 30 0 90 1
1 4096 90 1 30 1
2 0 70 2 65 1
[3 rows x 5 columns]
>>> expected
Out[2]:
r t ts tt ttct
0 2048 30 0 90 1
1 4096 90 1 30 1
2 0 70 2 65 1
[3 rows x 5 columns]
>>> output == expected
Out[3]:
r t ts tt ttct
0 True True True True True
1 True True True True True
2 True True True True True
However, I can't simply rely on the == operator because of NaNs. I was under the impression that the appropriate way to resolve this was by using the equals method. From the documentation:
pandas.DataFrame.equals
DataFrame.equals(other)
Determines if two NDFrame objects contain the same elements. NaNs in the same location are considered equal.
Nonetheless:
>>> expected.equals(log_events)
Out[4]: False
A little digging around reveals the difference in the frames:
>>> output._data
Out[5]:
BlockManager
Items: Index([u'r', u't', u'ts', u'tt', u'ttct'], dtype='object')
Axis 1: Int64Index([0, 1, 2], dtype='int64')
FloatBlock: [r], 1 x 3, dtype: float64
IntBlock: [t, ts, tt, ttct], 4 x 3, dtype: int64
>>> expected._data
Out[6]:
BlockManager
Items: Index([u'r', u't', u'ts', u'tt', u'ttct'], dtype='object')
Axis 1: Int64Index([0, 1, 2], dtype='int64')
IntBlock: [r, t, ts, tt, ttct], 5 x 3, dtype: int64
Force that output float block to int, or force the expected int block to float, and the test passes.
Obviously, there are different senses of equality, and the sort of test that DataFrame.equals performs could be useful in some cases. Nonetheless, the disparity between == and DataFrame.equals is frustrating to me and seems like an inconsistency. In pseudo-code, I would expect its behavior to match:
(self.index == other.index).all() \
and (self.columns == other.columns).all() \
and (self.values.fillna(SOME_MAGICAL_VALUE) == other.values.fillna(SOME_MAGICAL_VALUE)).all().all()
However, it doesn't. Am I wrong in my thinking, or is this an inconsistency in the Pandas API? Moreover, what IS the test I should be performing for my purposes, given the possible presence of NaN?
.equals() does just what it says. It tests for exact equality among elements, positioning of nans (and NaTs), dtype equality, and index equality. Think of this as as df is df2 type of test but they don't have to actually be the same object, IOW, df.equals(df.copy()) IS always True.
Your example fails because different dtypes are not equal (they may be equivalent though). So you can use com.array_equivalent for this, or (df == df2).all().all() if you don't have nans.
This is a replacement for np.array_equal which is broken for nan positional detections (and object dtypes).
It is mostly used internally. That said if you like an enhancement for equivalence (e.g. the elements are equivalent in the == sense and nan positionals match), pls open an issue on github. (and even better submit a PR!)
I used a workaround digging into the MagicMock instance:
assert mock_instance.call_count == 1
call_args = mock_instance.call_args[0]
call_kwargs = mock_instance.call_args[1]
pd.testing.assert_frame_equal(call_kwargs['dataframe'], pd.DataFrame())

Understand any() and nan in Pandas

I have the same problem as in: Pandas series.all() returns nan
In [88]: pd.Series([False, np.nan]).any()
Out[88]: nan
where as:
In [84]: np.any([False, np.nan])
Out[84]: True
and also:
In [99]: pd.DataFrame([False, np.nan]).any()
Out[99]:
0 False
dtype: bool
I was curious what the explanation was for the different behaviors for the three types?
The difference here has nothing to do with the two different types implementing any differently. In fact, the docs for pandas.Series.any and numpy.ndarray.any both explicitly say "Refer to numpy.any for full documentation", because they both effectively just call numpy.any.
The difference is that you have different dtypes in the two cases. Creating a NumPy ndarray, implicitly or explicitly, from different numeric types coerces the types to be the same if possible, so you end up with float64, while a Pandas series keeps the types separate, which means you end up with object.
If you specify the dtype explicitly, you can see that they do the same thing:
>>> a = np.array([False, np.nan])
>>> a
array([ 0., nan])
>>> a.dtype
float64
>>> a.any()
True
>>> a = np.array([False, np.nan], dtype=object)
>>> a
array([False, nan], dtype=object)
>>> a.any()
nan
>>> p = pd.Series([False, np.nan])
>>> p
0 False
1 NaN
>>> p.dtype
dtype('O')
>>> p.any()
nan
>>> p = pd.Series([False, np.nan], dtype=np.float64)
>>> p
0 0
1 NaN
>>> p.any()
True

Python: max/min builtin functions depend on parameter order

max(float('nan'), 1) evaluates to nan
max(1, float('nan')) evaluates to 1
Is it the intended behavior?
Thanks for the answers.
max raises an exception when the iterable is empty. Why wouldn't Python's max raise an exception when nan is present? Or at least do something useful, like return nan or ignore nan. The current behavior is very unsafe and seems completely unreasonable.
I found an even more surprising consequence of this behavior, so I just posted a related question.
In [19]: 1>float('nan')
Out[19]: False
In [20]: float('nan')>1
Out[20]: False
The float nan is neither bigger nor smaller than the integer 1.
max starts by choosing the first element, and only replaces it when it finds an element which is strictly larger.
In [31]: max(1,float('nan'))
Out[31]: 1
Since nan is not larger than 1, 1 is returned.
In [32]: max(float('nan'),1)
Out[32]: nan
Since 1 is not larger than nan, nan is returned.
PS. Note that np.max treats float('nan') differently:
In [36]: import numpy as np
In [91]: np.max([1,float('nan')])
Out[91]: nan
In [92]: np.max([float('nan'),1])
Out[92]: nan
but if you wish to ignore np.nans, you can use np.nanmax:
In [93]: np.nanmax([1,float('nan')])
Out[93]: 1.0
In [94]: np.nanmax([float('nan'),1])
Out[94]: 1.0
I haven't seen this before, but it makes sense. Notice that nan is a very weird object:
>>> x = float('nan')
>>> x == x
False
>>> x > 1
False
>>> x < 1
False
I would say that the behaviour of max is undefined in this case -- what answer would you expect? The only sensible behaviour is to assume that the operations are antisymmetric.
Notice that you can reproduce this behaviour by making a broken class:
>>> class Broken(object):
... __le__ = __ge__ = __eq__ = __lt__ = __gt__ = __ne__ =
... lambda self, other: False
...
>>> x = Broken()
>>> x == x
False
>>> x < 1
False
>>> x > 1
False
>>> max(x, 1)
<__main__.Broken object at 0x024B5B50>
>>> max(1, x)
1
Max works the following way:
The first item is set as maxval and then the next is compared to this value. The comparation will always return False:
>>> float('nan') < 1
False
>>> float('nan') > 1
False
So if the first value is nan, then (since the comparation returns false) it will not be replaced upon the next step.
OTOH if 1 is the first, the same happens: but in this case, since 1 was set, it will be the maximum.
You can verify this in the python code, just look up the function min_max in Python/bltinmodule.c

Categories

Resources