Pandas - Comparing None values

Pandas - Comparing None values - python

Consider this snippet:
a = pd.DataFrame([[None]])
b = pd.DataFrame([[None]])
Now, I want to validate both of them contains the exact same values:
int((a == b).sum()) # should be 1
but it's not 1. Instead, it returns 0. This behavior is giving me troubles especially in assert_frame_equal where it is reporting None is not None even though they are:
a.iloc[0,0] == b.iloc[0,0] # True
Why is that and how can I fix this?

pandas is special casing None so as to be interpreted as NaN (since NaN != NaN, and pd.isnull treats both consistently... this may be one explanation).
Not a solution, but a workaround – np.array_equal works, if they're None and not NaN;
>>> np.array_equal(a, b)
True
If you want the count and not a bool result, use np.equal;
>>> np.equal(a, b).sum().item()
1

Related

How to check if all values in a dataframe are True

pd.DataFrame.all and pd.DataFrame.any convert to bool all values and than assert all identities with the keyword True. This is ok as long as we are fine with the fact that non-empty lists and strings evaluate to True. However let assume that this is not the case.
>>> pd.DataFrame([True, 'a']).all().item()
True # Wrong
A workaround is to assert equality with True, but a comparison to True does not sound pythonic.
>>> (pd.DataFrame([True, 'a']) == True).all().item()
False # Right
Question: can we assert for identity with True without using == True

First of all, I do not advise this. Please do not use mixed dtypes inside your dataframe columns - that defeats the purpose of dataframes and they are no more useful than lists and no more performant than loops.
Now, addressing your actual question, spolier alert, you can't get over the ==. But you can hide it using the eq function. You may use
df.eq(True).all()
Or,
df.where(df.eq(True), False).all()
Note that
df.where(df.eq(True), False)
0
0 True
1 False
Which you may find useful if you want to convert non-"True" values to False for any other reason.

I would actually use
(pd.DataFrame([True, 'a']) == True).all().item()
This way, you're checking for the value of the object, not just checking the "truthy-ness" of it.
This seems perfectly pythonic to me because you're explicitly checking for the value of the object, not just whether or not it's a truthy value.

Why does comparing to nan yield False (Python)?

Here, I have the following:
>>> import numpy as np
>>> q = np.nan
>>> q == np.nan
False
>>> q is np.nan
True
>>> q in (np.nan, )
True
So, the question is: why nan is not equal to nan, but is nan?
(UNIQUE) And why 'in' returns True?
I don't seem to be able to trace down the implementation of nan. It leads me to C:\Python33\lib\site-packages\numpy\core\umath.pyd (row NAN = nan), but from there there is no traceable way to find out what nan actually is.

The creators of numpy decided that it made most sense that most comparisons to nan, including ==, should yield False. You can do this in Python by defining a __eq__(self, other) method for your object. This behaviour was chosen simply because it is the most useful, for various purposes. After all, the fact that one entry has a missing value, and another entry also has a missing value, does not imply that those two entries are equal. It just implies that you don't know whether they are equal or not, and it's therefore best not to treat them as if they are (e.g. when you join two tables together by pairing up corresponding rows).
is on the other hand is a Python keyword which cannot be overwritten by numpy. It tests whether two objects are the same thing. nan is the same object as nan. This is also useful behaviour to have anyway, because often you will want to e.g. get rid of all entries which don't have a value, which you can achieve with is not nan.
nan in (nan,) returns True because as you probably know, (nan,) is a tuple with only one element, nan, and when Python checks if an object is in a tuple, it is checking whether that object is or == any object in the tuple.

Checking for NaN presence in a container

NaN is handled perfectly when I check for its presence in a list or a set. But I don't understand how. [UPDATE: no it's not; it is reported as present if the identical instance of NaN is found; if only non-identical instances of NaN are found, it is reported as absent.]
I thought presence in a list is tested by equality, so I expected NaN to not be found since NaN != NaN.
hash(NaN) and hash(0) are both 0. How do dictionaries and sets tell NaN and 0 apart?
Is it safe to check for NaN presence in an arbitrary container using in operator? Or is it implementation dependent?
My question is about Python 3.2.1; but if there are any changes existing/planned in future versions, I'd like to know that too.
NaN = float('nan')
print(NaN != NaN) # True
print(NaN == NaN) # False
list_ = (1, 2, NaN)
print(NaN in list_) # True; works fine but how?
set_ = {1, 2, NaN}
print(NaN in set_) # True; hash(NaN) is some fixed integer, so no surprise here
print(hash(0)) # 0
print(hash(NaN)) # 0
set_ = {1, 2, 0}
print(NaN in set_) # False; works fine, but how?
Note that if I add an instance of a user-defined class to a list, and then check for containment, the instance's __eq__ method is called (if defined) - at least in CPython. That's why I assumed that list containment is tested using operator ==.
EDIT:
Per Roman's answer, it would seem that __contains__ for list, tuple, set, dict behaves in a very strange way:
def __contains__(self, x):
for element in self:
if x is element:
return True
if x == element:
return True
return False
I say 'strange' because I didn't see it explained in the documentation (maybe I missed it), and I think this is something that shouldn't be left as an implementation choice.
Of course, one NaN object may not be identical (in the sense of id) to another NaN object. (This not really surprising; Python doesn't guarantee such identity. In fact, I never saw CPython share an instance of NaN created in different places, even though it shares an instance of a small number or a short string.) This means that testing for NaN presence in a built-in container is undefined.
This is very dangerous, and very subtle. Someone might run the very code I showed above, and incorrectly conclude that it's safe to test for NaN membership using in.
I don't think there is a perfect workaround to this issue. One, very safe approach, is to ensure that NaN's are never added to built-in containers. (It's a pain to check for that all over the code...)
Another alternative is watch out for cases where in might have NaN on the left side, and in such cases, test for NaN membership separately, using math.isnan(). In addition, other operations (e.g., set intersection) need to also be avoided or rewritten.

Question #1: why is NaN found in a container when it's an identical object.
From the documentation:
For container types such as list, tuple, set, frozenset, dict, or
collections.deque, the expression x in y is equivalent to any(x is e
or x == e for e in y).
This is precisely what I observe with NaN, so everything is fine. Why this rule? I suspect it's because a dict/set wants to honestly report that it contains a certain object if that object is actually in it (even if __eq__() for whatever reason chooses to report that the object is not equal to itself).
Question #2: why is the hash value for NaN the same as for 0?
From the documentation:
Called by built-in function hash() and for operations on members of
hashed collections including set, frozenset, and dict. hash()
should return an integer. The only required property is that objects
which compare equal have the same hash value; it is advised to somehow
mix together (e.g. using exclusive or) the hash values for the
components of the object that also play a part in comparison of
objects.
Note that the requirement is only in one direction; objects that have the same hash do not have to be equal! At first I thought it's a typo, but then I realized that it's not. Hash collisions happen anyway, even with default __hash__() (see an excellent explanation here). The containers handle collisions without any problem. They do, of course, ultimately use the == operator to compare elements, hence they can easily end up with multiple values of NaN, as long as they are not identical! Try this:
>>> nan1 = float('nan')
>>> nan2 = float('nan')
>>> d = {}
>>> d[nan1] = 1
>>> d[nan2] = 2
>>> d[nan1]
1
>>> d[nan2]
2
So everything works as documented. But... it's very very dangerous! How many people knew that multiple values of NaN could live alongside each other in a dict? How many people would find this easy to debug?..
I would recommend to make NaN an instance of a subclass of float that doesn't support hashing and hence cannot be accidentally added to a set/dict. I'll submit this to python-ideas.
Finally, I found a mistake in the documentation here:
For user-defined classes which do not define __contains__() but do
define __iter__(), x in y is true if some value z with x == z is
produced while iterating over y. If an exception is raised during the
iteration, it is as if in raised that exception.
Lastly, the old-style iteration protocol is tried: if a class defines
__getitem__(), x in y is true if and only if there is a non-negative
integer index i such that x == y[i], and all lower integer indices do
not raise IndexError exception. (If any other exception is raised, it
is as if in raised that exception).
You may notice that there is no mention of is here, unlike with built-in containers. I was surprised by this, so I tried:
>>> nan1 = float('nan')
>>> nan2 = float('nan')
>>> class Cont:
... def __iter__(self):
... yield nan1
...
>>> c = Cont()
>>> nan1 in c
True
>>> nan2 in c
False
As you can see, the identity is checked first, before == - consistent with the built-in containers. I'll submit a report to fix the docs.

I can't repro you tuple/set cases using float('nan') instead of NaN.
So i assume that it worked only because id(NaN) == id(NaN), i.e. there is no interning for NaN objects:
>>> NaN = float('NaN')
>>> id(NaN)
34373956456
>>> id(float('NaN'))
34373956480
And
>>> NaN is NaN
True
>>> NaN is float('NaN')
False
I believe tuple/set lookups has some optimization related to comparison of the same objects.
Answering your question - it seam to be unsafe to relay on in operator while checking for presence of NaN. I'd recommend to use None, if possible.
Just a comment. __eq__ has nothing to do with is statement, and during lookups comparison of objects' ids seem to happen prior to any value comparisons:
>>> class A(object):
... def __eq__(*args):
... print '__eq__'
...
>>> A() == A()
__eq__ # as expected
>>> A() is A()
False # `is` checks only ids
>>> A() in [A()]
__eq__ # as expected
False
>>> a = A()
>>> a in [a]
True # surprise!

NaNs as key in dictionaries

Can anyone explain the following behaviour to me?
>>> import numpy as np
>>> {np.nan: 5}[np.nan]
5
>>> {float64(np.nan): 5}[float64(np.nan)]
KeyError: nan
Why does it work in the first case, but not in the second?
Additionally, I found that the following DOES work:
>>> a ={a: 5}[a]
float64(np.nan)

The problem here is that NaN is not equal to itself, as defined in the IEEE standard for floating point numbers:
>>> float("nan") == float("nan")
False
When a dictionary looks up a key, it roughly does this:
Compute the hash of the key to be looked up.
For each key in the dict with the same hash, check if it matches the key to be looked up. This check consists of
a. Checking for object identity: If the key in the dictionary and the key to be looked up are the same object as indicated by the is operator, the key was found.
b. If the first check failed, check for equality using the __eq__ operator.
The first example succeeds, since np.nan and np.nan are the same object, so it does not matter they don't compare equal:
>>> numpy.nan is numpy.nan
True
In the second case, np.float64(np.nan) and np.float64(np.nan) are not the same object -- the two constructor calls create two distinct objects:
>>> numpy.float64(numpy.nan) is numpy.float64(numpy.nan)
False
Since the objects also do not compare equal, the dictionary concludes the key is not found and throws a KeyError.
You can even do this:
>>> a = float("nan")
>>> b = float("nan")
>>> {a: 1, b: 2}
{nan: 1, nan: 2}
In conclusion, it seems a saner idea to avoid NaN as a dictionary key.

Please note this is not the case anymore in Python 3.6:
>>> d = float("nan") #object nan
>>> d
nan
>>> c = {"a": 3, d: 4}
>>> c["a"]
3
>>> c[d]
4
In this example c is a dictionary that contains the value 3 associated to the key "a" and the value 4 associated to the key NaN.
The way Python 3.6 internally looks up in the dictionary has changed. Now, the first thing it does is compare the two pointers that represent the underlying variables. If they point to the same object, then the two objects are considered the same (well, technically we are comparing one object with itself). Otherwise, their hash is compared, if the hash is different, then the two objects are considered different. If at this point the equality of the objects has not been decided, then their comparators are called (they are "manually" compared, so to speak).
This means that although IEEE754 specifies that NAN isn't equal to itself:
>>> d == d
False
When looking up a dictionary, the underlying pointers of the variables are the first thing to be compared. Because these they point to the same object NaN, the dictionary returns 4.
Note also that not all NaN objects are exactly the same:
>>> e = float("nan")
>>> e == d
False
>>> c[e]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
KeyError: nan
>>> c[d]
4
So, to summarize. Dictionaries prioritize performance by trying to compare if the underlying objects are the same. They have hash comparison and comparisons as fallback. Moreover, not every NaN represents the same underlying object.
One has to be very careful when dealing with NaNs as keys to dictionaries, adding such a key makes the underlying value impossible to reach unless you depend on the property described here. This property may change in the future (somewhat unlikely, but possible). Proceed with care.

False or None vs. None or False

In [20]: print None or False
-------> print(None or False)
False
In [21]: print False or None
-------> print(False or None)
None
This behaviour confuses me. Could someone explain to me why is this happening like this? I expected them to both behave the same.

The expression x or y evaluates to x if x is true, or y if x is false.
Note that "true" and "false" in the above sentence are talking about "truthiness", not the fixed values True and False. Something that is "true" makes an if statement succeed; something that's "false" makes it fail. "false" values include False, None, 0 and [] (an empty list).

The or operator returns the value of its first operand, if that value is true in the Pythonic boolean sense (aka its "truthiness"), otherwise it returns the value of its second operand, whatever it happens to be. See the subsection titled Boolean operations in the section on Expressions in the current online documentation.
In both your examples, the first operand is considered false, so the value of the second one becomes the result of evaluating the expression.

EXPLANATION
You must realize that True, False, and None are all singletons in Python, which means that there exist and can only exist one, single instance of a singleton object, hence the name singleton. Also, you can't modify a singleton object because its state is set in stone, if I may use that phrase.
Now, let me explain how those Python singletons are meant to be used.
Let's have a Python object named foo that has a value None, then if foo is not None is saying that foo has a value other than None. This works the same as saying if foo, which is basically if foo == True.
So, not None and True work the same way, as well as None and False.
>>> foo = not None
>>> bool(foo)
True
>>> foo = 5 # Giving an arbitrary value here
>>> bool(foo)
True
>>> foo = None
>>> bool(foo)
False
>>> foo = 5 # Giving an arbitrary value here
>>> bool(foo)
True
The critical thing to realize and be aware of when coding is that when comparing two objects, None needs is, but True and False need ==. Avoid using if foo == None, only use if foo is None. Also, avoid using if foo != None and only use if foo is not None.
WARNING
If you are using if foo or if not foo when the value of foo happens to be None, beware of potential bugs in your code. So, don't check for a potential None value in conditional statements this way. Be on the safe side by checking for it as explained above, i.e., if foo is None or if foo is not None. It is very important to follow best practices shared by Python experts.
REMINDER
True is a 1 and False is a 0.
In the old days of Python, we only had the integer 1 to represent a truthy value and we had the integer 0 to represent a falsy value. However, it is more understandable and human-friendly to say True instead of 1 and it is more understandable and human-friendly to say False instead of 0.
GOOD TO KNOW
The bool type (i.e., values True and False) are a subtype of the int type. So, if you use type hints and you annotate that a function/method returns either an int or a bool (i.e., -> int | bool or -> Union[int, bool]), then mypy (or any other static type checker) won't be able to correctly determine the return type of such a function/method. That's something you need to be aware of. There's no fix for this.

A closely related topic: Python's or and and short-circuit. In a logical or operation, if any argument is true, then the whole thing will be true and nothing else needs to be evaluated; Python promptly returns that "true" value. If it finishes and nothing was true, it returns the last argument it handled, which will be a "false" value.
and is the opposite, if it sees any false values, it will promptly exit with that "false" value, or if it gets through it all, returns the final "true" value.
>>> 1 or 2 # first value TRUE, second value doesn't matter
1
>>> 1 and 2 # first value TRUE, second value might matter
2
>>> 0 or 0.0 # first value FALSE, second value might matter
0.0
>>> 0 and 0.0 # first value FALSE, second value doesn't matter
0

From a boolean point of view they both behave the same, both return a value that evaluates to false.
or just "reuses" the values that it is given, returning the left one if that was true and the right one otherwise.

Condition1 or Condition2
if Condition1 is False then evalute and return Condition2.
None evalutes to False.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas - Comparing None values - python

Related

How to check if all values in a dataframe are True

Why does comparing to nan yield False (Python)?

Checking for NaN presence in a container

NaNs as key in dictionaries

False or None vs. None or False

Categories

Resources