Understand any() and nan in Pandas - python

I have the same problem as in: Pandas series.all() returns nan
In [88]: pd.Series([False, np.nan]).any()
Out[88]: nan
where as:
In [84]: np.any([False, np.nan])
Out[84]: True
and also:
In [99]: pd.DataFrame([False, np.nan]).any()
Out[99]:
0 False
dtype: bool
I was curious what the explanation was for the different behaviors for the three types?

The difference here has nothing to do with the two different types implementing any differently. In fact, the docs for pandas.Series.any and numpy.ndarray.any both explicitly say "Refer to numpy.any for full documentation", because they both effectively just call numpy.any.
The difference is that you have different dtypes in the two cases. Creating a NumPy ndarray, implicitly or explicitly, from different numeric types coerces the types to be the same if possible, so you end up with float64, while a Pandas series keeps the types separate, which means you end up with object.
If you specify the dtype explicitly, you can see that they do the same thing:
>>> a = np.array([False, np.nan])
>>> a
array([ 0., nan])
>>> a.dtype
float64
>>> a.any()
True
>>> a = np.array([False, np.nan], dtype=object)
>>> a
array([False, nan], dtype=object)
>>> a.any()
nan
>>> p = pd.Series([False, np.nan])
>>> p
0 False
1 NaN
>>> p.dtype
dtype('O')
>>> p.any()
nan
>>> p = pd.Series([False, np.nan], dtype=np.float64)
>>> p
0 0
1 NaN
>>> p.any()
True

Related

Python language construction when filtering the array

I can see many questions in SO about how to filter an array (usually using pandas or numpy). The examples are very simple:
df = pd.DataFrame({ 'val': [1, 2, 3, 4, 5, 6, 7] })
a = df[df.val > 3]
Intuitively I understand the df[df.val > 3] statement. But it confuses me from the syntax point of view. In other languages, I would expect a lambda function instead of df.val > 3.
The question is: Why this style is possible and what is going on underhood?
Update 1:
To be more clear about the confusing part: In other languages I saw the next syntax:
df.filter(row => row.val > 3)
I understand here what is going on under the hood: For every iteration, the lambda function is called with the row as an argument and returns a boolean value. But here df.val > 3 doesn't make sense to me because df.val looks like a column.
Moreover, I can write df[df > 3] and it will be compiled and executed successfully. And it makes me crazy because I don't understand how a DataFrame object can be compared to a number.
Create an array and dataframe from it:
In [104]: arr = np.arange(1,8); df = pd.DataFrame({'val':arr})
In [105]: arr
Out[105]: array([1, 2, 3, 4, 5, 6, 7])
In [106]: df
Out[106]:
val
0 1
1 2
2 3
3 4
4 5
5 6
6 7
numpy arrays have methods and operators that operate on the whole array. For example you can multiply the array by a scalar, or add a scalar to all elements. Or in this case compare each element to a scalar. That's all implemented by the class (numpy.ndarray), not by Python syntax.
In [107]: arr>3
Out[107]: array([False, False, False, True, True, True, True])
Similarly pandas implements these methods (or uses numpy methods on the underlying arrays). Selecting a column of a frame, with `df['val'] or:
In [108]: df.val
Out[108]:
0 1
1 2
2 3
3 4
4 5
5 6
6 7
Name: val, dtype: int32
This is a pandas Series (slight difference in display).
It can be compared to a scalar - as with the array:
In [110]: df.val>3
Out[110]:
0 False
1 False
2 False
3 True
4 True
5 True
6 True
Name: val, dtype: bool
And the boolean array can be used to index the frame:
In [111]: df[arr>3]
Out[111]:
val
3 4
4 5
5 6
6 7
The boolean Series also works:
In [112]: df[df.val>3]
Out[112]:
val
3 4
4 5
5 6
6 7
Boolean array indexing works the same way:
In [113]: arr[arr>3]
Out[113]: array([4, 5, 6, 7])
Here I use the indexing to fetch values; setting values is analogous.

Why does a nan of type <class 'numpy.float64'> return -9223372036854775808 as an int64?

I came across some behavior that I find odd and replicated it. simply, why does
np.int64(np.float64(np.nan))
output
-9223372036854775808
(as pointed out in comments, yes this is -2^63, the maximum negative value for a two-sided int64)
In case it is relevant or of interest, my original use case was looking at dataframe indices of type np.float64 and converting to np.int64 (I don't normally nest types for no reason as in the simplified example above).
Starting with an example dataframe:
0 1
NaN 1 2
1.0 3 4
NaN 5 6
then running:
print(df.index.values[0])
print(type(df.index.values[0]))
print(df.index.values[0].astype(np.int64))
print(type(df.index.values[0].astype(np.int64)))
prints:
nan
<class 'numpy.float64'>
-9223372036854775808
<class 'numpy.int64'>
However, using python types you can't convert float nan to int:
print(np.nan)
print(type(np.nan))
print(np.nan.astype(np.int64))
out:
nan
<class 'float'>
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-130-0d779433eac7> in <module>
1 print(np.nan)
2 print(type(np.nan))
----> 3 print(np.nan.astype(np.int64))
AttributeError: 'float' object has no attribute 'astype'
Although in practice I was able to just change the nans to a value I knew would not be a key (0) - I was curious why do class np.float64 types behave this way?
Your df.index.values is a numpy array:
Out[34]: array([nan, 1., inf])
In [35]: a.dtype
Out[35]: dtype('float64')
Arrays have a astype method, and the developers chose to convert special floats like nan to some sort of integer (or as discussed allow the compiler/processor do it). The alternative would have been to raise an error.
In [36]: b=a.astype(int)
In [37]: b
Out[37]: array([-9223372036854775808, 1, -9223372036854775808])
In [38]: b.dtype
Out[38]: dtype('int64')
np.int32, np.uint16 etc produce different values.
An object created with np.float64 function is a lot like a 0d array - it has many of the same attributes and methods, including astype:
In [39]: np.float64(np.nan)
Out[39]: nan
In [40]: np.array(np.nan)
Out[40]: array(nan)
In [41]: Out[39].astype(int)
Out[41]: -9223372036854775808
In [42]: Out[40].astype(int)
Out[42]: array(-9223372036854775808)
np.nan on the other hand is a Python float object, and does not have a astype method.
And the python int doesn't like to do it either:
In [52]: int(np.nan)
Traceback (most recent call last):
File "<ipython-input-52-03e21f51ddd3>", line 1, in <module>
int(np.nan)
ValueError: cannot convert float NaN to integer
astype() is a Pandas function. When you work with np.nan, you cannot use Pandas functions. rather use int(np.nan)

Numpy/Pandas clean way to check if a specific value is NaN

How can I check if a given value is NaN?
e.g. if (a == np.NaN) (doesn't work)
Please note that:
Numpy's isnan method throws errors with data types like string
Pandas docs only provide methods to drop rows containing NaNs, or ways to check if/when DataFrame contains NaNs. I'm asking about checking if a specific value is NaN.
Relevant Stackoverflow questions and Google search results seem to be about checking "if any value is NaN" or "which values in a DataFrame"
There must be a clean way to check if a given value is NaN?
You can use the inate property that NaN != NaN
so a == a will return False if a is NaN
This will work even for strings
Example:
In[52]:
s = pd.Series([1, np.NaN, '', 1.0])
s
Out[52]:
0 1
1 NaN
2
3 1
dtype: object
for val in s:
print(val==val)
True
False
True
True
This can be done in a vectorised manner:
In[54]:
s==s
Out[54]:
0 True
1 False
2 True
3 True
dtype: bool
but you can still use the method isnull on the whole series:
In[55]:
s.isnull()
Out[55]:
0 False
1 True
2 False
3 False
dtype: bool
UPDATE
As noted by #piRSquared if you compare None==None this will return True but pd.isnull will return True so depending on whether you want to treat None as NaN you can still use == for comparison or pd.isnull if you want to treat None as NaN
Pandas has isnull, notnull, isna, and notna
These functions work for arrays or scalars.
Setup
a = np.array([[1, np.nan],
[None, '2']])
Pandas functions
pd.isna(a)
# same as
# pd.isnull(a)
array([[False, True],
[ True, False]])
pd.notnull(a)
# same as
# pd.notna(a)
array([[ True, False],
[False, True]])
DataFrame (or Series) methods
b = pd.DataFrame(a)
b.isnull()
# same as
# b.isna()
0 1
0 False True
1 True False
b.notna()
# same as
# b.notnull()
0 1
0 True False
1 False True

How do numpy functions operate on pandas objects internally?

Numpy functions, eg np.mean(), np.var(), etc, accept an array-like argument, like np.array, or list, etc.
But passing in a pandas dataframe also works. This means that a pandas dataframe can indeed disguise itself as a numpy array, which I find a little strange (despite knowing the fact that the underlying values of a df are indeed numpy arrays).
For an object to be an array-like, I thought that it should be slicable using integer indexing in the way a numpy array is sliced. So for instance df[1:3, 2:3] should work, but it would lead to an error.
So, possibly a dataframe gets converted into a numpy array when it goes inside the function. But if that is the case then why does np.mean(numpy_array) lead to a different result than that of np.mean(df)?
a = np.random.rand(4,2)
a
Out[13]:
array([[ 0.86688862, 0.09682919],
[ 0.49629578, 0.78263523],
[ 0.83552411, 0.71907931],
[ 0.95039642, 0.71795655]])
np.mean(a)
Out[14]: 0.68320065182041034
gives a different result than what the below gives...
df = pd.DataFrame(data=a, index=range(np.shape(a)[0]),
columns=range(np.shape(a)[1]))
df
Out[18]:
0 1
0 0.866889 0.096829
1 0.496296 0.782635
2 0.835524 0.719079
3 0.950396 0.717957
np.mean(df)
Out[21]:
0 0.787276
1 0.579125
dtype: float64
The former output is a single number, whereas the latter is a column-wise mean. How does a numpy function know about the make of a dataframe?
If you step through this:
--Call--
> d:\winpython-64bit-3.4.3.5\python-3.4.3.amd64\lib\site-packages\numpy\core\fromnumeric.py(2796)mean()
-> def mean(a, axis=None, dtype=None, out=None, keepdims=False):
(Pdb) s
> d:\winpython-64bit-3.4.3.5\python-3.4.3.amd64\lib\site-packages\numpy\core\fromnumeric.py(2877)mean()
-> if type(a) is not mu.ndarray:
(Pdb) s
> d:\winpython-64bit-3.4.3.5\python-3.4.3.amd64\lib\site-packages\numpy\core\fromnumeric.py(2878)mean()
-> try:
(Pdb) s
> d:\winpython-64bit-3.4.3.5\python-3.4.3.amd64\lib\site-packages\numpy\core\fromnumeric.py(2879)mean()
-> mean = a.mean
You can see that the type is not a ndarray so it tries to call a.mean which in this case would be df.mean():
In [6]:
df.mean()
Out[6]:
0 0.572999
1 0.468268
dtype: float64
This is why the output is different
Code to reproduce above:
In [3]:
a = np.random.rand(4,2)
a
Out[3]:
array([[ 0.96750329, 0.67623187],
[ 0.44025179, 0.97312747],
[ 0.07330062, 0.18341157],
[ 0.81094166, 0.04030253]])
In [4]:
np.mean(a)
Out[4]:
0.52063384885403818
In [5]:
df = pd.DataFrame(data=a, index=range(np.shape(a)[0]),
columns=range(np.shape(a)[1]))
​
df
Out[5]:
0 1
0 0.967503 0.676232
1 0.440252 0.973127
2 0.073301 0.183412
3 0.810942 0.040303
numpy output:
In [7]:
np.mean(df)
Out[7]:
0 0.572999
1 0.468268
dtype: float64
If you'd called .values to return a np array then the output is the same:
In [8]:
np.mean(df.values)
Out[8]:
0.52063384885403818

Python: max/min builtin functions depend on parameter order

max(float('nan'), 1) evaluates to nan
max(1, float('nan')) evaluates to 1
Is it the intended behavior?
Thanks for the answers.
max raises an exception when the iterable is empty. Why wouldn't Python's max raise an exception when nan is present? Or at least do something useful, like return nan or ignore nan. The current behavior is very unsafe and seems completely unreasonable.
I found an even more surprising consequence of this behavior, so I just posted a related question.
In [19]: 1>float('nan')
Out[19]: False
In [20]: float('nan')>1
Out[20]: False
The float nan is neither bigger nor smaller than the integer 1.
max starts by choosing the first element, and only replaces it when it finds an element which is strictly larger.
In [31]: max(1,float('nan'))
Out[31]: 1
Since nan is not larger than 1, 1 is returned.
In [32]: max(float('nan'),1)
Out[32]: nan
Since 1 is not larger than nan, nan is returned.
PS. Note that np.max treats float('nan') differently:
In [36]: import numpy as np
In [91]: np.max([1,float('nan')])
Out[91]: nan
In [92]: np.max([float('nan'),1])
Out[92]: nan
but if you wish to ignore np.nans, you can use np.nanmax:
In [93]: np.nanmax([1,float('nan')])
Out[93]: 1.0
In [94]: np.nanmax([float('nan'),1])
Out[94]: 1.0
I haven't seen this before, but it makes sense. Notice that nan is a very weird object:
>>> x = float('nan')
>>> x == x
False
>>> x > 1
False
>>> x < 1
False
I would say that the behaviour of max is undefined in this case -- what answer would you expect? The only sensible behaviour is to assume that the operations are antisymmetric.
Notice that you can reproduce this behaviour by making a broken class:
>>> class Broken(object):
... __le__ = __ge__ = __eq__ = __lt__ = __gt__ = __ne__ =
... lambda self, other: False
...
>>> x = Broken()
>>> x == x
False
>>> x < 1
False
>>> x > 1
False
>>> max(x, 1)
<__main__.Broken object at 0x024B5B50>
>>> max(1, x)
1
Max works the following way:
The first item is set as maxval and then the next is compared to this value. The comparation will always return False:
>>> float('nan') < 1
False
>>> float('nan') > 1
False
So if the first value is nan, then (since the comparation returns false) it will not be replaced upon the next step.
OTOH if 1 is the first, the same happens: but in this case, since 1 was set, it will be the maximum.
You can verify this in the python code, just look up the function min_max in Python/bltinmodule.c

Categories

Resources