Different pandas DataFrame logical operation result when the changing the order - python

My code is like:
a = pd.DataFrame([np.nan, True])
b = pd.DataFrame([True, np.nan])
c = a|b
print(c)
I don't know the result of logical operation result when one element is np.nan, but I expect them to be the same whatever the oder. But I got the result like this:
0
0 False
1 True
Why? Is this about short circuiting in pandas? I searched the doc of pandas but did not find answer.
My pandas version is 1.1.3

This is behaviour that is tied to np.nan, not pandas. Take the following examples:
print(True or np.nan)
print(np.nan or True)
Output:
True
nan
When performing the operation, dtype ends up mattering and the way that np.nan functions within the numpy library is what leads to this strange behaviour.
To get around this quirk, you can fill NaN values with False for example or some other token value which evaluates to False (using pandas.DataFrame.fillna()).

Related

Pandas comparing dataframe with series containing datetime

I'm trying to compare a Dataframe with a Series to check if one of the rows in the df is equal to the series, e.g.
import pandas as pd
import datetime as dt
d = pd.DataFrame([[1, dt.datetime(1990,12,10)],
[2, dt.datetime(1990,12,11)]])
s = d.loc[0].copy()
print(d == s) # or d.gt(s) which should do the same
This breaks with the following error
TypeError: int() argument must be a string, a bytes-like object or a number, not 'Timestamp'
Comparing the values yields the expected results:
d.values == s.values
array([[ True, True],
[False, False]], dtype=bool)
Also this error isn't raised using strings:
d = pd.DataFrame([[1, "a"], [2, "b"]])
s = d.loc[1].copy()
print(s == d)
# 0 1
#0 True True
#1 False False
Is this a bug in pandas or am I doing something wrong?
EDIT:
I'm using python 3.6 with pandas 0.20.3
I opened an issue on pandas github:
17411
Recant
as mentioned in the comments, (and probably worth adding to the question) this works for strings, so i don't see why it should not work for datetimes
discussion on github here suggests that it is an ongoing debate regarding whether a datetime should be false when compared to a number or not.
if you print d and s you get the following:
d:
0 1
0 1 1990-12-10
1 2 1990-12-11
s:
0 1
1 1990-12-10 00:00:00
Name: 0, dtype: object
in s, the numbers 0,1 on the left there are the index (which is the key by which s == d compares) so your code is comparing 1 to 1 and then 2 against 1990-12-10 00:00:00 - which is why you get your error.
as to why this works with values - .values gives back the numpy array without the indexes, so the comparison is done on the shape you were expecting rather than considering the indexes.
I think the problem is that you're comparing two objects which don't support comparison.
I'd try something like this:
d[d[1] == dt.datetime(1990,12,10)]
The problem was that I had version 0.20.3 which is the latest available version through pip or conda.
Version 0.21, which is the last dev version on github seems to have solved the issue.
I'll delete the question as soon as version 0.21 is on pypi

Comparing logical values to NaN in pandas/numpy

I want to do an element-wise OR operation on two pandas Series of boolean values. np.nans are also included.
I have tried three approaches and realized that the expression "np.nan or False" can be evaluted to True, False, and np.nan depending on the approach.
These are my example Series:
series_1 = pd.Series([True, False, np.nan])
series_2 = pd.Series([False, False, False])
Approach #1
Using the | operator of pandas:
In [5]: series_1 | series_2
Out[5]:
0 True
1 False
2 False
dtype: bool
Approach #2
Using the logical_or function from numpy:
In [6]: np.logical_or(series_1, series_2)
Out[6]:
0 True
1 False
2 NaN
dtype: object
Approach #3
I define a vectorized version of logical_or which is supposed to be evaluated row-by-row over the arrays:
#np.vectorize
def vectorized_or(a, b):
return np.logical_or(a, b)
I use vectorized_or on the two series and convert its output (which is a numpy array) into a pandas Series:
In [8]: pd.Series(vectorized_or(series_1, series_2))
Out[8]:
0 True
1 False
2 True
dtype: bool
Question
I am wondering about the reasons for these results.
This answer explains np.logical_or and says np.logical_or(np.nan, False) is be True but why does this only works when vectorized and not in Approach #2? And how can the results of Approach #1 be explained?
first difference : | is np.bitwise_or. it explains the difference between #1 and #2.
Second difference : since serie_1.dtype if object (non homogeneous data), operations are done row by row in the two first cases.
When using vectorize ( #3):
The data type of the output of vectorized is determined by calling
the function with the first element of the input. This can be avoided
by specifying the otypes argument.
For vectorized operations, you quit the object mode. data are first converted according to first element (bool here, bool(nan) is True) and the operations are done after.

Python pandas check if dataframe is not empty

I have an if statement where it checks if the data frame is not empty. The way I do it is the following:
if dataframe.empty:
pass
else:
#do something
But really I need:
if dataframe is not empty:
#do something
My question is - is there a method .not_empty() to achieve this? I also wanted to ask if the second version is better in terms of performance? Otherwise maybe it makes sense for me to leave it as it is i.e. the first version?
Just do
if not dataframe.empty:
# insert code here
The reason this works is because dataframe.empty returns True if dataframe is empty. To invert this, we can use the negation operator not, which flips True to False and vice-versa.
.empty returns a boolean value
>>> df_empty.empty
True
So if not empty can be written as
if not df.empty:
#Your code
Check pandas.DataFrame.empty
, might help someone.
You can use the attribute dataframe.empty to check whether it's empty or not:
if not dataframe.empty:
#do something
Or
if len(dataframe) != 0:
#do something
Or
if len(dataframe.index) != 0:
#do something
As already clearly explained by other commentators, you can negate a boolean expression in Python by simply prepending the not operator, hence:
if not df.empty:
# do something
does the trick.
I only want to clarify the meaning of "empty" in this context, because it was a bit confusing for me at first.
According to the Pandas documentation, the DataFrame.empty method returns True if any of the axes in the DataFrame are of length 0.
As a consequence, "empty" doesn't mean zero rows and zero columns, like someone might expect. A dataframe with zero rows (axis 1 is empty) but non-zero columns (axis 2 is not empty) is still considered empty:
> df = pd.DataFrame(columns=["A", "B", "C"])
> df.empty
True
Another interesting point highlighted in the documentation is a DataFrame that only contains NaNs is not considered empty.
> df = pd.DataFrame(columns=["A", "B", "C"], index=['a', 'b', 'c'])
> df
A B C
a NaN NaN NaN
b NaN NaN NaN
c NaN NaN NaN
> df.empty
False
No doubt that the use of empty is the most comprehensive in this case (explicit is better than implicit).
However, the most efficient in term of computation time is through the usage of len :
if not len(df.index) == 0:
# insert code here
Source : this answer.
Another way:
if dataframe.empty == False:
#do something`

Erratic NaN behaviour in numpy/pandas

I've been trying to replace missing values in a Pandas dataframe, but without success. I tried the .fillna method and also tried to loop through the entire data set, checking each cell and replacing NaNs with a chosen value. However, in both cases, Python executes the script without throwing up any errors, but the NaN values remain.
When I dug a bit deeper, I discovered behaviour that seems erratic to me, best demonstrated with an example:
In[ ] X['Smokinginpregnancy'].head()
Out[ ]
Index
E09000002 NaN
E09000003 5.216126
E09000004 10.287496
E09000005 3.090379
E09000006 6.080041
Name: Smokinginpregnancy, dtype: float64
I know for a fact that the first item in this column is missing and pandas recognises it as NaN. In fact, if I call this item on its own, python tells me it's NaN:
In [ ] X['Smokinginpregnancy'][0]
Out [ ]
nan
However, when I test whether it's NaN, python returns False.
In [ ] X['Smokinginpregnancy'][0] == np.nan
Out [ ] False
I suspect that when .fillna is being executed, python checks whether the item is NaN but gets back a False, so it continues, leaving the cell alone.
Does anyone know what's going on? Any solutions? (apart from opening the csv file in excel and then manually replacing the values.)
I'm using Anaconda's Python 3 distribution.
You are doing:
X['Smokinginpregnancy'][0] == np.nan
This is guaranteed to return False because all NaNs compare unequal to everything by IEEE754 standard:
>>> x = float('nan')
>>> x == x
False
>>> x == 1
False
>>> x == float('nan')
False
See also here.
You have to use math.isnan to check for NaNs:
>>> math.isnan(x)
True
Or numpy.isnan
So use:
numpy.isnan(X['Smokinginpregnancy'][0])
Regarding pandas.fillna note that this function returns the filled array. Maybe you did something like:
X.fillna(...)
without reassigning X? Alternatively you must pass inplace=True to mutate the dataframe on which you are calling the method.
NaN in pandas can be check function pandas.isnull. I created boolean mask and return subset with NaN values.
Function filnna can be used for one column Smokinginpregnancy (more info in doc):
X['Smokinginpregnancy'] = X['Smokinginpregnancy'].fillna('100')
or
X['Smokinginpregnancy'].fillna('100', inplace=True)
Warning:
Sometimes inplace=True can be ignored, better is not use. - link, github, github 3 comments.
All together:
print X['Smokinginpregnancy'].head()
#Index
#E09000002 NaN
#E09000003 5.216126
#E09000004 10.287496
#E09000005 3.090379
#E09000006 6.080041
#check NaN in column Smokinginpregnancy by boolean mask
mask = pd.isnull(X['Smokinginpregnancy'])
XNaN = X[mask]
print XNaN
# Smokinginpregnancy
#Index
#E09000002 NaN
#use function fillna for column Smokinginpregnancy
#X['Smokinginpregnancy'] = X['Smokinginpregnancy'].fillna('100')
X['Smokinginpregnancy'].fillna('100', inplace=True)
print X
# Smokinginpregnancy
#Index
#E09000002 100
#E09000003 5.216126
#E09000004 10.2875
#E09000005 3.090379
#E09000006 6.080041
More information, why comparison doesn't work:
One has to be mindful that in python (and numpy), the nan's don’t compare equal, but None's do. Note that Pandas/numpy uses the fact that np.nan != np.nan, and treats None like np.nan. More info in Bakuriu's answer.
In [11]: None == None
Out[11]: True
In [12]: np.nan == np.nan
Out[12]: False

Proper way to use "opposite boolean" in Pandas data frame boolean indexing

I wanted to use a boolean indexing, checking for rows of my data frame where a particular column does not have NaN values. So, I did the following:
import pandas as pd
my_df.loc[pd.isnull(my_df['col_of_interest']) == False].head()
to see a snippet of that data frame, including only the values that are not NaN (most values are NaN).
It worked, but seems less-than-elegant. I'd want to type:
my_df.loc[!pd.isnull(my_df['col_of_interest'])].head()
However, that generated an error. I also spend a lot of time in R, so maybe I'm confusing things. In Python, I usually put in the syntax "not" where I can. For instance, if x is not none:, but I couldn't really do it here. Is there a more elegant way? I don't like having to put in a senseless comparison.
In general with pandas (and numpy), we use the bitwise NOT ~ instead of ! or not (whose behaviour can't be overridden by types).
While in this case we have notnull, ~ can come in handy in situations where there's no special opposite method.
>>> df = pd.DataFrame({"a": [1, 2, np.nan, 3]})
>>> df.a.isnull()
0 False
1 False
2 True
3 False
Name: a, dtype: bool
>>> ~df.a.isnull()
0 True
1 True
2 False
3 True
Name: a, dtype: bool
>>> df.a.notnull()
0 True
1 True
2 False
3 True
Name: a, dtype: bool
(For completeness I'll note that -, the unary negative operator, will also work on a boolean Series but ~ is the canonical choice, and - has been deprecated for numpy boolean arrays.)
Instead of using pandas.isnull() , you should use pandas.notnull() to find the rows where the column has not null values. Example -
import pandas as pd
my_df.loc[pd.notnull(my_df['col_of_interest'])].head()
pandas.notnull() is the boolean inverse of pandas.isnull() , as given in the documentation -
See also
pandas.notnull
boolean inverse of pandas.isnull

Categories

Resources