Proper way to use "opposite boolean" in Pandas data frame boolean indexing

Proper way to use "opposite boolean" in Pandas data frame boolean indexing - python

I wanted to use a boolean indexing, checking for rows of my data frame where a particular column does not have NaN values. So, I did the following:
import pandas as pd
my_df.loc[pd.isnull(my_df['col_of_interest']) == False].head()
to see a snippet of that data frame, including only the values that are not NaN (most values are NaN).
It worked, but seems less-than-elegant. I'd want to type:
my_df.loc[!pd.isnull(my_df['col_of_interest'])].head()
However, that generated an error. I also spend a lot of time in R, so maybe I'm confusing things. In Python, I usually put in the syntax "not" where I can. For instance, if x is not none:, but I couldn't really do it here. Is there a more elegant way? I don't like having to put in a senseless comparison.

In general with pandas (and numpy), we use the bitwise NOT ~ instead of ! or not (whose behaviour can't be overridden by types).
While in this case we have notnull, ~ can come in handy in situations where there's no special opposite method.
>>> df = pd.DataFrame({"a": [1, 2, np.nan, 3]})
>>> df.a.isnull()
0 False
1 False
2 True
3 False
Name: a, dtype: bool
>>> ~df.a.isnull()
0 True
1 True
2 False
3 True
Name: a, dtype: bool
>>> df.a.notnull()
0 True
1 True
2 False
3 True
Name: a, dtype: bool
(For completeness I'll note that -, the unary negative operator, will also work on a boolean Series but ~ is the canonical choice, and - has been deprecated for numpy boolean arrays.)

Instead of using pandas.isnull() , you should use pandas.notnull() to find the rows where the column has not null values. Example -
import pandas as pd
my_df.loc[pd.notnull(my_df['col_of_interest'])].head()
pandas.notnull() is the boolean inverse of pandas.isnull() , as given in the documentation -
See also
pandas.notnull
boolean inverse of pandas.isnull

Related

Different pandas DataFrame logical operation result when the changing the order

My code is like:
a = pd.DataFrame([np.nan, True])
b = pd.DataFrame([True, np.nan])
c = a|b
print(c)
I don't know the result of logical operation result when one element is np.nan, but I expect them to be the same whatever the oder. But I got the result like this:
0
0 False
1 True
Why? Is this about short circuiting in pandas? I searched the doc of pandas but did not find answer.
My pandas version is 1.1.3

This is behaviour that is tied to np.nan, not pandas. Take the following examples:
print(True or np.nan)
print(np.nan or True)
Output:
True
nan
When performing the operation, dtype ends up mattering and the way that np.nan functions within the numpy library is what leads to this strange behaviour.
To get around this quirk, you can fill NaN values with False for example or some other token value which evaluates to False (using pandas.DataFrame.fillna()).

How can I drop several rows from my Dataframe?

I have a dataframe (called my_df1) and want to drop several rows based on certain dates. How can I create a new dataframe (my_df2) without the dates '2020-05-01' and '2020-05-04'?
I tried the following which did not work as you can see below:
my_df2 = mydf_1[(mydf_1['Date'] != '2020-05-01') | (mydf_1['Date'] != '2020-05-04')]
my_df2.head()

The problem seems to be with your logical operator.
You should be using and here instead of or since you have to select all the rows which are not 2020-05-01 and 2020-05-04.
The bitwise operators will not be short circuiting and hence the result.

You can use isin with negation ~ sign:
dates=['2020-05-01', '2020-05-04']
my_df2 = mydf_1[~mydf_1['Date'].isin(dates)]

The short explanation about your mistake AND and OR was addressed by kanmaytacker.
Following a few additional recommendations:
Indexing in pandas:
By label .loc
By index .iloc
By label also works without .loc but it's slower as it's composed of chained operations instead of a single internal operation consisting on nested loops (see here). Also, with .loc you can select on more than one axis at a time.
# example with rows. Same logic for columns or additional axis.
df.loc[(df['a']!=4) & (df['a']!=1),:] # ".loc" is the only addition
>>>
a b c
2 0 4 6
Your index is a boolean set. This is true for numpy and as a consecuence, pandas too.
(df['a']!=4) & (df['a']!=1)
>>>
0 False
1 False
2 True
Name: a, dtype: bool

Bool and missing values in pandas

I am trying to figure out whether or not a column in a pandas dataframe is boolean or not (and if so, if it has missing values and so on).
In order to test the function that I created I tried to create a dataframe with a boolean column with missing values. However, I would say that missing values are handled exclusively 'untyped' in python and there are some weird behaviours:
> boolean = pd.Series([True, False, None])
> print(boolean)
0 True
1 False
2 None
dtype: object
so the moment you put None into the list, it is being regarded as object because python is not able to mix the types bool and type(None)=NoneType back into bool. The same thing happens with math.nan and numpy.nan. The weirdest things happen when you try to force pandas into an area it does not want to go to :-)
> boolean = pd.Series([True, False, np.nan]).astype(bool)
> print(boolean)
0 True
1 False
2 True
dtype: bool
So 'np.nan' is being casted to 'True'?
Questions:
Given a data table where one column is of type 'object' but in fact it is a boolean column with missing values: how do I figure that out? After filtering for the non-missing values it is still of type 'object'... do I need to implement a try-catch-cast of every column into every imaginable data type in order to see the true nature of columns?
I guess that there is a logical explanation of why np.nan is being casted to True but this is an unwanted behaviour of the software pandas/python itself, right? So should I file a bug report?

Q1: I would start with combining
np.any(pd.isna(boolean))
to identify if there are any None Values in a column, and with
set(boolean)
You can identify, if there are only True, False and Nones inside. Combining with filtering (and if you prefer to also typcasting) you should be done.
Q2: see comment of #WeNYoBen

I've hit the same problem. I came up with the following solution:
from pandas import Series
def is_boolean_series(col: Series):
val = col[~col.isna()].iloc[0]
return type(val) == bool

How do I "re-group" my Series after performing an apply() on a SeriesGroupBy?

I need to adapt an existing function, that essentially performs a Series.str.contains and returns the resulting Series, to be able to handle SeriesGroupBy as input.
As suggested by the pandas error message
Cannot access attribute 'str' of 'SeriesGroupBy' objects, try using the 'apply' method
I have tried to use apply() on the SeriesGroupBy object, which works in a way, but results in a Series object. I would now like to apply the same grouping as before, to this Series.
Original function
def contains(series, expression):
return series.str.contains(expression)
My attempt so far
>>> import pandas as pd
... from functools import partial
...
... def _f(series, expression):
... return series.str.contains(expression)
...
... def contains(grouped_series, expression):
... result = grouped_series.apply(partial(_f, expression=expression))
... return result
>>> df = pd.DataFrame(zip([1,1,2,2], ['abc', 'def', 'abq', 'bcq']), columns=['group', 'text'])
>>> gdf = df.groupby('group')
>>> gs = gdf['text']
>>> type(gs)
<class 'pandas.core.groupby.generic.SeriesGroupBy'>
>>> r = contains(gdf['text'], 'b')
>>> r
0 True
1 False
2 True
3 True
Name: text, dtype: bool
>>> type(r)
<class 'pandas.core.series.Series'>
The desired result would by a boolean series grouped by the same indices as the original grouped_series.
The actual result is a Series object without any grouping.
EDIT / CLARIFICATION:
The initial answers make me think I didn't stress the core of the problem enough. For the sake of the question, lets assume I cannot change anything outside of the contains(grouped_series, expression) function.
I think I know how to solve my problem if I approach it from another angle, and if I don't that would then become another question. The real world context makes it very complicated to change code outside of that one function. So I would really appreciate suggestions that work within that constraint.
So, let me rephrase the question as follows:
I'm looking for a function contains(grouped_series, expression), so that the following code works:
>>> df = pd.DataFrame(zip([1,1,2,2], ['abc', 'def', 'abq', 'bcq']), columns=['group', 'text'])
>>> grouped_series = contains(df.groupby('group')['text'], 'b')
>>> grouped_series.sum()
group
1 1.0
2 2.0
Name: text, dtype: float64

groupby is not needed unless you want to do something with the "group" -- like calculating its sum or check if all rows in the group contain the letter b. When you call apply on a GroupBy object, you can pass additional argument to the function being applied by keywords:
def contains(frame, expression):
return frame['text'].str.contains(expression).all()
df.groupby('group').apply(contains, expression='b')
Result:
group
1 False
2 True
dtype: bool
I like to think that the first parameter to the function being applied (frame) is a smaller view of the original dataframe, being chopped up by the groupby clause.
That said, apply is pretty slow compared to specialized aggregate functions lime min, max or sum. Use these as much as possible and save apply for complex cases.

Following the advice of the error message, you could use apply:
df.groupby('group').apply(lambda x : x.text.str.contains('b'))
Out[10]:
group
1 0 True
1 False
2 2 True
3 True
Name: text, dtype: bool
If you want to put these indices into your data set and return a DataFrame, use reset_index:
df.groupby('group').apply(lambda x : x.text.str.contains('b')).reset_index()
Out[11]:
group level_1 text
0 1 0 True
1 1 1 False
2 2 2 True
3 2 3 True

_f has absolutely no relationship to the groups. The way to deal with this is to instead define a column prior to grouping (not a separate function), then group. Now that column (called 'to_sum') is part of your Series.GroupBy object.
df.assign(to_sum = _f(df['text'], 'b')).groupby('group').to_sum.sum()
#group
#1 1.0
#2 2.0
#Name: to_sum, dtype: float64
If you don't need the entire DataFrame for your subsequent operations, you can sum the Series returned by _f using df to group (as they will share the same index)
_f(df['text'], 'b').groupby(df['group']).sum()

You can just do this. No need to do group-by
df['eval']= df['text'].str.contains('b')
eval is the name of the column which you want add. You can name what you want.
df.groupby('group')['eval'].sum()
Run this after the first line. The result is
group
1 1.0
2 2.0

Comparing logical values to NaN in pandas/numpy

I want to do an element-wise OR operation on two pandas Series of boolean values. np.nans are also included.
I have tried three approaches and realized that the expression "np.nan or False" can be evaluted to True, False, and np.nan depending on the approach.
These are my example Series:
series_1 = pd.Series([True, False, np.nan])
series_2 = pd.Series([False, False, False])
Approach #1
Using the | operator of pandas:
In [5]: series_1 | series_2
Out[5]:
0 True
1 False
2 False
dtype: bool
Approach #2
Using the logical_or function from numpy:
In [6]: np.logical_or(series_1, series_2)
Out[6]:
0 True
1 False
2 NaN
dtype: object
Approach #3
I define a vectorized version of logical_or which is supposed to be evaluated row-by-row over the arrays:
#np.vectorize
def vectorized_or(a, b):
return np.logical_or(a, b)
I use vectorized_or on the two series and convert its output (which is a numpy array) into a pandas Series:
In [8]: pd.Series(vectorized_or(series_1, series_2))
Out[8]:
0 True
1 False
2 True
dtype: bool
Question
I am wondering about the reasons for these results.
This answer explains np.logical_or and says np.logical_or(np.nan, False) is be True but why does this only works when vectorized and not in Approach #2? And how can the results of Approach #1 be explained?

first difference : | is np.bitwise_or. it explains the difference between #1 and #2.
Second difference : since serie_1.dtype if object (non homogeneous data), operations are done row by row in the two first cases.
When using vectorize ( #3):
The data type of the output of vectorized is determined by calling
the function with the first element of the input. This can be avoided
by specifying the otypes argument.
For vectorized operations, you quit the object mode. data are first converted according to first element (bool here, bool(nan) is True) and the operations are done after.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Proper way to use "opposite boolean" in Pandas data frame boolean indexing - python

Related

Different pandas DataFrame logical operation result when the changing the order

How can I drop several rows from my Dataframe?

Bool and missing values in pandas

How do I "re-group" my Series after performing an apply() on a SeriesGroupBy?

Comparing logical values to NaN in pandas/numpy

Categories

Resources