I want to do an element-wise OR operation on two pandas Series of boolean values. np.nans are also included.
I have tried three approaches and realized that the expression "np.nan or False" can be evaluted to True, False, and np.nan depending on the approach.
These are my example Series:
series_1 = pd.Series([True, False, np.nan])
series_2 = pd.Series([False, False, False])
Approach #1
Using the | operator of pandas:
In [5]: series_1 | series_2
Out[5]:
0 True
1 False
2 False
dtype: bool
Approach #2
Using the logical_or function from numpy:
In [6]: np.logical_or(series_1, series_2)
Out[6]:
0 True
1 False
2 NaN
dtype: object
Approach #3
I define a vectorized version of logical_or which is supposed to be evaluated row-by-row over the arrays:
#np.vectorize
def vectorized_or(a, b):
return np.logical_or(a, b)
I use vectorized_or on the two series and convert its output (which is a numpy array) into a pandas Series:
In [8]: pd.Series(vectorized_or(series_1, series_2))
Out[8]:
0 True
1 False
2 True
dtype: bool
Question
I am wondering about the reasons for these results.
This answer explains np.logical_or and says np.logical_or(np.nan, False) is be True but why does this only works when vectorized and not in Approach #2? And how can the results of Approach #1 be explained?
first difference : | is np.bitwise_or. it explains the difference between #1 and #2.
Second difference : since serie_1.dtype if object (non homogeneous data), operations are done row by row in the two first cases.
When using vectorize ( #3):
The data type of the output of vectorized is determined by calling
the function with the first element of the input. This can be avoided
by specifying the otypes argument.
For vectorized operations, you quit the object mode. data are first converted according to first element (bool here, bool(nan) is True) and the operations are done after.
Related
My code is like:
a = pd.DataFrame([np.nan, True])
b = pd.DataFrame([True, np.nan])
c = a|b
print(c)
I don't know the result of logical operation result when one element is np.nan, but I expect them to be the same whatever the oder. But I got the result like this:
0
0 False
1 True
Why? Is this about short circuiting in pandas? I searched the doc of pandas but did not find answer.
My pandas version is 1.1.3
This is behaviour that is tied to np.nan, not pandas. Take the following examples:
print(True or np.nan)
print(np.nan or True)
Output:
True
nan
When performing the operation, dtype ends up mattering and the way that np.nan functions within the numpy library is what leads to this strange behaviour.
To get around this quirk, you can fill NaN values with False for example or some other token value which evaluates to False (using pandas.DataFrame.fillna()).
I have a dataframe (called my_df1) and want to drop several rows based on certain dates. How can I create a new dataframe (my_df2) without the dates '2020-05-01' and '2020-05-04'?
I tried the following which did not work as you can see below:
my_df2 = mydf_1[(mydf_1['Date'] != '2020-05-01') | (mydf_1['Date'] != '2020-05-04')]
my_df2.head()
The problem seems to be with your logical operator.
You should be using and here instead of or since you have to select all the rows which are not 2020-05-01 and 2020-05-04.
The bitwise operators will not be short circuiting and hence the result.
You can use isin with negation ~ sign:
dates=['2020-05-01', '2020-05-04']
my_df2 = mydf_1[~mydf_1['Date'].isin(dates)]
The short explanation about your mistake AND and OR was addressed by kanmaytacker.
Following a few additional recommendations:
Indexing in pandas:
By label .loc
By index .iloc
By label also works without .loc but it's slower as it's composed of chained operations instead of a single internal operation consisting on nested loops (see here). Also, with .loc you can select on more than one axis at a time.
# example with rows. Same logic for columns or additional axis.
df.loc[(df['a']!=4) & (df['a']!=1),:] # ".loc" is the only addition
>>>
a b c
2 0 4 6
Your index is a boolean set. This is true for numpy and as a consecuence, pandas too.
(df['a']!=4) & (df['a']!=1)
>>>
0 False
1 False
2 True
Name: a, dtype: bool
I wanted to use a boolean indexing, checking for rows of my data frame where a particular column does not have NaN values. So, I did the following:
import pandas as pd
my_df.loc[pd.isnull(my_df['col_of_interest']) == False].head()
to see a snippet of that data frame, including only the values that are not NaN (most values are NaN).
It worked, but seems less-than-elegant. I'd want to type:
my_df.loc[!pd.isnull(my_df['col_of_interest'])].head()
However, that generated an error. I also spend a lot of time in R, so maybe I'm confusing things. In Python, I usually put in the syntax "not" where I can. For instance, if x is not none:, but I couldn't really do it here. Is there a more elegant way? I don't like having to put in a senseless comparison.
In general with pandas (and numpy), we use the bitwise NOT ~ instead of ! or not (whose behaviour can't be overridden by types).
While in this case we have notnull, ~ can come in handy in situations where there's no special opposite method.
>>> df = pd.DataFrame({"a": [1, 2, np.nan, 3]})
>>> df.a.isnull()
0 False
1 False
2 True
3 False
Name: a, dtype: bool
>>> ~df.a.isnull()
0 True
1 True
2 False
3 True
Name: a, dtype: bool
>>> df.a.notnull()
0 True
1 True
2 False
3 True
Name: a, dtype: bool
(For completeness I'll note that -, the unary negative operator, will also work on a boolean Series but ~ is the canonical choice, and - has been deprecated for numpy boolean arrays.)
Instead of using pandas.isnull() , you should use pandas.notnull() to find the rows where the column has not null values. Example -
import pandas as pd
my_df.loc[pd.notnull(my_df['col_of_interest'])].head()
pandas.notnull() is the boolean inverse of pandas.isnull() , as given in the documentation -
See also
pandas.notnull
boolean inverse of pandas.isnull
I am aware that AND corresponds to & and NOT, ~. What is the element-wise logical OR operator? I know "or" itself is not what I am looking for.
The corresponding operator is |:
df[(df < 3) | (df == 5)]
would elementwise check if value is less than 3 or equal to 5.
If you need a function to do this, we have np.logical_or. For two conditions, you can use
df[np.logical_or(df<3, df==5)]
Or, for multiple conditions use the logical_or.reduce,
df[np.logical_or.reduce([df<3, df==5])]
Since the conditions are specified as individual arguments, parentheses grouping is not needed.
More information on logical operations with pandas can be found here.
To take the element-wise logical OR of two Series a and b just do
a | b
If you operate on the columns of a single dataframe, eval and query are options where or works element-wise. You don't need to worry about parenthesis either because comparison operators have higher precedence than boolean/bitwise operators. For example, the following query call returns rows where column A values are >1 and column B values are > 2.
df = pd.DataFrame({'A': [1,2,0], 'B': [0,1,2]})
df.query('A > 1 or B > 2') # == df[(df['A']>1) | (df['B']>2)]
# A B
# 1 2 1
or with eval you can return a boolean Series (again or works just fine as element-wise operator).
df.eval('A > 1 or B > 2')
# 0 False
# 1 True
# 2 False
# dtype: bool
So I am operating on a rather large set of data. I am usign Pandas DataFrame to handle this data and am stuck on an efficient way to parse the data into two formatted lists
HERE IS MY DATAFRAME OBJECT
fet1 fet2 fet3 fet4 fet5
stim1 True True False False False
stim2 True False False False True
stim3 ...................................
stim4 ...................................
stim5 ............................. so on
I am trying to parse each row and create two lists. List one should have the column name of all the true values. List two should have the column names of the false values.
example for stim 1:
list_1=[fet1,fet2]
list_2=[fet3,fet4,fet5]
I know I can brute force this approach and Iterate over the rows. Or I can transpose and convert to a dictionary and Parse that Way. I can also create Sparse Series objects and then create sets but then have to reference the column names separately.
The only problem I am running into is that I am always getting Quadratic O(n^2) run time.
Is there a more efficient way to do this as a built in functionality from Pandas?
Thanks for your help.
Is this what you want?
>>> df
fet1 fet2 fet3 fet4 fet5
stim1 True True False False False
stim2 True False False False True
>>> def func(row):
return [
row.index[row == True],
row.index[row == False]
]
>>> df.apply(func, axis=1)
stim1 [[fet1, fet2], [fet3, fet4, fet5]]
stim2 [[fet1, fet5], [fet2, fet3, fet4]]
dtype: object
This may or may not be faster. I do not think a more succinct solution is possible.
Fast (not row-by-row) operations can get this far.
In [126]: (np.array(df.columns)*~df)[~df]
Out[126]:
fet1 fet2 fet3 fet4 fet5
stim1 NaN NaN fet3 fet4 fet5
stim2 NaN fet2 fet3 fet4 NaN
But at this point, because the rows might have variable length, the array structure must be broken and each row must be considered individually.
In [122]: (np.array(df.columns)*df)[df].apply(lambda x: Series([x.dropna()]), 1)
Out[122]:
0
stim1 [fet1, fet2]
stim2 [fet1, fet5]
In [125]: (np.array(df.columns)*~df)[~df].apply(lambda x: Series([x.dropna()]), 1)
Out[125]:
0
stim1 [fet3, fet4, fet5]
stim2 [fet2, fet3, fet4]
The slowest step is probably the Series constructor. I'm pretty sure there's no way around it though.