I am aware that AND corresponds to & and NOT, ~. What is the element-wise logical OR operator? I know "or" itself is not what I am looking for.
The corresponding operator is |:
df[(df < 3) | (df == 5)]
would elementwise check if value is less than 3 or equal to 5.
If you need a function to do this, we have np.logical_or. For two conditions, you can use
df[np.logical_or(df<3, df==5)]
Or, for multiple conditions use the logical_or.reduce,
df[np.logical_or.reduce([df<3, df==5])]
Since the conditions are specified as individual arguments, parentheses grouping is not needed.
More information on logical operations with pandas can be found here.
To take the element-wise logical OR of two Series a and b just do
a | b
If you operate on the columns of a single dataframe, eval and query are options where or works element-wise. You don't need to worry about parenthesis either because comparison operators have higher precedence than boolean/bitwise operators. For example, the following query call returns rows where column A values are >1 and column B values are > 2.
df = pd.DataFrame({'A': [1,2,0], 'B': [0,1,2]})
df.query('A > 1 or B > 2') # == df[(df['A']>1) | (df['B']>2)]
# A B
# 1 2 1
or with eval you can return a boolean Series (again or works just fine as element-wise operator).
df.eval('A > 1 or B > 2')
# 0 False
# 1 True
# 2 False
# dtype: bool
Related
I have a pandas dataframe with about 50 columns and >100 rows. I want to select columns 'col_x', 'col_y' where 'col_z' < m. Is there a simple way to do this, similar to df[df['col3'] < m] and df[['colx','coly']] but combined?
Let's break down your problem. You want to
Filter rows based on some boolean condition
You want to select a subset of columns from the result.
For the first point, the condition you'd need is -
df["col_z"] < m
For the second requirement, you'd want to specify the list of columns that you need -
["col_x", "col_y"]
How would you combine these two to produce an expected output with pandas? The most straightforward way is using loc -
df.loc[df["col_z"] < m, ["col_x", "col_y"]]
The first argument selects rows, and the second argument selects columns.
More About loc
Think of this in terms of the relational algebra operations - selection and projection. If you're from the SQL world, this would be a relatable equivalent. The above operation, in SQL syntax, would look like this -
SELECT col_x, col_y # projection on columns
FROM df
WHERE col_z < m # selection on rows
pandas loc allows you to specify index labels for selecting rows. For example, if you have a dataframe -
col_x col_y
a 1 4
b 2 5
c 3 6
To select index a, and c, and col_x you'd use -
df.loc[['a', 'c'], ['col_x']]
col_x
a 1
c 3
Alternatively, for selecting by a boolean condition (using a series/array of bool values, as your original question asks), where all values in col_x are odd -
df.loc[(df.col_x % 2).ne(0), ['col_y']]
col_y
a 4
c 6
For details, df.col_x % 2 computes the modulus of each value with respect to 2. The ne(0) will then compare the value to 0, and return True if it isn't (all odd numbers are selected like this). Here's what that expression results in -
(df.col_x % 2).ne(0)
a True
b False
c True
Name: col_x, dtype: bool
Further Reading
10 Minutes to Pandas - Selection by Label
Indexing and selecting data
Boolean indexing
Selection with .loc in python
pandas loc vs. iloc vs. ix vs. at vs. iat?
I have a dataframe (called my_df1) and want to drop several rows based on certain dates. How can I create a new dataframe (my_df2) without the dates '2020-05-01' and '2020-05-04'?
I tried the following which did not work as you can see below:
my_df2 = mydf_1[(mydf_1['Date'] != '2020-05-01') | (mydf_1['Date'] != '2020-05-04')]
my_df2.head()
The problem seems to be with your logical operator.
You should be using and here instead of or since you have to select all the rows which are not 2020-05-01 and 2020-05-04.
The bitwise operators will not be short circuiting and hence the result.
You can use isin with negation ~ sign:
dates=['2020-05-01', '2020-05-04']
my_df2 = mydf_1[~mydf_1['Date'].isin(dates)]
The short explanation about your mistake AND and OR was addressed by kanmaytacker.
Following a few additional recommendations:
Indexing in pandas:
By label .loc
By index .iloc
By label also works without .loc but it's slower as it's composed of chained operations instead of a single internal operation consisting on nested loops (see here). Also, with .loc you can select on more than one axis at a time.
# example with rows. Same logic for columns or additional axis.
df.loc[(df['a']!=4) & (df['a']!=1),:] # ".loc" is the only addition
>>>
a b c
2 0 4 6
Your index is a boolean set. This is true for numpy and as a consecuence, pandas too.
(df['a']!=4) & (df['a']!=1)
>>>
0 False
1 False
2 True
Name: a, dtype: bool
I just can't figure out what "==" means at the second line:
- It is not a test, there is no if statement...
- It is not a variable declaration...
I've never seen this before, the thing is data.ctage==cat is a pandas Series and not a test...
for cat in data["categ"].unique():
subset = data[data.categ == cat] # Création du sous-échantillon
print("-"*20)
print('Catégorie : ' + cat)
print("moyenne:\n",subset['montant'].mean())
print("mediane:\n",subset['montant'].median())
print("mode:\n",subset['montant'].mode())
print("VAR:\n",subset['montant'].var())
print("EC:\n",subset['montant'].std())
plt.figure(figsize=(5,5))
subset["montant"].hist(bins=30) # Crée l'histogramme
plt.show() # Affiche l'histogramme
It is testing each element of data.categ for equality with cat. That produces a vector of True/False values. This is passed as in indexer to data[], which returns the rows from data that correspond to the True values in the vector.
To summarize, the whole expression returns the subset of rows from data where the value of data.categ equals cat.
(Seems possible the whole operation could be done more elegantly using data.groupBy('categ').apply(someFunc).)
It creates a boolean series with indexes where data.categ is equal to cat , with this boolean mask, you can filter your dataframe, in other words subset will have all records where the categ is the value stored in cat.
This is an example using numeric data
np.random.seed(0)
a = np.random.choice(np.arange(2), 5)
b = np.random.choice(np.arange(2), 5)
df = pd.DataFrame(dict(a = a, b = b))
df[df.a == 0].head()
# a b
# 0 0 0
# 2 0 0
# 4 0 1
df[df.a == df.b].head()
# a b
# 0 0 0
# 2 0 0
# 3 1 1
Yes, it is a test. Boolean expressions are not restricted to if statements.
It looks as if data is a data frame (PANDAS). The expression used as a data frame index is how PANDAS denotes a selector or filter. This says to select every row in which the fieled categ matches the variable cat (apparently a pre-defined variable). This collection of rows becomes a new data frame, subset.
data.categ == cat will return a boolean list that will be used to filter your dataframe by lefting only values where boolean is equal True.
Booleans are used in many situations, not only in if statements.
Here you are checking data.categ with the element iterating, cat, in the dictionary of data.
And if they are equal you are continuing the loop.
I have this large dataframe I've imported into pandas and I want to chop it down via a filter. Here is my basic sample code:
import pandas as pd
import numpy as np
from pandas import Series, DataFrame
df = DataFrame({'A':[12345,0,3005,0,0,16455,16454,10694,3005],'B':[0,0,0,1,2,4,3,5,6]})
df2= df[df["A"].map(lambda x: x > 0) & (df["B"] > 0)]
Basically this displays bottom 4 results which is semi-correct. But I need to display everything BUT these results. So essentially, I'm looking for a way to use this filter but in a "not" version if that's possible. So if column A is greater than 0 AND column B is greater than 0 then we want to disqualify these values from the dataframe. Thanks
No need for map function call on Series "A".
Apply De Morgan's Law:
"not (A and B)" is the same as "(not A) or (not B)"
df2 = df[~(df.A > 0) | ~(df.B > 0)]
There is no need for the map implementation. You can just reverse the arguments like ...
df.ix[(df.A<=0)|(df.B<=0),:]
Or use boolean indexing without ix:
df[(df.A<=0)|(df.B<=0)]
Try
df2 = df[df["A"].map(lambda x: x <= 0) | (df["B"] <= 0)]
you could use: wrap everything in a bracket and use a ~ (tilde) outside. in place of not.
df[~((df['A'] >0) & (df['B']>0))]
answer:
A B
0 12345 0
1 0 0
2 3005 0
3 0 1
4 0 2
For a df table like below,
A B C D
0 0 1 1 1
1 2 3 5 7
3 3 1 2 8
why are the double brackets needed for selecting specific columns after boolean indexing?
the [['A','C']] part of
df[df['A'] < 3][['A','C']]
For pandas objects (Series, DataFrame), the indexing operator [] only accepts
colname or list of colnames to select column(s)
slicing or Boolean array to select row(s), i.e. it only refers to one dimension of the dataframe.
For df[[colname(s)]], the interior brackets are for list, and the outside brackets are indexing operator, i.e. you must use double brackets if you select two or more columns. With one column name, single pair of brackets returns a Series, while double brackets return a dataframe.
Also, df.ix[df['A'] < 3,['A','C']] or df.loc[df['A'] < 3,['A','C']] is better than the chained selection for avoiding returning a copy versus a view of the dataframe.
Please refer pandas documentation for details
Because you have no columns named 'A','C', which is what you'd be trying to do which will raise a KeyError, so you have to use an iterable to sub-select from the df.
So
df[df['A'] < 3]['A','C']
raises
KeyError: ('A', 'C')
Which is different to
In [261]:
df[df['A'] < 3][['A','C']]
Out[261]:
A C
0 0 1
1 2 5
This is no different to trying:
df['A','C']
hence why you need double square brackets:
df[['A','C']]
Note that the modern way is to use .ix:
In [264]:
df.ix[df['A'] < 3,['A','C']]
Out[264]:
A C
0 0 1
1 2 5
So that you're operating on a view rather than potentially a copy
Because inner brackets are just python syntax (literal) for list.
The outer brackets are the indexer operation of pandas dataframe object.
In this use case inner ['A', 'B'] defines the list of columns to pass as single argument to the indexer operation, which is denoted by outer brackets.
Adding to previous responses, you could also use df.iloc accessor if you need to select index positions. It's also making the code more reproducible, which is nice.