Filtering data in pandas - python

I am working with pandas 0.13.0
I have a data frame(a) with 2.5 million records
I want to exclude some hundreds of records applying two conditions simoultaneusly: only the records that fulfill the 2 conditions at the same time.
I want to see how many records I will exclude when applying both conditions:
len(a)
2523250
b=a[(a.cond1=='120.A') & (a.cond2==2012)]
len(b)
6010
But when I apply the conditions to obtain the final dataframe:
c=a[(a.cond1!='120.A') & (a.cond2!=2012)]
len(c)
2214968
In the second case '&' is working like and 'OR'
What I am doing wrong?

Review De Morgan's laws. The logical negation of & is not simply switching the == with !=, you must also swap & with |, because you want the rows where either cond1 != '120.A' or cond2 != 2012, i.e., you want to exclude a row if ONE of the != conditions is true because that makes the original & statement False.
#EdChum's comment above is equivalent to
c=a[(a.cond1!='120.A') | (a.cond2!=2012)]

Related

How can I drop several rows from my Dataframe?

I have a dataframe (called my_df1) and want to drop several rows based on certain dates. How can I create a new dataframe (my_df2) without the dates '2020-05-01' and '2020-05-04'?
I tried the following which did not work as you can see below:
my_df2 = mydf_1[(mydf_1['Date'] != '2020-05-01') | (mydf_1['Date'] != '2020-05-04')]
my_df2.head()
The problem seems to be with your logical operator.
You should be using and here instead of or since you have to select all the rows which are not 2020-05-01 and 2020-05-04.
The bitwise operators will not be short circuiting and hence the result.
You can use isin with negation ~ sign:
dates=['2020-05-01', '2020-05-04']
my_df2 = mydf_1[~mydf_1['Date'].isin(dates)]
The short explanation about your mistake AND and OR was addressed by kanmaytacker.
Following a few additional recommendations:
Indexing in pandas:
By label .loc
By index .iloc
By label also works without .loc but it's slower as it's composed of chained operations instead of a single internal operation consisting on nested loops (see here). Also, with .loc you can select on more than one axis at a time.
# example with rows. Same logic for columns or additional axis.
df.loc[(df['a']!=4) & (df['a']!=1),:] # ".loc" is the only addition
>>>
a b c
2 0 4 6
Your index is a boolean set. This is true for numpy and as a consecuence, pandas too.
(df['a']!=4) & (df['a']!=1)
>>>
0 False
1 False
2 True
Name: a, dtype: bool

Pandas column selection with many conditions becomes unwieldy

Quick Pandas question:
I cleaning up the values in individual columns of a dataframe by using an apply on a series:
# For all values in col 'Rate' over 1, divide by 100
df['rate'][df['rate']>1] = df['rate'][df['rate']>1].apply(lambda x: x/100)
This is fine when the selection criteria is simple, such as df['rate']>1. This however gets very long when you start adding multiple selection criteria:
df['rate'][(df['rate']>1) & (~df['rate'].isnull()) & (df['rate_type']=='fixed) & (df['something']<= 'nothing')] = df['rate'][(df['rate']>1) & (df['rate_type']=='fixed) & (df['something']<= 'nothing')].apply(lambda x: x/100)
What's the most concise way to:
1. Split a column off (as a Series) from a DataFrame
2. Apply a function to the items of the Series
3. Update the DataFrame with the modified series
I've tried using df.update(), but that didn't seem to work. I've also tried using the Series as a selector, e.g. isin(Series), but I wasn't able to get that to work either.
Thank you!
When there are multiple conditions, you can keep things simple using eval:
mask = df.eval("rate > 1 & rate_type == 'fixed' & something <= 'nothing'")
df.loc[mask, 'rate'] = df['rate'].apply(function)
Read more about evaluating expressions dynamically here. Of course, this particular function can be vectorized as
df.loc[mask, 'rate'] /= 100
It will work with update
con=(df['rate']>1) & (df['rate_type']=='fixed') & (df['something']<= 'nothing')
df.update(df.loc[con,['rate']].apply(lambda x: x/100))

Pandas: Efficient way to select rows from a dataframe using multiple criteria

I am selecting/filtering a DataFrame using multiple criteria (comparsion with variables), like so:
results = df1[
(df1.Year == Year) &
(df1.headline == text) &
(df1.price > price1) &
(df1.price < price2) &
(df1.promo > promo1) &
(df1.promo < promo2)
]
While this approach works, it is very slow. Hence I wonder, is there any more efficient way of filtering/selecting rows based on multiple criteria using pandas?
Your current approach is pretty by-the-book as fair as Pandas syntax goes, in my personal opinion.
One way to optimize, if you really need to do so, is to use the underlying NumPy arrays for generating the boolean masks. Generally speaking, Pandas may come with a bit of additional overhead in how it overloads operators versus NumPy. (With the tradeoff being arguably greater flexibility and intrinsically smooth handling of NaN data.)
price = df1.price.values
promo = df1.promo.values
# Note: this is a view to a slice of df1
results = df1.loc[
(df1.Year.values == Year) &
(df1.headline.values == text) &
(price > price1) &
(price < price2) &
(promo > promo1) &
(promo < promo2)
]
Secondly, check that you are already taking advantage of numexpr, which Pandas is enabled to do:
>>> import pandas as pd
>>> pd.get_option('compute.use_numexpr') # use `pd.set_option()` if False
True

Run logical Expressions against pandas dataframe

I m trying to select rows from a pandas dataframe by applying condition to a column (in form of logical expression).
Sample data frame looks like:
id userid code
0 645382311 12324234234
1 645382311 -2434234242
2 645382312 32536365654
3 645382312 12324234234
...
For example, I expect next result by applying logical expressions for column 'code':
case 1: (12324234234 OR -2434234242) AND NOT 32536365654
case 2: (12324234234 AND -2434234242) OR NOT 32536365654
must give a result for both cases:
userid: 645382311
The logic above says:
For case 1 - give me only those userid who has at least one of the values (12324234234 OR -2434234242) and doesn't have 32536365654 in the whole data frame.
For case 2 - I need only those userid who has either both codes in data frame (12324234234 AND -2434234242) or any codes but not 32536365654.
The statement like below returns empty DataFrame:
flt = df[(df.code == 12324234234) & (df.code == -2434234242)]
print("flt: ", flt)
Result (and it make sens):
flt: Empty DataFrame
Would be appreciate for any hints to handle such cases.
As a simple approach, I would transform your sample table into a boolean presence matrix, which would then allow you to perform the logic you need:
import pandas
sample = pandas.DataFrame([[645382311, 12324234234], [645382311, -2434234242], [645382312, 32536365654], [645382312, 12324234234]], columns=['userid', 'code'])
# Add a column of True values
sample['value'] = True
# Pivot to boolean presence matrix and remove MultiIndex
presence = sample.pivot(index='userid', columns='code').fillna(False)['value']
# Perform desired boolean tests
case1 = (presence[12324234234] | presence[-2434234242]) & ~(presence[32536365654])
case2 = (presence[12324234234] & presence[-2434234242]) | ~(presence[32536365654])
The case variables will contain the boolean test result for each userid.

Element-wise logical OR in Pandas

I am aware that AND corresponds to & and NOT, ~. What is the element-wise logical OR operator? I know "or" itself is not what I am looking for.
The corresponding operator is |:
df[(df < 3) | (df == 5)]
would elementwise check if value is less than 3 or equal to 5.
If you need a function to do this, we have np.logical_or. For two conditions, you can use
df[np.logical_or(df<3, df==5)]
Or, for multiple conditions use the logical_or.reduce,
df[np.logical_or.reduce([df<3, df==5])]
Since the conditions are specified as individual arguments, parentheses grouping is not needed.
More information on logical operations with pandas can be found here.
To take the element-wise logical OR of two Series a and b just do
a | b
If you operate on the columns of a single dataframe, eval and query are options where or works element-wise. You don't need to worry about parenthesis either because comparison operators have higher precedence than boolean/bitwise operators. For example, the following query call returns rows where column A values are >1 and column B values are > 2.
df = pd.DataFrame({'A': [1,2,0], 'B': [0,1,2]})
df.query('A > 1 or B > 2') # == df[(df['A']>1) | (df['B']>2)]
# A B
# 1 2 1
or with eval you can return a boolean Series (again or works just fine as element-wise operator).
df.eval('A > 1 or B > 2')
# 0 False
# 1 True
# 2 False
# dtype: bool

Categories

Resources