I am selecting/filtering a DataFrame using multiple criteria (comparsion with variables), like so:
results = df1[
(df1.Year == Year) &
(df1.headline == text) &
(df1.price > price1) &
(df1.price < price2) &
(df1.promo > promo1) &
(df1.promo < promo2)
]
While this approach works, it is very slow. Hence I wonder, is there any more efficient way of filtering/selecting rows based on multiple criteria using pandas?
Your current approach is pretty by-the-book as fair as Pandas syntax goes, in my personal opinion.
One way to optimize, if you really need to do so, is to use the underlying NumPy arrays for generating the boolean masks. Generally speaking, Pandas may come with a bit of additional overhead in how it overloads operators versus NumPy. (With the tradeoff being arguably greater flexibility and intrinsically smooth handling of NaN data.)
price = df1.price.values
promo = df1.promo.values
# Note: this is a view to a slice of df1
results = df1.loc[
(df1.Year.values == Year) &
(df1.headline.values == text) &
(price > price1) &
(price < price2) &
(promo > promo1) &
(promo < promo2)
]
Secondly, check that you are already taking advantage of numexpr, which Pandas is enabled to do:
>>> import pandas as pd
>>> pd.get_option('compute.use_numexpr') # use `pd.set_option()` if False
True
Related
I am new to Python and am converting SQL to Python and want to learn the most efficient way to process a large dataset (rows > 1 million and columns > 100). I need to create multiple new columns based on other columns in the DataFrame. I have recently learned how to use pd.concat for new boolean columns, but I also have some non-boolean columns that rely on the values of other columns.
In SQL I would use a single case statement (case when age > 1000 then sample_id else 0 end as custom1, etc...). In Python I can achieve the same result in 2 steps (pd.concat + loc find & replace) as shown below. I have seen references in other posts to using the apply method but have also read in other posts that the apply method can be inefficient.
My question is then, for the code shown below, is there a more efficient way to do this? Can I do it all in one step within the pd.concat (so far I haven't been able to get that to work)? I am okay doing it in 2 steps if necessary. I need to be able to handle large integers (100 billion) in my custom1 element and have decimals in my custom2 element.
And finally, I tried using multiple separate np.where statements but received a warning that my DataFrame was fragmented and that I should try to use concat. So I am not sure which approach overall is most efficient or recommended.
Update - after receiving a comment and an answer pointing me towards use of np.where, I decided to test the approaches. Using a data set with 2.7 million rows and 80 columns, I added 25 new columns. First approach was to use the concat + df.loc replace as shown in this post. Second approach was to use np.where. I ran the test 10 times and np.where was faster in all 10 trials. As noted above, I think repeated use of np.where in this way can cause fragmentation, so I suppose now my decision comes down to faster np.where with potential fragmentation vs. slower use of concat without risk of fragmentation. Any further insight on this final update is appreciated.
df = pd.DataFrame({'age': [120, 4000],
'weight': [505.31, 29.01],
'sample_id': [999999999999, 555555555555]},
index=['rock1', 'rock2'])
#step 1: efficiently create starting custom columns using concat
df = pd.concat(
[
df,
(df["age"] > 1000).rename("custom1").astype(int),
(df["weight"] < 100).rename("custom2").astype(float),
],
axis=1,
)
#step2: assign final values to custom columns based on other column values
df.loc[df.custom1 == 1, 'custom1'] = (df['sample_id'])
df.loc[df.custom2 == 1, 'custom2'] = (df['weight'] / 2)
Thanks for any feedback you can provide...I appreciate your time helping me.
The standard way to do this is using numpy where:
import numpy as np
df['custom1'] = np.where(df.age.gt(1000), df.sample_id, 0)
df['custom2'] = np.where(df.weight.lt(100), df.weight / 2, 0)
I have a dataframe containing counts of two things, which I've put in columns numA and numB. I want to find the rows where numA < x and numB < y, which can be done like so:
filtered_df = df[(df.numA < x) & (df.numB < y)]
This works when both numA and numB are present. However neither column is guaranteed to appear in the dataframe. If only one column exists, I would still like to filter the rows based on it. This could be easily coded with something along the lines of
if "numA" in df.columns:
filtered_df = df[df.numA < x]
if "numB" in df.columns:
filtered_df = filtered_df[filtered_df.numB < y]
But this seems very inefficient, especially since in reality I have 9 columns like this, and each of these requires the same check. Is there a way to achieve the same thing but with code that is more readable, easier to maintain and less tedious to write out?
If you want an all-or-nothing type comparison I think a fairly easy way is to use set comparisons:
if(set(list_of_cols_to_check).issubset(df.columns)):
filtered_df = df[(df.numA < x) & ... & (df.numB < y)]
If you want to perform comparisons for all that do exist it gets a bit more complicated. It is not very different than what you have, but I'd probably do it as follows:
filter = (df.index >= 0) #always true
filter = filter & (df.numA < 4) if 'numA' in df else filter
filter = filter & (df.numB < 2) if 'numB' in df else filter
filter = filter & (df.numC < 1) if 'numC' in df else filter
df[filter]
Quick Pandas question:
I cleaning up the values in individual columns of a dataframe by using an apply on a series:
# For all values in col 'Rate' over 1, divide by 100
df['rate'][df['rate']>1] = df['rate'][df['rate']>1].apply(lambda x: x/100)
This is fine when the selection criteria is simple, such as df['rate']>1. This however gets very long when you start adding multiple selection criteria:
df['rate'][(df['rate']>1) & (~df['rate'].isnull()) & (df['rate_type']=='fixed) & (df['something']<= 'nothing')] = df['rate'][(df['rate']>1) & (df['rate_type']=='fixed) & (df['something']<= 'nothing')].apply(lambda x: x/100)
What's the most concise way to:
1. Split a column off (as a Series) from a DataFrame
2. Apply a function to the items of the Series
3. Update the DataFrame with the modified series
I've tried using df.update(), but that didn't seem to work. I've also tried using the Series as a selector, e.g. isin(Series), but I wasn't able to get that to work either.
Thank you!
When there are multiple conditions, you can keep things simple using eval:
mask = df.eval("rate > 1 & rate_type == 'fixed' & something <= 'nothing'")
df.loc[mask, 'rate'] = df['rate'].apply(function)
Read more about evaluating expressions dynamically here. Of course, this particular function can be vectorized as
df.loc[mask, 'rate'] /= 100
It will work with update
con=(df['rate']>1) & (df['rate_type']=='fixed') & (df['something']<= 'nothing')
df.update(df.loc[con,['rate']].apply(lambda x: x/100))
I am working with pandas 0.13.0
I have a data frame(a) with 2.5 million records
I want to exclude some hundreds of records applying two conditions simoultaneusly: only the records that fulfill the 2 conditions at the same time.
I want to see how many records I will exclude when applying both conditions:
len(a)
2523250
b=a[(a.cond1=='120.A') & (a.cond2==2012)]
len(b)
6010
But when I apply the conditions to obtain the final dataframe:
c=a[(a.cond1!='120.A') & (a.cond2!=2012)]
len(c)
2214968
In the second case '&' is working like and 'OR'
What I am doing wrong?
Review De Morgan's laws. The logical negation of & is not simply switching the == with !=, you must also swap & with |, because you want the rows where either cond1 != '120.A' or cond2 != 2012, i.e., you want to exclude a row if ONE of the != conditions is true because that makes the original & statement False.
#EdChum's comment above is equivalent to
c=a[(a.cond1!='120.A') | (a.cond2!=2012)]
I have a 2 dimensional array in numpy and need to apply a mathematical formula just to some values of the array which match certain criteria. This can be made using a for loop and if conditions however I think using numpy where() method works faster.
My code so far is this but it doesn't work
cond2 = np.where((SPN >= -alpha) & (SPN <= 0))
SPN[cond2] = -1*math.cos((SPN[cond2]*math.pi)/(2*alpha))
The values in the orginal array need to be replaced with the corresponding value after applying the formula.
Any ideas of how to make this work? I'm working with big arrays so need and efficient way of doing it.
Thanks
Try this:
cond2 = (SPN >= -alpha) & (SPN <= 0)
SPN[cond2] = -np.cos(SPN[cond2]*np.pi/(2*alpha))