Suppose I have data for 50K shoppers and the products they bought. I want to count the number of times each user purchased product "a". value_counts seems to be the fastest way to calculate these types of numbers for a grouped pandas data frame. However, I was surprised at how much slower it was to calculate the purchase frequency for just one specific product (e.g., "a") using agg or apply. I could select a specific column from a data frame created using value_counts but that could be rather inefficient on very large data sets with lots of products.
Below a simulated example where each customer purchases 10 times from a set of three products. At this size you already notice speed differences between apply and agg compared to value_counts. Is there a better/faster way to extract information like this from a grouped pandas data frame?
import pandas as pd
import numpy as np
df = pd.DataFrame({
"col1": [f'c{j}' for i in range(10) for j in range(50000)],
"col2": np.random.choice(["a", "b", "c"], size=500000, replace=True)
})
dfg = df.groupby("col1")
# value_counts is fast
dfg["col2"].value_counts().unstack()
# apply and agg are (much) slower
dfg["col2"].apply(lambda x: (x == "a").sum())
dfg["col2"].agg(lambda x: (x == "a").sum())
# much faster to do
dfg["col2"].value_counts().unstack()["a"]
EDIT:
Two great responses to this question. Given the starting point of an already grouped data frame, it seems there may not be a better/faster way to count the number of occurrences of a single level in a categorical variable than using (1) apply or agg with a lambda function or (2) using value_counts to get the counts for all levels and then selecting the one you need.
The groupby/size approach is an excellent alternative to value_counts. With a minor edit to Cainã Max Couto-Silva's answer, this would give:
dfg = df.groupby(['col1', 'col2'])
dfg.size().unstack(fill_value=0)["a"]
I assume there would be a trade-off at some point where if you have many levels apply/agg or value_counts on an already grouped data frame may be faster than the groupby/size approach which requires creating a newly grouped data frame. I'll post back when I have some time to look into that.
Thanks for the comments and answers!
This is still faster:
dfg = df.groupby(['col1','col2'])
dfg.size().unstack()
Tests:
%%timeit
pd.crosstab(df.col1, df.col2)
# > 712 ms ± 12.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
dfg = df.groupby("col1")
dfg["col2"].value_counts().unstack()
# > 165 ms ± 12.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
dfg = df.groupby(['col1','col2'])
dfg.size().unstack()
# > 131 ms ± 1.35 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
If we expand the dataframe to 5 million rows:
df = pd.concat([df for _ in range(10)])
print(f'df.shape = {df.shape}')
# > df.shape = (5000000, 2)
print(f'{df.shape[0]:,} rows.')
# > 5,000,000 rows.
%%timeit
pd.crosstab(df.col1, df.col2)
# > 1.58 s ± 33.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
dfg = df.groupby("col1")
dfg["col2"].value_counts().unstack()
# > 1.27 s ± 47.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
dfg = df.groupby(['col1','col2'])
dfg.size().unstack()
# > 847 ms ± 53.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Filter before value_counts
df.loc[df.col2=='a','col1'].value_counts()['c0']
Also I think crosstab is 'faster' than groupby + value_counts
pd.crosstab(df.col1, df.col2)
Related
I have this code that does some analysis on a DataFrame. both_profitable is True if and only if both long_profitable and short_profitable in that row are True. However, the DataFrame is quite large and using pandas apply on axis=1 is more taxing than I'd like.
output["long_profitable"] = (
df[[c for c in df.columns if "long_profit" in c]].ge(target).any(axis=1)
)
output["short_profitable"] = (
df[[c for c in df.columns if "short_profit" in c]].ge(target).any(axis=1)
)
output["both_profitable"] = output.apply(
lambda x: True if x["long_profitable"] and x["short_profitable"] else False,
axis=1,
)
Is there a simpler/more optimized way to achieve this same goal?
You should use eq method on the columns:
output["both_profitable"] = output["long_profitable"].eq(output["short_profitable"])
Or since both columns are boolean, you could use the bitwise & operator:
output["both_profitable"] = output["long_profitable"] & output["short_profitable"]
Also FYI, you could use str.contains + loc, instead of a list comprehension to select columns of df:
output["long_profitable"] = df.loc[:, df.columns.str.contains('long_profit')].ge(target).any(axis=1)
output["short_profitable"] = df.loc[:, df.columns.str.contains('short_profit')].ge(target).any(axis=1)
both_profitable is True if and only if both long_profitable and short_profitable in that row are True
In other words, both_profitable is the result of boolean AND on the two columns.
This can be achieved in several ways:
output['long_profitable'] & output['short_profitable']
# for any number of boolean columns, all of which we want to AND
cols = ['long_profitable', 'short_profitable']
output[cols].all(axis=1)
# same logic, using prod() -- this is just for fun; use all() instead
output[cols].prod(axis=1).astype(bool)
Of course, you can assign any of the above to a new column:
output_modified = output.assign(both_profitable=...)
Note: the 2nd and 3rd forms are of particular interest if you are AND-ing many columns.
Timing
n = 10_000_000
np.random.seed(0)
output = pd.DataFrame({
'long_profitable': np.random.randint(0, 2, n, dtype=bool),
'short_profitable': np.random.randint(0, 2, n, dtype=bool),
})
%timeit output['long_profitable'] & output['short_profitable']
# 4.52 ms ± 41.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit output[cols].all(axis=1)
# 18.6 ms ± 53 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit output[cols].prod(axis=1).astype(bool)
# 71.6 ms ± 375 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Question:
Hi,
When searching for methods to make a selection of a dataframe (being relatively unexperienced with Pandas), I had the following question:
What is faster for large datasets - .isin() or .query()?
Query is somewhat more intuitive to read, so my preferred approach due to my line of work. However, testing it on a very small example dataset, query seems to be much slower.
Is there anyone who has tested this properly before? If so, what were the outcomes? I searched the web, but could not find another post on this.
See the sample code below, which works for Python 3.8.5.
Thanks a lot in advance for your help!
Code:
# Packages
import pandas as pd
import timeit
import numpy as np
# Create dataframe
df = pd.DataFrame({'name': ['Foo', 'Bar', 'Faz'],
'owner': ['Canyon', 'Endurace', 'Bike']},
index=['Frame', 'Type', 'Kind'])
# Show dataframe
df
# Create filter
selection = ['Canyon']
# Filter dataframe using 'isin' (type 1)
df_filtered = df[df['owner'].isin(selection)]
%timeit df_filtered = df[df['owner'].isin(selection)]
213 µs ± 14 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# Filter dataframe using 'isin' (type 2)
df[np.isin(df['owner'].values, selection)]
%timeit df_filtered = df[np.isin(df['owner'].values, selection)]
128 µs ± 3.11 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# Filter dataframe using 'query'
df_filtered = df.query("owner in #selection")
%timeit df_filtered = df.query("owner in #selection")
1.15 ms ± 9.35 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
The best test in real data, here fast comparison for 3k, 300k,3M rows with this sample data:
selection = ['Hedge']
df = pd.concat([df] * 1000, ignore_index=True)
In [139]: %timeit df[df['owner'].isin(selection)]
449 µs ± 58 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [140]: %timeit df.query("owner in #selection")
1.57 ms ± 33.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
df = pd.concat([df] * 100000, ignore_index=True)
In [142]: %timeit df[df['owner'].isin(selection)]
8.25 ms ± 66.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [143]: %timeit df.query("owner in #selection")
13 ms ± 1.05 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
df = pd.concat([df] * 1000000, ignore_index=True)
In [145]: %timeit df[df['owner'].isin(selection)]
94.5 ms ± 9.28 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [146]: %timeit df.query("owner in #selection")
112 ms ± 499 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
If check docs:
DataFrame.query() using numexpr is slightly faster than Python for large frames
Conclusion - The best test in real data, because depends of number of rows, number of matched values and also by length of list selection.
A perfplot over some generated data:
Assuming some hypothetical data, as well as a proportionally increasing selection size (10% of frame size).
Sample data for n=10:
df:
name owner
0 Constant JoVMq
1 Constant jiKNB
2 Constant WEqhm
3 Constant pXNqB
4 Constant SnlbV
5 Constant Euwsj
6 Constant QPPbs
7 Constant Nqofa
8 Constant qeUKP
9 Constant ZBFce
Selection:
['ZBFce']
Performance reflects the docs. At smaller frames the overhead of query is significant over isin However, at frames around 200k rows the performance is comparable to isin and at frames around 10m rows query starts to become more performant.
I agree with #jezrael that, this is, as with most pandas runtime problems, very data dependent, and the best test would be to test on real datasets for a given use case and make a decision based on that.
Edit: Included #AlexanderVolkovsky's suggestion to convert selection to a set and use apply + in:
Perfplot Code:
import string
import numpy as np
import pandas as pd
import perfplot
charset = list(string.ascii_letters)
np.random.seed(5)
def gen_data(n):
df = pd.DataFrame({'name': 'Constant',
'owner': [''.join(np.random.choice(charset, 5))
for _ in range(n)]})
selection = df['owner'].sample(frac=.1).tolist()
return df, selection, set(selection)
def test_isin(params):
df, selection, _ = params
return df[df['owner'].isin(selection)]
def test_query(params):
df, selection, _ = params
return df.query("owner in #selection")
def test_apply_over_set(params):
df, _, set_selection = params
return df[df['owner'].apply(lambda x: x in set_selection)]
if __name__ == '__main__':
out = perfplot.bench(
setup=gen_data,
kernels=[
test_isin,
test_query,
test_apply_over_set
],
labels=[
'test_isin',
'test_query',
'test_apply_over_set'
],
n_range=[2 ** k for k in range(25)],
equality_check=None
)
out.save('perfplot_results.png', transparent=False)
I am a newbie and I need some insight. Say I have a pandas dataframe as follows:
temp = pd.DataFrame()
temp['A'] = np.random.rand(100)
temp['B'] = np.random.rand(100)
temp['C'] = np.random.rand(100)
I need to write a function where I replace every value in column "C" with 0's if the value of "A" is bigger than 0.5 in the corresponding row. Otherwise I need to multiply A and B in the same row element-wise and write down the output at the corresponding row on column "C".
What I did so far, is:
A=temp.loc[temp['A']<0.5, 'A'].values
B=temp.loc[temp['A']<0.5, 'B'].values
temp['C'] = 0
temp.loc[temp['A']<0.5,'C']=A*B
It works just as I desire it to work HOWEVER I am not sure if there's a faster way to implement this. I am very skeptical especially in the slicings that I feel like it's abundant to use those many slices. Though, I couldn't find any other solutions since I have to write 0's for C rows where A is bigger than 0.5.
Or, is there a way to slice the part that is needed only, perform calculations, then somehow remember the indices so you could put the required values back to the original data-frame on the corresponding rows?
One way using numpy.where:
temp["C"] = np.where(temp["A"]<0.5, temp["A"] * temp["B"], 0)
Benchmark (about 4x faster in sample, and keeps on increasing):
# With given sample of 100 rows
%%timeit
A=temp.loc[temp['A']<0.5, 'A'].values
B=temp.loc[temp['A']<0.5, 'B'].values
temp['C'] = 0
temp.loc[temp['A']<0.5,'C']=A*B
# 819 µs ± 2.77 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit np.where(temp["A"]<0.5, temp["A"] * temp["B"], 0)
# 174 µs ± 455 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Benchmark on larger data (about 7x faster)
temp = pd.DataFrame()
temp['A'] = np.random.rand(1000000)
temp['B'] = np.random.rand(1000000)
temp['C'] = np.random.rand(1000000)
%%timeit
A=temp.loc[temp['A']<0.5, 'A'].values
B=temp.loc[temp['A']<0.5, 'B'].values
temp['C'] = 0
temp.loc[temp['A']<0.5,'C']=A*B
# 35.2 ms ± 345 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit np.where(temp["A"]<0.5, temp["A"] * temp["B"], 0)
# 5.16 ms ± 188 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Validation
A=temp.loc[temp['A']<0.5, 'A'].values
B=temp.loc[temp['A']<0.5, 'B'].values
temp['C'] = 0
temp.loc[temp['A']<0.5,'C']=A*B
np.array_equal(temp["C"], np.where(temp["A"]<0.5, temp["A"] * temp["B"], 0))
# True
I am extracting maximum rainfall intensity for different durations using data with 5-minute rainfall totals. The code produces a list of max rainfall intensity for each duration (DURS). The code works but is slow when using data sets with 1,000,000+ rows. I am new to Pandas and I understand the apply() method is much faster than using a For loop but I do not know how to re-write a For loop using the apply() method.
Example of dataframe:
Value[mm] State of value
Date_Time
2020-01-01 00:00:00 1.0 5
2020-01-01 00:05:00 0.5 5
2020-01-01 00:10:00 4.0 5
2020-01-01 00:15:00 2.0 5
2020-01-01 00:20:00 2.0 5
2020-01-01 00:25:00 0.5 5
Example of Code:
import matplotlib.pyplot as plt
from matplotlib.ticker import ScalarFormatter
import math, numpy, array, glob
import pandas as pd
import numpy as np
pluvi_file = "rain.csv"
DURS = [5,6,10,15,20,25,30,45,60,90,120,180,270,360,540,720,1440,2880,4320]
df = pd.read_csv(pluvi_file, delimiter=',',parse_dates=[['Date','Time']])
df['Date_Time'] = pd.to_datetime(df['Date_Time'], dayfirst=True)
df.index = df['Date_Time']
del df['Date_Time']
lista = []
for DUR in DURS:
x = str(DUR)+' Min'
df1 = df.groupby(pd.Grouper(freq=x)).sum()
a = df1['Value[mm]'].max()/DUR*60
print(a)
lista.append(a)
Output (Max rainfall intensity for each duration in mm/hr):
5 66.0
6 60.0
10 54.0
15 40.0
20 40.5
25 30.0
30 34.0
45 26.666666666666664
60 26.5
90 20.666666666666668
120 23.0
180 12.166666666666666
270 8.11111111111111
360 9.416666666666666
540 6.444444444444445
720 4.708333333333333
1440 3.8958333333333335
2880 2.7708333333333335
4320 2.1597222222222223
How would I re-write this using the apply() method?
Solution
It looks like applying doesn't suit here, since functions you are applying on groups are vectorised methods from Essential Basic Functionality. Also removing of for loop doesn't look like a promising way for performance optimization, since there are no too much durations in your DURS list, so the main issue is grouping operation and calculations on groups, and there's no too much space for optimization, at least at my opinion.
Create artificial data
import pandas as pd
df = pd.DataFrame({'Date_Time' : ["2020-01-01 00:00:00",
"2020-01-01 00:05:00",
"2020-01-01 00:10:00",
"2020-01-01 00:15:00",
"2020-01-01 00:20:00",
"2020-01-01 00:25:00"],
'Value[mm]' : [1.0,0.5,4.0,2.0,2.0,0.5],
'State of value': [5,5,5,5,5,5]
})
df = df.sample(3900875, replace=True).reset_index(drop=True)
Now, lets set Date_Time as an index, and get just series we need to calculate our values
df['Date_Time'] = pd.to_datetime(df['Date_Time'], dayfirst=True)
df = df.set_index('Date_Time', drop = True)
df = df['Value[mm]']
Compare the performance of different approaches
Grouping and looping
%%timeit
lista = []
for DUR in DURS:
x = str(DUR)+' Min'
df1 = df.groupby(pd.Grouper(freq=x)).sum()
a = df1.max()/DUR*60
lista.append(a)
19.6 s ± 439 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Resampling
Time boost is probably random here, since it looks like the same is hapening under the hood.
%%timeit
def get_max_by_dur(DUR):
return df.resample(str(DUR)+"Min").sum().max()
l_a = [get_max_by_dur(dur)/dur*60 for dur in DURS]
17.2 s ± 559 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Resampling + Dask
Despite the fact that there's no way to properly vectorize - you still can make some parallelization and optimization with Dask.
!python3 -m pip install "dask[dataframe]" --upgrade
import dask.dataframe as dd
%%timeit
dd_df = dd.from_pandas(df, npartitions = 1)
def get_max_by_dur(DUR):
return dd_df.resample(str(DUR)+"Min").sum().max()
l_a = [(get_max_by_dur(dur)/dur*60).compute() for dur in DURS]
2.21 s ± 110 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Few words on apply and optimization
Usually, you use apply, to apply a function along the axis of a DataFrame. So that's the substitution for looping thru rows or columns of DataFrame itself, but in reality, apply is just a glorified loop with some extra functionality. So, when the performance matters you usually want to optimize your code like this.
Vectorization or Essential Basic Functionality (as you made)
Cython routines or numba
List comprehension.
Apply method
Iteration
Ilustration
Let's say you want to get a product of two columns
1). Vectorization or basic methods.
Basic methods:
df["product"] = df.prod(axis=1)
162 ms ± 9.28 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Vectorization:
import numpy as np
def multiply(Value,State): # you may use lambda here as well
return Value*State
%timeit df["new_column"] = np.vectorize(multiply) (df["Value[mm]"], df["State of value"])
853 ms ± 42.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
2). Cython or numba
It can be very useful in cases if you already wrote some looping. You can often, just decorate it with #numba.jit and achieve significant performance boost. It's also very helpful when you want to compute some iterative value, which is difficult to vectorize.
Since the function we choose is multiplication you'll not have benefits comparing to usual apply.
%%cython
cdef double cython_multiply(double Value, double State):
return Value * State
%timeit df["new_column"] = df.apply(lambda row:multiply(row["Value[mm]"], row["State of value"]), axis = 1)
1min 38s ± 4 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
3). List comprehension.
It's pythonic and, also quite similar to for loop.
%timeit df["new_column"] = [x*y for x, y in zip(df["Value[mm]"], df["State of value"])]
1.56 s ± 160 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
4). Apply method
Notice, how slow it is.
%timeit df["new_column"] = df.apply(lambda row:row["Value[mm]"]*row["State of value"], axis = 1)
1min 37s ± 4.76 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
5). Looping thru rows
itertuples:
%%timeit
list_a = []
for row in df.itertuples():
list_a.append(row[2]*row[3])
df['product'] = list_a
9.81 s ± 831 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
iterrows (you probably shouldn't use that):
%%timeit
list_a = []
for row in df.iterrows():
list_a.append(row[1][1]*row[1][2])
df['product'] = list_a
6min 40s ± 1min 8s per loop (mean ± std. dev. of 7 runs, 1 loop each)
I've a dataframe having around 16,000 rows and I'm performing max aggregation of one column and grouping it by another one.
df.groupby(['col1']).agg({'col2': 'max'}).reset_index()
It takes 1.97s. I'd like to improve it's performance. Request you suggest in lines of utilizing numpy or vectorization.
Datatype both columns are object.
%%timeit
df.groupby(['col1']).agg({'col2': 'max'}).reset_index()
1.97 s ± 42 ms per loop (mean ± std. dev. of 7 runs, 1 loop each
I played around with datatypes and changed datatype of col2 to integer and it reduced the run time significantly.
%%timeit
df['col2'] = df['col2'].astype(int)
df.groupby(['col1']).agg({'col2': 'max'}).reset_index()
6.58 ms ± 74.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Aggregation of integer will be faster than string.