Question:
Hi,
When searching for methods to make a selection of a dataframe (being relatively unexperienced with Pandas), I had the following question:
What is faster for large datasets - .isin() or .query()?
Query is somewhat more intuitive to read, so my preferred approach due to my line of work. However, testing it on a very small example dataset, query seems to be much slower.
Is there anyone who has tested this properly before? If so, what were the outcomes? I searched the web, but could not find another post on this.
See the sample code below, which works for Python 3.8.5.
Thanks a lot in advance for your help!
Code:
# Packages
import pandas as pd
import timeit
import numpy as np
# Create dataframe
df = pd.DataFrame({'name': ['Foo', 'Bar', 'Faz'],
'owner': ['Canyon', 'Endurace', 'Bike']},
index=['Frame', 'Type', 'Kind'])
# Show dataframe
df
# Create filter
selection = ['Canyon']
# Filter dataframe using 'isin' (type 1)
df_filtered = df[df['owner'].isin(selection)]
%timeit df_filtered = df[df['owner'].isin(selection)]
213 µs ± 14 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# Filter dataframe using 'isin' (type 2)
df[np.isin(df['owner'].values, selection)]
%timeit df_filtered = df[np.isin(df['owner'].values, selection)]
128 µs ± 3.11 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# Filter dataframe using 'query'
df_filtered = df.query("owner in #selection")
%timeit df_filtered = df.query("owner in #selection")
1.15 ms ± 9.35 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
The best test in real data, here fast comparison for 3k, 300k,3M rows with this sample data:
selection = ['Hedge']
df = pd.concat([df] * 1000, ignore_index=True)
In [139]: %timeit df[df['owner'].isin(selection)]
449 µs ± 58 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [140]: %timeit df.query("owner in #selection")
1.57 ms ± 33.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
df = pd.concat([df] * 100000, ignore_index=True)
In [142]: %timeit df[df['owner'].isin(selection)]
8.25 ms ± 66.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [143]: %timeit df.query("owner in #selection")
13 ms ± 1.05 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
df = pd.concat([df] * 1000000, ignore_index=True)
In [145]: %timeit df[df['owner'].isin(selection)]
94.5 ms ± 9.28 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [146]: %timeit df.query("owner in #selection")
112 ms ± 499 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
If check docs:
DataFrame.query() using numexpr is slightly faster than Python for large frames
Conclusion - The best test in real data, because depends of number of rows, number of matched values and also by length of list selection.
A perfplot over some generated data:
Assuming some hypothetical data, as well as a proportionally increasing selection size (10% of frame size).
Sample data for n=10:
df:
name owner
0 Constant JoVMq
1 Constant jiKNB
2 Constant WEqhm
3 Constant pXNqB
4 Constant SnlbV
5 Constant Euwsj
6 Constant QPPbs
7 Constant Nqofa
8 Constant qeUKP
9 Constant ZBFce
Selection:
['ZBFce']
Performance reflects the docs. At smaller frames the overhead of query is significant over isin However, at frames around 200k rows the performance is comparable to isin and at frames around 10m rows query starts to become more performant.
I agree with #jezrael that, this is, as with most pandas runtime problems, very data dependent, and the best test would be to test on real datasets for a given use case and make a decision based on that.
Edit: Included #AlexanderVolkovsky's suggestion to convert selection to a set and use apply + in:
Perfplot Code:
import string
import numpy as np
import pandas as pd
import perfplot
charset = list(string.ascii_letters)
np.random.seed(5)
def gen_data(n):
df = pd.DataFrame({'name': 'Constant',
'owner': [''.join(np.random.choice(charset, 5))
for _ in range(n)]})
selection = df['owner'].sample(frac=.1).tolist()
return df, selection, set(selection)
def test_isin(params):
df, selection, _ = params
return df[df['owner'].isin(selection)]
def test_query(params):
df, selection, _ = params
return df.query("owner in #selection")
def test_apply_over_set(params):
df, _, set_selection = params
return df[df['owner'].apply(lambda x: x in set_selection)]
if __name__ == '__main__':
out = perfplot.bench(
setup=gen_data,
kernels=[
test_isin,
test_query,
test_apply_over_set
],
labels=[
'test_isin',
'test_query',
'test_apply_over_set'
],
n_range=[2 ** k for k in range(25)],
equality_check=None
)
out.save('perfplot_results.png', transparent=False)
Related
I have a dataframe with a few million rows where I want to calculate the difference on a daily basis between two columns which are in datetime format.
There are stack overflow questions which answer this question computing the difference on a timestamp basis (see here
Doing it on the timestamp basis felt quite fast:
df["Differnce"] = (df["end_date"] - df["start_date"]).dt.days
But doing it on a daily basis felt quite slow:
df["Differnce"] = (df["end_date"].dt.date - df["start_date"].dt.date).dt.days
I was wondering if there is a easy but better/faster way to achieve the same result?
Example Code:
import pandas as pd
import numpy as np
data = {'Condition' :["a", "a", "b"],
'start_date': [pd.Timestamp('2022-01-01 23:00:00.000000'), pd.Timestamp('2022-01-01 23:00:00.000000'), pd.Timestamp('2022-01-01 23:00:00.000000')],
'end_date': [pd.Timestamp('2022-01-02 01:00:00.000000'), pd.Timestamp('2022-02-01 23:00:00.000000'), pd.Timestamp('2022-01-02 01:00:00.000000')]}
df = pd.DataFrame(data)
df["Right_Difference"] = np.where((df["Condition"] == "a"), ((df["end_date"].dt.date - df["start_date"].dt.date).dt.days), np.nan)
df["Wrong_Difference"] = np.where((df["Condition"] == "a"), ((df["end_date"] - df["start_date"]).dt.days), np.nan)
Use Series.dt.to_period, faster is Series.dt.normalize or Series.dt.floor :
#300k rows
df = pd.concat([df] * 100000, ignore_index=True)
In [286]: %timeit (df["end_date"].dt.date - df["start_date"].dt.date).dt.days
1.14 s ± 135 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [287]: %timeit df["end_date"].dt.to_period('d').astype('int') - df["start_date"].dt.to_period('d').astype('int')
64.1 ms ± 3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [288]: %timeit (df["end_date"].dt.normalize() - df["start_date"].dt.normalize()).dt.days
27.7 ms ± 316 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [289]: %timeit (df["end_date"].dt.floor('d') - df["start_date"].dt.floor('d')).dt.days
27.7 ms ± 937 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Suppose that we have a large dataframe with duplicate index,
# IPython
In [1]: import pandas as pd
In [2]: from numpy.random import randint
In [3]: df = pd.DataFrame({'a': randint(1, 10, 10000)}, index=randint(1, 10, 10000))
and we want to group by column 'a' and do some operations using apply, such as (just do nothing in apply)
In [4]: df.groupby('a').apply(lambda x: x)
which will take a long time:
In [5]: %timeit df.groupby('a').apply(lambda x: x)
19.9 s ± 322 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
If the size goes bigger, it becames unbearable. However, if we reset_index first, it goes fast.
In [6]: %timeit df.reset_index(drop=True).groupby('a').apply(lambda x: x)
2.24 ms ± 60.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Then my question is how apply handle duplicate index, and why it's so slow.
Thanks for any help.
I have a pandas dataframe with two columns, where I need to check where the value at each row of column A is a string that starts with the value of the corresponding row at column B or viceversa.
It seems that the Series method .str.startswith cannot deal with vectorized input, so I needed to zip over the two columns in a list comprehension and create a new pd.Series with the same index as any of the two columns.
I would like this to be a vectorized operation with the .str accessor available to operate on iterables, but something like this returns NaN:
df = pd.DataFrame(data={'a':['x','yy'], 'b':['xyz','uvw']})
df['a'].str.startswith(df['b'])
while my working solution is the following:
pd.Series(index=df.index, data=[a.startswith(b) or b.startswith(a) for a,b in zip(df['a'],df['b'])])
I suspect that there may be a better way to tackle this issue as it also would benefit all string methods on series.
Is there any more beautiful or efficient method to do this?
One idea is use np.vecorize, but because working with strings performance is only a bit better like your solution:
def fun (a,b):
return a.startswith(b) or b.startswith(a)
f = np.vectorize(fun)
a = pd.Series(f(df['a'],df['b']), index=df.index)
print (a)
0 True
1 False
dtype: bool
df = pd.DataFrame(data={'a':['x','yy'], 'b':['xyz','uvw']})
df = pd.concat([df] * 10000, ignore_index=True)
In [132]: %timeit pd.Series(index=df.index, data=[a.startswith(b) or b.startswith(a) for a,b in df[['a', 'b']].to_numpy()])
42.3 ms ± 516 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [133]: %timeit pd.Series(f(df['a'],df['b']), index=df.index)
9.81 ms ± 119 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [134]: %timeit pd.Series(index=df.index, data=[a.startswith(b) or b.startswith(a) for a,b in zip(df['a'],df['b'])])
14.1 ms ± 262 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
#sammywemmy solution
In [135]: %timeit pd.Series([any((a.startswith(b), b.startswith(a))) for a, b in df.to_numpy()], index=df.index)
46.3 ms ± 683 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
I am a newbie and I need some insight. Say I have a pandas dataframe as follows:
temp = pd.DataFrame()
temp['A'] = np.random.rand(100)
temp['B'] = np.random.rand(100)
temp['C'] = np.random.rand(100)
I need to write a function where I replace every value in column "C" with 0's if the value of "A" is bigger than 0.5 in the corresponding row. Otherwise I need to multiply A and B in the same row element-wise and write down the output at the corresponding row on column "C".
What I did so far, is:
A=temp.loc[temp['A']<0.5, 'A'].values
B=temp.loc[temp['A']<0.5, 'B'].values
temp['C'] = 0
temp.loc[temp['A']<0.5,'C']=A*B
It works just as I desire it to work HOWEVER I am not sure if there's a faster way to implement this. I am very skeptical especially in the slicings that I feel like it's abundant to use those many slices. Though, I couldn't find any other solutions since I have to write 0's for C rows where A is bigger than 0.5.
Or, is there a way to slice the part that is needed only, perform calculations, then somehow remember the indices so you could put the required values back to the original data-frame on the corresponding rows?
One way using numpy.where:
temp["C"] = np.where(temp["A"]<0.5, temp["A"] * temp["B"], 0)
Benchmark (about 4x faster in sample, and keeps on increasing):
# With given sample of 100 rows
%%timeit
A=temp.loc[temp['A']<0.5, 'A'].values
B=temp.loc[temp['A']<0.5, 'B'].values
temp['C'] = 0
temp.loc[temp['A']<0.5,'C']=A*B
# 819 µs ± 2.77 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit np.where(temp["A"]<0.5, temp["A"] * temp["B"], 0)
# 174 µs ± 455 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Benchmark on larger data (about 7x faster)
temp = pd.DataFrame()
temp['A'] = np.random.rand(1000000)
temp['B'] = np.random.rand(1000000)
temp['C'] = np.random.rand(1000000)
%%timeit
A=temp.loc[temp['A']<0.5, 'A'].values
B=temp.loc[temp['A']<0.5, 'B'].values
temp['C'] = 0
temp.loc[temp['A']<0.5,'C']=A*B
# 35.2 ms ± 345 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit np.where(temp["A"]<0.5, temp["A"] * temp["B"], 0)
# 5.16 ms ± 188 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Validation
A=temp.loc[temp['A']<0.5, 'A'].values
B=temp.loc[temp['A']<0.5, 'B'].values
temp['C'] = 0
temp.loc[temp['A']<0.5,'C']=A*B
np.array_equal(temp["C"], np.where(temp["A"]<0.5, temp["A"] * temp["B"], 0))
# True
I've run across some legacy code with data stored as a single-row pd.DataFrame.
My intuition would be that working with a pd.Series would be faster in this case - I don't know how they do optimization, but I know that they can and do so.
Is my intuition correct? Or is there no significant difference for most actions?
(to clarify - obviously the best practice would not be a single row DataFrame, but I'm asking about performance)
Yes for a large number of columns there will be a noticeable impact on performance.
You should consider that a DataFrame is a dict of Series so when you perform an operation on the single row, pandas has to coalesce all the column values first before performing the operation.
Even for 100 elements you can see there is a hit:
s = pd.Series(np.random.randn(100))
df = pd.DataFrame(np.random.randn(1,100))
%timeit s.sum()
%timeit df.sum(axis=1)
104 µs ± 5.08 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
194 µs ± 2.8 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In my opinion there is no reason to have a single row df that couldn't be achieved with a Series where the index values are the same as the column names for that df
The performance degradation isn't linear as for a 10k array it's not quite 2x worse:
s = pd.Series(np.random.randn(10000))
df = pd.DataFrame(np.random.randn(1,10000))
%timeit s.sum()
%timeit df.sum(axis=1)
149 µs ± 10.5 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
253 µs ± 36.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)