I've identified as the bottleneck of my code the following operation on a given Pandas DataFrame df.
df.corr()
I was wondering whether there exist some drop-in replacements to speed this step up?
Thank you!
You might try numpy.corrcoef instead:
pd.DataFrame(np.corrcoef(df.values, rowvar=False), columns=df.columns)
Example Timings
# Setup
np.random.seed(0)
df = pd.DataFrame(np.random.randn(1000, 1000))
df.corr()
# 15 s ± 225 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
pd.DataFrame(np.corrcoef(df.values, rowvar=False), columns=df.columns)
# 24.4 ms ± 299 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Related
I have a dataframe with a few million rows where I want to calculate the difference on a daily basis between two columns which are in datetime format.
There are stack overflow questions which answer this question computing the difference on a timestamp basis (see here
Doing it on the timestamp basis felt quite fast:
df["Differnce"] = (df["end_date"] - df["start_date"]).dt.days
But doing it on a daily basis felt quite slow:
df["Differnce"] = (df["end_date"].dt.date - df["start_date"].dt.date).dt.days
I was wondering if there is a easy but better/faster way to achieve the same result?
Example Code:
import pandas as pd
import numpy as np
data = {'Condition' :["a", "a", "b"],
'start_date': [pd.Timestamp('2022-01-01 23:00:00.000000'), pd.Timestamp('2022-01-01 23:00:00.000000'), pd.Timestamp('2022-01-01 23:00:00.000000')],
'end_date': [pd.Timestamp('2022-01-02 01:00:00.000000'), pd.Timestamp('2022-02-01 23:00:00.000000'), pd.Timestamp('2022-01-02 01:00:00.000000')]}
df = pd.DataFrame(data)
df["Right_Difference"] = np.where((df["Condition"] == "a"), ((df["end_date"].dt.date - df["start_date"].dt.date).dt.days), np.nan)
df["Wrong_Difference"] = np.where((df["Condition"] == "a"), ((df["end_date"] - df["start_date"]).dt.days), np.nan)
Use Series.dt.to_period, faster is Series.dt.normalize or Series.dt.floor :
#300k rows
df = pd.concat([df] * 100000, ignore_index=True)
In [286]: %timeit (df["end_date"].dt.date - df["start_date"].dt.date).dt.days
1.14 s ± 135 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [287]: %timeit df["end_date"].dt.to_period('d').astype('int') - df["start_date"].dt.to_period('d').astype('int')
64.1 ms ± 3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [288]: %timeit (df["end_date"].dt.normalize() - df["start_date"].dt.normalize()).dt.days
27.7 ms ± 316 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [289]: %timeit (df["end_date"].dt.floor('d') - df["start_date"].dt.floor('d')).dt.days
27.7 ms ± 937 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Question:
Hi,
When searching for methods to make a selection of a dataframe (being relatively unexperienced with Pandas), I had the following question:
What is faster for large datasets - .isin() or .query()?
Query is somewhat more intuitive to read, so my preferred approach due to my line of work. However, testing it on a very small example dataset, query seems to be much slower.
Is there anyone who has tested this properly before? If so, what were the outcomes? I searched the web, but could not find another post on this.
See the sample code below, which works for Python 3.8.5.
Thanks a lot in advance for your help!
Code:
# Packages
import pandas as pd
import timeit
import numpy as np
# Create dataframe
df = pd.DataFrame({'name': ['Foo', 'Bar', 'Faz'],
'owner': ['Canyon', 'Endurace', 'Bike']},
index=['Frame', 'Type', 'Kind'])
# Show dataframe
df
# Create filter
selection = ['Canyon']
# Filter dataframe using 'isin' (type 1)
df_filtered = df[df['owner'].isin(selection)]
%timeit df_filtered = df[df['owner'].isin(selection)]
213 µs ± 14 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# Filter dataframe using 'isin' (type 2)
df[np.isin(df['owner'].values, selection)]
%timeit df_filtered = df[np.isin(df['owner'].values, selection)]
128 µs ± 3.11 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# Filter dataframe using 'query'
df_filtered = df.query("owner in #selection")
%timeit df_filtered = df.query("owner in #selection")
1.15 ms ± 9.35 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
The best test in real data, here fast comparison for 3k, 300k,3M rows with this sample data:
selection = ['Hedge']
df = pd.concat([df] * 1000, ignore_index=True)
In [139]: %timeit df[df['owner'].isin(selection)]
449 µs ± 58 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [140]: %timeit df.query("owner in #selection")
1.57 ms ± 33.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
df = pd.concat([df] * 100000, ignore_index=True)
In [142]: %timeit df[df['owner'].isin(selection)]
8.25 ms ± 66.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [143]: %timeit df.query("owner in #selection")
13 ms ± 1.05 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
df = pd.concat([df] * 1000000, ignore_index=True)
In [145]: %timeit df[df['owner'].isin(selection)]
94.5 ms ± 9.28 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [146]: %timeit df.query("owner in #selection")
112 ms ± 499 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
If check docs:
DataFrame.query() using numexpr is slightly faster than Python for large frames
Conclusion - The best test in real data, because depends of number of rows, number of matched values and also by length of list selection.
A perfplot over some generated data:
Assuming some hypothetical data, as well as a proportionally increasing selection size (10% of frame size).
Sample data for n=10:
df:
name owner
0 Constant JoVMq
1 Constant jiKNB
2 Constant WEqhm
3 Constant pXNqB
4 Constant SnlbV
5 Constant Euwsj
6 Constant QPPbs
7 Constant Nqofa
8 Constant qeUKP
9 Constant ZBFce
Selection:
['ZBFce']
Performance reflects the docs. At smaller frames the overhead of query is significant over isin However, at frames around 200k rows the performance is comparable to isin and at frames around 10m rows query starts to become more performant.
I agree with #jezrael that, this is, as with most pandas runtime problems, very data dependent, and the best test would be to test on real datasets for a given use case and make a decision based on that.
Edit: Included #AlexanderVolkovsky's suggestion to convert selection to a set and use apply + in:
Perfplot Code:
import string
import numpy as np
import pandas as pd
import perfplot
charset = list(string.ascii_letters)
np.random.seed(5)
def gen_data(n):
df = pd.DataFrame({'name': 'Constant',
'owner': [''.join(np.random.choice(charset, 5))
for _ in range(n)]})
selection = df['owner'].sample(frac=.1).tolist()
return df, selection, set(selection)
def test_isin(params):
df, selection, _ = params
return df[df['owner'].isin(selection)]
def test_query(params):
df, selection, _ = params
return df.query("owner in #selection")
def test_apply_over_set(params):
df, _, set_selection = params
return df[df['owner'].apply(lambda x: x in set_selection)]
if __name__ == '__main__':
out = perfplot.bench(
setup=gen_data,
kernels=[
test_isin,
test_query,
test_apply_over_set
],
labels=[
'test_isin',
'test_query',
'test_apply_over_set'
],
n_range=[2 ** k for k in range(25)],
equality_check=None
)
out.save('perfplot_results.png', transparent=False)
I've run across some legacy code with data stored as a single-row pd.DataFrame.
My intuition would be that working with a pd.Series would be faster in this case - I don't know how they do optimization, but I know that they can and do so.
Is my intuition correct? Or is there no significant difference for most actions?
(to clarify - obviously the best practice would not be a single row DataFrame, but I'm asking about performance)
Yes for a large number of columns there will be a noticeable impact on performance.
You should consider that a DataFrame is a dict of Series so when you perform an operation on the single row, pandas has to coalesce all the column values first before performing the operation.
Even for 100 elements you can see there is a hit:
s = pd.Series(np.random.randn(100))
df = pd.DataFrame(np.random.randn(1,100))
%timeit s.sum()
%timeit df.sum(axis=1)
104 µs ± 5.08 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
194 µs ± 2.8 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In my opinion there is no reason to have a single row df that couldn't be achieved with a Series where the index values are the same as the column names for that df
The performance degradation isn't linear as for a 10k array it's not quite 2x worse:
s = pd.Series(np.random.randn(10000))
df = pd.DataFrame(np.random.randn(1,10000))
%timeit s.sum()
%timeit df.sum(axis=1)
149 µs ± 10.5 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
253 µs ± 36.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
As was proposed in a question I posed last week, a memory efficient way to store a column with values in the range [True, False, NaN] would be to use the int8-datatype to denote True as 1, False as 0 and NaN as -1.
If I do this, what would be good practice to "redefine" pandas' isnull() methods to also take into account that, if a column in a dataframe has dtype int8, -1 should be considered a null-value. I could think of defining a new function def isnull(v), that returns if a value is NaN, or -1 in case of dtype int8, but I can imagine this will not be a very fast and efficient solution (given that the dataframe I am working with is multiple gigabytes big, and I want to be able to count the amount of "null"-values in a column/dataframe).
it should be pretty fast...
Timing for 100.000.000 rows series.
In [84]: s = pd.Series(np.random.choice([1,0,-1], 10**8), dtype=np.int8)
In [85]: s.shape
Out[85]: (100000000,)
simulating series.isnull():
In [86]: %timeit s==-1
87 ms ± 3.51 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [87]: %timeit s.values==-1
84.1 ms ± 2.88 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [88]: %timeit np.where(s==-1)
546 ms ± 14.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [89]: %timeit np.where(s.values==-1)
531 ms ± 2.78 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
simulating: series.isnull().sum():
In [90]: %timeit (s==-1).sum()
1.39 s ± 38.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [91]: %timeit (s.values==-1).sum()
181 ms ± 1.88 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
PS please pay attention that for counting (summing) them the difference between (s==-1).sum() and (s.values==-1).sum() is pretty noticeable
I have a pandas series series. If I want to get the element-wise floor or ceiling, is there a built in method or do I have to write the function and use apply? I ask because the data is big so I appreciate efficiency. Also this question has not been asked with respect to the Pandas package.
You can use NumPy's built in methods to do this: np.ceil(series) or np.floor(series).
Both return a Series object (not an array) so the index information is preserved.
I am the OP, but I tried this and it worked:
np.floor(series)
UPDATE: THIS ANSWER IS WRONG, DO NOT DO THIS
Explanation: using Series.apply() with a native vectorized Numpy function makes
no sense in most cases as it will run the Numpy function in a Python loop, leading to much worse performance. You'd be much better off using
np.floor(series) directly, as suggested by several other answers.
You could do something like this using NumPy's floor, for instance, with a dataframe:
floored_data = data.apply(np.floor)
Can't test it right now but an actual and working solution might not be far from it.
With pd.Series.clip, you can set a floor via clip(lower=x) or ceiling via clip(upper=x):
s = pd.Series([-1, 0, -5, 3])
print(s.clip(lower=0))
# 0 0
# 1 0
# 2 0
# 3 3
# dtype: int64
print(s.clip(upper=0))
# 0 -1
# 1 0
# 2 -5
# 3 0
# dtype: int64
pd.Series.clip allows generalised functionality, e.g. applying and flooring a ceiling simultaneously, e.g. s.clip(-1, 1)
NOTE: Answer originally referred to clip_lower / clip_upper which were removed in pandas 1.0.0.
The pinned answer already the fastest. Here's I provide some alternative to do ceiling and floor using pure pandas and compare it with the numpy approach.
series = pd.Series(np.random.normal(100,20,1000000))
Floor
%timeit np.floor(series) # 1.65 ms ± 18.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit series.astype(int) # 2.2 ms ± 131 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit (series-0.5).round(0) # 3.1 ms ± 47 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit round(series-0.5,0) # 2.83 ms ± 60.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Why astype int works? Because in Python, when converting to integer, that it always get floored.
Ceil
%timeit np.ceil(series) # 1.67 ms ± 21 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit (series+0.5).round(0) # 3.15 ms ± 46.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit round(series+0.5,0) # 2.99 ms ± 103 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
So yeah, just use the numpy function.