pandas to_datetime is very slow - python

I have a decent sized sparse data frame with several date/time columns in string format. I am trying convert them to datetime (or Timestamp) objects using the standard Pandas to_datetime() method. But it is too slow.
I ended up writing a "fast" to_datetime function (below). It is significantly faster, but it still seems slow. Profiling tells me all of the time is spent on the last line.
Am I off the deep end? Is there a different (and faster) way to do this?
In [98]: df.shape
Out[98]: (2497977, 79)
In [117]: len(df.reference_date.dropna())
Out[117]: 2004185
In [118]: len(df.reference_date.dropna().unique())
Out[118]: 157
In [119]: %time df.reference_date = pandas.to_datetime(df.reference_date)
CPU times: user 3min 2s, sys: 434 ms, total: 3min 2s
**Wall time: 3min 2s**
In [123]: %time fast_to_datetime(dataframe=df, column='reference_date', date_format='%Y%m%d')
CPU times: user 3.58 s, sys: 343 ms, total: 3.92 s
**Wall time: 3.92 s**
def fast_to_datetime(dataframe, column, date_format=None):
tmp_dates = dataframe[column].dropna().unique()
unique_dates = pandas.DataFrame(tmp_dates, columns=['orig_date'])
unique_dates.set_index(keys=['orig_date'], drop=False, inplace=True, verify_integrity=True)
unique_dates[column] = pandas.to_datetime(unique_dates.orig_date, format=date_format)
dataframe.set_index(keys=column, drop=False, inplace=True)
dataframe[column] = unique_dates[column]
In [126]: sys.version
Out[126]: '2.7.5 (default, Nov 20 2015, 02:00:19) \n[GCC 4.8.5 20150623 (Red Hat 4.8.5-4)]'
In [127]: pandas.__version__
Out[127]: u'0.17.0'

Related

Joining dataframes using rust polars in Python

I am experimenting with polars and would like to understand why using polars is slower than using pandas on a particular example:
import pandas as pd
import polars as pl
n=10_000_000
df1 = pd.DataFrame(range(n), columns=['a'])
df2 = pd.DataFrame(range(n), columns=['b'])
df1p = pl.from_pandas(df1.reset_index())
df2p = pl.from_pandas(df2.reset_index())
# takes ~60 ms
df1.join(df2)
# takes ~950 ms
df1p.join(df2p, on='index')
A pandas join uses the indexes, which are cached.
A comparison where they do the same:
# pandas
# CPU times: user 1.64 s, sys: 867 ms, total: 2.5 s
# Wall time: 2.52 s
df1.merge(df2, left_on="a", right_on="b")
# polars
# CPU times: user 5.59 s, sys: 199 ms, total: 5.79 s
# Wall time: 780 ms
df1p.join(df2p, left_on="a", right_on="b")

Pandas grouping extremely slow for aggregations min and max

I have dataframe with datetime index, and shape:
df.shape
(311885, 38)
Aggregate functions .sum(), .mean() and .median() work fine:
%%time
df.groupby(pd.Grouper(freq='D')).mean()
CPU times: user 77.6 ms, sys: 16 ms, total: 93.7 ms
Wall time: 92.7 ms
However, .min() and .max() are extremely slow:
%%time
df.groupby(pd.Grouper(freq='D')).min()
CPU times: user 51.1 s, sys: 377 ms, total: 51.5 s
Wall time: 51.1 s
Also, tried resample with equally bad result:
%%time
df.resample('D').min()
CPU times: user 52.2 s, sys: 478 ms, total: 52.7 s
Wall time: 52.2 s
Installed versions:
pd.__version__
'0.25.2'
print(sys.version)
3.6.8 (default, Jan 14 2019, 11:02:34)
[GCC 8.0.1 20180414 (experimental) [trunk revision 259383]]
Is this expected behaviour? Can timings of .min() and .max() be improved?
As Quang Hoang pointed out in their comment, I had a string column which caused .min() and .max() to be slow. Without it, everything is fast.

Pandas groupby resample poor performance

My problem
I'm having trouble with the performance of resample function in combination with a groupby. The operation I'm doing is currently taking 8+ seconds on a data sample of 5000 rows which is totally unreasonable for my requirements.
Sample data (500 rows)
Pastebin with data as dict: https://pastebin.com/RPNdhXsy
The logic
I have data with dates in a quarterly interval which I want to group by a column and then resample the dates within the groups on a monthly basis.
Input:
isin report_date val
SE001 2018-12-31 1
SE001 2018-09-30 2
SE001 2018-06-31 3
US001 2018-10-31 4
US001 2018-07-31 5
Output:
isin report_date val
SE001 2018-12-31 1
2018-11-30 NaN
2018-10-31 NaN
2018-09-30 2
2018-08-31 NaN
2018-07-31 NaN
2018-06-30 3
US001 2018-10-30 4
2018-09-31 NaN
2018-08-31 NaN
2018-07-31 5
I used to have this operation:
df.groupby('isin').resample('M', on="report_date").first()[::-1]
Since it seems that asfreq() has slightly better performance than using on= in resample, I instead do the following currently. It's still slow though.
I reverse since resample seems to non-optionally sort dates descending.
df.set_index('report_date').groupby('isin').resample('M').asfreq()[::-1]
As stated, with 5000 rows and around 16 columns this takes 15 seconds to run since I need to do it on two separate dataframes.
With the sample data in the pastebin (500 rows) the operation takes me 0.7s which is way too long for me since my final data will have 800k rows.
EDIT: Timing of the different operations
Current way
setindex --- 0.001055002212524414 seconds ---
groupby --- 0.00033092498779296875 seconds ---
resample --- 0.004662036895751953 seconds ---
asfreq --- 0.8990700244903564 seconds ---
[::-1] --- 0.0013098716735839844 seconds ---
= 0.9056s
Old way
groupby --- 0.0005779266357421875 seconds ---
resample --- 0.0044629573822021484 seconds ---
first --- 1.6829369068145752 seconds ---
[::-1] --- 0.001600027084350586 seconds ---
= 1.6894s
Judging by this, it seems that converting from the pandas.core.resample.DatetimeIndexResamplerGroupby to a df is taking very long. Now what?
EDIT2: Using reindex
df.set_index('report_date').groupby('isin').apply(lambda x: x.reindex(pd.date_range(x.index.min(), x.index.max(), freq='M'), fill_value=0))[::-1]
This takes 0.28s which is a vast improvement. Still not very good though.
How can I speed this up? Is there another way to do the same thing?
I cut execution time for a 25k row test data set from 850 ms to 320 ms. I wrapped the reindex logic in a function, to make timing easier:
def orig_pipeline(df):
return (df
.set_index('report_date')
.groupby('isin')
.apply(lambda x: x.reindex(pd.date_range(x.index.min(),
x.index.max(),
freq='M'),
fill_value=0))
[::-1])
Then, I created new functions to make date arithmetic and reindexing faster:
def create_params(df):
return (df.groupby('isin')['report_date']
.agg(['min', 'max']).sort_index().reset_index())
def create_multiindex(df, params):
all_dates = pd.date_range(start='1999-12-31', end='2020-12-31', freq='M')
midx = (
(row.isin, d)
for row in params.itertuples()
for d in all_dates[(row.min <= all_dates) & (all_dates <= row.max)])
return pd.MultiIndex.from_tuples(midx, names=['isin', 'report_date'])
def apply_mulitindex(df, midx):
return df.set_index(['isin', 'report_date']).reindex(midx)
def new_pipeline(df):
params = create_params(df)
midx = create_multiindex(df, params)
return apply_mulitindex(df, midx)
Old and new pipelines give same results (except possibly sort order):
v1 = orig_pipeline(df).drop(columns='isin').sort_index()
v2 = new_pipeline(df).sort_index().fillna(0)
assert(v1 == v2).all().all()
Timing results:
%%timeit
v1 = orig_pipeline(df_big)
854 ms ± 2.72 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
v2 = new_pipeline(df_big)
322 ms ± 5.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
I would like to illustrate the experiments I made trying to figure which solution yields the most performance, and it shows that #jsmart 's is the best one.
My dataset is like the following (sorry for the screenshot I could not manage to paste a pretty table):
My goal is to have for each (orgacom, client) couple the indicators resampled by business day.
Solution 1: groupby / apply asfreq
%%time
sol1 = (
to_process.groupby(['orgacom', 'client'], observed=True, )
.apply(lambda x: x.asfreq('B', fill_value=np.nan))
)
CPU times: user 4min 6s, sys: 2.91 s, total: 4min 9s
Wall time: 4min 9s
Solution 2: groupby / apply reindex (as of #jokab EDIT2)
%%time
sol2 = (
to_process.groupby(['orgacom', 'client'], observed=True, )
.apply(lambda x: x.reindex(pd.date_range(x.index.min(), x.index.max(), freq='B'), fill_value=np.nan))
)
CPU times: user 4min 13s, sys: 2.16 s, total: 4min 15s
Wall time: 4min 15s
Solution 3: recoding resample (as of #jsmart answer)
def create_params(df):
return (df.reset_index().groupby(['orgacom', 'client'], observed=True, )['date']
.agg(['min', 'max']).sort_index().reset_index())
def create_multiindex(df, params):
all_dates = pd.date_range(start='2016-12-31', end='2020-12-31', freq='B')
midx = (
(row.orgacom, row.client, d)
for row in params.itertuples()
for d in all_dates[(row.min <= all_dates) & (all_dates <= row.max)])
return pd.MultiIndex.from_tuples(midx, names=['orgacom', 'client', 'date'])
def apply_mulitindex(df, midx):
return df.set_index(['orgacom', 'client', 'date']).reindex(midx)
def new_pipeline(df):
params = create_params(df)
midx = create_multiindex(df, params)
return apply_mulitindex(df, midx)
%%time
sol3 = new_pipeline(to_process.reset_index())
CPU times: user 1min 46s, sys: 4.93 s, total: 1min 51s
Wall time: 1min 51s
Solution 4: groupby / resample asfreq (as of #jokab first solution)
%%time
sol4 = to_process.groupby(['orgacom', 'client']).resample('B').asfreq()
CPU times: user 4min 22s, sys: 8.01 s, total: 4min 30s
Wall time: 4min 30s
I also noticed that resampling on groupby can be slow. In my case, I used data reshaping for speed up,
df.set_index(['isin', 'report_date'])['val'].unstack(0).resample('M')
There is another way of doing this. Use itertools.groupby() and list comprehension
import time
from itertools import groupby
print(time.time())
data = (
('SE001', '2018-12-31', 1),
('SE001', '2018-09-30', 2),
('SE001', '2018-06-31', 3),
('US001', '2018-10-31', 4),
('US001', '2018-07-31', 5),
)
aggr = [(key, sum([g[2] for g in grp])) for key, grp in groupby(sorted(data), key=lambda x: x[0])]
print(aggr)
print(time.time())
# 100,000 records
# 2.5 seconds

Why does it take so long to create a SparseDataFrame (Python pandas)?

Given the following code (executed in a Jupyter notebook):
In [1]: import pandas as pd
%time df=pd.SparseDataFrame(index=range(0,1000), columns=range(0,1000));
CPU times: user 3.89 s, sys: 30.3 ms, total: 3.92 s
Wall time: 3.92 s
Why does it take so long to create a sparse data frame?
Note that it seems to be irrelevant if I increse the dimension for the rows. But when I increase the number of columns from 1000 to say 10000, the code seems to take forever and I always had to abort it.
Compare this with scipy's sparse matrix:
In [2]: from scipy.sparse import lil_matrix
%time m=lil_matrix((1000, 1000))
CPU times: user 1.09 ms, sys: 122 µs, total: 1.21 ms
Wall time: 1.18 ms

Cumsum in pandas.groupby is slow

I would like know if there is a faster way to run a cumsum in pandas.
For example:
import numpy as np
import pandas as pd
n = 10000000
values = np.random.randint(1, 100000, n)
ids = values.astype("S10")
df = pd.DataFrame({"ids": ids, "val": values})
Now, I want to groupby using ids and get some stats.
The max for example is pretty fast:
time df.groupby("ids").val.max()
CPU times: user 5.08 s, sys: 131 ms, total: 5.21 s
Wall time: 5.22 s
However, the cumsum is very slow:
time df.groupby("ids").val.cumsum()
CPU times: user 26.8 s, sys: 707 ms, total: 27.5 s
Wall time: 27.6 s
My problem is that I need the cumsum grouped by a key in a large dataset, almost as shown here, but it takes minutes. Is there a way to make it faster?
Thanks!

Categories

Resources