Pandas grouping extremely slow for aggregations min and max

Pandas grouping extremely slow for aggregations min and max - python

I have dataframe with datetime index, and shape:
df.shape
(311885, 38)
Aggregate functions .sum(), .mean() and .median() work fine:
%%time
df.groupby(pd.Grouper(freq='D')).mean()
CPU times: user 77.6 ms, sys: 16 ms, total: 93.7 ms
Wall time: 92.7 ms
However, .min() and .max() are extremely slow:
%%time
df.groupby(pd.Grouper(freq='D')).min()
CPU times: user 51.1 s, sys: 377 ms, total: 51.5 s
Wall time: 51.1 s
Also, tried resample with equally bad result:
%%time
df.resample('D').min()
CPU times: user 52.2 s, sys: 478 ms, total: 52.7 s
Wall time: 52.2 s
Installed versions:
pd.__version__
'0.25.2'
print(sys.version)
3.6.8 (default, Jan 14 2019, 11:02:34)
[GCC 8.0.1 20180414 (experimental) [trunk revision 259383]]
Is this expected behaviour? Can timings of .min() and .max() be improved?

As Quang Hoang pointed out in their comment, I had a string column which caused .min() and .max() to be slow. Without it, everything is fast.

Related

Joining dataframes using rust polars in Python

I am experimenting with polars and would like to understand why using polars is slower than using pandas on a particular example:
import pandas as pd
import polars as pl
n=10_000_000
df1 = pd.DataFrame(range(n), columns=['a'])
df2 = pd.DataFrame(range(n), columns=['b'])
df1p = pl.from_pandas(df1.reset_index())
df2p = pl.from_pandas(df2.reset_index())
# takes ~60 ms
df1.join(df2)
# takes ~950 ms
df1p.join(df2p, on='index')

A pandas join uses the indexes, which are cached.
A comparison where they do the same:
# pandas
# CPU times: user 1.64 s, sys: 867 ms, total: 2.5 s
# Wall time: 2.52 s
df1.merge(df2, left_on="a", right_on="b")
# polars
# CPU times: user 5.59 s, sys: 199 ms, total: 5.79 s
# Wall time: 780 ms
df1p.join(df2p, left_on="a", right_on="b")

Why the procesing time increases when using integer data type instead of strings in pandas

I have a list of tuples, with more than 0.8 million entries, a sample of which is shown below
[('r', 'b', '>', 'ins'),
('r', 'ba', '>', 'ins'),
('r', 'a', '>', 'del'),
('ar', 'b', '>', 'ins'),
('ar', 'ba', '>', 'del')]
The above list is converted into a Pandas dataframe, and then some subsequent grouping and aggregate operations are performed on it as shown below
conDF = pd.DataFrame.from_records(tripleeList)
conDF.columns = ["lc","obj","rc","insOrDel"]
conDF["coun"] = 1
groupedConDF = conDF.groupby(["lc","rc","obj","insOrDel"]).count()
It is observed that if the final element in each tuple is represented using an integer (say with 1 and 0), instead of using the strings "ins" and "del", then the processing time reaches 10 times that of what is currently required. I have used %%time magic in jupyter notebook for obtaining the measures.
Time reported time when using the strings
CPU times: user 459 ms, sys: 24.4 ms, total: 484 ms
Wall time: 481 ms
The reported time when using the integers
CPU times: user 5.77 s, sys: 410 ms, total: 6.18 s
Wall time: 604 ms
Updated
When modifying the code as suggested by #jezrael, the following is reported
Strings:
CPU times: user 966 ms, sys: 45.9 ms, total: 1.01 s
Wall time: 431 ms
Integer:
CPU times: user 5.72 s, sys: 375 ms, total: 6.09 s
Wall time: 558 ms
Modified code:
conDF = pd.DataFrame.from_records(tripleeList)
conDF.columns = ["lc","obj","rc","insOrDel"]
conDF["coun"] = 1
groupedConDF = conDF.groupby(["lc","rc","obj","insOrDel"], sort=False).size()
What could be the possible reason for this?

Why does it take so long to create a SparseDataFrame (Python pandas)?

Given the following code (executed in a Jupyter notebook):
In [1]: import pandas as pd
%time df=pd.SparseDataFrame(index=range(0,1000), columns=range(0,1000));
CPU times: user 3.89 s, sys: 30.3 ms, total: 3.92 s
Wall time: 3.92 s
Why does it take so long to create a sparse data frame?
Note that it seems to be irrelevant if I increse the dimension for the rows. But when I increase the number of columns from 1000 to say 10000, the code seems to take forever and I always had to abort it.
Compare this with scipy's sparse matrix:
In [2]: from scipy.sparse import lil_matrix
%time m=lil_matrix((1000, 1000))
CPU times: user 1.09 ms, sys: 122 µs, total: 1.21 ms
Wall time: 1.18 ms

pandas to_datetime is very slow

I have a decent sized sparse data frame with several date/time columns in string format. I am trying convert them to datetime (or Timestamp) objects using the standard Pandas to_datetime() method. But it is too slow.
I ended up writing a "fast" to_datetime function (below). It is significantly faster, but it still seems slow. Profiling tells me all of the time is spent on the last line.
Am I off the deep end? Is there a different (and faster) way to do this?
In [98]: df.shape
Out[98]: (2497977, 79)
In [117]: len(df.reference_date.dropna())
Out[117]: 2004185
In [118]: len(df.reference_date.dropna().unique())
Out[118]: 157
In [119]: %time df.reference_date = pandas.to_datetime(df.reference_date)
CPU times: user 3min 2s, sys: 434 ms, total: 3min 2s
**Wall time: 3min 2s**
In [123]: %time fast_to_datetime(dataframe=df, column='reference_date', date_format='%Y%m%d')
CPU times: user 3.58 s, sys: 343 ms, total: 3.92 s
**Wall time: 3.92 s**
def fast_to_datetime(dataframe, column, date_format=None):
tmp_dates = dataframe[column].dropna().unique()
unique_dates = pandas.DataFrame(tmp_dates, columns=['orig_date'])
unique_dates.set_index(keys=['orig_date'], drop=False, inplace=True, verify_integrity=True)
unique_dates[column] = pandas.to_datetime(unique_dates.orig_date, format=date_format)
dataframe.set_index(keys=column, drop=False, inplace=True)
dataframe[column] = unique_dates[column]
In [126]: sys.version
Out[126]: '2.7.5 (default, Nov 20 2015, 02:00:19) \n[GCC 4.8.5 20150623 (Red Hat 4.8.5-4)]'
In [127]: pandas.__version__
Out[127]: u'0.17.0'

Cumsum in pandas.groupby is slow

I would like know if there is a faster way to run a cumsum in pandas.
For example:
import numpy as np
import pandas as pd
n = 10000000
values = np.random.randint(1, 100000, n)
ids = values.astype("S10")
df = pd.DataFrame({"ids": ids, "val": values})
Now, I want to groupby using ids and get some stats.
The max for example is pretty fast:
time df.groupby("ids").val.max()
CPU times: user 5.08 s, sys: 131 ms, total: 5.21 s
Wall time: 5.22 s
However, the cumsum is very slow:
time df.groupby("ids").val.cumsum()
CPU times: user 26.8 s, sys: 707 ms, total: 27.5 s
Wall time: 27.6 s
My problem is that I need the cumsum grouped by a key in a large dataset, almost as shown here, but it takes minutes. Is there a way to make it faster?
Thanks!

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas grouping extremely slow for aggregations min and max - python

As Quang Hoang pointed out in their comment, I had a string column which caused .min() and .max() to be slow. Without it, everything is fast.

Related

Joining dataframes using rust polars in Python

Why the procesing time increases when using integer data type instead of strings in pandas

Why does it take so long to create a SparseDataFrame (Python pandas)?

pandas to_datetime is very slow

Cumsum in pandas.groupby is slow

Categories

Resources