I have dataframe with datetime index, and shape:
df.shape
(311885, 38)
Aggregate functions .sum(), .mean() and .median() work fine:
%%time
df.groupby(pd.Grouper(freq='D')).mean()
CPU times: user 77.6 ms, sys: 16 ms, total: 93.7 ms
Wall time: 92.7 ms
However, .min() and .max() are extremely slow:
%%time
df.groupby(pd.Grouper(freq='D')).min()
CPU times: user 51.1 s, sys: 377 ms, total: 51.5 s
Wall time: 51.1 s
Also, tried resample with equally bad result:
%%time
df.resample('D').min()
CPU times: user 52.2 s, sys: 478 ms, total: 52.7 s
Wall time: 52.2 s
Installed versions:
pd.__version__
'0.25.2'
print(sys.version)
3.6.8 (default, Jan 14 2019, 11:02:34)
[GCC 8.0.1 20180414 (experimental) [trunk revision 259383]]
Is this expected behaviour? Can timings of .min() and .max() be improved?
As Quang Hoang pointed out in their comment, I had a string column which caused .min() and .max() to be slow. Without it, everything is fast.
Related
I am experimenting with polars and would like to understand why using polars is slower than using pandas on a particular example:
import pandas as pd
import polars as pl
n=10_000_000
df1 = pd.DataFrame(range(n), columns=['a'])
df2 = pd.DataFrame(range(n), columns=['b'])
df1p = pl.from_pandas(df1.reset_index())
df2p = pl.from_pandas(df2.reset_index())
# takes ~60 ms
df1.join(df2)
# takes ~950 ms
df1p.join(df2p, on='index')
A pandas join uses the indexes, which are cached.
A comparison where they do the same:
# pandas
# CPU times: user 1.64 s, sys: 867 ms, total: 2.5 s
# Wall time: 2.52 s
df1.merge(df2, left_on="a", right_on="b")
# polars
# CPU times: user 5.59 s, sys: 199 ms, total: 5.79 s
# Wall time: 780 ms
df1p.join(df2p, left_on="a", right_on="b")
I have a list of tuples, with more than 0.8 million entries, a sample of which is shown below
[('r', 'b', '>', 'ins'),
('r', 'ba', '>', 'ins'),
('r', 'a', '>', 'del'),
('ar', 'b', '>', 'ins'),
('ar', 'ba', '>', 'del')]
The above list is converted into a Pandas dataframe, and then some subsequent grouping and aggregate operations are performed on it as shown below
conDF = pd.DataFrame.from_records(tripleeList)
conDF.columns = ["lc","obj","rc","insOrDel"]
conDF["coun"] = 1
groupedConDF = conDF.groupby(["lc","rc","obj","insOrDel"]).count()
It is observed that if the final element in each tuple is represented using an integer (say with 1 and 0), instead of using the strings "ins" and "del", then the processing time reaches 10 times that of what is currently required. I have used %%time magic in jupyter notebook for obtaining the measures.
Time reported time when using the strings
CPU times: user 459 ms, sys: 24.4 ms, total: 484 ms
Wall time: 481 ms
The reported time when using the integers
CPU times: user 5.77 s, sys: 410 ms, total: 6.18 s
Wall time: 604 ms
Updated
When modifying the code as suggested by #jezrael, the following is reported
Strings:
CPU times: user 966 ms, sys: 45.9 ms, total: 1.01 s
Wall time: 431 ms
Integer:
CPU times: user 5.72 s, sys: 375 ms, total: 6.09 s
Wall time: 558 ms
Modified code:
conDF = pd.DataFrame.from_records(tripleeList)
conDF.columns = ["lc","obj","rc","insOrDel"]
conDF["coun"] = 1
groupedConDF = conDF.groupby(["lc","rc","obj","insOrDel"], sort=False).size()
What could be the possible reason for this?
Given the following code (executed in a Jupyter notebook):
In [1]: import pandas as pd
%time df=pd.SparseDataFrame(index=range(0,1000), columns=range(0,1000));
CPU times: user 3.89 s, sys: 30.3 ms, total: 3.92 s
Wall time: 3.92 s
Why does it take so long to create a sparse data frame?
Note that it seems to be irrelevant if I increse the dimension for the rows. But when I increase the number of columns from 1000 to say 10000, the code seems to take forever and I always had to abort it.
Compare this with scipy's sparse matrix:
In [2]: from scipy.sparse import lil_matrix
%time m=lil_matrix((1000, 1000))
CPU times: user 1.09 ms, sys: 122 µs, total: 1.21 ms
Wall time: 1.18 ms
I have a decent sized sparse data frame with several date/time columns in string format. I am trying convert them to datetime (or Timestamp) objects using the standard Pandas to_datetime() method. But it is too slow.
I ended up writing a "fast" to_datetime function (below). It is significantly faster, but it still seems slow. Profiling tells me all of the time is spent on the last line.
Am I off the deep end? Is there a different (and faster) way to do this?
In [98]: df.shape
Out[98]: (2497977, 79)
In [117]: len(df.reference_date.dropna())
Out[117]: 2004185
In [118]: len(df.reference_date.dropna().unique())
Out[118]: 157
In [119]: %time df.reference_date = pandas.to_datetime(df.reference_date)
CPU times: user 3min 2s, sys: 434 ms, total: 3min 2s
**Wall time: 3min 2s**
In [123]: %time fast_to_datetime(dataframe=df, column='reference_date', date_format='%Y%m%d')
CPU times: user 3.58 s, sys: 343 ms, total: 3.92 s
**Wall time: 3.92 s**
def fast_to_datetime(dataframe, column, date_format=None):
tmp_dates = dataframe[column].dropna().unique()
unique_dates = pandas.DataFrame(tmp_dates, columns=['orig_date'])
unique_dates.set_index(keys=['orig_date'], drop=False, inplace=True, verify_integrity=True)
unique_dates[column] = pandas.to_datetime(unique_dates.orig_date, format=date_format)
dataframe.set_index(keys=column, drop=False, inplace=True)
dataframe[column] = unique_dates[column]
In [126]: sys.version
Out[126]: '2.7.5 (default, Nov 20 2015, 02:00:19) \n[GCC 4.8.5 20150623 (Red Hat 4.8.5-4)]'
In [127]: pandas.__version__
Out[127]: u'0.17.0'
I would like know if there is a faster way to run a cumsum in pandas.
For example:
import numpy as np
import pandas as pd
n = 10000000
values = np.random.randint(1, 100000, n)
ids = values.astype("S10")
df = pd.DataFrame({"ids": ids, "val": values})
Now, I want to groupby using ids and get some stats.
The max for example is pretty fast:
time df.groupby("ids").val.max()
CPU times: user 5.08 s, sys: 131 ms, total: 5.21 s
Wall time: 5.22 s
However, the cumsum is very slow:
time df.groupby("ids").val.cumsum()
CPU times: user 26.8 s, sys: 707 ms, total: 27.5 s
Wall time: 27.6 s
My problem is that I need the cumsum grouped by a key in a large dataset, almost as shown here, but it takes minutes. Is there a way to make it faster?
Thanks!