I am experimenting with polars and would like to understand why using polars is slower than using pandas on a particular example:
import pandas as pd
import polars as pl
n=10_000_000
df1 = pd.DataFrame(range(n), columns=['a'])
df2 = pd.DataFrame(range(n), columns=['b'])
df1p = pl.from_pandas(df1.reset_index())
df2p = pl.from_pandas(df2.reset_index())
# takes ~60 ms
df1.join(df2)
# takes ~950 ms
df1p.join(df2p, on='index')
A pandas join uses the indexes, which are cached.
A comparison where they do the same:
# pandas
# CPU times: user 1.64 s, sys: 867 ms, total: 2.5 s
# Wall time: 2.52 s
df1.merge(df2, left_on="a", right_on="b")
# polars
# CPU times: user 5.59 s, sys: 199 ms, total: 5.79 s
# Wall time: 780 ms
df1p.join(df2p, left_on="a", right_on="b")
Related
I have a list of tuples, with more than 0.8 million entries, a sample of which is shown below
[('r', 'b', '>', 'ins'),
('r', 'ba', '>', 'ins'),
('r', 'a', '>', 'del'),
('ar', 'b', '>', 'ins'),
('ar', 'ba', '>', 'del')]
The above list is converted into a Pandas dataframe, and then some subsequent grouping and aggregate operations are performed on it as shown below
conDF = pd.DataFrame.from_records(tripleeList)
conDF.columns = ["lc","obj","rc","insOrDel"]
conDF["coun"] = 1
groupedConDF = conDF.groupby(["lc","rc","obj","insOrDel"]).count()
It is observed that if the final element in each tuple is represented using an integer (say with 1 and 0), instead of using the strings "ins" and "del", then the processing time reaches 10 times that of what is currently required. I have used %%time magic in jupyter notebook for obtaining the measures.
Time reported time when using the strings
CPU times: user 459 ms, sys: 24.4 ms, total: 484 ms
Wall time: 481 ms
The reported time when using the integers
CPU times: user 5.77 s, sys: 410 ms, total: 6.18 s
Wall time: 604 ms
Updated
When modifying the code as suggested by #jezrael, the following is reported
Strings:
CPU times: user 966 ms, sys: 45.9 ms, total: 1.01 s
Wall time: 431 ms
Integer:
CPU times: user 5.72 s, sys: 375 ms, total: 6.09 s
Wall time: 558 ms
Modified code:
conDF = pd.DataFrame.from_records(tripleeList)
conDF.columns = ["lc","obj","rc","insOrDel"]
conDF["coun"] = 1
groupedConDF = conDF.groupby(["lc","rc","obj","insOrDel"], sort=False).size()
What could be the possible reason for this?
I have dataframe with datetime index, and shape:
df.shape
(311885, 38)
Aggregate functions .sum(), .mean() and .median() work fine:
%%time
df.groupby(pd.Grouper(freq='D')).mean()
CPU times: user 77.6 ms, sys: 16 ms, total: 93.7 ms
Wall time: 92.7 ms
However, .min() and .max() are extremely slow:
%%time
df.groupby(pd.Grouper(freq='D')).min()
CPU times: user 51.1 s, sys: 377 ms, total: 51.5 s
Wall time: 51.1 s
Also, tried resample with equally bad result:
%%time
df.resample('D').min()
CPU times: user 52.2 s, sys: 478 ms, total: 52.7 s
Wall time: 52.2 s
Installed versions:
pd.__version__
'0.25.2'
print(sys.version)
3.6.8 (default, Jan 14 2019, 11:02:34)
[GCC 8.0.1 20180414 (experimental) [trunk revision 259383]]
Is this expected behaviour? Can timings of .min() and .max() be improved?
As Quang Hoang pointed out in their comment, I had a string column which caused .min() and .max() to be slow. Without it, everything is fast.
Given the following code (executed in a Jupyter notebook):
In [1]: import pandas as pd
%time df=pd.SparseDataFrame(index=range(0,1000), columns=range(0,1000));
CPU times: user 3.89 s, sys: 30.3 ms, total: 3.92 s
Wall time: 3.92 s
Why does it take so long to create a sparse data frame?
Note that it seems to be irrelevant if I increse the dimension for the rows. But when I increase the number of columns from 1000 to say 10000, the code seems to take forever and I always had to abort it.
Compare this with scipy's sparse matrix:
In [2]: from scipy.sparse import lil_matrix
%time m=lil_matrix((1000, 1000))
CPU times: user 1.09 ms, sys: 122 µs, total: 1.21 ms
Wall time: 1.18 ms
I have a pandas dataframe holding more than million records. One of its columns is datetime. The sample of my data is like the following:
time,x,y,z
2015-05-01 10:00:00,111,222,333
2015-05-01 10:00:03,112,223,334
...
I need to effectively get the record during the specific period. The following naive way is very time consuming.
new_df = df[(df["time"] > start_time) & (df["time"] < end_time)]
I know that on DBMS like MySQL the indexing by the time field is effective for getting records by specifying the time period.
My question is
Does the indexing of pandas such as df.index = df.time makes the slicing process faster?
If the answer of Q1 is 'No', what is the common effective way to get a record during the specific time period in pandas?
Let's create a dataframe with 1 million rows and time performance. The index is a Pandas Timestamp.
df = pd.DataFrame(np.random.randn(1000000, 3),
columns=list('ABC'),
index=pd.DatetimeIndex(start='2015-1-1', freq='10s', periods=1000000))
Here are the results sorted from fastest to slowest (tested on the same machine with both v. 0.14.1 (don't ask...) and the most recent version 0.17.1):
%timeit df2 = df['2015-2-1':'2015-3-1']
1000 loops, best of 3: 459 µs per loop (v. 0.14.1)
1000 loops, best of 3: 664 µs per loop (v. 0.17.1)
%timeit df2 = df.ix['2015-2-1':'2015-3-1']
1000 loops, best of 3: 469 µs per loop (v. 0.14.1)
1000 loops, best of 3: 662 µs per loop (v. 0.17.1)
%timeit df2 = df.loc[(df.index >= '2015-2-1') & (df.index <= '2015-3-1'), :]
100 loops, best of 3: 8.86 ms per loop (v. 0.14.1)
100 loops, best of 3: 9.28 ms per loop (v. 0.17.1)
%timeit df2 = df.loc['2015-2-1':'2015-3-1', :]
1 loops, best of 3: 341 ms per loop (v. 0.14.1)
1000 loops, best of 3: 677 µs per loop (v. 0.17.1)
Here are the timings with the Datetime index as a column:
df.reset_index(inplace=True)
%timeit df2 = df.loc[(df['index'] >= '2015-2-1') & (df['index'] <= '2015-3-1')]
100 loops, best of 3: 12.6 ms per loop (v. 0.14.1)
100 loops, best of 3: 13 ms per loop (v. 0.17.1)
%timeit df2 = df.loc[(df['index'] >= '2015-2-1') & (df['index'] <= '2015-3-1'), :]
100 loops, best of 3: 12.8 ms per loop (v. 0.14.1)
100 loops, best of 3: 12.7 ms per loop (v. 0.17.1)
All of the above indexing techniques produce the same dataframe:
>>> df2.shape
(250560, 3)
It appears that either of the first two methods are the best in this situation, and the fourth method also works just as fine using the latest version of Pandas.
I've never dealt with a data set that large, but maybe you can try recasting the time column as a datetime index and then slicing directly. Something like this.
timedata.txt (extended from your example):
time,x,y,z
2015-05-01 10:00:00,111,222,333
2015-05-01 10:00:03,112,223,334
2015-05-01 10:00:05,112,223,335
2015-05-01 10:00:08,112,223,336
2015-05-01 10:00:13,112,223,337
2015-05-01 10:00:21,112,223,338
df = pd.read_csv('timedata.txt')
df.time = pd.to_datetime(df.time)
df = df.set_index('time')
print(df['2015-05-01 10:00:02':'2015-05-01 10:00:14'])
x y z
time
2015-05-01 10:00:03 112 223 334
2015-05-01 10:00:05 112 223 335
2015-05-01 10:00:08 112 223 336
2015-05-01 10:00:13 112 223 337
Note that in the example the times used for slicing are not in the column, so this will work for the case where you only know the time interval.
If your data has a fixed time period you can create a datetime index which may provide more options. I didn't want to assume your time period was fixed so constructed this for a more general case.
I would like know if there is a faster way to run a cumsum in pandas.
For example:
import numpy as np
import pandas as pd
n = 10000000
values = np.random.randint(1, 100000, n)
ids = values.astype("S10")
df = pd.DataFrame({"ids": ids, "val": values})
Now, I want to groupby using ids and get some stats.
The max for example is pretty fast:
time df.groupby("ids").val.max()
CPU times: user 5.08 s, sys: 131 ms, total: 5.21 s
Wall time: 5.22 s
However, the cumsum is very slow:
time df.groupby("ids").val.cumsum()
CPU times: user 26.8 s, sys: 707 ms, total: 27.5 s
Wall time: 27.6 s
My problem is that I need the cumsum grouped by a key in a large dataset, almost as shown here, but it takes minutes. Is there a way to make it faster?
Thanks!