When I call df.groupby([...]).apply(lambda x: ...) the performance is horrible. Is there a faster / more direct way to do this simple query?
To demonstrate my point, here is some code to set up the DataFrame:
import pandas as pd
df = pd.DataFrame(data=
{'ticker': ['AAPL','AAPL','AAPL','IBM','IBM','IBM'],
'side': ['B','B','S','S','S','B'],
'size': [100, 200, 300, 400, 100, 200],
'price': [10.12, 10.13, 10.14, 20.3, 20.2, 20.1]})
price side size ticker
0 10.12 B 100 AAPL
1 10.13 B 200 AAPL
2 10.14 S 300 AAPL
3 20.30 S 400 IBM
4 20.20 S 100 IBM
5 20.10 B 200 IBM
Now here is the part that is extremely slow that I need to speed up:
%timeit avgpx = df.groupby(['ticker','side']) \
.apply(lambda group: (group['size'] * group['price']).sum() / group['size'].sum())
3.23 ms ± 148 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
This produces the correct result but as you can see above, takes super long (3.23ms doesn't seem like much but this is only 6 rows... When I use this on a real dataset it takes forever).
ticker side
AAPL B 10.126667
S 10.140000
IBM B 20.100000
S 20.280000
dtype: float64
You can save some time by precomputing the product and getting rid of the apply.
df['scaled_size'] = df['size'] * df['price']
g = df.groupby(['ticker', 'side'])
g['scaled_size'].sum() / g['size'].sum()
ticker side
AAPL B 10.126667
S 10.140000
IBM B 20.100000
S 20.280000
dtype: float64
100 loops, best of 3: 2.58 ms per loop
Sanity Check
df.groupby(['ticker','side']).apply(
lambda group: (group['size'] * group['price']).sum() / group['size'].sum())
ticker side
AAPL B 10.126667
S 10.140000
IBM B 20.100000
S 20.280000
dtype: float64
100 loops, best of 3: 5.02 ms per loop
Getting rid of apply appears to result in a 2X speedup on my machine.
Related
I have a dataframe like this
df = pd.DataFrame({'id': [205,205,205, 211, 211, 211]
, 'date': pd.to_datetime(['2019-12-01','2020-01-01', '2020-02-01'
,'2019-12-01' ,'2020-01-01', '2020-03-01'])})
df
id date
0 205 2019-12-01
1 205 2020-01-01
2 205 2020-02-01
3 211 2019-12-01
4 211 2020-01-01
5 211 2020-03-01
where the column date is made by consecutive months for id 205 but not for id 211.
I want to keep only the observations (id) for which I have monthly data without jumps. In this example I want:
id date
0 205 2019-12-01
1 205 2020-01-01
2 205 2020-02-01
Here I am collecting the id to keep:
keep_id = []
for num in pd.unique(df.index):
temp = (df.loc[df['id']==num,'date'].dt.year - df.loc[df['id']==num,'date'].shift(1).dt.year) * 12 + df.loc[df['id']==num,'date'].dt.month - df.loc[df['id']==num,'date'].shift(1).dt.month
temp.values[0] = 1.0 # here I correct the first entry
if (temp==1.).all():
keep_id.append(num)
where I am using (df.loc[num,'date'].dt.year - df.loc[num,'date'].shift(1).dt.year) * 12 + df.loc[num,'date'].dt.month - df.loc[num,'date'].shift(1).dt.month to compute the difference in months from the previous date for every id.
This seems to work when tested on a small portion of df, but I'm sure there is a better way of doing this, maybe using the .groupby() method.
Since df is made of millions of observations my code takes too much time (and I'd like to learn a more efficient and pythonic way of doing this)
What you want to do is use groupby-filter rather than a groupby apply.
df.groupby('id').filter(lambda x: not (x.date.diff() > pd.Timedelta(days=32)).any())
provides exactly:
id date
0 205 2019-12-01
1 205 2020-01-01
2 205 2020-02-01
And indeed, I would keep the index unique, there are too many useful characteristics to retain.
Both this response and Michael's above are correct in terms of output. In terms of performance, they are very similar as well:
%timeit df.groupby('id').filter(lambda x: not (x.date.diff() > pd.Timedelta(days=32)).any())
1.48 ms ± 12.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
and
%timeit df[df.groupby('id')['date'].transform(lambda x: x.diff().max() < pd.Timedelta(days=32))]
1.7 ms ± 163 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
For most operations, this difference is negligible.
You can use the following approach. Only ~3x faster in my tests.
df[df.groupby('id')['date'].transform(lambda x: x.diff().max() < pd.Timedelta(days=32))]
Out:
date
id
205 2019-12-01
205 2020-01-01
205 2020-02-01
Here's the thing, I have the dataset below where date is the index:
date value
2020-01-01 100
2020-02-01 140
2020-03-01 156
2020-04-01 161
2020-05-01 170
.
.
.
And I want to transform it in this other dataset:
value_t0 value_t1 value_t2 value_t3 value_t4 ...
100 NaN NaN NaN NaN ...
140 100 NaN NaN NaN ...
156 140 100 NaN NaN ...
161 156 140 100 NaN ...
170 161 156 140 100 ...
First I thought about using pandas.pivot_table to do something, but that would just provide a different layout grouped by some column, which is not exactly what I want. Later, I thought about using pandasql and apply 'case when', but that wouldn't work because I would have to type dozens of lines of code. So I'm stuck here.
try this:
new_df = pd.DataFrame({f"value_t{i}": df['value'].shift(i) for i in range(len(df))})
The series .shift(n) method can get you a single column of your desired output by shifting everything down and filling in NaNs above. So we're building a new dataframe by feeding it a dictionary of the form {column name: column data, ...}, by using dictionary comprehension to iterate through your original dataframe.
I think the best is use numpy
values = np.asarray(df['value'].astype(float))
new_values = np.tril(np.repeat([values], values.shape[0], axis=0).T)
new_values[np.triu_indices(new_values.shape[0], 1)] = np.nan
new_df = pd.DataFrame(new_values).add_prefix('value_t')
Times for 5000 rows
%%timeit
values = np.asarray(df['value'].astype(float))
new_values = np.tril(np.repeat([values], values.shape[0], axis=0).T)
new_values[np.triu_indices(new_values.shape[0],1)] = np.nan
new_df = pd.DataFrame(new_values).add_prefix('value_t')
556 ms ± 35.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
new_df = pd.DataFrame({f"value_t{i}": df['value'].shift(i) for i in range(len(df))})
1.31 s ± 36.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Time without add_prefix
%%timeit
values = np.asarray(df['value'].astype(float))
new_values = np.tril(np.repeat([values], values.shape[0], axis=0).T)
new_values[np.triu_indices(new_values.shape[0],1)] = np.nan
new_df = pd.DataFrame(new_values)
357 ms ± 8.09 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Resampling a dataframe can take the dataframe to either a higher or lower temporal resolution. Most of the time this is used to go to lower resolution (e.g. resample 1-minute data to monthly values). When the dataset is sparse (for example, no data were collected in Feb-2020) then the Feb-2020 row in will be filled with NaNs the resampled dataframe. The problem is when the data record is long AND sparse there are a lot of NaN rows, which makes the dataframe unnecessarily large and takes a lot of CPU time. For example, consider this dataframe and resample operation:
import numpy as np
import pandas as pd
freq1 = pd.date_range("20000101", periods=10, freq="S")
freq2 = pd.date_range("20200101", periods=10, freq="S")
index = np.hstack([freq1.values, freq2.values])
data = np.random.randint(0, 100, (20, 10))
cols = list("ABCDEFGHIJ")
df = pd.DataFrame(index=index, data=data, columns=cols)
# now resample to daily average
df = df.resample(rule="1D").mean()
Most of the data in this dataframe is useless and can be removed via:
df.dropna(how="all", axis=0, inplace=True)
however, this is sloppy. Is there another method to resample the dataframe that does not fill all of the data gaps with NaN (i.e. in the example above, the resultant dataframe would have only two rows)?
Updating my original answer with (what I think) is an improvement, plus updated times.
Use groupby
There are a couple ways you can use groupby instead of resample. In the case of a day ("1D") resampling, you can just use the date property of the DateTimeIndex:
df = df.groupby(df.index.date).mean()
This is in fact faster than the resample for your data:
%%timeit
df.resample(rule='1D').mean().dropna()
# 2.08 ms ± 114 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
df.groupby(df.index.date).mean()
# 666 µs ± 15.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
The more general approach would be to use the floor of the timestamps to do the groupby operation:
rule = '1D'
f = df.index.floor(rule)
df.groupby(f).mean()
# A B C D E F G H I J
# 2000-01-01 50.5 33.5 62.7 42.4 46.7 49.2 64.0 53.3 71.0 38.0
# 2020-01-01 50.4 56.3 57.4 46.2 55.0 60.2 60.3 57.8 63.5 47.3
This will work with more irregular frequencies as well. The main snag here is that by default, it seems like the floor is calculated in reference to some initial date, which can cause weird results (see my post):
rule = '7D'
f = df.index.floor(rule)
df.groupby(f).mean()
# A B C D E F G H I J
# 1999-12-30 50.5 33.5 62.7 42.4 46.7 49.2 64.0 53.3 71.0 38.0
# 2019-12-26 50.4 56.3 57.4 46.2 55.0 60.2 60.3 57.8 63.5 47.3
The major issue is that the resampling doesn't start on the earliest timestamp within your data. However, it is fixable using this solution to the above post:
# custom function for flooring relative to a start date
def floor(x, freq):
offset = x[0].ceil(freq) - x[0]
return (x + offset).floor(freq) - offset
rule = '7D'
f = floor(df.index, rule)
df.groupby(f).mean()
# A B C D E F G H I J
# 2000-01-01 50.5 33.5 62.7 42.4 46.7 49.2 64.0 53.3 71.0 38.0
# 2019-12-28 50.4 56.3 57.4 46.2 55.0 60.2 60.3 57.8 63.5 47.3
# the cycle of 7 days is now starting from 01-01-2000
Just note here that the function floor() is relatively slow compared to pandas.Series.dt.floor(). So it is best to us the latter if you can, but both are better than the original resample (in your example):
%%timeit
df.groupby(df.index.floor('1D')).mean()
# 1.06 ms ± 6.52 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%%timeit
df.groupby(floor(df.index, '1D')).mean()
# 1.42 ms ± 14.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
I want to generate a summary table from a tidy pandas DataFrame. I now use groupby and two for loops, which does not seem efficient. Seems stacking and unstacking would get me there, but I have failed.
Sample data
import pandas as pd
import numpy as np
import copy
import random
df_tidy = pd.DataFrame(columns = ['Stage', 'Exc', 'Cat', 'Score'])
for _ in range(10):
df_tidy = df_tidy.append(
{
'Stage': random.choice(['OP', 'FUEL', 'EOL']),
'Exc': str(np.random.randint(low=0, high=1000)),
'Cat': random.choice(['CC', 'HT', 'PM']),
'Score': np.random.random(),
}, ignore_index=True
)
df_tidy
returns
Stage Exc Cat Score
0 OP 929 HT 0.946234
1 OP 813 CC 0.829522
2 FUEL 114 PM 0.868605
3 OP 896 CC 0.382077
4 FUEL 10 CC 0.832246
5 FUEL 515 HT 0.632220
6 EOL 970 PM 0.532310
7 FUEL 198 CC 0.209856
8 FUEL 848 CC 0.479470
9 OP 968 HT 0.348093
I would like a new DataFrame with Stages as columns, Cats as rows and sum of Scores as values. I achieve it this way:
Working but probably inefficient approach
new_df = pd.DataFrame(columns=list(df_tidy['Stage'].unique()))
for cat, small_df in df_tidy.groupby('Cat'):
for lcs, smaller_df in small_df.groupby('Stage'):
new_df.loc[cat, lcs] = smaller_df['Score'].sum()
new_df['Total'] = new_df.sum(axis=1)
new_df
Which returns what I want:
OP FUEL EOL Total
CC 1.2116 1.52157 NaN 2.733170
HT 1.29433 0.63222 NaN 1.926548
PM NaN 0.868605 0.53231 1.400915
But I cannot believe this is the simplest or most efficient path.
Question
What pandas magic am I missing out on?
Update - Timing the proposed solutions
To understand the differences between pivot_table and crosstab proposed below, I timed the three solutions with a 100,000 row dataframe built exactly as above:
groupby solution, that I thought was inefficient:
%%timeit
new_df = pd.DataFrame(columns=list(df_tidy['Stage'].unique()))
for cat, small_df in df_tidy.groupby('Cat'):
for lcs, smaller_df in small_df.groupby('Stage'):
new_df.loc[cat, lcs] = smaller_df['Score'].sum()
new_df['Total'] = new_df.sum(axis=1)
41.2 ms ± 3.18 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
crosstab solution, that requires a creation of a DataFrame in the background, even if the passed data is already in DataFrame format:
%%timeit
pd.crosstab(index=df_tidy.Cat,columns=df_tidy.Stage, values=df_tidy.Score, aggfunc='sum', margins = True, margins_name = 'Total').iloc[:-1,:]
67.8 ms ± 1.08 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
pivot_table solution:
%%timeit
pd.pivot_table(df_tidy, index=['Cat'], columns=["Stage"], margins=True, margins_name='Total', aggfunc=np.sum).iloc[:-1,:]
713 ms ± 20.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
So, it would appear that the clunky groupbysolution is the quickest.
A simple solution from crosstab
pd.crosstab(index=df.Cat,columns=df.Stage,values=df.Score,aggfunc='sum', margins = True, margins_name = 'Total').iloc[:-1,:]
Out[342]:
Stage EOL FUEL OP Total
Cat
CC NaN 1.521572 1.211599 2.733171
HT NaN 0.632220 1.294327 1.926547
PM 0.53231 0.868605 NaN 1.400915
I was wondering if not a simpler solution than using pd.crosstab is to use pd.pivot:
pd.pivot_table(df_tidy, index=['Cat'], columns=["Stage"], margins=True, margins_name='Total', aggfunc=np.sum).iloc[:-1,:]
I am trying to use groupby, nlargest, and sum functions in Pandas together, but having trouble making it work.
State County Population
Alabama a 100
Alabama b 50
Alabama c 40
Alabama d 5
Alabama e 1
...
Wyoming a.51 180
Wyoming b.51 150
Wyoming c.51 56
Wyoming d.51 5
I want to use groupby to select by state, then get the top 2 counties by population. Then use only those top 2 county population numbers to get a sum for that state.
In the end, I'll have a list that will have the state and the population (of it's top 2 counties).
I can get the groupby and nlargest to work, but getting the sum of the nlargest(2) is a challenge.
The line I have right now is simply: df.groupby('State')['Population'].nlargest(2)
You can use apply after performing the groupby:
df.groupby('State')['Population'].apply(lambda grp: grp.nlargest(2).sum())
I think this issue you're having is that df.groupby('State')['Population'].nlargest(2) will return a DataFrame, so you can no longer do group level operations. In general, if you want to perform multiple operations in a group, you'll need to use apply/agg.
The resulting output:
State
Alabama 150
Wyoming 330
EDIT
A slightly cleaner approach, as suggested by #cs95:
df.groupby('State')['Population'].nlargest(2).sum(level=0)
This is slightly slower than using apply on larger DataFrames though.
Using the following setup:
import numpy as np
import pandas as pd
from string import ascii_letters
n = 10**6
df = pd.DataFrame({'A': np.random.choice(list(ascii_letters), size=n),
'B': np.random.randint(10**7, size=n)})
I get the following timings:
In [3]: %timeit df.groupby('A')['B'].apply(lambda grp: grp.nlargest(2).sum())
103 ms ± 1.08 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [4]: %timeit df.groupby('A')['B'].nlargest(2).sum(level=0)
147 ms ± 3.38 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
The slower performance is potentially caused by the level kwarg in sum performing a second groupby under the hood.
Using agg, the grouping logic looks like:
df.groupby('State').agg({'Population': {lambda x: x.nlargest(2).sum() }})
This results in another dataframe object; which you could query to find the most populous states, etc.
Population
State
Alabama 150
Wyoming 330