Pandas groupby winsorized mean

Pandas groupby winsorized mean - python

The normal groupby mean is easy:
df.groupby(['col_a','col_b']).mean()[col_i_want]
However, if i want to apply a winsorized mean (default limits of 0.05 and 0.95) which is equivalent to clipping the dataset then performing a mean, there suddenly seems to be no easy way to do it? I would have to:
winsorized_mean = []
col_i_want = 'col_c'
for entry in df['col_a'].unique():
for entry2 in df['col_b'].unique():
sub_df = df[(df['col_a'] == entry) & (df['col_b'] == entry2)]
m = sub_df[col_to_groupby].clip(lower=0.05,upper=0.95).mean()
winsorized_mean.append([entry,entry2,m])
Is there a function I'm not aware of to do this automatically?

You can use scipy.stats.trim_mean:
import pandas as pd
from scipy.stats import trim_mean
# label 'a' will exhibit different means depending on trimming
label = ['a'] * 20 + ['b'] * 80 + ['c'] * 400 + ['a'] * 100
data = list(range(100)) + list(range(500, 1000))
df = pd.DataFrame({'label': label, 'data': data})
grouped = df.groupby('label')
# trim 5% off both ends
print(grouped.apply(stats.trim_mean, .05))
# trim 10% off both ends
print(grouped.apply(stats.trim_mean, .1))

Related

Python's `.loc` is really slow on selecting subsets of Data

I'm having a large multindexed (y,t) single valued DataFrame df. Currently, I'm selecting a subset via df.loc[(Y,T), :] and create a dictionary out of it. The following MWE works, but the selection is very slow for large subsets.
import numpy as np
import pandas as pd
# Full DataFrame
y_max = 50
Y_max = range(1, y_max+1)
t_max = 100
T_max = range(1, t_max+1)
idx_max = tuple((y,t) for y in Y_max for t in T_max)
df = pd.DataFrame(np.random.sample(y_max*t_max), index=idx_max, columns=['Value'])
# Create Dictionary of Subset of Data
y1 = 4
yN = 10
Y = range(y1, yN+1)
t1 = 5
tN = 9
T = range(t1, tN+1)
idx_sub = tuple((y,t) for y in Y for t in T)
data_sub = df.loc[(Y,T), :] #This is really slow
dict_sub = dict(zip(idx_sub, data_sub['Value']))
# result, e.g. (y,t) = (5,7)
dict_sub[5,7] == df.loc[(5,7), 'Value']
I was thinking of using df.loc[(y1,t1),(yN,tN), :], but it does not work properly, as the second index is only bounded in the final year yN.

One idea is use Index.isin with itertools.product in boolean indexing:
from itertools import product
idx_sub = tuple(product(Y, T))
dict_sub = df.loc[df.index.isin(idx_sub),'Value'].to_dict()
print (dict_sub)

Calculate 2.5% below and 2.5% above the mean in Python

How do I print the dataframe, where the population is within 5% of the mean? (2.5% below and 2.5% above)
Here is what I've tried:
mean = df['population'].mean()
minimum = mean - (0.025*mean)
maximum = mean + (0.025*mean)
df[df.population < maximum]

Use:
df.loc[(df['population'] > minimum) & (df['population'] < maximum)]

import pandas as pd
df = pd.read_csv("fileName.csv")
#suppose this dataFrame contains the population in the int format
mean = df['population'].mean()
minimum = mean - (0.025*mean)
maximum = mean + (0.025*mean)
ans = df.loc[(df['population']>minimum) & (df['population'] <maximum)]
ans
you can use this

I built this dataframe for testing.
import numpy as np
import pandas as pd
random_data = np.random.randint(1_000_000, 100_000_000, 200)
random_df = pd.DataFrame(random_data, columns=['population'])
random_df
Here's the answer to specifically what you were asking for.
pop = random_df.population
top_boundary = pop.mean() + pop.mean() * 0.025
low_boundary = pop.mean() - pop.mean() * 0.025
criteria_boundary_limits = random_df.population.between(low_boundary, top_boundary)
criteria_boundary_df = random_df.loc[criteria_boundary_limits]
criteria_boundary_df
But, maybe, another answer could be had by using quantiles. I used 40 quantiles because 1/40 = 0.025.
groups_list = list(range(1,41))
random_df['groups'] = pd.qcut(random_df['population'], 40, labels = groups_list)
criteria_groups_limits = random_df.groups.between(20,21)
criteria_groups_df = random_df.loc[criteria_groups_limits]
criteria_groups_df

Dataframe with Monte Carlo Simulation calculation next row Problem

I want to build up a Dataframe from scratch with calculations based on the Value before named Barrier option. I know that i can use a Monte Carlo simulation to solve it but it just wont work the way i want it to.
The formula is:
Value in row before * np.exp((r-sigma**2/2)*T/TradingDays+sigma*np.sqrt(T/TradingDays)*z)
The first code I write just calculates the first column. I know that I need a second loop but can't really manage it.
The result should be, that for each simulation it will calculate a new value using the the value before, for 500 Day meaning S_1 should be S_500 with a total of 1000 simulations. (I need to generate new columns based on the value before using the formular.)
similar to this:
So for the 1. Simulations 500 days, 2. Simulation 500 day and so on...
import numpy as np
import pandas as pd
from scipy.stats import norm
import random as rd
import math
simulation = 0
S_0 = 42
T = 2
r = 0.02
sigma = 0.20
TradingDays = 500
df = pd.DataFrame()
for i in range (0,TradingDays):
z = norm.ppf(rd.random())
simulation = simulation + 1
S_1 = S_0*np.exp((r-sigma**2/2)*T/TradingDays+sigma*np.sqrt(T/TradingDays)*z)
df = df.append ({
'S_1':S_1,
'S_0':S_0
}, ignore_index=True)
df = df.round ({'Z':6,
'S_T':2
})
df.index += 1
df.index.name = 'Simulation'
print(df)
I found another possible code which i found here and it does solve the problem but just for one row, the next row is just the same calculation. Generate a Dataframe that follow a mathematical function for each column / row
If i just replace it with my formular i get the same problem.
replacing:
exp(r - q * sqrt(sigma))*T+ (np.random.randn(nrows) * sqrt(deltaT)))
with:
exp((r-sigma**2/2)*T/nrows+sigma*np.sqrt(T/nrows)*z))
import numpy as np
import pandas as pd
from scipy.stats import norm
import random as rd
import math
S_0 = 42
T = 2
r = 0.02
sigma = 0.20
TradingDays = 50
Simulation = 100
df = pd.DataFrame({'s0': [S_0] * Simulation})
for i in range(1, TradingDays):
z = norm.ppf(rd.random())
df[f's{i}'] = df.iloc[:, -1] * np.exp((r-sigma**2/2)*T/TradingDays+sigma*np.sqrt(T/TradingDays)*z)
print(df)
I would work more likely with the last code and solve the problem with it.

How about just overwriting the value of S_0 by the new value of S_1 while you loop and keeping all simulations in a list?
Like this:
import numpy as np
import pandas as pd
import random
from scipy.stats import norm
S_0 = 42
T = 2
r = 0.02
sigma = 0.20
trading_days = 50
output = []
for i in range(trading_days):
z = norm.ppf(random.random())
value = S_0*np.exp((r - sigma**2 / 2) * T / trading_days + sigma * np.sqrt(T/trading_days) * z)
output.append(value)
S_0 = value
df = pd.DataFrame({'simulation': output})
Perhaps I'm missing something, but I don't see the need for a second loop.
Also, this eliminates calling df.append() in a loop, which should be avoided. (See here)

Solution based on the the answer of bartaelterman, thank you very much!
import numpy as np
import pandas as pd
from scipy.stats import norm
import random as rd
import math
#Dividing the list in chunks to later append it to the dataframe in the right order
def chunk_list(lst, chunk_size):
for i in range(0, len(lst), chunk_size):
yield lst[i:i + chunk_size]
def blackscholes():
d1 = ((math.log(S_0/K)+(r+sigma**2/2)*T)/(sigma*np.sqrt(2)))
d2 = ((math.log(S_0/K)+(r-sigma**2/2)*T)/(sigma*np.sqrt(2)))
preis_call_option = S_0*norm.cdf(d1)-K*np.exp(-r*T)*norm.cdf(d2)
return preis_call_option
K = 40
S_0 = 42
T = 2
r = 0.02
sigma = 0.2
U = 38
simulation = 10000
trading_days = 500
trading_days = trading_days -1
#creating 2 lists for the first and second loop
loop_simulation = []
loop_trading_days = []
#first loop calculates the first column in a list
for j in range (0,simulation):
print("Progressbar_1_2 {:2.2%}".format(j / simulation), end="\n\r")
S_Tag_new = 0
NORM_S_INV = norm.ppf(rd.random())
S_Tag = S_0*np.exp((r-sigma**2/2)*T/trading_days+sigma*np.sqrt(T/trading_days)*NORM_S_INV)
S_Tag_new = S_Tag
loop_simulation.append(S_Tag)
#second loop calculates the the rows for the columns in a list
for i in range (0,trading_days):
NORM_S_INV = norm.ppf(rd.random())
S_Tag = S_Tag_new*np.exp((r-sigma**2/2)*T/trading_days+sigma*np.sqrt(T/trading_days)*NORM_S_INV)
loop_trading_days.append(S_Tag)
S_Tag_new = S_Tag
#values from the second loop will be divided in number of Trading days per Simulation
loop_trading_days_chunked = list(chunk_list(loop_trading_days,trading_days))
#First dataframe with just the first results from the firstloop for each simulation
df1 = pd.DataFrame({'S_Tag 1': loop_simulation})
#Appending the the chunked list from the second loop to a second dataframe
df2 = pd.DataFrame(loop_trading_days_chunked)
#Merging both dataframe into one
df3 = pd.concat([df1, df2], axis=1)

Statsmodels OLS with rolling window problem

I would like to do a regression with a rolling window, but I got only one parameter back after the regression:
rolling_beta = sm.OLS(X2, X1, window_type='rolling', window=30).fit()
rolling_beta.params
The result:
X1 5.715089
dtype: float64
What could be the problem?
Thanks in advance, Roland

I think the problem is that the parameters window_type='rolling' and window=30 simply do not do anything. First I'll show you why, and at the end I'll provide a setup I've got lying around for linear regressions on rolling windows.
1. The problem with your function:
Since you haven't provided some sample data, here's a function that returns a dataframe of a desired size with some random numbers:
# Function to build synthetic data
import numpy as np
import pandas as pd
import statsmodels.api as sm
from collections import OrderedDict
def sample(rSeed, periodLength, colNames):
np.random.seed(rSeed)
date = pd.to_datetime("1st of Dec, 1999")
cols = OrderedDict()
for col in colNames:
cols[col] = np.random.normal(loc=0.0, scale=1.0, size=periodLength)
dates = date+pd.to_timedelta(np.arange(periodLength), 'D')
df = pd.DataFrame(cols, index = dates)
return(df)
Output:
X1 X2
2018-12-01 -1.085631 -1.294085
2018-12-02 0.997345 -1.038788
2018-12-03 0.282978 1.743712
2018-12-04 -1.506295 -0.798063
2018-12-05 -0.578600 0.029683
.
.
.
2019-01-17 0.412912 -1.363472
2019-01-18 0.978736 0.379401
2019-01-19 2.238143 -0.379176
Now, try:
rolling_beta = sm.OLS(df['X2'], df['X1'], window_type='rolling', window=30).fit()
rolling_beta.params
Output:
X1 -0.075784
dtype: float64
And this at least represents the structure of your output too, meaning that you're expecting an estimate for each of your sample windows, but instead you get a single estimate. So I looked around for some other examples using the same function online and in the statsmodels docs, but I was unable to find specific examples that actually worked. What I did find were a few discussions talking about how this functionality was deprecated a while ago. So then I tested the same function with some bogus input for the parameters:
rolling_beta = sm.OLS(df['X2'], df['X1'], window_type='amazing', window=3000000).fit()
rolling_beta.params
Output:
X1 -0.075784
dtype: float64
And as you can see, the estimates are the same, and no error messages are returned for the bogus input. So I suggest that you take a look at the function below. This is something I've put together to perform rolling regression estimates.
2. A function for regressions on rolling windows of a pandas dataframe
df = sample(rSeed = 123, colNames = ['X1', 'X2', 'X3'], periodLength = 50)
def RegressionRoll(df, subset, dependent, independent, const, win, parameters):
"""
RegressionRoll takes a dataframe, makes a subset of the data if you like,
and runs a series of regressions with a specified window length, and
returns a dataframe with BETA or R^2 for each window split of the data.
Parameters:
===========
df: pandas dataframe
subset: integer - has to be smaller than the size of the df
dependent: string that specifies name of denpendent variable
inependent: LIST of strings that specifies name of indenpendent variables
const: boolean - whether or not to include a constant term
win: integer - window length of each model
parameters: string that specifies which model parameters to return:
BETA or R^2
Example:
========
RegressionRoll(df=df, subset = 50, dependent = 'X1', independent = ['X2'],
const = True, parameters = 'beta', win = 30)
"""
# Data subset
if subset != 0:
df = df.tail(subset)
else:
df = df
# Loopinfo
end = df.shape[0]
win = win
rng = np.arange(start = win, stop = end, step = 1)
# Subset and store dataframes
frames = {}
n = 1
for i in rng:
df_temp = df.iloc[:i].tail(win)
newname = 'df' + str(n)
frames.update({newname: df_temp})
n += 1
# Analysis on subsets
df_results = pd.DataFrame()
for frame in frames:
#print(frames[frame])
# Rolling data frames
dfr = frames[frame]
y = dependent
x = independent
if const == True:
x = sm.add_constant(dfr[x])
model = sm.OLS(dfr[y], x).fit()
else:
model = sm.OLS(dfr[y], dfr[x]).fit()
if parameters == 'beta':
theParams = model.params[0:]
coefs = theParams.to_frame()
df_temp = pd.DataFrame(coefs.T)
indx = dfr.tail(1).index[-1]
df_temp['Date'] = indx
df_temp = df_temp.set_index(['Date'])
if parameters == 'R2':
theParams = model.rsquared
df_temp = pd.DataFrame([theParams])
indx = dfr.tail(1).index[-1]
df_temp['Date'] = indx
df_temp = df_temp.set_index(['Date'])
df_temp.columns = [', '.join(independent)]
df_results = pd.concat([df_results, df_temp], axis = 0)
return(df_results)
df_rolling = RegressionRoll(df=df, subset = 50, dependent = 'X1', independent = ['X2'], const = True, parameters = 'beta',
win = 30)
Output: A dataframe with beta estimates for OLS of X2 on X1 for each 30 period window of the data.
const X2
Date
2018-12-30 0.044042 0.032680
2018-12-31 0.074839 -0.023294
2019-01-01 -0.063200 0.077215
.
.
.
2019-01-16 -0.075938 -0.215108
2019-01-17 -0.143226 -0.215524
2019-01-18 -0.129202 -0.170304

Speed up rolling window in Pandas

I have this code which works fine and gives me the result I am looking for. It loops through a list of window sizes to create rolling aggregates for each metric in the sum_metric_list, min_metric_list and max_metric_list.
# create the rolling aggregations for each window
for window in constants.AGGREGATION_WINDOW:
# get the sum and count sums
sum_metrics_names_list = [x[6:] + "_1_" + str(window) for x in sum_metrics_list]
adt_df[sum_metrics_names_list] = adt_df.groupby('athlete_id')[sum_metrics_list].apply(lambda x : x.rolling(center = False, window = window, min_periods = 1).sum())
# get the min of mins
min_metrics_names_list = [x[6:] + "_1_" + str(window) for x in min_metrics_list]
adt_df[min_metrics_names_list] = adt_df.groupby('athlete_id')[min_metrics_list].apply(lambda x : x.rolling(center = False, window = window, min_periods = 1).min())
# get the max of max
max_metrics_names_list = [x[6:] + "_1_" + str(window) for x in max_metrics_list]
adt_df[max_metrics_names_list] = adt_df.groupby('athlete_id')[max_metrics_list].apply(lambda x : x.rolling(center = False, window = window, min_periods = 1).max())
It works well on small datasets but as soon as I run it on my full data with >3000 metrics and 40 windows it becomes very slow. Is there any way to optimise this code?

The benchmark (and code) below suggests that you can save a significant amount of time by using
df.groupby(...).rolling()
instead of
df.groupby(...)[col].apply(lambda x: x.rolling(...))
The main time-saving idea here is to try to apply vectorized functions (such as sum) to the largest possible array (or DataFrame) at one time (with one function call) instead of many tiny function calls.
df.groupby(...).rolling().sum() calls sum on each (grouped) sub-DataFrame. It
can compute the rolling sums for all the columns with one call.
You could use df[sum_metrics_list+[key]].groupby(key).rolling().sum() to compute the rolling/sum on the sum_metrics_list columns.
In contrast, df.groupby(...)[col].apply(lambda x: x.rolling(...)) calls sum on a single column of each (grouped) sub-DataFrame. Since you have >3000 metrics you end up calling df.groupby(...)[col].rolling().sum() (or min or max) 3000 times.
Of course, this pseudo-logic of counting the number of calls is only a heuristic which may guide you in the direction of faster code. The proof is in the pudding:
import collections
import timeit
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
def make_df(nrows=100, ncols=3):
seed = 2018
np.random.seed(seed)
df = pd.DataFrame(np.random.randint(10, size=(nrows, ncols)))
df['athlete_id'] = np.random.randint(10, size=nrows)
return df
def orig(df, key='athlete_id'):
columns = list(df.columns.difference([key]))
result = pd.DataFrame(index=df.index)
for window in range(2, 4):
for col in columns:
colname = 'sum_col{}_winsize{}'.format(col, window)
result[colname] = df.groupby(key)[col].apply(lambda x: x.rolling(
center=False, window=window, min_periods=1).sum())
colname = 'min_col{}_winsize{}'.format(col, window)
result[colname] = df.groupby(key)[col].apply(lambda x: x.rolling(
center=False, window=window, min_periods=1).min())
colname = 'max_col{}_winsize{}'.format(col, window)
result[colname] = df.groupby(key)[col].apply(lambda x: x.rolling(
center=False, window=window, min_periods=1).max())
result = pd.concat([df, result], axis=1)
return result
def alt(df, key='athlete_id'):
"""
Call rolling on the whole DataFrame, not each column separately
"""
columns = list(df.columns.difference([key]))
result = [df]
for window in range(2, 4):
rolled = df.groupby(key, group_keys=False).rolling(
center=False, window=window, min_periods=1)
new_df = rolled.sum().drop(key, axis=1)
new_df.columns = ['sum_col{}_winsize{}'.format(col, window) for col in columns]
result.append(new_df)
new_df = rolled.min().drop(key, axis=1)
new_df.columns = ['min_col{}_winsize{}'.format(col, window) for col in columns]
result.append(new_df)
new_df = rolled.max().drop(key, axis=1)
new_df.columns = ['max_col{}_winsize{}'.format(col, window) for col in columns]
result.append(new_df)
df = pd.concat(result, axis=1)
return df
timing = collections.defaultdict(list)
ncols = [3, 10, 20, 50, 100]
for n in ncols:
df = make_df(ncols=n)
timing['orig'].append(timeit.timeit(
'orig(df)',
'from __main__ import orig, alt, df',
number=10))
timing['alt'].append(timeit.timeit(
'alt(df)',
'from __main__ import orig, alt, df',
number=10))
plt.plot(ncols, timing['orig'], label='using groupby/apply (orig)')
plt.plot(ncols, timing['alt'], label='using groupby/rolling (alternative)')
plt.legend(loc='best')
plt.xlabel('number of columns')
plt.ylabel('seconds')
print(pd.DataFrame(timing, index=pd.Series(ncols, name='ncols')))
plt.show()
and yields these timeit benchmarks
alt orig
ncols
3 0.871695 0.996862
10 0.991617 3.307021
20 1.168522 6.602289
50 1.676441 16.558673
100 2.521121 33.261957
The speed advantage of alt compared to orig seems to increase as the number of columns increases.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas groupby winsorized mean - python

Related

Python's `.loc` is really slow on selecting subsets of Data

Calculate 2.5% below and 2.5% above the mean in Python

Dataframe with Monte Carlo Simulation calculation next row Problem

Statsmodels OLS with rolling window problem

Speed up rolling window in Pandas

Categories

Resources