Efficient way to accumulate in Pandas - python

I would like to calculate an exponential moving average on a Pandas time series.
I have two columns, one called 'value' and one called 'time'. Time is nondecreasing. I can as well reset the index as 'time' and get the value Series.
ewma uses only integer "time" values.
Instead of calculating
ewma[i+1] = value[i+1] + alpha * ewma[i]
like ewma, I would like to do:
ewma[i+1] = value[i+1] + exp(alpha * (time[i] - time[i+1]) ) * ewma[i]
What is the most efficient way to do it?
numpy.accumulate requires numpy.ufunc.

Related

Referencing time and (time+10 seconds) to calc normalized price return in Pandas Dataframe

I am trying to normalize price at a certain point in time with respect to price 10 seconds later using this formula: ((price(t+10seconds) – price(t)) / price(t) ) / spread(t)
Both price and spread are columns in my dataframe. And I have indexed my dataframe by timestamp (pd.datetime object) because I figured that would make calculating price(t+10sec) easier.
What I've tried so far:
pos['timestamp'] = pd.to_datetime(pos['timestamp'])
pos.set_index('timestamp')
def normalize_data(pos):
t0 = pd.to_datetime('2021-10-27 09:30:13.201')
x = pos['mid_price']
y = ((x[t0 + pd.Timedelta('10 sec')] - x)/x) / (spread)
return y
pos['norm_price'] = normalize_data(pos)
this gives me an error because I'm indexing x[t0+pd.Timedelta('10sec')] but not the other x's in the equation. I also don't think I'm using pd.Timedelta or the x[t0+pd.Time...] correctly and unsure of how to fix all this/define a better function.
Any input would be much appreciated
dataframe
Your problem is here:
pos.set_index('timestamp')
This line of code will return a new dataframe, and leave your original dataframe unchanged. So, your function normalize_data is working on the original version of pos, which does not have the index you want, and neither will x. Change your code to this:
pos = pos.set_index('timestamp')
And that should get things working.

Welles Wilder's moving average with pandas

I'm trying to calculate Welles Wilder's type of moving average in a panda dataframe (also called cumulative moving average).
The method to calculate the Wilder's moving average for 'n' periods of series 'A' is:
Calculate the mean of the first 'n' values in 'A' and set as the mean for the 'n' position.
For the following values use the previous mean weighed by (n-1) and the current value of the series weighed by 1 and divide all by 'n'.
My question is: how to implement this in a vectorized way?
I tried to do it iterating over the dataframe (what a I read isn't recommend because is slow). It works, the values are correct, but I get an error
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
and it's probably not the most efficient way of doing it.
My code so far:
import pandas as pd
import numpy as np
#Building Random sample:
datas = pd.date_range('2020-01-01','2020-01-31')
np.random.seed(693)
A = np.random.randint(40,60, size=(31,1))
df = pd.DataFrame(A,index = datas, columns = ['A'])
period = 12 # Main parameter
initial_mean = A[0:period].mean() # Equation for the first value.
size = len(df.index)
df['B'] = np.full(size, np.nan)
df.B[period-1] = initial_mean
for x in range(period, size):
df.B[x] = ((df.A[x] + (period-1)*df.B[x-1]) / period) # Equation for the following values.
print(df)
You can use the Pandas ewm() method, which behaves exactly as you described when adjust=False:
When adjust is False, weighted averages are calculated recursively as:
weighted_average[0] = arg[0];
weighted_average[i] = (1-alpha)*weighted_average[i-1] + alpha*arg[i]
If you want to do the simple average of the first period items, you can do that first and apply ewm() to the result.
You can calculate a series with the average of the first period items, followed by the other items repeated verbatim, with the formula:
pd.Series(
data=[df['A'].iloc[:period].mean()],
index=[df['A'].index[period-1]],
).append(
df['A'].iloc[period:]
)
So in order to calculate the Wilder moving average and store it in a new column 'C', you can use:
df['C'] = pd.Series(
data=[df['A'].iloc[:period].mean()],
index=[df['A'].index[period-1]],
).append(
df['A'].iloc[period:]
).ewm(
alpha=1.0 / period,
adjust=False,
).mean()
At this point, you can calculate df['B'] - df['C'] and you'll see that the difference is almost zero (there's some rounding error with float numbers.) So this is equivalent to your calculation using a loop.
You might want to consider skipping the direct average between the first period items and simply start applying ewm() from the start, which will assume the first row is the previous average in the first calculation. The results will be slightly different but once you've gone through a couple of periods then those initial values will hardly influence the results.
That would be a way more simple calculation:
df['D'] = df['A'].ewm(
alpha=1.0 / period,
adjust=False,
).mean()

Pandas: Calculate the percentage between two rows and add the value as a column

I have a dataset structured like this:
"Date","Time","Open","High","Low","Close","Volume"
This time series represent the values of a generic stock market.
I want to calculate the difference in percentage between two rows of the column "Close" (in fact, I want to know how much the value of the stock increased or decreased; each row represent a day).
I've done this with a for loop(that is terrible using pandas in a big data problem) and I create the right results but in a different DataFrame:
rows_number = df_stock.shape[0]
# The first row will be 1, because is calculated in percentage. If haven't any yesterday the value must be 1
percentage_df = percentage_df.append({'Date': df_stock.iloc[0]['Date'], 'Percentage': 1}, ignore_index=True)
# Foreach days, calculate the market trend in percentage
for index in range(1, rows_number):
# n_yesterday : 100 = (n_today - n_yesterday) : x
n_today = df_stock.iloc[index]['Close']
n_yesterday = self.df_stock.iloc[index-1]['Close']
difference = n_today - n_yesterday
percentage = (100 * difference ) / n_yesterday
percentage_df = percentage_df .append({'Date': df_stock.iloc[index]['Date'], 'Percentage': percentage}, ignore_index=True)
How could I refactor this taking advantage of dataFrame api, thus removing the for loop and creating a new column in place?
df['Change'] = df['Close'].pct_change()
or if you want to calucale change in reverse order:
df['Change'] = df['Close'].pct_change(-1)
I would suggest to first make the Date column as DateTime indexing for this you can use
df_stock = df_stock.set_index(['Date'])
df_stock.index = pd.to_datetime(df_stock.index, dayfirst=True)
Then simply access any row with specific column by using datetime indexing and do any kind of operations whatever you want for example to calculate difference in percentage between two rows of the column "Close"
df_stock['percentage'] = ((df_stock['15-07-2019']['Close'] - df_stock['14-07-2019']['Close'])/df_stock['14-07-2019']['Close']) * 100
You can also use for loop to do the operations for each date or row:
for Dt in df_stock.index:
Using diff
(-df['Close'].diff())/df['Close'].shift()

How to apply previous row result in pandas

I'm trying to understand how to go around this in python pandas. My objective is to fill column "RESULT" with the initial investment and apply the profit on top of the previous result.
So if I would use an excel spreadsheet I would do this:
Ask what's the initial_investment (in this example $350)
Compute the first row as profit/100*initial_investment + initial_investment
the 2nd and forth will be the same with the exception that "initial_investment" is in the raw above.
my initial python code is this
import pandas as pd
df = pd.DataFrame({"DATE":[2009,2010,2011,2012,2013,2014,2015,2016],"PROFIT":[10,4,5,7,-10,5,-5,3],"RESULT":[350,350,350,350,350,350,350,350]})
print df
You can use the cumulative product function cumprod():
df['RESULT'] = ((df.PROFIT + 100) / 100.).cumprod() * 350
First you transform df.PROFIT into a proportion of the previous value. Then cumprod() multiplies each row by the previous rows. You can then just multiply this by whatever your initial value is.

Calculate z_score for a column grouped by another column

Suppose I have a DataFrame with columns person_id and mean_act, where every row is a numerical value for a specific person. I want to calculate the zscore for all the values at a person level. That is, I want a new column mean_act_person_zscore that is computed as the zscore of mean_act using the mean and std of the zscores for that person only (and not the whole dataset).
My first approach is something like this:
person_ids = df['person_id'].unique()
for pid in person_ids:
person_df = df[df['person_id'] == pid]
person_df = (person_df['mean_act'] - person_df['mean_act'].mean())/person_df['mean_act'].std()
At every iteration, it computes the right zscore output series, but the problem is that since the selection is by reference, not by value, the original df ends up without having the mean_act_person_zscore column.
Thoughts as to how to do this?
Should be straight forward:
df['mean_act_person_zscore'] = df.groupby('person_id').mean_act.transform(lambda x: (x - x.mean()) / x.std())

Categories

Resources