numpy function to aggregate a signal for time?

numpy function to aggregate a signal for time? - python

I want to compute the aggregated average of a signal over time, in a certain period. I don't know how this is called scientifically.
Example: I have an electricity consumption for a full year in 15 minute values. I want to know my average consumption by hour of the day (24 values). But it is more complex: there are more measurements in between the 15-minute steps, and I cannot foresee where they are. However, they should be taken into account, with a correct 'weight'.
I wrote a function that works, but it is extremely slow. Here is a test setup:
import numpy as np
signal = np.arange(6)
time = np.array([0, 2, 3.5, 4, 6, 8])
period = 4
interval = 2
def aggregate(signal, time, period, interval):
pass
aggregated = aggregate(signal, time, period, interval)
# This should be the result: aggregated = array([ 2. , 3.125])
aggregated should have period/interval values. This is the manual computation:
aggregated[0] = (np.trapz(y=np.array([0, 1]), x=np.array([0, 2]))/interval + \
np.trapz(y=np.array([3, 4]), x=np.array([4, 6]))/interval) / (period/interval)
aggregated[1] = (np.trapz(y=np.array([1, 2, 3]), x=np.array([2, 3.5, 4]))/interval + \
np.trapz(y=np.array([4, 5]), x=np.array([6, 8]))/interval) / (period/interval)
One last detail: it has to be efficient, thats why my own solution is not useful. Maybe I'm overlooking a numpy or scipy method? Or is this something pandas can do?
Thanks a lot for your help.

I would strongly recommend using Pandas. Here I'm using version 0.8 (soon to be released). I think this is close to what you want.
import pandas as p
import numpy as np
import matplotlib as plt
# Make up some data:
time = p.date_range(start='2011-05-23', end='2012-05-23', freq='min')
watts = np.linspace(0, 3.14 * 365, time.size)
watts = 38 * (1.5 + np.sin(watts)) + 8 * np.sin(5 * watts)
# Create a time series
ts = p.Series(watts, index=time, name='watts')
# Resample down to 15 minute pieces, using mean values
ts15 = ts.resample('15min', how='mean')
ts15.plot()
Pandas can easily do many other things with your data (like determine your average weekly energy profile). Check out p.read_csv() for reading in your data.

I think this is pretty close to what you need. I'm not sure I interpreted interval and period correctly, but I think I got it write within some constant factor.
import numpy as np
def aggregate(signal, time, period, interval):
assert (period % interval) == 0
ipp = period / interval
midpoint = np.r_[time[0], (time[1:] + time[:-1])/2., time[-1]]
cumsig = np.r_[0, (np.diff(midpoint) * signal).cumsum()]
grid = np.linspace(0, time[-1], np.floor(time[-1]/period)*ipp + 1)
cumsig = np.interp(grid, midpoint, cumsig)
return np.diff(cumsig).reshape(-1, ipp).sum(0) / period

I worked out a function that does exactly what I wanted based on the previous answers and on pandas.
def aggregate_by_time(signal, time, period=86400, interval=900, label='left'):
"""
Function to calculate the aggregated average of a timeseries by
period (typical a day) in bins of interval seconds (default = 900s).
label = 'left' or 'right'. 'Left' means that the label i contains data from
i till i+1, 'right' means that label i contains data from i-1 till i.
Returns an array with period/interval values, one for each interval
of the period.
Note: the period has to be a multiple of the interval
"""
def make_datetimeindex(array_in_seconds, year):
"""
Create a pandas DateIndex from a time vector in seconds and the year.
"""
start = pandas.datetime(year, 1, 1)
datetimes = [start + pandas.datetools.timedelta(t/86400.) for t in array_in_seconds]
return pandas.DatetimeIndex(datetimes)
interval_string = str(interval) + 'S'
dr = make_datetimeindex(time, 2012)
df = pandas.DataFrame(data=signal, index=dr, columns=['signal'])
df15min = df.resample(interval_string, closed=label, label=label)
# now create bins for the groupby() method
time_s = df15min.index.asi8/1e9
time_s -= time_s[0]
df15min['bins'] = np.mod(time_s, period)
df_aggr = df15min.groupby(['bins']).mean()
# if you only need the numpy array: take df_aggr.values
return df_aggr

Related

Calculating RSI for recent days

I'm working on improving my algo bot and one thing that I have implemented absolutely awful is RSI. Since RSI is a lagging indicator I can't get recent data, the last date I get a value for is 8 days ago. I'm therefore looking to calculate it somehow by using previous values and looking for ideas on how to do so.
My data points:
[222.19000244140625, nan]
[222.19000244140625, nan]
[215.47000122070312, nan]
[212.25, nan]
[207.97000122070312, nan]
[206.3300018310547, nan]
[205.88999938964844, nan]
[208.36000061035156, nan]
[204.08999633789062, 10.720487433358727]
[197.00999450683594, 7.934105468501102]
[194.6699981689453, 7.224811311424375]
[190.66000366210938, 6.148330770309926]
[191.6300048828125, 9.861218420857213]
[189.13999938964844, 8.835726925023536]
[189.02000427246094, 8.785409465194874]
[187.02000427246094, 7.925663008903896]
[195.69000244140625, 37.989974096922204]
[196.9199981689453, 41.10776671337689]
[194.11000061035156, 36.33757785797855]
As you can see 10.720487433358727 is my most recent value but I'm sure bigger brains than mine can figure out a way to calculate it up until today.
Thanks for your help!

It is important to note that there are various ways of defining the RSI. It is commonly defined in at least two ways: using a simple moving average (SMA) as above, or using an exponential moving average (EMA). Here's a code snippet that calculates both definitions of RSI and plots them for comparison. I'm discarding the first row after taking the difference, since it is always NaN by definition.
import pandas
import pandas_datareader.data as web
import datetime
import matplotlib.pyplot as plt
# Window length for moving average
window_length = 14
# Dates
start = '2020-12-01'
end = '2021-01-27'
# Get data
data = web.DataReader('AAPL', 'yahoo', start, end)
# Get just the adjusted close
close = data['Adj Close']
# Get the difference in price from previous step
delta = close.diff()
# Get rid of the first row, which is NaN since it did not have a previous
# row to calculate the differences
delta = delta[1:]
# Make the positive gains (up) and negative gains (down) Series
up, down = delta.copy(), delta.copy()
up[up < 0] = 0
down[down > 0] = 0
# Calculate the EWMA
roll_up1 = up.ewm(span=window_length).mean()
roll_down1 = down.abs().ewm(span=window_length).mean()
# Calculate the RSI based on EWMA
RS1 = roll_up1 / roll_down1
RSI1 = 100.0 - (100.0 / (1.0 + RS1))
# Calculate the SMA
roll_up2 = up.rolling(window_length).mean()
roll_down2 = down.abs().rolling(window_length).mean()
# Calculate the RSI based on SMA
RS2 = roll_up2 / roll_down2
RSI2 = 100.0 - (100.0 / (1.0 + RS2))
# Compare graphically
plt.figure(figsize=(8, 6))
RSI1.plot()
RSI2.plot()
plt.legend(['RSI via EWMA', 'RSI via SMA'])
plt.show()

Generating data associated with a trend

I want to create 3 different datasets with a column each having dates (dd/mm/yyyy). These dates need to be in a range of 3 months like January 2019 to April 2019. The count for each date needs to represent the number of searches. The dataset should have 2000 entries and dates can be repititive as well. All 3 datasets are to be created such that one has a upward trend to the count, one has a downward trend to the count, and one is normally distributed.
Upward trend with the time, i.e. increasing entries with time ( lower count in beginning and increasing moving forward.)
Declining trend with time i.e. decreasing entries with time (higher count in the beginning and decreasing moving forward)
I am able to generate a normal distribution using datagenerator plugin of
www.generatedata.com
I am now interested in the other 2 use cases i.e. upward trend and declining trend. Can anyone advise me how to do the same. For random distribution, I was able to achieve using the faker library as well.
from faker import Factory
import random
import numpy as np
faker = Factory.create()
def date_between(d1, d2):
f = '%b%d-%Y'
return faker.date_time_between_dates(datetime.strptime(d1, f), datetime.strptime(d2, f))
def fakerecord():
return {'ID': faker.numerify('######'),
'S_date': date_between('jan01-2019', 'apr01-2019')
}
Can anyone advise how can I incorporate trends to the dataset.
Thanks

you can do it like below.
trend function defines your trend if start is higher than end it is downward trend and vice versa. you can also control the rate of trend by changing difference between start and end
import numpy as np
import pandas as pd
dates = pd.date_range("2019-1-1", "2019-4-1", freq="D")
def trend(count, start_weight=1, end_weight=3):
lin_sp = np.linspace(start_weight, end_weight, count)
return lin_sp/sum(lin_sp)
date_trends = np.random.choice(dates,size=20000, p=trend(len(dates)))
print("Total dates", len(date_trends))
print("counts of each dates")
print(np.unique(date_trends, return_counts=True)[1])

I edited my first answer to make it more clear.
With the function below you can set the relative probabilities of generating a search on the start and end dates of your choice.
Ex. if starting_prob = 0.1 and ending_prob = 1.0, then the probability of seeing a
search on the start date is 1/10 of the probability of seeing a search on the
end date
If starting_prob = 1.0 and ending_prob = 0.1, then the probability of seeing a
search on the end date is 1/10 of the probability of seeing a search on the
start date
import datetime
import numpy as np
def random_dates(start, end, starting_prob = 0.1, ending_prob = 1.0, num_samples = 2000):
"""
Generate increasing or decreasing counts of datetimes between `start` and `end`
Parameters:
start: string in format'%b%d-%Y' (i.e. 'Sep19-2019')
end : string in format'%b%d-%Y'. must be after start
starting_prob: (float) relative probability of seeing a search on the first day
ending_prob: (float) relative probability of seeing a search on the last day
num_samples: number of dates in the list
"""
start_date = datetime.datetime.strptime(start, '%b%d-%Y')
end_date = datetime.datetime.strptime(end, '%b%d-%Y')
# Get days between `start` and `end`
num_days = (end_date - start_date).days
linear_probabilities = np.linspace(starting_prob, ending_prob, num_days)
# normalize probabilities so they add up to 1
linear_probabilities /= np.sum(linear_probabilities)
rand_days = np.random.choice(num_days, size = num_samples, replace = True,
p = linear_probabilities)
rand_date = [(start_date + datetime.timedelta(int(rand_days[ii]))).strftime('%b%d-%Y')
for ii in range(num_samples)]
# return list of date strings
return rand_date
You could use the function to generate different sets of dates (each with 20000 samples):
rdates_decreasing = random_dates("Jan01-2019", "Apr30-2019",
starting_prob = 1.0, ending_prob = 0.1,
num_samples = 20000)
rdates_increasing = random_dates("Jan01-2019", "Apr30-2019",
starting_prob = 0.1, ending_prob = 1.0,
num_samples = 20000)
rdates_random = random_dates("Jan01-2019", "Apr30-2019",
starting_prob = 1.0, ending_prob = 1.0,
num_samples = 20000)
You can use pandas to save a csv file. Each column will have a list of dates.
import pandas as pd
pd.DataFrame({'dates_decreasing': rdates_decreasing,
'dates_increasing': rdates_increasing,
'dates_random': rdates_random,
}).to_csv("path\to\datefile.csv", index = False)
You could convert your dates to counts in a data frame like this:
from collections import Counter
import matplotlib.pyplot as plt
# create dataframe with counts
df1 = pd.DataFrame({"dates_decreasing": list(Counter(rdates_decreasing).keys()),
"counts_decreasing": list(Counter(rdates_decreasing).values()),
"dates_increasing": list(Counter(rdates_increasing).keys()),
"counts_increasing": list(Counter(rdates_increasing).values()),
"dates_random": list(Counter(rdates_random).keys()),
"counts_random": list(Counter(rdates_random).values()),
})
# convert to datetime
df1['dates_decreasing']= pd.to_datetime(df1['dates_decreasing'])
df1['dates_increasing']= pd.to_datetime(df1['dates_increasing'])
df1['dates_random']= pd.to_datetime(df1['dates_random'])
# plot
fig, ax = plt.subplots()
ax.plot(df1.dates_decreasing, df1.counts_decreasing, "o", label = "decreasing")
ax.plot(df1.dates_increasing, df1.counts_increasing, "o", label = "increasing")
ax.plot(df1.dates_random, df1.counts_random, "o", label = "random")
ax.set_ylabel("count")
ax.legend()
fig.autofmt_xdate()
plt.show()

How to create spacing of points per decade for logarithmic plot

Intro
I have some range of frequencies that goes from freq_start_hz = X to freq_stop_hz = Y.
I am trying to logarithmically (base 10) space out samples between the range [freq_start_hz, freq_stop_hz], based on a number of samples per decade (num_samp_per_decade), inclusive of the endpoint.
I noticed numpy has a method logspace (link) which enables you to create logarithmic divisions of some range base ** start to base ** stop based on a total number of samples, num.
Can you help me create Python code that will create even logarithmic spacing per decade?
Example
freq_start_hz = 10, freq_stop_hz = 100, num_samp_per_decade = 5
This is easy, since it's just one decade. So one could create it using the following:
import numpy as np
from math import log10
freq_start_hz = 10
freq_stop_hz = 100
num_samp_per_decade = 5
freq_list = np.logspace(
start=log10(freq_start_hz),
stop=log10(freq_stop_hz),
num=num_samp_per_decade,
endpoint=False,
base=10,
)
freq_list = np.append(freq_list, freq_stop_hz) # Appending end
print(freq_list.tolist())
Output is [10.0, 17.78279410038923, 31.622776601683793, 56.23413251903491, 100.0]
Note: this worked nicely because I designed it this way. If freq_start_hz = 8, this method no longer works since it now spans multiple decades.
Conclusion
I am hoping somewhere out there, there's a premade method in math, numpy, another scipy library, or some other library that my internet searching hasn't turned up.

Calculate the number of points based on the number of decades in the range.
from math import log10
import numpy as np
start = 10
end = 1500
samples_per_decade = 5
ndecades = log10(end) - log10(start)
npoints = int(ndecades) * samples_per_decade
#a = np.linspace(log10(start), log10(end), num = npoints)
#points = np.power(10, a)
points = np.logspace(log10(start), log10(end), num=npoints, endpoint=True, base=10)
print(points)

Python pandas most pythonic way to calculate rolling betas on a DataFrame [duplicate]

I have many (4000+) CSVs of stock data (Date, Open, High, Low, Close) which I import into individual Pandas dataframes to perform analysis. I am new to python and want to calculate a rolling 12month beta for each stock, I found a post to calculate rolling beta (Python pandas calculate rolling stock beta using rolling apply to groupby object in vectorized fashion) however when used in my code below takes over 2.5 hours! Considering I can run the exact same calculations in SQL tables in under 3 minutes this is too slow.
How can I improve the performance of my below code to match that of SQL? I understand Pandas/python has that capability. My current method loops over each row which I know slows performance but I am unaware of any aggregate way to perform a rolling window beta calculation on a dataframe.
Note: the first 2 steps of loading the CSVs into individual dataframes and calculating daily returns only takes ~20seconds. All my CSV dataframes are stored in the dictionary called 'FilesLoaded' with names such as 'XAO'.
Your help would be much appreciated!
Thank you :)
import pandas as pd, numpy as np
import datetime
import ntpath
pd.set_option('precision',10) #Set the Decimal Point precision to DISPLAY
start_time=datetime.datetime.now()
MarketIndex = 'XAO'
period = 250
MinBetaPeriod = period
# ***********************************************************************************************
# CALC RETURNS
# ***********************************************************************************************
for File in FilesLoaded:
FilesLoaded[File]['Return'] = FilesLoaded[File]['Close'].pct_change()
# ***********************************************************************************************
# CALC BETA
# ***********************************************************************************************
def calc_beta(df):
np_array = df.values
m = np_array[:,0] # market returns are column zero from numpy array
s = np_array[:,1] # stock returns are column one from numpy array
covariance = np.cov(s,m) # Calculate covariance between stock and market
beta = covariance[0,1]/covariance[1,1]
return beta
#Build Custom "Rolling_Apply" function
def rolling_apply(df, period, func, min_periods=None):
if min_periods is None:
min_periods = period
result = pd.Series(np.nan, index=df.index)
for i in range(1, len(df)+1):
sub_df = df.iloc[max(i-period, 0):i,:]
if len(sub_df) >= min_periods:
idx = sub_df.index[-1]
result[idx] = func(sub_df)
return result
#Create empty BETA dataframe with same index as RETURNS dataframe
df_join = pd.DataFrame(index=FilesLoaded[MarketIndex].index)
df_join['market'] = FilesLoaded[MarketIndex]['Return']
df_join['stock'] = np.nan
for File in FilesLoaded:
df_join['stock'].update(FilesLoaded[File]['Return'])
df_join = df_join.replace(np.inf, np.nan) #get rid of infinite values "inf" (SQL won't take "Inf")
df_join = df_join.replace(-np.inf, np.nan)#get rid of infinite values "inf" (SQL won't take "Inf")
df_join = df_join.fillna(0) #get rid of the NaNs in the return data
FilesLoaded[File]['Beta'] = rolling_apply(df_join[['market','stock']], period, calc_beta, min_periods = MinBetaPeriod)
# ***********************************************************************************************
# CLEAN-UP
# ***********************************************************************************************
print('Run-time: {0}'.format(datetime.datetime.now() - start_time))

Generate Random Stock Data
20 Years of Monthly Data for 4,000 Stocks
dates = pd.date_range('1995-12-31', periods=480, freq='M', name='Date')
stoks = pd.Index(['s{:04d}'.format(i) for i in range(4000)])
df = pd.DataFrame(np.random.rand(480, 4000), dates, stoks)
df.iloc[:5, :5]
Roll Function
Returns groupby object ready to apply custom functions
See Source
def roll(df, w):
# stack df.values w-times shifted once at each stack
roll_array = np.dstack([df.values[i:i+w, :] for i in range(len(df.index) - w + 1)]).T
# roll_array is now a 3-D array and can be read into
# a pandas panel object
panel = pd.Panel(roll_array,
items=df.index[w-1:],
major_axis=df.columns,
minor_axis=pd.Index(range(w), name='roll'))
# convert to dataframe and pivot + groupby
# is now ready for any action normally performed
# on a groupby object
return panel.to_frame().unstack().T.groupby(level=0)
Beta Function
Use closed form solution of OLS regression
Assume column 0 is market
See Source
def beta(df):
# first column is the market
X = df.values[:, [0]]
# prepend a column of ones for the intercept
X = np.concatenate([np.ones_like(X), X], axis=1)
# matrix algebra
b = np.linalg.pinv(X.T.dot(X)).dot(X.T).dot(df.values[:, 1:])
return pd.Series(b[1], df.columns[1:], name='Beta')
Demonstration
rdf = roll(df, 12)
betas = rdf.apply(beta)
Timing
Validation
Compare calculations with OP
def calc_beta(df):
np_array = df.values
m = np_array[:,0] # market returns are column zero from numpy array
s = np_array[:,1] # stock returns are column one from numpy array
covariance = np.cov(s,m) # Calculate covariance between stock and market
beta = covariance[0,1]/covariance[1,1]
return beta
print(calc_beta(df.iloc[:12, :2]))
-0.311757542437
print(beta(df.iloc[:12, :2]))
s0001 -0.311758
Name: Beta, dtype: float64
Note the first cell
Is the same value as validated calculations above
betas = rdf.apply(beta)
betas.iloc[:5, :5]
Response to comment
Full working example with simulated multiple dataframes
num_sec_dfs = 4000
cols = ['Open', 'High', 'Low', 'Close']
dfs = {'s{:04d}'.format(i): pd.DataFrame(np.random.rand(480, 4), dates, cols) for i in range(num_sec_dfs)}
market = pd.Series(np.random.rand(480), dates, name='Market')
df = pd.concat([market] + [dfs[k].Close.rename(k) for k in dfs.keys()], axis=1).sort_index(1)
betas = roll(df.pct_change().dropna(), 12).apply(beta)
for c, col in betas.iteritems():
dfs[c]['Beta'] = col
dfs['s0001'].head(20)

Using a generator to improve memory efficiency
Simulated data
m, n = 480, 10000
dates = pd.date_range('1995-12-31', periods=m, freq='M', name='Date')
stocks = pd.Index(['s{:04d}'.format(i) for i in range(n)])
df = pd.DataFrame(np.random.rand(m, n), dates, stocks)
market = pd.Series(np.random.rand(m), dates, name='Market')
df = pd.concat([df, market], axis=1)
Beta Calculation
def beta(df, market=None):
# If the market values are not passed,
# I'll assume they are located in a column
# named 'Market'. If not, this will fail.
if market is None:
market = df['Market']
df = df.drop('Market', axis=1)
X = market.values.reshape(-1, 1)
X = np.concatenate([np.ones_like(X), X], axis=1)
b = np.linalg.pinv(X.T.dot(X)).dot(X.T).dot(df.values)
return pd.Series(b[1], df.columns, name=df.index[-1])
roll function
This returns a generator and will be far more memory efficient
def roll(df, w):
for i in range(df.shape[0] - w + 1):
yield pd.DataFrame(df.values[i:i+w, :], df.index[i:i+w], df.columns)
Putting it all together
betas = pd.concat([beta(sdf) for sdf in roll(df.pct_change().dropna(), 12)], axis=1).T
Validation
OP beta calc
def calc_beta(df):
np_array = df.values
m = np_array[:,0] # market returns are column zero from numpy array
s = np_array[:,1] # stock returns are column one from numpy array
covariance = np.cov(s,m) # Calculate covariance between stock and market
beta = covariance[0,1]/covariance[1,1]
return beta
Experiment setup
m, n = 12, 2
dates = pd.date_range('1995-12-31', periods=m, freq='M', name='Date')
cols = ['Open', 'High', 'Low', 'Close']
dfs = {'s{:04d}'.format(i): pd.DataFrame(np.random.rand(m, 4), dates, cols) for i in range(n)}
market = pd.Series(np.random.rand(m), dates, name='Market')
df = pd.concat([market] + [dfs[k].Close.rename(k) for k in dfs.keys()], axis=1).sort_index(1)
betas = pd.concat([beta(sdf) for sdf in roll(df.pct_change().dropna(), 12)], axis=1).T
for c, col in betas.iteritems():
dfs[c]['Beta'] = col
dfs['s0000'].head(20)
calc_beta(df[['Market', 's0000']])
0.0020118230147777435
NOTE:
The calculations are the same

While efficient subdivision of the input data set into rolling windows is important to the optimization of the overall calculations, the performance of the beta calculation itself can also be significantly improved.
The following optimizes only the subdivision of the data set into rolling windows:
def numpy_betas(x_name, window, returns_data, intercept=True):
if intercept:
ones = numpy.ones(window)
def lstsq_beta(window_data):
x_data = numpy.vstack([window_data[x_name], ones]).T if intercept else window_data[[x_name]]
beta_arr, residuals, rank, s = numpy.linalg.lstsq(x_data, window_data)
return beta_arr[0]
indices = [int(x) for x in numpy.arange(0, returns_data.shape[0] - window + 1, 1)]
return DataFrame(
data=[lstsq_beta(returns_data.iloc[i:(i + window)]) for i in indices]
, columns=list(returns_data.columns)
, index=returns_data.index[window - 1::1]
)
The following also optimizes the beta calculation itself:
def custom_betas(x_name, window, returns_data):
window_inv = 1.0 / window
x_sum = returns_data[x_name].rolling(window, min_periods=window).sum()
y_sum = returns_data.rolling(window, min_periods=window).sum()
xy_sum = returns_data.mul(returns_data[x_name], axis=0).rolling(window, min_periods=window).sum()
xx_sum = numpy.square(returns_data[x_name]).rolling(window, min_periods=window).sum()
xy_cov = xy_sum - window_inv * y_sum.mul(x_sum, axis=0)
x_var = xx_sum - window_inv * numpy.square(x_sum)
betas = xy_cov.divide(x_var, axis=0)[window - 1:]
betas.columns.name = None
return betas
Comparing the performance of the two different calculations, you can see that as the window used in the beta calculation increases, the second method dramatically outperforms the first:
Comparing the performance to that of #piRSquared's implementation, the custom method takes roughly 350 millis to evaluate compared to over 2 seconds.

Further optimizing on #piRSquared's implementation for both speed and memory. the code is also simplified for clarity.
from numpy import nan, ndarray, ones_like, vstack, random
from numpy.lib.stride_tricks import as_strided
from numpy.linalg import pinv
from pandas import DataFrame, date_range
def calc_beta(s: ndarray, m: ndarray):
x = vstack((ones_like(m), m))
b = pinv(x.dot(x.T)).dot(x).dot(s)
return b[1]
def rolling_calc_beta(s_df: DataFrame, m_df: DataFrame, period: int):
result = ndarray(shape=s_df.shape, dtype=float)
l, w = s_df.shape
ls, ws = s_df.values.strides
result[0:period - 1, :] = nan
s_arr = as_strided(s_df.values, shape=(l - period + 1, period, w), strides=(ls, ls, ws))
m_arr = as_strided(m_df.values, shape=(l - period + 1, period), strides=(ls, ls))
for row in range(period, l):
result[row, :] = calc_beta(s_arr[row - period, :], m_arr[row - period])
return DataFrame(data=result, index=s_df.index, columns=s_df.columns)
if __name__ == '__main__':
num_sec_dfs, num_periods = 4000, 480
dates = date_range('1995-12-31', periods=num_periods, freq='M', name='Date')
stocks = DataFrame(data=random.rand(num_periods, num_sec_dfs), index=dates,
columns=['s{:04d}'.format(i) for i in
range(num_sec_dfs)]).pct_change()
market = DataFrame(data=random.rand(num_periods), index=dates, columns=
['Market']).pct_change()
betas = rolling_calc_beta(stocks, market, 12)
%timeit betas = rolling_calc_beta(stocks, market, 12)
335 ms ± 2.69 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

HERE'S THE SIMPLEST AND FASTEST SOLUTION
The accepted answer was too slow for what I needed and the I didn't understand the math behind the solutions asserted as faster. They also gave different answers, though in fairness I probably just messed it up.
I don't think you need to make a custom rolling function to calculate beta with pandas 1.1.4 (or even since at least .19). The below code assumes the data is in the same format as the above problems--a pandas dataframe with a date index, percent returns of some periodicity for the stocks, and market values are located in a column named 'Market'.
If you don't have this format, I recommend joining the stock returns to the market returns to ensure the same index with:
# Use .pct_change() only if joining Close data
beta_data = stock_data.join(market_data), how = 'inner').pct_change().dropna()
After that, it's just covariance divided by variance.
ticker_covariance = beta_data.rolling(window).cov()
# Limit results to the stock (i.e. column name for the stock) vs. 'Market' covariance
ticker_covariance = ticker_covariance.loc[pd.IndexSlice[:, stock], 'Market'].dropna()
benchmark_variance = beta_data['Market'].rolling(window).var().dropna()
beta = ticker_covariance / benchmark_variance
NOTES: If you have a multi-index, you'll have to drop the non-date levels to use the rolling().apply() solution. I only tested this for one stock and one market. If you have multiple stocks, a modification to the ticker_covariance equation after .loc is probably needed. Last, if you want to calculate beta values for the periods before the full window (ex. stock_data begins 1 year ago, but you use 3yrs of data), then you can modify the above to and expanding (instead of rolling) window with the same calculation and then .combine_first() the two.

Created a simple python package finance-calculator based on numpy and pandas to calculate financial ratios including beta. I am using the simple formula (as per investopedia):
beta = covariance(returns, benchmark returns) / variance(benchmark returns)
Covariance and variance are directly calculated in pandas which makes it fast. Using the api in the package is also simple:
import finance_calculator as fc
beta = fc.get_beta(scheme_data, benchmark_data, tail=False)
which will give you a dataframe of date and beta or the last beta value if tail is true.

but these would be blockish when you require beta calculations across the dates(m) for multiple stocks(n) resulting (m x n) number of calculations.
Some relief could be taken by running each date or stock on multiple cores, but then you will end up having huge hardware.
The major time requirement for the solutions available is finding the variance and co-variance and also NaN should be avoided in (Index and stock) data for a correct calculation as per pandas==0.23.0.
Thus running again would result stupid move unless the calculations are cached.
numpy variance and co-variance version also happens to miss-calculate the beta if NaN are not dropped.
A Cython implementation is must for huge set of data.

How to split dataframe according to intersection point in Python?

I am working on a project which is aiming to show difference between good form and bad form of an exercise. To do this we collected the acceleration data with wrist based accelerometer. The image above shows 2 set of a fitness execise (bench press). Each set has 10 repetitions. And the image below shows 10 repetitions of 1 set.I have a raw data set which consist of 10 set of an execises. What I want to do is splitting the raw data to 10 parts which will contain the part between 2 black line in the image above so I can analyze the data easily. My supervisor gave me a starting point which is choosing cutpoint in the each set. He said take a cutpoint, find the first interruption time start cutting at 3 sec before that time and count to 10 and finish cutting.
This an idea that I don't know how to apply. At least, if you can tell how to cut a dataframe according to cutpoint I would be greatful.

Well, I found another way to detect periodic parts of my accelerometer data. So, Here is my code:
import numpy as np
from peakdetect import peakdetect
import datetime as dt
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
from matplotlib import style
from pandas import DataFrame as df
style.use('ggplot')
def get_periodic(path):
periodics = []
data_frame = df.from_csv(path)
data_frame.columns = ['z', 'y', 'x']
if path.__contains__('1'):
if path.__contains__('bench'):
bench_press_1_week = data_frame.between_time('11:24', '11:52')
peak_indexes = get_peaks(bench_press_1_week.y, lookahead=3000)
for i in range(0, len(peak_indexes)):
time_indexes = bench_press_1_week.index.tolist()
start_time = time_indexes[0]
periodic_start = start_time.to_datetime() + dt.timedelta(0, peak_indexes[i] / 100)
periodic_end = periodic_start + dt.timedelta(0, 60)
periodic = bench_press_1_week.between_time(periodic_start.time(), periodic_end.time())
periodics.append(periodic)
return periodics
def get_peaks(data, lookahead):
peak_indexes = []
correlation = np.correlate(data, data, mode='full')
realcorr = correlation[correlation.size / 2:]
maxpeaks, minpeaks = peakdetect(realcorr, lookahead=lookahead)
for i in range(0, len(maxpeaks)):
peak_indexes.append(maxpeaks[i][0])
return peak_indexes
def show_segment_plot(data, periodic_area, exercise_name):
plt.figure(8)
gs = gridspec.GridSpec(7, 2)
ax = plt.subplot(gs[:2, :])
plt.title(exercise_name)
ax.plot(data)
k = 0
for i in range(2, 7):
for j in range(0, 2):
ax = plt.subplot(gs[i, j])
title = "{} {}".format(k + 1, ".Set")
plt.title(title)
ax.plot(periodic_area[k])
k = k + 1
plt.show()
Firstly, this question gave me another perspective for my problem. The image below shows the raw accelerometer data of bench press with 10 sets. Here it has 3 axis(x,y,z) and it's major axis is y(Blue on the image).
I used autocorrelation function for detecting the periodic parts, In the image above every peak represents 1 set of execises. With this peak detection algorithm I found each peak's x-axis value,
In[196]: maxpeaks
Out[196]:
[[16204, 32910.14013671875],
[32281, 28726.95849609375],
[48515, 24583.898681640625],
[64436, 22088.130859375],
[80335, 19582.248291015625],
[96699, 16436.567626953125],
[113081, 12100.027587890625],
[129027, 8098.98486328125],
[145184, 5387.788818359375]]
Basically, each x-value represent samples. My sampling frequency was 100Hz so 16204/100 = 162,04 seconds. To find the time of periodic part I added 162,04 sec to started time. Each bench press took aproximatelly 1 min and in this example, exercise's starting time was 11:24, for first periodic part's start time is 11:26 and ending time is 1 min after. There is some lag but yes best solution that I found is this.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

numpy function to aggregate a signal for time? - python

Related

Calculating RSI for recent days

Generating data associated with a trend

How to create spacing of points per decade for logarithmic plot

Python pandas most pythonic way to calculate rolling betas on a DataFrame [duplicate]

How to split dataframe according to intersection point in Python?

Categories

Resources