I am currently trying to compute the Half life results for multiple columns of data. I have tried to incorporate the codes I got from 'pythonforfinance.com' Link.
However, I seem to have missed a few edits that is resulting in errors being thrown.
This is how my df looks like: Link
and the code I am running:
import pandas as pd
import numpy as np
import statsmodels.api as sm
df1=pd.read_excel('C:\\Users\Sai\Desktop\Test\Spreads.xlsx')
Halflife_results={}
for col in df1.columns.values:
spread_lag = df1.shift(periods=1, axis=1)
spread_lag.ix([0]) = spread_lag.ix([1])
spread_ret = df1.columns - spread_lag
spread_ret.ix([0]) = spread_ret.ix([1])
spread_lag2 = sm.add_constant(spread_lag)
md = sm.OLS(spread_ret,spread_lag2)
mdf = md.fit()
half_life = round(-np.log(2) / mdf.params[1],0)
print('half life:', half_life)
The error that is being thrown is:
File "C:/Users/Sai/Desktop/Test/Half life test 2.py", line 12
spread_lag.ix([0]) = spread_lag.ix([1])
^
SyntaxError: can't assign to function call
Based on the error message, I seem to have made a very basic mistake but since I am a beginner I am not able to fix the issue. If not a solution to this code, an explanation to these lines of the codes would be of great help:
spread_lag = df1.shift(periods=1, axis=1)
spread_lag.ix([0]) = spread_lag.ix([1])
spread_ret = df1.columns - spread_lag
spread_ret.ix([0]) = spread_ret.ix([1])
spread_lag2 = sm.add_constant(spread_lag)
As explained by the error message, pd.Series.ixisn't callable: you should change spread_lag.ix([0]) to spread_lag.ix[0].
Also, you shouldn't shift on axis=1 (rows) since you're interested in differences along each column (axis=0, default value).
Defining a get_halflifefunction allows you then to directly apply it to each column, removing the need for a loop.
def get_halflife(s):
s_lag = s.shift(1)
s_lag.ix[0] = s_lag.ix[1]
s_ret = s - s_lag
s_ret.ix[0] = s_ret.ix[1]
s_lag2 = sm.add_constant(s_lag)
model = sm.OLS(s_ret,s_lag2)
res = model.fit()
halflife = round(-np.log(2) / res.params[1],0)
return halflife
df1.apply(get_halflife)
Related
I'm using auto_arima via pmdarima to fit multiple time series via a groupby. This is to say, I have a pd.DataFrame of stacked time-indexed data, grouped by variable variable, and have successfully applied transform(pm.auto_arima) to each. The reproducible example finds boring best ARIMA models, but the idea seems to work. I now want to apply .predict() similarly, but cannot get it to play nice with apply / lambda(x) / their combinations.
The code below works until the # Forecasting - help! section. I'm having trouble catching the correct object (apparently) in the apply. How might I adapt one of test1, test2, or test3 to get what I want? Or, is there some other best-practice construct to consider? Is it better across columns (without a melt)? Or via a loop?
Ultimately, I hope that test1, say, is a stacked pd.DataFrame (or pd.Series at least) with 8 rows: 4 forecasted values for each of the 2 time series in this example, with an identifier column variable (possibly tacked on after the fact).
import pandas as pd
import pmdarima as pm
import itertools
# Get data - this is OK.
url = 'https://raw.githubusercontent.com/nickdcox/learn-airline-delays/main/delays_2018.csv'
keep = ['arr_flights', 'arr_cancelled']
# Setup data - this is OK.
df = pd.read_csv(url, index_col=0)
df.index = pd.to_datetime(df.index, format = "%Y-%m")
df = df[keep]
df = df.sort_index()
df = df.loc['2018']
df = df.groupby(df.index).sum()
df.reset_index(inplace = True)
df = df.melt(id_vars = 'date', value_vars = df.columns.to_list()[1:])
# Fit auto.arima for each time series - this is OK.
fit = df.groupby('variable')['value'].transform(pm.auto_arima).drop_duplicates()
fit = fit.to_frame(name = 'model')
fit['variable'] = keep
fit.reset_index(drop = True, inplace = True)
# Setup forecasts - this is OK.
max_date = df.date.max()
dr = pd.to_datetime(pd.date_range(max_date, periods = 4 + 1, freq = 'MS').tolist()[1:])
yhat = pd.DataFrame(list(itertools.product(keep, dr)), columns = ['variable', 'date'])
yhat.set_index('date', inplace = True)
# Forecasting - help! - Can't get any of these to work.
def predict_fn(obj):
return(obj.loc[0].predict(4))
predict_fn(fit.loc[fit['variable'] == 'arr_flights']['model']) # Appears to work!
test1 = fit.groupby('variable')['model'].apply(lambda x: x.predict(n_periods = 4)) # Try 1: 'Series' object has no attribute 'predict'.
test2 = fit.groupby('variable')['model'].apply(lambda x: x.loc[0].predict(n_periods = 4)) # Try 2: KeyError
test3 = fit.groupby('variable')['model'].apply(predict_fn) # Try 3: KeyError
I can't figure out why I am getting this error. If you can figure it out, I'd appreciate it. If you can provide specific instruction, I'd appreciate it. This code is in one module; there are 7 modules total.
Python 3.7, Mac OS, code from www.finrl.org
# Perform Feature Engineering:
df = FeatureEngineer(df.copy(),
use_technical_indicator=True,
use_turbulence=False).preprocess_data()
# add covariance matrix as states
df=df.sort_values(['date','tic'],ignore_index=True)
df.index = df.date.factorize()[0]
cov_list = []
# look back is one year
lookback=252
for i in range(lookback,len(df.index.unique())):
data_lookback = df.loc[i-lookback:i,:]
price_lookback=data_lookback.pivot_table(index = 'date',columns = 'tic', values = 'close')
return_lookback = price_lookback.pct_change().dropna()
covs = return_lookback.cov().values
cov_list.append(covs)
df_cov = pd.DataFrame({'date':df.date.unique()[lookback:],'cov_list':cov_list})
df = df.merge(df_cov, on='date')
df = df.sort_values(['date','tic']).reset_index(drop=True)
df.head()
The function definition statement for FeatureEngineer.__init__ is:
def __init__(
self,
use_technical_indicator=True,
tech_indicator_list=config.TECHNICAL_INDICATORS_LIST,
use_turbulence=False,
user_defined_feature=False,
):
As you can see there is no argument (other than self which you should not provide) before use_technical_indicator, so you should remove the df.copy() from before the use_techincal_indicator in your line 2.
Checking the current FeatureEngineer class, you must to provide the df.copy() parameter to the preprocess_data() method.
So, your code have to look like:
# Perform Feature Engineering:
df = FeatureEngineer(use_technical_indicator=True,
tech_indicator_list = config.TECHNICAL_INDICATORS_LIST,
use_turbulence=True,
user_defined_feature = False).preprocess_data(df.copy())
Like others before me (e.g., questions like this), I am attempting to use statsmodels OLS within a pandas groupby. However, in trying to send the results' residuals to a column in the extant dataframe, I run up against either indexing ValueErrors (if I use apply) or else KeyErrors (if I use transform).
My current code is:
def regression_residuals(df, **kwargs):
X = df[kwargs['x_column']]
y = df[kwargs['y_column']]
regr_ols = sm.OLS(y,X).fit()
resid = regr_ols.resid.reset_index(drop=True)
return resid
df['residuals'] = df.groupby(['year_and_month']).apply(
regression_residuals, x_column = 'x_var', y_column = 'y_var')
As is, the code yields a result of "ValueError: Wrong number of items passed 4, placement implies 1", while changing apply to transform results in "KeyError: ('x_var', 'occurred at index item_label')". From debug-output it appears that the creation of the residuals seems correct, but it's having a hard time placing the residuals series back into groupby with the correct indexing. However, it's not apparent what would correctly do that.
If I try to use the for-loop iteration through the DataFrameGroupBy's as in the question I had cited, the original frame remains unmodified. As a result, things like
grps = df.groupby(['year_and_month'])
for year_month, grp in grps:
grp['residuals'] = apply_reg_resid(grp, x_column = 'x_var', y_column = 'y_var')
are of no use here, as it does nothing to the original df.
What should I more properly be doing?
Thanks all for any help.
EDIT:
Hi all, I'm apparently unable to post an answer my own question, but I think I've found out the solution. Using:
def regression_residuals(df, **kwargs):
X = df[kwargs.pop('x_column')].values
y = df[kwargs.pop('y_column')].values
X = sm.add_constant(X, prepend=False)
regr_ols = sm.OLS(y,X).fit()
resid = regr_ols.resid
df_resid = pd.DataFrame(resid, index=df.index)
return resid
seems to solve the problem.
I am able to answer my question. It is:
def regression_residuals(df, **kwargs):
X = df[kwargs.pop('x_column')]
y = df[kwargs.pop('y_column')]
X = sm.add_constant(X, prepend=False)
regr_ols = sm.OLS(y,X).fit()
resid = regr_ols.resid
df_resid = pd.DataFrame(resid, index=df.index)
return resid
want to do a simple normalization of the data in a numpy ndarray.
specifically want X-mu/sigma. Tried using the exact code that
that I found in earlier questions - kept getting error = TypeError
cannot perform reduce with flexible type. Gave up and tried a simpler
normzlization method X-mu/X.ptp - got the same error.
import csv
import numpy as np
from numpy import *
import urllib.request
#Import comma separated data from git.hub
url = 'http://archive.ics.uci.edu/ml/machine-learning-
databases/wine/wine.data'
urllib.request.urlretrieve(url,'F:/Python/Wine Dataset/wine_data')
#open file
filename = 'F:/Python/Wine Dataset/wine_data';
raw_data = open(filename,'rt');
#Put raw_data into a numpy.ndarray
reader = csv.reader(raw_data);
x = list(reader);
data = np.array(x)
#First column is classification, other columns are features
y= data[:,0];
X_raw = data[:,1:13];
# Attempt at normalizing data- really wanted X-mu/sigma gave up
# even this simplified version doesn't work
# latest error is TypeError cannot perform reduce with flexible type?????
X = (X_raw - X_raw.min(0)) / X_raw.ptp(0);
print(X);
#
#
#
#
Finally figured it out. The line "data = np.array(x)" returned an array containing string data.
was:
data = "np.array(x)"
changed to: "np.array(x).astype(np.float)"
after that everything worked - simple issue cost me hours
import pandas as pd
census_df = pd.read_csv('census.csv')
#census_df.head()
def answer_seven():
census_df_1 = census_df[(census_df['SUMLEV'] == 50)].set_index('CTYNAME')
census_df_1['highest'] = census_df_1[['POPESTIAMTE2010','POPESTIAMTE2011','POPESTIAMTE2012','POPESTIAMTE2013','POPESTIAMTE2014','POPESTIAMTE2015']].max()
census_df_1['lowest'] =census_df_1[['POPESTIAMTE2010','POPESTIAMTE2011','POPESTIAMTE2012','POPESTIAMTE2013','POPESTIAMTE2014','POPESTIAMTE2015']].min()
x = abs(census_df_1['highest'] - census_df_1['lowest']).tolist()
return x[0]
answer_seven()
This is trying to use the data from census.csv to find the counties that have the largest absolute change in population within 2010-2015(POPESTIMATES), I wanted to simply find the difference between abs.value of max and min value for each year/column. You must return a string. also [(census_df['SUMLEV'] ==50)] means only counties are taken as they are set to 50. But the code gives an error that ends with
KeyError: "['POPESTIAMTE2010' 'POPESTIAMTE2011' 'POPESTIAMTE2012'
'POPESTIAMTE2013'\n 'POPESTIAMTE2014' 'POPESTIAMTE2015'] not in index"
Am I indexing the wrong data structure? I'm really new to datascience and coding.
I think the column names in the code have typo. The pattern is 'POPESTIMATE201?' and not 'POPESTIAMTE201?'
Any help with shortening the code will be appreciated. Here is the code that works -
census_df = pd.read_csv('census.csv')
def answer_seven():
cdf = census_df[(census_df['SUMLEV'] == 50)].set_index('CTYNAME')
columns = ['POPESTIMATE2010', 'POPESTIMATE2011', 'POPESTIMATE2012', 'POPESTIMATE2013', 'POPESTIMATE2014', 'POPESTIMATE2015']
cdf['big'] = cdf[columns].max(axis =1)
cdf['sml'] = cdf[columns].min(axis =1)
cdf['change'] = cdf[['big']].sub(cdf['sml'], axis=0)
return cdf['change'].idxmax()