I'm Building an OLS Model but cant make any predictions.
Can you explain what I'm doing wrong?
Building the model :
import numpy as np
import pandas as pd
from scipy import stats
import statsmodels.api as sm
import matplotlib.pyplot as plt
d = {'City': ['Tokyo','Tokyo','Lisbon','Tokyo','Madrid','New York','Madrid','London','Tokyo','London','Tokyo'],
'Card': ['Visa','Visa','Visa','Master Card','Bitcoin','Master Card','Bitcoin','Visa','Master Card','Visa','Bitcoin'],
'Colateral':['Yes','Yes','No','No','Yes','No','No','Yes','Yes','No','Yes'],
'Client Number':[1,2,3,4,5,6,7,8,9,10,11],
'Total':[100,100,200,300,10,20,40,50,60,100,500]}
d = pd.DataFrame(data=d).set_index('Client Number')
df = pd.get_dummies(d,prefix='', prefix_sep='')
X = df[['Lisbon','London','Madrid','New York','Tokyo','Bitcoin','Master Card','Visa','No','Yes']]
Y = df['Total']
X1 = sm.add_constant(X)
reg = sm.OLS(Y, X1).fit()
reg.summary()
Prediction:
d1 = {'City': ['Tokyo','Tokyo','Lisbon'],
'Card': ['Visa','Visa','Visa'],
'Colateral':['Yes','Yes','No'],
'Client Number':[11,12,13],
'Total':[0,0,0]}
df1 = pd.DataFrame(data=d1).set_index('Client Number')
df1 = pd.get_dummies(df1,prefix='', prefix_sep='')
y_new = df1[['Lisbon','Tokyo','Visa','No','Yes']]
x_new = df1['Total']
mod = sm.OLS(y_new, x_new)
mod.predict(reg.params)
Then it shows : ValueError: shapes (3,1) and (11,) not aligned: 1 (dim 1) != 11 (dim 0)
What Am I doing wrong?
Here is the fixed prediction part of code with my comments:
d1 = {'City': ['Tokyo','Tokyo','Lisbon'],
'Card': ['Visa','Visa','Visa'],
'Colateral':['Yes','Yes','No'],
'Client Number':[11,12,13],
'Total':[0,0,0]}
df1 = pd.DataFrame(data=d1).set_index('Client Number')
df1 = pd.get_dummies(df1,prefix='', prefix_sep='')
x_new = df1.drop(columns='Total')
The main problem is different number of dummies in training X1 and x_new dataset.
Below I add missing dummy columns and fill it with zero:
x_new = x_new.reindex(columns = X1.columns, fill_value=0)
now x_new has proper number of columns equal to training dataset X1:
const Lisbon London Madrid ... Master Card Visa No Yes
Client Number ...
11 0 0 0 0 ... 0 1 0 1
12 0 0 0 0 ... 0 1 0 1
13 0 1 0 0 ... 0 1 1 0
[3 rows x 11 columns]
Finally predict on new dataset x_new using previously trained model reg:
reg.predict(x_new)
result:
Client Number
11 35.956284
12 35.956284
13 135.956284
dtype: float64
APPENDIX
As requested I enclose below fully reproducible code to test both training and prediction tasks:
import numpy as np
import pandas as pd
from scipy import stats
import statsmodels.api as sm
import matplotlib.pyplot as plt
d = {'City': ['Tokyo','Tokyo','Lisbon','Tokyo','Madrid','New York','Madrid','London','Tokyo','London','Tokyo'],
'Card': ['Visa','Visa','Visa','Master Card','Bitcoin','Master Card','Bitcoin','Visa','Master Card','Visa','Bitcoin'],
'Colateral':['Yes','Yes','No','No','Yes','No','No','Yes','Yes','No','Yes'],
'Client Number':[1,2,3,4,5,6,7,8,9,10,11],
'Total':[100,100,200,300,10,20,40,50,60,100,500]}
d = pd.DataFrame(data=d).set_index('Client Number')
df = pd.get_dummies(d,prefix='', prefix_sep='')
X = df[['Lisbon','London','Madrid','New York','Tokyo','Bitcoin','Master Card','Visa','No','Yes']]
Y = df['Total']
X1 = sm.add_constant(X)
reg = sm.OLS(Y, X1).fit()
reg.summary()
###
d1 = {'City': ['Tokyo','Tokyo','Lisbon'],
'Card': ['Visa','Visa','Visa'],
'Colateral':['Yes','Yes','No'],
'Client Number':[11,12,13],
'Total':[0,0,0]}
df1 = pd.DataFrame(data=d1).set_index('Client Number')
df1 = pd.get_dummies(df1,prefix='', prefix_sep='')
x_new = df1.drop(columns='Total')
x_new = x_new.reindex(columns = X1.columns, fill_value=0)
reg.predict(x_new)
The biggest issue is that you are not using the same dummy transformation. That is, some values in df1 are absent. You can add the missing values/columns with the following code (from here):
d1 = {'City': ['Tokyo','Tokyo','Lisbon'],
'Card': ['Visa','Visa','Visa'],
'Colateral':['Yes','Yes','No'],
'Client Number':[11,12,13],
'Total':[0,0,0]}
df1 = pd.DataFrame(data=d1).set_index('Client Number')
df1 = pd.get_dummies(df1,prefix='', prefix_sep='')
print(df1.shape) # Shape is 3x6 but it has to be 3x11
# Get missing columns in the training test
missing_cols = set( df.columns ) - set( df1.columns )
# Add a missing column in test set with default value equal to 0
for c in missing_cols:
df1[c] = 0
# Ensure the order of column in the test set is in the same order than in train set
df1 = df1[df.columns]
print(df1.shape) # Shape is 3x11
Further, you mixed up x_new and y_new. So it should be:
x_new = df1.drop(['Total'], axis=1).values
y_new = df1['Total'].values
mod = sm.OLS(y_new, x_new)
mod.predict(reg.params)
Note that I used x_new = df1.drop(['Total'], axis=1).values instead of df1[['Lisbon','Tokyo','Visa','No','Yes']] as it is more convenient (in terms of 1) less prone to (typing)errors and 2) less code
First, you need to either string-index all the words, or one-hot encode the values. ML models don't accept words, only numbers. Next, you want you X and y to be:
X = d.iloc[:,:-1]
y = d.iloc[:,-1]
This way X has a shape of [11,3] and y has a shape of [11,], which is the proper shapes needed.
Related
The issue that I have is with a rather simple approach of forecasting time series in python using SARIMAX model and 2 variables:
endogenous: the one of interest.
exogenous: the one assumed to have some influence on the endogenous variable.
The example uses the daily values of BTC and ETH, where BTC is endogenous, and ETH is endogenous.
import datetime
import numpy
import numpy as np
import matplotlib.pyplot as plt
import math
import pandas as pd
import pmdarima as pm
import statsmodels.api as sm
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_squared_error
from datetime import date
from math import sqrt
from dateutil.relativedelta import relativedelta
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
from statsmodels.tsa.statespace.sarimax import SARIMAX
import itertools
from random import random
import yfinance as yf
plt.style.use('ggplot')
The method of fetching data is quite simple using yahoo.finance API as yf
today = datetime.datetime.today()
ticker = input('Enter your ticker: ')
df1 = yf.download(ticker, period = 'max', interval = '1d')
df1.reset_index(inplace = True)
df1
This needs to be done manually - insert the name of the coin by hand (gives more freedom to the user in terms of what coins are combined).
Enter your ticker: BTC-USD
[*********************100%***********************] 1 of 1 completed
Date Open High Low Close Adj Close Volume
0 2014-09-17 465.864014 468.174011 452.421997 457.334015 457.334015 21056800
1 2014-09-18 456.859985 456.859985 413.104004 424.440002 424.440002 34483200
2 2014-09-19 424.102997 427.834991 384.532013 394.795990 394.795990 37919700
3 2014-09-20 394.673004 423.295990 389.882996 408.903992 408.903992 36863600
4 2014-09-21 408.084991 412.425995 393.181000 398.821014 398.821014 26580100
... ... ... ... ... ... ... ...
2677 2022-01-15 43101.898438 43724.671875 42669.035156 43177.398438 43177.398438 18371348298
2678 2022-01-16 43172.039062 43436.808594 42691.023438 43113.878906 43113.878906 17902097845
2679 2022-01-17 43118.121094 43179.390625 41680.320312 42250.550781 42250.550781 21690904261
2680 2022-01-18 42250.074219 42534.402344 41392.214844 42375.632812 42375.632812 22417209227
2681 2022-01-19 42365.046875 42462.070312 41248.902344 42142.539062 42142.539062 24763551744
2682 rows × 7 columns
So df1 is our exogenous data. Then the endogenous data are fetched in the same manner.
today = datetime.datetime.today()
ticker = input('Enter your ticker: ')
df2 = yf.download(ticker, period = 'max', interval = '1d')
df2.reset_index(inplace = True)
df2
Enter your ticker: ETH-USD
[*********************100%***********************] 1 of 1 completed
Date Open High Low Close Adj Close Volume
0 2017-11-09 308.644989 329.451996 307.056000 320.884003 320.884003 893249984
1 2017-11-10 320.670990 324.717987 294.541992 299.252991 299.252991 885985984
2 2017-11-11 298.585999 319.453003 298.191986 314.681000 314.681000 842300992
3 2017-11-12 314.690002 319.153015 298.513000 307.907990 307.907990 1613479936
4 2017-11-13 307.024994 328.415009 307.024994 316.716003 316.716003 1041889984
... ... ... ... ... ... ... ...
1528 2022-01-15 3309.844238 3364.537842 3278.670898 3330.530762 3330.530762 9619999078
1529 2022-01-16 3330.387207 3376.401123 3291.563721 3350.921875 3350.921875 9505934874
1530 2022-01-17 3350.947266 3355.819336 3157.224121 3212.304932 3212.304932 12344309617
1531 2022-01-18 3212.287598 3236.016113 3096.123535 3164.025146 3164.025146 13024154091
1532 2022-01-19 3163.054932 3170.838135 3055.951416 3123.905762 3123.905762 14121734144
1533 rows × 7 columns
Now is a merging step where the two datasets are aligned.
df1['Date'] = pd.to_datetime(df1['Date'])
df2['Date'] = pd.to_datetime(df2['Date'])
data = df2.merge(df1, on = 'Date', how = 'left')
which looks like this:
Date Open High Low Close_x Adj Close Volume Close_y
0 2017-11-09 308.644989 329.451996 307.056000 320.884003 320.884003 893249984 7143.580078
1 2017-11-10 320.670990 324.717987 294.541992 299.252991 299.252991 885985984 6618.140137
2 2017-11-11 298.585999 319.453003 298.191986 314.681000 314.681000 842300992 6357.600098
3 2017-11-12 314.690002 319.153015 298.513000 307.907990 307.907990 1613479936 5950.069824
4 2017-11-13 307.024994 328.415009 307.024994 316.716003 316.716003 1041889984 6559.490234
... ... ... ... ... ... ... ... ...
1528 2022-01-15 3309.844238 3364.537842 3278.670898 3330.530762 3330.530762 9619999078 43177.398438
1529 2022-01-16 3330.387207 3376.401123 3291.563721 3350.921875 3350.921875 9505934874 43113.878906
1530 2022-01-17 3350.947266 3355.819336 3157.224121 3212.304932 3212.304932 12344309617 42250.550781
1531 2022-01-18 3212.287598 3236.016113 3096.123535 3164.025146 3164.025146 13024154091 42375.632812
1532 2022-01-19 3163.054932 3170.838135 3055.951416 3123.905762 3123.905762 14121734144 42142.539062
1533 rows × 8 columns
I want to focus solely on the closing price of BTC and ETH:
X = data[['Close_y', 'Date']]
y = data['Close_x']
X = pd.get_dummies(X)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.1, random_state = 42, shuffle = False)
# grid search
X_train = X_train.drop('Date', axis = 1)
X_test = X_test.drop('Date', axis = 1)
Look for the best grid:
# Define the p, d and q parameters to take any value between 0 and 3 (exclusive)
p = d = q = range(0, 1)
# Generate all different combinations of p, q and q triplets
pdq = list(itertools.product(p, d, q))
# Generate all different combinations of seasonal p, q and q triplets
# put 12 in the 's' position as we have monthly data
pdqs = [(x[0], x[1], x[2], 12) for x in list(itertools.product(p, d, q))]
### Run Grid Search ###
def sarimax_gridsearch(pdq, pdqs, maxiter=5):
ans = []
for comb in pdq:
for combs in pdqs:
try:
mod = SARIMAX(y_train, exog=X_train, order=comb, seasonal_order=combs)
output = mod.fit(maxiter=maxiter)
ans.append([comb, combs, output.bic])
print('SARIMAX {} x {}12 : BIC Calculated ={}'.format(comb, combs, output.bic))
except:
continue
# Find the parameters with minimal BIC value
# Convert into dataframe
ans_df = pd.DataFrame(ans, columns=['pdq', 'pdqs', 'bic'])
# Sort and return top 5 combinations
ans_df = ans_df.sort_values(by=['bic'], ascending=True)
print(ans_df)
ans_df = ans_df.iloc[0]
return ans_df['pdq'], ans_df['pdqs']
o, s = sarimax_gridsearch(pdq, pdqs)
Make the predictions
# future predictions
# create Exogenous variables
df1 = df1.reset_index()
df1 = df1.set_index('Date')
df1 = df1.sort_index()
li = []
ys = ['Close']
for i in ys:
a = df1[i]
train_set, test_set = np.split(a, [int(.80 * len(a))])
model = pm.auto_arima(train_set, stepwise=True, error_action='ignore',seasonal=True, m=7)
b = model.get_params()
order = b.get('order')
s_order = b.get('seasonal_order')
model = sm.tsa.statespace.SARIMAX(a,
order=order,
seasonal_order=s_order
)
model_fit = model.fit()
start_index = data.index.max().date()+ relativedelta(days=1)
end_index = date(start_index.year, start_index.month , start_index.day+10)
forecast = model_fit.predict(start=start_index, end=end_index)
#start_index = data.shape[0]
#end_index = start_index + 12
#forecast = model_fit.predict(start=start_index, end=end_index)
li.append(forecast)
df = pd.DataFrame(li)
df = df.transpose()
df.columns = ys
df = df.reset_index()
exo = df[['Close', 'index']]
exo = exo.set_index('index')
But when I try to make the future predictions based on exo, like this:
#fit the model
print(b, s)
model_best = SARIMAX(y,exog=X.drop(['Date'],1), order=o, seasonal_order=s)
model_fit = model_best.fit()
model_fit.summary()
model_fit.plot_diagnostics(figsize=(15,12))
start_index = data.shape[0]
end_index = start_index + 12
pred_uc = model_fit.forecast(steps=13, start_index = start_index, end_index = end_index, exog = exo)
future_df = pd.DataFrame({'pred' : pred_uc})
print('Forecast:')
print(future_df)
plt.rcParams["figure.figsize"] = (8, 5)
#data = data.set_index('time')
plt.plot(data['Close_x'],color = 'blue', label = 'Actual')
plt.plot(pred_uc, color = 'orange',label = 'Predicted')
plt.show()
I get this annoying error:
ValueError Traceback (most recent call last)
C:\ProgramData\Anaconda3\lib\site-packages\statsmodels\tsa\statespace\mlemodel.py in _validate_out_of_sample_exog(self, exog, out_of_sample)
1757 try:
-> 1758 exog = exog.reshape(required_exog_shape)
1759 except ValueError:
ValueError: cannot reshape array of size 11 into shape (13,1)
ValueError: Provided exogenous values are not of the appropriate shape. Required (13, 1), got (11, 1).
Can someone explain where I am wrong or what steps I missed in this module?
Check the shape of the exo variable. If you are forecasting 13 steps, then you need to provide exog variables for each of those 13 steps. The error message is saying that you only provided exog variables for 11 steps. You can either provide a larger array to the exog argument, or you can change the forecast to be for 11 steps.
I have this data:
and I am trying to do a simple linear regression model on it.
Here is my code:
from sklearn.linear_model import LinearRegression
X = df[['Date']]
y = df['ACP Cleaning']
model = LinearRegression()
model.fit(X, y)
X_predict = [['2021-1-1']]
y_predict = model.predict(X_predict)
and this is my error:
ValueError: Unable to convert array of bytes/strings into decimal
numbers with dtype='numeric'
Linear Regression works with numbers, not strings.
You must pre-process your data in order to match the input of the model.
One way to do it is to parse the string and convert it to timestamp:
import datetime
def process_date(date_str):
d = datetime.datetime.strptime(date_str, '%Y-%m-%d')
return d.timestamp()
X = df[['Date']].apply(process_date)
The same must be done to the data you want to predict.
Update: If your dataset's datatype is correct, then the problem is with the data you are trying to use for prediction (you cannot predict a string).
The following is a complete working example. Pay close attention to the processing done to the X_predict variable.
import datetime
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
rng = pd.date_range('2015-02-24', periods=5, freq='3A')
df = pd.DataFrame({ 'Date': rng, 'Val' : np.random.randn(len(rng))})
print(df.head())
X = np.array(df['Date']).reshape(-1,1)
y = df['Val']
model = LinearRegression()
model.fit(X, y)
def process_date(date_str):
d = datetime.datetime.strptime(date_str, '%Y-%m-%d')
# return array
return [d.timestamp()]
X_predict = ['2021-1-1']
X_predict = list(map(process_date, X_predict))
y_predict = model.predict(X_predict)
y_predict
Returns:
Date Val
0 2015-12-31 -0.110503
1 2018-12-31 -0.621394
2 2021-12-31 -1.030068
3 2024-12-31 1.221146
4 2027-12-31 -0.327685
array([-2.6149628])
Update: I used your data to create a csv file:
Date,Val
1-1-2020, 90404.71
2-1-2020, 69904.71
...
And then I loaded with pandas. Everything looks good to me:
def process_date(date_str):
# the date format is month-day-year
d = datetime.datetime.strptime(date_str, '%m-%d-%Y')
return d.timestamp()
df = pd.read_csv("test.csv")
df['Date'] = df['Date'].apply(process_date)
df.head()
Output:
Date Val
0 1.577848e+09 90404.710
1 1.580526e+09 69904.710
2 1.583032e+09 98934.112
3 1.585710e+09 77084.430
4 1.588302e+09 35877.420
Extracting features:
# must reshape 'cause we have only one feature
X = df['Date'].to_numpy().reshape(-1,1)
y = df['Val'].to_numpy()
model = LinearRegression()
model.fit(X, y)
Predicting:
X_predict = ['1-1-2021', '2-1-2021']
X_predict = np.array(list(map(process_date, X_predict)))
X_predict = X_predict.reshape(-1, 1)
y_predict = model.predict(X_predict)
y_predict
Output:
array([55492.2660361 , 53516.12292932])
This is a good prediction. You can use matplotlib to plot your data and convince yourself:
import matplotlib.pyplot as plt
%matplotlib inline
plt.plot(df['Date'], df['Val'])
plt.show()
Linear Regression needs your arrays to be of numeric type, since you have dates that are stored as strings in your X array, Linear Regression won't work as you expect.
You can convert the X array to numeric type by counting the number of days since the beginning date. You can try something like this in your DataFrame:
df.Date = (df.Date - df.Date[0]).days
And then you can continue as you were doing.
I have assumed that the dates in your Date column are in the datetime format, else you would need to convert it first.
I have two DataFrames. Both have X and Y coordinates. But DF1 is much denser than DF2. I want to downsample DF1 according to the X Y coordinates in DF2. Specifically, for each X/Y pairs in DF2, I select DF1 data between X +/-delta and Y +/-delta, and calculate the average value of Z. New_DF1 will have the same X Y coordinate as DF2, but with the average value of Z by downsampling.
Here are some examples and a function I made for this purpose. My problem was that it is too slow for a large dataset. It is highly appreciated if anyone has a better idea to vectorize the operation instead of crude looping.
Create data examples:
DF1 = pd.DataFrame({'X':[0.6,0.7,0.9,1.1,1.3,1.8,2.1,2.8,2.9,3.0,3.3,3.5],"Y":[0.6,0.7,0.9,1.1,1.3,1.8,2.1,2.8,2.9,3.0,3.3,3.5],'Z':[1,2,3,4,5,6,7,8,9,10,11,12]})
DF2 = pd.DataFrame({'X':[1,2,3],'Y':[1,2,3],'Z':[10,20,30]})
Function:
def DF1_match_DF2_target(half_range, DF2, DF1):
### half_range, scalar, define the area of dbf target
### dbf data
### raw pwg pixel map
DF2_X =DF2.loc[:,["X"]]
DF2_Y =DF2.loc[:,['Y']]
results = list()
for i in DF2.index:
#Select target XY from DF2
x= DF2_X.at[i,'X']
y= DF2_Y.at[i,'Y']
#Select X,Y range for DF1
upper_lmt_X = x+half_range
lower_lmt_X = x-half_range
upper_lmt_Y = y+half_range
lower_lmt_Y = y-half_range
#Select data from DF1 according to X,Y range, calculate average Z
subset_X = DF1.loc[(DF1['X']>lower_lmt_X) & (DF1['X']<upper_lmt_X)]
subset_XY = subset_X.loc[(subset_X['Y']>lower_lmt_Y) & (subset_X['Y']<upper_lmt_Y)]
result = subset_XY.mean(axis=0,skipna=True)
result[0] = x #set X,Y in new_DF1 the same as the X,Y in DF2
result[1] = y #set X,Y in new_DF1 the same as the X,Y in DF2
results.append(result)
results = pd.DataFrame(results)
return results
Test and Result:
new_DF1 = DF1_match_DF2_target(0.5,DF2,DF1)
new_DF1
Test and Result
How about using the 'pandas:cut()' function to aggregate using the boundary values?
half_range = 0.5
# create bins
x_bins = [0] + list(df2.x)
y_bins = [0] + list(df2.y)
tmp = [half_range]*(len(df2)+1)
x_bins = [a + b for a, b in zip(x_bins, tmp)]
y_bins = [a + b for a, b in zip(y_bins, tmp)]
key = pd.cut(df1.x, bins=x_bins, right=False, precision=1)
df3 = df1.groupby(key).mean().reset_index()
df2.z = df3.z
df2
x y z
0 1 1 3.0
1 2 2 6.5
2 3 3 9.5
I would like to do a regression with a rolling window, but I got only one parameter back after the regression:
rolling_beta = sm.OLS(X2, X1, window_type='rolling', window=30).fit()
rolling_beta.params
The result:
X1 5.715089
dtype: float64
What could be the problem?
Thanks in advance, Roland
I think the problem is that the parameters window_type='rolling' and window=30 simply do not do anything. First I'll show you why, and at the end I'll provide a setup I've got lying around for linear regressions on rolling windows.
1. The problem with your function:
Since you haven't provided some sample data, here's a function that returns a dataframe of a desired size with some random numbers:
# Function to build synthetic data
import numpy as np
import pandas as pd
import statsmodels.api as sm
from collections import OrderedDict
def sample(rSeed, periodLength, colNames):
np.random.seed(rSeed)
date = pd.to_datetime("1st of Dec, 1999")
cols = OrderedDict()
for col in colNames:
cols[col] = np.random.normal(loc=0.0, scale=1.0, size=periodLength)
dates = date+pd.to_timedelta(np.arange(periodLength), 'D')
df = pd.DataFrame(cols, index = dates)
return(df)
Output:
X1 X2
2018-12-01 -1.085631 -1.294085
2018-12-02 0.997345 -1.038788
2018-12-03 0.282978 1.743712
2018-12-04 -1.506295 -0.798063
2018-12-05 -0.578600 0.029683
.
.
.
2019-01-17 0.412912 -1.363472
2019-01-18 0.978736 0.379401
2019-01-19 2.238143 -0.379176
Now, try:
rolling_beta = sm.OLS(df['X2'], df['X1'], window_type='rolling', window=30).fit()
rolling_beta.params
Output:
X1 -0.075784
dtype: float64
And this at least represents the structure of your output too, meaning that you're expecting an estimate for each of your sample windows, but instead you get a single estimate. So I looked around for some other examples using the same function online and in the statsmodels docs, but I was unable to find specific examples that actually worked. What I did find were a few discussions talking about how this functionality was deprecated a while ago. So then I tested the same function with some bogus input for the parameters:
rolling_beta = sm.OLS(df['X2'], df['X1'], window_type='amazing', window=3000000).fit()
rolling_beta.params
Output:
X1 -0.075784
dtype: float64
And as you can see, the estimates are the same, and no error messages are returned for the bogus input. So I suggest that you take a look at the function below. This is something I've put together to perform rolling regression estimates.
2. A function for regressions on rolling windows of a pandas dataframe
df = sample(rSeed = 123, colNames = ['X1', 'X2', 'X3'], periodLength = 50)
def RegressionRoll(df, subset, dependent, independent, const, win, parameters):
"""
RegressionRoll takes a dataframe, makes a subset of the data if you like,
and runs a series of regressions with a specified window length, and
returns a dataframe with BETA or R^2 for each window split of the data.
Parameters:
===========
df: pandas dataframe
subset: integer - has to be smaller than the size of the df
dependent: string that specifies name of denpendent variable
inependent: LIST of strings that specifies name of indenpendent variables
const: boolean - whether or not to include a constant term
win: integer - window length of each model
parameters: string that specifies which model parameters to return:
BETA or R^2
Example:
========
RegressionRoll(df=df, subset = 50, dependent = 'X1', independent = ['X2'],
const = True, parameters = 'beta', win = 30)
"""
# Data subset
if subset != 0:
df = df.tail(subset)
else:
df = df
# Loopinfo
end = df.shape[0]
win = win
rng = np.arange(start = win, stop = end, step = 1)
# Subset and store dataframes
frames = {}
n = 1
for i in rng:
df_temp = df.iloc[:i].tail(win)
newname = 'df' + str(n)
frames.update({newname: df_temp})
n += 1
# Analysis on subsets
df_results = pd.DataFrame()
for frame in frames:
#print(frames[frame])
# Rolling data frames
dfr = frames[frame]
y = dependent
x = independent
if const == True:
x = sm.add_constant(dfr[x])
model = sm.OLS(dfr[y], x).fit()
else:
model = sm.OLS(dfr[y], dfr[x]).fit()
if parameters == 'beta':
theParams = model.params[0:]
coefs = theParams.to_frame()
df_temp = pd.DataFrame(coefs.T)
indx = dfr.tail(1).index[-1]
df_temp['Date'] = indx
df_temp = df_temp.set_index(['Date'])
if parameters == 'R2':
theParams = model.rsquared
df_temp = pd.DataFrame([theParams])
indx = dfr.tail(1).index[-1]
df_temp['Date'] = indx
df_temp = df_temp.set_index(['Date'])
df_temp.columns = [', '.join(independent)]
df_results = pd.concat([df_results, df_temp], axis = 0)
return(df_results)
df_rolling = RegressionRoll(df=df, subset = 50, dependent = 'X1', independent = ['X2'], const = True, parameters = 'beta',
win = 30)
Output: A dataframe with beta estimates for OLS of X2 on X1 for each 30 period window of the data.
const X2
Date
2018-12-30 0.044042 0.032680
2018-12-31 0.074839 -0.023294
2019-01-01 -0.063200 0.077215
.
.
.
2019-01-16 -0.075938 -0.215108
2019-01-17 -0.143226 -0.215524
2019-01-18 -0.129202 -0.170304
I have the following code which I am writing as part of a simple movie recommender in python so I can mimic the results I get as part of coursera's Machine Learning Course taught by Andrew NG.
I want to modify the numpy.ndarray that I get after calling as_matrix() on the pandas dataframe and add a column vector to it like we can in MATLAB
Y = [ratings Y]
Following is my python code
dataFile='/filepath/'
userItemRatings = pd.read_csv(dataFile, sep="\t", names=['userId', 'movieId', 'rating','timestamp'])
movieInfoFile = '/filepath/'
movieInfo = pd.read_csv(movieInfoFile, sep="|", names=['movieId','Title','Release Date','Video Release Date','IMDb URL','Unknown','Action','Adventure','Animation','Childrens','Comedy','Crime','Documentary','Drama','Fantasy','Film-Noir','Horror','Musical','Mystery','Romance','Sci-Fi','Thriller','War','Western'], encoding = "ISO-8859-1")
userMovieMatrix=pd.merge(userItemRatings, movieInfo, left_on='movieId', right_on='movieId')
userMovieSubMatrix = userMovieMatrix[['userId', 'movieId', 'rating','timestamp','Title']]
Y = pd.pivot_table(userMovieSubMatrix, values='rating', index=['movieId'], columns=['userId'])
Y.fillna(0,inplace=True)
movies = Y.shape[0]
users = Y.shape[1] +1
ratings = np.zeros((1682, 1))
ratings[0] = 4
ratings[6] = 3
ratings[11] = 5
ratings[53] = 4
ratings[63] = 5
ratings[65] = 3
ratings[68] = 5
ratings[97] = 2
ratings[182] = 4
ratings[225] = 5
ratings[354] = 5
features = 10
theta = pd.DataFrame(np.random.rand(users,features))# users 943*3
X = pd.DataFrame(np.random.rand(movies,features))# movies 1682 * 3
X = X.as_matrix()
theta = theta.as_matrix()
Y = Y.as_matrix()
"""want to insert a column vector into this Y to get a new Y of dimension
1682*944, but only seeing 1682*943 after the following statement
"""
np.insert(Y, 0, ratings, axis=1)
R = Y.copy()
R[R!=0] = 1
Ymean = np.zeros((movies, 1))
Ynorm = np.zeros((movies, users))
for i in range(movies):
idx = np.where(R[i,:] == 1)[0]
Ymean[i] = Y[i,idx].mean()
Ynorm[i,idx] = Y[i,idx] - Ymean[i]
print(type(Ymean), type(Ynorm), type(Y), Y.shape)
Ynorm[np.isnan(Ynorm)] = 0.
Ymean[np.isnan(Ymean)] = 0.
There is an inline comment inserted, but my problem is when I create a new numpy array and call insert, it works just fine. However the numpy array I get after calling as_matrix() on pandas dataframe on which pivot_table() is called doesn't work. Is there any alternative?
insert does not operate in place, you need to assign the output to a variable. Try:
Y = np.insert(Y, 0, ratings, axis=1)