Do you know how to calculate linear regression between two points in time? For example between two prices for Amazon.
I am asking because all simple examples are with numbers on x axis and values on y axis and this solution from :
How to calculate the coordinates of the line between two points in python?
will not work because time on x axis .
Can we use some numpy function? Is there any trick?
If we have code:
import yfinance as yf
import datetime
start = datetime.datetime(2021,9,1)
end = datetime.datetime(2021,9,2)
test_point = datetime.datetime(2021,10,15)
Amazon = yf.Ticker("AMZN")
df = Amazon.history(start=start, end=end)
print(df)
Code above will give us df :
Open High Low Close Volume Dividends Stock Splits
Date
2021-08-31 3424.800049 3472.580078 3395.590088 3470.790039 4356400 0 0
2021-09-01 3496.399902 3527.000000 3475.239990 3479.000000 3629900 0 0
Let`s say: I want to calculate the slope and intercept of the line between low prices.
In python 3.10 should be https://docs.python.org/3/library/statistics.html#statistics.linear_regression as linear regression and there is example with a years. I am looking something similar for python3.8
The easiest then would be to choose a "goalpost" date, and create a feature time_elapsed for "The number of days since goalpost" and then you can easily do LS on that variable.
goalpost = datetime.datetime(2021, 8, 31)
df["time_elapsed"] = (df.index - goalpost).dt.days
E.g. if your goalpost is 2021-08-31 then the time_elapsed column would be
time_elapsed other_cols...
Date
2021-08-31 0 ...
2021-09-01 1 ...
From here, you can use a Linear regression package (like something from sklearn, statsmodels, or even the one you shared), or you can code your own regression. In the case of Simple Linear regression (one variable Y, one variable X), the formulas are pretty simple:
https://en.wikipedia.org/wiki/Ordinary_least_squares#Simple_linear_regression_model
Related
I'm doing a prophet prediction for a dataframe with call count and I tried to do a logistic growth model for it.
Dataframe has ds column; 30min datetime bins and y column; call count for each bin
Here's the prediction code
dataframe['cap']=10000
dataframe['floor']=0
m = Prophet(growth='logistic')
m.fit(dataframe)
future = m.make_future_dataframe(periods=periods, freq='H')
future['cap']=10000
future['floor']=0
forecast = m.predict(future)
fig1 = m.plot(forecast, xlabel='Date-time DD-MM-YY:HH-MM-SS', ylabel='Call Count')
plt.title('Forecast Prediction')
plt.show()
and graph
Even after that the detected pattern graph and forecast both goes below zero.
Why is this happening and What can I do?
Generally speaking, a common technique to handle negative values in prediction models is the logarithmic trasformation.
To transform your target variable , you can use Y=log(x + c) where c is the constant.
People usually choose something like Y=log(x+1) or any other "very small" positive number.
After this transformation, it's impossible to get negative values from forecasting.
To come back to original scale just use the inverse function of y=log(x+1) , that is y = exp(x)-1
I have the following price data on a particular stock. Using df.head(), I have:
Adj Close pct_change log_ret
Date
2018-01-02 167.701889 0.017905 0.017746
2018-01-03 167.672668 -0.000174 -0.000174
2018-01-04 168.451508 0.004645 0.004634
2018-01-05 170.369385 0.011385 0.011321
2018-01-08 169.736588 -0.003714 -0.003721
The pct_change is obtained using df['pct_change'] = df['Adj Close'].pct_change() while the log_ret is derived using df['log_ret'] = np.log(df['Adj Close']) - np.log(df['Adj Close'].shift(1)).
First, I show that the daily returns of the stock is normally distributed:
plt.hist(df['log_ret'].loc[df.index >= '2018-01-01'], 100)
plt.show()
Since the daily returns of the stock is normally distributed, the price of the stock should follow a lognormal distribution. If I interpret it correctly, it means log(['Adj Price']) ~ N(mean,var). Hence, I created a new column log_price and tried to plot the following:
df['log_price'] = np.log(df['Adj Close'])
plt.hist(df['log_price'].loc[df.index >= '2018-01-01'], 100)
plt.show()
However, it does not seem that the prices follow a lognormal distribution. I want to show that the prices follow a lognormal distribution, so what am I doing incorrectly?
Source:Log-normal_distribution
if Y has a normal distribution, then the exponential function of Y, X = exp(Y), has a log-normal distribution
but in your code you have taken log which is wrong, so correct the code, for more details follow the wikipedia.
I am a newbie to time series forecasting. I have a non stationary time series, which I converted to stationary time series by first taking the log and then subtracting from exponential weighted mean. Then I fitted the series to ARIMA model and done forecasting (for 12 months). Now I have to convert the scale back to original scale. In other words, how do I basically rescale the forecasted values ?
I tried rescaling forecast of one month, then again calculated ewma of last 6 months and added this value to the forecast of next month and so on which I dont think is very efficient. Code to convert the series to stationary data:
airline_data_log = np.log(airline_data)
expweighted_avg = airline_data_log.ewm(halflife = 0.5).mean()
stationary_data = airline_data_log - expweighted_avg
model_ARIMA = sm.tsa.arima_model.ARIMA(stationary_data,order = (2,0,2))
p = result.forecast(steps = 12)
output = p[0]
ewma = expweighted_avg
Now I want to rescale the output. I tried something like this:
for i in range(0,12):
forecasted_values.iloc[i][0] = forecasted_values.iloc[i][0] +
ewma.iloc[-1][0]
airline_data_log = airline_data_log + forecasted_values_p.iloc[i]
ewma = airline_data_log.ewm(halflife = 0.5).mean()
which I think is inefficient and probably even wrong. Any efficient way (right way) ?
I am new to Python ARIMA implementation. I have a data at 15 min frequency for few months. In my attempt to follow the Box-Jenkins method to fit a timeseries model. I ran into an issue towards the end. The ACF-PACF graph for the time series (ts) and the difference series (ts_diff) are given. I used ARIMA (5,1,2) and finally I plotted the fitted values(green) and original values(blue). As you can from figure, there is a clear shift(by one) in values. What am I doing wrong?
Is the prediction bad? Any insight will be helpful.
This is a standard property of one-step ahead prediction or forecasting.
The information used for the forecast is the history up to and including the previous period. A peak, for example, at a period will affect the forecast for the next period, but cannot influence the forecast for the peak period. This makes the forecasts appear shifted in the plot.
A two-step ahead forecast would give the impression of a shift by two periods.
Just to confirm, I am doing this right then? Here is the code I used.
from statsmodels.tsa.arima_model import ARIMA
model = sm.tsa.ARIMA(ts, order=(5, 1, 2))
model = model.fit()
results_ARIMA=model.predict(typ='levels')
concatenated = pd.concat([ts, results_ARIMA], axis=1, keys=['original', 'predicted'])
concatenated.head(10)
original predicted
login_time
1970-01-01 20:00:00 2 NaN
1970-01-01 20:15:00 6 2.000186
1970-01-01 20:30:00 9 4.552971
1970-01-01 20:45:00 7 7.118973
1970-01-01 21:00:00 1 7.099769
1970-01-01 21:15:00 4 3.624975
1970-01-01 21:30:00 0 3.867454
1970-01-01 21:45:00 4 1.618120
1970-01-01 22:00:00 9 2.997275
1970-01-01 22:15:00 8 6.300015
In the model you specify (5, 1, 2), you set d = 1. This means that you are differencing the data by 1, or in other words, performing a shift of your entire range of time-related observations so as to minimize the residuals of the fitted model.
Sometimes, setting d to 1 will result in a ACF / PACF plot with fewer and / or less dramatic spikes (i.e. less extreme residuals). In such cases, if you use the model you have fitted to predict future values, your predictions will deviate less dramatically from the observations you have if you apply differencing.
Differencing is accomplished through Y(differenced) = Y(t) - Y(t-d), where Y(t) refers to observed value Y at timeindex t, and d refers to the order of differencing you apply. When you use differencing, your entire range of observations basically shifts to the right. This means you lose some data at the left edge of your time series. How many time points you lose depends on the order of differencing d you use. This is where your observed shift comes from.
This page may offer a more elaborate explanation (make sure to click around a bit and explore the other pages on there if you want a treatment of the whole process of fitting an ARIMA model).
Hope this helps (or at least puts your mind at ease about the shift)!
Bests,
Evert
My data: I have two seasonal patterns in my hourly data... daily and weekly. For example... each day in my dataset has roughly the same shape based on hour of the day. However, certain days like Saturday and Sunday exhibit increases in my data, and also slightly different hourly shapes.
(Using holt-winters, as I found discovered here: https://gist.github.com/andrequeiroz/5888967)
I ran the algorithm, using 24 as my periods per season, and forecasting out 7 seasons (1 week), I noticed that it would over-forecast the weekdays and under-forecast the weekend since its estimating the saturday curve based on fridays curve and not a combination of friday's curve and saturday(t-1)'s curve. What would be a good way to include a secondary period in my data, as in, both 24 and 7? Is their a different algorithm that I should be using?
One obvious way to account for different shapes would be to use just one sort of period, but make it have a periodicity of 7*24, so you would be forecasting the entire week as a single shape.
Have you tried linear regression, in which the predicted value is a linear trend plus a contribution from dummy variables? The simplest example to explain would be trend plus only a daily contribution. Then you would have
Y = X*t + c + A*D1 + B*D2 + ... F * D6 (+ noise)
Here you use linear regression to find the best fitting values of X, c, and A...F. t is the time, counting up 0, 1, 2, 3,... indefinitely, so the fitted value of X gives you a trend. c is a constant value, so it moves all the predicted Ys up or down. D1 is set to 1 on Tuesdays and 0 otherwise, D2 is set to 1 on Wednesdays and 0 otherwise... D6 is set to 1 on Sundays and 0 otherwise, so the A..F terms give contributions for days other than Mondays. We don't fit a term for Mondays because if we did then we could not distinguish the c term - if you added 1 to c and subtracted one from each of A..F the predictions would be unchanged.
Hopefully you can now see that we could add 23 terms to account for an shape for the 24 hours of each day, and a total of 46 terms to account for a shape for the 24 hours of each weekday and the different 24 hours of each weekend day.
You would be best to look for a statistical package to handle this for you, such as the free R package (http://www.r-project.org/). It does have a bit of a learning curve, but you can probably find books or articles that take you through using it for just this sort of prediction.
Whatever you do, I would keep on checking forecasting methods against your historical data - people have found that the most accurate forecasting methods in practice are often surprisingly simple.