Python - Calculate ongoing 1 Standard Deviation from linear regression line - python

I have managed to get a linear regression line for time series data, much thanks to stackoverflow prior. So I have the following plots/line drawn over from python:
I got this regression line with the following code, originally importing price/time series data from a csv file:
f4 = open('C:\Users\cost9\OneDrive\Documents\PYTHON\TEST-ASSURANCE FILES\LINEAR REGRESSION MULTI TREND IDENTIFICATION\ES_1H.CSV')
ES_1H = pd.read_csv(f4)
ES_1H.rename(columns={'Date/Time': 'Date'}, inplace=True)
ES_1H['Date'] = ES_1H['Date'].reset_index()
ES_1H.Date.values.astype('M8[D]')
ES_1H_Last_300_Periods = ES_1H[-300:]
x = ES_1H_Last_300_Periods['Date']
y = ES_1H_Last_300_Periods['Close']
x = sm.add_constant(x)
ES_1H_LR = pd.ols(y = ES_1H_Last_300_Periods['Close'], x = ES_1H_Last_300_Periods['Date'])
plt.scatter(y = ES_1H_LR.y_fitted.values, x = ES_1H_Last_300_Periods['Date'])
What I'm looking for is to be able to plot/identify 1 standard deviation from the regression line (shown in the picture above). Most of the above code is just to conform the data to successfully be able to plot the regression line - change the Date/Time data so it will work in the ols formula, cut off the data to the last 300 periods and so on. But I am not sure how to grab 1 standard deviation from the line that is drawn via linear regression.
So ideally what I'm looking for would look something like this:
...with the yellow lines being 1 standard deviation away from the regression line. Does anyone know how to get 1 standard deviation from the linear regression line here? For reference, here are the stats for linear regression:
edit: For reference here's what I ended up doing:
plt.scatter(y = ES_1D_LR.y_fitted.values, x = ES_1D_Last_30_Periods['Date'])
plt.scatter(y = ES_1D_Last_30_Periods.Close, x = ES_1D_Last_30_Periods.Date)
plt.scatter(y = ES_1D_LR.y_fitted.values - np.std(ES_1D_LR.y_fitted.values), x = ES_1D_Last_30_Periods.Date)
plt.scatter(y = ES_1D_LR.y_fitted.values + np.std(ES_1D_LR.y_fitted.values), x = ES_1D_Last_30_Periods.Date)
plt.show()

IIUC you can do it this way:
In [185]: x = np.arange(100)
In [186]: y = x*0.6
In [187]: plt.scatter(x, y, c='b')
Out[187]: <matplotlib.collections.PathCollection at 0xc512390>
In [188]: plt.scatter(x, y - np.std(y), c='y')
Out[188]: <matplotlib.collections.PathCollection at 0xc683940>
In [189]: plt.scatter(x, y + np.std(y), c='y')
Out[189]: <matplotlib.collections.PathCollection at 0xc69a550>
Result:

I just wanted to achieve the same thing. Here's how I did it.
import matplotlib.pyplot as plt
import numpy as np
Given this data:
plt.plot(time, price)
plt.plot(time, predicted_price)
plt.show()
Plot a window around the predicted_price regression line:
sq_dis = (price - predicted_price) ** 2
limit = (sq_dis.mean() + sq_dis.std()) * 0.3 # < - adjust window here
filter = np.abs(sq_dis) < limit
plt.plot(time, price)
plt.plot(time, predicted_price)
plt.plot(time[filter], price[filter])
plt.show()

I found this method closer to the way I had planned to plot my regression plots, so maybe you will find it interesting as well:
Use the function "plt.fill_between" to gray the area between mean and (mean+-standard deviation) like the following link:
https://jakevdp.github.io/PythonDataScienceHandbook/04.03-errorbars.html

Related

How do i plot 2 separate average lines on my residuals between different x values - python

In the residual plot resulting from the below code, there is substantial drop in values around the halfway point
I would like to help visualise this for those less statistically inclined by plotting 2 average lines of the residual plot
one from x(0, 110)
and the second from x(110, 240)
Here is the code
FINAL LINEAR MODEL
x = merged[['Imp_Col_LNG', 'AveSH_LNG']].values
y = merged['Unproductive_LNG'].values
from sklearn.linear_model import LinearRegression
reg = LinearRegression()
reg.fit(x,y)
# plt.scatter(x, y)
yp=reg.predict(x)
# plt.plot(xp,yp)
# plt.text(x.max()*0.7,y.max()*0.1,'$R^2$ =
{score:.4f}'.format(score=reg.score(x,y)))
# plt.show()
plt.scatter(yp, y)
s = yp.argsort()
plt.plot(yp[s], yp[s],color='k',ls='--')
from scipy.stats import norm
ub = yp + norm.ppf(0.5+0.95/2) * res.std(ddof=1)
lb = yp - norm.ppf(0.5+0.95/2) * res.std(ddof=1)
plt.plot(yp[s], ub[s],color='k',ls='--')
plt.plot(yp[s], lb[s],color='k',ls='--')
plt.text(x.max()*0.7,y.max()*0.1,'$R^2$ =
{score:.4f}'.format(score=reg.score(x,y)))
plt.xlabel('Predicted Values')
plt.ylabel('Observed Values')
plt.title('LNG_Shuffles')
plt.show()
RESIDUAL PLOTS
res = pd.Series(y - yp)
checkresiduals(res)
plt.plot(res)
Since we are trying to plot the average of the residuals from (0, 110) and (110, 240), we first have to calculate the averages for each section.
Here, res stores the residual data in the form of a pd.Series object. To get the array information from it, we can use the to_numpy method of the pd.Series objects.
res_data = res.to_numpy()
Now, let's calculate the mean for each part.
first_average = res_data[:110].mean()
second_average = res_data[110:].mean()
Now, since we are going to plot this over two different ranges, we will have to convert these to numpy arrays before plotting.
plt.plot(np.arange(110), np.ones(110) * first_average)
plt.plot(np.arange(110, 240), np.ones(130) * second_average)
This should give you the piecewise residual average plot.

Python: two normal distribution

I have two data sets where two values where measured. I am interested in the difference between the value and the standard deviation of the difference. I made a histogram which I would like to fit two normal distributions. To calculate the difference between the maxima. I also would like to evaluate the effect that in on data set I have much less data on one value. I've already looked at this link but it is not really what I need:
Python: finding the intersection point of two gaussian curves
for ii in range(2,8):
# Kanal = ii - 1
file = filepath + '\Mappe1.txt'
data = np.loadtxt(file, delimiter='\t', skiprows=1)
data = data[:,ii]
plt.hist(data,bins=100)
plt.xlabel("bins")
plt.ylabel("Counts")
plt.tight_layout()
plt.grid()
plt.figure()
plt.show()
Quick and dirty fitting can be readily achieved using scipy:
from scipy.optimize import curve_fit #non linear curve fitting tool
from matplotlib import pyplot as plt
def func2fit(x1,x2,m_1,m_2,std_1,std_2,height1, height2): #define a simple gauss curve
return height1*exp(-(x1-m_1)**2/2/std_1**2)+height2*exp(-(x2-m_2)**2/2/std_2**2)
init_guess=(-.3,.3,.5,.5,3000,3000)
#contains the initial guesses for the parameters (m_1, m_2, std_1, std_2, height1, height2) using your first figure
#do the fitting
fit_pars, pcov =curve_fit(func2fit,xdata,ydata,init_guess)
#fit_pars contains the mean, the heights and the SD values, pcov contains the estimated covariance of these parameters
plt.plot(xdata,func2fit(xdata,*fit_pars),label='fit') #plot the fit
For further reference consult the scipy manual page:
curve_fit
Assuming that the two samples are independent there is no need to handle this problem using curve fitting. It's basic statistics. Here's some code that does the calculations required, with the source attributed in a comment.
## adapted from http://onlinestatbook.com/2/estimation/difference_means.html
from random import gauss
from numpy import sqrt
sample_1 = [ gauss(0,1) for _ in range(10) ]
sample_2 = [ gauss(1,.5) for _ in range(20) ]
n_1 = len(sample_1)
n_2 = len(sample_2)
mean_1 = sum(sample_1)/n_1
mean_2 = sum(sample_2)/n_2
SSE = sum([(_-mean_1)**2 for _ in sample_1]) + sum([(_-mean_2)**2 for _ in sample_2])
df = (n_1-1) + (n_2-1)
MSE = SSE/df
n_h = 2 / ( 1/n_1 + 1/n_2 )
s_mean_diff = sqrt( 2* MSE / n_h )
print ( 'difference between means', abs(n_1-n_2))
print ( 'std dev of this difference', s_mean_diff )

Linear regression with pandas time series

I have a dataframe object which contains 1 seconds intervals of the EUR_USD currency pair. But in theory it could be any interval and in this case it could look like this:
2015-11-10 01:00:00+01:00 1.07616
2015-11-10 01:01:00+01:00 1.07605
2015-11-10 01:02:00+01:00 1.07590
2015-11-10 01:03:00+01:00 1.07592
2015-11-10 01:04:00+01:00 1.07583
I'd like to use linear regression to draw a trend line from the data in dataframe, but I'm not sure what the best way are to do that with time series, and even such a small interval of time series.
So far I've messed around by replacing the time by (and this is just to show where I'd like to go with it) a list ranging from 0 to the time series list length.
x = list(range(0, len(df.index.tolist()), 1))
y = df["closeAsk"].tolist()
Using numpy to do the math magic
fit = np.polyfit(x,y,1)
fit_fn = np.poly1d(fit)
Lastly I draw the function along with the df["closeAsk"] to make sense of the trend.
plt.plot(x,df["closeAsk"], '-')
plt.plot(x,y, 'yo', x, fit_fn(x), '--k')
plt.show()
However now the x-axis is just meaningless numbers, instead I'd like for them to show the time series.
To elaborate on my comment:
Say you have some evenly spaced time series data, time, and some correlated data, data, as you've laid out in your question.
time = pd.date_range('9:00', '10:00', freq='1s')
data = np.cumsum(np.random.randn(time.size))
df = pd.DataFrame({'time' : time,
'data' : data})
As you've shown, you can do a linear fit of the data with np.polyfit and create the trend line with np.poly1d.
x = np.arange(time.size) # = array([0, 1, 2, ..., 3598, 3599, 3600])
fit = np.polyfit(x, df['data'], 1)
fit_fn = np.poly1d(fit)
Then plot the data and the fit with df['time'] as the x-axis.
plt.plot(df['time'], fit_fn(x), 'k-')
plt.plot(df['time'], df['data'], 'go', ms=2)
May be you wil be happy with seaborn?
Please try
seaborn.regplot
you can create a numpy linspace for the x-values in the same length as your datapoint like so:
y = df["closeAsk"].dropna() # or.fillna(method='bfill')
x = np.linspace(1, len(y), num=len(y))
import seaborn as sb
sb.regplot(x, y)
Building on the accepted answer, here's a neat way to plot both trend and data from any pd.Series, including time series:
trend(df['data']).plot()
Where trend.plot is defined as follows (generalized from the accepted answer):
def trend(s):
x = np.arange(len(s))
z = np.polyfit(x, s, 1)
p = np.poly1d(z)
t = pd.Series(p(x), index=s.index)
return t
trend.plot = lambda s: [s.plot(), trend(s).plot()]
If you need just the trend data (not the plot):
trendline = trend(df['data'])

Time series forecasting with support vector regression

I'm trying to perform a simple time series prediction using support vector regression.
I am trying to understand the answer provided here.
I adapted Tom's code to reflect the answer provided:
import numpy as np
from matplotlib import pyplot as plt
from sklearn.svm import SVR
X = np.arange(0,100)
Y = np.sin(X)
a = 0
b = 10
x = []
y = []
while b <= 100:
x.append(Y[a:b])
a += 1
b += 1
b = 10
while b <= 90:
y.append(Y[b])
b += 1
svr_rbf = SVR(kernel='rbf', C=1e5, gamma=1e5)
y_rbf = svr_rbf.fit(x[:81], y).predict(x)
figure = plt.figure()
tick_plot = figure.add_subplot(1, 1, 1)
tick_plot.plot(X, Y, label='data', color='green', linestyle='-')
tick_plot.axvline(x=X[-10], alpha=0.2, color='gray')
tick_plot.plot(X[10:], y_rbf[:-1], label='data', color='blue', linestyle='--')
plt.show()
However, I still get the same behavior -- the prediction just returns the value from the last known step. Strangely, if I set the kernel to linear the result is much better. Why doesn't the rbf kernel prediction work as intended?
Thank you.
I understand this is an old question, but I will answer it as other people might benefit from the answer.
The values you are using for C and gamma are most likely the issue if your example works with a linear kernel and not with rbf.
C and gamma are SVM parameters used for nonlinear kernel. For a goodexplanation of what C and gamma are intuitively, have a look here: http://scikit-learn.org/stable/auto_examples/svm/plot_rbf_parameters.html?
In order to predict the values of a sinusoid, try C = 1 and gamma = 0.1. It performs much better than with the values you have.

turn scatter data into binned data with errors bars equal to standard deviation

I have a bunch of data scattered x, y. If I want to bin these according to x and put error bars equal to the standard deviation on them, how would I go about doing that?
The only I know of in python is to loop over the data in x and group them according to bins (max(X)-min(X)/nbins) then loop over those blocks to find the std. I'm sure there are faster ways of doing this with numpy.
I want it to look similar to "vert symmetric" in: http://matplotlib.org/examples/pylab_examples/errorbar_demo.html
You can bin your data with np.histogram. I'm reusing code from this other answer to calculate the mean and standard deviation of the binned y:
import numpy as np
import matplotlib.pyplot as plt
x = np.random.rand(100)
y = np.sin(2*np.pi*x) + 2 * x * (np.random.rand(100)-0.5)
nbins = 10
n, _ = np.histogram(x, bins=nbins)
sy, _ = np.histogram(x, bins=nbins, weights=y)
sy2, _ = np.histogram(x, bins=nbins, weights=y*y)
mean = sy / n
std = np.sqrt(sy2/n - mean*mean)
plt.plot(x, y, 'bo')
plt.errorbar((_[1:] + _[:-1])/2, mean, yerr=std, fmt='r-')
plt.show()
No loop ! Python allows you to avoid looping as much as possible.
I am not sure to get everything, you have the same x vector for all data and many y vectors corresponding to different measurement no ? And you want to plot your data as the "vert symmetric" with the mean value of y for each x and a standard deviation for each x as an errorbar ?
Then it is easy. I assume you have a M-long x vector and a N*M array of your N sets of y data already loaded in variable names x and y.
import numpy as np
import pyplot as pl
error = np.std(y,axis=1)
ymean = np.mean(y,axis=1)
pl.errorbar(x,ymean,error)
pl.show()
I hope it helps. Let me know if you have any question or if it is not clear.

Categories

Resources