Time series forecasting with support vector regression - python

I'm trying to perform a simple time series prediction using support vector regression.
I am trying to understand the answer provided here.
I adapted Tom's code to reflect the answer provided:
import numpy as np
from matplotlib import pyplot as plt
from sklearn.svm import SVR
X = np.arange(0,100)
Y = np.sin(X)
a = 0
b = 10
x = []
y = []
while b <= 100:
x.append(Y[a:b])
a += 1
b += 1
b = 10
while b <= 90:
y.append(Y[b])
b += 1
svr_rbf = SVR(kernel='rbf', C=1e5, gamma=1e5)
y_rbf = svr_rbf.fit(x[:81], y).predict(x)
figure = plt.figure()
tick_plot = figure.add_subplot(1, 1, 1)
tick_plot.plot(X, Y, label='data', color='green', linestyle='-')
tick_plot.axvline(x=X[-10], alpha=0.2, color='gray')
tick_plot.plot(X[10:], y_rbf[:-1], label='data', color='blue', linestyle='--')
plt.show()
However, I still get the same behavior -- the prediction just returns the value from the last known step. Strangely, if I set the kernel to linear the result is much better. Why doesn't the rbf kernel prediction work as intended?
Thank you.

I understand this is an old question, but I will answer it as other people might benefit from the answer.
The values you are using for C and gamma are most likely the issue if your example works with a linear kernel and not with rbf.
C and gamma are SVM parameters used for nonlinear kernel. For a goodexplanation of what C and gamma are intuitively, have a look here: http://scikit-learn.org/stable/auto_examples/svm/plot_rbf_parameters.html?
In order to predict the values of a sinusoid, try C = 1 and gamma = 0.1. It performs much better than with the values you have.

Related

Python Logistic Regression Produces Wrong Coefficients

I am trying to use LogisticRegression model fro scikit-learn to solve Excercise 2 from Machine Learning course by Andrew Ng on Coursera. But the result, which I get is wrong:
1) Outcome coefficients doen't match with answers:
What I get with a model
What I should get according to answers
[-25.16, 0.21, 0.20]
you can see on the plot (wrong graph), that decision boundary seems to be a little bit below the decision boundary by intuition.
2) Graph outcome seems wrong
As you can see, decision boundary is below
LogisticRegression
Answer
MY CODE:
% matplotlib notebook
# IMPORT DATA
ex2_folder = 'machine-learning-ex2/ex2'
input_1 = pd.read_csv(folder + ex2_folder +'/ex2data1.txt', header = None)
X = input_1[[0,1]]
y = input_1[2]
# IMPORT AND FIT MODEL
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(fit_intercept = True)
model.fit(X,y)
print('Intercept (Theta 0: {}). Coefficients: {}'.format(model.intercept_, model.coef_))
# CALCULATE GRID
n = 5
xx1, xx2 = np.mgrid[25:101:n, 25:101:n]
grid = np.c_[xx1.ravel(), xx2.ravel()]
probs = model.predict_proba(grid)[:, 1]
probs = probs.reshape(xx1.shape)
# PLOTTING
f = plt.figure()
ax = plt.gca()
for outcome in [0,1]:
xo = 'yo' if outcome == 0 else 'k+'
selection = y == outcome
plt.plot(X.loc[selection, 0],X.loc[selection,1],xo, mec = 'k')
plt.xlim([25,100])
plt.ylim([25,100])
plt.xlabel('Exam 1 Score')
plt.ylabel('Exam 2 Score')
plt.title('Exam 1 & 2 and admission outcome')
contour = ax.contourf(xx1,xx2, probs, 100, cmap="RdBu",
vmin=0, vmax=1)
ax_c = f.colorbar(contour)
ax_c.set_label("$P(y = 1)$")
ax_c.set_ticks([0, .25, .5, .75, 1])
plt.contour(xx1, xx2, probs, [0.5], linewidths=1, colors='b', alpha = 0.3);
plt.plot(xx1[probs > 0.5], xx2[probs > 0.5],'.b', alpha = 0.3)
LINKS
DataFile in txt
PDF Tasks and Solutions in Octave
To get identical results you need to create identical testing conditions.
One obvious difference at a glance is the iteration count. Sklearn LogisticRegression classifier default iteration count is 100, while Andrew NG's sample code runs for 400 iterations. That will certainly give you a different result from Nguyen's course.
I am not sure anymore which cost function Nguyen is using for the exercise, but I am pretty sure it's cross entropy and not the L2 that is default function for LogisticRecression classifier in scikit learn.
And the last note, before you implement higher-level solutions (scikitlearn/tensorflow/keras), you should first try to implement them in pure python to understand how they work. It will be easier (and more fun) to try and make higher-level packages to work for you.

matplotlib exact solution to a differential equation

I'm trying to plot the exact solution to a differential equation (a radioactive leak model) in python2.7 with matplotlib. When plotting the graph with Euler methods OR SciPy I get the expected results, but with the exact solution the output is a straight-line graph (should be logarithmic curve).
Here is my code:
import math
import numpy as np
import matplotlib.pyplot as plt
#define parameters
r = 1
beta = 0.0864
x0 = 0
maxt = 100.0
tstep = 1.0
#Make arrays for time and radioactivity
t = np.zeros(1)
#Implementing model with Exact solution where Xact = Exact Solution
Xact = np.zeros(1)
e = math.exp(-(beta/t))
while (t[-1]<maxt):
t = np.append(t,t[-1]+tstep)
Xact = np.append(Xact,Xact[-1] + ((r/beta)+(x0-r/beta)*e))
#plot results
plt.plot(t, Xact,color="green")
I realise that my problem may be due to an incorrect equation, but I'd be very grateful if someone could point out an error in my code. Cheers.
You probably want e to depend on t, as in
def e(t): return np.exp(-t/beta)
and then use
Xact.append( (r/beta)+(x0-r/beta)*e(t[-1]) )
But you can have that all shorter as
t = np.arange(0, maxt+tstep/2, tstep)
plt.plot(t, (r/beta)+(x0-r/beta)*np.exp(-t/beta), color="green" )

Python: scikit-learn isomap results seem random, but no possibility to set random_state

I am using Isomap from scikit-learn manifold learning. I reduce to two dimension, and observe that with every run of the algorthm on the same data set without any changes the resulting vectors change. I assume there are some random numbers used in the algorithm, but there is no way to set a seed. Random_state is not a variable to pass in Isomap. Am I missing something?
The random you've seen is about the sign of your result. The sign is not (in my opinion) 100% random. Signs within each component are consistent so that the relative relation is consistent in your result. Signs between components are random. In other words, which component got multiplied by -1 or 1 are random. This behavior comes from the KernelPCA function used by Isomap when the arpack kernel is used.
To give you a solution first, you can use eigen_solver='dense' when using Isomap. That may slow down your algorithm but should remove this randomness. I know this explanation above might be confusing. Let me give more details and show this by plot.
First, what is a visualized consequence of the "sign randomness"? Using the following code (modified from this official example) with eigen_solver = 'arpack', you can see two fit_transform using the same Isomap class may (or may not) give you different results. However, as you can see in the plot, the relative location maintains. It's just the whole plot getting flipped. If you use eigen_solver='dense' and run the code multiple times, you won't see this randomness:
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import offsetbox
from sklearn import (manifold, datasets, decomposition, ensemble,
discriminant_analysis, random_projection)
digits = datasets.load_digits(n_class=6)
X = digits.data
y = digits.target
n_samples, n_features = X.shape
n_neighbors = 30
def plot_embedding(X, ax, title=None):
x_min, x_max = np.min(X, 0), np.max(X, 0)
X = (X - x_min) / (x_max - x_min)
for i in range(X.shape[0]):
ax.text(X[i, 0], X[i, 1], str(digits.target[i]),
color=plt.cm.Set1(y[i] / 10.),
fontdict={'weight': 'bold', 'size': 9})
eigen_solver = 'arpack'
#eigen_solver = 'dense'
iso = manifold.Isomap(n_neighbors, n_components=2, eigen_solver=eigen_solver)
X_iso1 = iso.fit_transform(X)
X_iso2 = iso.fit_transform(X)
fig = plt.figure(figsize=(16, 6))
ax1 = fig.add_subplot(121)
plot_embedding(X_iso1, ax1)
ax2 = fig.add_subplot(122)
plot_embedding(X_iso2, ax2)
plt.show()
Secondly, is there a way to set a seed to "stabilize" the random state? No, there is currently no way to set a seed for KernelPCA from Isomap. With KernelPCA, however, there is a kwarg random_state which is "A pseudo random number generator used for the initialization of the residuals when eigen_solver == ‘arpack’". Play with the following code (modified from this official test code) and you can see this randomness is gone (blue dots cover red dots) even with eigen_solver = 'arpack':
import matplotlib.pyplot as plt
import numpy as np
from sklearn.decomposition import KernelPCA
X_fit = np.random.rand(100, 4)
X = np.dot(X_fit, X_fit.T)
eigen_solver = 'arpack'
#eigen_solver = 'dense'
#random_state = None
random_state = 0
kpca = KernelPCA(n_components=2, kernel='precomputed',
eigen_solver=eigen_solver, random_state=random_state)
X_kpca1 = kpca.fit_transform(X)
X_kpca2 = kpca.fit_transform(X)
plt.plot(X_kpca1[:,0], X_kpca1[:,1], 'ro')
plt.plot(X_kpca2[:,0], X_kpca2[:,1], 'bo')
plt.show()

Python - Calculate ongoing 1 Standard Deviation from linear regression line

I have managed to get a linear regression line for time series data, much thanks to stackoverflow prior. So I have the following plots/line drawn over from python:
I got this regression line with the following code, originally importing price/time series data from a csv file:
f4 = open('C:\Users\cost9\OneDrive\Documents\PYTHON\TEST-ASSURANCE FILES\LINEAR REGRESSION MULTI TREND IDENTIFICATION\ES_1H.CSV')
ES_1H = pd.read_csv(f4)
ES_1H.rename(columns={'Date/Time': 'Date'}, inplace=True)
ES_1H['Date'] = ES_1H['Date'].reset_index()
ES_1H.Date.values.astype('M8[D]')
ES_1H_Last_300_Periods = ES_1H[-300:]
x = ES_1H_Last_300_Periods['Date']
y = ES_1H_Last_300_Periods['Close']
x = sm.add_constant(x)
ES_1H_LR = pd.ols(y = ES_1H_Last_300_Periods['Close'], x = ES_1H_Last_300_Periods['Date'])
plt.scatter(y = ES_1H_LR.y_fitted.values, x = ES_1H_Last_300_Periods['Date'])
What I'm looking for is to be able to plot/identify 1 standard deviation from the regression line (shown in the picture above). Most of the above code is just to conform the data to successfully be able to plot the regression line - change the Date/Time data so it will work in the ols formula, cut off the data to the last 300 periods and so on. But I am not sure how to grab 1 standard deviation from the line that is drawn via linear regression.
So ideally what I'm looking for would look something like this:
...with the yellow lines being 1 standard deviation away from the regression line. Does anyone know how to get 1 standard deviation from the linear regression line here? For reference, here are the stats for linear regression:
edit: For reference here's what I ended up doing:
plt.scatter(y = ES_1D_LR.y_fitted.values, x = ES_1D_Last_30_Periods['Date'])
plt.scatter(y = ES_1D_Last_30_Periods.Close, x = ES_1D_Last_30_Periods.Date)
plt.scatter(y = ES_1D_LR.y_fitted.values - np.std(ES_1D_LR.y_fitted.values), x = ES_1D_Last_30_Periods.Date)
plt.scatter(y = ES_1D_LR.y_fitted.values + np.std(ES_1D_LR.y_fitted.values), x = ES_1D_Last_30_Periods.Date)
plt.show()
IIUC you can do it this way:
In [185]: x = np.arange(100)
In [186]: y = x*0.6
In [187]: plt.scatter(x, y, c='b')
Out[187]: <matplotlib.collections.PathCollection at 0xc512390>
In [188]: plt.scatter(x, y - np.std(y), c='y')
Out[188]: <matplotlib.collections.PathCollection at 0xc683940>
In [189]: plt.scatter(x, y + np.std(y), c='y')
Out[189]: <matplotlib.collections.PathCollection at 0xc69a550>
Result:
I just wanted to achieve the same thing. Here's how I did it.
import matplotlib.pyplot as plt
import numpy as np
Given this data:
plt.plot(time, price)
plt.plot(time, predicted_price)
plt.show()
Plot a window around the predicted_price regression line:
sq_dis = (price - predicted_price) ** 2
limit = (sq_dis.mean() + sq_dis.std()) * 0.3 # < - adjust window here
filter = np.abs(sq_dis) < limit
plt.plot(time, price)
plt.plot(time, predicted_price)
plt.plot(time[filter], price[filter])
plt.show()
I found this method closer to the way I had planned to plot my regression plots, so maybe you will find it interesting as well:
Use the function "plt.fill_between" to gray the area between mean and (mean+-standard deviation) like the following link:
https://jakevdp.github.io/PythonDataScienceHandbook/04.03-errorbars.html

How to overplot fit results for discrete values in pymc3?

I am completely new to pymc3, so please excuse the fact that this is likely trivial. I have a very simple model where I am predicting a binary response function. The model is almost a verbatim copy of this example: https://github.com/pymc-devs/pymc3/blob/master/pymc3/examples/gelman_bioassay.py
I get back the model parameters (alpha, beta, and theta), but I can't seem to figure out how to overplot the predictions of the model vs. the input data. I tried doing this (using the parlance of the bioassay model):
from scipy.stats import binom
mean_alpha = mean(trace['alpha'])
mean_beta = mean(trace['beta'])
pred_death = binom.rvs(n, 1./(1.+np.exp(-(mean_alpha + mean_beta * dose))))
and then plotting dose vs. pred_death, but this is manifestly not correct as I get different draws of the binomial distribution every time.
Related to this is another question, how do I evaluate the goodness of fit? I couldn't seem to find anything to that effect in the "getting started" pymc3 tutorial.
Thanks very much for any advice!
Hi a simple way to do it is as follows:
from pymc3 import *
from numpy import ones, array
# Samples for each dose level
n = 5 * ones(4, dtype=int)
# Log-dose
dose = array([-.86, -.3, -.05, .73])
def invlogit(x):
return np.exp(x) / (1 + np.exp(x))
with Model() as model:
# Logit-linear model parameters
alpha = Normal('alpha', 0, 0.01)
beta = Normal('beta', 0, 0.01)
# Calculate probabilities of death
theta = Deterministic('theta', invlogit(alpha + beta * dose))
# Data likelihood
deaths = Binomial('deaths', n=n, p=theta, observed=[0, 1, 3, 5])
start = find_MAP()
step = NUTS(scaling=start)
trace = sample(2000, step, start=start, progressbar=True)
import matplotlib.pyplot as plt
death_fit = np.percentile(trace.theta,50,axis=0)
plt.plot(dose, death_fit,'g', marker='.', lw='1.25', ls='-', ms=5, mew=1)
plt.show()
If you want to plot dose vs pred_death, where pred_death is computed from the mean estimated values of alpha and beta, then do:
pred_death = 1./(1. + np.exp(-(mean_alpha + mean_beta * dose)))
plt.plot(dose, pred_death)
instead if you want to plot dose vs pred_death, where pred_death is computed taking into account the uncertainty in posterior for alpha and beta. Then probably the easiest way is to use the function sample_ppc:
May be something like
ppc = pm.sample_ppc(trace, samples=100, model=pmmodel)
for i in range(100):
plt.plot(dose, ppc['deaths'][i], 'bo', alpha=0.5)
Using Posterior Predictive Checks (ppc) is a way to check how well your model behaves by comparing the predictions of the model to your actual data. Here you have an example of sample_ppc
Other options could be to plot the mean value plus some interval of interest.

Categories

Resources