I'm looking for a way to generalize regression using pykalman from 1 to N regressors. We will not bother about online regression initially - I just want a toy example to set up the Kalman filter for 2 regressors instead of 1, i.e. Y = c1 * x1 + c2 * x2 + const.
For the single regressor case, the following code works. My question is how to change the filter setup so it works for two regressors:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from pykalman import KalmanFilter
if __name__ == "__main__":
file_name = '<path>\KalmanExample.txt'
df = pd.read_csv(file_name, index_col = 0)
prices = df[['ETF', 'ASSET_1']] #, 'ASSET_2']]
delta = 1e-5
trans_cov = delta / (1 - delta) * np.eye(2)
obs_mat = np.vstack( [prices['ETF'],
np.ones(prices['ETF'].shape)]).T[:, np.newaxis]
kf = KalmanFilter(
n_dim_obs=1,
n_dim_state=2,
initial_state_mean=np.zeros(2),
initial_state_covariance=np.ones((2, 2)),
transition_matrices=np.eye(2),
observation_matrices=obs_mat,
observation_covariance=1.0,
transition_covariance=trans_cov
)
state_means, state_covs = kf.filter(prices['ASSET_1'].values)
# Draw slope and intercept...
pd.DataFrame(
dict(
slope=state_means[:, 0],
intercept=state_means[:, 1]
), index=prices.index
).plot(subplots=True)
plt.show()
The example file KalmanExample.txt contains the following data:
Date,ETF,ASSET_1,ASSET_2
2007-01-02,176.5,136.5,141.0
2007-01-03,169.5,115.5,143.25
2007-01-04,160.5,111.75,143.5
2007-01-05,160.5,112.25,143.25
2007-01-08,161.0,112.0,142.5
2007-01-09,155.5,110.5,141.25
2007-01-10,156.5,112.75,141.25
2007-01-11,162.0,118.5,142.75
2007-01-12,161.5,117.0,142.5
2007-01-15,160.0,118.75,146.75
2007-01-16,156.5,119.5,146.75
2007-01-17,155.0,120.5,145.75
2007-01-18,154.5,124.5,144.0
2007-01-19,155.5,126.0,142.75
2007-01-22,157.5,124.5,142.5
2007-01-23,161.5,124.25,141.75
2007-01-24,164.5,125.25,142.75
2007-01-25,164.0,126.5,143.0
2007-01-26,161.5,128.5,143.0
2007-01-29,161.5,128.5,140.0
2007-01-30,161.5,129.75,139.25
2007-01-31,161.5,131.5,137.5
2007-02-01,164.0,130.0,137.0
2007-02-02,156.5,132.0,128.75
2007-02-05,156.0,131.5,132.0
2007-02-06,159.0,131.25,130.25
2007-02-07,159.5,136.25,131.5
2007-02-08,153.5,136.0,129.5
2007-02-09,154.5,138.75,128.5
2007-02-12,151.0,136.75,126.0
2007-02-13,151.5,139.5,126.75
2007-02-14,155.0,169.0,129.75
2007-02-15,153.0,169.5,129.75
2007-02-16,149.75,166.5,128.0
2007-02-19,150.0,168.5,130.0
The single regressor case provides the following output and for the two-regressor case I want a second "slope"-plot representing C2.
Answer edited to reflect my revised understanding of the question.
If I understand correctly you wish to model an observable output variable Y = ETF, as a linear combination of two observable values; ASSET_1, ASSET_2.
The coefficients of this regression are to be treated as the system states, i.e. ETF = x1*ASSET_1 + x2*ASSET_2 + x3, where x1 and x2 are the coefficients assets 1 and 2 respectively, and x3 is the intercept. These coefficients are assumed to evolve slowly.
Code implementing this is given below, note that this is just extending the existing example to have one more regressor.
Note also that you can get quite different results by playing with the delta parameter. If this is set large (far from zero), then the coefficients will change more rapidly, and the reconstruction of the regressand will be near-perfect. If it is set small (very close to zero) then the coefficients will evolve more slowly and the reconstruction of the regressand will be less perfect. You might want to look into the Expectation Maximisation algorithm - supported by pykalman.
CODE:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from pykalman import KalmanFilter
if __name__ == "__main__":
file_name = 'KalmanExample.txt'
df = pd.read_csv(file_name, index_col = 0)
prices = df[['ETF', 'ASSET_1', 'ASSET_2']]
delta = 1e-3
trans_cov = delta / (1 - delta) * np.eye(3)
obs_mat = np.vstack( [prices['ASSET_1'], prices['ASSET_2'],
np.ones(prices['ASSET_1'].shape)]).T[:, np.newaxis]
kf = KalmanFilter(
n_dim_obs=1,
n_dim_state=3,
initial_state_mean=np.zeros(3),
initial_state_covariance=np.ones((3, 3)),
transition_matrices=np.eye(3),
observation_matrices=obs_mat,
observation_covariance=1.0,
transition_covariance=trans_cov
)
# state_means, state_covs = kf.em(prices['ETF'].values).smooth(prices['ETF'].values)
state_means, state_covs = kf.filter(prices['ETF'].values)
# Re-construct ETF from coefficients and 'ASSET_1' and ASSET_2 values:
ETF_est = np.array([a.dot(b) for a, b in zip(np.squeeze(obs_mat), state_means)])
# Draw slope and intercept...
pd.DataFrame(
dict(
slope1=state_means[:, 0],
slope2=state_means[:, 1],
intercept=state_means[:, 2],
), index=prices.index
).plot(subplots=True)
plt.show()
# Draw actual y, and estimated y:
pd.DataFrame(
dict(
ETF_est=ETF_est,
ETF_act=prices['ETF'].values
), index=prices.index
).plot()
plt.show()
PLOTS:
Related
I'm trying to code a demonstration of compressed sensing for my final year project but am getting poor image reconstruction when using the Lasso algorithm. I've relied on the following as a reference: http://www.pyrunner.com/weblog/2016/05/26/compressed-sensing-python/
However my code has some differences:
I use scikit-learn to perform a lasso optimisation (basis pursuit) as opposed to using cvxpy to perform an l_1 minimisation with an equality constraint as in the article.
I construct psi differently/more simply, testing seems to show that it's correct.
I use a different package to read and write the image.
import numpy as np
import scipy.fftpack as spfft
import scipy.ndimage as spimg
import imageio
from sklearn.linear_model import Lasso
x_orig = imageio.imread('gt40.jpg', pilmode='L') # read in grayscale
x = spimg.zoom(x_orig, 0.2) #zoom for speed
ny,nx = x.shape
k = round(nx * ny * 0.5) #50% sample
ri = np.random.choice(nx * ny, k, replace=False)
y = x.T.flat[ri] #y is the measured sample
# y = np.expand_dims(y, axis=1) ---- this doesn't seem to make a difference, was presumably required with cvxpy
psi = spfft.idct(np.identity(nx*ny), norm='ortho', axis=0) #my construction of psi
# psi = np.kron(
# spfft.idct(np.identity(nx), norm='ortho', axis=0),
# spfft.idct(np.identity(ny), norm='ortho', axis=0)
# )
# psi = 2*np.random.random_sample((nx*ny,nx*ny)) - 1
theta = psi[ri,:] #equivalent to phi*psi
lasso = Lasso(alpha=0.001, max_iter=10000)
lasso.fit(theta, y)
s = np.array(lasso.coef_)
x_recovered = psi#s
x_recovered = x_recovered.reshape(nx, ny).T
x_recovered_final = x_recovered.astype('uint8') #recovered image is float64 and has negative values..
imageio.imwrite('gt40_recovered.jpg', x_recovered_final)
Unfortunately I'm not allowed to post images yet so here is a link to the original zoomed image, the image recovered with lasso and the image recovered with cvxpy (described later):
https://imgur.com/a/LROSug6
As you can see not only is the recovery poor but the image completely corrupted - the colours seem to be negative and the detail from the 50% sample lost. I think I've managed to track down the problem to the Lasso regression - it returns a vector that, when inverse transformed, has values that are not necessarily in the 0-255 range as expected for the image. So the conversion to from dtype float64 to uint8 is rather random (e.g. -55 becomes 255-55=200).
Following this I tried swapping out lasso for the same optimisation as in the article (minimising the l_1 norm subject to theta*s=y using cvxpy):
import cvxpy as cvx
x_orig = imageio.imread('gt40.jpg', pilmode='L') # read in grayscale
x = spimg.zoom(x_orig, 0.2)
ny,nx = x.shape
k = round(nx * ny * 0.5)
ri = np.random.choice(nx * ny, k, replace=False)
y = x.T.flat[ri]
psi = spfft.idct(np.identity(nx*ny), norm='ortho', axis=0)
theta = psi[ri,:] #equivalent to phi*psi
#NEW CODE STARTS:
vx = cvx.Variable(nx * ny)
objective = cvx.Minimize(cvx.norm(vx, 1))
constraints = [theta#vx == y]
prob = cvx.Problem(objective, constraints)
result = prob.solve(verbose=True)
s = np.array(vx.value).squeeze()
x_recovered = psi#s
x_recovered = x_recovered.reshape(nx, ny).T
x_recovered_final = x_recovered.astype('uint8')
imageio.imwrite('gt40_recovered_altopt.jpg', x_recovered_final)
This took nearly 6 hours but finally I got a somewhat satisfactory result. However I would like to perform a demonstration of lasso if possible. Any help in getting the lasso to return appropriate values or somehow converting its result appropriately would be very much appreciated.
I am trying to fit a quadratic function to some data, and I'm trying to do this without using numpy's polyfit function.
Mathematically I tried to follow this website https://neutrium.net/mathematics/least-squares-fitting-of-a-polynomial/ but somehow I don't think that I'm doing it right. If anyone could assist me that would be great, or If you could suggest another way to do it that would also be awesome.
What I've tried so far:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
ones = np.ones(3)
A = np.array( ((0,1),(1,1),(2,1)))
xfeature = A.T[0]
squaredfeature = A.T[0] ** 2
b = np.array( (1,2,0), ndmin=2 ).T
b = b.reshape(3)
features = np.concatenate((np.vstack(ones), np.vstack(xfeature), np.vstack(squaredfeature)), axis = 1)
featuresc = features.copy()
print(features)
m_det = np.linalg.det(features)
print(m_det)
determinants = []
for i in range(3):
featuresc.T[i] = b
print(featuresc)
det = np.linalg.det(featuresc)
determinants.append(det)
print(det)
featuresc = features.copy()
determinants = determinants / m_det
print(determinants)
plt.scatter(A.T[0],b)
u = np.linspace(0,3,100)
plt.plot(u, u**2*determinants[2] + u*determinants[1] + determinants[0] )
p2 = np.polyfit(A.T[0],b,2)
plt.plot(u, np.polyval(p2,u), 'b--')
plt.show()
As you can see my curve doesn't compare well to nnumpy's polyfit curve.
Update:
I went through my code and removed all the stupid mistakes and now it works, when I try to fit it over 3 points, but I have no idea how to fit over more than three points.
This is the new code:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
ones = np.ones(3)
A = np.array( ((0,1),(1,1),(2,1)))
xfeature = A.T[0]
squaredfeature = A.T[0] ** 2
b = np.array( (1,2,0), ndmin=2 ).T
b = b.reshape(3)
features = np.concatenate((np.vstack(ones), np.vstack(xfeature), np.vstack(squaredfeature)), axis = 1)
featuresc = features.copy()
print(features)
m_det = np.linalg.det(features)
print(m_det)
determinants = []
for i in range(3):
featuresc.T[i] = b
print(featuresc)
det = np.linalg.det(featuresc)
determinants.append(det)
print(det)
featuresc = features.copy()
determinants = determinants / m_det
print(determinants)
plt.scatter(A.T[0],b)
u = np.linspace(0,3,100)
plt.plot(u, u**2*determinants[2] + u*determinants[1] + determinants[0] )
p2 = np.polyfit(A.T[0],b,2)
plt.plot(u, np.polyval(p2,u), 'r--')
plt.show()
Instead using Cramer's Rule, actually solve the system using least squares. Remember that Cramer's Rule will only work if the total number of points you have equals the desired order of polynomial plus 1.
If you don't have this, then Cramer's Rule will not work as you're trying to find an exact solution to the problem. If you have more points, the method is unsuitable as we will create an overdetermined system of equations.
To adapt this to more points, numpy.linalg.lstsq would be a better fit as it solves the solution to the Ax = b by computing the vector x that minimizes the Euclidean norm using the matrix A. Therefore, remove the y values from the last column of the features matrix and solve for the coefficients and use numpy.linalg.lstsq to solve for the coefficients:
import numpy as np
import matplotlib.pyplot as plt
ones = np.ones(4)
xfeature = np.asarray([0,1,2,3])
squaredfeature = xfeature ** 2
b = np.asarray([1,2,0,3])
features = np.concatenate((np.vstack(ones),np.vstack(xfeature),np.vstack(squaredfeature)), axis = 1) # Change - remove the y values
determinants = np.linalg.lstsq(features, b)[0] # Change - use least squares
plt.scatter(xfeature,b)
u = np.linspace(0,3,100)
plt.plot(u, u**2*determinants[2] + u*determinants[1] + determinants[0] )
plt.show()
I get this plot now, which matches what the dashed curve is in your graph, also matching what numpy.polyfit gives you:
The question:
I'm building a model on three time series where Y is the dependent variable, and X1 and X2 ar the explanatory variables. Let's say that there is strong reason to believe that the impact of X1 on Y increases compared to X2 as time goes by. How can you account for this in a multiple regression model?
(I'll show some code snippets as my question progresses, and you'll find a complete code section at the end.)
The details - a visual approach:
Here are three synthetic series where it seems that the impact of X1 on Y is very strong at the end of the period:
A basic model could be:
model = smf.ols(formula='Y ~ X1 + X2')
And if you plot the fitted values against the observed Y values, you'd get this:
And sticking to a visual evaluation of the model, it seems that it performs OK in the majority of the period, but very poorly after August sets in.
How can I account for this in a multiple regression model? With the help from this post I've tried to introduce an interaction term with both a linear and squared timestep in these models:
mod_timestep = Y ~ X1 + X2:timestep
mod_timestep2 = Y ~ X1 + X2:timestep2
By the way, these are the timesteps:
Results:
It seems that both approaches perform a bit better in the end, but considerably worse in the beginning.
Any other suggestions? I know there's a multitude of possibilites with lagged terms of the dependent model and other models such as ARIMA or GARCH. But for a number of reasons I'd like to remain within the boundaries of multiple linear regressions and no lagged terms if possible.
Here's the whole thing for an easy copy&paste:
#%%
# imports
import matplotlib.pyplot as plt
import pandas as pd
import matplotlib.dates as mdates
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
###############################################################################
# Synthetic Data and plot
###############################################################################
# Function to build synthetic data
def sample():
np.random.seed(26)
date = pd.to_datetime("1st of Dec, 1999")
nPeriod = 250
dates = date+pd.to_timedelta(np.arange(nPeriod), 'D')
#ppt = np.random.rand(1900)
Y = np.random.normal(loc=0.0, scale=1.0, size=nPeriod).cumsum()
X1 = np.random.normal(loc=0.0, scale=1.0, size=nPeriod).cumsum()
X2 = np.random.normal(loc=0.0, scale=1.0, size=nPeriod).cumsum()
df = pd.DataFrame({'Y':Y,
'X1':X1,
'X2':X2},index=dates)
# Adjust level of series
df = df+100
# A subset
df = df.tail(50)
return(df)
# Function to make a couple of plots
def plot1(df, names, colors):
# PLot
fig, ax = plt.subplots(1)
ax.set_facecolor('white')
# Plot series
counter = 0
for name in names:
print(name)
ax.plot(df.index,df[name], lw=0.5, color = colors[counter])
counter = counter + 1
fig = ax.get_figure()
# Assign months to X axis
locator = mdates.MonthLocator() # every month
# Specify the X format
fmt = mdates.DateFormatter('%b')
X = plt.gca().xaxis
X.set_major_locator(locator)
X.set_major_formatter(fmt)
ax.legend(loc = 'upper left', fontsize ='x-small')
fig.show()
# Build sample data
df = sample()
# PLot of input variables
plot1(df = df, names = ['Y', 'X1', 'X2'], colors = ['red', 'blue', 'green'])
###############################################################################
# Models
###############################################################################
# Add timesteps to original df
timestep = pd.Series(np.arange(1, len(df)+1), index = df.index)
timestep2 = timestep**2
newcols2 = list(df)
df = pd.concat([df, timestep, timestep2], axis = 1)
newcols2.extend(['timestep', 'timestep2'])
df.columns = newcols2
def add_models_to_df(df, models, modelNames):
df_temp = df.copy()
counter = 0
for model in models:
df_temp[modelNames[counter]] = smf.ols(formula=model, data=df).fit().fittedvalues
counter = counter + 1
return(df_temp)
df_models = add_models_to_df(df, models = ['Y ~ X1 + X2', 'Y ~ X1 + X2:timestep', 'Y ~ X1 + X2:timestep2'],
modelNames = ['mod_regular', 'mod_timestep', 'mod_timestep2'])
# Models
df_models = add_models_to_df(df, models = ['Y ~ X1 + X2', 'Y ~ X1 + X2:timestep', 'Y ~ X1 + X2:timestep2'],
modelNames = ['mod_regular', 'mod_timestep', 'mod_timestep2'])
# Plots of models
plot1(df = df_models,
names = ['Y', 'mod_regular', 'mod_timestep', 'mod_timestep2'],
colors = ['red', 'black', 'green', 'grey'])
Edit 1 - screenshot from suggestion:**
Using the other options on the link you provided would seem to be the better option.
Using your functions with Y ~ X1 + X2*timestep and Y ~ X1 + X2*timestep2 will at least "catch" the increase in Y in the beginning, the decrease in the middle of the period and the sudden increase in the end.
I can't post images yet, so you'll have to try yourself.
I have clustered my data (12000, 3) using sklearn Gaussian mixture model algorithm (GMM). I have 3 clusters. Each point of my data represents a molecular structure. I would like to know how could I sampled each cluster. I have tried with the function:
gmm = GMM(n_components=3).fit(Data)
gmm.sample(n_samples=20)
but it does preform a sampling of the whole distribution, but I need a sample of each one of the components.
Well this is not that easy since you need to calculate the eigenvectors of all covariance matrices. Here is some example code for a problem I studied
import numpy as np
from scipy.stats import multivariate_normal
import random
from operator import truediv
import itertools
from scipy import linalg
import matplotlib.pyplot as plt
import matplotlib as mpl
from sklearn import mixture
#import some data which can be used for gmm
mix = np.loadtxt("mixture.txt", usecols=(0,1), unpack=True)
#print(mix.shape)
color_iter = itertools.cycle(['navy', 'c', 'cornflowerblue', 'gold',
'darkorange'])
def plot_results(X, Y_, means, covariances, index, title):
#function for plotting the gaussians
splot = plt.subplot(2, 1, 1 + index)
for i, (mean, covar, color) in enumerate(zip(
means, covariances, color_iter)):
v, w = linalg.eigh(covar)
v = 2. * np.sqrt(2.) * np.sqrt(v)
u = w[0] / linalg.norm(w[0])
# as the DP will not use every component it has access to
# unless it needs it, we shouldn't plot the redundant
# components.
if not np.any(Y_ == i):
continue
plt.scatter(X[Y_ == i, 0], X[Y_ == i, 1], .8, color=color)
# Plot an ellipse to show the Gaussian component
angle = np.arctan(u[1] / u[0])
angle = 180. * angle / np.pi # convert to degrees
ell = mpl.patches.Ellipse(mean, v[0], v[1], 180. + angle, color=color)
ell.set_clip_box(splot.bbox)
ell.set_alpha(0.5)
splot.add_artist(ell)
plt.xlim(-4., 3.)
plt.ylim(-4., 2.)
gmm = mixture.GaussianMixture(n_components=3, covariance_type='full').fit(mix.T)
print(gmm.predict(mix.T))
plot_results(mix.T, gmm.predict(mix.T), gmm.means_, gmm.covariances_, 0,
'Gaussian Mixture')
So for my problem the resulting plot looked like this:
Edit: here the answer to your comment. I would use pandas to do this. Assume X is your feature matrix and y are your labels, then
import pandas as pd
y_pred = gmm.predict(X)
df_all_info = pd.concat([X,y,y_pred], axis=1)
In the resulting dataframe you can check all the information you want, you can even just exclude the samples the algorithm misclassified with:
df_wrong = df_all_info[df_all_info['name of y-column'] != df_all_info['name of y_pred column']]
I use sklearn.linear_model.ElasticNetCV and I would like to get a similar figure as Matlab provides with lassoPlot with plottype=CV or R's plot(cv.glmnet(x,y)), i.e., a plot of the cross validations errors over various alphas (note, in Matlab and R this parameter is called lambda). Here is an example:
import sklearn.linear_model as lm
import matplotlib.pyplot as plt
import numpy as np
import scipy as sp
import scipy.stats as stats
# toy example
# generate 200 samples of five-dimensional artificial data X from a
# exponential distributions with various means:
X = np.zeros( (200,5 ) )
for col in range(5):
X[ :, col ] = stats.expon.rvs( scale=1.0/(col+1) )
# generate response data Y = X*r + eps where r has just two nonzero
# components, and the noise eps is normal with standard deviation 0.1:
r = np.array( [ 0, 2, 0, -3, 0 ] )
Y = np.dot(X,r) + sp.randn( 200 )*0.1
enet = lm.ElasticNetCV()
alphas,coefs, _ = enet.path( X, Y )
# plot regulization paths
plt.plot( -np.log10(alphas), coefs.T, linestyle='-' )
plt.show()
I would like to plot also in a separate figure the cross validation error for each alpha. But It seems that ElasticNetCV.path() does not return a mse vector. Is there a simliar functionality in sklearn to Matlab.lassoPlot with plottype='CV' see: http://de.mathworks.com/help/stats/lasso-and-elastic-net.html or R's cv.glmnet(x,y) https://web.stanford.edu/~hastie/glmnet/glmnet_alpha.html. Alternatively, I would implement it using sklearn.cross_validation. Do you have any suggestions?