Fitting a quadratic function in python without numpy polyfit - python

I am trying to fit a quadratic function to some data, and I'm trying to do this without using numpy's polyfit function.
Mathematically I tried to follow this website https://neutrium.net/mathematics/least-squares-fitting-of-a-polynomial/ but somehow I don't think that I'm doing it right. If anyone could assist me that would be great, or If you could suggest another way to do it that would also be awesome.
What I've tried so far:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
ones = np.ones(3)
A = np.array( ((0,1),(1,1),(2,1)))
xfeature = A.T[0]
squaredfeature = A.T[0] ** 2
b = np.array( (1,2,0), ndmin=2 ).T
b = b.reshape(3)
features = np.concatenate((np.vstack(ones), np.vstack(xfeature), np.vstack(squaredfeature)), axis = 1)
featuresc = features.copy()
print(features)
m_det = np.linalg.det(features)
print(m_det)
determinants = []
for i in range(3):
featuresc.T[i] = b
print(featuresc)
det = np.linalg.det(featuresc)
determinants.append(det)
print(det)
featuresc = features.copy()
determinants = determinants / m_det
print(determinants)
plt.scatter(A.T[0],b)
u = np.linspace(0,3,100)
plt.plot(u, u**2*determinants[2] + u*determinants[1] + determinants[0] )
p2 = np.polyfit(A.T[0],b,2)
plt.plot(u, np.polyval(p2,u), 'b--')
plt.show()
As you can see my curve doesn't compare well to nnumpy's polyfit curve.
Update:
I went through my code and removed all the stupid mistakes and now it works, when I try to fit it over 3 points, but I have no idea how to fit over more than three points.
This is the new code:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
ones = np.ones(3)
A = np.array( ((0,1),(1,1),(2,1)))
xfeature = A.T[0]
squaredfeature = A.T[0] ** 2
b = np.array( (1,2,0), ndmin=2 ).T
b = b.reshape(3)
features = np.concatenate((np.vstack(ones), np.vstack(xfeature), np.vstack(squaredfeature)), axis = 1)
featuresc = features.copy()
print(features)
m_det = np.linalg.det(features)
print(m_det)
determinants = []
for i in range(3):
featuresc.T[i] = b
print(featuresc)
det = np.linalg.det(featuresc)
determinants.append(det)
print(det)
featuresc = features.copy()
determinants = determinants / m_det
print(determinants)
plt.scatter(A.T[0],b)
u = np.linspace(0,3,100)
plt.plot(u, u**2*determinants[2] + u*determinants[1] + determinants[0] )
p2 = np.polyfit(A.T[0],b,2)
plt.plot(u, np.polyval(p2,u), 'r--')
plt.show()

Instead using Cramer's Rule, actually solve the system using least squares. Remember that Cramer's Rule will only work if the total number of points you have equals the desired order of polynomial plus 1.
If you don't have this, then Cramer's Rule will not work as you're trying to find an exact solution to the problem. If you have more points, the method is unsuitable as we will create an overdetermined system of equations.
To adapt this to more points, numpy.linalg.lstsq would be a better fit as it solves the solution to the Ax = b by computing the vector x that minimizes the Euclidean norm using the matrix A. Therefore, remove the y values from the last column of the features matrix and solve for the coefficients and use numpy.linalg.lstsq to solve for the coefficients:
import numpy as np
import matplotlib.pyplot as plt
ones = np.ones(4)
xfeature = np.asarray([0,1,2,3])
squaredfeature = xfeature ** 2
b = np.asarray([1,2,0,3])
features = np.concatenate((np.vstack(ones),np.vstack(xfeature),np.vstack(squaredfeature)), axis = 1) # Change - remove the y values
determinants = np.linalg.lstsq(features, b)[0] # Change - use least squares
plt.scatter(xfeature,b)
u = np.linspace(0,3,100)
plt.plot(u, u**2*determinants[2] + u*determinants[1] + determinants[0] )
plt.show()
I get this plot now, which matches what the dashed curve is in your graph, also matching what numpy.polyfit gives you:

Related

Obtaining an isosurface from 3D data and the corresponding indices

I have a 3D numpy array of temperature values on a grid. From this I can compute the gradients using dTdx, dTdy, dYdz = np.gradient(T). Now I'm only interested in the values of the gradients on the isosurface where the temperature is 900. What I want to do is something like (pseudo-codish):
import nympy as np
def regular(x,y,z,q=100,k=175,a=7.1e-5):
R = np.sqrt(x**2+y**2+z**2)
return 100 / (2*np.pi*k) * (1/R) * np.exp(-0.5/a*(R+x))
x = np.arange(-1.5,0.5+res/2,res)*1e-3
y = np.arange(-1.0,1.0+res/2,res)*1e-3
z = np.arange(0.0,0.5+res/2,res)*1e-3
Y,X,Z = np.meshgrid(y,x,z)
T = regular(X,Y,Z)
dTdx, dTdy, dYdz = np.gradient(T)
(xind,yind,zind) = <package>.get_contour_indices(X,Y,Z,T,value=900)
x_gradients_at_isosurface = dTdx[xind,yind,zind]
...
I've tried:
import numpy as np
from skimage import measure
contour_data = measure.find_contours(T[:,:,0],900)
contour_data = np.int_(np.round(contour_data[0]))
xs,ys = contour_data[:,0],contour_data[:,1]
gradients_of_interest = np.array([G[x,y,0] for x,y in zip( xs,ys )])
which works fine, but only works for 2D data. I'm looking for the 3D equivalent. I've found the following:
import plotly.graph_objects as go
surf = go.Isosurface(x=X.flatten(),y=Y.flatten(),z=Z.flatten(),value=T.flatten(),isomin=900,isomax=900)
fig = go.Figure(data=surf)
plt.show()
But I'm not interested in plotting it. I want to know the indices where the temperature is T=900 so I can use it on the gradients. Any ideas?
You need skimage.measure.marching_cubes.

Multiple regression with pykalman?

I'm looking for a way to generalize regression using pykalman from 1 to N regressors. We will not bother about online regression initially - I just want a toy example to set up the Kalman filter for 2 regressors instead of 1, i.e. Y = c1 * x1 + c2 * x2 + const.
For the single regressor case, the following code works. My question is how to change the filter setup so it works for two regressors:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from pykalman import KalmanFilter
if __name__ == "__main__":
file_name = '<path>\KalmanExample.txt'
df = pd.read_csv(file_name, index_col = 0)
prices = df[['ETF', 'ASSET_1']] #, 'ASSET_2']]
delta = 1e-5
trans_cov = delta / (1 - delta) * np.eye(2)
obs_mat = np.vstack( [prices['ETF'],
np.ones(prices['ETF'].shape)]).T[:, np.newaxis]
kf = KalmanFilter(
n_dim_obs=1,
n_dim_state=2,
initial_state_mean=np.zeros(2),
initial_state_covariance=np.ones((2, 2)),
transition_matrices=np.eye(2),
observation_matrices=obs_mat,
observation_covariance=1.0,
transition_covariance=trans_cov
)
state_means, state_covs = kf.filter(prices['ASSET_1'].values)
# Draw slope and intercept...
pd.DataFrame(
dict(
slope=state_means[:, 0],
intercept=state_means[:, 1]
), index=prices.index
).plot(subplots=True)
plt.show()
The example file KalmanExample.txt contains the following data:
Date,ETF,ASSET_1,ASSET_2
2007-01-02,176.5,136.5,141.0
2007-01-03,169.5,115.5,143.25
2007-01-04,160.5,111.75,143.5
2007-01-05,160.5,112.25,143.25
2007-01-08,161.0,112.0,142.5
2007-01-09,155.5,110.5,141.25
2007-01-10,156.5,112.75,141.25
2007-01-11,162.0,118.5,142.75
2007-01-12,161.5,117.0,142.5
2007-01-15,160.0,118.75,146.75
2007-01-16,156.5,119.5,146.75
2007-01-17,155.0,120.5,145.75
2007-01-18,154.5,124.5,144.0
2007-01-19,155.5,126.0,142.75
2007-01-22,157.5,124.5,142.5
2007-01-23,161.5,124.25,141.75
2007-01-24,164.5,125.25,142.75
2007-01-25,164.0,126.5,143.0
2007-01-26,161.5,128.5,143.0
2007-01-29,161.5,128.5,140.0
2007-01-30,161.5,129.75,139.25
2007-01-31,161.5,131.5,137.5
2007-02-01,164.0,130.0,137.0
2007-02-02,156.5,132.0,128.75
2007-02-05,156.0,131.5,132.0
2007-02-06,159.0,131.25,130.25
2007-02-07,159.5,136.25,131.5
2007-02-08,153.5,136.0,129.5
2007-02-09,154.5,138.75,128.5
2007-02-12,151.0,136.75,126.0
2007-02-13,151.5,139.5,126.75
2007-02-14,155.0,169.0,129.75
2007-02-15,153.0,169.5,129.75
2007-02-16,149.75,166.5,128.0
2007-02-19,150.0,168.5,130.0
The single regressor case provides the following output and for the two-regressor case I want a second "slope"-plot representing C2.
Answer edited to reflect my revised understanding of the question.
If I understand correctly you wish to model an observable output variable Y = ETF, as a linear combination of two observable values; ASSET_1, ASSET_2.
The coefficients of this regression are to be treated as the system states, i.e. ETF = x1*ASSET_1 + x2*ASSET_2 + x3, where x1 and x2 are the coefficients assets 1 and 2 respectively, and x3 is the intercept. These coefficients are assumed to evolve slowly.
Code implementing this is given below, note that this is just extending the existing example to have one more regressor.
Note also that you can get quite different results by playing with the delta parameter. If this is set large (far from zero), then the coefficients will change more rapidly, and the reconstruction of the regressand will be near-perfect. If it is set small (very close to zero) then the coefficients will evolve more slowly and the reconstruction of the regressand will be less perfect. You might want to look into the Expectation Maximisation algorithm - supported by pykalman.
CODE:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from pykalman import KalmanFilter
if __name__ == "__main__":
file_name = 'KalmanExample.txt'
df = pd.read_csv(file_name, index_col = 0)
prices = df[['ETF', 'ASSET_1', 'ASSET_2']]
delta = 1e-3
trans_cov = delta / (1 - delta) * np.eye(3)
obs_mat = np.vstack( [prices['ASSET_1'], prices['ASSET_2'],
np.ones(prices['ASSET_1'].shape)]).T[:, np.newaxis]
kf = KalmanFilter(
n_dim_obs=1,
n_dim_state=3,
initial_state_mean=np.zeros(3),
initial_state_covariance=np.ones((3, 3)),
transition_matrices=np.eye(3),
observation_matrices=obs_mat,
observation_covariance=1.0,
transition_covariance=trans_cov
)
# state_means, state_covs = kf.em(prices['ETF'].values).smooth(prices['ETF'].values)
state_means, state_covs = kf.filter(prices['ETF'].values)
# Re-construct ETF from coefficients and 'ASSET_1' and ASSET_2 values:
ETF_est = np.array([a.dot(b) for a, b in zip(np.squeeze(obs_mat), state_means)])
# Draw slope and intercept...
pd.DataFrame(
dict(
slope1=state_means[:, 0],
slope2=state_means[:, 1],
intercept=state_means[:, 2],
), index=prices.index
).plot(subplots=True)
plt.show()
# Draw actual y, and estimated y:
pd.DataFrame(
dict(
ETF_est=ETF_est,
ETF_act=prices['ETF'].values
), index=prices.index
).plot()
plt.show()
PLOTS:

Filtering 1D numpy arrays in Python

Explanation:
I have two numpy arrays: dataX and dataY, and I am trying to filter each array to reduce the noise. The image shown below shows the actual input data (blue dots) and an example of what I want it to be like(red dots). I do not need the filtered data to be as perfect as in the example but I do want it to be as straight as possible. I have provided sample data in the code.
What I have tried:
Firstly, you can see that the data isn't 'continuous', so I first divided them into individual 'segments' ( 4 of them in this example), and then applied a filter to each 'segment'. Someone suggested that I use a Savitzky-Golay filter. The full, run-able code is below:
import scipy as sc
import scipy.signal
import numpy as np
import matplotlib.pyplot as plt
# Sample Data
ydata = np.array([1,0,1,2,1,2,1,0,1,1,2,2,0,0,1,0,1,0,1,2,7,6,8,6,8,6,6,8,6,6,8,6,6,7,6,5,5,6,6, 10,11,12,13,12,11,10,10,11,10,12,11,10,10,10,10,12,12,10,10,17,16,15,17,16, 17,16,18,19,18,17,16,16,16,16,16,15,16])
xdata = np.array([1,2,3,1,5,4,7,8,6,10,11,12,13,10,12,13,17,16,19,18,21,19,23,21,25,20,26,27,28,26,26,26,29,30,30,29,30,32,33, 1,2,3,1,5,4,7,8,6,10,11,12,13,10,12,13,17,16,19,18,21,19,23,21,25,20,26,27,28,26,26,26,29,30,30,29,30,32])
# Used a diff array to find where there is a big change in Y.
# If there's a big change in Y, then there must be a change of 'segment'.
diffy = np.diff(ydata)
# Create empty numpy arrays to append values into
filteredX = np.array([])
filteredY = np.array([])
# Chose 3 to be the value indicating the change in Y
index = np.where(diffy >3)
# Loop through the array
start = 0
for i in range (0, (index[0].size +1) ):
# Check if last segment is reached
if i == index[0].size:
print xdata[start:]
partSize = xdata[start:].size
# Window length must be an odd integer
if partSize % 2 == 0:
partSize = partSize - 1
filteredDataX = sc.signal.savgol_filter(xdata[start:], partSize, 3)
filteredDataY = sc.signal.savgol_filter(ydata[start:], partSize, 3)
filteredX = np.append(filteredX, filteredDataX)
filteredY = np.append(filteredY, filteredDataY)
else:
print xdata[start:index[0][i]]
partSize = xdata[start:index[0][i]].size
if partSize % 2 == 0:
partSize = partSize - 1
filteredDataX = sc.signal.savgol_filter(xdata[start:index[0][i]], partSize, 3)
filteredDataY = sc.signal.savgol_filter(ydata[start:index[0][i]], partSize, 3)
start = index[0][i]
filteredX = np.append(filteredX, filteredDataX)
filteredY = np.append(filteredY, filteredDataY)
# Plots
plt.plot(xdata,ydata, 'bo', label = 'Input Data')
plt.plot(filteredX, filteredY, 'ro', label = 'Filtered Data')
plt.xlabel('X')
plt.ylabel('Y')
plt.title('Result')
plt.legend()
plt.show()
This is my result:
When each point is connected, the result looks as follows.
I have played around with the order, but it seems like a third order gave the best result.
I have also tried these filters, among a few others:
scipy.signal.medfilt
scipy.ndimage.filters.uniform_filter1d
But so far none of the filters I have tried were close to what I really wanted. What is the best way to filter data such as this? Looking forward to your help.
One way to get something looking close to your ideal would be clustering + linear regression.
Note that you have to provide the number of clusters and I also cheated a bit in scaling up y before clustering.
import numpy as np
from scipy import cluster, stats
ydata = np.array([1,0,1,2,1,2,1,0,1,1,2,2,0,0,1,0,1,0,1,2,7,6,8,6,8,6,6,8,6,6,8,6,6,7,6,5,5,6,6, 10,11,12,13,12,11,10,10,11,10,12,11,10,10,10,10,12,12,10,10,17,16,15,17,16, 17,16,18,19,18,17,16,16,16,16,16,15,16])
xdata = np.array([1,2,3,1,5,4,7,8,6,10,11,12,13,10,12,13,17,16,19,18,21,19,23,21,25,20,26,27,28,26,26,26,29,30,30,29,30,32,33, 1,2,3,1,5,4,7,8,6,10,11,12,13,10,12,13,17,16,19,18,21,19,23,21,25,20,26,27,28,26,26,26,29,30,30,29,30,32])
def split_to_lines(x, y, k):
yo = np.empty_like(y, dtype=float)
# get the cluster centers and the labels for each point
centers, map_ = cluster.vq.kmeans2(np.array((x, y * 2)).T.astype(float), k)
# for each cluster, use the labels to select the points belonging to
# the cluster and do a linear regression
for i in range(k):
slope, interc, *_ = stats.linregress(x[map_==i], y[map_==i])
# use the regression parameters to construct y values on the
# best fit line
yo[map_==i] = x[map_==i] * slope + interc
return yo
import pylab
pylab.plot(xdata, ydata, 'or')
pylab.plot(xdata, split_to_lines(xdata, ydata, 4), 'ob')
pylab.show()

How can I sample the different components of a GMM distribution?

I have clustered my data (12000, 3) using sklearn Gaussian mixture model algorithm (GMM). I have 3 clusters. Each point of my data represents a molecular structure. I would like to know how could I sampled each cluster. I have tried with the function:
gmm = GMM(n_components=3).fit(Data)
gmm.sample(n_samples=20)
but it does preform a sampling of the whole distribution, but I need a sample of each one of the components.
Well this is not that easy since you need to calculate the eigenvectors of all covariance matrices. Here is some example code for a problem I studied
import numpy as np
from scipy.stats import multivariate_normal
import random
from operator import truediv
import itertools
from scipy import linalg
import matplotlib.pyplot as plt
import matplotlib as mpl
from sklearn import mixture
#import some data which can be used for gmm
mix = np.loadtxt("mixture.txt", usecols=(0,1), unpack=True)
#print(mix.shape)
color_iter = itertools.cycle(['navy', 'c', 'cornflowerblue', 'gold',
'darkorange'])
def plot_results(X, Y_, means, covariances, index, title):
#function for plotting the gaussians
splot = plt.subplot(2, 1, 1 + index)
for i, (mean, covar, color) in enumerate(zip(
means, covariances, color_iter)):
v, w = linalg.eigh(covar)
v = 2. * np.sqrt(2.) * np.sqrt(v)
u = w[0] / linalg.norm(w[0])
# as the DP will not use every component it has access to
# unless it needs it, we shouldn't plot the redundant
# components.
if not np.any(Y_ == i):
continue
plt.scatter(X[Y_ == i, 0], X[Y_ == i, 1], .8, color=color)
# Plot an ellipse to show the Gaussian component
angle = np.arctan(u[1] / u[0])
angle = 180. * angle / np.pi # convert to degrees
ell = mpl.patches.Ellipse(mean, v[0], v[1], 180. + angle, color=color)
ell.set_clip_box(splot.bbox)
ell.set_alpha(0.5)
splot.add_artist(ell)
plt.xlim(-4., 3.)
plt.ylim(-4., 2.)
gmm = mixture.GaussianMixture(n_components=3, covariance_type='full').fit(mix.T)
print(gmm.predict(mix.T))
plot_results(mix.T, gmm.predict(mix.T), gmm.means_, gmm.covariances_, 0,
'Gaussian Mixture')
So for my problem the resulting plot looked like this:
Edit: here the answer to your comment. I would use pandas to do this. Assume X is your feature matrix and y are your labels, then
import pandas as pd
y_pred = gmm.predict(X)
df_all_info = pd.concat([X,y,y_pred], axis=1)
In the resulting dataframe you can check all the information you want, you can even just exclude the samples the algorithm misclassified with:
df_wrong = df_all_info[df_all_info['name of y-column'] != df_all_info['name of y_pred column']]

SKLearn ElasticNetCV: Looking for a similar cross-validation-error-plot to Matlab's lassoPlot or R's plot(cv.glmnet(x,y))

I use sklearn.linear_model.ElasticNetCV and I would like to get a similar figure as Matlab provides with lassoPlot with plottype=CV or R's plot(cv.glmnet(x,y)), i.e., a plot of the cross validations errors over various alphas (note, in Matlab and R this parameter is called lambda). Here is an example:
import sklearn.linear_model as lm
import matplotlib.pyplot as plt
import numpy as np
import scipy as sp
import scipy.stats as stats
# toy example
# generate 200 samples of five-dimensional artificial data X from a
# exponential distributions with various means:
X = np.zeros( (200,5 ) )
for col in range(5):
X[ :, col ] = stats.expon.rvs( scale=1.0/(col+1) )
# generate response data Y = X*r + eps where r has just two nonzero
# components, and the noise eps is normal with standard deviation 0.1:
r = np.array( [ 0, 2, 0, -3, 0 ] )
Y = np.dot(X,r) + sp.randn( 200 )*0.1
enet = lm.ElasticNetCV()
alphas,coefs, _ = enet.path( X, Y )
# plot regulization paths
plt.plot( -np.log10(alphas), coefs.T, linestyle='-' )
plt.show()
I would like to plot also in a separate figure the cross validation error for each alpha. But It seems that ElasticNetCV.path() does not return a mse vector. Is there a simliar functionality in sklearn to Matlab.lassoPlot with plottype='CV' see: http://de.mathworks.com/help/stats/lasso-and-elastic-net.html or R's cv.glmnet(x,y) https://web.stanford.edu/~hastie/glmnet/glmnet_alpha.html. Alternatively, I would implement it using sklearn.cross_validation. Do you have any suggestions?

Categories

Resources