Least squares not working for a set of y's - python

I am trying to run a least square algorithm using numpy and is having trouble. Can someone please tell me what I am doing wrong in the given code? When I set y to be y = np.power(X, 1) + np.random.rand(20)*3 or some other reasonable function of x, everything is working fine. But for that particular y defined by those given y values, the plot I am getting is senseless.
Is this some kind of numerical problem?
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
X = np.arange(1,21)
y = np.array([-0.00454712, -0.00457764, -0.0045166 , -0.00442505, -0.00427246,
-0.00411987, -0.00378418, -0.003479 , -0.00314331, -0.00259399,
-0.00213623, -0.00146484, -0.00082397, -0.00030518, 0.00027466,
0.00076294, 0.00146484, 0.00192261, 0.00247192, 0.00314331])
#y = np.power(X, 1) + np.random.rand(20)*3
w = np.linalg.lstsq(X.reshape(20, 1), y)[0]
plt.plot(X, y, 'red')
plt.plot(X, X*w[0], 'blue')
plt.show()

Are you sure there is a linear relationship between what you are fitting and the y variable data?
Using the code (y = np.power(X, 1) + np.random.rand(20)*3) from your example, you have a linear relationship built into the y variable itself (with some noise) which allows your plot to track relatively well with the linear equation.
X = np.arange(1,21)
#y = np.power(X, 1) + np.random.rand(20)*3
w = np.linalg.lstsq(X.reshape(20, 1), y)[0]
plt.plot(X, y, 'red')
plt.plot(X, X*w[0], 'blue')
plt.show()
However, when you alternate to something like your y variable
y = np.array([-0.00454712, -0.00457764, -0.0045166 , -0.00442505, -0.00427246,
-0.00411987, -0.00378418, -0.003479 , -0.00314331, -0.00259399,
-0.00213623, -0.00146484, -0.00082397, -0.00030518, 0.00027466,
0.00076294, 0.00146484, 0.00192261, 0.00247192, 0.00314331])
You end up with something less easy to fit.
Looking at the documentation, if you are attempting to something that fits this set of values, you will need to build in a constant component in which case lstsq does not do by default.
The docs state for lstsq
Return the least-squares solution to a linear matrix equation.
Solves the equation a x = b
If you really want to fit the data to a linear equation, running code like the below will give you something that almost matches your original data. However, the data behind this process seems to have polynomial/exponential driver which would make polyfit better.
X = np.arange(1,21)
y = np.array([-0.00454712, -0.00457764, -0.0045166 , -0.00442505, -0.00427246,
-0.00411987, -0.00378418, -0.003479 , -0.00314331, -0.00259399,
-0.00213623, -0.00146484, -0.00082397, -0.00030518, 0.00027466,
0.00076294, 0.00146484, 0.00192261, 0.00247192, 0.00314331])
#y = np.power(X, 1) + np.random.rand(20)*3
X2 = np.vstack([X, np.ones(len(X))]).T
w = np.linalg.lstsq(X2, y)[0]
plt.plot(X, y, 'red')
plt.plot(X, X.dot(w[0])+w[1], 'blue')
plt.show()

Related

Python solving equation and graphing the results

I trying to solve Kepler's Equation using python with known 'x' and 'e' values, trying to find 'y'. The equation is x=y-(e*sin(y)). I need to step through an array of x, with a range of min=0 and max=pi, with 1000 steps, and a value of e=0.1, solve for y and plot the graph. I am getting an error why 'y' is undefined, but 'y' is what I am trying to find, so I am stuck.
x = np.linspace(0, math.pi, 1000)
e = 0.1
y = Symbol('y')
Solve(x = y-(e*math.sin(y)))
FIG1, MA = plt.plots(figsize=(4, 3))
MA.plot(x, y)
MA.set_xlabel('Mean Anomely')
MA.set_ylabel('Mean Eccentricity')
MA.set_title('Keplers equation')
plt.show()
You are looking for the inverse function of x=y-(e*sin(y)) to get y(x). You will not find a symbolic solution, so you need to solve it numerically. A standard trick for this is computing values of x for given y and do an interpolation. This is possible, because the function is monotonic and continuous.
import numpy as np
import matplotlib.pyplot as plt
e = 0.1
# select many points for interpolation, e.g. 2000
E_values = np.linspace(0, np.pi, 2000)
M_values = E_values - e*np.sin(E_values)
# do the interpolation on your selected points for M
M_interp = np.linspace(0, np.pi, 1000)
E_interp = np.interp(M_interp, M_values, E_values)
# plot the stuff
fig, ax = plt.subplots(figsize=(4, 3))
ax.plot(M_interp, E_interp)
ax.set_xlabel('Mean Anomaly')
ax.set_ylabel('Eccentric Anomaly')
Note that I used the more frequently used symbols M and E for mean anomaly and eccentric anomaly.

Linear regression minimizing errors only above the linear

I have a dataset that resembles the data created in the MWE below:
from matplotlib import pyplot as plt
import numpy as np
sz=100
x = np.linspace(-1, 1, sz)
mean = -np.sign(x)
noise = np.random.randn(*x.shape)
K = -2
y_true = K*x
y = y_true + mean + noise
plt.scatter(x, y, label="Data with error")
plt.plot(x, y_true, "-", label="True line")
plt.grid()
That is, the errors around the line I want are mostly negative for x>0 and mostly positive for x<0. What I'm looking for is a way to estimate the coefficient K (which in this case is -2).
Really I think the way to do it would be to minimize the error only of the points that fall above the line for x<0 and below the line for x>0, but I'm not sure how to go about it effectively in Python, since everything I can think of involves iterative processes which are slow in Python.
Basically you want to include something that can account for the mean variable in your data generating model. You can do this by modeling a discontinuity at the point x=0 by including a variable in your model that is 0 where x < 0 and 1 where x > 0.
We can even just include the "mean" variable itself and get the same model (with a different interpretation for the second coefficient). Here is a linear model that recovers the correct value for the slope of this discontinuous line. Note that this assumes the slope is the same on the right side of 0 as the left side.
from sklearn.linear_model import LinearRegression
X = np.array([x, mean]).T
reg = LinearRegression().fit(X, y)
print(reg.coef_)
Here is my attempt where I A) fit all data to a straight line, and then B) separate data depending on two criteria: whether x is greater than or less than zero and whether predicted Y is above or below that straight line, and finally C) fit the separated data. The slope is here -2.417 and will vary from run to run depending on the random data.
from matplotlib import pyplot as plt
import numpy as np
sz=100
x = np.linspace(-1, 1, sz)
mean = -np.sign(x)
noise = np.random.randn(*x.shape)
K = -2
y_true = K*x
y = y_true + mean + noise
plt.scatter(x, y, label="Data with error")
plt.plot(x, y_true, "-", label="True line")
###############################
# new section for calculatiing new line
allDataFirstOrderParameters = np.polyfit(x, y, 1)
allDataFirstOrderErrors = y - np.polyval(allDataFirstOrderParameters, x)
newX = []
newY = []
for i in range(len(x)):
if x[i] < 0 and allDataFirstOrderErrors[i] < 0:
newX.append(x[i])
newY.append(y[i])
if x[i] > 0 and allDataFirstOrderErrors[i] > 0:
newX.append(x[i])
newY.append(y[i])
newX = np.array(newX)
newY = np.array(newY)
newFirstOrderParameters = np.polyfit(newX, newY, 1)
print("New Parameters", newFirstOrderParameters)
plotNewX = np.linspace(min(x), max(x))
plotNewY = np.polyval(newFirstOrderParameters, plotNewX)
plt.plot(plotNewX, plotNewY, label="New line")
plt.legend()
plt.show()

lambda: what's the output that a lambda function multiply numpy array?

I am learning ML with python. I read the below code from that book.
x, y = np.array(x), np.array(y)
x = (x - x.mean()) / x.std()
x0 = np.linspace(-2, 4, 100)
def get_model(deg):
return lambda input_x=x0: np.polyval(np.polyfit(x, y, deg), input_x)
def get_cost(deg, input_x, input_y):
return 0.5 * ((get_model(deg)(input_x) - input_y) ** 2).sum()
I'm not sure why in the get_cost function, the author uses get_model(deg) to multiply input_x which is x. In my understanding, get_model(deg) function already return the predicted y based on x0.
When I tried to understand what's happening, I typed get_model(4), then it returned <function __main__.get_model.<locals>.<lambda>>. To my surprised, it haven't returned the predicted y based on x0 but a function?! I just totally messed up.
When I tried typing get_model(4)(x), It just return the predicted y based on x, I don't get it. Please someone could help me to figure out.
The method get_model(x) is, as you noticed, not return predictions, but a model for predicting.
If you execute get_model(1) the method will return you a linear model, which allows you to fit your values into a linear function:
import numpy as np
import matplotlib.pyplot as plt
fig = plt.gcf()
fig.set_size_inches(10, 5)
x = np.linspace(-2, 4, 200)
y = x**2
y += np.random.rand(len(x)) * 10
x0= x
def get_model(deg):
return lambda input_x=x0: np.polyval(np.polyfit(x, y, deg), input_x)
linear_model = get_model(1)
plt.scatter(x, y)
plt.scatter(x, linear_model(), c='red')
plt.show()
If you want to try another model, you can do this by changing the degree of the model:
plt.scatter(x, y)
plt.scatter(x, get_model(2)(), c='red')
plt.scatter(x, get_model(19)(), c='yellow')
plt.show()
I hope this helps you understand the code a bit better.

How does one implement a subsampled RBF (Radial Basis Function) in Numpy?

I was trying to implement a Radial Basis Function in Python and Numpy as describe by CalTech lecture here. The mathematics seems clear to me so I find it strange that its not working (or it seems to not work). The idea is simple, one chooses a subsampled number of centers for each Gaussian form a kernal matrix and tries to find the best coefficients. i.e. solve Kc = y where K is the guassian kernel (gramm) matrix with least squares. For that I did:
beta = 0.5*np.power(1.0/stddev,2)
Kern = np.exp(-beta*euclidean_distances(X=X,Y=subsampled_data_points,squared=True))
#(C,_,_,_) = np.linalg.lstsq(K,Y_train)
C = np.dot( np.linalg.pinv(Kern), Y )
but when I try to plot my interpolation with the original data they don't look at all alike:
with 100 random centers (from the data set). I also tried 10 centers which produces essentially the same graph as so does using every data point in the training set. I assumed that using every data point in the data set should more or less perfectly copy the curve but it didn't (overfit). It produces:
which doesn't seem correct. I will provide the full code (that runs without error):
import numpy as np
from sklearn.metrics.pairwise import euclidean_distances
from scipy.interpolate import Rbf
import matplotlib.pyplot as plt
## Data sets
def get_labels_improved(X,f):
N_train = X.shape[0]
Y = np.zeros( (N_train,1) )
for i in range(N_train):
Y[i] = f(X[i])
return Y
def get_kernel_matrix(x,W,S):
beta = get_beta_np(S)
#beta = 0.5*tf.pow(tf.div( tf.constant(1.0,dtype=tf.float64),S), 2)
Z = -beta*euclidean_distances(X=x,Y=W,squared=True)
K = np.exp(Z)
return K
N = 5000
low_x =-2*np.pi
high_x=2*np.pi
X = low_x + (high_x - low_x) * np.random.rand(N,1)
# f(x) = 2*(2(cos(x)^2 - 1)^2 -1
f = lambda x: 2*np.power( 2*np.power( np.cos(x) ,2) - 1, 2) - 1
Y = get_labels_improved(X , f)
K = 2 # number of centers for RBF
indices=np.random.choice(a=N,size=K) # choose numbers from 0 to D^(1)
subsampled_data_points=X[indices,:] # M_sub x D
stddev = 100
beta = 0.5*np.power(1.0/stddev,2)
Kern = np.exp(-beta*euclidean_distances(X=X,Y=subsampled_data_points,squared=True))
#(C,_,_,_) = np.linalg.lstsq(K,Y_train)
C = np.dot( np.linalg.pinv(Kern), Y )
Y_pred = np.dot( Kern , C )
plt.plot(X, Y, 'o', label='Original data', markersize=1)
plt.plot(X, Y_pred, 'r', label='Fitted line', markersize=1)
plt.legend()
plt.show()
Since the plots look strange I decided to read the docs for the ploting functions but I couldn't find anything obvious that was wrong.
Scaling of interpolating functions
The main problem is unfortunate choice of standard deviation of the functions used for interpolation:
stddev = 100
The features of your functions (its humps) are of size about 1. So, use
stddev = 1
Order of X values
The mess of red lines is there because plt from matplotlib connects consecutive data points, in the order given. Since your X values are in random order, this results in chaotic left-right movements. Use sorted X:
X = np.sort(low_x + (high_x - low_x) * np.random.rand(N,1), axis=0)
Efficiency issues
Your get_labels_improved method is inefficient, looping over the elements of X. Use Y = f(X), leaving the looping to low-level NumPy internals.
Also, the computation of least-squared solution of an overdetermined system should be done with lstsq instead of computing the pseudoinverse (computationally expensive) and multiplying by it.
Here is the cleaned-up code; using 30 centers gives a good fit.
import numpy as np
from sklearn.metrics.pairwise import euclidean_distances
import matplotlib.pyplot as plt
N = 5000
low_x =-2*np.pi
high_x=2*np.pi
X = np.sort(low_x + (high_x - low_x) * np.random.rand(N,1), axis=0)
f = lambda x: 2*np.power( 2*np.power( np.cos(x) ,2) - 1, 2) - 1
Y = f(X)
K = 30 # number of centers for RBF
indices=np.random.choice(a=N,size=K) # choose numbers from 0 to D^(1)
subsampled_data_points=X[indices,:] # M_sub x D
stddev = 1
beta = 0.5*np.power(1.0/stddev,2)
Kern = np.exp(-beta*euclidean_distances(X=X, Y=subsampled_data_points,squared=True))
C = np.linalg.lstsq(Kern, Y)[0]
Y_pred = np.dot(Kern, C)
plt.plot(X, Y, 'o', label='Original data', markersize=1)
plt.plot(X, Y_pred, 'r', label='Fitted line', markersize=1)
plt.legend()
plt.show()

How can I place a best fit line to the plotted points?

I have a simple plot containing two datasets in arrays, and am trying to use regression to calculate a best fit line through the points.
However the line I am getting is way off to the left and up of the data points.
How I can get the line to be in the right place, and are there any other tips and suggestions to my code?
from pylab import *
Is = array([-13.74,-13.86,-13.32,-18.41,-23.83])
gra = array([31.98,29.41,28.12,34.28,40.09])
plot(gra,Is,'kx')
(m,b) = polyfit(Is,gra,1)
print(b)
print(m)
z = polyval([m,b],Is)
plot(Is,z,'k--')
If anyone is curious, the data is the bandgap of a silicon transistor at various temperatures.
You have to be careful as to which of your arrays you pass as x coordinates and which as y coordinates. Consider that you have data values y at positions x. Then you have to evaluate the polynomial wrt. x too.
from pylab import*
Is = array([-13.74,-13.86,-13.32,-18.41,-23.83])
gra = array([31.98,29.41,28.12,34.28,40.09])
# rename the variables for clarity
x = gra
y = Is
plot(x, y, 'kx')
(m,b) = polyfit(x, y, 1)
print(b)
print(m)
z = polyval([m,b], x)
plot(x, z, 'k--')
show()

Categories

Resources