Linear fit constrained to go through the first point in Python - python

I have two numpy 1-d arrays of equal length, x and a response variable y. I'd like to do a least-squares fit that forces the resulting line to go through the point (x_0, y_0). What's the easiest way to pull this off?
Note: some optimization solutions have been proposed. These are good in theory, but the line must go exactly through the point – even if the fit for the rest of the points is worse.

One way to do it would be to center your data on (x_0, y_0), run linear regression (specifying no intercept), and then transform the predictions back to the original scale.
Here's an example:
import numpy as np
import matplotlib.pyplot as plt
x = np.random.randn(100)
y = np.random.randn(100) + x
plt.plot(x, y, "o")
plt.plot(x[0], y[0], "o") # x_0, y_0 is the orange dot
Next, use linear regression without an intercept to fit a line that goes through the transformed data.
from sklearn.linear_model import LinearRegression
lm = LinearRegression(fit_intercept = False)
# center data on x_0, y_0
y2 = y - y[0]
x2 = x - x[0]
# fit model
lm.fit(x2.reshape(-1, 1), y2)
Last, predict the line and plot it back on the original scale
# predict line
preds = lm.predict(np.arange(-5, 5, 0.1).reshape(-1,1))
# plot on original scale
plt.plot(x, y, "o")
plt.plot(x[0], y[0], "o")
# add x_0 and y_0 back to the predictions
plt.plot(np.arange(-5, 5, 0.1) + x[0], preds + y[0])

Related

Intersection point between line and surface in 3D python

I want to be able to find the intersection between a line and a three-dimensional surface.
Mathematically, I have done this by taking the following steps:
Define the (x, y, z) coordinates of the line in a parametric manner. e.g. (x, y, z) = (1+t, 2+3t, 1-t)
Define the surface as a function. e.g. z = f(x, y)
Substitute the values of x, y, and z from the line into the surface function.
By solving, I would be able to get the intersection of the surface and the line
I want to know if there is a method for doing this in Python. I am also open to suggestions on more simple ways to solving for the intersection.
You can use the following code:
import numpy as np
import scipy as sc
import scipy.optimize
from matplotlib import pyplot as plt
def f(x, y):
""" Function of the surface"""
# example equation
z = x**2 + y**2 -10
return z
p0 = np.array([1, 2, 1]) # starting point for the line
direction = np.array( [1, 3, -1]) # direction vector
def line_func(t):
"""Function of the straight line.
:param t: curve-parameter of the line
:returns xyz-value as array"""
return p0 + t*direction
def target_func(t):
"""Function that will be minimized by fmin
:param t: curve parameter of the straight line
:returns: (z_line(t) - z_surface(t))**2 – this is zero
at intersection points"""
p_line = line_func(t)
z_surface = f(*p_line[:2])
return np.sum((p_line[2] - z_surface)**2)
t_opt = sc.optimize.fmin(target_func, x0=-10)
intersection_point = line_func(t_opt)
The main idea is to reformulate the algebraic equation point_of_line = point_of_surface (condition for intersection) into a minimization problem: |point_of_line - point_of_surface| → min. Due to the representation of the surface as z_surface = f(x, y) it is convenient to calculate the distance for a given t-value only on basis of the z-values. This is done in target_func(t). And then the optimal t-value is found by fmin.
The correctness and plausibility of the result can be checked with some plotting:
from mpl_toolkits.mplot3d import Axes3D
ax = plt.subplot(projection='3d')
X = np.linspace(-5, 5, 10)
Y = np.linspace(-5, 5, 10)
tt = np.linspace(-5, 5, 100)
XX, YY = np.meshgrid(X, Y)
ZZ = f(XX, YY)
ax.plot_wireframe(XX, YY, ZZ, zorder=0)
LL = np.array([line_func(t) for t in tt])
ax.plot(*LL.T, color="orange", zorder=10)
ax.plot([x], [y], [z], "o", color="red", ms=10, zorder=20)
Note that this combination of wire frame and line plots does not handle well, which part of the orange line should be below the blue wire lines of the surface.
Also note, that for this type of problem there might be any number of solutions from 0 up to +∞. This depends on the actual surface. fmin finds an local optimum, this might be a global optimum with target_func(t_opt)=0 or it might not. Changing the initial guess x0 might change which local optimum fmin finds.

Linear regression minimizing errors only above the linear

I have a dataset that resembles the data created in the MWE below:
from matplotlib import pyplot as plt
import numpy as np
sz=100
x = np.linspace(-1, 1, sz)
mean = -np.sign(x)
noise = np.random.randn(*x.shape)
K = -2
y_true = K*x
y = y_true + mean + noise
plt.scatter(x, y, label="Data with error")
plt.plot(x, y_true, "-", label="True line")
plt.grid()
That is, the errors around the line I want are mostly negative for x>0 and mostly positive for x<0. What I'm looking for is a way to estimate the coefficient K (which in this case is -2).
Really I think the way to do it would be to minimize the error only of the points that fall above the line for x<0 and below the line for x>0, but I'm not sure how to go about it effectively in Python, since everything I can think of involves iterative processes which are slow in Python.
Basically you want to include something that can account for the mean variable in your data generating model. You can do this by modeling a discontinuity at the point x=0 by including a variable in your model that is 0 where x < 0 and 1 where x > 0.
We can even just include the "mean" variable itself and get the same model (with a different interpretation for the second coefficient). Here is a linear model that recovers the correct value for the slope of this discontinuous line. Note that this assumes the slope is the same on the right side of 0 as the left side.
from sklearn.linear_model import LinearRegression
X = np.array([x, mean]).T
reg = LinearRegression().fit(X, y)
print(reg.coef_)
Here is my attempt where I A) fit all data to a straight line, and then B) separate data depending on two criteria: whether x is greater than or less than zero and whether predicted Y is above or below that straight line, and finally C) fit the separated data. The slope is here -2.417 and will vary from run to run depending on the random data.
from matplotlib import pyplot as plt
import numpy as np
sz=100
x = np.linspace(-1, 1, sz)
mean = -np.sign(x)
noise = np.random.randn(*x.shape)
K = -2
y_true = K*x
y = y_true + mean + noise
plt.scatter(x, y, label="Data with error")
plt.plot(x, y_true, "-", label="True line")
###############################
# new section for calculatiing new line
allDataFirstOrderParameters = np.polyfit(x, y, 1)
allDataFirstOrderErrors = y - np.polyval(allDataFirstOrderParameters, x)
newX = []
newY = []
for i in range(len(x)):
if x[i] < 0 and allDataFirstOrderErrors[i] < 0:
newX.append(x[i])
newY.append(y[i])
if x[i] > 0 and allDataFirstOrderErrors[i] > 0:
newX.append(x[i])
newY.append(y[i])
newX = np.array(newX)
newY = np.array(newY)
newFirstOrderParameters = np.polyfit(newX, newY, 1)
print("New Parameters", newFirstOrderParameters)
plotNewX = np.linspace(min(x), max(x))
plotNewY = np.polyval(newFirstOrderParameters, plotNewX)
plt.plot(plotNewX, plotNewY, label="New line")
plt.legend()
plt.show()

Curve fit in a log-log plot in matplotlib and getting the equation of line

I am trying to draw a best fit curve for my data. It is terribly bad sample of data, but for simplicity's sake let's say, I expect to draw a straight line as a best fit in log-log scale.
I think I already did that with regression and it returns me a reasonable fit line. But I want to double check it with curve fit function in scipy. And I also want to extract the equation of the fit line.
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
import scipy.optimize as optimization
x = np.array([ 1.72724547e-08, 1.81960233e-08, 1.68093027e-08, 2.22839973e-08,
2.23090589e-08, 4.28020801e-08, 2.30004711e-08, 2.48543008e-08,
1.08633065e-07, 3.24417303e-08, 3.22946248e-08, 3.82328031e-08,
3.97713860e-08, 3.44080732e-08, 3.81526816e-08, 3.30756706e-08
])
y = np.array([ 4.18793565e+12, 4.40554864e+12, 4.48745390e+12, 4.50816705e+12,
4.57088190e+12, 4.60256574e+12, 4.66659380e+12, 4.79733449e+12, 7.31139083e+12, 7.53355564e+12, 8.03526122e+12, 8.14704284e+12,
8.47227414e+12, 8.62978548e+12, 8.81048873e+12, 9.46237161e+12
])
# Regression Function
def regress(x, y):
"""Return a tuple of predicted y values and parameters for linear regression."""
p = sp.stats.linregress(x, y)
b1, b0, r, p_val, stderr = p
y_pred = sp.polyval([b1, b0], x)
return y_pred, p
# plotting z
allx, ally = x, y # data, non-transformed
y_pred, _ = regress(np.log(allx), np.log(ally)) # change here # transformed input
plt.loglog(allx, ally, marker='$\\star$',color ='g', markersize=5,linestyle='None')
plt.loglog(allx, np.exp(y_pred), "c--", label="regression") # transformed output
# Let's fit an exponential function.
# This looks like a line on a lof-log plot.
def myExpFunc(x, a, b):
return a * np.power(x, b)
popt, pcov = curve_fit(myExpFunc, x, y, maxfev=1000)
plt.plot(x, myExpFunc(x, *popt), 'r:',
label="({0:.3f}*x**{1:.3f})".format(*popt))
print "Exponential Fit: y = (a*(x**b))"
print "\ta = popt[0] = {0}\n\tb = popt[1] = {1}".format(*popt)
plt.show()
Again I apologize for a bad dataset. your help will be very appreciated.
My plot looks like this:
enter code here

Least squares not working for a set of y's

I am trying to run a least square algorithm using numpy and is having trouble. Can someone please tell me what I am doing wrong in the given code? When I set y to be y = np.power(X, 1) + np.random.rand(20)*3 or some other reasonable function of x, everything is working fine. But for that particular y defined by those given y values, the plot I am getting is senseless.
Is this some kind of numerical problem?
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
X = np.arange(1,21)
y = np.array([-0.00454712, -0.00457764, -0.0045166 , -0.00442505, -0.00427246,
-0.00411987, -0.00378418, -0.003479 , -0.00314331, -0.00259399,
-0.00213623, -0.00146484, -0.00082397, -0.00030518, 0.00027466,
0.00076294, 0.00146484, 0.00192261, 0.00247192, 0.00314331])
#y = np.power(X, 1) + np.random.rand(20)*3
w = np.linalg.lstsq(X.reshape(20, 1), y)[0]
plt.plot(X, y, 'red')
plt.plot(X, X*w[0], 'blue')
plt.show()
Are you sure there is a linear relationship between what you are fitting and the y variable data?
Using the code (y = np.power(X, 1) + np.random.rand(20)*3) from your example, you have a linear relationship built into the y variable itself (with some noise) which allows your plot to track relatively well with the linear equation.
X = np.arange(1,21)
#y = np.power(X, 1) + np.random.rand(20)*3
w = np.linalg.lstsq(X.reshape(20, 1), y)[0]
plt.plot(X, y, 'red')
plt.plot(X, X*w[0], 'blue')
plt.show()
However, when you alternate to something like your y variable
y = np.array([-0.00454712, -0.00457764, -0.0045166 , -0.00442505, -0.00427246,
-0.00411987, -0.00378418, -0.003479 , -0.00314331, -0.00259399,
-0.00213623, -0.00146484, -0.00082397, -0.00030518, 0.00027466,
0.00076294, 0.00146484, 0.00192261, 0.00247192, 0.00314331])
You end up with something less easy to fit.
Looking at the documentation, if you are attempting to something that fits this set of values, you will need to build in a constant component in which case lstsq does not do by default.
The docs state for lstsq
Return the least-squares solution to a linear matrix equation.
Solves the equation a x = b
If you really want to fit the data to a linear equation, running code like the below will give you something that almost matches your original data. However, the data behind this process seems to have polynomial/exponential driver which would make polyfit better.
X = np.arange(1,21)
y = np.array([-0.00454712, -0.00457764, -0.0045166 , -0.00442505, -0.00427246,
-0.00411987, -0.00378418, -0.003479 , -0.00314331, -0.00259399,
-0.00213623, -0.00146484, -0.00082397, -0.00030518, 0.00027466,
0.00076294, 0.00146484, 0.00192261, 0.00247192, 0.00314331])
#y = np.power(X, 1) + np.random.rand(20)*3
X2 = np.vstack([X, np.ones(len(X))]).T
w = np.linalg.lstsq(X2, y)[0]
plt.plot(X, y, 'red')
plt.plot(X, X.dot(w[0])+w[1], 'blue')
plt.show()

How does one implement a subsampled RBF (Radial Basis Function) in Numpy?

I was trying to implement a Radial Basis Function in Python and Numpy as describe by CalTech lecture here. The mathematics seems clear to me so I find it strange that its not working (or it seems to not work). The idea is simple, one chooses a subsampled number of centers for each Gaussian form a kernal matrix and tries to find the best coefficients. i.e. solve Kc = y where K is the guassian kernel (gramm) matrix with least squares. For that I did:
beta = 0.5*np.power(1.0/stddev,2)
Kern = np.exp(-beta*euclidean_distances(X=X,Y=subsampled_data_points,squared=True))
#(C,_,_,_) = np.linalg.lstsq(K,Y_train)
C = np.dot( np.linalg.pinv(Kern), Y )
but when I try to plot my interpolation with the original data they don't look at all alike:
with 100 random centers (from the data set). I also tried 10 centers which produces essentially the same graph as so does using every data point in the training set. I assumed that using every data point in the data set should more or less perfectly copy the curve but it didn't (overfit). It produces:
which doesn't seem correct. I will provide the full code (that runs without error):
import numpy as np
from sklearn.metrics.pairwise import euclidean_distances
from scipy.interpolate import Rbf
import matplotlib.pyplot as plt
## Data sets
def get_labels_improved(X,f):
N_train = X.shape[0]
Y = np.zeros( (N_train,1) )
for i in range(N_train):
Y[i] = f(X[i])
return Y
def get_kernel_matrix(x,W,S):
beta = get_beta_np(S)
#beta = 0.5*tf.pow(tf.div( tf.constant(1.0,dtype=tf.float64),S), 2)
Z = -beta*euclidean_distances(X=x,Y=W,squared=True)
K = np.exp(Z)
return K
N = 5000
low_x =-2*np.pi
high_x=2*np.pi
X = low_x + (high_x - low_x) * np.random.rand(N,1)
# f(x) = 2*(2(cos(x)^2 - 1)^2 -1
f = lambda x: 2*np.power( 2*np.power( np.cos(x) ,2) - 1, 2) - 1
Y = get_labels_improved(X , f)
K = 2 # number of centers for RBF
indices=np.random.choice(a=N,size=K) # choose numbers from 0 to D^(1)
subsampled_data_points=X[indices,:] # M_sub x D
stddev = 100
beta = 0.5*np.power(1.0/stddev,2)
Kern = np.exp(-beta*euclidean_distances(X=X,Y=subsampled_data_points,squared=True))
#(C,_,_,_) = np.linalg.lstsq(K,Y_train)
C = np.dot( np.linalg.pinv(Kern), Y )
Y_pred = np.dot( Kern , C )
plt.plot(X, Y, 'o', label='Original data', markersize=1)
plt.plot(X, Y_pred, 'r', label='Fitted line', markersize=1)
plt.legend()
plt.show()
Since the plots look strange I decided to read the docs for the ploting functions but I couldn't find anything obvious that was wrong.
Scaling of interpolating functions
The main problem is unfortunate choice of standard deviation of the functions used for interpolation:
stddev = 100
The features of your functions (its humps) are of size about 1. So, use
stddev = 1
Order of X values
The mess of red lines is there because plt from matplotlib connects consecutive data points, in the order given. Since your X values are in random order, this results in chaotic left-right movements. Use sorted X:
X = np.sort(low_x + (high_x - low_x) * np.random.rand(N,1), axis=0)
Efficiency issues
Your get_labels_improved method is inefficient, looping over the elements of X. Use Y = f(X), leaving the looping to low-level NumPy internals.
Also, the computation of least-squared solution of an overdetermined system should be done with lstsq instead of computing the pseudoinverse (computationally expensive) and multiplying by it.
Here is the cleaned-up code; using 30 centers gives a good fit.
import numpy as np
from sklearn.metrics.pairwise import euclidean_distances
import matplotlib.pyplot as plt
N = 5000
low_x =-2*np.pi
high_x=2*np.pi
X = np.sort(low_x + (high_x - low_x) * np.random.rand(N,1), axis=0)
f = lambda x: 2*np.power( 2*np.power( np.cos(x) ,2) - 1, 2) - 1
Y = f(X)
K = 30 # number of centers for RBF
indices=np.random.choice(a=N,size=K) # choose numbers from 0 to D^(1)
subsampled_data_points=X[indices,:] # M_sub x D
stddev = 1
beta = 0.5*np.power(1.0/stddev,2)
Kern = np.exp(-beta*euclidean_distances(X=X, Y=subsampled_data_points,squared=True))
C = np.linalg.lstsq(Kern, Y)[0]
Y_pred = np.dot(Kern, C)
plt.plot(X, Y, 'o', label='Original data', markersize=1)
plt.plot(X, Y_pred, 'r', label='Fitted line', markersize=1)
plt.legend()
plt.show()

Categories

Resources