Linear Regression - mean square error coming too large - python

I have a house-sales dataset and on that, I am applying linear regression. After getting, slope and y-intercept, I plot the graph and compute cost and the result I get is little odd to me, because
Line from parameters is fitting the data well
But the cost value from the same parameter is huge
Here's the code for plotting the straight line
def plotLine(slope, yIntercept, X, y):
abline_values = [slope * i + yIntercept for i in X]
plt.scatter(X, y)
plt.plot(X, abline_values, 'black')
plt.title(slope)
plt.show()
Following is the function for computing cost
def computeCost(m, parameters, x, y):
[yIntercept, slope] = parameters
hypothesis = yIntercept - np.dot(x, slope)
loss = hypothesis - y
cost = np.sum(loss ** 2) / (2 * m)
return cost
And following lines of code gives me the x vs y plot with the line from computed parameters (for the sake of simplicity of this question, I've manually set the parameters) and cost value.
yIntercept = -70000
slope = 0.85
print("Starting gradient descent at b = %d, m = %f, error = %f" % (yIntercept, slope, computeCost(m, parameters, X, y)))
plotLine(slope, yIntercept, X, y)
And the output of above snippet is
So, my questions are:
1. Is this the right way to plot straight line over x vs y plot?
2. Why cost value is too big, and is it possible to have cost value to be so big even parameters are fitting data well.
Edit 1
The m in print statement is slope value and not size of X, i.e, len(X)

1. Your way to plot seems right, you can probably simplify
abline_values = [slope * i + yIntercept for i in X]
to
abline_values = slope * X + yIntercept
2. Did you set m=0.85 in your example? It seems so, but I can not tell since you did not provide the call to the cost function. Shouldn't it be the size of the sample? If you add up all the squared errors and divide them by 2*0.85, the size of the error depends on your sample size. And since it is not a relative error and the values are rather big, it is possible that all these errors add up to that huge number. Try to set m to the size of your sample.
In addition there is an error in the sign of the computation of the hypothesis value, it should be a +. Otherwise you would have a negative slope, which explains large errors as well.
def computeCost(parameters, x, y):
[yIntercept, slope] = parameters
hypothesis = yIntercept + np.dot(x, slope)
loss = hypothesis - y
cost = np.sum(loss ** 2) / (2 * len(x))
return cost

The error value is large due to the unnormalized input data. According to your code, x varies from 0 to 250k. In this case, I would suggest that you normalize x to be in [0, 1]. With that, I would expect that the loss is small, and so are the learnt parameters (slope and intercept).

Related

Scaling error and Information cirteria in lmfit

I am using lmfit for solving a non-linear optimization problem. It works fine to the point where I am trying to implement a measurement error as the standard deviation of the dependent variable y (sigma_y). My problem is, that I cannot interpret the Information criteria properly. When implementing the return (model - y)/(sigma_y) they just raise from really low to very high values.
i.e. [left: return (model - y) -weighting-> right: return (model - y)/(sigma_y)]:
chi-square 0.00159805 -> 47.3184972
reduced chi-square 1.7756e-04 -> 5.25761080 expectation value is 1 || SO discussion
Akaike info crit -93.2055413 -> 20.0490661 the more negative, the better
Bayesian info crit -92.4097507 -> 20.8448566 the more negative, the better
My guess is, that this is somehow connected to bad usage of lmfit (wrong calculation of Information criteria, bad error scaling) or to a general lack of understanding these criteria (to me reduced chi-square of 5.258 (under-estimated) or 1.776e-4 (over-estimated) sounds like a really poor fit, but the plot of residuals etc. looks okay for me...)
Here is my example code that reproduces the problem:
import lmfit
import numpy as np
import matplotlib.pyplot as plt
y = np.array([1.42774324, 1.45919099, 1.58891696, 1.78432768, 1.97403125,
2.17091161, 2.02065766, 1.83449279, 1.64412613, 1.47265405,
1.4507 ])
sigma_y = np.array([0.00312633, 0.00391546, 0.00873894, 0.01252675, 0.00639111,
0.00452488, 0.0050566 , 0.01127185, 0.01081748, 0.00227587,
0.00190191])
x = np.array([0.02372331, 0.07251188, 0.50685822, 1.30761963, 2.10535442,
2.90597497, 2.30733552, 1.50906939, 0.7098041 , 0.09580686,
0.04777082])
offset = np.array([1.18151707, 1.1815602 , 1.18202847, 1.18187962, 1.18047493,
1.17903493, 1.17962602, 1.18141625, 1.18216907, 1.18222051,
1.18238949])
parameter = lmfit.Parameters()
parameter.add('par_1', value = 1e-5)
parameter.add('par_2', value = 0.18)
def U_reg(parameter, x, offset):
par_1 = parameter['par_1']
par_2 = parameter['par_2']
model = offset + 0.03043211217 * np.arcsinh(x / (2 * par_1)) + par_2 * x
return (model - y)/(sigma_y)
mini = lmfit.Minimizer(U_reg, parameter, fcn_args=(x, offset), nan_policy='omit', scale_covar=False)
regr1 = mini.minimize(method='least_sq') #Levenberg-Marquardt
print("Levenberg-Marquardt:")
print(lmfit.fit_report(regr1, min_correl=0.9))
def U_plot(parameter, x, offset):
par_1 = parameter['par_1']
par_2 = parameter['par_2']
model = offset + 0.03043211217 * np.arcsinh(x / (2 * par_1)) + par_2 * x
return model - y
plt.title("Model vs Data")
plt.plot(x, y, 'b')
plt.plot(x, U_plot(regr1.params, x, offset) + y, 'r--', label='Levenberg-Marquardt')
plt.ylabel("y")
plt.xlabel("x")
plt.legend(loc='best')
plt.show()
I know that it might be more convient to implement weighting via lmfit.Model, due to multiple regressors I want to keep the lmfit.Minimizer method. My python version is 3.8.5, lmfit is 1.0.2
Well, in order for the magnitude of chi-square to be meaningful (for example, that it be around (N_data - N_varys), the scale of the uncertainties has to be correct -- giving the 1-sigma standard deviation for each observation.
It's not really possible for lmfit to detect whether the sigma you use has the right scale.

Neural Network Backpropogation code not working

I need to write a simple neural network that consists of 1 output node, one hidden layer of 3 nodes, and 1 input layer (variable size). For now I am just trying to train on the xor data so lets presume that there are 3 input nodes (one node represents the bias and is always 1). The data is labeled 0,1.
I did out the equations for backpropogation and found that despite being so simple, my code does not converge to the xor data being correct.
Let W be the 3x3 matrix of weights connecting the input and hidden layer, and w be the 1x3 matrix that connects the hidden to output layer. Here are some helper functions for my method
def feed_forward_predict(x, W, w):
sigmoid = lambda x: 1/(1+np.exp(-x))
z = np.array(list(map(sigmoid, np.matmul(W, x))))
L = sigmoid(np.matmul(w, z))
return [L, z, x]
this just takes in a value and makes a prediction using the formula sig(w*sig(W*x)). We also have
def calculate_objective(data, labels, W, w):
obj = 0
for point, label in zip(data, labels):
L, z, x = feed_forward_predict(point, W, w)
obj += (label - L)**2
return obj
which calculates the Mean Squared Error for a bunch of given data points. Both of these functions should work as I checked them by hand. Now the problem comes in for the back propogation algorithm
def back_prop(traindata, trainlabels):
sigmoid = lambda x: 1/(1+np.exp(-x))
sigmoid_prime = lambda x: np.exp(-x)/((1+np.exp(-x))**2)
W = np.random.rand(3, len(traindata[0]))
w = np.random.rand(1, 3)
obj = calculate_objective(traindata, trainlabels, W, w)
print(obj)
epochs = 10_000
eta = .01
prevobj = np.inf
i=0
while(i < epochs):
prevobj = obj
dellw = np.zeros((1,3))
for point, label in zip(traindata, trainlabels):
y, z, x = feed_forward_predict(point, W, w)
dellw += 2*(y - label) * sigmoid_prime(np.dot(w, z)) * z
w -= eta * dellw
for point, label in zip(traindata, trainlabels):
y, z, x = feed_forward_predict(point, W, w)
temp = 2 * (y - label) * sigmoid_prime(np.dot(w, z))
# Note that s,u,v represent the hidden node weights. My professor required it this way
dells = temp * w[0][0] * sigmoid_prime(np.matmul(W[0,:], x)) * x
dellu = temp * w[0][1] * sigmoid_prime(np.matmul(W[1,:], x)) * x
dellv = temp * w[0][2] * sigmoid_prime(np.matmul(W[2,:], x)) * x
dellW = np.array([dells, dellu, dellv])
W -= eta*dellW
obj = calculate_objective(traindata, trainlabels, W, w)
i = i + 1
print("i=", i, " Objective=",obj)
return [W, w]
However this code, despite seemingly being correct in terms of the matrix multiplications and derivatives I took, does not converge to anything. In fact the error consistantly bounces: it will fall, then rise, then fall back to the same spot, then rise again. I believe that the problem lies with the W matrix gradient but I do not know what exactly it is.
If you'd like to see for yourself what is happening, the input data I used is
0: 0 0 1
0: 1 1 1
1: 1 0 1
1: 0 1 1
where the first number represents the label. I also set the random seed to np.random.seed(0) just so that I could be consistant with my matrices I'm dealing with.
It appears you are attempting to setup a manual version of stochastic gradient decent with a fixed learning rate (a classic NN problem).
Some notes on your code. It is very difficult to follow all the steps you are doing with so much loops and inconsistencies. In general, it defeats the purpose of using np.array() if you are using loops. Likewise you should know that np.matmul() is * and np.dot() is #. It is unclear how you are using the derivative. You have it explicitly stated at the start for the activation function and then partially derived in the middle of your loop for the MSE. Ugh.
Some other pointers. Explicitly state all your functions and your data, those should be globals. Those should also be derived all at once based on your fixed data as np.array(). In particular, note that while traditional statistics (like finding the line of best fit) means that we are solving for a fixed set of weights given a random variable; in stochastic gradient decent, we are doing the opposite. We are instead fixing the random variable to our data and optimizing our weights. Hence, your functions should only have your weights as "free variables", everything else is fixed. It is important to follow what is being fixed and what is free to update. Your code does not reflect that you know what is being update and what is fixed.
SGD algorithm outline:
Random params.
Update params by moving params a small percentage in the direction of lowest decent.
Run step (2) for a specified amount of time.
Print your params.
Example of SGD code (here is an example of performing SGD to find the line of best fit for some data).
import numpy as np
#Data
X = np.random.random((100,)) #Random points
Y = (2.3*X + 8) + 0.1*np.random.random((100,)) #Linear model + Noise
#Functions (only free variable is the params) (we want the F of best fit under MSE)
F = lambda p : p[0]*X+p[1]
dF = lambda p : np.array([X,np.ones(X.shape)])
MSE = lambda p : (1/Y.shape[0])*((Y-F(p))**2).sum(0)
dMSE = lambda p : (1/Y.shape[0])*(-2*(Y-F(p))*dF(p)).sum(1)
#SGD loop
lr = 0.05
epochs = 1000
params = np.array([0.0,0.0])
for i in range(epochs):
params -= lr*dMSE(params)
print(params)
Hopefully, written this way it is super clear exactly where the subtraction of the gradient is occurring and exactly how it is calculated. Note also, in case it wasn't clear, the derivative in both dF and dMSE is with respect to the params. Obviously this is a toy problem that can be solved explicitly with the scipy module. Hence, SGD is a clearly useless way to optimize two variables.
from scipy.stats import linregress
params = linregress(X,Y)
print(params)
I think I figured it out, in my code I was not summing the hidden node weight derivatives and instead was assigning at every loop iteration. The correct version would be as follow
for point, label in zip(traindata, trainlabels):
y, z, x = feed_forward_predict(point, W, w)
temp = 2 * (y - label) * sigmoid_prime(np.dot(w, z))
# Note that s,u,v represent the hidden node weights. My professor required it this way
dells += temp * w[0][0] * sigmoid_prime(np.matmul(W[0,:], x)) * x
dellu += temp * w[0][1] * sigmoid_prime(np.matmul(W[1,:], x)) * x
dellv += temp * w[0][2] * sigmoid_prime(np.matmul(W[2,:], x)) * x

Trying to fit a trig function to data with scipy

I am trying to fit some data using scipy.optimize.curve_fit. I have read the documentation and also this StackOverflow post, but neither seem to answer my question.
I have some data which is simple, 2D data which looks approximately like a trig function. I want to fit it with a general trig function
using scipy.
My approach is as follows:
from __future__ import division
import numpy as np
from scipy.optimize import curve_fit
#Load the data
data = np.loadtxt('example_data.txt')
t = data[:,0]
y = data[:,1]
#define the function to fit
def func_cos(t,A,omega,dphi,C):
# A is the amplitude, omega the frequency, dphi and C the horizontal/vertical shifts
return A*np.cos(omega*t + dphi) + C
#do a scipy fit
popt, pcov = curve_fit(func_cos, t,y)
#Plot fit data and original data
fig = plt.figure(figsize=(14,10))
ax1 = plt.subplot2grid((1,1), (0,0))
ax1.plot(t,y)
ax1.plot(t,func_cos(t,*popt))
This outputs:
where blue is the data orange is the fit. Clearly I am doing something wrong. Any pointers?
If no values are provided for initial guess of the parameters p0 then a value of 1 is assumed for each of them. From the docs:
p0 : array_like, optional
Initial guess for the parameters (length N). If None, then the initial values will all be 1 (if the number of parameters for the function can be determined using introspection, otherwise a ValueError is raised).
Since your data has very large x-values and very small y-values an initial guess of 1 is far from the actual solution and hence the optimizer does not converge. You can help the optimizer by providing suitable initial parameter values that can be guessed / approximated from the data:
Amplitude: A = (y.max() - y.min()) / 2
Offset: C = (y.max() + y.min()) / 2
Frequency: Here we can estimate the number of zero crossing by multiplying consecutive y-values and check which products are smaller than zero. This number divided by the total x-range gives the frequency and in order to get it in units of pi we can multiply that number by pi: y_shifted = y - offset; oemga = np.pi * np.sum(y_shifted[:-1] * y_shifted[1:] < 0) / (t.max() - t.min())
Phase shift: can be set to zero, dphi = 0
So in summary, the following initial parameter guess can be used:
offset = (y.max() + y.min()) / 2
y_shifted = y - offset
p0 = (
(y.max() - y.min()) / 2,
np.pi * np.sum(y_shifted[:-1] * y_shifted[1:] < 0) / (t.max() - t.min()),
0,
offset
)
popt, pcov = curve_fit(func_cos, t, y, p0=p0)
Which gives me the following fit function:

Cost value doesn't decrease when using gradient descent

I have data pairs (x,y) which are created by a cubic function
y = g(x) = ax^3 − bx^2 − cx + d
plus some random noise. Now, I want to fit a model (parameters a,b,c,d) to this data using gradient descent.
My implementation:
param={}
param["a"]=0.02
param["b"]=0.001
param["c"]=0.002
param["d"]=-0.04
def model(param,x,y,derivative=False):
x2=np.power(x,2)
x3=np.power(x,3)
y_hat = param["a"]*x3+param["b"]*x2+param["c"]*x+param["d"]
if derivative==False:
return y_hat
derv={} #of Cost function w.r.t parameters
m = len(y_hat)
derv["a"]=(2/m)*np.sum((y_hat-y)*x3)
derv["b"]=(2/m)*np.sum((y_hat-y)*x2)
derv["c"]=(2/m)*np.sum((y_hat-y)*x)
derv["d"]=(2/m)*np.sum((y_hat-y))
return derv
def cost(y_hat,y):
assert(len(y)==len(y_hat))
return (np.sum(np.power(y_hat-y,2)))/len(y)
def optimizer(param,x,y,lr=0.01,epochs = 100):
for i in range(epochs):
y_hat = model(param,x,y)
derv = model(param,x,y,derivative=True)
param["a"]=param["a"]-lr*derv["a"]
param["b"]=param["b"]-lr*derv["b"]
param["c"]=param["c"]-lr*derv["c"]
param["d"]=param["d"]-lr*derv["d"]
if i%10==0:
#print (y,y_hat)
#print(param,derv)
print(cost(y_hat,y))
X = np.array(x)
Y = np.array(y)
optimizer(param,X,Y,0.01,100)
When run, the cost seems to be increasing:
36.140028646153525
181.88127675295928
2045.7925570171055
24964.787906199843
306448.81623701524
3763271.7837247783
46215271.5069297
567552820.2134454
6969909237.010273
85594914704.25394
Did I compute the gradients wrong? I don't know why the cost is exploding.
Here is the data: https://pastebin.com/raw/1VqKazUV.
If I run your code with e.g. lr=1e-4, the cost decreases.
Check your gradients (just print the result of model(..., True)), you will see that they are quite large. As your learning rate is also not too small, you are likely oscillating away from the minimum (see any ML textbook for example plots of this, you should also be able to see this if you just print your parameters after every iteration).

MLE Log-likelihood for logistic regression gives divide by zero error

I want to compute the log-likelihood of a logistic regression model.
def sigma(x):
return 1 / (1 + np.exp(-x))
def logll(y, X, w):
""""
Parameters
y : ndarray of shape (N,)
Binary labels (either 0 or 1).
X : ndarray of shape (N,D)
Design matrix.
w : ndarray of shape (D,)
Weight vector.
"""
p = sigma(X # w)
y_1 = y # np.log(p)
y_0 = (1 - y) # (1 - np.log(1 - p))
return y_1 + y_0
logll(y, Xz, np.linspace(-5,5,D))
Applying this function results in
/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:16:
RuntimeWarning: divide by zero encountered in log
app.launch_new_instance()
I would expect y_0 to be a negative float. How can I avoid this error and is there a bug somewhere in the code?
Edit 1
X # w statistics:
Max: 550.775133944
Min: -141.972597608
Sigma(max): 1.0 => Throws error in y_0 in np.log(1 - 1.0)
Sigma(min): 2.19828642169e-62
Edit 2
I also have access to this logsigma function that computes sigma in log space:
def logsigma (x):
return np.vectorize(np.log)(sigma(x))
Unfortunately, I don't find a way to rewrite y_0 then. The following is my approach but obviously not correct.
def l(y, X, w):
y_1 = np.dot(y, logsigma(X # w))
y_0 = (1 - y) # (1 - np.log(1 - logsigma(X # w)))
return y_1 + y_0
First of all, I think you've made a mistake in your log-likelihood formula: it should be a plain sum of y_0 and y_1, not sum of exponentials:
Division by zero can be caused by large negative values (I mean large by abs value) in X # w, e.g. sigma(-800) is exactly 0.0 on my machine, so the log of it results in "RuntimeWarning: divide by zero encountered in log".
Make sure you initialize your network with small values near zero and you don't have exploding gradients after several iterations of backprop.
By the way, here's the code I use for cross-entropy loss, which works also in multi-class problems:
def softmax_loss(x, y):
"""
- x: Input data, of shape (N, C) where x[i, j] is the score for the jth class
for the ith input.
- y: Vector of labels, of shape (N,) where y[i] is the label for x[i] and
0 <= y[i] < C
"""
probs = np.exp(x - np.max(x, axis=1, keepdims=True))
probs /= np.sum(probs, axis=1, keepdims=True)
N = x.shape[0]
return -np.sum(np.log(probs[np.arange(N), y])) / N
UPD: When nothing else helps, there is one more numerical trick (discussed in the comments): compute log(p+epsilon) and log(1-p+epsilon) with a small positive epsilon value. This ensures that log(0.0) never happens.

Categories

Resources