I implemented linear regression with gradient descent in python. To see how well it is doing I compared it with scikit-learn's LinearRegression() class. For some reason, sklearn always outperforms my program by a MSE of 3 on average (I am using the Boston Housing dataset for testing). I understand that I am currently not doing gradient checking to check for convergence, but I am allowing for many iterations and have set the learning rate low enough such that it SHOULD converge. Is there any clear bug in my learning algorithm implementation? Here is my code:
import numpy as np
from sklearn.linear_model import LinearRegression
def getWeights(x):
lenWeights = len(x[1,:]);
weights = np.random.rand(lenWeights)
bias = np.random.random();
return weights,bias
def train(x,y,weights,bias,maxIter):
converged = False;
iterations = 1;
m = len(x);
alpha = 0.001;
while not converged:
for i in range(len(x)):
# Dot product of weights and training sample
hypothesis = np.dot(x[i,:], weights) + bias;
# Calculate gradient
error = hypothesis - y[i];
grad = (alpha * 1/m) * ( error * x[i,:] );
# Update weights and bias
weights = weights - grad;
bias = bias - alpha * error;
iterations = iterations + 1;
if iterations > maxIter:
converged = True;
break
return weights, bias
def predict(x, weights, bias):
return np.dot(x,weights) + bias
if __name__ == '__main__':
data = np.loadtxt('housing.txt');
x = data[:,:-1];
y = data[:,-1];
for i in range(len(x[1,:])):
x[:,i] = ( (x[:,i] - np.min(x[:,i])) / (np.max(x[:,i]) - np.min(x[:,i])) );
initialWeights,initialBias = getWeights(x);
weights,bias = train(x,y,initialWeights,initialBias,55000);
pred = predict(x, weights,bias);
MSE = np.mean(abs(pred - y));
print "This Program MSE: " + str(MSE)
sklearnModel = LinearRegression();
sklearnModel = sklearnModel.fit(x,y);
sklearnModel = sklearnModel.predict(x);
skMSE = np.mean(abs(sklearnModel - y));
print "Sklearn MSE: " + str(skMSE)
First, make sure that you are computing the correct objective function value. The linear regression objective should be .5*np.mean((pred-y)**2), rather than np.mean(abs(pred - y)).
You are actually running a stochastic gradient descent (SGD) algorithm (running a gradient iteration on individual examples), which should be distinguished from "gradient descent".
SGD is a good learning method, but a bad optimization method - it can take many iterations to converge to a minimum of the empirical error (http://leon.bottou.org/publications/pdf/nips-2007.pdf).
For SGD to converge, the learning rate must be restricted. Typically, the learning rate is set to the base learning rate divided by the number of iterations, something like alpha/(iterations+1), using the variables in your code.
You also include a multiple of 1/m in your gradient, which is typically not used in SGD updates.
To test your SGD implementation, rather than evaluating the error on the dataset that you trained with, split the dataset into a training set and a test set, and evaluate the error on this test set after training with both methods. The training/test set split will allow you to estimate the performance of your algorithm as a learning algorithm (estimate the expected error) rather than as an optimization algorithm (minimize the empirical error).
Try increasing your iteration value. This should allow your algorithm to, hopefully, converge on a value that is closer to the global minimum. Keep in mind you are not using l-bfgs which can come closer to converging much faster than plain gradient descent or even SGD.
Also try using the normal equation as another way to do Linear Regression.
http://eli.thegreenplace.net/2014/derivation-of-the-normal-equation-for-linear-regression/.
Related
I am trying to implement logistic regression from scratch using binary cross entropy loss function. The loss function implemented below is created based on the following formula.
def binary_crossentropy(y, yhat):
no_of_samples = len(y)
numerator_1 = y*np.log(yhat)
numerator_2 = (1-y) * np.log(1-yhat)
loss = -(np.sum(numerator_1 + numerator_2) / no_of_samples)
return loss
And below is how I implement the training using gradient descent.
L = 0.01
epochs = 40000
no_of_samples = len(x)
# Keeping track of the loss
loss = []
for _ in range(epochs):
yhat = sigmoid(x*weight + bias)
# Finding out the loss of each iteration
loss.append(binary_crossentropy(y, yhat))
d_weight = np.sum(x *(yhat-y)) / no_of_samples
d_bias = np.sum(yhat-y) / no_of_samples
weight = weight - L*d_weight
bias = bias - L*d_bias
The training above goes fine since the weight and bias are properly adjusted. But my question here is that, why the loss graph appears to be very fluctuating?
I have ever tried implementing linear regression and the loss appears to be constantly decreasing.
Is there anything incorrect in my logistic regression implementation? If my implementation is already correct, why does it fluctuate that way?
You need to optimize hyperparameters to see if the problem solves or not. One thing that can be done is to change the type of optimizers that you used. For instance, you can use Fmin_tnc instead of gradient descent.
Besides, you can tune the epochs, L and type of solvers (‘newton-cg’, ‘lbfgs’, ‘liblinear’, ‘sag’, ‘saga’) if you use sklearn for regression.
Given a neural network with weights theta and inputs x, I am interested in calculating the partial derivatives of the neural network's output w.r.t. x, so that I can use the result when training the weights theta using a loss depending both on the output and the partial derivatives of the output. I figured out how to calculate the partial derivatives following this post. I also found this post that explains how to use sympy to achieve something similar, however, adapting it to a neural network context within pytorch seems like a huge amount of work and a recipee for very slow code.
Thus, I tried something different, which failed. As a minimal example, I created a function (substituting my neural network)
theta = torch.ones([3], requires_grad=True, dtype=torch.float32)
def trainable_function(time):
return theta[0]*time**3 + theta[1]*time**2 + theta[2]*time
Then, I defined a second function to give me partial derivatives:
def trainable_derivative(time):
deriv_time = torch.tensor(time, requires_grad=True)
fun_value = trainable_function(deriv_time)
gradient = torch.autograd.grad(fun_value, deriv_time, create_graph=True, retain_graph=True)
deriv_time.requires_grad = False
return gradient
Given some noisy observations of the derivatives, I now try to train theta. For simplicity, I create a loss that only depends on the derivatives. In this minimal example, the derivatives are used directly as observations, not as regularization, to avoid complicated loss functions that are besides the point.
def objective(train_times, observations):
predictions = torch.squeeze(torch.tensor([trainable_derivative(a) for a in train_times]))
return torch.sum((predictions - observations)**2)
optimizer = Adam([theta], lr=0.1)
for iteration in range(200):
optimizer.zero_grad()
loss = objective(data_times, noisy_targets)
loss.backward()
optimizer.step()
Unfortunately, when running this code, I get the error
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn
I suppose that when calculating the partial derivatives in the way I do, I do not really create a computational graph through which autodiff could differentiate through. Thus, the connection to the parameters theta somehow gets lost and now it looks to the optimizer as if the loss is completely independent of the parameters theta. However, I could be totally wrong..
Does anyone know how to fix this?
Is it possible to include this type of derivatives in the loss function in pytorch?
And if so, what would be the most pytorch-style way of doing this?
Many thanks for your help and advise, it is much appreciated.
For completeness:
To run the above code, some training data needs to be generated. I used the following code, which works perfectly and has been tested against the analytical derivatives:
true_a = 1
true_b = 1
true_c = 1
def true_function(time):
return true_a*time**3 + true_b*time**2 + true_c*time
def true_derivative(time):
deriv_time = torch.tensor(time, requires_grad=True)
fun_value = true_function(deriv_time)
return torch.autograd.grad(fun_value, deriv_time)
data_times = torch.linspace(0, 1, 500)
true_targets = torch.squeeze(torch.tensor([true_derivative(a) for a in data_times]))
noisy_targets = torch.tensor(true_targets) + torch.randn_like(true_targets)*0.1
Your approach to the problem appears overly complicated.
I believe that what you're trying to achieve is within reach in PyTorch.
I include here a simple code snippet that I believe showcases what you would like to do:
import torch
import torch.nn as nn
# Data and Function
torch.manual_seed(0)
input_dim = 1
output_dim = 2
n = 10 # batchsize
simple_function = nn.Sequential(nn.Linear(1, 2), nn.Sigmoid())
t = (torch.arange(n).float() / n).view(n, 1)
x = torch.randn(n, output_dim)
t.requires_grad = True
# Actual computation
xhat = simple_function(t)
jac = torch.autograd.functional.jacobian(simple_function, t, create_graph=True)
grad = jac[torch.arange(n),:,torch.arange(n),0]
loss = (x -xhat).pow(2).sum() + grad.pow(2).sum()
loss.backward()
The Hamming Loss counts the number of labels for which our prediction is wrong normalizing it.
The standard implementation of the HammingLoss as a metric relies on counting the wrong predictions, with something along these lines: (on TF)
count_non_zero = tf.math.count_nonzero(actuals - predictions)
return tf.reduce_mean(count_non_zero / actuals.get_shape()[-1])
Implementing the Hamming Loss as an actual loss requires it to be differentiable, which is not this case due to the tf.math.count_nonzero.
An alternative (and approximated) method would be counting the non-zero labels in this way, but unluckily the NN doesn't seem to improve.
def hamming_loss(y_true, y_pred):
y_true = tf.convert_to_tensor(y_true, name="y_true")
y_pred = tf.convert_to_tensor(y_pred, name="y_pred")
diff = tf.cast(tf.math.abs(y_true - y_pred), dtype=tf.float32)
#Counting non-zeros in a differentiable way
epsilon = K.epsilon()
nonzero = tf.reduce_mean(tf.math.abs( diff / (tf.math.abs(diff) + epsilon)))
return tf.reduce_mean(nonzero / K.int_shape(y_pred)[-1])
Concluding, what's the correct implementation of the Hamming Loss for TensorFlow?
[.1] https://hal.archives-ouvertes.fr/hal-01044994/document
Your network doesn't converge since:
diff / (tf.math.abs(diff) + epsilon)
yields a 0 , 1 vector which kills the gradients both on zeros and ones
I have implemented a gradient boosting decision tree to do a mulitclass classification. My custom loss functions look like this:
import numpy as np
from sklearn.preprocessing import OneHotEncoder
def softmax(mat):
res = np.exp(mat)
res = np.multiply(res, 1/np.sum(res, axis=1, keepdims=True))
return res
def custom_asymmetric_objective(y_true, y_pred_encoded):
pred = y_pred_encoded.reshape((-1, 3), order='F')
pred = softmax(pred)
y_true = OneHotEncoder(sparse=False,categories='auto').fit_transform(y_true.reshape(-1, 1))
grad = (pred - y_true).astype("float")
hess = 2.0 * pred * (1.0-pred)
return grad.flatten('F'), hess.flatten('F')
def custom_asymmetric_valid(y_true, y_pred_encoded):
y_true = OneHotEncoder(sparse=False,categories='auto').fit_transform(y_true.reshape(-1, 1)).flatten('F')
margin = (y_true - y_pred_encoded).astype("float")
loss = margin*10
return "custom_asymmetric_eval", np.mean(loss), False
Everything works, but now I want to adjust my loss function in the following way: It should "penalize" if an item is classified incorrectly, and a penalty should be added for a certain constraint (this is calculated before, let's just say the penalty is e.g. 0,05, so just a real number).
Is there any way to consider both, the misclassification and the penalty value?
Try L2 regularization: weights will be updated following the subtraction of a learning rate times error times x plus the penalty term lambda weight to the power of 2
Simplifying:
This will be the effect:
ADDED: The penalization term (on the right of equation) increases the generalization power of your model. So, if you overfit your model in training set, the perfomance will be poor in test set. So, you penalize these "right" classifications in training set that generate error in test set and compromise generalization.
I am not sure is the right place to ask this question, feel free to tell me if I need to remove the post.
I am quite new in pyTorch and currently working with CycleGAN (pyTorch implementation) as a part of my project and I understand most of the implementation of cycleGAN.
I read the paper with the name ‘CycleGAN with better Cycles’ and I am trying to apply the modification which mentioned in the paper. One of modification is Cycle consistency weight decay which I don’t know how to apply.
optimizer_G.zero_grad()
# Identity loss
loss_id_A = criterion_identity(G_BA(real_A), real_A)
loss_id_B = criterion_identity(G_AB(real_B), real_B)
loss_identity = (loss_id_A + loss_id_B) / 2
# GAN loss
fake_B = G_AB(real_A)
loss_GAN_AB = criterion_GAN(D_B(fake_B), valid)
fake_A = G_BA(real_B)
loss_GAN_BA = criterion_GAN(D_A(fake_A), valid)
loss_GAN = (loss_GAN_AB + loss_GAN_BA) / 2
# Cycle consistency loss
recov_A = G_BA(fake_B)
loss_cycle_A = criterion_cycle(recov_A, real_A)
recov_B = G_AB(fake_A)
loss_cycle_B = criterion_cycle(recov_B, real_B)
loss_cycle = (loss_cycle_A + loss_cycle_B) / 2
# Total loss
loss_G = loss_GAN +
lambda_cyc * loss_cycle + #lambda_cyc is 10
lambda_id * loss_identity #lambda_id is 0.5 * lambda_cyc
loss_G.backward()
optimizer_G.step()
My question is how can I gradually decay the weight of cycle consistency loss?
Any help in implementing this modification would be appreciated.
This is from the paper:
Cycle consistency loss helps to stabilize training a lot in early stages but becomes an obstacle towards realistic images in later stages. We propose to gradually decay the weight of cycle consistency loss λ as training progress. However, we should still make sure that λ is
not decayed to 0 so that generators won’t become unconstrained and go completely wild.
Thanks in advance.
Below is a prototype function you can use!
def loss (other params, decay params, initial_lambda, steps):
# compute loss
# compute cyclic loss
# function that computes lambda given the steps
cur_lambda = compute_lambda(step, decay_params, initial_lamdba)
final_loss = loss + cur_lambda*cyclic_loss
return final_loss
compute_lambda function for linearly decaying from 10 to 1e-5 in 50 steps
def compute_lambda(step, decay_params):
final_lambda = decay_params["final"]
initial_lambda = decay_params["initial"]
total_step = decay_params["total_step"]
start_step = decay_params["start_step"]
if (step < start_step+total_step and step>start_step):
return initial_lambda + (step-start_step)*(final_lambda-initial_lambda)/total_step
elif (step < start_step):
return initial_lambda
else:
return final_lambda
# Usage:
compute_lambda(i, {"final": 1e-5, "initial":10, "total_step":50, "start_step" : 50})