How to calculate the loss of L1 and L2 regularization where w is a vector of weights of the linear model in Python?
The regularizes shall compute the loss without considering the bias term in the weights
def l1_reg(w):
# TO-DO: Add your code here
return None
def l2_reg(w):
# TO-DO: Add your code here
return None
Why Using Regularization
While train your model you would like to get a higher accuracy as possible .therefore, you might choose all correlated features [columns,
predictors,vectors] , but, in case of the dataset you have not big enough (i.e. number of features, n much larger than m) , this causes what's called by overfitting .Overfitting describe that your model performs very well in a training set, but fail in the test set (i.e. training accuracy is much better compared with the test set accuracy), you can think of it, that you can solve a problem, that you have been solved before, but can't solve a similar problem, because you overthinking [Not same problem but similar],so here regularization come to solve this problem.
Regularization
Let's frist explain the logic term behied Regularization.
Regularization the process of adding information
[You can think of it, before giving you another problem, i add more information to first one, you categorized it, so you just not overthinking if you find similar problem].
This image show overfitted model and acurate model.
L1 & L2 are the types of information added to your model equation
L1 Regularization
In L1 you add information to model equation to be the absolute sum of theta vector (θ) multiply by the regularization parameter (λ) which could be any large number over size of data (m), where (n) is the number of features.
L2 Regularization
In L2, you add the information to model equation to be the sum of vector (θ) squared multiplied by the regularization parameter (λ) which can be any big number over size of data (m), which (n) is a number of features.
In case using Normal Equation
Then L2 Regularization going to be (n+1)x(n+1) diagonal matrix with a zero in the upper left and ones down the other diagonal entries multiply by the regularization parameter(λ).
I think it is important to clarify this before answering: the L1 and L2 regularization terms aren't loss functions. They help to control the weights in the vector so that they don't become too large and can reduce overfitting.
L1 regularization term is the sum of absolute values of each element. For a length N vector, it would be |w[1]| + |w[2]| + ... + |w[N]|.
L2 regularization term is the sum of squared values of each element. For a length N vector, it would be w[1]² + w[2]² + ... + w[N]². I hope this helps!
def calculateL1(self, vector):
vector = np.abs(vector)
return np.sum(vector)
def calculateL2(self, vector):
return np.dot(vector, vector.T)
Related
I want to calculate L1 loss in a neural network, I came across this example at https://discuss.pytorch.org/t/simple-l2-regularization/139/2, but there are some errors in this code.
Is this really how to calculate L1 Loss in a NN or is there a simpler way?
l1_crit = nn.L1Loss()
reg_loss = 0
for param in model.parameters():
reg_loss += l1_crit(param)
factor = 0.0005
loss += factor * reg_loss
Is this equivalent in any way to simple doing:
loss = torch.nn.L1Loss()
I assume not, because I am not passing along any network parameters. Just checking if there isn existing function to do this.
If I am understanding well, you want to compute the L1 loss of your model (as you say in the begining). However I think you might got confused with the discussion in the pytorch forum.
From what I understand, in the Pytorch forums, and the code you posted, the author is trying to normalize the network weights with L1 regularization. So it is trying to enforce that weights values fall in a sensible range (not too big, not too small). That is weights normalization using L1 normalization (that is why it is using model.parameters()). Normalization takes a value as input and produces a normalized value as output.
Check this for weights normalization: https://pytorch.org/docs/master/generated/torch.nn.utils.weight_norm.html
On the other hand, L1 Loss it is just a way to determine how 2 values differ from each other, so the "loss" is just measure of this difference. In the case of L1 Loss this error is computed with the Mean Absolute Error loss = |x-y| where x and y are the values to compare. So error compute takes 2 values as input and produces a value as output.
Check this for loss computing: https://pytorch.org/docs/master/generated/torch.nn.L1Loss.html
To answer your question: no, the above snippets are not equivalent, since the first is trying to do weights normalization and the second one, you are trying to compute a loss. This would be the loss computing with some context:
sample, target = dataset[i]
target_predicted = model(sample)
loss = torch.nn.L1Loss()
loss_value = loss(target, target_predicted)
I have two tensors that I am calculating the Spearmans Rank Correlation from, and I would like to be able to have PyTorch automatically adjust the values in these Tensors in a way that increases my Spearmans Rank Correlation number as high as possible.
I have explored autograd but nothing I've found has explained it simply enough.
Initialized tensors:
a=Var(torch.randn(20,1),requires_grad=True)
psfm_s=Var(torch.randn(12,20),requires_grad=True)
How can I have a loop of constant adjustments of the values in these two tensors to get the highest spearmans rank correlation from 2 lists I make from these 2 tensors while having PyTorch do the work? I just need a guide of where to go. Thank you!
I'm not familiar with Spearman's Rank Correlation, but if I understand your question you're asking how to use PyTorch to solve problems other than deep networks?
If that's the case then I'll provide a simple least squares example which I believe should be informative to your effort.
Consider a set of 200 measurements of 10 dimensional vectors x and y. Say we want to find a linear transform from x to y.
The least squares approach dictates we can accomplish this by finding the matrix M and vector b which minimize |(y - (M x+b))²|
The following example code generates some example data and then uses pytorch to perform this minimization. I believe the comments are sufficient to help you understand what is occurring here.
import torch
from torch.nn.parameter import Parameter
from torch import optim
# define some fake data
M_true = torch.randn(10, 10)
b_true = torch.randn(10, 1)
x = torch.randn(200, 10, 1)
noise = torch.matmul(M_true, 0.05 * torch.randn(200, 10, 1))
y = torch.matmul(M_true, x) + b_true + noise
# begin optimization
# define the parameters we want to optimize (using random starting values in this case)
M = Parameter(torch.randn(10, 10))
b = Parameter(torch.randn(10, 1))
# define the optimizer and provide the parameters we want to optimize
optimizer = optim.SGD((M, b), lr=0.1)
for i in range(500):
# compute loss that we want to minimize
y_hat = torch.matmul(M, x) + b
loss = torch.mean((y - y_hat)**2)
# zero the gradients of the parameters referenced by the optimizer (M and b)
optimizer.zero_grad()
# compute new gradients
loss.backward()
# update parameters M and b
optimizer.step()
if (i + 1) % 100 == 0:
# scale learning rate by factor of 0.9 every 100 steps
optimizer.param_groups[0]['lr'] *= 0.9
print('step', i + 1, 'mse:', loss.item())
# final parameter values (data contains a torch.tensor)
print('Resulting parameters:')
print(M.data)
print(b.data)
print('Compare to the "real" values')
print(M_true)
print(b_true)
Of course this problem has a simple closed form solution, but this numerical approach is just to demonstrate how to use PyTorch's autograd to solve problems not necessarily neural network related. I also choose to explicitly define the matrix M and vector b here rather than using an equivalent nn.Linear layer since I think that would just confuse things.
In your case you want to maximize something so make sure to negate your objective function before calling backward.
This is my first post here, so I hope it complies to the guidelines and is interesting also for other people except myself.
I am building a CNN autoencoder that takes as input matrixes of fixed sizes with the goal of getting a lower dimensional representation of them (I call them hashes here). I want to make these hashes similar, when the matrixes are similar. Since just a few of my data are labeled, I want to make the loss function a combination of two separate functions. One part will be the reconstruction error of the autoencoder (This part is correctly working). The other part, I want it to be for the labeled data. Since I will have three different classes, I want that on each batch, to calculate the distance between hash values belonging to the same class (I am having trouble implementing this).
My effort so far:
X = tf.placeholder(shape=[None, 512, 128, 1], dtype=tf.float32)
class1_indices = tf.placeholder(shape=[None], dtype=tf.int32)
class2_indices = tf.placeholder(shape=[None], dtype=tf.int32)
hashes, reconstructed_output = self.conv_net(X, weights, biases_enc, biases_dec, keep_prob)
class1_hashes = tf.gather(hashes, class1_indices)
class1_cost = self.calculate_within_class_loss(class1_hashes)
class2_hashes = tf.gather(hashes, class2_indices)
class2_cost = self.calculate_within_class_loss(class2_hashes)
loss_all = tf.reduce_sum(tf.square(reconstructed_output - X))
loss_labeled = class1_cost + class2_cost
loss_op = loss_all + loss_labeled
optimizer = tf.train.AdagradOptimizer(learning_rate=learning_rate)
train_op = optimizer.minimize(loss_op)
Where calclulate_within_class_loss is a separate function that I created. I have currently implemented it only for the difference of the first hash of a class with other hashes of that class in the same batch, however, I am not happy with my current implementation and it looks that it is not working.
def calculate_within_class_loss(self, hash_values):
first_hash = tf.slice(hash_values, [0, 0], [1, 256])
total_loss = tf.foldl(lambda d, e: d + tf.sqrt(tf.reduce_sum(tf.square(tf.subtract(e, first_hash)))), hash_values, initializer=0.0)
return total_loss
So, I have two questions / issues:
Is there any easy way to calculate the distance of every raw with all other raws in a tensor?
My current implementation of calculate within class distance, even if it is just for the first element with other elements, will give me a 'nan' when I try to optimize it.
Thanks for your time and help :)
In the sample code, you are calculating the sum of Eucledian distance between the points.
For this, you will have to loop over the entire dataset and do O(n^2 * m) calculations and have O(n^2 * m) space, i.e. Tensorflow graph operations.
Here, n is the number of vectors and m is the size of the hash, i.e. 256.
However, if you could change your object to the following:
Then, you can use the nifty relationship between the squared Euclidean distance and the variance and rewrite the same calculation as
Where mu_k is the average value of the kth coordinate for the cluster.
This will allow you to compute the value in O(n * m) time and O(n * m) Tensorflow operations.
This would be the way to go if you think this change (i.e. from Euclidean distance to squared Euclidean distance) will not adversely effect your loss function.
Here is the code:
import numpy as np
# sigmoid function
def nonlin(x,deriv=False):
if(deriv==True):
return x*(1-x)
return 1/(1+np.exp(-x))
# input dataset
X = np.array([ [0,0,1],
[0,1,1],
[1,0,1],
[1,1,1] ])
# output dataset
y = np.array([[0,0,1,1]]).T
# seed random numbers to make calculation
# deterministic (just a good practice)
np.random.seed(1)
# initialize weights randomly with mean 0
syn0 = 2*np.random.random((3,1)) - 1
for iter in xrange(10000):
# forward propagation
l0 = X
l1 = nonlin(np.dot(l0,syn0))
# how much did we miss?
l1_error = y - l1
# multiply how much we missed by the
# slope of the sigmoid at the values in l1
l1_delta = l1_error * nonlin(l1,True)
# update weights
syn0 += np.dot(l0.T,l1_delta)
print "Output After Training:"
print l1
Here is the website: http://iamtrask.github.io/2015/07/12/basic-python-network/
Line 36 of the code, the l1 error is multiplied by the derivative of the input dotted with the weights. I have no idea why this is done and have been spending hours trying to figure it out. I just reached the conclusion that this is wrong, but something is telling me that's probably not right considering how many people recommend and use this tutorial as a starting point for learning neural networks.
In the article, they say that
Look at the sigmoid picture again! If the slope was really shallow
(close to 0), then the network either had a very high value, or a very
low value. This means that the network was quite confident one way or
the other. However, if the network guessed something close to (x=0,
y=0.5) then it isn't very confident.
I cannot seem to wrap my head around why the highness or lowness of the input into the sigmoid function has anything to do with the confidence. Surely it doesn't matter how high it is, because if the predicted output is low, then it will be really UNconfident, unlike what they said about it should be confident just coz it's high.
Surely it would just be better to cube the l1_error if you wanted to emphasize the error?
This is a real let down considering up to that point it finally looked like I had found a promising way to really intuitively start learning about neural networks, but yet again I was wrong. If you have a good place to start learning where I can understand really easily, it would be appreciated.
Look at this image. If the sigmoid functions gives you a HIGH or LOW value(Pretty good confidence), the derivative of that value is LOW. If you get a value at the steepest slope(0.5), the derivative of that value is HIGH.
When the function gives us a bad prediction, we want to change our weights by a higher number, and on the contrary, if the prediction is good(High confidence), we do NOT want to change our weights much.
First of all, this line is correct:
l1_delta = l1_error * nonlin(l1, True)
The total error from the next layer l1_error is multiplied by the derivative of the current layer (here I consider a sigmoid a separate layer to simplify backpropagation flow). It's called a chain rule.
The quote about "network confidence" may indeed be confusing for a novice learner. What they mean here is probabilistic interpretation of the sigmoid function. Sigmoid (or in general softmax) is very often the last layer in classification problems: sigmoid outputs a value between [0, 1], which can be seen as a probability or confidence of class 0 or class 1.
In this interpretation, sigmoid=0.001 is high confidence of class 0, which corresponds to small gradient and small update to the network, sigmoid=0.999 is high confidence of class 1 and sigmoid=0.499 is low confidence of any class.
Note that in your example, sigmoid is the last layer, so you can look at this network as doing binary classification, hence the interpretation above makes sense.
If you consider a sigmoid activation in the hidden layers, confidence interpretation is more questionable (though one can ask, how confident a particular neuron is). But error propagation formula still holds, because the chain rule holds.
Surely it would just be better to cube the l1_error if you wanted to
emphasise the error?
Here's an important note. The big success of neural networks over the last several years is, at least partially, due to use of ReLu instead of sigmoid in the hidden layers, exactly because it's better not to saturate the gradient. This is known as vanishing gradient problem. So, on the contrary, you generally don't want to emphasise the error in backprop.
I am trying to implement a solution to Ridge regression in Python using Stochastic gradient descent as the solver. My code for SGD is as follows:
def fit(self, X, Y):
# Convert to data frame in case X is numpy matrix
X = pd.DataFrame(X)
# Define a function to calculate the error given a weight vector beta and a training example xi, yi
# Prepend a column of 1s to the data for the intercept
X.insert(0, 'intercept', np.array([1.0]*X.shape[0]))
# Find dimensions of train
m, d = X.shape
# Initialize weights to random
beta = self.initializeRandomWeights(d)
beta_prev = None
epochs = 0
prev_error = None
while (beta_prev is None or epochs < self.nb_epochs):
print("## Epoch: " + str(epochs))
indices = range(0, m)
shuffle(indices)
for i in indices: # Pick a training example from a randomly shuffled set
beta_prev = beta
xi = X.iloc[i]
errori = sum(beta*xi) - Y[i] # Error[i] = sum(beta*x) - y = error of ith training example
gradient_vector = xi*errori + self.l*beta_prev
beta = beta_prev - self.alpha*gradient_vector
epochs += 1
The data I'm testing this on is not normalized and my implementation always ends up with all the weights being Infinity, even though I initialize the weights vector to low values. Only when I set the learning rate alpha to a very small value ~1e-8, the algorithm ends up with valid values of the weights vector.
My understanding is that normalizing/scaling input features only helps reduce convergence time. But the algorithm should not fail to converge as a whole if the features are not normalized. Is my understanding correct?
You can check from scikit-learn's Stochastic Gradient Descent documentation that one of the disadvantages of the algorithm is that it is sensitive to feature scaling. In general, gradient based optimization algorithms converge faster on normalized data.
Also, normalization is advantageous for regression methods.
The updates to the coefficients during each step will depend on the ranges of each feature. Also, the regularization term will be affected heavily by large feature values.
SGD may converge without data normalization, but that is subjective to the data at hand. Therefore, your assumption is not correct.
Your assumption is not correct.
It's hard to answer this, because there are so many different methods/environments but i will try to mention some points.
Normalization
When some method is not scale-invariant (i think every linear-regression is not) you really should normalize your data
I take it that you are just ignoring this because of debugging / analyzing
Normalizing your data is not only relevant for convergence-time, the results will differ too (think about the effect within the loss-function; big values might effect in much more loss to small ones)!
Convergence
There is probably much to tell about convergence of many methods on normalized/non-normalized data, but your case is special:
SGD's convergence theory only guarantees convergence to some local-minimum (= global-minimum in your convex-opt problem) for some chosings of hyper-parameters (learning-rate and learning-schedule/decay)
Even optimizing normalized data can fail with SGD when those params are bad!
This is one of the most important downsides of SGD; dependence on hyper-parameters
As SGD is based on gradients and step-sizes, non-normalized data has a possibly huge effect on not achieving this convergence!
In order for sgd to converge in linear regression the step size should be smaller than 2/s where s is the largest singular value of the matrix (see the Convergence and stability in the mean section in https://en.m.wikipedia.org/wiki/Least_mean_squares_filter), in the case of ridge regression it should be less than 2*(1+p/s^2)/s where p is the ridge penalty.
Normalizing rows of the matrix (or gradients) changes the loss function to give each sample an equal weight and it changes the singular values of the matrix such that you can choose a step size near 1 (see the NLMS section in https://en.m.wikipedia.org/wiki/Least_mean_squares_filter). Depending on your data it might require smaller step sizes or allow for larger step sizes. It all depends on whether or not the normalization increases or deacreses the largest singular value of the matrix.
Note that when deciding whether or not to normalize the rows you shouldn't just think about the convergence rate (which is determined by the ratio between the largest and smallest singular values) or stability in the mean, but also about how it changes the loss function and whether or not it fits your needs because of that, sometimes it makes sense to normalize but sometimes (for example when you want to give different importance for different samples or when you think that a larger energy for the signal means better snr) it doesn't make sense to normalize.