Simple L1 loss in PyTorch

Simple L1 loss in PyTorch - python

I want to calculate L1 loss in a neural network, I came across this example at https://discuss.pytorch.org/t/simple-l2-regularization/139/2, but there are some errors in this code.
Is this really how to calculate L1 Loss in a NN or is there a simpler way?
l1_crit = nn.L1Loss()
reg_loss = 0
for param in model.parameters():
reg_loss += l1_crit(param)
factor = 0.0005
loss += factor * reg_loss
Is this equivalent in any way to simple doing:
loss = torch.nn.L1Loss()
I assume not, because I am not passing along any network parameters. Just checking if there isn existing function to do this.

If I am understanding well, you want to compute the L1 loss of your model (as you say in the begining). However I think you might got confused with the discussion in the pytorch forum.
From what I understand, in the Pytorch forums, and the code you posted, the author is trying to normalize the network weights with L1 regularization. So it is trying to enforce that weights values fall in a sensible range (not too big, not too small). That is weights normalization using L1 normalization (that is why it is using model.parameters()). Normalization takes a value as input and produces a normalized value as output.
Check this for weights normalization: https://pytorch.org/docs/master/generated/torch.nn.utils.weight_norm.html
On the other hand, L1 Loss it is just a way to determine how 2 values differ from each other, so the "loss" is just measure of this difference. In the case of L1 Loss this error is computed with the Mean Absolute Error loss = |x-y| where x and y are the values to compare. So error compute takes 2 values as input and produces a value as output.
Check this for loss computing: https://pytorch.org/docs/master/generated/torch.nn.L1Loss.html
To answer your question: no, the above snippets are not equivalent, since the first is trying to do weights normalization and the second one, you are trying to compute a loss. This would be the loss computing with some context:
sample, target = dataset[i]
target_predicted = model(sample)
loss = torch.nn.L1Loss()
loss_value = loss(target, target_predicted)

Related

Is there a way to improve DNN for linear regression?

I'm creating a Deep Neural Network for linear regression. The net has 3 hidden layers with 256 units per layer. Here is the model:
Each unit has ReLU as activation function. I also used Early Stopping to make sure it doesn't overfit.
The target is an integer and in the training set its values goes from 0 to 7860.
After the training i've got the following losses:
train_MSE = 33640.5703, train_MAD = 112.6294,
val_MSE = 53932.8125, val_MAD = 138.7836,
test_MSE = 52595.9414, test_MAD= 137.2564
I've tried many different configurations of the net (different optimizer, loss functions, normalizations, regularizers...) but nothing seems to help me to reduce the loss even further. Even if the training error decrease, the test error never goes under a value of MAD = 130.
Here's the behavior of my net:
My question is if there's a way to improve my dnn to make more accurate predictions or this is the best that i can achieve with my dataset?

If your problem is linear by nature, meaning the real function behind your data is of the from: y = a*x + b + epsilon where the last term is just random noise.
You won't get any better than fitting the underlying function y = a*x + b. Fitting espilon would only result in loss of generalization over new data.

You can try a many different things to improve DNN,
Increase hidden layers
Scale or Normalize your data
Try rectified linear unit as Activation
Take More data
Change learning algorithm parameters like learning rates

Compute the Loss of L1 and L2 regularization

How to calculate the loss of L1 and L2 regularization where w is a vector of weights of the linear model in Python?
The regularizes shall compute the loss without considering the bias term in the weights
def l1_reg(w):
# TO-DO: Add your code here
return None
def l2_reg(w):
# TO-DO: Add your code here
return None

Why Using Regularization
While train your model you would like to get a higher accuracy as possible .therefore, you might choose all correlated features [columns,
predictors,vectors] , but, in case of the dataset you have not big enough (i.e. number of features, n much larger than m) , this causes what's called by overfitting .Overfitting describe that your model performs very well in a training set, but fail in the test set (i.e. training accuracy is much better compared with the test set accuracy), you can think of it, that you can solve a problem, that you have been solved before, but can't solve a similar problem, because you overthinking [Not same problem but similar],so here regularization come to solve this problem.
Regularization
Let's frist explain the logic term behied Regularization.
Regularization the process of adding information
[You can think of it, before giving you another problem, i add more information to first one, you categorized it, so you just not overthinking if you find similar problem].
This image show overfitted model and acurate model.
L1 & L2 are the types of information added to your model equation
L1 Regularization
In L1 you add information to model equation to be the absolute sum of theta vector (θ) multiply by the regularization parameter (λ) which could be any large number over size of data (m), where (n) is the number of features.
L2 Regularization
In L2, you add the information to model equation to be the sum of vector (θ) squared multiplied by the regularization parameter (λ) which can be any big number over size of data (m), which (n) is a number of features.
In case using Normal Equation
Then L2 Regularization going to be (n+1)x(n+1) diagonal matrix with a zero in the upper left and ones down the other diagonal entries multiply by the regularization parameter(λ).

I think it is important to clarify this before answering: the L1 and L2 regularization terms aren't loss functions. They help to control the weights in the vector so that they don't become too large and can reduce overfitting.
L1 regularization term is the sum of absolute values of each element. For a length N vector, it would be |w[1]| + |w[2]| + ... + |w[N]|.
L2 regularization term is the sum of squared values of each element. For a length N vector, it would be w[1]² + w[2]² + ... + w[N]². I hope this helps!

def calculateL1(self, vector):
vector = np.abs(vector)
return np.sum(vector)
def calculateL2(self, vector):
return np.dot(vector, vector.T)

How can I predict the expected value and the variance simultaneously with a neural network?

I'd like to use a neural network to predict a scalar value which is the sum of a function of the input values and a random value (I'm assuming gaussian distribution) whose variance also depends on the input values. Now I'd like to have a neural network that has two outputs - the first output should approximate the deterministic part - the function, and the second output should approximate the variance of the random part, depending on the input values. What loss function do I need to train such a network?
(It would be nice if there was an example with Python for Tensorflow, but I'm also interested in general answers. I'm also not quite clear how I could write something like in Python code - none of the examples I found so far show how to address individual outputs from the loss function.)

You can use dropout for that. With a dropout layer you can make several different predictions based on different settings of which nodes dropped out. Then you can simply count the outcomes and interpret the result as a measure for uncertainty.
For details, read:
Gal, Yarin, and Zoubin Ghahramani. "Dropout as a bayesian approximation: Representing model uncertainty in deep learning." international conference on machine learning. 2016.

Since I've found nothing simple to implement, I wrote something myself, that models that explicitly: here is a custom loss function that tries to predict mean and variance. It seems to work but I'm not quite sure how well that works out in practice, and I'd appreciate feedback. This is my loss function:
def meanAndVariance(y_true: tf.Tensor , y_pred: tf.Tensor) -> tf.Tensor :
"""Loss function that has the values of the last axis in y_true
approximate the mean and variance of each value in the last axis of y_pred."""
y_pred = tf.convert_to_tensor(y_pred)
y_true = math_ops.cast(y_true, y_pred.dtype)
mean = y_pred[..., 0::2]
variance = y_pred[..., 1::2]
res = K.square(mean - y_true) + K.square(variance - K.square(mean - y_true))
return K.mean(res, axis=-1)
The output dimension is twice the label dimension - mean and variance of each value in the label. The loss function consists of two parts: a mean squared error that has the mean approximate the mean of the label value, and the variance that approximates the difference of the value from the predicted mean.

When using dropout to estimate the uncertainty (or any other stochastic regularization method), make sure to also checkout our recent work on providing a sampling-free approximation of Monte-Carlo dropout.
https://arxiv.org/pdf/1908.00598.pdf
We essentially follow ur idea. Treat the activations as random variables and then propagate mean and variance using error propagation to the output layer. Consequently, we obtain two outputs - the mean and the variance.

Incorporate side conditions into Keras neural network

I want to train my neural network (in Keras) with an additional condition on the output elements.
An example:
Minimize my loss function MSE between network output y_pred and y_true.
Additionally, ensure that the norm of y_pred is less or equal 1.
Without the condition, the task is straightforward.
Note: The condition is not necessarily the vector norm of y_pred.
How can I implement the additional condition/restriction in a Keras (or maybe Tensorflow) model?

In principle, tensorflow (and keras) don't allow you to add hard constraints to your model.
You have to convert your invarient (norm <= 1) to a penalty function, which is added to the loss. This could look like this:
y_norm = tf.norm(y_pred)
norm_loss = tf.where(y_norm > 1, y_norm, 0)
total_loss = mse + norm_loss
Look at the docs of where. If your prediction has a norm bigger than one, backpropagation tries to minimize the norm. If it is less than or equal, this part of the loss is simply 0. No gradient is produced.
But this can be very hard to optimize. Your predictions could oscillate around a norm of 1. It is also possible to add a factor: total_loss = mse + 1000* norm_loss. Be very careful with this, it makes optimization even harder.
In the example above, the norm above one contributes linearly to the loss. This is called l1-regularization. You could also square it, which would become l2-regularization.
In your specific case, you could get creative. Why not normalize your predictions and the targets to one (just a suggestion, might be a bad idea)?
loss = mse(y_pred / tf.norm(y_pred), y_target / np.linalg.norm(y_target)

Cross entropy loss suddenly increases to infinity

I am attempting to replicate an deep convolution neural network from a research paper. I have implemented the architecture, but after 10 epochs, my cross entropy loss suddenly increases to infinity. This can be seen in the chart below. You can ignore what happens to the accuracy after the problem occurs.
Here is the github repository with a picture of the architecture
After doing some research I think using an AdamOptimizer or relu might be a problem.
x = tf.placeholder(tf.float32, shape=[None, 7168])
y_ = tf.placeholder(tf.float32, shape=[None, 7168, 3])
#Many Convolutions and Relus omitted
final = tf.reshape(final, [-1, 7168])
keep_prob = tf.placeholder(tf.float32)
W_final = weight_variable([7168,7168,3])
b_final = bias_variable([7168,3])
final_conv = tf.tensordot(final, W_final, axes=[[1], [1]]) + b_final
cross_entropy = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=y_, logits=final_conv))
train_step = tf.train.AdamOptimizer(1e-5).minimize(cross_entropy)
correct_prediction = tf.equal(tf.argmax(final_conv, 2), tf.argmax(y_, 2))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
EDIT
If anyone is interested, the solution was that I was basically feeding in incorrect data.

Solution: Control the solution space. This might mean using smaller datasets when training, it might mean using less hidden nodes, it might mean initializing your wb differently. Your model is reaching a point where the loss is undefined, which might be due to the gradient being undefined, or the final_conv signal.
Why: Sometimes no matter what, a numerical instability is reached. Eventually adding a machine epsilon to prevent dividing by zero (cross entropy loss here) just won't help because even then the number cannot be accurately represented by the precision you are using. (Ref: https://en.wikipedia.org/wiki/Round-off_error and https://floating-point-gui.de/basic/)
Considerations:
1) When tweaking epsilons, be sure to be consistent with your data type (Use the machine epsilon of the precision you are using, in your case float32 is 1e-6 ref: https://en.wikipedia.org/wiki/Machine_epsilon and python numpy machine epsilon.
2) Just in-case others reading this are confused: The value in the constructor for Adamoptimizer is the learning rate, but you can set the epsilon value (ref: How does paramater epsilon affects AdamOptimizer? and https://www.tensorflow.org/api_docs/python/tf/train/AdamOptimizer)
3) Numerical instability of tensorflow is there and its difficult to get around. Yes there is tf.nn.softmax_with_cross_entropy but this is too specific (what if you don't want a softmax?). Refer to Vahid Kazemi's 'Effective Tensorflow' for an insightful explanation: https://github.com/vahidk/EffectiveTensorflow#entropy

that jump in your loss graph is very weird...
I would like you to focus on few points :
if your images are not normalized between 0 and 1 then normalize them
if you have normalized your values between -1 and 1 then use a sigmoid layer instead of softmax because softmax squashes the values between 0 and 1
before using softmax add a sigmoid layer to squash your values (Highly Recommended)
other things you can do is add dropouts for every layer
also I would suggest you to use tf.clip so that your gradients does not explode and implode
you can also use L2 regularization
and experiment with the learning rate and epsilon of AdamOptimizer
I would also suggest you to use tensor-board to keep track of the weights so that way you will come to know where the weights are exploding
You can also use tensor-board for keeping track of loss and accuracy
See The softmax formula below:
Probably that e to power of x, the x is being a very large number because of which softmax is giving infinity and hence the loss is infinity
Heavily use tensorboard to debug and print the values of the softmax so that you can figure out where you are going wrong
One more thing I noticed you are not using any kind of activation functions after the convolution layers... I would suggest you to leaky relu after every convolution layer
Your network is a humongous network and it is important to use leaky relu as activation function so that it adds non-linearity and hence improves the performance

You may want to use a different value for epsilon in the Adam optimizer (e.g. 0.1 -- 1.0).This is mentioned in the documentation:
The default value of 1e-8 for epsilon might not be a good default in general. For example, when training an Inception network on ImageNet a current good choice is 1.0 or 0.1.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.