I am trying to implement a solution to Ridge regression in Python using Stochastic gradient descent as the solver. My code for SGD is as follows:
def fit(self, X, Y):
# Convert to data frame in case X is numpy matrix
X = pd.DataFrame(X)
# Define a function to calculate the error given a weight vector beta and a training example xi, yi
# Prepend a column of 1s to the data for the intercept
X.insert(0, 'intercept', np.array([1.0]*X.shape[0]))
# Find dimensions of train
m, d = X.shape
# Initialize weights to random
beta = self.initializeRandomWeights(d)
beta_prev = None
epochs = 0
prev_error = None
while (beta_prev is None or epochs < self.nb_epochs):
print("## Epoch: " + str(epochs))
indices = range(0, m)
shuffle(indices)
for i in indices: # Pick a training example from a randomly shuffled set
beta_prev = beta
xi = X.iloc[i]
errori = sum(beta*xi) - Y[i] # Error[i] = sum(beta*x) - y = error of ith training example
gradient_vector = xi*errori + self.l*beta_prev
beta = beta_prev - self.alpha*gradient_vector
epochs += 1
The data I'm testing this on is not normalized and my implementation always ends up with all the weights being Infinity, even though I initialize the weights vector to low values. Only when I set the learning rate alpha to a very small value ~1e-8, the algorithm ends up with valid values of the weights vector.
My understanding is that normalizing/scaling input features only helps reduce convergence time. But the algorithm should not fail to converge as a whole if the features are not normalized. Is my understanding correct?
You can check from scikit-learn's Stochastic Gradient Descent documentation that one of the disadvantages of the algorithm is that it is sensitive to feature scaling. In general, gradient based optimization algorithms converge faster on normalized data.
Also, normalization is advantageous for regression methods.
The updates to the coefficients during each step will depend on the ranges of each feature. Also, the regularization term will be affected heavily by large feature values.
SGD may converge without data normalization, but that is subjective to the data at hand. Therefore, your assumption is not correct.
Your assumption is not correct.
It's hard to answer this, because there are so many different methods/environments but i will try to mention some points.
Normalization
When some method is not scale-invariant (i think every linear-regression is not) you really should normalize your data
I take it that you are just ignoring this because of debugging / analyzing
Normalizing your data is not only relevant for convergence-time, the results will differ too (think about the effect within the loss-function; big values might effect in much more loss to small ones)!
Convergence
There is probably much to tell about convergence of many methods on normalized/non-normalized data, but your case is special:
SGD's convergence theory only guarantees convergence to some local-minimum (= global-minimum in your convex-opt problem) for some chosings of hyper-parameters (learning-rate and learning-schedule/decay)
Even optimizing normalized data can fail with SGD when those params are bad!
This is one of the most important downsides of SGD; dependence on hyper-parameters
As SGD is based on gradients and step-sizes, non-normalized data has a possibly huge effect on not achieving this convergence!
In order for sgd to converge in linear regression the step size should be smaller than 2/s where s is the largest singular value of the matrix (see the Convergence and stability in the mean section in https://en.m.wikipedia.org/wiki/Least_mean_squares_filter), in the case of ridge regression it should be less than 2*(1+p/s^2)/s where p is the ridge penalty.
Normalizing rows of the matrix (or gradients) changes the loss function to give each sample an equal weight and it changes the singular values of the matrix such that you can choose a step size near 1 (see the NLMS section in https://en.m.wikipedia.org/wiki/Least_mean_squares_filter). Depending on your data it might require smaller step sizes or allow for larger step sizes. It all depends on whether or not the normalization increases or deacreses the largest singular value of the matrix.
Note that when deciding whether or not to normalize the rows you shouldn't just think about the convergence rate (which is determined by the ratio between the largest and smallest singular values) or stability in the mean, but also about how it changes the loss function and whether or not it fits your needs because of that, sometimes it makes sense to normalize but sometimes (for example when you want to give different importance for different samples or when you think that a larger energy for the signal means better snr) it doesn't make sense to normalize.
Related
What will happen when I use batch normalization but set batch_size = 1?
Because I am using 3D medical images as training dataset, the batch size can only be set to 1 because of GPU limitation. Normally, I know, when batch_size = 1, variance will be 0. And (x-mean)/variance will lead to error because of division by 0.
But why did errors not occur when I set batch_size = 1? Why my network was trained as good as I expected? Could anyone explain it?
Some people argued that:
The ZeroDivisionError may not be encountered because of two cases. First, the exception is caught in a try catch block. Second, a small rational number is added ( 1e-19 ) to the variance term so that it is never zero.
But some people disagree. They said that:
You should calculate mean and std across all pixels in the images of the batch. (So even batch_size = 1, there are still a lot of pixels in the batch. So the reason why batch_size=1 can still work is not because of 1e-19)
I have checked the Pytorch source code, and from the code I think the latter one is right.
Does anyone have different opinion???
variance will be 0
No, it won't; BatchNormalization computes statistics only with respect to a single axis (usually the channels axis, =-1 (last) by default); every other axis is collapsed, i.e. summed over for averaging; details below.
More importantly, however, unless you can explicitly justify it, I advise against using BatchNormalization with batch_size=1; there are strong theoretical reasons against it, and multiple publications have shown BN performance degrade for batch_size under 32, and severely for <=8. In a nutshell, batch statistics "averaged" over a single sample vary greatly sample-to-sample (high variance), and BN mechanisms don't work as intended.
Small mini-batch alternatives: Batch Renormalization -- Layer Normalization -- Weight Normalization
Implementation details: from source code:
reduction_axes = list(range(len(input_shape)))
del reduction_axes[self.axis]
Eventually, tf.nn.monents is called with axes=reduction_axes, which performs a reduce_sum to compute variance. Then, in the TensorFlow backend, mean and variance are passed to tf.nn.batch_normalization to return train- or inference-normalized inputs.
In other words, if your input is (batch_size, height, width, depth, channels), or (1, height, width, depth, channels), then BN will run calculations over the 1, height, width, and depth dimensions.
Can variance ever be zero? - yes, if every single datapoint for any given channel slice (along every dimension) is the same. But this should be near-impossible for real data.
Other answers: first one is misleading:
a small rational number is added (1e-19) to the variance
This doesn't happen in computing variance, but it is added to variance when normalizing; nonetheless, it is rarely necessary, as variance is far from zero. Also, the epsilon term is actually defaulted to 1e-3 by Keras; it serves roles in regularizing, beyond mere avoiding zero-division.
Update: I failed to address an important piece of intuition with suspecting variance to be 0; indeed, the batch statistics variance is zero, since there is only one statistic - but the "statistic" itself concerns the mean & variance of the channel + spatial dimensions. In other words, the variance of the mean & variance (of the single train sample) is zero, but the mean & variance themselves aren't.
when batch_size = 1, variance will be 0
No, because when you compute mean and variance for BN (for example using tf.nn.monents) you will be computing it over axis [0, 1, 2] (assuming you have NHWC tensor channels order).
From "Group Normalization" paper:
https://arxiv.org/pdf/1803.08494.pdf
With batch_size=1 batch normalization is equal to instance normalization and it can be helpful in some tasks.
But if you are using sort of encoder-decoder and in some layer you have tensor with spatial size of 1x1 it will be a problem, because each channel only have only one value and mean of value will be equal to this value, so BN will zero out information.
I'd like to use a neural network to predict a scalar value which is the sum of a function of the input values and a random value (I'm assuming gaussian distribution) whose variance also depends on the input values. Now I'd like to have a neural network that has two outputs - the first output should approximate the deterministic part - the function, and the second output should approximate the variance of the random part, depending on the input values. What loss function do I need to train such a network?
(It would be nice if there was an example with Python for Tensorflow, but I'm also interested in general answers. I'm also not quite clear how I could write something like in Python code - none of the examples I found so far show how to address individual outputs from the loss function.)
You can use dropout for that. With a dropout layer you can make several different predictions based on different settings of which nodes dropped out. Then you can simply count the outcomes and interpret the result as a measure for uncertainty.
For details, read:
Gal, Yarin, and Zoubin Ghahramani. "Dropout as a bayesian approximation: Representing model uncertainty in deep learning." international conference on machine learning. 2016.
Since I've found nothing simple to implement, I wrote something myself, that models that explicitly: here is a custom loss function that tries to predict mean and variance. It seems to work but I'm not quite sure how well that works out in practice, and I'd appreciate feedback. This is my loss function:
def meanAndVariance(y_true: tf.Tensor , y_pred: tf.Tensor) -> tf.Tensor :
"""Loss function that has the values of the last axis in y_true
approximate the mean and variance of each value in the last axis of y_pred."""
y_pred = tf.convert_to_tensor(y_pred)
y_true = math_ops.cast(y_true, y_pred.dtype)
mean = y_pred[..., 0::2]
variance = y_pred[..., 1::2]
res = K.square(mean - y_true) + K.square(variance - K.square(mean - y_true))
return K.mean(res, axis=-1)
The output dimension is twice the label dimension - mean and variance of each value in the label. The loss function consists of two parts: a mean squared error that has the mean approximate the mean of the label value, and the variance that approximates the difference of the value from the predicted mean.
When using dropout to estimate the uncertainty (or any other stochastic regularization method), make sure to also checkout our recent work on providing a sampling-free approximation of Monte-Carlo dropout.
https://arxiv.org/pdf/1908.00598.pdf
We essentially follow ur idea. Treat the activations as random variables and then propagate mean and variance using error propagation to the output layer. Consequently, we obtain two outputs - the mean and the variance.
I want to train my neural network (in Keras) with an additional condition on the output elements.
An example:
Minimize my loss function MSE between network output y_pred and y_true.
Additionally, ensure that the norm of y_pred is less or equal 1.
Without the condition, the task is straightforward.
Note: The condition is not necessarily the vector norm of y_pred.
How can I implement the additional condition/restriction in a Keras (or maybe Tensorflow) model?
In principle, tensorflow (and keras) don't allow you to add hard constraints to your model.
You have to convert your invarient (norm <= 1) to a penalty function, which is added to the loss. This could look like this:
y_norm = tf.norm(y_pred)
norm_loss = tf.where(y_norm > 1, y_norm, 0)
total_loss = mse + norm_loss
Look at the docs of where. If your prediction has a norm bigger than one, backpropagation tries to minimize the norm. If it is less than or equal, this part of the loss is simply 0. No gradient is produced.
But this can be very hard to optimize. Your predictions could oscillate around a norm of 1. It is also possible to add a factor: total_loss = mse + 1000* norm_loss. Be very careful with this, it makes optimization even harder.
In the example above, the norm above one contributes linearly to the loss. This is called l1-regularization. You could also square it, which would become l2-regularization.
In your specific case, you could get creative. Why not normalize your predictions and the targets to one (just a suggestion, might be a bad idea)?
loss = mse(y_pred / tf.norm(y_pred), y_target / np.linalg.norm(y_target)
I have a multi-label problem with ~1000 classes, yet only a handful are selected at a time. When using tf.nn.sigmoid_cross_entropy_with_logits this causes the loss to very quickly approach 0 because there are 990+ 0's being predicted.
loss = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits, labels))
Is it mathematically possible to just multiple the loss by a large constant (say 1000) just so that I can plot loss numbers in tensorboard that I can actually distinguish between? I realize that I could simply multiple the values that I am plotting (without affecting the value that I pass to the train_op) but I am trying to gain a better understanding for whether multiplying the train_op by a constant would have any real effect. For example, I could implement any of the following choices and am trying to think through the potential consequences:
loss = tf.reduce_mean(tf.multiply(tf.nn.sigmoid_cross_entropy_with_logits(logits, labels), 1000.0))
loss = tf.multiply(tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits, labels)), 1000.0)
Would you expect the training results to differ if a constant is introduced like this?
The larger your loss is, the bigger your gradient will be. Therefore, if you multiply your loss by 1000, your gradient step will be big and can lead to divergence. Look into gradient descent and backpropagation to understand this better.
Moreover, reduce_mean compute the mean of all the elements of your tensor. Multiplying before the mean or after is mathematically identical. Your two lines are therefore doing the same thing.
If you want to multiply your loss just to manipulate bigger number to plot them, just create another tensor and multiply it. You'll use your loss for training and multiplied_loss for plotting.
loss = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits, labels))
multiplied_loss = tf.multiply(loss, 1000.0)
optimizer.minimize(loss)
tf.summary.scalar('loss*1000', multiplied_loss)
This code is not enough of course, adapt it to your case.
In Neural Networks, the number of samples used for training data is 5000 and before the data is given for training it was normalized using the formula
y - mean(y)
y' = -----------
stdev(y)
Now I want to de-normalise the data after getting the predicted output. Generally for prediction a test data data is used which is 2000 samples. In order to de-normalize, following formula is used
y = y' * stdev(y) + mean(y)
This approach is taken from the following thread
[How to denormalise (de-standardise) neural net predictions after normalising input data
Could anyone explain me how the same mean and standard deviation used in normalizing the training data(5000*2100) could be used in de-normalizing the predicted data as you know for prediction test data(2000*2100) is used,both the counts are different.
The denormalization equation is simple algebra: it's the same equation as normalization, but solved for y instead of y'. The function is to reverse the normalization process, recovering the "shape" of the original data; that's why you have to use the original stdev and mean.
Normalization is a process of shifting the data to center on 0 (using the mean), and then squeezing the distribution to a standard normal curve (for a new stdev of 1.0). To return to the original shape, you have to un-shift and un-squeeze the same amounts as the original distribution.
Note that we expect the predicted data to have a mean of 0 and a stdev around 1.0 (with some change in variations due to the central tendency theorem). Your worry is not silly: we do have a different population count for the stdev.