Tensorflow Optimizers - multiple loss values passed to minimize()?

Tensorflow Optimizers - multiple loss values passed to minimize()? - python

My first time using Tensorflow on the MNIST dataset, I had a really simple bug where I forgot to take mean of my error values before passing it to the optimizer.
In other words, instead of
loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=y, labels=y_))
I accidentally used
loss = tf.nn.softmax_cross_entropy_with_logits(logits=y, labels=y_)
Not taking the mean or sum of the error values threw no errors when training the network, however. This led me thinking: Is there actually a case when someone would need to pass in multiple loss values into an optimizer? What was happening when I passed in a Tensor not of size [1] into minimize()?

They are being added up. This is side-product of TensorFlow using Reverse Mode AD to differentiate, which requires loss to be a scalar

Related

Custom metrics in tensorflow

I'm running a sequence-to-sequence model in TensorFlow, for which I need to extensively pad my input data (samples' length varies). Thus, any metrics calculated, it is heavily biased (if the true sample is ~10% of the data input to the model, the errors appearing there as somehow hidden by "correct" predictions on the padded part).
Thus, I'd like to calculate "true" metrics (accuracy, AUC or whathever), that takes into account the real sample only. In numpy-ish code, I'd like to do something like that:
def adjusted_metrics(y_true,y_pred):
last_index = np.nonzero(y_true)[-1] % y is padded with 0 and there is another value at the end of real y
return AUC(y_true[:last_index], y_pred[:last_index])
But, I'm pretty new to tensorflow and:
I can't do that in tensorflow code. Actually, I'm not able to find the index of the last nonzero element of y_true, when it is a Tensor. I tried casting to numpy using tensorflow.experimental.numpy (no effect actually, it still appears as a Tensor?) or calling .numpy() on a tensor (not working, despite the fact I don't have eager execution disabled). I tried masking, but it's hard for me to find the mask dimension, also due to the following point:
All my attempts also seem inappropriate in the context of batches - y_true and y_pred are of shape (None, max_length). I suppose the calculation in batches is governed by my model, but I've no idea how (and if it's possible) to change the metrics calculation to be done per sample, keeping the whole learning process in batches.
Any advice? :)

Is there any way to copy all parameters of one Pytorch model to another specially Batch Normalization mean and std?

I have found many correct ways online to copy one pytorch model parameters to another but somehow the copy-paste operation always misses the batch normalization parameters. Everything works fine as long as I only use modules such as conv2d, linear, drop out, max pool etc in my model. But as soon as I add Batch normalization in pytorch model, the below-given script stop working and accuracy at test time is different :
net = model()
copy_net = model()
for param in net.module.parameters():
copy_param.append(param.clone().detach())
count = 0
for param in copy_net.module.parameters():
param.data = copy_param[count]
param.requires_grad = False
count = count +1
Can anybody give me a possible solution to copying batch normalization also ?

net.load_state_dict(copy_net.state_dict()) should work.
As per #dxtx, in pytorch's philosophy, the state dict should cover all the states in a 'module', e.g. in batch norm module , the running mean and var, if I remembered correctly, should be part of the state dict.
But in fact, if you wrote module like batch norm yourself, you have to override the 'state_dict' method.

How to skip redundant forward prop during RL training step in TensorFlow

I have a Tensorflow question regarding Reinforcement Learning. I have everything working and training, but there is something that feels redundant. Want to point it out and hear your thoughts:
Lets assume something simple like episodic REINFORCE. Given this standard setup:
state -> network -> logits
When I want to train (when episode complete), I need to:
pass in an array of states (saved from running the episode) to a TF Placeholder
do a forward pass with those saved states to produce logits
compute log_probs (using saved array of actions)
compute loss (using saved array of advantages)
This works fine. However, what seems redundant is steps 1&2. I'd prefer to calculate log_probs during each step of episode while episode is being run. This way I don't have to do steps 1,2,3 during training, and forward pass is only performed once (during episode). I'd have my log_probs calculated by the time the episode was over.
However, if I create placeholders for log_probs and advantages, and don't pass in states (for the redundant forward prop), then I don't know how to get TF to know where the variables are for backprop. I get the error:
ValueError: No gradients provided for any variable
So my questions are:
if I'm passing in states, is it true that forward prop is being run again during training?
can I prevent this by my method above, finding some way to tell TF where the gradients are?
In case anyone wants to see actual code (I tried to be clear enough not to need it), here is a gist of the script in question
EDIT: I think the answer has something to do with using optimzer.compute_gradients (where I can pass in variables) and optimizer.apply_gradients, but not sure how yet...

What is the meaning of 'self.diff' in 'forward' of a custom python loss layer for Caffe training?

I try to use a custom python loss layer. When I checked several examples online, such as:
Euclidean loss layer, Dice loss layer,
I notice a variable 'self.diff' is always assigned in 'forward'. Especially for the Dice loss layer,
self.diff[...] = bottom[1].data
I wonder if there is any reason that this variable has to be introduced in forward or I can just use bottom[1].data to access ground truth label?
In addition, what is the point of top[0].reshape(1) in reshape, since by definition in forward, the loss output is a scalar itself.

You need to set the diff attribute of the layer for overall consistency and data communication protocol; it's available other places in the class, and anywhere the loss layer object appears. bottom is a local parameter, and is not available elsewhere in the same form.
In general, the code is expandable for a variety of applications and more complex computations; the reshaping is part of this, ensuring that the returned value is scalar, even if someone expands the inputs to work with vectors or matrices.

NaN loss when training regression network

I have a data matrix in "one-hot encoding" (all ones and zeros) with 260,000 rows and 35 columns. I am using Keras to train a simple neural network to predict a continuous variable. The code to make the network is the following:
model = Sequential()
model.add(Dense(1024, input_shape=(n_train,)))
model.add(Activation('relu'))
model.add(Dropout(0.1))
model.add(Dense(512))
model.add(Activation('relu'))
model.add(Dropout(0.1))
model.add(Dense(256))
model.add(Activation('relu'))
model.add(Dropout(0.1))
model.add(Dense(1))
sgd = SGD(lr=0.01, nesterov=True);
#rms = RMSprop()
#model.compile(loss='categorical_crossentropy', optimizer=rms, metrics=['accuracy'])
model.compile(loss='mean_absolute_error', optimizer=sgd)
model.fit(X_train, Y_train, batch_size=32, nb_epoch=3, verbose=1, validation_data=(X_test,Y_test), callbacks=[EarlyStopping(monitor='val_loss', patience=4)] )
However, during the training process, I see the loss decrease nicely, but during the middle of the second epoch, it goes to nan:
Train on 260000 samples, validate on 64905 samples
Epoch 1/3
260000/260000 [==============================] - 254s - loss: 16.2775 - val_loss:
13.4925
Epoch 2/3
88448/260000 [=========>....................] - ETA: 161s - loss: nan
I tried using RMSProp instead of SGD, I tried tanh instead of relu, I tried with and without dropout, all to no avail. I tried with a smaller model, i.e. with only one hidden layer, and same issue (it becomes nan at a different point). However, it does work with less features, i.e. if there are only 5 columns, and gives quite good predictions. It seems to be there is some kind of overflow, but I can't imagine why--the loss is not unreasonably large at all.
Python version 2.7.11, running on a linux machine, CPU only. I tested it with the latest version of Theano, and I also get Nans, so I tried going to Theano 0.8.2 and have the same problem. With the latest version of Keras has the same problem, and also with the 0.3.2 version.

Regression with neural networks is hard to get working because the output is unbounded, so you are especially prone to the exploding gradients problem (the likely cause of the nans).
Historically, one key solution to exploding gradients was to reduce the learning rate, but with the advent of per-parameter adaptive learning rate algorithms like Adam, you no longer need to set a learning rate to get good performance. There is very little reason to use SGD with momentum anymore unless you're a neural network fiend and know how to tune the learning schedule.
Here are some things you could potentially try:
Normalize your outputs by quantile normalizing or z scoring. To be rigorous, compute this transformation on the training data, not on the entire dataset. For example, with quantile normalization, if an example is in the 60th percentile of the training set, it gets a value of 0.6. (You can also shift the quantile normalized values down by 0.5 so that the 0th percentile is -0.5 and the 100th percentile is +0.5).
Add regularization, either by increasing the dropout rate or adding L1 and L2 penalties to the weights. L1 regularization is analogous to feature selection, and since you said that reducing the number of features to 5 gives good performance, L1 may also.
If these still don't help, reduce the size of your network. This is not always the best idea since it can harm performance, but in your case you have a large number of first-layer neurons (1024) relative to input features (35) so it may help.
Increase the batch size from 32 to 128. 128 is fairly standard and could potentially increase the stability of the optimization.

The answer by 1" is quite good. However, all of the fixes seems to fix the issue indirectly rather than directly. I would recommend using gradient clipping, which will clip any gradients that are above a certain value.
In Keras you can use clipnorm=1 (see https://keras.io/optimizers/) to simply clip all gradients with a norm above 1.

I faced the same problem before. I search and find this question and answers. All those tricks mentioned above are important for training a deep neural network. I tried them all, but still got NAN.
I also find this question here. https://github.com/fchollet/keras/issues/2134.
I cited the author's summary as follows：
I wanted to point this out so that it's archived for others who may
experience this problem in future. I was running into my loss function
suddenly returning a nan after it go so far into the training process.
I checked the relus, the optimizer, the loss function, my dropout in
accordance with the relus, the size of my network and the shape of the
network. I was still getting loss that eventually turned into a nan
and I was getting quite fustrated.
Then it dawned on me. I may have some bad input. It turns out, one of
the images that I was handing to my CNN (and doing mean normalization
on) was nothing but 0's. I wasn't checking for this case when I
subtracted the mean and normalized by the std deviation and thus I
ended up with an exemplar matrix which was nothing but nan's. Once I
fixed my normalization function, my network now trains perfectly.
I agree with the above viewpoint: the input is sensitive for your network. In my case, I use the log value of density estimation as an input. The absolute value could be very huge, which may result in NaN after several steps of gradients. I think the input check is necessary. First, you should make sure the input does not include -inf or inf, or some extremely large numbers in absolute value.

I faced the same problem with using LSTM, the problem is my data has some nan value after standardization, therefore, we should check the input model data after the standarization if you see you will have nan value:
print(np.any(np.isnan(X_test)))
print(np.any(np.isnan(y_test)))
you can solve this by adding a small value(0.000001) to Std like this,
def standardize(train, test):
mean = np.mean(train, axis=0)
std = np.std(train, axis=0)+0.000001
X_train = (train - mean) / std
X_test = (test - mean) /std
return X_train, X_test

To sum up the different solutions mentioned here and from this github discussion, which would depend of course on your particular situation:
Add regularization to add l1 or l2 penalties to the weights. Otherwise, try a smaller l2 reg. i.e l2(0.001), or remove it if already exists.
Try a smaller Dropout rate.
Clip the gradients to prevent their explosion. For instance in Keras you could use clipnorm=1. or clipvalue=1. as parameters for your optimizer.
Check validity of inputs (no NaNs or sometimes 0s). i.e df.isnull().any()
Replace optimizer with Adam which is easier to handle. Sometimes also replacing sgd with rmsprop would help.
Use RMSProp with heavy regularization to prevent gradient explosion.
Try normalizing your data, or inspect your normalization process for any bad values introduced.
Verify that you are using the right activation function (e.g. using a softmax instead of sigmoid for multiple class classification).
Try to increase the batch size (e.g. 32 to 64 or 128) to increase the stability of your optimization.
Try decreasing your learning rate.
Check the size of your last batch which may be different from the batch size.

I faced a very similar problem, and this is how I got it to run.
The first thing you can try is changing your activation to LeakyReLU instead of using Relu or Tanh. The reason is that often, many of the nodes within your layers have an activation of zero, and backpropogation doesn't update the weights for these nodes because their gradient is also zero. This is also called the 'dying ReLU' problem (you can read more about it here: https://datascience.stackexchange.com/questions/5706/what-is-the-dying-relu-problem-in-neural-networks).
To do this, you can import the LeakyReLU activation using:
from keras.layers.advanced_activations import LeakyReLU
and incorporate it within your layers like this:
model.add(Dense(800,input_shape=(num_inputs,)))
model.add(LeakyReLU(alpha=0.1))
Additionally, it is possible that the output feature (the continuous variable you are trying to predict) is an imbalanced data set and has too many 0s. One way to fix this issue is to use smoothing. You can do this by adding 1 to the numerator of all your values in this column and dividing each of the values in this column by 1/(average of all the values in this column)
This essentially shifts all the values from 0 to a value greater than 0 (which may still be very small). This prevents the curve from predicting 0s and minimizing the loss (eventually making it NaN). Smaller values are more greatly impacted than larger values, but on the whole, the average of the data set remains the same.

I had the same problem, I was using Keras for a Multivariate regression problem. What I later realised was that some values in my dataset were nan and that led to a nan loss.
I used the command:
df=df.dropna()
And it resolved my issue.

I tried every suggestion on this page and many others to no avail. We were importing csv files with pandas, then using keras Tokenizer with text input to create vocabularies and word vector matrices. After noticing some CSV files led to nan while others worked, suddenly we looked at the encoding of the files and realized that ascii files were NOT working with keras, leading to nan loss and accuracy of 0.0000e+00; however, utf-8 and utf-16 files were working! Breakthrough.
If you're performing textual analysis and getting nan loss after trying these suggestions, use file -i {input} (linux) or file -I {input} (osx) to discover your file type. If you have ISO-8859-1 or us-ascii, try converting to utf-8 or utf-16le. Haven't tried the latter but I'd imagine it would work as well. Hopefully this helps someone very very frustrated!

I was getting the loss as nan in the very first epoch, as soon as the training starts. Solution as simple as removing the nas from the input data worked for me (df.dropna())
I hope this helps someone encountering similar problem

I had the same problem with my RNN with keras LSTM layers, so I tried each solution from above. I had already scaled my data (with sklearn.preprocessing.MinMaxScaler), there were no NaN values in my data after scaling. Solutions like using LeakyRelU or changing learning rate didn't help.
So I decided to change the scaler from MinMaxScaler to StandardScaler, even though I had no NaN values and I found it odd but it worked!

In my case the issue was that I copy-pasted my previous work for binary classification and used sigmoid activation on the output layer instead of softmax (the new network was about multiclass classification).

I had a similar problem using keras. Loss turned into NAN after the second batch was input.
I tried to:
Use softmax as activation of output dense layer
Drop nan in the input
Normalize the input
However, that didn't work. So, then I tried to:
Decrease the learning rate
Problem solved.

Try to check your data if there are NAN values. Removing NAN values solve the problem for me.

I had this issue when one of my training data entries contained a nan

I had similar issue with my logloss, MAE and others being all NA's. I looked into the data and found, I had few features with NA's in them. I imputed NA's with approximate values and was able to solve the issue.

I had the same problem with my keras CNN, as others I tried all above solutions: decrease learning rate, drop nullity from train data, normalize data, add dropout layer and ...
but there couldn't solve nan problem, I tried change activation function in classifier (last) layer from sigmoid to softmax. It worked!
try changing activation function of last layer to softmax!

I was getting the same thing when I tried creating a bounding box regressor.
My neural net had larger layer than yours . I increased the dropout value and got suitable results.

Was getting NaN for my classification network.
Answering here as it might help someone.
Had made a blunder -
Number of classes in training labels was 5. i.e. from 0 to 4.
In the last dense layer of classification had 4 nodes which means 4 classes which is the issue.
Chaging the number of nodes in the last layer of network to 5 solved the issue for me.

I had a similar issue and I tried changing my activations from Sigmoid to Softmax and from RelU to LeakyRelU and the problem was resolved. So I guess as long as there is no NaN in the input to begin with, and you have tried lowering your learning rate, the viable solution is to play with your activations!

My situation:
Train Loss: nan, Train Accuracy: 0.0, Validation Loss: nan, Validation Accuracy: 0.0
later I found out it was because of my labels are 1, 2, 3, 4 not start with 0.
So I relabel them, use 0, 1, 2, 3 instead of 1, 2, 3, 4 as labels.
Problem solved!
Hope my answer helps!

In keras, class labels begins from 0. If, for example, you have 7 classes, therefore either begin labeling them from 0 through 6 and feed the last dense layer (with the softmax activation function) with units=7. Or if you should label your data from 1 through 7, in this case, you must set units=8 (in the last dense layer).

I was getting nan values for binary classification then I changed the loss function to 'binary cross entropy' from categorical cross-entropy and it worked fine.

BTW, it seems to be a dying gradient NOT exploding.
neuron dies when its input is negative for all training instances.
here 'adam' optimizer helped against NaNs.
But concerning your situation - be sure, you have scaled dataset & loss='mean_squared_error' (as opposed to yours)
model.compile(optimizer = 'adam', loss = keras.losses.mean_squared_error, metrics=keras.metrics.mean_absolute_error)

This error may occur when your data have problems like:
Input contains NaN, infinity or a value too large for dtype('float64').
In these cases, you can solve it by removing NaN values like :
`df = df.dropna()`
or any other fillna() methods
Note: These methods work for pandas' data frames only.

Remove rows containing with nan value or replace the nan value with some value from datasets or dataframe. it will solve the error.

I got the same issue. Successfully you can use keras for regression. Convert All your data into Rounded number that solved my issue. Eg. 23.43 to 23

I had the same problem. Examining the data, I realized that an error had occurred during data acquisition.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.