I'm currently learning neural networks and have attempted to train an MLP to learn XOR using Back-propagation in Python. The network has two hidden layers (using Sigmoid Activation) and one output layer (also Sigmoid).
The network (around 20,000 epochs, and with a learning rate of 0.1) outputs numbers close to the original class label:
prediction: 0.11428432952745145
original class output was: 0
prediction: 0.8230114358069576
original class output was: 1
prediction: 0.8229532575410421
original class output was: 1
prediction: 0.23349671680470516
original class output was: 0
When i plot the errors (for every epoch), my graph shows a steep decline, then a slight 'bump', i was under the impression that the errors would gradually reduce:
Errors (summed) vs Epoch
Would this be classed as converging? I've tried to adjust the learning rate, with no luck.
Thanks!
Not necessarily, the NN will solbe an optimization problem changing the weights. This is not guaranteed to only fall maybe some of the choice of the gradient descent was picking "worse" values.
I would recommend to experiment for more epochs and eventually it will converge.If you want post your code for more specific tips.
Yes -- definitely converging! You're getting the characteristic XOR learning curve for MLP with sigmoid activations -- you could put that in a textbook. And there's nothing faster than expected with that number of epochs. In fact, you could probably set the learning rate higher and maybe the step-size as well.
Assessing convergence statistically (not as a closed-form limit, nor graphically) can be a bit difficult. But that graph is pretty good evidence of convergence.
Related
I'm using a relatively simple neural network with fully connected layers in keras. For some reason, the accuracy drastically increases basically to its final value after only one training epoch (likewise, the loss sharply decreases). I've tried architectures with larger and smaller numbers of hidden layers too. This network also performs poorly on the testing data, so I am trying to find a more optimal architecture or improve my training set accordingly.
It is trained on a set of 6500 1D array-like data, and I'm using a batch size of 512.
As said by Murilo, hard to say much without more information but it can come from multiple things:
Your network learns through the batches of each epoch, meaning that
your ~12 batches (6500/512) are already enough to learn a good bit of
classification.
Your weights are not really well initialized, and produce a huge
loss for the first epoch. The massive decrease in the loss is
actually the solver 'squishing' the weights. The best explanation I
found for this comes from A. Karpathy in his 'MakeMore' tutorial:
https://youtu.be/P6sfmUTpUmc?t=260
Now this sudden decrease of the loss is not extreme here (from 0.5 to 0.2) so I would not care much. I agree with Murilo that low accuracy in validation can come from too few samples in your validation set, or a bad shuffling between train and validation sets.
I am using tensorflow.keras to train a CNN in an image recognition problem, using the Adam minimiser to minimise a custom loss (some code is at the bottom of the question). I am experimenting with how much data I need to use in my training set, and thought I should look into whether each of my models have properly converged. However, when plotting loss vs number of epochs of training for different training set fractions, I noticed approximately periodic spikes in the loss function, as in the plot below. Here, the different lines show different training set sizes as a fraction of my total dataset.
As I decrease the size of the training set (blue -> orange -> green), the frequency of these spikes appears to decrease, though the amplitude appears to increase. Intuitively, I would associate this kind of behaviour with a minimiser jumping out of a local minimum, but I am not experienced enough with TensorFlow/CNNs to know if that is the correct way to interpret this behaviour. Equally, I can't quite understand the variation with training set size.
Can anyone help me to understand this behaviour? And should I be concerned by these features?
from quasarnet.models import QuasarNET, custom_loss
from tensorflow.keras.optimizers import Adam
...
model = QuasarNET(
X[0,:,None].shape,
nlines=len(args.lines)+len(args.lines_bal)
)
loss = []
for i in args.lines:
loss.append(custom_loss)
for i in args.lines_bal:
loss.append(custom_loss)
adam = Adam(decay=0.)
model.compile(optimizer=adam, loss=loss, metrics=[])
box, sample_weight = io.objective(z,Y,bal,lines=args.lines,
lines_bal=args.lines_bal)
print( "starting fit")
history = model.fit(X[:,:,None], box,
epochs = args.epochs,
batch_size = 256,
sample_weight = sample_weight)
Following some discussion from a colleague, I believe that we have solved this problem. As a default, the Adam minimiser uses an adaptive learning rate that is inversely proportional to the variance of the gradient in its recent history. When the loss starts to flatten out, the variance of the gradient decreases, and so the minimiser increases the learning rate. This can happen quite drastically, causing the minimiser to "jump" to a higher loss point in parameter space.
You can avoid this by setting amsgrad=True when initialising the minimiser (http://www.satyenkale.com/papers/amsgrad.pdf). This prevents the learning rate from increasing in this way, and thus results in better convergence. The (somewhat basic) plot below shows loss vs number of training epochs for the normal setup, as in the original question (norm loss) compared to the loss when setting amsgrad=True in the minimiser (amsgrad loss).
Clearly, the loss function is much better behaved with amsgrad=True, and, with more epochs of training, should result in a stable convergence.
I am building a predictive model where I want to know can I predict whether a package will be delivered on time (Binary Yes / No), in the event that the package is not delivered on time, I wish to be able to predict by when it will be delivered in categories of <7days, <14days, <21days >28days after expected date.
I have built and tested a model for binary classification and have got an f Score of 0.92, which is satisfactory for my needs. However, when I train my categorical model, I start to see training accuracy and validation accuracy diverge (training accuracy is much better than validation accuracy). This is a sign of overfitting.
However, I have tried regularization and different values, plus using dropout and different values, and the validation accuracy never gets above 0.7. My total training set is of ~10k examples, ~3k validation, and whilst the catgorical spread is not equal there are sufficient examples of each category (I think). I am using a NN and have increased / decreased both layers and activations and still no joy
Any thoughts on where to go next. Thanks
Because you are using NN, introduce dropout layers. See if it can help to reduce the overfitting problem. And also checkout this How to choose the number of hidden layers and nodes in a feedforward neural network?
The more complex the network (hidden layers, number of neurons in them), also contribute to overfitting problem
The approach we have chosen is to carry out a linear regression with the expected duration as target variable. We have excluded some outliers, and then taken the differences between the actual and predicted days. We then max'd and min'd the difference and we now have a prediction with a tolerable range. We will keep working on the other techniques to see if we can improve. Thanks to everyone who suggested ideas
I am using Tensorflow DNNRegressor Estimator model for making a neural network. But calling estimator.train() function is giving output as follows:
I.e. my loss function is varying a lot with every step. But as far as I know, my loss function should decrease with no of iterations. Also, find the attached screenshot for Tensorboard Visualisation for loss function:
The doubts I'm not able to figure out are:
Whether it is overall loss function value (combined loss for every step processed till now) or just that step's loss value?
If it is that step's loss value, then how to get value of overall loss function and see its trend, which I feel should decrease with increasing no of iterations? And In my knowledge that is the value we should look at while training a dataset.
If this is overall loss value, then why is it fluctuating so much? Am I missing something?
First of all, let me point out that tf.contrib.learn.DNNRegressor uses a linear regression head with mean_squared_loss, i.e. simple L2 loss.
Whether it is overall loss function value (combined loss for every
step processed till now) or just that step's loss value?
Each point on a chart is the value of a loss function on the last step after learning so far.
If it is that step's loss value, then how to get value of overall loss
function and see its trend, which I feel should decrease with
increasing no of iterations?
There's no overall loss function, probably you mean a chart how the loss changed after each step. That's exactly what tensorboard is showing to you. You are right, its trend is not downwards, as it should. This indicates that your neural network is not learning.
If this is overall loss value, then why is it fluctuating so much? Am I missing something?
A common reason for the neural network not learning is poor choice of hyperparameters (though there are many more mistakes you can possibly make). For example:
the learning rate is too large
it's also possible that the learning rate is too small, which means that the neural network is learning, but very very slowly, so that you can't see it
weights initialization is probably too large, try to decrease it
batch size may be too large as well
you're passing wrong labels for the inputs
training data contains missing values, or unnormalized
...
What I usually do to check if the neural network is at least somehow working is reduce the training set to few examples and try to overfit the network. This experiment is very fast, so I can try various learning rates, initialization variance and other parameters to find a sweet spot. Once I have a steady decreasing loss chart, I go on with a bigger set.
Though previous comment is very informative and good, it doesn't quite address your issue. When you instantiate DNNRegressor, add:
loss_reduction=tf.losses.Reduction.MEAN
in the constructor, and you'll see your average loss, converges.
estimator = tf.estimator.DNNRegressor(
feature_columns=feat_clmns,
hidden_units=[32, 64, 32],
weight_column=weight_clmn,
**loss_reduction=tf.losses.Reduction.MEAN**
i'm running a code of tensorflow.On terminal it is giving me values of training and test accuracy and also step size. Can someone please explain these terms or provide any material that i can read to understand these terms and also stochastic gradient descent method for convolution neural networks
From what you have displayed in the terminal, you are using actually tflearn. This should also display the LOSS or COST which is, how far is your prediction from the actual output. Low loss and high accuracy = better model.
The Stochastic Gradient Descent (SGD) allows learning rate decay. There is a good explanation here http://tflearn.org/optimizers/#stochastic-gradient-descent
In the menu on the felt side you find everything about Loss, Training, Accuracy, Layers etc.
And you can actually choose how often you want to display these things (I mean at what step).
As of batch size, learning rate, number of iterations, number of layers and number of nodes, you can play around with all these and see which works better for your dataset.