I have trained an [50, 500, 500, 5] neural network, the input layer have 50 neurons and the output layer have 5 neurons. The biases in layer2 change like this
why does the distribution of the bias in layer2 change so dramatic ?
(the distribution of the biases in layer1 are nearly same in every step.
can anyone explain this phenomenon?
What you’re seeing is almost certainly overfitting. This isn’t a bug in your implementation, but rather an issue with your understanding. This 1,055 neuron multi-layer perceptron (MLP) has on the order of 6.25M weights (depending on you implementation)! That’s enough capacity to memorize almost any pattern. What you’re seeing in your histogram is the migration of your bias parameters to fit your data. It’s only occurring in the last layer, because the bias terms in the last layer alone are more than enough to memorize the data. Judging by the very clear parameter concentration, I’m guessing you’re training on a fairly small dataset. This is just what it looks like when your MLP is memorizing data points. Consult your train and validation loss vs. epoch curves to confirm this hypothesis.
Related
I'm using a relatively simple neural network with fully connected layers in keras. For some reason, the accuracy drastically increases basically to its final value after only one training epoch (likewise, the loss sharply decreases). I've tried architectures with larger and smaller numbers of hidden layers too. This network also performs poorly on the testing data, so I am trying to find a more optimal architecture or improve my training set accordingly.
It is trained on a set of 6500 1D array-like data, and I'm using a batch size of 512.
As said by Murilo, hard to say much without more information but it can come from multiple things:
Your network learns through the batches of each epoch, meaning that
your ~12 batches (6500/512) are already enough to learn a good bit of
classification.
Your weights are not really well initialized, and produce a huge
loss for the first epoch. The massive decrease in the loss is
actually the solver 'squishing' the weights. The best explanation I
found for this comes from A. Karpathy in his 'MakeMore' tutorial:
https://youtu.be/P6sfmUTpUmc?t=260
Now this sudden decrease of the loss is not extreme here (from 0.5 to 0.2) so I would not care much. I agree with Murilo that low accuracy in validation can come from too few samples in your validation set, or a bad shuffling between train and validation sets.
I am building a two-layer neural network from scratch on the Fashion MNIST dataset. In between, using the RELU as activation and on the last layer, I am using softmax cross entropy. I am getting the below learning curve between train and validation accuracy which is wrong obviously. But if you see my loss curve, it's decreasing but my model is not learning. I am not able to my head around where I am going wrong. Could anyone explain these two graphs, like where I could be possibly going wrong?
I don't know exactly what you are doing, and I don't know anything about your architecture, but it's wrong to use ReLU on the last layer.
Usually you leave the last layer as linear (no activation). This will produce the logits that enter the Softmax. The output of the softmax will try to approximate the probability distribution on the classes.
This could be a reason for your results.
I am building a predictive model where I want to know can I predict whether a package will be delivered on time (Binary Yes / No), in the event that the package is not delivered on time, I wish to be able to predict by when it will be delivered in categories of <7days, <14days, <21days >28days after expected date.
I have built and tested a model for binary classification and have got an f Score of 0.92, which is satisfactory for my needs. However, when I train my categorical model, I start to see training accuracy and validation accuracy diverge (training accuracy is much better than validation accuracy). This is a sign of overfitting.
However, I have tried regularization and different values, plus using dropout and different values, and the validation accuracy never gets above 0.7. My total training set is of ~10k examples, ~3k validation, and whilst the catgorical spread is not equal there are sufficient examples of each category (I think). I am using a NN and have increased / decreased both layers and activations and still no joy
Any thoughts on where to go next. Thanks
Because you are using NN, introduce dropout layers. See if it can help to reduce the overfitting problem. And also checkout this How to choose the number of hidden layers and nodes in a feedforward neural network?
The more complex the network (hidden layers, number of neurons in them), also contribute to overfitting problem
The approach we have chosen is to carry out a linear regression with the expected duration as target variable. We have excluded some outliers, and then taken the differences between the actual and predicted days. We then max'd and min'd the difference and we now have a prediction with a tolerable range. We will keep working on the other techniques to see if we can improve. Thanks to everyone who suggested ideas
I wanted to know if the squared error depends also on the number of hidden layers and the number of neurons on each hidden layer , because I've created a neuron network with one hidden layer but I can't reach a small squared error , so maybe the function is not convex ? Can I optimize weights by adding more hidden layers ?
The more neurons (e.g. layers) you add to your model, the better you can approximate arbitrary functions. If your loss on your training data is not decreasing any further you are underfitting. This can be solved by making the model more complex, i.e. adding more trainable parameters. But you have to be careful, that you do not overdo it and end up overfitting.
Though this is not a programming question, I'll try my best to answer it here.
The squared error, i.e. the 'loss' of your neural network, depends on your neural network prediction and the ground truth. And it is convex from its definition.
The reasons that you're not getting low losses could be:
You're not normalizing your inputs. For example, if you got a series of house prices as input, which is around 500k to 1m, and you didn't normalize them, your prediction will be the linear combination of the prices, which is about the same order of magnitude, then pass through the activation function. This could result in large losses.
You're not initializing your weights and biases correctly. Similar to above, you could have large weights/biases which lead to large prediction values.
You didn't choose the proper activation function. When you're doing classification, your labels are generally one hot encoded, so your activation functions should limit the prediction to [0,1] or similar, so relu won't be a proper option. Also you don't want sigmoid as activation for regression problems.
Your labels are not predictable or have too much noise. Or maybe your network is not complex enough to capture important patterns, in that case you could try adding more layers and more nodes per layer.
Your learning rate is too small, this leads to slow convergence.
That's all I have in mind. You probably need more work to find out the reason to your problem.
A scalar loss makes perfect sense to me which is in general what is fed as Loss Function in standard NN architectures. However there is a provision to make your loss not scalar, for example I had input a loss of size ([batch_size,]) and it did not throw an error.
Does it summarize the loss as sum/mean internally? It is not coming out very clear to me from the source code.
Any help is highly appreciated. Thanks. :)
Imagine you are trying to classify the MNIST data set. In this case, you have 10 output neurons. Therefore, after propagating a training sample through it, you get 10 activations. These have to be compared to the desired output via a cost function for each neuron (something like cross-entropy).
What you are trying to do here, is to minimize these costs among all the neurons. In the MNIST example you have to do something like
xent = tf.reduce_mean(
tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=y), name="xent")
You see that the cost is calculated across all neurons as mean (reduce_mean).