Does the squared error depends on the number of hidden layers? - python

I wanted to know if the squared error depends also on the number of hidden layers and the number of neurons on each hidden layer , because I've created a neuron network with one hidden layer but I can't reach a small squared error , so maybe the function is not convex ? Can I optimize weights by adding more hidden layers ?

The more neurons (e.g. layers) you add to your model, the better you can approximate arbitrary functions. If your loss on your training data is not decreasing any further you are underfitting. This can be solved by making the model more complex, i.e. adding more trainable parameters. But you have to be careful, that you do not overdo it and end up overfitting.

Though this is not a programming question, I'll try my best to answer it here.
The squared error, i.e. the 'loss' of your neural network, depends on your neural network prediction and the ground truth. And it is convex from its definition.
The reasons that you're not getting low losses could be:
You're not normalizing your inputs. For example, if you got a series of house prices as input, which is around 500k to 1m, and you didn't normalize them, your prediction will be the linear combination of the prices, which is about the same order of magnitude, then pass through the activation function. This could result in large losses.
You're not initializing your weights and biases correctly. Similar to above, you could have large weights/biases which lead to large prediction values.
You didn't choose the proper activation function. When you're doing classification, your labels are generally one hot encoded, so your activation functions should limit the prediction to [0,1] or similar, so relu won't be a proper option. Also you don't want sigmoid as activation for regression problems.
Your labels are not predictable or have too much noise. Or maybe your network is not complex enough to capture important patterns, in that case you could try adding more layers and more nodes per layer.
Your learning rate is too small, this leads to slow convergence.
That's all I have in mind. You probably need more work to find out the reason to your problem.

Related

How to train rebel neurons?

I'm training a pretty basic NN over mmnist fashion dataset. I'm using my own code, which is not important. I use a rather simplified algorith similar to ADAM and a cuadratic formula (train_value - real_value)**2 for training and error calculation. I apply a basic back propagation algorith for each weight, and I analyse 1/5 of the network weights for each trainning image. I use only a 128 layer as in the basic example for begginers in tensorflow, plus the input and output layers (the last with softmax and the first normalized to 0-1)
I'm not an expert at all, and I've only been able to train my network up to 77% accuracy over the test set.
As shown in the image bellow, I detected that the gradients of the weights for most of my neurons converge to cero after a few epochs. But there are few remarkable exceptions that just remain rebel (vertical lines at the first image divide the weights by neuron).
Could you recommend me some general techniques to train rogue neurons without affecting others?
You could add a constraint to the given kernel (the weight matrix in the Dense layer).
With one of those constraints, the weights can be normalized to a given, user-defined range.
See: TensorFlow.Keras Constraints
In addition you can try to use regularizers, to prevent overfitting of the mode, which may be indicated by some very large (absolute) weight values.
For that see for example L1 or L2 reguralizers: TensorFlow.Keras Regularizers

Why does my TensorFlow NN model's predicted values have upper limit?

I have a neural network with three layers. I've tried using tanh and sigmoid functions for my activations and then the output layer is just a simple linear function (I'm trying to model a regression problem).
For some reason my model seems to have a hard cut off where it will never predict a value above some threshold (even though it should). What reason could there be for this?
Here is what predictions from the model look like (with sigmoid activations):
update:
With relu activation, and switching from gradient descent to Adam, and adding L2 regularization... the model predicts same value for every input...
A linear layer regressing a single value will have outputs of the form
output = bias + sum(kernel * inputs)
If inputs comes from a tanh, then -1 <= inputs <= 1, and hence
bias - sum(abs(kernel)) <= output <= bias + sum(abs(kernel))
If you want an unbounded output, consider using an unbounded activation on all intermediate layers, e.g. relu.
I think your problem concerns the generalization/expressiveness of the model. Regression is a basic task, there should be no problem with the method itself, but problem with the execution. #DomJack explained how output is restricted for a specific set of parameters, but that only happens for anomaly data. In general, when training parameters would be tuned so that it will predict output correctly.
So first point is about the quality of training data. Make sure you have large enough training data (and it is split randomly if you split train/test from one dataset). Also, maybe trivial, but make sure you didn't mess up input/output value in preprocessing.
Another point is about the size of the network. Make sure you use large enough hidden layer.

In Tensorflow, a loss can be a tensor of more than one dimension. What does that mean?

A scalar loss makes perfect sense to me which is in general what is fed as Loss Function in standard NN architectures. However there is a provision to make your loss not scalar, for example I had input a loss of size ([batch_size,]) and it did not throw an error.
Does it summarize the loss as sum/mean internally? It is not coming out very clear to me from the source code.
Any help is highly appreciated. Thanks. :)
Imagine you are trying to classify the MNIST data set. In this case, you have 10 output neurons. Therefore, after propagating a training sample through it, you get 10 activations. These have to be compared to the desired output via a cost function for each neuron (something like cross-entropy).
What you are trying to do here, is to minimize these costs among all the neurons. In the MNIST example you have to do something like
xent = tf.reduce_mean(
tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=y), name="xent")
You see that the cost is calculated across all neurons as mean (reduce_mean).

What is the key feature in MNIST Dataset that is used to classify images

I was recently learning about neural networks and came across MNIST data set. i understood that a sigmoid cost function is used to reduce the loss. Also, weights and biases gets adjusted and an optimum weights and biases are found after the training. the thing i did not understand is, on what basis the images are classified. For example, to classify whether a patient has cancer or not, data like age, location, etc., becomes features. in MNIST dataset, i did not find any of that. Am i missing something here. Please help me with this
First of all the Network pipeline consists of 3 main parts:
Input Manipulation:
Parameters that effect the finding of minimum:
Parameters like your descission function in your interpretation
layer (often fully connected layer)
In contrast to your regular machine learning pipeline where you have to extract features manually a CNN uses filters. (Filters like in edge detection or viola and jones).
If a filter runs across the images and is convolved with pixels it Produces an output.
This output is then interpreted by a neuron. If the output is above a threshold it is considered as valid (Step function counts 1 if valid or in case of Sigmoid it has a value on the sigmoid function).
The next steps are the same as before.
This is progressed until the interpretation layer (often softmax). This layer interprets your computation (if the filters are good adapted to your problem you will get a good predicted label) which means you have a low difference between (y_guess - y_true_label).
Now you can see that for the guess of y we have multiplied the input x with many weights w and also used functions on it. This can be seen like a chain rule in analysis.
To get better results the effect of a single weight on the input must be known. Therefore, you use Backpropagation which is a derivative of the Error with respect to all w. The Trick is that you can reuse derivatives which is more or less Backpropagation and it becomes easier since you can use Matrix vector notation.
If you have your gradient, you can use the normal concept of minimization where you walk along the steepest descent. (There are also many other gradient methods like adagrad or adam etc).
The steps will repeat until convergence or until you reach the maximum epochs.
So the answer is: THE COMPUTED WEIGHTS (FILTERS) ARE THE KEY TO DETECT NUMBERS AND DIGITS :)

Train a feed forward neural network indirectly

I am faced with this problem:
I have to build an FFNN that has to approximate an unknown function f:R^2 -> R^2. The data in my possession to check the net is a one-dimensional R vector. I know the function g:R^2->R that will map the output of the net into the space of my data. So I would use the neural network as a filter against bias in the data. But I am faced with two problems:
Firstly, how can I train my network in this way?
Secondly, I am thinking about adding an extra hidden layer that maps R^2->R and lets the net train itself to find the correct maps and then remove the extra layer. Would this algorithm be correct? Namely, would the output be the same that I was looking for?
Your idea with additional layer is good, although the problem is, that your weights in this layer have to be fixed. So in practise, you have to compute the partial derivatives of your R^2->R mapping, which can be used as the error to propagate through your network during training. Unfortunately, this may lead to the well known "vanishing gradient problem" which stopped the development of NN for many years.
In short - you can either manually compute the partial derivatives, and given expected output in R, simply feed the computed "backpropagated" errors to the network looking for R^2->R^2 mapping or as you said - create additional layer, and train it normally, but you will have to make the upper weights constant (which will require some changes in the implementation).

Categories

Resources