How to decide activation function in neural network

How to decide activation function in neural network - python

I am using feedforward, backpropagation, multilayer neural network and I am using sigmoid function as a activation function which is having range of -1 to 1. But the minimum error is not going below 5.8 and I want so less, you can see the output after 100000 iterations.
I think this is because of my output range is above 1, and sigmoid functions range is only -1 to 1. Can anybody suggest me how i can overcome this problem as my desired output range is 0 to 2.5. Suggest me which activation function will be best for this range.

If you are seeking to reduce output error, there are a couple of things to look at before tweaking a node's activation function.
First, do you have a bias node? Bias nodes have several implications, but - most relevant to this discussion - they allow the network output to be translated to the desired output range. As this reference states:
The use of biases in a neural network increases the capacity of the network to solve problems by allowing the hyperplanes that separate individual classes to be offset for superior positioning.
This post provides a very good discussion:
Role of Bias in Neural Networks.
This one is good, too: Why the BIAS is necessary in ANN? Should we have separate BIAS for each layer?
Second method: it often helps to normalize your inputs and outputs. As you note, your sigmoid offers a range of +/- 1. This small range can be problematic when trying to learn functions that have a range of 0 to 1000 (for example). To aid learning, it's common to scale and translate inputs to accommodate the node activation functions. In this example, one might divide the range by 500, yielding a 0 to 2 range, and then subtract 1 from this range. In this manner, the inputs have been normalized to a range of -1 to 1, which better fits the activation function. Note that network output should be denormalized: first, add +1 to the output, then multiply by 500.
In your case, you might consider scaling the inputs by 0.8, then subtracting 1 from the result. You would then add 1 to the network output, and then multiply by 1.25 to recover the desired range. Note that this method may be easiest to accomplish since it does not directly change your network topology like the addition of bias would.
Finally, have you experimented with changing the number of hidden nodes? Although I believe the first two options are better candidates for improving performance, you might give this one a try. (Just as a point of reference, I can't recall an instance in which modifying the activation function's shape improved network response more than option 1 and 2.)
Here are some good discussion of hidden layer/node configuration:
multi-layer perceptron (MLP) architecture: criteria for choosing number of hidden layers and size of the hidden layer?
How to choose number of hidden layers and nodes in neural network?
24 inputs make your problem a high-dimensional one. Ensure that your training dataset adequately covers the input state space, and ensure that you are your test data and training data are drawn from similarly representative populations. (Take a look at the "cross-validation" discussions when training neural networks).

The vanilla sigmoid function is:
def sigmoid(x):
return 1/(1+math.e**-x)
You could transform that to:
def mySigmoid(x):
return 2.5/(1+math.e**-x)
in order to make the transformation that you want

Related

Does the squared error depends on the number of hidden layers?

I wanted to know if the squared error depends also on the number of hidden layers and the number of neurons on each hidden layer , because I've created a neuron network with one hidden layer but I can't reach a small squared error , so maybe the function is not convex ? Can I optimize weights by adding more hidden layers ?

The more neurons (e.g. layers) you add to your model, the better you can approximate arbitrary functions. If your loss on your training data is not decreasing any further you are underfitting. This can be solved by making the model more complex, i.e. adding more trainable parameters. But you have to be careful, that you do not overdo it and end up overfitting.

Though this is not a programming question, I'll try my best to answer it here.
The squared error, i.e. the 'loss' of your neural network, depends on your neural network prediction and the ground truth. And it is convex from its definition.
The reasons that you're not getting low losses could be:
You're not normalizing your inputs. For example, if you got a series of house prices as input, which is around 500k to 1m, and you didn't normalize them, your prediction will be the linear combination of the prices, which is about the same order of magnitude, then pass through the activation function. This could result in large losses.
You're not initializing your weights and biases correctly. Similar to above, you could have large weights/biases which lead to large prediction values.
You didn't choose the proper activation function. When you're doing classification, your labels are generally one hot encoded, so your activation functions should limit the prediction to [0,1] or similar, so relu won't be a proper option. Also you don't want sigmoid as activation for regression problems.
Your labels are not predictable or have too much noise. Or maybe your network is not complex enough to capture important patterns, in that case you could try adding more layers and more nodes per layer.
Your learning rate is too small, this leads to slow convergence.
That's all I have in mind. You probably need more work to find out the reason to your problem.

Why does my TensorFlow NN model's predicted values have upper limit?

I have a neural network with three layers. I've tried using tanh and sigmoid functions for my activations and then the output layer is just a simple linear function (I'm trying to model a regression problem).
For some reason my model seems to have a hard cut off where it will never predict a value above some threshold (even though it should). What reason could there be for this?
Here is what predictions from the model look like (with sigmoid activations):
update:
With relu activation, and switching from gradient descent to Adam, and adding L2 regularization... the model predicts same value for every input...

A linear layer regressing a single value will have outputs of the form
output = bias + sum(kernel * inputs)
If inputs comes from a tanh, then -1 <= inputs <= 1, and hence
bias - sum(abs(kernel)) <= output <= bias + sum(abs(kernel))
If you want an unbounded output, consider using an unbounded activation on all intermediate layers, e.g. relu.

I think your problem concerns the generalization/expressiveness of the model. Regression is a basic task, there should be no problem with the method itself, but problem with the execution. #DomJack explained how output is restricted for a specific set of parameters, but that only happens for anomaly data. In general, when training parameters would be tuned so that it will predict output correctly.
So first point is about the quality of training data. Make sure you have large enough training data (and it is split randomly if you split train/test from one dataset). Also, maybe trivial, but make sure you didn't mess up input/output value in preprocessing.
Another point is about the size of the network. Make sure you use large enough hidden layer.

Binary vs non-binary hidden layer output

I'm trying to understand exactly how to implement a basic neural network in python that will use genetic algorithms for unsupervised learning and have ran into a small problem that the literature I've been able to pull up hasn't solved.
Lets say I have an input of 2 values, that are passed to a 3 neuron hidden layer with all weights/biases applied. After I determine if it fired I now send what exactly? Do I send the output from my sigmoid or do I send a full stop/start. In other words is my output into hidden layer 2 going to be binary or is it non-binary?
Can anyone explain this with the reasoning behind we choose one or the other?

This really depends on your network design, but there is no such restriction that inputs have to be binary. In fact that will not be a case you face often. For output layer, the type of output can be easily and clearly determined, eg. if you have something like a classifier that classifies this answer is spam or not, then the output (of 'a single neuron' output layer) will be binary. If you have a neural network to recognise handwritten digits, then probably it's better to have a 10 neuron output layer, each giving probability of the input image being one of the digits [0, 9].
For other layers (hidden and input), the output can be anything, most of the time it won't be binary.
EDIT:
I think I misunderstood your question a bit, and also you probably aren't talking about Fuzzy Neural Networks.
So if you are not considering those (in most cases), when you say a neuron has fired, you mean its output is 1 (binary high), and 0 otherwise, so yes it's binary.
Do I send the output from my sigmoid or do I send a full stop/start
The way sigmoid function is used in neural networks (with weights) it attempts to make the computation output a binary result, so basically both the options mean the same. There is a difference, but usually NNs try to avoid that region where sigmoid (or related neuron) outputs some value which can not be approximated to 0 or 1 nicely. Weights of inputs of that neuron are moved so that the neuron gives a clear 0 or 1.
Also note that, while it's not good to not know sigmoid (and tanh), but for practical purposes ReLU, Leaky ReLU, or maxout are better choices.
Suggested: http://cs231n.github.io/neural-networks-1/
Also you can find lectures (videos and notes) by Andrew Ng, Andrej Karpathy etc helpful.

What is the key feature in MNIST Dataset that is used to classify images

I was recently learning about neural networks and came across MNIST data set. i understood that a sigmoid cost function is used to reduce the loss. Also, weights and biases gets adjusted and an optimum weights and biases are found after the training. the thing i did not understand is, on what basis the images are classified. For example, to classify whether a patient has cancer or not, data like age, location, etc., becomes features. in MNIST dataset, i did not find any of that. Am i missing something here. Please help me with this

First of all the Network pipeline consists of 3 main parts:
Input Manipulation:
Parameters that effect the finding of minimum:
Parameters like your descission function in your interpretation
layer (often fully connected layer)
In contrast to your regular machine learning pipeline where you have to extract features manually a CNN uses filters. (Filters like in edge detection or viola and jones).
If a filter runs across the images and is convolved with pixels it Produces an output.
This output is then interpreted by a neuron. If the output is above a threshold it is considered as valid (Step function counts 1 if valid or in case of Sigmoid it has a value on the sigmoid function).
The next steps are the same as before.
This is progressed until the interpretation layer (often softmax). This layer interprets your computation (if the filters are good adapted to your problem you will get a good predicted label) which means you have a low difference between (y_guess - y_true_label).
Now you can see that for the guess of y we have multiplied the input x with many weights w and also used functions on it. This can be seen like a chain rule in analysis.
To get better results the effect of a single weight on the input must be known. Therefore, you use Backpropagation which is a derivative of the Error with respect to all w. The Trick is that you can reuse derivatives which is more or less Backpropagation and it becomes easier since you can use Matrix vector notation.
If you have your gradient, you can use the normal concept of minimization where you walk along the steepest descent. (There are also many other gradient methods like adagrad or adam etc).
The steps will repeat until convergence or until you reach the maximum epochs.
So the answer is: THE COMPUTED WEIGHTS (FILTERS) ARE THE KEY TO DETECT NUMBERS AND DIGITS :)

Neural network toy model to fit sine function fails, what's wrong?

Graduate student, new to Keras and neural networks was trying to fit a very simple feedforward neural network to a one-dimensional sine.
Below are three examples of the best fit that I can get. On the plots, you can see the output of the network vs ground truth
The complete code, just a few lines, is posted here example Keras
I was playing with the number of layers, different activation functions, different initializations, and different loss functions, batch size, number of training samples. It seems that none of those were able to improve the results beyond the above examples.
I would appreciate any comments and suggestions. Is sine a hard function for a neural network to fit? I suspect that the answer is not, so I must be doing something wrong...
There is a similar question here from 5 years ago, but the OP there didn't provide the code and it is still not clear what went wrong or how he was able to resolve this problem.

In order to make your code work, you need to:
scale the input values in the [-1, +1] range (neural networks don't like big values)
scale the output values as well, as the tanh activation doesn't work too well close to +/-1
use the relu activation instead of tanh in all but the last layer (converges way faster)
With these modifications, I was able to run your code with two hidden layers of 10 and 25 neurons

Since there is already an answer that provides a workaround I'm going to focus on problems with your approach.
Input data scale
As others have stated, your input data value range from 0 to 1000 is quite big. This problem can be easily solved by scaling your input data to zero mean and unit variance (X = (X - X.mean())/X.std()) which will result in improved training performance. For tanh this improvement can be explained by saturation: tanh maps to [-1;1] and will therefore return either -1 or 1 for almost all sufficiently big (>3) x, i.e. it saturates. In saturation the gradient for tanh will be close to zero and nothing will be learned. Of course, you could also use ReLU instead, which won't saturate for values > 0, however you will have a similar problem as now gradients depend (almost) solely on x and therefore later inputs will always have higher impact than earlier inputs (among other things).
While re-scaling or normalization may be a solution, another solution would be to treat your input as a categorical input and map your discrete values to a one-hot encoded vector, so instead of
>>> X = np.arange(T)
>>> X.shape
(1000,)
you would have
>>> X = np.eye(len(X))
>>> X.shape
(1000, 1000)
Of course this might not be desirable if you want to learn continuous inputs.
Modeling
You are currently trying to model a mapping from a linear function to a non-linear function: you map f(x) = x to g(x) = sin(x). While I understand that this is a toy problem, this way of modeling is limited to only this one curve as f(x) is in no way related to g(x). As soon as you are trying to model different curves, say both sin(x) and cos(x), with the same network you will have a problem with your X as it has exactly the same values for both curves. A better approach of modeling this problem is to predict the next value of the curve, i.e. instead of
X = range(T)
Y = sin(x)
you want
X = sin(X)[:-1]
Y = sin(X)[1:]
so for time-step 2 you will get the y value of time-step 1 as input and your loss expects the y value of time-step 2. This way you implicitly model time.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.