Neural Network Becomes Unruly with Large Layers - python

This is a higher-level question about the performance of a neural network. The issue I'm having is that with larger numbers of neurons per layer, the network has frequent rounds of complete stupidity. They are not consistent; it seems that the probability of general success vs failure is about 50/50 when layers get larger than 60 neurons (always 3 layers).
I tested this by teaching the same function to networks with input and hidden layers of sizes from 10-200. The success rate is either 0-1% or 90+%, but nothing in between. To help visualize this, I graphed it. Failures is a total count of incorrect responses on 200 data sets after 5k training iterations. .
I think it's also important to note that the numbers at which the network succeeds or fails change for each run of the experiment. The only possible culprit I've come up with is local minima (but don't let this influence your answers, I'm new to this, and initial attempts to minimize the chance of local minima seem to have no effect).
So, the ultimate question is, what could cause this behavior? Why is this thing so wildly inconsistent?
The Python code is on Github and the code that generated this graph is the testHugeNetwork method in test.py (line 172). If any specific parts of the network algorithm would be helpful I'm glad to post relevant snippets.

My guess is, that your network is oscillating heavily across a jagged error surface. Trying a lower error rate might help. But first of all, there are a few things you can do to better understand what your network is doing:
plot the output error over training epochs. This will show you when in the training process things go wrong.
have a graphical representation (an image) of your weight matrices and of your outputs. Makes it much easier to spot irregularities.
A major problem with ANN training is saturation of the sigmoid function. Towards the asymptotes of both the logistic function and tanh, the derivative is close to 0, numerically it probably even is zero. As a result, the network will only learn very slowly or not at all. This problem occurs when the input for the sigmoid are too big, here's what you can do about it:
initialize your weights proportional to number of inputs a neuron receives. Standard literature suggests to draw them from a distribution with mean = 0 and standard deviation 1/sqrt(m), where m is the number of input connections.
scale your teachers so that they lie where the network can learn the most; that is, where the activation function is the steepest: the maximum of the first derivative. For tanh you can alternatively scale the function to f(x) = 1.7159 * tanh(2/3 * x) and keep the teachers at [-1, 1]. However, don't forget to adjust the derivative to f'(x) = 2/3 * 1.7159 * (1 - tanh^2 (2/3 * x)
Let me know if you need additional clarification.

Related

Why is Normalization causing my network to have exploding gradients in training?

I've built a network (In Pytorch) that performs well for image restoration purposes. I'm using an autoencoder with a Resnet50 encoder backbone, however, I am only using a batch size of 1. I'm experimenting with some frequency domain stuff that only allows me to process one image at a time.
I have found that my network performs reasonably well, however, it only behaves well if I remove all batch normalization from the network. Now of course batch norm is useless for a batch size of 1 so I switched over to group norm, designed for this purpose. However, even with group norm, my gradient explodes. The training can go very well for 20 - 100 epochs and then game over. Sometimes it recovers and explodes again.
I should also say that in training, every new image fed in is given a wildly different amount of noise to train for random noise amounts. This has been done before but perhaps coupled with a batch size of 1 it could be problematic.
I'm scratching my head at this one and I'm wondering if anyone has suggestions. I've dialed in my learning rate and clipped the max gradients but this isn't really solving the actual issue. I can post some code but I'm not sure where to start and hoping someone could give me a theory. Any ideas? Thanks!
To answer my own question, my network was unstable in training because a batch size of 1 makes the data too different from batch to batch. Or as the papers like to put it, too high an internal covariate shift.
Not only were my images drawn from a very large varied dataset, but they were also rotated and flipped randomly. As well as this, random Gaussain of noise between 0 and 30 was chosen for each image, so one image may have little to no noise while the next may be barely distinguisable in some cases. Or as the papers like to put it, too high an internal covariate shift.
In the above question I mentioned group norm - my network is complex and some of the code is adapted from other work. There were still batch norm functions hidden in my code that I missed. I removed them. I'm still not sure why BN made things worse.
Following this I reimplemented group norm with groups of size=32 and things are training much more nicely now.
In short removing the extra BN and adding Group norm helped.

Why scaling down the parameter many times during training will help the learning speed be the same for all weights in Progressive GAN?

Equalized learning rate is one of the special things in Progressive Gan, a paper of the NVIDIA team. By using this method, they introduced that
Our approach ensures that the dynamic range, and thus the learning speed, is the same for all weights.
In detail, they inited all learnable parameters by normal distribution . During training time, each forward time, they will scale the result with per-layer normalization constant from He's initializer
I reproduced the code from pytorch GAN zoo Github's repo
def forward(self, x, equalized):
# generate He constant depend on the size of tensor W
size = self.module.weight.size()
fan_in = prod(size[1:])
weight = math.sqrt(2.0 / fan_in)
'''
A module example:
import torch.nn as nn
module = nn.Conv2d(nChannelsPrevious, nChannels, kernelSize, padding=padding, bias=bias)
'''
x = self.module(x)
if equalized:
x *= self.weight
return x
At first, I thought the He constant will be as He's paper
Normally, so can be scaled up which leads to the gradient in backpropagation is increased as the formula in ProGan's paper prevent vanishing gradient.
However, the code shows that .
In summary, I can't understand why to scale down the parameter many times during training will help the learning speed be more stable.
I asked this question in some communities e.g: Artificial Intelligent, mathematics, and still haven't had an answer yet.
Please help me explain it, thank you!
There is already an explanation in the paper for the reason for scaling down the parameters in every single pass:
The benefit of doing this dynamically instead of during initialization is somewhat
subtle and relates to the scale-invariance in commonly used adaptive stochastic gradient descent
methods such as RMSProp (Tieleman & Hinton, 2012) and Adam (Kingma & Ba, 2015). These
methods normalize a gradient update by its estimated standard deviation, thus making the update
independent of the scale of the parameter. As a result, if some parameters have a larger dynamic
range than others, they will take longer to adjust. This is a scenario modern initializers cause, and
thus it is possible that a learning rate is both too large and too small at the same time.
I believe multiplying He's constant in every pass ensures that the range of the parameter will not be too wide at any point in backpropagation and therefore they will not take so long to adjust. So if for example discriminator at some point in the learning process adjusted faster than the generator, it will not take the generator to adjust itself and consequently, the learning process of them will equalize.

GAN Generator output layer is [way] out of target range - potentially due to loss func

The TL;DR is that my G(z) (z is normal distributed noise) is way out of the expected range. Picture - see comparison of real against generated data.
(It's also a terrible model, being just a few dense layers - but I can sort that later!)
I've thrown s**t at the wall for the past week to see what sticks and now it's time to ask for an informed opinion! :)
Has anyone else encountered this problem before? Suggested fixes?
What is your generic loss function for your regressive Generator (with a categorical discriminator)?
As it's a regression problem (with target data normalised between 0 & 1) my G output has a linear activation. However, I'm trying to follow Soumith's GAN Hacks to get it working, where we're conflicting over the activation - he says it should be tanh. One could assume with my expected target range I could choose sigmoid, but as I say, I infer this should be linear as a regressive problem to solve. Opinions?
Plus there isn't much clarity on the loss function. He Says:-
In GAN papers, the loss function to optimize G is min (log 1-D), but in practice folks practically use max log D
Does this mean one should aim to minimise the error log(1-D), or pick the minimum of log(1-D)? Similarly with max log(D). (I'm using class definitions of true/false, so a maximum and minimum from the can be chosen from the two classes.)
In the Goodfellow et. al 2014 paper on GANs, I can see how some of the loss function for the Discriminator resolves to 0 when applied to the Generator. However, I'm obviously missing something, as the equation states I should take log(D(x)) which will sooner or later hit a NaN.
Clarification about my confused state is thanked many times in advance.
Update:
This tutorial on O'Reilly says
Take a look at the last line of our discriminator: there's no softmax or sigmoid layer at the end. GANs can fail if their discriminators "saturate," or become confident enough to return exactly 0 when they're given a generated image; that leaves the discriminator without a useful gradient to descend.
Which sounds a little like my problem, especially as I was using Softmax. Using no activation doesn't solve the problem, but does make my GAN compete in interesting new ways!

What is the key feature in MNIST Dataset that is used to classify images

I was recently learning about neural networks and came across MNIST data set. i understood that a sigmoid cost function is used to reduce the loss. Also, weights and biases gets adjusted and an optimum weights and biases are found after the training. the thing i did not understand is, on what basis the images are classified. For example, to classify whether a patient has cancer or not, data like age, location, etc., becomes features. in MNIST dataset, i did not find any of that. Am i missing something here. Please help me with this
First of all the Network pipeline consists of 3 main parts:
Input Manipulation:
Parameters that effect the finding of minimum:
Parameters like your descission function in your interpretation
layer (often fully connected layer)
In contrast to your regular machine learning pipeline where you have to extract features manually a CNN uses filters. (Filters like in edge detection or viola and jones).
If a filter runs across the images and is convolved with pixels it Produces an output.
This output is then interpreted by a neuron. If the output is above a threshold it is considered as valid (Step function counts 1 if valid or in case of Sigmoid it has a value on the sigmoid function).
The next steps are the same as before.
This is progressed until the interpretation layer (often softmax). This layer interprets your computation (if the filters are good adapted to your problem you will get a good predicted label) which means you have a low difference between (y_guess - y_true_label).
Now you can see that for the guess of y we have multiplied the input x with many weights w and also used functions on it. This can be seen like a chain rule in analysis.
To get better results the effect of a single weight on the input must be known. Therefore, you use Backpropagation which is a derivative of the Error with respect to all w. The Trick is that you can reuse derivatives which is more or less Backpropagation and it becomes easier since you can use Matrix vector notation.
If you have your gradient, you can use the normal concept of minimization where you walk along the steepest descent. (There are also many other gradient methods like adagrad or adam etc).
The steps will repeat until convergence or until you reach the maximum epochs.
So the answer is: THE COMPUTED WEIGHTS (FILTERS) ARE THE KEY TO DETECT NUMBERS AND DIGITS :)

Neural network toy model to fit sine function fails, what's wrong?

Graduate student, new to Keras and neural networks was trying to fit a very simple feedforward neural network to a one-dimensional sine.
Below are three examples of the best fit that I can get. On the plots, you can see the output of the network vs ground truth
The complete code, just a few lines, is posted here example Keras
I was playing with the number of layers, different activation functions, different initializations, and different loss functions, batch size, number of training samples. It seems that none of those were able to improve the results beyond the above examples.
I would appreciate any comments and suggestions. Is sine a hard function for a neural network to fit? I suspect that the answer is not, so I must be doing something wrong...
There is a similar question here from 5 years ago, but the OP there didn't provide the code and it is still not clear what went wrong or how he was able to resolve this problem.
In order to make your code work, you need to:
scale the input values in the [-1, +1] range (neural networks don't like big values)
scale the output values as well, as the tanh activation doesn't work too well close to +/-1
use the relu activation instead of tanh in all but the last layer (converges way faster)
With these modifications, I was able to run your code with two hidden layers of 10 and 25 neurons
Since there is already an answer that provides a workaround I'm going to focus on problems with your approach.
Input data scale
As others have stated, your input data value range from 0 to 1000 is quite big. This problem can be easily solved by scaling your input data to zero mean and unit variance (X = (X - X.mean())/X.std()) which will result in improved training performance. For tanh this improvement can be explained by saturation: tanh maps to [-1;1] and will therefore return either -1 or 1 for almost all sufficiently big (>3) x, i.e. it saturates. In saturation the gradient for tanh will be close to zero and nothing will be learned. Of course, you could also use ReLU instead, which won't saturate for values > 0, however you will have a similar problem as now gradients depend (almost) solely on x and therefore later inputs will always have higher impact than earlier inputs (among other things).
While re-scaling or normalization may be a solution, another solution would be to treat your input as a categorical input and map your discrete values to a one-hot encoded vector, so instead of
>>> X = np.arange(T)
>>> X.shape
(1000,)
you would have
>>> X = np.eye(len(X))
>>> X.shape
(1000, 1000)
Of course this might not be desirable if you want to learn continuous inputs.
Modeling
You are currently trying to model a mapping from a linear function to a non-linear function: you map f(x) = x to g(x) = sin(x). While I understand that this is a toy problem, this way of modeling is limited to only this one curve as f(x) is in no way related to g(x). As soon as you are trying to model different curves, say both sin(x) and cos(x), with the same network you will have a problem with your X as it has exactly the same values for both curves. A better approach of modeling this problem is to predict the next value of the curve, i.e. instead of
X = range(T)
Y = sin(x)
you want
X = sin(X)[:-1]
Y = sin(X)[1:]
so for time-step 2 you will get the y value of time-step 1 as input and your loss expects the y value of time-step 2. This way you implicitly model time.

Categories

Resources