Issue NaN with Adam solver

Issue NaN with Adam solver - python

I'm training networks with the Adam solver and ran into the problem, that optimization hits 'nan' at some point, but the loss seems to decrease nicely up to that point. It happens only for some specific configurations and after a couple of thousand iterations. For example the network with batch size 5 will have the problem, while with a batch size of one it works. So I started debugging my code:
1) First thing that came to my mind is to check the inputs when the net hits 'nan', but they look reasonable (correctly labled ground truth and input with an okayish value range)
2) While searching I discovered tf.verify_tensor_all_finite(..) and I put that all over my code to see, which tensor first becomes 'nan'.
I could narrow down the problem to the following lines:
kernel = tf.verify_tensor_all_finite(kernel, 'kernel')
in_tensor = tf.verify_tensor_all_finite(in_tensor, 'in_tensor')
tmp_result = tf.nn.conv2d_transpose(value=in_tensor, filter=kernel, output_shape=output_shape,
strides=strides, padding='SAME')
tmp_result = tf.verify_tensor_all_finite(tmp_result, 'convres')
Which throw an error, which reads:
InvalidArgumentError (see above for traceback): convres : Tensor had NaN values
[[Node: upconv_logits5_fs/VerifyFinite_2/CheckNumerics = CheckNumerics[T=DT_FLOAT, _class=["loc:#upconv_logits5_fs/conv2d_transpose"], message="convres", _device="/job:localhost/replica:0/task:0/gpu:0"](upconv_logits5_fs/conv2d_transpose)]]
[[Node: Adam/update/_2794 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_154_Adam/update", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]]
Now I'm not sure what happened here.
I guess, that during the forward pass everything went well, because the scalar loss didn't trigger an error and also kernel & input were still valid numbers. It seems as some Adam update node modifies the value of my upconv_logits5_fs towards nan. This transposed convolution op is the very last of my network and therefore the first one to be updated.
I'm working with a tf.nn.softmax_cross_entropy_with_logits() loss and put tf.verify_tensor_all_finite() on all it's in- and outputs, but they don't trigger errors. The only conclusion I can draw is that there might be a numerical issue with the Adam solver.
What do you think about that conclusion?
Does anybody have any idea how to proceed or what I could try?
Your help is very much appreciated.
EDIT:
I was able to solve my problem by increasing the solvers epsilon parameter from 1e-8 to 1e-4. It seemed like some of my parameters tend to have very little to zero variance and that resulted in tf.sqrt(0.0 + epsilon), which caused numerical problems.

I had run to the same issue multiple time. the reason behind this problem is by using the softmax and crossentropy. So when you are computing the gradient and diving by zero or inf you are getting nan which is propagating throw all your parameters.
few advises to avoid this problem
if error starts increasing then NaN appears afterwards: diverging due to too high learning rate
if NaNs appear suddenly: saturating units yielding non-differentiable gradient
NaN computation due to log(0)
NaN due to floating point issues (to high weights) or activations on the output
0/0, inf/inf, inf*weight...
solutions:
reduce learning rate
Change the Weight initialization
Use L2 norm
Safe softmax (small value add to log(x))
gradient clipping
In my case learning rate solved the issue but I'm still working to optimize it more

One additional step which was not included in the Feras's answer and costed me a day of debugging.
Increasing the precision of your variables. I had a network where a lot of variables were defined as float16. The network worked fine for all the optimizers except of Adam and Adadelta. After hours of debugging I switched to tf.float64 and it worked.

This is probably quite particular to my case but might still help someone else.
I was having my loss suddenly turn to nan without reaching particularly large values beforehands. I checked that my data wasn't corrupted, tried playing with the learning rate, adding clipnorm, batch normalization layers and so on without any success.
I was actually adding a random epsilon to my denominator somewhere in the model (to avoid division by 0) but didn't pay attention to its range. By changing the minval from 0 to 1e-18 got rid of the problem for me.
rand_num = Lambda(lambda input: K.random_uniform(tf.shape(input), minval = 1e-18, maxval=1e-17))(s_p)
I guess some of the randomly picked values were too small to serve their purpose and fix potential division by zero.

This is a numerical stability issue. I would suggest trying a lower learning rate to see if that resolve your problem.

Related

How do I better process my data and set parameters for my Neural Network?

When I run my NN the only way to get any training to occur is if I divide X by 1000. The network also needs to be trained under 70000 times with a 0.03 training rate and if those values are larger the NN gets worse. I think this is a due to bad processing of data and maybe the lack of having biases, but I don't really know.
Code on Google Colab

In short: all of the problems you mentioned and more.
Scaling is essential, typically to 0 mean and a variance of 1. Otherwise, you will quickly saturate the hidden units, their gradients will be near zero and (almost) no learning will be possible.
Bias is mandatory for such ANN. It's like an offset for fitting linear function. If you drop it, getting good fit will be very difficult.
You seem to be checking accuracy on your training data.
You have very few training samples.
Sigmoid is proven to be poor choice. Use ReLU and check e.g. here for explanation.
Also, I'd recommend spending some time on learning Python before going into this. For starter, avoid using global, it can get you unforeseen behaviour if you're not careful.

Why is my Deep Q Net and Double Deep Q Net unstable?

I am trying to implement DQN and DDQN(both with experience reply) to solve OpenAI AI-Gym Cartpole Environment. Both of the approaches are able to learn and solve this problem sometimes, but not always.
My network is simply a feed forward network(I've tried using 1 and 2 hidden layers). In DDQN I created one network in DQN, and two networks in DDQN, a target network to evaluate the Q value and a primary network to choose the best action, train the primary network, and copy it to target network after some episodes.
The problem in DQN is:
Sometimes it can achieve the perfect 200 score within 100 episodes, but sometimes it gets stuck and only achieves 10 score no matter how long it is trained.
Also, in case of successful learning, the learning speed differ.
The problem in DDQN is:
It can learn to achieve 200 score, but then it seems to forget what's learned and the score drops dramatically.
I've tried tuning batch size, learning rate, number of neurons in the hidden layer, the number of hidden layers, exploration rate, but instability persists.
Are there any rule of thumb on the size of network and batch size? I think reasonably larger network and larger batch size will increase stability.
Is it possible to make the learning stable? Any comments or references are appreciated!

These kind of problems happen pretty often and you shouldn't give up. First, of course, you should do another one or two checks if the code is all right - try to compare your code to other implementations, see how the loss function behave etc. If you are pretty sure your code is all fine - and, as you say that model can learn the task from time to time, it probably is - you should start experimenting with the hyper-parameters.
Your problems seem to be connected to hyper-parameters like exploration technique, learning rate, the way you are updating the target networks and to the experience replay memory. I would not play around with the hidden layer sizes - find the values for which the model learned once and keep them fixed.
Exploration technique: I assume you use epsilon-greedy strategy. My advice would be to start with a high epsilon value (I usually start with 1.0) and decay it after each step or episode, but define an epsilon_min too. Starting with a low epsilon value may be the problem of different learning speeds and success rates - if you go full random, you always populate your memory with similar kind of transitions at the beginning. With lower epsilon rates at the start, there is a bigger chance for your model to not explore enough before the exploitation phase begins.
Learning rate: Make sure it is not too big. Smaller rate may lower the learning speed, but helps a learned model to not escape back from global minima to some local, worse ones. Also, adaptive learning rates such as these calculated with Adam might help you. Of course the batch size have an impact as well, but I would keep it fixed and worry about it only if the other hyper-parameter changes won't work.
Target network update (rate and value): This is an important one as well. You have to experiment a bit - not only how often do you perform the update, but also how much of the primary values you copy into the target ones. People often do a hard update each episode or so, but try doing soft updates instead if the first technique does not work.
Experience replay: Do you use it? You should. How big is your memory size? This is very important factor and the memory size can influence the stability and success rate (A Deeper Look at Experience Replay). Basically, if you notice instability of your algorithm, try a bigger memory size, and if it affects your learning curve a lot, try out the technique proposed in the mentioned paper.

Maybe this can help you with your problem on this environment.
Cartpole problem with DQN algorithm from Udacity

I also was thinking that the problem is the unstable (D)DQN or that "CartPole" is bugged or "not stable solvable"!
After searching for a few weeks, I have checked my code several times, changed every setting but one...
The discount factor, setting it to 1.0 (really) has stabilized my training much more on CartPole-v1 at 500 max steps.
CartPole-v1 was stable in training with a simple Q-Learner (reduce min-alpha and min-epsilon to 0.001): https://github.com/sanjitjain2/q-learning-for-cartpole/blob/master/qlearning.py
The creator has Gamma at 1.0 (I read about it on reddit), so I tested it with a simple DQN (double_q = False) from here: https://github.com/adventuresinML/adventures-in-ml-code/blob/master/double_q_tensorflow2.py
I also removed 1 line: # reward = np.random.normal(1.0, RANDOM_REWARD_STD)
This way it gets the normal +1 reward per step and was "stable" 7 out of 10 runs.
And here is the result:

I spent a whole day solving this problem. The return climbs to above 400, and suddenly falls to 9.x.
In my case I think it's due to the unstable gradients. The l2 norm of the gradients varies from 1 or 2 to several thousands.
Finally solved it. See whether it could help.
clip the gradients before apply them, use a learning rate decay schedule
variables = model.trainable_variables
grads = tape.gradient(loss, variables)
grads, grads_norm = tf.clip_by_global_norm(grads, 30.0)
learning_rate = 0.1 / (math.sqrt(total_steps) + 1)
for g, var in zip(grads, variables):
var.assign_sub(g * learning_rate)
use a exploration rate decay schedule
epsilon = 0.85 ** math.log(total_steps + 1, 2)

GAN Generator output layer is [way] out of target range - potentially due to loss func

The TL;DR is that my G(z) (z is normal distributed noise) is way out of the expected range. Picture - see comparison of real against generated data.
(It's also a terrible model, being just a few dense layers - but I can sort that later!)
I've thrown s**t at the wall for the past week to see what sticks and now it's time to ask for an informed opinion! :)
Has anyone else encountered this problem before? Suggested fixes?
What is your generic loss function for your regressive Generator (with a categorical discriminator)?
As it's a regression problem (with target data normalised between 0 & 1) my G output has a linear activation. However, I'm trying to follow Soumith's GAN Hacks to get it working, where we're conflicting over the activation - he says it should be tanh. One could assume with my expected target range I could choose sigmoid, but as I say, I infer this should be linear as a regressive problem to solve. Opinions?
Plus there isn't much clarity on the loss function. He Says:-
In GAN papers, the loss function to optimize G is min (log 1-D), but in practice folks practically use max log D
Does this mean one should aim to minimise the error log(1-D), or pick the minimum of log(1-D)? Similarly with max log(D). (I'm using class definitions of true/false, so a maximum and minimum from the can be chosen from the two classes.)
In the Goodfellow et. al 2014 paper on GANs, I can see how some of the loss function for the Discriminator resolves to 0 when applied to the Generator. However, I'm obviously missing something, as the equation states I should take log(D(x)) which will sooner or later hit a NaN.
Clarification about my confused state is thanked many times in advance.
Update:
This tutorial on O'Reilly says
Take a look at the last line of our discriminator: there's no softmax or sigmoid layer at the end. GANs can fail if their discriminators "saturate," or become confident enough to return exactly 0 when they're given a generated image; that leaves the discriminator without a useful gradient to descend.
Which sounds a little like my problem, especially as I was using Softmax. Using no activation doesn't solve the problem, but does make my GAN compete in interesting new ways!

NaN loss when training regression network

I have a data matrix in "one-hot encoding" (all ones and zeros) with 260,000 rows and 35 columns. I am using Keras to train a simple neural network to predict a continuous variable. The code to make the network is the following:
model = Sequential()
model.add(Dense(1024, input_shape=(n_train,)))
model.add(Activation('relu'))
model.add(Dropout(0.1))
model.add(Dense(512))
model.add(Activation('relu'))
model.add(Dropout(0.1))
model.add(Dense(256))
model.add(Activation('relu'))
model.add(Dropout(0.1))
model.add(Dense(1))
sgd = SGD(lr=0.01, nesterov=True);
#rms = RMSprop()
#model.compile(loss='categorical_crossentropy', optimizer=rms, metrics=['accuracy'])
model.compile(loss='mean_absolute_error', optimizer=sgd)
model.fit(X_train, Y_train, batch_size=32, nb_epoch=3, verbose=1, validation_data=(X_test,Y_test), callbacks=[EarlyStopping(monitor='val_loss', patience=4)] )
However, during the training process, I see the loss decrease nicely, but during the middle of the second epoch, it goes to nan:
Train on 260000 samples, validate on 64905 samples
Epoch 1/3
260000/260000 [==============================] - 254s - loss: 16.2775 - val_loss:
13.4925
Epoch 2/3
88448/260000 [=========>....................] - ETA: 161s - loss: nan
I tried using RMSProp instead of SGD, I tried tanh instead of relu, I tried with and without dropout, all to no avail. I tried with a smaller model, i.e. with only one hidden layer, and same issue (it becomes nan at a different point). However, it does work with less features, i.e. if there are only 5 columns, and gives quite good predictions. It seems to be there is some kind of overflow, but I can't imagine why--the loss is not unreasonably large at all.
Python version 2.7.11, running on a linux machine, CPU only. I tested it with the latest version of Theano, and I also get Nans, so I tried going to Theano 0.8.2 and have the same problem. With the latest version of Keras has the same problem, and also with the 0.3.2 version.

Regression with neural networks is hard to get working because the output is unbounded, so you are especially prone to the exploding gradients problem (the likely cause of the nans).
Historically, one key solution to exploding gradients was to reduce the learning rate, but with the advent of per-parameter adaptive learning rate algorithms like Adam, you no longer need to set a learning rate to get good performance. There is very little reason to use SGD with momentum anymore unless you're a neural network fiend and know how to tune the learning schedule.
Here are some things you could potentially try:
Normalize your outputs by quantile normalizing or z scoring. To be rigorous, compute this transformation on the training data, not on the entire dataset. For example, with quantile normalization, if an example is in the 60th percentile of the training set, it gets a value of 0.6. (You can also shift the quantile normalized values down by 0.5 so that the 0th percentile is -0.5 and the 100th percentile is +0.5).
Add regularization, either by increasing the dropout rate or adding L1 and L2 penalties to the weights. L1 regularization is analogous to feature selection, and since you said that reducing the number of features to 5 gives good performance, L1 may also.
If these still don't help, reduce the size of your network. This is not always the best idea since it can harm performance, but in your case you have a large number of first-layer neurons (1024) relative to input features (35) so it may help.
Increase the batch size from 32 to 128. 128 is fairly standard and could potentially increase the stability of the optimization.

The answer by 1" is quite good. However, all of the fixes seems to fix the issue indirectly rather than directly. I would recommend using gradient clipping, which will clip any gradients that are above a certain value.
In Keras you can use clipnorm=1 (see https://keras.io/optimizers/) to simply clip all gradients with a norm above 1.

I faced the same problem before. I search and find this question and answers. All those tricks mentioned above are important for training a deep neural network. I tried them all, but still got NAN.
I also find this question here. https://github.com/fchollet/keras/issues/2134.
I cited the author's summary as follows：
I wanted to point this out so that it's archived for others who may
experience this problem in future. I was running into my loss function
suddenly returning a nan after it go so far into the training process.
I checked the relus, the optimizer, the loss function, my dropout in
accordance with the relus, the size of my network and the shape of the
network. I was still getting loss that eventually turned into a nan
and I was getting quite fustrated.
Then it dawned on me. I may have some bad input. It turns out, one of
the images that I was handing to my CNN (and doing mean normalization
on) was nothing but 0's. I wasn't checking for this case when I
subtracted the mean and normalized by the std deviation and thus I
ended up with an exemplar matrix which was nothing but nan's. Once I
fixed my normalization function, my network now trains perfectly.
I agree with the above viewpoint: the input is sensitive for your network. In my case, I use the log value of density estimation as an input. The absolute value could be very huge, which may result in NaN after several steps of gradients. I think the input check is necessary. First, you should make sure the input does not include -inf or inf, or some extremely large numbers in absolute value.

I faced the same problem with using LSTM, the problem is my data has some nan value after standardization, therefore, we should check the input model data after the standarization if you see you will have nan value:
print(np.any(np.isnan(X_test)))
print(np.any(np.isnan(y_test)))
you can solve this by adding a small value(0.000001) to Std like this,
def standardize(train, test):
mean = np.mean(train, axis=0)
std = np.std(train, axis=0)+0.000001
X_train = (train - mean) / std
X_test = (test - mean) /std
return X_train, X_test

To sum up the different solutions mentioned here and from this github discussion, which would depend of course on your particular situation:
Add regularization to add l1 or l2 penalties to the weights. Otherwise, try a smaller l2 reg. i.e l2(0.001), or remove it if already exists.
Try a smaller Dropout rate.
Clip the gradients to prevent their explosion. For instance in Keras you could use clipnorm=1. or clipvalue=1. as parameters for your optimizer.
Check validity of inputs (no NaNs or sometimes 0s). i.e df.isnull().any()
Replace optimizer with Adam which is easier to handle. Sometimes also replacing sgd with rmsprop would help.
Use RMSProp with heavy regularization to prevent gradient explosion.
Try normalizing your data, or inspect your normalization process for any bad values introduced.
Verify that you are using the right activation function (e.g. using a softmax instead of sigmoid for multiple class classification).
Try to increase the batch size (e.g. 32 to 64 or 128) to increase the stability of your optimization.
Try decreasing your learning rate.
Check the size of your last batch which may be different from the batch size.

I faced a very similar problem, and this is how I got it to run.
The first thing you can try is changing your activation to LeakyReLU instead of using Relu or Tanh. The reason is that often, many of the nodes within your layers have an activation of zero, and backpropogation doesn't update the weights for these nodes because their gradient is also zero. This is also called the 'dying ReLU' problem (you can read more about it here: https://datascience.stackexchange.com/questions/5706/what-is-the-dying-relu-problem-in-neural-networks).
To do this, you can import the LeakyReLU activation using:
from keras.layers.advanced_activations import LeakyReLU
and incorporate it within your layers like this:
model.add(Dense(800,input_shape=(num_inputs,)))
model.add(LeakyReLU(alpha=0.1))
Additionally, it is possible that the output feature (the continuous variable you are trying to predict) is an imbalanced data set and has too many 0s. One way to fix this issue is to use smoothing. You can do this by adding 1 to the numerator of all your values in this column and dividing each of the values in this column by 1/(average of all the values in this column)
This essentially shifts all the values from 0 to a value greater than 0 (which may still be very small). This prevents the curve from predicting 0s and minimizing the loss (eventually making it NaN). Smaller values are more greatly impacted than larger values, but on the whole, the average of the data set remains the same.

I had the same problem, I was using Keras for a Multivariate regression problem. What I later realised was that some values in my dataset were nan and that led to a nan loss.
I used the command:
df=df.dropna()
And it resolved my issue.

I tried every suggestion on this page and many others to no avail. We were importing csv files with pandas, then using keras Tokenizer with text input to create vocabularies and word vector matrices. After noticing some CSV files led to nan while others worked, suddenly we looked at the encoding of the files and realized that ascii files were NOT working with keras, leading to nan loss and accuracy of 0.0000e+00; however, utf-8 and utf-16 files were working! Breakthrough.
If you're performing textual analysis and getting nan loss after trying these suggestions, use file -i {input} (linux) or file -I {input} (osx) to discover your file type. If you have ISO-8859-1 or us-ascii, try converting to utf-8 or utf-16le. Haven't tried the latter but I'd imagine it would work as well. Hopefully this helps someone very very frustrated!

I was getting the loss as nan in the very first epoch, as soon as the training starts. Solution as simple as removing the nas from the input data worked for me (df.dropna())
I hope this helps someone encountering similar problem

I had the same problem with my RNN with keras LSTM layers, so I tried each solution from above. I had already scaled my data (with sklearn.preprocessing.MinMaxScaler), there were no NaN values in my data after scaling. Solutions like using LeakyRelU or changing learning rate didn't help.
So I decided to change the scaler from MinMaxScaler to StandardScaler, even though I had no NaN values and I found it odd but it worked!

In my case the issue was that I copy-pasted my previous work for binary classification and used sigmoid activation on the output layer instead of softmax (the new network was about multiclass classification).

I had a similar problem using keras. Loss turned into NAN after the second batch was input.
I tried to:
Use softmax as activation of output dense layer
Drop nan in the input
Normalize the input
However, that didn't work. So, then I tried to:
Decrease the learning rate
Problem solved.

Try to check your data if there are NAN values. Removing NAN values solve the problem for me.

I had this issue when one of my training data entries contained a nan

I had similar issue with my logloss, MAE and others being all NA's. I looked into the data and found, I had few features with NA's in them. I imputed NA's with approximate values and was able to solve the issue.

I had the same problem with my keras CNN, as others I tried all above solutions: decrease learning rate, drop nullity from train data, normalize data, add dropout layer and ...
but there couldn't solve nan problem, I tried change activation function in classifier (last) layer from sigmoid to softmax. It worked!
try changing activation function of last layer to softmax!

I was getting the same thing when I tried creating a bounding box regressor.
My neural net had larger layer than yours . I increased the dropout value and got suitable results.

Was getting NaN for my classification network.
Answering here as it might help someone.
Had made a blunder -
Number of classes in training labels was 5. i.e. from 0 to 4.
In the last dense layer of classification had 4 nodes which means 4 classes which is the issue.
Chaging the number of nodes in the last layer of network to 5 solved the issue for me.

I had a similar issue and I tried changing my activations from Sigmoid to Softmax and from RelU to LeakyRelU and the problem was resolved. So I guess as long as there is no NaN in the input to begin with, and you have tried lowering your learning rate, the viable solution is to play with your activations!

My situation:
Train Loss: nan, Train Accuracy: 0.0, Validation Loss: nan, Validation Accuracy: 0.0
later I found out it was because of my labels are 1, 2, 3, 4 not start with 0.
So I relabel them, use 0, 1, 2, 3 instead of 1, 2, 3, 4 as labels.
Problem solved!
Hope my answer helps!

In keras, class labels begins from 0. If, for example, you have 7 classes, therefore either begin labeling them from 0 through 6 and feed the last dense layer (with the softmax activation function) with units=7. Or if you should label your data from 1 through 7, in this case, you must set units=8 (in the last dense layer).

I was getting nan values for binary classification then I changed the loss function to 'binary cross entropy' from categorical cross-entropy and it worked fine.

BTW, it seems to be a dying gradient NOT exploding.
neuron dies when its input is negative for all training instances.
here 'adam' optimizer helped against NaNs.
But concerning your situation - be sure, you have scaled dataset & loss='mean_squared_error' (as opposed to yours)
model.compile(optimizer = 'adam', loss = keras.losses.mean_squared_error, metrics=keras.metrics.mean_absolute_error)

This error may occur when your data have problems like:
Input contains NaN, infinity or a value too large for dtype('float64').
In these cases, you can solve it by removing NaN values like :
`df = df.dropna()`
or any other fillna() methods
Note: These methods work for pandas' data frames only.

Remove rows containing with nan value or replace the nan value with some value from datasets or dataframe. it will solve the error.

I got the same issue. Successfully you can use keras for regression. Convert All your data into Rounded number that solved my issue. Eg. 23.43 to 23

I had the same problem. Examining the data, I realized that an error had occurred during data acquisition.

Neural Network Becomes Unruly with Large Layers

This is a higher-level question about the performance of a neural network. The issue I'm having is that with larger numbers of neurons per layer, the network has frequent rounds of complete stupidity. They are not consistent; it seems that the probability of general success vs failure is about 50/50 when layers get larger than 60 neurons (always 3 layers).
I tested this by teaching the same function to networks with input and hidden layers of sizes from 10-200. The success rate is either 0-1% or 90+%, but nothing in between. To help visualize this, I graphed it. Failures is a total count of incorrect responses on 200 data sets after 5k training iterations. .
I think it's also important to note that the numbers at which the network succeeds or fails change for each run of the experiment. The only possible culprit I've come up with is local minima (but don't let this influence your answers, I'm new to this, and initial attempts to minimize the chance of local minima seem to have no effect).
So, the ultimate question is, what could cause this behavior? Why is this thing so wildly inconsistent?
The Python code is on Github and the code that generated this graph is the testHugeNetwork method in test.py (line 172). If any specific parts of the network algorithm would be helpful I'm glad to post relevant snippets.

My guess is, that your network is oscillating heavily across a jagged error surface. Trying a lower error rate might help. But first of all, there are a few things you can do to better understand what your network is doing:
plot the output error over training epochs. This will show you when in the training process things go wrong.
have a graphical representation (an image) of your weight matrices and of your outputs. Makes it much easier to spot irregularities.
A major problem with ANN training is saturation of the sigmoid function. Towards the asymptotes of both the logistic function and tanh, the derivative is close to 0, numerically it probably even is zero. As a result, the network will only learn very slowly or not at all. This problem occurs when the input for the sigmoid are too big, here's what you can do about it:
initialize your weights proportional to number of inputs a neuron receives. Standard literature suggests to draw them from a distribution with mean = 0 and standard deviation 1/sqrt(m), where m is the number of input connections.
scale your teachers so that they lie where the network can learn the most; that is, where the activation function is the steepest: the maximum of the first derivative. For tanh you can alternatively scale the function to f(x) = 1.7159 * tanh(2/3 * x) and keep the teachers at [-1, 1]. However, don't forget to adjust the derivative to f'(x) = 2/3 * 1.7159 * (1 - tanh^2 (2/3 * x)
Let me know if you need additional clarification.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.