Why is my Deep Q Net and Double Deep Q Net unstable?

Why is my Deep Q Net and Double Deep Q Net unstable? - python

I am trying to implement DQN and DDQN(both with experience reply) to solve OpenAI AI-Gym Cartpole Environment. Both of the approaches are able to learn and solve this problem sometimes, but not always.
My network is simply a feed forward network(I've tried using 1 and 2 hidden layers). In DDQN I created one network in DQN, and two networks in DDQN, a target network to evaluate the Q value and a primary network to choose the best action, train the primary network, and copy it to target network after some episodes.
The problem in DQN is:
Sometimes it can achieve the perfect 200 score within 100 episodes, but sometimes it gets stuck and only achieves 10 score no matter how long it is trained.
Also, in case of successful learning, the learning speed differ.
The problem in DDQN is:
It can learn to achieve 200 score, but then it seems to forget what's learned and the score drops dramatically.
I've tried tuning batch size, learning rate, number of neurons in the hidden layer, the number of hidden layers, exploration rate, but instability persists.
Are there any rule of thumb on the size of network and batch size? I think reasonably larger network and larger batch size will increase stability.
Is it possible to make the learning stable? Any comments or references are appreciated!

These kind of problems happen pretty often and you shouldn't give up. First, of course, you should do another one or two checks if the code is all right - try to compare your code to other implementations, see how the loss function behave etc. If you are pretty sure your code is all fine - and, as you say that model can learn the task from time to time, it probably is - you should start experimenting with the hyper-parameters.
Your problems seem to be connected to hyper-parameters like exploration technique, learning rate, the way you are updating the target networks and to the experience replay memory. I would not play around with the hidden layer sizes - find the values for which the model learned once and keep them fixed.
Exploration technique: I assume you use epsilon-greedy strategy. My advice would be to start with a high epsilon value (I usually start with 1.0) and decay it after each step or episode, but define an epsilon_min too. Starting with a low epsilon value may be the problem of different learning speeds and success rates - if you go full random, you always populate your memory with similar kind of transitions at the beginning. With lower epsilon rates at the start, there is a bigger chance for your model to not explore enough before the exploitation phase begins.
Learning rate: Make sure it is not too big. Smaller rate may lower the learning speed, but helps a learned model to not escape back from global minima to some local, worse ones. Also, adaptive learning rates such as these calculated with Adam might help you. Of course the batch size have an impact as well, but I would keep it fixed and worry about it only if the other hyper-parameter changes won't work.
Target network update (rate and value): This is an important one as well. You have to experiment a bit - not only how often do you perform the update, but also how much of the primary values you copy into the target ones. People often do a hard update each episode or so, but try doing soft updates instead if the first technique does not work.
Experience replay: Do you use it? You should. How big is your memory size? This is very important factor and the memory size can influence the stability and success rate (A Deeper Look at Experience Replay). Basically, if you notice instability of your algorithm, try a bigger memory size, and if it affects your learning curve a lot, try out the technique proposed in the mentioned paper.

Maybe this can help you with your problem on this environment.
Cartpole problem with DQN algorithm from Udacity

I also was thinking that the problem is the unstable (D)DQN or that "CartPole" is bugged or "not stable solvable"!
After searching for a few weeks, I have checked my code several times, changed every setting but one...
The discount factor, setting it to 1.0 (really) has stabilized my training much more on CartPole-v1 at 500 max steps.
CartPole-v1 was stable in training with a simple Q-Learner (reduce min-alpha and min-epsilon to 0.001): https://github.com/sanjitjain2/q-learning-for-cartpole/blob/master/qlearning.py
The creator has Gamma at 1.0 (I read about it on reddit), so I tested it with a simple DQN (double_q = False) from here: https://github.com/adventuresinML/adventures-in-ml-code/blob/master/double_q_tensorflow2.py
I also removed 1 line: # reward = np.random.normal(1.0, RANDOM_REWARD_STD)
This way it gets the normal +1 reward per step and was "stable" 7 out of 10 runs.
And here is the result:

I spent a whole day solving this problem. The return climbs to above 400, and suddenly falls to 9.x.
In my case I think it's due to the unstable gradients. The l2 norm of the gradients varies from 1 or 2 to several thousands.
Finally solved it. See whether it could help.
clip the gradients before apply them, use a learning rate decay schedule
variables = model.trainable_variables
grads = tape.gradient(loss, variables)
grads, grads_norm = tf.clip_by_global_norm(grads, 30.0)
learning_rate = 0.1 / (math.sqrt(total_steps) + 1)
for g, var in zip(grads, variables):
var.assign_sub(g * learning_rate)
use a exploration rate decay schedule
epsilon = 0.85 ** math.log(total_steps + 1, 2)

Related

Why is Normalization causing my network to have exploding gradients in training?

I've built a network (In Pytorch) that performs well for image restoration purposes. I'm using an autoencoder with a Resnet50 encoder backbone, however, I am only using a batch size of 1. I'm experimenting with some frequency domain stuff that only allows me to process one image at a time.
I have found that my network performs reasonably well, however, it only behaves well if I remove all batch normalization from the network. Now of course batch norm is useless for a batch size of 1 so I switched over to group norm, designed for this purpose. However, even with group norm, my gradient explodes. The training can go very well for 20 - 100 epochs and then game over. Sometimes it recovers and explodes again.
I should also say that in training, every new image fed in is given a wildly different amount of noise to train for random noise amounts. This has been done before but perhaps coupled with a batch size of 1 it could be problematic.
I'm scratching my head at this one and I'm wondering if anyone has suggestions. I've dialed in my learning rate and clipped the max gradients but this isn't really solving the actual issue. I can post some code but I'm not sure where to start and hoping someone could give me a theory. Any ideas? Thanks!

To answer my own question, my network was unstable in training because a batch size of 1 makes the data too different from batch to batch. Or as the papers like to put it, too high an internal covariate shift.
Not only were my images drawn from a very large varied dataset, but they were also rotated and flipped randomly. As well as this, random Gaussain of noise between 0 and 30 was chosen for each image, so one image may have little to no noise while the next may be barely distinguisable in some cases. Or as the papers like to put it, too high an internal covariate shift.
In the above question I mentioned group norm - my network is complex and some of the code is adapted from other work. There were still batch norm functions hidden in my code that I missed. I removed them. I'm still not sure why BN made things worse.
Following this I reimplemented group norm with groups of size=32 and things are training much more nicely now.
In short removing the extra BN and adding Group norm helped.

Reinforcement Learning Agent Policy Won't Update

I am currently training a deep reinforcement learning model in a continuous environment using Ray.
The environment I am using was coded up in OpenAI Gym using baselines by another person who's research I am trying to replicate in Ray so that it may be parallelized. The model converges about 5% of the time in baselines after thousands of episodes though I have not been able to replicate this.
My problem is that the AI stops changing its policy very quickly and the mean reward then stays roughly the same for the rest of the run. This is independent of every parameter I've tried to vary. So far I've tried changing:
Learning Algorithm: I've used PPO, A3C, and DDPG
Learning Rate: I've done hyperparameter sweeps from 1e-6 up to 10
Episode Length: I've used complete_episodes as well as truncated_episodes in batch_mode. I've varied rollout_fragment_length and made sure train_batch_size was always equal to rollout_fragment_length times the number of workers
entropy_coeff in A3C
kl_target, kl_coeff, and sgd_minibatch_size in PPO
Network Architecture: I've tried both wide and deep networks. The default Ray network is 2x256 layers. I've tried changing it to 4x2048 and 10x64. The baselines network that succeeded was 8x64.
Changing activation function: ReLU returns NaNs but tanh lets the program run
Implementing L2 loss to prevent connection weights from exploding. This kept tanh activation from returning NaNs after a certain amount of time (the higher the learning rate the faster it got to the point it returned NaNs)
Normalization: I normalized the state that is returned to the agent from 0 to 1. Actions are not normalized because they both go from -6 to 6.
Scaling the reward: Ive had rewards in the range of 1e-5 to 1e5. This had no effect on the agent's behavior.
Changing reward functions
Making the job easier (more lenient conditions)
Note: I am running this in a docker image and have verified that this occurs on another machine so it's not a hardware issue.
Here is a list of things I believe are red flags that could be interrelated:
I was originally getting NaNs (Ray returned NaN actions) when using ReLU but switching my activation function to tanh.
The KL divergence coefficient cur_kl_coeff in PPO goes to zero almost immediately when I include L2 loss. From what I understand, this means that the model is not making meaningful weight changes after the first few iterations.
However, when the loss is just policy loss, cur_kl_coeff varies normally.
Loss doesn't substantially change on average and doesn't ever converge
Reward doesn't change substantially reward_mean always converges to roughly the same value. This occurs even when reward_mean starts above average or initially increases to above average due to random weight initialization.
Note: This is not an issue of the agent not finding a better path or a poor reward function not valuing good actions.
Zooming out shows why mean reward peaked previously. The agent performed phenomenally but was never able to repeat anywhere near its success.
And, I cannot stress this enough, this is not an issue of me not letting the agent run for long enough. I have observed this same behavior across dozens of runs and cumulatively tens of thousands of iterations. I understand that it can take models a long time to converge, but the models consistently make zero progress.
tldr; My model doesn't change it's policy in any meaningful way despite getting very high rewards on certain runs.

Loss and learning rate scaling strategies for Tensorflow distributed training when using TF Estimator

For those who don't want to read the whole story:
TL; DR: When using TF Estimator, do we have to scale learning rate by the factor by which we increase batch size (I know this is the right way, I am not sure if TF handles this internally)? Similarly, do we have to scale per example loss by global batch size (batch_size_per_replica * number of replicas)?
Documentation on Tensorflow distributed learning is confusing. I need clarification on below points.
It is now understood that if you increase the batch size by a factor of k then you need to increase the learning rate by k (see this and this paper). However, Tensoflow official page on distributed learning makes no clarifying comment about this. They do mention here that learning rate needs to be adjusted. Do they handle the learning rate scaling by themselves? To make matters more complicated, the behavior is different in Keras and tf.Estimator (see next point). Any suggestions on should I increase the LR by a factor of K or not when I am using tf.Estimator?
It is widely accepted that the per example loss should be scaled by global_batch_size = batch_size_per_replica * number of replicas. Tensorflow mentions it here but then when illustrating how to achieve this with a tf.Estimator, they either forget or the scaling by global_batch_size is not required. See here, in the code snippet, loss is defined as follows.
loss = tf.reduce_sum(loss) * (1. / BATCH_SIZE)
and BATCH_SIZE to the best of my understanding is defined above as per replica batch size.
To complicate things further, the scaling is handled automatically if you are using Keras (for reasons I will never understand, it would have been better to keep everything consistent).

The learning rate is not automatically scaled by the global step. As you said, they even suggest that you might need to adjust the learning rate, but then again only in some cases, so that's not the default. I suggest that you do increase the learning rate manually.
If we take a look at a simple tf.Estimator, the tf.estimator.DNNClassifier (link), the default loss_reduction is losses_utils.ReductionV2.SUM_OVER_BATCH_SIZE. If we got to that Reduction (found here), we see that it's a policy for how to combine losses form indiviual samples together. On a single machine, we would just use tf.reduce_mean, but you can't use that in a distributed setting (as mentioned in the next link). The Reduction leads us to here, which shows you 1) an implementation of how you would implement the global step and 2) explains why. As they are telling you you should implement this yourself, this implies it's not handled by tf.Estimator. Also note that yo can find some explanations on the Reduction page they state the differences bewteen Keras and Estimators about these params.

How do I better process my data and set parameters for my Neural Network?

When I run my NN the only way to get any training to occur is if I divide X by 1000. The network also needs to be trained under 70000 times with a 0.03 training rate and if those values are larger the NN gets worse. I think this is a due to bad processing of data and maybe the lack of having biases, but I don't really know.
Code on Google Colab

In short: all of the problems you mentioned and more.
Scaling is essential, typically to 0 mean and a variance of 1. Otherwise, you will quickly saturate the hidden units, their gradients will be near zero and (almost) no learning will be possible.
Bias is mandatory for such ANN. It's like an offset for fitting linear function. If you drop it, getting good fit will be very difficult.
You seem to be checking accuracy on your training data.
You have very few training samples.
Sigmoid is proven to be poor choice. Use ReLU and check e.g. here for explanation.
Also, I'd recommend spending some time on learning Python before going into this. For starter, avoid using global, it can get you unforeseen behaviour if you're not careful.

Neural Network Becomes Unruly with Large Layers

This is a higher-level question about the performance of a neural network. The issue I'm having is that with larger numbers of neurons per layer, the network has frequent rounds of complete stupidity. They are not consistent; it seems that the probability of general success vs failure is about 50/50 when layers get larger than 60 neurons (always 3 layers).
I tested this by teaching the same function to networks with input and hidden layers of sizes from 10-200. The success rate is either 0-1% or 90+%, but nothing in between. To help visualize this, I graphed it. Failures is a total count of incorrect responses on 200 data sets after 5k training iterations. .
I think it's also important to note that the numbers at which the network succeeds or fails change for each run of the experiment. The only possible culprit I've come up with is local minima (but don't let this influence your answers, I'm new to this, and initial attempts to minimize the chance of local minima seem to have no effect).
So, the ultimate question is, what could cause this behavior? Why is this thing so wildly inconsistent?
The Python code is on Github and the code that generated this graph is the testHugeNetwork method in test.py (line 172). If any specific parts of the network algorithm would be helpful I'm glad to post relevant snippets.

My guess is, that your network is oscillating heavily across a jagged error surface. Trying a lower error rate might help. But first of all, there are a few things you can do to better understand what your network is doing:
plot the output error over training epochs. This will show you when in the training process things go wrong.
have a graphical representation (an image) of your weight matrices and of your outputs. Makes it much easier to spot irregularities.
A major problem with ANN training is saturation of the sigmoid function. Towards the asymptotes of both the logistic function and tanh, the derivative is close to 0, numerically it probably even is zero. As a result, the network will only learn very slowly or not at all. This problem occurs when the input for the sigmoid are too big, here's what you can do about it:
initialize your weights proportional to number of inputs a neuron receives. Standard literature suggests to draw them from a distribution with mean = 0 and standard deviation 1/sqrt(m), where m is the number of input connections.
scale your teachers so that they lie where the network can learn the most; that is, where the activation function is the steepest: the maximum of the first derivative. For tanh you can alternatively scale the function to f(x) = 1.7159 * tanh(2/3 * x) and keep the teachers at [-1, 1]. However, don't forget to adjust the derivative to f'(x) = 2/3 * 1.7159 * (1 - tanh^2 (2/3 * x)
Let me know if you need additional clarification.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.