Reinforcement Learning Agent Policy Won't Update - python

I am currently training a deep reinforcement learning model in a continuous environment using Ray.
The environment I am using was coded up in OpenAI Gym using baselines by another person who's research I am trying to replicate in Ray so that it may be parallelized. The model converges about 5% of the time in baselines after thousands of episodes though I have not been able to replicate this.
My problem is that the AI stops changing its policy very quickly and the mean reward then stays roughly the same for the rest of the run. This is independent of every parameter I've tried to vary. So far I've tried changing:
Learning Algorithm: I've used PPO, A3C, and DDPG
Learning Rate: I've done hyperparameter sweeps from 1e-6 up to 10
Episode Length: I've used complete_episodes as well as truncated_episodes in batch_mode. I've varied rollout_fragment_length and made sure train_batch_size was always equal to rollout_fragment_length times the number of workers
entropy_coeff in A3C
kl_target, kl_coeff, and sgd_minibatch_size in PPO
Network Architecture: I've tried both wide and deep networks. The default Ray network is 2x256 layers. I've tried changing it to 4x2048 and 10x64. The baselines network that succeeded was 8x64.
Changing activation function: ReLU returns NaNs but tanh lets the program run
Implementing L2 loss to prevent connection weights from exploding. This kept tanh activation from returning NaNs after a certain amount of time (the higher the learning rate the faster it got to the point it returned NaNs)
Normalization: I normalized the state that is returned to the agent from 0 to 1. Actions are not normalized because they both go from -6 to 6.
Scaling the reward: Ive had rewards in the range of 1e-5 to 1e5. This had no effect on the agent's behavior.
Changing reward functions
Making the job easier (more lenient conditions)
Note: I am running this in a docker image and have verified that this occurs on another machine so it's not a hardware issue.
Here is a list of things I believe are red flags that could be interrelated:
I was originally getting NaNs (Ray returned NaN actions) when using ReLU but switching my activation function to tanh.
The KL divergence coefficient cur_kl_coeff in PPO goes to zero almost immediately when I include L2 loss. From what I understand, this means that the model is not making meaningful weight changes after the first few iterations.
However, when the loss is just policy loss, cur_kl_coeff varies normally.
Loss doesn't substantially change on average and doesn't ever converge
Reward doesn't change substantially reward_mean always converges to roughly the same value. This occurs even when reward_mean starts above average or initially increases to above average due to random weight initialization.
Note: This is not an issue of the agent not finding a better path or a poor reward function not valuing good actions.
Zooming out shows why mean reward peaked previously. The agent performed phenomenally but was never able to repeat anywhere near its success.
And, I cannot stress this enough, this is not an issue of me not letting the agent run for long enough. I have observed this same behavior across dozens of runs and cumulatively tens of thousands of iterations. I understand that it can take models a long time to converge, but the models consistently make zero progress.
tldr; My model doesn't change it's policy in any meaningful way despite getting very high rewards on certain runs.

Related

Is it safe to ignore the "ConvergenceWarning" in sklearn?

I'm using LassoCV in sklearn. There's always a lot of ConvergenceWarnings. I checked the results and they look good. So I was wondering if it is safe to simply ignore those warnings. A few thoughts I have are:
Does the warning happens because the magnitude of my response is too big? It will make the loss bigger, too.
In most cases, would it be solved when I increase the number of iterations? I'm hesitating to do it because sometimes it takes longer to run but the results didn't improve.
tl;dr It is fine almost always, to be sure watch learning curve.
So, LassoCV implements Lasso regression, which parameters are optimized via some kind of gradient descent (coordinate descent to be more precise, which is even simpler method) and as all gradient methods this approach requires defining:
step size
stop criterion
probably most popular stop criteria are:
a) fixed amount of steps (good choice time-wise, since 1000 steps takes exactly x1000 time compare to 1 step, so it is easy to manage the time spent on training).
b) fixed delta (difference) between values of a loss function during step n and n-1 (possibly better classification/regression quality)
The warning you observe is because LassoCV uses the the first criterion (fixed amount of steps), but also checks for the second (delta), once number of fixed steps is reached the algorithm stops, default value of delta is too small for most real datasets.
To be sure that you train your model long enough you could plot a learning curve: loss value after every 10-20-50 steps of the training, once it goes to a plateau you are good to stop.

How do I better process my data and set parameters for my Neural Network?

When I run my NN the only way to get any training to occur is if I divide X by 1000. The network also needs to be trained under 70000 times with a 0.03 training rate and if those values are larger the NN gets worse. I think this is a due to bad processing of data and maybe the lack of having biases, but I don't really know.
Code on Google Colab
In short: all of the problems you mentioned and more.
Scaling is essential, typically to 0 mean and a variance of 1. Otherwise, you will quickly saturate the hidden units, their gradients will be near zero and (almost) no learning will be possible.
Bias is mandatory for such ANN. It's like an offset for fitting linear function. If you drop it, getting good fit will be very difficult.
You seem to be checking accuracy on your training data.
You have very few training samples.
Sigmoid is proven to be poor choice. Use ReLU and check e.g. here for explanation.
Also, I'd recommend spending some time on learning Python before going into this. For starter, avoid using global, it can get you unforeseen behaviour if you're not careful.

Why is my Deep Q Net and Double Deep Q Net unstable?

I am trying to implement DQN and DDQN(both with experience reply) to solve OpenAI AI-Gym Cartpole Environment. Both of the approaches are able to learn and solve this problem sometimes, but not always.
My network is simply a feed forward network(I've tried using 1 and 2 hidden layers). In DDQN I created one network in DQN, and two networks in DDQN, a target network to evaluate the Q value and a primary network to choose the best action, train the primary network, and copy it to target network after some episodes.
The problem in DQN is:
Sometimes it can achieve the perfect 200 score within 100 episodes, but sometimes it gets stuck and only achieves 10 score no matter how long it is trained.
Also, in case of successful learning, the learning speed differ.
The problem in DDQN is:
It can learn to achieve 200 score, but then it seems to forget what's learned and the score drops dramatically.
I've tried tuning batch size, learning rate, number of neurons in the hidden layer, the number of hidden layers, exploration rate, but instability persists.
Are there any rule of thumb on the size of network and batch size? I think reasonably larger network and larger batch size will increase stability.
Is it possible to make the learning stable? Any comments or references are appreciated!
These kind of problems happen pretty often and you shouldn't give up. First, of course, you should do another one or two checks if the code is all right - try to compare your code to other implementations, see how the loss function behave etc. If you are pretty sure your code is all fine - and, as you say that model can learn the task from time to time, it probably is - you should start experimenting with the hyper-parameters.
Your problems seem to be connected to hyper-parameters like exploration technique, learning rate, the way you are updating the target networks and to the experience replay memory. I would not play around with the hidden layer sizes - find the values for which the model learned once and keep them fixed.
Exploration technique: I assume you use epsilon-greedy strategy. My advice would be to start with a high epsilon value (I usually start with 1.0) and decay it after each step or episode, but define an epsilon_min too. Starting with a low epsilon value may be the problem of different learning speeds and success rates - if you go full random, you always populate your memory with similar kind of transitions at the beginning. With lower epsilon rates at the start, there is a bigger chance for your model to not explore enough before the exploitation phase begins.
Learning rate: Make sure it is not too big. Smaller rate may lower the learning speed, but helps a learned model to not escape back from global minima to some local, worse ones. Also, adaptive learning rates such as these calculated with Adam might help you. Of course the batch size have an impact as well, but I would keep it fixed and worry about it only if the other hyper-parameter changes won't work.
Target network update (rate and value): This is an important one as well. You have to experiment a bit - not only how often do you perform the update, but also how much of the primary values you copy into the target ones. People often do a hard update each episode or so, but try doing soft updates instead if the first technique does not work.
Experience replay: Do you use it? You should. How big is your memory size? This is very important factor and the memory size can influence the stability and success rate (A Deeper Look at Experience Replay). Basically, if you notice instability of your algorithm, try a bigger memory size, and if it affects your learning curve a lot, try out the technique proposed in the mentioned paper.
Maybe this can help you with your problem on this environment.
Cartpole problem with DQN algorithm from Udacity
I also was thinking that the problem is the unstable (D)DQN or that "CartPole" is bugged or "not stable solvable"!
After searching for a few weeks, I have checked my code several times, changed every setting but one...
The discount factor, setting it to 1.0 (really) has stabilized my training much more on CartPole-v1 at 500 max steps.
CartPole-v1 was stable in training with a simple Q-Learner (reduce min-alpha and min-epsilon to 0.001): https://github.com/sanjitjain2/q-learning-for-cartpole/blob/master/qlearning.py
The creator has Gamma at 1.0 (I read about it on reddit), so I tested it with a simple DQN (double_q = False) from here: https://github.com/adventuresinML/adventures-in-ml-code/blob/master/double_q_tensorflow2.py
I also removed 1 line: # reward = np.random.normal(1.0, RANDOM_REWARD_STD)
This way it gets the normal +1 reward per step and was "stable" 7 out of 10 runs.
And here is the result:
I spent a whole day solving this problem. The return climbs to above 400, and suddenly falls to 9.x.
In my case I think it's due to the unstable gradients. The l2 norm of the gradients varies from 1 or 2 to several thousands.
Finally solved it. See whether it could help.
clip the gradients before apply them, use a learning rate decay schedule
variables = model.trainable_variables
grads = tape.gradient(loss, variables)
grads, grads_norm = tf.clip_by_global_norm(grads, 30.0)
learning_rate = 0.1 / (math.sqrt(total_steps) + 1)
for g, var in zip(grads, variables):
var.assign_sub(g * learning_rate)
use a exploration rate decay schedule
epsilon = 0.85 ** math.log(total_steps + 1, 2)

GAN Generator output layer is [way] out of target range - potentially due to loss func

The TL;DR is that my G(z) (z is normal distributed noise) is way out of the expected range. Picture - see comparison of real against generated data.
(It's also a terrible model, being just a few dense layers - but I can sort that later!)
I've thrown s**t at the wall for the past week to see what sticks and now it's time to ask for an informed opinion! :)
Has anyone else encountered this problem before? Suggested fixes?
What is your generic loss function for your regressive Generator (with a categorical discriminator)?
As it's a regression problem (with target data normalised between 0 & 1) my G output has a linear activation. However, I'm trying to follow Soumith's GAN Hacks to get it working, where we're conflicting over the activation - he says it should be tanh. One could assume with my expected target range I could choose sigmoid, but as I say, I infer this should be linear as a regressive problem to solve. Opinions?
Plus there isn't much clarity on the loss function. He Says:-
In GAN papers, the loss function to optimize G is min (log 1-D), but in practice folks practically use max log D
Does this mean one should aim to minimise the error log(1-D), or pick the minimum of log(1-D)? Similarly with max log(D). (I'm using class definitions of true/false, so a maximum and minimum from the can be chosen from the two classes.)
In the Goodfellow et. al 2014 paper on GANs, I can see how some of the loss function for the Discriminator resolves to 0 when applied to the Generator. However, I'm obviously missing something, as the equation states I should take log(D(x)) which will sooner or later hit a NaN.
Clarification about my confused state is thanked many times in advance.
Update:
This tutorial on O'Reilly says
Take a look at the last line of our discriminator: there's no softmax or sigmoid layer at the end. GANs can fail if their discriminators "saturate," or become confident enough to return exactly 0 when they're given a generated image; that leaves the discriminator without a useful gradient to descend.
Which sounds a little like my problem, especially as I was using Softmax. Using no activation doesn't solve the problem, but does make my GAN compete in interesting new ways!

Neural Network Becomes Unruly with Large Layers

This is a higher-level question about the performance of a neural network. The issue I'm having is that with larger numbers of neurons per layer, the network has frequent rounds of complete stupidity. They are not consistent; it seems that the probability of general success vs failure is about 50/50 when layers get larger than 60 neurons (always 3 layers).
I tested this by teaching the same function to networks with input and hidden layers of sizes from 10-200. The success rate is either 0-1% or 90+%, but nothing in between. To help visualize this, I graphed it. Failures is a total count of incorrect responses on 200 data sets after 5k training iterations. .
I think it's also important to note that the numbers at which the network succeeds or fails change for each run of the experiment. The only possible culprit I've come up with is local minima (but don't let this influence your answers, I'm new to this, and initial attempts to minimize the chance of local minima seem to have no effect).
So, the ultimate question is, what could cause this behavior? Why is this thing so wildly inconsistent?
The Python code is on Github and the code that generated this graph is the testHugeNetwork method in test.py (line 172). If any specific parts of the network algorithm would be helpful I'm glad to post relevant snippets.
My guess is, that your network is oscillating heavily across a jagged error surface. Trying a lower error rate might help. But first of all, there are a few things you can do to better understand what your network is doing:
plot the output error over training epochs. This will show you when in the training process things go wrong.
have a graphical representation (an image) of your weight matrices and of your outputs. Makes it much easier to spot irregularities.
A major problem with ANN training is saturation of the sigmoid function. Towards the asymptotes of both the logistic function and tanh, the derivative is close to 0, numerically it probably even is zero. As a result, the network will only learn very slowly or not at all. This problem occurs when the input for the sigmoid are too big, here's what you can do about it:
initialize your weights proportional to number of inputs a neuron receives. Standard literature suggests to draw them from a distribution with mean = 0 and standard deviation 1/sqrt(m), where m is the number of input connections.
scale your teachers so that they lie where the network can learn the most; that is, where the activation function is the steepest: the maximum of the first derivative. For tanh you can alternatively scale the function to f(x) = 1.7159 * tanh(2/3 * x) and keep the teachers at [-1, 1]. However, don't forget to adjust the derivative to f'(x) = 2/3 * 1.7159 * (1 - tanh^2 (2/3 * x)
Let me know if you need additional clarification.

Categories

Resources