I'm implementing PPO2 reinforcement learning on my self-build tasks and always encounter such situations where the agent seems to be nearly matured then suddenly catstrophically loses its performance and couldn't hold its stable performance. I don't know what's the right word for it.
I'm just wondering what could be the cause for such catastrophic drop in performance? Any hints or tips?
Many thanks
learningprocess1 learningprocess2
I would guess that your reward function is not capped and can produce extremely high negative rewards in some edge cases.
Two things to prevent this are:
Limit the values from your reward function
Make sure that you can handle situations when your learning environment is unstable like the process crashed, froze, experienced a bug. For example if you give your agent negative reward when he falls (robot trying to walk) and the environment doesn't detect the fall because of some rare bug, then your reward function keeps giving negative rewards until the episode stopped.
Most of the time this is not that big of a deal but if you are unlucky your environment could even produce NaN values and those would corrupt your network
Related
How can you prevent the agent from non-stop repeating the same action circle?
Of course, somehow with changes in the reward system. But are there general rules you could follow or try to include in your code to prevent such a problem?
To be more precise, my actual problem is this one:
I'm trying to teach an ANN to learn Doodle Jump using Q-Learning. After only a few generations the agent keeps jumping on one and the same platform/stone over and over again, non-stop. It doesn't help to increase the length of the random-exploration-time.
My reward system is the following:
+1 when the agent is living
+2 when the agent jumps on a platform
-1000 when it dies
An idea would be to reward it negative or at least with 0 when the agent hits the same platform as it did before. But to do so, I'd have to pass a lot of new input-parameters to the ANN: x,y coordinates of the agent and x,y coordinates of the last visited platform.
Furthermore, the ANN then would also have to learn that a platform is 4 blocks thick, and so on.
Therefore, I'm sure that this idea I just mentioned wouldn't solve the problem, contrarily I believe that the ANN would in general simply not learn well anymore, because there are too many unuseful and complex-to-understand inputs.
This is not a direct answer to the very generally asked question.
I found a workaround for my particular DoodleJump example, probably someone does something similar and needs help:
While training: Let every platform the agent jumped on disappear after that, and spawn a new one somewhere else.
While testing/presenting: You can disable the new "disappear-feature" (so that it's like it was before again) and the player will play well and won't hop on one and the same platform all the time.
For my first project in reinforcement learning I'm trying to train an agent to play a real time game. This means that the environment constantly moves and makes changes, so the agent needs to be precise about its timing. In order to have a correct sequence, I figured the agent will have to work in certain frequency. By that I mean if the agent has 10Hz frequency, it will have to take inputs every 0.1 secs and make a decision. However, I couldn't find any sources on this problem/matter, but it's probably due to not using correct terminology on my searches. Is this a valid way to approach this matter? If so, what can I use? I'm working with python3 in windows (the game is only ran in windows), are there any libraries that could be used? I'm guessing time.sleep() is not a viable way out, since it isn't very precise (when using high frequencies) and since it just freezes the agent.
EDIT: So my main questions are:
a) Should I use a certain frequency, is this a normal way to operate a reinforcement learning agent?
b) If so what libraries do you suggest?
There isn't a clear answer to this question, as it is influenced by a variety of factors, such as inference time for your model, maximum accepted control rate by the environment and required control rate to solve the environment.
As you are trying to play a game, I am assuming that your eventual goal might be to compare the performance of the agent with the performance of a human.
If so, a good approach would be to select a control rate that is similar to what humans might use in the same game, which is most likely lower than 10 Hertz.
You could try to measure how many actions you use when playing to get a good estimate,
However, any reasonable frequency, such as the 10Hz you suggested, should be a good starting point to begin working on your agent.
So I am sort of an amateur when comes to machine learning and I am trying to program the Baum Welch algorithm, which is a derivation of the EM algorithm for Hidden Markov Models. Inside my program I am testing for convergence using the probability of each observation sequence in the new model and then terminating once the new model is less than or equal to the old model. However, when I run the algorithm it seems to converge somewhat and gives results that are far better than random but when converging it goes down on the last iteration. Is this a sign of a bug or am I doing something wrong?
It seems to me that I should have been using the summation of the log of each observation's probability for the comparison instead since it seems like the function I am maximizing. However, the paper I read said to use the log of the sum of probabilities(which I am pretty sure is the same as the sum of the probabilities) of the observations(https://www.cs.utah.edu/~piyush/teaching/EM_algorithm.pdf).
I fixed this on another project where I implemented backpropogation with feed-forward neural nets by implementing a for loop with pre-set number of epochs instead of a while loop with a condition for the new iteration to be strictly greater than but I am wondering if this is a bad practice.
My code is at https://github.com/icantrell/Natural-Language-Processing
inside the nlp.py file.
Any advice would be appreciated.
Thank You.
For EM iterations, or any other iteration proved to be non-decreasing, you should be seeing increases until the size of increases becomes small compared with floating point error, at which time floating point errors violate the assumptions in the proof, and you may see not only a failure to increase, but a very small decrease - but this should only be very small.
One good way to check these sorts of probability based calculations is to create a small test problem where the right answer is glaringly obvious - so obvious that you can see whether the answers from the code under test are obviously correct at all.
It might be worth comparing the paper you reference with https://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm#Proof_of_correctness. I think equations such as (11) and (12) are not intended for you to actually calculate, but as arguments to motivate and prove the final result. I think the equation corresponding to the traditional EM step, which you do calculate, is equation (15) which says that you change the parameters at each step to increase the expected log-likelihood, which is the expectation under the distribution of hidden states calculated according to the old parameters, which is the standard EM step. In fact, turning over I see this is stated explicitly at the top of P 8.
This may seem or even be a stupid question: When I build something self-tuning like Python with PGO (or ATLAS or I believe FFTW also does it), does the computer have to be otherwise idle (to not interfere with the measurements) or can I pass the time playing Doom?
The linked README from the python source distribution seems to deem this too trivial a matter to mention, but I'm genuinely unsure about this.
What you do on your computer while it is performing the PGO measurements should have no impact what so ever on the result of the optimization. What PGO do is to use measurments to find the hot paths in the code for a given data set and use this information to make the program as fast as possible for this data set and which path is hot and which is not is independent of other programs running on the computer.
To explain things a bit, when optimizing code there are trade offs. The improvement will be higher in some parts of the code and lower in others depending on which code transforms are used and where they are applied. To get a better final result you want high improvements in code that is executed a lot (hot code in compiler lingo) while you can live with smaller improvements in code that is executed less frequently (cold code). Normally a set of heuristics are used to identify these hot parts of the program and apply optimizations in a way that makes these parts as fast as possible. The problem with this approach is that the heuristics does not know anything about how the program will be used in practice and may misidentify hot code as cold.
Profile guided optimization (PGO) is a method to help the compiler to locate the hot parts of the code using data from real executions. As a first step you tell the compiler to build an instrumented version of the program to measure how the code is executed in practice, typically by adding counters to count the number of iterations in loops and which branch is chosen in if-statements. The second step is to run the instrumented program on real data. At the end of execution the program will output the values of all the added counters and by matching counters with the code it is possible to see which parts of the program are hot (high numbers) and which are cold (low numbers). Finally the program is compiled but this time agumented with the program profile. This implies that the compiler no longer need to guess which parts should be faster and which could be slower it can look it up in the profile.
Disclaimer: This is probably a difficult question to answer, but I would greatly appreciate any insights, thoughts or suggestions. If this has already been answered elsewhere and I simply haven't managed to find it, please let me know. Also, I'm somewhat new to algorithm engineering in general, specifically to using Python/NumPy for implementing and evaluating non-trivial algorithm prototypes, and picking it all up as I go. I may be missing something fundamental to scientific computing with NumPy.
To moderator: Feel free to move this thread if there is a better forum for it. I'm not sure this strictly qualifies for the Computational Science forum as it potentially boils down to an implementation/coding issue.
Note: If you want to understand the problem and context, read everything.
I'm using NumPy's std() function to calculate standard deviations in the implementation of an algorithm that finds optimal combinations of nodes in a graph, loosely based on minspan trees. A selection function contains the call to std(). This is an early single-threaded prototype of the algorithm; the algorithm was originally designed for multi-threading (which will likely be implemented in C or C++, so not relevant here).
Edge weights are dependent on properties of nodes previously selected and so are calculated when the algorithm examines an available node. The graph may contain several thousand nodes. At this stage searches are exhaustive, potentially requiring hundreds of thousands of calculations.
Additionally, evaluations of the algorithm are automated and may run hundreds of consecutive trials depending on user input. I haven't seen errors (below) crop up during searches of smaller graphs (e.g., 100 nodes), however scaling the size of the graph close to 1000 guarantees that they make an appearance.
The relevant code can be reduced to roughly the following:
# graph.available = [{'id': (int), 'dims': {'dim1': (int), ...}}, ...]
# accumulatedDistr = {'dim1': (int), ...}
# Note: dicts as nodes, etc used here for readability
edges = []
for node in graph.available:
intersection = my.intersect(node['distr'], accumulatedDistr) # Returns list
edgeW = numpy.std(intersection)
edges.append((node['id'], edgeW))
# Perform comparisons, select and combine into accumulatedDistr
Distributions are guaranteed to contain non-negative, non-zero integer values and the lists returned from my.intersect() are likewise guaranteed to never be empty.
When I run a series of trials I'll occasionally see the following messages in the terminal:
C:\Program Files\Python36\lib\site-packages\numpy\core\_methods.py:135: RuntimeWarning: Degrees of freedom <= 0 for slice
keepdims=keepdims)
C:\Program Files\Python36\lib\site-packages\numpy\core\_methods.py:105: RuntimeWarning: invalid value encountered in true_divide
arrmean, rcount, out=arrmean, casting='unsafe', subok=False)
C:\Program Files\Python36\lib\site-packages\numpy\core\_methods.py:127: RuntimeWarning: invalid value encountered in double_scalars
ret = ret.dtype.type(ret / rcount)
They typically only appear a few times during a set of trials, so likely every few million calculations. However, one bad calculation can (at the least!) subtly alter the results of an entire trial, so I'd still like to prevent this if at all possible. I assume there must be some internal state that's accumulating errors, but I've found precious little to suggest how I might address the problem. This concerns me greatly as accumulation of errors casts doubt on all of the following calculations, if that is indeed the case.
I suppose that I may have to look for a proprietary library (e.g., Python wrappers for Intel kernal math libs) to guarantee the kind of extreme-volume (pardon the abuse of terminology) computational stability that I want. But first: Is there a way to prevent them (NumPy)? Also, just in case: Is the problem as serious as I'm afraid it could be?
I've confirmed that this is indeed a bug in my own code, despite never catching it in the tests. Granted, on the evidence I would have had to run a few million or so consecutive randomized-input tests. On reflection, that might not be a bad idea as general practice for critical sections of code, despite the amount of time it takes. Better to catch it early on than after you've built an entire library around the affected code. Lesson learned.
Thanks to BrenBarn for putting me on the right track! I've run across open source libraries in the past that did have rare, hard to hit bugs. I'm relieved NumPy isn't one of them. At least not this time. Still, I think there's room to complain about vague, unhelpful error messages. Welcome to NumPy, I suppose.
So, to answer my own question: NumPy wasn't the problem. np.std() is very stable. Hopefully my experiences here will help others rule out what isn't causing their code to collapse.