Agent repeats the same action circle non stop, Q learning

Agent repeats the same action circle non stop, Q learning - python

How can you prevent the agent from non-stop repeating the same action circle?
Of course, somehow with changes in the reward system. But are there general rules you could follow or try to include in your code to prevent such a problem?
To be more precise, my actual problem is this one:
I'm trying to teach an ANN to learn Doodle Jump using Q-Learning. After only a few generations the agent keeps jumping on one and the same platform/stone over and over again, non-stop. It doesn't help to increase the length of the random-exploration-time.
My reward system is the following:
+1 when the agent is living
+2 when the agent jumps on a platform
-1000 when it dies
An idea would be to reward it negative or at least with 0 when the agent hits the same platform as it did before. But to do so, I'd have to pass a lot of new input-parameters to the ANN: x,y coordinates of the agent and x,y coordinates of the last visited platform.
Furthermore, the ANN then would also have to learn that a platform is 4 blocks thick, and so on.
Therefore, I'm sure that this idea I just mentioned wouldn't solve the problem, contrarily I believe that the ANN would in general simply not learn well anymore, because there are too many unuseful and complex-to-understand inputs.

This is not a direct answer to the very generally asked question.
I found a workaround for my particular DoodleJump example, probably someone does something similar and needs help:
While training: Let every platform the agent jumped on disappear after that, and spawn a new one somewhere else.
While testing/presenting: You can disable the new "disappear-feature" (so that it's like it was before again) and the player will play well and won't hop on one and the same platform all the time.

Related

Understanding Depth-First Branch and Bound implementation to StarCraft 2

The problem is that I'm finding it difficult to understand how DFBB works, what the parameters and output should be for this case.
I'm working on creating an AI for the game StarCraft 2 that will handle the build order in the game (for team Terran). I was planning to follow the approach described in the link (see below) which followed a very similar thing that I was going for. To summarize what I'm planning to do:
A list of different type of buildings that need to be built will be given to me. Buildings cost minerals and gas (this is the currency in the game), some buildings have prerequisites (meaning other buildings need to be built before it's possible to build it) and they take a certain amount of time to build.
In the article they used Depth-First Branch and Bound to figure out the optimal build order, meaning the fastest way possible to build the buildings in that list. This was their pseudocode:
Where the state S is represented by S = (current game time, resources available, actions in progress but not completed, worker income data). How S´ is derived is described article and it is done through three functions so that bit I understand.
As mentioned earlier I'm struggling to understand what the starting status S, goal G, time limit t and bound b should be represented by in the pseudocode that they are describing.
I only know three things for sure: the list of buildings that needs to be built, what consumables I have at the moment (minerals and gas), resources (that is buildings I already have in the game). This should then be applied to the algorithm somehow, but it is unclear what the input should be to the function. The output should be a list sorted in the right order so if I where to building the buildings in the order they come in then it should all work out and it should be the optimal possible time it can be done in.
For example should I iterate through the list buildings and run DFBB on every element with the goal then being seeing if the building can be built. But what should the time limit be set too and what does bound mean in this case? Is it simply the cost?
Please explain how this function should be run on the list in order to find the optimal path of building it. The article is fairly easy to read, but I need some help understanding how it is meant to work and how I can apply it to my problem.
Link to article: https://ai.dmi.unibas.ch/research/reading_group/churchill-buro-aiide2011.pdf

Starting Status S is the initial state at the start of the game. I believe you have 100 minearls and Command center and 12? SCVs, so that's your start.
The Goal here is the list of building you want to have. The satisfies condition is are all building in goal also in S.
The time limit is the amount of time you are willing to spend to get the result. If yous set it to 5 seconds it will probably give you a sub-optimal solution, but it will do it in 5 seconds. If the algorithm finishes the search it will return earlier. If you don't care leave it out, but make sure you write solutions to a file in case something happens.
Bound b is the in-game time limit for building everything. You initially set it to infinite or some obvious value (like 10 minutes?). When you find a solution the b gets updated so every new solution you find MUST be faster (in-game) than the previous one.
A few notes. Make sure that the possible action (children in step 9) includes doing nothing (wait for more resources) and building an SCV.
Another thing that might be missing is a correct modelling of SCV movement speed. The units need to move to a place to build something and it also takes time for them to get back to mining.

Time step in reinforcement learning

For my first project in reinforcement learning I'm trying to train an agent to play a real time game. This means that the environment constantly moves and makes changes, so the agent needs to be precise about its timing. In order to have a correct sequence, I figured the agent will have to work in certain frequency. By that I mean if the agent has 10Hz frequency, it will have to take inputs every 0.1 secs and make a decision. However, I couldn't find any sources on this problem/matter, but it's probably due to not using correct terminology on my searches. Is this a valid way to approach this matter? If so, what can I use? I'm working with python3 in windows (the game is only ran in windows), are there any libraries that could be used? I'm guessing time.sleep() is not a viable way out, since it isn't very precise (when using high frequencies) and since it just freezes the agent.
EDIT: So my main questions are:
a) Should I use a certain frequency, is this a normal way to operate a reinforcement learning agent?
b) If so what libraries do you suggest?

There isn't a clear answer to this question, as it is influenced by a variety of factors, such as inference time for your model, maximum accepted control rate by the environment and required control rate to solve the environment.
As you are trying to play a game, I am assuming that your eventual goal might be to compare the performance of the agent with the performance of a human.
If so, a good approach would be to select a control rate that is similar to what humans might use in the same game, which is most likely lower than 10 Hertz.
You could try to measure how many actions you use when playing to get a good estimate,
However, any reasonable frequency, such as the 10Hz you suggested, should be a good starting point to begin working on your agent.

How to give an AI controls in a video game?

So I made Pong using PyGame and I want to use genetic algorithms to have an AI learn to play the game. I want it to only know the location of its paddle and the ball and controls. I just don't know how to have the AI move the paddle on its own. I don't want to do like: "If the ball is above you, go up." I want it to just try random stuff until it learns what to do.
So my question is, how do I get the AI to try controls and see what works?

Learning Atari-Pong has become a standard task in reinforcement learning. For example there is the OpenAI baselines github repo implementing RL algorithms that can be plugged into various tasks.
You definitely don't need those advanced algos just to learn Pong the way you describe, but you can learn from the API they're using to separate between tasks ("environments" in reinforcement learning terms) and the AI part ("controller" or "agent"). For this, I suggest to read the OpenAI Gymn Documentation for how you would add a new Environment.
In short, you could either use some float numbers (position and velocity of ball, or two positions instead of velocity, and position of the paddle). Or you could use discrete inputs (integers, or just pixels, much harder to learn). Those inputs could be connected to a small neural network.
For the command output, the simplest thing to do is to predict a probability for moving up or down. This is a good idea because when you evaluate your controller, it will have some non-zero chance of scoring points, so your genetic algorithm can compare different controllers (with different weights) against each other. Just use the sigmoid function on your neural net output, and interpret it as probability.
If you initialize all your neural network weights to a good random range, you probably can get a pong player that doesn't completely suck just by trying random weights for long enough (even without a GA).
PS: if you didn't plan to use a neural network: they are really simple to implement from scratch if you only have to implement the forward-pass. E.g. if you don't implement back-propagation training, and use a GA instead to learn the weights (or an evolution strategy, or just random weights). The hardest part is to find a good range for the initial random weights.

One design consideration which may be helpful is if you can provide some minimal set of display details out through another interface; and conversely allow for commands to the player paddle. For example, you could send a simple structure describing ball position and both paddles and the ball with each frame update out through a socket to another process. Following the same pattern, you could create a structure that is sent as a reply to that message describing how to move the player paddle. For example:
# Pong Game program
import socket
import struct
# Set up server or client socket
# ... Into game loop
state = (p1_paddle_y, p2_paddle_y, ball_x, ball_y, victory_state)
# assuming pixel locations, and victory_state is -1:Loss, 0:InProgress, 1:Win
myGameStateMsg = struct.pack('>LLLLh', state[0], state[1], state[2], state[3])
sock.send(myGameStateMsg) # Sending game state to player
playerMsg = sock.recv(4) # Get player command
playerCmd = struct.unpack('i', playerMsg)
# playerCmd is an integer describing direction & speed of paddle motion
# ... Process game state update, repeat loop
You could accomplish the same effect using threads and a transacted structure, but you'll need to consider properly guarding those structures (read-while-write problems, etc.)
Personally, I prefer the first approach (sockets & multi-processing) for stability reasons. Suppose there's some sort of bug that causes a crash; if you've already got process separation, it becomes easier to identify the source of the crash. At the thread-level, it's still possible but a little more challenging. One of the other benefits of the multi-processing approach is that you can easily set up multiple players and have the game expand (1vInGameAI, 1v1, 3v3, 4v4). Especially when you expand, you could test out different algorithms, like Q-Learning, adaptive dynamic programming, etc. and have them play each other!
Addendum: Sockets 101
Sockets are a mechanism to get more than one process (i.e., a running program) to send messages to one another. These processes can be running on the same machine or across the network. In a sense, using them is like reading and writing to a file that is constantly modifying (that's the abstraction that sockets provide), but also provide blocking calls so that make the process wait for information to be available.
There is a lot more detail that can be discussed about sockets (like file-sockets vs network-sockets (FD vs IP); UDP vs TCP, etc.) that could easily fill multiple pages. Instead, please refer to the following tutorial about a basic setup: https://docs.python.org/3/howto/sockets.html. With that, you'll have a basic understanding of what they can provide and where to go for more advanced techniques with them.
You may also want to consult the struct tutorial as well for introductory message packing: https://docs.python.org/3/library/struct.html. There are better ways of doing this, but you won't understand much about how they work and break-down without understanding structs.

So you'd want as the AI input the position of the paddle, and the position of the ball. The AI output is two boolean output whether the AI should press up or down button on the next simulation step.
I'd also suggest adding another input value, the ball's velocity. Otherwise, you would've likely needed to add another input which is the location of the ball in the previous simulation step, and a much more complicated middle layer for the AI to learn the concept of velocity.

Use machine learning for simple robot control

I'd like to improve my little robot with machine learning.
Up to now it uses simple while and if then decisions in its main function to act as a lawn mowing robot.
My idea is to use SKLearn for that purpose.
Please help me to find the right first steps.
i have a few sensors that tell about the world otside:
World ={yaw, pan, tilt, distance_to_front_obstacle, ground_color}
I have a state vector
State = {left_motor, right_motor, cutter_motor}
that controls the 3 actors of the robot.
I'd like to build a dataset of input and output values to teach sklearn the wished behaviour, after that the input values should give the correct output values for the actors.
One example: if the motors are on and the robot should move forward but the distance meter tells constant values, the robot seems to be blocked. Now it should decide to draw back and turn and move to another direction.
First of all, do you think that this is possible with sklearn and second how should i start?
My (simple) robot control code is here: http://github.com/bgewehr/RPiMower
Please help me with the first steps!

I would suggest to use Reinforcement Learning. Here you have a tutorial of Q-Learning that fits well into your problem.
If you want code in python, right now I think there is no implementation of Q-learning in scikit-learn. However, I can give you some examples of code in python that you could use: 1, 2 and 3.
Also please have in mind that reinforcement learning is set to maximize the sum of all future rewards. You have to focus on the general view.
Good luck :-)

The sklearn package contains a lot of useful tools for machine learning so I dont think thats a problem. If it is, then there are definitely other useful python packages. I think collecting data for the supervised learning phase will be the challenging part, and wonder if it would be smart to make a track with tape within a grid system. That would make it be easier to translate the track to labels (x,y positions in the grid). Each cell in the grid should be small if you want to make complex tracks later on I think. It may be very smart to check how they did in the self-driving google car.

Get away from an object in a 2D-grid

I'm developing a small game in python. I am using a 2D rectangular grid. I know that for pathfinding I can use A* and the likes, I know how this works, but the problem I have is a bit different.
Let's say we have a computer controlled human and some computer controlled zombies. When the human spots a zombie, it should get away from this as far as he can. At the moment, to test everything I just turn around 180° and run away, until I spot another zombie and repeat.
Obviously this is not very smart (and can cause problems if there is a zombie on both sides).
I was wondering if there was a smarter way to do this? Something like using Dijkstra to find a "safe zone" where I can run to? Alternatives are always welcome, I can't seem to figure it out.

You could suppose that the zombies can see everything within a particular range (radius or perhaps be more clever) and then have the human look for a spot that he thinks the zombies can't see. Pick the closest spot the zombie can't see and use the A* algorithm to find a path if one exists, else try a different one. Look out when there's nowhere to run. Alternatively you could weight all of the spots in your visibility region with a value based on how far away you would be from the zombies if you chose that spot.

Just off the top of my head, you'll probably be able to do some vector math and have the human run in the normal vector to the zombies.
I don't know how well this will work (or how it will scale to the number of zombies you have), but you could do something like:
For each zombie, compute the distance to the human and the direction it is from the human.
Create a vector for each zombie (or some subset of the close zombies), using the direction and the inverse of the distance, since the closer the zombie is the more important it is to run away.
Find the sum of all the vectors.
Make the human run in the normal vector to your result.
I'm not sure how resource intensive this would be, but it seems like the most logical way to prioritize where to run.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.