Deep Q-learning modification

Deep Q-learning modification - python

#Edit:
I'm trying to create an agent to play the game of Tetris, using a convolutional nnet that takes the board state + current piece as input. From what I've read, Deep Q-learning is not very good at this, which I just confirmed.
#end Edit
Suppose that an agent is learning a policy to play a game, where each game step can be represented as
s, a, r, s', done
representing
state, action, reward, next state, game over
In the Deep Q-learning algorithm, the agent is in state s and takes some action a (following an epsilon-greedy policy), observes a reward r and gets to the next state s'.
The agent acts like this:
# returns an action index
get_action(state, epsilon)
if random() < epsilon
return random_action_index
else
return argmax(nnet.predict(state))
The parameters are updated by greedily observing the max Q-value in state s', so we have
# action taken was 'a' in state 's' leading to 's_'
prediction = nnet.predict(s)
if done
target = reward
else
target = reward + gamma * max(nnet.predict(s_))
prediction[a] = target
The [prediction, target] is feed to some nnet for weight update. So this nnet gets a state s as input, and outputs a vector of q-values with dimension n_actions. This is all clear to me.
Now, suppose that my state-actions are so noise, that this approach will simply not work. So, instead of outputting a vector of dimension n_actions, my nnet output is a single value, representing the "state-quality" (how desirable is that state).
Now my agent acts like this:
# returns an action based on how good the next state is
get_action(state, epsilon):
actions = []
for each action possible in state:
game.deepcopy().apply(action)
val = nnet.predict(game.get_state())
action.value = val
actions.append(action)
if random() < epsilon
return randomChoice(actions)
else
return action with_max_value from actions
And my [prediction, target] is like this:
# action taken was 'a' in state 's' leading to 's_'
prediction = nnet.predict(s)
if done
target = reward
else
target = reward + gamma * nnet.predict(s_)
I have some questions regarding this second algorithm:
a) Does it makes sense to act non greedily sometimes?
Intuitively no, because if I land in a bad state, it was probably because of a bad random action, not because the previous state was 'bad'. The Q-learning update will adjust the bad action, but this second algorithm will wrongly adjust the previous state.
b) What kind of algorithm is this? Where does it fits in Reinforcement Learning?
c) In the case of Tetris, the state almost never repeats, so what can I do in this case? Is that the reason deep q-learning fails here?
This may look confusing, but the algorithm actually works. I can provide extra details if necessary, thank you!

Now, suppose that my state-actions are so noise, that this approach will simply not work. So, instead of outputting a vector of dimension n_actions, my nnet output is a single value, representing the "state-quality" (how desirable is that state).
Now my agent acts like this:
# returns an action based on how good the next state is
get_action(state, epsilon):
actions = []
for each action possible in state:
game.apply(action)
val = nnet.predict(game.get_state())
action.value = val
actions.append(action)
if random() < epsilon
return randomChoice(actions)
else
return action with_max_value from actions
First a quick note on that pseudocode: I don't think that would work, because you don't simulate the effects of the different actions on copies of the game state, but just on the game state directly. You'd probably want to create separate copies of the game state first, and run each action once on a different copy.
Anyway, this kind of algorithm is generally assumed not to be possible in Reinforcement Learning settings. In RL, we generally operate under the assumption that we don't have a "simulator" or "forward model" or anything like that. We normally assume that we have an agent situated in a real environment in which we can generate experience that can be used to learn from. Under this assumption, we cannot implement that kind of for each action possible in state loop which simulates what would happen if we were to execute different actions in the same game state. The assumption is that we first have to pick one action, execute it, and then learn from that particular trajectory of experience; we can no longer "go back", imagine that we had selected a different action, and also learn from that trajectory.
Of course, in practice doing this is often possible, because we often do actually have a simulator (for example a robot simulator, or a game, etc.). The majority of research in RL still makes the assumption that we do not have such a simulator, because this leads to algorithms that may eventually be useable in "real-world" situations (e.g., real-world physical robots). Implementing the idea you describe above actually means you're moving more towards search algorithms (such as Monte-Carlo Tree Search), rather than Reinforcement Learning algorithms. It means that your approach is limited to scenarios in which you do have a simulator available (e.g., games).
a) Does it makes sense to act non greedily sometimes?
Even though you include the search-like process of looping through all actions and simulating all their effects, I suspect you'll still have to have some form of exploration if you want to converge to good policies, so you'll have to act non-greedily sometimes. Yes, it looks like this will cause your algorithm to converge to something different from the traditional interpretation of an "optimal policy". This is not too much of a problem if your epsilon is fairly low though. In practice, it will likely tend to be a slightly "safer" policy that is learned. See also my answer to this other question.
b) What kind of algorithm is this? Where does it fits in Reinforcement Learning?
Aside from my discussion above on how this actually moves a bit towards the domain of Search algorithms rather than RL algorithms, it also looks to me like this would be an on-policy algorithm rather than an off-policy algorithm (standard Q-learning is off-policy, because it learns about the greedy policy whilst generating experience through a non-greedy policy). This distinction is also discussed in detail by most of the answers to the question I linked to above.

Related

How to set your own value function in Reinforecement learning?

I am new to using reinforcement learning, I only read the first few chapters in R.Sutton (so I have a small theoretical background).
I try to solve a combinatorial optimization problem which can be broken down to:
I am looking for the optimal configuration of points (qubits) on a grid (quantum computer).
I already have a cost function to qualify a configuration. I also have a reward function.
Right now I am using simulated annealing, where I randomly move a qubit or swap two qubits.
However, this ansatz is not working well for more than 30 qubits.
That's why I thought to use a policy, which tells me which qubit to move/swap instead of doing it randomly.
Reading the gym documentation, I couldn't find what option I should use. I don't need Q-Learning or deep reinforcement learning as far as I understood since I only need to learn a policy?
I would also be fine using Pytorch or whatever. With this little amount of information, what do you recommend to chose? More importantly, how can I set my own value function?

There are two categories of RL algorithms.
One category like Q-learning, Deep Q-learning and other ones learn a value function that for a state and an action predicts the estimated reward that you will get. Then, once you know for each state and each action what the reward is, your policy is simply to select for each state the action that provides the biggest reward. Thus, in the case of these algorithms, even if you learn a value function, the policy depends on this value function.
Then, you have other deep rl algorithms where you learn a policy directly, like Reinforce, Actor Critic algorithms or other ones. You still learn a value function, but at the same time you also learn a policy with the help of the value function. The value function will help the system learn the policy during training, but during testing you do not use the value function anymore, but only the policy.
Thus, in the first case, you actually learn a value function and act greedy on this value function, and in the second case you learn a value function and a policy and then you use the policy to navigate in the environment.
In the end, both these algorithms should work for your problem, and if you say that you are new to RL, maybe you could try the Deep Q-learning from the gym documentation.

Reinforcement Learning with MDP for revenues optimization

I want to modelize the service of selling seats on an airplane as an MDP( markov decision process) to use reinforcement learning for airline revenues optimization, for that I needed to define what would be: states, actions, policy, value and reward. I thought a little a bit about it, but i think there is still something missing.
I modelize my system this way:
States = (r,c) where r is the number of passengers and c the number of seats bought so r>=c.
Actions = (p1,p2,p3) that are the 3 prices. the objective is to decide which one of them give more revenues.
Reward: revenues.
Could you please tell me what do u think and help me?
After the modelization, I have to implement all of that wit Reinforcement Learning. Is there a package that do the work ?

I think the biggest thing missing in your formulation is the sequential part. Reinforcement learning is useful when used sequentially, where the next state has to be dependent on the current state (thus the "Markovian"). In this formulation, you have not specified any Markovian behavior at all. Also, the reward is a scalar which is dependent on either the current state or the combination of current state and action. In your case, the revenue is dependent on the price (the action), but it has no correlation to the state (the seat). These are the two big problems that I see with your formulation, there are others as well. I will suggest you to go through the RL theory (online courses and such) and write a few sample problems before trying to formulate your own.

Reinforcement Learning - How to we decide the reward to the agent when the input to the game is only pixels?

I am new to RL and the best I've done is CartPole in openAI gym. In cartPole, the API automatically provides the reward given the action taken. How am I supposed to decide the reward when all I have is pixel data and no "magic function" that could tell the reward for a certain action.
Say, I want to make a self driving bot in GTA San Andreas. The input I have access to are raw pixels. How am I supposed to figure out the reward for a certain action it takes?

You need to make up a reward that proxies the behavior you want - and that is actually no trivial business.
If there is some numbers on a fixed part of the screen representing score, then you can use old fashioned image processing techniques to read the numbers and let those be your reward function.
If there is a minimap in a fixed part of the screen with fixed scale and orientation, then you could use minus the distance of your character to a target as reward.
If there are no fixed elements in the UI you can use to proxy the reward, then you are going to have a bad time, unless you can somehow access the internal variables of the console to proxy the reward (using the position coordinates of your PC, for example).

Genetic Algorithm in Optimization of Events

I'm a data analysis student and I'm starting to explore Genetic Algorithms at the moment. I'm trying to solve a problem with GA but I'm not sure about the formulation of the problem.
Basically I have a state of a variable being 0 or 1 (0 it's in the normal range of values, 1 is in a critical state). When the state is 1 I can apply 3 solutions (let's consider Solution A, B and C) and for each solution I know the time that the solution was applied and the time where the state of the variable goes to 0.
So I have for the problem a set of data that have a critical event at 1, the solution applied and the time interval (in minutes) from the critical event to the application of the solution, and the time interval (in minutes) from the application of the solution until the event goes to 0.
I want with a genetic algorithm to know which is the best solution for a critical event and the fastest one. And if it is possible to rank the solutions acquired so if in the future on solution can't be applied I can always apply the second best for example.
I'm thinking of developing the solution in Python since I'm new to GA.
Edit: Specifying the problem (responding to AMack)
Yes is more a less that but with some nuances. For example the function A can be more suitable to make the variable go to F but because exist other problems with the variable are applied more than one solution. So on the data that i receive for an event of V, sometimes can be applied 3 ou 4 functions but only 1 or 2 of them are specialized for the problem that i want to analyze. My objetive is to make a decision support on the solution to use when determined problem appear. But the optimal solution can be more that one because for some event function A acts very fast but in other case of the same event function A don't produce a fast response and function C is better in that case. So in the end i pretend a solution where is indicated what are the best solutions to the problem but not only the fastest because the fastest in the majority of the cases sometimes is not the fastest in the same issue but with a different background.

I'm unsure of what your question is, but here are the elements you need for any GA:
A population of initial "genomes"
A ranking function
Some form of mutation, crossing over within the genome
and reproduction.
If a critical event is always the same, your GA should work very well. That being said, if you have a different critical event but the same genome you will run into trouble. GA's evolve functions towards the best possible solution for A Set of conditions. If you constantly run the GA so that it may adapt to each unique situation you will find a greater degree of adaptability, but have a speed issue.
You have a distinct advantage using python because string manipulation (what you'll probably use for the genome) is easy, however...
python is slow.
If the genome is short, the initial population is small, and there are very few generations this shouldn't be a problem. You lose possibly better solutions that way but it will be significantly faster.
have fun...

You should take a look at the GARAGe Michigan State. They are a GA research group with a fair number of resources in terms of theory, papers, and software that should provide inspiration.

To start, let's make sure I understand your problem.
You have a set of sample data, each element containing a time series of a binary variable (we'll call it V). When V is set to True, a function (A, B, or C) is applied which returns V to it's False state. You would like to apply a genetic algorithm to determine which function (or solution) will return V to False in the least amount of time.
If this is the case, I would stay away from GAs. GAs are typically used for some kind of function optimization / tuning. In general, the underlying assumption is that what you permute is under your control during the algorithm's application (i.e., you are modifying parameters used by the algorithm that are independent of the input data). In your case, my impression is that you just want to find out which of your (I assume) static functions perform best in a wide variety of cases. If you don't feel your current dataset provides a decent approximation of your true input distribution, you can always sample from it and permute the values to see what happens; however, this would not be a GA.
Having said all of this, I could be wrong. If anyone has used GAs in verification like this, please let me know. I'd certainly be interested in learning about it.

Applying Hidden Markov Models in Python

I recently had a homework assignment in my computational biology class to which I had to apply a HMM. Although I think I understand HMMs, I couldn't manage to apply them to my code. Basically, I had to apply the Fair Bet Casino problem to CpG islands in DNA--using observed outcomes to predict hidden states using a transition matrix. I managed to come up with a solution that worked correctly...but unfortunately in exponential time. My solution psuedocode looked like this:
*notes: trans_matrix contains both the probabilities of going from one state to the other state, and the probabilities of going from one observation to the other observation.
HMM(trans_matrix, observations):
movements = {}
for i in range(0,len(observations):
movements[i] = #list of 4 probabilities, all possible state changes
poss_paths = []
for m in movements.keys():
#add two new paths for every path currently in poss_paths
# (state either changes or doesn't)
#store a running probability for each path
correct_path = poss_paths[index_of_highest_probability]
return correct_path
I realize that I go into exponential time when I look at all the possible paths #add two new paths for every path currently in poss_paths. I'm not sure how to find the path of highest probability without looking at all the paths, though.
Thank you!
EDIT:
I did find a lot of info about the Viterbi algorithm when I was doing the assignment but I was confused as to how it actually gives the best answer. It seems like Viterbi (I was looking at the forward algorithm specifically, I think) looks at a specific position, moves forward a position or two, and then decides the "correct" next path increment only having looked at a few subsequent probabilities. I may be understanding this wrong; is this how the Viterbi works? Pseudocode is helpful. Thank you!

One benefit of Hidden Markov Models is that you can generally do what you need without considering all possible paths one by one. What you are trying to do looks like an expensive way of finding the single most probable path, which you can do by dynamic programming under the name of the Viterbi algorithm - see e.g. http://cs.brown.edu/research/ai/dynamics/tutorial/Documents/HiddenMarkovModels.html. There are other interesting things covered in documents like this which are not quite the same, such as working out the probabilities for the hidden state at a single position, or at all single positions. Very often this involves something called alpha and beta passes, which are a good search term, along with Hidden Markov Models.
There is a large description at http://en.wikipedia.org/wiki/Viterbi_algorithm with mathematical pseudo-code and what I think is python as well. Like most of these algorithms, it uses the Markov property that once you know the hidden state at a point you know everything you need to answer questions about that point in time - you don't need to know the past history. As in dynamic programming, you work from left to right along the data, using answers computed for output k-1 to work out answers for output k. What you want to work out at point k, for each state j, is the probability of the observed data up to and including that point, along the most likely path that ends up in state j at point k. That probability is the product of the probability of the observed data at k given state j, times the probability of the transition from some previous state at time k-1 to j times the probability of all of the observed data up to and including point k-1 given that you ended up at the previous state at time k-1 - this last bit is something you have just computed for time k-1. You consider all possible previous states and pick the one that gives you the highest combined probability. That gives you the answer for state j at time k, and you save the previous state that gave you the best answer. This may look like you are just fiddling around with outputs for k and k-1, but you now have an answer for time k that reflects all the data up to and including time k. You carry this on until k is the last point in your data, at which point you have answers for the probabilities of each final state given all the data. Pick the state at this time which gives you the highest probability and then trace all the way back using the info you saved about which previous state in time k-1 you used to compute the probability for data up to k and in state j at time k.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.