How to set your own value function in Reinforecement learning?

How to set your own value function in Reinforecement learning? - python

I am new to using reinforcement learning, I only read the first few chapters in R.Sutton (so I have a small theoretical background).
I try to solve a combinatorial optimization problem which can be broken down to:
I am looking for the optimal configuration of points (qubits) on a grid (quantum computer).
I already have a cost function to qualify a configuration. I also have a reward function.
Right now I am using simulated annealing, where I randomly move a qubit or swap two qubits.
However, this ansatz is not working well for more than 30 qubits.
That's why I thought to use a policy, which tells me which qubit to move/swap instead of doing it randomly.
Reading the gym documentation, I couldn't find what option I should use. I don't need Q-Learning or deep reinforcement learning as far as I understood since I only need to learn a policy?
I would also be fine using Pytorch or whatever. With this little amount of information, what do you recommend to chose? More importantly, how can I set my own value function?

There are two categories of RL algorithms.
One category like Q-learning, Deep Q-learning and other ones learn a value function that for a state and an action predicts the estimated reward that you will get. Then, once you know for each state and each action what the reward is, your policy is simply to select for each state the action that provides the biggest reward. Thus, in the case of these algorithms, even if you learn a value function, the policy depends on this value function.
Then, you have other deep rl algorithms where you learn a policy directly, like Reinforce, Actor Critic algorithms or other ones. You still learn a value function, but at the same time you also learn a policy with the help of the value function. The value function will help the system learn the policy during training, but during testing you do not use the value function anymore, but only the policy.
Thus, in the first case, you actually learn a value function and act greedy on this value function, and in the second case you learn a value function and a policy and then you use the policy to navigate in the environment.
In the end, both these algorithms should work for your problem, and if you say that you are new to RL, maybe you could try the Deep Q-learning from the gym documentation.

Related

Time step in reinforcement learning

For my first project in reinforcement learning I'm trying to train an agent to play a real time game. This means that the environment constantly moves and makes changes, so the agent needs to be precise about its timing. In order to have a correct sequence, I figured the agent will have to work in certain frequency. By that I mean if the agent has 10Hz frequency, it will have to take inputs every 0.1 secs and make a decision. However, I couldn't find any sources on this problem/matter, but it's probably due to not using correct terminology on my searches. Is this a valid way to approach this matter? If so, what can I use? I'm working with python3 in windows (the game is only ran in windows), are there any libraries that could be used? I'm guessing time.sleep() is not a viable way out, since it isn't very precise (when using high frequencies) and since it just freezes the agent.
EDIT: So my main questions are:
a) Should I use a certain frequency, is this a normal way to operate a reinforcement learning agent?
b) If so what libraries do you suggest?

There isn't a clear answer to this question, as it is influenced by a variety of factors, such as inference time for your model, maximum accepted control rate by the environment and required control rate to solve the environment.
As you are trying to play a game, I am assuming that your eventual goal might be to compare the performance of the agent with the performance of a human.
If so, a good approach would be to select a control rate that is similar to what humans might use in the same game, which is most likely lower than 10 Hertz.
You could try to measure how many actions you use when playing to get a good estimate,
However, any reasonable frequency, such as the 10Hz you suggested, should be a good starting point to begin working on your agent.

Deep Q-learning modification

#Edit:
I'm trying to create an agent to play the game of Tetris, using a convolutional nnet that takes the board state + current piece as input. From what I've read, Deep Q-learning is not very good at this, which I just confirmed.
#end Edit
Suppose that an agent is learning a policy to play a game, where each game step can be represented as
s, a, r, s', done
representing
state, action, reward, next state, game over
In the Deep Q-learning algorithm, the agent is in state s and takes some action a (following an epsilon-greedy policy), observes a reward r and gets to the next state s'.
The agent acts like this:
# returns an action index
get_action(state, epsilon)
if random() < epsilon
return random_action_index
else
return argmax(nnet.predict(state))
The parameters are updated by greedily observing the max Q-value in state s', so we have
# action taken was 'a' in state 's' leading to 's_'
prediction = nnet.predict(s)
if done
target = reward
else
target = reward + gamma * max(nnet.predict(s_))
prediction[a] = target
The [prediction, target] is feed to some nnet for weight update. So this nnet gets a state s as input, and outputs a vector of q-values with dimension n_actions. This is all clear to me.
Now, suppose that my state-actions are so noise, that this approach will simply not work. So, instead of outputting a vector of dimension n_actions, my nnet output is a single value, representing the "state-quality" (how desirable is that state).
Now my agent acts like this:
# returns an action based on how good the next state is
get_action(state, epsilon):
actions = []
for each action possible in state:
game.deepcopy().apply(action)
val = nnet.predict(game.get_state())
action.value = val
actions.append(action)
if random() < epsilon
return randomChoice(actions)
else
return action with_max_value from actions
And my [prediction, target] is like this:
# action taken was 'a' in state 's' leading to 's_'
prediction = nnet.predict(s)
if done
target = reward
else
target = reward + gamma * nnet.predict(s_)
I have some questions regarding this second algorithm:
a) Does it makes sense to act non greedily sometimes?
Intuitively no, because if I land in a bad state, it was probably because of a bad random action, not because the previous state was 'bad'. The Q-learning update will adjust the bad action, but this second algorithm will wrongly adjust the previous state.
b) What kind of algorithm is this? Where does it fits in Reinforcement Learning?
c) In the case of Tetris, the state almost never repeats, so what can I do in this case? Is that the reason deep q-learning fails here?
This may look confusing, but the algorithm actually works. I can provide extra details if necessary, thank you!

Now, suppose that my state-actions are so noise, that this approach will simply not work. So, instead of outputting a vector of dimension n_actions, my nnet output is a single value, representing the "state-quality" (how desirable is that state).
Now my agent acts like this:
# returns an action based on how good the next state is
get_action(state, epsilon):
actions = []
for each action possible in state:
game.apply(action)
val = nnet.predict(game.get_state())
action.value = val
actions.append(action)
if random() < epsilon
return randomChoice(actions)
else
return action with_max_value from actions
First a quick note on that pseudocode: I don't think that would work, because you don't simulate the effects of the different actions on copies of the game state, but just on the game state directly. You'd probably want to create separate copies of the game state first, and run each action once on a different copy.
Anyway, this kind of algorithm is generally assumed not to be possible in Reinforcement Learning settings. In RL, we generally operate under the assumption that we don't have a "simulator" or "forward model" or anything like that. We normally assume that we have an agent situated in a real environment in which we can generate experience that can be used to learn from. Under this assumption, we cannot implement that kind of for each action possible in state loop which simulates what would happen if we were to execute different actions in the same game state. The assumption is that we first have to pick one action, execute it, and then learn from that particular trajectory of experience; we can no longer "go back", imagine that we had selected a different action, and also learn from that trajectory.
Of course, in practice doing this is often possible, because we often do actually have a simulator (for example a robot simulator, or a game, etc.). The majority of research in RL still makes the assumption that we do not have such a simulator, because this leads to algorithms that may eventually be useable in "real-world" situations (e.g., real-world physical robots). Implementing the idea you describe above actually means you're moving more towards search algorithms (such as Monte-Carlo Tree Search), rather than Reinforcement Learning algorithms. It means that your approach is limited to scenarios in which you do have a simulator available (e.g., games).
a) Does it makes sense to act non greedily sometimes?
Even though you include the search-like process of looping through all actions and simulating all their effects, I suspect you'll still have to have some form of exploration if you want to converge to good policies, so you'll have to act non-greedily sometimes. Yes, it looks like this will cause your algorithm to converge to something different from the traditional interpretation of an "optimal policy". This is not too much of a problem if your epsilon is fairly low though. In practice, it will likely tend to be a slightly "safer" policy that is learned. See also my answer to this other question.
b) What kind of algorithm is this? Where does it fits in Reinforcement Learning?
Aside from my discussion above on how this actually moves a bit towards the domain of Search algorithms rather than RL algorithms, it also looks to me like this would be an on-policy algorithm rather than an off-policy algorithm (standard Q-learning is off-policy, because it learns about the greedy policy whilst generating experience through a non-greedy policy). This distinction is also discussed in detail by most of the answers to the question I linked to above.

Reinforcement Learning with MDP for revenues optimization

I want to modelize the service of selling seats on an airplane as an MDP( markov decision process) to use reinforcement learning for airline revenues optimization, for that I needed to define what would be: states, actions, policy, value and reward. I thought a little a bit about it, but i think there is still something missing.
I modelize my system this way:
States = (r,c) where r is the number of passengers and c the number of seats bought so r>=c.
Actions = (p1,p2,p3) that are the 3 prices. the objective is to decide which one of them give more revenues.
Reward: revenues.
Could you please tell me what do u think and help me?
After the modelization, I have to implement all of that wit Reinforcement Learning. Is there a package that do the work ?

I think the biggest thing missing in your formulation is the sequential part. Reinforcement learning is useful when used sequentially, where the next state has to be dependent on the current state (thus the "Markovian"). In this formulation, you have not specified any Markovian behavior at all. Also, the reward is a scalar which is dependent on either the current state or the combination of current state and action. In your case, the revenue is dependent on the price (the action), but it has no correlation to the state (the seat). These are the two big problems that I see with your formulation, there are others as well. I will suggest you to go through the RL theory (online courses and such) and write a few sample problems before trying to formulate your own.

Implementing Policy iteration methods in Open AI Gym

I am currently reading "Reinforcement Learning" from Sutton & Barto and I am attempting to write some of the methods myself.
Policy iteration is the one I am currently working on. I am trying to use OpenAI Gym for a simple problem, such as CartPole or continuous mountain car.
However, for policy iteration, I need both the transition matrix between states and the Reward matrix.
Are these available from the 'environment' that you build in OpenAI Gym.
I am using python.
If not, how do I calculate these values, and use the environment?

No, OpenAI Gym environments will not provide you with the information in that form. In order to collect that information you will need to explore the environment via sampling: i.e. selecting actions and receiving observations and rewards. With these samples you can estimate them.
One basic way to approximate these values is to use LSPI (least square policy iteration), as far as I remember, you will find more about this in Sutton too.

See these comments at toy_text/discrete.py:
P: transitions (*)
(*) dictionary dict of dicts of lists, where
P[s][a] == [(probability, nextstate, reward, done), ...]

Use machine learning for simple robot control

I'd like to improve my little robot with machine learning.
Up to now it uses simple while and if then decisions in its main function to act as a lawn mowing robot.
My idea is to use SKLearn for that purpose.
Please help me to find the right first steps.
i have a few sensors that tell about the world otside:
World ={yaw, pan, tilt, distance_to_front_obstacle, ground_color}
I have a state vector
State = {left_motor, right_motor, cutter_motor}
that controls the 3 actors of the robot.
I'd like to build a dataset of input and output values to teach sklearn the wished behaviour, after that the input values should give the correct output values for the actors.
One example: if the motors are on and the robot should move forward but the distance meter tells constant values, the robot seems to be blocked. Now it should decide to draw back and turn and move to another direction.
First of all, do you think that this is possible with sklearn and second how should i start?
My (simple) robot control code is here: http://github.com/bgewehr/RPiMower
Please help me with the first steps!

I would suggest to use Reinforcement Learning. Here you have a tutorial of Q-Learning that fits well into your problem.
If you want code in python, right now I think there is no implementation of Q-learning in scikit-learn. However, I can give you some examples of code in python that you could use: 1, 2 and 3.
Also please have in mind that reinforcement learning is set to maximize the sum of all future rewards. You have to focus on the general view.
Good luck :-)

The sklearn package contains a lot of useful tools for machine learning so I dont think thats a problem. If it is, then there are definitely other useful python packages. I think collecting data for the supervised learning phase will be the challenging part, and wonder if it would be smart to make a track with tape within a grid system. That would make it be easier to translate the track to labels (x,y positions in the grid). Each cell in the grid should be small if you want to make complex tracks later on I think. It may be very smart to check how they did in the self-driving google car.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.