Implementing Policy iteration methods in Open AI Gym - python

I am currently reading "Reinforcement Learning" from Sutton & Barto and I am attempting to write some of the methods myself.
Policy iteration is the one I am currently working on. I am trying to use OpenAI Gym for a simple problem, such as CartPole or continuous mountain car.
However, for policy iteration, I need both the transition matrix between states and the Reward matrix.
Are these available from the 'environment' that you build in OpenAI Gym.
I am using python.
If not, how do I calculate these values, and use the environment?

No, OpenAI Gym environments will not provide you with the information in that form. In order to collect that information you will need to explore the environment via sampling: i.e. selecting actions and receiving observations and rewards. With these samples you can estimate them.
One basic way to approximate these values is to use LSPI (least square policy iteration), as far as I remember, you will find more about this in Sutton too.

See these comments at toy_text/discrete.py:
P: transitions (*)
(*) dictionary dict of dicts of lists, where
P[s][a] == [(probability, nextstate, reward, done), ...]

Related

How to set your own value function in Reinforecement learning?

I am new to using reinforcement learning, I only read the first few chapters in R.Sutton (so I have a small theoretical background).
I try to solve a combinatorial optimization problem which can be broken down to:
I am looking for the optimal configuration of points (qubits) on a grid (quantum computer).
I already have a cost function to qualify a configuration. I also have a reward function.
Right now I am using simulated annealing, where I randomly move a qubit or swap two qubits.
However, this ansatz is not working well for more than 30 qubits.
That's why I thought to use a policy, which tells me which qubit to move/swap instead of doing it randomly.
Reading the gym documentation, I couldn't find what option I should use. I don't need Q-Learning or deep reinforcement learning as far as I understood since I only need to learn a policy?
I would also be fine using Pytorch or whatever. With this little amount of information, what do you recommend to chose? More importantly, how can I set my own value function?
There are two categories of RL algorithms.
One category like Q-learning, Deep Q-learning and other ones learn a value function that for a state and an action predicts the estimated reward that you will get. Then, once you know for each state and each action what the reward is, your policy is simply to select for each state the action that provides the biggest reward. Thus, in the case of these algorithms, even if you learn a value function, the policy depends on this value function.
Then, you have other deep rl algorithms where you learn a policy directly, like Reinforce, Actor Critic algorithms or other ones. You still learn a value function, but at the same time you also learn a policy with the help of the value function. The value function will help the system learn the policy during training, but during testing you do not use the value function anymore, but only the policy.
Thus, in the first case, you actually learn a value function and act greedy on this value function, and in the second case you learn a value function and a policy and then you use the policy to navigate in the environment.
In the end, both these algorithms should work for your problem, and if you say that you are new to RL, maybe you could try the Deep Q-learning from the gym documentation.

Can I create a contextual Multi-Armed Bandit Agent in SB3?

I wonder if it is possible to create an agent equivalent to a contextual Multi-Armed Bandit using the SB3 library.
It seems to me a much simpler agent, but checking the library documentation they say they don't cover that kind of algorithm, and I wonder if it is possible to create a similar agent (without trajectory interpretation) by tuning one of the existing agents.
My first approach was to use any agent by assigning a value of gamma=0, but I think that would not be mathematically correct.

Understanding Gym Environment

This isn't specifically about troubleshooting code, but with helping me understand the gym Environment. I am inheriting gym.Env to create my own environment, but I am have a difficult time understanding the flow. I look through the documentation, but there are still questions and concepts that are unclear.
I am still a little foggy how the actually agent knows what action to control? I know when you __init__ the class, you have to distinguish if your actions are discrete or Box, but how does the agent know what parameters in their control?
When determining the lower and upper limit for the spaces.Box command, that tells the agent how big of a step-size that can take? For example, if my limits are [-1,1] they can implement any size within that domain?
I saw that the limits can be [a,b], (-oo,a], [b,oo), (-oo,oo) for the limits, if need to have my observation space, I just use the np.inf command?
If there any documentation that you would recommend, that would be much appreciate.
1.
The agent does not know what the action does; that is where reinforcement learning comes in. To clarify, whenever you use the environment's step(action) method, you should do verify that the action is valid within the environment and return a reward and environment state conditional on that action.
If you want to reference these values outside of the environment, however, you can do so and control the available actions the agent can pass in like so:
import gym
env = gym.make('CartPole-v0')
actions = env.action_space.n #Number of discrete actions (2 for cartpole)
Now you can create a network with an output shape of 2 - using softmax activation and taking the maximum probability for determining the agents action to take.
2.
The spaces are used for internal environment validation. For example, observation_space = spaces.Box(low=-1.0, high=1.0, shape=(1,), dtype=np.float32) means that the maximum value the agent will see for any variable is 1, and the minimum is -1. So you should also use these inside the step() method to make sure the environment stays within these bounds.
This step is primarily important for others who use your environment to be able to at-a-glance identify what kind of network they will need to make in order to interface with your environment.
3.
Yes. np.inf and -np.inf

How could I define the observation_space for my custom openai enviroment?

I'm currently working on a custom Gym environment that represents a networ graph (with nodes and links), and I am struggling to determine what the observation_space variable of my environment should look like. I don't plan on using a graphic representation of my environment (meaning that the render() method will only use the terminal).
I looked for answers on the openai gihub page, and I found this issue. However, I still don't understand how should my observation_space variable should look like.
My gym environment is currently looking like this.
TL;DR:
the current state is in fact the node on which the agent is located
the current state is a character
the list of the possible states is explicited in the constructor
Moreover, I plan on using Q-learning algorithms for exploiting this graph: should I discretise the observation_space? I plan using a RL algorithm like like this one.
How should I represent my observation_space?
Thanks in advance!
In a Gym environment, the observation space represents all the possible observations that can be returned by the step() method. I took a look at your environment code and for me, it looks like that your observation space is the list of nodes of your graph. In this case, you will have to extend the gym.spaces.Space class, since there's no "list" space in default Gym.
Your observation is just one discrete information: Which node. So you could use a discrete space which has one number for every possible node.
Could look something like this:
self.observation_space = spaces.Discrete(len(node_array))
It sounds like you need a mapping from numbers to characters but that shouldn't be hard.
I guess it get's more complicated if the number of nodes changes or you need a VecEnv with Envs of different node count. But it sounds like you know the node count and DQN doesn't use parallelized Envs as far as I remember. So should be OK.
The action space could be more problematic when not all nodes have the same degree of neighbors. In principle you have again just a discrete space. Each number for one path to take. But the amount of possible paths change and I don't know how to handle dynamic spaces using Gym Env and standard RL-Libs. A possible workaround could be an upper-bound for possible actions maybe combined with an array which masks impossible actions.

Useful packages to create online prediction tool with Python and R (example provided)

I am building a Cox PH statistical model to predict the probability of relapse for breast cancer patients. I would like to create an online interface for patients or doctors to use that would allow them to input the relevant patient characteristics, and compute the probability of relapse. Here is a perfect example, albeit for prostate cancer:
http://nomograms.mskcc.org/Prostate/PostRadicalProstatectomy.aspx
My basic plan is to create the tool with python, and compute the probability with R based on the user's inputs and from my previously fitted Cox PH model. The tool would require both drop-down boxes and user-inputted numerical values. The problem is I've never done any web programming or GUI programming with Python. My experience is with the scientific programming side of Python (e.g. Pylab, etc). My questions are:
What relevant packages for Python and R will I need? From some Googling I've done it seems that RPy and Tkinter are clear choices.
How can I store the statistical model such that the tool doesn't have to compute the model from my data set every time someone uses it? In the case of a Cox PH model, this would require storing the baseline hazard and the model formula.
Any useful tips from your experience?
I really appreciate your help!
Basically you need to learn WebDev, which is a pretty massive topic. If you are serious about making this a webapp, Django is one of the easiest places to start, and it's also fantastically documented. So essentially my answer would be:
http://djangobook.com/en/2.0/
start reading.
Apart from using R through RPy or equivalent there are a number of survival analysis routines in the statsmodels (formerly sicpy.statsmodel) python library. They are in the "sandbox" package though, meaning they aren't supposed to be ready for production right now.
E.g. you have the Cox model of proportional hazard coded here.
See also this question on CrossValidated.
I suggest fitting your model in R and easily use the 'DynNom' package ('DNbuilder' function). It will create such an interactive tool for your prediction model that you can easily share it as a webpage without any web programming or GUI programming skills. This will be a line of code after fitting your model, for example:
fit1 <- coxph(Surv(time, status) ~ age + sex + ph.ecog, data = lung)
library(DynNom)
DNbuilder(fit1)
You can easily share it on your account in http://shinyapps.io/ or host it on your website (needs more effort).

Categories

Resources