Understanding Gym Environment

Understanding Gym Environment - python

This isn't specifically about troubleshooting code, but with helping me understand the gym Environment. I am inheriting gym.Env to create my own environment, but I am have a difficult time understanding the flow. I look through the documentation, but there are still questions and concepts that are unclear.
I am still a little foggy how the actually agent knows what action to control? I know when you __init__ the class, you have to distinguish if your actions are discrete or Box, but how does the agent know what parameters in their control?
When determining the lower and upper limit for the spaces.Box command, that tells the agent how big of a step-size that can take? For example, if my limits are [-1,1] they can implement any size within that domain?
I saw that the limits can be [a,b], (-oo,a], [b,oo), (-oo,oo) for the limits, if need to have my observation space, I just use the np.inf command?
If there any documentation that you would recommend, that would be much appreciate.

1.
The agent does not know what the action does; that is where reinforcement learning comes in. To clarify, whenever you use the environment's step(action) method, you should do verify that the action is valid within the environment and return a reward and environment state conditional on that action.
If you want to reference these values outside of the environment, however, you can do so and control the available actions the agent can pass in like so:
import gym
env = gym.make('CartPole-v0')
actions = env.action_space.n #Number of discrete actions (2 for cartpole)
Now you can create a network with an output shape of 2 - using softmax activation and taking the maximum probability for determining the agents action to take.
2.
The spaces are used for internal environment validation. For example, observation_space = spaces.Box(low=-1.0, high=1.0, shape=(1,), dtype=np.float32) means that the maximum value the agent will see for any variable is 1, and the minimum is -1. So you should also use these inside the step() method to make sure the environment stays within these bounds.
This step is primarily important for others who use your environment to be able to at-a-glance identify what kind of network they will need to make in order to interface with your environment.
3.
Yes. np.inf and -np.inf

Related

What is the best approach to obtain initial values for a GEKKO optimisation problem?

I have a dynamic control problem with about 35 independent variables, many intermediate variables and a few equations, mainly to enforce a sensible mass balance by limiting certain dependent variables (representing stream flows) to be positive.
Initially the variables were declared using the m.Var() constructor but I consequently upgraded them to MV an CV variables to capitalize on flexibility of tuning attributes such as COST, DCOST, etc. that these classes add.
I noticed that IPOPT (3.12) does not produce a solution (Error reported : EXIT: Converged to a point of local infeasibility. Problem may be infeasible) when a set of variables are configured as MV's, yet when one is instantiated as a Var a successful solution is returned. I re-instantiated the variable as MV and systematically removed constraints to try and pinpoint the constraining equation. I discovered that the set of initial conditions I provided for the independent variables constituted an infeasibility (it resulted in value of -0.02 on a stream that have a positive constraint on it). Although RTOL can probably be used to solve the problem for this case, I do not think it is correct general approach. I have tried COLDSTART=2 but do not know how to interpret the presolve.txt file generated.
Firstly, is there some standard functionality to assist with this situation or should one make sure the initial guesses represent a feasible solution.
Secondly, why would the inability to produce a successful solution only manifest when the variable is declared as an MV as opposed to the less decorated Var?

The m.Var() creates an additional degree of freedom for the optimizer while m.Param() creates a quantity that is determined by the user. The m.Var() types can be upgraded to m.SV() as state variables or m.CV() as controlled variables. The m.Param() type is upgraded to m.FV() for fixed values or m.MV() for manipulated variables. If the STATUS parameter is turned on for those types then they also become degrees of freedom. More information on FV, MV, SV, and CV types is given in the APMonitor documentation and Gekko Documentation. The problem likely becomes feasible because of the additional degree of freedom. Try setting the m.MV() to an m.SV() to retain the degree of freedom from the m.Var() declaration.
Initial solutions are often easiest to obtain from a steady-state simulation. There are additional details in this paper:
Here is a flowchart that I typically use:
There are additional details in the paper on how COLDSTART options work. If the solver reports a successful solution, then there should be no constraint violations.

How to set your own value function in Reinforecement learning?

I am new to using reinforcement learning, I only read the first few chapters in R.Sutton (so I have a small theoretical background).
I try to solve a combinatorial optimization problem which can be broken down to:
I am looking for the optimal configuration of points (qubits) on a grid (quantum computer).
I already have a cost function to qualify a configuration. I also have a reward function.
Right now I am using simulated annealing, where I randomly move a qubit or swap two qubits.
However, this ansatz is not working well for more than 30 qubits.
That's why I thought to use a policy, which tells me which qubit to move/swap instead of doing it randomly.
Reading the gym documentation, I couldn't find what option I should use. I don't need Q-Learning or deep reinforcement learning as far as I understood since I only need to learn a policy?
I would also be fine using Pytorch or whatever. With this little amount of information, what do you recommend to chose? More importantly, how can I set my own value function?

There are two categories of RL algorithms.
One category like Q-learning, Deep Q-learning and other ones learn a value function that for a state and an action predicts the estimated reward that you will get. Then, once you know for each state and each action what the reward is, your policy is simply to select for each state the action that provides the biggest reward. Thus, in the case of these algorithms, even if you learn a value function, the policy depends on this value function.
Then, you have other deep rl algorithms where you learn a policy directly, like Reinforce, Actor Critic algorithms or other ones. You still learn a value function, but at the same time you also learn a policy with the help of the value function. The value function will help the system learn the policy during training, but during testing you do not use the value function anymore, but only the policy.
Thus, in the first case, you actually learn a value function and act greedy on this value function, and in the second case you learn a value function and a policy and then you use the policy to navigate in the environment.
In the end, both these algorithms should work for your problem, and if you say that you are new to RL, maybe you could try the Deep Q-learning from the gym documentation.

How could I define the observation_space for my custom openai enviroment?

I'm currently working on a custom Gym environment that represents a networ graph (with nodes and links), and I am struggling to determine what the observation_space variable of my environment should look like. I don't plan on using a graphic representation of my environment (meaning that the render() method will only use the terminal).
I looked for answers on the openai gihub page, and I found this issue. However, I still don't understand how should my observation_space variable should look like.
My gym environment is currently looking like this.
TL;DR:
the current state is in fact the node on which the agent is located
the current state is a character
the list of the possible states is explicited in the constructor
Moreover, I plan on using Q-learning algorithms for exploiting this graph: should I discretise the observation_space? I plan using a RL algorithm like like this one.
How should I represent my observation_space?
Thanks in advance!

In a Gym environment, the observation space represents all the possible observations that can be returned by the step() method. I took a look at your environment code and for me, it looks like that your observation space is the list of nodes of your graph. In this case, you will have to extend the gym.spaces.Space class, since there's no "list" space in default Gym.

Your observation is just one discrete information: Which node. So you could use a discrete space which has one number for every possible node.
Could look something like this:
self.observation_space = spaces.Discrete(len(node_array))
It sounds like you need a mapping from numbers to characters but that shouldn't be hard.
I guess it get's more complicated if the number of nodes changes or you need a VecEnv with Envs of different node count. But it sounds like you know the node count and DQN doesn't use parallelized Envs as far as I remember. So should be OK.
The action space could be more problematic when not all nodes have the same degree of neighbors. In principle you have again just a discrete space. Each number for one path to take. But the amount of possible paths change and I don't know how to handle dynamic spaces using Gym Env and standard RL-Libs. A possible workaround could be an upper-bound for possible actions maybe combined with an array which masks impossible actions.

Implementing Policy iteration methods in Open AI Gym

I am currently reading "Reinforcement Learning" from Sutton & Barto and I am attempting to write some of the methods myself.
Policy iteration is the one I am currently working on. I am trying to use OpenAI Gym for a simple problem, such as CartPole or continuous mountain car.
However, for policy iteration, I need both the transition matrix between states and the Reward matrix.
Are these available from the 'environment' that you build in OpenAI Gym.
I am using python.
If not, how do I calculate these values, and use the environment?

No, OpenAI Gym environments will not provide you with the information in that form. In order to collect that information you will need to explore the environment via sampling: i.e. selecting actions and receiving observations and rewards. With these samples you can estimate them.
One basic way to approximate these values is to use LSPI (least square policy iteration), as far as I remember, you will find more about this in Sutton too.

See these comments at toy_text/discrete.py:
P: transitions (*)
(*) dictionary dict of dicts of lists, where
P[s][a] == [(probability, nextstate, reward, done), ...]

Modelica Parameter studies with python

I want to run parameter studies in different modelica building libraries (buildings, IDEAS) with python: For example: change the infiltration rate.
I tried: simulateModel and simulateExtendedModel(..."zone.n50", [value])
My questions:Why is it not possible to translate the model and then change the parameter: Warning: Setting zone.n50 has no effect in model. After translation you can only set literal start-values and non-evaluated parameters.
It is also not possible to run: simulateExtendedModel. When i go to command line in dymola and write for zone.n50, then i get the actual value (that i have defined in python), but in the result file (and the plotted variable) it is always the standard n50 value.So my question: How can I change values ( befor running (and translating?) the simulation?
The value for the parameter is also not visible in the variable browser.
Kind regards

It might be a strcutrual parameter, these are evaluated also. It should work if you explicitly set Evaluate=False for the parameter that you want to study.
Is it not visible in the variable browser or is it just greyed out and constant? If it is not visible at all you should check if it is protected.

Some parameters cannot be changed after compilation, even with Evaluate=False. This is the case for parameters that influence the structure of the model, for example parameters that influence a discretization scheme and therefore influence the number of equations.
Changing such parameters requires to recompile the model. You can still do this in a parametric study though, I think you can use Modelicares to achieve this (http://kdavies4.github.io/ModelicaRes/modelicares.exps.html)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.