I've implemented gridworld example from the book Reinforcement Learning - An Introduction, second edition" from Richard S. Sutton and Andrew G. Barto, Chapter 4, sections 4.1 and 4.2, page 80.
Here is my implementation:
https://github.com/ozrentk/dynamic-programming-gridworld-playground
The original algorithm seems to have a bug since the value function (mapping) is updated one by one value in the source mapping structure. Why is that incorrect? It means that inside the loop for each s (of set S), in the same evaluation loop pass, the next value of the element s (e.g. s_2 of set S) will be evaluated from the newly evaluated element in that pass (e.g. s_1 of set S), instead of s from the current iteration. This problem is avoided here using the double buffering technique. An additional buffer is used for new values of set S. It also means that the program uses more memory because of that buffer.
I must admit that I'm not 100% sure if this is a bug, or if I misinterpreted the algorithm.
Generally, this is the code I'm using:
...
while True:
delta = 0
# NOTE: algorithm modified a bit, additional buffer new_values introduced
# Barto & Sutton seem to have a bug in their algorithm (iterative estimation does not fit figure 4.1)
# Instead of tracking one state value inside a loop, we track entire state value function mapping
# outside that loop. Also note that after that change algorithm consumes more memory.
new_values = [0] * states_n
for s in non_terminal_states:
# Evaluate state value under current policy
next_states = get_next_states(s, policy[s], world_size)
sum_items = [p * (reward + gamma * values[s_next]) for s_next, p in next_states.items()]
new_values[s] = sum(sum_items)
# Track delta
delta = max(delta, abs(values[s] - new_values[s]))
# (now we switch state value function buffer, instead of switching single state value in the loop)
values = new_values
if use_policy_improvement:
# Policy_improvement is done inside improve_policy(), and if new policy is no better than the
# old one, return value of is_policy_stable is True
is_policy_stable, improved_policy = improve_policy()
if is_policy_stable:
print("Policy is stable.")
break
else:
print("- Improving policy... ----------------")
policy = improved_policy
visualize_policy(policy, states, world_size)
# In case we don't track policy improvement, we need to track delta for the convergence sake
if delta < theta:
break
# Track iteration count
k += 1
...
Am I wrong or there is a problem with the policy evaluation part of the algorithm in the book?
The original algorithm is the "asynchronous version" of policy evaluation. And your impletation using two buffer is the "synchronous version". Both are correct.
The "asynchronous version" also converge to the optimal solution(You can find the proof in book Parallel and Distributed Computation: Numerical Methods). And as you may find in the book, it "usually converges faster".
I find that This link provides a good explanation.
Related
I'm making a connect 4 AI in python, and I'm using minimax with iterative deepening and alpha beta pruning for this. For greater depths it's still quite slow, so I wanted to implement a transposition table. After reading up on it I think i get the general idea but i haven't been able to quite make it work. Here's part of my code: (the maximizing part of the minimax):
if(isMaximizing):
maxEval = -99999999999
bestMove = None
# cache.get(hash(board)) Here's where i'd check to see if the hash is already in the table
# if so i searched for the best move that was given to that board before.
# loop through possible moves
for move in [3,2,4,1,5,0,6]:
if moves[move] > -1:
# check if time limit has been reached for iterative deepening
if startTime - time.time() <= -10:
timeout = True
return (maxEval, bestMove, timeout)
if timeout == False:
board = makeMove((moves[move],move), True, board) # make the move
eval = minimax(depth - 1, board, False, alpha, beta, cache, zobTable, startTime, timeout)[0]
if eval > maxEval:
maxEval = eval
bestMove = (moves[move]+1,move)
board[moves[move] + 1][move] = '_' # undo the move on the board
moves[move] = moves[move] + 1 # undo the move in the list of legal moves
alpha = max(alpha, maxEval)
if alpha >= beta:
break
# cache.set(hash(board), (eval, value)) Here's where i would set the value and bestmove for the current boardstate
return (maxEval, bestMove, timeout)
Right now i'm hashing the board with the zobrist hashing method, and i'm using an ordered dict to add the hashed boards to. To this hashkey i've added the value for the board and the bestMove for that board. Unfortunately this seems to make the algorithm pick bad moves (it worked before), does anyone know where you should put the boardstates in the cache, and where you should get them from the cache?
A few points on your approach:
If you want things to be fast, writing efficient code in C or C++ is going to be much faster than python. I've seen 10-100x improvements in performance in this sort of search code by switching away from python and to a good C/C++ implementation. Either way you should try to write code that avoids allocating memory during search, as this is very expensive. That is to say, you could see better returns from coding more efficiently than from adding a transposition table.
When using Zobrist hashing for a transposition table in game tree search, you typically do not store the state explicitly. You only check to see if the hashes are equal. While there is a small chance of error, it requires far less memory to store just the hash, and with a 64-bit hash the chance of collisions are probably vanishingly small for the types of searches you are doing. (The chances of errors resulting are even lower.)
When you store values in the transposition table, you also need to store the alpha and beta bounds used during the search. When you get a value back at a node mid-search it is either an upper bound on the true value (because value = beta), a lower bound on the true value (because value = alpha) or the actual value of the node (alpha < value < beta). You need to store this in your transposition table. Then, when you want to re-use the value, you have to check that you can use the value given your current alpha and beta bounds. (You can validate this by actually doing the search after finding the value in the transposition table to see if you get the same value from search that you got in the table.)
I have next first order differential equation (example):
dn/dt=A*n; n(0)=28
When A is constant, it is perfectly solved with python odeint.
But i have an array of different values of A from .txt file [not function,just an array of values]
A = [0.1,0.2,0.3,-0.4,0.7,...,0.0028]
And i want that in each iteration (or in each moment of time t) of solving ode A is a new value from array.
I mean that:
First iteration (or t=0) - A=0.1
Second iteration (or t=1) - A=0.2 and etc from array.
How can i do it with using python odeint?
Yes, you can to that, but not directly in odeint, as that has no event mechanism, and what you propose needs an event-action mechanism.
But you can separate your problem into steps, use inside each step odeint with the now constant A parameter, and then in the end join the steps.
T = [[0]]
N = [[n0]]
for k in range(len(A)):
t = np.linspan(k,k+1,11);
n = odeint(lambda u,t: A[k]*u, [n0],t)
n0 = n[-1]
T.append(t[1:])
N.append(n[1:])
T = np.concatenate(T)
N = np.concatenate(N)
If you are satisfied with less efficiency, both in the evaluation of the ODE and in the number of internal steps, you can also implement the parameter as a piecewise constant function.
tA = np.arange(len(A));
A_func = interp1d(tA, A, kind="zero", fill_value="extrapolate")
T = np.linspace(0,len(A)+1, 10*len(A)+11);
N = odeint(lambda u,t: A_func(t)*u, [n0], T)
The internal step size controller works on the assumption that the ODE function is well differentiable to 5th or higher order. The jumps are then seen via the implicit numerical differentiation inherent in the step error calculation as highly oscillatory events, requiring a very small step size. There is some mitigation inside the code that usually allows the solver to eventually step over such a jump, but it will require much more internal steps and thus function evaluations than the first variant above.
Given transport costs, per single unit of delivery, for a supermarket from three distribution centers to ten separate stores.
Note: Please look in the #data section of my code to see the data that I'm not allowed to post in photo form. ALSO note while my costs are a vector with 30 entries. Each distribution centre can only access 10 costs each. So DC1 costs = entries 1-10, DC2 costs = entries 11-20 etc..
I want to minimize the transport cost subject to each of the ten stores demand (in units of delivery).
This can be done by inspection. The the minimum cost being $150313. The problem being implementing the solution with Python and Gurobi and producing the same result.
What I've tried is a somewhat sloppy model of the problem in Gurobi so far. I'm not sure how to correctly index and iterate through my sets that are required to produce a result.
This is my main problem: The objective function I define to minimize transport costs is not correct as I produce a non-answer.
The code "runs" though. If I change to maximization I just get an unbounded problem. So I feel like I am definitely not calling the correct data/iterations through sets into play.
My solution so far is quite small, so I feel like I can format it into the question and comment along the way.
from gurobipy import *
#Sets
Distro = ["DC0","DC1","DC2"]
Stores = ["S0", "S1", "S2", "S3", "S4", "S5", "S6", "S7", "S8", "S9"]
D = range(len(Distro))
S = range(len(Stores))
Here I define my sets of distribution centres and set of stores. I am not sure where or how to exactly define the D and S iteration variables to get a correct answer.
#Data
Demand = [10,16,11,8,8,18,11,20,13,12]
Costs = [1992,2666,977,1761,2933,1387,2307,1814,706,1162,
2471,2023,3096,2103,712,2304,1440,2180,2925,2432,
1642,2058,1533,1102,1970,908,1372,1317,1341,776]
Just a block of my relevant data. I am not sure if my cost data should be 3 separate sets considering each distribution centre only has access to 10 costs and not 30. Or if there is a way to keep my costs as one set but make sure each centre can only access the costs relevant to itself I would not know.
m = Model("WonderMarket")
#Variables
X = {}
for d in D:
for s in S:
X[d,s] = m.addVar()
Declaring my objective variable. Again, I'm blindly iterating at this point to produce something that works. I've never programmed before. But I'm learning and putting as much thought into this question as possible.
#set objective
m.setObjective(quicksum(Costs[s] * X[d, s] * Demand[s] for d in D for s in S), GRB.MINIMIZE)
My objective function is attempting to multiply the cost of each delivery from a centre to a store, subject to a stores demand, then make that the smallest value possible. I do not have a non zero constraint yet. I will need one eventually?! But right now I have bigger fish to fry.
m.optimize()
I produce a 0 row, 30 column with 0 nonzero entries model that gives me a solution of 0. I need to set up my program so that I get the value that can be calculated easily by hand. I believe the issue is my general declaring of variables and low knowledge of iteration and general "what goes where" issues. A lot of thinking for just a study exercise!
Appreciate anyone who has read all the way through. Thank you for any tips or help in advance.
Your objective is 0 because you do not have defined any constraints. By default all variables have a lower bound of 0 and hence minizing an unconstrained problem puts all variables to this lower bound.
A few comments:
Unless you need the names for the distribution centers and stores, you could define them as follows:
D = 3
S = 10
Distro = range(D)
Stores = range(S)
You could define the costs as a 2-dimensional array, e.g.
Costs = [[1992,2666,977,1761,2933,1387,2307,1814,706,1162],
[2471,2023,3096,2103,712,2304,1440,2180,2925,2432],
[1642,2058,1533,1102,1970,908,1372,1317,1341,776]]
Then the cost of transportation from distribution center d to store s are stored in Costs[d][s].
You can add all variables at once and I assume you want them to be binary:
X = m.addVars(D, S, vtype=GRB.BINARY)
(or use Distro and Stores instead of D and S if you need to use the names).
Your definition of the objective function then becomes:
m.setObjective(quicksum(Costs[d][s] * X[d, s] * Demand[s] for d in Distro for s in Stores), GRB.MINIMIZE)
(This is all assuming that each store can only be delivered from one distribution center, but since your distribution centers do not have a maximal capacity this seems to be a fair assumption.)
You need constraints ensuring that the stores' demands are actually satisfied. For this it suffices to ensure that each store is being delivered from one distribution center, i.e., that for each s one X[d, s] is 1.
m.addConstrs(quicksum(X[d, s] for d in Distro) == 1 for s in Stores)
When I optimize this, I indeed get an optimal solution with value 150313.
I have started using boost::odeint in my C++ code, and I think I'm missing a simple feature available in other integrators, namely Scipy's odeint.
scipy.odeint lets the user specify the times at which the state must be added to the output state history.scipy.odeint is a variable timestep integrator whose one-liner call looks like this (the state is integrated from the initial condition X0 and interpolated/stored at the times specified in times)
X = scipy.odeint(dxdt,X0,times,atol = 1e-13,rtol = 1e-13)
where X is a matrix that has as many rows as there are elements in times
Basically, I am looking for a similar feature in boost::odeint in order to do two things:
Propagate a state from t0 to tf, but only retrieve the final value of the state. I think I could write an observer only storing the state if the internal time satisfies t == tf, but this looks like a rather ugly hack. If I want to let the integrator choose the proper internal timestep to meet the tolerance values, storing intermediate states is a unnecessary burden.
Propagate a state from t0 to tf but store the state at times specified in advance, that are not necessarily evenly distributed, in a similar fashion to the call to scipy.odeint hereabove, while also letting the integrator choose the proper internal timestep.
The closest I've been to achieving that is with the following
size_t steps = integrate_adaptive( make_controlled< error_stepper_type >( 1.0e-10 , 1.0e-16 ) ,
dynamics , x , 0.0 , 10.0 , 1. , push_back_state_and_time( x_vec , times ));
The tolerances are met, but all the states are stored into x_vec by the observer, without letting me specify what the storage times should be.
How should I proceed?
It seems you are looking for the integrate_times function:
It lets you specify a range of exact times for which the observer will be invoked, adjusting the step size to reach every time step exactly, if necessary.
Especially for adaptive methods, this is quite useful as it computes the solution at the exact times you specified while still controlling the time step size to not exceed the error bounds.
So your current call could be modified to something like
auto stepper = make_controlled<error_stepper_type>( 1.0e-10 , 1.0e-16 );
// std::vector<time> times;
// std::vector<state> x_vec;
// time tstep;
auto tbegin = times.begin();
auto tend = times.end();
integrate_times(stepper, dynamics, x, tbegin, tend, tstep, push_back_state(x_vec));
Recently I read a problem to practice DP. I wasn't able to come up with one, so I tried a recursive solution which I later modified to use memoization. The problem statement is as follows :-
Making Change. You are given n types of coin denominations of values
v(1) < v(2) < ... < v(n) (all integers). Assume v(1) = 1, so you can
always make change for any amount of money C. Give an algorithm which
makes change for an amount of money C with as few coins as possible.
[on problem set 4]
I got the question from here
My solution was as follows :-
def memoized_make_change(L, index, cost, d):
if index == 0:
return cost
if (index, cost) in d:
return d[(index, cost)]
count = cost / L[index]
val1 = memoized_make_change(L, index-1, cost%L[index], d) + count
val2 = memoized_make_change(L, index-1, cost, d)
x = min(val1, val2)
d[(index, cost)] = x
return x
This is how I've understood my solution to the problem. Assume that the denominations are stored in L in ascending order. As I iterate from the end to the beginning, I have a choice to either choose a denomination or not choose it. If I choose it, I then recurse to satisfy the remaining amount with lower denominations. If I do not choose it, I recurse to satisfy the current amount with lower denominations.
Either way, at a given function call, I find the best(lowest count) to satisfy a given amount.
Could I have some help in bridging the thought process from here onward to reach a DP solution? I'm not doing this as any HW, this is just for fun and practice. I don't really need any code either, just some help in explaining the thought process would be perfect.
[EDIT]
I recall reading that function calls are expensive and is the reason why bottom up(based on iteration) might be preferred. Is that possible for this problem?
Here is a general approach for converting memoized recursive solutions to "traditional" bottom-up DP ones, in cases where this is possible.
First, let's express our general "memoized recursive solution". Here, x represents all the parameters that change on each recursive call. We want this to be a tuple of positive integers - in your case, (index, cost). I omit anything that's constant across the recursion (in your case, L), and I suppose that I have a global cache. (But FWIW, in Python you should just use the lru_cache decorator from the standard library functools module rather than managing the cache yourself.)
To solve for(x):
If x in cache: return cache[x]
Handle base cases, i.e. where one or more components of x is zero
Otherwise:
Make one or more recursive calls
Combine those results into `result`
cache[x] = result
return result
The basic idea in dynamic programming is simply to evaluate the base cases first and work upward:
To solve for(x):
For y starting at (0, 0, ...) and increasing towards x:
Do all the stuff from above
However, two neat things happen when we arrange the code this way:
As long as the order of y values is chosen properly (this is trivial when there's only one vector component, of course), we can arrange that the results for the recursive call are always in cache (i.e. we already calculated them earlier, because y had that value on a previous iteration of the loop). So instead of actually making the recursive call, we replace it directly with a cache lookup.
Since every component of y will use consecutively increasing values, and will be placed in the cache in order, we can use a multidimensional array (nested lists, or else a Numpy array) to store the values instead of a dictionary.
So we get something like:
To solve for(x):
cache = multidimensional array sized according to x
for i in range(first component of x):
for j in ...:
(as many loops as needed; better yet use `itertools.product`)
If this is a base case, write the appropriate value to cache
Otherwise, compute "recursive" index values to use, look up
the values, perform the computation and store the result
return the appropriate ("last") value from cache
I suggest considering the relationship between the value you are constructing and the values you need for it.
In this case you are constructing a value for index, cost based on:
index-1 and cost
index-1 and cost%L[index]
What you are searching for is a way of iterating over the choices such that you will always have precalculated everything you need.
In this case you can simply change the code to the iterative approach:
for each choice of index 0 upwards:
for each choice of cost:
compute value corresponding to index,cost
In practice, I find that the iterative approach can be significantly faster (e.g. *4 perhaps) for simple problems as it avoids the overhead of function calls and checking the cache for preexisting values.