I'm making a connect 4 AI in python, and I'm using minimax with iterative deepening and alpha beta pruning for this. For greater depths it's still quite slow, so I wanted to implement a transposition table. After reading up on it I think i get the general idea but i haven't been able to quite make it work. Here's part of my code: (the maximizing part of the minimax):
if(isMaximizing):
maxEval = -99999999999
bestMove = None
# cache.get(hash(board)) Here's where i'd check to see if the hash is already in the table
# if so i searched for the best move that was given to that board before.
# loop through possible moves
for move in [3,2,4,1,5,0,6]:
if moves[move] > -1:
# check if time limit has been reached for iterative deepening
if startTime - time.time() <= -10:
timeout = True
return (maxEval, bestMove, timeout)
if timeout == False:
board = makeMove((moves[move],move), True, board) # make the move
eval = minimax(depth - 1, board, False, alpha, beta, cache, zobTable, startTime, timeout)[0]
if eval > maxEval:
maxEval = eval
bestMove = (moves[move]+1,move)
board[moves[move] + 1][move] = '_' # undo the move on the board
moves[move] = moves[move] + 1 # undo the move in the list of legal moves
alpha = max(alpha, maxEval)
if alpha >= beta:
break
# cache.set(hash(board), (eval, value)) Here's where i would set the value and bestmove for the current boardstate
return (maxEval, bestMove, timeout)
Right now i'm hashing the board with the zobrist hashing method, and i'm using an ordered dict to add the hashed boards to. To this hashkey i've added the value for the board and the bestMove for that board. Unfortunately this seems to make the algorithm pick bad moves (it worked before), does anyone know where you should put the boardstates in the cache, and where you should get them from the cache?
A few points on your approach:
If you want things to be fast, writing efficient code in C or C++ is going to be much faster than python. I've seen 10-100x improvements in performance in this sort of search code by switching away from python and to a good C/C++ implementation. Either way you should try to write code that avoids allocating memory during search, as this is very expensive. That is to say, you could see better returns from coding more efficiently than from adding a transposition table.
When using Zobrist hashing for a transposition table in game tree search, you typically do not store the state explicitly. You only check to see if the hashes are equal. While there is a small chance of error, it requires far less memory to store just the hash, and with a 64-bit hash the chance of collisions are probably vanishingly small for the types of searches you are doing. (The chances of errors resulting are even lower.)
When you store values in the transposition table, you also need to store the alpha and beta bounds used during the search. When you get a value back at a node mid-search it is either an upper bound on the true value (because value = beta), a lower bound on the true value (because value = alpha) or the actual value of the node (alpha < value < beta). You need to store this in your transposition table. Then, when you want to re-use the value, you have to check that you can use the value given your current alpha and beta bounds. (You can validate this by actually doing the search after finding the value in the transposition table to see if you get the same value from search that you got in the table.)
Related
I've implemented gridworld example from the book Reinforcement Learning - An Introduction, second edition" from Richard S. Sutton and Andrew G. Barto, Chapter 4, sections 4.1 and 4.2, page 80.
Here is my implementation:
https://github.com/ozrentk/dynamic-programming-gridworld-playground
The original algorithm seems to have a bug since the value function (mapping) is updated one by one value in the source mapping structure. Why is that incorrect? It means that inside the loop for each s (of set S), in the same evaluation loop pass, the next value of the element s (e.g. s_2 of set S) will be evaluated from the newly evaluated element in that pass (e.g. s_1 of set S), instead of s from the current iteration. This problem is avoided here using the double buffering technique. An additional buffer is used for new values of set S. It also means that the program uses more memory because of that buffer.
I must admit that I'm not 100% sure if this is a bug, or if I misinterpreted the algorithm.
Generally, this is the code I'm using:
...
while True:
delta = 0
# NOTE: algorithm modified a bit, additional buffer new_values introduced
# Barto & Sutton seem to have a bug in their algorithm (iterative estimation does not fit figure 4.1)
# Instead of tracking one state value inside a loop, we track entire state value function mapping
# outside that loop. Also note that after that change algorithm consumes more memory.
new_values = [0] * states_n
for s in non_terminal_states:
# Evaluate state value under current policy
next_states = get_next_states(s, policy[s], world_size)
sum_items = [p * (reward + gamma * values[s_next]) for s_next, p in next_states.items()]
new_values[s] = sum(sum_items)
# Track delta
delta = max(delta, abs(values[s] - new_values[s]))
# (now we switch state value function buffer, instead of switching single state value in the loop)
values = new_values
if use_policy_improvement:
# Policy_improvement is done inside improve_policy(), and if new policy is no better than the
# old one, return value of is_policy_stable is True
is_policy_stable, improved_policy = improve_policy()
if is_policy_stable:
print("Policy is stable.")
break
else:
print("- Improving policy... ----------------")
policy = improved_policy
visualize_policy(policy, states, world_size)
# In case we don't track policy improvement, we need to track delta for the convergence sake
if delta < theta:
break
# Track iteration count
k += 1
...
Am I wrong or there is a problem with the policy evaluation part of the algorithm in the book?
The original algorithm is the "asynchronous version" of policy evaluation. And your impletation using two buffer is the "synchronous version". Both are correct.
The "asynchronous version" also converge to the optimal solution(You can find the proof in book Parallel and Distributed Computation: Numerical Methods). And as you may find in the book, it "usually converges faster".
I find that This link provides a good explanation.
I have a normal Negamax algorithm with alpha-beta pruning which is initiated with iterative deepening (ID). I thought that to really get use of ID I save the calculated valid moves from depth 1 in a table, so next time I go for depth 2 and the same original position arrives I can just grab the valid moves from the table instead to save time. However, I find that this idea doesn't save any time at all really which makes me think:
I have never seen anyone do this, is it not worth it for some reason?
My implementation of this is wrong?
I am confused by how Negamax works and maybe this is impossible to do in the first place?
Here is the original iterative call, along with a snippet of the Negamax function itself:
self.valid_moves_history = []
for depth in range(1, s.max_search_depth):
move, evaluation = self.negamax(gamestate, depth, -math.inf, math.inf, s.start_color)
# ----------------------------------------------------------------------------
def negamax(self, gamestate, depth, alpha, beta, color):
if self.zobrist_key in self.valid_moves_history:
children = self.valid_moves_history[self.zobrist_key]
else:
children = gamestate.get_valid_moves()
self.valid_moves_history[key] = children
if depth == 0 or gamestate.is_check_mate or gamestate.is_stale_mate:
return None, e.evaluate(gamestate, depth) * color
# Negamax loop
max_eval = -math.inf
for child in reversed(children):
gamestate.make_move(child[0], child[1])
score = -self.negamax(gamestate, depth - 1, -beta, -alpha, -color)[1]
gamestate.unmake_move()
if score > max_eval:
max_eval = score
best_move = child
alpha = max(alpha, max_eval)
if beta <= alpha:
break
The most time consuming tasks of my complete program are distributed something like this (% of total runtime for a game):
Calculate valid moves: 60%
Evaluation function (medium complexity at the moment): 25%
Negamax itself with lookups, table saves etc: 5%
Make/unmake moves: 4%
Is it normal/reasonable for the calculating move time to be this high? This is the main reason why I thought to save valid moves in a list in the first place.
Or can someone please explain why this is a good/bad idea and what I should do instead? Thank you for any input.
I know this thread is quite old at this point but I think that this could still be useful to some people. The whole topic which you are talking about is called transposition tables in Minimax and you can find many links to the topic. Negamax is the same as Minimax except you do not have separate functions for the Max and Min players, and instead you just call a max function and turn it into a negative. I think it is probably more useful for you to implement move ordering first as it can double the speed of your program. You can also find a more efficient way to find valid moves to speed up the program.
For the following problem, I used a dictionary to track values while the provided answer used a list. Is there a quick way to determine the most efficient data structures for problems like these?
A robot moves in a plane starting from the original point (0,0). The
robot can move toward UP, DOWN, LEFT and RIGHT with a given steps. The
trace of robot movement is shown as the following: UP 5 DOWN 3 LEFT 3
RIGHT 2. The numbers after the direction are steps. Please write a
program to compute the distance from current position after a sequence
of movement and original point. If the distance is a float, then just
print the nearest integer. Example: If the following tuples are given
as input to the program: UP 5 DOWN 3 LEFT 3 RIGHT 2 Then, the output
of the program should be: 2
My answer uses a dictionary (origin["y"] for y and origin["x"] for x):
direction = 0
steps = 0
command = (direction, steps)
command_list = []
origin = {"x": 0, "y": 0}
while direction is not '':
direction = input("Direction (U, D, L, R):")
steps = input("Number of steps:")
command = (direction, steps)
command_list.append(command)
print(command_list)
while len(command_list) > 0:
current = command_list[-1]
if current[0] == 'U':
origin["y"] += int(current[1])
elif current[0] == 'D':
origin["y"] -= int(current[1])
elif current[0] == 'L':
origin["x"] -= int(current[1])
elif current[0] == 'R':
origin["x"] += int(current[1])
command_list.pop()
distance = ((origin["x"])**2 + (origin["y"])**2)**0.5
print(distance)
The provided answer uses a list (pos[0] for y, and pos[1] for x):
import math
pos = [0,0]
while True:
s = raw_input()
if not s:
break
movement = s.split(" ")
direction = movement[0]
steps = int(movement[1])
if direction=="UP":
pos[0]+=steps
elif direction=="DOWN":
pos[0]-=steps
elif direction=="LEFT":
pos[1]-=steps
elif direction=="RIGHT":
pos[1]+=steps
else:
pass
print int(round(math.sqrt(pos[1]**2+pos[0]**2)))
I'll offer a few points on your question because I strongly disagree with the close recommendations. There's much in your question that's not opinion.
In general, your choice of dictionary wasn't appropriate. For a toy program like this it doesn't make much difference, but I assume you're interested in best practice for serious programs. In production software, you wouldn't make this choice. Why?
Error prone-ness. A typo in future code, e.g. origin["t"] = 3 when you meant origin["y"] = 3 is a nasty bug, maybe difficult to find. t = 3 is more likely to cause a "fast failure." (In a statically typed language like C++ or Java, it's a sure compile-time error.)
Space overhead. A simple scalar variable requires essentially no space beyond the value itself. An array has a fixed overhead for the "dope vector" that tracks its location, current, and maximum size. A dictionary requires yet more extra space for open addressing, unused hash buckets, and fill tracking.
Speed.
Accessing a scalar variable is very fast: just a few processor instructions.
Accessing a tuple or array element when you know its index is also very fast, though not as fast as variable access. Extra instructions are needed to check array bounds. Adding one element to an array may take O(current array size) to copy current contents into a larger block of memory. The advantage of tuples and arrays is that you can access elements quickly based on a computed integer index. Scalar variables don't do this. Choose an array/tuple when you need integer index access. Favor tuples when you know the exact size and it's unlikely to change. Their immutability tends to make code more understandable (and thread safe).
Accessing a dictionary element is still more expensive because a hash value must be computed and buckets traversed with possible collision resolution. Adding a single element can also trigger a table reorganization, which is O(table size) with constant factor much bigger than list reorganization because all the elements must be rehashed. The big advantage of dictionaries is that accessing all stored pairs is likely to take the same amount of time. You should choose a dict only when you need that capability: to store a "map" from keys to values.
Conclude from all the above that the best choice for your origin coordinates would have been simple variables. If you later enhance the program in a way that requires passing (x, y) pairs to/from methods, then you'd consider a Point class.
I am learning about abstract data types here. Lately I have been reading about hashing with a Map (or some data structure like a dict).
Here is how the code looks like:
class HashTable:
def __init__(self):
self.size = 11
self.slots = [None] * self.size
self.data = [None] * self.size
def put(self,key,data):
hashvalue = self.hashfunction(key,len(self.slots))
if self.slots[hashvalue] == None:
self.slots[hashvalue] = key
self.data[hashvalue] = data
else:
if self.slots[hashvalue] == key:
self.data[hashvalue] = data #replace
else:
nextslot = self.rehash(hashvalue,len(self.slots))
while self.slots[nextslot] != None and \
self.slots[nextslot] != key:
nextslot = self.rehash(nextslot,len(self.slots))
if self.slots[nextslot] == None:
self.slots[nextslot]=key
self.data[nextslot]=data
else:
self.data[nextslot] = data #replace
def hashfunction(self,key,size):
return key%size
def rehash(self,oldhash,size):
return (oldhash+1)%size
def get(self,key):
startslot = self.hashfunction(key,len(self.slots))
data = None
stop = False
found = False
position = startslot
while self.slots[position] != None and \
not found and not stop:
if self.slots[position] == key:
found = True
data = self.data[position]
else:
position=self.rehash(position,len(self.slots))
if position == startslot:
stop = True
return data
def __getitem__(self,key):
return self.get(key)
def __setitem__(self,key,data):
self.put(key,data)
Now within the textbook, the author states that the size of the hashtable is arbitrary. See here:
Note that the initial size for the hash table has been chosen to be
11. Although this is arbitrary, it is important that the size be a prime number so that the collision resolution algorithm can be as
efficient as possible.
Why is this arbitrary? It would seem that the number of slots given is directly correlated to how many values can be stored. I know that other hashtables may be flexible and be able to store more data into one data slot, but in THIS specific example, it isn't just 'arbitrary'. It is exactly how many values can be stored.
Am I missing something here?
Why is this arbitrary?
Because he could have chosen any other small prime.
It would seem that the number of slots is directly correlated with […] how many values can be stored
Yep, and that's irrelevant. If you need to grow your hash table, you resize (reallocate) and re-hash it. This is not what's the author is talking about.
The Paramagnetic Croiss answered your main question. The number 11 does of course mean that you can't fit more than 11 elements without reallocating your table and rehashing all your elements, so obviously it's not arbitrary in that sense. But it's arbitrary in the sense that as long as the number is prime (and, yes, larger than the number of inserts you're going to do), everything the author intends to demonstrate will work out the same.*
* In particular, if your elements are natural numbers, and your table size is prime, and small enough compared to the largest integer, % size makes a great hash function.
But for your followup question:
It would seem though, that making a table with a bigger prime number would allow for you to have more available slots and require less need for you to rehash, and have less items to search through in each slot (if you extended the data slots to hold more than one value). The items would be spread thinner in general. Is this not correct?
If I understand you right, you're not using the right words, which is why you're getting confusing answers. Your example code uses a function called rehash, but that's misleading. Rehashing is one way to do probing, but it's not the way you're doing it; you're just doing a combination of linear probing and double hashing.* More commonly, when people talk about rehashing, they're talking about what you do after you grow the hash table and have to rehash every value from the old table into the new one.
* When your hash function is as simple as key%size, the distinction is ambiguous…
Anyway, yes, more load (if you have N elements in M buckets, you have N/M load) means more probing, which is bad. To take the most extreme element, at load 1.0, the average operation will have to probe through half the table to find the right bucket, making the hash table as inefficient as brute-force searching an array.
However, as you decrease load, the returns drop off pretty fast. You can draw the exact curve for any particular hash implementation, but the rule of thumb you usually use (for closed hashes like this) is that getting the load below 2/3 is usually not worth it. And keep in mind that a larger hash table has costs as well as benefits. Let's say you're on a 32-bit machine with a 64-byte cache line. So, 11 pointers fit in a single cache line; after any hash operation, the next one is guaranteed to be a cache hit. But 17 pointers are split across two cache lines; after any hash operation, the next one only has a 50% chance of being a cache hit.*
* Of course realistically there's plenty of room inside your loop to use up 2 cache lines for a hash table; that's why people don't usually worry about performance at all when N is in single digits… But you can see how with larger hash tables, keeping too much empty space an mean more L1 cache misses, more L2 cache misses, in the worst case even more VM page misses.
Well, nobody can predict the future, as you never know how many values the data structure user will actually put in the container.
So you start with something small, not to eat too much memory, and then increase and rehash as needed.
I am trying to solve Euler problem 18 where I am required to find out the maximum total from top to bottom. I am trying to use recursion, but am stuck with this.
I guess I didn't state my problem earlier. What I am trying to achieve by recursion is to find the sum of the maximum number path. I start from the top of the triangle, and then check the condition is 7 + findsum() bigger or 4 + findsum() bigger. findsum() is supposed to find the sum of numbers beneath it. I am storing the sum in variable 'result'
The problem is I don't know the breaking case of this recursion function. I know it should break when it has reached the child elements, but I don't know how to write this logic in the program.
pyramid=[[0,0,0,3,0,0,0,],
[0,0,7,0,4,0,0],
[0,2,0,4,0,6,0],
[8,0,5,0,9,0,3]]
pos=[0,3]
def downleft(pyramid,pos):#returns down left child
try:
return(pyramid[pos[0]+1][pos[1]-1])
except:return(0)
def downright(pyramid,pos):#returns down right child
try:
return(pyramid[pos[0]+1][pos[1]+1])
except:
return(0)
result=0
def find_max(pyramid,pos):
global result
if downleft(pyramid,pos)+find_max(pyramid,[pos[0]+1,pos[1]-1]) > downright(pyramid,pos)+find_max(pyramid,[pos[0]+1,pos[1]+1]):
new_pos=[pos[0]+1,pos[1]-1]
result+=downleft(pyramid,pos)+find_max(pyramid,[pos[0]+1,pos[1]-1])
elif downright(pyramid,pos)+find_max(pyramid,[pos[0]+1,pos[1]+1]) > downleft(pyramid,pos)+find_max(pyramid,[pos[0]+1,pos[1]-1]):
new_pos=[pos[0]+1,pos[1]+1]
result+=downright(pyramid,pos)+find_max(pyramid,[pos[0]+1,pos[1]+1])
else :
return(result)
find_max(pyramid,pos)
A big part of your problem is that you're recursing a lot more than you need to. You should really only ever call find_max twice recursively, and you need some base-case logic to stop after the last row.
Try this code:
def find_max(pyramid, x, y):
if y >= len(pyramid): # base case, we're off the bottom of the pyramid
return 0 # so, return 0 immediately, without recursing
left_value = find_max(pyramid, x - 1, y + 1) # first recursive call
right_value = find_max(pyramid, x + 1, y + 1) # second recursive call
if left_value > right_value:
return left_value + pyramid[y][x]
else:
return right_value + pyramid[y][x]
I changed the call signature to have separate values for the coordinates rather than using a tuple, as this made the indexing much easier to write. Call it with find_max(pyramid, 3, 0), and get rid of the global pos list. I also got rid of the result global (the function returns the result).
This algorithm could benefit greatly from memoization, as on bigger pyramids you'll calculate the values of the lower-middle areas many times. Without memoization, the code may be impractically slow for large pyramid sizes.
Edit: I see that you are having trouble with the logic of the code. So let's have a look at that.
At each position in the tree you want to make a choice of selecting
the path from this point on that has the highest value. So what
you do is, you calculate the score of the left path and the score of
the right path. I see this is something you try in your current code,
only there are some inefficiencies. You calculate everything
twice (first in the if, then in the elif), which is very expensive. You should only calculate the values of the children once.
You ask for the stopping condition. Well, if you reach the bottom of the tree, what is the score of the path starting at this point? It's just the value in the tree. And that is what you should return at that point.
So the structure should look something like this:
function getScoreAt(x, y):
if at the end: return valueInTree(x, y)
valueLeft = getScoreAt(x - 1, y + 1)
valueRight = getScoreAt(x + 1, y + 1)
valueHere = min(valueLeft, valueRight) + valueInTree(x, y)
return valueHere
Extra hint:
Are you aware that in Python negative indices wrap around to the back of the array? So if you do pyramid[pos[0]+1][pos[1]-1] you may actually get to elements like pyramid[1][-1], which is at the other side of the row of the pyramid. What you probably expect is that this raises an error, but it does not.
To fix your problem, you should add explicit bound checks and not rely on try blocks (try blocks for this is also not a nice programming style).