PyBrains Q-Learning maze example. State values and the global policy

PyBrains Q-Learning maze example. State values and the global policy - python

I am trying out the PyBrains maze example
my setup is:
envmatrix = [[...]]
env = Maze(envmatrix, (1, 8))
task = MDPMazeTask(env)
table = ActionValueTable(states_nr, actions_nr)
table.initialize(0.)
learner = Q()
agent = LearningAgent(table, learner)
experiment = Experiment(task, agent)
for i in range(1000):
experiment.doInteractions(N)
agent.learn()
agent.reset()
Now, I am not confident in the results that I am getting
The bottom-right corner (1, 8) is the absorbing state
I have put an additional punishment state (1, 7) in mdp.py:
def getReward(self):
""" compute and return the current reward (i.e. corresponding to the last action performed) """
if self.env.goal == self.env.perseus:
self.env.reset()
reward = 1
elif self.env.perseus == (1,7):
reward = -1000
else:
reward = 0
return reward
Now, I do not understand how, after 1000 runs and 200 interaction during every run, agent thinks that my punishment state is a good state (you can see the square is white)
I would like to see the values for every state and policy after the final run. How do I do that? I have found that this line table.params.reshape(81,4).max(1).reshape(9,9) returns some values, but I am not sure whether those correspond to values of the value function

Now I added another constraint - made the agent to always start from the same position: (1, 1) by adding self.initPos = [(1, 1)] in maze.py and now I get this behaviour after 1000 runs with each run having 200 interactions:
Which kind of makes sense now - the robot tries to go around the wall from another side, avoiding the state (1, 7)
So, I was getting weird results because the agent used to start from random positions, which also included the punishing state
EDIT:
Another point is that if it is desirable to spawn the agent randomly, then make sure it is not spawned in the punishable state
def _freePos(self):
""" produce a list of the free positions. """
res = []
for i, row in enumerate(self.mazeTable):
for j, p in enumerate(row):
if p == False:
if self.punishing_states != None:
if (i, j) not in self.punishing_states:
res.append((i, j))
else:
res.append((i, j))
return res
Also, seems then that table.params.reshape(81,4).max(1).reshape(9,9) returns the value for every state from the value function

Related

Minesweeper AI labelling mines as safe spots

Background:
I have been working on the Minesweeper ai project for the HarvardX CS50AI online course for a few days. The goal is to implement AI for the minesweeper game. The problem set can be accessed here: https://cs50.harvard.edu/ai/2020/projects/1/minesweeper/
Implementation:
My task is to implement two classes, MinesweeperAI and the Sentence. Sentence class is a logical statement about a Minesweeper game that consists of a set of board cells and a count of the number of those cells which are mines. MinesweeperAI class is a main handler of AI.
Issue:
Although the program is running without any errors, the AI is making bad decisions, and thus, it is unable to complete the Minesweeper game successfully. From my observations, the AI is labelling potential mines as a safe space and thus, making suicidal runes.
Debugging
I have tried classical debugging, printing, even talking to myself about the code. For some reason, the AI is labelling statements that are mines as safe spaces - I can not detect the reason behind it. I have documented the code with comments, and I can not see any breakdown in implemented logic. However, there must be one - I am inserting the code below with some additional materials.
Sentence class, the logical representation of in-game knowledge:
class Sentence():
"""
Logical statement about a Minesweeper game
A sentence consists of a set of board cells,
and a count of the number of those cells which are mines.
"""
def __init__(self, cells, count):
self.cells = set(cells)
self.count = count
def __eq__(self, other):
return self.cells == other.cells and self.count == other.count
def __str__(self):
return f"{self.cells} = {self.count}"
def known_mines(self):
"""
Returns the set of all cells in self.cells known to be mines.
"""
# Because we are eliminating safe cells from the the statement, we are looking for statements
# that would contain number of cells that is equal (or smaller) than number of mines.
# Upon fulfilment of such condition, evaluated cells are known to be mines.
if len(self.cells) <= self.count:
return self.cells
else:
return None
def known_safes(self):
"""
Returns the set of all cells in self.cells known to be safe.
"""
# There is only one case when the cells are known to be "safes" - when the number of count is 0.
if self.count == 0:
return self.cells
else:
return None
def mark_mine(self, cell):
"""
Updates internal knowledge representation given the fact that
a cell is known to be a mine.
"""
# Marking mine implies two logical consequences:
# a) the number of counts must decrease by one (n - 1);
# b) the cell marked as mine must be discarded from the sentence (we keep track,
# only of the cells that are still unknown to be mines or "safes".
if cell in self.cells:
self.cells.discard(cell)
self.count -= 1
if self.count < 0: # this is a safeguard from any improper inference set forth.
self.count = 0
else:
pass
def mark_safe(self, cell):
"""
Updates internal knowledge representation given the fact that
a cell is known to be safe.
"""
# Marking "safe" implies one logical consequence:
# a) the cell marked as safe must be discarded from the sentence.
if cell in self.cells:
self.cells.discard(cell)
else:
pass
MinesweeperAI class, the primary AI module:
class MinesweeperAI():
"""
Minesweeper game player
"""
def __init__(self, height=8, width=8):
# Set initial height and width
self.height = height
self.width = width
# Keep track of which cells have been clicked on
self.moves_made = set()
# Keep track of cells known to be safe or mines
self.mines = set()
self.safes = set()
# List of sentences about the game known to be true
self.knowledge = []
def mark_mine(self, cell):
"""
Marks a cell as a mine, and updates all knowledge
to mark that cell as a mine as well.
"""
self.mines.add(cell)
for sentence in self.knowledge:
sentence.mark_mine(cell)
def mark_safe(self, cell):
"""
Marks a cell as safe, and updates all knowledge
to mark that cell as safe as well.
"""
self.safes.add(cell)
for sentence in self.knowledge:
sentence.mark_safe(cell)
def add_knowledge(self, cell, count):
"""
Called when the Minesweeper board tells us, for a given
safe cell, how many neighboring cells have mines in them.
This function should:
1) mark the cell as a move that has been made
2) mark the cell as safe
3) add a new sentence to the AI's knowledge base
based on the value of `cell` and `count`
4) mark any additional cells as safe or as mines
if it can be concluded based on the AI's knowledge base
5) add any new sentences to the AI's knowledge base
if they can be inferred from existing knowledge
"""
# 1) mark the cell as a move that has been made.
self.moves_made.add(cell)
# 2) mark the cell as safe. By this we are also updating our internal knowledge base.
self.mark_safe(cell)
# 3) add a new sentence to the AI's knowledge base based on the value of `cell` and `count`
sentence_prep = set()
# Sentence must include all the adjacent tiles, but do not include:
# a) the revealed cell itself;
# b) the cells that are known to be mines;
# c) the cell that are known to be safe.
for i in range(cell[0] - 1, cell[0] + 2):
for j in range(cell[1] - 1, cell[1] + 2): # Those two cover all the adjacent tiles.
if (i, j) != cell:
if (i, j) not in self.moves_made and (i, j) not in self.mines and (i, j) not in self.safes:
if 0 <= i < self.height and 0 <= j < self.width: # The cell must be within the game frame.
sentence_prep.add((i, j))
new_knowledge = Sentence(sentence_prep, count) # Adding newly formed knowledge to the KB.
self.knowledge.append(new_knowledge)
# 4) mark any additional cells as safe or as mines,
# if it can be concluded based on the AI's knowledge base
# 5) add any new sentences to the AI's knowledge base
# if they can be inferred from existing knowledge.
while True: # iterating knowledge base in search for new conclusions on safes or mines.
amended = False # flag indicates that we have made changes to the knowledge, new run required.
knowledge_copy = copy.deepcopy(self.knowledge) # creating copy of the database.
for sentence in knowledge_copy: # cleaning empty sets from the database.
if len(sentence.cells) == 0:
self.knowledge.remove(sentence)
knowledge_copy = copy.deepcopy(self.knowledge) # creating copy once again, without empty sets().
for sentence in knowledge_copy:
mines_check = sentence.known_mines() # this should return: a set of mines that are known mines or None.
safes_check = sentence.known_safes() # this should return: a set of safes that are known safes or None
if mines_check is not None:
for cell in mines_check:
self.mark_mine(cell) # marking cell as a mine, and updating internal knowledge.
amended = True # raising flag.
if safes_check is not None:
for cell in safes_check:
self.mark_safe(cell) # marking cell as a safe, and updating internal knowledge.
amended = True # raising flag.
# the algorithm should infer new knowledge,
# basing on reasoning: (A.cells - B.cells) = (A.count - B.count), if
# B is the subset of A.
knowledge_copy = copy.deepcopy(self.knowledge) # creating copy once again, updated checks.
for sentence_one in knowledge_copy:
for sentence_two in knowledge_copy:
if len(sentence_one.cells) != 0 and len(sentence_two.cells) != 0: # In case of the empty set
if sentence_one.cells != sentence_two.cells: # Comparing sentences (if not the same).
if sentence_one.cells.issubset(sentence_two.cells): # If sentence one is subset of sen_two.
new_set = sentence_two.cells.difference(sentence_one.cells)
if len(new_set) != 0: # if new set is not empty (in case of bug).
new_counts = sentence_two.count - sentence_one.count
if new_counts >= 0: # if the counts are equal or bigger than 0 (in case of bug).
new_sentence = Sentence(new_set, new_counts)
if new_sentence not in self.knowledge: # if the sentence is not already in
# the KB.
self.knowledge.append(new_sentence)
amended = True # raising flag.
if not amended:
break # If the run resulted in no amendments, then we can not make any additional amendments,
# to our KB.
def make_safe_move(self):
"""
Returns a safe cell to choose on the Minesweeper board.
The move must be known to be safe, and not already a move
that has been made.
This function may use the knowledge in self.mines, self.safes
and self.moves_made, but should not modify any of those values.
"""
for cell in self.safes:
if cell not in self.moves_made:
return cell
return None
def make_random_move(self):
"""
Returns a move to make on the Minesweeper board.
Should choose randomly among cells that:
1) have not already been chosen, and
2) are not known to be mines
"""
for i in range(self.height):
for j in range(self.width):
cell = (i, j)
if cell not in self.moves_made and cell not in self.mines:
return cell
return None
Documentation of the issue:
Documentation of the issue - the AI is making a safe move that it should now have labelled as the safe
Some comments:
Generally speaking, the cell is known to be safe when the sentence.count is zero (it means, that all the cells in the sentence are known to be "safes"). On the other hand, the cell is known as a mine, if the (len) of cells is equal to the sentence.count. The logic behind it is rather straightforward, still, I am missing something big when it comes to the implementation.
Thank you for all your help. Please do not be too harsh on my code - I am still learning, and to be honest, it's the first time when I am struggling hard with a piece of code that I have prepared. It's giving me little rest because I just can not crack down on what I am doing wrong. If there is something that I could provide (any more additional data) - please, just let me know!

Ok, after a lot debugging I found the root of the issue: When new knowledge is added via add_knowledge, the AI does only half account for cells it knows to be mines: It does not added those to the new Sentence, but one also needs to reduce the count by one for each already known cell:
for i in range(cell[0] - 1, cell[0] + 2):
for j in range(cell[1] - 1, cell[1] + 2): # Those two cover all the adjacent tiles.
if (i, j) != cell:
if (i, j) not in self.moves_made and (i, j) not in self.mines and (i, j) not in self.safes:
if 0 <= i < self.height and 0 <= j < self.width: # The cell must be within the game frame.
sentence_prep.add((i, j))
elif (i, j) in self.mines: # One of the neighbors is a known mine. Reduce the count.
count -= 1
new_knowledge = Sentence(sentence_prep, count) # Adding newly formed knowledge to the KB.
self.knowledge.append(new_knowledge)
This should now work (Unless there is another edge case somewhere)
Here a bit about my journey. I wrote these Tools to help with debugging:
def get_neighbours(size, x, y):
for i in range(x - 1, x + 2):
for j in range(y - 1, y + 2): # Those two cover all the adjacent tiles.
if (i, j) != (x, y):
if 0 <= i < size[0] and 0 <= j < size[1]:
yield i, j
class SimpleBoard:
def __init__(self, size, grid):
self.size = size
self.grid = grid
self.calc()
def calc(self):
for x in range(self.size[0]):
for y in range(self.size[1]):
if self.grid[x][y] != 9:
self.grid[x][y] = sum(1 for i, j in get_neighbours(self.size, x, y) if self.grid[i][j] == 9)
#classmethod
def random(cls, size, count):
self = cls(size, [[0] * size[1] for _ in range(size[0])])
options = list(product(range(size[0]), range(size[1])))
shuffle(options)
mines = options[:count]
for x, y in mines:
self.grid[x][y] = 9
self.calc()
return self
def build_ai_view(ai: MinesweeperAI, board: SimpleBoard):
out = []
for x in range(ai.height):
out.append(l :=[])
for y in range(ai.width):
cell = x,y
if cell in ai.mines:
assert cell not in ai.safes
l.append("X" if board.grid[x][y] == 9 else "%")
elif cell in ai.safes:
l.append(str(board.grid[x][y]) if cell in ai.moves_made else "_")
else:
l.append("?")
cells_to_sentence = defaultdict(list)
for i, sentence in enumerate(ai.knowledge):
for c in sentence.cells:
cells_to_sentence[c].append(sentence)
unique_groups = []
for c, ss in cells_to_sentence.items():
if ss not in unique_groups:
unique_groups.append(ss)
labels = "abcdefghijklmnopqrstuvxyz"
for (x, y), ss in cells_to_sentence.items():
i = unique_groups.index(ss)
l = labels[i]
assert out[x][y] == "?"
out[x][y] = l
for i, ss in enumerate(unique_groups):
out.append(l := [labels[i]])
if len(ss) > 1:
l.append("overlap of")
for s in ss:
if [s] not in unique_groups:
unique_groups.append([s])
l.append(labels[unique_groups.index([s])])
# l.extend(labels[unique_groups.index([s])] for s in ss)
else:
l.append(str(ss[0].count))
out.append([repr(ai)])
return "\n".join(map(str, out))
They might not be pretty code, but they work and display all relevant information from the perspective of the AI. I then used this together with the given failing case:
board = SimpleBoard((8, 8), [
[0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0],
[0, 9, 0, 0, 0, 9, 0, 0],
[0, 0, 0, 9, 0, 0, 0, 0],
[0, 0, 0, 9, 0, 0, 0, 0],
[0, 9, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0],
[0, 9, 0, 9, 0, 9, 0, 0],
])
and this simple loop:
pprint(board.grid)
start = next((x, y) for x in range(board.size[0]) for y in range(board.size[1]) if board.grid[x][y] == 0)
ai = MinesweeperAI(*board.size)
ai.add_knowledge(start, 0)
print(build_ai_view(ai, board))
while True:
target = ai.make_safe_move()
print(target)
x, y = target
if board.grid[x][y] == 9:
print("FOUND MINE", x, y)
break
else:
ai.add_knowledge((x, y), board.grid[x][y])
print(build_ai_view(ai, board))
to be able to backwards figure out at which point the AI starts to make false assumptions.
This came in multiple steps: figure out when the first % (e.g. wrongly marked mine) appears, figure out which Sentences lead to that conclusion, figure out which of those is wrong and finally figure out why that assumption is made.

How to check if a pointer can traverse a possible path given a fixed number of moves

I am trying to solve this problem on Hackerrank:
I have this input:
........#....#..#..#....#...#..#.#.#.#.#.#..#.....
..#..#..#.#....#..#.....#......#..##...........##.
.........#.###.##...#.....##......###............#
....##........#..#.#.#......#...#.##.......##.....
.................###...#.#...#......#.#.#.#.#...#.
.........#.....#...........##....#.#.#.##.....##..
.....#.##............#....#......#..#..#...##.....
.#.......###....#.#..##.##.#...##...#........#...#
..#.##..##..........#..........##.....##..........
#.#..##......#.#.#..##.###...#.........###..#...#.
.#..#..............#...#........#..#...#....#..#..
##..#..#........#....#........#...#.#......#.....#
#.#.......#.#..#...###..#..#.##...#.##.......#...#
#.#...#...#.....#....#......##.#.#.........#....#.
.#..........#......##..#....#....#.#.#..#..###....
#.#............#.##..#.##.##......###......#..#..#
.#..#.##...#.#......................#........#....
.....#....#.#..........##.#.#................#....
##............#.#......####...#.........#..##..#..
....#..##..##...#.........##..##....#..#.##...#...
.#........#...#..#...........#.###.....##.....##..
.......#..#..##...#..###.....#..##.........#......
...#......#..#...........###...............#......
...##.###.#.#....#...#..#.#.#....#....#.##.#...#..
..........#.......#..#..#...###....##.....#..#....
.............##.##.#.......#.#....#.......#..#..#.
.......#........#.....#....##...#...#.#...#.#.##..
.....#..#..#........#..#.....#...#.##.#....#...#..
....................#.#...#....###...###...##...#.
##.#.....##.....#..#.#.#...........#.#.##...#..#.#
#...........#....#.##...#.#.....#...#.....#.#.....
..#..##...#........#.##..#.....##.......#...#.#.#.
......#....#...##...........#..#.......#.##.......
......#..#..#.###..........#...#...........#..#...
....#.#..#..#.#.#...#.......#...#.##......#.......
....#.......#..#........#...#.#...#......#.......#
.#....##...#.#..#....#.#.##........#..#.#.........
#....#.......#..##......##...............#..#.##..
...#..##.......#.....#....#...#.#......#..##..###.
.....#...#...#...#...#...#..##...#..#.............
....##......#...#..#...#...#.#....#.....#..#.##...
...##.......#..##.....#........#.#....#...#.......
..#...#....#...#.###......#................#......
...#...###...#..##...###.....................#....
.....#....#....#...#.#.#.##....##......#....##....
...#.###...##.........#..........#.##.#.....#.....
##..#...#.........#.......#......##...........####
...###.#..........#.....#####........#..#.#.#...#.
...#..#.....#..##.##.#.....##...#...#.#.....#...##
.##.......#.##....#............#..................
#.....#.........#.#.........#..###....##...##.....
#....#.....#...#.....#.##...##...####........#....
#...........#..#...#........#.##..##..#...#.#.....
..#.#................#......###..##.#.#...##...#..
.#.#....#..#............#....#......#............#
..#..#...#.#.#...#...........#.......##.#...#.#...
#..........#.....#.....#......#.......#.#...##....
.......#...........#...........#....#............#
...####.#.....#.##.....#.......##.#..#......#.....
.#..#.....#..#......#.............#.#.#..##...#...
..#.#.#.........#...#..#.......#................##
.#..##.#.#...#.............#..#..........#..#...#.
....#........#......#...###..#.#..................
#..#..#.....#.#.#...##....##........#........#....
.....#.#.......##.......#.....#........#..##..#...
#.#.##........#..##.#..#.#...#........#.#......#..
....#.#.#.......#.##.##...##...#..#.###...#.#.#...
.....##.#....#........#....#.#........#.#.#.....#.
.....#..##..#.#....#.......#...#.#.###.........#.#
#.....#.##..#.......###.........#..##..#......##..
70 rows and the maxTime is 2244
This is my strategy but it works for some test cases:
import math
import collections
def reachTheEnd(grid, maxTime):
# Write your code here
grid = [i for i in grid]
yes = 'Yes'
no = 'No'
counter = 0
for i in grid:
counter += i.count('.')
if maxTime <= counter:
return yes
elif i != i[::-1]:
return no
else:
return no
print(counter)
This is a BFS problem but I couldn't figure out the logic. I appreciate all your help.

The idea of breadth-first search is that you (1) don't visit the same node twice, and (2) continuously maintain a list of nodes you haven't visited, split up by timeslice. Eventually, you visit all visitable nodes, and for any node you do visit, you visit it in as few steps as possible.
In the following algorithm, we create a cache to satisfy (1) and a queue to satisfy (2). Each timeslice, we examine each element of the queue, and replace the queue entirely with a new queue, composed of elements discovered during that timeslice. Whenever we discover a new element, we add it to the cache along with the timeslice on which it was first found - this, then, must be the quickest route to that new element.
If we either run out of time or run out of new cells to explore before reaching the destination, then we fail. Otherwise, we succeed. We can check that by simply encoding our exit conditions and checking if we've visited the destination by the time we exit the while loop.
def reachTheEnd(grid, maxTime):
# mark what the coordinates of the destination are
destination = (len(grid) - 1, len(grid[-1]) - 1)
# initialize
# - a step counter
# - a cache of visited cells on the grid
# - a queue of not-yet-visited cells that are adjacent to visited cells
counter = 0
cache = {(0, 0): 0}
queue = [(0, 0)]
# perform breadth-first search on the current queue, moving forward one
# timeslice. On each timeslice, we take one 'step' forward in any direction
# towards a newly-accessible tile on the grid.
# our 'exit' conditions are
# - we run out of time
# - there is no path to the end of the maze
# - we reach the end
while counter < maxTime and len(queue) > 0 and destination not in cache:
counter += 1
new_queue = []
for (x, y) in queue:
# check adding to path in all directions (up, down, left, right)
# If the step is
# - not out-of-bounds
# - not a wall
# - not already visited
# then add it to the cache with the current step count, as this is
# is the most quickly we can reach it. Also add it to the queue of
# cells to investigate on the next timeslice.
for (dx, dy) in [(-1, 0), (1, 0), (0, -1), (0, 1)]:
(nx, ny) = (x + dx, y + dy)
if 0 <= nx < len(grid) \
and 0 <= ny < len(grid[nx]) \
and grid[nx][ny] == '.' \
and (nx, ny) not in cache:
cache[(nx, ny)] = counter
new_queue.append((nx, ny))
queue = new_queue
# by the time the loop exits, either we've failed to reach the end,
# or the end is in the cache with the quickest possible path to it.
if destination in cache:
return "Yes"
else:
return "No"
A possible optimization would be to move the "have we reached destination yet" to inside the innermost for loop, which would save processing the rest of the queue on that timeslice. However, this would make the code slightly more complicated (and thus less useful for explaining the concept of BFS) and provide only a minimal time save.
Note that for the big 70x50 grid you've provided, there's no way to actually reach the lower-right square (it's a small island surrounded by walls). It can reach cell (67, 49) by timeslice 117, which is as close as it gets, but can't get around the wall.

Am I missing a check for the actions in the given state?

The problem:
Three traditional, but jealous, couples need to cross a river. Each couple consists of a husband and a wife. They find a small boat that can contain no more than two persons. Find the simplest schedule of crossings that will permit all six people to cross the river so that none of the women shall be left in company with any of the men, unless her husband is present. It is assumed that all passengers on the boat onboard before the next trip and at least one person has to be in the boat for each crossing.
Had to edit out code, was requested by professor.
I've been working on this problem for 6 hours and I am stumped. My professor is busy and cannot help.

I toke a careful look on your code. It is indeed a very interesting problem and quite complex. After some while I realized that what maybe causing your problem is that your are checking the conditions before the crossing is made and not afterwords. A saw the template you provided and I guess we can try to stick with to logic proposed by 1- make the action method return all possible crosses (without checking the states yet) 2- given each action, get the corresponding new state and check if that state is valid. 3- make the value() method to check if we are making progress on the optimization.
class Problem:
def __init__(self, initial_state, goal):
self.goal = goal
self.record = [[0, initial_state, "LEFT", []]]
# list of results [score][state][boat_side][listActions]
def actions(self, state, boat_side):
side = 0 if boat_side == 'LEFT' else 1
boat_dir = 'RIGHT' if boat_side == 'LEFT' else 'LEFT'
group = [i for i, v in enumerate(state) if v == side]
onboard_2 = [[boat_dir, a, b] for a in group for b in group if
a < b and ( # not the same person and unique group
(a%2==0 and b - a == 1) or ( # wife and husband
a%2==0 and b%2==0) or ( # two wife's
a%2==1 and b%2==1) # two husbands
)]
onboard_1 = [[boat_dir, a] for a in group]
return onboard_1 + onboard_2
def result(self, state, action):
new_boat_side = action[0]
new_state = []
for i, v in enumerate(state):
if i in action[1:]:
new_state.append(1 if v == 0 else 0)
else:
new_state.append(v)
# check if invalid
for p, side, in enumerate(new_state):
if p%2 == 0: # is woman
if side != new_state[p+1]: # not with husband
if any(men == side for men in new_state[1::2]):
new_state = False
break
return new_state, new_boat_side
def goal_test(self, state):
return state == self.goal
def value(self, state):
# how many people already crossed
return state.count(1)
# optimization process
initial_state = [0]*6
goal = [1]*6
task = Problem(initial_state, goal)
while True:
batch_result = []
for score, state, side, l_a in task.record:
possible_actions = task.actions(state, side)
for a in possible_actions:
new_state, new_boat_side = task.result(state, a)
if new_state: # is a valid state
batch_result.append([
task.value(new_state),
new_state,
new_boat_side,
l_a + a,
])
batch_result.sort(key= lambda x: x[0], reverse= True)
# sort the results with the most people crossed
task.record = batch_result[:5]
# I am only sticking with the best 5 results but
# any number should be fine on this problem
if task.goal_test(task.record[0][1]):
break
# for i in task.record[:5]: # uncomment these lines to see full progress
# print(i)
# x = input() # press any key to continue
print(task.record[0][3])
I hope it helped, please fill free to say if anything is still not so clear.

How do I find shortest path in maze with BFS?

I am trying to find a way to solve a maze. My teacher said I have to use BFS as a way to learn. So I made the algorithm itself, but I don't understand how to get the shortest path out of it. I have looked at others their codes and they said that backtracking is the way to do it. How does this backtracking work and what do you backtrack?
I will give my code just because I like some feedback to it and maybe I made some mistake:
def main(self, r, c):
running = True
self.queue.append((r, c))
while running:
if len(self.queue) > 0:
self.current = self.queue[0]
if self.maze[self.current[0] - 1][self.current[1]] == ' ' and not (self.current[0] - 1, self.current[1])\
in self.visited and not (self.current[0] - 1, self.current[1]) in self.queue:
self.queue.append((self.current[0] - 1, self.current[1]))
elif self.maze[self.current[0] - 1][self.current[1]] == 'G':
return self.path
if self.maze[self.current[0]][self.current[1] + 1] == ' ' and not (self.current[0], self.current[1] + 1) in self.visited\
and not (self.current[0], self.current[1] + 1) in self.queue:
self.queue.append((self.current[0], self.current[1] + 1))
elif self.maze[self.current[0]][self.current[1] + 1] == 'G':
return self.path
if self.maze[self.current[0] + 1][self.current[1]] == ' ' and not (self.current[0] + 1, self.current[1]) in self.visited\
and not (self.current[0] + 1, self.current[1]) in self.queue:
self.queue.append((self.current[0] + 1, self.current[1]))
elif self.maze[self.current[0] + 1][self.current[1]] == 'G':
return self.path
if self.maze[self.current[0]][self.current[1] - 1] == ' ' and not (self.current[0], self.current[1] - 1) in self.visited\
and not (self.current[0], self.current[1] - 1) in self.queue:
self.queue.append((self.current[0], self.current[1] - 1))
elif self.maze[self.current[0]][self.current[1] - 1] == 'G':
return self.path
self.visited.append((self.current[0], self.current[1]))
del self.queue[0]
self.path.append(self.queue[0])
As maze I use something like this:
############
# S #
##### ######
# #
######## ###
# #
## ##### ###
# G#
############
Which is stored in a matrix
What I eventually want is just the shortest path inside a list as output.

Since this is a coding assignment I'll leave the code to you and simply explain the general algorithm here.
You have a n by m grid. I am assuming this is is provided to you. You can store this in a two dimensional array.
Step 1) Create a new two dimensional array the same size as the grid and populate each entry with an invalid coordinate (up to you, maybe use None or another value you can use to indicate that a path to that coordinate has not yet been discovered). I will refer to this two dimensional array as your path matrix and the maze as your grid.
Step 2) Enqueue the starting coordinate and update the path matrix at that position (for example, update matrix[1,1] if coordinate (1,1) is your starting position).
Step 3) If not at the final coordinate, dequeue an element from the queue. For each possible direction from the dequeued coordinate, check if it is valid (no walls AND the coordinate does not exist in the matrix yet), and enqueue all valid coordinates.
Step 4) Repeat Step 3.
If there is a path to your final coordinate, you will not only find it with this algorithm but it will also be a shortest path. To backtrack, check your matrix at the location of your final coordinate. This should lead you to another coordinate. Continue this process and backtrack until you arrive at the starting coordinate. If you store this list of backtracked coordinates then you will have a path in reverse.

The main problem in your code is this line:
self.path.append(self.queue[0])
This will just keep adding to the path while you go in all possible directions in a BFS way. This path will end up getting all coordinates that you visit, which is not really a "path", because with BFS you continually switch to a different branch in the search, and so you end up collecting positions that are quite unrelated.
You need to build the path in a different way. A memory efficient way of doing this is to track where you come from when visiting a node. You can use the visited variable for that, but then make it a dictionary, which for each r,c pair stores the r,c pair from which the cell was visited. It is like building a linked list. From each newly visited cell you'll be able to find back where you came from, all the way back to the starting cell. So when you find the target, you can build the path from this linked list.
Some other less important problems in your code:
You don't check whether a coordinate is valid. If the grid is bounded completely by # characters, this is not really a problem, but if you would have a gap at the border, you'd get an exception
There is code repetition for each of the four directions. Try to avoid such repetition, and store recurrent expressions like self.current[1] - 1 in a variable, and create a loop over the four possible directions.
The variable running makes no sense: it never becomes False. Instead make your loop condition what currently is your next if condition. As long as the queue is not empty, continue. If the queue becomes empty then that means there is no path to the target.
You store every bit of information in self properties. You should only do that for information that is still relevant after the search. I would instead just create local variables for queue, visited, current, ...etc.
Here is how the code could look:
class Maze():
def __init__(self, str):
self.maze = str.splitlines()
def get_start(self):
row = next(i for i, line in enumerate(self.maze) if "S" in line)
col = self.maze[row].index("S")
return row, col
def main(self, r, c):
queue = [] # use a local variable, not a member
visited = {} # use a dict, key = coordinate-tuples, value = previous location
visited[(r, c)] = (-1, -1)
queue.append((r, c))
while len(queue) > 0: # don't use running as variable
# no need to use current; just reuse r and c:
r, c = queue.pop(0) # you can remove immediately from queue
if self.maze[r][c] == 'G':
# build path from walking backwards through the visited information
path = []
while r != -1:
path.append((r, c))
r, c = visited[(r, c)]
path.reverse()
return path
# avoid repetition of code: make a loop
for dx, dy in ((-1, 0), (0, -1), (1, 0), (0, 1)):
new_r = r + dy
new_c = c + dx
if (0 <= new_r < len(self.maze) and
0 <= new_c < len(self.maze[0]) and
not (new_r, new_c) in visited and
self.maze[new_r][new_c] != '#'):
visited[(new_r, new_c)] = (r, c)
queue.append((new_r, new_c))
maze = Maze("""############
# S #
##### ######
# #
######## ###
# #
## ##### ###
# G#
############""")
path = maze.main(*maze.get_start())
print(path)
See it run on repl.it

Why is shallow copy needed for my values dictionary to correctly update?

I am working on an Agent class in Python 2.7.11 that uses a Markov Decision Process (MDP) to search for an optimal policy π in a GridWorld. I am implementing a basic value iteration for 100 iterations of all GridWorld states using the following Bellman Equation:
T(s,a,s') is the probability function of successfully transitioning to successor state s' from current state s by taking action a.
R(s,a,s') is the reward for transitioning from s to s'.
γ (gamma) is the discount factor where 0 &leq; γ &leq; 1.
Vk(s') is a recursive call to repeat the calculation once s' has been reached.
Vk+1(s) is representative of how after enough k iterations have occured, the Vk iteration value will converge and become equivalent to Vk+1
This equation is derived from taking the maximum of a Q value function, which is what I am using within my program:
When constructing my Agent, it is passed an MDP, which is an abstract class containing the following methods:
# Returns all states in the GridWorld
def getStates()
# Returns all legal actions the agent can take given the current state
def getPossibleActions(state)
# Returns all possible successor states to transition to from the current state
# given an action, and the probability of reaching each with that action
def getTransitionStatesAndProbs(state, action)
# Returns the reward of going from the current state to the successor state
def getReward(state, action, nextState)
My Agent is also passed a discount factor, and a number of iterations. I am also making use of a dictionary to keep track of my values. Here is my code:
class IterationAgent:
def __init__(self, mdp, discount = 0.9, iterations = 100):
self.mdp = mdp
self.discount = discount
self.iterations = iterations
self.values = util.Counter() # A Counter is a dictionary with default 0
for transition in range(0, self.iterations, 1):
states = self.mdp.getStates()
valuesCopy = self.values.copy()
for state in states:
legalMoves = self.mdp.getPossibleActions(state)
convergedValue = 0
for move in legalMoves:
value = self.computeQValueFromValues(state, move)
if convergedValue <= value or convergedValue == 0:
convergedValue = value
valuesCopy.update({state: convergedValue})
self.values = valuesCopy
def computeQValueFromValues(self, state, action):
successors = self.mdp.getTransitionStatesAndProbs(state, action)
reward = self.mdp.getReward(state, action, successors)
qValue = 0
for successor, probability in successors:
# The Q value equation: Q*(a,s) = T(s,a,s')[R(s,a,s') + gamma(V*(s'))]
qValue += probability * (reward + (self.discount * self.values[successor]))
return qValue
This implementation is correct, though I am unsure why I need valuesCopy to accomplish a successful update to my self.values dictionary. I have tried the following to omit the copying, but it does not work since it returns slightly incorrect values:
for i in range(0, self.iterations, 1):
states = self.mdp.getStates()
for state in states:
legalMoves = self.mdp.getPossibleActions(state)
convergedValue = 0
for move in legalMoves:
value = self.computeQValueFromValues(state, move)
if convergedValue <= value or convergedValue == 0:
convergedValue = value
self.values.update({state: convergedValue})
My question is why is including a copy of my self.values dictionary necessary to update my values correctly when valuesCopy = self.values.copy() makes a copy of the dictionary anyways every iteration? Shouldn't updating the values in the original result in the same update?

There's an algorithmic difference in having or not having the copy:
# You update your copy here, so the original will be used unchanged, which is not the
# case if you don't have the copy
valuesCopy.update({state: convergedValue})
# If you have the copy, you'll be using the old value stored in self.value here,
# not the updated one
qValue += probability * (reward + (self.discount * self.values[successor]))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

PyBrains Q-Learning maze example. State values and the global policy - python

Related

Minesweeper AI labelling mines as safe spots

How to check if a pointer can traverse a possible path given a fixed number of moves

Am I missing a check for the actions in the given state?

How do I find shortest path in maze with BFS?

Why is shallow copy needed for my values dictionary to correctly update?

Categories

Resources