Traceback in Needleman-Wunsch global alignment without storing pointer

Traceback in Needleman-Wunsch global alignment without storing pointer - python

My understanding is that while basically every discussion of dynamic programming I can find has one store the pointers as the matrix is populated, it is faster to instead to re-calculate the previous cells during the traceback step instead.
I have my dynamic programming algorithm to build the matrix correctly as far as I can tell, but I am confused on how to do the traceback calculations. I also have been told that it is necessary to recalculate the values (instead of just looking them up) but I don't see how that will come up with different numbers.
The version of SW I am implementing includes an option for gaps in both sequences to open up, so the recurrence relation for each matrix has three options. Below is the current version of my global alignment class. From my hand calculations I believe that score_align properly generates the matrix, but obviously traceback_col_seq does not work.
INF = 2147483647 #max size of int32
class global_aligner():
def __init__(self, subst, open=10, extend=2, double=3):
self.extend, self.open, self.double, self.subst = extend, open, double, subst
def __call__(self, row_seq, col_seq):
#add alphabet error checking?
score_align(row_seq, col_seq)
return traceback_col_seq()
def init_array(self):
self.M = zeros((self.maxI, self.maxJ), int)
self.Ic = zeros((self.maxI, self.maxJ), int)
self.Ir = zeros((self.maxI, self.maxJ), int)
for i in xrange(self.maxI):
self.M[i][0], self.Ir[i][0], self.Ic[i][0] = \
-INF, -INF, -(self.open+self.extend*i)
for j in xrange(self.maxJ):
self.M[0][j], self.Ic[0][j], self.Ir[0][j] = \
-INF, -INF, -(self.open+self.extend*j)
self.M[0][0] = 0
self.Ic[0][0] = -self.open
def score_cell(self, i, j, chars):
thisM = [self.Ic[i-1][j-1]+self.subst[chars], self.M[i-1][j-1]+\
self.subst[chars], self.Ir[i-1][j-1]+self.subst[chars]]
thisC = [self.Ic[i][j-1]-self.extend, self.M[i][j-1]-self.open, \
self.Ir[i][j-1]-self.double]
thisR = [self.M[i-1][j]-self.open, self.Ir[i-1][j]-self.extend, \
self.Ic[i-1][j]-self.double]
return max(thisM), max(thisC), max(thisR)
def score_align(self, row_seq, col_seq):
self.row_seq, self.col_seq = list(row_seq), list(col_seq)
self.maxI, self.maxJ = len(self.row_seq)+1, len(self.col_seq)+1
self.init_array()
for i in xrange(1, self.maxI):
row_char = self.row_seq[i-1]
for j in xrange(1, self.maxJ):
chars = row_char+self.col_seq[j-1]
self.M[i][j], self.Ic[i][j], self.Ir[i][j] = \
self.score_cell(i, j, chars)
def traceback_col_seq(self):
self.traceback = list()
i, j = self.maxI-1, self.maxJ-1
while i > 1 and j > 1:
cell = [self.M[i][j], self.Ic[i][j], self.Ir[i][j]]
cellMax = max(cell)
chars = self.row_seq[i-1]+self.col_seq[j-1]
if cell.index(cellMax) == 0: #M
diag = [diagM, diagC, diagR] = self.score_cell(i-1, j-1, chars)
diagMax = max(diag)
if diag.index(diagMax) == 0: #match
self.traceback.append(self.col_seq[j-1])
elif diag.index(diagMax) == 1: #insert column (open)
self.traceback.append('-')
elif diag.index(diagMax) == 2: #insert row (open other)
self.traceback.append(self.col_seq[j-1].lower())
i, j = i-1, j-1
elif cell.index(cellMax) == 1: #Ic
up = [upM, upC, upR] = self.score_cell(i-1, j, chars)
upMax = max(up)
if up.index(upMax) == 0: #match (close)
self.traceback.append(self.col_seq[j-1])
elif up.index(upMax) == 1: #insert column (extend)
self.traceback.append('-')
elif up.index(upMax) == 2: #insert row (double)
self.traceback.append('-')
i -= 1
elif cell.index(cellMax) == 2: #Ir
left = [leftM, leftC, leftR] = self.score_cell(i, j-1, chars)
leftMax = max(left)
if left.index(leftMax) == 0: #match (close)
self.traceback.append(self.col_seq[j-1])
elif left.index(leftMax) == 1: #insert column (double)
self.traceback.append('-')
elif left.index(leftMax) == 2: #insert row (extend other)
self.traceback.append(self.col_seq[j-1].lower())
j -= 1
for j in xrange(0,j,-1):
self.traceback.append(self.col_seq[j-1])
for i in xrange(0,i, -1):
self.traceback.append('-')
return ''.join(self.traceback[::-1])
test = global_aligner(blosumMatrix)
test.score_align('AA','AAA')
test.traceback_col_seq()

I think the main problem is that you aren't taking the matrix that you're currently in into account when generating the cells that you could potentially have come from. cell = [self.M[i][j], self.Ic[i][j], self.Ir[i][j]] is right for the first time through the while loop, but after that you can't just choose the matrix that has the highest score. Your options are constrained by where you're coming from. I'm having a bit of trouble following your code, but I think you're taking that into account in the if statements in the while loop. If that's the case, then I think changes along the lines of these should be sufficient:
cell = [self.M[i][j], self.Ic[i][j], self.Ir[i][j]]
cellIndex = cell.index(max(cell))
while i > 1 and j > 1:
chars = self.row_seq[i-1]+self.col_seq[j-1]
if cellIndex == 0: #M
diag = [diagM, diagC, diagR] = self.score_cell(i-1, j-1, chars)
diagMax = max(diag)
...
cellIndex = diagMax
i, j = i-1, j-1
elif cell.index(cellMax) == 1: #Ic
up = [upM, upC, upR] = self.score_cell(i-1, j, chars)
upMax = max(up)
...
cellIndex = upMax
i -= 1
elif cell.index(cellMax) == 2: #Ir
left = [leftM, leftC, leftR] = self.score_cell(i, j-1, chars)
leftMax = max(left)
...
cellIndex = leftMax
j -= 1
Like I said, I'm not positive that I'm following your code correctly, but see if that helps.

Related

How to generate alternating substrings using recursion

I have a practice question that requires me to generate x number of alternating substrings, namely "#-" & "#--" using both recursion as well as iteration. Eg.string_iteration(3) generates "#-#--#-".
I have successfully implemented the solution for the iterative method,
but I'm having trouble getting started on the recursive method. How can I proceed?
Iterative method
def string_iteration(x):
odd_block = '#-'
even_block = '#--'
current_block = ''
if x == 0:
return ''
else:
for i in range(1,x+1):
if i % 2 != 0:
current_block += odd_block
elif i % 2 == 0:
current_block += even_block
i += 1
return current_block

For recursion, you almost always just need a base case and everything else. Here, your base case it pretty simple — when x < 1, you can return an empty string:
if x < 1:
return ''
After than you just need to return the block + the result of string_iteration(x-1). After than it's just a matter of deciding which block to choose. For example:
def string_iteration(x):
# base case
if x < 1:
return ''
blocks = ('#--', '#-')
# recursion
return string_iteration(x-1) + blocks[x % 2]
string_iteration(5)
# '#-#--#-#--#-'
This boils down to
string_iteration(1) + string_iteration(2) ... string_iteration(x)

The other answer doesn't give the same result as your iterative method. If you always want it to start with the odd block, you should add the block on the right of the recursive call instead of the left:
def string_recursion(x):
odd_block = '#-'
even_block = '#--'
if x == 0:
return ''
if x % 2 != 0:
return string_recursion(x - 1) + odd_block
elif x % 2 == 0:
return string_recursion(x - 1) + even_block

For recursive solution, you need a base case and calling the function again with some other value so that at the end you will have the desired output. Here, we can break this problem recursively like - string_recursive(x) = string_recursive(x-1) + string_recursive(x-2) + ... + string_recursive(1).
def string_recursion(x, parity):
final_str = ''
if x == 0:
return ''
if parity == -1: # when parity -1 we will add odd block
final_str += odd_block
elif parity == 1:
final_str += even_block
parity *= -1 # flip the parity every time
final_str += string_recursion(x-1, parity)
return final_str
odd_block = '#-'
even_block = '#--'
print(string_recursion(3, -1)) # for x=1 case we have odd parity, hence -1
# Output: #-#--#-

Slow performance in agent based model python

I originally posted this on code-review (hence the lengthy code) but failed to get an answer.
My model is based on this game https://en.wikipedia.org/wiki/Ultimatum_game . I won't go into the intuition behind it but generally speaking it functions as follows:
The game consists of a n x n lattice on which an agent is placed at each node.
During each time step, each player on each node plays against a random neighbour by playing a particular strategy.
Each of their strategies (a value between 1-9) has a propensity attached to it (which is randomly assigned and is just some number). The propensity then in turn determines the probability of playing that strategy. The probability is calculated as the propensity of that strategy over the sum of the propensities of all strategies.
If a game results in a positive payoff, then the payoffs from that game get added to the propensities for those strategies.
These propensities then determine the probabilities for their strategies in the next time step, and so on.
The simulation ends after time step N is reached.
For games with large lattices and large time steps, my code runs really really slowly. I ran cProfiler to check where the bottleneck(s) are, and as I suspected the update_probabilitiesand play_rounds functions seem to be taking up a lot time. I want to be able to run the game with gridsize of about 40x40 for about 100000+ time steps, but right now that is not happening.
So what would be a more efficient way to calculate and update the probabilities/propensities of each player in the grid? I've considered implementing NumPy arrays but I am not sure if it would be worth the hassle here?
import numpy as np
import random
from random import randint
from numpy.random import choice
from numpy.random import multinomial
import cProfile
mew = 0.001
error = 0.05
def create_grid(row, col):
return [[0 for j in range(col)] for i in range(row)]
def create_random_propensities():
propensities = {}
pre_propensities = [random.uniform(0, 1) for i in range(9)]
a = np.sum(pre_propensities)
for i in range(1, 10):
propensities[i] = (pre_propensities[i - 1]/a) * 10 # normalize sum of propensities to 10
return propensities
class Proposer:
def __init__(self):
self.propensities = create_random_propensities()
self.probabilites = []
self.demand = 0 # the amount the proposer demands for themselves
def pick_strat(self, n_trials): # gets strategy, an integer in the interval [1, 9]
results = multinomial(n_trials, self.probabilites)
i, = np.where(results == max(results))
if len(i) > 1:
return choice(i) + 1
else:
return i[0] + 1
def calculate_probability(self, dict_data, index, total_sum): # calculates probability for particular strat, taking propensity
return dict_data[index]/total_sum # of that strat as input
def calculate_sum(self, dict_data):
return sum(dict_data.values())
def initialize(self):
init_sum = self.calculate_sum(self.propensities)
for strategy in range(1, 10):
self.probabilites.append(self.calculate_probability(self.propensities, strategy, init_sum))
self.demand = self.pick_strat(1)
def update_strategy(self):
self.demand = self.pick_strat(1)
def update_probablities(self):
for i in range(9):
self.propensities[1 + i] *= 1 - mew
pensity_sum = self.calculate_sum(self.propensities)
for i in range(9):
self.probabilites[i] = self.calculate_probability(self.propensities, 1 + i, pensity_sum)
def update(self):
self.update_probablities()
self.update_strategy()
class Responder: # methods same as proposer class, can skip-over
def __init__(self):
self.propensities = create_random_propensities()
self.probabilites = []
self.max_thresh = 0 # the maximum demand they are willing to accept
def pick_strat(self, n_trials):
results = multinomial(n_trials, self.probabilites)
i, = np.where(results == max(results))
if len(i) > 1:
return choice(i) + 1
else:
return i[0] + 1
def calculate_probability(self, dict_data, index, total_sum):
return dict_data[index]/total_sum
def calculate_sum(self, dict_data):
return sum(dict_data.values())
def initialize(self):
init_sum = self.calculate_sum(self.propensities)
for strategy in range(1, 10):
self.probabilites.append(self.calculate_probability(self.propensities, strategy, init_sum))
self.max_thresh = self.pick_strat(1)
def update_strategy(self):
self.max_thresh = self.pick_strat(1)
def update_probablities(self):
for i in range(9):
self.propensities[1 + i] *= 1 - mew # stops sum of propensites from growing without bound
pensity_sum = self.calculate_sum(self.propensities)
for i in range(9):
self.probabilites[i] = self.calculate_probability(self.propensities, 1 + i, pensity_sum)
def update(self):
self.update_probablities()
self.update_strategy()
class Agent:
def __init__(self):
self.prop_side = Proposer()
self.resp_side = Responder()
self.prop_side.initialize()
self.resp_side.initialize()
def update_all(self):
self.prop_side.update()
self.resp_side.update()
class Grid:
def __init__(self, rowsize, colsize):
self.rowsize = rowsize
self.colsize = colsize
def make_lattice(self):
return [[Agent() for j in range(self.colsize)] for i in range(self.rowsize)]
#staticmethod
def von_neumann_neighbourhood(array, row, col, wrapped=True): # gets up, bottom, left, right neighbours of some node
neighbours = set([])
if row + 1 <= len(array) - 1:
neighbours.add(array[row + 1][col])
if row - 1 >= 0:
neighbours.add(array[row - 1][col])
if col + 1 <= len(array[0]) - 1:
neighbours.add(array[row][col + 1])
if col - 1 >= 0:
neighbours.add(array[row][col - 1])
#if wrapped is on, conditions for out of bound points
if row - 1 < 0 and wrapped == True:
neighbours.add(array[-1][col])
if col - 1 < 0 and wrapped == True:
neighbours.add(array[row][-1])
if row + 1 > len(array) - 1 and wrapped == True:
neighbours.add(array[0][col])
if col + 1 > len(array[0]) - 1 and wrapped == True:
neighbours.add(array[row][0])
return neighbours
def get_error_term(pay, strategy):
index_strat_2, index_strat_8 = 2, 8
if strategy == 1:
return (1 - (error/2)) * pay, error/2 * pay, index_strat_2
if strategy == 9:
return (1 - (error/2)) * pay, error/2 * pay, index_strat_8
else:
return (1 - error) * pay, error/2 * pay, 0
class Games:
def __init__(self, n_rows, n_cols, n_rounds):
self.rounds = n_rounds
self.rows = n_rows
self.cols = n_cols
self.lattice = Grid(self.rows, self.cols).make_lattice()
self.lookup_table = np.full((self.rows, self.cols), False, dtype=bool) # if player on grid has updated their strat, set to True
def reset_look_tab(self):
self.lookup_table = np.full((self.rows, self.cols), False, dtype=bool)
def run_game(self):
n = 0
while n < self.rounds:
for r in range(self.rows):
for c in range(self.cols):
if n != 0:
self.lattice[r][c].update_all()
self.lookup_table[r][c] = True
self.play_rounds(self.lattice, r, c)
self.reset_look_tab()
n += 1
def play_rounds(self, grid, row, col):
neighbours = Grid.von_neumann_neighbourhood(grid, row, col)
neighbour = random.sample(neighbours, 1).pop()
neighbour_index = [(ix, iy) for ix, row in enumerate(self.lattice) for iy, i in enumerate(row) if i == neighbour]
if self.lookup_table[neighbour_index[0][0]][neighbour_index[0][1]] == False: # see if neighbour has already updated their strat
neighbour.update_all()
player = grid[row][col]
coin_toss = randint(0, 1) # which player acts as proposer or responder in game
if coin_toss == 1:
if player.prop_side.demand <= neighbour.resp_side.max_thresh: # postive payoff
payoff, adjacent_payoff, index = get_error_term(player.prop_side.demand, player.prop_side.demand)
if player.prop_side.demand == 1 or player.prop_side.demand == 9: # extreme strategies get bonus payoffs
player.prop_side.propensities[player.prop_side.demand] += payoff
player.prop_side.propensities[index] += adjacent_payoff
else:
player.prop_side.propensities[player.prop_side.demand] += payoff
player.prop_side.propensities[player.prop_side.demand - 1] += adjacent_payoff
player.prop_side.propensities[player.prop_side.demand + 1] += adjacent_payoff
else:
return 0 # if demand > max thresh -> both get zero
if coin_toss != 1:
if neighbour.prop_side.demand <= player.resp_side.max_thresh:
payoff, adjacent_payoff, index = get_error_term(10 - neighbour.prop_side.demand, player.resp_side.max_thresh)
if player.resp_side.max_thresh == 1 or player.resp_side.max_thresh == 9:
player.resp_side.propensities[player.resp_side.max_thresh] += payoff
player.resp_side.propensities[index] += adjacent_payoff
else:
player.resp_side.propensities[player.resp_side.max_thresh] += payoff
player.resp_side.propensities[player.resp_side.max_thresh - 1] += adjacent_payoff
player.resp_side.propensities[player.resp_side.max_thresh + 1] += adjacent_payoff
else:
return 0
#pr = cProfile.Profile()
#pr.enable()
my_game = Games(10, 10, 2000) # (rowsize, colsize, n_steps)
my_game.run_game()
#pr.disable()
#pr.print_stats(sort='time')
(For those who might be wondering, the get_error_term just returns the propensities for strategies that are next to strategies that receive a positive payoff, for example if the strategy 8 works, then 7 and 9's propensities also get adjusted upwards and this is calculated by said function. And the first for loop inside update_probabilities just makes sure that the sum of propensities don't grow without bound).

Python: N Puzzle (8-Puzzle) Solver Heruistics: How to iterate?

I have written a selection of functions to try and solve a N-puzzle / 8-puzzle.
I am quite content with my ability to manipulate the puzzle but am struggling with how to iterate and find the best path. My skills are not in OOP either and so the functions are simple.
The idea is obviously to reduce the heruistic distance and place all pieces in their desired locations.
I have read up a lot of other questions regarding this topic but they're often more advanced and OOP focused.
When I try and iterate through there are no good moves. I'm not sure how to perform the A* algorithm.
from math import sqrt, fabs
import copy as cp
# Trial puzzle
puzzle1 = [
[3,5,4],
[2,1,0],
[6,7,8]]
# This function is used minimise typing later
def starpiece(piece):
'''Checks the input of a *arg and returns either tuple'''
if piece == ():
return 0
elif isinstance(piece[0], (str, int)) == True:
return piece[0]
elif isinstance(piece[0], (tuple, list)) and len(piece[0]) == 2:
return piece[0]
# This function creates the goal puzzle layout
def goal(puzzle):
'''Input a nested list and output an goal list'''
n = len(puzzle) * len(puzzle)
goal = [x for x in range(1,n)]
goal.append(0)
nested_goal = [goal[i:i+len(puzzle)] for i in range(0, len(goal), len(puzzle))]
return nested_goal
# This fuction gives either the coordinates (as a tuple) of a piece in the puzzle
# or the piece in the puzzle at give coordinates
def search(puzzle, *piece):
'''Input a puzzle and piece value and output a tuple of coordinates.
If no piece is selected 0 is chosen by default. If coordinates are
entered the piece value at those coordinates are outputed'''
piece = starpiece(piece)
if isinstance(piece, (tuple, list)) == True:
return puzzle[piece[0]][piece[1]]
for slice1, sublist in enumerate(puzzle):
for slice2, item in enumerate(sublist):
if puzzle[slice1][slice2] == piece:
x, y = slice1, slice2
return (x, y)
# This function gives the neighbours of a piece at a given position as a list of coordinates
def neighbours(puzzle, *piece):
'''Input a position (as a tuple) or piece and output a list
of adjacent neighbours. Default are the neighbours to 0'''
length = len(puzzle) - 1
return_list = []
piece = starpiece(piece)
if isinstance(piece, tuple) != True:
piece = search(puzzle, piece)
if (piece[0] - 1) >= 0:
x_minus = (piece[0] - 1)
return_list.append((x_minus, piece[1]))
if (piece[0] + 1) <= length:
x_plus = (piece[0] + 1)
return_list.append((x_plus, piece[1]))
if (piece[1] - 1) >= 0:
y_minus = (piece[1] - 1)
return_list.append((piece[0], y_minus))
if (piece[1] + 1) <= length:
y_plus = (piece[1] + 1)
return_list.append((piece[0], y_plus))
return return_list
# This function swaps piece values of adjacent cells
def swap(puzzle, cell1, *cell2):
'''Moves two cells, if adjacent a swap occurs. Default value for cell2 is 0.
Input either a cell value or cell cooridinates'''
cell2 = starpiece(cell2)
if isinstance(cell1, (str, int)) == True:
cell1 = search(puzzle, cell1)
if isinstance(cell2, (str, int)) == True:
cell2 = search(puzzle, cell2)
puzzleSwap = cp.deepcopy(puzzle)
if cell1 == cell2:
print('Warning: no swap occured as both cell values were {}'.format(search(puzzle,cell1)))
return puzzleSwap
elif cell1 in neighbours(puzzleSwap, cell2):
puzzleSwap[cell1[0]][cell1[1]], puzzleSwap[cell2[0]][cell2[1]] = puzzleSwap[cell2[0]][cell2[1]], puzzleSwap[cell1[0]][cell1[1]]
return puzzleSwap
else:
print('''Warning: no swap occured as cells aren't adjacent''')
return puzzleSwap
# This function gives true if a piece is in it's correct position
def inplace(puzzle, p):
'''Ouputs bool on whether a piece is in it's correct position'''
if search(puzzle, p) == search(goal(puzzle), p):
return True
else:
return False
# These functions give heruistic measurements
def heruistic(puzzle):
'''All returns heruistic (misplaced, total distance) as a tuple. Other
choices are: heruistic misplaced, heruistic distance or heruistic list'''
heruistic_misplaced = 0
heruistic_distance = 0
heruistic_distance_total = 0
heruistic_list = []
for sublist in puzzle:
for item in sublist:
if inplace(puzzle, item) == False:
heruistic_misplaced += 1
for sublist in puzzle:
for item in sublist:
a = search(puzzle, item)
b = search(goal(puzzle), item)
heruistic_distance = int(fabs(a[0] - b[0]) + fabs(a[1] - b[1]))
heruistic_distance_total += heruistic_distance
heruistic_list.append(heruistic_distance)
return (heruistic_misplaced, heruistic_distance_total, heruistic_list)
def hm(puzzle):
'''Outputs heruistic misplaced'''
return heruistic(puzzle)[0]
def hd(puzzle):
'''Outputs total heruistic distance'''
return heruistic(puzzle)[1]
def hl(puzzle):
'''Outputs heruistic list'''
return heruistic(puzzle)[2]
def hp(puzzle, p):
'''Outputs heruistic distance at a given location'''
x, y = search(puzzle, p)[0], search(puzzle, p)[1]
return heruistic(puzzle)[2][(x * len(puzzle)) + y]
# This is supposted to iterate along a route according to heruistics but doesn't work
def iterMove(puzzle):
state = cp.deepcopy(puzzle)
while state != goal(puzzle):
state_hd = hd(state)
state_hm = hm(state)
moves = neighbours(state)
ok_moves = []
good_moves = []
for move in moves:
maybe_state = swap(state, move)
if hd(maybe_state) < state_hd and hm(maybe_state) < state_hm:
good_moves.append(move)
elif hd(maybe_state) < state_hd:
ok_moves.append(move)
elif hm(maybe_state) < state_hm:
ok_moves.append(move)
if good_moves != []:
print(state)
state = swap(state, good_moves[0])
elif ok_moves != []:
print(state)
state = swap(state, ok_moves[0])
>> iterMove(puzzle1)
'no good moves'

To implement A* in Python you can use https://docs.python.org/3/library/heapq.html for a priority queue. You put possible positions into the queue with a priority of "cost so far + heuristic for remaining cost". When you take them out of the queue you check a set of already seen positions. Skip this one if you've seen the position, else add it to the set and then process.
An untested version of the critical piece of code:
queue = [(heuristic(starting_position), 0, starting_position, None)]
while 0 < len(queue):
(est_moves, cur_moves, position, history) = heapq.heappop(queue)
if position in seen:
continue
elif position = solved:
return history
else:
seen.add(position)
for move in possible_moves(position):
next_position = position_after_move(position, move)
est_moves = cur_moves + 1 + heuristic(next_position)
heapq.heappush(queue,
(est_moves, cur_moves+1,
next_position, (move, history)))
return None

Implementing Knuth-Morris-Pratt (KMP) algorithm for string matching with Python

I am following Cormen Leiserson Rivest Stein (clrs) book and came across "kmp algorithm" for string matching. I implemented it using Python (as-is).
However, it doesn't seem to work for some reason. where is my fault?
The code is given below:
def kmp_matcher(t,p):
n=len(t)
m=len(p)
# pi=[0]*n;
pi = compute_prefix_function(p)
q=-1
for i in range(n):
while(q>0 and p[q]!=t[i]):
q=pi[q]
if(p[q]==t[i]):
q=q+1
if(q==m):
print "pattern occurs with shift "+str(i-m)
q=pi[q]
def compute_prefix_function(p):
m=len(p)
pi =range(m)
pi[1]=0
k=0
for q in range(2,m):
while(k>0 and p[k]!=p[q]):
k=pi[k]
if(p[k]==p[q]):
k=k+1
pi[q]=k
return pi
t = 'brownfoxlazydog'
p = 'lazy'
kmp_matcher(t,p)

This is a class I wrote based on CLRs KMP algorithm, which contains what you are after. Note that only DNA "characters" are accepted here.
class KmpMatcher(object):
def __init__(self, pattern, string, stringName):
self.motif = pattern.upper()
self.seq = string.upper()
self.header = stringName
self.prefix = []
self.validBases = ['A', 'T', 'G', 'C', 'N']
#Matches the motif pattern against itself.
def computePrefix(self):
#Initialize prefix array
self.fillPrefixList()
k = 0
for pos in range(1, len(self.motif)):
#Check valid nt
if(self.motif[pos] not in self.validBases):
self.invalidMotif()
#Unique base in motif
while(k > 0 and self.motif[k] != self.motif[pos]):
k = self.prefix[k]
#repeat in motif
if(self.motif[k] == self.motif[pos]):
k += 1
self.prefix[pos] = k
#Initialize the prefix list and set first element to 0
def fillPrefixList(self):
self.prefix = [None] * len(self.motif)
self.prefix[0] = 0
#An implementation of the Knuth-Morris-Pratt algorithm for linear time string matching
def kmpSearch(self):
#Compute prefix array
self.computePrefix()
#Number of characters matched
match = 0
found = False
for pos in range(0, len(self.seq)):
#Check valid nt
if(self.seq[pos] not in self.validBases):
self.invalidSequence()
#Next character is not a match
while(match > 0 and self.motif[match] != self.seq[pos]):
match = self.prefix[match-1]
#A character match has been found
if(self.motif[match] == self.seq[pos]):
match += 1
#Motif found
if(match == len(self.motif)):
print(self.header)
print("Match found at position: " + str(pos-match+2) + ':' + str(pos+1))
found = True
match = self.prefix[match-1]
if(found == False):
print("Sorry '" + self.motif + "'" + " was not found in " + str(self.header))
#An invalid character in the motif message to the user
def invalidMotif(self):
print("Error: motif contains invalid DNA nucleotides")
exit()
#An invalid character in the sequence message to the user
def invalidSequence(self):
print("Error: " + str(self.header) + "sequence contains invalid DNA nucleotides")
exit()

You might want to try out my code:
def recursive_find_match(i, j, pattern, pattern_track):
if pattern[i] == pattern[j]:
pattern_track.append(i+1)
return {"append":pattern_track, "i": i+1, "j": j+1}
elif pattern[i] != pattern[j] and i == 0:
pattern_track.append(i)
return {"append":pattern_track, "i": i, "j": j+1}
else:
i = pattern_track[i-1]
return recursive_find_match(i, j, pattern, pattern_track)
def kmp(str_, pattern):
len_str = len(str_)
len_pattern = len(pattern)
pattern_track = []
if len_pattern == 0:
return
elif len_pattern == 1:
pattern_track = [0]
else:
pattern_track = [0]
i = 0
j = 1
while j < len_pattern:
data = recursive_find_match(i, j, pattern, pattern_track)
i = data["i"]
j = data["j"]
pattern_track = data["append"]
index_str = 0
index_pattern = 0
match_from = -1
while index_str < len_str:
if index_pattern == len_pattern:
break
if str_[index_str] == pattern[index_pattern]:
if index_pattern == 0:
match_from = index_str
index_pattern += 1
index_str += 1
else:
if index_pattern == 0:
index_str += 1
else:
index_pattern = pattern_track[index_pattern-1]
match_from = index_str - index_pattern

Try this:
def kmp_matcher(t, d):
n=len(t)
m=len(d)
pi = compute_prefix_function(d)
q = 0
i = 0
while i < n:
if d[q]==t[i]:
q=q+1
i = i + 1
else:
if q != 0:
q = pi[q-1]
else:
i = i + 1
if q == m:
print "pattern occurs with shift "+str(i-q)
q = pi[q-1]
def compute_prefix_function(p):
m=len(p)
pi =range(m)
k=1
l = 0
while k < m:
if p[k] <= p[l]:
l = l + 1
pi[k] = l
k = k + 1
else:
if l != 0:
l = pi[l-1]
else:
pi[k] = 0
k = k + 1
return pi
t = 'brownfoxlazydog'
p = 'lazy'
kmp_matcher(t, p)

KMP stands for Knuth-Morris-Pratt it is a linear time string-matching algorithm.
Note that in python, the string is ZERO BASED, (while in the book the string starts with index 1).
So we can workaround this by inserting an empty space at the beginning of both strings.
This causes four facts:
The len of both text and pattern is augmented by 1, so in the loop range, we do NOT have to insert the +1 to the right interval. (note that in python the last step is excluded);
To avoid accesses out of range, you have to check the values of k+1 and q+1 BEFORE to give them as index to arrays;
Since the length of m is augmented by 1, in kmp_matcher, before to print the response, you have to check this instead: q==m-1;
For the same reason, to calculate the correct shift you have to compute this instead: i-(m-1)
so the correct code, based on your original question, and considering the starting code from Cormen, as you have requested, would be the following:
(note : I have inserted a matching pattern inside, and some debug text that helped me to find logical errors):
def compute_prefix_function(P):
m = len(P)
pi = [None] * m
pi[1] = 0
k = 0
for q in range(2, m):
print ("q=", q, "\n")
print ("k=", k, "\n")
if ((k+1) < m):
while (k > 0 and P[k+1] != P[q]):
print ("entered while: \n")
print ("k: ", k, "\tP[k+1]: ", P[k+1], "\tq: ", q, "\tP[q]: ", P[q])
k = pi[k]
if P[k+1] == P[q]:
k = k+1
print ("Entered if: \n")
print ("k: ", k, "\tP[k]: ", P[k], "\tq: ", q, "\tP[q]: ", P[q])
pi[q] = k
print ("Outside while or if: \n")
print ("pi[", q, "] = ", k, "\n")
print ("---next---")
print ("---end for---")
return pi
def kmp_matcher(T, P):
n = len(T)
m = len(P)
pi = compute_prefix_function(P)
q = 0
for i in range(1, n):
print ("i=", i, "\n")
print ("q=", q, "\n")
print ("m=", m, "\n")
if ((q+1) < m):
while (q > 0 and P[q+1] != T[i]):
q = pi[q]
if P[q+1] == T[i]:
q = q+1
if q == m-1:
print ("Pattern occurs with shift", i-(m-1))
q = pi[q]
print("---next---")
print("---end for---")
txt = " bacbababaabcbab"
ptn = " ababaab"
kmp_matcher(txt, ptn)
(so this would be the correct accepted answer...)
hope that it helps.

KMP algorithm pattern calculation issue

Here is my code and test case. My question is, it seems the value of KMP pattern will never increase, since in last iteration, we checked pattern[i] != pattern[j], and in current round if (j == -1) or (pattern[j] == pattern[i]) cannot be true unless j == -1?
def findPattern(pattern):
j = -1
next = [-1] * len(pattern)
i = 0 # next[0] is always -1, by KMP definition
while (i+1 < len(pattern)):
if (j == -1) or (pattern[j] == pattern[i]):
i += 1
j += 1
if pattern[i] != pattern[j]:
next[i] = j
else:
next[i] = next[j]
else:
j = next[j]
return next
if __name__ == "__main__":
# print findPattern("aaaab")
print findPattern("abaabc")
thanks in advance,
Lin

You have already increased i and j, so you are actually checking pattern[i+1] != pattern[j+1] after if (j == -1) or (pattern[j] == pattern[i])

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Traceback in Needleman-Wunsch global alignment without storing pointer - python

Related

How to generate alternating substrings using recursion

Slow performance in agent based model python

Python: N Puzzle (8-Puzzle) Solver Heruistics: How to iterate?

Implementing Knuth-Morris-Pratt (KMP) algorithm for string matching with Python

KMP algorithm pattern calculation issue

Categories

Resources