Implementing Smith-Waterman algorithm for local alignment in python - python

I have created a sequence alignment tool to compare two strands of DNA (X and Y) to find the best alignment of substrings from X and Y. The algorithm is summarized here (https://en.wikipedia.org/wiki/Smith–Waterman_algorithm). I have been able to generate a lists of lists, filling them all with zeros, to represent my matrix. I created a scoring algorithm to return a numerical score for each kind of alignment between bases (eg. plus 4 for a match). Then I created an alignment algorithm that should put a score in each coordinate of my "matrix". However, when I go to print the matrix, it only returns the original with all zeros (rather than actual scores).
I know there are other methods of implementing this method (with numpy for example), so could you please tell me why this specific code (below) does not work? Is there a way to modify it, so that it does work?
code:
def zeros(X: int, Y: int):
lenX = len(X) + 1
lenY = len(Y) + 1
matrix = []
for i in range(lenX):
matrix.append([0] * lenY)
def score(X, Y):
if X[n] == Y[m]: return 4
if X[n] == '-' or Y[m] == '-': return -4
else: return -2
def SmithWaterman(X, Y, score):
for n in range(1, len(X) + 1):
for m in range(1, len(Y) + 1):
align = matrix[n-1, m-1] + (score(X[n-1], Y[m-1]))
indelX = matrix[n-1, m] + (score(X[n-1], Y[m]))
indelY = matrix[n, m-1] + (score(X[n], Y[m-1]))
matrix[n, m] = max(align, indelX, indelY, 0)
print(matrix)
zeros("ACGT", "ACGT")
output:
[[0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0]]

The reason it's just printing out the zeroed out matrix is that the SmithWaterman function is never called, so the matrix is never updated.
You would need to do something like
# ...
SmithWaterman(X, Y, score)
print(matrix)
# ...
However, If you do this, you will find that this code is actually quite broken in many other ways. I've gone through and annotated some of the syntax errors and other issues with the code:
def zeros(X: int, Y: int):
# ^ ^ incorrect type annotations. should be str
lenX = len(X) + 1
lenY = len(Y) + 1
matrix = []
for i in range(lenX):
matrix.append([0] * lenY)
# A more "pythonic" way of expressing the above would be:
# matrix = [[0] * len(Y) + 1 for _ in range(len(x) + 1)]
def score(X, Y):
# ^ ^ shadowing variables from outer scope. this is not a bug per se but it's considered bad practice
if X[n] == Y[m]: return 4
# ^ ^ variables not defined in scope
if X[n] == '-' or Y[m] == '-': return -4
# ^ ^ variables not defined in scope
else: return -2
def SmithWaterman(X, Y, score): # this function is never called
# ^ unnecessary function passed as parameter. function is defined in scope
for n in range(1, len(X) + 1):
for m in range(1, len(Y) + 1):
align = matrix[n-1, m-1] + (score(X[n-1], Y[m-1]))
# ^ invalid list lookup. should be: matrix[n-1][m-1]
indelX = matrix[n-1, m] + (score(X[n-1], Y[m]))
# ^ out of bounds error when m == len(Y)
indelY = matrix[n, m-1] + (score(X[n], Y[m-1]))
# ^ out of bounds error when n == len(X)
matrix[n, m] = max(align, indelX, indelY, 0)
# this should be nested in the inner for-loop. m, n, indelX, and indelY are not defined in scope here
print(matrix)
zeros("ACGT", "ACGT")

Related

Substituting values in SymPy summation

When substituting values into a SymPy sum, it doesn't seem to recognise that the variables are indexed, and simply factors out all the indexed variables, like so:
# Define variables.
z_tilde_i = sympy.IndexedBase('\\tilde{z}')
rho_i = sympy.IndexedBase('\\rho')
M = sympy.symbols('M')
n = sympy.symbols('n', integer = True)
i = sympy.Idx('i', n)
# Define equation M = sum(rho * deltaZ).
eq_total_mass = sympy.Eq(M, sympy.Sum(rho_i[i] * (z_tilde_i[i + 1] - z_tilde_i[i]), (i, 0, n - 1)))
# Try to substitute values.
print(eq_total_mass.rhs.subs(n, 3).doit())
>>> 3*(\tilde{z}[i + 1] - \tilde{z}[i])*\rho[i]
How to make the SymPy sum recognise the indexed variables?
For a workaround:
There is no need to define i as Idx:
>>> i = var('i')
>>> Sum(rho_i[i] * (z_tilde_i[i + 1] - z_tilde_i[i]), (i, 0, 1)).doit()
(-\tilde{z}[0] + \tilde{z}[1])*\rho[0] + (-\tilde{z}[1] + \tilde{z}[2])*\rho[1]
Or if you do, don't use the integer=True when defining n:
>>> n = var('n')
>>> i = sympy.Idx('i', n)
>>> Sum(rho_i[i] * (z_tilde_i[i + 1] - z_tilde_i[i]), (i, 0, 1)).doit()
(-\tilde{z}[0] + \tilde{z}[1])*\rho[0] + (-\tilde{z}[1] + \tilde{z}[2])*\rho[1]

How to make nested list behave like numpy array?

I'm trying to implements an algorithm to count subsets with given sum in python which is
import numpy as np
maxN = 20
maxSum = 1000
minSum = 1000
base = 1000
dp = np.zeros((maxN, maxSum + minSum))
v = np.zeros((maxN, maxSum + minSum))
# Function to return the required count
def findCnt(arr, i, required_sum, n) :
# Base case
if (i == n) :
if (required_sum == 0) :
return 1
else :
return 0
# If the state has been solved before
# return the value of the state
if (v[i][required_sum + base]) :
return dp[i][required_sum + base]
# Setting the state as solved
v[i][required_sum + base] = 1
# Recurrence relation
dp[i][required_sum + base] = findCnt(arr, i + 1, required_sum, n) + findCnt(arr, i + 1, required_sum - arr[i], n)
return dp[i][required_sum + base]
arr = [ 2, 2, 2, 4 ]
n = len(arr)
k = 4
print(findCnt(arr, 0, k, n))
And it gives the expected result, but I was asked to not use numpy, so I replaced numpy arrays with nested lists like this :
#dp = np.zeros((maxN, maxSum + minSum)) replaced by
dp = [[0]*(maxSum + minSum)]*maxN
#v = np.zeros((maxN, maxSum + minSum)) replaced by
v = [[0]*(maxSum + minSum)]*maxN
but now the program always gives me 0 in the output, I think this is because of some behavior differences between numpy arrays and nested lists, but I don't know how to fix it
EDIT :
thanks to #venky__ who provided this solution in the comments :
[[0 for i in range( maxSum + minSum)] for i in range(maxN)]
and it worked, but I still don't understand what is the difference between it and what I was doing before, I tried :
print( [[0 for i in range( maxSum + minSum)] for i in range(maxN)] == [[0]*(maxSum + minSum)]*maxN )
And the result is True, so how this was able to fix the problem ?
It turns out that I was using nested lists the wrong way to represent 2d arrays, since python was not crating separate objets, but the same sub list indexes was referring to the same integer object, for better explanation please read this.

AttributeError: Trying to access an object's attribute in a 2D list in python

I'm having trouble with accessing objects inside my 2D list in Python. Basically what I'm doing is that I have a 2D list called recueil (french for collection) that stores Reflection objects. The Reflection class has a valeur(=value) and an indice(=index) attribute, were the value is what I get when I apply the ref function (don't have it currently set up) to the reflection with a lower index (the first row of Reflections have one index, and the nth row of Reflections have n indexes, which explains the indice[]).
def main():
a = [1, 3, 5]
b = [2, 4, 6]
n = 3
recueil = [[Reflection([None],0) for colonne in range(N(ligne))] for ligne in range(n)]
recueil[0][0].valeur = b[0] - a[0]
recueil[0][1].valeur = b[1] - a[1]
recueil[0][2].valeur = b[2] - a[2]
for i in range(n):
print(recueil[0][i].valeur)
print(recueil[0][i].indice)
init_indices(recueil, n)
remplir_indices(recueil, n)
def N(n):
return 3 * (2 ** n)
def init_indices(recueil,n):
for i in range(n):
for j in range(N(i)):
recueil[i][j] = [0] * (i+1)
class Reflection(object):
indice = [0]
valeur = 0
def __init__(self, indice, valeur):
self.indice = indice
self.valeur = valeur
The problem I'm having is that in remplir_indices(=fill_indexes), I'm trying to make it so the initial sets have the indexes from 1 to n (so recueil[0][0].indice = 1, ..., recueil[0][n-1].indice = n) but it's treating recueil[][] as a list instead of the object stored in it. Any ideas as to what my error is?
def remplir_indices(recueil, n):
for i in range(n):
if i == 0:
for j in range(n):
recueil[i][j].indice[0] = j
else:
return 0
This looks like an indentation issue:
def remplir_indices(recueil, n):
for i in range(n):
if i == 0:
for j in range(n):
recueil[i][j].index[0] = j
# recueil[i][j].indice[0] = j
return 0

Swap float values only using tuples [duplicate]

In Python, I've seen two variable values swapped using this syntax:
left, right = right, left
Is this considered the standard way to swap two variable values or is there some other means by which two variables are by convention most usually swapped?
Python evaluates expressions from left to right. Notice that while
evaluating an assignment, the right-hand side is evaluated before the
left-hand side.
Python docs: Evaluation order
That means the following for the expression a,b = b,a :
The right-hand side b,a is evaluated, that is to say, a tuple of two elements is created in the memory. The two elements are the objects designated by the identifiers b and a, that were existing before the instruction is encountered during the execution of the program.
Just after the creation of this tuple, no assignment of this tuple object has still been made, but it doesn't matter, Python internally knows where it is.
Then, the left-hand side is evaluated, that is to say, the tuple is assigned to the left-hand side.
As the left-hand side is composed of two identifiers, the tuple is unpacked in order that the first identifier a be assigned to the first element of the tuple (which is the object that was formerly b before the swap because it had name b)
and the second identifier b is assigned to the second element of the tuple (which is the object that was formerly a before the swap because its identifiers was a)
This mechanism has effectively swapped the objects assigned to the identifiers a and b
So, to answer your question: YES, it's the standard way to swap two identifiers on two objects.
By the way, the objects are not variables, they are objects.
That is the standard way to swap two variables, yes.
I know three ways to swap variables, but a, b = b, a is the simplest. There is
XOR (for integers)
x = x ^ y
y = y ^ x
x = x ^ y
Or concisely,
x ^= y
y ^= x
x ^= y
Temporary variable
w = x
x = y
y = w
del w
Tuple swap
x, y = y, x
I would not say it is a standard way to swap because it will cause some unexpected errors.
nums[i], nums[nums[i] - 1] = nums[nums[i] - 1], nums[i]
nums[i] will be modified first and then affect the second variable nums[nums[i] - 1].
Does not work for multidimensional arrays, because references are used here.
import numpy as np
# swaps
data = np.random.random(2)
print(data)
data[0], data[1] = data[1], data[0]
print(data)
# does not swap
data = np.random.random((2, 2))
print(data)
data[0], data[1] = data[1], data[0]
print(data)
See also Swap slices of Numpy arrays
To get around the problems explained by eyquem, you could use the copy module to return a tuple containing (reversed) copies of the values, via a function:
from copy import copy
def swapper(x, y):
return (copy(y), copy(x))
Same function as a lambda:
swapper = lambda x, y: (copy(y), copy(x))
Then, assign those to the desired names, like this:
x, y = swapper(y, x)
NOTE: if you wanted to you could import/use deepcopy instead of copy.
That syntax is a standard way to swap variables. However, we need to be careful of the order when dealing with elements that are modified and then used in subsequent storage elements of the swap.
Using arrays with a direct index is fine. For example:
def swap_indexes(A, i1, i2):
A[i1], A[i2] = A[i2], A[i1]
print('A[i1]=', A[i1], 'A[i2]=', A[i2])
return A
A = [0, 1, 2, 3, 4]
print('For A=', A)
print('swap indexes 1, 3:', swap_indexes(A, 1, 3))
Gives us:
('For A=', [0, 1, 2, 3, 4])
('A[i1]=', 3, 'A[i2]=', 1)
('swap indexes 1, 3:', [0, 3, 2, 1, 4])
However, if we change the left first element and use it in the left second element as an index, this causes a bad swap.
def good_swap(P, i2):
j = P[i2]
#Below is correct, because P[i2] is modified after it is used in P[P[i2]]
print('Before: P[i2]=', P[i2], 'P[P[i2]]=', P[j])
P[P[i2]], P[i2] = P[i2], P[P[i2]]
print('Good swap: After P[i2]=', P[i2], 'P[P[i2]]=', P[j])
return P
def bad_swap(P, i2):
j = P[i2]
#Below is wrong, because P[i2] is modified and then used in P[P[i2]]
print('Before: P[i2]=', P[i2], 'P[P[i2]]=', P[j])
P[i2], P[P[i2]] = P[P[i2]], P[i2]
print('Bad swap: After P[i2]=', P[i2], 'P[P[i2]]=', P[j])
return P
P = [1, 2, 3, 4, 5]
print('For P=', P)
print('good swap with index 2:', good_swap(P, 2))
print('------')
P = [1, 2, 3, 4, 5]
print('bad swap with index 2:', bad_swap(P, 2))
('For P=', [1, 2, 3, 4, 5])
('Before: P[i2]=', 3, 'P[P[i2]]=', 4)
('Good swap: After P[i2]=', 4, 'P[P[i2]]=', 3)
('good swap with index 2:', [1, 2, 4, 3, 5])
('Before: P[i2]=', 3, 'P[P[i2]]=', 4)
('Bad swap: After P[i2]=', 4, 'P[P[i2]]=', 4)
('bad swap with index 2:', [1, 2, 4, 4, 3])
The bad swap is incorrect because P[i2] is 3 and we expect P[P[i2]] to be P[3]. However, P[i2] is changed to 4 first, so the subsequent P[P[i2]] becomes P[4], which overwrites the 4th element rather than the 3rd element.
The above scenario is used in permutations. A simpler good swap and bad swap would be:
#good swap:
P[j], j = j, P[j]
#bad swap:
j, P[j] = P[j], j
You can combine tuple and XOR swaps: x, y = x ^ x ^ y, x ^ y ^ y
x, y = 10, 20
print('Before swapping: x = %s, y = %s '%(x,y))
x, y = x ^ x ^ y, x ^ y ^ y
print('After swapping: x = %s, y = %s '%(x,y))
or
x, y = 10, 20
print('Before swapping: x = %s, y = %s '%(x,y))
print('After swapping: x = %s, y = %s '%(x ^ x ^ y, x ^ y ^ y))
Using lambda:
x, y = 10, 20
print('Before swapping: x = %s, y = %s' % (x, y))
swapper = lambda x, y : ((x ^ x ^ y), (x ^ y ^ y))
print('After swapping: x = %s, y = %s ' % swapper(x, y))
Output:
Before swapping: x = 10 , y = 20
After swapping: x = 20 , y = 10

Navigating a grid

I stumbled upon a problem at Project Euler, https://projecteuler.net/problem=15
. I solved this by combinatorics but was left wondering if there is a dynamic programming solution to this problem or these kinds of problems overall. And say some squares of the grid are taken off - is that possible to navigate? I am using Python. How should I do that? Any tips are appreciated. Thanks in advance.
You can do a simple backtrack and explore an implicit graph like this: (comments explain most of it)
def explore(r, c, n, memo):
"""
explore right and down from position (r,c)
report a rout once position (n,n) is reached
memo is a matrix which saves how many routes exists from each position to (n,n)
"""
if r == n and c == n:
# one path has been found
return 1
elif r > n or c > n:
# crossing the border, go back
return 0
if memo[r][c] is not None:
return memo[r][c]
a= explore(r+1, c, n, memo) #move down
b= explore(r, c+1, n, memo) #move right
# return total paths found from this (r,c) position
memo[r][c]= a + b
return a+b
if __name__ == '__main__':
n= 20
memo = [[None] * (n+1) for _ in range(n+1)]
paths = explore(0, 0, n, memo)
print(paths)
Most straight-forwardly with python's built-in memoization util functools.lru_cache. You can encode missing squares as a frozenset (hashable) of missing grid points (pairs):
from functools import lru_cache
#lru_cache(None)
def paths(m, n, missing=None):
missing = missing or frozenset()
if (m, n) in missing:
return 0
if (m, n) == (0, 0):
return 1
over = paths(m, n-1, missing=missing) if n else 0
down = paths(m-1, n, missing=missing) if m else 0
return over + down
>>> paths(2, 2)
6
# middle grid point missing: only two paths
>>> paths(2, 2, frozenset([(1, 1)]))
2
>>> paths(20, 20)
137846528820
There is also a mathematical solution (which is probably what you used):
def factorial(n):
result = 1
for i in range(1, n + 1):
result *= i
return result
def paths(w, h):
return factorial(w + h) / (factorial(w) * factorial(h))
This works because the number of paths is the same as the number of ways to choose to go right or down over w + h steps, where you go right w times, which is equal to w + h choose w, or (w + h)! / (w! * h!).
With missing grid squares, I think there is a combinatoric solution, but it's very slow if there are many missing squares, so dynamic programming would probably be better there.
For example, the following should work:
missing = [
[0, 1],
[0, 0],
[0, 0],
]
def paths_helper(x, y, path_grid, missing):
if path_grid[x][y] is not None:
return path_grid[x][y]
if missing[x][y]:
path_grid[x][y] = 0
return 0
elif x < 0 or y < 0:
return 0
else:
path_count = (paths_helper(x - 1, y, path_grid, missing) +
paths_helper(x, y - 1, path_grid, missing))
path_grid[x][y] = path_count
return path_count
def paths(missing):
arr = [[None] * w for _ in range(h)]
w = len(missing[0])
h = len(missing)
return paths_helper(w, h, arr, missing)
print paths()

Categories

Resources