NeedlemanWunsch algorithm

NeedlemanWunsch algorithm - python

I'm trying to understand the Hirschberg algorithm and I came across this piece of algorithm from Wikipedia. I do not understand how the NeedlemanWunsch() function works.
function Hirschberg(X,Y)
Z = ""
W = ""
if length(X) == 0 or length(Y) == 0
if length(X) == 0
for i=1 to length(Y)
Z = Z + '-'
W = W + Yi
end
else if length(Y) == 0
for i=1 to length(X)
Z = Z + Xi
W = W + '-'
end
end
else if length(X) == 1 or length(Y) == 1
(Z,W) = NeedlemanWunsch(X,Y)
else
xlen = length(X)
xmid = length(X)/2
ylen = length(Y)
ScoreL = NWScore(X1:xmid, Y)
ScoreR = NWScore(rev(Xxmid+1:xlen), rev(Y))
ymid = PartitionY(ScoreL, ScoreR)
(Z,W) = Hirschberg(X1:xmid, y1:ymid) + Hirschberg(Xxmid+1:xlen, Yymid+1:ylen)
end
return (Z,W)
Can someone explain about the NeedlemanWunsch algorithm and how can it be implemented through Python? Thank you very much!

This looks like a homework/coursework question so I won't give you the full solution. However, I will guide you into producing a working solution.
Needleman-Wunsch Algorithm
The Needleman-Wunsch algorithm is a method used to align sequences. It is essentially made up of two components:
A similarity matrix, F.
A linear penalty gap, d.
When aligning sequences, there can be many possibilities. What this matrix allows you to do is to find the most optimal one and discard all the other sequences.
What you will have to do is:
Create a 2-dimensional array to hold the matrix, F.
A method to initialise matrix F with the scores.
A method to compute the optimal sequence.
Creating a 2-dimensional array to hold the matrix, F
You can either use numpy for this, or you could just generate the matrix as follows. Assume you have two sequences A and B:
F = [[0 for x in xrange(len(A)] for x in xrange(len(B))]
A method to initialise matrix F with the scores.
Create a method which takes a parameters the length of each sequence, the linear penalty gap, and the matrix F:
def createSimilarityMatrix(lengthOfA, lengthOfB, penalityGap, F):
You then need to implement the following pseudo-code:
for i=0 to length(A)
F(i,0) ← d*i
for j=0 to length(B)
F(0,j) ← d*j
for i=1 to length(A)
for j=1 to length(B)
{
Match ← F(i-1,j-1) + S(Ai, Bj)
Delete ← F(i-1, j) + d
Insert ← F(i, j-1) + d
F(i,j) ← max(Match, Insert, Delete)
}
Hint: Research optimal ways to write this algorithm in idiomatic Python. Also note that in the double for-loop at the bottom, you can collapse in a one-liner.
A method to compute the optimal sequence
Once you have the similarity matrix done, then you can implement the main algorithm to compute the optimal sequence. For this, create a method which takes your two sequences A and B as parameters:
def needlemanWunsch (a, b):
You will then need to implement this method using the following pseudocode:
AlignmentA ← ""
AlignmentB ← ""
i ← length(A)
j ← length(B)
while (i > 0 or j > 0)
{
if (i > 0 and j > 0 and F(i,j) == F(i-1,j-1) + S(Ai, Bj))
{
AlignmentA ← Ai + AlignmentA
AlignmentB ← Bj + AlignmentB
i ← i - 1
j ← j - 1
}
else if (i > 0 and F(i,j) == F(i-1,j) + d)
{
AlignmentA ← Ai + AlignmentA
AlignmentB ← "-" + AlignmentB
i ← i - 1
}
else (j > 0 and F(i,j) == F(i,j-1) + d)
{
AlignmentA ← "-" + AlignmentA
AlignmentB ← Bj + AlignmentB
j ← j - 1
}
}
The psuedo-code has been taken from this page on Wikipedia. For more information on the Needleman-Wunsch algorithm, please have a look at this presentation.

Related

Problem with exec() function in KKT conditions by using SymPy

I'm writing a not full-implemented Python function using SymPy library which looks for the critical points of a mathematical function f through the KKT conditions, as it follows:
def KKT(f: str, h=[], g=[], max=True):
# NOTE: The expressions contained in g must be such that g <= 0 and the ones contained in h must be such that h = 0. Both g and h are string lists
import sympy as sp # Importing a SymPy library
n = len(h) # Quantity of equality constraints
m = len(g) # Quantity of inequality constraints
f = sp.parse_expr(f) # Constructing the f function by passing a string as an argument
vars = f.free_symbols # Getting the variables set
pars = list(vars) # Parameters list
if n > 0:
for i in range(n):
exec(f'h{i+1} = sp.Symbol("h_{i+1}")')
exec(f'h{i+1} = sp.parse_expr(h[{i}])') # Create the equality constraint h_i
exec(f'l{i+1} = sp.Symbol("\\lambda_{i+1}")') # Create the parameter lambda_i
exec(f'pars.append(l{i+1})') # Adding lambda_i to parameters list
if m > 0:
for j in range(m):
exec(f'g_{j+1} = sp.Symbol("g_{j+1}")')
exec(f'g{j+1} = sp.parse_expr(g[{j}])') # Create the inequality constraint g_i
exec(f'u{j+1} = sp.Symbol("\\mu_{j+1}", negative=False)') # Create the parameter mu_i
exec(f'pars.append(u{j+1})') # Add mu_i to parameters list
exec(f'p{j+1} = sp.Symbol("p_{j+1}", negative=False)') # Create fill portion p_i
exec(f'pars.append(p{j+1})') # Add p_i to parameters list
# Creating the Lagrangean
L = f
if n > 0:
for i in range(n):
exec(f'L = L - l{i+1} * h{i+1}') 'Adding lambda_j * h_j to the Lagrangean
if m > 0:
for j in range(m):
exec(f'L = L - u{j+1} * g{j+1}') # <- THIS LINE IS NOT WORKING
# Adding mu_i * g_i to the Lagrangean
print(f'j: {j}')
print(f'{L}\n')
# Creating the KKT condition from Lagrangean
R = [] # Constraint's set
for var in vars:
R.append(sp.diff(L, var)) # Add the Lagrangean's partial derivative with respect to var
if n > 0:
for i in range(n):
exec(f'R.append(h{i+1})')
if m > 0:
for j in range(m):
exec(f'R.append(u{j+1} * g{j+1})')
exec(f'R.append(g{j+1} + p{j+1})')
# Solving KKT conditions
sols_lagr = sp.solve(R, pars, dict=True) # Lagrangian solutions
critical_points = [{var: sol.get(var) for var in sol if var in vars} for sol in sols_lagr]
return critical_points
KKT('24*x_1 - x_1**2 + 10*x_2 - 2*x_2**2',
[],
['x_1 - 8', 'x_2 - 7', '-x_1', '-x_2']
)
However, for some reason, the following line code doesn't work:
exec(f'L = L - u{j+1} * g{j+1}')
Because I'm getting this result when I execute this code's block:
j: 0
-x_1**2 + 24*x_1 - 2*x_2**2 + 10*x_2
j: 1
-x_1**2 + 24*x_1 - 2*x_2**2 + 10*x_2
j: 2
-x_1**2 + 24*x_1 - 2*x_2**2 + 10*x_2
j: 3
-x_1**2 + 24*x_1 - 2*x_2**2 + 10*x_2
Which shows the Lagrangean is not adding up the parcels. I'd be grateful if someone could help me.

Hairpin structure easy algorithm

I have a structure with letters A, Z, F, G.
A have a pair with Z, and F with G.
I have for example AAAFFZZGGFFGGAAAAZZZZ. And I need to "curve" structure and that fold will make a pairs
A A A F F Z Z G G F F -
* * * * * * |
Z Z Z Z A A A A G G -
6 pairs in upper example.
But we can create structure like this:
A A A F F Z Z G G -
* * * |
Z Z Z Z A A A A G G F F -
And head is a fragment of structure where pairs cannot exists. For example head 2:
rozmiar. Na przykład head = 2:
F F
A A A F F Z Z G G / \
* * * * |
Z Z Z Z A A A A \ /
G G
I don't know how to find maximal number of pairs that can be created in this structure

To make the matching go faster you can prepare a second string (R) with the letters flipped to their corresponding values. This will allow direct comparisons of paired positions instead of going through an indirection n^2 times:
S = "AAAFFZZGGFFGGAAAAZZZZ"
match = str.maketrans("AZFG","ZAGF")
R = S.translate(match) # flipped matching letters
mid = len(S)//2
maxCount,maxPos = 0,0
for d in range(mid+1): # moving away from middle
for p in {mid+d,mid-d}: # right them left
pairs = sum(a==b for a,b in zip(S[p-1::-1],R[p:])) # match fold
if pairs>maxCount:
maxCount,maxPos = pairs,p # track best so far and fold position
if p <= maxCount: break # stop when impossible to improve
print(maxCount) # 6 matches
print(maxPos) # folding at 11
print(S[:maxPos].rjust(len(S))) # AAAFFZZGGFF
print(S[maxPos:][::-1].rjust(len(S))) # ZZZZAAAAGG
# ** ** **

You could start with testing the folding point in the center, and then fan out (in zig-zag manner) putting the folding point further from the center.
Then for a given folding point, count the allowed pairs. You can use slicing to create the two strips, one in reversed order, and then zip those to get the pairs.
The outer iteration (determining the folding point) can stop when the size of the shortest strip is shorter than the size of the best answer found so far.
Here is a solution, which also returns the actual fold, so it can be printed for verification:
def best_fold(chain):
allowed = set(("AZ","ZA","FG","GF"))
revchain = chain[::-1]
maxnumpairs = 0
fold = chain # This represents "no solution"
n = len(chain)
head = n // 2
for diff in range(n):
head += diff if diff % 2 else -diff
if head - 2 < maxnumpairs or n - head - 2 < maxnumpairs:
break
numpairs = sum(a+b in allowed
for a, b in zip(revchain[-head+2:], chain[head+2:])
)
if numpairs > maxnumpairs:
maxnumpairs = numpairs
fold = chain[:head].rjust(n) + "\n" + revchain[:-head].rjust(n)
return maxnumpairs, fold
Here is how to run it on the example string:
numpairs, fold = best_fold("AAAFFZZGGFFGGAAAAZZZZ")
print(numpairs) # 5
print(fold) # AAAFFZZGGF
# ZZZZAAAAGGF

I would first start with a suitable data type to implement your creasable string. Now first things first, as you mention in your comment we can use list of characters which is better than a string. Then perhaps a tuple of arrays like [[Char],[Char]] would be sufficient.
Also creasing from left or right shouldn't matter so for simplicity we start with;
[ ["A","A","F","F","Z","Z","G","G","F","F","G","G","A","A","A","A","Z","Z","Z","Z"]
, ["A"]
]
then in every step;
Then in every step we can map our list of characters into a a tuple of creases like
chars.map((_,i,a) => [a.slice(i+1),a.slice(0,i+1).reverse()]
Compare and count for pairs.
In order to make an efficient comparison of corresponding items being a valid pair, a simple look up table as below can be used
{ "A": "Z"
, "Z": "A"
, "G": "F"
, "F": "G"
}
Finally we can filter the longest one(s)
An implementation of creases could be like;
function pairs(cs){
var lut = { "A": "Z"
, "Z": "A"
, "G": "F"
, "F": "G"
};
return cs.map(function(_,i,a){
var crs = [a.slice(i+1),a.slice(0,i+1).reverse()]; // console.log(crs) to see all creases)
return crs[0].reduce( (ps,c,j) => lut[c] === crs[1][j] ? ( ps.res.push([c,crs[1][j],j])
, ps.crs ??= crs.concat(i+1)
, ps
)
: ps
, { res: []
, crs: null
}
);
})
.reduce( (r,d) => r.length ? r[r.length-1].res.length > d.res.length ? r :
r[r.length-1].res.length < d.res.length ? [d] : r.concat(d)
: [d]
, []
);
}
var chars = ["A","A","A","F","F","Z","Z","G","G","F","F","G","G","A","A","A","A","Z","Z","Z","Z"],
result = pairs(chars);
console.log(result);
.as-console-wrapper{
max-height: 100% !important;
}
Now this is a straightforward algorithm and possibly not the most efficient one. It can be made more faster by using complex modular arithmetic on testing pairs without using any additional arrays however i believe it would be an overkill and very hard to explain.

VRP heterogeneous site-dependency

In my code, I managed to implement different vehicle types (I think) and to indicate the site-dependency. However, it seems that in the output of my optimization, vehicles can drive more then one route. I would like to implement that my vehicle, once it returns to the depot (node 0), that a new vehicle is assigned to perform another route. Could you help me with that? :)
I'm running on Python Jupyter notebook with the Docplex solver
all_units = [0,1,2,3,4,5,6,7,8,9]
ucp_raw_unit_data = {
"customer": all_units,
"loc_x": [40,45,45,42,42,42,40,40,38,38],
"loc_y" : [50,68,70,66,68,65,69,66,68,70],
"demand": [0,10,30,10,10,10,20,20,20,10],
"req_vehicle":[[0,1,2], [0], [0], [0],[0], [0], [0], [0], [0], [0]],
}
df_units = DataFrame(ucp_raw_unit_data, index=all_units)
# Display the 'df_units' Data Frame
df_units
Q = 50
N = list(df_units.customer[1:])
V = [0] + N
k = 15
# n.o. vehicles
K = range(1,k+1)
# vehicle 1 = type 1 vehicle 6 = type 2 and vehicle 11 = type 0
vehicle_types = {1:[1],2:[1],3:[1],4:[1],5:[2],6:[2],7:[2],8:[2],9:
[2],10:[2],11:[0],12:[0],13:[0],14:[0],15:[0]}
lf = 0.5
R = range(1,11)
# Create arcs and costs
A = [(i,j,k,r) for i in V for j in V for k in K for r in R if i!=j]
Y = [(k,r) for k in K for r in R]
c = {(i,j):np.hypot(df_units.loc_x[i]-df_units.loc_x[j],
df_units.loc_y[i]-df_units.loc_y[j]) for i,j,k,r in A}
from docplex.mp.model import Model
import docplex
mdl = Model('SDCVRP')
# decision variables
x = mdl.binary_var_dict(A, name = 'x')
u = mdl.continuous_var_dict(df_units.customer, ub = Q, name = 'u')
y = mdl.binary_var_dict(Y, name = 'y')
# objective function
mdl.minimize(mdl.sum(c[i,j]*x[i,j,k,r] for i,j,k,r in A))
#constraint 1 each node only visited once
mdl.add_constraints(mdl.sum(x[i,j,k,r] for k in K for r in R for j in V
if j != i and vehicle_types[k][0] in df_units.req_vehicle[j]) == 1 for i
in N)
##contraint 2 each node only exited once
mdl.add_constraints(mdl.sum(x[i,j,k, r] for k in K for r in R for i in V
if i != j and vehicle_types[k][0] in df_units.req_vehicle[j]) == 1 for j
in N )
##constraint 3 -- Vehicle type constraint (site-dependency)
mdl.add_constraints(mdl.sum(x[i,j,k,r] for k in K for r in R for i in V
if i != j and vehicle_types[k][0] not in
df_units.req_vehicle[j]) == 0 for j in N)
#Correcte constraint 4 -- Flow constraint
mdl.add_constraints((mdl.sum(x[i, j, k,r] for j in V if j != i) -
mdl.sum(x[j, i, k,r] for j in V if i != j)) == 0 for i in
N for k in K for r in R)
#constraint 5 -- Cumulative load of visited nodes
mdl.add_indicator_constraints([mdl.indicator_constraint(x[i,j,k,r],u[i] +
df_units.demand[j]==u[j]) for i,j,k,r in A if i!=0 and j!=0])
## constraint 6 -- one vehicle to one route
mdl.add_constraints(mdl.sum(y[k,r] for r in R) <= 1 for k in K)
mdl.add_indicator_constraints([mdl.indicator_constraint(x[i,j,k,r],y[k,r]
== 1) for i,j,k,r in A if i!=0 and j!=0])
##constraint 7 -- cumulative load must be equal or higher than demand in
this node
mdl.add_constraints(u[i] >=df_units.demand[i] for i in N)
##constraint 8 minimum load factor
mdl.add_indicator_constraints([mdl.indicator_constraint(x[j,0,k,r],u[j]
>= lf*Q) for j in N for k in K for r in R if j != 0])
mdl.parameters.timelimit = 15
solution = mdl.solve(log_output=True)
print(solution)
I expect every route to be visited with another vehicle, however the same vehicles perform multiple routes. Also, now the cumulative load is calculated for visited nodes, I would like to have this for the vehicle on the routes so that the last constraint (minimum load factor) can be performed.

I understand K indices are for vehicles and R are for routes. I ran your code and got the follwing assignments:
y_11_9=1
y_12_4=1
y_13_7=1
y_14_10=1
y_15_10=1
which seem to show many vehicles share the same route.
This is not forbidden by the sum(y[k,r] for r in R) <=1) constraint,
as it forbids one vehicle from working several routes.
Do you want to limit the number of assigned vehicles to one route to 1, as this is the symmetrical constraint from constraint #6?
If I got it wrong, plese send the solution you get and the constraint you want to add.
If I add the symmetrical constraint, that is, limit assignments vehicles to routes to 1 (no two vehicles on the same route), by:
mdl.add_constraints(mdl.sum(y[k, r] for r in R) <= 1 for k in K)
mdl.add_constraints(mdl.sum(y[k, r] for k in K) <= 1 for r in R)
I get a solution with the same cost, and only three vehicle-route assignments:
y_11_3=1
y_12_7=1
y_15_9=1
Still, I guess the best solution would be to add some cost factor of using a vehicle, and introducing this into the final objective. This might also reduce the symmetries in the problem.
Philippe.

How can I implement this point in polygon code in Python?

So, for my Computer Graphics class I was tasked with doing a Polygon Filler, my software renderer is currently being coded in Python. Right now, I want to test this pointInPolygon code I found at: How can I determine whether a 2D Point is within a Polygon? so I can make my own method later on basing myself on that one.
The code is:
int pnpoly(int nvert, float *vertx, float *verty, float testx, float testy)
{
int i, j, c = 0;
for (i = 0, j = nvert-1; i < nvert; j = i++) {
if ( ((verty[i]>testy) != (verty[j]>testy)) &&
(testx < (vertx[j]-vertx[i]) * (testy-verty[i]) / (verty[j]-verty[i]) + vertx[i]) )
c = !c;
}
return c;
}
And my attempt to recreate it in Python is as following:
def pointInPolygon(self, nvert, vertx, verty, testx, testy):
c = 0
j = nvert-1
for i in range(nvert):
if(((verty[i]>testy) != (verty[j]>testy)) and (testx < (vertx[j]-vertx[i]) * (testy-verty[i]) / (verty[j]-verty[i] + vertx[i]))):
c = not c
j += 1
return c
But this obviously will return a index out of range in the second iteration because j = nvert and it will crash.
Thanks in advance.

You're reading the tricky C code incorrectly. The point of j = i++ is to both increment i by one and assign the old value to j. Similar python code would do j = i at the end of the loop:
j = nvert - 1
for i in range(nvert):
...
j = i
The idea is that for nvert == 3, the values would go
j | i
---+---
2 | 0
0 | 1
1 | 2
Another way to achieve this is that j equals (i - 1) % nvert,
for i in range(nvert):
j = (i - 1) % nvert
...
i.e. it is lagging one behind, and the indices form a ring (like the vertices do)
More pythonic code would use itertools and iterate over the coordinates themselves. You'd have a list of pairs (tuples) called vertices, and two iterators, one of which is one vertex ahead the other, and cycling back to the beginning because of itertools.cycle, something like:
# make one iterator that goes one ahead and wraps around at the end
next_ones = itertools.cycle(vertices)
next(next_ones)
for ((x1, y1), (x2, y2)) in zip(vertices, next_ones):
# unchecked...
if (((y1 > testy) != (y2 > testy))
and (testx < (x2 - x1) * (testy - y1) / (y2-y1 + x1))):
c = not c

Can someone explain how this if statement is working?

Can someone explain how the if statement is working or the meaning of the code in the following code?
I have two lists, A and B, and I need to see if there exists a pair of elements, one from A the other from B, such that swapping them will make the sum of both lists equal.
My method, O(n^2) is to find the sumOfA and sumOfB.
Find the halfdiff = (sumOfA - sumOfB)/2
For each element in A, see if there's a B[i] so that (A[j] - B[i]) = halfdiff.
But the following code does it in O(n+m). And I don't understand the meaning of "if" statement (LINE 11) here. Does it guarantee that if it is true we have the required pair?
1 def fast_solution(A, B, m):
2 n = len(A)
3 sum_a = sum(A)
4 sum_b = sum(B)
5 d = sum_b - sum_a
6 if d % 2 == 1:
7 return False
8 d //= 2
9 count = counting(A, m)
10 for i in xrange(n):
11 if 0 <= B[i] - d and B[i] - d <= m and count[B[i] - d] > 0:
12 return True
13 return False

You have to find i, j such that sum(A) - a[i] + b[j] = sum(B) - b[j] + a[i], or equivalently, sum(A) - 2*a[i] = sum(B) - 2*b[j].
You can do this by calculating all possible results of the right-hand-side, and then searching through possible i values.
def exists_swap(A, B):
sumA = sum(A)
sumB = sum(B)
bVals = set(sumB - 2 * bj for bj in B)
return any(sumA - 2 * ai in bVals for ai in A)
The partial code in your question is doing a similar thing, except d = (sum(B)-sum(A))/2 and count is itertools.Counter(A) (that is, it's a dict that maps any x to the number of times it appears in A). Then count[B[i] - d] > 0 is equivalent to there being a j such that B[i] - d = A[j], or B[i] - A[j] = (sum(B) - sum(A))/2.
It may be that instead of using sets or dicts, the value m is the maximum value allowed in A and B. Then counting could be defined like this:
def counting(xs, m):
r = [0] * (m+1)
for x in xs:
r[x] += 1
return r
This is a simple but inefficient way to represent a set of integers, but it makes sense of the missing parts of your question and explains the bounds checking 0 <= B[i] - d and B[i] - d <= m which is unnecessary if you use a set or dict, but necessary if counting returns an array.

Actually, it's not O(n+m). Linear estimation is just amortized because of hashmap count usage. This knowledge may help you to understand that your code is an obfuscated version of
bool solve(A,B) {
sum_a = sum(A)
sum_b = sum(B)
sort(B)
for(val in A)
if( binary_search(B, val - (sum_b - sum_a)/2 ) )
return true
return false
}
As Paul pointed out, 0 <= B[i] - d and B[i] - d <= m is just a validation of count argument. BTW his solution is purely linear, well implemented and much simplier to understand.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

NeedlemanWunsch algorithm - python

Related

Problem with exec() function in KKT conditions by using SymPy

Hairpin structure easy algorithm

VRP heterogeneous site-dependency

How can I implement this point in polygon code in Python?

Can someone explain how this if statement is working?

Categories

Resources