Creating and updating 3 index matrix in Python

Creating and updating 3 index matrix in Python - python

I'm trying to create a three index matrix that contains 1 value (V) for every node of a numerical spatial mesh (xyz) (real world problem: the electrostatic potential created by finite-paralell- plates in a point of space). This matrix initially has to be filled with zeros except for some specific points (where the plates and space limits are) and then iteratively update the value at each node according to the following 7-point stencil method (j k and l indices of the x y z coordinates respectively):
V[j,k,l] = (V[j+1, k, l] + V[j-1, k, l] + V[j, k+1, l] + V[j, k-1, l] + V[j, k, l+1] +V[j, k, l-1])/6
(i. e., replace the value of a node with the average of the other 6 neighbouring nodes)
I've tried np.zeros and np.meshgrid but I think maybe I just simply have a serious conceptual and basic gap regarding arrays since nothing seems to do what I want. Any orientation would be really appreciated and sorry if I did not explain myself correctly. Here some code I've tried:
V1 = 10
V2 = -5
Mx = 101
My = 151
Mz = 301
V = np.zeros([Mx, My, Mz]).astype(int)
V[46, 51:101, 101:201] = V1 #the values of these nodes should stay fixed throughout iteration
V[56, 51:101, 101:201] = V2 #the values of these nodes should stay fixed throughout iteration
V[1,:,:] =V[100,:,:] =V[:,1,:] =V[:,150,:] =V[:,:,1] =V[:,:,300] = 0 #the values of these nodes should stay fixed throughout iteration
for j in V:
for k in j:
for l in k:
V[j, k, l] = (V[j+1, k, l] + V[j-1, k, l] + V[j, k+1, l] + V[j, k-1, l] + V[j, k, l+1] +V[j, k, l-1])/6
(Update after help from user kcw78)
Implementing the proposed code and trying to implement a while loop that keeps going until error falls below tolerance or the error in two consecutives cycles is the same. The statement of the assignment says more specifically:
"As many of these cycles will be completed as needed for the error to fall below a certain prescribed tolerance, rtol. And what is a good measure of the error here? We will use the maximum value of the local residual, defined as the (absolute value of the) difference between the potential value at the central node and the arithmetic average of the other values in the stencil. As a extra safeguard, we will also compare the errors of any two successive cycles and stop the relaxation if they become equal. A better solution is no longer possible."
Now trying the code below, but not sure if it's trapped in an infinite while loop or just takes a lot of time since I have to stop it after 20 minutes without producing any output (also not sure if maybe I should use .all() instead of .any()):
import numpy as np
V1 = 10
V2 = -5
Mx = 101
My = 151
Mz = 301
rtol = 10**-2
V1_set = { (46,k,l) for k in range(51,101,1) for l in range(101,201,1) }
V2_set = { (56,k,l) for k in range(51,101,1) for l in range(101,201,1) }
V = np.zeros((Mx, My, Mz))
Vnew = np.copy(V)
V[46, 51:101, 101:201] = V1
V[56, 51:101, 101:201] = V2
V[1,:,:] =V[100,:,:] =V[:,1,:] =V[:,150,:] =V[:,:,1] =V[:,:,300] = 0
check_set = set().union(V1_set,V2_set)
error = np.zeros((Mx, My, Mz))
errornew = np.zeros((Mx, My, Mz))
while float(errornew.any()) < rtol or error.any() != errornew.any():
V = Vnew
error = errornew
for j in range(1,V.shape[0]-1):
for k in range(1,V.shape[1]-1):
for l in range(1,V.shape[2]-1):
if (j,k,l) not in check_set:
Vnew[j, k, l] = (V[j+1, k, l] + V[j-1, k, l] + V[j, k+1, l] + V[j, k-1, l] + V[j, k, l+1] +V[j, k, l-1])/6
errornew[j, k, l] = abs(Vnew[j, k, l]-V[j, k, l])

If I understand your question, you will need 2 changes:
First you need additional variables to check the positions that are fixed thru the iteration. I added sets with (j,k,l) tuples to do this. So you can follow my logic, I initially created 3 sets; 1 each for these indices: 1) fixed V1 (V1_set), 2) fixed V2 (V2_set) and 3) boundary (zero_set), then union all 3 sets into a single set (called check_set). You could start with a single set and update as you add. Side note: your code has V[1,:,:] = 0, but I think you really want V[0,:,:] = 0. Let me know if I interpreted that incorrectly.
Second, you need to loop on the axis length in each direction(attributes are V.shape[0], V.shape[1], V.shape[2]). Inside the loop I check each (i,j,k) against check_set, and only calculate anew V1[j, k, l] value if it is NOT in the set.
See code below:
V1 = 10
V2 = -5
Mx = 101
My = 151
Mz = 301
V1_set = { (46,k,l) for k in range(51,101,1) for l in range(101,201,1) }
V2_set = { (56,k,l) for k in range(51,101,1) for l in range(101,201,1) }
zero_set = set()
zero_set.update( { (0,k,l) for k in range(My) for l in range(Mz) } )
zero_set.update( { (100,k,l) for k in range(My) for l in range(Mz) } )
zero_set.update( { (j,0,l) for j in range(Mx) for l in range(Mz) } )
zero_set.update( { (j,150,l) for j in range(Mx) for l in range(Mz) } )
zero_set.update( { (j,k,0) for j in range(Mx) for k in range(My) } )
zero_set.update( { (j,k,300) for j in range(Mx) for k in range(My) } )
check_set = set().union(V1_set,V2_set,zero_set)
V = np.zeros((Mx, My, Mz)).astype(int)
V[46, 51:101, 101:201] = V1 #the values of these nodes should stay fixed throughout iteration
V[56, 51:101, 101:201] = V2 #the values of these nodes should stay fixed throughout iteration
V[1,:,:] =V[100,:,:] =V[:,1,:] =V[:,150,:] =V[:,:,1] =V[:,:,300] = 0 #the values of these nodes should stay fixed throughout iteration
for j in range(V.shape[0]):
for k in range(V.shape[1]):
for l in range(V.shape[2]):
if (j,k,l) not in check_set:
V[j, k, l] = (V[j+1, k, l] + V[j-1, k, l] + V[j, k+1, l] + V[j, k-1, l] + V[j, k, l+1] +V[j, k, l-1])/6
After posting the solution above, it occurred to me that the ranges used in zero_set are intended to avoid the first/last (array boundary) indices. If so, there is no need for zero_set. You can handle this by modifying the range arguments as shown below:
check_set = set().union(V1_set,V2_set)
for j in range(1,V.shape[0]-1):
for k in range(1,V.shape[1]-1):
for l in range(1,V.shape[2]-1):
if (j,k,l) not in check_set:
V[j, k, l] = (V[j+1, k, l] + V[j-1, k, l] + V[j, k+1, l] + V[j, k-1, l] + V[j, k, l+1] +V[j, k, l-1])/6
Additional observations to consider:
I noticed you created array V with .astype(int). Are you sure
that's what you want (and not floats)? In general, your calculations will not return integer values.
The way your code is written, you are changing the values of
V[j,k,l] as you go. So, you are using updated values of V[j,k,l]
for j,k,l less than the current j,k,l, and previous V[j,k,l] values for j,k,l greater than the current j,k,l.
Finally, I assume you are going to iterate thru this calculation until the change between 2 cycles is "acceptably small". If so, you need to have 2 copies of the array ("old" and "new") to take the difference. Take care to use .copy() when copying to create a new/different np.array object.

This is an updated answer based on new information and code added to initial post. You have at least 1 problem with your logic. The if (j,k,l) not in check_set: block skips over (j,k,l) values that you want to hold constant. As a result, you don't calculate Vnew at these points. That will cause problems calculation the change with each iteration (and will give the wrong result).
Also, I think you need V = Vnew.copy(). Otherwise, V and Vnew reference the same object.
Here is my simple approach to iterate with a hardcoded error tolerance.
check_set = set().union(V1_set,V2_set)
Vi = V.copy()
Vn = np.zeros((Mx, My, Mz))
diff = max(abs(V1), abs(V2))
i = 1
print('Start Cycle#',i,'; diff =',diff)
while diff > 0.25:
for j in range(1,V.shape[0]-1):
for k in range(1,V.shape[1]-1):
for l in range(1,V.shape[2]-1):
if (j,k,l) in check_set:
Vn[j, k, l] = Vi[j, k, l]
else:
Vn[j, k, l] = (Vi[j+1, k, l] + Vi[j-1, k, l] + Vi[j, k+1, l] + Vi[j, k-1, l] + Vi[j, k, l+1] +Vi[j, k, l-1])/6
diff = max(abs(np.amax(Vn-Vi)), abs(np.amin(Vn-Vi)))
print('Cycle#',i,'completed; diff =',diff)
i += 1
Vi = Vn.copy()
This implementation will "converge" in 10 iterations. However, this only checks the error between two successive cycles is less than a hard coded tolerance (similar to the second part of the desired error check).
I did NOT implement the first error check: "use the maximum value of the local residual, defined as the (absolute value of the) difference between the potential value at the central node and the arithmetic average of the other values in the stencil." I am not 100 % sure of the intent. Is the stencil the 6 points around [j,k,l]? If so, I think you need a similar calculation AFTER you calculate the new Vn values, something like this:
error[j, k, l] = abs(Vn[j, k, l] - (Vn[j+1, k, l] + Vn[j-1, k, l] + Vn[j, k+1, l] + Vn[j, k-1, l] + Vn[j, k, l+1] +Vn[j, k, l-1])/6 )

Related

Optimizing loop for millions of entry selections

I have a python anonymisation mechanism that rely on generating fake data from existing attributes.
Those attributes are accessible in the domain D which is an array of 16 sets, each set representing values possible for each attributes.
the attributes are ['uid', 'trans_id', 'trans_date', 'trans_type', 'operation', 'amount', 'balance', 'k_symbol', 'bank', 'acct_district_id', 'frequency', 'acct_date', 'disp_type', 'cli_district_id', 'gender', 'zip']
Some attributes have very few values (gener is M or F), some are unique (uid) and can have 1260000 different values.
The fake data is generated as tuples of randomly selected attributes inside the domain.
I have to generate nearly 2 million tuples.
The first implementation of this was:
def beta_step(I, V, beta, n, m, D):
r = approx_binomial(m - n, beta)
print("r = " + str(r))
i = 0
while i < r:
t = []
for attribute in D:
a_j = choice(list(attribute))
t.append(a_j)
if t not in V + I:
V.append(t)
i += 1
This took around 0,5s for each tuple.
Note that I and V are existing lists (with initialy respectively 1200000 and 800000 tuples)
I already found out that I could speed-up things by converting D to a 2D array once and for all, in order not to convert sets in list on each run
for attribute in D:
a_j = choice(attribute)
t.append(a_j)
This gets me down to 0.2s by tuple.
I also tried looping fewer times and generating multiple tuples at a time like so:
def beta_step(I, V, beta, n, m, D):
D = [list(attr) for attr in D ] #Convert D in 2D list
r = approx_binomial(m - n, beta)
print("r = " + str(r))
i = 0
NT = 1000 #Number of tuples generated at a time
while i < r:
T = [[] for j in range(NT)]
for attribute in D:
a_j = choices(attribute,k=min(NT,r-i))
for j in range(len(a_j)):
T[j].append(a_j[j])
for t in T:
if t not in V + I:
V.append(t)
i += 1
But this takes around 220s for 1000 tuples so it is not faster than before.
I have timed the different parts and it seems that it is the last for loop that takes most of the time (Around 217s).
Is there any way I could speed things up in order not to run it for 50 hours?
=======================
EDIT : I implemented #Larri suggestion like that :
def beta_step(I, V, beta, n, m, D):
D = [list(attr) for attr in D ] #Convert D in list of lists
I = set(tuple(t) for t in I)
V = set(tuple(t) for t in V)
r = approx_binomial(m - n, beta)
print("r = " + str(r))
i = 0
print('SIZE I', len(I))
print('SIZE V', len(V))
NT = 1000 #Number of tuples to generate at each pass
while i < r:
T = [[] for j in range(min(NT,r-i))]
for attribute in D:
a_j = choices(attribute,k=min(NT,r-i))
for j in range(len(a_j)):
T[j].append(a_j[j])
new_T = set(tuple(t) for t in T) - I
size_V_before = len(V)
V.update(new_T)
size_V_after = len(V)
delta_V = size_V_after-size_V_before
i += delta_V
return [list(t) for t in V]
it now takes about 0s to add elements to V
In total, adding the 1680000 tuples took 91s
However, converting back to a 2d array takes 200s, is there a way to make it faster that doesn't involve rewritting the whole program to work on sets ?

For the last for loop at least, consider converting to sets instead of using arrays. That allows you to use set.update() method without having to check if t is already included in V. This is assuming that you can incorporate the A in the logic somehow. From the given code I can't see any reference to A.
So you can change it to something like V.update(T). The i would then be the delta of len(V) before and after the operation.

How works python key=operator.itemgetter(1))?

I have a matrix and I need to find max element and its number. How to rewrite it without operator (with for)?
for j in range(size - 1):
i, val = max(enumerate(copy[j::, j]), key=operator.itemgetter(1))
copy = change_rows(copy, j, i)
P = change_rows(P, j, i)
And actually maybe you can explain what this string means?
i, val = max(enumerate(copy[j::, j]), key=operator.itemgetter(1))

Let's decompose this line.
i, val = max(enumerate(copy[j::, j]), key=operator.itemgetter(1))
First, enumerate() creates an iterator over copy[j::,j] that yields index-value pairs. For example,
>>> for i, val in enumerate("abcd"):
... print(i, val)
...
0 a
1 b
2 c
3 d
Next, the max() function is for finding the largest item in a sequence. But we want it to target the values of copy[j::,j], not the indices that we are also getting from enumerate(). Specifying key=operator.itemgetter(1) tells max() to look at the (i,val) pairs and find the one with the largest val.
This is probably better done with np.argmax(), especially because val goes unused.
>>> import numpy as np
>>> for j in range(size - 1):
... i = np.argmax(copy[j::, j]) # Changed this line.
copy = change_rows(copy, j, i)
P = change_rows(P, j, i)

Finding the subsequence that starts and ends with the same number with the maximum sum

I have to make a program that takes as input a list of numbers and returns the sum of the subsequence that starts and ends with the same number which has the maximum sum (including the equal numbers in the beginning and end of the subsequence in the sum). It also has to return the placement of the start and end of the subsequence, that is, their index+1. The problem is that my current code runs smoothly only while the length of list is not that long. When the list length extends to 5000 the program does not give an answer.
The input is the following:
6
3 2 4 3 5 6
The first line is for the length of the list. The second line is the list itself, with list items separated by space. The output will be 12, 1, 4, because as you can see there is 1 equal pair of numbers (3): the first and fourth element, so the sum of elements between them is 3 + 2 + 4 + 3 = 12, and their placement is first and fourth.
Here is my code.
length = int(input())
mass = raw_input().split()
for i in range(length):
mass[i]=int(mass[i])
value=-10000000000
b = 1
e = 1
for i in range(len(mass)):
if mass[i:].count(mass[i])!=1:
for j in range(i,len(mass)):
if mass[j]==mass[i]:
f = mass[i:j+1]
if sum(f)>value:
value = sum(f)
b = i+1
e = j+1
else:
if mass[i]>value:
value = mass[i]
b = i+1
e = i+1
print value
print b,e

This should be faster than your current approach.
Rather than searching through mass looking for pairs of matching numbers we pair each number in mass with its index and sort those pairs. We can then use groupby to find groups of equal numbers. If there are more than 2 of the same number we use the first and last, since they will have the greatest sum between them.
from operator import itemgetter
from itertools import groupby
raw = '3 5 6 3 5 4'
mass = [int(u) for u in raw.split()]
result = []
a = sorted((u, i) for i, u in enumerate(mass))
for _, g in groupby(a, itemgetter(0)):
g = list(g)
if len(g) > 1:
u, v = g[0][1], g[-1][1]
result.append((sum(mass[u:v+1]), u+1, v+1))
print(max(result))
output
(19, 2, 5)
Note that this code will not necessarily give the maximum sum between equal elements in the list if the list contains negative numbers. It will still work correctly with negative numbers if no group of equal numbers has more than two members. If that's not the case, we need to use a slower algorithm that tests every pair within a group of equal numbers.
Here's a more efficient version. Instead of using the sum function we build a list of the cumulative sums of the whole list. This doesn't make much of a difference for small lists, but it's much faster when the list size is large. Eg, for a list of 10,000 elements this approach is about 10 times faster. To test it, I create an array of random positive integers.
from operator import itemgetter
from itertools import groupby
from random import seed, randrange
seed(42)
def maxsum(seq):
total = 0
sums = [0]
for u in seq:
total += u
sums.append(total)
result = []
a = sorted((u, i) for i, u in enumerate(seq))
for _, g in groupby(a, itemgetter(0)):
g = list(g)
if len(g) > 1:
u, v = g[0][1], g[-1][1]
result.append((sums[v+1] - sums[u], u+1, v+1))
return max(result)
num = 25000
hi = num // 2
mass = [randrange(1, hi) for _ in range(num)]
print(maxsum(mass))
output
(155821402, 21, 24831)
If you're using a recent version of Python you can use itertools.accumulate to build the list of cumulative sums. This is around 10% faster.
from itertools import accumulate
def maxsum(seq):
sums = [0] + list(accumulate(seq))
result = []
a = sorted((u, i) for i, u in enumerate(seq))
for _, g in groupby(a, itemgetter(0)):
g = list(g)
if len(g) > 1:
u, v = g[0][1], g[-1][1]
result.append((sums[v+1] - sums[u], u+1, v+1))
return max(result)
Here's a faster version, derived from code by Stefan Pochmann, which uses a dict, instead of sorting & groupby. Thanks, Stefan!
def maxsum(seq):
total = 0
sums = [0]
for u in seq:
total += u
sums.append(total)
where = {}
for i, x in enumerate(seq, 1):
where.setdefault(x, [i, i])[1] = i
return max((sums[j] - sums[i - 1], i, j)
for i, j in where.values())
If the list contains no duplicate items (and hence no subsequences bound by duplicate items) it returns the maximum item in the list.
Here are two more variations. These can handle negative items correctly, and if there are no duplicate items they return None. In Python 3 that could be handled elegantly by passing default=None to max, but that option isn't available in Python 2, so instead I catch the ValueError exception that's raised when you attempt to find the max of an empty iterable.
The first version, maxsum_combo, uses itertools.combinations to generate all combinations of a group of equal numbers and thence finds the combination that gives the maximum sum. The second version, maxsum_kadane uses a variation of Kadane's algorithm to find the maximum subsequence within a group.
If there aren't many duplicates in the original sequence, so the average group size is small, maxsum_combo is generally faster. But if the groups are large, then maxsum_kadane is much faster than maxsum_combo. The code below tests these functions on random sequences of 15000 items, firstly on sequences with few duplicates (and hence small mean group size) and then on sequences with lots of duplicates. It verifies that both versions give the same results, and then it performs timeit tests.
from __future__ import print_function
from itertools import groupby, combinations
from random import seed, randrange
from timeit import Timer
seed(42)
def maxsum_combo(seq):
total = 0
sums = [0]
for u in seq:
total += u
sums.append(total)
where = {}
for i, x in enumerate(seq, 1):
where.setdefault(x, []).append(i)
try:
return max((sums[j] - sums[i - 1], i, j)
for v in where.values() for i, j in combinations(v, 2))
except ValueError:
return None
def maxsum_kadane(seq):
total = 0
sums = [0]
for u in seq:
total += u
sums.append(total)
where = {}
for i, x in enumerate(seq, 1):
where.setdefault(x, []).append(i)
try:
return max(max_sublist([(sums[j] - sums[i-1], i, j)
for i, j in zip(v, v[1:])], k)
for k, v in where.items() if len(v) > 1)
except ValueError:
return None
# Kadane's Algorithm to find maximum sublist
# From https://en.wikipedia.org/wiki/Maximum_subarray_problem
def max_sublist(seq, k):
max_ending_here = max_so_far = seq[0]
for x in seq[1:]:
y = max_ending_here[0] + x[0] - k, max_ending_here[1], x[2]
max_ending_here = max(x, y)
max_so_far = max(max_so_far, max_ending_here)
return max_so_far
def test(num, hi, loops):
print('\nnum = {0}, hi = {1}, loops = {2}'.format(num, hi, loops))
print('Verifying...')
for k in range(5):
mass = [randrange(-hi // 2, hi) for _ in range(num)]
a = maxsum_combo(mass)
b = maxsum_kadane(mass)
print(a, b, a==b)
print('\nTiming...')
for func in maxsum_combo, maxsum_kadane:
t = Timer(lambda: func(mass))
result = sorted(t.repeat(3, loops))
result = ', '.join([format(u, '.5f') for u in result])
print('{0:14} : {1}'.format(func.__name__, result))
loops = 20
num = 15000
hi = num // 4
test(num, hi, loops)
loops = 10
hi = num // 100
test(num, hi, loops)
output
num = 15000, hi = 3750, loops = 20
Verifying...
(13983131, 44, 14940) (13983131, 44, 14940) True
(13928837, 27, 14985) (13928837, 27, 14985) True
(14057416, 40, 14995) (14057416, 40, 14995) True
(13997395, 65, 14996) (13997395, 65, 14996) True
(14050007, 12, 14972) (14050007, 12, 14972) True
Timing...
maxsum_combo : 1.72903, 1.73780, 1.81138
maxsum_kadane : 2.17738, 2.22108, 2.22394
num = 15000, hi = 150, loops = 10
Verifying...
(553789, 21, 14996) (553789, 21, 14996) True
(550174, 1, 14992) (550174, 1, 14992) True
(551017, 13, 14991) (551017, 13, 14991) True
(554317, 2, 14986) (554317, 2, 14986) True
(558663, 15, 14988) (558663, 15, 14988) True
Timing...
maxsum_combo : 7.29226, 7.34213, 7.36688
maxsum_kadane : 1.07532, 1.07695, 1.10525
This code runs on both Python 2 and Python 3. The above results were generated on an old 32 bit 2GHz machine running Python 2.6.6 on a Debian derivative of Linux. The speeds for Python 3.6.0 are similar.
If you want to include groups that consist of a single non-repeated number, and also want to include the numbers that are in groups as a "subsequence" of length 1, you can use this version:
def maxsum_kadane(seq):
if not seq:
return None
total = 0
sums = [0]
for u in seq:
total += u
sums.append(total)
where = {}
for i, x in enumerate(seq, 1):
where.setdefault(x, []).append(i)
# Find the maximum of the single items
m_single = max((k, v[0], v[0]) for k, v in where.items())
# Find the maximum of the subsequences
try:
m_subseq = max(max_sublist([(sums[j] - sums[i-1], i, j)
for i, j in zip(v, v[1:])], k)
for k, v in where.items() if len(v) > 1)
return max(m_single, m_subseq)
except ValueError:
# No subsequences
return m_single
I haven't tested it extensively, but it should work. ;)

Motif search with Gibbs sampler

I am a beginner in both programming and bioinformatics. So, I would appreciate your understanding. I tried to develop a python script for motif search using Gibbs sampling as explained in Coursera class, "Finding Hidden Messages in DNA". The pseudocode provided in the course is:
GIBBSSAMPLER(Dna, k, t, N)
randomly select k-mers Motifs = (Motif1, …, Motift) in each string
from Dna
BestMotifs ← Motifs
for j ← 1 to N
i ← Random(t)
Profile ← profile matrix constructed from all strings in Motifs
except for Motifi
Motifi ← Profile-randomly generated k-mer in the i-th sequence
if Score(Motifs) < Score(BestMotifs)
BestMotifs ← Motifs
return BestMotifs
Problem description:
CODE CHALLENGE: Implement GIBBSSAMPLER.
Input: Integers k, t, and N, followed by a collection of strings Dna.
Output: The strings BestMotifs resulting from running GIBBSSAMPLER(Dna, k, t, N) with
20 random starts. Remember to use pseudocounts!
Sample Input:
8 5 100
CGCCCCTCTCGGGGGTGTTCAGTAACCGGCCA
GGGCGAGGTATGTGTAAGTGCCAAGGTGCCAG
TAGTACCGAGACCGAAAGAAGTATACAGGCGT
TAGATCAAGTTTCAGGTGCACGTCGGTGAACC
AATCCACCAGCTCCACGTGCAATGTTGGCCTA
Sample Output:
TCTCGGGG
CCAAGGTG
TACAGGCG
TTCAGGTG
TCCACGTG
I followed the pseudocode to the best of my knowledge. Here is my code:
def BuildProfileMatrix(dnamatrix):
ProfileMatrix = [[1 for x in xrange(len(dnamatrix[0]))] for x in xrange(4)]
indices = {'A':0, 'C':1, 'G': 2, 'T':3}
for seq in dnamatrix:
for i in xrange(len(dnamatrix[0])):
ProfileMatrix[indices[seq[i]]][i] += 1
ProbMatrix = [[float(x)/sum(zip(*ProfileMatrix)[0]) for x in y] for y in ProfileMatrix]
return ProbMatrix
def ProfileRandomGenerator(profile, dna, k, i):
indices = {'A':0, 'C':1, 'G': 2, 'T':3}
score_list = []
for x in xrange(len(dna[i]) - k + 1):
probability = 1
window = dna[i][x : k + x]
for y in xrange(k):
probability *= profile[indices[window[y]]][y]
score_list.append(probability)
rnd = uniform(0, sum(score_list))
current = 0
for z, bias in enumerate(score_list):
current += bias
if rnd <= current:
return dna[i][z : k + z]
def score(motifs):
ProfileMatrix = [[0 for x in xrange(len(motifs[0]))] for x in xrange(4)]
indices = {'A':0, 'C':1, 'G': 2, 'T':3}
for seq in motifs:
for i in xrange(len(motifs[0])):
ProfileMatrix[indices[seq[i]]][i] += 1
score = len(motifs)*len(motifs[0]) - sum([max(x) for x in zip(*ProfileMatrix)])
return score
from random import randint, uniform
def GibbsSampler(k, t, N):
dna = ['CGCCCCTCTCGGGGGTGTTCAGTAACCGGCCA',
'GGGCGAGGTATGTGTAAGTGCCAAGGTGCCAG',
'TAGTACCGAGACCGAAAGAAGTATACAGGCGT',
'TAGATCAAGTTTCAGGTGCACGTCGGTGAACC',
'AATCCACCAGCTCCACGTGCAATGTTGGCCTA']
Motifs = []
for i in [randint(0, len(dna[0])-k) for x in range(len(dna))]:
j = 0
kmer = dna[j][i : k+i]
j += 1
Motifs.append(kmer)
BestMotifs = []
s_best = float('inf')
for i in xrange(N):
x = randint(0, t-1)
Motifs.pop(x)
profile = BuildProfileMatrix(Motifs)
Motif = ProfileRandomGenerator(profile, dna, k, x)
Motifs.append(Motif)
s_motifs = score(Motifs)
if s_motifs < s_best:
s_best = s_motifs
BestMotifs = Motifs
return [s_best, BestMotifs]
k, t, N =8, 5, 100
best_motifs = [float('inf'), None]
# Repeat the Gibbs sampler search 20 times.
for repeat in xrange(20):
current_motifs = GibbsSampler(k, t, N)
if current_motifs[0] < best_motifs[0]:
best_motifs = current_motifs
# Print and save the answer.
print '\n'.join(best_motifs[1])
Unfortunately, my code never gives the same output as the solved example. Besides, while trying to debug the code I found that I get weird scores that define the mismatches between motifs. However, when I tried to run the score function separately, it worked perfectly.
Each time I run the script, the output changes, but anyway here is an example of one of the outputs for the input present in the code:
Example output of my code
TATGTGTA
TATGTGTA
TATGTGTA
GGTGTTCA
TATACAGG
Could you please help me debug this code?!! I spent the whole day trying to find out what's wrong with it although I know it might be some silly mistake I made, but my eye failed to catch it.
Thank you all!!

Finally, I found out what was wrong in my code! It was in line 54:
Motifs.append(Motif)
After randomly removing one of the motifs, followed by building a profile out of these motifs then randomly selecting a new motif based on this profile, I should have added the selected motif in the same position before removal NOT appended to the end of the motif list.
Now, the correct code is:
Motifs.insert(x, Motif)
The new code worked as expected.

How can I prevent adding the same value (With another key) into a dictionary?

I need to fill a dictionary with pairs key-value given by the next code:
for i in range(1,n+1):
d = {}
Ri = Vector([#SomeCoordinates])
for k in range(1,n+1):
Rk = Vector([#SomeCoordinates])
if i != k:
d['R'+str(i)+str(k)] = (Rk-Ri).mod # Distance between Ri and Rk
else:
None
""" Since (Rk-Ri).mod gives me the distance between two points (i and k),
it's meaningless to calc the distance if i == k. """
Here's the problem:
'Rik' represents the same distance as 'Rki' and I don't want to add a distance twice.
Then, I tried with this code:
if i != k and ( ('R'+str(i)+str(k)) and ('R'+str(k)+str(i)) ) not in d:
d['R'+str(i)+str(k)] = (Rk-Ri).mod
else:
None
but the problem is still there.
When I "print d" I get R12 but also R21 (And the same with every pair of numbers " i k ").
What can I do?

You could use the following:
d = {}
for i in range(1, n + 1):
Ri = Vector([#SomeCoordinates]).
for k in range(i + 1, n + 1):
Rk = Vector([#SomeCoordinates])
d[i, k] = d[k, i] = (Rk - Ri).mod
This way we ensure we'll take only a pair (by enforcing k > i) and then we can assign to the dictionary the distance for both (i, k) and (k, i).
I used d[i, k] instead of d['R' + str(i) + str(k)] because the latter has the following disadvantage: We can't infer for example, if d['R123'] refers to (12, 3) or (1, 23).
Also, I moved dictionary initialisation (d = {}) outside both loops, because it's initialised for each i.

If I undertand you correctly, you are looking for all the combinations of two elements. You can use itertools.combinations to autoamtically generate all such combinations with no duplicates.
d = {}
for i, k in itertools.combinations(range(1, n+1), 2):
Ri = Vector([SomeCoordinates])
Rk = Vector([SomeCoordinates])
d['R'+str(i)+str(k)] = (Rk-Ri).mod
You could even make it a dict comprehension (although it may be a bit long):
d = {'R'+str(i)+str(k)] : (Vector([SomeCoordinates]) - Vector([SomeCoordinates])).mod
for i, k in itertools.combinations(range(1, n+1), 2)}
Or, to do the (possibly expensive) calculation of Vector([SomeCoordinates]) just once for each value of i or k, try this (thanks to JuniorCompressor for pointing this out):
R = {i: Vector([SomeCoordinates]) for i in range(1, n+1)}
d = {(i, k): (R[i] - R[k]).mod for i, k in itertools.combinations(range(1, n+1), 2)}
Also, as others have noted, 'R'+str(i)+str(k) is not a good key, as it will be impossible to distinguish between e.g. (1,23) and (12,3), as both end up as 'R123'. I suggest you just use the tuple (i,k) instead.

You might always put the smaller value first so that a previous entry is automatically overwritten:
if i != k:
key = str(i) + "," + str(k) if i < k else str(k) + "," + str(i)
d['R'+key] = (Rk-Ri).mod
(I assume that your script only needs the distance values, not information from the current keys.)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Creating and updating 3 index matrix in Python - python

Related

Optimizing loop for millions of entry selections

How works python key=operator.itemgetter(1))?

Finding the subsequence that starts and ends with the same number with the maximum sum

Motif search with Gibbs sampler

How can I prevent adding the same value (With another key) into a dictionary?

Categories

Resources