How to slice array quickly based on conditions? - python

I have a giant nested for loop....10 in all, but for illustration here, i am including 6. I am doing a summation (over multiple indices; the incides are not independent!). The index in any inner for loop depends on the index of the outer loop (except for one instance). The inner-most loop contains an operation where i slice an array (named 'w') based on 8 different conditions all combined using '&' and '|'. There is also this 'HB' function that takes as an argument this sliced array (named 'wrange'), performs some operations on it and returns an array of the same size.
The timescale for this slicing and the 'HB' function to execute is 300-400 microseconds and 100 microseconds respectively. I need to bring it down drastically. To nanoseconds.!!
Tried using dictionary instead of array (where i am slicing). It is much slower. Tried storing the sliced array for all possible values. That is a very huge computation in its own right since there are many many possible combinations of the conditions (these conditions depend indirectly on the indices of the for loop)
s goes from 1 to 49
t goes from -s to s
and there are 641 combinations of l,n
Here, i have posted one value of s,t and an l,n combination for illustration.
s = 7
t = -7
l = 72
n = 12
Nl = Dictnorm[n,l]
Gamma_l = Dictfwhm[n,l]
Dictc1 = {}
Dictc2 = {}
Dictwrange = {}
DictH = {}
DictG = {}
product = []
startm = max(-l-t,-l)
endm = min(l-t,l)+1
sum5 = 0
for sp in range(s-2,s+3): #s'
sum4 = 0
for tp in range(-sp,-sp+1): #t'
#print(tp)
sum3 = 0
integral = 1
for lp in range(l-2,l+3): #l'
sum2 = 0
if (n,lp) in Dictknl2.keys():
N1 = Dictnorm[n,lp]
Gamma_1 = Dictfwhm[n,lp]
for lpp in range(l-2,l+3): #l"
sum1 = 0
if ((sp+lpp-lp)%2 == 1 and sp>=abs(lpp-lp) and
lp>=abs(sp-lpp) and lpp>=abs(sp-lp) and
(n,lpp) in Dictknl2.keys()):
F = f(lpp,lp,sp)
N2 = Dictnorm[n,lpp]
Gamma_2 = Dictfwhm[n,lpp]
for m in range(startm, endm): #m
sum0 = 0
L1 = LKD(n,l,m,l,m)
L2 = LKD(n,l,m+t,l,m+t)
for mp in range(max(m+t-tp-5,m-5),
min(m+5,m+t-tp+5)+1): #m'
if (abs(mp)<=lp and abs(mp)<=lpp and
abs(mp+tp)<=lp and abs(mp+tp)<=lpp
and LKD(n,l,m,lp,mp)!=0
and LKD(n,l,m+t,lpp,mp+tp)!=0):
c3 = Dictomega[n,lp,mp+tp]
c4 = Dictomega[n,lpp,mp]
wrange = np.unique(np.concatenate
((Dictwrange[m],
w[((w>=(c3-Gamma_1))&
((c3+Gamma_1)>=w))|
((w>=(c4-Gamma_2))&
((c4+Gamma_2)>=w))])))
factor = (sum(
HB(Dictc1[n,l,m+t],
Dictc2[n,l,m],Nl,
Nl,Gamma_l,
Gamma_l,wrange,
Sigma).conjugate()
*HB(c3,c4,N1,N2,Gamma_1,
Gamma_2,wrange,0)*L1*L2)
*LKD(n,l,m,lp,mp)
*LKD(n,l,m+t,lpp,mp+tp) *DictG[m]
*gamma(lpp,sp,lp,tp,mp)
*F)
sum0 = sum0 + factor #sum over m'
sum1 = sum1 + sum0 #sum over m
sum2 = sum2 + sum1 #sum over l"
sum3 = sum3 + sum2 #sum over l'
sum4 = sum4 + sum3*integral #sum over t'
sum5 = sum5 + sum4 #sum over s'
z = (1/(sum(product)))*sum5
print(z.real,z.imag,l,n)
TL;DR
def HB(a,...f,array1): #########timesucker
perform_some_operations_on_array1_using_a_b_c_d
return operated_on_array1
for i in ():
for j in ():
...
...
for o in ():
array1 = w[w>some_function1(i,j,..k) &
w<some_function2(i,j,..k) |.....] #########timesucker
factor = HB(a,....f,array1) * HB(g,...k,array1) *
alpha*beta*gamma....
It takes about 30 seconds to run this whole section once. I need to bring it down to as low as possible. 1 second is the minimum target

Related

Question on Calculation Speed Difference in the two essentially same codes

Here are two codes, which handles the same data and returns the essentially the same result.
1.
for j in range(np.shape(I)[0]):
if (j%int(np.shape(I)[0]/10) ==0):
print(str(j/np.shape(I)[0]*100)+'% ........... is done')
for k in range(np.shape(I)[1]):
for i in range(np.shape(I)[1]):
if (abs(time_resc_array[j,k]-time_tar[i]) < t_toler):
I_pp[i] = I_pp[i]+ I[j,k]
count[i]=count[i]+1
norm[i]=norm[i]+1
break
here I and time_resc_array are the 290*10000 numpy arrays, and count, I_pp, and time_tar are the 290 numpy arrays.
2.
trial = int(n_rep * N/10)
freq11 = freq1 * 10**6
average = 100*10**6
tau = np.zeros(trial)
pp_seq = int(n_rep2*(t_unit2 * 10 **-9) * 10 * average )
Narray = np.arange(0,pp_seq)
pp_tk = Narray * 1/(10*average) # divide 1 period of average freq by 10
pp_data = np.zeros(pp_seq)
pp_cnt = np.zeros(pp_seq)
Narray = np.arange(1, n_rep2+1)
oper_tk = Narray * (t_unit2 * 10 **-9)
for i in range(0,A):
if (i%int(trial)==0):
print(str(i/trial*100)+'.......... % is done')
ptr = i%n_rep2
tau[i] = oper_tk[ptr] * freq11[i//n_rep2][i%n_rep2] / average
for j in range(0, pp_seq):
if ptr == 0:
break
elif tau[i] < pp_tk[j]:
pp_data[j] += I[i//n_rep2][i%n_rep2]
pp_cnt[j] += 1
break
where freq1 and I are the 290*10000 array. The first code is approximately 4-5 times slower than the second one, which I don't grasp the reason. Could somebody please help me understand what I am doing wrong with the first one?
p.s. the second code is not of mine, so it can be deleted sooner or later.

any tip to improve performance when using nested loops with python

so, I had this exercise where I would receive a list of integers and had to find how many sum pairs were multiple to 60
example:
input: list01 = [10,90,50,40,30]
result = 2
explanation: 10 + 50, 90 + 30
example2:
input: list02 = [60,60,60]
result = 3
explanation: list02[0] + list02[1], list02[0] + list02[2], list02[1] + list02[2]
seems pretty easy, so here is my code:
def getPairCount(numbers):
total = 0
cont = 0
for n in numbers:
cont+=1
for n2 in numbers[cont:]:
if (n + n2) % 60 == 0:
total += 1
return total
it's working, however, for a big input with over 100k+ numbers is taking too long to run, and I need to be able to run in under 8 seconds, any tips on how to solve this issue??
being with another lib that i'm unaware or being able to solve this without a nested loop
Here's a simple solution that should be extremely fast (it runs in O(n) time). It makes use of the following observation: We only care about each value mod 60. E.g. 23 and 143 are effectively the same.
So rather than making an O(n**2) nested pass over the list, we instead count how many of each value we have, mod 60, so each value we count is in the range 0 - 59.
Once we have the counts, we can consider the pairs that sum to 0 or 60. The pairs that work are:
0 + 0
1 + 59
2 + 58
...
29 + 31
30 + 30
After this, the order is reversed, but we only
want to count each pair once.
There are two cases where the values are the same:
0 + 0 and 30 + 30. For each of these, the number
of pairs is (count * (count - 1)) // 2. Note that
this works when count is 0 or 1, since in both cases
we're multiplying by zero.
If the two values are different, then the number of
cases is simply the product of their counts.
Here's the code:
def getPairCount(numbers):
# Count how many of each value we have, mod 60
count_list = [0] * 60
for n in numbers:
n2 = n % 60
count_list[n2] += 1
# Now find the total
total = 0
c0 = count_list[0]
c30 = count_list[30]
total += (c0 * (c0 - 1)) // 2
total += (c30 * (c30 - 1)) // 2
for i in range(1, 30):
j = 60 - i
total += count_list[i] * count_list[j]
return total
This runs in O(n) time, due to the initial one-time pass we make over the list of input values. The loop at the end is just iterating from 1 through 29 and isn't nested, so it should run almost instantly.
Below is a translation of Tom Karzes's answer but using numpy. I benchmarked it and it is only faster if the input is already a numpy array, not a list. I still want to write it here because it nicely shows how loops in python can be one-liners in numpy.
def get_pairs_count(numbers, /):
# Count how many of each value we have, modulo 60.
numbers_mod60 = np.mod(numbers, 60)
_, counts = np.unique(numbers_mod60, return_counts=True)
# Now find the total.
total = 0
c0 = counts[0]
c30 = counts[30]
total += (c0 * (c0 - 1)) // 2
total += (c30 * (c30 - 1)) // 2
total += np.dot(counts[1:30:+1], counts[59:30:-1]) # Notice the slicing indices used.
return total

Summing results from a monte carlo

I am trying to sum the values in the 'Callpayoff' list however am unable to do so, print(Callpayoff) returns a vertical list:
0
4.081687878300656
1.6000410648454846
0.5024316862043037
0
so I wonder if it's a special sublist ? sum(Callpayoff) does not work unfortunately. Any help would be greatly appreciated.
def Generate_asset_price(S,v,r,dt):
return (1 + r * dt + v * sqrt(dt) * np.random.normal(0,1))
def Call_Poff(S,T):
return max(stream[-1] - S,0)
# initial values
S = 100
v = 0.2
r = 0.05
T = 1
N = 2 # number of steps
dt = 0.00396825
simulations = 5
for x in range(simulations):
stream = [100]
Callpayoffs = []
t = 0
for n in range(N):
s = stream[t] * Generate_asset_price(S,v,r,dt)
stream.append(s)
t += 1
Callpayoff = Call_Poff(S,T)
print(Callpayoff)
plt.plot(stream)
Right now you're not appending values to a list, you're just replacing the value of Callpayoff at each iteration and printing it. At each iteration, it's printed on a new line so it looks like a "vertical list".
What you need to do is use Callpayoffs.append(Call_Poff(S,T)) instead of Callpayoff = Call_Poff(S,T).
Now a new element will be added to Callpayoffs at every iteration of the for loop.
Then you can print the list with print(Callpayoffs) or the sum with print(sum(Callpayoffs))
All in all the for loop should look like this:
for x in range(simulations):
stream = [100]
Callpayoffs = []
t = 0
for n in range(N):
s = stream[t] * Generate_asset_price(S,v,r,dt)
stream.append(s)
t += 1
Callpayoffs.append(Call_Poff(S,T))
print(Callpayoffs,"sum:",sum(Callpayoffs))
Output:
[2.125034975231003, 0] sum: 2.125034975231003
[0, 0] sum: 0
[0, 0] sum: 0
[0, 0] sum: 0
[3.2142923036024342, 4.1390018820809615] sum: 7.353294185683396

Summation from sub list

If n = 4, m = 3, I have to select 4 elements (basically n elements) from a list from start and end. From below example lists are [17,12,10,2] and [2,11,20,8].
Then between these two lists I have to select the highest value element and after this the element has to be deleted from the original list.
The above step has to be performed m times and take the summation of the highest value elements.
A = [17,12,10,2,7,2,11,20,8], n = 4, m = 3
O/P: 20+17+12=49
I have written the following code. However, the code performance is not good and giving time out for larger list. Could you please help?
A = [17,12,10,2,7,2,11,20,8]
m = 3
n = 4
scoreSum = 0
count = 0
firstGrp = []
lastGrp = []
while(count<m):
firstGrp = A[:n]
lastGrp = A[-n:]
maxScore = max(max(firstGrp), max(lastGrp))
scoreSum = scoreSum + maxScore
if(maxScore in firstGrp):
A.remove(maxScore)
else:
ai = len(score) - 1 - score[::-1].index(maxScore)
A.pop(ai)
count = count + 1
firstGrp.clear()
lastGrp.clear()
print(scoreSum )
I would like to do that this way, you can generalize it later:
a = [17,12,10,2,7,2,11,20,8]
a.sort(reverse=True)
sums=0
for i in range(3):
sums +=a[i]
print(sums)
If you are concerned about performance, you should use specific libraries like numpy. This will be much faster !
A = [17,12,10,2,7,11,20,8]
n = 4
m = 3
score = 0
for _ in range(m):
sublist = A[:n] + A[-n:]
subidx = [x for x in range(n)] + [x for x in range(len(A) - n, len(A))]
sub = zip(sublist, subidx)
maxval = max(sub, key=lambda x: x[0])
score += maxval[0]
del A[maxval[1]]
print(score)
Your method uses a lot of max() calls. Combining the slices of the front and back lists allows you to reduce the amounts of those max() searches to one pass and then a second pass to find the index at which it occurs for removal from the list.

Rosalind: Mendel's first law

I'm trying to solve the problem at http://rosalind.info/problems/iprb/
Given: Three positive integers k, m, and n, representing a population
containing k+m+n organisms: k individuals are homozygous dominant for
a factor, m are heterozygous, and n are homozygous recessive.
Return: The probability that two randomly selected mating organisms
will produce an individual possessing a dominant allele (and thus
displaying the dominant phenotype). Assume that any two organisms can
mate.
My solution works for the sample, but not for any problems generated. After further research it seems that I should find the probability of choosing any one organism at random, find the probability of choosing the second organism, and then the probability of that pairing producing offspring with a dominant allele.
My question is then: what does my code below find the probability of? Does it find the percentage of offspring with a dominant allele for all possible matings -- so rather than the probability of one random mating, my code is solving for the percentage of offspring with dominant alleles if all pairs were tested?
f = open('rosalind_iprb.txt', 'r')
r = f.read()
s = r.split()
############# k = # homozygotes dominant, m = #heterozygotes, n = # homozygotes recessive
k = float(s[0])
m = float(s[1])
n = float(s[2])
############# Counts for pairing between each group and within groups
k_k = 0
k_m = 0
k_n = 0
m_m = 0
m_n = 0
n_n = 0
##############
if k > 1:
k_k = 1.0 + (k-2) * 2.0
k_m = k * m
k_n = k * n
if m > 1:
m_m = 1.0 + (m-2) * 2.0
m_n = m * n
if n> 1:
n_n = 1.0 + (n-2) * 2.0
#################
dom = k_k + k_m + k_n + 0.75*m_m + 0.5*m_n
total = k_k + k_m + k_n + m_m + m_n + n_n
chance = dom/total
print chance
Looking at your code, I'm having a hard time figuring out what it's supposed to do. I'll work through the problem here.
Let's simplify the wording. There are n1 type 1, n2 type 2, and n3 type 3 items.
How many ways are there to choose a set of size 2 out of all the items? (n1 + n2 + n3) choose 2.
Every pair of items will have item types corresponding to one of the six following unordered multisets: {1,1}, {2,2}, {3,3}, {1,2}, {1,3}, {2,3}
How many multisets of the form {i,i} are there? ni choose 2.
How many multisets of the form {i,j} are there, where i != j? ni * nj.
The probabilities of the six multisets are thus the following:
P({1,1}) = [n1 choose 2] / [(n1 + n2 + n3) choose 2]
P({2,2}) = [n2 choose 2] / [(n1 + n2 + n3) choose 2]
P({3,3}) = [n3 choose 2] / [(n1 + n2 + n3) choose 2]
P({1,2}) = [n1 * n2] / [(n1 + n2 + n3) choose 2]
P({1,3}) = [n1 * n3] / [(n1 + n2 + n3) choose 2]
P({2,3}) = [n2 * n3] / [(n1 + n2 + n3) choose 2]
These sum to 1. Note that [X choose 2] is just [X * (X - 1) / 2] for X > 1 and 0 for X = 0 or 1.
Return: The probability that two randomly selected mating organisms will produce an individual possessing a dominant allele (and thus displaying the dominant phenotype).
To answer this question, you simply need to identify which of the six multisets correspond to this event. Lacking the genetics knowledge to answer that question, I'll leave that to you.
For example, suppose that a dominant allele results if either of the two parents was type 1. Then the events of interest are {1,1}, {1,2}, {1,3} and the probability of the event is P({1,1}) + P({1,2}) + P({1,3}).
I spend some time in this question, so, to clarify in python:
lst = ['2', '2', '2']
k, m, n = map(float, lst)
t = sum(map(float, lst))
# organize a list with allele one * allele two (possibles) * dominant probability
# multiplications by one were ignored
# remember to substract the haplotype from the total when they're the same for the second haplotype choosed
couples = [
k*(k-1), # AA x AA
k*m, # AA x Aa
k*n, # AA x aa
m*k, # Aa x AA
m*(m-1)*0.75, # Aa x Aa
m*n*0.5, # Aa x aa
n*k, # aa x AA
n*m*0.5, # aa x Aa
n*(n-1)*0 # aa x aa
]
# (t-1) indicate that the first haplotype was select
print(round(sum(couples)/t/(t-1), 5))
If you are interested, I just found a solution and put it in C#.
public double mendel(double k, double m, double n)
{
double prob;
prob = ((k*k - k) + 2*(k*m) + 2*(k*n) + (.75*(m*m - m)) + 2*(.5*m*n))/((k + m + n)*(k + m + n -1));
return prob;
}
Our parameters are k (dominant), m (hetero), & n (recessive).
First I found the probability for each possible breeding pair selection in terms of percentage of the population. So, a first round choice for k would look like k/(k+m+n), and a second round choice of k after a first round choice of k would look like (k-1)/(k+m+n). Then multiply these two to get the outcome. Since there were three identified populations, there were nine possible outcomes.
Then I multiplied each outcome by it's dominance probability - 100% for anything with k, 75% for m&m, 50% for m&n, n&m, and 0% for n&n. Now add the outcomes together, and you have your solution.
http://rosalind.info/problems/iprb/
Here is the code I did in python:
We don't want the offspring to be completely recessive, so we should make the probability tree and look at the cases and the probabilities of the cases that event might happen.
Then the probability that we want is 1 - p_reccesive. More explanation is provided in the comment section of the following code.
"""
Let d: dominant, h: hetero, r: recessive
Let a = k+m+n
Let X = the r.v. associated with the first person randomly selected
Let Y = the r.v. associated with the second person randomly selected without replacement
Then:
k = f_d => p(X=d) = k/a => p(Y=d| X=d) = (k-1)/(a-1) ,
p(Y=h| X=d) = (m)/(a-1) ,
p(Y=r| X=d) = (n)/(a-1)
m = f_h => p(X=h) = m/a => p(Y=d| X=h) = (k)/(a-1) ,
p(Y=h| X=h) = (m-1)/(a-1)
p(Y=r| X=h) = (n)/(a-1)
n = f_r => p(X=r) = n/a => p(Y=d| X=r) = (k)/(a-1) ,
p(Y=h| X=r) = (m)/(a-1) ,
p(Y=r| X=r) = (n-1)/(a-1)
Now the joint would be:
| offspring possibilites given X and Y choice
-------------------------------------------------------------------------
X Y | P(X,Y) | d(dominant) h(hetero) r(recessive)
-------------------------------------------------------------------------
d d k/a*(k-1)/(a-1) | 1 0 0
d h k/a*(m)/(a-1) | 1/2 1/2 0
d r k/a*(n)/(a-1) | 0 1 0
|
h d m/a*(k)/(a-1) | 1/2 1/2 0
h h m/a*(m-1)/(a-1) | 1/4 1/2 1/4
h r m/a*(n)/(a-1) | 0 1/2 1/2
|
r d n/a*(k)/(a-1) | 0 0 0
r h n/a*(m)/(a-1) | 0 1/2 1/2
r r n/a*(n-1)/(a-1) | 0 0 1
Here what we don't want is the element in the very last column where the offspring is completely recessive.
so P = 1 - those situations as follow
"""
path = 'rosalind_iprb.txt'
with open(path, 'r') as file:
lines = file.readlines()
k, m, n = [int(i) for i in lines[0].split(' ')]
a = k + m + n
p_recessive = (1/4*m*(m-1) + 1/2*m*n + 1/2*m*n + n*(n-1))/(a*(a-1))
p_wanted = 1 - p_recessive
p_wanted = round(p_wanted, 5)
print(p_wanted)

Categories

Resources