Rosalind "Mendel's First Law" IPRB

Rosalind "Mendel's First Law" IPRB - python

As preparation for an upcoming bioinformatics course, I am doing some assignments from rosalind.info. I am currently stuck in the assignment "Mendel's First Law".
I think I could brute force myself through this, but that somehow my thinking must be too convoluted. My approach would be this:
Build a tree of probabilities which has three levels. There are two creatures that mate, creature A and creature B. First level is, what is the probability for picking as creature A homozygous dominant (k), heterozygous (m) or homozygous recessive (n). It seems that for example for homozygous dominant, since there are a total of (k+m+n) creatures and k of them are homozygous dominant, the probability is k/(k+m+n).
Then in this tree, under each of these would come the probability of creature B being k / m / n given that we know what creature A got picked as. For example if creature A was picked to be heterozygous (m), then the probability that creature B would also be heterozygous is (m-1)/(k+m+n-1) because there is now one less heterozygous creature left.
This would give the two levels of probabilities, and would involve a lot of code just to get this far, as I would literally be building a tree structure and for each branch have manually written code for that part.
Now after choosing creatures A and B, each of them has two chromosomes. One of these chromosomes can randomly be picked. So for A chromosome 1 or 2 can be picked and same for B. So there are 4 different options: pick 1 of A, 1 of B. Pick 2 of A, 1 of B. Pick 1 of A, 2 of B. Pick 2 of A, 2 of B. The probability of each of these would be 1/4. So finally this tree would have these leaf probabilities.
Then from there somehow by magic I would add up all of these probabilities to see what is the probability that two organisms would produce a creature with a dominant allele.
I doubt that this assignment was designed to take hours to solve. What am I thinking too hard?
Update:
Solved this in the most ridiculous brute-force way possible. Just ran thousands of simulated matings and figured out the portion that ended up having a dominant allele, until there was enough precision to pass the assignment.
import random
k = 26
m = 18
n = 25
trials = 0
dominants = 0
while True:
s = ['AA'] * k + ['Aa'] * m + ['aa'] * n
first = random.choice(s)
s.remove(first)
second = random.choice(s)
has_dominant_allele = 'A' in [random.choice(first), random.choice(second)]
trials += 1
if has_dominant_allele:
dominants += 1
print "%.5f" % (dominants / float(trials))

Species with dominant alleles are either AA or Aa.
Your total ppopulation (k + n + m consists of k (hom) homozygous dominant organisms with AA, m (het) heterozygous dominant organisms with Aa and n (rec) homozygous recessive organisms with aa. Each of these can mate with any other.
The probability for organisms with the dominant allele is:
P_dom = n_dominant/n_total or 1 - n_recessive/n_total
Doing the Punnett squares for each of these combinations is not a bad idea:
hom + het
| A | a
-----------
A | AA | Aa
a | Aa | aa
het + rec
| a | a
-----------
A | Aa | Aa
a | aa | aa
Apparently, mating of of two organisms results in four possible children. hom + het yields 1 of 4 organisms with the recessive allele, het + rec yields 2 of 4 organisms with the recessive allele.
You might want to do that for the other combinations as well.
Since we're not just mating the organisms one on one, but throw together a whole k + m + n bunch, the total number of offspring and the number of 'children' with a particular allele would be nice to know.
If you don't mind a bit of Python, comb from scipy.misc might be helpful here. in the calculation, don't forget (a) that you get 4 children from each combination and (b) that you need a factor (from the Punnett squares) to determine the recessive (or dominant) offspring from the combinations.
Update
# total population
pop_total = 4 * comb(hom + het + rec, 2)
# use PUNNETT squares!
# dominant organisms
dom_total = 4*comb(hom,2) + 4*hom*het + 4*hom*rec + 3*comb(het,2) + 2*het*rec
# probability for dominant organisms
phom = dom_total/pop_total
print phom
# probability for dominant organisms +
# probability for recessive organisms should be 1
# let's check that:
rec_total = 4 * comb(rec, 2) + 2*rec*het + comb(het, 2)
prec = totalrec/totalpop
print 1 - prec

This is more a probability/counting question than coding. It's easier to calculate the probability of an offspring having only recessive traits first. Let me know if you have any trouble understanding anything. I ran the following code and my output passed the rosalind grader.
def mendel(x, y, z):
#calculate the probability of recessive traits only
total = x+y+z
twoRecess = (z/total)*((z-1)/(total-1))
twoHetero = (y/total)*((y-1)/(total-1))
heteroRecess = (z/total)*(y/(total-1)) + (y/total)*(z/(total-1))
recessProb = twoRecess + twoHetero*1/4 + heteroRecess*1/2
print(1-recessProb) # take the complement
#mendel(2, 2, 2)
with open ("rosalind_iprb.txt", "r") as file:
line =file.readline().split()
x, y, z= [int(n) for n in line]
print(x, y, z)
file.close()
print(mendel(x, y, z))

Klaus's solution has most of it correct; however, the error occurs when calculating the number of combinations that have at least one dominant allele. This part is incorrect, because while there are 4 possibilities when combining 2 alleles to form an offspring, only one possibility is actually executed. Therefore, Klaus's solution calculates a percentage that is markedly higher than it should be.
The correct way to calculate the number of combos of organisms with at least one dominant allele is the following:
# k = number of homozygous dominant organisms
# n = number of heterozygous organisms
# m = number of homozygous recessive organisms
dom_total = comb(k, 2) + k*m + k*n + .5*m*n + .75*comb(m, 2)
# Instead of:
# 4*comb(k,2) + 4*k*n + 4*k*m + 3*comb(n,2) + 2*n*m
The above code segment works for calculating the total number of dominant combos because it multiplies each part by the percentage (100% being 1) that it will produce a dominant offspring. You can think of each part as being the number of punnet squares for combos of each type (k&k, k&m, k&n, m&n, m&m).
So the entire correct code segment would look like this:
# Import comb (combination operation) from the scipy library
from scipy.special import comb
def calculateProbability(k, m, n):
# Calculate total number of organisms in the population:
totalPop = k + m + n
# Calculate the number of combos that could be made (valid or not):
totalCombos = comb(totalPop, 2)
# Calculate the number of combos that have a dominant allele therefore are valid:
validCombos = comb(k, 2) + k*m + k*n + .5*m*n + .75*comb(m, 2)
probability = validCombos/totalCombos
return probability
# Example Call:
calculateProbability(2, 2, 2)
# Example Output: 0.783333333333

You dont need to run thousands of simulations in while loop. You can run one simulation, and calculate probability from it results.
from itertools import product
k = 2 # AA homozygous dominant
m = 2 # Aa heterozygous
n = 2 # aa homozygous recessive
population = (['AA'] * k) + (['Aa'] * m) + (['aa'] * n)
all_children = []
for parent1 in population:
# remove selected parent from population.
chosen = population[:]
chosen.remove(parent1)
for parent2 in chosen:
# get all possible children from 2 parents. Punnet square
children = product(parent1, parent2)
all_children.extend([''.join(c) for c in children])
dominants = filter(lambda c: 'A' in c, all_children)
# float for python2
print float(len(list(dominants))) / len(all_children)
# 0.7833333

Here I am adding my answer to explain it more clearly:
We don't want the offspring to be completely recessive, so we should make the probability tree and look at the cases and the probabilities of the cases that event might happen.
Then the probability that we want is 1 - p_reccesive. More explanation is provided in the comment section of the following code.
"""
Let d: dominant, h: hetero, r: recessive
Let a = k+m+n
Let X = the r.v. associated with the first person randomly selected
Let Y = the r.v. associated with the second person randomly selected without replacement
Then:
k = f_d => p(X=d) = k/a => p(Y=d| X=d) = (k-1)/(a-1) ,
p(Y=h| X=d) = (m)/(a-1) ,
p(Y=r| X=d) = (n)/(a-1)
m = f_h => p(X=h) = m/a => p(Y=d| X=h) = (k)/(a-1) ,
p(Y=h| X=h) = (m-1)/(a-1)
p(Y=r| X=h) = (n)/(a-1)
n = f_r => p(X=r) = n/a => p(Y=d| X=r) = (k)/(a-1) ,
p(Y=h| X=r) = (m)/(a-1) ,
p(Y=r| X=r) = (n-1)/(a-1)
Now the joint would be:
| offspring possibilites given X and Y choice
-------------------------------------------------------------------------
X Y | P(X,Y) | d(dominant) h(hetero) r(recessive)
-------------------------------------------------------------------------
d d k/a*(k-1)/(a-1) | 1 0 0
d h k/a*(m)/(a-1) | 1/2 1/2 0
d r k/a*(n)/(a-1) | 0 1 0
|
h d m/a*(k)/(a-1) | 1/2 1/2 0
h h m/a*(m-1)/(a-1) | 1/4 1/2 1/4
h r m/a*(n)/(a-1) | 0 1/2 1/2
|
r d n/a*(k)/(a-1) | 0 0 0
r h n/a*(m)/(a-1) | 0 1/2 1/2
r r n/a*(n-1)/(a-1) | 0 0 1
Here what we don't want is the element in the very last column where the offspring is completely recessive.
so P = 1 - those situations as follow
"""
path = 'rosalind_iprb.txt'
with open(path, 'r') as file:
lines = file.readlines()
k, m, n = [int(i) for i in lines[0].split(' ')]
a = k + m + n
p_recessive = (1/4*m*(m-1) + 1/2*m*n + 1/2*m*n + n*(n-1))/(a*(a-1))
p_wanted = 1 - p_recessive
p_wanted = round(p_wanted, 5)
print(p_wanted)

I just found the formula for the answer. You have 8 possible mating interactions that can yield a dominant offspring:
DDxDD, DDxDd, DdxDD, DdxDd, DDxdd, ddxDD, Ddxdd, ddxDd
With the respective probabilities of producing dominant offspring of:
1.0, 1.0, 1.0, 0.75, 1.0, 1.0, 0.5, 0.5
Initially it seemed odd to me that DDxdd and ddxDD were two separate mating events, but if you think about it they are slightly different conceptually. The probability of DDxdd is k/(k+m+n) * n/((k-1)+m+n) and the probability of ddxDD is n/(k+m+n) * k/(k+m+(n-1)). Mathematically these are identical, but speaking from a probability stand point these are two separate events. So your total probability is the sum of the probabilities of each of these different mating events multiplied by the probability of that mating event producing a dominant offspring. I won't simplify it here step by step but that gives you the code:
total_probability = ((k ** 2 - k) + (2 * k * m) + (3 / 4 * (m ** 2 - m)) + (2* k * n) + (m * n)) / (total_pop ** 2 - total_pop)
All you need to do is plug in your values of k, m, and n and you'll get the probability they ask for.

I doubt that this assignment was designed to take hours to solve. What am I thinking too hard?
I also had the same question. After reading the whole thread, I came up with the code.
I hope the code itself will explain the probability calculation:
def get_prob_of_dominant(k, m, n):
# A - dominant factor
# a - recessive factor
# k - amount of organisms with AA factors (homozygous dominant)
# m - amount of organisms with Aa factors (heterozygous)
# n - amount of organisms with aa factors (homozygous recessive)
events = ['AA+Aa', 'AA+aa', 'Aa+aa', 'AA+AA', 'Aa+Aa', 'aa+aa']
# get the probability of dominant traits (set up Punnett square)
punnett_probabilities = {
'AA+Aa': 1,
'AA+aa': 1,
'Aa+aa': 1 / 2,
'AA+AA': 1,
'Aa+Aa': 3 / 4,
'aa+aa': 0,
}
event_probabilities = {}
totals = k + m + n
# Event: AA+Aa -> P(X=k, Y=m) + P(X=m, Y=k):
P_km = k / totals * m / (totals - 1)
P_mk = m / totals * k / (totals - 1)
event_probabilities['AA+Aa'] = P_km + P_mk
# Event: AA+aa -> P(X=k, Y=n) + P(X=n, Y=k):
P_kn = k / totals * n / (totals - 1)
P_nk = n / totals * k / (totals - 1)
event_probabilities['AA+aa'] = P_kn + P_nk
# Event: Aa+aa -> P(X=m, Y=n) +P(X=n, Y=m):
P_mn = m / totals * n / (totals - 1)
P_nm = n / totals * m / (totals - 1)
event_probabilities['Aa+aa'] = P_mn + P_nm
# Event: AA+AA -> P(X=k, Y=k):
P_kk = k / totals * (k - 1) / (totals - 1)
event_probabilities['AA+AA'] = P_kk
# Event: Aa+Aa -> P(X=m, Y=m):
P_mm = m / totals * (m - 1) / (totals - 1)
event_probabilities['Aa+Aa'] = P_mm
# Event: aa+aa -> P(X=n, Y=n) + P(X=n, Y=n) = 0 (will be * 0, so just don't use it)
event_probabilities['aa+aa'] = 0
# Total probability is the sum of (prob of dominant factor * prob of the event)
total_probability = 0
for event in events:
total_probability += punnett_probabilities[event] * event_probabilities[event]
return round(total_probability, 5)

Related

any tip to improve performance when using nested loops with python

so, I had this exercise where I would receive a list of integers and had to find how many sum pairs were multiple to 60
example:
input: list01 = [10,90,50,40,30]
result = 2
explanation: 10 + 50, 90 + 30
example2:
input: list02 = [60,60,60]
result = 3
explanation: list02[0] + list02[1], list02[0] + list02[2], list02[1] + list02[2]
seems pretty easy, so here is my code:
def getPairCount(numbers):
total = 0
cont = 0
for n in numbers:
cont+=1
for n2 in numbers[cont:]:
if (n + n2) % 60 == 0:
total += 1
return total
it's working, however, for a big input with over 100k+ numbers is taking too long to run, and I need to be able to run in under 8 seconds, any tips on how to solve this issue??
being with another lib that i'm unaware or being able to solve this without a nested loop

Here's a simple solution that should be extremely fast (it runs in O(n) time). It makes use of the following observation: We only care about each value mod 60. E.g. 23 and 143 are effectively the same.
So rather than making an O(n**2) nested pass over the list, we instead count how many of each value we have, mod 60, so each value we count is in the range 0 - 59.
Once we have the counts, we can consider the pairs that sum to 0 or 60. The pairs that work are:
0 + 0
1 + 59
2 + 58
...
29 + 31
30 + 30
After this, the order is reversed, but we only
want to count each pair once.
There are two cases where the values are the same:
0 + 0 and 30 + 30. For each of these, the number
of pairs is (count * (count - 1)) // 2. Note that
this works when count is 0 or 1, since in both cases
we're multiplying by zero.
If the two values are different, then the number of
cases is simply the product of their counts.
Here's the code:
def getPairCount(numbers):
# Count how many of each value we have, mod 60
count_list = [0] * 60
for n in numbers:
n2 = n % 60
count_list[n2] += 1
# Now find the total
total = 0
c0 = count_list[0]
c30 = count_list[30]
total += (c0 * (c0 - 1)) // 2
total += (c30 * (c30 - 1)) // 2
for i in range(1, 30):
j = 60 - i
total += count_list[i] * count_list[j]
return total
This runs in O(n) time, due to the initial one-time pass we make over the list of input values. The loop at the end is just iterating from 1 through 29 and isn't nested, so it should run almost instantly.

Below is a translation of Tom Karzes's answer but using numpy. I benchmarked it and it is only faster if the input is already a numpy array, not a list. I still want to write it here because it nicely shows how loops in python can be one-liners in numpy.
def get_pairs_count(numbers, /):
# Count how many of each value we have, modulo 60.
numbers_mod60 = np.mod(numbers, 60)
_, counts = np.unique(numbers_mod60, return_counts=True)
# Now find the total.
total = 0
c0 = counts[0]
c30 = counts[30]
total += (c0 * (c0 - 1)) // 2
total += (c30 * (c30 - 1)) // 2
total += np.dot(counts[1:30:+1], counts[59:30:-1]) # Notice the slicing indices used.
return total

How to slice array quickly based on conditions?

I have a giant nested for loop....10 in all, but for illustration here, i am including 6. I am doing a summation (over multiple indices; the incides are not independent!). The index in any inner for loop depends on the index of the outer loop (except for one instance). The inner-most loop contains an operation where i slice an array (named 'w') based on 8 different conditions all combined using '&' and '|'. There is also this 'HB' function that takes as an argument this sliced array (named 'wrange'), performs some operations on it and returns an array of the same size.
The timescale for this slicing and the 'HB' function to execute is 300-400 microseconds and 100 microseconds respectively. I need to bring it down drastically. To nanoseconds.!!
Tried using dictionary instead of array (where i am slicing). It is much slower. Tried storing the sliced array for all possible values. That is a very huge computation in its own right since there are many many possible combinations of the conditions (these conditions depend indirectly on the indices of the for loop)
s goes from 1 to 49
t goes from -s to s
and there are 641 combinations of l,n
Here, i have posted one value of s,t and an l,n combination for illustration.
s = 7
t = -7
l = 72
n = 12
Nl = Dictnorm[n,l]
Gamma_l = Dictfwhm[n,l]
Dictc1 = {}
Dictc2 = {}
Dictwrange = {}
DictH = {}
DictG = {}
product = []
startm = max(-l-t,-l)
endm = min(l-t,l)+1
sum5 = 0
for sp in range(s-2,s+3): #s'
sum4 = 0
for tp in range(-sp,-sp+1): #t'
#print(tp)
sum3 = 0
integral = 1
for lp in range(l-2,l+3): #l'
sum2 = 0
if (n,lp) in Dictknl2.keys():
N1 = Dictnorm[n,lp]
Gamma_1 = Dictfwhm[n,lp]
for lpp in range(l-2,l+3): #l"
sum1 = 0
if ((sp+lpp-lp)%2 == 1 and sp>=abs(lpp-lp) and
lp>=abs(sp-lpp) and lpp>=abs(sp-lp) and
(n,lpp) in Dictknl2.keys()):
F = f(lpp,lp,sp)
N2 = Dictnorm[n,lpp]
Gamma_2 = Dictfwhm[n,lpp]
for m in range(startm, endm): #m
sum0 = 0
L1 = LKD(n,l,m,l,m)
L2 = LKD(n,l,m+t,l,m+t)
for mp in range(max(m+t-tp-5,m-5),
min(m+5,m+t-tp+5)+1): #m'
if (abs(mp)<=lp and abs(mp)<=lpp and
abs(mp+tp)<=lp and abs(mp+tp)<=lpp
and LKD(n,l,m,lp,mp)!=0
and LKD(n,l,m+t,lpp,mp+tp)!=0):
c3 = Dictomega[n,lp,mp+tp]
c4 = Dictomega[n,lpp,mp]
wrange = np.unique(np.concatenate
((Dictwrange[m],
w[((w>=(c3-Gamma_1))&
((c3+Gamma_1)>=w))|
((w>=(c4-Gamma_2))&
((c4+Gamma_2)>=w))])))
factor = (sum(
HB(Dictc1[n,l,m+t],
Dictc2[n,l,m],Nl,
Nl,Gamma_l,
Gamma_l,wrange,
Sigma).conjugate()
*HB(c3,c4,N1,N2,Gamma_1,
Gamma_2,wrange,0)*L1*L2)
*LKD(n,l,m,lp,mp)
*LKD(n,l,m+t,lpp,mp+tp) *DictG[m]
*gamma(lpp,sp,lp,tp,mp)
*F)
sum0 = sum0 + factor #sum over m'
sum1 = sum1 + sum0 #sum over m
sum2 = sum2 + sum1 #sum over l"
sum3 = sum3 + sum2 #sum over l'
sum4 = sum4 + sum3*integral #sum over t'
sum5 = sum5 + sum4 #sum over s'
z = (1/(sum(product)))*sum5
print(z.real,z.imag,l,n)
TL;DR
def HB(a,...f,array1): #########timesucker
perform_some_operations_on_array1_using_a_b_c_d
return operated_on_array1
for i in ():
for j in ():
...
...
for o in ():
array1 = w[w>some_function1(i,j,..k) &
w<some_function2(i,j,..k) |.....] #########timesucker
factor = HB(a,....f,array1) * HB(g,...k,array1) *
alpha*beta*gamma....
It takes about 30 seconds to run this whole section once. I need to bring it down to as low as possible. 1 second is the minimum target

Unexpected complex numbers in Python

I'm trying to calculate a total score for a decathlon participant and there are two formulas given, one is for a field events and the other is for track events.
Points = INT(A(B — P)^C) for track events (faster time produces a better score)
Points = INT(A(P — B)^C) for field events (greater distance or height produces a better score
A, B and C are given constants for this formulas and P is the athletes performance measured in seconds (running), metres (throwing), or centimetres (jumping).
Once I am trying to calculate I get a result that is a complex number that I cannot convert into integer or smth like that.
These are the constants for A,B and C : https://en.wikipedia.org/wiki/Decathlon#Points_system
These are my values for athlete's performance (a list that I will try somehow, after adding the total score, convert into a JSON file):
splited_info = ['Lehi Poghos', '13.04', '4.53', '7.79', '1.55', '64.72', '18.74', '24.20', '2.40', '28.20', '6.50.76']
Could someone give me some feedback on what or how am I doing this wrong?
def split(info):
with open(info.filename, "r") as f:
csv_reader = csv.reader(f, delimiter="\n")
for line in csv_reader:
splited_info = line[0].split(";")
score = 0
score += int(25.4347 * ((18 - float(splited_info[1])) ** 1.81))
score += int(0.14354 * ((float(splited_info[2]) - 220) ** 1.4))
score += int(51.39 * ((float(splited_info[3]) - 1.5) ** 1.05))
score += int(0.8465 * ((float(splited_info[4]) - 75) ** 1.42))
score += int(1.53775 * ((82 - float(splited_info[5])) ** 1.81))
score += int(5.74352 * ((28.5 - float(splited_info[6])) ** 1.92))
score += int(12.91 * ((float(splited_info[7]) - 4) ** 1.1))
score += int(0.2797 * ((float(splited_info[8]) - 100) ** 1.35))
score += int(10.14 * ((float(splited_info[9]) - 7) ** 1.08))
score += int(0.03768 * ((480 - float(splited_info[10])) ** 1.85))
print(score)
I'm just hardcoding all the calculations since all the calculations are going to be different with all different values of A,B,C and P.

The problem is a mix up of metres and centimetres. The Wikipedia page is slightly inaccurate in its recount of the formulae - throws are measured in metres but jumps should be measured in centimetres. This is why you're getting fractional powers of negative numbers.
See the original source for more info:
IAAF Scoring Tables for Combined Events, p. 24.

Rosalind: Mendel's first law

I'm trying to solve the problem at http://rosalind.info/problems/iprb/
Given: Three positive integers k, m, and n, representing a population
containing k+m+n organisms: k individuals are homozygous dominant for
a factor, m are heterozygous, and n are homozygous recessive.
Return: The probability that two randomly selected mating organisms
will produce an individual possessing a dominant allele (and thus
displaying the dominant phenotype). Assume that any two organisms can
mate.
My solution works for the sample, but not for any problems generated. After further research it seems that I should find the probability of choosing any one organism at random, find the probability of choosing the second organism, and then the probability of that pairing producing offspring with a dominant allele.
My question is then: what does my code below find the probability of? Does it find the percentage of offspring with a dominant allele for all possible matings -- so rather than the probability of one random mating, my code is solving for the percentage of offspring with dominant alleles if all pairs were tested?
f = open('rosalind_iprb.txt', 'r')
r = f.read()
s = r.split()
############# k = # homozygotes dominant, m = #heterozygotes, n = # homozygotes recessive
k = float(s[0])
m = float(s[1])
n = float(s[2])
############# Counts for pairing between each group and within groups
k_k = 0
k_m = 0
k_n = 0
m_m = 0
m_n = 0
n_n = 0
##############
if k > 1:
k_k = 1.0 + (k-2) * 2.0
k_m = k * m
k_n = k * n
if m > 1:
m_m = 1.0 + (m-2) * 2.0
m_n = m * n
if n> 1:
n_n = 1.0 + (n-2) * 2.0
#################
dom = k_k + k_m + k_n + 0.75*m_m + 0.5*m_n
total = k_k + k_m + k_n + m_m + m_n + n_n
chance = dom/total
print chance

Looking at your code, I'm having a hard time figuring out what it's supposed to do. I'll work through the problem here.
Let's simplify the wording. There are n1 type 1, n2 type 2, and n3 type 3 items.
How many ways are there to choose a set of size 2 out of all the items? (n1 + n2 + n3) choose 2.
Every pair of items will have item types corresponding to one of the six following unordered multisets: {1,1}, {2,2}, {3,3}, {1,2}, {1,3}, {2,3}
How many multisets of the form {i,i} are there? ni choose 2.
How many multisets of the form {i,j} are there, where i != j? ni * nj.
The probabilities of the six multisets are thus the following:
P({1,1}) = [n1 choose 2] / [(n1 + n2 + n3) choose 2]
P({2,2}) = [n2 choose 2] / [(n1 + n2 + n3) choose 2]
P({3,3}) = [n3 choose 2] / [(n1 + n2 + n3) choose 2]
P({1,2}) = [n1 * n2] / [(n1 + n2 + n3) choose 2]
P({1,3}) = [n1 * n3] / [(n1 + n2 + n3) choose 2]
P({2,3}) = [n2 * n3] / [(n1 + n2 + n3) choose 2]
These sum to 1. Note that [X choose 2] is just [X * (X - 1) / 2] for X > 1 and 0 for X = 0 or 1.
Return: The probability that two randomly selected mating organisms will produce an individual possessing a dominant allele (and thus displaying the dominant phenotype).
To answer this question, you simply need to identify which of the six multisets correspond to this event. Lacking the genetics knowledge to answer that question, I'll leave that to you.
For example, suppose that a dominant allele results if either of the two parents was type 1. Then the events of interest are {1,1}, {1,2}, {1,3} and the probability of the event is P({1,1}) + P({1,2}) + P({1,3}).

I spend some time in this question, so, to clarify in python:
lst = ['2', '2', '2']
k, m, n = map(float, lst)
t = sum(map(float, lst))
# organize a list with allele one * allele two (possibles) * dominant probability
# multiplications by one were ignored
# remember to substract the haplotype from the total when they're the same for the second haplotype choosed
couples = [
k*(k-1), # AA x AA
k*m, # AA x Aa
k*n, # AA x aa
m*k, # Aa x AA
m*(m-1)*0.75, # Aa x Aa
m*n*0.5, # Aa x aa
n*k, # aa x AA
n*m*0.5, # aa x Aa
n*(n-1)*0 # aa x aa
]
# (t-1) indicate that the first haplotype was select
print(round(sum(couples)/t/(t-1), 5))

If you are interested, I just found a solution and put it in C#.
public double mendel(double k, double m, double n)
{
double prob;
prob = ((k*k - k) + 2*(k*m) + 2*(k*n) + (.75*(m*m - m)) + 2*(.5*m*n))/((k + m + n)*(k + m + n -1));
return prob;
}
Our parameters are k (dominant), m (hetero), & n (recessive).
First I found the probability for each possible breeding pair selection in terms of percentage of the population. So, a first round choice for k would look like k/(k+m+n), and a second round choice of k after a first round choice of k would look like (k-1)/(k+m+n). Then multiply these two to get the outcome. Since there were three identified populations, there were nine possible outcomes.
Then I multiplied each outcome by it's dominance probability - 100% for anything with k, 75% for m&m, 50% for m&n, n&m, and 0% for n&n. Now add the outcomes together, and you have your solution.
http://rosalind.info/problems/iprb/

Here is the code I did in python:
We don't want the offspring to be completely recessive, so we should make the probability tree and look at the cases and the probabilities of the cases that event might happen.
Then the probability that we want is 1 - p_reccesive. More explanation is provided in the comment section of the following code.
"""
Let d: dominant, h: hetero, r: recessive
Let a = k+m+n
Let X = the r.v. associated with the first person randomly selected
Let Y = the r.v. associated with the second person randomly selected without replacement
Then:
k = f_d => p(X=d) = k/a => p(Y=d| X=d) = (k-1)/(a-1) ,
p(Y=h| X=d) = (m)/(a-1) ,
p(Y=r| X=d) = (n)/(a-1)
m = f_h => p(X=h) = m/a => p(Y=d| X=h) = (k)/(a-1) ,
p(Y=h| X=h) = (m-1)/(a-1)
p(Y=r| X=h) = (n)/(a-1)
n = f_r => p(X=r) = n/a => p(Y=d| X=r) = (k)/(a-1) ,
p(Y=h| X=r) = (m)/(a-1) ,
p(Y=r| X=r) = (n-1)/(a-1)
Now the joint would be:
| offspring possibilites given X and Y choice
-------------------------------------------------------------------------
X Y | P(X,Y) | d(dominant) h(hetero) r(recessive)
-------------------------------------------------------------------------
d d k/a*(k-1)/(a-1) | 1 0 0
d h k/a*(m)/(a-1) | 1/2 1/2 0
d r k/a*(n)/(a-1) | 0 1 0
|
h d m/a*(k)/(a-1) | 1/2 1/2 0
h h m/a*(m-1)/(a-1) | 1/4 1/2 1/4
h r m/a*(n)/(a-1) | 0 1/2 1/2
|
r d n/a*(k)/(a-1) | 0 0 0
r h n/a*(m)/(a-1) | 0 1/2 1/2
r r n/a*(n-1)/(a-1) | 0 0 1
Here what we don't want is the element in the very last column where the offspring is completely recessive.
so P = 1 - those situations as follow
"""
path = 'rosalind_iprb.txt'
with open(path, 'r') as file:
lines = file.readlines()
k, m, n = [int(i) for i in lines[0].split(' ')]
a = k + m + n
p_recessive = (1/4*m*(m-1) + 1/2*m*n + 1/2*m*n + n*(n-1))/(a*(a-1))
p_wanted = 1 - p_recessive
p_wanted = round(p_wanted, 5)
print(p_wanted)

Trying to create a heatmap for a planet wars bot which shows which army has the most influence

I'm trying to create a heatmap for a planetwars bot which indicates which influence a planet is under.
The initial map looks like: http://imgur.com/a/rPVnl#0
Ideally the Red planet should have a value of -1, the Blue planet should have a value of 1, and the planet marked 1 should have a value of 0. (Or 0 to 1, mean of 0.5 would work)
My initial analysis code is below, but the results it outputs are between 0.13 and 7.23.
for p in gameinfo.planets: #gameinfo.planets returns {pid:planet_object}
planet = gameinfo.planets[p]
own_value = 1
for q in gameinfo.my_planets.values():
if q != planet:
q_value = q.num_ships / planet.distance_to(q)
own_value = own_value + q_value
enemy_value = 1
for q in gameinfo.enemy_planets.values():
if q != planet:
q_value = q.num_ships / planet.distance_to(q)
enemy_value = enemy_value + q_value
self.heatmap[p] = own_value/enemy_value
I've also tried to add some code to curb the range from 0 to 1
highest = self.heatmap.keys()[0]
lowest = self.heatmap.keys()[0]
for p in gameinfo.planets.keys():
if self.heatmap[p] > highest:
highest = self.heatmap[p]
elif self.heatmap[p] < lowest:
lowest = self.heatmap[p]
map_range = highest-lowest
for p in gameinfo.planets.keys():
self.heatmap[p] = self.heatmap[p]/map_range
self.heatmap_mean = sum(self.heatmap.values(), 0.0) / len(self.heatmap)
The values ended up between 0 and 1, but the mean was 0.245? (Also the values actually ranged from 0.019 to 1.019).

I've solved my problem, this is what the solution looks like.
#HEATMAP ANALYSIS
for p in gameinfo.planets:
ave_self_value = 0
for q in gameinfo.my_planets:
if q != p:
ave_self_value = ave_self_value + (self.planet_distances[p][q] * gameinfo.planets[q].num_ships / self.own_strength)
ave_enemy_value = 0
for q in gameinfo.enemy_planets:
if q != p:
ave_enemy_value = ave_enemy_value + (self.planet_distances[p][q] * gameinfo.planets[q].num_ships / self.enemy_strength)
self.heatmap[p] = ave_enemy_value - ave_self_value
hmin, hmax = min(self.heatmap.values()), max(self.heatmap.values())
for h in self.heatmap.keys():
self.heatmap[h] = 2 * (self.heatmap[h] - hmin) / (hmax - hmin) - 1
self.heatmap_mean = sum(self.heatmap.values(), 0.0) / len(self.heatmap)
#END HEATMAP ANALYSIS

for p in foo:
...
...
for q in bar:
...
if q != p:
q_value = some_value / another_value
own_value = own_value + q_value
Apologies for the gross simplification. Say foo is [1, 2, 3, 4, 5] and bar is [1, 5].
First time through, p is 1. q takes 1, so q==p. Next, q takes 5, now q!=p, own_value accumulates the q_value which I presume is some positive number less than one.
But second time through, p is 2. q takes 1, so q!=p, so own_value goes up by some fraction of one. Then q takes 5, so q!=p still, so own_value goes up by that same fraction again. This is where the problem lies: (some_value / another_value) + (some_value / another_value) breaks the -1 to 1 scale. You get 7.23 sometimes because that's how many times q did not equal p.
There is nothing in the
for x in foo:
for y in bar:
construction that cares about normalising for x in foo - just for q in bar.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Rosalind "Mendel's First Law" IPRB - python

Related

any tip to improve performance when using nested loops with python

How to slice array quickly based on conditions?

Unexpected complex numbers in Python

Rosalind: Mendel's first law

Trying to create a heatmap for a planet wars bot which shows which army has the most influence

Categories

Resources