I know how to do a standard binomial distribution in python where probabilities of each trial is the same. My question is what to do if the trial probabilities change each time. I'm drafting up an algorithm based on the paper below but thought I should check on here to see whether there's already a standard way to do it.
http://www.tandfonline.com/doi/abs/10.1080/00949658208810534#.UeVnWT6gk6w
Thanks in advance,
James
Is this kind of what you are looking for?
import numpy as np
def random_MN_draw(n, probs): # n=2 since binomial
""" get X random draws from the multinomial distribution whose probability is given by 'probs' """
mn_draw = np.random.multinomial(n,probs) # do 1 multinomial experiment with the given probs with probs= [0.5,0.5], this is a fair coin-flip
return mn_draw
def simulate(sim_probabilities):
len_sim = len(sim_probabilities)
simulated_flips = np.zeros(2,len_sim)
for i in range(0,len_sim)
simulated_flips(:,i) = random_MN_draw(2, sim_probabilities(i))
# Here, at the end of the simulation, you can count the number of heads
# in 'simulated_flips' to get your MLE's on P(H) and P(T).
Suppose you want to do 9 coin tosses, and P(H) on each flip is 0.1 .. 0.9, respectively. !0% chance of a head on first flip, 90% on last.
For E(H), the expected number of heads, you can just sum the 9 individual expectations.
For a distribution, you could enumerate the ordered possible outcomes (itertools.combinations_with_replacement(["H", "T"], 9))
(HHH HHH HHH)
(HHH HHH HHT)
...
(TTT TTT TTT)
and calculate a probability for the ordered outcome in a straightforward manner.
for each ordered outcome, increment a defaultdict(float) indexed by the number of heads with the calculated p.
When done, compute the sum of the dictionary values, then divide every value in the dictionary by that sum.
You'll have 10 values that correspond to the chances of observing 0 .. 9 heads.
Gerry
Well, the question is old and I can't answer it since I don't know pythons math libraries well enough.
Howewer, it might be helpful to other readers to know that this distribution often runs under the name
Poisson Binomial Distribution
Related
I start off with a population. I also have properties each individual in the population can have. If an individual DOES have the property, it’s score goes up by 5. If it DOESNT have it, it’s score increases by 0.
Example code using length as a property:
for x in individual:
if len <5:
score += 5
if len >=5:
score += 0
Then I add up the total score and select the individuals I want to continue. Is this a fitness function?
Anything can be a fitness algorithm as long as it gives better points for better DNA. The code you wrote looks like a gene of a DNA rather than a constraint. If it was a constraint, you'd give it a growing score penalty (its a minimization of score?) depending on the distance to the constraint point so that the selection/crossover part could prioritize the closer DNAs to 5 for smaller and distant values to 5 for the bigger. But currently it looks like "anything > 5 works fine" so there will be a lot of random solutions to this with high diversity rather than values like 4.9, 4.99, etc even if you apply elitism.
If there are many variables like "len" with equal score, then one gene's failure could be shadowed by another gene's success. To stop this, you can give them different scores like 5,10,20,40,... so that the selection and crossover can know if it actually made progress without any failure.
If you've meant a constraint by that 5, then you should tell the selection that the "failed" values closer to 5 (i.e. 4,4.5,4.9,4.99) are better than distant ones, by applying a variable score like this:
if(gene < constraint_value)
score += (constraint_value - gene)^2;
// if you've meant to add zero, then don't need to add zero
In comments, you said molecular computations. Molecules have floating point coordinates&masses so if you are optimizing them, then the constraint with variable penalty will make it easier for the selection to get better groups of DNAs for future generations if the mutation is adding onto the current value of genes rather than setting them to a totally random value.
I have seen several posts on this subject already, however they all seem unnecessarily complicated, or wrong --- the following proposal does not suffer from the former problem (it is simple), but possibly the latter (that it is wrong).
My goal is to generate s whole numbers, i.e., positive integers, uniformly at random, such that their sum is n. To me, the following solution of generating n random numbers between 1 and s, and then outputting the frequencies gets what we want:
import random
from collections import defaultdict
samples = list()
for i in range(n) :
samples.append(random.randint(1,s))
hist = defaultdict(int)
for sample in samples :
hist[sample] += 1
freq = list()
for j in range(s) :
freq.append(hist[j+1])
print('list:', freq)
print('sum:', sum(freq))
So, for example, if we wanted s=10 random whole numbers which sum up to n=100, we would get from this procedure, for example
list: [11, 7, 9, 12, 16, 13, 9, 10, 8, 5]
sum: 100
Since I am no statistician by any means, I fear that this generates numbers which are not truly uniformly distributed. Any comments/analysis would be greatly appreciated
Well, what you present here is multinomial distribution, I believe. Directly from wikipedia - "it models the probability of counts for rolling a s-sided die n times.", with
parameters vector pi = 1/s.
however they all seem unnecessarily complicated, or wrong
not sure what you had in mind, but in Python world to sample from multinomial means you use NumPy and then it is one-liner
import numpy as np
result = np.random.multinomial(n, [1.0/s for _ in range(s)])
And likely it would be faster, well tested and correct for all possible combination of parameters.
If you find it is better suited you, so be it, but inventing new way to sample well-known distribution is quite a job in itself. Please note, that there a lot of distribution where sum of outcomes is equal to fixed number - f.e., Dirichlet-multinomial. And they have a lot of parameters which you could vary wildly, achieving statistically different results.
Imagine s = 10 and n = 1_000_000. Then all the numbers would tend to cluster around 100_000. I'm pretty sure that what you actually have is a poisson distribution with lambda = n/s.
If you want something more like a uniform distribution, you can try something like this:
-Generate s random numbers between 0 and 1 and let sum denote their sum.
-Multiply each number by (n / sum), and let us name these decimal numbers d_1, …, d_s.
-Round down to the nearest integer and call the numbers i_1, ..., i_s.
Now, the sum of these is some n_i which may be less than n because of rounding. Let rest = n - n_i. Sort the i_1, …, i_s based on the rest of d_1, …, d_s with division by 1 (i.e. sort {i_1, …, i_s} based on the sorted sequence of {d_1 % 1, …, d_s % 1}) with lowest values at lowest indices. Then:
for j in range(rest):
i_(s-j) += 1
This will give you s random numbers with identical uniform distributions which are scaled such that Sum(i_1, …, i_s) = n.
I hope this helps.
Let's say I have this code
import random
def fooFunc():
return 1
What is the overall chance of fooFunc being executed when using the code below?
if random.randrange(4096)==1:
fooFunc()
if random.randrange(256)==1:
fooFunc()
I'd suggest this isn't a python problem and better suited to https://math.stackexchange.com/ - to ask about probabilities.
As random.randrange(x) produces a number between 0 and x (including 0, but NOT including x), you have a 1/x probability of any specific number being produced.
Please see Neil Slater's answer for calculating the specific probability in your situation.
(Please see here if you want to look at the internals of random.randrange(): How does a randrange() function work?)
Each call to random.randrange can be treated as independent random selection, provided you don't know the seed and are happy to treat the output of a PRNG as a random variable.
What's the overall chance of fooFunc being executed?
Assuming you don't care about tracking whether fooFunc is called twice?
This is just the normal probability calculation, similar to "what is the chance of rolling at least one 6 when I roll two dice". To do this, it is easier to re-formulate the question as "What is the probability that I don't roll any 6", and subtract that from 1.0, because there is only one combination of failing both checks, whilst there are 3 combinations of succeeding one or other or both.
So p = 1 - ((4095/4096) * (255/256))
I was given the following assignment by my Algorithms professor:
Write a Python program that implements Euclid’s extended algorithm. Then perform the following experiment: run it on a random selection of inputs of a given size, for sizes bounded by some parameter N; compute the average number of steps of the algorithm for each input size n ≤ N, and use gnuplot to plot the result. What does f(n) which is the “average number of steps” of Euclid’s extended algorithm on input size n look like? Note that size is not the same as value; inputs of size n are inputs with a binary representation of n bits.
The programming of the algorithm was the easy part but I just want to make sure that I understand where to go from here. I can fix N to be some arbitrary value. I generate a set of random values of a and b to feed into the algorithm whose length in binary (n) are bounded above by N. While the algorithm is running, I have a counter that is keeping track of the number of steps (ignoring trivial linear operations) taken for that particular a and b.
At the end of this, I sum the lengths of each individual inputs a and b binary representation and that represents a single x value on the graph. My single y value would be the counter variable for that particular a and b. Is this a correct way to think about it?
As a follow up question, I also know that the best case for this algorithm is θ(1) and worst case is O(log(n)) so my "average" graph should lie between those two. How would I manually calculate average running time to verify that my end graph is correct?
Thanks.
I'm looking for a python/sklearn/lifelines/whatever implementation of Harrell's c-index (concordance index), which is mentioned in random survival forests.
The C-index is calculated using the following steps:
Form all possible pairs of cases over the data.
Omit those pairs whose shorter survival time is censored. Omit pairs i and j if Ti=Tj unless at least one is a death. Let Permissible denote the total number of permissible pairs.
For each permissible pair where Ti and Tj are not equal, count 1 if the shorter survival
time has worse predicted outcome; count 0.5 if predicted outcomes are tied. For each permissible pair, where Ti=Tj and both are deaths, count 1 if predicted outcomes are tied; otherwise, count 0.5. For each permissible
pair where Ti=Tj, but not both are deaths, count 1 if the death has
worse predicted outcome; otherwise, count 0.5. Let Concordance denote
the sum over all permissible pairs.
The C-index, C, is defined by C=Concordance/Permissible.
Note: nltk has a ConcordanceIndex method with a different meaning :(
LifeLines package now has this implemented c-index, or concordance-index
LifeLine package could implement concordance index.
pip install lifelines
or
conda install -c conda-forge lifelines
Example:
from lifelines.utils import concordance_index
cph = CoxPHFitter().fit(df, 'T', 'E')
concordance_index(df['T'], -cph.predict_partial_hazard(df), df['E'])