I am faced with the following programming problem. I need to generate n (a, b) tuples for which the sum of all a's is a given A and sum of all b's is a given B and for each tuple the ratio of a / b is in the range (c_min, c_max). A / B is within the same range, too. I am also trying to make sure there is no bias in the result other than what is introduced by the constraints and the a / b values are more-or-less uniformly distributed in the given range.
Some clarifications and meta-constraints:
A, B, c_min, and c_max are given.
The ratio A / B is in the (c_min, c_max) range. This has to be so if the problem is to have a solution given the other constraints.
a and b are >0 and non-integer.
I am trying to implement this in Python but ideas in any language (English included) are much appreciated.
We look for tuples a_i and b_i such that
(a_1, ... a_n) and (b_1, ... b_n) have a distribution which is invariant under permutation of indices (what you would call "unbiased")
the ratios a_i / b_i are uniformly distributed on [cmin, cmax]
sum(a_i) = A, sum(b_i) = B
If c_min and c_max are not too ill conditioned (ie they are not very close to another), and n is not very large, the following works:
Generate a_i "uniformly" such that sum a_i = A:
Draw n samples aa_i (i = 1..n) from some distribution (eg. uniform)
Divide them by their sum and multiply by A: a_i = A * aa_i / sum(aa_i) has desired properties.
Generate b_i such that sum b_i = B by the same method.
If there exists i such that a_i / b_i is not in the interval [cmin, cmax], throw away all the a_i and b_i and try again from the beginning.
It doesn't scale well with n, because the set of a_i and b_i satisfying the constraints gets more and more narrow as n increases (and so you reject more candidates).
To be honest, I don't see any other simple solution. If n gets large and cmin ~ cmax, then you will have to use a sledgehammer (eg. MCMC) to generate samples from your distribution, unless there is some trick we did not see.
If you really want to use MCMC algorithms, note that you can change cmin to cmin * B / A (likewise for cmax) and assume A == B == 1. The problem is then to draw uniformly on the product of two unit n-simplices (u_1...u_n, v_1...v_n) such that
u_i / v_i \in [cmin, cmax].
So you have to use a MCMC algorithm (Metropolis-Hastings seems more suited) on the product of two unit n-simplices with the density
f(u_1, ..., u_n, v_1, ..., v_n) = \prod indicator_{u_i/v_i \in [cmin, cmax]}
which is definitely doable (albeit involved).
Start by generating as many identical tuples, n, as you need:
(A/n, B/n)
Now pick two tuples at random. Make a random change to the a value of one, and a compensating change to the a value of the other, keeping everything within the given constraints. Put the two tuples back.
Now pick another random pair. This times twiddle with the b values.
Lather, rinse repeat.
I think the simplest thing is to
Use your favorite method to throw n-1 values such that \sum_i=0,n-1 a_i < A, and set a_n to get the right total. There are several SO question about doing that, though I've never seen a answer I'm really happy with yet. Maybe I'll write a paper or something.
Get the n-1 b's by throwing the c_i uniformly on the allowed range, and set final b to get the right total and check on the final c (I think it must be OK, but I haven't proven it yet).
Note that since we have 2 hard constrains we should expect to throw 2n-2 random numbers, and this method does exactly that (on the assumption that you can do step 1 with n-1 throws.
Blocked Gibbs sampling is pretty simple and converges to the right distribution (this is along the lines of what Alexandre is proposing).
For all i, initialize ai = A / n and bi = B / n.
Select i ≠ j uniformly at random. With probability 1/2, update ai and aj with uniform random values satisfying the constraints. The rest of the time, do the same for bi and bj.
Repeat Step 2 as many times as seems to be necessary for your application. I have no idea what the convergence rate is.
Lots of good ideas here. Thanks! Rossum's idea seemed the most straightforward implementation-wise so I went for it. Here is the code for posterity:
c_min = 0.25
c_max = 0.75
a_sum = 100.0
b_sum = 200.0
n = 1000
a = [a_sum / n] * n
b = [b_sum / n] * n
while not good_enough(a, b):
i, j = random.sample(range(n), 2)
li, ui = c_min * b[i] - a[i], c_max * b[i] - a[i]
lj, uj = a[j] - c_min * b[j], a[j] - c_max * b[j]
llim = max((li, uj))
ulim = min((ui, lj))
q = random.uniform(llim, ulim)
a[i] += q
a[j] -= q
i, j = random.sample(range(n), 2)
li, ui = a[i] / c_max - b[i], a[i] / c_min - b[i]
lj, uj = b[j] - a[j] / c_max, b[j] - a[j] / c_min
llim = max((li, uj))
ulim = min((ui, lj))
q = random.uniform(llim, ulim)
b[i] += q
b[j] -= q
The good_enough(a, b) function can be a lot of things. I tried:
Standard deviation, which is hit or miss, as you don't know what is a good enough value.
Kurtosis, where a large negative value would be nice. However, it is relatively slow to calculate and is undefined with the seed values of (a_sum / n, b_sum / n) (though that's trivial to fix).
Skewness, where a value close to 0 is desirable. But it has the same drawbacks as kurtosis.
A number of iterations proportional to n. 2n sometimes wasn't enough, n ^ 2 is a little bit of overkill and is, well, exponential.
Ideally, a heuristic using a combination of skewness and kurtosis would be best but I settled for making sure each value has been changed from the initial (again, as rossum suggested in a comment). Though there is no theoretical guarantee that the loop will complete, it seemed to work well enough for me.
So here's what I think from mathematical point of view. We have sequences a_i and b_i such that sum of a_i is A and sum of b_i is B. Furthermore A/B is in (x,y) and so is a_i/b_i for each i. Furthermore you want a_i/b_i to be uniformly distributed in (x,y).
So do it starting from the end. Choose c_i from (x,y) such that they are uniformly distributed. Then we want to have the following equality a_i/b_i = c_i, so a_i = b_i*c_i.
Therefore we only need to find b_i. But we have the following system of linear equations:
A = (sum)b_i*c_i
B = (sum)b_i
where b_i are variables. Solve it (some fancy linear algebra tricks) and you're done!
Note that for large enough n this system will have lots of solutions. They will be dependent on some parameters which you can choose randomly.
Enough of the theoretical approach, let's see some practical solution.
// EDIT 1: Here's some hard core Python code :D
import random
min = 0.0
max = 10.0
A = 500.0
B = 100.0
def generate(n):
C = [min + i*(max-min)/(n+1) for i in range(1, n+1)]
Y = [0]
for i in range(1,n-1):
# This line should be changed in order to always get positive numbers
# It should be relatively easy to figure out some good random generator
Y.append(random.random())
val = A - C[0]*B
for i in range(1, n-1):
val -= Y[i] * (C[i] - C[0])
val /= (C[n-1] - C[0])
Y.append(val)
val = B
for i in range(1, n):
val -= Y[i]
Y[0] = val
result = []
for i in range(0, n):
result.append([ Y[i]*C[i], Y[i] ])
return result
The result is a list of pairs (X,Y) satisfying your conditions with the exception that they may be negative (see the random generator line in code) i.e. the first and the last pair may contain negative numbers.
// EDIT 2:
Too ensure that they are positive you may try something like
Y.append(random.random() * B / n)
instead of
Y.append(random.random())
I'm not sure though.
// EDIT 3:
In order to have better results try something like this:
avrg = B / n
ran = avrg / 20
for i in range(1, n-1):
Y.append(random.gauss(avrg, ran))
instead of
for i in range(1, n-1):
Y.append(random.random())
This will make all b_i to be near B / n. Unfortunetly the last term will still sometimes jump high. I'm sorry, but there is no way to avoid this (mathematics) since the last and the first terms depend on the others. For small n (~100) it looks good though. Unfortunetly some negative values may appear.
The choice of a correct generator is not so simple if you additionally want b_i to be uniformly distributed.
Related
Let me present my problem in mathematical notation before diving into the programming aspect.
Let a_n be the sequence whose ith term is defined as i^2 - (i-1)^2. It is easy to see a_i = 2i-1. Hence (in mathematical notation) we have a_n = {2-1, 4-1, ..., 2n-1} = {1, 3, 5, ..., 2n -1}, the sequence of all odd integers in the range [1, 2n].
In HacerRank, an excercise is to define a function that computes the sum S_n = a_1 + a_2 + ... + a_n and then finds x in the equation S_n = x (mod 10^9 + 7) (still using math notation). So we need to find the equivalence of the sum of all odd integers up to 2n in mod 10^9 + 7.
Now, to go into the programming aspect, here's what I attempted:
def summingSeries(n):
# Firstly, an anonimous function computing the ith term in the sequence.
a_i = lambda i: 2*i - 1
# Now, let us sum all elements in the list consisting of
# every a_i for i in the range [1, n].
s_n = sum([a_i(x) for x in range(1, n + 1)])
# Lastly, return the required modulo.
return s_n % (10**9 + 7)
This function passes some, but not all tests in HackerRank. However, I am oblivious as to what might be wrong with it. Any clues?
Thanks in advance.
The solution ended being quite simple. The tests were being failed not because the logic of the code was wrong, but because it was taking too long to compute. Observing that 1 + 3 + ... + 2n-1 = n^2 we can write the function as
def summingSeries(n):
return n**2 % (10**9 + 7)
which is clearly a major simplification. The runtime is now acceptable according to HackerRank's criteria and all tests were passed.
I am using scipy.stats.binom to work with the binomial distribution. Given n and p, the probability function is
A sum over k ranging over 0 to n should (and indeed does) give 1. Fixing a point x_0, we can add the probabilities in both directions and the two sums ought to add to 1. However the code below yields two different answers when x_0 is close to n.
from scipy.stats import binom
n = 9
p = 0.006985
b = binom(n=n, p=p)
x_0 = 8
# Method 1
cprob = 0
for k in range(x_0, n+1):
cprob += b.pmf(k)
print('cumulative prob with method 1:', cprob)
# Method 2
cprob = 1
for k in range(0, x_0):
cprob -= b.pmf(k)
print('cumulative prob with method 2:', cprob)
I expect the outputs from both methods to agree. For x_0 < 7 it agrees but for x_0 >= 8 as above I get
>> cumulative prob with method 1: 5.0683768775504006e-17
>> cumulative prob with method 2: 1.635963929799698e-16
The precision error in the two methods propagates through my code (later) and gives vastly different answers. Any help is appreciated.
Roundoff errors of the order of the machine epsilon are expected and are inevitable. That these propagate and later blow up means that your problem is very poorly conditioned. You'd need to rethink the algorithm or an implementation, depending on where the bad conditioning comes from.
In your specific example you can get by using either np.sum (which tries to be careful with roundoff), or even math.fsum from the standard library.
I want to solve a LP problem in Python with the library Pulpe (or any other).
I want to express a constraint stating that all my variables must have different values, (their domain is {0, 1, 2, 3... N} for a given integer N.) that is x_1 != x_2 != x_3 ... != x_N.
The solver gives me a solution when I do not add any constraint related to what I mentionned above but when I do, it tells me that the system is not feasible even though it has one solution.
In order to add the "all different" constraint, I did the following:
for x_i in variables:
for x_j in variables:
if the following constraint has not been already added and x_i != x_j:
my_problem += x_i - x_j >= 1, "unique name for the constraint"
The previous code does not work. When I want to add the constraint x_i != x_j, it just doesn't work. So, as I am working with a bounded set of integers, I can (I think) rewrite the "not equals" as x_i - x_j > 0. Pulpe tells me that it does not handle the ">" operator between int and LpAffineExpression so I wrote x_i - x_j >= 1. However, it runs but it seems that it does not work and I can't figure out why.
There are a few ways of doing this, depending on the exact situation.
You seem to have n variables x[i]. They can assume values {0,...,n} and must be all different.
As an aside: your notation x[1] ≠ x[2] ≠ x[3].. is not completely correct. E.g. x[1]=1, x[2]=2, x[3]=1 would satisfy x[1] ≠ x[2] ≠ x[3].
The all-different constraint can be written pairwise as x[i] ≠ x[j] for all i < j (we don't want to check i and j twice). This inequality can be restated as: x[i] ≤ x[j]-1 OR x[i] ≥ x[j]+1. The OR condition can be implemented in a MIP model as:
x[i] ≤ x[j]-1 + M δ[i,j] ∀ i < j
x[i] ≥ x[j]+1 - M (1-δ[i,j]) ∀ i < j
δ[i,j] ∈ {0,1}
where M=n+1. We added extra binary variables δ[i,j].
This is the most direct formulation of a "not-equal" construct. It also has relatively few binary variables: about n^2/2. Other formulations are also possible. For more information see link.
Note that constraint programming solvers often have built-in facilities for the all-different constraint, so it may be easier to use a CP solver (they can also be more efficient for models with all-different constraints).
The reason your constraint does not work is that you are requiring x_i to be at least 1 greater than x_j, for every i and j. So you are requiring x_1 > x_2 and x_2 > x_1. You can probably fix this issue by replacing x_i != x_j with i > j or something like that, in your if statement.
But in your case, I would consider using binary variables to indicate which value each x_i takes. For example, let y_{i,n} = 1 if x_i = n. Then you have a constraint like
sum {i=1,...,N} y_{i,n} <= 1 for all n = 0,...,N
(i.e., each value of n can be used at most once) and another like
sum {n=0,...,N} y_{i,n} = 1 for all i = 1,...,N
(every i must be assigned some value n).
Then, in your formulation, replace all the x_i variables with sum {n=0,...,N} y_{i,n}.
Say I have two matrices, A and B.
I want to calculate the magnitude of difference between the two matrices. That is, without using iteration.
Here's what I have so far:
def mymagn(A, B):
i = 0
j = 0
x = np.shape(A)
y = np.shape(B)
while i < x[1]:
while j < y[1]:
np.sum((A[i][j] - B[i][j]) * (A[i][j] - B[i][j]))
j += 1
i += 1
As I understand it, generally the value should be small with two similar matrices but I'm not getting that, can anyone help? Is there any way to get rid of the need to iterate?
This should do it:
def mymagn(A, B):
return np.sum((B - A) ** 2)
For arrays/matrices of the same size, addition/subtraction are element-wise (like in MATLAB). Exponentiation with scalar exponent is also element-wise. And np.sum will by default sum all elements (along all axes).
In python, I would like to find the roots of equations of the form:
-x*log(x) + (1-x)*log(n) - (1-x)*log(1 - x) - k = 0
where n and k are parameters that will be specified.
An additional constraint on the roots is that x >= (1-x)/n. So just for what it's worth, I'll be filtering out roots that don't satisfy that.
My first attempt was to use scipy.optimize.fsolve (note that I'm just setting k and n to be 0 and 1 respectively):
def f(x):
return -x*log(x) + (1-x)*log(1) - (1-x)*log(1-x)
fsolve(f, 1)
Using math.log, I got value-errors because I was supplying bad input to log. Using numpy.log gave me some divide by zeros and invalid values in multiply.
I adjusted f as so, just to see what it would do:
def f(x):
if x <= 0:
return 1000
if x >= 1:
return 2000
return -x*log(x) + (1-x)*log(1) - (1-x)*log(1-x)
Now I get
/usr/lib/python2.7/dist-packages/scipy/optimize/minpack.py:221: RuntimeWarning: The iteration is not making good progress, as measured by the
improvement from the last ten iterations.
warnings.warn(msg, RuntimeWarning)
Using python, how can I solve for x for various n and k parameters in the original equation?
fsolve also allows guesses to be inserted for where to start. My suggestion would be to plot the equation and have the user type a initial guess either with the mouse or via text to use as an initial guess. You may also want to change the out of bounds values:
if x <= 0:
return 1000 + abs(x)
if x >= 1:
return 2000 + abs(x)
This way the function has a slope outside of the region of interest that will guide the solver back into the interesting region.