Any effective solver to solve selection problem with constraints - python

I am working on a package selection problem. I have to put a constraint on the final result, like ‘top3 brands among selected products should account for less than 50%’.
I try to implement that on Pulp. But it seems CBC solver do not support such constraint. Please help me. How can I put such constraint? Or should I switch to another solver?

It depends a bit on what exactly "top 3 brands" is. If you mean the sum of the three largest x[i] should be less than 50% of the total, then this can be linearized as follows:
y1 >= x[i] for all i
y2 >= x[i]-M*delta1[i] for all i
y3 >= x[i]-M*delta2[i] for all i
y1+y2+y3 <= sum(i,x[i])/2
sum(i, delta1[i]) = 1
sum(i, delta2[i]) = 2
delta1[i],delta2[i] ∈ {0,1}
This can be solved with CBC. Here M is a large enough constant (called big-M) and delta1,delta2 are binary variables. Note that y1,y2,y3 are bounds on the largest three values, so interpreting these values may not always be obvious. Basically, the definition of these values is:
y1 : variable at least as large as the largest x[i]
y2 : variable at least as large as the second largest x[i]
y3 : variable at least as large as the third largest x[i]
Other and better formulations exist (but this one is a bit more intuitive [for me that is]).
A simpler approach would be to use sum_largest(x,k) in CVXPY, e.g.
sum_largest(x,3) <= sum(x)/2

Related

Minimize the number of outputs

For a linear optimization problem, I would like to include a penalty. The penalty of every option (penalties[(i)]) should be 1 if the the sum is larger than 0 and 0 if the penalty is zero. Is there a way to do this?
The penalty is defined as:
penalties = {}
for i in A:
penalties[(i)]=(lpSum(choices[i][k] for k in B))/len(C)
prob += Objective Function + sum(penalties)
For example:
penalties[(0)]=0
penalties[(1)]=2
penalties[(3)]=6
penalties[(4)]=0
The sum of the penalties should then be:
sum(penalties)=0+1+1+0= 2
Yes. What you need to do is to create binary variables: use_ith_row. The interpretation of this variable will be ==1 if any of the choices[i][k] are >= 0 for row i (and 0 otherwise).
The penalty term in your objective function simply needs to be sum(use_ith_row[i] for i in A).
The last thing you need is the set of constraints which enforce the rule described above:
for i in A:
lpSum(choices[i][k] for k in B) <= use_ith_row[i]*M
Finnaly, you need to choose M large enough so that the constraint above has no limiting effect when use_ith_row is 1 (you can normally work out this bound quite easily). Choosing an M which is way too large will also work, but will tend to make your problem solve slower.
p.s. I don't know what C is or why you divide by its length - but typically if this penalty is secondary to you other/primary objective you would weight it so that improvement in your primary objective is always given greater weight.

my code is giving me the wrong output sometimes, how to solve it?

I am trying to solve this problem: 'Your task is to construct a building which will be a pile of n cubes. The cube at the bottom will have a volume of n^3, the cube above will have the volume of (n-1)^3 and so on until the top which will have a volume of 1^3.
You are given the total volume m of the building. Being given m can you find the number n of cubes you will have to build?
The parameter of the function findNb (find_nb, find-nb, findNb) will be an integer m and you have to return the integer n such as n^3 + (n-1)^3 + ... + 1^3 = m if such a n exists or -1 if there is no such n.'
I tried to first create an arithmetic sequence then transform it into a sigma sum with the nth term of the arithmetic sequence, the get a formula which I can compare its value with m.
I used this code and work 70 - 80% fine, most of the calculations that it does are correct, but some don't.
import math
def find_nb(m):
n = 0
while n < 100000:
if (math.pow(((math.pow(n, 2))+n), 2)) == 4*m:
return n
break
n += 1
return -1
print(find_nb(4183059834009))
>>> output 2022, which is correct
print(find_nb(24723578342962))
>>> output -1, which is also correct
print(find_nb(4837083252765022010))
>>> real output -1, which is incorrect
>>> expected output 57323
As mentioned, this is a math problem, which is mainly what I am better at :).
Sorry for the in-line mathematical formula as I cannot do any math formula rendering (in SO).
I do not see any problem with your code and I believe your sample test case is wrong. However, I'll still give optimisation "tricks" below for your code to run quicker
Firstly as you know, sum of the cubes between 1^3 and n^3 is n^2(n+1)^2/4. Therefore we want to find integer solutions for the equation
n^2(n+1)^2/4 == m i.e. n^4+2n^3+n^2 - 4m=0
Running a loop for n from 1 (or in your case, 2021) to 100000 is inefficient. Firstly, if m is a large number (1e100+) the complexity of your code is O(n^0.25). Considering Python's runtime, you can run your code in time only if m is less than around 1e32.
To optimise your code, you have two approaches.
1) Use Binary Search. I will not get into the details here, but basically, you can halve the search range for a simple comparison. For the initial bounds you can use lower = 0 & upper = k. A better bound for k will be given below, but let's use k = m for now.
Complexity: O(log(k)) = O(log(m))
Feasible range for m: m < 10^(3e7)
2) Use the almighty Newton-Raphson!! Using the iteration formula x_(n+1) = x_n - f(x_n) / f'(x_n), where f'(x) can be calculated explicitly, and a reasonable initial guess, let's say k = m again, the complexity is (I believe) O(log(k)) + O(1) = O(log(m)).
Complexity: O(log(k)) = O(log(m))
Feasible range for m: m < 10^(3e7)
Finally, I'll give a better initial guess for k in the above methods, also given in Ian's answer to this question. Since n^4+2n^3+n^2 = O(n^4), we can actually take k ~ m^0.25 = (m^0.5)^0.5. To calculate this, We can take k = 2^(log(k)/4) where log is base 2. The log should be O(1), but I'm not sure for big numbers/dynamic size (int in Python). Not a theorist. Using this better guess and Newton-Raphson, since the guess is in a constant range from the result, the algorithm is nearly O(1). Again, check out the links for better understanding.
Finally
Since your goal is to find whether n exists such that the equation is "exactly satisfied", use Newton-Raphson and iterate until the next guess is less than 0.5 from the current guess. If your implementation is "floppy", you can also do a range +/- 10 from the guess to ensure that you find the solution.
I think this is a Math question rather than a programming question.
Firstly, I would advise you to start iterating from a function of your input m. Right now you are initialising your n value arbitrarily (though of course it might be a requirement of the question) but I think there are ways to optimise it. Maybe, just maybe you can iterate from the cube root, so if n reaches zero or if at any point the sum becomes smaller than m you can safely assume there is no possible building that can be built.
Secondly, the equation you derived from your summation doesn't seem to be correct. I substituted your expected n and input m into the condition in your if clause and they don't match. So either 1) your equation is wrong or 2) the expected output is wrong. I suggest that you relook at your derivation of the condition. Are you using the sum of cubes factorisation? There might be some edge cases that you neglected (maybe odd n) but my Math is rusty so I can't help much.
Of course, as mentioned, the break is unnecessary and will never be executed.

Sparsity reduction

I have to factorize a big sparse matrix ( 6.5mln rows representing users* 6.5mln columns representing items) to find users and items latent vectors. I chose the als algorithm in spark framework(pyspark).
To boost the quality I have to reduce the sparsity of my matrix till 98%. (current value is 99.99% because I have inly 356mln of filled entries).
I can do it by dropping rows or columns, but I must find the optimal solution maximizing number of rows(users).
The main problem is that I must find some subsets of users and items sets, and dropping some row can drop some columns and vice versa, the second problem is that function that evaluates sparsity is not linear.
Which way I can solve this problem? which libraries in python can help me with it?
Thank you.
This is a combinatorial problem. There is no easy way to drop an optimal set of columns to achieve max number of users while reducing sparsity. A formal approach would be to formulate it as a mixed-integer program. Consider the following 0-1 matrix, derived from your original matrix C.
A(i,j) = 1 if C(i,j) is nonzero,
A(i,j) = 0 if C(i,j) is zero
Parameters:
M : a sufficiently big numeric value, e.g. number of columns of A (NCOLS)
N : total number of nonzeros in C (aka in A)
Decision vars are
x(j) : 0 or 1 implying whether column j is dropped or not
nr(i): number of nonzeros covered in row i
y(i) : 0 or 1 implying whether row i is dropped or not
Constraints:
A(i,:) x = nr(i) for i = 1..NROWS
nr(i) <= y(i) * M for i = 1..NROWS
#sum(nr(i)) + e = 0.98 * N # explicit slack 'e' to be penalized in the objective
y(i) and x(j) are 0-1 variables (binary variables) for i,j
Objective:
maximize #sum(y(i)) - N.e
Such a model would be extremely difficult to solve as an integer model. However, barrier methods should be able to solve the linear programming relaxations (LP) Possible solvers are Coin/CLP (open-source), Lindo (commercial) etc... It may then be possible to use the LP solution to compute approximate integer solutions by simple rounding.
In the end, you will definitely require an iterative approach which will require solving MF problem several times each time factoring a different submatrix of C, computed with above approach, until you are satisfied with the solution.

Work with logarithms of very large numbers

I have done a program that must output a file with values of a function. This function produce very large values but I only need the logarithm, which can go up to values of around 10000 or even a million (large, but manageable with integer 32 bit variables).
Now, obviously the function itself will be of the order of exp(10000) and that's huge. So I'm looking for any tricks to calculate the logarithm. I'm using python since I thought that it's native support for very large numbers would be useful, but it's not in the cases of very large numbers.
The value of the function is calculated as:
a*(x1+x2+x3+x4)
and I have to take the logarithm of that. I already preprocess the logarithms of all factors and then sum them all, but I can't do anything (at least anything simple) with log(x1+x2+x3+x4).
The results from python ar NaN because the x1,x2,x3,x4 variable grow to much. They are calculated as:
x = [1,1,1,1]
for i in range(1,K):
x[j] *= a*cosh(b*g[i]) # even values of i
x[j] *= a*sinh(b*g[i]) # odd values of i
for some constants a, b and a vector g[]. That's just pseudo code, I write each x[1], x[2].
Is there any trick by which I could calculate the logarithm of that sum without running into the NaN problem?
Thank you very much
P.S.: I was using python because of what I said, if there's any special library for C(++) or something like that to deal with very large numbers, I would really appreciate it.
P.S.: The constant b inside the cosh can be of the order of 100 and than can make things blow up, so if there's anything to do with taking that constant out somehow...
I see that in your loop you are each time multiplying each x with a constant a. If you skip that factor, and take log(x1+x2+x3+x4) which may then be manageable, you just add log(a) to that to get the final result. Or n*log(a) if you're multiplying by a several times.
That idea is language independent. :-)
Scaling the summands like this:
x = []
scale_factor = max(b*g[i] for i in range(1, K))
for i in range(1, K):
x[i] = cosh(b*g[i] - scale_factor)
result = log(a)*K + sum(log(x[i]) for i in range(1, K)) + log(scale_factor)
edit:
uh-oh, one detail is wrong:
result = log(a)*K + sum(log(x[i]) for i in range(1, K)) + scale_factor
The last term is just the factor, not it's log. Sorry.

Optimization on a set of data using python

Optimization on a set of data using python.
Following data sets available
x, y, f(x), f(y).
Function to be optimized (maximize):
f(x,y) = f(x)*y - f(y)*x
based on following contraints:
V >= sqrt(f(x)^2+f(y)^2)
I >= sqrt(x^2+y2)
where V and I are constants.
Can anyone please let me know what optimization module do I need to use? From what I understand I need to perform a discrete optimization as I have set f values for x, y, f(x) and f(y).
Using complex optimizers (http://docs.scipy.org/doc/scipy/reference/optimize.html) for such a problem is rather a bad idea.
It looks like a problem which can be quite easily solved in under O(n^2) where n=max(|x|,|y|), simply:
sort x,y,f(x),f(y) creating sorted(x), sorted(y), sorted(f(x)), sorted(f(y))
for each x find the positions in sorted(y) for which I^2 >= x^2+y^2 holds and similarly for f(x) and sorted(f(y)) and V^2 >= f(x)^2 + f(y)^2 (two binary searches, as I^2 >= x^2+y^2 <=> |y| <= sqrt(I^2-x^2) so you can find the "barrier"in constant time and then use bin searches to find actual data points which are the closest ones "on the right side of inequality")
Iterate through sorted(x) and for each x:
Iterate simultanously through elements of y and f(y) and discard (in this loop) points which are not in borth intervals found in step 2. (linear complexity)
Record argument pairs x_max,y_max for which f(x_max,y_max) is maximized
Return x_max,y_max
Total complexity is under quadratic, as step 1 takes O(nlgn), each iteration of loop in step 2 is O(lgn) so the whole step 2 takes O(nlgn), loop in step 3 is O(n) and loop in first substep of step 3 is O(n) (but in real life it should be almost constant due to the constraints), which makes the whole algorithm O(n^2) (and in most cases it will behave as O(nlgn)). It also does not depend on the definition of f(x,y) (it uses it as a black box) so you can optimize an arbitrary function is such a way.

Categories

Resources