Conditional average of variables in Pulp optimization problem

Conditional average of variables in Pulp optimization problem - python

Suppose I have 2 pulp variables: x1 and x2. These two variables represent water temperatures inside two different water pipes. These two pipes, at a certain point, merges into one single pipe and the two water flows mixes together. The water temperature after the mixing is equal to the average of the two temperatures because the flow rate is the same.
If the flow of one water pipe is zero, there is no mixing and the output temperature is equal to the temperature of the non-zero flow water temperature.
This final water temperature is then used into the objective function of the pulp problem to calculate some cost.
This means that I have to calculate the average of these two variables but each variable has to be considered in the calculation of the average only if it is greater than 0.
Here is an example you can reproduce to calculate the average without the condition of >0.
from pulp import *
# Define the variables
x1 = LpVariable("x1", 0, None)
x2 = LpVariable("x2", 0, None)
avg = LpVariable("avg",0,None)
# Define the problem
prob = LpProblem("average_problem", LpMinimize)
# Define the objective function
prob += 0, "objective function"
# Calculate avg value
prob += avg==(x1+x2)/2, "average_constraint"
# Set x1 and x2 value just as example
prob += x1==100
prob += x2==50
cost_of_engine = (105-avg)*3/0.2
total_production_cost = lpSum(cost_of_engine+10)
prob.setObjective(total_production_cost)
# Solve the problem
prob.solve()
This example works if x1 and x2 are both higher than zero.
However, if for instance x1=0 and x2=100, then avg=50.
What I need, instead, is to discard the x1 variable from the calculation of the average so that avg=100.
This is clearly a non-linear problem because the denominator of the calculation of the average is dynamic and depends on the value of the variable x1 and x2.
Do you have any idea how to solve this problem? Maybe using the Big M technique?

There are several approaches that might be reasonable, depending on the characteristics of your problem that are not described.... As noted, if you are trying to minimize an average in the objective and both the numerator and the denominator are variables, the resultant expression is non-linear and you'll need to consider a substitute objective or move outside of pulp and look at non-linear formulations and non-linear solvers.
Idea #1: Use a penalty for the number of items used.
You can introduce (and properly constrain with a big-M constraint) a new variable y_i ∈ {0, 1} that is 1 if x_i is used and some logical weight w and use an objective like:
obj = ∑ x_i + w * ∑ y_i ; minimized
which might work OK if the x_i are in a range such that a logical w can be generated.
Or...
Idea #2: Use a mini-max or maxi-min constraint
If you are seeking an aggregate total of the x_i used, while minimizing the number that are used and there is some "trade space" in the model, you can set the objective to "maximize the minimum used x_i value", which might work, again, depending on the other characteristics of your model. This should have similar effect by encouraging the model to pick larger x_i to make the target value. In that case, you can... In pseudocode:
Introduce y and z ...
y_i ∈ {0, 1}
x_i <= y_i * M
z ∈ Reals
z <= x_i + (1-y_i)*M # constrain z to the lowest x_i used...
obj = max(z)

Related

Maximise a groups probability of reaching a score within PuLP

Using python, I have a linear programming solution in Pulp which selects 6 players within a budget constraint whilst maximising a specified parameter.
However, I want to be able to maximise a probability parameter of each team of 6 players.
Namely, I want to be able to input a mean and standard deviation for each player, and then maximise the percentage chance of each team reaching a predetermined score. This requires summing the means and standard deviations of the 6 players in each team, then calculating the percentage chance of them surpassing this score (I have been using numpy.norm to do this).
The problem I am having is that I am not sure if it is possible to maximise this parameter within a linear programming module like pulp. I can not get it to maximise the probability after summing each teams mean and standard deviation.
I have tried estimating this value by multiplying each individuals mean and standard deviation by 6, thus creating a dummy team, and calculating the probability of reaching the predetermined score, then scaling back down and maximising the sum of these values in each team. This gets close but is not as accurate as I want. This is the code I have so far:
lineup dataframe:
index
mu
std
Salary
Rory McIlroy
73.450198
10.455766
11100.
Scottie Scheffler
72.652175
9.477475
11000.
Jon Rahm
73.033862
10.293721
10800.
Justin Thomas
73.886648
10.426305
10500.
Collin Morikawa
68.409628
10.588617
10300.
target_score = 600
limit = 50000
lineup_im = lineup2.set_index('index')
w = lineup_im.Salary
v = lineup_im.mu
z = lineup_im['std']
items = list(sorted(lineup_im.index))
# Create model
m = LpProblem("Knapsack", LpMaximize)
# Variables
x = LpVariable.dicts('p', items, lowBound=0, upBound=1, cat=LpInteger)
# Objective
m += sum((((1-(norm(loc=v[i]*6, scale=z[i]*6).cdf(target_score)))))*x[i] for i in items)/6
# Constraint
m += sum(w[i]*x[i] for i in items) <= limit
m += sum(x[i] for i in items) == 6
# Optimize
m.solve()
Is there a way to do this within Pulp or another LP module in python?

Welcome to the site and nice post w/ data!
You have a chicken vs. egg problem here... Let me explain...
The parameter that you want to get to is the CDF of the team score, which, if you assume it is normally distributed is the sum of the means of the player's scores with a variance that is a sum of the player's variances... That's how it works for Norm distribution, right?
So, all of those things are known values (parameters) in your problem, based on the player data. You just haven't calculated the team's CDF for all of the possible teams. The problem is you cannot do that as some kind of callback after using the optimizer to pick team membership, it must be done in advance. pulp solver does not have the ability/linkages to make calls to numpy to get the CDF "on the fly". So you have a couple options...
You could reformulate your problem in terms of the teams and then expand your data set and just have a binary variable for which team is selected, but that seems kind of like a waste, because you are essentially having the solver just picking the single best team, with only one constraint (total salary), which makes me think you should just brute force this (see below.)
You could just brute force this. If you are considering 100 players and you are choosing 6, that is combin(100, 6) ~ 1 billion. So I would use put the data into dictionaries for fast lookup, use itertools to run through the combinations, first screen for total salary cap, and if that passes, compute the team CDF/p-value for the target score, and keep track of the max value

Define decision variable x_i to indicate whether player i is selected for the team. From the basics of independent random variables, if we define mu_i to be the mean for each player i and sd_i to be their standard deviation, then:
mu_team = \sum_i mu_i*x_i
var_team = \sum_i sd_i^2*x_i
sd_team = sqrt(var_team)
You seek to maximize the probability that a normal random variable with mean mu_team and standard deviation sd_team exceeds some target score S. Conveniently, this is equivalent to minimizing the Z-score of the value S for that random variable:
z_team = (S-mu_team) / sd_team
It's now clear that you could reformulate your optimization model as minimizing z_team subject to your budget and team size constraints. However, z_team is non-linear --- it's a linear function of the decision variables divided by the square root of another linear function of the decision variables. In general mixed integer optimization problems with non-linear objective functions are not so trivial to solve; you won't be able to do it "out of the box" with pulp.
Not all is lost, though! Notice that we're basically balancing quantity S-mu_team with quantity sd_team. If we can construct teams with mu_team > S, then we'd ideally like teams with large mu_team and small sd_team, which enables as negative a z_team value as possible. If we could build a tradeoff curve between achievable mu_team and sd_team values, we could quickly identify the best achievable z_team value. Similarly, if all teams have mu_team < S, then we'd ideally like teams with large mu_team and large sd_team to get a z_team value as close as possible to 0; again, a tradeoff curve would be helpful.
This leads us to a simple solution:
Maximize mu_team subject to budget and team size constraints. Call the best achievable mu_team value M. In the special case of M=S, the best achievable z_team value is 0, and you are done. Otherwise, continue.
Build an efficient frontier trading off mu_team and sd_team:
If M > S, then maximize mu_team - alpha*var_team for various constants alpha >= 0
If M < S, then maximize mu_team + alpha*var_team for various constants alpha >= 0
Compute z_team for each solution in your efficient frontier, and select the one with the smallest z_team value.
Note that each of the optimization problems in steps 1 and 2 now have a linear objective value (both mu_team and var_team are linear in the decision variables), so they will be easily solvable with pulp.

How to define the objective function for integer optimization task?

I need to find the k in the range [1, 10], which is the least positive integer such that binomial(k, 2)≥ m, where m≥3 - integer. The binomial() function is the binominal coefficient.
My attempt is:
After some algebraic steps, I have found the minization task: min k(k-1) - 2m ≥ 0, s.t. m≥3. I have defined the objective function and gradient. In the objective function I fixed the m=3 and my problem is how to define integer domain for the variable m.
from scipy.optimize import line_search
# objective function
def objective(k):
m = 3
return k*(k-1)-2*m
# gradient for the objective function
def gradient(k):
return 2.0 * k - 1
# define range
r_min, r_max = 1, 11
# prepare inputs
inputs = arange(r_min, r_max, 1)
# compute targets
targets = [objective(k) for k in inputs]
# define the starting point
point = 1.0
# define the direction to move
direction = 1.0
# print the initial conditions
print('start=%.1f, direction=%.1f' % (point, direction))
# perform the line search
result = line_search(objective, gradient, point, direction)
print(result)
I have see the
LineSearchWarning: The line search algorithm did not converge
Question. How to define the objective function in Python?

You are look to minimise k such that k(k-1)-2m≥0, with additional constraints on k on which we'll come back later. You can explicitly solve this inequation, by solving the corresponding equation first, that is, finding the roots of P:=X²-X-2m. The quadratic formulas give the roots (1±√(1+4m²))/2. Since P(x)→∞ as x→±∞, you know that the x that satisfy your inequation are the ones above the greatest root, and below the lowest root. Since you are only interested in positive solutions, and since 1-√(1+m²)<0, the set of wanted solutions is [(1+√(1+m²))/2,∞). Among these solutions, the smallest integer is the ceil of (1+√(1+m²))/2 which is strictly greater than 1. Let k=ceil((1+sqrt(1+m**2))/2) be that integer. If k≤10, then your problem has a solution, which is k. Otherwise, your problem has no solutions. In Python, you get the following:
import math
def solve(m):
k = math.ceil((1+math.sqrt(1+m**2))/2)
return k if k <= 10 else None

How to minimize the error to a given dataset

lets assume a function
f(x,y) = z
Now I want to choose x so that the output of f matches real data, and y decreases in equidistant steps to zero starting from 1. The output is calculated in the function f by a set of differential equations.
How can I select x so that the error to the real outputs is as small as possible. Assuming I know a set of z - values, namely
f(x,1) = z_1
f(x,0.9) = z_2
f(x,0.8) = z_3
now find x, that the error to the real data z_1,z_2,z_3 is minimal.
How can one do this?

A common method of optimizing is least squares fitting, in which you would basically try to find params such that the sum of squares: sum (f(params,xdata_i) - ydata_i))^2 is minimized for given xdata and ydata. In your case: params would be x, xdata_i would be 1, 0.9 and 0.8 and ydata_i z_1, z_2 and z_3.
You should consider the package scipy.optimize. It's used in finding parameters for a function. I think this page gives quite a good example on how to use it.

How to find the maximum of a prob in PuLP

I am trying to solve a linear problem in PuLP that minimizes a cost function. The cost function is itself a function of the maximum value of the cost function, e.g., I have a daily cost, and I am trying to minimize the monthly cost, which is the sum of the daily cost plus the maximum daily cost in the month. I don't think I'm capturing the maximum value of the function in the final solution, and I'm not sure how to go about troubleshooting this issue. The basic outline of the code is below:
# Initialize the problem to be solved
prob = LpProblem("monthly_cost", LpMinimize)
# The number of time steps
# price is a pre-existing array of variable prices
tmax = len(price)
# Time range
time = list(range(tmax))
# Price reduction at every time step
d = LpVariable.dict("d", (time), 0, 5)
# Price increase at every time step
c = LpVariable.dict("c", (time), 0, 5)
# Define revenues = price increase - price reduction + initial price
revenue = ([(c[t] - d[t] + price[t]) for t in time])
# Find maximum revenue
max_revenue = max(revenue)
# Initialize the problem
prob += sum([revenue[t]*0.0245 for t in time]) + max_revenue
# Solve the problem
prob.solve()
The variable max_revenue always equals c_0 - d_0 + price[0] even though price[0] is not the maximum of price and c_0 and d_0 both equal 0. Does anyone know how to ensure the dynamic maximum is being inserted into the problem? Thanks!

I don't think you can do the following in PuLP or any other standard LP solvers:
max_revenue = max(revenue)
This is because determining the maximum will require the solver to evaluate revenue equations; so in this case, I don't think you can extract a standard LP model. Such models are in fact non-smooth.
In such situations, you can easily reformulate the problem as follows:
max_revenue >= revenue = ([(c[t] - d[t] + price[t]) for t in time])
This works, as for any value of revenue: max_revenue >= revenue. This in turn helps in extracting a standard LP model from the equations. Hence, the original problem formulation gets extended with additional inequality constraints (the equality constraints and the objective functions should be the same as before). So it could look something like this (word of caution: I have not tested this):
# Define variable
max_revenue = LpVariable("Max Revenue", 0)
# Define other variables, revenues, etc.
# Add the inequality constraints
for item in revenue:
prob += max_revenue >= item
I would also suggest that you have a look at scipy.optimize.linprog. PuLP writes the model in an intermediary file, and then calls installed solver to solve the model. On the other hand, in scipy.optimize.linprog it's all done in python and should be faster. However, if your problem can not be solved using simplex algorithm, or you require other professional solvers (e.g. CPlex, Gurobi, etc.) then PuLP is a good choice.
Also, see the discussion on Data Fitting (page 19) in Introduction to Linear Optimisation by Bertsimas.
Hope this helps. Cheers.

Fitting curve: why small numbers are better?

I spent some time these days on a problem. I have a set of data:
y = f(t), where y is very small concentration (10^-7), and t is in second. t varies from 0 to around 12000.
The measurements follow an established model:
y = Vs * t - ((Vs - Vi) * (1 - np.exp(-k * t)) / k)
And I need to find Vs, Vi, and k. So I used curve_fit, which returns the best fitting parameters, and I plotted the curve.
And then I used a similar model:
y = (Vs * t/3600 - ((Vs - Vi) * (1 - np.exp(-k * t/3600)) / k)) * 10**7
By doing that, t is a number of hour, and y is a number between 0 and about 10. The parameters returned are of course different. But when I plot each curve, here is what I get:
http://i.imgur.com/XLa4LtL.png
The green fit is the first model, the blue one with the "normalized" model. And the red dots are the experimental values.
The fitting curves are different. I think it's not expected, and I don't understand why. Are the calculations more accurate if the numbers are "reasonnable" ?

The docstring for optimize.curve_fit says,
p0 : None, scalar, or M-length sequence
Initial guess for the parameters. If None, then the initial
values will all be 1 (if the number of parameters for the function
can be determined using introspection, otherwise a ValueError
is raised).
Thus, to begin with, the initial guess for the parameters is by default 1.
Moreover, curve fitting algorithms have to sample the function for various values of the parameters. The "various values" are initially chosen with an initial step size on the order of 1. The algorithm will work better if your data varies somewhat smoothly with changes in the parameter values that on the order of 1.
If the function varies wildly with parameter changes on the order of 1, then the algorithm may tend to miss the optimum parameter values.
Note that even if the algorithm uses an adaptive step size when it tweaks the parameter values, if the initial tweak is so far off the mark as to produce a big residual, and if tweaking in some other direction happens to produce a smaller residual, then the algorithm may wander off in the wrong direction and miss the local minimum. It may find some other (undesired) local minimum, or simply fail to converge. So using an algorithm with an adaptive step size won't necessarily save you.
The moral of the story is that scaling your data can improve the algorithm's chances of of finding the desired minimum.
Numerical algorithms in general all tend to work better when applied to data whose magnitude is on the order of 1. This bias enters into the algorithm in numerous ways. For instance, optimize.curve_fit relies on optimize.leastsq, and the call signature for optimize.leastsq is:
def leastsq(func, x0, args=(), Dfun=None, full_output=0,
col_deriv=0, ftol=1.49012e-8, xtol=1.49012e-8,
gtol=0.0, maxfev=0, epsfcn=None, factor=100, diag=None):
Thus, by default, the tolerances ftol and xtol are on the order of 1e-8. If finding the optimum parameter values require much smaller tolerances, then these hard-coded default numbers will cause optimize.curve_fit to miss the optimize parameter values.
To make this more concrete, suppose you were trying to minimize f(x) = 1e-100*x**2. The factor of 1e-100 squashes the y-values so much that a wide range of x-values (the parameter values mentioned above) will fit within the tolerance of 1e-8. So, with un-ideal scaling, leastsq will not do a good job of finding the minimum.
Another reason to use floats on the order of 1 is because there are many more (IEEE754) floats in the interval [-1,1] than there are far away from 1. For example,
import struct
def floats_between(x, y):
"""
http://stackoverflow.com/a/3587987/190597 (jsbueno)
"""
a = struct.pack("<dd", x, y)
b = struct.unpack("<qq", a)
return b[1] - b[0]
In [26]: floats_between(0,1) / float(floats_between(1e6,1e7))
Out[26]: 311.4397707054894
This shows there are over 300 times as many floats representing numbers between 0 and 1 than there are in the interval [1e6, 1e7].
Thus, all else being equal, you'll typically get a more accurate answer if working with small numbers than very large numbers.

I would imagine it has more to do with the initial parameter estimates you are passing to curve fit. If you are not passing any I believe they all default to 1. Normalizing your data makes those initial estimates closer to the truth. If you don't want to use normalized data just pass the initial estimates yourself and give them reasonable values.

Others have already mentioned that you probably need to have a good starting guess for your fit. In cases like this is, I usually try to find some quick and dirty tricks to get at least a ballpark estimate of the parameters. In your case, for large t, the exponential decays pretty quickly to zero, so for large t, you have
y == Vs * t - (Vs - Vi) / k
Doing a first-order linear fit like
[slope1, offset1] = polyfit(t[t > 2000], y[t > 2000], 1)
you will get slope1 == Vs and offset1 == (Vi - Vs) / k.
Subtracting this straight line from all the points you have, you get the exponential
residual == y - slope1 * t - offset1 == (Vs - Vi) * exp(-t * k)
Taking the log of both sides, you get
log(residual) == log(Vs - Vi) - t * k
So doing a second fit
[slope2, offset2] = polyfit(t, log(y - slope1 * t - offset1), 1)
will give you slope2 == -k and offset2 == log(Vs - Vi), which should be solvable for Vi since you already know Vs. You might have to limit the second fit to small values of t, otherwise you might be taking the log of negative numbers. Collect all the parameters you obtained with these fits and use them as the starting points for your curve_fit.
Finally, you might want to look into doing some sort of weighted fit. The information about the exponential part of your curve is contained in just the first few points, so maybe you should give those a higher weight. Doing this in a statistically correct way is not trivial.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.