I'm trying to find the best possible combination that will maximize my sum value, but it has to be under 2 specific constraints, therefore I am assuming Linear programming will be the best fit.
The problem goes like this:
Some educational world-event wish to gather the world's smartest teen students.
Every state tested 100K students on the following exams:'MATH', 'ENGLISH', 'COMPUTERS', 'HISTORY','PHYSICS'.. and where graded 0-100 on EACH exam.
Every state was requested to send their best 10K from the tested 100K students for the event.
You, as the French representative, were requested to choose the top 10K students from the tested 100K student from your country. For that, you'll need to optimize the MAX VALUE from them in order to get the best possible TOTAL SCORE.
BUT there are 2 main constrains:
1- from the total 10K chosen students you need to allocate specific students that will be tested on the event on 1 specific subject only from the mentioned 5 subjects.
the allocation needed is: ['MATH': 4000, 'ENGLISH':3000,'COMPUTERS':2000, 'HISTORY':750,'PHYSICS':250]
2- Each 'exam subject' score will have to be weighted differently.. for exp: 97 is Math worth more than 97 in History.
the wheights are: ['MATH': 1.9, 'ENGLISH':1.7,'COMPUTERS':1.5, 'HISTORY':1.3,'PHYSICS':1.1]
MY SOLUTION:
I tried to use the PULP (python) as an LP library and solved it correctly, but it took more than 2 HOURS of running.
can you find a better (faster, simpler..) way to solve it?
there are some NUMPY LP functions that could be used instead, maybe will be faster?
it supposed to be a simple OPTIMIZATION problem be I made it too slow and complexed.
--The solution needs to be in Python only please
for example, let's look on a small scale at the same problem:
there are 30 students and you need to choose only 15 students that will give us the best combination in relation to the following subject allocation demand.
the allocation needed is- ['MATH': 5, 'ENGLISH':4,'COMPUTERS':3, 'HISTORY':2,'PHYSICS':1]
this is all the 30 students and their grades:
after running the algorithm, the output solution will be:
here is my full code for the ORIGINAL question (100K students):
import pandas as pd
import numpy as np
import pulp as p
import time
t0=time.time()
demand = [4000, 3000, 2000, 750,250]
weight = [1.9,1.7, 1.5, 1.3, 1.1]
original_data= pd.read_csv('GRADE_100K.csv') #created simple csv file with random scores
data_c=original_data.copy()
data_c.index = np.arange(1, len(data_c)+1)
data_c.columns
data_c=data_c[['STUDENT_ID', 'MATH', 'ENGLISH', 'COMPUTERS', 'HISTORY','PHYSICS']]
#DataFrame Shape
m=data_c.shape[1]
n=data_c.shape[0]
data=[]
sublist=[]
for j in range(0,n):
for i in range(1,m):
sublist.append(data_c.iloc[j,i])
data.append(sublist)
sublist=[]
def _get_num_students(data):
return len(data)
def _get_num_subjects(data):
return len(data[0])
def _get_weighted_data(data, weight):
return [
[a*b for a, b in zip(row, weight)]
for row in data
]
data = _get_weighted_data(data, weight)
num_students = _get_num_students(data)
num_subjects = _get_num_subjects(data)
# Create a LP Minimization problem
Lp_prob = p.LpProblem('Problem', p.LpMaximize)
# Create problem Variables
variables_matrix = [[0 for i in range(num_subjects)] for j in range(num_students)]
for i in range(0, num_students):
for j in range(0, num_subjects):
variables_matrix[i][j] = p.LpVariable(f"X({i+1},{j+1})", 0, 1, cat='Integer')
df_d=pd.DataFrame(data=data)
df_v=pd.DataFrame(data=variables_matrix)
ml=df_d.mul(df_v)
ml['coeff'] = ml.sum(axis=1)
coefficients=ml['coeff'].tolist()
# DEALING WITH TARGET FUNCTION VALUE
suming=0
k=0
sumsum=[]
for z in range(len(coefficients)):
suming +=coefficients[z]
if z % 2000==0:
sumsum.append(suming)
suming=0
if z<2000:
sumsum.append(suming)
sumsuming=0
for s in range(len(sumsum)):
sumsuming=sumsuming+sumsum[s]
Lp_prob += sumsuming
# DEALING WITH the 2 CONSTRAINS
# 1-subject constraints
con1_suming=0
for e in range(num_subjects):
L=df_v.iloc[:,e].to_list()
for t in range(len(L)):
con1_suming +=L[t]
Lp_prob += con1_suming <= demand[e]
con1_suming=0
# 2- students constraints
con2_suming=0
for e in range(num_students):
L=df_v.iloc[e,:].to_list()
for t in range(len(L)):
con2_suming +=L[t]
Lp_prob += con2_suming <= 1
con2_suming=0
print("time taken for TARGET+CONSTRAINS %8.8f seconds" % (time.time()-t0) )
t1=time.time()
status = Lp_prob.solve() # Solver
print("time taken for SOLVER %8.8f seconds" % (time.time()-t1) ) # 632 SECONDS
print(p.LpStatus[status]) # The solution status
print(p.value(Lp_prob.objective))
df_v=pd.DataFrame(data=variables_matrix)
# Printing the final solution
lst=[]
val=[]
for i in range(0, num_students):
lst.append([p.value(variables_matrix[i][j]) for j in range(0, num_subjects)])
val.append([sum([p.value(variables_matrix[i][j]) for j in range(0, num_subjects)]),i])
ones_places=[]
for i in range (0, len(val)):
if val[i][0]==1:
ones_places.append(i+1)
len(ones_places)
data_once=data_c[data_c['STUDENT_ID'].isin(ones_places)]
IDs=[]
for i in range(len(ones_places)):
IDs.append(data_once['STUDENT_ID'].to_list()[i])
course=[]
sub_course=[]
for i in range(len(lst)):
j=0
sub_course='x'
while j<len(lst[i]):
if lst[i][j]==1:
sub_course=j
j=j+1
course.append(sub_course)
coures_ones=[]
for i in range(len(course)):
if course[i]!= 'x':
coures_ones.append(course[i])
# adding the COURSE name to the final table
# NUMBER OF DICTIONARY KEYS based on number of COURSES
col=original_data.columns.values[1:].tolist()
dic = {0:col[0], 1:col[1], 2:col[2], 3:col[3], 4:col[4]}
cc_name=[dic.get(n, n) for n in coures_ones]
one_c=[]
if len(IDs)==len(cc_name):
for i in range(len(IDs)):
one_c.append([IDs[i],cc_name[i]])
prob=[]
if len(IDs)==len(cc_name):
for i in range(len(IDs)):
prob.append([IDs[i],cc_name[i], data_once.iloc[i][one_c[i][1]]])
scoring_table = pd.DataFrame(prob,columns=['STUDENT_ID','COURES','SCORE'])
scoring_table.sort_values(by=['COURES', 'SCORE'], ascending=[False, False], inplace=True)
scoring_table.index = np.arange(1, len(scoring_table)+1)
print(scoring_table)
I think you're close on this. It is a fairly standard Integer Linear Program (ILP) assignment problem. It's gonna be a bit slow because of the structure of the problem.
You didn't say in your post what the breakdown of the setup & solve times were. I see you are reading from a file and using pandas. I think pandas gets clunky pretty quick with optimization problems, but that is just a personal preference.
I coded your problem up in pyomo, using the cbc solver, which I'm pretty sure is the same one used by pulp for comparison. (see below). I think you have it right with 2 constraints and a dual-indexed binary decision variable.
If I chop it down to 10K students (no slack...just 1-for-1 pairing) it solves in 14sec for comparison. My setup is a 5 year old iMac w/ lots of ram.
Running with 100K students in the pool, it solves in about 25min with 10sec "setup" time before the solver is invoked. So I'm not really sure why your encoding is taking 2hrs. If you can break down your solver time, that would help. The rest should be trivial. I didn't poke around too much in the output, but the OBJ function value of 980K seems reasonable.
Other ideas:
If you can get the solver options to configure properly and set a mip gap of 0.05 or so, it should speed things way up, if you can accept a slightly non-optimal solution. I've only had decent luck with solver options with the paid-for solvers like Gurobi. I haven't had much luck with that using the freebie solvers, YMMV.
import pyomo.environ as pyo
from random import randint
from time import time
# start setup clock
tic = time()
# exam types
subjects = ['Math', 'English', 'Computers', 'History', 'Physics']
# make set of students...
num_students = 100_000
students = [f'student_{s}' for s in range(num_students)]
# make 100K fake scores in "flat" format
student_scores = { (student, subj) : randint(0,100)
for student in students
for subj in subjects}
assignments = { 'Math': 4000, 'English': 3000, 'Computers': 2000, 'History': 750, 'Physics': 250}
weights = {'Math': 1.9, 'English': 1.7, 'Computers': 1.5, 'History': 1.3, 'Physics': 1.1}
# Set up model
m = pyo.ConcreteModel('exam assignments')
# Sets
m.subjects = pyo.Set(initialize=subjects)
m.students = pyo.Set(initialize=students)
# Parameters
m.assignments = pyo.Param(m.subjects, initialize=assignments)
m.weights = pyo.Param(m.subjects, initialize=weights)
m.scores = pyo.Param(m.students, m.subjects, initialize=student_scores)
# Variables
m.x = pyo.Var(m.students, m.subjects, within=pyo.Binary) # binary selection of pairing student to test
# Objective
m.OBJ = pyo.Objective(expr=sum(m.scores[student, subject] * m.x[student, subject]
for student in m.students
for subject in m.subjects), sense=pyo.maximize)
### Constraints ###
# fill all assignments
def fill_assignments(m, subject):
return sum(m.x[student, subject] for student in m.students) == assignments[subject]
m.C1 = pyo.Constraint(m.subjects, rule=fill_assignments)
# use each student at most 1 time
def limit_student(m, student):
return sum(m.x[student, subject] for subject in m.subjects) <= 1
m.C2 = pyo.Constraint(m.students, rule=limit_student)
toc = time()
print (f'setup time: {toc-tic:0.3f}')
tic = toc
# solve it..
solver = pyo.SolverFactory('cbc')
solution = solver.solve(m)
print(solution)
toc = time()
print (f'solve time: {toc-tic:0.3f}')
Output
setup time: 10.835
Problem:
- Name: unknown
Lower bound: -989790.0
Upper bound: -989790.0
Number of objectives: 1
Number of constraints: 100005
Number of variables: 500000
Number of binary variables: 500000
Number of integer variables: 500000
Number of nonzeros: 495094
Sense: maximize
Solver:
- Status: ok
User time: -1.0
System time: 1521.55
Wallclock time: 1533.36
Termination condition: optimal
Termination message: Model was solved to optimality (subject to tolerances), and an optimal solution is available.
Statistics:
Branch and bound:
Number of bounded subproblems: 0
Number of created subproblems: 0
Black box:
Number of iterations: 0
Error rc: 0
Time: 1533.8383190631866
Solution:
- number of solutions: 0
number of solutions displayed: 0
solve time: 1550.528
Here are some more ideas on my idea of using min cost flows.
We model this problem by taking a directed graph with 4 layers, where each layer is fully connected to the next.
Nodes
First layer: A single node s that will be our source.
Second layer: One node for each student.
Third layer: One node for each subject.
Fourth layer: OA single node t that will be our drain.
Edge Capacities
First -> Second: All edges have capacity 1.
Second -> Third: All edges have capacity 1.
Third -> Fourth: All edges have the capacity corresponding to the number students that has to be assigned to that subject.
Edge Costs
First -> Second: All edges have cost 0.
Second -> Third: Remember that edges in this layer connect a student with a subject. The costs on these will be chosen anti proportional to the weighted score the student has on that subject.
cost = -subject_weight*student_subject_score.
Third -> Fourth: All edges have cost 0.
Then we demand a flow from s to t equal to the number of students we have to choose.
Why does this work?
A solution to the min cost flow problem will correspond to a solution of your problem by taking all the edges between the third and fourth layer as assignments.
Each student can be chosen for at most one subject, as the corresponding node has only one incoming edge.
Each subject has exactly the number of required students, as the outgoing capacity corresponds to the number of students we have to choose for this subject and we have to use the full capacity of these edges, as we can not fulfil the flow demand otherwise.
A minimal solution to the MCF problem corresponds to the maximal solution of your problem, as the costs corresponds to the value they give.
As you asked for a solution in python I implemented the min cost flow problem with ortools. Finding a solution took less than a second in my colab notebook. What takes "long" is the extraction of the solution. But including setup and solution extraction I am still having a runtime of less than 20s for the full 100000 student problem.
Code
# imports
from ortools.graph import pywrapgraph
import numpy as np
import pandas as pd
import time
t_start = time.time()
# setting given problem parameters
num_students = 100000
subjects = ['MATH', 'ENGLISH', 'COMPUTERS', 'HISTORY','PHYSICS']
num_subjects = len(subjects)
demand = [4000, 3000, 2000, 750, 250]
weight = [1.9,1.7, 1.5, 1.3, 1.1]
# generating student scores
student_scores_raw = np.random.randint(101, size=(num_students, num_subjects))
# setting up graph nodes
source_nodes = [0]
student_nodes = list(range(1, num_students+1))
subject_nodes = list(range(num_students+1, num_subjects+num_students+1))
drain_nodes = [num_students+num_subjects+1]
# setting up the min cost flow edges
start_nodes = [int(c) for c in (source_nodes*num_students + [i for i in student_nodes for _ in subject_nodes] + subject_nodes)]
end_nodes = [int(c) for c in (student_nodes + subject_nodes*num_students + drain_nodes*num_subjects)]
capacities = [int(c) for c in ([1]*num_students + [1]*num_students*num_subjects + demand)]
unit_costs = [int(c) for c in ([0.]*num_students + list((-student_scores_raw*np.array(weight)*10).flatten()) + [0.]*num_subjects)]
assert len(start_nodes) == len(end_nodes) == len(capacities) == len(unit_costs)
# setting up the min cost flow demands
supplies = [sum(demand)] + [0]*(num_students + num_subjects) + [-sum(demand)]
# initialize the min cost flow problem instance
min_cost_flow = pywrapgraph.SimpleMinCostFlow()
for i in range(0, len(start_nodes)):
min_cost_flow.AddArcWithCapacityAndUnitCost(start_nodes[i], end_nodes[i], capacities[i], unit_costs[i])
for i in range(0, len(supplies)):
min_cost_flow.SetNodeSupply(i, supplies[i])
# solve the problem
t_solver_start = time.time()
if min_cost_flow.Solve() == min_cost_flow.OPTIMAL:
print('Best Value:', -min_cost_flow.OptimalCost()/10)
print('Solver time:', str(time.time()-t_solver_start)+'s')
print('Total Runtime until solution:', str(time.time()-t_start)+'s')
#extracting the solution
solution = []
for i in range(min_cost_flow.NumArcs()):
if min_cost_flow.Flow(i) > 0 and min_cost_flow.Tail(i) in student_nodes:
student_id = min_cost_flow.Tail(i)-1
subject_id = min_cost_flow.Head(i)-1-num_students
solution.append([student_id, subjects[subject_id], student_scores_raw[student_id, subject_id]])
assert(len(solution) == sum(demand))
solution = pd.DataFrame(solution, columns = ['student_id', 'subject', 'score'])
print(solution.head())
else:
print('There was an issue with the min cost flow input.')
print('Total Runtime:', str(time.time()-t_start)+'s')
Replacing the for-loop for the solution extraction in the above code by the following list-comprehension (that is not also not using list lookups every iteration) the runtime can be improved significantly. But for readability reasons I will leave this old solution here as well. Here is the new one:
solution = [[min_cost_flow.Tail(i)-1,
subjects[min_cost_flow.Head(i)-1-num_students],
student_scores_raw[min_cost_flow.Tail(i)-1, min_cost_flow.Head(i)-1-num_students]
]
for i in range(min_cost_flow.NumArcs())
if (min_cost_flow.Flow(i) > 0 and
min_cost_flow.Tail(i) <= num_students and
min_cost_flow.Tail(i)>0)
]
The following output is giving the runtimes for the new faster implementation.
Output
Best Value: 1675250.7
Solver time: 0.542395830154419s
Total Runtime until solution: 1.4248979091644287s
student_id subject score
0 3 ENGLISH 99
1 5 MATH 98
2 17 COMPUTERS 100
3 22 COMPUTERS 100
4 33 ENGLISH 100
Total Runtime: 1.752336025238037s
Pleas point out any mistakes I might have made.
I hope this helps. ;)
Related
I'm trying to find the global minimum of the function from the hundred digit hundred dollars challenge, question #4 as an exercise for simulated annealing.
As the basis of my understanding and approach to writing the code, I refer to the global optimization algorithms version 3 book which is found for free online.
Consequently, I've initially come up with the following code:
The noisy func:
def noisy_func(x, y):
return (math.exp(math.sin(50*x)) +
math.sin(60*math.exp(y)) +
math.sin(70*math.sin(x)) +
math.sin(math.sin(80*y)) -
math.sin(10*(x + y)) +
0.25*(math.pow(x, 2) +
math.pow(y, 2)))
The function used to mutate the values:
def mutate(X_Value, Y_Value):
mutationResult_X = X_Value + randomNumForInput()
mutationResult_Y = Y_Value + randomNumForInput()
while mutationResult_X > 4 or mutationResult_X < -4:
mutationResult_X = X_Value + randomNumForInput()
while mutationResult_Y > 4 or mutationResult_Y < -4:
mutationResult_Y = Y_Value + randomNumForInput()
mutationResults = [mutationResult_X, mutationResult_Y]
return mutationResults
randomNumForInput simply returns a random number between 4 and -4. (Interval Limits for the search.) Hence it is equivalent to random.uniform(-4, 4).
This is the central function of the program.
def simulated_annealing(f):
"""Peforms simulated annealing to find a solution"""
#Start by initializing the current state with the initial state
#acquired by a random generation of a number and then using it
#in the noisy func, also set solution(best_state) as current_state
#for a start
pCurSelect = [randomNumForInput(),randomNumForInput()]
current_state = f(pCurSelect[0],pCurSelect[1])
best_state = current_state
#Begin time monitoring, this will represent the
#Number of steps over time
TimeStamp = 1
#Init current temp via the func, using such values as to get the initial temp
initial_temp = 100
final_temp = .1
alpha = 0.001
num_of_steps = 1000000
#calculates by how much the temperature should be tweaked
#each iteration
#suppose the number of steps is linear, we'll send in 100
temp_Delta = calcTempDelta(initial_temp, final_temp, num_of_steps)
#set current_temp via initial temp
current_temp = getTemperature(initial_temp, temp_Delta)
#max_iterations = 100
#initial_temp = get_Temperature_Poly(TimeStamp)
#current_temp > final_temp
while current_temp > final_temp:
#get a mutated value from the current value
#hence being a 'neighbour' value
#with it, acquire the neighbouring state
#to the current state
neighbour_values = mutate(pCurSelect[0], pCurSelect[1])
neighbour_state = f(neighbour_values[0], neighbour_values[1])
#calculate the difference between the newly mutated
#neighbour state and the current state
delta_E_Of_States = neighbour_state - current_state
# Check if neighbor_state is the best state so far
# if the new solution is better (lower), accept it
if delta_E_Of_States <= 0:
pCurSelect = neighbour_values
current_state = neighbour_state
if current_state < best_state:
best_state = current_state
# if the new solution is not better, accept it with a probability of e^(-cost/temp)
else:
if random.uniform(0, 1) < math.exp(-(delta_E_Of_States) / current_temp):
pCurSelect = neighbour_values
current_state = neighbour_state
# Here, we'd decrement the temperature or increase the timestamp, normally
"""current_temp -= alpha"""
#print("Run number: " + str(TimeStamp) + " current_state = " + str(current_state) )
#increment TimeStamp
TimeStamp = TimeStamp + 1
# calc temp for next iteration
current_temp = getTemperature(current_temp, temp_Delta)
#print("Iteration Count: " + str(TimeStamp))
return best_state
alpha is not used for this implementation, however temperature is moderated linearly using the following funcs:
def calcTempDelta(T_Initial, T_Final, N):
return((T_Initial-T_Final)/N)
def getTemperature(T_old, T_new):
return (T_old - T_new)
This is how I implemented the solution described in page 245 of the book. However, this implementation does not return to me the global minimum of the noisy function, but rather, one of its near-by local minimum.
The reasons I implemented the solution in this way is two fold:
It has been provided to me as a working example of a linear temperature moderation, and thus a working template.
Although I have tried to understand the other forms of temperature moderation laid out in the book in pages 248-249, it is not entirely clear to me how the variable "Ts" is calculated, and even after trying to look through some of the cited sources the book references, it remains esoteric for me still. Thus I figured, I'd rather try to make this "simple" solution work correctly first, before proceeding to attempt other approaches of temperature quenching (logarithmic, exponential, etc).
Since then I have tried in numerous ways to acquire the global minimum of the noisy func through various different iterations of the code, which would be too much to post here all at once. I've tried different rewrites of this code:
Decrease the randomly rolled number over each iteration as in order to search within a smaller scope every time, this has resulted in more consistent but still incorrect results.
Mutate by different increments, so lets say, between -1 and 1, etc. Same effect.
Rewrite mutate as in order to examine the neighbouring points to the current point via some step size, and examine neighboring points by adding/reducing said step size from the current point's x/y values, checking the differences between the newly generated point and the current point (the delta of E's, basically), and return the appropriate values with whichever one produced the lowest distance to the current function, thus being its closest proximity neighbour.
Reduce the intervals limits over which the search occurs.
It is in these, the solutions involving step-size/reducing limits/checking neighbours by quadrants that I have used movements comprised of some constant alpha times the time_stamp.
These and other solutions which I've attempted have not worked, either producing even less accurate results (albeit in some cases more consistent results) or in one case, not working at all.
Therefore I must be missing something, whether its to do with the temperature moderation, or the precise way (formula) by which I'm supposed to make the next step (mutate) in the algorithm.
I know its a lot to take in and look at, but I'd appreciate any constructive criticism/help/advice you can provide me.
If it will be of any help to showcase code bits of the other solution attempts, I'll post them if asked.
It is important that you keep track of what you are doing.
I have put a few important tips on frigidum
The alpha cooling generally works well, it makes sure you don't speed through the interesting sweet-spot, where about 0.1 of the proposals are accepted.
Make sure your proposals are not too coarse, I have put a example where I only change x or y, but never both. The idea is that annealing will take whats best, or take a tour, and let the scheme decide.
I use the package frigidum for the algo, but its pretty much the same are your code. Also notice I have 2 proposals, a large change and a small change, combinations usually work well.
Finally, I noticed its hopping a lot. A small variation would be to pick the best-so-far before you go in the last 5% of your cooling.
I use/install frigidum
!pip install frigidum
And made a small change to make use of numpy arrays;
import math
def noisy_func(X):
x, y = X
return (math.exp(math.sin(50*x)) +
math.sin(60*math.exp(y)) +
math.sin(70*math.sin(x)) +
math.sin(math.sin(80*y)) -
math.sin(10*(x + y)) +
0.25*(math.pow(x, 2) +
math.pow(y, 2)))
import frigidum
import numpy as np
import random
def random_start():
return np.random.random( 2 ) * 4
def random_small_step(x):
if np.random.random() < .5:
return np.clip( x + np.array( [0, 0.02 * (random.random() - .5)] ), -4,4)
else:
return np.clip( x + np.array( [0.02 * (random.random() - .5), 0] ), -4,4)
def random_big_step(x):
if np.random.random() < .5:
return np.clip( x + np.array( [0, 0.5 * (random.random() - .5)] ), -4,4)
else:
return np.clip( x + np.array( [0.5 * (random.random() - .5), 0] ), -4,4)
local_opt = frigidum.sa(random_start=random_start,
neighbours=[random_small_step, random_big_step],
objective_function=noisy_func,
T_start=10**2,
T_stop=0.00001,
repeats=10**4,
copy_state=frigidum.annealing.copy)
The output of the above was
---
Neighbour Statistics:
(proportion of proposals which got accepted *and* changed the objective function)
random_small_step : 0.451045
random_big_step : 0.268002
---
(Local) Minimum Objective Value Found:
-3.30669277
With the above code sometimes I get below -3, but I also noticed sometimes it has found something around -2, than it is stuck in the last phase.
So a small tweak would be to re-anneal the last phase of the annealing, with the best-found-so-far.
Hope that helps, let me know if any questions.
The following is the objective function:
The idea is that a mean-variance optimization has already been done on a universe of securities. This gives us the weights for a target portfolio. Now suppose the investor already is holding a portfolio and does not want to change their entire portfolio to the target one.
Let w_0 = [w_0(1),w_0(2),...,w_0(N)] be the initial portfolio, where w_0(i) is the fraction of the portfolio invested in
stock i = 1,...,N. Let w_t = [w_t(1), w_t(2),...,w_t(N)] be the target portfolio, i.e., the portfolio
that it is desirable to own after rebalancing. This target portfolio may be constructed using quadratic optimization techniques such as variance minimization.
The objective is to decide the final portfolio w_f = [w_f (1), w_f (2),..., w_f(N)] that satisfies the
following characteristics:
(1) The final portfolio is close to our target portfolio
(2) The number of transactions from our initial portfolio is sufficiently small
(3) The return of the final portfolio is high
(4) The final portfolio does not hold many more securities that our initial portfolio
An objective function which is to be minimized is created by summing together the characteristic terms 1 through 4.
The first term is captured by summing the absolute difference in weights from the final and the target portfolio.
The second term is captured by the sum of an indicator function multiplied by a user specified penalty. The indicator function is y_{transactions}(i) where it is 1 if the weight of security i is different in the initial portfolio and the final portfolio, and 0 otherwise.
The third term is captured by the total final portfolio return multiplied by a negative user specified penalty since the objective is minimization.
The final term is the count of assets in the final portfolio (ie. sum of an indicator function counting the number of positive weights in the final portfolio), multiplied by a user specified penalty.
Assuming that we already have the target weights as target_w how do I setup this optimization problem in docplex python library? Or if anyone is familiar with mixed integer programming in NAG it would be helpful to know how to setup such a problem there as well.
`
final_w = [0.]*n
final_w = np.array(final_w)
obj1 = np.sum(np.absolute(final_w - target_w))
pen_trans = 1.2
def ind_trans(final,inital):
list_trans = []
for i in range(len(final)):
if abs(final[i]-inital[i]) == 0:
list_trans.append(0)
else:
list_trans.append(1)
return list_trans
obj2 = pen_trans*sum(ind_trans(final_w,initial_w))
pen_returns = 0.6
returns_np = np.array(df_secs['Return'])
obj3 = (-1)*np.dot(returns_np,final_w)
pen_count = 1.
def ind_count(final):
list_count = []
for i in range(len(final)):
if final[i] == 0:
list_count.append(0)
else:
list_count.append(1)
return list_count
obj4 = sum(ind_count(final_w))
objective = obj1 + obj2 + obj3 + obj4
The main issue in your code is that final_w is not a an array of variables but an array of data. So there will be nothing to optimize. To create an array of variables in docplex you have to do something like this:
from docplex.mp.model import Model
with Model() as m:
final = m.continuous_var_list(n, 0.0, 1.0)
That creates n variables that can take values between 0 and 1. With that in hand you can start things. For example:
obj1 = m.sum(m.abs(initial[i] - final[i]) for i in range(n))
For the next objective things become harder since you need indicator constraints. To simplify definition of these constraints first define a helper variable delta that gives the absolute difference between stocks:
delta = m.continuous_var_list(n, 0.0, 1.0)
m.add_constraints(delta[i] == m.abs(initial[i] - final[i]) for i in range(n))
Next you need an indicator variable that is 1 if a transaction is required to adjust stock i:
needtrans = m.binary_var_list(n)
for i in range(n):
# If needtrans[i] is 0 then delta[i] must be 0.
# Since needtrans[i] is penalized in the objective, the solver will
# try hard to set it to 0. It will only set it to 1 if delta[i] != 0.
# That is exactly what we want
m.add_indicator(needtrans[i], delta[i] == 0, 0)
With that you can define the second objective:
obj2 = pen_trans * m.sum(needtrans)
once all objectives have been defined, you can add their sum to the model:
m.minimize(obj1 + obj2 + obj3 + obj4)
and then solve the model and display its solution:
m.solve()
print(m.solution.get_values(final))
If any of the above is not (yet) clear to you then I suggest you take a look at the many examples that ship with docplex and also at the (reference) documentation.
I am working on a code to solve for the optimum combination of diameter size of number of pipelines. The objective function is to find the least sum of pressure drops in six pipelines.
As I have 15 choices of discrete diameter sizes which are [2,4,6,8,12,16,20,24,30,36,40,42,50,60,80] that can be used for any of the six pipelines that I have in the system, the list of possible solutions becomes 15^6 which is equal to 11,390,625
To solve the problem, I am using Mixed-Integer Linear Programming using Pulp package. I am able to find the solution for the combination of same diameters (e.g. [2,2,2,2,2,2] or [4,4,4,4,4,4]) but what I need is to go through all combinations (e.g. [2,4,2,2,4,2] or [4,2,4,2,4,2] to find the minimum. I attempted to do this but the process is taking a very long time to go through all combinations. Is there a faster way to do this ?
Note that I cannot calculate the pressure drop for each pipeline as the choice of diameter will affect the total pressure drop in the system. Therefore, at anytime, I need to calculate the pressure drop of each combination in the system.
I also need to constraint the problem such that the rate/cross section of pipeline area > 2.
Your help is much appreciated.
The first attempt for my code is the following:
from pulp import *
import random
import itertools
import numpy
rate = 5000
numberOfPipelines = 15
def pressure(diameter):
diameterList = numpy.tile(diameter,numberOfPipelines)
pressure = 0.0
for pipeline in range(numberOfPipelines):
pressure += rate/diameterList[pipeline]
return pressure
diameterList = [2,4,6,8,12,16,20,24,30,36,40,42,50,60,80]
pipelineIds = range(0,numberOfPipelines)
pipelinePressures = {}
for diameter in diameterList:
pressures = []
for pipeline in range(numberOfPipelines):
pressures.append(pressure(diameter))
pressureList = dict(zip(pipelineIds,pressures))
pipelinePressures[diameter] = pressureList
print 'pipepressure', pipelinePressures
prob = LpProblem("Warehouse Allocation",LpMinimize)
use_diameter = LpVariable.dicts("UseDiameter", diameterList, cat=LpBinary)
use_pipeline = LpVariable.dicts("UsePipeline", [(i,j) for i in pipelineIds for j in diameterList], cat = LpBinary)
## Objective Function:
prob += lpSum(pipelinePressures[j][i] * use_pipeline[(i,j)] for i in pipelineIds for j in diameterList)
## At least each pipeline must be connected to a diameter:
for i in pipelineIds:
prob += lpSum(use_pipeline[(i,j)] for j in diameterList) ==1
## The diameter is activiated if at least one pipelines is assigned to it:
for j in diameterList:
for i in pipelineIds:
prob += use_diameter[j] >= lpSum(use_pipeline[(i,j)])
## run the solution
prob.solve()
print("Status:", LpStatus[prob.status])
for i in diameterList:
if use_diameter[i].varValue> pressureTest:
print("Diameter Size",i)
for v in prob.variables():
print(v.name,"=",v.varValue)
This what I did for the combination part which took really long time.
xList = np.array(list(itertools.product(diameterList,repeat = numberOfPipelines)))
print len(xList)
for combination in xList:
pressures = []
for pipeline in range(numberOfPipelines):
pressures.append(pressure(combination))
pressureList = dict(zip(pipelineIds,pressures))
pipelinePressures[combination] = pressureList
print 'pipelinePressures',pipelinePressures
I would iterate through all combinations, I think you would run into memory problems otherwise trying to model ALL combinations in a MIP.
If you iterate through the problems perhaps using the multiprocessing library to use all cores, it shouldn't take long just remember only to hold information on the best combination so far, and not to try and generate all combinations at once and then evaluate them.
If the problem gets bigger you should consider Dynamic Programming Algorithms or use pulp with column generation.
I'm taking my first course in programming, the course is meant for physics applications and the like. We have an exam coming up, my professor published a practice exam with the following question.
The Maxwell distribution in speed v for an ideal gas consisting of particles of mass m at Kelvin temperature T is given by:
Stackoverflow doesn't use MathJax for formula's, and I can't quite figure out how to write a formula on this site. So, here is a link to WolframAlpha:
where k is Boltzmann's constant, k = 1.3806503 x 10-23 J/K.
Write a Python script called maxwell.py which prints v and f(v) to standard output in two column format. For the particle mass, choose the mass of a proton, m = 1.67262158 10-27 kg. For the gas temperature, choose the temperature at the surface of the sun, T = 5778 K.
Your output should consist of 300 data points, ranging from v = 100 m/s to v = 30,000 m/s in steps of size dv = 100 m/s.
So, here is my attempt at the code.
import math as m
import sys
def f(v):
n = 1.67262158e-27 #kg
k = 1.3806503e-23 #J/K
T = 5778 #Kelvin
return (4*m.pi)*((n/(2*m.pi*k*T))**(3/2))*(v**2)*m.exp((-n*v**2)/(2*k*T))
v = 100 #m/s
for i in range(300):
a = float(f(v))
print (v, a)
v = v + 100
But, my professors solution is:
import numpy as np
def f(v):
m = 1.67262158e-27 # kg
T = 5778. # K
k = 1.3806503e-23 # J/K
return 4.*np.pi * (m/(2.*np.pi*k*T))**1.5 * v**2 * np.exp(-m*v**2/(2.*k*T))
v = np.linspace(100.,30000.,300)
fv = f(v)
vfv = zip(v,fv)
for x in vfv:
print "%5.0f %.3e"%x
# print np.sum(fv*100.)
So, these are very different codes. From what I can tell, they produce the same result. I guess my question is, simply, why is my code incorrect?
Thank you!
EDIT:
So, I asked my professor about it and this was his response.
I think your code is fine. It would run much faster using numpy, but the problem didn't specify that you needed numpy. I might have docked one point for not looping through a list of v's (your variable i doesn't do anything). Also, you should have used v += 100. Almost, these two things together would have been one point out of 10.
1st: Is there any better syntax for doing the range in my code, since my variable i doesn't do anything?
2nd: What is the purpose of v += 100?
Things to be careful about when dealing with numbers is implicit type conversion from floats to ints.
One instance I could figure in your code is that you use (3/2) which evaluates to 1, while the other code uses 1.5 directly.
"""Some simulations to predict the future portfolio value based on past distribution. x is
a numpy array that contains past returns.The interpolated_returns are the returns
generated from the cdf of the past returns to simulate future returns. The portfolio
starts with a value of 100. portfolio_value is filled up progressively as
the program goes through every loop. The value is multiplied by the returns in that
period and a dollar is removed."""
portfolio_final = []
for i in range(10000):
portfolio_value = [100]
rand_values = np.random.rand(600)
interpolated_returns = np.interp(rand_values,cdf_values,x)
interpolated_returns = np.add(interpolated_returns,1)
for j in range(1,len(interpolated_returns)+1):
portfolio_value.append(interpolated_returns[j-1]*portfolio_value[j-1])
portfolio_value[j] = portfolio_value[j]-1
portfolio_final.append(portfolio_value[-1])
print (np.mean(portfolio_final))
I couldn't find a way to write this code using numpy. I was having a look at iterations using nditer but I was unable to move ahead with that.
I guess the easiest way to figure out how you can vectorize your stuff would be to look at the equations that govern your evolution and see how your portfolio actually iterates, finding patterns that could be vectorized instead of trying to vectorize the code you already have. You would have noticed that the cumprod actually appears quite often in your iterations.
Nevertheless you can find the semi-vectorized code below. I included your code as well such that you can compare the results. I also included a simple loop version of your code which is much easier to read and translatable into mathematical equations. So if you share this code with somebody else I would definitely use the simple loop option. If you want some fancy-pants vectorizing you can use the vector version. In case you need to keep track of your single steps you can also add an array to the simple loop option and append the pv at every step.
Hope that helps.
Edit: I have not tested anything for speed. That's something you can easily do yourself with timeit.
import numpy as np
from scipy.special import erf
# Prepare simple return model - Normal distributed with mu &sigma = 0.01
x = np.linspace(-10,10,100)
cdf_values = 0.5*(1+erf((x-0.01)/(0.01*np.sqrt(2))))
# Prepare setup such that every code snippet uses the same number of steps
# and the same random numbers
nSteps = 600
nIterations = 1
rnd = np.random.rand(nSteps)
# Your code - Gives the (supposedly) correct results
portfolio_final = []
for i in range(nIterations):
portfolio_value = [100]
rand_values = rnd
interpolated_returns = np.interp(rand_values,cdf_values,x)
interpolated_returns = np.add(interpolated_returns,1)
for j in range(1,len(interpolated_returns)+1):
portfolio_value.append(interpolated_returns[j-1]*portfolio_value[j-1])
portfolio_value[j] = portfolio_value[j]-1
portfolio_final.append(portfolio_value[-1])
print (np.mean(portfolio_final))
# Using vectors
portfolio_final = []
for i in range(nIterations):
portfolio_values = np.ones(nSteps)*100.0
rcp = np.cumprod(np.interp(rnd,cdf_values,x) + 1)
portfolio_values = rcp * (portfolio_values - np.cumsum(1.0/rcp))
portfolio_final.append(portfolio_values[-1])
print (np.mean(portfolio_final))
# Simple loop
portfolio_final = []
for i in range(nIterations):
pv = 100
rets = np.interp(rnd,cdf_values,x) + 1
for i in range(nSteps):
pv = pv * rets[i] - 1
portfolio_final.append(pv)
print (np.mean(portfolio_final))
Forget about np.nditer. It does not improve the speed of iterations. Only use if you intend to go one and use the C version (via cython).
I'm puzzled about that inner loop. What is it supposed to be doing special? Why the loop?
In tests with simulated values these 2 blocks of code produce the same thing:
interpolated_returns = np.add(interpolated_returns,1)
for j in range(1,len(interpolated_returns)+1):
portfolio_value.append(interpolated_returns[j-1]*portfolio[j-1])
portfolio_value[j] = portfolio_value[j]-1
interpolated_returns = (interpolated_returns+1)*portfolio - 1
portfolio_value = portfolio_value + interpolated_returns.tolist()
I assuming that interpolated_returns and portfolio are 1d arrays of the same length.