My team is building a CP-SAT solver that schedules assignments (think homework) over a period of days with variable availability (time available to do assignments). We're trying to speed up our model.
We've tried num_search_workers and other parameter tuning but want to check for other speed increases. The aim being to solve problems with ~100days and up to 2000 assignments in 5-10seconds (benchmarked on M1 mac). Any ideas?
Problem Description: Place a assignments across d days respecting these requirements
Assignment time on a day must not exceed that day's time available
Assignment dependencies should be respected (if A needs B then B should not occur after A)
Assignments can be split (in order to better fit across days with little time)
Optimize for diversity of assignment types on a day
Solving slows dramatically with # days and # assignments. This is expected but we'd like to know if you can suggest possible speedups
Here's an example unit test. Hopefully shows the splitting, ordering, and time constraints.
days = [{"secondsAvailable": 1200}, {"secondsAvailable": 1200}, {"secondsAvailable": 1200}, {"secondsAvailable": 1200}]
assignments = [
{"id": 1, "resourceType": "Type0", "seconds": 2400, "deps": [], "instances": 2},
{"id": 2, "resourceType": "Type0", "seconds": 1200, "deps": [1], "instances": 1},
{"id": 3, "resourceType": "Type0", "seconds": 1200, "deps": [1, 2], "instances": 1},
]
result = cp_sat.CP_SAT_FAST.schedule(days, assignments, options=solver_options)
# expect a list of lists where each inner list is a day with the included assignments
expected = shared.SolverOutput(feasible=True, solution=[
[{"id": 1, "resourceType": "Type0", "time": 1200, "instances": 2}],
[{"id": 1, "resourceType": "Type0", "time": 1200, "instances": 2}],
[{"id": 2, "resourceType": "Type0", "time": 1200, "instances": 1}],
[{"id": 3, "resourceType": "Type0", "time": 1200, "instances": 1}],
])
self.assertEqual(result, expected)
And here's the solver:
import math
from typing import List, Dict
from ortools.sat.python import cp_model
import numpy as np
import planner.solvers as solvers
from planner.shared import SolverOutput, SolverOptions
class CP_SAT_FAST(solvers.Solver):
"""
CP_SAT_FAST is a CP_SAT solver with speed optimizations and a time limit (passed in through options).
"""
#staticmethod
def schedule(days: List[Dict], assignments: List[Dict], options: SolverOptions) -> SolverOutput:
"""
Schedules a list of assignments on a studyplan of days
Arguments:
days: list of dicts containing available time for that day
assignments: list of assignments to place on schedule
"""
model = cp_model.CpModel()
num_assignments = len(assignments)
num_days = len(days)
# x[d, a] shows is assignment a is on day d
x = np.zeros((num_days, num_assignments), cp_model.IntVar)
# used for resource diversity optimization
total_resource_types = 4
unique_today = []
# upper and lower bounds used for dependency ordering (if a needs b then b must be before or on the day of a)
day_ub = {}
day_lb = {}
# track assignment splitting
instances = {}
assignment_times = {}
id_to_assignment = {}
for a, asm in enumerate(assignments):
# track upper and lower bounds
day_ub[a] = model.NewIntVar(0, num_days, "day_ub")
day_lb[a] = model.NewIntVar(0, num_days, "day_lb")
asm["ub"] = day_ub[a]
asm["lb"] = day_lb[a]
id_to_assignment[asm["id"]] = asm
max_instances = min(num_days, asm.get("instances", num_days))
# each assignment must occur at least once
instances[a] = model.NewIntVar(1, max_instances, f"instances_{a}")
model.AddHint(instances[a], max_instances)
# when split keep a decision variable of assignment time
assignment_times[a] = model.NewIntVar(asm.get("seconds") // max_instances, asm.get("seconds"), f"assignment_time_{a}")
model.AddDivisionEquality(assignment_times[a], asm.get("seconds"), instances[a])
for d in range(num_days):
time_available = days[d].get("secondsAvailable", 0)
if time_available <= 0:
# no assignments on zero-time days
model.Add(sum(x[d]) == 0)
else:
# track resource diversity on this day
type0_today = model.NewBoolVar(f"type0_on_{d}")
type1_today = model.NewBoolVar(f"type1_on_{d}")
type2_today = model.NewBoolVar(f"type2_on_{d}")
type3_today = model.NewBoolVar(f"type3_on_{d}")
types_today = model.NewIntVar(0, total_resource_types, f"unique_on_{d}")
task_times = []
for a, asm in enumerate(assignments):
# x[d, a] = True if assignment a is on day d
x[d, a] = model.NewBoolVar(f"x[{d},{a}]")
# set assignment upper and lower bounds for ordering
model.Add(day_ub[a] >= d).OnlyEnforceIf(x[d, a])
model.Add(day_lb[a] >= (num_days - d)).OnlyEnforceIf(x[d, a])
# track if a resource type is on a day for resource diversity optimization
resourceType = asm.get("resourceType")
if resourceType == "Type0":
model.AddImplication(x[d, a], type0_today)
elif resourceType == "Type1":
model.AddImplication(x[d, a], type1_today)
elif resourceType == "Type2":
model.AddImplication(x[d, a], type2_today)
elif resourceType == "Type3":
model.AddImplication(x[d, a], type3_today)
else:
raise RuntimeError(f"Unknown resource type {asm.get('resourceType')}")
# track of task time (considering splitting), for workload requirements
task_times.append(model.NewIntVar(0, asm.get("seconds"), f"time_{a}_on_{d}"))
model.Add(task_times[a] == assignment_times[a]).OnlyEnforceIf(x[d, a])
# time assigned to day d cannot exceed the day's available time
model.Add(time_available >= sum(task_times))
# sum the unique resource types on this day for later optimization
model.Add(sum([type0_today, type1_today, type2_today, type3_today]) == types_today)
unique_today.append(types_today)
"""
Resource Diversity:
Keeps track of what instances of a resource type appear on each day
and the minimum number of unique resource types on any day. (done above ^)
Then the model objective is set to maximize that minimum
"""
total_diversity = model.NewIntVar(0, num_days * total_resource_types, "total_diversity")
model.Add(sum(unique_today) == total_diversity)
avg_diversity = model.NewIntVar(0, total_resource_types, "avg_diversity")
model.AddDivisionEquality(avg_diversity, total_diversity, num_days)
# Set objective
model.Maximize(avg_diversity)
# Assignment Occurance/Splitting and Dependencies
for a, asm in enumerate(assignments):
# track how many times an assignment occurs (since we can split)
model.Add(instances[a] == sum(x[d, a] for d in range(num_days)))
# Dependencies
for needed_asm in asm.get("deps", []):
needed_ub = id_to_assignment[needed_asm]["ub"]
# this asm's lower bound must be greater than or equal to the upper bound of the dependency
model.Add(num_days - asm["lb"] >= needed_ub)
# Solve
solver = cp_model.CpSolver()
# set time limit
solver.parameters.max_time_in_seconds = float(options.time_limit)
solver.parameters.preferred_variable_order = 1
solver.parameters.initial_polarity = 0
# solver.parameters.stop_after_first_solution = True
# solver.parameters.num_search_workers = 8
intermediate_printer = SolutionPrinter()
status = solver.Solve(model, intermediate_printer)
print("\nStats")
print(f" - conflicts : {solver.NumConflicts()}")
print(f" - branches : {solver.NumBranches()}")
print(f" - wall time : {solver.WallTime()}s")
print()
if status == cp_model.OPTIMAL or status == cp_model.FEASIBLE:
sp = []
for i, d in enumerate(days):
day_time = 0
days_tasks = []
for a, asm in enumerate(assignments):
if solver.Value(x[i, a]) >= 1:
asm_time = math.ceil(asm.get("seconds") / solver.Value(instances[a]))
day_time += asm_time
days_tasks.append({"id": asm["id"], "resourceType": asm.get("resourceType"), "time": asm_time, "instances": solver.Value(instances[a])})
sp.append(days_tasks)
return SolverOutput(feasible=True, solution=sp)
else:
return SolverOutput(feasible=False, solution=[])
class SolutionPrinter(cp_model.CpSolverSolutionCallback):
def __init__(self):
cp_model.CpSolverSolutionCallback.__init__(self)
self.__solution_count = 0
def on_solution_callback(self):
print(f"Solution {self.__solution_count} objective value = {self.ObjectiveValue()}")
self.__solution_count += 1
Before answering your actual question I want to point out a few things in your model that I suspect are not working as you intended.
The constraints on the assignment types present on a given day
model.AddImplication(x[d, a], type0_today)
etc., do enforce that type0_today == 1 if there is an assignment of that type on that day. However, it does not enforce that type0_today == 0 if there are no assignments of that type on that day. The solver is still free to choose type0_today == 1, and it will do so, because that fulfills this constraint and also directly increases the objective function. You will probably discover in the optimal solution to the test case you gave that all the type0_today to type3_today variables are 1 and that avg_diversity == 4 in the optimal solution, even though there are no assignments of any type but 0 in the input data. In the early stages of modelling, it's always a good idea to check the value of all the variables in the model for plausibility.
Since I don't have a Python installation, I translated your model to c# to be able to do some experiments. Sorry, you'll have to translate into the equivalent Python code. I reformulated the constraint on the type0_today variables to use an array type_today[d, t] (for day d and type t) and use the AddMaxEquality constraint, which for Boolean variables is equivalent to the logical OR of all the participating variables:
// For each day...
for (int d = 0; d < num_days; d++)
{
// ... make a list for each assignment type of all x[d, a] where a has that type.
List<IntVar>[] assignmentsByType = new List<IntVar>[total_resource_types];
for (int t = 0; t < total_resource_types; t++)
{
assignmentsByType[t] = new List<IntVar>();
}
for (int a = 0; a < num_assignments; a++)
{
int t = getType(assignments[a].resourceType);
assignmentsByType[t].Add(x[d, a]);
}
// Constrain the types present on the day to be the logical OR of assignments with that type on that day
for (int t = 0; t < total_resource_types; t++)
{
if (assignmentsByType[t].Count > 0)
{
model.AddMaxEquality(type_today[d, t], assignmentsByType[t]);
}
else
{
model.Add(type_today[d, t] == 0);
}
}
}
You compute the average diversity as
avg_diversity = model.NewIntVar(0, total_resource_types, "avg_diversity")
model.AddDivisionEquality(avg_diversity, total_diversity, num_days)
Since the solver only works with integer variables, avg_diversity will be exactly one of the values 0, 1, 2, 3 or 4 with no fractional part. The constraint AddDivisionEquality will also ensure that total_diversity is an exact integer multiple of both avg_diversity and num_days. This is a very strong restriction on the solutions and will lead to infeasibility in many cases that I don't think you intended.
For example, avg_diversity == 3, num_days == 20 and total_diversity == 60 would be an allowed solution, but total_diversity == 63 would not be allowed, although there are three days in that solution with higher diversity than in the one with total_diversity == 60.
Instead, I recommend that you eliminate the variable avg_diversity and its constraint and simply use total_diversity as your objective function. Since the number of days is a fixed constant during the solution, maximizing the total diversity will be equivalent without introducing artificial infeasibilities.
That said, here is my answer.
Generic constraint satisfaction problems are in general NP problems and should not be expected to scale well. Although many specific problem formulations can actually be solved quickly, small changes in the input data or the formulation can push the problem into a black hole of exponentiality. There is really no other approach than trying out various methods to see what works best with your exact problem.
Although it sounds paradoxical, it is easier for the solver to find optimal solutions for strongly constrained problems than for lightly constrained ones (assuming they are feasible!). The search space in a strongly constrained problem is smaller than in the lightly constrained one, so the solver has fewer choices about what to experiment with to optimize and therefore completes the job faster.
First suggestion
In your problem, you have variables day_ub and day_lb for each assignment. These have a range from 0 to num_days. The constraints on them
model.Add(day_ub[a] >= d).OnlyEnforceIf(x[d, a])
model.Add(day_lb[a] >= (num_days - d)).OnlyEnforceIf(x[d, a])
allow the solver freedom to choose any value between 0 and the largest d resp. largest (num_days - d) (inclusive). During the optimization, the solver probably spends time trying out different values for these variables but rarely discovers that it leads to an improvement; that would happen only when the placement of a dependent assignment would be changed.
You can eliminate the variables day_ub and day_lb and their constraints and instead formulate the dependencies directly with the x variables.
In my c# model I reformulated the assignment dependency constraint as follows:
for (int a = 0; a < num_assignments; a++)
{
Assignment assignment = assignments[a];
foreach (int predecessorIndex in getPredecessorAssignmentIndicesFor(assignment))
{
for (int d1 = 0; d1 < num_days; d1++)
{
for (int d2 = 0; d2 < d1; d2++)
{
model.AddImplication(x[d1, predecessorIndex], x[d2, a].Not());
}
}
}
}
In words: if an assignment B (predecessorIndex) on which assignment A (a) depends is placed on day d1, then all the x[0..d1, a] must be false. This directly relates the dependencies using the x variables insteading of introducing helping variables with additional freedom which bog down the solver. This change reduces the number of variables in the problem and increases the number of constraints, both of which help the solver.
In an experiment I did with 25 days and 35 assignments, checking the model stats showed
Original:
#Variables: 2020
#kIntDiv: 35
#kIntMax: 100
#kLinear1: 1750
#kLinear2: 914
#kLinearN: 86
Total constraints 2885
New formulation:
#Variables: 1950
#kBoolOr: 11700
#kIntDiv: 35
#kIntMax: 100
#kLinear2: 875
#kLinearN: 86
Total constraints 12796
So the new formulation has fewer variables but far more constraints.
The solution times in the experiment were improved, the solver took only 2,6 s to achieve total_diversity == 68 instead of over 90 s.
Original formulation
Time Objective
0,21 56
0,53 59
0,6 60
0,73 61
0,75 62
0,77 63
2,9 64
3,82 65
3,84 66
91,01 67
91,03 68
91,05 69
New formulation
Time Objective
0,2347 41
0,3066 42
0,4252 43
0,4602 44
0,5014 49
0,6437 50
0,6777 51
0,6948 52
0,7108 53
0,9593 54
1,0178 55
1,1535 56
1,2023 57
1,2351 58
1,2595 59
1,2874 60
1,3097 61
1,3325 62
1,388 63
1,5698 64
2,4948 65
2,5993 66
2,6198 67
2,6431 68
32,5665 69
Of course, the solution times you get will be strongly dependent on the input data.
Second suggestion
During my experiments I observed that solutions are found much more quickly when the assignments have a lot of dependencies. This is consistent with more highly constrained models being easier to solve.
If you often have assignments of the same type and duration (like the numbers 2 and 3 in your test data) and they both have instance == 1` and either no dependencies or the same ones, then exchanging their position in the solution will not improve the objective.
In a pre-processing step you could look for such duplicates and make one of them dependent on the other. This is essentially a symmetry-breaking constraint. This will prevent the solver from wasting time with an attempt to see if exchanging their positions would improve the objective.
Third suggestion
The solution needs to deal with determining how many instances of each assignment will be present in a solution. That requires two variables for each assignment instances[a] and assignment_times[a] with an associated constraint.
Instead of doing this, you could get rid of the variables instances[a] and assignment_times[a] and instead split assignments with instances > 1 into multiple assignments in a preprocessing step. For example, in your test data, assignment 1 would be split into two assignments 1_1 and 1_2 each having instances == 1 and seconds = 1200. For this test case where instances == 2 for assignment 1, this will not have any effect on the final solution-- maybe the solver will schedule 1_1 and 1_2 on the same day, maybe not, but the final result is equivalent to splitting or not but doesn't need the extra variables.
In the preprocessing step, when an assignment is split, you should add symmetry breaking constraints to make 1_2 dependent on 1_1, etc., for the reasons mentioned above.
When an assignment has instances > 2, splitting it into multiple assignments before the run is actually a change to the model. For example, if instances == 3 and seconds = 2400 you cannot get a solution in which the assignment is split over two days with 1200 s each; the solver will always be scheduling 3 assignments of 800 s each.
So this suggestion is actually a change to the model and you'll have to determine if that is acceptable or not.
The total diversity will usually be helped by having more instances of an assignment to place, so the change may not have large practical consequences. It would also allow scheduling 2/3 of an assignment on one day and the remaining 1/3 on another day, so it even adds some flexibility.
But this may or may not be acceptable in terms of your overall requirements.
In all cases, you'll have to test changes with your exact data to see if they really result in an improvement or not.
I hope this helps (and that that this is a real world problem and not a homework assignment, as I did spend a few hours investigating...).
Related
I can't manage to find a way to create a constraint that sounds like this: for example I have 2 variables, one is a regular product and the other one is a super rare product. In order to have a super rare product, you will need to have already 25 of the regular version of that product. This can be stackable (e.g. if the algorithm select 75 of that regular product, it can have 3 super rare). The reason for this is that the super rare is more profitable, so if I place it without any constraints, it will select only the super rare ones. Any ideas on how to write such a constraint?
Thanks in advance!
Part of the code:
hwProblem = LpProblem("HotWheels", LpMaximize)
# Variables
jImportsW_blister = LpVariable("HW J-Imports w/ blister", lowBound=20, cat=LpInteger) # regular product
jImportsTH = LpVariable("HW J-Imports treasure hunt", lowBound=None, cat=LpInteger) # super rare product
# Objective Function
hwProblem += 19 * jImportsW_blister + 350 * jImportsTH # profit for each type of product
# Constraints
hwProblem += jImportsW_blister <= 50, "HW J-Imports maximum no. of products"
hwProblem += jImportsTH <= jImportsW_blister / 25
# ^this is where the error is happening
There's a few "missing pieces" here regarding the structure of your model, but in general, you can limit the "super rare" (SR) by doing something like:
prob += SR <= R / 25
Hey guys I have a script that compares each possible user and checks how similar their text is:
dictionary = {
t.id: (
t.text,
t.set,
t.compare_string
)
for t in dataframe.itertuples()
}
highly_similar = []
for a, b in itertools.combinations(dictionary.items(), 2):
if a[1][2] == b[1][2] and not a[1][1].isdisjoint(b[1][1]):
similarity_score = fuzz.ratio(a[1][0], b[1][0])
if (similarity_score >= 95 and len(a[1][0]) >= 10) or similarity_score == 100:
highly_similar.append([a[0], b[0], a[1][0], b[1][0], similarity_score])
This script takes around 15 minutes to run, the dataframe contains 120k users, so comparing each possible combination takes quite a bit of time, if I just write pass on the for loop it takes 2 minutes to loop through all values.
I tried using filter() and map() for the if statements and fuzzy score but the performance was worse. I tried improving the script as much as I could but I don't know how I can improve this further.
Would really appreciate some help!
It is slightly complicated to reason about the data since you have not attached it, but we can see multiple places that might provide an improvement:
First, let's rewrite the code in a way which is easier to reason about than using the indices:
dictionary = {
t.id: (
t.text,
t.set,
t.compare_string
)
for t in dataframe.itertuples()
}
highly_similar = []
for a, b in itertools.combinations(dictionary.items(), 2):
a_id, (a_text, a_set, a_compre_string) = a
b_id, (b_text, b_set, b_compre_string) = b
if (a_compre_string == b_compre_string
and not a_set.isdisjoint(b_set)):
similarity_score = fuzz.ratio(a_text, b_text)
if (similarity_score >= 95 and len(a_text) >= 10)
or similarity_score == 100):
highly_similar.append(
[a_id, b_id, a_text, b_text, similarity_score])
You seem to only care about pairs having the same compare_string values. Therefore, and assuming this is not something that all pairs share, we can key by whatever that value is to cover much less pairs.
To put some numbers into it, let's say you have 120K inputs, and 1K values for each value of val[1][2] - then instead of covering 120K * 120K = 14 * 10^9 combinations, you would have 120 bins of size 1K (where in each bin we'd need to check all pairs) = 120 * 1K * 1K = 120 * 10^6 which is about 1000 times faster. And it would be even faster if each bin has less than 1K elements.
import collections
# Create a dictionary from compare_string to all items
# with the same compare_string
items_by_compare_string = collections.defaultdict(list)
for item in dictionary.items():
compare_string = item[1][2]
items_by_compare_string[compare_string].append(items)
# Iterate over each group of items that have the same
# compare string
for item_group in items_by_compare_string.values():
# Check pairs only within that group
for a, b in itertools.combinations(item_group, 2):
# No need to compare the compare_strings!
if not a_set.isdisjoint(b_set):
similarity_score = fuzz.ratio(a_text, b_text)
if (similarity_score >= 95 and len(a_text) >= 10)
or similarity_score == 100):
highly_similar.append(
[a_id, b_id, a_text, b_text, similarity_score])
But, what if we want more speed? Let's look at the remaining operations:
We have a check to find if two sets share at least one item
This seems like an obvious candidate for optimization if we have any knowledge about these sets (to allow us to determine which pairs are even relevant to compare)
Without additional knowledge, and just looking at every two pairs and trying to speed this up, I doubt we can do much - this is probably highly optimized using internal details of Python sets, I don't think it's likely to optimize it further
We a fuzz.ratio computation which is some external function, and I'm going to assume is heavy
If you are using this from the FuzzyWuzzy package, make sure to install python-Levenshtein to get the speedups detailed here
We have some comparisons which we are unlikely to be able to speed up
We might be able to cache the length of a_text by nesting the two loops, but that's negligible
We have appends to a list, which runs on average ("amortized") constant time per operation, so we can't really speed that up
Therefore, I don't think we can reasonably suggest any more speedups without additional knowledge. If we know something about the sets that can help optimize which pairs are relevant we might be able to speed things up further, but I think this is about it.
EDIT: As pointed out in other answers, you can obviously run the code in multi-threading. I assumed you were looking for an algorithmic change that would possibly reduce the number of operations significantly, instead of just splitting these over more CPUs.
Essentially, from python programming side, i see two things that can improve your processing time:
Multi-threads and Vectorized operations
From the fuzzy score side, here is a list of tips you can use to improve your processing time (new anonymous tab to avoid paywall):
https://towardsdatascience.com/fuzzy-matching-at-scale-84f2bfd0c536
Using multi thread you can speed you operation up to N times, being N the number of threads in you CPU. You can check it with:
import multiprocessing
multiprocessing.cpu_count()
Using vectorized operations you can parallel process your operations in low level with SIMD (single instruction / multiple data) operations, or with gpu tensor operations (like those in tensorflow/pytorch).
Here is a small comparison of results for each case:
import numpy as np
import time
A = [np.random.rand(512) for i in range(2000)]
B = [np.random.rand(512) for i in range(2000)]
high_similarity = []
def measure(i,j,a,b,high_similarity):
d = ((a-b)**2).sum()
if d>12:
high_similarity.append((i,j,d))
start_single_thread = time.time()
for i in range(len(A)):
for j in range(len(B)):
if i<j:
measure(i,j,A[i],B[j],high_similarity)
finis_single_thread = time.time()
print("single thread time:",finis_single_thread-start_single_thread)
out[0] single thread time: 147.64517450332642
running on multi thread:
from threading import Thread
high_similarity = []
def measure(a = None,b= None,high_similarity = None):
d = ((a-b)**2).sum()
if d > 12:
high_similarity.append(d)
start_multi_thread = time.time()
for i in range(len(A)):
for j in range(len(B)):
if i<j:
thread = Thread(target=measure,kwargs= {'a':A[i],'b':B[j],'high_similarity':high_similarity} )
thread.start()
thread.join()
finish_multi_thread = time.time()
print("time to run on multi threads:",finish_multi_thread - start_multi_thread)
out[1] time to run on multi-threads: 11.946279764175415
A_array = np.array(A)
B_array = np.array(B)
start_vectorized = time.time()
for i in range(len(A_array)):
#vectorized distance operation
dists = (A_array-B_array)**2
high_similarity+= dists[dists>12].tolist()
aux = B_array[-1]
np.delete(B_array,-1)
np.insert(B_array, 0, aux)
finish_vectorized = time.time()
print("time to run vectorized operations:",finish_vectorized-start_vectorized)
out[2] time to run vectorized operations: 2.302949905395508
Note that you can't guarantee any order of execution, so will you also need to store the index of results. The snippet of code is just to illustrate that you can use parallel process, but i highly recommend to use a pool of threads and divide your dataset in N subsets for each worker and join the final result (instead of create a thread for each function call like i did).
I'm trying to find the best possible combination that will maximize my sum value, but it has to be under 2 specific constraints, therefore I am assuming Linear programming will be the best fit.
The problem goes like this:
Some educational world-event wish to gather the world's smartest teen students.
Every state tested 100K students on the following exams:'MATH', 'ENGLISH', 'COMPUTERS', 'HISTORY','PHYSICS'.. and where graded 0-100 on EACH exam.
Every state was requested to send their best 10K from the tested 100K students for the event.
You, as the French representative, were requested to choose the top 10K students from the tested 100K student from your country. For that, you'll need to optimize the MAX VALUE from them in order to get the best possible TOTAL SCORE.
BUT there are 2 main constrains:
1- from the total 10K chosen students you need to allocate specific students that will be tested on the event on 1 specific subject only from the mentioned 5 subjects.
the allocation needed is: ['MATH': 4000, 'ENGLISH':3000,'COMPUTERS':2000, 'HISTORY':750,'PHYSICS':250]
2- Each 'exam subject' score will have to be weighted differently.. for exp: 97 is Math worth more than 97 in History.
the wheights are: ['MATH': 1.9, 'ENGLISH':1.7,'COMPUTERS':1.5, 'HISTORY':1.3,'PHYSICS':1.1]
MY SOLUTION:
I tried to use the PULP (python) as an LP library and solved it correctly, but it took more than 2 HOURS of running.
can you find a better (faster, simpler..) way to solve it?
there are some NUMPY LP functions that could be used instead, maybe will be faster?
it supposed to be a simple OPTIMIZATION problem be I made it too slow and complexed.
--The solution needs to be in Python only please
for example, let's look on a small scale at the same problem:
there are 30 students and you need to choose only 15 students that will give us the best combination in relation to the following subject allocation demand.
the allocation needed is- ['MATH': 5, 'ENGLISH':4,'COMPUTERS':3, 'HISTORY':2,'PHYSICS':1]
this is all the 30 students and their grades:
after running the algorithm, the output solution will be:
here is my full code for the ORIGINAL question (100K students):
import pandas as pd
import numpy as np
import pulp as p
import time
t0=time.time()
demand = [4000, 3000, 2000, 750,250]
weight = [1.9,1.7, 1.5, 1.3, 1.1]
original_data= pd.read_csv('GRADE_100K.csv') #created simple csv file with random scores
data_c=original_data.copy()
data_c.index = np.arange(1, len(data_c)+1)
data_c.columns
data_c=data_c[['STUDENT_ID', 'MATH', 'ENGLISH', 'COMPUTERS', 'HISTORY','PHYSICS']]
#DataFrame Shape
m=data_c.shape[1]
n=data_c.shape[0]
data=[]
sublist=[]
for j in range(0,n):
for i in range(1,m):
sublist.append(data_c.iloc[j,i])
data.append(sublist)
sublist=[]
def _get_num_students(data):
return len(data)
def _get_num_subjects(data):
return len(data[0])
def _get_weighted_data(data, weight):
return [
[a*b for a, b in zip(row, weight)]
for row in data
]
data = _get_weighted_data(data, weight)
num_students = _get_num_students(data)
num_subjects = _get_num_subjects(data)
# Create a LP Minimization problem
Lp_prob = p.LpProblem('Problem', p.LpMaximize)
# Create problem Variables
variables_matrix = [[0 for i in range(num_subjects)] for j in range(num_students)]
for i in range(0, num_students):
for j in range(0, num_subjects):
variables_matrix[i][j] = p.LpVariable(f"X({i+1},{j+1})", 0, 1, cat='Integer')
df_d=pd.DataFrame(data=data)
df_v=pd.DataFrame(data=variables_matrix)
ml=df_d.mul(df_v)
ml['coeff'] = ml.sum(axis=1)
coefficients=ml['coeff'].tolist()
# DEALING WITH TARGET FUNCTION VALUE
suming=0
k=0
sumsum=[]
for z in range(len(coefficients)):
suming +=coefficients[z]
if z % 2000==0:
sumsum.append(suming)
suming=0
if z<2000:
sumsum.append(suming)
sumsuming=0
for s in range(len(sumsum)):
sumsuming=sumsuming+sumsum[s]
Lp_prob += sumsuming
# DEALING WITH the 2 CONSTRAINS
# 1-subject constraints
con1_suming=0
for e in range(num_subjects):
L=df_v.iloc[:,e].to_list()
for t in range(len(L)):
con1_suming +=L[t]
Lp_prob += con1_suming <= demand[e]
con1_suming=0
# 2- students constraints
con2_suming=0
for e in range(num_students):
L=df_v.iloc[e,:].to_list()
for t in range(len(L)):
con2_suming +=L[t]
Lp_prob += con2_suming <= 1
con2_suming=0
print("time taken for TARGET+CONSTRAINS %8.8f seconds" % (time.time()-t0) )
t1=time.time()
status = Lp_prob.solve() # Solver
print("time taken for SOLVER %8.8f seconds" % (time.time()-t1) ) # 632 SECONDS
print(p.LpStatus[status]) # The solution status
print(p.value(Lp_prob.objective))
df_v=pd.DataFrame(data=variables_matrix)
# Printing the final solution
lst=[]
val=[]
for i in range(0, num_students):
lst.append([p.value(variables_matrix[i][j]) for j in range(0, num_subjects)])
val.append([sum([p.value(variables_matrix[i][j]) for j in range(0, num_subjects)]),i])
ones_places=[]
for i in range (0, len(val)):
if val[i][0]==1:
ones_places.append(i+1)
len(ones_places)
data_once=data_c[data_c['STUDENT_ID'].isin(ones_places)]
IDs=[]
for i in range(len(ones_places)):
IDs.append(data_once['STUDENT_ID'].to_list()[i])
course=[]
sub_course=[]
for i in range(len(lst)):
j=0
sub_course='x'
while j<len(lst[i]):
if lst[i][j]==1:
sub_course=j
j=j+1
course.append(sub_course)
coures_ones=[]
for i in range(len(course)):
if course[i]!= 'x':
coures_ones.append(course[i])
# adding the COURSE name to the final table
# NUMBER OF DICTIONARY KEYS based on number of COURSES
col=original_data.columns.values[1:].tolist()
dic = {0:col[0], 1:col[1], 2:col[2], 3:col[3], 4:col[4]}
cc_name=[dic.get(n, n) for n in coures_ones]
one_c=[]
if len(IDs)==len(cc_name):
for i in range(len(IDs)):
one_c.append([IDs[i],cc_name[i]])
prob=[]
if len(IDs)==len(cc_name):
for i in range(len(IDs)):
prob.append([IDs[i],cc_name[i], data_once.iloc[i][one_c[i][1]]])
scoring_table = pd.DataFrame(prob,columns=['STUDENT_ID','COURES','SCORE'])
scoring_table.sort_values(by=['COURES', 'SCORE'], ascending=[False, False], inplace=True)
scoring_table.index = np.arange(1, len(scoring_table)+1)
print(scoring_table)
I think you're close on this. It is a fairly standard Integer Linear Program (ILP) assignment problem. It's gonna be a bit slow because of the structure of the problem.
You didn't say in your post what the breakdown of the setup & solve times were. I see you are reading from a file and using pandas. I think pandas gets clunky pretty quick with optimization problems, but that is just a personal preference.
I coded your problem up in pyomo, using the cbc solver, which I'm pretty sure is the same one used by pulp for comparison. (see below). I think you have it right with 2 constraints and a dual-indexed binary decision variable.
If I chop it down to 10K students (no slack...just 1-for-1 pairing) it solves in 14sec for comparison. My setup is a 5 year old iMac w/ lots of ram.
Running with 100K students in the pool, it solves in about 25min with 10sec "setup" time before the solver is invoked. So I'm not really sure why your encoding is taking 2hrs. If you can break down your solver time, that would help. The rest should be trivial. I didn't poke around too much in the output, but the OBJ function value of 980K seems reasonable.
Other ideas:
If you can get the solver options to configure properly and set a mip gap of 0.05 or so, it should speed things way up, if you can accept a slightly non-optimal solution. I've only had decent luck with solver options with the paid-for solvers like Gurobi. I haven't had much luck with that using the freebie solvers, YMMV.
import pyomo.environ as pyo
from random import randint
from time import time
# start setup clock
tic = time()
# exam types
subjects = ['Math', 'English', 'Computers', 'History', 'Physics']
# make set of students...
num_students = 100_000
students = [f'student_{s}' for s in range(num_students)]
# make 100K fake scores in "flat" format
student_scores = { (student, subj) : randint(0,100)
for student in students
for subj in subjects}
assignments = { 'Math': 4000, 'English': 3000, 'Computers': 2000, 'History': 750, 'Physics': 250}
weights = {'Math': 1.9, 'English': 1.7, 'Computers': 1.5, 'History': 1.3, 'Physics': 1.1}
# Set up model
m = pyo.ConcreteModel('exam assignments')
# Sets
m.subjects = pyo.Set(initialize=subjects)
m.students = pyo.Set(initialize=students)
# Parameters
m.assignments = pyo.Param(m.subjects, initialize=assignments)
m.weights = pyo.Param(m.subjects, initialize=weights)
m.scores = pyo.Param(m.students, m.subjects, initialize=student_scores)
# Variables
m.x = pyo.Var(m.students, m.subjects, within=pyo.Binary) # binary selection of pairing student to test
# Objective
m.OBJ = pyo.Objective(expr=sum(m.scores[student, subject] * m.x[student, subject]
for student in m.students
for subject in m.subjects), sense=pyo.maximize)
### Constraints ###
# fill all assignments
def fill_assignments(m, subject):
return sum(m.x[student, subject] for student in m.students) == assignments[subject]
m.C1 = pyo.Constraint(m.subjects, rule=fill_assignments)
# use each student at most 1 time
def limit_student(m, student):
return sum(m.x[student, subject] for subject in m.subjects) <= 1
m.C2 = pyo.Constraint(m.students, rule=limit_student)
toc = time()
print (f'setup time: {toc-tic:0.3f}')
tic = toc
# solve it..
solver = pyo.SolverFactory('cbc')
solution = solver.solve(m)
print(solution)
toc = time()
print (f'solve time: {toc-tic:0.3f}')
Output
setup time: 10.835
Problem:
- Name: unknown
Lower bound: -989790.0
Upper bound: -989790.0
Number of objectives: 1
Number of constraints: 100005
Number of variables: 500000
Number of binary variables: 500000
Number of integer variables: 500000
Number of nonzeros: 495094
Sense: maximize
Solver:
- Status: ok
User time: -1.0
System time: 1521.55
Wallclock time: 1533.36
Termination condition: optimal
Termination message: Model was solved to optimality (subject to tolerances), and an optimal solution is available.
Statistics:
Branch and bound:
Number of bounded subproblems: 0
Number of created subproblems: 0
Black box:
Number of iterations: 0
Error rc: 0
Time: 1533.8383190631866
Solution:
- number of solutions: 0
number of solutions displayed: 0
solve time: 1550.528
Here are some more ideas on my idea of using min cost flows.
We model this problem by taking a directed graph with 4 layers, where each layer is fully connected to the next.
Nodes
First layer: A single node s that will be our source.
Second layer: One node for each student.
Third layer: One node for each subject.
Fourth layer: OA single node t that will be our drain.
Edge Capacities
First -> Second: All edges have capacity 1.
Second -> Third: All edges have capacity 1.
Third -> Fourth: All edges have the capacity corresponding to the number students that has to be assigned to that subject.
Edge Costs
First -> Second: All edges have cost 0.
Second -> Third: Remember that edges in this layer connect a student with a subject. The costs on these will be chosen anti proportional to the weighted score the student has on that subject.
cost = -subject_weight*student_subject_score.
Third -> Fourth: All edges have cost 0.
Then we demand a flow from s to t equal to the number of students we have to choose.
Why does this work?
A solution to the min cost flow problem will correspond to a solution of your problem by taking all the edges between the third and fourth layer as assignments.
Each student can be chosen for at most one subject, as the corresponding node has only one incoming edge.
Each subject has exactly the number of required students, as the outgoing capacity corresponds to the number of students we have to choose for this subject and we have to use the full capacity of these edges, as we can not fulfil the flow demand otherwise.
A minimal solution to the MCF problem corresponds to the maximal solution of your problem, as the costs corresponds to the value they give.
As you asked for a solution in python I implemented the min cost flow problem with ortools. Finding a solution took less than a second in my colab notebook. What takes "long" is the extraction of the solution. But including setup and solution extraction I am still having a runtime of less than 20s for the full 100000 student problem.
Code
# imports
from ortools.graph import pywrapgraph
import numpy as np
import pandas as pd
import time
t_start = time.time()
# setting given problem parameters
num_students = 100000
subjects = ['MATH', 'ENGLISH', 'COMPUTERS', 'HISTORY','PHYSICS']
num_subjects = len(subjects)
demand = [4000, 3000, 2000, 750, 250]
weight = [1.9,1.7, 1.5, 1.3, 1.1]
# generating student scores
student_scores_raw = np.random.randint(101, size=(num_students, num_subjects))
# setting up graph nodes
source_nodes = [0]
student_nodes = list(range(1, num_students+1))
subject_nodes = list(range(num_students+1, num_subjects+num_students+1))
drain_nodes = [num_students+num_subjects+1]
# setting up the min cost flow edges
start_nodes = [int(c) for c in (source_nodes*num_students + [i for i in student_nodes for _ in subject_nodes] + subject_nodes)]
end_nodes = [int(c) for c in (student_nodes + subject_nodes*num_students + drain_nodes*num_subjects)]
capacities = [int(c) for c in ([1]*num_students + [1]*num_students*num_subjects + demand)]
unit_costs = [int(c) for c in ([0.]*num_students + list((-student_scores_raw*np.array(weight)*10).flatten()) + [0.]*num_subjects)]
assert len(start_nodes) == len(end_nodes) == len(capacities) == len(unit_costs)
# setting up the min cost flow demands
supplies = [sum(demand)] + [0]*(num_students + num_subjects) + [-sum(demand)]
# initialize the min cost flow problem instance
min_cost_flow = pywrapgraph.SimpleMinCostFlow()
for i in range(0, len(start_nodes)):
min_cost_flow.AddArcWithCapacityAndUnitCost(start_nodes[i], end_nodes[i], capacities[i], unit_costs[i])
for i in range(0, len(supplies)):
min_cost_flow.SetNodeSupply(i, supplies[i])
# solve the problem
t_solver_start = time.time()
if min_cost_flow.Solve() == min_cost_flow.OPTIMAL:
print('Best Value:', -min_cost_flow.OptimalCost()/10)
print('Solver time:', str(time.time()-t_solver_start)+'s')
print('Total Runtime until solution:', str(time.time()-t_start)+'s')
#extracting the solution
solution = []
for i in range(min_cost_flow.NumArcs()):
if min_cost_flow.Flow(i) > 0 and min_cost_flow.Tail(i) in student_nodes:
student_id = min_cost_flow.Tail(i)-1
subject_id = min_cost_flow.Head(i)-1-num_students
solution.append([student_id, subjects[subject_id], student_scores_raw[student_id, subject_id]])
assert(len(solution) == sum(demand))
solution = pd.DataFrame(solution, columns = ['student_id', 'subject', 'score'])
print(solution.head())
else:
print('There was an issue with the min cost flow input.')
print('Total Runtime:', str(time.time()-t_start)+'s')
Replacing the for-loop for the solution extraction in the above code by the following list-comprehension (that is not also not using list lookups every iteration) the runtime can be improved significantly. But for readability reasons I will leave this old solution here as well. Here is the new one:
solution = [[min_cost_flow.Tail(i)-1,
subjects[min_cost_flow.Head(i)-1-num_students],
student_scores_raw[min_cost_flow.Tail(i)-1, min_cost_flow.Head(i)-1-num_students]
]
for i in range(min_cost_flow.NumArcs())
if (min_cost_flow.Flow(i) > 0 and
min_cost_flow.Tail(i) <= num_students and
min_cost_flow.Tail(i)>0)
]
The following output is giving the runtimes for the new faster implementation.
Output
Best Value: 1675250.7
Solver time: 0.542395830154419s
Total Runtime until solution: 1.4248979091644287s
student_id subject score
0 3 ENGLISH 99
1 5 MATH 98
2 17 COMPUTERS 100
3 22 COMPUTERS 100
4 33 ENGLISH 100
Total Runtime: 1.752336025238037s
Pleas point out any mistakes I might have made.
I hope this helps. ;)
I am trying to minimize a function multiple times. I developed a class called BlackScholesModel; this class contains the function to be minimized which is the method difference from the BlackScholesModel class.
I created a nested dictionary to save each class object. The code of the nested dictionary is:
expirations = ('2020-12-17', '2021-12-16')
today = datetime.now()
stock_price = stockquotes.Stock("^GSPC").current_price
BSM = {name:name for name in expirations}
for i, a in enumerate(expirations):
strikeP = {count:count for count in range(0,len(expirations))}
for j in range(0,len(strike[a])):
strikeP[j] = BlackScholesModel(datetime.strptime(expirations[i],"%Y-%m-%d"),\
today,stock_price,strike[a][j],\
premium=call_premium[a][j])
BSM[a] = strikeP
Output:
{'2020-12-17': {0: <BlackScholesMerton.BlackScholesModel at 0x22deb7f5708>,
1: <BlackScholesMerton.BlackScholesModel at 0x22debc805c8>,
2: <BlackScholesMerton.BlackScholesModel at 0x22dec1312c8>},
'2021-12-16': {0: <BlackScholesMerton.BlackScholesModel at 0x22debd324c8>,
1: <BlackScholesMerton.BlackScholesModel at 0x22debd36088>,
2: <BlackScholesMerton.BlackScholesModel at 0x22debd36fc8>,}}
Having this nested dictionary, I want to loop through each element and minimize each class method difference; however, the len(BSM['2020-12-17']) is 93 and the len(BSM['2021-12-16']) is 50. I have the following code:
implied_vol = {name:name for name in expirations}
x0 = 1
for i, a in enumerate(expirations):
strikeP = {count:count for count in range(0,len(expirations))}
for j in range(0,len(BSM[a])):
strikeP[j] = minimize(BSM[a][j].difference,x0)['x']
implied_vol[a] = strikeP
Due to the high volume of transactions, the computer cannot complete this operation. I am trying to find a way to make my code more efficient. I want to store each minimum result in a format similar to the nested dictionary BSM. Any help or thought is more than welcome.
Mauricio
Have You tried to estimate the average execution time of the function minimize(BlackScholesModel.difference, x0) ? With that estimated number, is it possible to perform the number of minimizations you need ?
Assuming that is possible to perform this in a reasonable time, You can do it more quickly by parallelizing the execution:
from joblib import Parallel, delayed
def minimize_expiration(expiration_dict):
strikeP = {}
for i, bs in expiration_dict.items():
strikeP[i] = minimize(bs.difference,x0)['x']
return strikeP
N_JOBS = -1 # number of cores to use (-1 to use all available cores)
implied_vol_list = Parallel(n_jobs=N_JOBS, verbose=1)(delayed(minimize_expiration)(BSM[expiration]) for expiration in expirations)
IV = {expiration: iv for expiration, iv in zip(expirations, implied_vol_list)}
You may have issues with this approach if the implementation of BlackScholesModel is not serializable
For example, given the following 4 constraints, a and x are ints, b is array, maps int to int:
a >= 0
b[0] == 10
x == 0
b[x] >= a
find_max(a) => 10
find_min(a) => 0
Can z3py do something like this?
Yeah, sure.
You can either do it incrementally, via multiple single-objective optimization searches, or use the more efficient boxed (a.k.a. Multi-Independent) combination offered by z3 for dealing with multi-objective optimization.
Definition 4.6.3. (Multiple-Independent OMT [LAK+14, BP14, BPF15, ST15b, ST15c]).
Let <φ,O> be a multi-objective OMT problem, where φ
is a ground SMT formula and O = {obj_1 , ..., obj_N},
is a sorted list of N objective functions.
We call Multiple-Independent OMT problem,
a.k.a Boxed OMT problem [BP14, BPF15],
the problem of finding in one single run a set of
models {M_1, ...,M_N} such that each M_i makes
obj_i minimum on the common formula φ.
Remark 4.6.3. Solving a Multiple-Independent
OMT problem <φ, {obj_1, ..., obj_N }> is akin to
independently solving N single-objective OMT
problems <φ, obj_1>, ..., <φ, obj_N>.
However, the former allows for factorizing the search
and thus obtaining a significant performance boost
when compared to the latter approach [LAK+14, BP14, ST15c].
[source, pag. 104]
Example:
from z3 import *
a = Int('a')
x = Int('x')
b = Array('I', IntSort(), IntSort())
opt = Optimize()
opt.add(a >= 0)
opt.add(x == 0)
opt.add(Select(b, 0) == 10)
opt.add(Select(b, x) >= a)
obj1 = opt.maximize(a)
obj2 = opt.minimize(a)
opt.set('priority', 'box') # Setting Boxed Multi-Objective Optimization
is_sat = opt.check()
assert is_sat
print("Max(a): " + str(obj1.value()))
print("Min(a): " + str(obj2.value()))
Output:
~$ python test.py
Max(a): 10
Min(a): 0
See publications on the topic like, e.g.
1. Nikolaj Bjorner and Anh-Dung Phan. νZ - Maximal Satisfaction with Z3. In Proc International Symposium on Symbolic Computation in Software Science, Gammart, Tunisia, December 2014. EasyChair Proceedings in Computing (EPiC). [PDF]
2. Nikolaj Bjorner, Anh-Dung Phan, and Lars Fleckenstein. Z3 - An Optimizing SMT Solver. In Proc. TACAS, volume 9035 of LNCS. Springer, 2015. [Springer] [[PDF]