OR-tools routing optimization node compatibility - python

I am trying to solve a capacitated routing problem where I have a set of nodes which require different amounts and different types of items.
In addition I want to allow node drops, because all nodes with one type of item might still exceed the vehicle capacity and thus would lead to no solution.
However eventually all nodes should be served so I use an iterative approach where I was treating each item type as individual routing problem.
But I was wondering if one could use disjunctions or something similar to solve the 'global' routing problem. Any help on whether this is possible is appreciated.
Example:
Node 1 - item A - demand 10
Node 2 - item A - demand 10
Node 3 - item A - demand 12
Node 4 - item B - demand 10
Node 5 - item B - demand 10
vehicle I - capacity 20
vehicle II - capacity 10
My approach:
First solve for item A: vehicle I serves node 1 & 2, node 3 is dropped, save dropped nodes for later iteration
Then solve for item B: vehicle I serves nodes 4 & 5, vehicle II is idle
Solve for remaining node 3: vehicle I serves node 3
EDIT
I adjusted my approach to fit #mizux answer. Below the code:
EDIT2 Fixed a bug where the demand callback function from the first loops would still reference the product_index variable and thus return the wrong demand. Fix by using functools.partial.
import functools
from ortools.constraint_solver import pywrapcp, routing_enums_pb2
class CVRP():
def __init__(self, data):
# assert all(data['demands'] < max(data['vehicle_capacities'])) # if any demand exceeds cap no solution possible
self.data = data
self.vehicle_names_internal = [f'{i}:{j}' for j in data['products'] for i in data['vehicle_names']]
self.manager = pywrapcp.RoutingIndexManager(len(data['distance_matrix']), len(self.vehicle_names_internal), data['depot'])
self.routing = pywrapcp.RoutingModel(self.manager)
transit_callback_id = self.routing.RegisterTransitCallback(self._dist_callback)
self.routing.SetArcCostEvaluatorOfAllVehicles(transit_callback_id)
# set up dimension for each product type for vehicle capacity constraint
for product_index, product in enumerate(data['products']):
dem_product_callback = functools.partial(self._dem_callback_generic, product_index=product_index)
dem_callback_id = self.routing.RegisterUnaryTransitCallback(dem_product_callback)
vehicle_product_capacity = [0 for i in range(len(self.vehicle_names_internal))]
vehicle_product_capacity[product_index*data['num_vehicles']:product_index*data['num_vehicles']+data['num_vehicles']] = data['vehicle_capacities']
print(product_index, product)
print(self.vehicle_names_internal)
print(vehicle_product_capacity)
self.routing.AddDimensionWithVehicleCapacity(
dem_callback_id,
0,
vehicle_product_capacity,
True,
f'capacity_{product}',
)
# disjunction (allow node drops)
penalty = int(self.data['distance_matrix'].sum()+1) # penalty needs to be higher than total travel distance in order to only drop locations if not other feasible solution
for field_pos_idx_arr in self.data['disjunctions']:
self.routing.AddDisjunction([self.manager.NodeToIndex(i) for i in field_pos_idx_arr], penalty)
def _dist_callback(self, i, j):
return self.data['distance_matrix'][self.manager.IndexToNode(i)][self.manager.IndexToNode(j)]
def _dem_callback_generic(self, i, product_index):
node = self.manager.IndexToNode(i)
if node == self.data['depot']:
return 0
else:
return self.data['demands'][node, product_index]
def solve(self, verbose=False):
search_parameters = pywrapcp.DefaultRoutingSearchParameters()
search_parameters.first_solution_strategy = (
routing_enums_pb2.FirstSolutionStrategy.AUTOMATIC)
search_parameters.local_search_metaheuristic = (
routing_enums_pb2.LocalSearchMetaheuristic.AUTOMATIC)
search_parameters.time_limit.FromSeconds(30)
self.solution = self.routing.SolveWithParameters(search_parameters)

You should create two capacity dimensions, one for each type,
At each location you increase the relevant dimension.
You can duplicate your vehicle for each item type i.e.:
v0, Vehicle 1 Type A with: capacity A: 20, capacity B: 0
v1, Vehicle 1 Type B with: capacity A: 0, capacity B: 20
v2, Vehicle 2 Type A with: capacity A: 10, capacity B: 0
v3, Vehicle 2 Type B with: capacity A: 0, capacity B: 10
note: you can replicate it to allow multi-trips
You can create a "gate" node to allow only one vehicle configuration.
e.g. To only allow v0 or v1 to do some visit
v0_start = routing.Start(0)
v0_end = routing.End(0)
v1_start = routing.Start(1)
v1_end = routing.End(1)
gate_index = manager.NodeToIndex(gate_index)
routing.NextVar(v0_start).setValues[gate_index, v0_end]
routing.NextVar(v1_start).setValues[gate_index, v1_end]
Since node can only be visited once, one vehicle among v0 and v1 can pass by the gate node while the other has no choice but to go to end node i.e. empty route you can remove when post processing the assignment.
You can also add a vehicle FixedCost to incentive solver to use vehicle II if it is cheaper than vehicle I etc...
Add each location to a disjunction so the solver can drop them if needed
location_index = manager.NodeToIndex(location_id)
routing.AddDisjunction(
[location_index], # locations
penalty,
max_cardinality=1 # you can omit it since it is already 1 by default
)

Related

Pulp Python linear programming problem seems to ignore my constraints

I have a Python script (Pulp library) for allocating funds among a number of clients depending on their current level of funding (gap/requirements) and their membership to priority groups. However, I am not receiving the expected results.
In particular, I want:
All allocations must be positive and their sum should be equal to the total available money I have.
I want to minimize the target funding gap for the most vulnerable group (group A) and then I want that the target gap % in the less vulnerable group increase of 10%: (for group B = funding gap A1.1, for group C = funding gap B1.1...).
I have tried this:
"""
DECISION VARIABLES
"""
# Create a continuous Decision Variable and Affine Expression for the amount of additional funding received by each
# project
allocation = {}
allocation_expr = LpAffineExpression()
for z in range(n):
if priority[z] == 'X' or (requirements[z] == 0 and skip_zero_requirements):
# Projects in Priority Group 'X' don't get any allocation
allocation[project_names[z]] = pulp.LpVariable(f'allocation_{project_names[z]}', lowBound=0, upBound=0)
else:
# allocation is non negative and cannot be greater than the initial gap
allocation[project_names[z]] = pulp.LpVariable(f'allocation_{project_names[z]}', lowBound=0, upBound=(gap[z]))
allocation_expr += allocation[project_names[z]]
# Create a continuous Decision Variable and Affine Expression for the maximum GAP% within each priority group
target_group_A_expr = LpAffineExpression()
target_group_A = pulp.LpVariable(f'allocation', lowBound=0 )
target_group_A_expr += target_group_A
"""
LINEAR PROGRAMMING PROBLEM
"""
# Create the linear programming problem object
lp_prob = pulp.LpProblem('Multi-Objective Optimization', pulp.LpMaximize)
"""
OBJECTIVE FUNCTIONS
"""
# Define the objective function as an LpAffineExpression
obj = LpAffineExpression()
# MAXIMIZE the portion of additional funding allocated to projects
obj += allocation_expr
# MINIMIZE the Max GAP% within each group [actually Maximizing the -(Max GAP%)]
obj += -target_group_A_expr
# Set the Objective Function
lp_prob += obj
"""
CONSTRAINTS
"""
# Additional funding allocations to individual projects must be non-negative and not greater than the project's gap
#for v in range(n):
# lp_prob += allocation[project_names[v]] <= gap[v]
# lp_prob += allocation[project_names[v]] >= 0
# The sum of allocations to individual projects cannot be greater than the additional funding
lp_prob += pulp.lpSum([allocation[project_names[u]] for u in range(n)]) <= additional_funding
# The Max GAP % within each group >= of the GAP % of all projects in the group (proxy for dynamic max calculation)
for i, (p, group) in enumerate(priority_groups.items()):
# Get the indices of the projects in the group
group_indices = priority_groups[p] #selects the indices matching with the rows of the projects belonging to that group
# Iterate over the indices of the projects in the group
for index in group:
# Create an LpAffineExpression for the GAP% of the project
project_gap_percentage = LpAffineExpression()
if requirements[index] == 0:
project_gap_percentage += 0
else:
project_gap_percentage += (gap[index] - allocation[project_names[index]]) / requirements[index]
# Add constraint to the model
lp_prob += target_group_A == (project_gap_percentage/pow(delta_gap, i))
"""
PROGRAMMING MODEL SOLVER
"""
# Solve the linear programming problem
lp_prob.solve()
delta_gap and the additional_funding are external parameters.
I receive even negative allocations and the constrains is not always meet, e.g. in group B, C I reach level of funding gap much lower than the level of group A- sometimes they randomly go to zero. How can this be possible?
I am considering to use another library, any suggestions?

I'm trying to create a code that finds the topological order for the network. When I run the program it usually stops and keeps running forever

from link import Link
from node import Node
from path import Path
from od import OD
import sys
import traceback
import utils
class BadNetworkOperationException(Exception):
"""
You can raise this exception if you try a network action which is invalid
(e.g., trying to find a topological order on a network with cycles.)
"""
pass
class Network:
"""
This is the class used for transportation networks. It uses the following
dictionaries to store the network; the keys are IDs for the network elements,
and the values are objects of the relevant type:
node -- network nodes; see node.py for description of this class
link -- network links; see link.py for description of this class
ODpair -- origin-destination pairs; see od.py
path -- network paths; see path.py. Paths are NOT automatically generated
when the network is initialized (you probably wouldn't want this,
the number of paths is exponential in network size.)
The network topology is expressed both in links (through the tail and head
nodes) and in nodes (forwardStar and reverseStar are Node attributes storing
the IDs of entering and leaving links in a list).
numNodes, numLinks, numZones -- self-explanatory
firstThroughNode -- in the TNTP data format, transiting through nodes with
low IDs can be prohibited (typically for centroids; you
may not want vehicles to use these as "shortcuts").
When implementing shortest path or other routefinding,
you should prevent trips from using nodes with lower
IDs than firstThroughNode, unless it is the destination.
"""
def init(self, networkFile="", demandFile=""):
"""
Class initializer; if both a network file and demand file are specified,
will read these files to fill the network data structure.
"""
self.numNodes = 0
self.numLinks = 0
self.numZones = 0
self.firstThroughNode = 0
self.node = dict()
self.link = dict()
self.ODpair = dict()
self.path = dict()
if len(networkFile) > 0 and len(demandFile) > 0:
self.readFromFiles(networkFile, demandFile)
def findLeastEnteringLinks(self):
"""
This method should return the ID of the node with the least number
of links entering the node. Ties can be broken arbitrarily.
"""
# *** YOUR CODE HERE ***
# Replace the following statement with your code.
# raise utils.NotYetAttemptedException
minvalue = 1000
for i in self.node:
if (minvalue > len(self.node[i].reverseStar)):
minvalue = len(self.node[i].reverseStar)
idnode= i
return idnode
def formAdjacencyMatrix(self):
"""
This method should produce an adjacency matrix, with rows and columns
corresponding to each node, and entries of 1 if there is a link connecting
the row node to the column node, and 0 otherwise. This matrix should
be stored in self.adjacencyMatrix, which is a dictionary of dictionaries:
the first key is the "row" (tail) node, and the second key is the "column"
(head) node.
"""
self.adjacencyMatrix = dict()
for i in self.node:
self.adjacencyMatrix[i] = dict()
# *** YOUR CODE HERE ***
# Replace the following statement with your code.
# raise utils.NotYetAttemptedException
for i in self.node:
for j in self.node:
self.adjacencyMatrix[i][j]=0
for ij in self.link:
self.adjacencyMatrix[self.link[ij].tail][self.link[ij].head]=1
return self.adjacencyMatrix
def findTopologicalOrder(self):
"""
This method should find a topological order for the network, storing
the order in the 'order' attribute of the nodes, i.e.:
self.node[5].order
should store the topological label for node 5.
The topological order is generally not unique, this method can return any
valid order. The nodes should be labeled 1, 2, 3, ... up through numNodes.
If the network has cycles, a topological order does not exist. The presence
of cycles can be detected in the algorithm for finding a topological order,
and you should raise an exception if this is detected.
"""
# *** YOUR CODE HERE ***
# Replace the following statement with your code.
# raise utils.NotYetAttemptedException
for i in self.node:
self.node[i].order = []
node_list= list()
for i in self.node:
if len(self.node[i].reverseStar) == 0:
node_list.append(i)
break
if len(node_list)== 0:
raise BadNetworkOperationException()
while (len(node_list)!=self.numNodes):
for i in [x for x in self.node if x not in node_list]:
incoming_nodes = [int(j[1]) for j in self.node[i].reverseStar if j not in node_list]
if (len(incoming_nodes)== 0):
node_list.append(i)
break
for k in range(len(node_list)):
self.node[node_list[k]].order = k+1
return self.node[i].order

Linear Programing- Max value optimization

I'm trying to find the best possible combination that will maximize my sum value, but it has to be under 2 specific constraints, therefore I am assuming Linear programming will be the best fit.
The problem goes like this:
Some educational world-event wish to gather the world's smartest teen students.
Every state tested 100K students on the following exams:'MATH', 'ENGLISH', 'COMPUTERS', 'HISTORY','PHYSICS'.. and where graded 0-100 on EACH exam.
Every state was requested to send their best 10K from the tested 100K students for the event.
You, as the French representative, were requested to choose the top 10K students from the tested 100K student from your country. For that, you'll need to optimize the MAX VALUE from them in order to get the best possible TOTAL SCORE.
BUT there are 2 main constrains:
1- from the total 10K chosen students you need to allocate specific students that will be tested on the event on 1 specific subject only from the mentioned 5 subjects.
the allocation needed is: ['MATH': 4000, 'ENGLISH':3000,'COMPUTERS':2000, 'HISTORY':750,'PHYSICS':250]
2- Each 'exam subject' score will have to be weighted differently.. for exp: 97 is Math worth more than 97 in History.
the wheights are: ['MATH': 1.9, 'ENGLISH':1.7,'COMPUTERS':1.5, 'HISTORY':1.3,'PHYSICS':1.1]
MY SOLUTION:
I tried to use the PULP (python) as an LP library and solved it correctly, but it took more than 2 HOURS of running.
can you find a better (faster, simpler..) way to solve it?
there are some NUMPY LP functions that could be used instead, maybe will be faster?
it supposed to be a simple OPTIMIZATION problem be I made it too slow and complexed.
--The solution needs to be in Python only please
for example, let's look on a small scale at the same problem:
there are 30 students and you need to choose only 15 students that will give us the best combination in relation to the following subject allocation demand.
the allocation needed is- ['MATH': 5, 'ENGLISH':4,'COMPUTERS':3, 'HISTORY':2,'PHYSICS':1]
this is all the 30 students and their grades:
after running the algorithm, the output solution will be:
here is my full code for the ORIGINAL question (100K students):
import pandas as pd
import numpy as np
import pulp as p
import time
t0=time.time()
demand = [4000, 3000, 2000, 750,250]
weight = [1.9,1.7, 1.5, 1.3, 1.1]
original_data= pd.read_csv('GRADE_100K.csv') #created simple csv file with random scores
data_c=original_data.copy()
data_c.index = np.arange(1, len(data_c)+1)
data_c.columns
data_c=data_c[['STUDENT_ID', 'MATH', 'ENGLISH', 'COMPUTERS', 'HISTORY','PHYSICS']]
#DataFrame Shape
m=data_c.shape[1]
n=data_c.shape[0]
data=[]
sublist=[]
for j in range(0,n):
for i in range(1,m):
sublist.append(data_c.iloc[j,i])
data.append(sublist)
sublist=[]
def _get_num_students(data):
return len(data)
def _get_num_subjects(data):
return len(data[0])
def _get_weighted_data(data, weight):
return [
[a*b for a, b in zip(row, weight)]
for row in data
]
data = _get_weighted_data(data, weight)
num_students = _get_num_students(data)
num_subjects = _get_num_subjects(data)
# Create a LP Minimization problem
Lp_prob = p.LpProblem('Problem', p.LpMaximize)
# Create problem Variables
variables_matrix = [[0 for i in range(num_subjects)] for j in range(num_students)]
for i in range(0, num_students):
for j in range(0, num_subjects):
variables_matrix[i][j] = p.LpVariable(f"X({i+1},{j+1})", 0, 1, cat='Integer')
df_d=pd.DataFrame(data=data)
df_v=pd.DataFrame(data=variables_matrix)
ml=df_d.mul(df_v)
ml['coeff'] = ml.sum(axis=1)
coefficients=ml['coeff'].tolist()
# DEALING WITH TARGET FUNCTION VALUE
suming=0
k=0
sumsum=[]
for z in range(len(coefficients)):
suming +=coefficients[z]
if z % 2000==0:
sumsum.append(suming)
suming=0
if z<2000:
sumsum.append(suming)
sumsuming=0
for s in range(len(sumsum)):
sumsuming=sumsuming+sumsum[s]
Lp_prob += sumsuming
# DEALING WITH the 2 CONSTRAINS
# 1-subject constraints
con1_suming=0
for e in range(num_subjects):
L=df_v.iloc[:,e].to_list()
for t in range(len(L)):
con1_suming +=L[t]
Lp_prob += con1_suming <= demand[e]
con1_suming=0
# 2- students constraints
con2_suming=0
for e in range(num_students):
L=df_v.iloc[e,:].to_list()
for t in range(len(L)):
con2_suming +=L[t]
Lp_prob += con2_suming <= 1
con2_suming=0
print("time taken for TARGET+CONSTRAINS %8.8f seconds" % (time.time()-t0) )
t1=time.time()
status = Lp_prob.solve() # Solver
print("time taken for SOLVER %8.8f seconds" % (time.time()-t1) ) # 632 SECONDS
print(p.LpStatus[status]) # The solution status
print(p.value(Lp_prob.objective))
df_v=pd.DataFrame(data=variables_matrix)
# Printing the final solution
lst=[]
val=[]
for i in range(0, num_students):
lst.append([p.value(variables_matrix[i][j]) for j in range(0, num_subjects)])
val.append([sum([p.value(variables_matrix[i][j]) for j in range(0, num_subjects)]),i])
ones_places=[]
for i in range (0, len(val)):
if val[i][0]==1:
ones_places.append(i+1)
len(ones_places)
data_once=data_c[data_c['STUDENT_ID'].isin(ones_places)]
IDs=[]
for i in range(len(ones_places)):
IDs.append(data_once['STUDENT_ID'].to_list()[i])
course=[]
sub_course=[]
for i in range(len(lst)):
j=0
sub_course='x'
while j<len(lst[i]):
if lst[i][j]==1:
sub_course=j
j=j+1
course.append(sub_course)
coures_ones=[]
for i in range(len(course)):
if course[i]!= 'x':
coures_ones.append(course[i])
# adding the COURSE name to the final table
# NUMBER OF DICTIONARY KEYS based on number of COURSES
col=original_data.columns.values[1:].tolist()
dic = {0:col[0], 1:col[1], 2:col[2], 3:col[3], 4:col[4]}
cc_name=[dic.get(n, n) for n in coures_ones]
one_c=[]
if len(IDs)==len(cc_name):
for i in range(len(IDs)):
one_c.append([IDs[i],cc_name[i]])
prob=[]
if len(IDs)==len(cc_name):
for i in range(len(IDs)):
prob.append([IDs[i],cc_name[i], data_once.iloc[i][one_c[i][1]]])
scoring_table = pd.DataFrame(prob,columns=['STUDENT_ID','COURES','SCORE'])
scoring_table.sort_values(by=['COURES', 'SCORE'], ascending=[False, False], inplace=True)
scoring_table.index = np.arange(1, len(scoring_table)+1)
print(scoring_table)
I think you're close on this. It is a fairly standard Integer Linear Program (ILP) assignment problem. It's gonna be a bit slow because of the structure of the problem.
You didn't say in your post what the breakdown of the setup & solve times were. I see you are reading from a file and using pandas. I think pandas gets clunky pretty quick with optimization problems, but that is just a personal preference.
I coded your problem up in pyomo, using the cbc solver, which I'm pretty sure is the same one used by pulp for comparison. (see below). I think you have it right with 2 constraints and a dual-indexed binary decision variable.
If I chop it down to 10K students (no slack...just 1-for-1 pairing) it solves in 14sec for comparison. My setup is a 5 year old iMac w/ lots of ram.
Running with 100K students in the pool, it solves in about 25min with 10sec "setup" time before the solver is invoked. So I'm not really sure why your encoding is taking 2hrs. If you can break down your solver time, that would help. The rest should be trivial. I didn't poke around too much in the output, but the OBJ function value of 980K seems reasonable.
Other ideas:
If you can get the solver options to configure properly and set a mip gap of 0.05 or so, it should speed things way up, if you can accept a slightly non-optimal solution. I've only had decent luck with solver options with the paid-for solvers like Gurobi. I haven't had much luck with that using the freebie solvers, YMMV.
import pyomo.environ as pyo
from random import randint
from time import time
# start setup clock
tic = time()
# exam types
subjects = ['Math', 'English', 'Computers', 'History', 'Physics']
# make set of students...
num_students = 100_000
students = [f'student_{s}' for s in range(num_students)]
# make 100K fake scores in "flat" format
student_scores = { (student, subj) : randint(0,100)
for student in students
for subj in subjects}
assignments = { 'Math': 4000, 'English': 3000, 'Computers': 2000, 'History': 750, 'Physics': 250}
weights = {'Math': 1.9, 'English': 1.7, 'Computers': 1.5, 'History': 1.3, 'Physics': 1.1}
# Set up model
m = pyo.ConcreteModel('exam assignments')
# Sets
m.subjects = pyo.Set(initialize=subjects)
m.students = pyo.Set(initialize=students)
# Parameters
m.assignments = pyo.Param(m.subjects, initialize=assignments)
m.weights = pyo.Param(m.subjects, initialize=weights)
m.scores = pyo.Param(m.students, m.subjects, initialize=student_scores)
# Variables
m.x = pyo.Var(m.students, m.subjects, within=pyo.Binary) # binary selection of pairing student to test
# Objective
m.OBJ = pyo.Objective(expr=sum(m.scores[student, subject] * m.x[student, subject]
for student in m.students
for subject in m.subjects), sense=pyo.maximize)
### Constraints ###
# fill all assignments
def fill_assignments(m, subject):
return sum(m.x[student, subject] for student in m.students) == assignments[subject]
m.C1 = pyo.Constraint(m.subjects, rule=fill_assignments)
# use each student at most 1 time
def limit_student(m, student):
return sum(m.x[student, subject] for subject in m.subjects) <= 1
m.C2 = pyo.Constraint(m.students, rule=limit_student)
toc = time()
print (f'setup time: {toc-tic:0.3f}')
tic = toc
# solve it..
solver = pyo.SolverFactory('cbc')
solution = solver.solve(m)
print(solution)
toc = time()
print (f'solve time: {toc-tic:0.3f}')
Output
setup time: 10.835
Problem:
- Name: unknown
Lower bound: -989790.0
Upper bound: -989790.0
Number of objectives: 1
Number of constraints: 100005
Number of variables: 500000
Number of binary variables: 500000
Number of integer variables: 500000
Number of nonzeros: 495094
Sense: maximize
Solver:
- Status: ok
User time: -1.0
System time: 1521.55
Wallclock time: 1533.36
Termination condition: optimal
Termination message: Model was solved to optimality (subject to tolerances), and an optimal solution is available.
Statistics:
Branch and bound:
Number of bounded subproblems: 0
Number of created subproblems: 0
Black box:
Number of iterations: 0
Error rc: 0
Time: 1533.8383190631866
Solution:
- number of solutions: 0
number of solutions displayed: 0
solve time: 1550.528
Here are some more ideas on my idea of using min cost flows.
We model this problem by taking a directed graph with 4 layers, where each layer is fully connected to the next.
Nodes
First layer: A single node s that will be our source.
Second layer: One node for each student.
Third layer: One node for each subject.
Fourth layer: OA single node t that will be our drain.
Edge Capacities
First -> Second: All edges have capacity 1.
Second -> Third: All edges have capacity 1.
Third -> Fourth: All edges have the capacity corresponding to the number students that has to be assigned to that subject.
Edge Costs
First -> Second: All edges have cost 0.
Second -> Third: Remember that edges in this layer connect a student with a subject. The costs on these will be chosen anti proportional to the weighted score the student has on that subject.
cost = -subject_weight*student_subject_score.
Third -> Fourth: All edges have cost 0.
Then we demand a flow from s to t equal to the number of students we have to choose.
Why does this work?
A solution to the min cost flow problem will correspond to a solution of your problem by taking all the edges between the third and fourth layer as assignments.
Each student can be chosen for at most one subject, as the corresponding node has only one incoming edge.
Each subject has exactly the number of required students, as the outgoing capacity corresponds to the number of students we have to choose for this subject and we have to use the full capacity of these edges, as we can not fulfil the flow demand otherwise.
A minimal solution to the MCF problem corresponds to the maximal solution of your problem, as the costs corresponds to the value they give.
As you asked for a solution in python I implemented the min cost flow problem with ortools. Finding a solution took less than a second in my colab notebook. What takes "long" is the extraction of the solution. But including setup and solution extraction I am still having a runtime of less than 20s for the full 100000 student problem.
Code
# imports
from ortools.graph import pywrapgraph
import numpy as np
import pandas as pd
import time
t_start = time.time()
# setting given problem parameters
num_students = 100000
subjects = ['MATH', 'ENGLISH', 'COMPUTERS', 'HISTORY','PHYSICS']
num_subjects = len(subjects)
demand = [4000, 3000, 2000, 750, 250]
weight = [1.9,1.7, 1.5, 1.3, 1.1]
# generating student scores
student_scores_raw = np.random.randint(101, size=(num_students, num_subjects))
# setting up graph nodes
source_nodes = [0]
student_nodes = list(range(1, num_students+1))
subject_nodes = list(range(num_students+1, num_subjects+num_students+1))
drain_nodes = [num_students+num_subjects+1]
# setting up the min cost flow edges
start_nodes = [int(c) for c in (source_nodes*num_students + [i for i in student_nodes for _ in subject_nodes] + subject_nodes)]
end_nodes = [int(c) for c in (student_nodes + subject_nodes*num_students + drain_nodes*num_subjects)]
capacities = [int(c) for c in ([1]*num_students + [1]*num_students*num_subjects + demand)]
unit_costs = [int(c) for c in ([0.]*num_students + list((-student_scores_raw*np.array(weight)*10).flatten()) + [0.]*num_subjects)]
assert len(start_nodes) == len(end_nodes) == len(capacities) == len(unit_costs)
# setting up the min cost flow demands
supplies = [sum(demand)] + [0]*(num_students + num_subjects) + [-sum(demand)]
# initialize the min cost flow problem instance
min_cost_flow = pywrapgraph.SimpleMinCostFlow()
for i in range(0, len(start_nodes)):
min_cost_flow.AddArcWithCapacityAndUnitCost(start_nodes[i], end_nodes[i], capacities[i], unit_costs[i])
for i in range(0, len(supplies)):
min_cost_flow.SetNodeSupply(i, supplies[i])
# solve the problem
t_solver_start = time.time()
if min_cost_flow.Solve() == min_cost_flow.OPTIMAL:
print('Best Value:', -min_cost_flow.OptimalCost()/10)
print('Solver time:', str(time.time()-t_solver_start)+'s')
print('Total Runtime until solution:', str(time.time()-t_start)+'s')
#extracting the solution
solution = []
for i in range(min_cost_flow.NumArcs()):
if min_cost_flow.Flow(i) > 0 and min_cost_flow.Tail(i) in student_nodes:
student_id = min_cost_flow.Tail(i)-1
subject_id = min_cost_flow.Head(i)-1-num_students
solution.append([student_id, subjects[subject_id], student_scores_raw[student_id, subject_id]])
assert(len(solution) == sum(demand))
solution = pd.DataFrame(solution, columns = ['student_id', 'subject', 'score'])
print(solution.head())
else:
print('There was an issue with the min cost flow input.')
print('Total Runtime:', str(time.time()-t_start)+'s')
Replacing the for-loop for the solution extraction in the above code by the following list-comprehension (that is not also not using list lookups every iteration) the runtime can be improved significantly. But for readability reasons I will leave this old solution here as well. Here is the new one:
solution = [[min_cost_flow.Tail(i)-1,
subjects[min_cost_flow.Head(i)-1-num_students],
student_scores_raw[min_cost_flow.Tail(i)-1, min_cost_flow.Head(i)-1-num_students]
]
for i in range(min_cost_flow.NumArcs())
if (min_cost_flow.Flow(i) > 0 and
min_cost_flow.Tail(i) <= num_students and
min_cost_flow.Tail(i)>0)
]
The following output is giving the runtimes for the new faster implementation.
Output
Best Value: 1675250.7
Solver time: 0.542395830154419s
Total Runtime until solution: 1.4248979091644287s
student_id subject score
0 3 ENGLISH 99
1 5 MATH 98
2 17 COMPUTERS 100
3 22 COMPUTERS 100
4 33 ENGLISH 100
Total Runtime: 1.752336025238037s
Pleas point out any mistakes I might have made.
I hope this helps. ;)

Improving BFS performance with some kind of memoization

I have this issue that I'm trying to build an algorithm which will find distances from one vertice to others in graph.
Let's say with the really simple example that my network looks like this:
network = [[0,1,2],[2,3,4],[4,5,6],[6,7]]
I created a BFS code which is supposed to find length of paths from the specified source to other graph's vertices
from itertools import chain
import numpy as np
n = 8
graph = {}
for i in range(0, n):
graph[i] = []
for communes in communities2:
for vertice in communes:
work = communes.copy()
work.remove(vertice)
graph[vertice].append(work)
for k, v in graph.items():
graph[k] = list(chain(*v))
def bsf3(graph, s):
matrix = np.zeros([n,n])
dist = {}
visited = []
queue = [s]
dist[s] = 0
visited.append(s)
matrix[s][s] = 0
while queue:
v = queue.pop(0)
for neighbour in graph[v]:
if neighbour in visited:
pass
else:
matrix[s][neighbour] = matrix[s][v] + 1
queue.append(neighbour)
visited.append(neighbour)
return matrix
bsf3(graph,2)
First I'm creating graph (dictionary) and than use the function to find distances.
What I'm concerned about is that this approach doesn't work with larger networks (let's say with 1000 people in there). And what I'm thinking about is to use some kind of memoization (actually that's why I made a matrix instead of list). The idea is that when the algorithm calculates the path from let's say 0 to 3 (what it does already) it should keep track for another routes in such a way that matrix[1][3] = 1 etc.
So I would use the function like bsf3(graph, 1) it would not calculate everything from scratch, but would be able to access some values from matrix.
Thanks in advance!
Knowing this not fully answer your question, but this is another approach you cabn try.
In networks you will have a routing table for each node inside your network. You simple save a list of all nodes inside the network and in which node you have to go. Example of routing table of node D
A -> B
B -> B
C -> E
D -> D
E -> E
You need to run BFS on each node to build all routing table and it will take O(|V|*(|V|+|E|). The space complexity is quadratic but you have to check all possible paths.
When you create all this information you can simple start from a node and search for your destination node inside the table and find the next node to go. This will give a more better time complexity (if you use the right data structure for the table).

Compute reachability of elements in a list of tuples

I have a list of tuples like this.
a = [(1,2),(1,3),(1,4),(2,5),(6,5),(7,8)]
In this list 1 relates to 2 and then 2 relates to 5 and 5 relates to 6 therefore 1 relates to 6. Similarly I need to find the relations between other elements in tuples. I need a function that takes the input values and outputs as follows:
input = (1,6) #output = True
input = (5,3) #output = True
input = (2,8) #output = False
I do not have knowledge of itertools or map functions. Can they be used to solve these types of problems?
And for the sake of curiosity and interest where can I find these types of questions to solve and where are these types of problems encountered in real life situations?
This can be easily done by considering the tuples as edges in a graph. The question is then reduced to checking if there is a path between the two nodes.
There exists lots of nice libraries for this, see e.g. networkx
import networkx as nx
a = [(1,2),(1,3),(1,4),(2,5),(6,5),(7,8)]
G = nx.Graph(a)
nx.has_path(G, 1, 6) # True
nx.has_path(G, 5, 3) # True
nx.has_path(G, 2, 8) # False
This answer here nicely states your problem as a graph problem, where every time you need to run your algorithm you need to check for the existence of a path between your input vertices. The time complexity for every query then depends on the size, order, diameter, degree of the underlying graph.
However, if you intend to run this algorithm many times with the same array a, it may be worth doing some preprocessing on the input graph to find the connected components (Wikipedia : connected components) first. In that case you can get constant time for every query. Here is the code I suggest :
# NOTE : tested using python 3.6.1
# WARNING : no input sanitization
a = [(1,2),(1,3),(1,4),(2,5),(6,5),(7,8)]
n = 8 # order of the underlying graph
# prepare graph as lists of neighbors for every vertex, i.e. adjacency lists (extra unused vertex '0', just to match the value range of the problem)
graph = [[] for i in range(n+1)]
for edge in a:
graph[edge[0]].append(edge[1])
graph[edge[1]].append(edge[0])
print( "graph : " + str(graph) )
# list of unprocessed vertices : contains all of them at the beginning
unprocessed_vertices = {i for i in range(1,n+1)}
# subroutine to discover the connected component of a vertex
def build_component():
component = [] # current connected component
curr_vertices = {unprocessed_vertices.pop()} # locally unprocessed vertices, initialize with one of the globally unprocessed vertices
while len(curr_vertices) > 0:
curr_vertex = curr_vertices.pop() # vertex to be processed
# add unprocessed neighbours of current vertex to the set of vertices to process
for neighbour in graph[curr_vertex]:
if neighbour in unprocessed_vertices:
curr_vertices.add(neighbour)
unprocessed_vertices.remove(neighbour)
component.append(curr_vertex)
return component
# main algorithm : graph traversal on multiple connected components
components = []
while len(unprocessed_vertices) > 0:
components.append( build_component() )
print( "components : " + str(components) )
# assign a number to each component
component_numbers = [None] * (n+1)
curr_number = 1
for comp in components:
for vertex in comp:
component_numbers[vertex] = curr_number
curr_number += 1
print( "component_numbers : " + str(component_numbers) )
# main functionality
def is_connected( pair ):
return component_numbers[pair[0]] == component_numbers[pair[1]]
# run main functionnality on inputs : every call is executed in constant time now, regardless of the size of the graph
print( is_connected( (1,6) ) )
print( is_connected( (5,3) ) )
print( is_connected( (2,8) ) )
I don't really know about the most likely situations where this problem could be encountered, but I suppose it can have application is some clustering tasks, or maybe if you want to know if it is possible to go from one place to another. If the edges of the graph represent dependencies between modules, this problem would tell you if two parts depend on each other, so maybe some potential applications in compiling or the managment of large projects. The underlying problem is a "Connected component" problem which is among the problems we know polynomial algorithms for.
It is generally very useful to model these kind of problems with graphs as these objects have a very simple structure, and most of the time we can reduce the original problem to a well known problem on graphs.

Categories

Resources