Related
I saw a many solutions for generating random floats within a specific range (like this) which actually helps me, and solutions for generating random floats summing to 1 (like this), and separately solutions work perfectly, but I can't figure how to merge them.
Currently my code is:
import random
def sample_floats(low, high, k=1):
""" Return a k-length list of unique random floats
in the range of low <= x <= high
"""
result = []
seen = set()
for i in range(k):
x = random.uniform(low, high)
while x in seen:
x = random.uniform(low, high)
seen.add(x)
result.append(x)
return result
And still, applying
weights = sample_floats(0.055, 1.0, 11)
weights /= np.sum(weights)
Returns weights array, in which there are some floats less that 0.055
Should I somehow implement np.random.dirichlet in function above, or it should be built on the basis of np.random.dirichlet and then implement condition > 0.055? Can't figure any solution.
Thank you in advice!
IIUC, you want to generate an array of k values, with minimum value of low=0.055.
It is easier to generate numbers from 0 that sum up to 1-low*k, and then to add low so that the final array sums to 1. Thus, this guarantees both the lower bound and the sum.
Regarding the high, I am pretty sure it is mathematically impossible to add this constraint as once you fix the lower bound and the sum, there is not enough degrees of freedom to chose an upper bound. The upper bound will be 1-low*(k-1) (here 0.505).
Also, be aware that, with a minimum value, you necessarily enforce a maximum k of 1//low (here 18 values). If you set k higher, the low bound won't be correct.
# parameters
low = 0.055
k = 10
a = np.random.rand(k)
a = (a/a.sum()*(1-low*k))
weights = a+low
# checking that the sum is 1
assert np.isclose(weights.sum(), 1)
Example output:
array([0.13608635, 0.06796974, 0.07444545, 0.1361171 , 0.07217206,
0.09223554, 0.12713463, 0.11012871, 0.1107402 , 0.07297022])
You could generate k-1 numbers iteratively by varying the lower and upper bounds of the uniform random number generator - the constraint at any iteration being that the number generated allows the rest of the numbers to be at least low
def sample_floats(low, high, k=1):
result = []
generated = 0
while generated < k-1:
current_higher_bound = max(low, 1 - (k - 1 - generated)*low - sum(result))
next_num = random.uniform(low, current_higher_bound)
result.append(next_num)
generated += 1
last_num = 1 - sum(result)
result.append(last_num)
return result
print(sample_floats(0.01, 1, k=15))
#[0.08878760926151083,
# 0.17897435239586243,
# 0.5873150041878156,
# 0.021487776792166513,
# 0.011234379498998357,
# 0.012408564286727042,
# 0.015391011259745103,
# 0.01264921242128719,
# 0.010759267284382326,
# 0.010615007333002748,
# 0.010288605412288477,
# 0.010060487014659121,
# 0.010027216923973544,
# 0.010000064276203318,
# 0.010001441651377285]
The samples are correlated, so I believe you can't generate them in an IID way. you can, however, do it in an iterative manner. For example, you can do it as I show in the code below. There are a few more special cases to check like what if the user inputs low<high or high*k<sum. But I figured you can find and account for them using my modification to your code.
import random
import warnings
def sample_floats(low = 0.055, high = 1., x_sum = 1., k = 1):
""" Return a k-length list of unique random floats
in the range of 'low' <= x <= 'high' summing up to 'sum'.
"""
sum_i = 0
xs = []
if x_sum - (k-1)*low < high:
warnings.warn(f'high = {high} is to high to be generated under the'
f' conditions set by k = {k}, sum = {x_sum}, and low = {low}.'
f' high automatically set to {x_sum - (k-1)*low}.')
if k == 1:
if high < x_sum:
raise ValueError(f'The parameter combination k = {k}, sum = {x_sum},'
' and high = {high} is impossible.')
else: return x_sum
high_i = high
for i in range(k-1):
x = random.uniform(low, high_i)
xs.append(x)
sum_i = sum_i + x
if high < (x_sum - sum_i - (k-1-i)*low):
high_i = high
else: high_i = x_sum - sum_i - (k-1-i)*low
xs.append(x_sum - sum_i)
return xs
For example:
random.seed(0)
xs = sample_floats(low = 0.055, high = 0.5, x_sum = 1., k = 5)
print(xs)
print(sum(xs))
Output:
[0.43076772392864643, 0.27801464913542906, 0.08495210994346317, 0.06568433355884717, 0.14058118343361425]
1.0
There are number of jobs to be assigned to number of resources each with a score (performance indicator) and cost. The resource assignment problem (RAP) objective is to maximize assignment scores considering the budget. Constraints: Each resource can handle at most one job and each job if it is filled should be done by one resource. Also, there is a limited budget to spend.
I have tackled the problem in two ways: CVXPY using gurobi solver and gurobi packages. My challenge is I can't program it in a memory-efficient way with cvxpy. There are hundreds of constraint list comprehensions! How can I can improve efficiency of my code in cvxpy? For example, is there a better way to define dictionary variables in cvxpy similar to gurobi?
ms is dictionary of format {('firstName lastName', 'job'), score_value}
cst is dictionary of format {('firstName lastName', 'job'), cost_value}
job is set of jobs
res is set of resources {'firstName lastName'}
G (or g in gurobi implementation) is a dictionary with jobs as keys and values of 0 or 1 whether that job is filled due to budget limit (0 if filled and 1 if not)
thanks
github link including codes and memory profiling comparison
gurobi implementation:
m = gp.Model("RAP")
assign = m.addVars(ms.keys(), vtype=GRB.BINARY, name="assign")
g = m.addVars(job, name="gap")
m.addConstrs((assign.sum("*", j) + g[j] == 1 for j in job), name="demand")
m.addConstrs((assign.sum(r, "*") <= 1 for r in res), name="supply")
m.addConstr(assign.prod(cst) <= budget, name="Budget")
job_gap_penalty = 101 # penatly of not filling a job
m.setObjective(assign.prod(ms) -job_gap_penalty*g.sum(), GRB.MAXIMIZE)
m.optimize()
cvxpy implenentation:
X = {}
for a in ms.keys():
X[a] = cp.Variable(boolean=True, name="assign")
G = {}
for g in job:
G[g] = cp.Variable(boolean=True, name="gap")
constraints = []
for j in job:
X_r = 0
for r in res:
X_r += X[r, j]
constraints += [
X_r + G[j] == 1
]
for r in res:
X_j = 0
for j in job:
X_j += X[r, j]
constraints += [
X_j <= 1
]
constraints += [
np.array(list(cst.values())) # np.array(list(X.values())) <= budget,
]
obj = cp.Maximize(np.array(list(ms.values())) # np.array(list(X.values()))
- job_gap_penalty * cp.sum(list(G.values())))
prob = cp.Problem(obj, constraints)
prob.solve(solver=cp.GUROBI, verbose=False)
Here is the memory profiling comparison:
memeory profiling for cvxpy
memory profiling for gurobi
Previously, I tried to solve thru defining dictionary variables similar to gurobi but at is not available in cvxpy, the code was not efficient when scaling up. But now I solved it thru matrix variables and then converting to dictionary variables which super fast!
assign_scores = np.array(list(ms.values())).reshape(len(res), len(job))
assign_cost = np.array(list(cst.values())).reshape(len(res), len(job))
# make a bool matrix variable with the shape of number of resources and jobs
x = cp.Variable(shape=(len(res), len(job)), boolean=True, name="assign")
# make a bool vector variable with the shape of number of jobs
g = cp.Variable(shape=(len(job), ), boolean=True, name="gap")
constraints = []
# each job can be assigned to at most one resource or remains unfilled due to budget cap
constraints += [cp.sum(x[:, j]) + g[j] == 1 for j in range(len(job))]
# each resource can be assigned to at most one job
constraints += [cp.sum(x[r, :]) <= 1 for r in range(len(res))]
# budget cap
constraints += [cp.sum(cp.multiply(assign_cost, x)) <= budget]
# pentalty if a job is not filled
job_gap_penalty=101
# objective is to maiximize performance score
obj = cp.Maximize(cp.sum(cp.multiply(assign_scores, x) - job_gap_penalty * cp.sum(g)))
prob = cp.Problem(obj, constraints)
prob.solve(solver=cp.GUROBI, verbose=True)
Problem
I'm implementing a generalized assignment problem using LINGO (in which I have experience to model mathematical problems) and Or-tools, but results were different.
Brief explanation of my assignment problem
I have a set of houses (called 'object' in the model) that need to be build. Each house needs a set of resources. To supply these resources, there are 3 suppliers. The resource cost varies by supplier.
The model should assign those suppliers to the houses in order to minimize the total cost of assignments.
Model
Parameters
resource_cost_per_supplier[i,j]: cost of resource i of supplier j.
resource_cost_factor_per_object[i,j]: matrix that signals the resources demanded by the objects (cost factor > 0). In addition, it contains the cost factor of resource i demanded by object j. This factor is calculated based on the duration of use of the resource during the construction of the object and also in others contractual factors.
supplier_budget_limit[j]: supplier budget limit of supplier j. Each supplier has a budget limit that should not be exceded (it's in the contract).
supplier_budget_tolerance_margin_limit[j]: supplier budget tolerance margin limit of supplier j. To the model works, I had to create this tolerance margin, that is applied in the supplier budget limit to create an acceptable range of supplier cost.
object_demand_attended_per_supplier[i,j]: binary matrix that signals if the supplier i has all the resources required by object j.
Variables
x[i,j]: binary variable that indicate if the supplier i will be (1) or not (0) assigned to the object j.
supplier_cost[j]: variable that represents the cost of supplier j in the market share. Its value is given by:
total_cost: variable that represents the total cost of market share. Its value is given by:
Objective function
min Z = total_cost
Constraints
1 - Ensure that each object j will have only one supplier i.
2 - For each supplier i, the sum of the cost of all your assignments must be greater than or equal to your budget limit minus the tolerance margin.
3 - For each supplier j, the sum of the cost of all your assignments must be less than or equal to your budget limit plus the tolerance margin.
4 - Ensure that a supplier i will not assigned to an object j if the supplier i cannot provide all the resources of object j.
5 - Ensure that variable x is binary for every supplier i and object j.
Code
Or-tools (Python)
from __future__ import print_function
from ortools.linear_solver import pywraplp
import pandas as pd
import numpy
###### [START] parameters ######
num_objects = 252 #Number of objects
num_resources = 35 #Number of resources (not every object will use all resources. It depends of the type of the object and other things)
num_suppliers = 3 #Number of suppliers
resource_cost_per_supplier = pd.read_csv('https://raw.githubusercontent.com/hrassis/divisao-mercado/master/input_prototype/resource_cost_per_supplier.csv', index_col = 0).to_numpy()
resource_cost_factor_per_object = pd.read_csv('https://raw.githubusercontent.com/hrassis/divisao-mercado/master/input_prototype/resource_cost_factor_per_object.csv', index_col = 0).to_numpy()
object_demand_attended_per_supplier = pd.read_csv('https://raw.githubusercontent.com/hrassis/divisao-mercado/master/input_prototype/object_demand_attended_per_supplier.csv', index_col = 0).to_numpy()
supplier_budget_limit = pd.read_csv('https://raw.githubusercontent.com/hrassis/divisao-mercado/master/input_prototype/supplier_budget_limit.csv', index_col = 0)['budget_limit'].values
supplier_budget_tolerance_margin_limit = pd.read_csv('https://raw.githubusercontent.com/hrassis/divisao-mercado/master/input_prototype/supplier_budget_tolerance_margin_limit.csv', index_col = 0)['tolerance_margin'].values
###### [END] parameters ######
###### [START] variables ######
#Assignment variable
x = {}
supplier_cost = []
#Total cost of market share
total_cost = 0
###### [END] variables ######
def main():
#Declare the solver
solver = pywraplp.Solver('GeneralizedAssignmentProblem', pywraplp.Solver.CBC_MIXED_INTEGER_PROGRAMMING)
#Assignment variable
#x = {}
#Ensure that the assignment variable is binary
for i in range(num_suppliers):
for j in range(num_objects):
x[i, j] = solver.BoolVar('x[%i,%i]' % (i,j))
#Assigning an expression to each supplier_cost element
for j in range(num_suppliers):
supplier_cost.append(solver.Sum(solver.Sum(resource_cost_per_supplier[i,j] * resource_cost_factor_per_object[i,k] * x[j,k] for k in range(num_objects)) for i in range(num_resources)))
#Total cost of market share
total_cost = solver.Sum(supplier_cost[j] for j in range(num_suppliers))
#Objective function
solver.Minimize(total_cost)
###### [START] constraints ######
# 1 - Ensure that each object will have only one supplier
for j in range(num_objects):
solver.Add(solver.Sum([x[i,j] for i in range(num_suppliers)]) == 1)
# 2 - For each supplier j, the sum of the cost of all your allocations must be greater than or equal to your budget limit minus the tolerance margin
for j in range(num_suppliers):
solver.Add(supplier_cost[j] >= total_cost * (supplier_budget_limit[j] - supplier_budget_tolerance_margin_limit[j]))
# 3 - For each supplier j, the sum of the cost of all your allocations must be less than or equal to your budget limit plus the tolerance margin
for j in range(num_suppliers):
solver.Add(supplier_cost[j] <= total_cost * (supplier_budget_limit[j] + supplier_budget_tolerance_margin_limit[j]))
# 4 - Ensure that a supplier i will not assigned to an object j if the supplier i can not supply all resources demanded by object j
for i in range(num_suppliers):
for j in range(num_objects):
solver.Add(x[i,j] - object_demand_attended_per_supplier[i,j] <= 0)
###### [END] constraints ######
solution = solver.Solve()
#Print the result
if solution == pywraplp.Solver.OPTIMAL:
print('------- Solution -------')
print('Total cost =', round(total_cost.solution_value(), 2))
for i in range(num_suppliers):
print('-----')
print('Supplier', i)
print('-> cost:', round(supplier_cost[i].solution_value(), 2))
print('-> cost percentage:', format(supplier_cost[i].solution_value()/total_cost.solution_value(),'.2%'))
print('-> supplier budget limit:', format(supplier_budget_limit[i], '.0%'))
print('-> supplier budget tolerance margin limit:', format(supplier_budget_tolerance_margin_limit[i], '.0%'))
print('-> acceptable range: {0} <= cost percentage <= {1}'.format(format(supplier_budget_limit[i] - supplier_budget_tolerance_margin_limit[i], '.0%'), format(supplier_budget_limit[i] + supplier_budget_tolerance_margin_limit[i], '.0%')))
# print('-> objects: {0}'.format(i))
else:
print('The problem does not have an optimal solution.')
#Generate a result to consult
assignment_result = pd.DataFrame(columns=['object','supplier','cost','assigned'])
for i in range(num_suppliers):
for j in range(num_objects):
assignment_result = assignment_result.append({'object': j, 'supplier': i, 'cost': get_object_cost(j, i), 'assigned': x[i, j].solution_value()}, ignore_index=True)
assignment_result.to_excel('assignment_result.xlsx')
def get_object_cost(object_index, supplier_index):
object_cost = 0.0
for i in range(num_resources):
object_cost = object_cost + resource_cost_factor_per_object[i,object_index] * resource_cost_per_supplier[i,supplier_index]
return object_cost
#Run main
main()
LINGO
model:
title: LINGO;
data:
!Number of objects;
num_objects = #OLE('LINGO_input.xlsx',num_objects);
!Number of resources (not every object will use all resources. It depends of the type of the object and other things);
num_resources = #OLE('LINGO_input.xlsx',num_resources);
!Number of suppliers;
num_suppliers = #OLE('LINGO_input.xlsx',num_suppliers);
enddata
sets:
suppliers/1..num_suppliers/:supplier_budget_limit,supplier_tolerance_margin_limit,supplier_cost;
resources/1..num_resources/:;
objects/1..num_objects/:;
resources_suppliers(resources,suppliers):resource_cost_per_supplier;
resources_objects(resources,objects):resource_cost_factor_per_object;
suppliers_objects(suppliers,objects):x,object_demand_attended_supplier;
endsets
data:
resource_cost_per_supplier = #OLE('LINGO_input.xlsx',resource_cost_per_supplier[cost]);
resource_cost_factor_per_object = #OLE('LINGO_input.xlsx',resource_cost_factor_per_object[cost_factor]);
supplier_budget_limit = #OLE('LINGO_input.xlsx',supplier_budget_limit[budget_limit_percentage]);
supplier_tolerance_margin_limit = #OLE('LINGO_input.xlsx',supplier_budget_tolerance_margin_limit[budget_tolerance_percentage]);
object_demand_attended_supplier = #OLE('LINGO_input.xlsx',object_demand_attended_per_supplier[supply_all_resources]);
enddata
!The array 'supplier_cost' was created to store the total cost of each supplier;
#FOR(suppliers(j):supplier_cost(j)= #SUM(resources(i):#SUM(objects(k):resource_cost_per_supplier(i,j)*resource_cost_factor_per_object(i,k)*x(j,k))));
!Total cost of market share;
total_cost = #SUM(suppliers(i):supplier_cost(i));
!Objective function;
min = total_cost;
!Ensure that each object will have only one supplier;
#FOR(objects(j):#SUM(suppliers(i):x(i,j))=1);
!For each supplier j, the sum of the cost of all your assignments must be greater than or equal to your budget limit minus the tolerance margin;
#FOR(suppliers(j):supplier_cost(j) >= total_cost*(supplier_budget_limit(j)-supplier_tolerance_margin_limit(j)));
!For each supplier j, the sum of the cost of all your assignments must be less than or equal to your budget limit plus the tolerance margin;
#FOR(suppliers(j):supplier_cost(j) <= total_cost*(supplier_budget_limit(j)+supplier_tolerance_margin_limit(j)));
!Ensure that a supplier j will not assigned to an object k if the supplier j can not supply all resources demanded by object k;
#FOR(suppliers(j):#FOR(objects(k):x(j,k)-object_demand_attended_supplier(j,k)<=0));
!Ensure that the assignment variable is binary;
#FOR(suppliers(i):#FOR(objects(j):#BIN(x(i,j))));
data:
#OLE('LINGO_input.xlsx',output[assigned])=x;
#OLE('LINGO_input.xlsx',objective_function_value)=total_cost;
#OLE('LINGO_input.xlsx',supplier_cost)=supplier_cost;
enddata
Results
The picture below shows the comparative result between Or-Tools and LINGO. I emphasize that the data used by the two implementations were exactly the same and I checked all the data several times.
Note that there is a difference of 1.876,20 between the two implementations. LINGO, that uses a Branch and Bound algorithm, found a better solution than Or-Tools. The difference is caused by the assignments inconsistencies shown below.
Regarding the processing time of the algorithms, LINGO took around 14 min and Or-Tools less than 1 min.
All the data used in the two implementations are in this repository: https://github.com/hrassis/divisao-mercado. Data used by LINGO is in folder input_lingo and used by Or-Tools is in the folder input_prototype. In addition I uploaded the validation report.
After "cheating" a bit:
solver.Add(x[1, 177] == 1)
solver.Add(x[0, 186] == 1)
solver.Add(x[0, 205] == 1)
solver.Add(x[2, 206] == 1)
solver.Add(x[2, 217] == 1)
solver.Add(x[2, 66] == 1)
solver.Add(x[2, 115] == 1)
solver.Add(x[1, 237] == 1)
The solver returns a better objective, so I believe there is a bug either on the CBC binary or the OR-Tools interface to it (sounds like the former).
Can you try using the CP-SAT solver?
There have been quite a few problems with CBC
https://github.com/google/or-tools/issues/1450
https://github.com/google/or-tools/issues/1525
I am looking for an optimization algorithm that takes a text file encoded with 0s, 1s, and -1s:
1's denoting target cells that requires Wi-Fi coverage
0's denoting cells that are walls
1's denoting cells that are void (do not require Wi-Fi coverage)
Example of text file:
I have created a solution function along with other helper functions, but I can't seem to get the optimal positions of the routers to be placed to ensure proper coverage. There is another file that does the printing, I am struggling with finding the optimal location. I basically need to change the get_random_position function to get the optimal one, but I am unsure how to do that. The area covered by the various routers are:
This is the kind of output I am getting:
Each router covers a square area of at most (2S+1)^2
Type 1: S=5; Cost=180
Type 2: S=9; Cost=360
Type 3: S=15; Cost=480
My code is as follows:
import numpy as np
import time
from random import randint
def is_taken(taken, i, j):
for coords in taken:
if coords[0] == i and coords[1] == j:
return True
return False
def get_random_position(floor, taken , nrows, ncols):
i = randint(0, nrows-1)
j = randint(0, ncols-1)
while floor[i][j] == 0 or floor[i][j] == -1 or is_taken(taken, i, j):
i = randint(0, nrows-1)
j = randint(0, ncols-1)
return (i, j)
def solution(floor):
start_time = time.time()
router_types = [1,2,3]
nrows, ncols = floor.shape
ratio = 0.1
router_scale = int(nrows*ncols*0.0001)
if router_scale == 0:
router_scale = 1
row_ratio = int(nrows*ratio)
col_ratio = int(ncols*ratio)
print('Row : ',nrows, ', Col: ', ncols, ', Router scale :', router_scale)
global_best = [0, ([],[],[])]
taken = []
while True:
found_better = False
best = [global_best[0], (list(global_best[1][0]), list(global_best[1][1]), list(global_best[1][2]))]
for times in range(0, row_ratio+col_ratio):
if time.time() - start_time > 27.0:
print('Time ran out! Using what I got : ', time.time() - start_time)
return global_best[1]
fit = []
for rtype in router_types:
interim = (list(global_best[1][0]), list(global_best[1][1]), list(global_best[1][2]))
for i in range(0, router_scale):
pos = get_random_position(floor, taken, nrows, ncols)
interim[0].append(pos[0])
interim[1].append(pos[1])
interim[2].append(rtype)
fit.append((fitness(floor, interim), interim))
highest_fitness = fit[0]
for index in range(1, len(fit)):
if fit[index][0] > highest_fitness[0]:
highest_fitness = fit[index]
if highest_fitness[0] > best[0]:
best[0] = highest_fitness[0]
best[1] = (highest_fitness[1][0],highest_fitness[1][1], highest_fitness[1][2])
found_better = True
global_best = best
taken.append((best[1][0][-1],best[1][1][-1]))
break
if found_better == False:
break
print('Best:')
print(global_best)
end_time = time.time()
run_time = end_time - start_time
print("Run Time:", run_time)
return global_best[1]
def available_cells(floor):
available = 0
for i in range(0, len(floor)):
for j in range(0, len(floor[i])):
if floor[i][j] != 0:
available += 1
return available
def fitness(building, args):
render = np.array(building, dtype=int, copy=True)
cov_factor = 220
cost_factor = 22
router_types = { # type: [coverage, cost]
1: {'size' : 5, 'cost' : 180},
2: {'size' : 9, 'cost' : 360},
3: {'size' : 15, 'cost' : 480},
}
routers_used = args[-1]
for r, c, t in zip(*args):
size = router_types[t]['size']
nrows, ncols = render.shape
rows = range(max(0, r-size), min(nrows, r+size+1))
cols = range(max(0, c-size), min(ncols, c+size+1))
walls = []
for ri in rows:
for ci in cols:
if building[ri, ci] == 0:
walls.append((ri, ci))
def blocked(ri, ci):
for w in walls:
if min(r, ri) <= w[0] and max(r, ri) >= w[0]:
if min(c, ci) <= w[1] and max(c, ci) >= w[1]:
return True
return False
for ri in rows:
for ci in cols:
if blocked(ri, ci):
continue
if render[ri, ci] == 2:
render[ri, ci] = 4
if render[ri, ci] == 1:
render[ri, ci] = 2
render[r, c] = 5
return (
cov_factor * np.sum(render > 1) -
cost_factor * np.sum([router_types[x]['cost'] for x in routers_used])
)
Here's a suggestion on how to solve the problem; however I don't affirm this is the best approach, and it's certainly not the only one.
Main idea
Your problem can be modelised as a weighted minimum set cover problem.
Good news, this is a well known optimization problem:
It is easy to find algorithm descriptions for approximate solutions
A quick search on the web shows many implementations of approximation algorithms in Python.
Bad news, this is a NP-hard optimization problem:
If you need an exact solution: algorithms will work only for "small" sized problems in a reasonable amount of time(in your case: size of the problem <=> number of "1" cells).
Approximate (a.k.a greedy) algorithms are trade-off between computation requirements, and a risk do deliver far from optimal solutions in certain cases.
Note that the following part does not prove that your problem is NP-hard. The general minimum set cover problem is NP-hard. In your case the subsets have several properties that might help to design a better algorithm. I have no idea how though.
Translating into a cover set problem
Let's define some sets:
U: the set of "1" cells (requiring Wifi).
P(U): the power set of U (the set of subsets of U).
P: the set of cells on which you can place a router (not sure if P=U in your original post).
T: the set of router type (3 values in your case).
R+: positive Real number (used to describe prices).
Let's define a function (pseudo Python):
# Domain of definition : T,P --> R+,P(U)
# This function takes a router type and a position, and returns
# a tuple containing:
# - the price of a router of the given type.
# - the subset of U containing all the position covered by a router
# of the given type placed at the given position.
def weighted_subset(routerType, position):
pass # TODO: implementation
Now, we define a last set, as the image of the function we've just described: S=weighted_subset(T,P). Each element of this set is a subset of U, weighted by a price in R+.
With all this formalism, finding the router types & positions that:
gives coverage to all the desirable locations
minimize the cost
Is equivalent to finding a sub-collection of S:
whose union of their P(U) is equal to U
which minimise the sum of the associated weights
Which is the weighted minimal set cover problem.
I have a set of data for which has an ID, timestamp, and identifiers. I have to go through it, calculate the entropy and save some other links for the data. At each step more identifiers are added to the identifiers dictionary and I have to re-compute the entropy and append it. I have really large amount of data and the program gets stuck due to growing number of identifiers and their entropy calculation after each step. I read the following solution but it is about the data consisting of numbers.
Incremental entropy computation
I have copied two functions from this page and the incremental calculation of entropy gives different values than the classical full entropy calculation at every step.
Here is the code I have:
from math import log
# ---------------------------------------------------------------------#
# Functions copied from https://stackoverflow.com/questions/17104673/incremental-entropy-computation
# maps x to -x*log2(x) for x>0, and to 0 otherwise
h = lambda p: -p*log(p, 2) if p > 0 else 0
# entropy of union of two samples with entropies H1 and H2
def update(H1, S1, H2, S2):
S = S1+S2
return 1.0*H1*S1/S+h(1.0*S1/S)+1.0*H2*S2/S+h(1.0*S2/S)
# compute entropy using the classic equation
def entropy(L):
n = 1.0*sum(L)
return sum([h(x/n) for x in L])
# ---------------------------------------------------------------------#
# Below is the input data (Actually I read it from a csv file)
input_data = [["1","2008-01-06T02:13:38Z","foo,bar"], ["2","2008-01-06T02:12:13Z","bar,blup"], ["3","2008-01-06T02:13:55Z","foo,bar"],
["4","2008-01-06T02:12:28Z","foo,xy"], ["5","2008-01-06T02:12:44Z","foo,bar"], ["6","2008-01-06T02:13:00Z","foo,bar"],
["7","2008-01-06T02:13:00Z","x,y"]]
total_identifiers = {} # To store the occurrences of identifiers. Values shows the number of occurrences
all_entropies = [] # Classical way of calculating entropy at every step
updated_entropies = [] # Incremental way of calculating entropy at every step
for item in input_data:
temp = item[2].split(",")
identifiers_sum = sum(total_identifiers.values()) # Sum of all identifiers
old_entropy = 0 if all_entropies[-1:] == [] else all_entropies[-1] # Get previous entropy calculation
for identifier in temp:
S_new = len(temp) # sum of new samples
temp_dictionaty = {a:1 for a in temp} # Store current identifiers and their occurrence
if identifier not in total_identifiers:
total_identifiers[identifier] = 1
else:
total_identifiers[identifier] += 1
current_entropy = entropy(total_identifiers.values()) # Entropy for current set of identifiers
updated_entropy = update(old_entropy, identifiers_sum, current_entropy, S_new)
updated_entropies.append(updated_entropy)
entropy_value = entropy(total_identifiers.values()) # Classical entropy calculation for comparison. This step becomes too expensive with big data
all_entropies.append(entropy_value)
print(total_identifiers)
print('Sum of Total Identifiers: ', identifiers_sum) # Gives 12 while the sum is 14 ???
print("All Classical Entropies: ", all_entropies) # print for comparison
print("All Updated Entropies: ", updated_entropies)
The other issue is that when I print "Sum of total_identifiers", it gives 12 instead of 14! (Due to very large amount of data, I read the actual file line by line and write the results directly to the disk and do not store it in the memory apart from the dictionary of identifiers).
The code above uses Theorem 4; it seems to me that you want to use Theorem 5 instead (from the paper in the next paragraph).
Note, however, that if the number of identifiers is really the problem then the incremental approach below isn't going to work either---at some point the dictionaries are going to get too large.
Below you can find a proof-of-concept Python implementation that follows the description from Updating Formulas and Algorithms for Computing Entropy and Gini Index from Time-Changing Data Streams.
import collections
import math
import random
def log2(p):
return math.log(p, 2) if p > 0 else 0
CountChange = collections.namedtuple('CountChange', ('label', 'change'))
class EntropyHolder:
def __init__(self):
self.counts_ = collections.defaultdict(int)
self.entropy_ = 0
self.sum_ = 0
def update(self, count_changes):
r = sum([change for _, change in count_changes])
residual = self._compute_residual(count_changes)
self.entropy_ = self.sum_ * (self.entropy_ - log2(self.sum_ / (self.sum_ + r))) / (self.sum_ + r) - residual
self._update_counts(count_changes)
return self.entropy_
def _compute_residual(self, count_changes):
r = sum([change for _, change in count_changes])
residual = 0
for label, change in count_changes:
p_new = (self.counts_[label] + change) / (self.sum_ + r)
p_old = self.counts_[label] / (self.sum_ + r)
residual += p_new * log2(p_new) - p_old * log2(p_old)
return residual
def _update_counts(self, count_changes):
for label, change in count_changes:
self.sum_ += change
self.counts_[label] += change
def entropy(self):
return self.entropy_
def naive_entropy(counts):
s = sum(counts)
return sum([-(r/s) * log2(r/s) for r in counts])
if __name__ == '__main__':
print(naive_entropy([1, 1]))
print(naive_entropy([1, 1, 1, 1]))
entropy = EntropyHolder()
freq = collections.defaultdict(int)
for _ in range(100):
index = random.randint(0, 5)
entropy.update([CountChange(index, 1)])
freq[index] += 1
print(naive_entropy(freq.values()))
print(entropy.entropy())
Thanks #blazs for providing the entropy_holder class. That solves the problem. So the idea is to import entropy_holder.py from (https://gist.github.com/blazs/4fc78807a96976cc455f49fc0fb28738) and use it to store the previous entropy and update at every step when new identifiers come.
So the minimum working code would look like this:
import entropy_holder
input_data = [["1","2008-01-06T02:13:38Z","foo,bar"], ["2","2008-01-06T02:12:13Z","bar,blup"], ["3","2008-01-06T02:13:55Z","foo,bar"],
["4","2008-01-06T02:12:28Z","foo,xy"], ["5","2008-01-06T02:12:44Z","foo,bar"], ["6","2008-01-06T02:13:00Z","foo,bar"],
["7","2008-01-06T02:13:00Z","x,y"]]
entropy = entropy_holder.EntropyHolder() # This class will hold the current entropy and counts of identifiers
for item in input_data:
for identifier in item[2].split(","):
entropy.update([entropy_holder.CountChange(identifier, 1)])
print(entropy.entropy())
This entropy by using the Blaz's incremental formulas is very close to the entropy calculated the classical way and saves from iterating over all the data again and again.