Generating random floats, summing to 1, with minimum value - python

I saw a many solutions for generating random floats within a specific range (like this) which actually helps me, and solutions for generating random floats summing to 1 (like this), and separately solutions work perfectly, but I can't figure how to merge them.
Currently my code is:
import random
def sample_floats(low, high, k=1):
""" Return a k-length list of unique random floats
in the range of low <= x <= high
"""
result = []
seen = set()
for i in range(k):
x = random.uniform(low, high)
while x in seen:
x = random.uniform(low, high)
seen.add(x)
result.append(x)
return result
And still, applying
weights = sample_floats(0.055, 1.0, 11)
weights /= np.sum(weights)
Returns weights array, in which there are some floats less that 0.055
Should I somehow implement np.random.dirichlet in function above, or it should be built on the basis of np.random.dirichlet and then implement condition > 0.055? Can't figure any solution.
Thank you in advice!

IIUC, you want to generate an array of k values, with minimum value of low=0.055.
It is easier to generate numbers from 0 that sum up to 1-low*k, and then to add low so that the final array sums to 1. Thus, this guarantees both the lower bound and the sum.
Regarding the high, I am pretty sure it is mathematically impossible to add this constraint as once you fix the lower bound and the sum, there is not enough degrees of freedom to chose an upper bound. The upper bound will be 1-low*(k-1) (here 0.505).
Also, be aware that, with a minimum value, you necessarily enforce a maximum k of 1//low (here 18 values). If you set k higher, the low bound won't be correct.
# parameters
low = 0.055
k = 10
a = np.random.rand(k)
a = (a/a.sum()*(1-low*k))
weights = a+low
# checking that the sum is 1
assert np.isclose(weights.sum(), 1)
Example output:
array([0.13608635, 0.06796974, 0.07444545, 0.1361171 , 0.07217206,
0.09223554, 0.12713463, 0.11012871, 0.1107402 , 0.07297022])

You could generate k-1 numbers iteratively by varying the lower and upper bounds of the uniform random number generator - the constraint at any iteration being that the number generated allows the rest of the numbers to be at least low
def sample_floats(low, high, k=1):
result = []
generated = 0
while generated < k-1:
current_higher_bound = max(low, 1 - (k - 1 - generated)*low - sum(result))
next_num = random.uniform(low, current_higher_bound)
result.append(next_num)
generated += 1
last_num = 1 - sum(result)
result.append(last_num)
return result
print(sample_floats(0.01, 1, k=15))
#[0.08878760926151083,
# 0.17897435239586243,
# 0.5873150041878156,
# 0.021487776792166513,
# 0.011234379498998357,
# 0.012408564286727042,
# 0.015391011259745103,
# 0.01264921242128719,
# 0.010759267284382326,
# 0.010615007333002748,
# 0.010288605412288477,
# 0.010060487014659121,
# 0.010027216923973544,
# 0.010000064276203318,
# 0.010001441651377285]

The samples are correlated, so I believe you can't generate them in an IID way. you can, however, do it in an iterative manner. For example, you can do it as I show in the code below. There are a few more special cases to check like what if the user inputs low<high or high*k<sum. But I figured you can find and account for them using my modification to your code.
import random
import warnings
def sample_floats(low = 0.055, high = 1., x_sum = 1., k = 1):
""" Return a k-length list of unique random floats
in the range of 'low' <= x <= 'high' summing up to 'sum'.
"""
sum_i = 0
xs = []
if x_sum - (k-1)*low < high:
warnings.warn(f'high = {high} is to high to be generated under the'
f' conditions set by k = {k}, sum = {x_sum}, and low = {low}.'
f' high automatically set to {x_sum - (k-1)*low}.')
if k == 1:
if high < x_sum:
raise ValueError(f'The parameter combination k = {k}, sum = {x_sum},'
' and high = {high} is impossible.')
else: return x_sum
high_i = high
for i in range(k-1):
x = random.uniform(low, high_i)
xs.append(x)
sum_i = sum_i + x
if high < (x_sum - sum_i - (k-1-i)*low):
high_i = high
else: high_i = x_sum - sum_i - (k-1-i)*low
xs.append(x_sum - sum_i)
return xs
For example:
random.seed(0)
xs = sample_floats(low = 0.055, high = 0.5, x_sum = 1., k = 5)
print(xs)
print(sum(xs))
Output:
[0.43076772392864643, 0.27801464913542906, 0.08495210994346317, 0.06568433355884717, 0.14058118343361425]
1.0

Related

Sorting numbers in range

I am working on a python project and I need help to sort a list of elements.
It is a list with investment portfolios returns (10000 items). I want to sort them in intervals so that I can, then, assign probabilities to it.
Basically, I want to say that, for example, I'll have:
Could you help me out with a library/code that could do this sorting?
thanks
TL;DR
import random
import numpy as np
ranges = [2e6, 2.5e6, 3e6, 3.5e6, 4e6, 4.5e6, 5e6]
returns = [random.randint(2e6, 5e6) for _ in range(10000)]
_, counts = np.unique(np.digitize(returns, bins=ranges), return_counts=True)
Long Answer
If I understand your question right, you want to count the number of elements in a list, given a condition. Assuming you pick the ranges yourself, a pure Python solution could look like this:
import random
returns = [random.randint(2e6, 5e6) for _ in range(10000)]
ranges = [[2e6, 2.5e6], [2.5e6, 3e6], [3.5e6, 4e6], [4e6, 4.5e6], [4.5e6, 5e6]]
print(f"Probability \t Return Range")
for lower, upper in ranges:
returns_in_range = sum(1 for r in returns if lower <= r < upper)
probability = 100 * (returns_in_range / len(returns))
print(f"{probability :.2f}% \t\t {lower / 1e6 :.1f}M - {upper / 1e6 :.1f}M")
Which outputs something like:
Probability Return Range
17.36% 2.0M - 2.5M
16.18% 2.5M - 3.0M
17.13% 3.5M - 4.0M
16.35% 4.0M - 4.5M
16.35% 4.5M - 5.0M
This is a simple solution, however it iterates over the returns once for every range, which is costly in your case. A better way is to iterate over the returns once and keep a counter for every range.
import random
ranges = [[2e6, 2.5e6], [2.5e6, 3e6], [3.5e6, 4e6], [4e6, 4.5e6], [4.5e6, 5e6]]
returns = [random.randint(2e6, 5e6) for _ in range(10000)]
counts = [0] * len(ranges)
for ret in returns:
for idx, (lower, upper) in enumerate(ranges):
if lower <= ret < upper:
counts[idx] += 1
# print the results
print(f"Probability \t Return Range")
for idx, (lower, upper) in enumerate(ranges):
returns_in_range = counts[idx]
probability = 100 * (returns_in_range / len(returns))
print(f"{probability :.2f}% \t\t {lower / 1e6 :.1f}M - {upper / 1e6 :.1f}M")
Using a library like numpy may increase performance significantly, as it is mostly written in C. The operation you are trying to do is called binning and can be implemented like this:
import random
import numpy as np
ranges = [2e6, 2.5e6, 3e6, 3.5e6, 4e6, 4.5e6, 5e6]
returns = [random.randint(2e6, 5e6) for _ in range(10000)]
_, counts = np.unique(np.digitize(returns, bins=ranges), return_counts=True)
# print the results
print(f"Probability \t Return Range")
for idx, (lower, upper) in enumerate(zip(ranges, ranges[1:])):
probability = 100 * (counts[idx] / len(returns))
print(f"{probability :.2f}% \t\t {lower / 1e6 :.1f}M - {upper / 1e6 :.1f}M")

Find minimum number of elements in a list that covers specific values

A recruiter wants to form a team with different skills and he wants to pick the minimum number of persons which can cover all the required skills.
N represents number of persons and K is the number of distinct skills that need to be included. list spec_skill = [[1,3],[0,1,2],[0,2,4]] provides information about skills of each person. e.g. person 0 has skills 1 and 3, person 1 has skills 0, 1 and 2 and so on.
The code should outputs the size of the smallest team that recruiter could find (the minimum number of persons) and values indicating the specific IDs of the people to recruit onto the team.
I implemented the code with brute force as below but since some data are more than thousands, it seems I need to be solved with heuristic approaches. In this case it is possible to have approximate answer.
Any suggestion how to solve it with heuristic methods will be appreciated.
N,K = 3,5
spec_skill = [[1,3],[0,1,2],[0,2,4]]
A = list(range(K))
set_a = set(A)
solved = False
for L in range(0, len(spec_skill)+1):
for subset in itertools.combinations(spec_skill, L):
s = set(item for sublist in subset for item in sublist)
if set_a.issubset(s):
print(str(len(subset)) + '\n' + ' '.join([str(spec_skill.index(item)) for item in subset]))
solved = True
break
if solved: break
Here is my way of doing this. There might be potential optimization possibilities in the code, but the base idea should be understandable.
import random
import time
def man_power(lst, K, iterations=None, period=0):
"""
Specify a fixed number of iterations
or a period in seconds to limit the total computation time.
"""
# mapping each sublist into a (sublist, original_index) tuple
lst2 = [(lst[i], i) for i in range(len(lst))]
mini_sample = [0]*(len(lst)+1)
if period<0 or (period == 0 and iterations is None):
raise AttributeError("You must specify iterations or a positive period")
def shuffle_and_pick(lst, iterations):
mini = [0]*len(lst)
for _ in range(iterations):
random.shuffle(lst2)
skillset = set()
chosen_ones = []
idx = 0
fullset = True
# Breaks from the loop when all skillsets are found
while len(skillset) < K:
# No need to go further, we didn't find a better combination
if len(chosen_ones) >= len(mini):
fullset = False
break
before = len(skillset)
skillset.update(lst2[idx][0])
after = len(skillset)
if after > before:
# We append with the orginal index of the sublist
chosen_ones.append(lst2[idx][1])
idx += 1
if fullset:
mini = chosen_ones.copy()
return mini
# Estimates how many iterations we can do in the specified period
if iterations is None:
t0 = time.perf_counter()
mini_sample = shuffle_and_pick(lst, 1)
iterations = int(period / (time.perf_counter() - t0)) - 1
mini_result = shuffle_and_pick(lst, iterations)
if len(mini_sample)<len(mini_result):
return mini_sample, len(mini_sample)
else:
return mini_result, len(mini_result)

Fill order from smaller packages?

The input is an integer that specifies the amount to be ordered.
There are predefined package sizes that have to be used to create that order.
e.g.
Packs
3 for $5
5 for $9
9 for $16
for an input order 13 the output should be:
2x5 + 1x3
So far I've the following approach:
remaining_order = 13
package_numbers = [9,5,3]
required_packages = []
while remaining_order > 0:
found = False
for pack_num in package_numbers:
if pack_num <= remaining_order:
required_packages.append(pack_num)
remaining_order -= pack_num
found = True
break
if not found:
break
But this will lead to the wrong result:
1x9 + 1x3
remaining: 1
So, you need to fill the order with the packages such that the total price is maximal? This is known as Knapsack problem. In that Wikipedia article you'll find several solutions written in Python.
To be more precise, you need a solution for the unbounded knapsack problem, in contrast to popular 0/1 knapsack problem (where each item can be packed only once). Here is working code from Rosetta:
from itertools import product
NAME, SIZE, VALUE = range(3)
items = (
# NAME, SIZE, VALUE
('A', 3, 5),
('B', 5, 9),
('C', 9, 16))
capacity = 13
def knapsack_unbounded_enumeration(items, C):
# find max of any one item
max1 = [int(C / item[SIZE]) for item in items]
itemsizes = [item[SIZE] for item in items]
itemvalues = [item[VALUE] for item in items]
# def totvalue(itemscount, =itemsizes, itemvalues=itemvalues, C=C):
def totvalue(itemscount):
# nonlocal itemsizes, itemvalues, C
totsize = sum(n * size for n, size in zip(itemscount, itemsizes))
totval = sum(n * val for n, val in zip(itemscount, itemvalues))
return (totval, -totsize) if totsize <= C else (-1, 0)
# Try all combinations of bounty items from 0 up to max1
bagged = max(product(*[range(n + 1) for n in max1]), key=totvalue)
numbagged = sum(bagged)
value, size = totvalue(bagged)
size = -size
# convert to (iten, count) pairs) in name order
bagged = ['%dx%d' % (n, items[i][SIZE]) for i, n in enumerate(bagged) if n]
return value, size, numbagged, bagged
if __name__ == '__main__':
value, size, numbagged, bagged = knapsack_unbounded_enumeration(items, capacity)
print(value)
print(bagged)
Output is:
23
['1x3', '2x5']
Keep in mind that this is a NP-hard problem, so it will blow as you enter some large values :)
You can use itertools.product:
import itertools
remaining_order = 13
package_numbers = [9,5,3]
required_packages = []
a=min([x for i in range(1,remaining_order+1//min(package_numbers)) for x in itertools.product(package_numbers,repeat=i)],key=lambda x: abs(sum(x)-remaining_order))
remaining_order-=sum(a)
print(a)
print(remaining_order)
Output:
(5, 5, 3)
0
This simply does the below steps:
Get value closest to 13, in the list with all the product values.
Then simply make it modify the number of remaining_order.
If you want it output with 'x':
import itertools
from collections import Counter
remaining_order = 13
package_numbers = [9,5,3]
required_packages = []
a=min([x for i in range(1,remaining_order+1//min(package_numbers)) for x in itertools.product(package_numbers,repeat=i)],key=lambda x: abs(sum(x)-remaining_order))
remaining_order-=sum(a)
print(' '.join(['{0}x{1}'.format(v,k) for k,v in Counter(a).items()]))
print(remaining_order)
Output:
2x5 + 1x3
0
For you problem, I tried two implementations depending on what you want, in both of the solutions I supposed you absolutely needed your remaining to be at 0. Otherwise the algorithm will return you -1. If you need them, tell me I can adapt my algorithm.
As the algorithm is implemented via dynamic programming, it handles good inputs, at least more than 130 packages !
In the first solution, I admitted we fill with the biggest package each time.
I n the second solution, I try to minimize the price, but the number of packages should always be 0.
remaining_order = 13
package_numbers = sorted([9,5,3], reverse=True) # To make sure the biggest package is the first element
prices = {9: 16, 5: 9, 3: 5}
required_packages = []
# First solution, using the biggest package each time, and making the total order remaining at 0 each time
ans = [[] for _ in range(remaining_order + 1)]
ans[0] = [0, 0, 0]
for i in range(1, remaining_order + 1):
for index, package_number in enumerate(package_numbers):
if i-package_number > -1:
tmp = ans[i-package_number]
if tmp != -1:
ans[i] = [tmp[x] if x != index else tmp[x] + 1 for x in range(len(tmp))]
break
else: # Using for else instead of a boolean value `found`
ans[i] = -1 # -1 is the not found combinations
print(ans[13]) # [0, 2, 1]
print(ans[9]) # [1, 0, 0]
# Second solution, minimizing the price with order at 0
def price(x):
return 16*x[0]+9*x[1]+5*x[2]
ans = [[] for _ in range(remaining_order + 1)]
ans[0] = ([0, 0, 0],0) # combination + price
for i in range(1, remaining_order + 1):
# The not found packages will be (-1, float('inf'))
minimal_price = float('inf')
minimal_combinations = -1
for index, package_number in enumerate(package_numbers):
if i-package_number > -1:
tmp = ans[i-package_number]
if tmp != (-1, float('inf')):
tmp_price = price(tmp[0]) + prices[package_number]
if tmp_price < minimal_price:
minimal_price = tmp_price
minimal_combinations = [tmp[0][x] if x != index else tmp[0][x] + 1 for x in range(len(tmp[0]))]
ans[i] = (minimal_combinations, minimal_price)
print(ans[13]) # ([0, 2, 1], 23)
print(ans[9]) # ([0, 0, 3], 15) Because the price of three packages is lower than the price of a package of 9
In case you need a solution for a small number of possible
package_numbers
but a possibly very big
remaining_order,
in which case all the other solutions would fail, you can use this to reduce remaining_order:
import numpy as np
remaining_order = 13
package_numbers = [9,5,3]
required_packages = []
sub_max=np.sum([(np.product(package_numbers)/i-1)*i for i in package_numbers])
while remaining_order > sub_max:
remaining_order -= np.product(package_numbers)
required_packages.append([max(package_numbers)]*np.product(package_numbers)/max(package_numbers))
Because if any package is in required_packages more often than (np.product(package_numbers)/i-1)*i it's sum is equal to np.product(package_numbers). In case the package max(package_numbers) isn't the one with the samllest price per unit, take the one with the smallest price per unit instead.
Example:
remaining_order = 100
package_numbers = [5,3]
Any part of remaining_order bigger than 5*2 plus 3*4 = 22 can be sorted out by adding 5 three times to the solution and taking remaining_order - 5*3.
So remaining order that actually needs to be calculated is 10. Which can then be solved to beeing 2 times 5. The rest is filled with 6 times 15 which is 18 times 5.
In case the number of possible package_numbers is bigger than just a handful, I recommend building a lookup table (with one of the others answers' code) for all numbers below sub_max which will make this immensely fast for any input.
Since no declaration about the object function is found, I assume your goal is to maximize the package value within the pack's capability.
Explanation: time complexity is fixed. Optimal solution may not be filling the highest valued item as many as possible, you have to search all possible combinations. However, you can reuse the possible optimal solutions you have searched to save space. For example, [5,5,3] is derived from adding 3 to a previous [5,5] try so the intermediate result can be "cached". You may either use an array or you may use a set to store possible solutions. The code below runs the same performance as the rosetta code but I think it's clearer.
To further optimize, use a priority set for opts.
costs = [3,5,9]
value = [5,9,16]
volume = 130
# solutions
opts = set()
opts.add(tuple([0]))
# calc total value
cost_val = dict(zip(costs, value))
def total_value(opt):
return sum([cost_val.get(cost,0) for cost in opt])
def possible_solutions():
solutions = set()
for opt in opts:
for cost in costs:
if cost + sum(opt) > volume:
continue
cnt = (volume - sum(opt)) // cost
for _ in range(1, cnt + 1):
sol = tuple(list(opt) + [cost] * _)
solutions.add(sol)
return solutions
def optimize_max_return(opts):
if not opts:
return tuple([])
cur = list(opts)[0]
for sol in opts:
if total_value(sol) > total_value(cur):
cur = sol
return cur
while sum(optimize_max_return(opts)) <= volume - min(costs):
opts = opts.union(possible_solutions())
print(optimize_max_return(opts))
If your requirement is "just fill the pack" it'll be even simpler using the volume for each item instead.

Finding the optimal location for router placement

I am looking for an optimization algorithm that takes a text file encoded with 0s, 1s, and -1s:
1's denoting target cells that requires Wi-Fi coverage
0's denoting cells that are walls
1's denoting cells that are void (do not require Wi-Fi coverage)
Example of text file:
I have created a solution function along with other helper functions, but I can't seem to get the optimal positions of the routers to be placed to ensure proper coverage. There is another file that does the printing, I am struggling with finding the optimal location. I basically need to change the get_random_position function to get the optimal one, but I am unsure how to do that. The area covered by the various routers are:
This is the kind of output I am getting:
Each router covers a square area of at most (2S+1)^2
Type 1: S=5; Cost=180
Type 2: S=9; Cost=360
Type 3: S=15; Cost=480
My code is as follows:
import numpy as np
import time
from random import randint
def is_taken(taken, i, j):
for coords in taken:
if coords[0] == i and coords[1] == j:
return True
return False
def get_random_position(floor, taken , nrows, ncols):
i = randint(0, nrows-1)
j = randint(0, ncols-1)
while floor[i][j] == 0 or floor[i][j] == -1 or is_taken(taken, i, j):
i = randint(0, nrows-1)
j = randint(0, ncols-1)
return (i, j)
def solution(floor):
start_time = time.time()
router_types = [1,2,3]
nrows, ncols = floor.shape
ratio = 0.1
router_scale = int(nrows*ncols*0.0001)
if router_scale == 0:
router_scale = 1
row_ratio = int(nrows*ratio)
col_ratio = int(ncols*ratio)
print('Row : ',nrows, ', Col: ', ncols, ', Router scale :', router_scale)
global_best = [0, ([],[],[])]
taken = []
while True:
found_better = False
best = [global_best[0], (list(global_best[1][0]), list(global_best[1][1]), list(global_best[1][2]))]
for times in range(0, row_ratio+col_ratio):
if time.time() - start_time > 27.0:
print('Time ran out! Using what I got : ', time.time() - start_time)
return global_best[1]
fit = []
for rtype in router_types:
interim = (list(global_best[1][0]), list(global_best[1][1]), list(global_best[1][2]))
for i in range(0, router_scale):
pos = get_random_position(floor, taken, nrows, ncols)
interim[0].append(pos[0])
interim[1].append(pos[1])
interim[2].append(rtype)
fit.append((fitness(floor, interim), interim))
highest_fitness = fit[0]
for index in range(1, len(fit)):
if fit[index][0] > highest_fitness[0]:
highest_fitness = fit[index]
if highest_fitness[0] > best[0]:
best[0] = highest_fitness[0]
best[1] = (highest_fitness[1][0],highest_fitness[1][1], highest_fitness[1][2])
found_better = True
global_best = best
taken.append((best[1][0][-1],best[1][1][-1]))
break
if found_better == False:
break
print('Best:')
print(global_best)
end_time = time.time()
run_time = end_time - start_time
print("Run Time:", run_time)
return global_best[1]
def available_cells(floor):
available = 0
for i in range(0, len(floor)):
for j in range(0, len(floor[i])):
if floor[i][j] != 0:
available += 1
return available
def fitness(building, args):
render = np.array(building, dtype=int, copy=True)
cov_factor = 220
cost_factor = 22
router_types = { # type: [coverage, cost]
1: {'size' : 5, 'cost' : 180},
2: {'size' : 9, 'cost' : 360},
3: {'size' : 15, 'cost' : 480},
}
routers_used = args[-1]
for r, c, t in zip(*args):
size = router_types[t]['size']
nrows, ncols = render.shape
rows = range(max(0, r-size), min(nrows, r+size+1))
cols = range(max(0, c-size), min(ncols, c+size+1))
walls = []
for ri in rows:
for ci in cols:
if building[ri, ci] == 0:
walls.append((ri, ci))
def blocked(ri, ci):
for w in walls:
if min(r, ri) <= w[0] and max(r, ri) >= w[0]:
if min(c, ci) <= w[1] and max(c, ci) >= w[1]:
return True
return False
for ri in rows:
for ci in cols:
if blocked(ri, ci):
continue
if render[ri, ci] == 2:
render[ri, ci] = 4
if render[ri, ci] == 1:
render[ri, ci] = 2
render[r, c] = 5
return (
cov_factor * np.sum(render > 1) -
cost_factor * np.sum([router_types[x]['cost'] for x in routers_used])
)
Here's a suggestion on how to solve the problem; however I don't affirm this is the best approach, and it's certainly not the only one.
Main idea
Your problem can be modelised as a weighted minimum set cover problem.
Good news, this is a well known optimization problem:
It is easy to find algorithm descriptions for approximate solutions
A quick search on the web shows many implementations of approximation algorithms in Python.
Bad news, this is a NP-hard optimization problem:
If you need an exact solution: algorithms will work only for "small" sized problems in a reasonable amount of time(in your case: size of the problem <=> number of "1" cells).
Approximate (a.k.a greedy) algorithms are trade-off between computation requirements, and a risk do deliver far from optimal solutions in certain cases.
Note that the following part does not prove that your problem is NP-hard. The general minimum set cover problem is NP-hard. In your case the subsets have several properties that might help to design a better algorithm. I have no idea how though.
Translating into a cover set problem
Let's define some sets:
U: the set of "1" cells (requiring Wifi).
P(U): the power set of U (the set of subsets of U).
P: the set of cells on which you can place a router (not sure if P=U in your original post).
T: the set of router type (3 values in your case).
R+: positive Real number (used to describe prices).
Let's define a function (pseudo Python):
# Domain of definition : T,P --> R+,P(U)
# This function takes a router type and a position, and returns
# a tuple containing:
# - the price of a router of the given type.
# - the subset of U containing all the position covered by a router
# of the given type placed at the given position.
def weighted_subset(routerType, position):
pass # TODO: implementation
Now, we define a last set, as the image of the function we've just described: S=weighted_subset(T,P). Each element of this set is a subset of U, weighted by a price in R+.
With all this formalism, finding the router types & positions that:
gives coverage to all the desirable locations
minimize the cost
Is equivalent to finding a sub-collection of S:
whose union of their P(U) is equal to U
which minimise the sum of the associated weights
Which is the weighted minimal set cover problem.

Calculating Incremental Entropy for Data that is not real numbers

I have a set of data for which has an ID, timestamp, and identifiers. I have to go through it, calculate the entropy and save some other links for the data. At each step more identifiers are added to the identifiers dictionary and I have to re-compute the entropy and append it. I have really large amount of data and the program gets stuck due to growing number of identifiers and their entropy calculation after each step. I read the following solution but it is about the data consisting of numbers.
Incremental entropy computation
I have copied two functions from this page and the incremental calculation of entropy gives different values than the classical full entropy calculation at every step.
Here is the code I have:
from math import log
# ---------------------------------------------------------------------#
# Functions copied from https://stackoverflow.com/questions/17104673/incremental-entropy-computation
# maps x to -x*log2(x) for x>0, and to 0 otherwise
h = lambda p: -p*log(p, 2) if p > 0 else 0
# entropy of union of two samples with entropies H1 and H2
def update(H1, S1, H2, S2):
S = S1+S2
return 1.0*H1*S1/S+h(1.0*S1/S)+1.0*H2*S2/S+h(1.0*S2/S)
# compute entropy using the classic equation
def entropy(L):
n = 1.0*sum(L)
return sum([h(x/n) for x in L])
# ---------------------------------------------------------------------#
# Below is the input data (Actually I read it from a csv file)
input_data = [["1","2008-01-06T02:13:38Z","foo,bar"], ["2","2008-01-06T02:12:13Z","bar,blup"], ["3","2008-01-06T02:13:55Z","foo,bar"],
["4","2008-01-06T02:12:28Z","foo,xy"], ["5","2008-01-06T02:12:44Z","foo,bar"], ["6","2008-01-06T02:13:00Z","foo,bar"],
["7","2008-01-06T02:13:00Z","x,y"]]
total_identifiers = {} # To store the occurrences of identifiers. Values shows the number of occurrences
all_entropies = [] # Classical way of calculating entropy at every step
updated_entropies = [] # Incremental way of calculating entropy at every step
for item in input_data:
temp = item[2].split(",")
identifiers_sum = sum(total_identifiers.values()) # Sum of all identifiers
old_entropy = 0 if all_entropies[-1:] == [] else all_entropies[-1] # Get previous entropy calculation
for identifier in temp:
S_new = len(temp) # sum of new samples
temp_dictionaty = {a:1 for a in temp} # Store current identifiers and their occurrence
if identifier not in total_identifiers:
total_identifiers[identifier] = 1
else:
total_identifiers[identifier] += 1
current_entropy = entropy(total_identifiers.values()) # Entropy for current set of identifiers
updated_entropy = update(old_entropy, identifiers_sum, current_entropy, S_new)
updated_entropies.append(updated_entropy)
entropy_value = entropy(total_identifiers.values()) # Classical entropy calculation for comparison. This step becomes too expensive with big data
all_entropies.append(entropy_value)
print(total_identifiers)
print('Sum of Total Identifiers: ', identifiers_sum) # Gives 12 while the sum is 14 ???
print("All Classical Entropies: ", all_entropies) # print for comparison
print("All Updated Entropies: ", updated_entropies)
The other issue is that when I print "Sum of total_identifiers", it gives 12 instead of 14! (Due to very large amount of data, I read the actual file line by line and write the results directly to the disk and do not store it in the memory apart from the dictionary of identifiers).
The code above uses Theorem 4; it seems to me that you want to use Theorem 5 instead (from the paper in the next paragraph).
Note, however, that if the number of identifiers is really the problem then the incremental approach below isn't going to work either---at some point the dictionaries are going to get too large.
Below you can find a proof-of-concept Python implementation that follows the description from Updating Formulas and Algorithms for Computing Entropy and Gini Index from Time-Changing Data Streams.
import collections
import math
import random
def log2(p):
return math.log(p, 2) if p > 0 else 0
CountChange = collections.namedtuple('CountChange', ('label', 'change'))
class EntropyHolder:
def __init__(self):
self.counts_ = collections.defaultdict(int)
self.entropy_ = 0
self.sum_ = 0
def update(self, count_changes):
r = sum([change for _, change in count_changes])
residual = self._compute_residual(count_changes)
self.entropy_ = self.sum_ * (self.entropy_ - log2(self.sum_ / (self.sum_ + r))) / (self.sum_ + r) - residual
self._update_counts(count_changes)
return self.entropy_
def _compute_residual(self, count_changes):
r = sum([change for _, change in count_changes])
residual = 0
for label, change in count_changes:
p_new = (self.counts_[label] + change) / (self.sum_ + r)
p_old = self.counts_[label] / (self.sum_ + r)
residual += p_new * log2(p_new) - p_old * log2(p_old)
return residual
def _update_counts(self, count_changes):
for label, change in count_changes:
self.sum_ += change
self.counts_[label] += change
def entropy(self):
return self.entropy_
def naive_entropy(counts):
s = sum(counts)
return sum([-(r/s) * log2(r/s) for r in counts])
if __name__ == '__main__':
print(naive_entropy([1, 1]))
print(naive_entropy([1, 1, 1, 1]))
entropy = EntropyHolder()
freq = collections.defaultdict(int)
for _ in range(100):
index = random.randint(0, 5)
entropy.update([CountChange(index, 1)])
freq[index] += 1
print(naive_entropy(freq.values()))
print(entropy.entropy())
Thanks #blazs for providing the entropy_holder class. That solves the problem. So the idea is to import entropy_holder.py from (https://gist.github.com/blazs/4fc78807a96976cc455f49fc0fb28738) and use it to store the previous entropy and update at every step when new identifiers come.
So the minimum working code would look like this:
import entropy_holder
input_data = [["1","2008-01-06T02:13:38Z","foo,bar"], ["2","2008-01-06T02:12:13Z","bar,blup"], ["3","2008-01-06T02:13:55Z","foo,bar"],
["4","2008-01-06T02:12:28Z","foo,xy"], ["5","2008-01-06T02:12:44Z","foo,bar"], ["6","2008-01-06T02:13:00Z","foo,bar"],
["7","2008-01-06T02:13:00Z","x,y"]]
entropy = entropy_holder.EntropyHolder() # This class will hold the current entropy and counts of identifiers
for item in input_data:
for identifier in item[2].split(","):
entropy.update([entropy_holder.CountChange(identifier, 1)])
print(entropy.entropy())
This entropy by using the Blaz's incremental formulas is very close to the entropy calculated the classical way and saves from iterating over all the data again and again.

Categories

Resources