I have a set of data for which has an ID, timestamp, and identifiers. I have to go through it, calculate the entropy and save some other links for the data. At each step more identifiers are added to the identifiers dictionary and I have to re-compute the entropy and append it. I have really large amount of data and the program gets stuck due to growing number of identifiers and their entropy calculation after each step. I read the following solution but it is about the data consisting of numbers.
Incremental entropy computation
I have copied two functions from this page and the incremental calculation of entropy gives different values than the classical full entropy calculation at every step.
Here is the code I have:
from math import log
# ---------------------------------------------------------------------#
# Functions copied from https://stackoverflow.com/questions/17104673/incremental-entropy-computation
# maps x to -x*log2(x) for x>0, and to 0 otherwise
h = lambda p: -p*log(p, 2) if p > 0 else 0
# entropy of union of two samples with entropies H1 and H2
def update(H1, S1, H2, S2):
S = S1+S2
return 1.0*H1*S1/S+h(1.0*S1/S)+1.0*H2*S2/S+h(1.0*S2/S)
# compute entropy using the classic equation
def entropy(L):
n = 1.0*sum(L)
return sum([h(x/n) for x in L])
# ---------------------------------------------------------------------#
# Below is the input data (Actually I read it from a csv file)
input_data = [["1","2008-01-06T02:13:38Z","foo,bar"], ["2","2008-01-06T02:12:13Z","bar,blup"], ["3","2008-01-06T02:13:55Z","foo,bar"],
["4","2008-01-06T02:12:28Z","foo,xy"], ["5","2008-01-06T02:12:44Z","foo,bar"], ["6","2008-01-06T02:13:00Z","foo,bar"],
total_identifiers = {} # To store the occurrences of identifiers. Values shows the number of occurrences
all_entropies = [] # Classical way of calculating entropy at every step
updated_entropies = [] # Incremental way of calculating entropy at every step
for item in input_data:
temp = item[2].split(",")
identifiers_sum = sum(total_identifiers.values()) # Sum of all identifiers
old_entropy = 0 if all_entropies[-1:] == [] else all_entropies[-1] # Get previous entropy calculation
for identifier in temp:
S_new = len(temp) # sum of new samples
temp_dictionaty = {a:1 for a in temp} # Store current identifiers and their occurrence
if identifier not in total_identifiers:
total_identifiers[identifier] = 1
total_identifiers[identifier] += 1
current_entropy = entropy(total_identifiers.values()) # Entropy for current set of identifiers
updated_entropy = update(old_entropy, identifiers_sum, current_entropy, S_new)
entropy_value = entropy(total_identifiers.values()) # Classical entropy calculation for comparison. This step becomes too expensive with big data
print('Sum of Total Identifiers: ', identifiers_sum) # Gives 12 while the sum is 14 ???
print("All Classical Entropies: ", all_entropies) # print for comparison
print("All Updated Entropies: ", updated_entropies)
The other issue is that when I print "Sum of total_identifiers", it gives 12 instead of 14! (Due to very large amount of data, I read the actual file line by line and write the results directly to the disk and do not store it in the memory apart from the dictionary of identifiers).
The code above uses Theorem 4; it seems to me that you want to use Theorem 5 instead (from the paper in the next paragraph).
Note, however, that if the number of identifiers is really the problem then the incremental approach below isn't going to work either---at some point the dictionaries are going to get too large.
Below you can find a proof-of-concept Python implementation that follows the description from Updating Formulas and Algorithms for Computing Entropy and Gini Index from Time-Changing Data Streams.
import collections
import math
import random
def log2(p):
return math.log(p, 2) if p > 0 else 0
CountChange = collections.namedtuple('CountChange', ('label', 'change'))
class EntropyHolder:
def __init__(self):
self.counts_ = collections.defaultdict(int)
self.entropy_ = 0
self.sum_ = 0
def update(self, count_changes):
r = sum([change for _, change in count_changes])
residual = self._compute_residual(count_changes)
self.entropy_ = self.sum_ * (self.entropy_ - log2(self.sum_ / (self.sum_ + r))) / (self.sum_ + r) - residual
return self.entropy_
def _compute_residual(self, count_changes):
r = sum([change for _, change in count_changes])
residual = 0
for label, change in count_changes:
p_new = (self.counts_[label] + change) / (self.sum_ + r)
p_old = self.counts_[label] / (self.sum_ + r)
residual += p_new * log2(p_new) - p_old * log2(p_old)
return residual
def _update_counts(self, count_changes):
for label, change in count_changes:
self.sum_ += change
self.counts_[label] += change
def entropy(self):
return self.entropy_
def naive_entropy(counts):
s = sum(counts)
return sum([-(r/s) * log2(r/s) for r in counts])
if __name__ == '__main__':
print(naive_entropy([1, 1]))
print(naive_entropy([1, 1, 1, 1]))
entropy = EntropyHolder()
freq = collections.defaultdict(int)
for _ in range(100):
index = random.randint(0, 5)
entropy.update([CountChange(index, 1)])
freq[index] += 1
Thanks #blazs for providing the entropy_holder class. That solves the problem. So the idea is to import entropy_holder.py from (https://gist.github.com/blazs/4fc78807a96976cc455f49fc0fb28738) and use it to store the previous entropy and update at every step when new identifiers come.
So the minimum working code would look like this:
import entropy_holder
input_data = [["1","2008-01-06T02:13:38Z","foo,bar"], ["2","2008-01-06T02:12:13Z","bar,blup"], ["3","2008-01-06T02:13:55Z","foo,bar"],
["4","2008-01-06T02:12:28Z","foo,xy"], ["5","2008-01-06T02:12:44Z","foo,bar"], ["6","2008-01-06T02:13:00Z","foo,bar"],
entropy = entropy_holder.EntropyHolder() # This class will hold the current entropy and counts of identifiers
for item in input_data:
for identifier in item[2].split(","):
entropy.update([entropy_holder.CountChange(identifier, 1)])
This entropy by using the Blaz's incremental formulas is very close to the entropy calculated the classical way and saves from iterating over all the data again and again.
Working on a project to decipher real text from gibberish, I found this code on Github and have made some slight edits to better fit my needs. When testing I keep getting a TypeError on line 17, return [c.lower() for c in line if c.lower() in accepted_chars].
import math
import pickle
accepted_chars = 'abcdefghijklmnopqrstuvwxyz '
pos = dict([(char, idx) for idx, char in enumerate(accepted_chars)])
def normalize(line):
"""Return only the subset of chars from accepted_chars.
This helps keep the model relatively small by ignoring punctuation,
infrequenty symbols, etc. """
return [c.lower() for c in line if c.lower() in accepted_chars]
def ngram(n, l):
"""Return all n grams from l after normalizing """
filtered = normalize(l)
for start in range(0, len(filtered) - n + 1):
yield ''.join(filtered[start:start + n])
def train():
""" Write a simple model as a pickle file"""
k = len(accepted_chars)
# Assume we have seen 10 of each character pair. This acts as a kind of
# prior or smoothing factor. This way, if we see a character transition
# live that we've never observed in the past, we won't assume the entire
# string has 0 probability.
counts = [[10 for i in range(k)] for i in range(k)]
# Count transitions from big text file, taken
# from http://norvig.com/spell-correct.html
for line in open('big.txt'):
for a,b in ngram(2, line):
counts[pos[a]][pos[b]] += 1
# Normalize the counts so that they become log probabilities.
# We use log probabilities rather than straight probabilities to avoid
# numeric underflow issues with long texts.
# This contains a justification:
# http://squarecog.wordpress.com/2009/01/10/dealing-with-underflow-in-joint-probability-calculations/
for i, row in enumerate(counts):
s = float(sum(row))
for j in range(len(row)):
row[j] = math.log(row[j] / s)
# Find the probability of generating a few arbitrarily choosen good and
# bad phrases.
good_probs = [avg_transition_prob(l, counts) for l in open('good.txt')]
bad_probs = [avg_transition_prob(l, counts) for l in open('bad.txt')]
# Assert that we actually are capable of detecting the junk.
assert min(good_probs) > max(bad_probs)
#And pick a threshhold halfway between the worst good and best bad inputs.
thresh = (min(good_probs) + max(bad_probs)) / 2
pickle.dump({'mat': counts, 'thresh': thresh}, open('gib.model.pki', 'wb'))
def avg_transition_prob(l, log_prob_mat):
""" Return the average transition prob from l through log_prob_mat """
log_prob = 0.0
transition_ct = 0
for a, b in ngram(2,1):
log_prob += log_prob_mat[pos[a]][pos[b]]
transition_ct += 1
return math.exp(log_prob / (transition_ct or 1))
if __name__ == '__main__':
ngram(2,1) calls normalize with the second parameter
normalize then does this:
return [c.lower() for c in line if c.lower() in accepted_chars]
Thus, you can't do for c in 1
Maybe you meant to put l there instead?
I do have a piece of code that compute partitions of a set of (potentialy duplicated) integers. But i am interested in the set of possible partition and there multiplicity.
You can for exemple launch the follwoing code :
import numpy as np
from collections import Counter
import pandas as pd
def _B(i):
# for a given multiindex i, we defined _B(i) as the set of integers containg i_j times the number j:
if len(i) != 1:
B = []
for j in range(len(i)):
B = i*[0]
return B
def _partition(collection):
# from here: https://stackoverflow.com/a/62532969/8425270
if len(collection) == 1:
yield (collection,)
first = collection[0]
for smaller in _partition(collection[1:]):
# insert `first` in each of the subpartition's subsets
for n, subset in enumerate(smaller):
yield smaller[:n] + ((first,) + subset,) + smaller[n + 1 :]
# put `first` in its own subset
yield ((first,),) + smaller
def to_list(tpl):
# the final hierarchy is
return list(list(i) if isinstance(i, tuple) else i for i in tpl)
def _Pi(inst_B):
# inst_B must be a tuple
if type(inst_B) != tuple :
inst_B = tuple(inst_B)
pp = [tuple(sorted(p)) for p in _partition(inst_B)]
c = Counter(pp)
Pi = c.keys()
N = list()
for pi in Pi:
Pi = [to_list(pi) for pi in Pi]
return Pi, N
if __name__ == "__main__":
import cProfile
pr = cProfile.Profile()
sh = (3, 3, 3)
rez = list()
rez_sorted= list()
rez_ref = list()
for idx in np.ndindex(sh):
if sum(idx) > 0:
Pi, N = _Pi(_B(idx))
print(pd.DataFrame({'Pi': Pi, 'N': N * np.array([np.math.factorial(len(pi) - 1) for pi in Pi])}))
# after your program ends
This code computes, for several examples of tuples of integer numbers (generated by np.ndindex) the partitions and counts i need. Everything happens in the _partition and the _Pi functions, this is were you should look at.
If you look closely at how these two functions are working, you'll see that they comput eevery potential partition and THEN count up how many times they appeared. For small problems, this is fine, but if the size of the prolbme increase, this starts to take a looooot of time. Try setting sh = (5,5,5), you'll see what i mean;
So the problem is the following :
Is there a way to compute directly the partitions and there number of occurences instead ?
Edit: I cross-posted on mathoverflow there, and they propose a solution in this article, in corrolary 2.10 (page 10 of the pdf). The problem could be solved by implmenting the sets p(v,r) in this corrolary.
I was hoping, as in the univariate case, that those sets would have a nice recursive expression but i ould not find one yet.
More Edit : This problem is equivalent to finding all (multiset)-partitions of a multiset. If the solution for finding (set)-partitions of a set is given by Bell partial polynomials, here we need multivariate version of these polynomials.
Let's say I wanted to call a function to do some calculation, but I also wanted to use that calculated value in a later function. When I return the value of the first function can I not just send it to my next function? Here is an example of what I am talking about:
def add(x,y):
addition = x + y
return addition
def square(a):
result = a * a
return result
sum = add(1,4)
product = square(addition)
If I call the add function, it'll return 5 as the addition result. But I want to use that number 5 in the next function, can I just send it to the next function as shown? In the main program I am working on it does not work like this.
Edit: This is a sample of the code I am actually working on which will give a better idea of what the problem is. The problem is when I send the mean to the calculateStdDev function.
#import libraries to be used
import time
import StatisticsCalculations
#global variables
mean = 0
stdDev = 0
#get file from user
fileChoice = input("Enter the .csv file name: ")
inputFile = open(fileChoice)
headers = inputFile.readline().strip('\n').split(',') #create headers for columns and strips unnecessary characters
#create a list with header-number of lists in it
dataColumns = []
for i in headers:
dataColumns.append([]) #fills inital list with as many empty lists as there are columns
#counts how many rows there are and adds a column of data into each empty list
rowCount = 0
for row in inputFile:
rowCount = rowCount + 1
comps = row.strip().split(',') #components of data
for j in range(len(comps)):
dataColumns[j].append(float(comps[j])) #appends the jth entry into the jth column, separating data into categories
k = 0
for entry in dataColumns:
print("{:>11}".format(headers[k]),"|", "{:>10.2f}".format(StatisticsCalculations.findMax(dataColumns[k])),"|",
"{:>10.2f}".format(StatisticsCalculations.findMin(dataColumns[k])),"|","{:>10.2f}".format(StatisticsCalculations.calculateMean(dataColumns[k], rowCount)),"|","{:>10.2f}".format()) #format each data entry to be right aligned and be correctly spaced in its column
#prining break line for each row
k = k + 1 #counting until dataColumns is exhausted
And the StatisticsCalculations module:
import math
def calculateMean(data, rowCount):
sumForMean = 0
for entry in data:
sumForMean = sumForMean + entry
mean = sumForMean/rowCount
return mean
def calculateStdDev(data, mean, rowCount, entry):
stdDevSum = 0
for x in data:
stdDevSum = float(stdDevSum) + ((float(entry[x]) - mean)** 2) #getting sum of squared difference to be used in std dev formula
stdDev = math.sqrt(stdDevSum / rowCount) #using the stdDevSum for the remaining parts of std dev formula
return stdDev
def findMin(data):
lowestNum = 1000
for component in data:
if component < lowestNum:
lowestNum = component
return lowestNum
def findMax(data):
highestNum = -1
for number in data:
if number > highestNum:
highestNum = number
return highestNum
First of all, sum is a reserved word, you shouldn't use it as a variable.
You can do it this way:
def add(x,y):
addition = x + y
return addition
def square(a):
result = a * a
return result
s = add(1, 4)
product = square(s)
Or directly:
product = square(add(1, 4))
I am looking for an optimization algorithm that takes a text file encoded with 0s, 1s, and -1s:
1's denoting target cells that requires Wi-Fi coverage
0's denoting cells that are walls
1's denoting cells that are void (do not require Wi-Fi coverage)
Example of text file:
I have created a solution function along with other helper functions, but I can't seem to get the optimal positions of the routers to be placed to ensure proper coverage. There is another file that does the printing, I am struggling with finding the optimal location. I basically need to change the get_random_position function to get the optimal one, but I am unsure how to do that. The area covered by the various routers are:
This is the kind of output I am getting:
Each router covers a square area of at most (2S+1)^2
Type 1: S=5; Cost=180
Type 2: S=9; Cost=360
Type 3: S=15; Cost=480
My code is as follows:
import numpy as np
import time
from random import randint
def is_taken(taken, i, j):
for coords in taken:
if coords[0] == i and coords[1] == j:
return True
return False
def get_random_position(floor, taken , nrows, ncols):
i = randint(0, nrows-1)
j = randint(0, ncols-1)
while floor[i][j] == 0 or floor[i][j] == -1 or is_taken(taken, i, j):
i = randint(0, nrows-1)
j = randint(0, ncols-1)
return (i, j)
def solution(floor):
start_time = time.time()
router_types = [1,2,3]
nrows, ncols = floor.shape
ratio = 0.1
router_scale = int(nrows*ncols*0.0001)
if router_scale == 0:
router_scale = 1
row_ratio = int(nrows*ratio)
col_ratio = int(ncols*ratio)
print('Row : ',nrows, ', Col: ', ncols, ', Router scale :', router_scale)
global_best = [0, ([],[],[])]
taken = []
while True:
found_better = False
best = [global_best[0], (list(global_best[1][0]), list(global_best[1][1]), list(global_best[1][2]))]
for times in range(0, row_ratio+col_ratio):
if time.time() - start_time > 27.0:
print('Time ran out! Using what I got : ', time.time() - start_time)
return global_best[1]
fit = []
for rtype in router_types:
interim = (list(global_best[1][0]), list(global_best[1][1]), list(global_best[1][2]))
for i in range(0, router_scale):
pos = get_random_position(floor, taken, nrows, ncols)
fit.append((fitness(floor, interim), interim))
highest_fitness = fit[0]
for index in range(1, len(fit)):
if fit[index][0] > highest_fitness[0]:
highest_fitness = fit[index]
if highest_fitness[0] > best[0]:
best[0] = highest_fitness[0]
best[1] = (highest_fitness[1][0],highest_fitness[1][1], highest_fitness[1][2])
found_better = True
global_best = best
if found_better == False:
end_time = time.time()
run_time = end_time - start_time
print("Run Time:", run_time)
return global_best[1]
def available_cells(floor):
available = 0
for i in range(0, len(floor)):
for j in range(0, len(floor[i])):
if floor[i][j] != 0:
available += 1
return available
def fitness(building, args):
render = np.array(building, dtype=int, copy=True)
cov_factor = 220
cost_factor = 22
router_types = { # type: [coverage, cost]
1: {'size' : 5, 'cost' : 180},
2: {'size' : 9, 'cost' : 360},
3: {'size' : 15, 'cost' : 480},
routers_used = args[-1]
for r, c, t in zip(*args):
size = router_types[t]['size']
nrows, ncols = render.shape
rows = range(max(0, r-size), min(nrows, r+size+1))
cols = range(max(0, c-size), min(ncols, c+size+1))
walls = []
for ri in rows:
for ci in cols:
if building[ri, ci] == 0:
walls.append((ri, ci))
def blocked(ri, ci):
for w in walls:
if min(r, ri) <= w[0] and max(r, ri) >= w[0]:
if min(c, ci) <= w[1] and max(c, ci) >= w[1]:
return True
return False
for ri in rows:
for ci in cols:
if blocked(ri, ci):
if render[ri, ci] == 2:
render[ri, ci] = 4
if render[ri, ci] == 1:
render[ri, ci] = 2
render[r, c] = 5
return (
cov_factor * np.sum(render > 1) -
cost_factor * np.sum([router_types[x]['cost'] for x in routers_used])
Here's a suggestion on how to solve the problem; however I don't affirm this is the best approach, and it's certainly not the only one.
Main idea
Your problem can be modelised as a weighted minimum set cover problem.
Good news, this is a well known optimization problem:
It is easy to find algorithm descriptions for approximate solutions
A quick search on the web shows many implementations of approximation algorithms in Python.
Bad news, this is a NP-hard optimization problem:
If you need an exact solution: algorithms will work only for "small" sized problems in a reasonable amount of time(in your case: size of the problem <=> number of "1" cells).
Approximate (a.k.a greedy) algorithms are trade-off between computation requirements, and a risk do deliver far from optimal solutions in certain cases.
Note that the following part does not prove that your problem is NP-hard. The general minimum set cover problem is NP-hard. In your case the subsets have several properties that might help to design a better algorithm. I have no idea how though.
Translating into a cover set problem
Let's define some sets:
U: the set of "1" cells (requiring Wifi).
P(U): the power set of U (the set of subsets of U).
P: the set of cells on which you can place a router (not sure if P=U in your original post).
T: the set of router type (3 values in your case).
R+: positive Real number (used to describe prices).
Let's define a function (pseudo Python):
# Domain of definition : T,P --> R+,P(U)
# This function takes a router type and a position, and returns
# a tuple containing:
# - the price of a router of the given type.
# - the subset of U containing all the position covered by a router
# of the given type placed at the given position.
def weighted_subset(routerType, position):
pass # TODO: implementation
Now, we define a last set, as the image of the function we've just described: S=weighted_subset(T,P). Each element of this set is a subset of U, weighted by a price in R+.
With all this formalism, finding the router types & positions that:
gives coverage to all the desirable locations
minimize the cost
Is equivalent to finding a sub-collection of S:
whose union of their P(U) is equal to U
which minimise the sum of the associated weights
Which is the weighted minimal set cover problem.
I am trying to simulate a biological gene network by updating the probabilities of the genes for each time step in Python. Then, the results for each time step will be plotted. I can do it by coping and pasting over and over again, but it is not ideal once the time steps becomes larger than 10. Here are what I have exactly done so far.
Pgt1, Pkr1, Pkni1, Phb1 = test.simUpdateFun(PgtI, PkrI, PkniI, PhbI)
Pgt2, Pkr2, Pkni2, Phb2 = test.simUpdateFun(Pgt1, Pkr1, Pkni1, Phb1)
Pgt3, Pkr3, Pkni3, Phb3 = test.simUpdateFun(Pgt2, Pkr2, Pkni2, Phb2)
Pgt4, Pkr4, Pkni4, Phb4 = test.simUpdateFun(Pgt3, Pkr3, Pkni3, Phb3)
Pgt5, Pkr5, Pkni5, Phb5 = test.simUpdateFun(Pgt4, Pkr4, Pkni4, Phb4)
data.dataPlot(PgtI, PkrI, PkniI, PhbI)
data.dataPlot(Pgt1, Pkr1, Pkni1, Phb1)
data.dataPlot(Pgt2, Pkr2, Pkni2, Phb2)
data.dataPlot(Pgt3, Pkr3, Pkni3, Phb3)
data.dataPlot(Pgt4, Pkr4, Pkni4, Phb4)
data.dataPlot(Pgt5, Pkr5, Pkni5, Phb5)
simUpdateFun is a function I wrote within a class to implement the genes' interaction and update the probabilities. Also, the data structure for each variable is a list that contain around 20 data points.
At first I am thinking about doing a recursion function for the update. Unfortunately, my knowledge in Python is rather limited (self-taught during a summer), and all I know of recursion function in Python are simple cases such as factorial function and Fibonacci sequences. The biggest problem for me is not being able to go around writing a loop or even the recursive function given the input and output for the simUpdateFun are lists.
The simUpdateFun is as follow:
def simUpdateFun(self, PgtPre, PkrPre, PkniPre, PhbPre):
"""A function to update the gap gene probabilities based on the mutual repression relationship from the previous ones"""
PgtS = []
PkrS = []
PkniS = []
PhbS = []
#PgtS = list(i for i in PgtPre)
# implement the mutual strong repression interaction
for i in range(self.xlen):
# implementation of the weak overlapping repression interaction
PgtR = []
PkrR = []
PkniR = []
PhbR = []
x = len(PgtPre)
P1 = 0
P2 = int(x/4 )
P3 = int(x * 2/4 )
P4 = int(x * 3/4 )
P5 = int(x )
# first calculate the repressor probability function for the gap genes
for i in PgtPre:
PgtR.append(self.repressor(i, 1))
for i in PkrPre:
PkrR.append(self.repressor(i, 1))
for i in PkniPre:
for i in PhbPre:
PhbR.append(self.repressor(i, 1))
# implement the interactions of weak repression of overlapping genes
# Try to avoid alternating the initial condition values by copy a new set of list objects.
PgtW = list(i for i in PgtPre) # using generator expression simplified the codes and also minimize # of lines
PkrW = list(i for i in PkrPre)
PkniW = list(i for i in PkniPre)
PhbW= list(i for i in PhbPre)
for i in range(P4, P5): # qudrant 4 regulation
PgtW[i] = PgtPre[i] * PhbR[i]
for i in range(P3, P4): # qudrant 3 regulation
PkniW[i] = PkniPre[i] * PgtR[i]
for i in range(P2, P3): # qudrant 2 regulation
PkrW[i] = PkrPre[i] * PkniR[i]
for i in range(P1, P2): # qudrant 1 regulation
PhbW[i] = PhbPre[i] * PkrR[i]
PkrW[i] = PkrPre[i] * PhbR[i]
# determinate the final probabilites of the two effects by multiplying since they take place simultatesily.
Pgtf = []
Pkrf = []
Pknif = []
Phbf = []
for i in range(len(PgtPre)):
Pgtf.append(PgtS[i] * PgtW[i])
Pkrf.append(PkrS[i] * PkrW[i])
Pknif.append(PkniS[i] * PkniW[i])
Phbf.append(PhbS[i] * PhbW[i])
return(Pgtf, Pkrf, Pknif, Phbf)
The function basically takes in a sets of list data which are probability values, and outputs the updated version of the lists.