I am trying to implement brown clustering algorithm in python.
I have data structure of cluster = List[List]
At any gives time the outside list length will be maximum 40 or 41.
But internal list contains english words such as 'the', 'hello' etc
So I have total of words 8000(vocabulary) and initially first 40 words are put into cluster.
I iterate over my vocabulary from 41 to 8000
# do some compution this takes very less times.
# Merge 2 item in list and delete one item from list
# ex: if c1 and c2 are items of clusters then
for i in range(41, 8000):
clusters.append(vocabulary[i])
c1 = computation 1
c2 = computation 2
clusters[c1] = clusters[c1] + clusters[c2]
del clusters[c2]
But the time takes for line clusters[c1] = clusters[c1] + clusters[c1] grows gradually as i iterate over my vocabulary. Initially for 41-50 is it 1sec, but for every 20 items in vocabulary the time grows by 1 sec.
On commenting just clusters[c1] = clusters[c1] + clusters[c1] from my entire code, i observer all iterations takes constant time. I am not sure how can i speed up this process.
for i in range(41, 8000):
clusters.append(vocabulary[i])
c1 = computation 1
c2 = computation 2
#clusters[c1] = clusters[c1] + clusters[c2]
del clusters[c2]
I am new to stackoverflow, please excuse me if any incorrect formatting here.
Thanks
The problem you're running into is that list concatenation is a linear time operation. Thus, your entire loop is O(n^2) (and that's prohibitively slow for n much larger than 1000). This is ignoring how copying such large lists can be bad for cache performance, etc.
Disjoint Set data structure
The solution I recommend is to use a disjoint set datastructure. This is a tree-based datastructure that "self-flattens" as you perform queries, resulting in a very fast runtimes for "merging" clusters.
The basic idea is that each word starts off as its own "singleton" tree, and merging clusters consists of making the root of one tree the child of another. This repeats (with some care for balancing) until you have as many clusters as desired.
I've written an example implementation (GitHub link) that assumes elements of each set are numbers. As long as you have a mapping from vocabulary terms to integers, it should work just fine for your purposes. (Note: I've done some preliminary testing, but I wrote it in 5 minutes right now so I'd recommend checking my work. ;) )
To use in your code, I would do something like the following:
clusters = DisjointSet(8000)
# some code to merge the first 40 words into clusters
for i in range(41, 8000):
c1 = some_computation() # assuming c1 is a number
c2 = some_computation() # assuming c2 is a number
clusters.join(c1, c2)
# Now, if you want to determine if some word with number k is
# in the same cluster as a word with number j:
print("{} and {} are in the same cluster? {}".format(j, k, clusters.query(j, k))
Regarding Sets vs Lists
While sets provide faster access time than lists, they actually have worse runtime when copying. This makes sense in theory, because a set object actually has to allocate and assign more memory space than a list for an appropriate load factor. Also, it is likely inserting so many items could result in a "rehash" of the entire hash table, which is a quadratic-time operation in worst-case.
However, practice is what we're concerned with now, so I ran a quick experiment to determine exactly how worse off sets were than lists.
Code for performing this test, in case anyone was interested, is below. I'm using the Intel packaging of Python, so my performance may be slightly faster than on your machine.
import time
import random
import numpy as np
import matplotlib.pyplot as plt
data = []
for trial in range(5):
trial_data = []
for N in range(0, 20000, 50):
l1 = random.sample(range(1000000), N)
l2 = random.sample(range(1000000), N)
s1 = set(l1)
s2 = set(l2)
# Time to concatenate two lists of length N
start_lst = time.clock()
l3 = l1+l2
stop_lst = time.clock()
# Time to union two sets of length N
start_set = time.clock()
s3 = s1|s2
stop_set = time.clock()
trial_data.append([N, stop_lst - start_lst, stop_set - start_set])
data.append(trial_data)
# average the trials and plot
data_array = np.array(data)
avg_data = np.average(data_array, 0)
fig = plt.figure()
ax = plt.gca()
ax.plot(avg_data[:,0], avg_data[:,1], label='Lists')
ax.plot(avg_data[:,0], avg_data[:,2], label='Sets')
ax.set_xlabel('Length of set or list (N)')
ax.set_ylabel('Seconds to union or concat (s)')
plt.legend(loc=2)
plt.show()
Related
I am looking for a time-efficient solution to the below problem that exploits the fact that I want to perform a certain operation many times over. I have two methods implemented below, and I observe that one of them is significantly faster. I am wondering if there is a more efficient alternative to both methods.
Input: Matrix mat of dimension m*n populated with nonnegative integers (0 <= each integer <= b). Also given p nonnegative integers q1, q2, ..., qp (each <= b) and vectors v1, v2, ..., vp. Each entry of vj contains d row indices of mat.
I am interested in cases where m, p, and d are large (~106), n is small (~10), and b is small (~100).
Output: For each pair (vj,qj), return the sub-list of rows of mat among vj[0], vj[1], ..., vj[d-1] that contain the integer qj.
My approach: Because p can be large, I preprocessed mat to determine if each row contains any of the numbers between 0 and b. Then, I went through the vectors vj to determine if the rows of mat defined by their entries contained qj. I tried two different approaches to storing whether each row of mat contains any integer between 0 and b. To my surprise, I found that Method 1 performs significantly faster than Method 2.
Question: I am wondering if there is a better (practical) way to preprocess mat so that the operations for each pair (vj,qj) are as fast as possible.
Edit: Defining a tmp variable as tmp = isPresent[qs[j]] and iterating through the elements of tmp yielded a faster solution, but I'm hoping I can do something even faster.
Note: Ordering of elements in result is not important.
# Python code
import random
import numpy
import time
m = 1000000 # number of rows of mat
n = 10 # number of columns of mat
b = 255 # upper bound on entries of mat
d = 10000 # dimension of vec (containing row indices of mat)
p = 100 # number of vecs
# random specification of mat
# mat, vec, and q will be inputs from another part of the project
mat = []
for i in range(m):
tmp = (numpy.random.permutation(b+1))[0:n]
mat.append(tmp)
# random specification of vec and q
vecs = []
qs = []
for i in range(p):
qs.append(random.randrange(0,b+1,1))
vecs.append((numpy.random.permutation(m))[0:d])
# METHOD 1
# store the rows where each integer occurs
# not too worried about time taken by this step
isPresent = [[False]*m for i in range(b+1)]
for i in range(m):
for j in mat[i]:
isPresent[j][i] = True
# mainly care about reducing time from hereon
time1 = 0.0
for j in range(p):
st1 = time.time()
result1 = []
for i in vecs[j]:
if isPresent[qs[j]][i]:
result1.append(i)
time1 += time.time() - st1
# METHOD 2
# store the rows where each integer occurs
# not too worried about time taken by this step
isPresent = [[False]*(b+1) for i in range(m)]
for i in range(m):
for j in mat[i]:
isPresent[i][j] = True
# mainly care about reducing time from hereon
time2 = 0.0
for j in range(p):
st2 = time.time()
result2 = []
for i in vecs[j]:
if isPresent[i][qs[j]]:
result2.append(i)
time2 += time.time() - st2
print('time1: ',time1,' time2: ',time2)
Note: I observe time1 = 0.46 seconds and time2 = 0.69 seconds on my laptop
TL;DR: Yes, there is a much better way to compute that using numpy. However, please note that there is a 2D random memory indirection pattern which is generally slow and known to be difficult to optimize.
Useful information:
Random memory accesses are slow. Indeed, it is difficult for the processor to predict memory address to fetch and thus reduce the latency of the memory. This is not too bad as long as data fit in the caches and are reused several times. Random memory accesses done over a huge memory area are much slower and should be avoided like the plague (when it is possible).
Analysis:
Both methods do a random memory indirections when executing the expressions isPresent[qs[j]][i] and isPresent[i][qs[j]].
Such indirections are slow. But the method 2 is slower since the average distance between fetched address tends to be much bigger than the method 1 causing an effect called cache thrashing.
Faster solution: Numpy can be used to strongly increase the performance of the first method (thanks to "vectorized" native methods).
Indeed, this method uses plain python loops that are generally very slow and recomputes isPresent[qs[j]] several times.
Here is the faster implementation:
# Assume vecs is a list of np.arrray rather than a list of list
isPresent = [numpy.array([False]*m) for i in range(b+1)]
for i in range(m):
for j in mat[i]:
isPresent[j][i] = True
time3 = 0.0
for j in range(p):
st3 = time.time()
tmp = isPresent[qs[j]]
result3 = numpy.extract(tmp[vecs[j]], vecs[j])
time3 += time.time() - st3
Performance results:
time1: 0.165357
time2: 0.309095
time3: 0.007201
The new version is 23 times faster than the first method and 43 times faster than the second.
Note that one can do this significantly faster by computing the j-loop in parallel, but this is a bit more complex.
I'm trying to solve an issue with subsection set.
The input data is the list and the integer.
The case is to divide a set into N-elements subsets whose sum of element is almost equal. As this is an NP-hard problem I try two approaches:
a) iterate all possibilities and distribute it using mpi4py to many machines (the list above 100 elements and 20 element subsets working too long)
b) using mpi4py send the list to different seed but in this case I potentially calculate the same set many times. For instance of 100 numbers and 5 subsets with 20 elements each in 60s my result could be easily better by human simply looking for the table.
Finally I'm looking for heuristic algorithm, which could be computing in distributed system and create N-elements subsets from bigger set whose sum is almost equal.
a = [range(12)]
k = 3
One of the possible solution:
[1,2,11,12] [3,4,9,10] [5,6,7,8]
because sum is 26, 26, 26
Not always it is possible to create exactly the equal sums or number of
elements. The difference between maximum and minimum number of elements in
sets could be 0 (if len(a)/k is integer) or 1.
edit 1:
I investigate two option: 1. Parent generate all iteration and then send to the parallel algorithm (but this is slow for me). 2. Parent send a list and each node generates own subsets and calculated the subset sum in restricted time. Then send the best result to parent. Parent received this results and choose the best one with minimized the difference between sums in subsets. I think the second option has potential to be faster.
Best regards,
Szczepan
I think you're trying to do something more complicated than necessary - do you actually need an exact solution (global optimum)? Regarding the heuristic solution, I had to do something along these lines in the past so here's my take on it:
Reformulate the problem as follows: You have a vector with given mean ('global mean') and you want to break it into chunks such that means of each individual chunk will be as close as possible to the 'global mean'.
Just divide it into chunks randomly and then iteratively swap elements between the chunks until you get acceptable results. You can experiment with different ways how to do it, here I'm just reshuffling elements of chunks with the minimum at maximum 'chunk-mean'.
In general, the bigger the chunk is, the easier it becomes, because the first random split would already give you not-so-bad solution (think sample means).
How big is your input list? I tested this with 100000 elements input (uniform distribution integers). With 50 2000-elements chunks you get the result instantly, with 2000 50-elements chunks you need to wait <1min.
import numpy as np
my_numbers = np.random.randint(10000, size=100000)
chunks = 50
iter_limit = 10000
desired_mean = my_numbers.mean()
accepatable_range = 0.1
split = np.array_split(my_numbers, chunks)
for i in range(iter_limit):
split_means = np.array([array.mean() for array in split]) # this can be optimized, some of the means are known
current_min = split_means.min()
current_max = split_means.max()
mean_diff = split_means.ptp()
if(i % 100 == 0 or mean_diff <= accepatable_range):
print("Iter: {}, Desired: {}, Min {}, Max {}, Range {}".format(i, desired_mean, current_min, current_max, mean_diff))
if mean_diff <= accepatable_range:
print('Acceptable solution found')
break
min_index = split_means.argmin()
max_index = split_means.argmax()
if max_index < min_index:
merged = np.hstack((split.pop(min_index), split.pop(max_index)))
else:
merged = np.hstack((split.pop(max_index), split.pop(min_index)))
reshuffle_range = mean_diff+1
while reshuffle_range > mean_diff:
# this while just ensures that you're not getting worse split, either the same or better
np.random.shuffle(merged)
modified_arrays = np.array_split(merged, 2)
reshuffle_range = np.array([array.mean() for array in modified_arrays]).ptp()
split += modified_arrays
I'm trying to do some calculations on over 1000 (100, 100, 1000) arrays. But as I could imagine, it doesn't take more than about 150-200 arrays before my memory is used up, and it all fails (at least with my current code).
This is what I currently have now:
import numpy as np
toxicity_data_path = open("data/toxicity.txt", "r")
toxicity_data = np.array(toxicity_data_path.read().split("\n"), dtype=int)
patients = range(1, 1000, 1)
The above is just a list of 1's and 0's (indicating toxicity or not) for each array (in this case one array is data for one patient). So in this case roughly 1000 patients.
I then create two lists from the above code so I have one list with patients having toxicity and one where they have not.
patients_no_tox = [i for i, e in enumerate(toxicity_data.astype(np.str)) if e in set("0")]
patients_with_tox = [i for i, e in enumerate(toxicity_data.astype(np.str)) if e in set("1")]
I then write this function, which takes an already saved-to-disk array ((100, 100, 1000)) for each patient, and then remove some indexes (which is also loaded from a saved file) on each array that will not work later on, or just needs to be removed. So it is essential to do so. The result is a final list of all patients and their 3D arrays of data. This is where things start to eat memory, when the function is used in the list comprehension.
def log_likely_list(patient, remove_index_list):
array_data = np.load("data/{}/array.npy".format(patient)).ravel()
return np.delete(array_data, remove_index_list)
remove_index_list = np.load("data/remove_index_list.npy")
final_list = [log_likely_list(patient, remove_index_list) for patient in patients]
Next step is to create two lists that I need for my calculations. I take the final list, with all the patients, and remove either patients that have toxicity or not, respectively.
patients_no_tox_list = np.column_stack(np.delete(final_list, patients_with_tox, 0))
patients_with_tox_list = np.column_stack(np.delete(final_list, patients_no_tox, 0))
The last piece of the puzzle is to use these two lists in the following equation, where I put the non-tox list into the right side of the equation, and with tox on the left side. It then sums up for all 1000 patients for each individual index in the 3D array of all patients, i.e. same index in each 3D array/patient, and then I end up with a large list of values pretty much.
log_likely = np.sum(np.log(patients_with_tox_list), axis=1) +
np.sum(np.log(1 - patients_no_tox_list), axis=1)
My problem, as stated is, that when I get around 150-200 (in the patients range) my memory is used, and it shuts down.
I have obviously tried to save stuff on the disk to load (that's why I load so many files), but that didn't help me much. I'm thinking maybe I could go one array at a time and into the log_likely function, but in the end, before summing, I would probably just have just as large an array, plus, the computation might be a lot slower if I can't use the numpy sum feature and such.
So is there any way I could optimize/improve on this, or is the only way to but a hell of lot more RAM ?
Each time you use a list comprehension, you create a new copy of the data in memory. So this line:
final_list = [log_likely_list(patient, remove_index_list) for patient in patients]
contains the complete data for all 1000 patients!
The better choice is to utilize generator expressions, which process items one at a time. To form a generator, surround your for...in...: expression with parentheses instead of brackets. This might look something like:
with_tox_data = (log_likely_list(patient, remove_index_list) for patient in patients_with_tox)
with_tox_log = (np.log(data, axis=1) for data in with_tox_data)
no_tox_data = (log_likely_list(patient, remove_index_list) for patient in patients_no_tox)
no_tox_log = (np.log(1 - data, axis=1) for data in no_tox_data)
final_data = itertools.chain(with_tox_log, no_tox_log)
Note that no computations have actually been performed yet: generators don't do anything until you iterate over them. The fastest way to aggregate all the results in this case is to use reduce:
log_likely = functools.reduce(np.add, final_data)
I have a list of start-end positions with ~280.000 elements. Totally covering about 73.000.000 positions.
For performance reasons I already split them into parts in a dictionary (by a subsetting factor), which in turn contains a list of tuples (start,end).
Finally I get a list of positions, which I want to test if they are located in the regions spanned by start and end.
posit = (start,end)
dict[subset].append(posit)
for position in dict[subset]:
if posit[0] < varpos < posit[1]:
# do some stuff here
Currently those look ups take a long time. But due to memory considerations I also don't want to produce a faster set containing all positions between start and stop.
Do you have any pointers how to create a fast start,end position data structure or a better look up strategy?
My assumptions are that the ranges do not overlap and the 280000 range objects are not changing on a regular basis. My first instinct is to use sorted list of lists instead of a list of dictionary objects. Then I would import the list of positions and pass them into a 'findRange' method.
To test my implementation I generated a sorted list of 280000 lists. Then passed 1000 random 'possiblePositionMatches' into findRange for matching.
This implementation took 7.260579 seconds for 100 'possiblePositionMatches' and 71.96268 seconds for 1000 'possiblePositionMatches'.
import random
import time
values = list()
for a in range(0,73000000,250) :
values.append([a, a+200])
possiblePositionMatches = list()
count = 1000
while count:
count = count - 1
possiblePositionMatches.append(random.randint(0,73000000))
matches = []
def findRange(value) :
for x in range(len(values)) :
if (value >= values[x][0]) and (value < values[x][1]) :
matches.append([value, values[x]])
def main():
t1 = time.process_time()
for y in possiblePositionMatches:
findRange(y)
print (matches)
t2 = time.process_time() - t1
print("Total Time: {0} seconds".format(t2))
main()
I have been fighting against a function giving me a memory error and thanks to your support (Python: how to split and return a list from a function to avoid memory error) I managed to sort the issue; however, since I am not a pro-programmer I would like to ask for your opinion on my method and how to improve its performance (if possible).
The function is a generator function returning all cycles from an n-nodes digraph. However, for a 12 nodes digraph, there are about 115 million cycles (each defined as a list of nodes, e.g. [0,1,2,0] is a cycle). I need all cycles available for further processing even after I have extracted some of their properties when they were first generated, so they need to be stored somewhere. So, the idea is to cut the result array every 10 million cycles to avoid memory error (when an array is too big, python runs out of RAM) and create a new array to store the following results. In the 12 node digraph, I would then have 12 result arrays, 11 full ones (containing 10 million cycles each) and the last containing 5 million cycles.
However, splitting the result array is not enough since the variables stay in RAM. So, I still need to write each one to the disk and delete it afterwards to clear the RAM.
As stated in How do I create a variable number of variables?, using 'exec' to create variable variable names is not very "clean" and dictionary solutions are better. However, in my case, if I store the results in a single dictionary, it will run out of memory due to the size of the arrays. Hence, I went for the 'exec' way. I would be grateful if you could comment on that decision.
Also, to store the arrays I use numpy.savez_compressed which gives me a 43 Mb file for each 10million cycles array. If it is not compressed it creates a 500 Mb file. However, using the compressed version slows the writing process. Any idea how to speed the writing and/or compressing process?
A simplified version of the code I wrote is as follows:
nbr_result_arrays=0
result_array_0=[]
result_lenght=10000000
tmp=result_array_0 # I use tmp to avoid using exec within the for loop (exec slows down code execution)
for cycle in generator:
tmp.append(cycle)
if len(tmp) == result_lenght:
exec 'np.savez_compressed(\'results_' +str(nbr_result_arrays)+ '\', tmp)'
exec 'del result_array_'+str(nbr_result_arrays)
nbr_result_arrays+=1
exec 'result_array_'+str(nbr_result_arrays)+'=[]'
exec 'tmp=result_array_'+str(nbr_result_arrays)
Thanks for reading,
Aleix
How about using itertools.islice?
import itertools
import numpy as np
for i in itertools.count():
tmp = list(itertools.islice(generator, 10000000))
if not tmp:
break
np.savez_compressed('results_{}'.format(i), tmp)
del tmp
thanks to all for your suggestions.
As suggested by #Aya, I believe that to improve performance (and possible space issues) I should avoid to store the results on the HD because storing them adds half of the time than creating the result, so loading and processing it again would get very close to creating the result again. Additionally, if I do not store any result, I save space which can become a big issue for bigger digraphs (a 12 node complete digraphs has about 115 million cycles but a 29 node ones has about 848E27 cycles... and increasing at factorial rate).
The idea is that I first need to find through all cycles going through the weakest arc to find the total probability of all cycles going it. Then, with this total probability I must go again through all those cycles to subtract them from the original array according to the weighted probability (I needed the total probability to be able to calculate the weighted probalility: weighted_prob= prob_of_this_cycle/total_prob_through_this_edge).
Thus, I believe that this is the best approach to do that (but I am open to more discussions! :) ).
However, I have a doubt regarding speed processing regarding two sub-functions:
1st: find whether a sequence contains a specific (smaller) sequence. I am doing that with the function "contains_sequence" which relies on the generator function "window" (as suggested in Is there a Python builtin for determining if an iterable contained a certain sequence? However I have been told that doing it with a deque would be up to 33% faster. Any other ideas?
2nd: I am currently finding the cycle probability of a cycle by sliding through the cycle nodes (which is represented by a list) to find the probability at the output of each arc to stay within the cycle and then multiply them all to find the cycle probability (the function name is find_cycle_probability). Any performance suggestions on this function would be appreciated since I need to run it for each cycle, i.e. countless times.
Any other tips/suggestion/comments will be most welcome! And thanks again for your help.
Aleix
Below follows the simplified code:
def simple_cycles_generator_w_filters(working_array_digraph, arc):
'''Generator function generating all cycles containing a specific arc.'''
generator=new_cycles.simple_cycles_generator(working_array_digraph)
for cycle in generator:
if contains_sequence(cycle, arc):
yield cycle
return
def find_smallest_arc_with_cycle(working_array,working_array_digraph):
'''Find the smallest arc through which at least one cycle flows.
Returns:
- if such arc exist:
smallest_arc_with_cycle = [a,b] where a is the start of arc and b the end
smallest_arc_with_cycle_value = x where x is the weight of the arc
- if such arc does not exist:
smallest_arc_with_cycle = []
smallest_arc_with_cycle_value = 0 '''
smallest_arc_with_cycle = []
smallest_arc_with_cycle_value = 0
sparse_array = []
for i in range(numpy.shape(working_array)[0]):
for j in range(numpy.shape(working_array)[1]):
if working_array[i][j] !=0:
sparse_array.append([i,j,working_array[i][j]])
sorted_array=sorted(sparse_array, key=lambda x: x[2])
for i in range(len(sorted_array)):
smallest_arc=[sorted_array[i][0],sorted_array[i][1]]
generator=simple_cycles_generator_w_filters(working_array_digraph,smallest_arc)
if any(generator):
smallest_arc_with_cycle=smallest_arc
smallest_arc_with_cycle_value=sorted_array[i][2]
break
return smallest_arc_with_cycle,smallest_arc_with_cycle_value
def window(seq, n=2):
"""Returns a sliding window (of width n) over data from the iterable
s -> (s0,s1,...s[n-1]), (s1,s2,...,sn), ... """
it = iter(seq)
result = list(itertools.islice(it, n))
if len(result) == n:
yield result
for elem in it:
result = result[1:] + [elem]
yield result
def contains_sequence(all_values, seq):
return any(seq == current_seq for current_seq in window(all_values, len(seq)))
def find_cycle_probability(cycle, working_array, total_outputs):
'''Finds the cycle probability of a given cycle within a given array'''
output_prob_of_each_arc=[]
for i in range(len(cycle)-1):
weight_of_the_arc=working_array[cycle[i]][cycle[i+1]]
output_probability_of_the_arc=float(weight_of_the_arc)/float(total_outputs[cycle[i]])#NOTE:total_outputs is an array, thus the float
output_prob_of_each_arc.append(output_probability_of_the_arc)
circuit_probabilities_of_the_cycle=numpy.prod(output_prob_of_each_arc)
return circuit_probabilities_of_the_cycle
def clean_negligible_values(working_array):
''' Cleans the array by rounding negligible values to 0 according to a
pre-defined threeshold.'''
zero_threeshold=0.000001
for i in range(numpy.shape(working_array)[0]):
for j in range(numpy.shape(working_array)[1]):
if working_array[i][j] == 0:
continue
elif 0 < working_array[i][j] < zero_threeshold:
working_array[i][j] = 0
elif -zero_threeshold <= working_array[i][j] < 0:
working_array[i][j] = 0
elif working_array[i][j] < -zero_threeshold:
sys.exit('Error')
return working_array
original_array= 1000 * numpy.random.random_sample((5, 5))
total_outputs=numpy.sum(original_array,axis=0) + 100 * numpy.random.random_sample(5)
working_array=original_array.__copy__()
straight_array= working_array.__copy__()
cycle_array=numpy.zeros(numpy.shape(working_array))
iteration_counter=0
working_array_digraph=networkx.DiGraph(working_array)
[smallest_arc_with_cycle, smallest_arc_with_cycle_value]= find_smallest_arc_with_cycle(working_array, working_array_digraph)
while smallest_arc_with_cycle: # using implicit true value of a non-empty list
cycle_flows_to_be_subtracted = numpy.zeros(numpy.shape((working_array)))
# FIRST run of the generator to calculate each cycle probability
# note: the cycle generator ONLY provides all cycles going through
# the specified weakest arc
generator = simple_cycles_generator_w_filters(working_array_digraph, smallest_arc_with_cycle)
nexus_total_probs = 0
for cycle in generator:
cycle_prob = find_cycle_probability(cycle, working_array, total_outputs)
nexus_total_probs += cycle_prob
# SECOND run of the generator
# using the nexus_prob_sum calculated before, I can allocate the weight of the
# weakest arc to each cycle going through it
generator = simple_cycles_generator_w_filters(working_array_digraph,smallest_arc_with_cycle)
for cycle in generator:
cycle_prob = find_cycle_probability(cycle, working_array, total_outputs)
allocated_cycle_weight = cycle_prob / nexus_total_probs * smallest_arc_with_cycle_value
# create the array to be substracted
for i in range(len(cycle)-1):
cycle_flows_to_be_subtracted[cycle[i]][cycle[i+1]] += allocated_cycle_weight
working_array = working_array - cycle_flows_to_be_subtracted
clean_negligible_values(working_array)
cycle_array = cycle_array + cycle_flows_to_be_subtracted
straight_array = straight_array - cycle_flows_to_be_subtracted
clean_negligible_values(straight_array)
# find the next weakest arc with cycles.
working_array_digraph=networkx.DiGraph(working_array)
[smallest_arc_with_cycle, smallest_arc_with_cycle_value] = find_smallest_arc_with_cycle(working_array,working_array_digraph)