Efficient block bootstrap of integer sequences

Efficient block bootstrap of integer sequences - python

I'm trying to block bootstrap samples for Monte-Carlo simulation and need to generate a large array of index values (integers) containing blocks in Python. I need this to be very fast but cannot figure out how to vectorize it.
I want to generate a large number of paths, where each path contains a sequence of integers of length L. Suppose I have an array of integers (representing an index) form 0 to N, from which I will sample randomly to construct each path. When I sample, I choose a random integer i from 0 to N, and then populate the path with i,i+1,i+2..,i+w for some window w. I then choose another random starting index value and continue to populate the path with the new window, repeating until the path is fully populated. I do this for all paths.
I'm wondering if there is a way to speed this method up without having to loop over each path, since I intend to generate a very large number of paths (millions)
An example of my for loop method is below:
paths = 10000
path_length = 500
window_length = 5
index = np.arange(0,5000)
simulated_values = np.zeros([paths,path_length])
n_windows = int(np.ceil(path_length/window_length))
for i in range(0, paths):
temp=[]
for n in range(0, n_windows):
random_start = random.randint(0, len(index) - path_length)
temp.extend(range(random_start, random_start + window_length))
simulated_values[i,:] = temp
print(simulated_values)

I found a solution in a python package called recombinator. Seems to be fast enough, and there is support for GPU for further speed
https://pypi.org/project/recombinator/
from recombinator.block_bootstrap import circular_block_bootstrap
index = np.arange(0,5000)
path_length = 500
window_length = 5
temp = circular_block_bootstrap(index,block_length=window_length,replications=1000000,replace=True, sub_sample_length=path_length)

Related

Index of element in random permutation for very large range

I am working with a very large range of values (0 to approx. 10^6128) and I need a way in Python to perform two-way lookups with a random permutation of the range.
Example with a smaller dataset:
import random
values = list(range(10)) # the actual range is too large to do this
random.shuffle(values)
def map_value(n):
return values[n]
def unmap_value(n):
return values.index(n)
I need a way to implement the map_value and unmap_value methods with values in the very large range above.

Creating a fixed permutation of 10**6128 values is costly - memory wise.
You can create values from your range on the fly and store them in one / two dictionaries.
If you only draw comparativly few values one dict might be enough, if you have lots of values you might need 2 for faster lookup.
Essentially you
lookup a value, if not present generate an index, store it and return it
lookup an index, if not present, generate a value, store it and return it
Using a fixed random seed should lead to same sequences:
import random
class big_range():
random.seed(42)
pos_value = {}
value_pos = {}
def map_value(self, n):
p = big_range.value_pos.get(n)
while p is None:
p = random.randrange(10**6128) # works, can't use random.choice(range(10**6128))
if p in big_range.pos_value:
p = None
else:
big_range.pos_value[p]=n
big_range.value_pos[n]=p
return p
def unmap_value(self, n):
p = big_range.pos_value.get(n)
while p is None:
p = random.randrange(10**6128) # works, can't use random.choice(range(10**6128))
if p in big_range.pos_value:
p = None
else:
big_range.pos_value[n]=p
big_range.value_pos[p]=n
return p
br = big_range()
for i in range(10):
print(br.map_value(i))
print(big_range.pos_value)
print(big_range.value_pos)
Output:
Gibberisch humongeous number ... but it works.
You store each number twice (once as pos:number, once as number:pos) for lookup reasons. You might want to check how many numbers you can generate before your memory goes out.
You can use one dict only, but looking up the value to an index is not O(1) but O(n) in that case because you need to traverse dict.items() to find the value and return the index.
The repeatability breaks if you do other random things in between because you alter the "state" of random - you might need to do some more capsulating and maybe statekeeping inside your class using random.getstate() / random.setstate() to store the last state after generation of a new random as well...
If you look for most of your values it will take longer and longer to produce a "not present one" if you simple keep looping indexes from 0 to 10**6128...
random.getstate()
random.setstate()
random.randrange()
This is kindof fragile and more of a thought experiment - I have no clue whatfor one needs a 10**6128 range of numbers...

Parallel algorithm for set splitting

I'm trying to solve an issue with subsection set.
The input data is the list and the integer.
The case is to divide a set into N-elements subsets whose sum of element is almost equal. As this is an NP-hard problem I try two approaches:
a) iterate all possibilities and distribute it using mpi4py to many machines (the list above 100 elements and 20 element subsets working too long)
b) using mpi4py send the list to different seed but in this case I potentially calculate the same set many times. For instance of 100 numbers and 5 subsets with 20 elements each in 60s my result could be easily better by human simply looking for the table.
Finally I'm looking for heuristic algorithm, which could be computing in distributed system and create N-elements subsets from bigger set whose sum is almost equal.
a = [range(12)]
k = 3
One of the possible solution:
[1,2,11,12] [3,4,9,10] [5,6,7,8]
because sum is 26, 26, 26
Not always it is possible to create exactly the equal sums or number of
elements. The difference between maximum and minimum number of elements in
sets could be 0 (if len(a)/k is integer) or 1.
edit 1:
I investigate two option: 1. Parent generate all iteration and then send to the parallel algorithm (but this is slow for me). 2. Parent send a list and each node generates own subsets and calculated the subset sum in restricted time. Then send the best result to parent. Parent received this results and choose the best one with minimized the difference between sums in subsets. I think the second option has potential to be faster.
Best regards,
Szczepan

I think you're trying to do something more complicated than necessary - do you actually need an exact solution (global optimum)? Regarding the heuristic solution, I had to do something along these lines in the past so here's my take on it:
Reformulate the problem as follows: You have a vector with given mean ('global mean') and you want to break it into chunks such that means of each individual chunk will be as close as possible to the 'global mean'.
Just divide it into chunks randomly and then iteratively swap elements between the chunks until you get acceptable results. You can experiment with different ways how to do it, here I'm just reshuffling elements of chunks with the minimum at maximum 'chunk-mean'.
In general, the bigger the chunk is, the easier it becomes, because the first random split would already give you not-so-bad solution (think sample means).
How big is your input list? I tested this with 100000 elements input (uniform distribution integers). With 50 2000-elements chunks you get the result instantly, with 2000 50-elements chunks you need to wait <1min.
import numpy as np
my_numbers = np.random.randint(10000, size=100000)
chunks = 50
iter_limit = 10000
desired_mean = my_numbers.mean()
accepatable_range = 0.1
split = np.array_split(my_numbers, chunks)
for i in range(iter_limit):
split_means = np.array([array.mean() for array in split]) # this can be optimized, some of the means are known
current_min = split_means.min()
current_max = split_means.max()
mean_diff = split_means.ptp()
if(i % 100 == 0 or mean_diff <= accepatable_range):
print("Iter: {}, Desired: {}, Min {}, Max {}, Range {}".format(i, desired_mean, current_min, current_max, mean_diff))
if mean_diff <= accepatable_range:
print('Acceptable solution found')
break
min_index = split_means.argmin()
max_index = split_means.argmax()
if max_index < min_index:
merged = np.hstack((split.pop(min_index), split.pop(max_index)))
else:
merged = np.hstack((split.pop(max_index), split.pop(min_index)))
reshuffle_range = mean_diff+1
while reshuffle_range > mean_diff:
# this while just ensures that you're not getting worse split, either the same or better
np.random.shuffle(merged)
modified_arrays = np.array_split(merged, 2)
reshuffle_range = np.array([array.mean() for array in modified_arrays]).ptp()
split += modified_arrays

Randomly select a length of numbers from numpy array

I have a number of data files, each containing a large amount of data points.
After loading the file with numpy, I get a numpy array:
f=np.loadtxt("...-1.txt")
How do I randomly select a length of x, but the order of numbers should not be changed?
For example:
f = [1,5,3,7,4,8]
if I wanted to select a random length of 3 data points, the output should be:
1,5,3, or
3,7,4, or
5,3,7, etc.

Pure logic will get you there.
For a list f and a max length x, the valid starting points of your random slices are limited to 0, len(f)-x:
0 1 2 3
f = [1,5,3,7,4,8]
So all valid starting point can be selected with random.randrange(len(f)-x+1) (where the +1 is because randrange works like range).
Store the random starting point into a variable start and slice your array with [start:start+x], or be creative and use another slice after the first:
result = f[random.randrange(len(f)-x+1):][:3]

Building on usr2564301's answer you can take out only the elements you need in 1 go using a range so you avoid building a potentially very large intermediate array:
result = f[range(random.randrange(len(f)-x+1), x)]
A range also avoids that you build large index arrays when your length x becomes larger.

Reduce the size of a numpy array while preserving the information in it

I'm a newbie at python and I'm trying to do something like binning the data of a numpy array. I'm really struggling in doing so, tho!
My array is a simulation of a simple particle diffusion model, given their probabilities of walking forward or backward. It can have an arbitrary number of species of particles and the total number of particles and that information is coded in the key vector, which is a vector composed of numbers ranging from 0 to nSpecies. Each of these numbers appears according to a given proportion chosen by the user. The size of the vector is chosen by the user as well.
def walk(diff, key, progressProbability, recessProbability, nSpecies):
"""
Returns an array with the positions of the particles pondered by their
walk probabilities
"""
random = np.random.rand(len(key))
forward = key.astype(float)
backward = key.astype(float)
for i in range(nSpecies):
forward[key == i] = progressProbability[i]
backward[key == i] = recessProbability[i]
diff = np.add(diff, random < forward)
diff = np.subtract(diff, random > 1 - backward)
return diff
To add time into this simulation, I run this walk function presented above many times. Therefore, the values in diff after running this function many times are a representation of how far the particle has gone.
def probability_diffusion(time, progressProbability, recessProbability,
changeProbability, key, nSpecies, nBins):
populationSize = len(key)
diff = np.zeros(populationSize, dtype= int)
for t in range(time):
diff = walk(diff, key, progressProbability, recessProbability, nSpecies)
return diff
My goal is to turn this diff array in a array with size 381 without losing the information coded in it. I thought about doing so by binning and averaging the data in each bin.
I've tried using the scipy binned_statistic function but I can't really wrap my head around how it works.
Any thoughts? Thank you.

python: improve performance and/or method to avoid memory error creating, saving and deleting variable variables

I have been fighting against a function giving me a memory error and thanks to your support (Python: how to split and return a list from a function to avoid memory error) I managed to sort the issue; however, since I am not a pro-programmer I would like to ask for your opinion on my method and how to improve its performance (if possible).
The function is a generator function returning all cycles from an n-nodes digraph. However, for a 12 nodes digraph, there are about 115 million cycles (each defined as a list of nodes, e.g. [0,1,2,0] is a cycle). I need all cycles available for further processing even after I have extracted some of their properties when they were first generated, so they need to be stored somewhere. So, the idea is to cut the result array every 10 million cycles to avoid memory error (when an array is too big, python runs out of RAM) and create a new array to store the following results. In the 12 node digraph, I would then have 12 result arrays, 11 full ones (containing 10 million cycles each) and the last containing 5 million cycles.
However, splitting the result array is not enough since the variables stay in RAM. So, I still need to write each one to the disk and delete it afterwards to clear the RAM.
As stated in How do I create a variable number of variables?, using 'exec' to create variable variable names is not very "clean" and dictionary solutions are better. However, in my case, if I store the results in a single dictionary, it will run out of memory due to the size of the arrays. Hence, I went for the 'exec' way. I would be grateful if you could comment on that decision.
Also, to store the arrays I use numpy.savez_compressed which gives me a 43 Mb file for each 10million cycles array. If it is not compressed it creates a 500 Mb file. However, using the compressed version slows the writing process. Any idea how to speed the writing and/or compressing process?
A simplified version of the code I wrote is as follows:
nbr_result_arrays=0
result_array_0=[]
result_lenght=10000000
tmp=result_array_0 # I use tmp to avoid using exec within the for loop (exec slows down code execution)
for cycle in generator:
tmp.append(cycle)
if len(tmp) == result_lenght:
exec 'np.savez_compressed(\'results_' +str(nbr_result_arrays)+ '\', tmp)'
exec 'del result_array_'+str(nbr_result_arrays)
nbr_result_arrays+=1
exec 'result_array_'+str(nbr_result_arrays)+'=[]'
exec 'tmp=result_array_'+str(nbr_result_arrays)
Thanks for reading,
Aleix

How about using itertools.islice?
import itertools
import numpy as np
for i in itertools.count():
tmp = list(itertools.islice(generator, 10000000))
if not tmp:
break
np.savez_compressed('results_{}'.format(i), tmp)
del tmp

thanks to all for your suggestions.
As suggested by #Aya, I believe that to improve performance (and possible space issues) I should avoid to store the results on the HD because storing them adds half of the time than creating the result, so loading and processing it again would get very close to creating the result again. Additionally, if I do not store any result, I save space which can become a big issue for bigger digraphs (a 12 node complete digraphs has about 115 million cycles but a 29 node ones has about 848E27 cycles... and increasing at factorial rate).
The idea is that I first need to find through all cycles going through the weakest arc to find the total probability of all cycles going it. Then, with this total probability I must go again through all those cycles to subtract them from the original array according to the weighted probability (I needed the total probability to be able to calculate the weighted probalility: weighted_prob= prob_of_this_cycle/total_prob_through_this_edge).
Thus, I believe that this is the best approach to do that (but I am open to more discussions! :) ).
However, I have a doubt regarding speed processing regarding two sub-functions:
1st: find whether a sequence contains a specific (smaller) sequence. I am doing that with the function "contains_sequence" which relies on the generator function "window" (as suggested in Is there a Python builtin for determining if an iterable contained a certain sequence? However I have been told that doing it with a deque would be up to 33% faster. Any other ideas?
2nd: I am currently finding the cycle probability of a cycle by sliding through the cycle nodes (which is represented by a list) to find the probability at the output of each arc to stay within the cycle and then multiply them all to find the cycle probability (the function name is find_cycle_probability). Any performance suggestions on this function would be appreciated since I need to run it for each cycle, i.e. countless times.
Any other tips/suggestion/comments will be most welcome! And thanks again for your help.
Aleix
Below follows the simplified code:
def simple_cycles_generator_w_filters(working_array_digraph, arc):
'''Generator function generating all cycles containing a specific arc.'''
generator=new_cycles.simple_cycles_generator(working_array_digraph)
for cycle in generator:
if contains_sequence(cycle, arc):
yield cycle
return
def find_smallest_arc_with_cycle(working_array,working_array_digraph):
'''Find the smallest arc through which at least one cycle flows.
Returns:
- if such arc exist:
smallest_arc_with_cycle = [a,b] where a is the start of arc and b the end
smallest_arc_with_cycle_value = x where x is the weight of the arc
- if such arc does not exist:
smallest_arc_with_cycle = []
smallest_arc_with_cycle_value = 0 '''
smallest_arc_with_cycle = []
smallest_arc_with_cycle_value = 0
sparse_array = []
for i in range(numpy.shape(working_array)[0]):
for j in range(numpy.shape(working_array)[1]):
if working_array[i][j] !=0:
sparse_array.append([i,j,working_array[i][j]])
sorted_array=sorted(sparse_array, key=lambda x: x[2])
for i in range(len(sorted_array)):
smallest_arc=[sorted_array[i][0],sorted_array[i][1]]
generator=simple_cycles_generator_w_filters(working_array_digraph,smallest_arc)
if any(generator):
smallest_arc_with_cycle=smallest_arc
smallest_arc_with_cycle_value=sorted_array[i][2]
break
return smallest_arc_with_cycle,smallest_arc_with_cycle_value
def window(seq, n=2):
"""Returns a sliding window (of width n) over data from the iterable
s -> (s0,s1,...s[n-1]), (s1,s2,...,sn), ... """
it = iter(seq)
result = list(itertools.islice(it, n))
if len(result) == n:
yield result
for elem in it:
result = result[1:] + [elem]
yield result
def contains_sequence(all_values, seq):
return any(seq == current_seq for current_seq in window(all_values, len(seq)))
def find_cycle_probability(cycle, working_array, total_outputs):
'''Finds the cycle probability of a given cycle within a given array'''
output_prob_of_each_arc=[]
for i in range(len(cycle)-1):
weight_of_the_arc=working_array[cycle[i]][cycle[i+1]]
output_probability_of_the_arc=float(weight_of_the_arc)/float(total_outputs[cycle[i]])#NOTE:total_outputs is an array, thus the float
output_prob_of_each_arc.append(output_probability_of_the_arc)
circuit_probabilities_of_the_cycle=numpy.prod(output_prob_of_each_arc)
return circuit_probabilities_of_the_cycle
def clean_negligible_values(working_array):
''' Cleans the array by rounding negligible values to 0 according to a
pre-defined threeshold.'''
zero_threeshold=0.000001
for i in range(numpy.shape(working_array)[0]):
for j in range(numpy.shape(working_array)[1]):
if working_array[i][j] == 0:
continue
elif 0 < working_array[i][j] < zero_threeshold:
working_array[i][j] = 0
elif -zero_threeshold <= working_array[i][j] < 0:
working_array[i][j] = 0
elif working_array[i][j] < -zero_threeshold:
sys.exit('Error')
return working_array
original_array= 1000 * numpy.random.random_sample((5, 5))
total_outputs=numpy.sum(original_array,axis=0) + 100 * numpy.random.random_sample(5)
working_array=original_array.__copy__()
straight_array= working_array.__copy__()
cycle_array=numpy.zeros(numpy.shape(working_array))
iteration_counter=0
working_array_digraph=networkx.DiGraph(working_array)
[smallest_arc_with_cycle, smallest_arc_with_cycle_value]= find_smallest_arc_with_cycle(working_array, working_array_digraph)
while smallest_arc_with_cycle: # using implicit true value of a non-empty list
cycle_flows_to_be_subtracted = numpy.zeros(numpy.shape((working_array)))
# FIRST run of the generator to calculate each cycle probability
# note: the cycle generator ONLY provides all cycles going through
# the specified weakest arc
generator = simple_cycles_generator_w_filters(working_array_digraph, smallest_arc_with_cycle)
nexus_total_probs = 0
for cycle in generator:
cycle_prob = find_cycle_probability(cycle, working_array, total_outputs)
nexus_total_probs += cycle_prob
# SECOND run of the generator
# using the nexus_prob_sum calculated before, I can allocate the weight of the
# weakest arc to each cycle going through it
generator = simple_cycles_generator_w_filters(working_array_digraph,smallest_arc_with_cycle)
for cycle in generator:
cycle_prob = find_cycle_probability(cycle, working_array, total_outputs)
allocated_cycle_weight = cycle_prob / nexus_total_probs * smallest_arc_with_cycle_value
# create the array to be substracted
for i in range(len(cycle)-1):
cycle_flows_to_be_subtracted[cycle[i]][cycle[i+1]] += allocated_cycle_weight
working_array = working_array - cycle_flows_to_be_subtracted
clean_negligible_values(working_array)
cycle_array = cycle_array + cycle_flows_to_be_subtracted
straight_array = straight_array - cycle_flows_to_be_subtracted
clean_negligible_values(straight_array)
# find the next weakest arc with cycles.
working_array_digraph=networkx.DiGraph(working_array)
[smallest_arc_with_cycle, smallest_arc_with_cycle_value] = find_smallest_arc_with_cycle(working_array,working_array_digraph)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.