Index of element in random permutation for very large range - python

I am working with a very large range of values (0 to approx. 10^6128) and I need a way in Python to perform two-way lookups with a random permutation of the range.
Example with a smaller dataset:
import random
values = list(range(10)) # the actual range is too large to do this
random.shuffle(values)
def map_value(n):
return values[n]
def unmap_value(n):
return values.index(n)
I need a way to implement the map_value and unmap_value methods with values in the very large range above.

Creating a fixed permutation of 10**6128 values is costly - memory wise.
You can create values from your range on the fly and store them in one / two dictionaries.
If you only draw comparativly few values one dict might be enough, if you have lots of values you might need 2 for faster lookup.
Essentially you
lookup a value, if not present generate an index, store it and return it
lookup an index, if not present, generate a value, store it and return it
Using a fixed random seed should lead to same sequences:
import random
class big_range():
random.seed(42)
pos_value = {}
value_pos = {}
def map_value(self, n):
p = big_range.value_pos.get(n)
while p is None:
p = random.randrange(10**6128) # works, can't use random.choice(range(10**6128))
if p in big_range.pos_value:
p = None
else:
big_range.pos_value[p]=n
big_range.value_pos[n]=p
return p
def unmap_value(self, n):
p = big_range.pos_value.get(n)
while p is None:
p = random.randrange(10**6128) # works, can't use random.choice(range(10**6128))
if p in big_range.pos_value:
p = None
else:
big_range.pos_value[n]=p
big_range.value_pos[p]=n
return p
br = big_range()
for i in range(10):
print(br.map_value(i))
print(big_range.pos_value)
print(big_range.value_pos)
Output:
Gibberisch humongeous number ... but it works.
You store each number twice (once as pos:number, once as number:pos) for lookup reasons. You might want to check how many numbers you can generate before your memory goes out.
You can use one dict only, but looking up the value to an index is not O(1) but O(n) in that case because you need to traverse dict.items() to find the value and return the index.
The repeatability breaks if you do other random things in between because you alter the "state" of random - you might need to do some more capsulating and maybe statekeeping inside your class using random.getstate() / random.setstate() to store the last state after generation of a new random as well...
If you look for most of your values it will take longer and longer to produce a "not present one" if you simple keep looping indexes from 0 to 10**6128...
random.getstate()
random.setstate()
random.randrange()
This is kindof fragile and more of a thought experiment - I have no clue whatfor one needs a 10**6128 range of numbers...

Related

Generate the n-th random number with Python

I am trying to generate random numbers that are used to generate a part of a world (I am working on world generation for a game). I could create these with something like [random.randint(0, 100) for n in range(1000)] to generate 1000 random numbers from 0 to 100, but I don't know how many numbers in a list I need. What I want is to be able to say something like random.nth_randint(0, 100, 5) which would generate the 5th random number from 0 to 100. (The same number every time as long as you use the same seed) How would I go about doing this? And if there is no way to do this, how else could I get the same behavior?
Python's random module produces deterministic pseudo random values.
In simpler words, it behaves as if it generated a list of predetermined values when a seed is provided (or when default seed is taken from OS), and those values will always be the same for a given seed.
Which is basically what we want here.
So to get nth random value you need to either remember its state for each generated value (probably just keeping track of the values would be less memory hungry) or you need to reset (reseed) the generator each time and produce N random numbers each time to get yours.
def randgen(a, b, n, seed=4):
# our default seed is random in itself as evidenced by https://xkcd.com/221/
random.seed(seed)
for i in range(n-1):
x = random.random()
return random.randint(a, b)
If I understood well your question you want every time the same n-th number. You may create a class where you keep track of the generated numbers (if you use the same seed).
The main idea is that, when you ask for then nth-number it will generate all the previous in order to be always the same for all the run of the program.
import random
class myRandom():
def __init__(self):
self.generated = []
#your instance of random.Random()
self.rand = random.Random(99)
def generate(self, nth):
if nth < len(self.generated) + 1:
return self.generated[nth - 1]
else:
for _ in range(len(self.generated), nth):
self.generated.append(self.rand.randint(1,100))
return self.generated[nth - 1]
r = myRandom()
print(r.generate(1))
print(r.generate(5))
print(r.generate(10))
Using a defaultdict, you can have a structure that generates a new number on the first access of each key.
from collections import defaultdict
from random import randint
random_numbers = defaultdict(lambda: randint(0, 100))
random_number[5] # 42
random_number[5] # 42
random_number[0] # 63
Numbers are thus lazily generated on access.
Since you are working on a game, it is likely you will then need to preserve random_numbers through interruptions of your program. You can use pickle to save your data.
import pickle
random_numbers[0] # 24
# Save the current state
with open('random', 'wb') as f:
pickle.dump(dict(random_numbers), f)
# Load the last saved state
with open('random', 'rb') as f:
opened_random_numbers = defaultdict(lambda: randint(0, 100), pickle.load(f))
opened_random_numbers[0] # 24
Numpy's new random BitGenerator interface provides a method advance(delta) some of the BitGenerator implementations (including the default BitGenerator used). This function allows you to seed and then advance to get the n-th random number.
From the docs:
Advance the underlying RNG as-if delta draws have occurred.
https://numpy.org/doc/stable/reference/random/bit_generators/generated/numpy.random.PCG64.advance.html#numpy.random.PCG64.advance

Python random sample generator (comfortable with huge population sizes)

As you might know random.sample(population,sample_size) quickly returns a random sample, but what if you don't know in advance the size of the sample? You end up in sampling the entire population, or shuffling it, which is the same. But this can be wasteful (if the majority of sample sizes come up to be small compared to population size) or even unfeasible (if population size is huge, running out of memory). Also, what if your code needs to jump from here to there before picking the next element of the sample?
P.S. I bumped into the need of optimizing random sample while working on simulated annealing for TSP. In my code sampling is restarted hundreds of thousands of times, and each time I don't know if I will need to pick 1 element or the 100% of the elements of population.
At first, I would split the population into blocks. The function to do the block sampling can easily be a generator, being able to process sample of arbitrary size. This also allows you to make the function a generator.
Imagine infinite population, a population block of 512 and sample size of 8. This means you could gather as many samples as you need, and for future reduction again sample the already sampled space (for 1024 blocks this means 8196 samples from which you can sample again).
At the same time, this allows for parallel processing which may be feasible in case of very large samples.
Example considering in-memory population
import random
population = [random.randint(0, 1000) for i in range(0, 150000)]
def sample_block(population, block_size, sample_size):
block_number = 0
while 1:
try:
yield random.sample(population[block_number * block_size:(block_number + 1) * block_size], sample_size)
block_number += 1
except ValueError:
break
sampler = sample_block(population, 512, 8)
samples = []
try:
while 1:
samples.extend(sampler.next())
except StopIteration:
pass
print random.sample(samples, 200)
If population was external to the script (file, block) the only modification is that you would have to load appropriate chunk to a memory. Proof of concept how sampling of infinite population could look like:
import random
import time
def population():
while 1:
yield random.randint(0, 10000)
def reduced_population(samples):
for sample in samples:
yield sample
def sample_block(generator, block_size, sample_size):
block_number = 0
block = []
while 1:
block.append(generator.next())
if len(block) == block_size:
s = random.sample(block, sample_size)
block_number += 1
block = []
print 'Sampled block {} with result {}.'.format(block_number, s)
yield s
samples = []
result = []
reducer = sample_block(population(), 512, 12)
try:
while 1:
samples.append(reducer.next())
if len(samples) == 1000:
sampler = sample_block(reduced_population(samples), 1000, 15)
result.append(list(sampler))
time.sleep(5)
except StopIteration:
pass
Ideally, you'd also gather the samples and again sample them.
That's what generators for, I believe. Here is an example of Fisher-Yates-Knuth sampling via generator/yield, you get events one by one and stop when you want to.
Code updated
import random
import numpy
import array
class populationFYK(object):
"""
Implementation of the Fisher-Yates-Knuth shuffle
"""
def __init__(self, population):
self._population = population # reference to the population
self._length = len(population) # lengths of the sequence
self._index = len(population)-1 # last unsampled index
self._popidx = array.array('i', range(0,self._length))
# array module vs numpy
#self._popidx = numpy.empty(self._length, dtype=numpy.int32)
#for k in range(0,self._length):
# self._popidx[k] = k
def swap(self, idx_a, idx_b):
"""
Swap two elements in population
"""
temp = self._popidx[idx_a]
self._popidx[idx_a] = self._popidx[idx_b]
self._popidx[idx_b] = temp
def sample(self):
"""
Yield one sampled case from population
"""
while self._index >= 0:
idx = random.randint(0, self._index) # index of the sampled event
if idx != self._index:
self.swap(idx, self._index)
sampled = self._population[self._popidx[self._index]] # yielding it
self._index -= 1 # one less to be sampled
yield sampled
def index(self):
return self._index
def restart(self):
self._index = self._length - 1
for k in range(0,self._length):
self._popidx[k] = k
if __name__=="__main__":
population = [1,3,6,8,9,3,2]
gen = populationFYK(population)
for k in gen.sample():
print(k)
You can get a sample of size K out of a population of size N by picking K non-repeating random-numbers in the range [0...N[ and treat them as indexes.
Option a)
You could generate such a index-sample using the well-known sample method.
random.sample(xrange(N), K)
From the Python docs about random.sample:
To choose a sample from a range of integers, use an xrange() object as an argument. This is especially fast and space efficient for sampling from a large population
Option b)
If you don't like the fact that random.sample already returns a list instead of a lazy generator of non-repeating random numbers, you can go fancy with Format-Preserving Encryption to encrypt a counter.
This way you get a real generator of random indexes, and you can pick as many as you want and stop at any time, without getting any duplicates, which gives you dynamically sized sample sets.
The idea is to construct an encryption scheme to encrypt the numbers from 0 to N. Now, for each time you want to get a sample from your population, you pick a random key for your encryption and start to encrypt the numbers from 0, 1, 2, ... onwards (this is the counter). Since every good encryption creates a random-looking 1:1 mapping you end up with non-repeating random integers you can use as indexes.
The storage requirements during this lazy generation is just the initial key plus the current value of the counter.
The idea was already discussed in Generating non-repeating random numbers in Python. There even is a python snippet linked: formatpreservingencryption.py
A sample code using this snippet could be implemented like this:
def itersample(population):
# Get the size of the population
N = len(population)
# Get the number of bits needed to represent this number
bits = (N-1).bit_length()
# Generate some random key
key = ''.join(random.choice(string.ascii_letters + string.digits) for _ in range(32))
# Create a new crypto instance that encrypts binary blocks of width <bits>
# Thus, being able to encrypt all numbers up to the nearest power of two
crypter = FPEInteger(key=key, radix=2, width=bits)
# Count up
for i in xrange(1<<bits):
# Encrypt the current counter value
x = crypter.encrypt(i)
# If it is bigger than our population size, just skip it
# Since we generate numbers up to the nearest power of 2,
# we have to skip up to half of them, and on average up to one at a time
if x < N:
# Return the randomly chosen element
yield population[x]
I wrote (in Python 2.7.9) a random sampler generator (of indexes) whose speed depends only on sample size (it should be O(ns log(ns)) where ns is sample size). So it is fast when sample size is small compared to population size, because it does NOT depend at all on population size. It doesn't build any population collection, it just picks random indexes and uses a kind of bisect method on sampled indexes to avoid duplicates and keep then sorted. Given an iterable population, here's how to use itersample generator:
import random
sampler=itersample(len(population))
next_pick=sampler.next() # pick the next random (index of) element
or
import random
sampler=itersample(len(population))
sample=[]
for index in sampler:
# do something with (index of) picked element
sample.append(index) # build a sample
if some_condition: # stop sampling when needed
break
If you need the actual elements and not just the indexes, just apply population iterable to the index when needed (population[sampler.next()] and population[index] respectively for first and second example).
The results of some tests show that speed does NOT depend on population size, so if you need to randomly pick only 10 elements from a population of 100 billions, you pay only for 10 (remember, we don't know in advance how many elements we'll pick, otherwise you'd better use random.sample).
Sampling 1000 from 1000000
Using itersample 0.0324 s
Sampling 1000 from 10000000
Using itersample 0.0304 s
Sampling 1000 from 100000000
Using itersample 0.0311 s
Sampling 1000 from 1000000000
Using itersample 0.0329 s
Other tests confirm that running time is slightly more than linear with sample size:
Sampling 100 from 1000000000
Using itersample 0.0018 s
Sampling 1000 from 1000000000
Using itersample 0.0294 s
Sampling 10000 from 1000000000
Using itersample 0.4438 s
Sampling 100000 from 1000000000
Using itersample 8.8739 s
Finally, here is the generator function itersample:
import random
def itersample(c): # c: population size
sampled=[]
def fsb(a,b): # free spaces before middle of interval a,b
fsb.idx=a+(b+1-a)/2
fsb.last=sampled[fsb.idx]-fsb.idx if len(sampled)>0 else 0
return fsb.last
while len(sampled)<c:
sample_index=random.randrange(c-len(sampled))
a,b=0,len(sampled)-1
if fsb(a,a)>sample_index:
yielding=sample_index
sampled.insert(0,yielding)
yield yielding
elif fsb(b,b)<sample_index+1:
yielding=len(sampled)+sample_index
sampled.insert(len(sampled),yielding)
yield yielding
else: # sample_index falls inside sampled list
while a+1<b:
if fsb(a,b)<sample_index+1:
a=fsb.idx
else:
b=fsb.idx
yielding=a+1+sample_index
sampled.insert(a+1,yielding)
yield yielding
Here is another idea. So for huge population we would like to keep some info about selected records. In your case you keep one integer index per selected record - 32bit or 64bbit integer, plus some code to do reasonable search wrt selected/not selected. In case of large number of selected records this record keeping might be prohibitive. What I would propose is to use Bloom filter for selected indeces set. False positive matches are possible, but false negatives are not, thus no risk to get duplicated records. It does introduce slight bias - false positives records will be excluded from sampling. But memory efficiency is good, fewer than 10 bits per element are required for a 1% false positive probability. So if you select 5% of the population and have 1% false positive, you missed 0.0005 of your population, depending on requirements might be ok. If you want lower false positive, use more bits. But memory efficiency would be a lot better, though I expect there is more code to execute per record sample.
Sorry, no code

Fast lookup in list of intervals

I have a list of start-end positions with ~280.000 elements. Totally covering about 73.000.000 positions.
For performance reasons I already split them into parts in a dictionary (by a subsetting factor), which in turn contains a list of tuples (start,end).
Finally I get a list of positions, which I want to test if they are located in the regions spanned by start and end.
posit = (start,end)
dict[subset].append(posit)
for position in dict[subset]:
if posit[0] < varpos < posit[1]:
# do some stuff here
Currently those look ups take a long time. But due to memory considerations I also don't want to produce a faster set containing all positions between start and stop.
Do you have any pointers how to create a fast start,end position data structure or a better look up strategy?
My assumptions are that the ranges do not overlap and the 280000 range objects are not changing on a regular basis. My first instinct is to use sorted list of lists instead of a list of dictionary objects. Then I would import the list of positions and pass them into a 'findRange' method.
To test my implementation I generated a sorted list of 280000 lists. Then passed 1000 random 'possiblePositionMatches' into findRange for matching.
This implementation took 7.260579 seconds for 100 'possiblePositionMatches' and 71.96268 seconds for 1000 'possiblePositionMatches'.
import random
import time
values = list()
for a in range(0,73000000,250) :
values.append([a, a+200])
possiblePositionMatches = list()
count = 1000
while count:
count = count - 1
possiblePositionMatches.append(random.randint(0,73000000))
matches = []
def findRange(value) :
for x in range(len(values)) :
if (value >= values[x][0]) and (value < values[x][1]) :
matches.append([value, values[x]])
def main():
t1 = time.process_time()
for y in possiblePositionMatches:
findRange(y)
print (matches)
t2 = time.process_time() - t1
print("Total Time: {0} seconds".format(t2))
main()

python: improve performance and/or method to avoid memory error creating, saving and deleting variable variables

I have been fighting against a function giving me a memory error and thanks to your support (Python: how to split and return a list from a function to avoid memory error) I managed to sort the issue; however, since I am not a pro-programmer I would like to ask for your opinion on my method and how to improve its performance (if possible).
The function is a generator function returning all cycles from an n-nodes digraph. However, for a 12 nodes digraph, there are about 115 million cycles (each defined as a list of nodes, e.g. [0,1,2,0] is a cycle). I need all cycles available for further processing even after I have extracted some of their properties when they were first generated, so they need to be stored somewhere. So, the idea is to cut the result array every 10 million cycles to avoid memory error (when an array is too big, python runs out of RAM) and create a new array to store the following results. In the 12 node digraph, I would then have 12 result arrays, 11 full ones (containing 10 million cycles each) and the last containing 5 million cycles.
However, splitting the result array is not enough since the variables stay in RAM. So, I still need to write each one to the disk and delete it afterwards to clear the RAM.
As stated in How do I create a variable number of variables?, using 'exec' to create variable variable names is not very "clean" and dictionary solutions are better. However, in my case, if I store the results in a single dictionary, it will run out of memory due to the size of the arrays. Hence, I went for the 'exec' way. I would be grateful if you could comment on that decision.
Also, to store the arrays I use numpy.savez_compressed which gives me a 43 Mb file for each 10million cycles array. If it is not compressed it creates a 500 Mb file. However, using the compressed version slows the writing process. Any idea how to speed the writing and/or compressing process?
A simplified version of the code I wrote is as follows:
nbr_result_arrays=0
result_array_0=[]
result_lenght=10000000
tmp=result_array_0 # I use tmp to avoid using exec within the for loop (exec slows down code execution)
for cycle in generator:
tmp.append(cycle)
if len(tmp) == result_lenght:
exec 'np.savez_compressed(\'results_' +str(nbr_result_arrays)+ '\', tmp)'
exec 'del result_array_'+str(nbr_result_arrays)
nbr_result_arrays+=1
exec 'result_array_'+str(nbr_result_arrays)+'=[]'
exec 'tmp=result_array_'+str(nbr_result_arrays)
Thanks for reading,
Aleix
How about using itertools.islice?
import itertools
import numpy as np
for i in itertools.count():
tmp = list(itertools.islice(generator, 10000000))
if not tmp:
break
np.savez_compressed('results_{}'.format(i), tmp)
del tmp
thanks to all for your suggestions.
As suggested by #Aya, I believe that to improve performance (and possible space issues) I should avoid to store the results on the HD because storing them adds half of the time than creating the result, so loading and processing it again would get very close to creating the result again. Additionally, if I do not store any result, I save space which can become a big issue for bigger digraphs (a 12 node complete digraphs has about 115 million cycles but a 29 node ones has about 848E27 cycles... and increasing at factorial rate).
The idea is that I first need to find through all cycles going through the weakest arc to find the total probability of all cycles going it. Then, with this total probability I must go again through all those cycles to subtract them from the original array according to the weighted probability (I needed the total probability to be able to calculate the weighted probalility: weighted_prob= prob_of_this_cycle/total_prob_through_this_edge).
Thus, I believe that this is the best approach to do that (but I am open to more discussions! :) ).
However, I have a doubt regarding speed processing regarding two sub-functions:
1st: find whether a sequence contains a specific (smaller) sequence. I am doing that with the function "contains_sequence" which relies on the generator function "window" (as suggested in Is there a Python builtin for determining if an iterable contained a certain sequence? However I have been told that doing it with a deque would be up to 33% faster. Any other ideas?
2nd: I am currently finding the cycle probability of a cycle by sliding through the cycle nodes (which is represented by a list) to find the probability at the output of each arc to stay within the cycle and then multiply them all to find the cycle probability (the function name is find_cycle_probability). Any performance suggestions on this function would be appreciated since I need to run it for each cycle, i.e. countless times.
Any other tips/suggestion/comments will be most welcome! And thanks again for your help.
Aleix
Below follows the simplified code:
def simple_cycles_generator_w_filters(working_array_digraph, arc):
'''Generator function generating all cycles containing a specific arc.'''
generator=new_cycles.simple_cycles_generator(working_array_digraph)
for cycle in generator:
if contains_sequence(cycle, arc):
yield cycle
return
def find_smallest_arc_with_cycle(working_array,working_array_digraph):
'''Find the smallest arc through which at least one cycle flows.
Returns:
- if such arc exist:
smallest_arc_with_cycle = [a,b] where a is the start of arc and b the end
smallest_arc_with_cycle_value = x where x is the weight of the arc
- if such arc does not exist:
smallest_arc_with_cycle = []
smallest_arc_with_cycle_value = 0 '''
smallest_arc_with_cycle = []
smallest_arc_with_cycle_value = 0
sparse_array = []
for i in range(numpy.shape(working_array)[0]):
for j in range(numpy.shape(working_array)[1]):
if working_array[i][j] !=0:
sparse_array.append([i,j,working_array[i][j]])
sorted_array=sorted(sparse_array, key=lambda x: x[2])
for i in range(len(sorted_array)):
smallest_arc=[sorted_array[i][0],sorted_array[i][1]]
generator=simple_cycles_generator_w_filters(working_array_digraph,smallest_arc)
if any(generator):
smallest_arc_with_cycle=smallest_arc
smallest_arc_with_cycle_value=sorted_array[i][2]
break
return smallest_arc_with_cycle,smallest_arc_with_cycle_value
def window(seq, n=2):
"""Returns a sliding window (of width n) over data from the iterable
s -> (s0,s1,...s[n-1]), (s1,s2,...,sn), ... """
it = iter(seq)
result = list(itertools.islice(it, n))
if len(result) == n:
yield result
for elem in it:
result = result[1:] + [elem]
yield result
def contains_sequence(all_values, seq):
return any(seq == current_seq for current_seq in window(all_values, len(seq)))
def find_cycle_probability(cycle, working_array, total_outputs):
'''Finds the cycle probability of a given cycle within a given array'''
output_prob_of_each_arc=[]
for i in range(len(cycle)-1):
weight_of_the_arc=working_array[cycle[i]][cycle[i+1]]
output_probability_of_the_arc=float(weight_of_the_arc)/float(total_outputs[cycle[i]])#NOTE:total_outputs is an array, thus the float
output_prob_of_each_arc.append(output_probability_of_the_arc)
circuit_probabilities_of_the_cycle=numpy.prod(output_prob_of_each_arc)
return circuit_probabilities_of_the_cycle
def clean_negligible_values(working_array):
''' Cleans the array by rounding negligible values to 0 according to a
pre-defined threeshold.'''
zero_threeshold=0.000001
for i in range(numpy.shape(working_array)[0]):
for j in range(numpy.shape(working_array)[1]):
if working_array[i][j] == 0:
continue
elif 0 < working_array[i][j] < zero_threeshold:
working_array[i][j] = 0
elif -zero_threeshold <= working_array[i][j] < 0:
working_array[i][j] = 0
elif working_array[i][j] < -zero_threeshold:
sys.exit('Error')
return working_array
original_array= 1000 * numpy.random.random_sample((5, 5))
total_outputs=numpy.sum(original_array,axis=0) + 100 * numpy.random.random_sample(5)
working_array=original_array.__copy__()
straight_array= working_array.__copy__()
cycle_array=numpy.zeros(numpy.shape(working_array))
iteration_counter=0
working_array_digraph=networkx.DiGraph(working_array)
[smallest_arc_with_cycle, smallest_arc_with_cycle_value]= find_smallest_arc_with_cycle(working_array, working_array_digraph)
while smallest_arc_with_cycle: # using implicit true value of a non-empty list
cycle_flows_to_be_subtracted = numpy.zeros(numpy.shape((working_array)))
# FIRST run of the generator to calculate each cycle probability
# note: the cycle generator ONLY provides all cycles going through
# the specified weakest arc
generator = simple_cycles_generator_w_filters(working_array_digraph, smallest_arc_with_cycle)
nexus_total_probs = 0
for cycle in generator:
cycle_prob = find_cycle_probability(cycle, working_array, total_outputs)
nexus_total_probs += cycle_prob
# SECOND run of the generator
# using the nexus_prob_sum calculated before, I can allocate the weight of the
# weakest arc to each cycle going through it
generator = simple_cycles_generator_w_filters(working_array_digraph,smallest_arc_with_cycle)
for cycle in generator:
cycle_prob = find_cycle_probability(cycle, working_array, total_outputs)
allocated_cycle_weight = cycle_prob / nexus_total_probs * smallest_arc_with_cycle_value
# create the array to be substracted
for i in range(len(cycle)-1):
cycle_flows_to_be_subtracted[cycle[i]][cycle[i+1]] += allocated_cycle_weight
working_array = working_array - cycle_flows_to_be_subtracted
clean_negligible_values(working_array)
cycle_array = cycle_array + cycle_flows_to_be_subtracted
straight_array = straight_array - cycle_flows_to_be_subtracted
clean_negligible_values(straight_array)
# find the next weakest arc with cycles.
working_array_digraph=networkx.DiGraph(working_array)
[smallest_arc_with_cycle, smallest_arc_with_cycle_value] = find_smallest_arc_with_cycle(working_array,working_array_digraph)

Python speeding up the search for a value in a dictionary of ranges

I have a file with a column of values I would like to use to compare with a dictionary that contains two values that together form a range.
for instance:
File A:
Chr1 200 ....
Chr3 300
File B:
Chr1 200 300 ...
Chr2 300 350 ...
For now I created a dictionary of values for File B:
for Line in FileB:
LineB = Line.strip('\n').split('\t')
Ranges[Chr].append(LineB)
For the comparison:
for Line in MethylationFile:
Line = Line.strip("\n")
Info = Line.split("\t")
Chr = Info[0]
Location = int(Info[1])
Annotation = ""
for i, r in enumerate(Ranges[Chr]):
n = i + 1
while (n < len(Ranges[Chr])):
if (int(Ranges[Chr][i][1]) <= Location <= int(Ranges[Chr][i][2])):
Annotation = '\t'.join(Ranges[Chr][i][4:])
n +=1
OutFile.write(Line + '\t' + Annotation + '\n')
If I leave the while loop the program does not seem to run (or is probably running too slow to get results) since I have over 7,000 values in each dictionary. If I change the while loop to an if loop the program runs but at an incredibly slow pace.
I'm looking for a way to make this program faster and more efficient
Dictionaries are great when you want to look up a key by exact match. In particular, the hash of the lookup key has to be the same as the hash of the stored key.
If your ranges are consistent, you could fake this by writing a hash function that returns the same value for a range, and for every value within that range. But if they're not, this hash function would have to keep track of all of the known ranges, which takes you back to the same problem you're starting with.
In that case, the right data structure here is probably some kind of sorted collection. If you only need to build up the collection, and then use it many times without ever modifying it, just sorting a list and using the bisect module will do it for you. If you need to modify the collection after creation, you'll want something built around a binary tree or B-tree variant of some kind, like blist or bintrees.
This will reduce the time to find a range from N/2 to log2(N). So, if you've got 10000 ranges, instead of 5000 comparisons, you'll do 14.
While we're at it, it would help to convert the range start and stop values to ints once, instead of doing it each time. Also, if you want to use the stdlib bisect, you unfortunately can't pass a key to most functions, so let's reorganize the ranges into comparable order too. So:
for Line in FileB:
LineB = Line.strip('\n').split('\t')
Ranges[Chr].append(int(LineB[1]), int(LineB[2]), [LineB[0])
for r in Ranges:
r.sort()
Now, instead of this loop:
for i, r in enumerate(Ranges[Chr]):
# ...
Do this:
i = bisect.bisect(Ranges[Chr], (Location, Location, None))
if i:
r = Ranges[Chr][i-1]
if r[0] <= Location < r[1]:
# do whatever you wanted with r
else:
# there is no range that includes Location
else:
# Location is before all ranges
You have to be careful thinking about bisect, and it's possible I've got this wrong on the first attempt, so… read the docs on what it does, and experiment with your data (printing out the results of the bisect function), before trusting this.
If your ranges can overlap, and you want to be able to find all ranges that contain a value rather than just one, you'll need a bit more than this to keep things efficient. There's no way to fully-order overlapping ranges, so bisect won't cut it.
If you're expecting more than log N matches per average lookup, you can do it with two sorted lists and bisect.
But otherwise, you need a more complex data structure, and more complex code. For example, if you can spare N^2 space, you can keep the time at log N by having, for each range in the first list, a second list, sorted by end, of all the values with a matching start.
And at this point, I think it's getting complex enough that you want to look for a library to do it for you.
However, you might want to consider a different solution.
If you use numpy or a database instead of pure Python, this can't cut the algorithmic complexity from N to log N… but it can cut the constant overhead by a factor of 10 or so, which may be good enough. In fact, if you're doing tons of searches on a medium-small list, it may even be better.
Plus, it looks a lot simpler, and once you get used to array operations or SQL, it may even be more readable. So:
RangeArrays = [np.array(a[:2] for a in value) for value in Ranges]
… or, if Ranges is a dict mapping strings to values, instead of a list:
RangeArrays = {key: np.array(a[:2] for a in value) for key, value in Ranges.items()}
Then, instead of this:
for i, r in enumerate(Ranges[Chr]):
# ...
Do:
comparisons = Location < RangeArrays[Chr]
matches = comparisons[:,0] < comparisons[:,1]
indices = matches.nonzero()[0]
for index in indices:
r = Ranges[indices[0]]
# Do stuff with r
(You can of course make things more concise, but it's worth doing it this way and printing out all of the intermediate steps to see why it works.)
Or, using a database:
cur = db.execute('''SELECT Start, Stop, Chr FROM Ranges
WHERE Start <= ? AND Stop > ?''', (Location, Location))
for (Start, Stop, Chr) in cur:
# do stuff

Categories

Resources