random non repeatable generator - python

Is it possible to generate in pseudo-random ORDER all the numbers from 0 .. N, without repeating any number AND w/o keeping track of what numbers were already generated
F.e. the opposite, a non-random rule would be:
- generate all ODD values
- generate all EVEN values
does:
np.random.choice(range(1000000),1000000,replace=False)
materialize the range ?

Yes, it's possible.
You could create a custom LCG for your given N or the next power of two greater than your N, but the quality of the random numbers is quite bad.
A better method is to create a seeded hash function that is reversible for every power of two, and hash all numbers from 0 to next_pow_2(N), whiles rejecting numbers greater than N. This article explains it quite well: https://andrew-helmer.github.io/permute/
The above method works best if N isn't that small (N > 2^14 for the implementation in the linked article would be advisory), because creating good hash functions for a small input width is very hard.
Note that while these methods work, you should really consider just shuffling an array of numbers 0 to N, as that is usually faster than the above methods.

Shuffle all the numbers in the range, then pick them off in the shuffled order.
More work, would be to develop a format preserving encryption that only produces numbers in the required range, and then encrypt the numbers 0, 1, 2, 3, ... Because encryption is a one-to-one mapping with a given key, different inputs are guaranteed to produce different outputs.
Whatever method you use you will only be able to output as many unique numbers as there are in the initial range, obviously. After that, the numbers will start to repeat.

Related

How to generate unique(!) arrays/lists/sequences of uniformly distributed random

Let‘s say I generate a pack, i.e., a one dimensional array of 10 random numbers with a random generator. Then I generate another array of 10 random numbers. I do this X times. How can I generate unique arrays, that even after a trillion generations, there is no array which is equal to another?
In one array, the elements can be duplicates. The array just has to differ from the other arrays with at least one different element from all its elements.
Is there any numpy method for this? Is there some special algorithm which works differently by exploring some space for the random generation? I don’t know.
One easy answer would be to write the arrays to a file and check if they were generated already, but the I/O operations on a subsequently bigger file needs way too much time.
This is a difficult request, since one of the properties of a RNG is that it should repeat sequences randomly.
You also have the problem of trying to record terabytes of prior results. Once thing you could try is to form a hash table (for search speed) of the existing arrays. Using this depends heavily on whether you have sufficient RAM to hold the entire list.
If not, you might consider disk-mapping a fast search structure of some sort. For instance, you could implement an on-disk binary tree of hash keys, re-balancing whenever you double the size of the tree (with insertions). This lets you keep the file open and find entries via seek, rather than needing to represent the full file in memory.
You could also maintain an in-memory index to the table, using that to drive your seek to the proper file section, then reading only a small subset of the file for the final search.
Does that help focus your implementation?
Assume that the 10 numbers in a pack are each in the range [0..max]. Each pack can then be considered as a 10 digit number in base max+1. Obviously, the size of max determines how many unique packs there are. For example, if max=9 there are 10,000,000,000 possible unique packs from [0000000000] to [9999999999].
The problem then comes down to generating unique numbers in the correct range.
Given your "trillions" then the best way to generate guaranteed unique numbers in the range is probably to use an encryption with the correct size output. Unless you want 64 bit (DES) or 128 bit (AES) output then you will need some sort of format preserving encryption to get output in the range you want.
For input, just encrypt the numbers 0, 1, 2, ... in turn. Encryption guarantees that, given the same key, the output is unique for each unique input. You just need to keep track of how far you have got with the input numbers. Given that, you can generate more unique packs as needed, within the limit imposed by max. After that point the output will start repeating.
Obviously as a final step you need to convert the encryption output to a 10 digit base max+1 number and put it into an array.
Important caveat:
This will not allow you to generate "arbitrarily" many unique packs. Please see limits as highlighted by #Prune.
Note that as the number of requested packs approaches the number of unique packs this takes longer and longer to find a pack. I also put in a safety so that after a certain number of tries it just gives up.
Feel free to adjust:
import random
## -----------------------
## Build a unique pack generator
## -----------------------
def build_pack_generator(pack_length, min_value, max_value, max_attempts):
existing_packs = set()
def _generator():
pack = tuple(random.randint(min_value, max_value) for _ in range(1, pack_length +1))
pack_hash = hash(pack)
attempts = 1
while pack_hash in existing_packs:
if attempts >= max_attempts:
raise KeyError("Unable to fine a valid pack")
pack = tuple(random.randint(min_value, max_value) for _ in range(1, pack_length +1))
pack_hash = hash(pack)
attempts += 1
existing_packs.add(pack_hash)
return list(pack)
return _generator
generate_unique_pack = build_pack_generator(2, 1, 9, 1000)
## -----------------------
for _ in range(50):
print(generate_unique_pack())
The Birthday problem suggests that at some point you don't need to bother checking for duplicates. For example, if each value in a 10 element "pack" can take on more than ~250 values then you only have a 50% chance of seeing a duplicate after generating 1e12 packs. The more distinct values each element can take on the lower this probability.
You've not specified what these random values are in this question (other than being uniformly distributed) but your linked question suggests they are Python floats. Hence each number has 2**53 distinct values it can take on, and the resulting probability of seeing a duplicate is practically zero.
There are a few ways of rearranging this calculation:
for a given amount of state and number of iterations what's the probability of seeing at least one collision
for a given amount of state how many iterations can you generate to stay below a given probability of seeing at least one collision
for a given number of iterations and probability of seeing a collision, what state size is required
The below Python code calculates option 3 as it seems closest to your question. The other options are available on the birthday attack page.
from math import log2, log1p
def birthday_state_size(size, p):
# -log1p(p) is a numerically stable version of log(1/(1+p))
return size**2 / (2*-log1p(-p))
log2(birthday_state_size(1e12, 1e-6)) # => ~100
So as long as you have more than 100 uniform bits of state in each pack everything should be fine. For example, two or more Python floats is OK (2 * 53), as is 10 integers with >= 1000 distinct values (10*log2(1000)).
You can of course reduce the probability down even further, but as noted in the Wikipedia article going below 1e-15 quickly approaches the reliability of a computer. This is why I say "practically zero" given the 530 bits of state provided by 10 uniformly distributed floats.

Generating an NxM array of uniformly distributed random numbers over a stated interval (not [0,1)) in numpy

I am aware of the numpy.random.rand() command, however there doesn't seem to be any variables allowing you to adjust the uniform interval in which the numbers are chosen to something other than [0,1).
I considered using a for loop i.e. initiating a zero array of the needed size, and using numpy.random.unifom(a,b,N) to generate N random numbers in the interval (a,b), and then putting these into the initiated array. I am not aware of this module being to create an array of arbitrary dimension, like the rand above. This is clearly inelegant, although y main concern is the run time. I presume this method would have a much higher run time than using the appropriate random number generator from the start.
Edit and additional thought: the interval I am working in is [0,pi/8) which is less than 1. Strictly speaking, I won't be affecting the randomness of the generated numbers if I just rescale, but multiplying each generated random number would clearly be additional computational time, I presume by a factor the order of the number of elements.
np.random.uniform accepts a low and a high:
In [11]: np.random.uniform(-3, 3, 7) # 7 numbers between -3 and 3
Out[11]: array([ 2.68365104, -0.97817374, 1.92815971, -2.56190434, 2.48954842, -0.16202127, -0.37050593])
numpy.random.uniform accepts a size argument where you can just pass the size of your array as tuple. For generating an MxN array use
np.random.uniform(low,high, size=(M,N))

how can I verify that this hash function is not gonna give me same result for two diiferent strings?

Consider two different strings to be of same length.
I am implementing robin-karp algorithm and using the hash function below:
def hs(pat):
l = len(pat)
pathash = 0
for x in range(l):
pathash += ord(pat[x])*prime**x # prime is global variable equal to 101
return pathash
It's a hash. There's, by definition, no guarantee there will be no collisions - otherwise, the hash would have to be as long as the hashed value, at least.
The idea behind what you're doing is based in number theory: powers of a number that is coprime to the size of your finite group (which probably the original author meant to be something like 2^N) can give you any number in that finite group, and it's hard to tell which one these were.
Sadly, the interesting part of this hash function, namely the size limiting/modulo operation of the hash, has been left out of this code – which makes one wonder where your code comes from. As far as I can immediately see, has little to do with Rabin-Karb.

Avoid generating duplicate values from random

I want to generate random numbers and store them in a list as the following:
alist = [random.randint(0, 2 ** mypower - 1) for _ in range(total)]
My concern is the following: I want to generate total=40 million values in the range of (0, 2 ** mypower - 1). If mypower = 64, then alist will be of size ~20GB (40M*64*8) which is very large for my laptop memory. I have an idea to iteratively generate chunk of values, say 5 million at a time, and save them to a file so that I don't have to generate all 40M values at once. My concern is that if I do that in a loop, it is guaranteed that random.randint(0, 2 ** mypower - 1) will not generate values that were already generated from the previous iteration? Something like this:
for i in range(num_of_chunks):
alist = [random.randint(0, 2 ** mypower - 1) for _ in range(chunk)]
# save to file
Well, since efficiency/speed doesn't matter, I think this will work:
s = set()
while len(s) < total:
s.add(random.randint(0, 2 ** mypower - 1))
alist = list(s)
Since sets can only have unique elements in it, i think this will work well enough
To guarantee unique values you should avoid using random. Instead you should use an encryption. Because encryption is reversible, unique inputs guarantee unique outputs, given the same key. Encrypt the numbers 0, 1, 2, 3, ... and you will get guaranteed unique random-seeming outputs back providing you use a secure encryption. Good encryption is designed to give random-seeming output.
Keep track of the key (essential) and how far you have got. For your first batch encrypt integers 0..5,000,000. For the second batch encrypt 5,000,001..10,000,000 and so on.
You want 64 bit numbers, so use DES in ECB mode. DES is a 64-bit cipher, so the output from each encryption will be 64 bits. ECB mode does have a weakness, but that only applies with identical inputs. You are supplying unique inputs so the weakness is not relevant for your particular application.
If you need to regenerate the same numbers, just re-encrypt them with the same key. If you need a different set of random numbers (which will duplicate some from the first set) then use a different key. The guarantee of uniqueness only applies with a fixed key.
One way to generate random values that don't repeat is first to create a list of contiguous values
l = list(range(1000))
then shuffle it:
import random
random.shuffle(l)
You could do that several times, and save it in a file, but you'll have limited ranges since you'll never see the whole picture because of your limited memory (it's like trying to sort a big list without having the memory for it)
As someone noted, to get a wide span of random numbers, you'll need a lot of memory, so simple but not so efficient.
Another hack I just though of: do the same as above but generate a range using a step. Then in a second pass, add a random offset to the values. Even if the offset values repeat, it's guaranteed to never generate the same number twice:
import random
step = 10
l = list(range(0,1000-step,step))
random.shuffle(l)
newlist = [x+random.randrange(0,step) for x in l]
with the required max value and number of iterations that gives:
import random
number_of_iterations = 40*10**6
max_number = 2**64
step = max_number//number_of_iterations
l = list(range(0,max_number-step,step))
random.shuffle(l)
newlist = [x+random.randrange(0,step) for x in l]
print(len(newlist),len(set(newlist)))
runs in 1-2 minutes on my laptop, and gives 40000000 distinct values (evenly scattered across the range)
Usually random number generators are not really random at all. In fact, this is quite helpful in some situations. If you want the values to be unique after the second iteration, send it a different seed value.
random.seed()
The same seed will generate the same list, so if you want the next iteration to be the same, use the same seed. if you want it to be different, use a different seed.
It may need High CPU and Physical memory!
I suggest you to classify your data.
For Example you can Save:
All numbers starting with 10 and are 5 characters(Example:10365) to 10-5.txt
All numbers starting with 11 and are 6 characters(Example:114567) to 11-6.txt
Then for checking a new number:
For Example my number is 9256547
It starts with 92 and is 7 characters.
So I search 92-7.txt for this number and if it isn't duplicate,I will add it to 92-7.txt
Finally,You can join all files together.
Sorry If I have mistakes.My main language isn't English.

Plotting average number of steps for Euclid's extended algorithm

I was given the following assignment by my Algorithms professor:
Write a Python program that implements Euclid’s extended algorithm. Then perform the following experiment: run it on a random selection of inputs of a given size, for sizes bounded by some parameter N; compute the average number of steps of the algorithm for each input size n ≤ N, and use gnuplot to plot the result. What does f(n) which is the “average number of steps” of Euclid’s extended algorithm on input size n look like? Note that size is not the same as value; inputs of size n are inputs with a binary representation of n bits.
The programming of the algorithm was the easy part but I just want to make sure that I understand where to go from here. I can fix N to be some arbitrary value. I generate a set of random values of a and b to feed into the algorithm whose length in binary (n) are bounded above by N. While the algorithm is running, I have a counter that is keeping track of the number of steps (ignoring trivial linear operations) taken for that particular a and b.
At the end of this, I sum the lengths of each individual inputs a and b binary representation and that represents a single x value on the graph. My single y value would be the counter variable for that particular a and b. Is this a correct way to think about it?
As a follow up question, I also know that the best case for this algorithm is θ(1) and worst case is O(log(n)) so my "average" graph should lie between those two. How would I manually calculate average running time to verify that my end graph is correct?
Thanks.

Categories

Resources