This question already has answers here:
Unique (non-repeating) random numbers in O(1)?
(22 answers)
Closed 9 years ago.
I need to fill a file with a lot of records identified by a number (test data). The number of records is very big, and the ids should be unique and the order of records should be random (or pseudo-random).
I tried this:
# coding: utf-8
import random
COUNT = 100000000
random.seed(0)
file_1 = open('file1', 'w')
for i in random.sample(xrange(COUNT), COUNT):
file_1.write('ID{0},A{0}\n'.format(i))
file_1.close()
But it's eating all of my memory.
Is there a way to generate a big shuffled sequence of consecutive (not necessarily but it would be nice, otherwise unique) integer numbers? Using a generator and not keeping all the sequence in RAM?
If you have 100 million numbers like in the question, then this is actually manageable in-memory (it takes about 0.5 GB).
As DSM pointed out, this can be done with the standard modules in an efficient way:
>>> import array
>>> a = array.array('I', xrange(10**8)) # a.itemsize indicates 4 bytes per element => about 0.5 GB
>>> import random
>>> random.shuffle(a)
It is also possible to use the third-party NumPy package, which is the standard Python tool for managing arrays in an efficient way:
>>> import numpy
>>> ids = numpy.arange(100000000, dtype='uint32') # 32 bits is enough for numbers up to about 4 billion
>>> numpy.random.shuffle(ids)
(this is only useful if your program already uses NumPy, as the standard module approach is about as efficient).
Both method take about the same amount of time on my machine (maybe 1 minute for the shuffling), but the 0.5 GB they use is not too big for current computers.
PS: There are too many elements for the shuffling to be really random because there are way too many permutations possible, compared to the period of the random generators used. In other words, there are fewer Python shuffles than the number of possible shuffles!
Maybe something like (won't be consecutive, but will be unique):
from uuid import uuid4
def unique_nums(): # Not strictly unique, but *practically* unique
while True:
yield int(uuid4().hex, 16)
# alternative yield uuid4().int
unique_num = unique_nums()
next(unique_num)
next(unique_num) # etc...
You can fetch random int easily from reading (on linux) /dev/urandom or using os.urandom() and struct.unpack():
Return a string of n random bytes suitable for cryptographic use.
This function returns random bytes from an OS-specific randomness source. The returned data should be unpredictable enough for cryptographic applications, though its exact quality depends on the OS implementation. On a UNIX-like system this will query /dev/urandom, and on Windows it will use CryptGenRandom. If a randomness source is not found, NotImplementedError will be raised.
>>> for i in range(4): print( hex( struct.unpack('<L', os.urandom(4))[0]))
...
0xbd7b6def
0xd3ecf2e6
0xf570b955
0xe30babb6
While on the other hand random package:
However, being completely deterministic, it is not suitable for all purposes, and is completely unsuitable for cryptographic purposes.
If you really need unique records you should go with this or answer provided by EOL.
But assuming really random source, with possibly repeated characters you'll have 1/N (where N = 2 ** sizeof(int)*8 = 2 ** 32) chance of hitting item at first guess, thus you can get (2**32) ** length possible outputs.
On the other hand when using just unique results you'll have max:
product from i = 0 to length {2*32 - i}
= n! / (n-length)!
= (2**32)! / (2**32-length)!
Where ! is factorial, not logical negation. So you'll just decrease randomness of result.
This one will keep your memory OK but will probably kill your disk :)
It generates a file with the sequence of the numbers from 0 to 100000000 and then it randomly pick positions in it and writes to another file. The numbers have to be re-organized in the first file to "delete" the numbers that have been chosen already.
import random
COUNT = 100000000
# Feed the file
with open('file1','w') as f:
i = 0
while i <= COUNT:
f.write("{0:08d}".format(i))
i += 1
with open('file1','r+') as f1:
i = COUNT
with open('file2','w') as f2:
while i >= 0:
f1.seek(i*8)
# Read the last val
last_val = f1.read(8)
random_pos = random.randint(0, i)
# Read random pos
f1.seek(random_pos*8)
random_val = f1.read(8)
f2.write('ID{0},A{0}\n'.format(random_val))
# Write the last value to this position
f1.seek(random_pos*8)
f1.write(last_val)
i -= 1
print "Done"
Related
Let‘s say I generate a pack, i.e., a one dimensional array of 10 random numbers with a random generator. Then I generate another array of 10 random numbers. I do this X times. How can I generate unique arrays, that even after a trillion generations, there is no array which is equal to another?
In one array, the elements can be duplicates. The array just has to differ from the other arrays with at least one different element from all its elements.
Is there any numpy method for this? Is there some special algorithm which works differently by exploring some space for the random generation? I don’t know.
One easy answer would be to write the arrays to a file and check if they were generated already, but the I/O operations on a subsequently bigger file needs way too much time.
This is a difficult request, since one of the properties of a RNG is that it should repeat sequences randomly.
You also have the problem of trying to record terabytes of prior results. Once thing you could try is to form a hash table (for search speed) of the existing arrays. Using this depends heavily on whether you have sufficient RAM to hold the entire list.
If not, you might consider disk-mapping a fast search structure of some sort. For instance, you could implement an on-disk binary tree of hash keys, re-balancing whenever you double the size of the tree (with insertions). This lets you keep the file open and find entries via seek, rather than needing to represent the full file in memory.
You could also maintain an in-memory index to the table, using that to drive your seek to the proper file section, then reading only a small subset of the file for the final search.
Does that help focus your implementation?
Assume that the 10 numbers in a pack are each in the range [0..max]. Each pack can then be considered as a 10 digit number in base max+1. Obviously, the size of max determines how many unique packs there are. For example, if max=9 there are 10,000,000,000 possible unique packs from [0000000000] to [9999999999].
The problem then comes down to generating unique numbers in the correct range.
Given your "trillions" then the best way to generate guaranteed unique numbers in the range is probably to use an encryption with the correct size output. Unless you want 64 bit (DES) or 128 bit (AES) output then you will need some sort of format preserving encryption to get output in the range you want.
For input, just encrypt the numbers 0, 1, 2, ... in turn. Encryption guarantees that, given the same key, the output is unique for each unique input. You just need to keep track of how far you have got with the input numbers. Given that, you can generate more unique packs as needed, within the limit imposed by max. After that point the output will start repeating.
Obviously as a final step you need to convert the encryption output to a 10 digit base max+1 number and put it into an array.
Important caveat:
This will not allow you to generate "arbitrarily" many unique packs. Please see limits as highlighted by #Prune.
Note that as the number of requested packs approaches the number of unique packs this takes longer and longer to find a pack. I also put in a safety so that after a certain number of tries it just gives up.
Feel free to adjust:
import random
## -----------------------
## Build a unique pack generator
## -----------------------
def build_pack_generator(pack_length, min_value, max_value, max_attempts):
existing_packs = set()
def _generator():
pack = tuple(random.randint(min_value, max_value) for _ in range(1, pack_length +1))
pack_hash = hash(pack)
attempts = 1
while pack_hash in existing_packs:
if attempts >= max_attempts:
raise KeyError("Unable to fine a valid pack")
pack = tuple(random.randint(min_value, max_value) for _ in range(1, pack_length +1))
pack_hash = hash(pack)
attempts += 1
existing_packs.add(pack_hash)
return list(pack)
return _generator
generate_unique_pack = build_pack_generator(2, 1, 9, 1000)
## -----------------------
for _ in range(50):
print(generate_unique_pack())
The Birthday problem suggests that at some point you don't need to bother checking for duplicates. For example, if each value in a 10 element "pack" can take on more than ~250 values then you only have a 50% chance of seeing a duplicate after generating 1e12 packs. The more distinct values each element can take on the lower this probability.
You've not specified what these random values are in this question (other than being uniformly distributed) but your linked question suggests they are Python floats. Hence each number has 2**53 distinct values it can take on, and the resulting probability of seeing a duplicate is practically zero.
There are a few ways of rearranging this calculation:
for a given amount of state and number of iterations what's the probability of seeing at least one collision
for a given amount of state how many iterations can you generate to stay below a given probability of seeing at least one collision
for a given number of iterations and probability of seeing a collision, what state size is required
The below Python code calculates option 3 as it seems closest to your question. The other options are available on the birthday attack page.
from math import log2, log1p
def birthday_state_size(size, p):
# -log1p(p) is a numerically stable version of log(1/(1+p))
return size**2 / (2*-log1p(-p))
log2(birthday_state_size(1e12, 1e-6)) # => ~100
So as long as you have more than 100 uniform bits of state in each pack everything should be fine. For example, two or more Python floats is OK (2 * 53), as is 10 integers with >= 1000 distinct values (10*log2(1000)).
You can of course reduce the probability down even further, but as noted in the Wikipedia article going below 1e-15 quickly approaches the reliability of a computer. This is why I say "practically zero" given the 530 bits of state provided by 10 uniformly distributed floats.
I am trying to generate random numbers that are used to generate a part of a world (I am working on world generation for a game). I could create these with something like [random.randint(0, 100) for n in range(1000)] to generate 1000 random numbers from 0 to 100, but I don't know how many numbers in a list I need. What I want is to be able to say something like random.nth_randint(0, 100, 5) which would generate the 5th random number from 0 to 100. (The same number every time as long as you use the same seed) How would I go about doing this? And if there is no way to do this, how else could I get the same behavior?
Python's random module produces deterministic pseudo random values.
In simpler words, it behaves as if it generated a list of predetermined values when a seed is provided (or when default seed is taken from OS), and those values will always be the same for a given seed.
Which is basically what we want here.
So to get nth random value you need to either remember its state for each generated value (probably just keeping track of the values would be less memory hungry) or you need to reset (reseed) the generator each time and produce N random numbers each time to get yours.
def randgen(a, b, n, seed=4):
# our default seed is random in itself as evidenced by https://xkcd.com/221/
random.seed(seed)
for i in range(n-1):
x = random.random()
return random.randint(a, b)
If I understood well your question you want every time the same n-th number. You may create a class where you keep track of the generated numbers (if you use the same seed).
The main idea is that, when you ask for then nth-number it will generate all the previous in order to be always the same for all the run of the program.
import random
class myRandom():
def __init__(self):
self.generated = []
#your instance of random.Random()
self.rand = random.Random(99)
def generate(self, nth):
if nth < len(self.generated) + 1:
return self.generated[nth - 1]
else:
for _ in range(len(self.generated), nth):
self.generated.append(self.rand.randint(1,100))
return self.generated[nth - 1]
r = myRandom()
print(r.generate(1))
print(r.generate(5))
print(r.generate(10))
Using a defaultdict, you can have a structure that generates a new number on the first access of each key.
from collections import defaultdict
from random import randint
random_numbers = defaultdict(lambda: randint(0, 100))
random_number[5] # 42
random_number[5] # 42
random_number[0] # 63
Numbers are thus lazily generated on access.
Since you are working on a game, it is likely you will then need to preserve random_numbers through interruptions of your program. You can use pickle to save your data.
import pickle
random_numbers[0] # 24
# Save the current state
with open('random', 'wb') as f:
pickle.dump(dict(random_numbers), f)
# Load the last saved state
with open('random', 'rb') as f:
opened_random_numbers = defaultdict(lambda: randint(0, 100), pickle.load(f))
opened_random_numbers[0] # 24
Numpy's new random BitGenerator interface provides a method advance(delta) some of the BitGenerator implementations (including the default BitGenerator used). This function allows you to seed and then advance to get the n-th random number.
From the docs:
Advance the underlying RNG as-if delta draws have occurred.
https://numpy.org/doc/stable/reference/random/bit_generators/generated/numpy.random.PCG64.advance.html#numpy.random.PCG64.advance
I want to generate all possible sequence of alternative digit and numbers. For example
5j1c6l2d4p9a9h9q
6d5m7w4c8h7z4s0i
3z0v5w1f3r6b2b1z
NumberSmallletterNumberSmallletter
NumberSmallletterNumberSmallletter
NumberSmallletterNumberSmallletter
NumberSmallletterNumberSmallletter
I can do it by using 16 loop But it will take 30+ hours (rough idea). Is there any efficient way. I hope there will be in python.
You can use itertools.product to generate all of the 16 long cases:
import string, itertools
i = itertools.product(string.digits, string.ascii_lowercase, repeat=8)
j = (''.join(p) for p in i)
As i is an iterator of tuples, we need to convert these all to strings (so they are in the format that you want). This is relatively straight forward to do as we can just pass each tuple into a generator and join the elements together into one string.
We can see that the iterator (j) is working by calling next() on it a couple of times:
>>> next(j)
'0a0a0a0a0a0a0a0a'
>>> next(j)
'0a0a0a0a0a0a0a0b'
>>> next(j)
'0a0a0a0a0a0a0a0c'
>>> next(j)
'0a0a0a0a0a0a0a0d'
>>> next(j)
'0a0a0a0a0a0a0a0e'
>>> next(j)
'0a0a0a0a0a0a0a0f'
>>> next(j)
'0a0a0a0a0a0a0a0g'
There is no "efficient" way to do this. There are 2.8242954e+19 different possible combonations, or 28,242,954,000,000,000,000. If each combination is 16 characters long, storing this all in a raw text file would take up 451,887,264,000 gigabytes, 441,296,156.25 terabytes, 430,953.2775878906 petabytes, or 420.8528101444 exabytes. The largest hard drive available to the average consumer is 16TB (Samsung PM1633a). They cost 12 thousand US dollars. This puts the total cost of storing all of this data to 330,972,117,600 US dollars (3677.46797 times Bill Gates' net worth). Even ignoring the amount of space all of these drives would take up, and ignoring the cost of the hardware you would need to connect them to, and assuming that they could all be running at highest performance all together in a lossless RAID array, this would make the write speed 330,972,118 gigabytes a second. Sounds like a lot, doesn't it? Even with that write speed, the file would take 22 minutes to write, assuming that there were no bottlenecks from CPU power, RAM speed, or the RAID controller itself.
Sources - a calculator.
import sys
n = 5
ans = [0 for i in range(26)]
all = ['a','b','A','Z','0','1']
def rec(pos, prev):
if (pos==n) :
for i in range(n):
sys.stdout.write(str(ans[i]))
sys.stdout.flush()
print ""
return
for i in all:
if(i != prev):
ans[pos] = i
rec(pos+1, i)
return
for i in all:
ans[0] = i;
rec(1, i)
The basic idea is backtracking. It is too slow. But code is short. You can modify your characters in 'all' and length of the sequences n.
If you don't clear about the code then try to simulate it with some cases.
I want to generate random numbers and store them in a list as the following:
alist = [random.randint(0, 2 ** mypower - 1) for _ in range(total)]
My concern is the following: I want to generate total=40 million values in the range of (0, 2 ** mypower - 1). If mypower = 64, then alist will be of size ~20GB (40M*64*8) which is very large for my laptop memory. I have an idea to iteratively generate chunk of values, say 5 million at a time, and save them to a file so that I don't have to generate all 40M values at once. My concern is that if I do that in a loop, it is guaranteed that random.randint(0, 2 ** mypower - 1) will not generate values that were already generated from the previous iteration? Something like this:
for i in range(num_of_chunks):
alist = [random.randint(0, 2 ** mypower - 1) for _ in range(chunk)]
# save to file
Well, since efficiency/speed doesn't matter, I think this will work:
s = set()
while len(s) < total:
s.add(random.randint(0, 2 ** mypower - 1))
alist = list(s)
Since sets can only have unique elements in it, i think this will work well enough
To guarantee unique values you should avoid using random. Instead you should use an encryption. Because encryption is reversible, unique inputs guarantee unique outputs, given the same key. Encrypt the numbers 0, 1, 2, 3, ... and you will get guaranteed unique random-seeming outputs back providing you use a secure encryption. Good encryption is designed to give random-seeming output.
Keep track of the key (essential) and how far you have got. For your first batch encrypt integers 0..5,000,000. For the second batch encrypt 5,000,001..10,000,000 and so on.
You want 64 bit numbers, so use DES in ECB mode. DES is a 64-bit cipher, so the output from each encryption will be 64 bits. ECB mode does have a weakness, but that only applies with identical inputs. You are supplying unique inputs so the weakness is not relevant for your particular application.
If you need to regenerate the same numbers, just re-encrypt them with the same key. If you need a different set of random numbers (which will duplicate some from the first set) then use a different key. The guarantee of uniqueness only applies with a fixed key.
One way to generate random values that don't repeat is first to create a list of contiguous values
l = list(range(1000))
then shuffle it:
import random
random.shuffle(l)
You could do that several times, and save it in a file, but you'll have limited ranges since you'll never see the whole picture because of your limited memory (it's like trying to sort a big list without having the memory for it)
As someone noted, to get a wide span of random numbers, you'll need a lot of memory, so simple but not so efficient.
Another hack I just though of: do the same as above but generate a range using a step. Then in a second pass, add a random offset to the values. Even if the offset values repeat, it's guaranteed to never generate the same number twice:
import random
step = 10
l = list(range(0,1000-step,step))
random.shuffle(l)
newlist = [x+random.randrange(0,step) for x in l]
with the required max value and number of iterations that gives:
import random
number_of_iterations = 40*10**6
max_number = 2**64
step = max_number//number_of_iterations
l = list(range(0,max_number-step,step))
random.shuffle(l)
newlist = [x+random.randrange(0,step) for x in l]
print(len(newlist),len(set(newlist)))
runs in 1-2 minutes on my laptop, and gives 40000000 distinct values (evenly scattered across the range)
Usually random number generators are not really random at all. In fact, this is quite helpful in some situations. If you want the values to be unique after the second iteration, send it a different seed value.
random.seed()
The same seed will generate the same list, so if you want the next iteration to be the same, use the same seed. if you want it to be different, use a different seed.
It may need High CPU and Physical memory!
I suggest you to classify your data.
For Example you can Save:
All numbers starting with 10 and are 5 characters(Example:10365) to 10-5.txt
All numbers starting with 11 and are 6 characters(Example:114567) to 11-6.txt
Then for checking a new number:
For Example my number is 9256547
It starts with 92 and is 7 characters.
So I search 92-7.txt for this number and if it isn't duplicate,I will add it to 92-7.txt
Finally,You can join all files together.
Sorry If I have mistakes.My main language isn't English.
i'm looking for optimal string sequence generator that could be used as user id generator. As far as i get it should implement following features:
Length is restricted to 8 characters.
Consists from latin symbols and digits
Public and thus should be obfuscated
For now i came up with following algorithm:
def idgen(begin, end):
assert type(begin) is int
assert type(end) is int
allowed = reduce(
lambda L, ri: L + map(chr, range(ord(ri[0]), ord(ri[1]) + 1)),
(('a', 'z'), ('0', '9')), list()
)
shift = lambda c, i: allowed[(allowed.index(c) + i) % len(allowed)]
for cur in xrange(begin, end):
cur = str(cur).zfill(8)
cur = izip(xrange(0, len(cur)), iter(cur))
cur = ''.join([shift(c, i) for i, c in cur])
yield cur
But it would give pretty similar ids, 0-100 example:
'01234567', '01234568', '01234569', '0123456a' etc.
So what are best practices? How, i suppose url shorteners should use some kind of similar algorithm?
Since you need "obfuscated" ids, you want something that looks random, but isn't. Fortunately, almost all computer generated randomness fulfills this criteria, including the one generated by the random python module.
You can generate a fixed sequence of numbers by settings the PRNG seed, like so:
import random
import string
valid_characters= string.lowercase+string.digits
def get_random_id():
return ''.join(random.choice(valid_characters) for x in range(8))
random.seed(0)
for x in range(10):
print get_random_id()
This will always print the same sequence of 10 "random" ids.
You probably want to generate ids on demand, instead of all at once. To do so, you need persist the PRNG state:
random.setstate( get_persisted() )
get_random_id()
persist( random.getstate )
#repeat ad infinitum
This is somewhat expensive, so you'll want to generate a couple of random ids at once, and keep them on a queue. You could also use random.jumpahead, but I suggest you try both techniques and see which one is faster.
Avoiding collisions for this task is not trivial. Even using hash functions will not save you from collisions.
The random module's PRNG has a period of 2**19937-1. You need a period of at least (26+10)**8, which is about 2**42. However, the fact that the period you need is lower than the period of the PRNG does not guarantee that there will be no collisions, because the size of your output (8 bytes, or 64 bits) may not match the output size of the PRNG. Also, depending on the particular implementation of choice, it's possible that, even if the sizes matched, the PRNG simply advances the PRNG on each call.
You could try to iterate over the keyspace, checking if something repeats:
random.seed(0)
space=len(valid_characters)**8
found=set()
x=0
while x<space:
id= get_random_id()
if id in found:
print "cycle found after",x,"iterations"
found.update(id)
if not x% 1000000:
print "progress:",(float(x)/space)*100,"%"
x+=1
But this will take a long, long while.
You could also handcraft a PRNG with the desired period, but it's a bit out of scope on stackoverflow. You could try the math stackexchange.
Ultimately, and depending on your purpose, I think your best bet may be simply keeping track of already generated ids, and skipping repeated ones.