Sequenced string PK's in python - python

i'm looking for optimal string sequence generator that could be used as user id generator. As far as i get it should implement following features:
Length is restricted to 8 characters.
Consists from latin symbols and digits
Public and thus should be obfuscated
For now i came up with following algorithm:
def idgen(begin, end):
assert type(begin) is int
assert type(end) is int
allowed = reduce(
lambda L, ri: L + map(chr, range(ord(ri[0]), ord(ri[1]) + 1)),
(('a', 'z'), ('0', '9')), list()
)
shift = lambda c, i: allowed[(allowed.index(c) + i) % len(allowed)]
for cur in xrange(begin, end):
cur = str(cur).zfill(8)
cur = izip(xrange(0, len(cur)), iter(cur))
cur = ''.join([shift(c, i) for i, c in cur])
yield cur
But it would give pretty similar ids, 0-100 example:
'01234567', '01234568', '01234569', '0123456a' etc.
So what are best practices? How, i suppose url shorteners should use some kind of similar algorithm?

Since you need "obfuscated" ids, you want something that looks random, but isn't. Fortunately, almost all computer generated randomness fulfills this criteria, including the one generated by the random python module.
You can generate a fixed sequence of numbers by settings the PRNG seed, like so:
import random
import string
valid_characters= string.lowercase+string.digits
def get_random_id():
return ''.join(random.choice(valid_characters) for x in range(8))
random.seed(0)
for x in range(10):
print get_random_id()
This will always print the same sequence of 10 "random" ids.
You probably want to generate ids on demand, instead of all at once. To do so, you need persist the PRNG state:
random.setstate( get_persisted() )
get_random_id()
persist( random.getstate )
#repeat ad infinitum
This is somewhat expensive, so you'll want to generate a couple of random ids at once, and keep them on a queue. You could also use random.jumpahead, but I suggest you try both techniques and see which one is faster.
Avoiding collisions for this task is not trivial. Even using hash functions will not save you from collisions.
The random module's PRNG has a period of 2**19937-1. You need a period of at least (26+10)**8, which is about 2**42. However, the fact that the period you need is lower than the period of the PRNG does not guarantee that there will be no collisions, because the size of your output (8 bytes, or 64 bits) may not match the output size of the PRNG. Also, depending on the particular implementation of choice, it's possible that, even if the sizes matched, the PRNG simply advances the PRNG on each call.
You could try to iterate over the keyspace, checking if something repeats:
random.seed(0)
space=len(valid_characters)**8
found=set()
x=0
while x<space:
id= get_random_id()
if id in found:
print "cycle found after",x,"iterations"
found.update(id)
if not x% 1000000:
print "progress:",(float(x)/space)*100,"%"
x+=1
But this will take a long, long while.
You could also handcraft a PRNG with the desired period, but it's a bit out of scope on stackoverflow. You could try the math stackexchange.
Ultimately, and depending on your purpose, I think your best bet may be simply keeping track of already generated ids, and skipping repeated ones.

Related

How can I partition `itertools.combinations` such that I can process the results in parallel?

I have a massive quantity of combinations (86 choose 10, which yields 3.5 trillion results) and I have written an algorithm which is capable of processing 500,000 combinations per second. I would not like to wait 81 days to see the final results, so naturally I am inclined to separate this into many processes to be handled by my many cores.
Consider this naive approach:
import itertools
from concurrent.futures import ProcessPoolExecutor
def algorithm(combination):
# returns a boolean in roughly 1/500000th of a second on average
def process(combinations):
for combination in combinations:
if algorithm(combination):
# will be very rare (a few hundred times out of trillions) if that matters
print("Found matching combination!", combination)
combination_generator = itertools.combinations(eighty_six_elements, 10)
# My system will have 64 cores and 128 GiB of memory
with ProcessPoolExecutor(workers=63) as executor:
# assign 1,000,000 combinations to each process
# it may be more performant to use larger batches (to avoid process startup overhead)
# but eventually I need to start worrying about running out of memory
group = []
for combination in combination_generator:
group.append(combination)
if len(group) >= 1_000_000:
executor.submit(process, group)
group = []
This code "works", but it has virtually no performance gain over a single-threaded approach, since it is bottlenecked by the generation of the combinations for combination in combination_generator.
How can I pass this computation off to the child-processes so that it can be parallelized? How can each process generate a specific subset of itertools.combinations?
p.s. I found this answer, but it only deals with generating single specified elements, whereas I need to efficiently generate millions of specified elements.
I'm the author of one answer to the question you already found for generating the combination at a given index. I'd start with that: Compute the total number of combinations, divide that by the number of equally sized subsets you want, then compute the cut-over for each of them. Then do your subprocess tasks with these combinations as bounds. Within each subprocess you'd do the iteration yourself, not using itertools. It's not hard:
def next_combination(n: int, c: list[int]):
"""Compute next combination, in lexicographical order.
Args:
n: the number of items to choose from.
c: a list of integers in strictly ascending order,
each of them between 0 (inclusive) and n (exclusive).
It will get modified by the call.
Returns: the list c after modification,
or None if this was the last combination.
"""
i = len(c)
while i > 0:
i -= 1
n -= 1
if c[i] == n: continue
c[i] += 1
for j in range(i + 1, len(c)):
c[j] = c[j - 1] + 1
return c
return None
Note that both the code above and the one from my other answer assume that you are looking for combinations of elements from range(n). If you want to combine other elements, do combinations for this range then use the elements of the found combinations as indices into your sequence of actual things you want to combine.
The main advantage of the approach above is that it ensures equal batch size, which might be useful if processing time is expected to be mostly determined by batch size. If processing time still varies greatly even for batches of the same size, that might be too much effort. I'll post an alternative answer addressing that.
You can do a recursive divide-and-conquer approach, where you make a decision based on the expected number of combinations. If it is small, use itertools. If it is large, handle the case of the first element being included and of it being excluded both in recursive calls.
The result does not ensure batches of equal size, but it does give you an upper bound on the size of each batch. If processing time of each batch is somewhat varied anyway, that might be good enough.
T = typing.TypeVar('T')
def combination_batches(
seq: collections.abc.Sequence[T],
r: int,
max_batch_size: int,
prefix: tuple[T, ...] = ()
) -> collections.abc.Iterator[collections.abc.Iterator[tuple[T, ...]]]:
"""Compute batches of combinations.
Each yielded value is itself a generator over some of the combinations.
Taken together they produce all the combinations.
Args:
seq: The sequence of elements to choose from.
r: The number of elements to include in each combination.
max_batch_size: How many elements each returned iterator
is allowed to iterate over.
prefix: Used during recursive calls, prepended to each returned tuple.
Yields: generators which each generate a subset of all the combinations,
in a way that generators together yield every combination exactly once.
"""
if math.comb(len(seq), r) > max_batch_size:
# One option: first element taken.
yield from combination_batches(
seq[1:], r - 1, max_batch_size, prefix + (seq[0],))
# Other option: first element not taken.
yield from combination_batches(
seq[1:], r, max_batch_size, prefix)
return
yield (prefix + i for i in itertools.combinations(seq, r))
See https://ideone.com/GD6WYl for a more compete demonstration.
Note that I don't know how well the process pool executor deals with generators as arguments, whether it is able to just forward a short description of each. There is a chance that in order to ship the generator to a subprocess it will actually generate all the values. So instead of yielding the generator expression the way I did, you might want to yield some object which pickles more nicely but still offers iteration over the same values.

Generate all possible sequence of altrenate digits and alphabets

I want to generate all possible sequence of alternative digit and numbers. For example
5j1c6l2d4p9a9h9q
6d5m7w4c8h7z4s0i
3z0v5w1f3r6b2b1z
NumberSmallletterNumberSmallletter
NumberSmallletterNumberSmallletter
NumberSmallletterNumberSmallletter
NumberSmallletterNumberSmallletter
I can do it by using 16 loop But it will take 30+ hours (rough idea). Is there any efficient way. I hope there will be in python.
You can use itertools.product to generate all of the 16 long cases:
import string, itertools
i = itertools.product(string.digits, string.ascii_lowercase, repeat=8)
j = (''.join(p) for p in i)
As i is an iterator of tuples, we need to convert these all to strings (so they are in the format that you want). This is relatively straight forward to do as we can just pass each tuple into a generator and join the elements together into one string.
We can see that the iterator (j) is working by calling next() on it a couple of times:
>>> next(j)
'0a0a0a0a0a0a0a0a'
>>> next(j)
'0a0a0a0a0a0a0a0b'
>>> next(j)
'0a0a0a0a0a0a0a0c'
>>> next(j)
'0a0a0a0a0a0a0a0d'
>>> next(j)
'0a0a0a0a0a0a0a0e'
>>> next(j)
'0a0a0a0a0a0a0a0f'
>>> next(j)
'0a0a0a0a0a0a0a0g'
There is no "efficient" way to do this. There are 2.8242954e+19 different possible combonations, or 28,242,954,000,000,000,000. If each combination is 16 characters long, storing this all in a raw text file would take up 451,887,264,000 gigabytes, 441,296,156.25 terabytes, 430,953.2775878906 petabytes, or 420.8528101444 exabytes. The largest hard drive available to the average consumer is 16TB (Samsung PM1633a). They cost 12 thousand US dollars. This puts the total cost of storing all of this data to 330,972,117,600 US dollars (3677.46797 times Bill Gates' net worth). Even ignoring the amount of space all of these drives would take up, and ignoring the cost of the hardware you would need to connect them to, and assuming that they could all be running at highest performance all together in a lossless RAID array, this would make the write speed 330,972,118 gigabytes a second. Sounds like a lot, doesn't it? Even with that write speed, the file would take 22 minutes to write, assuming that there were no bottlenecks from CPU power, RAM speed, or the RAID controller itself.
Sources - a calculator.
import sys
n = 5
ans = [0 for i in range(26)]
all = ['a','b','A','Z','0','1']
def rec(pos, prev):
if (pos==n) :
for i in range(n):
sys.stdout.write(str(ans[i]))
sys.stdout.flush()
print ""
return
for i in all:
if(i != prev):
ans[pos] = i
rec(pos+1, i)
return
for i in all:
ans[0] = i;
rec(1, i)
The basic idea is backtracking. It is too slow. But code is short. You can modify your characters in 'all' and length of the sequences n.
If you don't clear about the code then try to simulate it with some cases.

Spark: Generate Map Word to List of Similar Words - Need Better Performance

I am working with DNA sequence alignment, and I have a performance issue.
I need to create a dict that maps a word (a sequence of a set length) to a list of all words that are similar as decided by a separate function.
Right now, I am doing the following:
all_words_rdd = sc.parallelize([''.join(word) for word in itertools.product(all_letters, repeat=WORD_SIZE)], PARALLELISM)
all_similar_word_pairs_map = (all_words_rdd.cartesian(all_words_rdd)
.filter(lambda (word1, word2), scoring_matrix=scoring_matrix, threshold_value=threshold_value: areWordsSimilar((word1, word2), scoring_matrix, threshold_value))
.groupByKey()
.mapValues(set)
.collectAsMap())
Where areWordsSimilar obviously calculates whether the words reach a set similarity threshold.
However, this is horribly slow. It works fine with words of length 3, but once I go any higher it slows down exponentially (as you might expect). It also starts complaining about the task size being too big (again, not surprising)
I know the cartesian join is a really inefficient way to do this, but I'm not sure how to approach it otherwise.
I was thinking of starting with something like this:
all_words_rdd = (sc.parallelize(xrange(0, len(all_letters) ** WORD_SIZE))
.repartition(PARALLELISM)
...
)
This would let me split the calculation across multiple nodes. However, how do I calculate this? I was thinking about doing something with bases and inferring the letter using the modulo operator (i.e. in base of len(all_letters), num % 2 = all_letters[0], num % 3 = all_letters[1], etc).
However, this sounds horribly complicated, so I was wondering if anybody had a better way.
Thanks in advance.
EDIT
I understand that I cannot reduce the exponential complexity of the problem, that is not my goal. My goal is to break up the complexity across multiple nodes of execution by having each node perform part of the calculation. However, to do this I need to be able to derive a DNA word from a number using some process.
Generally speaking even without driver side code it looks like a hopeless task. Size of the sequence set is growing exponentially and you simply cannot win with that. Depending on how you plan to use this data there is most likely a better approach out there.
If you still want to go with this you can start with spiting kmers generation between a driver and workers:
from itertools import product
def extend_kmer(n, kmer="", alphabet="ATGC"):
"""
>>> list(extend_kmer(2))[:4]
['AA', 'AT', 'AG', 'AC']
"""
tails = product(alphabet, repeat=n)
for tail in tails:
yield kmer + "".join(tail)
def generate_kmers(k, seed_size, alphabet="ATGC"):
"""
>>> kmers = generate_kmers(6, 3, "ATGC").collect()
>>> len(kmers)
4096
>>> sorted(kmers)[0]
'AAAAAA'
"""
seed = sc.parallelize([x for x in extend_kmer(seed_size, "", alphabet)])
return seed.flatMap(lambda kmer: extend_kmer(k - seed_size, kmer, alphabet))
k = ... # Integer
seed_size = ... # Integer <= k
kmers = generate_kmers(k, seed_size) # RDD kmers
The simplest optimization you can do when it comes to searching is to drop cartesian and use a local generation:
from difflib import SequenceMatcher
def is_similar(x, y):
"""Dummy similarity check
>>> is_similar("AAAAA", "AAAAT")
True
>>> is_similar("AAAAA", "TTTTTT")
False
"""
return SequenceMatcher(None, x, y).ratio() > 0.75
def find_similar(kmer, f=is_similar, alphabet="ATGC"):
"""
>>> kmer, similar = find_similar("AAAAAA")
>>> sorted(similar)[:5]
['AAAAAA', 'AAAAAC', 'AAAAAG', 'AAAAAT', 'AAAACA']
"""
candidates = product(alphabet, repeat=len(kmer))
return (kmer, {"".join(x) for x in candidates if is_similar(kmer, x)})
similar_map = kmers.flatmap(find_similar)
It is still an extremely naive approach but it doesn't require expensive data shuffling.
Next thing you can try is to improve search strategy. It can be done either locally like above or globally using joins.
In both cases you need a smarter approach than checking all possible kmers. First thing that comes to mind is to use seed kmers taken from a given word. In locally mode these can be used as a starting point for candidate generation, in a global mode a join key (optionally combined with hashing).

Influence of choosing string as seed of random on the output

In python, in order to permutate the lettters of a string one can write
import random
random.seed(str_key)
length = range(len(original_str))
random.shuffle(length)
join "".join([original_key[idx] for idx in length])
I wonder what does seed do with the key string and how it produces from it the permutation (or says shuffle how to do that). For example if I take the key to be 'hGd' how I get that specific output while if I write another key like 'AGd' I get another output?
EDIT: The decryption algorithm I tried to use on that code is:
for key in itertools.product(*3*[string.ascii_letters]):
indices = range(len(enc_msg))
list_encrypted_msg = list(enc_msg)
random.seed(key)
random.shuffle(indices)
decrypted = ""
for idx in indices[::-1]:
decrypted += list_encrypted_msg[idx]
try:
if not decrypted.index("The"):
print decrypted
except ValueError:
continue
return "not found!"
What seed() does with its argument is to pass it to the built-in hash() function, which converts it to a 32 bit signed integer, in other words a number in the range -2,147,483,648 to 2,147,483,647. That number is then used as the starting number by the pseudo-random integer generator (by default, the Mersenne Twister algorithm) that is the heart of the standard random functions.
Each time a pseudo-random number generator (PRNG) is called it does a particular arithmetic operation on its current number to produce a new number. It may return that number as is, or it may return a modified version of that number. See Wikipedia for a simple type of PRNG.
With a good PRNG it is very hard to predict what the next number in the sequence is going to be, and Mersenne Twister is quite good. So it's not easy to predict the effect that different seeds will have on the output.
BTW, you can pass seed() any kind of hashable object. So it can be passed an int, string, tuple, etc, but not a list. But as I said above, whatever you pass it, it gets converted to a number.
Update: In recent versions of Python, random.seed takes an optional version arg: version 1 works as I described above, but version 2 (the default in Python 3.2+) a str, bytes, or bytearray object gets converted to an int and all of its bits are used.
And I guess I should mention that if you call seed() without a seed value it uses the system entropy pool to generate a seed value, and if the system doesn't provide an entropy pool (which is unlikely, except for extremely tiny or old embedded systems), it uses the current time as its seed value.
The Mersenne Twister algorithm has a period of 2**19937 - 1, which is a number of about 6000 decimal digits. So it takes a very long time before the cycle of integers it produces repeats exactly. Of course, individual integers and sub-sequences of integers will repeat much sooner. And a cryptographic attack on it only needs 624 (full) outputs to determine the position in the cycle. Python's version of Mersenne Twister doesn't actually return the integers it calculates, it converts them to 53 bit floating-point numbers.
Please see the Wikipedia article on the Mersenne Twister if you're curious to know how it works. Mersenne Twister was very impressive when it was first published, but there are now superior RNGs that are faster, more efficient, and have better statistical properties, eg the PCG family. We don't have PCG in the Python standard library yet, but PCG is now the default PRNG in Numpy.
FWIW, here's a slightly improved version of your program.
import random
#Convert string to list
msg = list(original_str)
random.seed(str_key)
random.shuffle(msg)
print "".join(msg)
Now, onto your decryption problem. :) Is the message you have to decrypt merely scrambled, as by the program above, or does it use some other form of encryption? If it's merely scrambled, it will be relatively easy to unscramble. So unless you tell me otherwise, I shall assume that to be the case.
You said that the key length is 3. Is the key purely alphabetic, or can the 3 characters in the key be anything in the range chr(0) to chr(255)? Either way, that's not really a lot of keys to check, and a Python program will be able to unscramble the message using a brute-force search of all keys in less than one second minute.
To iterate over all possible keys, you can do this:
from itertools import product
from string import ascii_letters
for k in product(*3*[ascii_letters]):
str_key = ''.join(k)
I used product() in that code because we want to generate all possible strings of 3 ascii letters, so we want the Cartesian product of 3 copies of ascii_letters. 3*[ascii_letters] is equivalent to [ascii_letters, ascii_letters, ascii_letters] and putting * in front unpacks that list so that product() gets 3 separate args. If we use permutations() then we don't get any strings with repeated characters. To illustrate:
>>> import itertools
>>> s='abc'
>>> [''.join(u) for u in itertools.permutations(s, 3)]
['abc', 'acb', 'bac', 'bca', 'cab', 'cba']
>>> [''.join(u) for u in itertools.product(*3*[s])]
['aaa', 'aab', 'aac', 'aba', 'abb', 'abc', 'aca', 'acb', 'acc',
'baa', 'bab', 'bac', 'bba', 'bbb', 'bbc', 'bca', 'bcb', 'bcc',
'caa', 'cab', 'cac', 'cba', 'cbb', 'cbc', 'cca', 'ccb', 'ccc']
Update: product takes a repeat keyword arg, so we can simplify that to itertools.product(s, repeat=3).
......
I thought you said that the string to be decoded has 42 chars, but there are only 40 chars in euTtSa:0 kty1h a0 nlradstara atlot 5wtic. Also, the appearance of the digits 0 & 5 in that string is a bit of a worry, although I guess the original unscrambled version could have digits in it...
Anyway, I just tried unscrambling that string using the shuffle algorithm with all possible 140608 3 letter keys and printing the permutations produced that begin with The. There are only 5 of them, and only one of those had a space after The. But in every case the rest of the unscrambled string is garbage. My guess is that you've misunderstood the encryption algorithm that your lecturer used.
Just in case you were wondering how random.shuffle() works, you can see Python source code here; the C code for the random module is here.
It's the Yates-Fisher shuffle, which is like a randomized version of one pass through a selection sort.
Another cute method that's sometimes seen is to sort the list using a random comparison function, or a random key function. Eg
>>> import random
>>> random.seed(42)
>>> for i in range(10):
... s.sort(key=lambda i:random.random())
... print ''.join(s)
...
gabecdf
dbgfeac
agbfdce
cebdgaf
fgedbca
afbecgd
bcaegfd
aebcdfg
bacgfed
fgdebca
However, this shuffling technique is relatively slow, and it has bad statistical properties. So please do not use it! The Fisher-Yates technique used by random.shuffle() is (pretty much) the optimal shuffling algorithm.
Reversing the shuffling procedure for a given key
Let's look at what happens when we shuffle a simple range.
from random import seed, shuffle
r = range(5)
key = 'zap'
seed(key)
shuffle(r)
After the shuffle, r will be
[2, 4, 1, 3, 0]
So to unshuffle r we need to build this list:
[r[4], r[2], r[0], r[3], r[1]]
Can you see how to do that? If you can't figure it out, I'm happy to post my code, but I think you should spend a little bit of time trying to figure it out first. Hint: Don't try to do it in a list comprehension, just use a for loop.
Ok. You've struggled with this long enough. Here's my decoder.
#! /usr/bin/env python
''' Unscramble a string of text by brute force
From http://stackoverflow.com/questions/26248379/influence-of-choosing-string-as-seed-of-random-on-the-output
'''
import sys
from random import seed, shuffle
from itertools import product
from string import ascii_letters
def scramble(seq, key):
seed(key)
msg = seq[:]
shuffle(msg)
return msg
def scramble_old(seq, key):
seed(key)
r = range(len(seq))
shuffle(r)
return [seq[i] for i in r]
def unscramble(seq, key):
seed(key)
r = range(len(seq))
shuffle(r)
newseq = len(seq) * [None]
for i, j in enumerate(r):
newseq[j] = seq[i]
return newseq
def test():
key = 'zap'
#orig = 'quickbrownfox'
orig = '01234'
print 'orig: ', orig
shuf = scramble(list(orig), key)
print 'shuffled: ', ''.join(shuf)
unshuf = unscramble(shuf, key)
print 'unshuffled: ', ''.join(unshuf)
def decode(seq, begin):
count = 0
begin_len = len(begin)
for k in product(*3*[ascii_letters]):
key = ''.join(k)
dup = seq[:]
newseq = unscramble(dup, key)
if newseq[:begin_len] == begin:
count += 1
print '%s: [%s] %s' % (key, ''.join(newseq), count)
#print ' [%s]\n' % ''.join(scramble(newseq, key))
def main():
original_str = 'euTtSa:0 kty1h a0 nlradstara atlot 5wtic'.lower()
original_list = list(original_str)
print ' [%s], %d\n' % (original_str, len(original_str))
decode(original_list, begin=list('the'))
if __name__ == '__main__':
#test()
main()
I'm going to start by noting the code you posted is a little confusing! message is never named, but I assume you mean the original string by that. So we good. Your question is also confusing. Are you asking what random number generator seeds are? Because that's easy enough. But if you're asking how to get the output of an 'AGd' seeded generator using an 'hGd' seeded generator, that should probably be impossible because 'hGd' is not a permutation of 'AGd' and vice versa. They're simply not in the same set of permutations! But supposing they were, if you're asking how many iterations you would need to get the same output (or a collision, in other words) that would depend on the implementation and the algorithm and whatnot. Maybe it'd be worth looking into the details of python's random module; I'll admit I don't personally know that.
But as for seeding, we could for example write a pseudorandom number generator but iteratively applying the equation y = 5*x + 77 mod 100 (where of course this one would be pretty garbage as far as random number generators go). The output on each call will be the equation applied to the input. But obviously this specifies a whole class of generators depending on what the initial value of x is! That's all a random seed is, usually, it's the value of x that starts off the whole process. Now checking the documentation here: https://docs.python.org/2/library/random.html I see that random seeds can actually be any hashable object. Consequently, if you feed in a string, the first thing it does is apply some hash function to the string in order to get a suitable seed to the pseudorandom number generator, where, of course, suitable is relative to the implementation of the specific pseudorandom number generator.
If I've misunderstood your question, please accept my humble apologies.
EDIT: PM 2Ring beat me to it, and his answer is better. See his.

Generate big random sequence of unique numbers [duplicate]

This question already has answers here:
Unique (non-repeating) random numbers in O(1)?
(22 answers)
Closed 9 years ago.
I need to fill a file with a lot of records identified by a number (test data). The number of records is very big, and the ids should be unique and the order of records should be random (or pseudo-random).
I tried this:
# coding: utf-8
import random
COUNT = 100000000
random.seed(0)
file_1 = open('file1', 'w')
for i in random.sample(xrange(COUNT), COUNT):
file_1.write('ID{0},A{0}\n'.format(i))
file_1.close()
But it's eating all of my memory.
Is there a way to generate a big shuffled sequence of consecutive (not necessarily but it would be nice, otherwise unique) integer numbers? Using a generator and not keeping all the sequence in RAM?
If you have 100 million numbers like in the question, then this is actually manageable in-memory (it takes about 0.5 GB).
As DSM pointed out, this can be done with the standard modules in an efficient way:
>>> import array
>>> a = array.array('I', xrange(10**8)) # a.itemsize indicates 4 bytes per element => about 0.5 GB
>>> import random
>>> random.shuffle(a)
It is also possible to use the third-party NumPy package, which is the standard Python tool for managing arrays in an efficient way:
>>> import numpy
>>> ids = numpy.arange(100000000, dtype='uint32') # 32 bits is enough for numbers up to about 4 billion
>>> numpy.random.shuffle(ids)
(this is only useful if your program already uses NumPy, as the standard module approach is about as efficient).
Both method take about the same amount of time on my machine (maybe 1 minute for the shuffling), but the 0.5 GB they use is not too big for current computers.
PS: There are too many elements for the shuffling to be really random because there are way too many permutations possible, compared to the period of the random generators used. In other words, there are fewer Python shuffles than the number of possible shuffles!
Maybe something like (won't be consecutive, but will be unique):
from uuid import uuid4
def unique_nums(): # Not strictly unique, but *practically* unique
while True:
yield int(uuid4().hex, 16)
# alternative yield uuid4().int
unique_num = unique_nums()
next(unique_num)
next(unique_num) # etc...
You can fetch random int easily from reading (on linux) /dev/urandom or using os.urandom() and struct.unpack():
Return a string of n random bytes suitable for cryptographic use.
This function returns random bytes from an OS-specific randomness source. The returned data should be unpredictable enough for cryptographic applications, though its exact quality depends on the OS implementation. On a UNIX-like system this will query /dev/urandom, and on Windows it will use CryptGenRandom. If a randomness source is not found, NotImplementedError will be raised.
>>> for i in range(4): print( hex( struct.unpack('<L', os.urandom(4))[0]))
...
0xbd7b6def
0xd3ecf2e6
0xf570b955
0xe30babb6
While on the other hand random package:
However, being completely deterministic, it is not suitable for all purposes, and is completely unsuitable for cryptographic purposes.
If you really need unique records you should go with this or answer provided by EOL.
But assuming really random source, with possibly repeated characters you'll have 1/N (where N = 2 ** sizeof(int)*8 = 2 ** 32) chance of hitting item at first guess, thus you can get (2**32) ** length possible outputs.
On the other hand when using just unique results you'll have max:
product from i = 0 to length {2*32 - i}
= n! / (n-length)!
= (2**32)! / (2**32-length)!
Where ! is factorial, not logical negation. So you'll just decrease randomness of result.
This one will keep your memory OK but will probably kill your disk :)
It generates a file with the sequence of the numbers from 0 to 100000000 and then it randomly pick positions in it and writes to another file. The numbers have to be re-organized in the first file to "delete" the numbers that have been chosen already.
import random
COUNT = 100000000
# Feed the file
with open('file1','w') as f:
i = 0
while i <= COUNT:
f.write("{0:08d}".format(i))
i += 1
with open('file1','r+') as f1:
i = COUNT
with open('file2','w') as f2:
while i >= 0:
f1.seek(i*8)
# Read the last val
last_val = f1.read(8)
random_pos = random.randint(0, i)
# Read random pos
f1.seek(random_pos*8)
random_val = f1.read(8)
f2.write('ID{0},A{0}\n'.format(random_val))
# Write the last value to this position
f1.seek(random_pos*8)
f1.write(last_val)
i -= 1
print "Done"

Categories

Resources