Influence of choosing string as seed of random on the output - python

In python, in order to permutate the lettters of a string one can write
import random
random.seed(str_key)
length = range(len(original_str))
random.shuffle(length)
join "".join([original_key[idx] for idx in length])
I wonder what does seed do with the key string and how it produces from it the permutation (or says shuffle how to do that). For example if I take the key to be 'hGd' how I get that specific output while if I write another key like 'AGd' I get another output?
EDIT: The decryption algorithm I tried to use on that code is:
for key in itertools.product(*3*[string.ascii_letters]):
indices = range(len(enc_msg))
list_encrypted_msg = list(enc_msg)
random.seed(key)
random.shuffle(indices)
decrypted = ""
for idx in indices[::-1]:
decrypted += list_encrypted_msg[idx]
try:
if not decrypted.index("The"):
print decrypted
except ValueError:
continue
return "not found!"

What seed() does with its argument is to pass it to the built-in hash() function, which converts it to a 32 bit signed integer, in other words a number in the range -2,147,483,648 to 2,147,483,647. That number is then used as the starting number by the pseudo-random integer generator (by default, the Mersenne Twister algorithm) that is the heart of the standard random functions.
Each time a pseudo-random number generator (PRNG) is called it does a particular arithmetic operation on its current number to produce a new number. It may return that number as is, or it may return a modified version of that number. See Wikipedia for a simple type of PRNG.
With a good PRNG it is very hard to predict what the next number in the sequence is going to be, and Mersenne Twister is quite good. So it's not easy to predict the effect that different seeds will have on the output.
BTW, you can pass seed() any kind of hashable object. So it can be passed an int, string, tuple, etc, but not a list. But as I said above, whatever you pass it, it gets converted to a number.
Update: In recent versions of Python, random.seed takes an optional version arg: version 1 works as I described above, but version 2 (the default in Python 3.2+) a str, bytes, or bytearray object gets converted to an int and all of its bits are used.
And I guess I should mention that if you call seed() without a seed value it uses the system entropy pool to generate a seed value, and if the system doesn't provide an entropy pool (which is unlikely, except for extremely tiny or old embedded systems), it uses the current time as its seed value.
The Mersenne Twister algorithm has a period of 2**19937 - 1, which is a number of about 6000 decimal digits. So it takes a very long time before the cycle of integers it produces repeats exactly. Of course, individual integers and sub-sequences of integers will repeat much sooner. And a cryptographic attack on it only needs 624 (full) outputs to determine the position in the cycle. Python's version of Mersenne Twister doesn't actually return the integers it calculates, it converts them to 53 bit floating-point numbers.
Please see the Wikipedia article on the Mersenne Twister if you're curious to know how it works. Mersenne Twister was very impressive when it was first published, but there are now superior RNGs that are faster, more efficient, and have better statistical properties, eg the PCG family. We don't have PCG in the Python standard library yet, but PCG is now the default PRNG in Numpy.
FWIW, here's a slightly improved version of your program.
import random
#Convert string to list
msg = list(original_str)
random.seed(str_key)
random.shuffle(msg)
print "".join(msg)
Now, onto your decryption problem. :) Is the message you have to decrypt merely scrambled, as by the program above, or does it use some other form of encryption? If it's merely scrambled, it will be relatively easy to unscramble. So unless you tell me otherwise, I shall assume that to be the case.
You said that the key length is 3. Is the key purely alphabetic, or can the 3 characters in the key be anything in the range chr(0) to chr(255)? Either way, that's not really a lot of keys to check, and a Python program will be able to unscramble the message using a brute-force search of all keys in less than one second minute.
To iterate over all possible keys, you can do this:
from itertools import product
from string import ascii_letters
for k in product(*3*[ascii_letters]):
str_key = ''.join(k)
I used product() in that code because we want to generate all possible strings of 3 ascii letters, so we want the Cartesian product of 3 copies of ascii_letters. 3*[ascii_letters] is equivalent to [ascii_letters, ascii_letters, ascii_letters] and putting * in front unpacks that list so that product() gets 3 separate args. If we use permutations() then we don't get any strings with repeated characters. To illustrate:
>>> import itertools
>>> s='abc'
>>> [''.join(u) for u in itertools.permutations(s, 3)]
['abc', 'acb', 'bac', 'bca', 'cab', 'cba']
>>> [''.join(u) for u in itertools.product(*3*[s])]
['aaa', 'aab', 'aac', 'aba', 'abb', 'abc', 'aca', 'acb', 'acc',
'baa', 'bab', 'bac', 'bba', 'bbb', 'bbc', 'bca', 'bcb', 'bcc',
'caa', 'cab', 'cac', 'cba', 'cbb', 'cbc', 'cca', 'ccb', 'ccc']
Update: product takes a repeat keyword arg, so we can simplify that to itertools.product(s, repeat=3).
......
I thought you said that the string to be decoded has 42 chars, but there are only 40 chars in euTtSa:0 kty1h a0 nlradstara atlot 5wtic. Also, the appearance of the digits 0 & 5 in that string is a bit of a worry, although I guess the original unscrambled version could have digits in it...
Anyway, I just tried unscrambling that string using the shuffle algorithm with all possible 140608 3 letter keys and printing the permutations produced that begin with The. There are only 5 of them, and only one of those had a space after The. But in every case the rest of the unscrambled string is garbage. My guess is that you've misunderstood the encryption algorithm that your lecturer used.
Just in case you were wondering how random.shuffle() works, you can see Python source code here; the C code for the random module is here.
It's the Yates-Fisher shuffle, which is like a randomized version of one pass through a selection sort.
Another cute method that's sometimes seen is to sort the list using a random comparison function, or a random key function. Eg
>>> import random
>>> random.seed(42)
>>> for i in range(10):
... s.sort(key=lambda i:random.random())
... print ''.join(s)
...
gabecdf
dbgfeac
agbfdce
cebdgaf
fgedbca
afbecgd
bcaegfd
aebcdfg
bacgfed
fgdebca
However, this shuffling technique is relatively slow, and it has bad statistical properties. So please do not use it! The Fisher-Yates technique used by random.shuffle() is (pretty much) the optimal shuffling algorithm.
Reversing the shuffling procedure for a given key
Let's look at what happens when we shuffle a simple range.
from random import seed, shuffle
r = range(5)
key = 'zap'
seed(key)
shuffle(r)
After the shuffle, r will be
[2, 4, 1, 3, 0]
So to unshuffle r we need to build this list:
[r[4], r[2], r[0], r[3], r[1]]
Can you see how to do that? If you can't figure it out, I'm happy to post my code, but I think you should spend a little bit of time trying to figure it out first. Hint: Don't try to do it in a list comprehension, just use a for loop.
Ok. You've struggled with this long enough. Here's my decoder.
#! /usr/bin/env python
''' Unscramble a string of text by brute force
From http://stackoverflow.com/questions/26248379/influence-of-choosing-string-as-seed-of-random-on-the-output
'''
import sys
from random import seed, shuffle
from itertools import product
from string import ascii_letters
def scramble(seq, key):
seed(key)
msg = seq[:]
shuffle(msg)
return msg
def scramble_old(seq, key):
seed(key)
r = range(len(seq))
shuffle(r)
return [seq[i] for i in r]
def unscramble(seq, key):
seed(key)
r = range(len(seq))
shuffle(r)
newseq = len(seq) * [None]
for i, j in enumerate(r):
newseq[j] = seq[i]
return newseq
def test():
key = 'zap'
#orig = 'quickbrownfox'
orig = '01234'
print 'orig: ', orig
shuf = scramble(list(orig), key)
print 'shuffled: ', ''.join(shuf)
unshuf = unscramble(shuf, key)
print 'unshuffled: ', ''.join(unshuf)
def decode(seq, begin):
count = 0
begin_len = len(begin)
for k in product(*3*[ascii_letters]):
key = ''.join(k)
dup = seq[:]
newseq = unscramble(dup, key)
if newseq[:begin_len] == begin:
count += 1
print '%s: [%s] %s' % (key, ''.join(newseq), count)
#print ' [%s]\n' % ''.join(scramble(newseq, key))
def main():
original_str = 'euTtSa:0 kty1h a0 nlradstara atlot 5wtic'.lower()
original_list = list(original_str)
print ' [%s], %d\n' % (original_str, len(original_str))
decode(original_list, begin=list('the'))
if __name__ == '__main__':
#test()
main()

I'm going to start by noting the code you posted is a little confusing! message is never named, but I assume you mean the original string by that. So we good. Your question is also confusing. Are you asking what random number generator seeds are? Because that's easy enough. But if you're asking how to get the output of an 'AGd' seeded generator using an 'hGd' seeded generator, that should probably be impossible because 'hGd' is not a permutation of 'AGd' and vice versa. They're simply not in the same set of permutations! But supposing they were, if you're asking how many iterations you would need to get the same output (or a collision, in other words) that would depend on the implementation and the algorithm and whatnot. Maybe it'd be worth looking into the details of python's random module; I'll admit I don't personally know that.
But as for seeding, we could for example write a pseudorandom number generator but iteratively applying the equation y = 5*x + 77 mod 100 (where of course this one would be pretty garbage as far as random number generators go). The output on each call will be the equation applied to the input. But obviously this specifies a whole class of generators depending on what the initial value of x is! That's all a random seed is, usually, it's the value of x that starts off the whole process. Now checking the documentation here: https://docs.python.org/2/library/random.html I see that random seeds can actually be any hashable object. Consequently, if you feed in a string, the first thing it does is apply some hash function to the string in order to get a suitable seed to the pseudorandom number generator, where, of course, suitable is relative to the implementation of the specific pseudorandom number generator.
If I've misunderstood your question, please accept my humble apologies.
EDIT: PM 2Ring beat me to it, and his answer is better. See his.

Related

why is my iterator implementation very inefficient?

I wrote the following python script to count the number of occurrences of a character (a) in the first n characters of an infinite string.
from itertools import cycle
def count_a(str_, n):
count = 0
str_ = cycle(str_)
for i in range(n):
if next(str_) == 'a':
count += 1
return count
My understanding of iterators is that they are supposed to be efficient, but this approach is super slow for very large n. Why is this so?
The cycle iterator might not be as efficient as you think, the documentation says
Make an iterator returning elements from the iterable and saving a
copy of each.
When the iterable is exhausted, return elements from the saved copy.
Repeats indefinitely
...Note, this member of the toolkit may require significant auxiliary
storage (depending on the length of the iterable).
Why not simplify and just not use the iterator at all? It adds unnecessary overhead and gives you no benefit. You can easily count the occurrences with a simple str_[:n].count('a')
The first problem here is that despite using itertools, you're still doing explicit python-level for loop. To gain the C level speed boost when using itertools you want to keep all the iteration in the high speed itertools.
So let's do this step by step, first we want to get the number of characters in a finite string. To do this, you can use the itertools.islice method to get the first n characters in the string:
str_first_n_chars = islice(cycle(str_), n)
You next want to count the number of occurrences of the letter (a), to do this you can do some variation of either of these (you may want to experiment which variants is faster):
count_a = sum(1 for c in str_first_n_chars if c == 'a')
count_a = len(tuple(filter('a'.__eq__, str_first_n_chars))
This is all and well, but this is still slow for really large n because you need to iterate through str_ many, many times for really large n, like for example n = 10**10000. In other words, this algorithm is O(n).
There's one last improvement we could made. Notice how that the number of (a) in the str_ never really change in each iteration. Rather than iterating through str_ multiple times for large n, we can do a little bit of smarter with a bit of math so that we only need to iterate through str_ twice. First we count the number of (a) in a single stretch of str_:
count_a_single = str_.count('a')
Then we find out how many times we would have to iterate through str_ to get length n by using divmod function:
iter_count, remainder = divmod(n, len(str_))
then we can just multiply iter_count with count_a_single and add the number of (a) in the remaining length. We don't need cycle or islice and such here because remainder < len(str_)
count_a = iter_count * count_a_single + str_[:remainder].count('a')
With this method, the runtime performance of the algorithm grows only on the length of a single cycle of str_ rather than n. In other words, this algorithm is O(len(str_)).

Use Random.random [30,35] to print random numbers between 30,35. Seed(70)

Probably a simple answer, not sure what I am missing. For a homework assignment I have to use random.random() to generate numbers between 30 and 35. The seed has to be set to 70 to match the pseudo-random numbers with the grader. This wasn't in my lecture so I am a little stumped as to what to do.
I have:
import random
def problem2_4():
print(random.random(30,35))
But this is clearly wrong.
The assignment says the output should look like (note: for the problem i use def problem2_4() just for the assignment grading system)
problem2_4()
[34.54884618961936, 31.470395203793395, 32.297169396656095, 30.681793552717807,
34.97530360173135, 30.773219981037737, 33.36969776732032, 32.990127772708405,
33.57311858494461, 32.052629620057274]
The output [blah, blah, blah] indicates that it is a list of numbers rather than a series of numbers printed one-by-one.
In addition, if you want random floating point values, you'll need to transform the numbers from random.random (which are zero to one) into that range.
That means you'll probably need something like:
import random # Need this module.
def problem2_4():
random.seed(70) # Set initial seed.
nums = [] # Start with empty list.
for _ in range(10): # Will add ten values.
nums += [random.random() * 5 + 30] # Add one value in desired range.
print(nums) # Print resultant list.
Of course, the Pythonic way to do this would be:
import random
random.seed(70)
print([random.random() * 5 + 30 for _ in range(10)])
Bit that might be a bit ahead of where your educator is working. Still, it's good to learn this stuff as early as possile since you'll never be a Pythonista until you do :-)
The function that you are looking for is randint, which returns an integer (whole number) between the specified minimum and maximum values.
So the solution would be
random.randint(30, 35)
Python Docs

how can I verify that this hash function is not gonna give me same result for two diiferent strings?

Consider two different strings to be of same length.
I am implementing robin-karp algorithm and using the hash function below:
def hs(pat):
l = len(pat)
pathash = 0
for x in range(l):
pathash += ord(pat[x])*prime**x # prime is global variable equal to 101
return pathash
It's a hash. There's, by definition, no guarantee there will be no collisions - otherwise, the hash would have to be as long as the hashed value, at least.
The idea behind what you're doing is based in number theory: powers of a number that is coprime to the size of your finite group (which probably the original author meant to be something like 2^N) can give you any number in that finite group, and it's hard to tell which one these were.
Sadly, the interesting part of this hash function, namely the size limiting/modulo operation of the hash, has been left out of this code – which makes one wonder where your code comes from. As far as I can immediately see, has little to do with Rabin-Karb.

Spark: Generate Map Word to List of Similar Words - Need Better Performance

I am working with DNA sequence alignment, and I have a performance issue.
I need to create a dict that maps a word (a sequence of a set length) to a list of all words that are similar as decided by a separate function.
Right now, I am doing the following:
all_words_rdd = sc.parallelize([''.join(word) for word in itertools.product(all_letters, repeat=WORD_SIZE)], PARALLELISM)
all_similar_word_pairs_map = (all_words_rdd.cartesian(all_words_rdd)
.filter(lambda (word1, word2), scoring_matrix=scoring_matrix, threshold_value=threshold_value: areWordsSimilar((word1, word2), scoring_matrix, threshold_value))
.groupByKey()
.mapValues(set)
.collectAsMap())
Where areWordsSimilar obviously calculates whether the words reach a set similarity threshold.
However, this is horribly slow. It works fine with words of length 3, but once I go any higher it slows down exponentially (as you might expect). It also starts complaining about the task size being too big (again, not surprising)
I know the cartesian join is a really inefficient way to do this, but I'm not sure how to approach it otherwise.
I was thinking of starting with something like this:
all_words_rdd = (sc.parallelize(xrange(0, len(all_letters) ** WORD_SIZE))
.repartition(PARALLELISM)
...
)
This would let me split the calculation across multiple nodes. However, how do I calculate this? I was thinking about doing something with bases and inferring the letter using the modulo operator (i.e. in base of len(all_letters), num % 2 = all_letters[0], num % 3 = all_letters[1], etc).
However, this sounds horribly complicated, so I was wondering if anybody had a better way.
Thanks in advance.
EDIT
I understand that I cannot reduce the exponential complexity of the problem, that is not my goal. My goal is to break up the complexity across multiple nodes of execution by having each node perform part of the calculation. However, to do this I need to be able to derive a DNA word from a number using some process.
Generally speaking even without driver side code it looks like a hopeless task. Size of the sequence set is growing exponentially and you simply cannot win with that. Depending on how you plan to use this data there is most likely a better approach out there.
If you still want to go with this you can start with spiting kmers generation between a driver and workers:
from itertools import product
def extend_kmer(n, kmer="", alphabet="ATGC"):
"""
>>> list(extend_kmer(2))[:4]
['AA', 'AT', 'AG', 'AC']
"""
tails = product(alphabet, repeat=n)
for tail in tails:
yield kmer + "".join(tail)
def generate_kmers(k, seed_size, alphabet="ATGC"):
"""
>>> kmers = generate_kmers(6, 3, "ATGC").collect()
>>> len(kmers)
4096
>>> sorted(kmers)[0]
'AAAAAA'
"""
seed = sc.parallelize([x for x in extend_kmer(seed_size, "", alphabet)])
return seed.flatMap(lambda kmer: extend_kmer(k - seed_size, kmer, alphabet))
k = ... # Integer
seed_size = ... # Integer <= k
kmers = generate_kmers(k, seed_size) # RDD kmers
The simplest optimization you can do when it comes to searching is to drop cartesian and use a local generation:
from difflib import SequenceMatcher
def is_similar(x, y):
"""Dummy similarity check
>>> is_similar("AAAAA", "AAAAT")
True
>>> is_similar("AAAAA", "TTTTTT")
False
"""
return SequenceMatcher(None, x, y).ratio() > 0.75
def find_similar(kmer, f=is_similar, alphabet="ATGC"):
"""
>>> kmer, similar = find_similar("AAAAAA")
>>> sorted(similar)[:5]
['AAAAAA', 'AAAAAC', 'AAAAAG', 'AAAAAT', 'AAAACA']
"""
candidates = product(alphabet, repeat=len(kmer))
return (kmer, {"".join(x) for x in candidates if is_similar(kmer, x)})
similar_map = kmers.flatmap(find_similar)
It is still an extremely naive approach but it doesn't require expensive data shuffling.
Next thing you can try is to improve search strategy. It can be done either locally like above or globally using joins.
In both cases you need a smarter approach than checking all possible kmers. First thing that comes to mind is to use seed kmers taken from a given word. In locally mode these can be used as a starting point for candidate generation, in a global mode a join key (optionally combined with hashing).

Sequenced string PK's in python

i'm looking for optimal string sequence generator that could be used as user id generator. As far as i get it should implement following features:
Length is restricted to 8 characters.
Consists from latin symbols and digits
Public and thus should be obfuscated
For now i came up with following algorithm:
def idgen(begin, end):
assert type(begin) is int
assert type(end) is int
allowed = reduce(
lambda L, ri: L + map(chr, range(ord(ri[0]), ord(ri[1]) + 1)),
(('a', 'z'), ('0', '9')), list()
)
shift = lambda c, i: allowed[(allowed.index(c) + i) % len(allowed)]
for cur in xrange(begin, end):
cur = str(cur).zfill(8)
cur = izip(xrange(0, len(cur)), iter(cur))
cur = ''.join([shift(c, i) for i, c in cur])
yield cur
But it would give pretty similar ids, 0-100 example:
'01234567', '01234568', '01234569', '0123456a' etc.
So what are best practices? How, i suppose url shorteners should use some kind of similar algorithm?
Since you need "obfuscated" ids, you want something that looks random, but isn't. Fortunately, almost all computer generated randomness fulfills this criteria, including the one generated by the random python module.
You can generate a fixed sequence of numbers by settings the PRNG seed, like so:
import random
import string
valid_characters= string.lowercase+string.digits
def get_random_id():
return ''.join(random.choice(valid_characters) for x in range(8))
random.seed(0)
for x in range(10):
print get_random_id()
This will always print the same sequence of 10 "random" ids.
You probably want to generate ids on demand, instead of all at once. To do so, you need persist the PRNG state:
random.setstate( get_persisted() )
get_random_id()
persist( random.getstate )
#repeat ad infinitum
This is somewhat expensive, so you'll want to generate a couple of random ids at once, and keep them on a queue. You could also use random.jumpahead, but I suggest you try both techniques and see which one is faster.
Avoiding collisions for this task is not trivial. Even using hash functions will not save you from collisions.
The random module's PRNG has a period of 2**19937-1. You need a period of at least (26+10)**8, which is about 2**42. However, the fact that the period you need is lower than the period of the PRNG does not guarantee that there will be no collisions, because the size of your output (8 bytes, or 64 bits) may not match the output size of the PRNG. Also, depending on the particular implementation of choice, it's possible that, even if the sizes matched, the PRNG simply advances the PRNG on each call.
You could try to iterate over the keyspace, checking if something repeats:
random.seed(0)
space=len(valid_characters)**8
found=set()
x=0
while x<space:
id= get_random_id()
if id in found:
print "cycle found after",x,"iterations"
found.update(id)
if not x% 1000000:
print "progress:",(float(x)/space)*100,"%"
x+=1
But this will take a long, long while.
You could also handcraft a PRNG with the desired period, but it's a bit out of scope on stackoverflow. You could try the math stackexchange.
Ultimately, and depending on your purpose, I think your best bet may be simply keeping track of already generated ids, and skipping repeated ones.

Categories

Resources