Generating random functions (as opposed to random numbers) - python

I would like to create a function that takes a string and returns a number between 0 and 1. The function should consistently return the same number when given the same string, but other than that the results should have no discernible pattern. The output numbers for any large set of input strings should follow a uniform distribution.
Moreover, I need to generate more than one such function, i.e. when given the string "abc", function A might consistently return 0.593927 while function B consistently returns 0.0162524. I need it to be fast (it's for a numerical simulation), and have reasonably good statistics.
I'm using Python and will settle for answers of the form "here is an easy way to do it using a Python library" or "here is an algorithm that you can implement." If there's no fast way to do it in Python I'll just switch to C instead.
I realise that either of the following two methods will work, but each of them have drawbacks that make me want to look for a more elegant solution.
Store a dictionary
I could just calculate a new random number every time I'm given a new string, and store it in a dictionary to be retrieved if I receive the same string again. However, my application is likely to generate a lot of strings that only appear once, This will eventually result in having to store a very large dictionary in memory. It also makes repeatability more difficult, since even if I use the same seed, I'll generate a different function if I receive the same strings in a different order. For these reasons, it would be much better to consistently compute the random numbers "on the fly".
Use a hash function
I could just call a hash function on the string and then convert the result to a number. The issue of generating multiple functions could be solved by, for example, appending a "seed" string to every input string. However, then I'm stuck with trying to find a hash function with the appropriate speed and statistics. Python's built-in hash is fast but implementation-dependent, and I don't know how good the statistics will be since it's not designed for this type of purpose. On the other hand I could use a secure hash algorithm such as md5, which will have good statistics, but this would be too slow for my application. Hash functions aimed at data storage applications are typically much faster than cryptographically secure ones like md5, but they're designed with the aim of avoiding collisions, rather than producing uniformly distributed output, and these aren't necessarily the same in all cases.
A further note about hash functions
To illustrate the point that avoiding collisions and producing uniform results are different things, consider the following example using Python's built-in hash function:
>>> hash("aaa") % 1000
340
>>> hash("aab") % 1000
343
>>> hash("aac") % 1000
342
>>> hash("aad") % 1000
337
>>> hash("aae") % 1000
336
>>> hash("aaf") % 1000
339
>>> hash("aag") % 1000
338
>>> hash("aah") % 1000
349
>>> hash("aai") % 1000
348
>>> hash("aaj") % 1000
351
>>> hash("aak") % 1000
350
There are no collisions in the above output, but they are also clearly not uniformly distributed, since they are all between 336 and 351, and there is also a definite pattern in the third digit. I realise I could probably get better statistics by doing (hash("aaa")/HASH_MAX)*1000 (assuming I can work out what HASH_MAX should be), but this should help to illustrate that the requirements for a good hash function are not the same as the requirements for the function I'm looking for.
Some relevant information about the problem
I don't know exactly what the strings are that this algorithm will need to work on, because the strings will be generated by the simulation, but the following are likely to be the case:
They will have a very restricted character set (perhaps just 4 or 5 different symbols).
There will be a lot of unique or rare strings and a few very common ones, of varying length.
There is no upper bound on the lengths of the strings, but short ones are likely to be much more common than long ones. I wouldn't be surprised if I never see one longer than 100 characters, but I don't know for sure. Many of them will just have one to three characters, so it's important that the algorithm is fast for short strings. (But I guess I could use a lookup table for strings less than a certain length.)
Typically, the strings will have large substrings in common - often two strings will differ only by a single character appended to the beginning or end. It's important that the algorithm doesn't tend to give similar output values when the strings are similar.

Use a good random number generator and seed it with the string.

There's an algorithm in the section on "hashing strings" in the Wikipedia article on universal hashing.
Alternatively, you could just use some built-in hash function; each of your random functions prepends a random (but fixed) prefix to the string before hashing.

Try to use a Fingerprint such as Rabin fingerprinting.
http://en.wikipedia.org/wiki/Fingerprint_(computing).
If you choose a N-bit finger print you just need to divide the result by 2^N.
Fingerprints are a kind of hash functions which are usually very fast to computer (compare to Cryptographic hash functions like MD5) but not good for cryptographic applications (the key value may be recoverable somehow using its fingerprint)

Lookup3 is reputed to have very good collision properties, which ought to imply uniform distribution of results, and it's also fast. Should be simple to put this in a Python extension.
More generally, if you find a function that does a good job of minimizing hash table collisions and has the speed properties you need, a final conversion from a 32- or 64-bit integer to float is all that's needed. There are many sources on the web and elsewhere of string hashing functions. Check Knuth, for starters.
Addition
One other thing that might be worth trying is to encrypt the string first with a fast 1-1 algorithm like RC4 (not secure, but still close enough to pseudorandom) and then run a trivial hash (h = h + a * c[i] + b) over the cipher text. The RC4 key is the uniquifier.

Related

Is there a generator version of `sample` in Python?

NetLogo argues that one of its important features is that it activates agents from an agentset in pseudo-random order. If one wanted to do something similar in Python one might do the following.
from random import sample
for agent in sample(agentset, len(agentset)):
< do something with agent >
I believe that would work fine. The problem is that sample returns a list. If agentset is large, one is essentially duplicating it. (I don't want to use shuffle or pop since these modify the original agentset.)
Ideally, I would like a version of sample that acts as a generator and yields values when requested. Is there such a function? If not, any thoughts about how to write one--without either modifying the original set or duplicating it?
Thanks.
The algorithms underlying sample require memory proportional to the size of the sample. (One algorithm is rejection sampling, and the other is a partial shuffle.) Neither can do what you're looking for.
What you're looking for requires different techniques, such as format-preserving encryption. A format-preserving cipher is essentially a keyed bijection from [0, n) to [0, n) (or equivalently, from any finite set to itself). Using format-preserving encryption, your generator would look like (pseudocode)
def shuffle_generator(sequence):
key = get_random_key()
cipher = FormatPreservingCipher(key, len(sequence))
for i in range(len(sequence)):
yield sequence[cipher.encrypt(i)]
This would be a lot slower than a traditional shuffle, but it would achieve your stated goal.
I am not aware of any good format-preserving encryption libraries for Python. (pyffx exists, but testing shows that either it's terrible, or it has severe limitations that aren't clearly documented. Either way, it doesn't seem to be usable for this.) Your best option may be to wrap a library implemented in a different language.

Optimizing Python Dictionary Lookup Speeds by Shortening Key Size?

I'm not clear on what goes on behind the scenes of a dictionary lookup. Does key size factor into the speed of lookup for that key?
Current dictionary keys are between 10-20 long, alphanumeric.
I need to do hundreds of lookups a minute.
If I replace those with smaller key IDs of between 1 & 4 digits will I get faster lookup times? This would mean I would need to add another value in each item the dictionary is holding. Overall the dictionary will be larger.
Also I'll need to change the program to lookup the ID then get the URL associated with the ID.
Am I likely just adding complexity to the program with little benefit?
Dictionaries are hash tables, so looking up a key consists of:
Hash the key.
Reduce the hash to the table size.
Index the table with the result.
Compare the looked-up key with the input key.
Normally, this is amortized constant time, and you don't care about anything more than that. There are two potential issues, but they don't come up often.
Hashing the key takes linear time in the length of the key. For, e.g., huge strings, this could be a problem. However, if you look at the source code for most of the important types, including [str/unicode](https://hg.python.org/cpython/file/default/Objects/unicodeobject.c, you'll see that they cache the hash the first time. So, unless you're inputting (or randomly creating, or whatever) a bunch of strings to look up once and then throw away, this is unlikely to be an issue in most real-life programs.
On top of that, 20 characters is really pretty short; you can probably do millions of such hashes per second, not hundreds.
From a quick test on my computer, hashing 20 random letters takes 973ns, hashing a 4-digit number takes 94ns, and hashing a value I've already hashed takes 77ns. Yes, that's nanoseconds.
Meanwhile, "Index the table with the result" is a bit of a cheat. What happens if two different keys hash to the same index? Then "compare the looked-up key" will fail, and… what happens next? CPython's implementation uses probing for this. The exact algorithm is explained pretty nicely in the source. But you'll notice that given really pathological data, you could end up doing a linear search for every single element. This is never going to come up—unless someone can attack your program by explicitly crafting pathological data, in which case it will definitely come up.
Switching from 20-character strings to 4-digit numbers wouldn't help here either. If I'm crafting keys to DoS your system via dictionary collisions, I don't care what your actual keys look like, just what they hash to.
More generally, premature optimization is the root of all evil. This is sometimes misquoted to overstate the point; Knuth was arguing that the most important thing to do is find the 3% of the cases where optimization is important, not that optimization is always a waste of time. But either way, the point is: if you don't know in advance where your program is too slow (and if you think you know in advance, you're usually wrong…), profile it, and then find the part where you get the most bang for your buck. Optimizing one arbitrary piece of your code is likely to have no measurable effect at all.
Python dictionaries are implemented as hash-maps in the background. The key length might have some impact on the performance if, for example, the hash-functions complexity depends on the key-length. But in general the performance impacts will be definitely negligable.
So I'd say there is little to no benefit for the added complexity.

the best way to calculate the position of a number in prime list?

for example
f(2)->1
f(3)->2
f(4)->-1 //4 is not a prime
f(5)->3
...
generally ,make a prime generator and count before it reach x
def f(x):
p = primeGenerator()
count=1
while True:
y = next(p)
if y>x:
return -1
elif y==x:
return count
else:
count+=1
wasn't it too slow?though i can cache the list for next call,if i guarantee the input MUST be a prime,so don't have to test if the input number is a prime, is there a faster formula to get the answer?
The best method depends on what inputs you get, and whether the function will be called many times or just once or a few times.
If it will be called often, and all inputs you are going to receive are small, not larger than 107 say, the best method is to create a lookup table in advance, and just look up the input.
If it will not be called often, and all inputs are small, just generating the primes not exceeding the input and counting them is certainly good enough. It might be an enhancement to remember what you already have for the next call, so that when the first argument is 19394489, and the next is 20889937, you don't need to start from 0 again, but only need to find the primes between them. But whether the extra storage is worth to be had depends on the arguments passed.
If it will be called often and the arguments are not too large, not exceeding 1013 say, the best method is to precompute the values of π(n) for some select values of n, and for each argument look up the value for the next smaller precomputed point, and then generate and count the primes between that point and the target value (or if the target is closer to the next larger precomputed point, count the primes between the target and that).
If you calculate e.g. π(n) for all multiples of 107 not exceeding 1013, you get a lookup table with one million entries, that's not very taxing on the memory nowadays, and never need to sieve a range larger than five million, which doesn't take long.
You could also have the lookup table as a file or database on disk, which would allow much shorter intervals between the precomputed points. That would also eliminate the time for reading in the precomputed table on startup, but the lookup would now involve an access to the file system, which takes much longer than a memory read. What would be the best strategy depends on the expected inputs and the system it's run on.
Computing the lookup table will however take rather long if the upper limit isn't small, but that's a one-time cost.
If the expected inputs are larger, up to 1016 say, and you're not willing to spend the time necessary for precomputing a lookup table for that range, your best bet is to implement one of the better algorithms for the prime counting function, Meissel's method as refined by Lehmer is relatively easy to implement (not so easy that I'll give an example implementation here, though, but here's a Haskell implementation that might help). Better, but more complicated is the method as improved by Miller et al.
Beyond that, you'd need to research the current state-of-the-art, and probably should use a lower-level language than Python.
You have to check all preceding candidates for primality. There are no shortcuts. As you say, you can cache the result of a prior calculation and start from there, but that's really the best you can do.

Whats more random, hashlib or urandom?

I'm working on a project with a friend where we need to generate a random hash. Before we had time to discuss, we both came up with different approaches and because they are using different modules, I wanted to ask you all what would be better--if there is such a thing.
hashlib.sha1(str(random.random())).hexdigest()
or
os.urandom(16).encode('hex')
Typing this question out has got me thinking that the second method is better. Simple is better than complex. If you agree, how reliable is this for 'randomly' generating hashes? How would I test this?
This solution:
os.urandom(16).encode('hex')
is the best since it uses the OS to generate randomness which should be usable for cryptographic purposes (depends on the OS implementation).
random.random() generates pseudo-random values.
Hashing a random value does not add any new randomness.
random.random() is a pseudo-radmom generator, that means the numbers are generated from a sequence. if you call random.seed(some_number), then after that the generated sequence will always be the same.
os.urandom() get's the random numbers from the os' rng, which uses an entropy pool to collect real random numbers, usually by random events from hardware devices, there exist even random special entropy generators for systems where a lot of random numbers are generated.
on unix system there are traditionally two random number generators: /dev/random and /dev/urandom. calls to the first block if there is not enough entropy available, whereas when you read /dev/urandom and there is not enough entropy data available, it uses a pseudo-rng and doesn't block.
so the use depends usually on what you need: if you need a few, equally distributed random numbers, then the built in prng should be sufficient. for cryptographic use it's always better to use real random numbers.
The second solution clearly has more entropy than the first. Assuming the quality of the source of the random bits would be the same for os.urandom and random.random:
In the second solution you are fetching 16 bytes = 128 bits worth of randomness
In the first solution you are fetching a floating point value which has roughly 52 bits of randomness (IEEE 754 double, ignoring subnormal numbers, etc...). Then you hash it around, which, of course, doesn't add any randomness.
More importantly, the quality of the randomness coming from os.urandom is expected and documented to be much better than the randomness coming from random.random. os.urandom's docstring says "suitable for cryptographic use".
Testing randomness is notoriously difficult - however, I would chose the second method, but ONLY (or, only as far as comes to mind) for this case, where the hash is seeded by a random number.
The whole point of hashes is to create a number that is vastly different based on slight differences in input. For your use case, the randomness of the input should do. If, however, you wanted to hash a file and detect one eensy byte's difference, that's when a hash algorithm shines.
I'm just curious, though: why use a hash algorithm at all? It seems that you're looking for a purely random number, and there are lots of libraries that generate uuid's, which have far stronger guarantees of uniqueness than random number generators.
if you want a unique identifier (uuid), then you should use
import uuid
uuid.uuid4().hex
https://docs.python.org/3/library/uuid.html

Why is collections.deque slower than collections.defaultdict?

Forgive me for asking in in such a general way as I'm sure their performance is depending on how one uses them, but in my case collections.deque was way slower than collections.defaultdict when I wanted to verify the existence of a value.
I used the spelling correction from Peter Norvig in order to verify a user's input against a small set of words. As I had no use for a dictionary with word frequencies I used a simple list instead of defaultdict at first, but replaced it with deque as soon as I noticed that a single word lookup took about 25 seconds.
Surprisingly, that wasn't faster than using a list so I returned to using defaultdict which returned results almost instantaneously.
Can someone explain this difference in performance to me?
Thanks in advance
PS: If one of you wants to reproduce what I was talking about, change the following lines in Norvig's script.
-NWORDS = train(words(file('big.txt').read()))
+NWORDS = collections.deque(words(file('big.txt').read()))
-return max(candidates, key=NWORDS.get)
+return candidates
These three data structures aren't interchangeable, they serve very different purposes and have very different characteristics:
Lists are dynamic arrays, you use them to store items sequentially for fast random access, use as stack (adding and removing at the end) or just storing something and later iterating over it in the same order.
Deques are sequences too, only for adding and removing elements at both ends instead of random access or stack-like growth.
Dictionaries (providing a default value just a relatively simple and convenient but - for this question - irrelevant extension) are hash tables, they associate fully-featured keys (instead of an index) with values and provide very fast access to a value by a key and (necessarily) very fast checks for key existence. They don't maintain order and require the keys to be hashable, but well, you can't make an omelette without breaking eggs.
All of these properties are important, keep them in mind whenever you choose one over the other. What breaks your neck in this particular case is a combination of the last property of dictionaries and the number of possible corrections that have to be checked. Some simple combinatorics should arrive at a concrete formula for the number of edits this code generates for a given word, but everyone who mispredicted such things often enough will know it's going to be surprisingly large number even for average words.
For each of these edits, there is a check edit in NWORDS to weeds out edits that result in unknown words. Not a bit problem in Norvig's program, since in checks (key existence checks) are, as metioned before, very fast. But you swaped the dictionary with a sequence (a deque)! For sequences, in has to iterate over the whole sequence and compare each item with the value searched for (it can stop when it finds a match, but since the least edits are know words sitting at the beginning of the deque, it usually still searches all or most of the deque). Since there are quite a few words and the test is done for each edit generated, you end up spending 99% of your time doing a linear search in a sequence where you could just hash a string and compare it once (or at most - in case of collisions - a few times).
If you don't need weights, you can conceptually use bogus values you never look at and still get the performance boost of an O(1) in check. Practically, you should just use a set which uses pretty much the same algorithms as the dictionaries and just cuts away the part where it stores the value (it was actually first implemented like that, I don't know how far the two diverged since sets were re-implemented in a dedicated, seperate C module).

Categories

Resources