Is there a generator version of `sample` in Python? - python

NetLogo argues that one of its important features is that it activates agents from an agentset in pseudo-random order. If one wanted to do something similar in Python one might do the following.
from random import sample
for agent in sample(agentset, len(agentset)):
< do something with agent >
I believe that would work fine. The problem is that sample returns a list. If agentset is large, one is essentially duplicating it. (I don't want to use shuffle or pop since these modify the original agentset.)
Ideally, I would like a version of sample that acts as a generator and yields values when requested. Is there such a function? If not, any thoughts about how to write one--without either modifying the original set or duplicating it?
Thanks.

The algorithms underlying sample require memory proportional to the size of the sample. (One algorithm is rejection sampling, and the other is a partial shuffle.) Neither can do what you're looking for.
What you're looking for requires different techniques, such as format-preserving encryption. A format-preserving cipher is essentially a keyed bijection from [0, n) to [0, n) (or equivalently, from any finite set to itself). Using format-preserving encryption, your generator would look like (pseudocode)
def shuffle_generator(sequence):
key = get_random_key()
cipher = FormatPreservingCipher(key, len(sequence))
for i in range(len(sequence)):
yield sequence[cipher.encrypt(i)]
This would be a lot slower than a traditional shuffle, but it would achieve your stated goal.
I am not aware of any good format-preserving encryption libraries for Python. (pyffx exists, but testing shows that either it's terrible, or it has severe limitations that aren't clearly documented. Either way, it doesn't seem to be usable for this.) Your best option may be to wrap a library implemented in a different language.

Related

Why is np.random.default_rng().permutation(n) preferred over the original np.random.permutation(n)?

Numpy documentation on np.random.permutation suggests all new code use np.random.default_rng() from the Random Generator package. I see in the documentation that the Random Generator package has standardized the generation of a wide variety of random distributions around the BitGenerator vs using Mersenne Twister, which I'm vaguely familiar with.
I see one downside, what used to be a single line of code to do simple permutations:
np.random.permutation(10)
turns into two lines of code now, which feels a little awkward for such a simple task:
rng = np.random.default_rng()
rng.permutation(10)
Why is this new approach an improvement over the previous approach?
And why wouldn't existing methods like np.random.permutation just wrap this new preferred method?
Is there a good reason not to use this new method as a one-liner np.random.default_rng().permutation(10), assuming it's not being called at high volumes?
Is there an argument for switching existing code to this method?
Some context:
Does numpy.random.seed() always give the same random number every time?
NumPy: Decide on new PRNG BitGenerator default
To your questions, in a logical order:
And why wouldn't existing methods like np.random.permutation just wrap this new preferred method?
Probably because of backwards compatibility concerns. Even if the "top-level" API would not be changing, its internals would be significantly enough to be deemed a break in compatability.
Why is this new approach an improvement over the previous approach?
"By default, Generator uses bits provided by PCG64 which has better statistical properties than the legacy MT19937 used in RandomState." (source). The PCG64 docstring provides more technical detail.
Is there a good reason not to use this new method as a one-liner np.random.default_rng().permutation(10), assuming it's not being called at high volumes?
I very much agree that it's a slightly awkward added line of code if it's done at the module-start. I would only point out that the NumPy docs do directly use this form in docstring examples, such as:
n = np.random.default_rng().standard_exponential((3, 8000))
The slight difference would be that one is instantiating a class at module load/import time, whereas in your form it might come later. But that should be a minuscule difference (again, assuming it's only used once or a handful of times). If you look at the default_rng(seed) source, when called with None, it just returns Generator(PCG64(seed)) after a few quick checks on seed.
Is there an argument for switching existing code to this method?
Going to pass on this one since I don't have anywhere near the depth of technical knowledge to give a good comparison of the algorithms, and also because it depends on some other variables such as whether you're concerned about making your downstream code compatibility with older versions of NumPy, where default_rng() simply doesn't exist.

GPyOpt - how to run a physical experiment?

I'm trying to do some physical experiments to find a formulation that optimizes some parameters. By physical experiments I mean I have a chemistry bench, I'm mixing stuff together, then measuring the properties of that formulation. Historically I've used traditional DOEs, but I need to speed up my time to getting to the ideal formulation. I'm aware of simplex optimization, but I'm interested in trying out Bayesian optimization. I found GPyOpt which claims (even in the SO Tag description) to support physical experiments. However, it's not clear how to enable this kind of behavior.
One thing I've tried is to collect user input via input, and I suppose I could pickle off the optimizer and function, but this feels kludgy. In the example code below, I use the function from the GPyOpt example but I have to type in the actual value.
from GPyOpt.methods import BayesianOptimization
import numpy as np
# --- Define your problem
def f(x):
return (6*x-2)**2*np.sin(12*x-4)
def g(x):
print(f(x))
return float(input("Result?"))
domain = [{'name': 'var_1', 'type': 'continuous', 'domain': (0, 1)}]
myBopt = BayesianOptimization(f=g,
domain=domain,
X=np.array([[0.745], [0.766], [0], [1], [0.5]]),
Y=np.array([[f(0.745)], [f(0.766)], [f(0)], [f(1)], [f(0.5)]]),
acquisition_type='LCB')
myBopt.run_optimization(max_iter=15, eps=0.001)
So, my questions is, what is the intended way of using GPyOpt for physical experimentation?
A few things.
First, set f=None. Note that this has the side-effect of causing the BO object to ignore the maximize=True, if you happen to be using this.
Second, rather than use run_optimization, you want suggest_next_locations. The former runs the entire optimization, whereas the latter just runs a single iteration. This method returns a vector with parameter combinations ("locations") to go test in the lab.
Third, you'll need to make some decisions regarding batch size. The number of combinations/locations that you get are controlled by the batch_size parameter that you use to initialize the BayesianOptimization object. Choice of acquisition function is important here, because some are closely tied to a batch_size of 1. If you need larger batches, then you'll need to read the docs for combinations suitable to your situation (e.g. acquisition_type=EI and evaluator_type=local_penalization.
Fourth, you'll need to explicitly manage the data between iterations. There are at least two ways to approach this. One is to pickle the BO object and add more data to it. An alternative that I think is more elegant is to instead create a completely fresh BO object each time. When instantiating it, you concatenate the new data to the old data, and just run a single iteration on the whole set (again, using suggest_next_locations). This might be kind of insane if you were using BO to optimize a function in silico, but considering how slow the chemistry steps are likely to be, this might be cleanest (and easier to make mid-course corrections.)
Hope this helps!

Use APIs for sorting or algorithm?

In a programming language like Python Which will have better efficiency? if i use a sorting algorithm like merge sort to sort an array or If I use a built in API like sort() to sort the array? If Algorithms are independent of programming languages, then what is the advantage of algorithms over built in methods or API's
Why to use public APIs:
The built in methods were written and reviewed by very experienced and many coders, and a lot of effort was invested to optimize them to be as efficient as it gets.
Since the built in methods are public APIs, it is also means they are constantly used, which means you get a massive "free" testing. You are much more likely to detect issues in public APIs than in private ones, and once something is discovered - it will be fixed for you.
Don't reinvent the wheel. Someone already programmed it for you, use it. If your profiler says there is a problem, think about replacing it. Not before.
Why to use custom made methods:
That said, the public APIs are general case. If you need something
very specific for your scenario, you might find a solution that will
be more efficient, but it will take you quite some time to actually
achieve better than the already optimize general purpose public API.
tl;dr: Use public APIs unless you:
Need it and can afford a lot of time to replace it.
Know what you are doing pretty well.
Intend to maintain it and do robust testing for it.
The libraries normally use well tested and correctly optimized algorythms. For example Python uses Timsort which:
is a stable sort (order of elements that compare equal is preserved)
in the worst case takes O( n log ⁡ n ) comparisons to sort an array of n elements
in the best case (when the input is already sorted) runs in linear time
Unless you have special requirements that make you know that for your particular data sets one sort algorythm will give best result you can use the standard library implementation.
The other reason to build a sort by hand, is evidently for academic purposes...

Generating random functions (as opposed to random numbers)

I would like to create a function that takes a string and returns a number between 0 and 1. The function should consistently return the same number when given the same string, but other than that the results should have no discernible pattern. The output numbers for any large set of input strings should follow a uniform distribution.
Moreover, I need to generate more than one such function, i.e. when given the string "abc", function A might consistently return 0.593927 while function B consistently returns 0.0162524. I need it to be fast (it's for a numerical simulation), and have reasonably good statistics.
I'm using Python and will settle for answers of the form "here is an easy way to do it using a Python library" or "here is an algorithm that you can implement." If there's no fast way to do it in Python I'll just switch to C instead.
I realise that either of the following two methods will work, but each of them have drawbacks that make me want to look for a more elegant solution.
Store a dictionary
I could just calculate a new random number every time I'm given a new string, and store it in a dictionary to be retrieved if I receive the same string again. However, my application is likely to generate a lot of strings that only appear once, This will eventually result in having to store a very large dictionary in memory. It also makes repeatability more difficult, since even if I use the same seed, I'll generate a different function if I receive the same strings in a different order. For these reasons, it would be much better to consistently compute the random numbers "on the fly".
Use a hash function
I could just call a hash function on the string and then convert the result to a number. The issue of generating multiple functions could be solved by, for example, appending a "seed" string to every input string. However, then I'm stuck with trying to find a hash function with the appropriate speed and statistics. Python's built-in hash is fast but implementation-dependent, and I don't know how good the statistics will be since it's not designed for this type of purpose. On the other hand I could use a secure hash algorithm such as md5, which will have good statistics, but this would be too slow for my application. Hash functions aimed at data storage applications are typically much faster than cryptographically secure ones like md5, but they're designed with the aim of avoiding collisions, rather than producing uniformly distributed output, and these aren't necessarily the same in all cases.
A further note about hash functions
To illustrate the point that avoiding collisions and producing uniform results are different things, consider the following example using Python's built-in hash function:
>>> hash("aaa") % 1000
340
>>> hash("aab") % 1000
343
>>> hash("aac") % 1000
342
>>> hash("aad") % 1000
337
>>> hash("aae") % 1000
336
>>> hash("aaf") % 1000
339
>>> hash("aag") % 1000
338
>>> hash("aah") % 1000
349
>>> hash("aai") % 1000
348
>>> hash("aaj") % 1000
351
>>> hash("aak") % 1000
350
There are no collisions in the above output, but they are also clearly not uniformly distributed, since they are all between 336 and 351, and there is also a definite pattern in the third digit. I realise I could probably get better statistics by doing (hash("aaa")/HASH_MAX)*1000 (assuming I can work out what HASH_MAX should be), but this should help to illustrate that the requirements for a good hash function are not the same as the requirements for the function I'm looking for.
Some relevant information about the problem
I don't know exactly what the strings are that this algorithm will need to work on, because the strings will be generated by the simulation, but the following are likely to be the case:
They will have a very restricted character set (perhaps just 4 or 5 different symbols).
There will be a lot of unique or rare strings and a few very common ones, of varying length.
There is no upper bound on the lengths of the strings, but short ones are likely to be much more common than long ones. I wouldn't be surprised if I never see one longer than 100 characters, but I don't know for sure. Many of them will just have one to three characters, so it's important that the algorithm is fast for short strings. (But I guess I could use a lookup table for strings less than a certain length.)
Typically, the strings will have large substrings in common - often two strings will differ only by a single character appended to the beginning or end. It's important that the algorithm doesn't tend to give similar output values when the strings are similar.
Use a good random number generator and seed it with the string.
There's an algorithm in the section on "hashing strings" in the Wikipedia article on universal hashing.
Alternatively, you could just use some built-in hash function; each of your random functions prepends a random (but fixed) prefix to the string before hashing.
Try to use a Fingerprint such as Rabin fingerprinting.
http://en.wikipedia.org/wiki/Fingerprint_(computing).
If you choose a N-bit finger print you just need to divide the result by 2^N.
Fingerprints are a kind of hash functions which are usually very fast to computer (compare to Cryptographic hash functions like MD5) but not good for cryptographic applications (the key value may be recoverable somehow using its fingerprint)
Lookup3 is reputed to have very good collision properties, which ought to imply uniform distribution of results, and it's also fast. Should be simple to put this in a Python extension.
More generally, if you find a function that does a good job of minimizing hash table collisions and has the speed properties you need, a final conversion from a 32- or 64-bit integer to float is all that's needed. There are many sources on the web and elsewhere of string hashing functions. Check Knuth, for starters.
Addition
One other thing that might be worth trying is to encrypt the string first with a fast 1-1 algorithm like RC4 (not secure, but still close enough to pseudorandom) and then run a trivial hash (h = h + a * c[i] + b) over the cipher text. The RC4 key is the uniquifier.

Data structure in python for 2d range counting queries

I need a data structure for doing 2d range counting queries (i.e. how many points are in a given rectangle).
I think my best bet is range tree (it can count in log^2, or even log after some optimizations). Does it sound like a good choice? Does anybody know about a python implementation or will I have to write one myself?
See scipy.spatial.KDTree for one implementation.
There's also a less generic (but occasionally more useful, particularly with regards to what you have in mind) implementation using shapelib's quadtree. See this blog and the corresponding package in PyPi.
There are probably other implementations, too, but those are the two that I've used...

Categories

Resources