Python's random: What happens if I don't use seed(someValue)? - python

a)In this case does the random number generator uses the system's clock (making the seed change) on each run?
b)Is the seed used to generate the pseudo-random values of expovariate(lambda)?

"Use the Source, Luke!"...;-). Studying https://svn.python.org/projects/python/trunk/Lib/random.py will rapidly reassure you;-).
What happens when seed isn't set (that's the "i is None" case):
if a is None:
try:
a = long(_hexlify(_urandom(16)), 16)
except NotImplementedError:
import time
a = long(time.time() * 256) # use fractional seconds
and the expovariate:
random = self.random
u = random()
while u <= 1e-7:
u = random()
return -_log(u)/lambd
obviously uses the same underlying random generator as every other method, and so is identically affected by the seeding or lack thereof (really, how else would it have been done?-)

a) It typically uses the system clock, the clock on some systems may only have ms precision and so seed twice very quickly may result in the same value.
seed(self, a=None)
Initialize internal state from hashable object.
None or no argument seeds from current time or from an operating
system specific randomness source if available.
http://pydoc.org/2.5.1/random.html#Random-seed
b) I would imagine expovariate does, but I can't find any proof. It would be silly if it didn't.

current system time is used; current system time is also used to initialize the generator when the module is first imported. If randomness sources are provided by the operating system, they are used instead of the system time (see the os.urandom() function for details on availability).
Random Docs

Related

process_time() function in python not working as expected

Can someone help me understand how process_time() works?
My code is
from time import process_time
t = process_time()
def fibonacci_of(n):
if n in cache: # Base case
return cache[n]
# Compute and cache the Fibonacci number
cache[n] = fibonacci_of(n - 1) + fibonacci_of(n - 2) # Recursive case
return cache[n]
cache = {0: 0, 1: 1}
fib = [fibonacci_of(n) for n in range(1500)]
print(fib[-1])
print(process_time() - t)
And last print is always 0.0.
My expected result is something like 0.764891862869
Docs at https://docs.python.org/3/library/time.html#time.process_time don't help newbie me :(
I tried some other functions and reading docs. But without success.
I'd assume this is OS dependent. Linux lets me get down to ~5 microseconds using process_time, but other operating systems might well not deal well with differences this small and return zero instead.
It's for this reason that Python exposes other timers that are designed to be more accurate over shorter time scales. Specifically, perf_counter is specified as using:
the highest available resolution to measure a short duration
Using this lets me measure down to ~80 nanoseconds, whether I'm using perf_counter or perf_counter_ns.
As documentation suggests:
time.process_time() → float
Return the value (in fractional seconds) of the sum of the system and user
CPU time of the current process. It does not include time
elapsed during sleep. It is process-wide by definition. The reference
point of the returned value is undefined, so that only the difference
between the results of two calls is valid.
Use process_time_ns() to avoid the precision loss caused by the float type.
This last sentence is the most important and differentiates between the very precise function: process_time_ns() and one that is less precise but more appropriate for the long-running processes - time.process_time().
It turns out that when you measure a couple nanoseconds (nano means 10**-9) and try to express it in seconds dividing by 10**9, you often go out of the precision of float (64 bits) and end up rounding up to zero. The float limitations are described in the Python documentation.
To get to know more you can also read a general introduction to precision in floating point arithmetic (ambitious) and its peril (caveats).

Running something in a method takes much more depending on the data types

Introduction
Today I found a weird behaviour in python while running some experiments with exponentiation and I was wondering if someone here knows what's happening. In my experiments, I was trying to check what is faster in python int**int or float**float. To check that I run some small snippets, and I found a really weird behaviour.
Weird results
My first approach was just to write some for loops and prints to check which one is faster. The snipper I used is this one
import time
# Run powers outside a method
ti = time.time()
for i in range(EXPERIMENTS):
x = 2**2
tf = time.time()
print(f"int**int took {tf-ti:.5f} seconds")
ti = time.time()
for i in range(EXPERIMENTS):
x = 2.**2.
tf = time.time()
print(f"float**float took {tf-ti:.5f} seconds")
After running it I got
int**int took 0.03004
float**float took 0.03070 seconds
Cool, it seems that data types do not affect the execution time. However, since I try to be a clean coder I refactored the repeated logic in a function power_time
import time
# Run powers in a method
def power_time(base, exponent):
ti = time.time()
for i in range(EXPERIMENTS):
x = base ** exponent
tf = time.time()
return tf-ti
print(f"int**int took {power_time(2, 2):.5f} seconds")
print(f"float**float took {power_time(2., 2.):5f} seconds")
And what a surprise of mine when I got these results
int**int took 0.20140 seconds
float**float took 0.05051 seconds
The refactor didn't affect a lot the float case, but it multiplied by ~7 the time required for the int case.
Conclusions and questions
Apparently, running something in a method can slow down your process depending on your data types, and that's really weird to me.
Also, if I run the same experiments but change ** by * or + the weird results disappear, and all the approaches give more or less the same results
Does someone know why is this happening? Am I missing something?
Apparently, running something in a method can slow down your process depending on your data types, and that's really weird to me.
It would be really weird if it was not like this! You can write your class that has it's own ** operator (through implementing the __pow__(self, other) method), and you could, for example, sleep 1s in there. Why should that take as long as taking a float to the power of another?
So, yeah, Python is a dynamically typed language. So, the operations done on data depend on the type of that data, and things can generally take different times.
In your first example, the difference never arises, because a) most probably the values get cached, because right after parsing it's clear that 2**2 is a constant and does not need to get evaluated every loop. Even if that's not the case, b) the time it costs to run a loop in python is hundreds of times that it takes to actually execute the math here – again, dynamically typed, dynamically named.
base**exponent is a whole different story. None about this is constant. So, there's actually going to be a calculation every iteration.
Now, the ** operator (__rpow__ in the Python data model) for Python's built-in float type is specified to do the float exponent (which is something implemented in highly optimized C and assembler), as exponentiation can elegantly be done on floating point numbers. Look for nb_power in cpython's floatobject.c. So, for the float case, the actual calculation is "free" for all that matters, again, because your loop is limited by how much effort it is to resolve all the names, types and functions to call in your loop. Not by doing the actual math, which is trivial.
The ** operator on Python's built-in int type is not as neatly optimized. It's a lot more complicated – it needs to do checks like "if the exponent is negative, return a float," and it does not do elementary math that your computer can do with a simple instruction, it handles arbitrary-length integers (remember, a python integer has as many bytes as it needs. You can save numbers that are larger than 64 bit in a Python integer!), which comes with allocation and deallocations. (I encourage you to read long_pow in CPython's longobject.c; it has 200 lines.)
All in all, integer exponentiation is expensive in python, because of python's type system.

How does Python's random use system time?

I've searched around but couldn't find an explanation. Please help me. Thanks.
I understand that Python will use system's time if a seed is not provided for random (to the best of my knowledge). My question is: How does Python use this time? Is it the timestamp or some other format?
I ran the following code;
from time import time
import random
t1 = time() #this gave 1590236721.1549928
data = [random.randint(0, 100) for x in range(10)]
t2 = time() #this also gave 1590236721.1549928
Since t1 == t2, I guessed that if UNIX timestamp is used as seed, it should be t1 but after trying it like so;
random.seed(t1)
data1 = [random.randint(0, 100) for x in range(10)]
I got different values: data != data1.
I need more explanations/ clarifications. Thanks.
Python 2
In this Q&A : (for python2.7) random: what is the default seed? You can see that python is not using the result of the time() function "as-is" at all to get the initial seed (and usually, it tries to get urandom values if it can from the OS, first, see https://docs.python.org/2/library/os.html#os.urandom.
try:
# Seed with enough bytes to span the 19937 bit
# state space for the Mersenne Twister
a = long(_hexlify(_urandom(2500)), 16)
except NotImplementedError:
import time
a = long(time.time() * 256) # use fractional seconds
Python 3
a) As in Python 2, if your OS provides random numbers (with urandom), like in *Nix systems, Python will try to use this (see https://docs.python.org/3/library/random.html#bookkeeping-functions). On Windows, it's using Win32 API's CryptGenRandom
b) even if it is using time(), maybe it's using the time at the start of your program, which may be different than the first time() call you use, so I don't think you can rely on your method of testing.
Last word of general advice: if you want reproducibility, you should explicitly set the seed yourself, and not rely on those implementation details.
when you use the program more than once without hitting in the seed you're using the utc down to the second, so you get a different results each time.
every nano second will give you a different time.
try this:
import random
print ("Random number with seed 42")
random.seed(42)
print ("first Number ", random.randint(1,99))
random.seed(42)
print ("Second Number ", random.randint(1,99))
random.seed(42)
print ("Third Number ", random.randint(1,99))

PyGMO Batch fitness evaluation

My goal is to perform a parameter estimation (model calibration) using PyGmo. My model will be an external "black blox" model (c-code) outputting the objective function J to be minimized (J in this case will be the "Normalized Root Mean Square Error" (NRMSE) between model outputs and measured data. To speed up the optimization (calibration) I would like to run my models/simulations on multiple cores/threads in parallel. Therefore I would like to use a batch fitness evaluator (bfe) in PyGMO. I prepared a minimal example using a simple problem class but using pure python (no external model) and the rosenbrock problem:
#!/usr/bin/env python
# coding: utf-8
import numpy as np
from fmpy import read_model_description, extract, simulate_fmu, freeLibrary
from fmpy.fmi2 import FMU2Slave
import pygmo as pg
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt
from matplotlib import cm
import time
#-------------------------------------------------------
def main():
# Optimization
# Define problem
class my_problem:
def __init__(self, dim):
self.dim = dim
def fitness(self, x):
J = np.zeros((1,))
for i in range(len(x) - 1):
J[0] += 100.*(x[i + 1]-x[i]**2)**2+(1.-x[i])**2
return J
def get_bounds(self):
return (np.full((self.dim,),-5.),np.full((self.dim,),10.))
def get_name(self):
return "My implementation of the Rosenbrock problem"
def get_extra_info(self):
return "\nDimensions: " + str(self.dim)
def batch_fitness(self, dvs):
J = [123] * len(dvs)
return J
prob = pg.problem(my_problem(30))
print('\n----------------------------------------------')
print('\nProblem description: \n')
print(prob)
#-------------------------------------------------------
dvs = pg.batch_random_decision_vector(prob, 1)
print('\n----------------------------------------------')
print('\nBarch fitness evaluation:')
print('\ndvs length:' + str(len(dvs)))
print('\ndvs:')
print(dvs)
udbfe = pg.default_bfe()
b = pg.bfe(udbfe=udbfe)
print('\nudbfe:')
print(udbfe)
print('\nbfe:')
print(b)
fvs = b(prob, dvs)
print(fvs)
#-------------------------------------------------------
pop_size = 50
gen_size = 1000
algo = pg.algorithm(pg.sade(gen = gen_size)) # The algorithm (a self-adaptive form of Differential Evolution (sade - jDE variant)
algo.set_verbosity(int(gen_size/10)) # We set the verbosity to 100 (i.e. each 100 gen there will be a log line)
print('\n----------------------------------------------')
print('\nOptimization:')
start = time.time()
pop = pg.population(prob, size = pop_size) # The initial population
pop = algo.evolve(pop) # The actual optimization process
best_fitness = pop.get_f()[pop.best_idx()] # Getting the best individual in the population
print('\n----------------------------------------------')
print('\nResult:')
print('\nBest fitness: ', best_fitness) # Get the best parameter set
best_parameterset = pop.get_x()[pop.best_idx()]
print('\nBest parameter set: ',best_parameterset)
print('\nTime elapsed for optimization: ', time.time() - start, ' seconds\n')
if __name__ == '__main__':
main()
When I try to run this code I get the following error:
Exception has occurred: ValueError
function: bfe_check_output_fvs
where: C:\projects\pagmo2\src\detail\bfe_impl.cpp, 103
what: An invalid result was produced by a batch fitness evaluation: the number of produced fitness vectors, 30, differs from the number of input decision vectors, 1
By deleting or commeting out this two lines:
fvs = b(prob, dvs)
print(fvs)
the script can be run without errors.
My questions:
How to use the batch fitness evaluation? (I know this is a new
capability of PyGMO and they are still working on the
documentation...) Can anybody give a minimal example on how to implement this?
Is this the right way to go to speed up my model calibration problem? Or should I use islands and archipelagos? If I got it right, the islands in an archipelago are not communicating to eachother, right? So if one performs e.g. a Particle Swarm Optimization and wants to evaluate several objective function calls simultaneously (in parallel) then the batch fitness evaluator is the right choice?
Do I need to care about archipelagos and islands in this example? What are they exactly meant for? Is it worth running several optimizations but with different initial x (input to objective function) and then to take the best solution? Is this a common approach in optimization with GA's?
I am very knew to the field of optimization and PyGMO, so thx for helping!
Is this the right way to go to speed up my model calibration problem? Or should I use islands and archipelagos? If I got it right, the islands in an archipelago are not communicating to eachother, right? So if one performs e.g. a Particle Swarm Optimization and wants to evaluate several objective function calls simultaneously (in parallel) then the batch fitness evaluator is the right choice?
There are 2 modes of parallelization in pagmo, the island model (i.e., coarse-grained parallelization) and the BFE machinery (i.e., fine-grained parallelization).
The island model works on any problem/algorithm combination, and it is based on the idea that multiple optimisations are run in parallel while exchanging information to accelerate the global convergence to a solution.
The BFE machinery, instead, parallelizes a single optimisation, and it requires explicit support in the solver to work. Currently in pagmo only a handful of solvers are able to take advantage of the BFE machinery. The BFE machinery can also be used to parallelise the initialisation of a population of individuals, which can be useful is your fitness function is particularly heavyweight.
Which parallelisation method is best for you depends on the properties of your problem. In my experience, users tend to prefer the BFE machinery (fine-grained parallelisation) if the fitness function is very heavy (e.g., it takes minutes or more to compute), because in such a situation fitness evaluations are so costly that in order to take advantage of the island model one would have to wait too long. The BFE is also in some sense easier to understand because you don't have to delve into the details of archipelagos, topologies, etc. On the other hand, the BFE works only with certain solvers (although we are trying to extend BFE support to other solvers as time goes by).
How to use the batch fitness evaluation? (I know this is a new capability of PyGMO and they are still working on the documentation...) Can anybody give a minimal example on how to implement this?
One way of using the BFE is what you did in your example, i.e., via the implementation of a batch_fitness() method in your problem. However, my suggestion would be to comment out the batch_fitness() method and try using one of the general-purpose batch fitness evaluators provided with pagmo. The easiest thing to do is to just default-construct an instance of the bfe class and then pass it to one of the algorithms that can use the BFE machinery. One such algorithm is nspso:
https://esa.github.io/pygmo2/algorithms.html#pygmo.nspso
So, something like this:
b = pg.bfe() # Construct a default BFE
uda = pg.nspso(gen = gen_size) # Construct the algorithm
uda.set_bfe(b) # Tell the UDA to use the BFE machinery
algo = pg.algorithm(uda) # Construct a pg.algorithm from the UDA
new_pop = algo.evolve(pop) # Evolve the population
This should use multiple processes to evaluate your fitness functions in parallel within the loop of the nspso algorithm.
If you need more help, please come over to our public users/devs chat room, where you should get assistance rather quickly (normally):
https://gitter.im/pagmo2/Lobby

Python: where is random.random() seeded?

Say I have some python code:
import random
r=random.random()
Where is the value of r seeded from in general?
And what if my OS has no random, then where is it seeded?
Why isn't this recommended for cryptography? Is there some way to know what the random number is?
Follow da code.
To see where the random module "lives" in your system, you can just do in a terminal:
>>> import random
>>> random.__file__
'/usr/lib/python2.7/random.pyc'
That gives you the path to the .pyc ("compiled") file, which is usually located side by side to the original .py where readable code can be found.
Let's see what's going on in /usr/lib/python2.7/random.py:
You'll see that it creates an instance of the Random class and then (at the bottom of the file) "promotes" that instance's methods to module functions. Neat trick. When the random module is imported anywhere, a new instance of that Random class is created, its values are then initialized and the methods are re-assigned as functions of the module, making it quite random on a per-import (erm... or per-python-interpreter-instance) basis.
_inst = Random()
seed = _inst.seed
random = _inst.random
uniform = _inst.uniform
triangular = _inst.triangular
randint = _inst.randint
The only thing that this Random class does in its __init__ method is seeding it:
class Random(_random.Random):
...
def __init__(self, x=None):
self.seed(x)
...
_inst = Random()
seed = _inst.seed
So... what happens if x is None (no seed has been specified)? Well, let's check that self.seed method:
def seed(self, a=None):
"""Initialize internal state from hashable object.
None or no argument seeds from current time or from an operating
system specific randomness source if available.
If a is not None or an int or long, hash(a) is used instead.
"""
if a is None:
try:
a = long(_hexlify(_urandom(16)), 16)
except NotImplementedError:
import time
a = long(time.time() * 256) # use fractional seconds
super(Random, self).seed(a)
self.gauss_next = None
The comments already tell what's going on... This method tries to use the default random generator provided by the OS, and if there's none, then it'll use the current time as the seed value.
But, wait... What the heck is that _urandom(16) thingy then?
Well, the answer lies at the beginning of this random.py file:
from os import urandom as _urandom
from binascii import hexlify as _hexlify
Tadaaa... The seed is a 16 bytes number that came from os.urandom
Let's say we're in a civilized OS, such as Linux (with a real random number generator). The seed used by the random module is the same as doing:
>>> long(binascii.hexlify(os.urandom(16)), 16)
46313715670266209791161509840588935391L
The reason of why specifying a seed value is considered not so great is that the random functions are not really "random"... They're just a very weird sequence of numbers. But that sequence will be the same given the same seed. You can try this yourself:
>>> import random
>>> random.seed(1)
>>> random.randint(0,100)
13
>>> random.randint(0,100)
85
>>> random.randint(0,100)
77
No matter when or how or even where you run that code (as long as the algorithm used to generate the random numbers remains the same), if your seed is 1, you will always get the integers 13, 85, 77... which kind of defeats the purpose (see this about Pseudorandom number generation) On the other hand, there are use cases where this can actually be a desirable feature, though.
That's why is considered "better" relying on the operative system random number generator. Those are usually calculated based on hardware interruptions, which are very, very random (it includes interruptions for hard drive reading, keystrokes typed by the human user, moving a mouse around...) In Linux, that O.S. generator is /dev/random. Or, being a tad picky, /dev/urandom (that's what Python's os.urandom actually uses internally) The difference is that (as mentioned before) /dev/random uses hardware interruptions to generate the random sequence. If there are no interruptions, /dev/random could be exhausted and you might have to wait a little bit until you can get the next random number. /dev/urandom uses /dev/random internally, but it guarantees that it will always have random numbers ready for you.
If you're using linux, just do cat /dev/random on a terminal (and prepare to hit Ctrl+C because it will start output really, really random stuff)
borrajax#borrajax:/tmp$ cat /dev/random
_+�_�?zta����K�����q�ߤk��/���qSlV��{�Gzk`���#p$�*C�F"�B9��o~,�QH���ɭ�f�޺�̬po�2o𷿟�(=��t�0�p|m�e
���-�5�߁ٵ�ED�l�Qt�/��,uD�w&m���ѩ/��;��5Ce�+�M����
~ �4D��XN��?ס�d��$7Ā�kte▒s��ȿ7_���- �d|����cY-�j>�
�b}#�W<դ���8���{�1»
. 75���c4$3z���/̾�(�(���`���k�fC_^C
Python uses the OS random generator or a time as a seed. This means that the only place where I could imagine a potential weakness with Python's random module is when it's used:
In an OS without an actual random number generator, and
In a device where time.time is always reporting the same time (has a broken clock, basically)
If you are concerned about the actual randomness of the random module, you can either go directly to os.urandom or use the random number generator in the pycrypto cryptographic library. Those are probably more random. I say more random because...
Image inspiration came from this other SO answer

Categories

Resources