Currently, I am programming a large simulation that uses random variates from multiple distributions that come from both numpy and scipy.stats, and where the distributions should also be independent. Seeking a way to ensure reproducibility, I luckily stumbled upon Abhinav's response here, where they provide an amazing example. Nevertheless, it notably only seeds a single distribution from scipy, whereas my code has multiple scipy distributions. Is there a way to seed all scipy distributions at once (while still seeding the numpy distributions)? If not all at once, is it possible to seed all of the continuous distributions? (It just seems inefficient to seed every single distribution separately). Thank you very much!
Edit: A minimal reproducible example can be found below (it is similar to Abhinav's example):
from numpy.random import Generator, PCG64
from scipy.stats import binom, norm
n, p, size, seed = 10, 0.5, 10, 12345
numpy_randomGen = Generator(PCG64(seed))
scipy_randomGen = binom
scipy_randomGen2 = norm
numpy_randomGen = Generator(PCG64(seed))
# this is the part I want to simplify, as I have many distributions from scipy
# maybe there is a convention that simplifies it?
scipy_randomGen.random_state=numpy_randomGen
scipy_randomGen2.random_state=numpy_randomGen
print(scipy_randomGen.rvs(n, p, size=size))
print(scipy_randomGen2.rvs(size=size))
print(numpy_randomGen.binomial(n, p, size))
Not sure what you're after. You still need to seed each distribution separately. So there's unlikely to be anything simpler then providing a random_state arg to each rvs call as in
.rvs(n, p, size=size, random_state=...)
Here the random_state argument can be a Generator or an integer seed (however in the latter case it's constructing an old-style RandomState object and seeds it under the hood.
I am trying to create a simulation of possible paths of a stochastic process, which is not anchored to any particular point. E.g. fit SARIMAX model to weather temperature data and then use the model to make a simulation of the temperature.
Here I use the standard demonstration from statsmodels page as a simpler example:
import numpy as np
import pandas as pd
from scipy.stats import norm
import statsmodels.api as sm
import matplotlib.pyplot as plt
from datetime import datetime
import requests
from io import BytesIO
Fitting the model:
wpi1 = requests.get('https://www.stata-press.com/data/r12/wpi1.dta').content
data = pd.read_stata(BytesIO(wpi1))
data.index = data.t
# Set the frequency
data.index.freq="QS-OCT"
# Fit the model
mod = sm.tsa.statespace.SARIMAX(data['wpi'], trend='c', order=(1,1,1))
res = mod.fit(disp=False)
print(res.summary())
Creating simulation:
res.simulate(len(data), repetitions=10).plot();
Here is the history:
Here is the simulation:
The simulated curves are so widely distibuted and apart from each other that this cannot make sense. The initial historical process doesn't have that much of a variance. What do I understand wrongly? How to perform the right simulation?
When you don't pass an initial state, it uses the first predicted state to start the simulation along with its predicted covariance. Since there is no information available to make the first prediction, it uses a diffuse prior with a variance of 1,000,000. This is why you are getting the wide range in your time series. A simple solution is to pass your own initial state using the smoothed_state.
Taking your code above, but using
initial = res.smoothed_state[:, 0]
res.simulate(len(data),
repetitions=10,
initial_state=initial).plot()
I get a plot that looks like
The first value is what really matters in this model, and is 30.6. You could add some randomness here directly by drawing the initial state from another (sensible) distribution. The default distribution is not sensible for simulation since it has a diffuse prior (it is, however, very sensible for estimation).
Other Notes
One other small note: You should not use trend="c" with d=1. You should instead use trend="t" when d=1 so that the model includes a drift. The model you estimate should be
mod = sm.tsa.statespace.SARIMAX(data["wpi"], trend="t", order=(1, 1, 1))
I used this model in the picture above to capture the positive trend in the data.
can someone help me to generate random numbers from the gamma distribution in python, i have tried these two possibilities but i'm still wondering about the main difference between them :
The first one is :
shape, scale= 0.5,1
size=(1024,10)
np.random.gamma(shape, scale, size)
and the second one is :
from scipy.stats import gamma
gamma.rvs(0.5, 1, (1024,10))
i think both of them are used to generate random samples following the gamma distribution, so what's the difference between these syntaxes. When should we use the first method and when the second one ?
There is no difference between the two except for the fact that one is from bumpy and other from scipy library. The probability density function used to create gamma distribution is same in both the cases.
I have some data that I have to test to see if it comes from a Weibull distribution with unknown parameters. In R I could use https://cran.r-project.org/web/packages/KScorrect/index.html but I can't find anything in Python.
Using scipy.stats I can fit parameters with:
scipy.stats.weibull_min.fit(values)
However in order to turn this into a test I think I need to perform some Monte-Carlo simulation (e.g. https://en.m.wikipedia.org/wiki/Lilliefors_test) I am not sure what to do exactly.
How can I make such a test in Python?
The Lilliefors test is implemented in OpenTURNS. To do this, all you have to use the Factory which corresponds to the distribution you want to fit.
In the following script, I simulate a Weibull sample with size 10 and perform the Kolmogorov-Smirnov test using a sample size equal to 1000. This means that the KS statistics is simulated 1000 times.
import openturns as ot
sample=ot.WeibullMin().getSample(10)
ot.ResourceMap.SetAsUnsignedInteger("FittingTest-KolmogorovSamplingSize",1000)
distributionFactory = ot.WeibullMinFactory()
dist, result = ot.FittingTest.Kolmogorov(sample, distributionFactory, 0.01)
print('Conclusion=', result.getBinaryQualityMeasure())
print('P-value=', result.getPValue())
More details can be found at:
http://openturns.github.io/openturns/latest/examples/data_analysis/kolmogorov_test.html
http://openturns.github.io/openturns/latest/examples/data_analysis/kolmogorov_distribution.html
One way around: estimate distribution parameters, draw data from the estimated distribution and run KS test to check that both samples come from the same distribution.
Let's create some "original" data:
>>> values = scipy.stats.weibull_min.rvs( 0.33, size=1000)
Now,
>>> args = scipy.stats.weibull_min.fit(values)
>>> print(args)
(0.32176317627928856, 1.249788665927261e-09, 0.9268793667654682)
>>> scipy.stats.kstest(values, 'weibull_min', args=args, N=100000)
KstestResult(statistic=0.033808945722737016, pvalue=0.19877935361964738)
The last line is equivalent to:
scipy.stats.ks_2samp(values, scipy.stats.weibull_min.rvs(*args, size=100000))
So, once you estimate parameters of the distribution, you can test it pretty reliably. But the scipy estimator is not very good, it took me several runs to get even "close" to the original distribution.
I get different results (test accuracy) every time I run the imdb_lstm.py example from Keras framework (https://github.com/fchollet/keras/blob/master/examples/imdb_lstm.py)
The code contains np.random.seed(1337) in the top, before any keras imports. It should prevent it from generating different numbers for every run. What am I missing?
UPDATE: How to repro:
Install Keras (http://keras.io/)
Execute https://github.com/fchollet/keras/blob/master/examples/imdb_lstm.py a few times. It will train the model and output test accuracy.
Expected result: Test accuracy is the same on every run.
Actual result: Test accuracy is different on every run.
UPDATE2: I'm running it on Windows 8.1 with MinGW/msys, module versions:
theano 0.7.0
numpy 1.8.1
scipy 0.14.0c1
UPDATE3: I narrowed the problem down a bit. If I run the example with GPU (set theano flag device=gpu0) then I get different test accuracy every time, but if I run it on CPU then everything works as expected. My graphics card: NVIDIA GeForce GT 635)
You can find the answer at the Keras docs: https://keras.io/getting-started/faq/#how-can-i-obtain-reproducible-results-using-keras-during-development.
In short, to be absolutely sure that you will get reproducible results with your python script on one computer's/laptop's CPU then you will have to do the following:
Set the PYTHONHASHSEED environment variable at a fixed value
Set the python built-in pseudo-random generator at a fixed value
Set the numpy pseudo-random generator at a fixed value
Set the tensorflow pseudo-random generator at a fixed value
Configure a new global tensorflow session
Following the Keras link at the top, the source code I am using is the following:
# Seed value
# Apparently you may use different seed values at each stage
seed_value= 0
# 1. Set the `PYTHONHASHSEED` environment variable at a fixed value
import os
os.environ['PYTHONHASHSEED']=str(seed_value)
# 2. Set the `python` built-in pseudo-random generator at a fixed value
import random
random.seed(seed_value)
# 3. Set the `numpy` pseudo-random generator at a fixed value
import numpy as np
np.random.seed(seed_value)
# 4. Set the `tensorflow` pseudo-random generator at a fixed value
import tensorflow as tf
tf.random.set_seed(seed_value)
# for later versions:
# tf.compat.v1.set_random_seed(seed_value)
# 5. Configure a new global `tensorflow` session
from keras import backend as K
session_conf = tf.ConfigProto(intra_op_parallelism_threads=1, inter_op_parallelism_threads=1)
sess = tf.Session(graph=tf.get_default_graph(), config=session_conf)
K.set_session(sess)
# for later versions:
# session_conf = tf.compat.v1.ConfigProto(intra_op_parallelism_threads=1, inter_op_parallelism_threads=1)
# sess = tf.compat.v1.Session(graph=tf.compat.v1.get_default_graph(), config=session_conf)
# tf.compat.v1.keras.backend.set_session(sess)
It is needless to say that you do not have to to specify any seed or random_state at the numpy, scikit-learn or tensorflow/keras functions that you are using in your python script exactly because with the source code above we set globally their pseudo-random generators at a fixed value.
Theano's documentation talks about the difficulties of seeding random variables and why they seed each graph instance with its own random number generator.
Sharing a random number generator between different {{{RandomOp}}}
instances makes it difficult to producing the same stream regardless
of other ops in graph, and to keep {{{RandomOps}}} isolated.
Therefore, each {{{RandomOp}}} instance in a graph will have its very
own random number generator. That random number generator is an input
to the function. In typical usage, we will use the new features of
function inputs ({{{value}}}, {{{update}}}) to pass and update the rng
for each {{{RandomOp}}}. By passing RNGs as inputs, it is possible to
use the normal methods of accessing function inputs to access each
{{{RandomOp}}}’s rng. In this approach it there is no pre-existing
mechanism to work with the combined random number state of an entire
graph. So the proposal is to provide the missing functionality (the
last three requirements) via auxiliary functions: {{{seed, getstate,
setstate}}}.
They also provide examples on how to seed all the random number generators.
You can also seed all of the random variables allocated by a
RandomStreams object by that object’s seed method. This seed will be
used to seed a temporary random number generator, that will in turn
generate seeds for each of the random variables.
>>> srng.seed(902340) # seeds rv_u and rv_n with different seeds each
I finally got reproducible results with my code. It's a combination of answers I saw around the web. The first thing is doing what #alex says:
Set numpy.random.seed;
Use PYTHONHASHSEED=0 for Python 3.
Then you have to solve the issue noted by #user2805751 regarding cuDNN by calling your Keras code with the following additional THEANO_FLAGS:
dnn.conv.algo_bwd_filter=deterministic,dnn.conv.algo_bwd_data=deterministic
And finally, you have to patch your Theano installation as per this comment, which basically consists in:
replacing all calls to *_dev20 operator by its regular version in theano/sandbox/cuda/opt.py.
This should get you the same results for the same seed.
Note that there might be a slowdown. I saw a running time increase of about 10%.
In Tensorflow 2.0 you can set random seed like this:
import tensorflow as tf
tf.random.set_seed(221)
from tensorflow import keras
from tensorflow.keras import layers
model = keras.Sequential( [
layers.Dense(2,name = 'one'),
layers.Dense(3,activation = 'sigmoid', name = 'two'),
layers.Dense(2,name = 'three')])
x = tf.random.uniform((12,12))
model(x)
The problem is now solved in Tensorflow 2.0 ! I had the same issue with TF 1.x (see If Keras results are not reproducible, what's the best practice for comparing models and choosing hyper parameters? ) but
import os
####*IMPORANT*: Have to do this line *before* importing tensorflow
os.environ['PYTHONHASHSEED']=str(1)
import tensorflow as tf
import tensorflow.keras as keras
import tensorflow.keras.layers
import random
import pandas as pd
import numpy as np
def reset_random_seeds():
os.environ['PYTHONHASHSEED']=str(1)
tf.random.set_seed(1)
np.random.seed(1)
random.seed(1)
#make some random data
reset_random_seeds()
NUM_ROWS = 1000
NUM_FEATURES = 10
random_data = np.random.normal(size=(NUM_ROWS, NUM_FEATURES))
df = pd.DataFrame(data=random_data, columns=['x_' + str(ii) for ii in range(NUM_FEATURES)])
y = df.sum(axis=1) + np.random.normal(size=(NUM_ROWS))
def run(x, y):
reset_random_seeds()
model = keras.Sequential([
keras.layers.Dense(40, input_dim=df.shape[1], activation='relu'),
keras.layers.Dense(20, activation='relu'),
keras.layers.Dense(10, activation='relu'),
keras.layers.Dense(1, activation='linear')
])
NUM_EPOCHS = 500
model.compile(optimizer='adam', loss='mean_squared_error')
model.fit(x, y, epochs=NUM_EPOCHS, verbose=0)
predictions = model.predict(x).flatten()
loss = model.evaluate(x, y) #This prints out the loss by side-effect
#With Tensorflow 2.0 this is now reproducible!
run(df, y)
run(df, y)
run(df, y)
This works for me:
SEED = 123456
import os
import random as rn
import numpy as np
from tensorflow import set_random_seed
os.environ['PYTHONHASHSEED']=str(SEED)
np.random.seed(SEED)
set_random_seed(SEED)
rn.seed(SEED)
It is easier that it seems. Putting only this, it works:
import numpy as np
import tensorflow as tf
import random as python_random
def reset_seeds():
np.random.seed(123)
python_random.seed(123)
tf.random.set_seed(1234)
reset_seeds()
The KEY of the question, VERY IMPORTANT, is to call the function reset_seeds() every time before running the model. Doing that you will obtain reproducible results as I check in the Google Collab.
I would like to add something to the previous answers. If you use python 3 and you want to get reproducible results for every run, you have to
set numpy.random.seed in the beginning of your code
give PYTHONHASHSEED=0 as a parameter to the python interpreter
I have trained and tested Sequential() kind of neural networks using Keras. I performed non linear regression on noisy speech data. I used the following code to generate random seed :
import numpy as np
seed = 7
np.random.seed(seed)
I get the exact same results of val_loss each time I train and test on the same data.
I agree with the previous comment, but reproducible results sometimes needs the same environment(e.g. installed packages, machine characteristics and so on). So that, I recommend to copy your environment to other place in case to have reproducible results. Try to use one of the next technologies:
Docker. If you have a Linux this very easy to move your environment to other place. Also you can try to use DockerHub.
Binder. This is a cloud platform for reproducing scientific experiments.
Everware. This is yet another cloud platform for "reusable science". See the project repository on Github.
Unlike what has been said before, only Tensorflow seed has an effect on random generation of weights (latest version Tensorflow 2.6.0 and Keras 2.6.0)
Here is a small test you can run to check the influence of each seed (with np being numpy, tf being tensorflow and random the Python random library):
# Testing how seeds influence results
# -----------------------------------
print("Seed specification")
my_seed = 36
# To vary python hash, numpy random, python random and tensorflow random seeds
a, b, c, d = 0, 0, 0, 0
os.environ['PYTHONHASHSEED'] = str(my_seed+a) # Has no effect
np.random.seed(my_seed+b) # Has no effect
random.seed(my_seed+c) # Has no effect
tf.random.set_seed(my_seed+d) # Has an effect
print("Making ML model")
keras.mixed_precision.set_global_policy('float64')
model = keras.Sequential([
layers.Dense(2, input_shape=input_shape),#, activation='relu'),
layers.Dense(output_nb, activation=None),
])
#
weights_save = model.get_weights()
print("Some weights:", weights_save[0].flatten())
We notice that variables a, b, c have no effect on the results.
Only d has an effect on the results.
So, in the latest versions of Tensorflow, only tensorflow random seed has an influence on the random choice of weights.