Creating same random number sequence in Python, NumPy and R - python

Python, NumPy and R all use the same algorithm (Mersenne Twister) for generating random number sequences. Thus, theoretically speaking, setting the same seed should result in same random number sequences in all 3. This is not the case. I think the 3 implementations use different parameters causing this behavior.
R
>set.seed(1)
>runif(5)
[1] 0.2655087 0.3721239 0.5728534 0.9082078 0.2016819
Python
In [3]: random.seed(1)
In [4]: [random.random() for x in range(5)]
Out[4]:
[0.13436424411240122,
0.8474337369372327,
0.763774618976614,
0.2550690257394217,
0.49543508709194095]
NumPy
In [23]: import numpy as np
In [24]: np.random.seed(1)
In [25]: np.random.rand(5)
Out[25]:
array([ 4.17022005e-01, 7.20324493e-01, 1.14374817e-04,
3.02332573e-01, 1.46755891e-01])
Is there some way, where NumPy and Python implementation could produce the same random number sequence? Ofcourse as some comments and answers point out, one could use rpy. What I am specifically looking for is to fine tune the parameters in the respective calls in Python and NumPy to get the sequence.
Context: The concern comes from an EDX course offering in which R is used. In one of the forums, it was asked if Python could be used and the staff replied that some assignments would require setting specific seeds and submitting answers.
Related:
Comparing Matlab and Numpy code that uses random number generation From this it seems that the underlying NumPy and Matlab implementation are similar.
python vs octave random generator: This question does come fairly close to the intended answer. Some sort of wrapper around the default state generator is required.

use rpy2 to call r in python, here is a demo, the numpy array data is sharing memory with x in R:
import rpy2.robjects as robjects
data = robjects.r("""
set.seed(1)
x <- runif(5)
""")
print np.array(data)
data[1] = 1.0
print robjects.r["x"]

I realize this is an old question, but I've stumbled upon the same problem recently, and created a solution which can be useful to others.
I've written a random number generator in C, and linked it to both R and Python. This way, the random numbers are guaranteed to be the same in both languages since they are generated using the same C code.
The program is called SyncRNG and can be found here: https://github.com/GjjvdBurg/SyncRNG.

Related

reproduce numpy random numbers with numpy rng

I have a script using np.random.randint that was seeded withnp.random.seed(0). Now I'd like to find how I can get the same numbers by creating a separate rng object rng = np.random.default_rng(0). Naively I would have expected these two ways to be identical, but apparently this is not the case. The reason is that I'm trying to reproduce some error from some script that used the built in method, but unfortunately now another script also influences the state of that built in RNG, which is why I'd like to decouple these.
Can anyone tell me what I have to do to get the same numbers form the rng as I do get from np.random's built in RNG?
The reason
import numpy as np
def builtin(seed=0):
np.random.seed(seed)
print(np.random.randint(0, 10, 5))
def default(seed=0):
rng = np.random.default_rng(seed)
print(rng.integers(0, 10, 5))
# unfortunately do not produce the same results:
builtin()
default()
numpy.random.default_rng uses the new-style Generator API. This is mostly superior to the old functionality, and should be preferred in new code unless you have a specific reason not to use it, but it is not backward compatible with the old API.
The old API's way of creating a new RNG object is numpy.random.RandomState:
rng = np.random.RandomState(0)
print(rng.randint(0, 10, 5))

My python codes in general are very slow, is this normal?

I recently began self-learning python, and have been using this language for an online course in algorithms. For some reason, many of my codes I created for this course are very slow (relatively to C/C++ Matlab codes I have created in the past), and I'm starting to worry that I am not using python properly.
Here is a simple python and matlab code to compare their speed.
MATLAB
for i = 1:100000000
a = 1 + 1
end
Python
for i in list(range(0, 100000000)):
a=1 + 1
The matlab code takes about 0.3 second, and the python code takes about 7 seconds. Is this normal? My python codes for much complex problems are very slow. For example, as a HW assignment, I'm running depth first search on a graph with about 900000 nodes, and this is taking forever. Thank you.
Performance is not an explicit design goal of Python:
Don’t fret too much about performance--plan to optimize later when
needed.
That's one of the reasons why Python integrated with a lot of high performance calculating backend engines, such as numpy, OpenBLAS and even CUDA, just to name a few.
The best way to go foreward if you want to increase performance is to let high-performance libraries do the heavy lifting for you. Optimizing loops within Python (by using xrange instead of range in Python 2.7) won't get you very dramatic results.
Here is a bit of code that compares different approaches:
Your original list(range())
The suggestes use of xrange()
Leaving the i out
Using numpy to do the addition using numpy array's (vector addition)
Using CUDA to do vector addition on the GPU
Code:
import timeit
import matplotlib.pyplot as mplplt
iter = 100
testcode = [
"for i in list(range(1000000)): a = 1+1",
"for i in xrange(1000000): a = 1+1",
"for _ in xrange(1000000): a = 1+1",
"import numpy; one = numpy.ones(1000000); a = one+one",
"import pycuda.gpuarray as gpuarray; import pycuda.driver as cuda; import pycuda.autoinit; import numpy;" \
"one_gpu = gpuarray.GPUArray((1000000),numpy.int16); one_gpu.fill(1); a = (one_gpu+one_gpu).get()"
]
labels = ["list(range())", "i in xrange()", "_ in xrange()", "numpy", "numpy and CUDA"]
timings = [timeit.timeit(t, number=iter) for t in testcode]
print labels, timings
label_idx = range(len(labels))
mplplt.bar(label_idx, timings)
mplplt.xticks(label_idx, labels)
mplplt.ylabel('Execution time (sec)')
mplplt.title('Timing of integer addition in python 2.7\n(smaller value is better performance)')
mplplt.show()
Results (graph) ran on Python 2.7.13 on OSX:
The reason that Numpy performs faster than the CUDA solution is that the overhead of using CUDA does not beat the efficiency of Python+Numpy. For larger, floating point calculations, CUDA does even better than Numpy.
Note that the Numpy solution performs more that 80 times faster than your original solution. If your timings are correct, this would even be faster than Matlab...
A final note on DFS (Depth-afirst-Search): here is an interesting article on DFS in Python.
Try using xrange instead of range.
The difference between them is that **xrange** generates the values as you use them instead of range, which tries to generate a static list at runtime.
Unfortunately, python's amazing flexibility and ease comes at the cost of being slow. And also, for such large values of iteration, I suggest using itertools module as it has faster caching.
The xrange is a good solution however if you want to iterate over dictionaries and such, it's better to use itertools as in that, you can iterate over any type of sequence object.

Speeding up a big a loop with a logarithm in Python

I'm trying to speed up the following code:
from math import log
from random import random
def logtest1(N):
tr=0
for i in range(1,N):
T= 40 + 10*random()
tr += -log(random())/T
I'm fairly new to python (coming from matlab)... and this same code runs 5x slower in python than matlab (and Julia), which got my attention.
I tried using a numba and a parakeet wrapper, and numpy functions instead of python ones, but didn't get any improvement at all.
I'd appreciate any help.
Thanks.
edit: the whole thing is a Monte Carlo simulation, so N is very large... 10e6 for testing purposes
You should really be looking into numpy. And Scipy, while you're at it. Numpy is a speed-optimized package for N-dimensional array numerics, and Scipy is a collection of scientific computing stuff built upon numpy.
If you write the function using numpy arrays, it looks like this:
def logtest2(N):
T = 40. + 10. * np.random.rand(N)
return np.sum(-1*np.log(np.random.rand(N)) / T)
It's also a lot faster. Testing with N = 1000000 gave me a runtime of 500ms for your version and 75ms for this one.
Right off the bat id say use xrange if you are using Python 2.7x
So in 2.7 it would be:
def logtest1(N):
tr=0
for i in xrange(N):
a = random() # Just generate the random number once
T= 40 + 10*a
tr += -log(a)/T
Here is a summary on why xrange is better: Should you always favor xrange() over range()?

Python NumPy vs Octave/MATLAB precision

This question is about precision of computation using NumPy vs. Octave/MATLAB (the MATLAB code below has only been tested with Octave, however). I am aware of a similar question on Stackoverflow, namely this, but that seems somewhat far from what I'm asking below.
Setup
Everything is running on Ubuntu 14.04.
Python version 3.4.0.
NumPy version 1.8.1 compiled against OpenBLAS.
Octave version 3.8.1 compiled against OpenBLAS.
Sample Code
Sample Python code.
import numpy as np
from scipy import linalg as la
def build_laplacian(n):
lap=np.zeros([n,n])
for j in range(n-1):
lap[j+1][j]=1
lap[j][j+1]=1
lap[n-1][n-2]=1
lap[n-2][n-1]=1
return lap
def evolve(s, lap):
wave=la.expm(-1j*s*lap).dot([1]+[0]*(lap.shape[0]-1))
for i in range(len(wave)):
wave[i]=np.linalg.norm(wave[i])**2
return wave
We now run the following.
np.min(evolve(2, build_laplacian(500)))
which gives something on the order of e-34.
We can produce similar code in Octave/MATLAB:
function lap=build_laplacian(n)
lap=zeros(n,n);
for i=1:(n-1)
lap(i+1,i)=1;
lap(i,i+1)=1;
end
lap(n,n-1)=1;
lap(n-1,n)=1;
end
function result=evolve(s, lap)
d=zeros(length(lap(:,1)),1); d(1)=1;
result=expm(-1i*s*lap)*d;
for i=1:length(result)
result(i)=norm(result(i))^2;
end
end
We then run
min(evolve(2, build_laplacian(500)))
and get 0. In fact, evolve(2, build_laplacian(500)))(60) gives something around e-100 or less (as expected).
The Question
Does anyone know what would be responsible for such a large discrepancy between NumPy and Octave (again, I haven't tested the code with MATLAB, but I'd expect to see similar results).
Of course, one can also compute the matrix exponential by first diagonalizing the matrix. I have done this and have gotten similar or worse results (with NumPy).
EDITS
My scipy version is 0.14.0. I am aware that Octave/MATLAB use the Pade approximation scheme, and am familiar with this algorithm. I am not sure what scipy does, but we can try the following.
Diagonalize the matrix with numpy's eig or eigh (in our case the latter works fine since the matrix is Hermitian). As a result we get two matrices: a diagonal matrix D, and the matrix U, with D consisting of eigenvalues of the original matrix on the diagonal, and U consists of the corresponding eigenvectors as columns; so that the original matrix is given by U.T.dot(D).dot(U).
Exponentiate D (this is now easy since D is diagonal).
Now, if M is the original matrix and d is the original vector d=[1]+[0]*n, we get scipy.linalg.expm(-1j*s*M).dot(d)=U.T.dot(numpy.exp(-1j*s*D).dot(U.dot(d)).
Unfortunately, this produces the same result as before. Thus this probably has something to do either with the way numpy.linalg.eig and numpy.linalg.eigh work, or with the way numpy does arithmetic internally.
So the question is: how do we increase numpy's precision? Indeed, as mentioned above, Octave seems to do a much finer job in this case.
The following code
import numpy as np
from scipy import linalg as la
import scipy
print np.__version__
print scipy.__version__
def build_laplacian(n):
lap=np.zeros([n,n])
for j in range(n-1):
lap[j+1][j]=1
lap[j][j+1]=1
lap[n-1][n-2]=1
lap[n-2][n-1]=1
return lap
def evolve(s, lap):
wave=la.expm(-1j*s*lap).dot([1]+[0]*(lap.shape[0]-1))
for i in range(len(wave)):
wave[i]=la.norm(wave[i])**2
return wave
r = evolve(2, build_laplacian(500))
print np.min(abs(r))
print r[59]
prints
1.8.1
0.14.0
0
(2.77560227344e-101+0j)
for me, with OpenBLAS 0.2.8-6ubuntu1.
So it appears your problem is not immediately reproduced. Your code examples above are not runnable as-is (typos).
As mentioned in scipy.linalg.expm documentation, the algorithm is from Al-Mohy and Higham (2009), which is different from the simpler scale-and-square-Pade in Octave.
As a consequence, the results also I get from Octave are slightly different, although the results are eps-close in matrix norms (1,2,inf). MATLAB uses the Pade approach from Higham (2005), which seems to give the same results as Scipy above.

Python NumPy log2 vs MATLAB

I'm a Python newbie coming from using MATLAB extensively. I was converting some code that uses log2 in MATLAB and I used the NumPy log2 function and got a different result than I was expecting for such a small number. I was surprised since the precision of the numbers should be the same (i.e. MATLAB double vs NumPy float64).
MATLAB Code
a = log2(64);
--> a=6
Base Python Code
import math
a = math.log2(64)
--> a = 6.0
NumPy Code
import numpy as np
a = np.log2(64)
--> a = 5.9999999999999991
Modified NumPy Code
import numpy as np
a = np.log(64) / np.log(2)
--> a = 6.0
So the native NumPy log2 function gives a result that causes the code to fail a test since it is checking that a number is a power of 2. The expected result is exactly 6, which both the native Python log2 function and the modified NumPy code give using the properties of the logarithm. Am I doing something wrong with the NumPy log2 function? I changed the code to use the native Python log2 for now, but I just wanted to know the answer.
No. There is nothing wrong with the code, it is just because floating points cannot be represented perfectly on our computers. Always use an epsilon value to allow a range of error while checking float values. Read The Floating Point Guide and this post to know more.
EDIT - As cgohlke has pointed out in the comments,
Depending on the compiler used to build numpy np.log2(x) is either computed by the C library or as 1.442695040888963407359924681001892137*np.log(x) See this link.
This may be a reason for the erroneous output.

Categories

Resources