Is it possible to reproduce randn() of MATLAB with NumPy? - python

I wonder if it is possible to exactly reproduce the whole sequence of randn() of MATLAB with NumPy. I coded my own routine with Python/Numpy, and it is giving me a little bit different results from the MATLAB code somebody else did, and I am having hard time finding out where it is coming from because of different random draws.
I have found the numpy.random.seed value which produces the same number for the first draw, but from the second draw and on, it is completely different. I'm making multivariate normal draws for about 20,000 times so I don't want to just save the matlab draws and read it in Python.

The user asked if it was possible to reproduce the output of randn() of Matlab, not rand. I have not been able to set the algorithm or seed to reproduce the exact number for randn(), but the solution below works for me.
In Matlab: Generate your normal distributed random numbers as follows:
rng(1);
norminv(rand(1,5),0,1)
ans =
-0.2095 0.5838 -3.6849 -0.5177 -1.0504
In Python: Generate your normal distributed random numbers as follows:
import numpy as np
from scipy.stats import norm
np.random.seed(1)
norm.ppf(np.random.rand(1,5))
array([[-0.2095, 0.5838, -3.6849, -0.5177,-1.0504]])
It is quite convenient to have functions, which can reproduce equal random numbers, when moving from Matlab to Python or vice versa.

If you set the random number generator to the same seed, it will theoretically create the same numbers, ie in matlab. I am not quite sure how to best do it, but this seems to work, in matlab do:
rand('twister', 5489)
and corresponding in numy:
np.random.seed(5489)
To (re)initalize your random number generators. This gives for me the same numbers for rand() and np.random.random(), however not for randn, I am not sure if there is an easy method for that.
With newer matlab versions you can probably set up a RandStream with the same properties as numpy, for older you can reproduce numpy's randn in matlab (or vice versa). Numpy uses the polar form to create the uniform numbers from np.random.random() (the second algorithm given here: http://www.taygeta.com/random/gaussian.html). You could just write that algorithm in matlab to create the same randn numbers as numpy does from the rand function in matlab.
If you don't need a huge amount of random numbers, just save them in a .mat and read them from scipy.io though...

Just wanted to further clarify on using the twister/seeding method: MATLAB and numpy generate the same sequence using this seeding but will fill them out in matrices differently.
MATLAB fills out a matrix down columns, while python goes down rows. So in order to get the same matrices in both, you have to transpose:
MATLAB:
rand('twister', 1337);
A = rand(3,5)
A =
Columns 1 through 2
0.262024675015582 0.459316887214567
0.158683972154466 0.321000540520167
0.278126519494360 0.518392820597537
Columns 3 through 4
0.261942925565145 0.115274226683149
0.976085284877434 0.386275068634359
0.732814552690482 0.628501179539712
Column 5
0.125057926335599
0.983548605143641
0.443224868645128
python:
import numpy as np
np.random.seed(1337)
A = np.random.random((5,3))
A.T
array([[ 0.26202468, 0.45931689, 0.26194293, 0.11527423, 0.12505793],
[ 0.15868397, 0.32100054, 0.97608528, 0.38627507, 0.98354861],
[ 0.27812652, 0.51839282, 0.73281455, 0.62850118, 0.44322487]])
Note: I also placed this answer on this similar question: Comparing Matlab and Numpy code that uses random number generation

Related

How to deal with large integers in NumPy?

I'm doing a data analysis project where I'm working with really large numbers. I originally did everything in pure python but I'm now trying to do it with numpy and pandas. However it seems like I've hit a roadblock, since it is not possible to handle integers larger than 64 bits in numpy (if I use python ints in numpy they max out at 9223372036854775807). Do I just throw away numpy and pandas completely or is there a way to use them with python-style arbitrary large integers? I'm okay with a performance hit.
by default numpy keeps elements as number datatype.
But you can force typing to object, like below
import numpy as np
x = np.array([10,20,30,40], dtype=object)
x_exp2 = 1000**x
print(x_exp2)
the output is
[1000000000000000000000000000000
1000000000000000000000000000000000000000000000000000000000000
1000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
1000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000]
The drawback is that the execution is much slower.
Later Edit to show that np.sum() works. There could be some limitations of course.
import numpy as np
x = np.array([10,20,30,40], dtype=object)
x_exp2 = 1000**x
print(x_exp2)
print(np.sum(x_exp2))
print(np.prod(x_exp2))
and the output is:
[1000000000000000000000000000000
1000000000000000000000000000000000000000000000000000000000000
1000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
1000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000]
1000000000000000000000000000001000000000000000000000000000001000000000000000000000000000001000000000000000000000000000000
1000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000

Why is the result of numpy fft different from matlab fft?

I was using parameters and formulations below to generate signals.
python code:
import numpy as np
fs=15e6
dt=1/fs
f0=1e6
pri=400e-6
t=np.arange(0,pri,dt)
i=64
fd=5/(i*pri)
xt=0.1*np.exp(2j*np.pi*f0*t)
xf=np.fft.fft(xt)
matlab code is very similar with python code:
fs=15e6
dt=1/fs
f0=1e6
pri=400e-6
t=0:dt:pri-dt
i=64
fd=5/(i*pri)
xt=0.1*exp(2j*pi*f0*t)
xf=fft(xt)
These code will generate an array of length 6000 to perform fft. Then I calculate the result in matlab using the same method. The result is absolutely same when the fft length is less than 6000. But it became a little different when the fft length is 6000.
The result of xf in python is:
xf[:5] = [4.68819428e-12-2.53650626e-12j,
6.55886345e-12+4.51937973e-13j,
5.91758655e-12+4.48215898e-12j,
2.07297400e-12+6.37992397e-12j,
-1.44454940e-12+5.60550355e-12j]
The result of xf in matlab is:
xf(1:5) = 5.165829569664382e-12+1.503743771929872e-12j
4.389776854811194e-12+5.127317569216533e-12j
1.067288620484369e-12+7.191186166371298e-12j
-3.058138112418996e-12+6.189531470616248e-12j
-5.288313073640339e-12+2.908982377132765e-12j
if use length 5999 to do fft like this in python:
xf=np.fft.fft(xt, 5999)
or in matlab:
xf=fft(xt, 5999)
The result is absolutely identical.
In python:
xf[:5] = [-0.09135455+0.04067366j,
-0.09160153+0.04072616j,
-0.09184974+0.04077892j,
-0.09209917+0.04083194j,
-0.09234986+0.04088522j]
In matlab:
xf(1:5) = -9.135455e-02+4.067366e-2j
-9.160153e-02+4.072616e-2j
-9.184974e-02+4.077892e-2j
-9.209917e-02+4.083194e-2j
-9.234986e-02+4.088522e-2j
I was confused. Can anybody illustrate this phenomenon? Thanks for your help.
PS: python 3.8.5, numpy 1.19.2, matlab 2014
demio. I think the different values you are getting is because MATLAB's floating point rounding errors. For low values, of order 1e-15, that values are rounded to 0 and that generates an error of the order that is being rounded to. It happens the same way for really big values. You can see a related post with pretty good explanation of this on: https://es.mathworks.com/matlabcentral/answers/475494-unexpected-results-due-to-floating-point-rounding-errors-by-performing-arithmetic-calculations-on-la.
Also it is worth noticing that even though this floating point rounding errors always occur you have to determine whether that's significant or not taking into account your set of data and the result you are expecting. Sometimes those absolute differences does not mean anything because the relative differences are marginal. If you wish to avoid this behavior from MATLAB you need to use the sym function, that triggers MATLAB to use a Symbolic representation which involves several things, one of them being that the numbers are represented more accurately. More on this subject can be found here: https://es.mathworks.com/help/symbolic/create-symbolic-numbers-variables-and-expressions.html#buyfu27.

Combining p values using scipy

I have to combine p values and get one p value.
I'm using scipy.stats.combine_pvalues function, but it is giving very small combined p value, is it normal?
e.g.:
>>> import scipy
>>> p_values_list=[8.017444955844044e-06, 0.1067379119652372, 5.306374345615846e-05, 0.7234201655194492, 0.13050605094545614, 0.0066989543716175, 0.9541246420333787]
>>> test_statistic, combined_p_value = scipy.stats.combine_pvalues(p_values_list, method='fisher',weights=None)
>>> combined_p_value
4.331727536209026e-08
As you see, combined_p_value is smaller than any given p value in the p_values_list?
How can it be?
Thanks in advance,
Burcak
It is correct, because you are testing all of your p-values come from a random uniform distribution. The alternate hypothesis is that at least one of them is true. Which in your case is very possible.
We can simulate this, by drawing from a random uniform distribution 1000 times, the length of your p-values:
import numpy as np
from scipy.stats import combine_pvalues
from matplotlib import pyplot as plt
random_p = np.random.uniform(0,1,(1000,len(p_values_list)))
res = np.array([combine_pvalues(i,method='fisher',weights=None) for i in random_p])
plt.hist(fisher_p)
From your results, the chi-square is 62.456 which is really huge and no where near the simulated chi-square above.
One thing to note is that the combining you did here does not take into account directionality, if that is possible in your test, you might want to consider using stouffer's Z along with weights. Also another sane way to check is to run simulation like the above, to generate list of p-values under the null hypothesis and see how they differ from what you observed.
Interesting paper but maybe a bit on the statistics side
I am by no means an expert in this field, but am interested in your question. Following some reading of wiki it seems to me that the combined_p_value tells you the likelihood of all p-values in the list been obtained under the same null-hypothesis. Which is very unlikely considering two extremely small values.
Your set has two extremely small values: 1st and 3rd. If the thought process I described is correct, removing any of them should yield a much higher p-value, which is indeed the case:
remove 1st: p-value of 0.00010569305282803985
remove 3rd: p-value of 2.4713196031837724e-05
In conclusion, I think that this is a correct way of interpreting the meta-analysis that combine_pvalues actually describes.

Mandelbrot set on python using matplotlib + need some advices

this is my first post here, so i'm sorry if i didn't follow the rules
i recently learned python, i know the basics and i like writing famous sets and plot them, i've wrote codes for the hofstadter sequence, a logistic sequence and succeeded in both
now i've tried writing mandelbrot's sequence without any complex parameters, but actually doing it "by hand"
for exemple if Z(n) is my complexe(x+iy) variable and C(n) my complexe number (c+ik)
i write the sequence as {x(n)=x(n-1)^2-y(n-1)^2+c ; y(n)=2.x(n-1).y(n-1)+c}
from math import *
import matplotlib.pyplot as plt
def mandel(p,u):
c=5
k=5
for i in range(p):
c=5
k=k-10/p
for n in range(p):
c=c-10/p
x=0
y=0
for m in range (u):
x=x*x-y*y + c
y=2*x*y + k
if sqrt(x*x+y*y)>2:
break
if sqrt(x*x+y*y)<2:
X=X+[c]
Y=Y+[k]
print (round((i/p)*100),"%")
return (plt.plot(X,Y,'.')),(plt.show())
p is the width and number of complexe parameters i want, u is the number of iterations
this is what i get as a result :
i think it's just a bit close to what i want.
now for my questions, how can i make the function faster? and how can i make it better ?
thanks a lot !
A good place to start would be to profile your code.
https://docs.python.org/2/library/profile.html
Using the cProfile module or the command line profiler, you can find the inefficient parts of your code and try to optimize them. If I had to guess without personally profiling it, your array appending is probably inefficient.
You can either use a numpy array that is premade at an appropriate size, or in pure python you can make an array with a given size (like 50) and work through that entire array. When it fills up, append that array to your main array. This reduces the number of times the array has to be rebuilt. The same could be done with a numpy array.
Quick things you could do though
if sqrt(x*x+y*y)>2:
should become this
if x*x+y*y>4:
Remove calls to sqrt if you can, its faster to just exponentiate the other side by 2. Multiplication is cheaper than finding roots.
Another thing you could do is this.
print (round((i/p)*100),"%")
should become this
# print (round((i/p)*100),"%")
You want faster code?...remove things not related to actually plotting it.
Also, you break a for loop after a comparison then make the same comparison...Do what you want to after the comparison and then break it...No need to compute that twice.

unknown vector size python

I have a matlab code that I'm trying to translate in python.
I'm new on python but I have been able to answer a lot of questions googling a little bit.
But now, I'm trying to figure out the following:
I have a for loop when I apply different things on each column, but you don't know the number of columns. For example.
In matlab, nothing easier than this:
for n = 1:size(x,2); y(n) = mean(x(:,n)); end
But I have no idea how to do it on python when, for example, the number of columns is 1, because I can't do x[:,1] in python.
Any idea?
Thanx
Yes, if you use numpy you can use x[:,1], and also you get other data structures (vectors instead of lists), the main difference between matlab and numpy is that matlab uses matrices for calculations and numpy uses vectors, but you get used to it, I think this guide will help you out.
Try numpy. It is a python bindings for high-performance math library written in C. I believe it has the same concepts of matrix slice operations, and it is significantly faster than the same code written in pure python (in most cases).
Regarding your example, I think the closest would be something using numpy.mean.
In pure python it is hard to calculate mean of column, but it you are able to transpose the matrix you could do it using something like this:
# there are no builtin avg function
def avg(lst):
return sum(lst)/len(lst)
rows = list(avg(row) for row in a)
this is one way to do it
from numpy import *
x=matrix([[1,2,3],[2,3,4]])
[mean(x[:,n]) for n in range(shape(x)[1])]
# [1.5, 2.5, 3.5]

Categories

Resources