I am porting some matlab code to python using scipy and got stuck with the following line:
Matlab/Octave code
[Pxx, f] = periodogram(x, [], 512, 5)
Python code
f, Pxx = signal.periodogram(x, 5, nfft=512)
The problem is that I get different output on the same data. More specifically, Pxx vectors are different. I tried different windows for signal.periodogram, yet no luck (and it seems that default scypy's boxcar window is the same as default matlab's rectangular window) Another strange behavior is that in python, first element of Pxx is always 0, no matter what data input is.
Am i missing something? Any advice would be greatly appreciated!
Simple Matlab/Octave code with actual data: http://pastebin.com/czNeyUjs
Simple Python+scipy code with actual data: http://pastebin.com/zPLGBTpn
After researching octave's and scipy's periodogram source code I found that they use different algorithm to calculate power spectral density estimate. Octave (and MATLAB) use FFT, whereas scipy's periodogram use the Welch method.
As #georgesl has mentioned, the output looks quite alike, but still, it differs. And for porting reason it was critical. In the end, I simply wrote a small function to calculate PSD estimate using FFT, and now output is the same. According to timeit testing, it works ~50% faster (1.9006s vs 2.9176s on a loop with 10.000 iterations). I think it's due to the FFT being faster than Welch in scipy's implementation, of just being faster.
Thanks to everyone who showed interest.
I faced the same problem but then I came across the documentation of scipy's periodogram
As you would see there that detrend='constant' is the default argument. This means that python automatically subtracts the mean of the input data from each point. (Read here). While Matlab/Octave do no such thing. I believe that is the reason why the outputs are different. Try specifying detrend=False, while calling scipy's periodogram you should get the same output as Matlab.
After reading the Matlab and Scipy documentation, another contribution to the different values could be that they use different default window function. Matlab uses a Hamming window, and Scipy uses a Hanning. The two window functions and similar but not identical.
Did you look at the results ?
The slight differences between the two results may comes from optimizations/default windows/implementations/whatever etc.
Related
I am doing feature scaling on my data and R and Python are giving me different answers in the scaling. R and Python give different answers for the many statistical values:
Median:
Numpy gives 14.948499999999999 with this code:np.percentile(X[:, 0], 50, interpolation = 'midpoint').
The built in Statistics package in Python gives the same answer with the following code: statistics.median(X[:, 0]).
On the other hand, R gives this results 14.9632 with this code: median(X[, 1]). Interestingly, the summary() function in R gives 14.960 as the median.
A similar difference occurs when computing the mean of this same data. R gives 13.10936 using the built-in mean() function and both Numpy and the Python Statistics package give 13.097945407088607.
Again, the same thing happens when computing the Standard Deviation. R gives 7.390328 and Numpy (with DDOF = 1) gives 7.3927612774052083. With DDOF = 0, Numpy gives 7.3927565984408936.
The IQR also gives different results. Using the built-in IQR() function in R, the given results is 12.3468. Using Numpy with this code: np.percentile(X[:, 0], 75) - np.percentile(X[:, 0], 25) the results is 12.358700000000002.
What is going on here? Why are Python and R always giving different results? It may help to know that my data has 795066 rows and is being treated as an np.array() in Python. The same data is being treated as a matrix in R.
tl;dr there are a few potential differences in algorithms even for such simple summary statistics, but given that you're seeing differences across the board and even in relatively simple computations such as the median, I think the problem is more likely that the values are getting truncated/modified/losing precision somehow in the transfer between platforms.
(This is more of an extended comment than an answer, but it was getting awkwardly long.)
you're unlikely to get much farther without a reproducible example; there are various ways to create examples to test hypotheses for the differences, but it's better if you do so yourself rather than making answerers do it.
how are you transferring data to/from Python/R? Is there some rounding in the representation used in the transfer? (What do you get for max/min, which should be based on a single number with no floating-point computations? How about if you drop one value to get an odd-length vector and take the median?)
medians: I was originally going to say that this could be a function of different ways to define quantile interpolation for an even-length vector, but the definition of the median is somewhat simpler than general quantiles, so I'm not sure. The differences you're reporting above seem way too big to be driven by floating-point computation in this case (since the computation is just an average of two values of similar magnitude).
IQRs: similarly, there are different possible definitions of percentiles/quantiles: see ?quantile in R.
median() vs summary(): R's summary() reports values at reduced precision (often useful for a quick overview); this is a common source of confusion.
mean/sd: there are some possible subtleties in the algorithm here -- for example, R sorts the vector before summing uses extended precision internally to reduce instability, I don't know if Python does or not. However, this shouldn't make as big a difference as you're seeing unless the data are a bit weird:
x <- rnorm(1000000,mean=0,sd=1)
> mean(x)
[1] 0.001386724
> sum(x)/length(x)
[1] 0.001386724
> mean(x)-sum(x)/length(x)
[1] -1.734723e-18
Similarly, there are more- and less-stable ways to compute a variance/standard deviation.
I was trying to port one code from python to matlab, but I encounter one inconsistence between numpy fft2 and matlab fft2:
peak =
4.377491037053e-223 3.029446976068e-216 ...
1.271610790463e-209 3.237410810582e-203 ...
(Large data can't be list directly, it can be accessed here:https://drive.google.com/file/d/0Bz1-hopez9CGTFdzU0t3RDAyaHc/edit?usp=sharing)
Matlab:
fft2(peak) --(sample result)
12.5663706143590 -12.4458341615690
-12.4458341615690 12.3264538927637
Python:
np.fft.fft2(peak) --(sample result)
12.56637061 +0.00000000e+00j -12.44583416 +3.42948517e-15j
-12.44583416 +3.35525358e-15j 12.32645389 -6.78073635e-15j
Please help me to explain why, and give suggestion on how to fix it.
The Fourier transform of a real, even function is real and even (ref). Therefore, it appears that your FFT should be real? Numpy is probably just struggling with the numerics while MATLAB may outright check for symmetry and force the solution to be real.
MATLAB uses FFTW3 while my research indicates Numpy uses a library called FFTPack. FFTW is one of the standards for FFT performance and uses a number of tricks to work quickly and perform calculations to the best precision possible. You can incredibly tiny numbers and this offers a number of numerical challenges that any library will be hard pressed to resolve.
You might consider executing the Python code against an FFTW3 wrapper like pyFFTW3 and see if you get similar results.
It appears that your input data is gaussian real and even, in which case we do expect the FFT2 of the signal to be real and even. If all your inputs are this way you could just take the real part. Or round to a certain precision. I would trust MATLAB's FFTW code over the Python code.
Or you could just ignore it. The differences are quite small and a value of 3e-15i is effectively zero for most applications. If you have automated the comparison, consider calling them equivalent if the mean square error of all the entries is less than some threshold (say 1e-8 or 1e-15 or 1e-20).
Are there functions in python that will fill out missing values in a matrix for you, by using collaborative filtering (ex. alternating minimization algorithm, etc). Or does one need to implement such functions from scratch?
[EDIT]: Although this isn't a matrix-completion example, but just to illustrate a similar situation, I know there is an svd() function in Matlab that takes a matrix as input and automatically outputs the singular value decomposition (svd) of it. I'm looking for something like that in Python, hopefully a built-in function, but even a good library out there would be great.
Check out numpy's linalg library to find a python SVD implementation
There is a library fancyimpute. Also, sklearn NMF
I am currently trying to move from Matlab to Python and succeeded in several aspects. However, one function in Matlab's Signal Processing Toolbox that I use quite regularly is the impinvar function to calculate a digital filter from its analogue version.
In Scipy.signal I only found the bilinear function to do something similar. But, in contrast to the Matlab bilinear function, it does not take an optional argument to do some pre-warping of the frequencies. I did not find any impinvar (impulse invariance) function in Scipy.
Before I now start to code it myself I'd like to ask whether there is something that I simply overlooked? Thanks.
There is a Python translation of the Octave impinvar function in the PyDynamic package which should be equivalent to the Matlab version.
I'm trying to perform a constrained least-squares estimation using Scipy such that all of the coefficients are in the range (0,1) and sum to 1 (this functionality is implemented in Matlab's LSQLIN function).
Does anybody have tips for setting up this calculation using Python/Scipy. I believe I should be using scipy.optimize.fmin_slsqp(), but am not entirely sure what parameters I should be passing to it.[1]
Many thanks for the help,
Nick
[1] The one example in the documentation for fmin_slsqp is a bit difficult for me to parse without the referenced text -- and I'm new to using Scipy.
scipy-optimize-leastsq-with-bound-constraints on SO givesleastsq_bounds, which is
leastsq
with bound constraints such as 0 <= x_i <= 1.
The constraint that they sum to 1 can be added in the same way.
(I've found leastsq_bounds / MINPACK to be good on synthetic test functions in 5d, 10d, 20d;
how many variables do you have ?)
Have a look at this tutorial, it seems pretty clear.
Since MATLAB's lsqlin is a bounded linear least squares solver, you would want to check out scipy.optimize.lsq_linear.
Non-negative least squares optimization using scipy.optimize.nnls is a robust way of doing it. Note that, if the coefficients are constrained to be positive and sum to unity, they are automatically limited to interval [0,1], that is one need not additionally constrain them from above.
scipy.optimize.nnls automatically makes variables positive using Lawson and Hanson algorithm, whereas the sum constraint can be taken care of as discussed in this thread and this one.
Scipy nnls uses an old fortran backend, which is apparently widely used in equivalent implementations of nnls by other software.