I use scipy.odr in order to make a fit with uncertainties on both x and y following this question Correct fitting with scipy curve_fit including errors in x?
After the fit I would like to compute the uncertainties on the parameters. Thus I look at the square root of the diagonal elements of the covariance matrix. I get :
>>> print(np.sqrt(np.diag(output.cov_beta)))
[ 0.17516591 0.33020487 0.27856021]
But in the Output there is also output.sd_beta which is, according to the doc on odr
Standard errors of the estimated parameters, of shape (p,).
But, it does not give me the same results :
>>> print(output.sd_beta)
[ 0.19705029 0.37145907 0.31336217]
EDIT
This is an example on a notebook : https://nbviewer.jupyter.org/github/gvallverdu/cookbook/blob/master/fit_odr.ipynb
With least square
stop reason: ['Sum of squares convergence']
params: [ -1.94792946 11.03369235 -5.43265555]
info: 1
sd_beta: [ 0.26176284 0.49877962 0.35510071]
sqrt(diag(cov): [ 0.25066236 0.47762805 0.34004208]
With ODR
stop reason: ['Sum of squares convergence']
params: [-1.93538595 6.141885 -3.80784384]
info: 1
sd_beta: [ 0.6941821 0.88909997 0.17292514]
sqrt(diag(cov): [ 0.01093697 0.01400794 0.00272447]
The reason for the discrepancy is that sd_beta is scaled by the residual variance, whereas cov_beta isn't.
scipy.odr is an interface for the ODRPACK FORTRAN library, which is thinly wrapped in __odrpack.c. sd_beta and cov_beta are recovered by indexing into the work vector that's used internally by the FORTRAN routine. The indices of their first elements in work are variables named sd and vcv (see here).
From the ODRPACK documentation (p.85):
WORK(SDI) is the first element of a p × 1 array SD containing
the standard deviations ̂σβK of the function parameters β, i.e., the
square roots of the diagonal entries of the covariance matrix, where
WORK(SDI-1+K) = SD(K) = ̂V 1/2 β (K, K) = ̂σβK
for K = 1,... ,p.
WORK(VCVI) is the first element of a p × p array VCV containing
the values of the covariance matrix of the parameters β prior to
scaling by the residual variance, where
WORK(VCVI-1+I+(J-1)*(NP)) = VCV(I,J) = ̂σ⁻²V β(I, J)
for I = 1,... ,p and J = 1,... ,p.
In other words, np.sqrt(np.diag(output.cov_beta * output.res_var)) will give you the same result as output.sd_beta.
I've opened a bug report here.
Related
I am working on developing a search algorithm and I am struggling to understand how to actually use the results of a singular value decomposition ( u,w,vt = svd(a) ) reduction on a term-document matrix.
For example, say I have an M x N matrix as follows where each column represents a document vector (number of terms in each document)
a = [[ 0, 0, 1 ],
[ 0, 1, 2 ],
[ 1, 1, 1 ],
[ 0, 2, 3 ]]
Now, I could run a tf-idf function on this matrix to generate a score for each term/document value, but for the sake of clarity, I will ignore that.
SVD Results
Upon running SVD on this matrix, I end up with the following diagonal vector for 'w'
import svd
u,w,vt = svd.svd(a)
print w
// [4.545183973611469, 1.0343228430392626, 0.5210363733873331]
I understand more or less what this represents (thanks to a lot of reading and particularly this article https://simonpaarlberg.com/post/latent-semantic-analyses/), however I can't figure out how to relate this resulting 'approximation' matrix back to the original documents? What do these weights represent? How can I use this result in my code to find documents related to a term query?
Basically... How do I use this?
The rank-r SVD reduces a rank-R MxN matrix A into r orthogonal rank-1 MxN matrices (u_n * s_n * v_n'). If you use these singular values and vectors to reconstruct the original matrix, you will obtain the best rank-r approximation of A.
Instead of storing the full matrix A, you just store the u_n, s_n, and v_n. (A is MxN, but U is Mxr, S can be stored in one dimension as r elements, and V' is rxN).
To approximate A * x, you simply compute (U * (S * (V' * x))) [Mxr x rxr x rxN x Nx1]. You can speed this up further by storing (U * S) instead of U and S separately.
So what do the singular values represent? In a way, they represent the energy of each rank-1 matrix. The higher a singular value is, the more that its associated rank-1 matrix contributes to the original matrix and the worse your reconstruction will be if it is not included if it is truncated.
Note that this procedure is closely related to Principal Component Analysis, which is performed on covariance matrices and is commonly used in machine learning to reduce the dimensionality of measured N-dimensional variables.
Additionally, it should be noted that the SVD is useful for many other applications in signal processing.
More information is on the Wikipedia article.
I don't understand curve_fit isn't able to estimate the covariance of the parameter, thus raising the OptimizeWarning below. The following MCVE explains my problem:
MCVE python snippet
from scipy.optimize import curve_fit
func = lambda x, a: a * x
popt, pcov = curve_fit(f = func, xdata = [1], ydata = [1])
print(popt, pcov)
Output
\python-3.4.4\lib\site-packages\scipy\optimize\minpack.py:715:
OptimizeWarning: Covariance of the parameters could not be estimated
category=OptimizeWarning)
[ 1.] [[ inf]]
For a = 1 the function fits xdata and ydata exactly. Why isn't the error/variance 0, or something close to 0, but inf instead?
There is this quote from the curve_fit SciPy Reference Guide:
If the Jacobian matrix at the solution doesn’t have a full rank, then ‘lm’ method returns a matrix filled with np.inf, on the other hand ‘trf’ and ‘dogbox’ methods use Moore-Penrose pseudoinverse to compute the covariance matrix.
So, what's the underlying problem? Why doesn't the Jacobian matrix at the solution have a full rank?
The formula for the covariance of the parameters (Wikipedia) has the number of degrees of freedom in the denominator. The degrees of freedoms are computed as (number of data points) - (number of parameters), which is 1 - 1 = 0 in your example. And this is where SciPy checks the number of degrees of freedom before dividing by it.
With xdata = [1, 2], ydata = [1, 2] you would get zero covariance (note that the model still fits exactly: exact fit is not the problem).
This is the same sort of issue as sample variance being undefined if the sample size N is 1 (the formula for sample variance has (N-1) in the denominator). If we only took size=1 sample out of the population, we don't estimate the variance by zero, we know nothing about the variance.
I'm working to implement a basic Monte Carlo simulator in Python for some project management risk modeling I'm trying to do (basically Crystal Ball / #Risk, but in Python).
I have a set of n random variables (all scipy.stats instances). I know that I can use rv.rvs(size=k) to generate k independent observations from each of these n variables.
I'd like to introduce correlations among the variables by specifying an n x n positive semi-definite correlation matrix.
Is there a clean way to do this in scipy?
What I've Tried
This answer and this answer seem to indicate that "copulas" would be an answer, but I don't see any reference in scipy to them.
This link seems to implement what I'm looking for, but I'm not sure if scipy has this functionality implemented already. I'd also like it to work for non-normal variables.
It seems that the Iman, Conover paper is the standard method.
If you just want correlation through a Gaussian Copula (*), then it can be calculated in a few steps with numpy and scipy.
create multivariate random variables with desired covariance, numpy.random.multivariate_normal, and creating a (nobs by k_variables) array
apply scipy.stats.norm.cdf to transform normal to uniform random variables, for each column/variable to get uniform marginal distributions
apply dist.ppf to transform uniform margin to the desired distribution, where dist can be one of the distributions in scipy.stats
(*) Gaussian copula is only one choice and it is not the best when we are interested in tail behavior, but it is the easiest to work with
for example http://archive.wired.com/techbiz/it/magazine/17-03/wp_quant?currentPage=all
two references
https://stats.stackexchange.com/questions/37424/how-to-simulate-from-a-gaussian-copula
http://www.mathworks.com/products/demos/statistics/copulademo.html
(I might have done this a while ago in python, but don't have any scripts or function right now.)
It seems like a rejection-based sampling method such as the Metropolis-Hastings algorithm is what you want. Scipy can implement such methods with its scipy.optimize.basinhopping function.
Rejection-based sampling methods allow you to draw samples from any given probability distribution. The idea is that you draw random samples from another "proposal" pdf that is easy to sample from (such as uniform or gaussian distributions) and then use a random test to decide if this sample from the proposal distribution should be "accepted" as representing a sample of the desired distribution.
The remaining tricks will then be:
Figure out the form of the joint N-dimensional probability density function which has marginals of the form you want along each dimension, but with the correlation matrix that you want. This is easy to do for the Gaussian distribution, where the desired correlation matrix and mean vector is all you need to define the distribution. If your marginals have a simple expression, you can probably find this pdf with some straightforward-but-tedious algebra. This paper cites several others which do what you are talking about, and I'm certain that there are many more.
Formulate a function for basinhopping to minimize such that it's accepted "minima" amount to samples of this pdf you have defined.
Given the results of (1), (2) should be straightforward.
If you have already a positive semi-definite correlation matrix R [n x n], it's easy to build a NormalCopula taking R as input. I'll show you an example with n = 3. The code is based on OpenTURNS library.
import openturns as ot
# you can replace this part by your matrix
dim = 3
R = ot.CorrelationMatrix (dim)
R[0,1] = 0.25
R[0,2] = 0.6
R[1,2] = 0.9
copula = ot.NormalCopula(R)
Should you like to get a sample of size, just write
size = 5
print(copula.getSample(size))
>>> [ X0 X1 X2 ]
0 : [ 0.355353 0.76205 0.632379 ]
1 : [ 0.902567 0.984443 0.989552 ]
2 : [ 0.423219 0.811016 0.754304 ]
3 : [ 0.303776 0.471557 0.450188 ]
4 : [ 0.746168 0.918729 0.891347 ]
EDIT - Following the comment of #Michael_Baudin
Of course, if you want to set the marginal distributions as e.g. Beta and LogNormal marginals, its also possible:
X0 = ot.LogNormal(0.1, 1, 0)
X1 = ot.Beta()
X2 = ot.Uniform(1.0, 2.0)
distribution = ot.ComposedDistribution([X0,X1,X2], Original_copula)
print(distribution.getSample(size))
>>> [ X0 X1 X2 ]
0 : [ 3.97678 0.158823 1.75635 ]
1 : [ 1.18929 -0.554092 1.18952 ]
2 : [ 2.59542 0.0751359 1.68599 ]
3 : [ 1.33363 -0.18407 1.42241 ]
4 : [ 1.34084 0.198019 1.6553 ]
import typing
import numpy as np
import scipy.stats
def run_gaussian_copula_simulation_and_get_samples(
ppfs: typing.List[typing.Callable[[np.ndarray], np.ndarray]], # List of $num_dims percentile point functions
cov_matrix: np.ndarray, # covariance matrix, shape($num_dims, $num_dims)
num_samples: int, # number of random samples to draw
) -> np.ndarray:
num_dims = len(ppfs)
# Draw random samples from multidimensional normal distribution -> shape($num_samples, $num_dims)
ran = np.random.multivariate_normal(np.zeros(num_dims), cov_matrix, (num_samples,), check_valid="raise")
# Transform back into a uniform distribution, i.e. the space [0,1]^$num_dims
U = scipy.stats.norm.cdf(ran)
# Apply ppf to transform samples into the desired distribution
# Each row of the returned array will represent one random sample -> access with a[i]
return np.array([ppfs[i](U[:, i]) for i in range(num_dims)]).T # shape($num_samples, $num_dims)
# Example 1. Uncorrelated data, i.e. both distributions are independent
f1 = run_gaussian_copula_simulation_and_get_samples(
[lambda x: scipy.stats.norm.ppf(x, loc=100, scale=15), scipy.stats.norm.ppf],
[[1, 0], [0, 1]],
6
)
# Example 2. Completely correlated data, i.e. both percentiles match
f2 = run_gaussian_copula_simulation_and_get_samples(
[lambda x: scipy.stats.norm.ppf(x, loc=100, scale=15), scipy.stats.norm.ppf],
[[1, 1], [1, 1]],
6
)
np.set_printoptions(suppress=True) # suppress scientific notation
print(f1)
print(f2)
A few note on this function. np.random.multivariate_normal
does a lot of the heavy lifting for us, note that in particular we do not need to decompose the correlation matrix.
ppfs is passed as a list of functions which each have one input and one return value.
In my particular use case I needed to generate multivariate-t-distributed random variables (in addition to normal-distributed ones),
consult this answer on how to do that: https://stackoverflow.com/a/41967819/2111778.
Additionally, I used scipy.stats.t.cdf for the back-transform part.
In my particular use case the desired distributions were empirical distributions representing expected financial loss.
The final data points then had to be added together to get a total financial loss across all
of the individual-but-correlated financial events.
Thus, np.array(...).T is actually replaced by sum(...) in my code base.
I want to do least-squares polynomial fits on data sets (X,Y,Yerr) and obtain the covariance matrices of the fit parameters. Also, since I have many data sets, CPU-time is an issue, so I'm seeking an analytical (=fast) solution. I found the following (non-ideal) options:
numpy.polyfit does the fit, but doesn't take into account the errors Yerr, nor does it return the covariance;
numpy.polynomial.polynomial.polyfit does accept Yerr as an input (in the form of weights), but doesn't return covariance either;
scipy.optimize.curve_fit and scipy.optimize.leastsq can be tailored to fit polynomials and return the covariance matrix, but - being iterative methods - these are much slower than the polyfit routines (which yield an analytical solution);
Does Python provide an analytical polynomial fit routine that returns the covariance of the fit parameters (or do I have to write one myself :-) ?
Update:
It appears that in Numpy 1.7.0, numpy.polyfit now not only does accept weights, but also returns the covariance matrix of the coefficients ... So, issue resolved! :-)
You want a fast weighted least squares model that returns the covariance matrix without additional overhead? In general, the right covariance matrix will depend on the data generating process (DGP) because different DGP (say Heteroscedasticity of errors) imply different distributions of parameter estimates (think White vs. OLS standard errors). But if you can assume WLS is the right way to do it, and I believe you would use the asymptotic variance estimate for beta for WLS, (1/n X'V^-1X)^-1, where V is the weighting matrix created from Yerrs. That's a pretty simple formula if numpy.polynomial.polynomial.polyfit is working for you.
I looked for an online reference but couldn't find one. But see Fumio Hayashi's Ecomometrics, 2000, Princeton University press, p. 133 - 137 for a derivation and discussion.
Update 12/4/12:
There is another stack overflow question that comes close:
numpy.polyfit has no keyword 'cov' that has a nice explanation (with code) of how to use scikits.statsmodels to do what you want. I believe you'll want to replace the line:
result = sm.OLS(Y,reg_x_data).fit()
to
result = sm.WLS(Y,reg_x_data, weights).fit()
Where you define weights as a function of Yerr as before with numpy.polynomial.polynomial.polyfit. More details on using statsmodels with WLS over at
the statsmodels website.
Here it is using scipy.linalg.lstsq
import numpy as np,numpy.random, scipy.linalg
#generate the test data
N = 100
xs = np.random.uniform(size=N)
errs = np.random.uniform(0, 0.1, size=N) # errors
ys = 1 + 2 * xs + 3 * xs ** 2 + errs * np.random.normal(size=N)
# do the fit
polydeg = 2
A = np.vstack([1 / errs] + [xs ** _ / errs for _ in range(1, polydeg + 1)]).T
result = scipy.linalg.lstsq(A, (ys / errs))[0]
covar = np.matrix(np.dot(A.T, A)).I
print result, '\n', covar
>> [ 0.99991811 2.00009834 3.00195187]
[[ 4.82718910e-07 -2.82097554e-06 3.80331414e-06]
[ -2.82097554e-06 1.77361434e-05 -2.60150367e-05]
[ 3.80331414e-06 -2.60150367e-05 4.22541049e-05]]
Picking up from where we left...
So I can use linalg.eig or linalg.svd to compute the PCA. Each one returns different Principal Components/Eigenvectors and Eigenvalues when they're fed the same data (I'm currently using the Iris dataset).
Looking here or any other tutorial with the PCA applied to the Iris dataset, I'll find that the Eigenvalues are [2.9108 0.9212 0.1474 0.0206]. The eig method gives me a different set of eigenvalues/vectors to work with which I don't mind, except that these eigenvalues, once summed, equal the number of dimensions (4) and can be used to find how much each component contributes to the total variance.
Taking the eigenvalues returned by linalg.eig I can't do that. For example, the values returned are [9206.53059607 314.10307292 12.03601935 3.53031167]. The proportion of variance in this case would be [0.96542969 0.03293797 0.00126214 0.0003702]. This other page says that ("The proportion of the variation explained by a component is just its eigenvalue divided by the sum of the eigenvalues.")
Since the variance explained by each dimension should be constant (I think), these proportions are wrong. So, if I use the values returned by svd(), which are the values used in all tutorials, I can get the correct percentage of variation from each dimension, but I'm wondering why the values returned by eig can't be used like that.
I assume the results returned are still a valid way to project the variables, so is there a way to transform them so that I can get the correct proportion of variance explained by each variable? In other words, can I use the eig method and still have the proportion of variance for each variable? Additionally, could this mapping be done only in the eigenvalues so that I can have both the real eigenvalues and the normalized ones?
Sorry for the long writeup btw. Here's a (::) for having gotten this far. Assuming you didn't just read this line.
Taking Doug's answer to your previous question and implementing the following two functions, I get the output shown below:
def pca_eig(orig_data):
data = array(orig_data)
data = (data - data.mean(axis=0)) / data.std(axis=0)
C = corrcoef(data, rowvar=0)
w, v = linalg.eig(C)
print "Using numpy.linalg.eig"
print w
print v
def pca_svd(orig_data):
data = array(orig_data)
data = (data - data.mean(axis=0)) / data.std(axis=0)
C = corrcoef(data, rowvar=0)
u, s, v = linalg.svd(C)
print "Using numpy.linalg.svd"
print u
print s
print v
Output:
Using numpy.linalg.eig
[ 2.91081808 0.92122093 0.14735328 0.02060771]
[[ 0.52237162 -0.37231836 -0.72101681 0.26199559]
[-0.26335492 -0.92555649 0.24203288 -0.12413481]
[ 0.58125401 -0.02109478 0.14089226 -0.80115427]
[ 0.56561105 -0.06541577 0.6338014 0.52354627]]
Using numpy.linalg.svd
[[-0.52237162 -0.37231836 0.72101681 0.26199559]
[ 0.26335492 -0.92555649 -0.24203288 -0.12413481]
[-0.58125401 -0.02109478 -0.14089226 -0.80115427]
[-0.56561105 -0.06541577 -0.6338014 0.52354627]]
[ 2.91081808 0.92122093 0.14735328 0.02060771]
[[-0.52237162 0.26335492 -0.58125401 -0.56561105]
[-0.37231836 -0.92555649 -0.02109478 -0.06541577]
[ 0.72101681 -0.24203288 -0.14089226 -0.6338014 ]
[ 0.26199559 -0.12413481 -0.80115427 0.52354627]]
In both cases, I get the desired eigenvalues.
Are you sure the data for both cases are the same and the correct order of dimensions(your not sending in the rotated array are you?)? I bet you'll find they both give the same results if you use them right ;)
There are three ways I know of to do PCA: derived from an eigenvalue decomposition of the correlation matrix, the covariance matrix, or on the unscaled and uncentered data. It sounds like you are passing linalg.eig is working on the unscaled data. Anyway, that is just a guess. A better place for your question is stats.stackexchange.com. The folks on math.stackexchange.com don't use actual numbers. :)
I'd suggest using SVD, singular value decomposition, for PCA, because
1) it gives you directly values and matrices you need
2) it's robust.
See principal-component-analysis-in-python on SO for an example with (surprise) iris data.
Running it gives
read iris.csv: (150, 4)
Center -= A.mean: [ 5.84 3.05 3.76 1.2 ]
Center /= A.std: [ 0.83 0.43 1.76 0.76]
SVD: A (150, 4) -> U (150, 4) x d diagonal x Vt (4, 4)
d^2: 437 138 22.1 3.09
% variance: [ 72.77 95.8 99.48 100. ]
PC 0 weights: [ 0.52 -0.26 0.58 0.57]
PC 1 weights: [-0.37 -0.93 -0.02 -0.07]
You see that the diagonal matrix d from SVD, squared,
gives the proportion of total variance from PC 0, PC 1 ...
Does this help ?