I have been struggling the last days trying to compute the degrees of freedom of two pair of vectors (x and y) following reference of Chelton (1983) which is:
degrees of freedom according to Chelton(1983)
and I can't find a proper way to calculate the normalized cross correlation function using np.correlate,
I always get an output that it isn't in between -1, 1.
Is there any easy way to get the cross correlation function normalized in order to compute the degrees of freedom of two vectors?
Nice Question. There is no direct way but you can "normalize" the input vectors before using np.correlate like this and reasonable values will be returned within a range of [-1,1]:
Here i define the correlation as generally defined in signal processing textbooks.
c'_{ab}[k] = sum_n a[n] conj(b[n+k])
CODE: If a and b are the vectors:
a = (a - np.mean(a)) / (np.std(a) * len(a))
b = (b - np.mean(b)) / (np.std(b))
c = np.correlate(a, b, 'full')
References:
https://docs.scipy.org/doc/numpy/reference/generated/numpy.correlate.html
https://en.wikipedia.org/wiki/Cross-correlation
MATLAB ➜ xcorr(a, b, 'normalized');
MATLAB normalized cross-correlation implementation in Python.
import numpy as np
a = [1, 2, 3, 4]
b = [2, 4, 6, 8]
norm_a = np.linalg.norm(a)
a = a / norm_a
norm_b = np.linalg.norm(b)
b = b / norm_b
c = np.correlate(a, b, mode = 'full')
If you are interested in the normalized correlation when the sequences are aligned (not the correlation function of the correlation versus time offsets), the function numpy.corrcoef does this directly, as computing the covariance matrix of x and y and then normalizing it by the standard deviation of x and the standard deviation of y.
https://numpy.org/doc/stable/reference/generated/numpy.corrcoef.html#numpy.corrcoef
This is the Pearson correlation coefficient and will have a range of +/-1.
a = np.dot(abs(var1),abs(var2),'full')
b = np.correlate(var1,var2,'full')
c = b/a
This is my idea: but it will normalize it 0-1
Related
Currently I want to generate some samples to get expectation & variance of it.
Given the probability density function: f(x) = {2x, 0 <= x <= 1; 0 otherwise}
I already found that E(X) = 2/3, Var(X) = 1/18, my detail solution is from here https://math.stackexchange.com/questions/4430163/simulating-expectation-of-continuous-random-variable
But here is what I have when simulating using python:
import numpy as np
N = 100_000
X = np.random.uniform(size=N, low=0, high=1)
Y = [2*x for x in X]
np.mean(Y) # 1.00221 <- not equal to 2/3
np.var(Y) # 0.3323 <- not equal to 1/18
What am I doing wrong here? Thank you in advanced.
You are generating the mean and variance of Y = 2X, when you want the mean and variance of the X's themselves. You know the density, but the CDF is more useful for random variate generation than the PDF. For your problem, the density is:
so the CDF is:
Given that the CDF is an easily invertible function for the range [0,1], you can use inverse transform sampling to generate X values by setting F(X) = U, where U is a Uniform(0,1) random variable, and inverting the relationship to solve for X. For your problem, this yields X = U1/2.
In other words, you can generate X values with
import numpy as np
N = 100_000
X = np.sqrt(np.random.uniform(size = N))
and then do anything you want with the data, such as calculate mean and variance, plot histograms, use in simulation models, or whatever.
A histogram will confirm that the generated data have the desired density:
import matplotlib.pyplot as plt
plt.hist(X, bins = 100, density = True)
plt.show()
produces
The mean and variance estimates can then be calculated directly from the data:
print(np.mean(X), np.var(X)) # => 0.6661509538922444 0.05556962913014367
But wait! There’s more...
Margin of error
Simulation generates random data, so estimates of mean and variance will be variable across repeated runs. Statisticians use confidence intervals to quantify the magnitude of the uncertainty in statistical estimates. When the sample size is sufficiently large to invoke the central limit theorem, an interval estimate of the mean is calculated as (x-bar ± half-width), where x-bar is the estimate of the mean. For a so-called 95% confidence interval, the half-width is 1.96 * s / sqrt(n) where:
s is the estimated standard deviation;
n is the number of samples used in the estimates of mean and standard deviation; and
1.96 is a scaling constant derived from the normal distribution and the desired level of confidence.
The half-width is a quantitative measure of the margin of error, a.k.a. precision, of the estimate. Note that as n gets larger, the estimate has a smaller margin of error and becomes more precise, but there are diminishing returns to increasing the sample size due to the square root. Increasing the precision by a factor of 2 would require 4 times the sample size if independent sampling is used.
In Python:
var = np.var(X)
print(np.mean(X), var, 1.96 * np.sqrt(var / N))
produces results such as
0.6666763186360812 0.05511848269208021 0.0014551397290634852
where the third column is the confidence interval half-width.
Improving precision
Inverse transform sampling can yield greater precision for a given sample size if we use a clever trick based on fundamental properties of expectation and variance. In intro prob/stats courses you probably were told that Var(X + Y) = Var(X) + Var(Y). The true relationship is actually Var(X + Y) = Var(X) + Var(Y) + 2Cov(X,Y), where Cov(X,Y) is the covariance between X and Y. If they are independent, the covariance is 0 and the general relationship becomes the one we learn/teach in intro courses, but if they are not independent the more general equation must be used. Variance is always a positive quantity, but covariance can be either positive or negative. Consequently, it’s easy to see that if X and Y have negative covariance the variance of their sum will be less than when they are independent. Negative covariance means that when X is above its mean Y tends to be below its mean, and vice-versa.
So how does that help? It helps because we can use the inverse transform, along with a technique known as antithetic variates, to create pairs of random variables which are identically distributed but have negative covariance. If U is a random variable with a Uniform(0,1) distribution, U’ = 1 - U also has a Uniform(0,1) distribution. (In fact, flipping any symmetric distribution will produce the same distribution.) As a result, X = F-1(U) and X’ = F-1(U’) are identically distributed since they’re defined by the same CDF, but will have negative covariance because they fall on opposite sides of their shared median and thus strongly tend to fall on opposite sides of their mean. If we average each pair to get A = (F-1(ui) + F-1(1-ui)) / 2) the expected value E[A] = E[(X + X’)/2] = 2E[X]/2 = E[X] while the variance Var(A) = [(Var(X) + Var(X’) + 2Cov(X,X’)]/4 = 2[Var(X) + Cov(X,X’)]/4 = [Var(X) + Cov(X,X’)]/2. In other words, we get a random variable A whose average is an unbiased estimate of the mean of X but which has less variance.
To fairly compare antithetic results head-to-head with independent sampling, we take the original sample size and allocate it with half the data being generated by the inverse transform of the U’s, and the other half generated by antithetic pairing using 1-U’s. We then average the paired values and generate statistics as before. In Python:
U = np.random.uniform(size = N // 2)
antithetic_avg = (np.sqrt(U) + np.sqrt(1.0 - U)) / 2
anti_var = np.var(antithetic_avg)
print(np.mean(antithetic_avg), anti_var, 1.96*np.sqrt(anti_var / (N / 2)))
which produces results such as
0.6667222935263972 0.0018911848781598295 0.0003811869837216061
Note that the half-width produced with independent sampling is nearly 4 times as large as the half-width produced using antithetic variates. To put it another way, we would need more than an order of magnitude more data for independent sampling to achieve the same precision.
To approximate the integral of some function of x, say, g(x), over S = [0, 1], using Monte Carlo simulation, you
generate N random numbers in [0, 1] (i.e. draw from the uniform distribution U[0, 1])
calculate the arithmetic mean of g(x_i) over i = 1 to i = N where x_i is the ith random number: i.e. (1 / N) times the sum from i = 1 to i = N of g(x_i).
The result of step 2 is the approximation of the integral.
The expected value of continuous random variable X with pdf f(x) and set of possible values S is the integral of x * f(x) over S. The variance of X is the expected value of X-squared minus the square of the expected value of X.
Expected value: to approximate the integral of x * f(x) over S = [0, 1] (i.e. the expected value of X), set g(x) = x * f(x) and apply the method outlined above.
Variance: to approximate the integral of (x * x) * f(x) over S = [0, 1] (i.e. the expected value of X-squared), set g(x) = (x * x) * f(x) and apply the method outlined above. Subtract the result of this by the square of the estimate of the expected value of X to obtain an estimate of the variance of X.
Adapting your method:
import numpy as np
N = 100_000
X = np.random.uniform(size = N, low = 0, high = 1)
Y = [x * (2 * x) for x in X]
E = [(x * x) * (2 * x) for x in X]
# mean
print((a := np.mean(Y)))
# variance
print(np.mean(E) - a * a)
Output
0.6662016482614397
0.05554821798023696
Instead of making Y and E lists, a much better approach is
Y = X * (2 * X)
E = (X * X) * (2 * X)
Y, E in this case are numpy arrays. This approach is much more efficient. Try making N = 100_000_000 and compare the execution times of both methods. The second should be much faster.
The scipy.linalg.eigh function can take two matrices as arguments: first the matrix a, of which we will find eigenvalues and eigenvectors, but also the matrix b, which is optional and chosen as the identity matrix in case it is left blank.
In what scenario would someone like to use this b matrix?
Some more context: I am trying to use xdawn covariances from the pyRiemann package. This uses the scipy.linalg.eigh function with a covariance matrix a and a baseline covariance matrix b. You can find the implementation here. This yields an error, as the b matrix in my case is not positive definitive and thus not useable in the scipy.linalg.eigh function. Removing this matrix and just using the identity matrix however solves this problem and yields relatively nice results... The problem is that I do not really understand what I changed, and maybe I am doing something I should not be doing.
This is the code from the pyRiemann package I am using (modified to avoid using functions defined in other parts of the package):
# X are samples (EEG data), y are labels
# shape of X is (1000, 64, 2459)
# shape of y is (1000,)
from scipy.linalg import eigh
Ne, Ns, Nt = X.shape
tmp = X.transpose((1, 2, 0))
b = np.matrix(sklearn.covariance.empirical_covariance(tmp.reshape(Ne, Ns * Nt).T))
for c in self.classes_:
# Prototyped response for each class
P = np.mean(X[y == c, :, :], axis=0)
# Covariance matrix of the prototyper response & signal
a = np.matrix(sklearn.covariance.empirical_covariance(P.T))
# Spatial filters
evals, evecs = eigh(a, b)
# and I am now using the following, disregarding the b matrix:
# evals, evecs = eigh(a)
If A and B were both symmetric matrices that doesn't necessarily have to imply that inv(A)*B must be a symmetric matrix. And so, if i had to solve a generalised eigenvalue problem of Ax=lambda Bx then i would use eig(A,B) rather than eig(inv(A)*B), so that the symmetry isn't lost.
One practical application is in finding the natural frequencies of a dynamic mechanical system from differential equations of the form M (d²x/dt²) = Kx where M is a positive definite matrix known as the mass matrix and K is the stiffness matrix, and x is displacement vector and d²x/dt² is acceleration vector which is the second derivative of the displacement vector. To find the natural frequencies, x can be substituted with x0 sin(ωt) where ω is the natural frequency. The equation reduces to Kx = ω²Mx. Now, one can use eig(inv(K)*M) but that might break the symmetry of the resultant matrix, and so I would use eig(K,M) instead.
A - lambda B x it means that x is not in the same basis as the covariance matrix.
If the matrix is not definite positive it means that there are vectors that can be flipped by your B.
I hope it was helpful.
Goal:
calculate a vector from under determined linear system (2x3) Ax = b The third equation should be unity equation (x^2 + y^2 + z^2 = 1).
I have correct matrix coefficients, but can't get the correct result;
trying to solve Ax = b in this way:
Function returns null space of an operator. Then I'm setting matrix and trying to solve it.
from scipy.linalg import qr, null_space, svd
from scipy import transpose, compress
def null(A, eps=1e-17):
u, s, vh = svd(A)
padding = max(0,np.shape(A)[1]-np.shape(s)[0])
null_mask = np.concatenate(((s <= eps), np.ones((padding,),dtype=bool)),axis=0)
null_space = compress(null_mask, vh, axis=0)
return transpose(null_space)
We have 3 vertices that set a triangle:
vh0 = [0., -1., 0.]
vh1 = [-0.03806, -0.98078501, -0.191341]
vh2 = [-0.074658, -0.98078501, -0.18024001]
# normal vector of vh0
normal_vec = [ 0., -0.23760592, 0.]
cap_vec10 = np.subtract(vh1, vh0)
cap_vec20 = np.subtract(vh2, vh0)
a1 = np.array(np.subtract(cap_vec20, cap_vec10))
a2 = np.array(np.dot(-1, capvec10))
# orientation bit of the normal vector
ob = np.sign(np.linalg.det([x_k, x_k1, normal_vec]))
# normal vector of vertex vh1 that I want to get solving the system
normal_vec1 = [-0.04744975, -0.97674069, -0.209108]
Lm = np.dot(np.subtract(vh2, vh1), normal_vec1)
Lm_1 = np.dot(np.subtract(vh0, vh1), normal_vec1)
# solving under determined system
A = np.array([a1, a2])
b = np.array([Lm, Lm_1])
x_lstsq = np.linalg.lstsq(A, b)[0]
wanted_norm = np.sqrt(abs(1 - (np.linalg.norm(x_lstsq)*np.linalg.norm(x_lstsq))))
Z = null(A)*wanted_norm
new_normal_vec = np.add(Z[:, 0], x_lstsq)
if np.sign(np.linalg.det([x_k, x_k1, Z[:, 0]])) != ob:
new_normal_vec[list(np.abs(x_lstsq)).index(min(np.abs(x_lstsq)))] *= ob
print("should_be: {}\ncounted_nv: {}".format(np.round(normal_vec1, 3), np.round(new_normal_vec, 3)))
normal_vec1 is the vector that I need. And for both vectors Z*vector == 1.
Coefficients in the code: L_m = < vector, normal_vector >, <> - scalar multiplication
As I understood, two equations set a line, and normalization gives a unity sphere. So my solution is crossing points of line and the unity sphere. But, also can't understand how to get both solutions.
Try using the pseudoinverse np.linalg.pinv function: https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.linalg.pinv.html
The goal was : basically solve the linear under determined system 2x3 with a constraint that resulting vector should be normalized.
What I did that give one of the solutions (because there are from 0 to 2 vectors in 3D space):
1. Calculated least squared solution:
x_lstsq = np.linalg.lstsq(A, b)[0]
2. Calculated null space (kernel of the operator A)
wanted_norm = np.sqrt(abs(1 - (np.linalg.norm(x_lstsq)*np.linalg.norm(x_lstsq))))
Z = null(A)*wanted_norm
3. Calculated the resulting vector
result = np.add(Z[:, 0], x_lstsq)
In this way I was getting one of two vectors that wasn't correct for my project, but was correct for this particular linear system. So my question was: how to get the second one doing the same steps (through the nullity space).
During the process of finding the solution I realized another one:
basically solve this linear system by hand, using the equation of normalization as the third equation.
Geometrically both first two equations sets two planes. The crossing of these planes gives a line. Such equation: x^2 + y^2 + z^2 = 1 sets unity sphere. So, crossing of this line and this sphere gives us two points. Therefore, solving squared equation of one of the coordinates gives you form 0 to 2 roots (from 0 to 2 vectors).
Based on this post, I can get covariance between two vectors using np.cov((x,y), rowvar=0). I have a matrix MxN and a vector Mx1. I want to find the covariance between each column of the matrix and the given vector. I know that I can use for loop to write. I was wondering if I can somehow use np.cov() to get the result directly.
As Warren Weckesser said, the numpy.cov(X, Y) is a poor fit for the job because it will simply join the arrays in one M by (N+1) array and find the huge (N+1) by (N+1) covariance matrix. But we'll always have the definition of covariance and it's easy to use:
A = np.sqrt(np.arange(12).reshape(3, 4)) # some 3 by 4 array
b = np.array([[2], [4], [5]]) # some 3 by 1 vector
cov = np.dot(b.T - b.mean(), A - A.mean(axis=0)) / (b.shape[0]-1)
This returns the covariances of each column of A with b.
array([[ 2.21895142, 1.53934466, 1.3379221 , 1.20866607]])
The formula I used is for sample covariance (which is what numpy.cov computes, too), hence the division by (b.shape[0] - 1). If you divide by b.shape[0] you get the unadjusted population covariance.
For comparison, the same computation using np.cov:
import numpy as np
A = np.sqrt(np.arange(12).reshape(3, 4))
b = np.array([[2], [4], [5]])
np.cov(A, b, rowvar=False)[-1, :-1]
Same output, but it takes about twice this long (and for large matrices, the difference will be much larger). The slicing at the end is because np.cov computes a 5 by 5 matrix, in which only the first 4 entries of the last row are what you wanted. The rest is covariance of A with itself, or of b with itself.
Correlation coefficient
The correlation coefficientis obtained by dividing by square roots of variances. Watch out for that -1 adjustment mentioned earlier: numpy.var does not make it by default, to make it happen you need ddof=1 parameter.
corr = cov / np.sqrt(np.var(b, ddof=1) * np.var(A, axis=0, ddof=1))
Check that the output is the same as the less efficient version
np.corrcoef(A, b, rowvar=False)[-1, :-1]
I'm using this code to get the zeros of a nonlinear function.
Most certainly, the function should have 1 or 3 zeros
import numpy as np
import matplotlib.pylab as plt
from scipy.optimize import fsolve
[a, b, c] = [5, 10, 0]
def func(x):
return -(x+a) + b / (1 + np.exp(-(x + c)))
x = np.linspace(-10, 10, 1000)
print(fsolve(func, [-10, 0, 10]))
plt.plot(x, func(x))
plt.show()
In this case the code give the 3 expected roots without any problem.
But, with c = -1.5 the code miss a root, and with c = -3 it find a non existing root.
I want to calculate the roots for many different parameter combinations, so changing the seeds manually is not a practical solution.
I appreciate any solution, trick or advice.
What you need is an automatic way to obtain good initial estimates of the roots of the function. This is in general a difficult task, however, for univariate, continuous functions, it is rather simple. The idea is to note that (a) this class of functions can be approximated to an arbitrary precision by a polynomial of appropriately large order, and (b) there are efficient algorithms for finding (all) the roots of a polynomial. Fortunately, Numpy provides functions for both performing polynomial approximation and finding polynomial roots.
Let's consider a specific function
[a, b, c] = [5, 10, -1.5]
def func(x):
return -(x+a) + b / (1 + np.exp(-(x + c)))
The following code uses polyfit and poly1d to approximate func over the range of interest (-10<x<10) by a polynomial function f_poly of order 10.
x_range = np.linspace(-10,10,100)
y_range = func(x_range)
pfit = np.polyfit(x_range,y_range,10)
f_poly = np.poly1d(pfit)
As the following plot shows, f_poly is indeed a good approximation of func. Even greater accuracy can be obtained by increasing the order. However, there is no point in pursuing extreme accuracy in the polynomial approximation, since we are looking for approximate estimates of the roots that will be later refined by fsolve
The roots of the polynomial approximation can be simply obtained as
roots = np.roots(pfit)
roots
array([-10.4551+1.4893j, -10.4551-1.4893j, 11.0027+0.j ,
8.6679+2.482j , 8.6679-2.482j , -5.7568+3.2928j,
-5.7568-3.2928j, -4.9269+0.j , 4.7486+0.j , 2.9158+0.j ])
As expected, Numpy returns 10 complex roots. However, we are only interested for real roots within the interval [-10,10]. These can be extracted as follows:
x0 = roots[np.where(np.logical_and(np.logical_and(roots.imag==0, roots.real>-10), roots.real<10))].real
x0
array([-4.9269, 4.7486, 2.9158])
Array x0 can serve as the initialization for fsolve:
fsolve(func, x0)
array([-4.9848, 4.5462, 2.7192])
Remark: The pychebfun package provides a function that directly gives all the roots of a function within an interval. It is also based on the idea of performing polynomial approximation, however, it uses a more sophisticated (yet, more efficient) approach. It automatically chooses the best polynomial order of the approximation (no user input), with the polynomial roots being practically equal to the true ones (no need to refine them via fsolve).
This simple code gives the same roots as those by fsolve
import pychebfun
f_cheb = pychebfun.Chebfun.from_function(func, domain = (-10,10))
f_cheb.roots()
Between two stationary points (i.e., df/dx=0), you have one or zero roots. In your case it is possible to calculate the two stationary points analytically:
[-c + log(1/(b - sqrt(b*(b - 4)) - 2)) + log(2),
-c + log(1/(b + sqrt(b*(b - 4)) - 2)) + log(2)]
So you have three intervals where you need to find a zero. Using Sympy saves you from doing the calculations by hand. Its sy.nsolve() allows to robustly find a zero in an interval:
import sympy as sy
a, b, c, x = sy.symbols("a, b, c, x", real=True)
# The function:
f = -(x+a) + b / (1 + sy.exp(-(x + c)))
df = f.diff(x) # calculate f' = df/dx
xxs = sy.solve(df, x) # Solving for f' = 0 gives two solutions
# numerical values:
pp = {a: 5, b: 10, c: .5} # values for a, b, c
fpp = f.subs(pp)
xxs_pp = [xpr.subs(pp).evalf() for xpr in xxs] # numerical stationary points
xxs_pp.sort() # in ascending order
# resulting intervals:
xx_low = [-1e9, xxs_pp[0], xxs_pp[1]]
xx_hig = [xxs_pp[0], xxs_pp[1], 1e9]
# calculate roots for each interval:
xx0 = []
for xl_, xh_ in zip(xx_low, xx_hig):
try:
x0 = sy.nsolve(fpp, (xl_, xh_), solver="bisect") # calculate zero
except ValueError: # no solution found
continue
xx0.append(x0)
print("The zeros are:")
print(xx0)
sy.plot(fpp) # plot function