I have two arrays, one is an array of corrected values, x, and the other is an array of the original values(before a correction was applied), y. I know that if I want to do a two-tailed ttest to get the two-tailed pvalue I need to do this:
t_statistic, pvlaue = scipy.stats.ttest_ind(x, y, nan_policy='omit')
However this only tells me if the two arrays are significantly different from eachother. I want to show that the corrected values, x, are significantly less than y. To do this it seems like I need to get the one-tailed pvalue but I can't seem to find a function that does this. Any ideas?
Consider these two arrays:
import scipy.stats as ss
import numpy as np
prng = np.random.RandomState(0)
x, y = prng.normal([1, 2], 1, size=(10, 2)).T
An independent sample t-test returns:
t_stat, p_val = ss.ttest_ind(x, y, nan_policy='omit')
print('t stat: {:.4f}, p value: {:4f}'.format(t_stat, p_val))
# t stat: -1.1052, p value: 0.283617
This p-value is actually calculated from the cumulative density function:
ss.t.cdf(-abs(t_stat), len(x) + len(y) - 2) * 2
# 0.28361693716176473
Here, len(x) + len(y) - 2 is the number of degrees of freedom.
Notice the multiplication with 2. If the test is one-tailed, you don't multiply. That's all. So your p-value for a left tailed test is
ss.t.cdf(t_stat, len(x) + len(y) - 2)
# 0.14180846858088236
If the test was right tailed, you would use the survival function
ss.t.sf(t_stat, len(x) + len(y) - 2)
# 0.85819153141911764
which is the same as 1 - ss.t.cdf(...).
I assumed that the arrays have the same length. If not, you need to modify the degrees of freedom.
Related
Currently I want to generate some samples to get expectation & variance of it.
Given the probability density function: f(x) = {2x, 0 <= x <= 1; 0 otherwise}
I already found that E(X) = 2/3, Var(X) = 1/18, my detail solution is from here https://math.stackexchange.com/questions/4430163/simulating-expectation-of-continuous-random-variable
But here is what I have when simulating using python:
import numpy as np
N = 100_000
X = np.random.uniform(size=N, low=0, high=1)
Y = [2*x for x in X]
np.mean(Y) # 1.00221 <- not equal to 2/3
np.var(Y) # 0.3323 <- not equal to 1/18
What am I doing wrong here? Thank you in advanced.
You are generating the mean and variance of Y = 2X, when you want the mean and variance of the X's themselves. You know the density, but the CDF is more useful for random variate generation than the PDF. For your problem, the density is:
so the CDF is:
Given that the CDF is an easily invertible function for the range [0,1], you can use inverse transform sampling to generate X values by setting F(X) = U, where U is a Uniform(0,1) random variable, and inverting the relationship to solve for X. For your problem, this yields X = U1/2.
In other words, you can generate X values with
import numpy as np
N = 100_000
X = np.sqrt(np.random.uniform(size = N))
and then do anything you want with the data, such as calculate mean and variance, plot histograms, use in simulation models, or whatever.
A histogram will confirm that the generated data have the desired density:
import matplotlib.pyplot as plt
plt.hist(X, bins = 100, density = True)
plt.show()
produces
The mean and variance estimates can then be calculated directly from the data:
print(np.mean(X), np.var(X)) # => 0.6661509538922444 0.05556962913014367
But wait! There’s more...
Margin of error
Simulation generates random data, so estimates of mean and variance will be variable across repeated runs. Statisticians use confidence intervals to quantify the magnitude of the uncertainty in statistical estimates. When the sample size is sufficiently large to invoke the central limit theorem, an interval estimate of the mean is calculated as (x-bar ± half-width), where x-bar is the estimate of the mean. For a so-called 95% confidence interval, the half-width is 1.96 * s / sqrt(n) where:
s is the estimated standard deviation;
n is the number of samples used in the estimates of mean and standard deviation; and
1.96 is a scaling constant derived from the normal distribution and the desired level of confidence.
The half-width is a quantitative measure of the margin of error, a.k.a. precision, of the estimate. Note that as n gets larger, the estimate has a smaller margin of error and becomes more precise, but there are diminishing returns to increasing the sample size due to the square root. Increasing the precision by a factor of 2 would require 4 times the sample size if independent sampling is used.
In Python:
var = np.var(X)
print(np.mean(X), var, 1.96 * np.sqrt(var / N))
produces results such as
0.6666763186360812 0.05511848269208021 0.0014551397290634852
where the third column is the confidence interval half-width.
Improving precision
Inverse transform sampling can yield greater precision for a given sample size if we use a clever trick based on fundamental properties of expectation and variance. In intro prob/stats courses you probably were told that Var(X + Y) = Var(X) + Var(Y). The true relationship is actually Var(X + Y) = Var(X) + Var(Y) + 2Cov(X,Y), where Cov(X,Y) is the covariance between X and Y. If they are independent, the covariance is 0 and the general relationship becomes the one we learn/teach in intro courses, but if they are not independent the more general equation must be used. Variance is always a positive quantity, but covariance can be either positive or negative. Consequently, it’s easy to see that if X and Y have negative covariance the variance of their sum will be less than when they are independent. Negative covariance means that when X is above its mean Y tends to be below its mean, and vice-versa.
So how does that help? It helps because we can use the inverse transform, along with a technique known as antithetic variates, to create pairs of random variables which are identically distributed but have negative covariance. If U is a random variable with a Uniform(0,1) distribution, U’ = 1 - U also has a Uniform(0,1) distribution. (In fact, flipping any symmetric distribution will produce the same distribution.) As a result, X = F-1(U) and X’ = F-1(U’) are identically distributed since they’re defined by the same CDF, but will have negative covariance because they fall on opposite sides of their shared median and thus strongly tend to fall on opposite sides of their mean. If we average each pair to get A = (F-1(ui) + F-1(1-ui)) / 2) the expected value E[A] = E[(X + X’)/2] = 2E[X]/2 = E[X] while the variance Var(A) = [(Var(X) + Var(X’) + 2Cov(X,X’)]/4 = 2[Var(X) + Cov(X,X’)]/4 = [Var(X) + Cov(X,X’)]/2. In other words, we get a random variable A whose average is an unbiased estimate of the mean of X but which has less variance.
To fairly compare antithetic results head-to-head with independent sampling, we take the original sample size and allocate it with half the data being generated by the inverse transform of the U’s, and the other half generated by antithetic pairing using 1-U’s. We then average the paired values and generate statistics as before. In Python:
U = np.random.uniform(size = N // 2)
antithetic_avg = (np.sqrt(U) + np.sqrt(1.0 - U)) / 2
anti_var = np.var(antithetic_avg)
print(np.mean(antithetic_avg), anti_var, 1.96*np.sqrt(anti_var / (N / 2)))
which produces results such as
0.6667222935263972 0.0018911848781598295 0.0003811869837216061
Note that the half-width produced with independent sampling is nearly 4 times as large as the half-width produced using antithetic variates. To put it another way, we would need more than an order of magnitude more data for independent sampling to achieve the same precision.
To approximate the integral of some function of x, say, g(x), over S = [0, 1], using Monte Carlo simulation, you
generate N random numbers in [0, 1] (i.e. draw from the uniform distribution U[0, 1])
calculate the arithmetic mean of g(x_i) over i = 1 to i = N where x_i is the ith random number: i.e. (1 / N) times the sum from i = 1 to i = N of g(x_i).
The result of step 2 is the approximation of the integral.
The expected value of continuous random variable X with pdf f(x) and set of possible values S is the integral of x * f(x) over S. The variance of X is the expected value of X-squared minus the square of the expected value of X.
Expected value: to approximate the integral of x * f(x) over S = [0, 1] (i.e. the expected value of X), set g(x) = x * f(x) and apply the method outlined above.
Variance: to approximate the integral of (x * x) * f(x) over S = [0, 1] (i.e. the expected value of X-squared), set g(x) = (x * x) * f(x) and apply the method outlined above. Subtract the result of this by the square of the estimate of the expected value of X to obtain an estimate of the variance of X.
Adapting your method:
import numpy as np
N = 100_000
X = np.random.uniform(size = N, low = 0, high = 1)
Y = [x * (2 * x) for x in X]
E = [(x * x) * (2 * x) for x in X]
# mean
print((a := np.mean(Y)))
# variance
print(np.mean(E) - a * a)
Output
0.6662016482614397
0.05554821798023696
Instead of making Y and E lists, a much better approach is
Y = X * (2 * X)
E = (X * X) * (2 * X)
Y, E in this case are numpy arrays. This approach is much more efficient. Try making N = 100_000_000 and compare the execution times of both methods. The second should be much faster.
I am performing component wise regression on a time series data. This is basically where instead of regressing y against x1, x2, ..., xN, we would regress y against x1 only, y against x2 only, ..., and take the regression that reduces the sum of square residues the most and add it as a base learner. This is repeated M times such that the final model is the sum of many many simple linear regression of the form y against xi (1 exogenous variable only), basically gradient boosting using linear regression as the base learners.
The problem is that since I am performing a rolling window regression on the time series data, I have to do N × M × T regressions which is more than a million OLS. Though each OLS is very fast, it takes a few hours to run on my weak laptop.
Currently, I am using statsmodels.OLS.fit() as the way to get my parameters for each y against xi linear regression as such. The z_matrix is the data matrix and the i represents the ith column to slice for the regression. The number of rows is about 100 and z_matrix is about size 100 × 500.
ols_model = sm.OLS(endog=endog, exog=self.z_matrix[:, i][..., None]).fit()
return ols_model.params, ols_model.ssr, ols_model.fittedvalues[..., None]
I have read from a previous post in 2016 Fastest way to calculate many regressions in python? that using repeated calls to statsmodels is not efficient and I tried one of the answers which suggested numpy's pinv which is unfortunately slower:
# slower: 40sec vs 30sec for statsmodel for 100 repeated runs of 150 linear regressions
params = np.linalg.pinv(self.z_matrix[:, [i]]).dot(endog)
y_hat = self.z_matrix[:, [i]]#params
ssr = sum((y_hat-endog)**2)
return params, ssr, y_hat
Does anyone have any better suggestions to speed up the computation of the linear regression? I just need the estimated parameters, sum of square residues, and predicted ŷ value. Thank you!
Here is one way since you are always running regressions without a constant. This code runs around 900K models in about 0.5s. It retains the sse, the predicted values for each of the 900K regressions, and the estimated parameters.
The big idea is to exploit the math behind regressions of one variable on another, which is the ratio of a cross-product to an inner product (which the model does not contain a constant). This could be modified to also include a constant by using a moving window demean to estimate the intercept.
import numpy as np
from statsmodels.regression.linear_model import OLS
import datetime
gen = np.random.default_rng(20210514)
# Number of observations
n = 1000
# Number of predictors
m = 1000
# Window size
w = 100
# Simulate data
y = gen.standard_normal((n, 1))
x = gen.standard_normal((n, m))
now = datetime.datetime.now()
# Compute rolling covariance and variance-like terms
# These assume the model is y = x*b + e w/o a constant
c = np.r_[np.zeros((1, m)), np.cumsum(x * y, axis=0)]
v = np.r_[np.zeros((1, m)), np.cumsum(x * x, axis=0)]
c_trimmed = c[w:] - c[:-w]
v_trimmed = v[w:] - v[:-w]
# Parameters are just the ratio
params = c_trimmed / v_trimmed
# Build a selector array to quickly reshape y and the columns of x
step = np.arange(m - w + 1)
sel = np.arange(w)
locs = step[:, None] + sel
# Get the blocked reshape of y. It has n - w + 1 rows with window observations
# and looks like
# [[y[0],y[1],...,y[99]],
# [y[1],y[2],...,y[100]],
# ...,
# [y[900],y[901],...,y[999]],
y_block = y[locs, 0]
# Storage for the predicted values and the sse
y_pred = np.empty((x.shape[1],) + y_block.shape)
sse = np.empty((m - w + 1, n))
# Easiest to loop over columns.
# Could do broadcasting tricks, but noth worth the trouble since number of columns is modest
for i in range(x.shape[0]):
# Reshape a columns of x like y
x_block = x[locs, i]
# Get the parameters and make sure it is 2d with shape (m-w+1, 1)
# so the broadcasting works
p = params[:, i][:, None]
# Get the predicted value
y_pred[i] = x_block * p
# And the sse
sse[:, i] = ((y_block - y_pred[i]) ** 2).sum(1)
print(f"Time: {(datetime.datetime.now() - now).total_seconds()}s")
# Some test code
# Test any single observation
start = 124
assert start <= m - w
column = 342
assert column < x.shape[1]
res = OLS(y[start : start + 100], x[start : start + 100, [column]]).fit()
np.testing.assert_allclose(res.params[0], params[start, column])
np.testing.assert_allclose(res.fittedvalues, y_pred[column, start])
np.testing.assert_allclose(res.ssr, sse[start, column])
Allow me to separate this to increasing difficulty questions:
1.
I have some 1d curve, given as a (n,) point array.
I would like to have it re-sampled k times, and have the results come from a cubic spline that passes through all points.
This can be done with interp1d
2.
The curve is given at non-same-interval samples as an array of shape (n, 2) where (:, 0) represents the sample time, and (:, 1) represent the sample values.
I want to re-sample the curve at k same-time-intervals.
How can this be done?
I thought i could do t_sampler = interp1d(np.arange(0,k),arr[:, 0]) for the time, then interp1d(t_sampler(np.arange(0,k)), arr[:, 1])
Am I missing something with this?
3.
How can I re-sample the curve at equal distance intervals? (question 2 was equal time intervals)
4.
What if the curve is 3d given by an array of shape (n, 4), where (:,0) are the (non uniform) sampling times, and the rest are the locations sampled?
Sorry for many-questionsin-single-question, they seemed too similar to open a new question for every one.
Partial answer; for 1 and 2 I would do this:
from scipy.interpolate import interp1d
import numpy as np
# dummy data
x = np.arange(-100,100,10)
y = x**2 + np.random.normal(0,1, len(x))
# interpolate:
f = interp1d(x,y, kind='cubic')
# resample at k intervals, with k = 100:
k = 100
# generate x axis:
xnew = np.linspace(np.min(x), np.max(x), k)
# call f on xnew to sample y values:
ynew = f(xnew)
plt.scatter(x,y)
plt.plot(xnew, ynew)
I've written some Python code to emulate MATLABs xcorr function for cross correlations:
def xcorr(x, y, scale='none'):
# Pad shorter array if signals are different lengths
if x.size > y.size:
pad_amount = x.size - y.size
y = np.append(y, np.repeat(0, pad_amount))
elif y.size > x.size:
pad_amount = y.size - x.size
x = np.append(x, np.repeat(0, pad_amount))
corr = np.correlate(x, y, mode='full') # scale = 'none'
lags = np.arange(-(x.size - 1), x.size)
if scale == 'biased':
corr = corr / x.size
elif scale == 'unbiased':
corr /= (x.size - abs(lags))
elif scale == 'coeff':
corr /= np.sqrt(np.dot(x, x) * np.dot(y, y))
I get the same values when comparing the values of the different scale types to MATLABs implementation, so this seems correct
One additional thing I'd like to add is the ability to normalize the cross correlation values so peaks don't exceed 1.0, and valleys dont drop below -1.0
coeff is already normalized so I'm not worried about that. However, the other scale types can exceed the -1/1 bounds.
I've tried a couple of things:
Adding corr /= max(corr) to the end of my function to normalize corr regardless of which scale option is chosen. This keeps the upper bound in check, but I'm not sure if this correctly handles the lower bound
Adding corr /= np.sqrt(np.dot(x, x) * np.dot(y, y)) to the end of my function for all options, but this seems to squash my values far away from 1.0
Whats the correct way to normalize none, biased, and unbiased scale options? MATLAB doesnt have the functionality for this, and Google doesnt turn up any results for normalization of biased/unbiased cross correlation estimates.
I’m confused. none implies no normalization, and biased and unbiased imply the appropriate normalization so samples of the output correspond to the appropriate estimators. It doesn’t make sense to as “what normalization should I apply to a biased estimate of correlation so that it’s bounded to [-1, 1]” because then the estimate wouldn’t be a biased estimate any more, it’d be something else. The only estimator (among this bunch) that has this property is the correlation coefficient (the signal-processing-variant of Pearson’s coefficient), which is what coeff corresponds to.
This implementation is fine as it is. Anyone seeking numbers in the [-1, 1] interval knows they should ask for the correlation coefficients via np.corrcoef().
The following should do what you seek, though I am not sure if it is statistically valid:
corr /= max(np.abs(corr))
I am trying to utilize Numpy's fft function, however when I give the function a simple gausian function the fft of that gausian function is not a gausian, its close but its halved so that each half is at either end of the x axis.
The Gaussian function I'm calculating is
y = exp(-x^2)
Here is my code:
from cmath import *
from numpy import multiply
from numpy.fft import fft
from pylab import plot, show
""" Basically the standard range() function but with float support """
def frange (min_value, max_value, step):
value = float(min_value)
array = []
while value < float(max_value):
array.append(value)
value += float(step)
return array
N = 256.0 # number of steps
y = []
x = frange(-5, 5, 10/N)
# fill array y with values of the Gaussian function
cache = -multiply(x, x)
for i in cache: y.append(exp(i))
Y = fft(y)
# plot the fft of the gausian function
plot(x, abs(Y))
show()
The result is not quite right, cause the FFT of a Gaussian function should be a Gaussian function itself...
np.fft.fft returns a result in so-called "standard order": (from the docs)
If A = fft(a, n), then A[0]
contains the zero-frequency term (the
mean of the signal), which is always
purely real for real inputs. Then
A[1:n/2] contains the
positive-frequency terms, and
A[n/2+1:] contains the
negative-frequency terms, in order of
decreasingly negative frequency.
The function np.fft.fftshift rearranges the result into the order most humans expect (and which is good for plotting):
The routine np.fft.fftshift(A)
shifts transforms and their
frequencies to put the zero-frequency
components in the middle...
So using np.fft.fftshift:
import matplotlib.pyplot as plt
import numpy as np
N = 128
x = np.arange(-5, 5, 10./(2 * N))
y = np.exp(-x * x)
y_fft = np.fft.fftshift(np.abs(np.fft.fft(y))) / np.sqrt(len(y))
plt.plot(x,y)
plt.plot(x,y_fft)
plt.show()
Your result is not even close to a Gaussian, not even one split into two halves.
To get the result you expect, you will have to position your own Gaussian with the center at index 0, and the result will also be positioned that way. Try the following code:
from pylab import *
N = 128
x = r_[arange(0, 5, 5./N), arange(-5, 0, 5./N)]
y = exp(-x*x)
y_fft = fft(y) / sqrt(2 * N)
plot(r_[y[N:], y[:N]])
plot(r_[y_fft[N:], y_fft[:N]])
show()
The plot commands split the arrays in two halfs and swap them to get a nicer picture.
It is being displayed with the center (i.e. mean) at coefficient index zero. That is why it appears that the right half is on the left, and vice versa.
EDIT: Explore the following code:
import scipy
import scipy.signal as sig
import pylab
x = sig.gaussian(2048, 10)
X = scipy.absolute(scipy.fft(x))
pylab.plot(x)
pylab.plot(X)
pylab.plot(X[range(1024, 2048)+range(0, 1024)])
The last line will plot X starting from the center of the vector, then wrap around to the beginning.
A fourier transform implicitly repeats indefinitely, as it is a transform of a signal that implicitly repeats indefinitely. Note that when you pass y to be transformed, the x values are not supplied, so in fact the gaussian that is transformed is one centred on the median value between 0 and 256, so 128.
Remember also that translation of f(x) is phase change of F(x).
Following on from Sven Marnach's answer, a simpler version would be this:
from pylab import *
N = 128
x = ifftshift(arange(-5,5,5./N))
y = exp(-x*x)
y_fft = fft(y) / sqrt(2 * N)
plot(fftshift(y))
plot(fftshift(y_fft))
show()
This yields a plot identical to the above one.
The key (and this seems strange to me) is that NumPy's assumed data ordering --- in both frequency and time domains --- is to have the "zero" value first. This is not what I'd expect from other implementations of FFT, such as the FFTW3 libraries in C.
This was slightly fudged in the answers from unutbu and Steve Tjoa above, because they're taking the absolute value of the FFT before plotting it, thus wiping away the phase issues resulting from not using the "standard order" in time.