Numpy Polynomial Regression, error arising at higher powers - python

I have the following code for finding the coefficients for polynomial regression,
x =[-1,0,1,2,3,5,7,9,4,6,8,2,4,6,8,5,2]
y = [-1,3,2.5,5,4,2,5,4,4,7,3,2,7,9,6,2,6]
degree = 4
x = np.asarray(x)
x = np.vstack((np.ones(len(x)),x))
i = 2
for p in range(degree-1):
xs= np.power(x,i)
x = np.vstack((x,xs))
i +=1
y = np.asarray(y)
y= np.matrix(y).T
x= np.matrix(x).T
parameters =(x.T*x).I*x.T*y
this code will work for lower powers but begins to return different results as numpy.polyfit( x, y, degree ) at around 16th degree.
Is my code incorrect, why does its result differ at higher powers to the built in function.

Given n data points, you can only fit up to degree n - 1, after that there is insufficient data to determine a unique polynomial. Effectively, you are dividing by zero.

Related

Simulating expectation of continuous random variable

Currently I want to generate some samples to get expectation & variance of it.
Given the probability density function: f(x) = {2x, 0 <= x <= 1; 0 otherwise}
I already found that E(X) = 2/3, Var(X) = 1/18, my detail solution is from here https://math.stackexchange.com/questions/4430163/simulating-expectation-of-continuous-random-variable
But here is what I have when simulating using python:
import numpy as np
N = 100_000
X = np.random.uniform(size=N, low=0, high=1)
Y = [2*x for x in X]
np.mean(Y) # 1.00221 <- not equal to 2/3
np.var(Y) # 0.3323 <- not equal to 1/18
What am I doing wrong here? Thank you in advanced.
You are generating the mean and variance of Y = 2X, when you want the mean and variance of the X's themselves. You know the density, but the CDF is more useful for random variate generation than the PDF. For your problem, the density is:
so the CDF is:
Given that the CDF is an easily invertible function for the range [0,1], you can use inverse transform sampling to generate X values by setting F(X) = U, where U is a Uniform(0,1) random variable, and inverting the relationship to solve for X. For your problem, this yields X = U1/2.
In other words, you can generate X values with
import numpy as np
N = 100_000
X = np.sqrt(np.random.uniform(size = N))
and then do anything you want with the data, such as calculate mean and variance, plot histograms, use in simulation models, or whatever.
A histogram will confirm that the generated data have the desired density:
import matplotlib.pyplot as plt
plt.hist(X, bins = 100, density = True)
plt.show()
produces
The mean and variance estimates can then be calculated directly from the data:
print(np.mean(X), np.var(X)) # => 0.6661509538922444 0.05556962913014367
But wait! There’s more...
Margin of error
Simulation generates random data, so estimates of mean and variance will be variable across repeated runs. Statisticians use confidence intervals to quantify the magnitude of the uncertainty in statistical estimates. When the sample size is sufficiently large to invoke the central limit theorem, an interval estimate of the mean is calculated as (x-bar ± half-width), where x-bar is the estimate of the mean. For a so-called 95% confidence interval, the half-width is 1.96 * s / sqrt(n) where:
s is the estimated standard deviation;
n is the number of samples used in the estimates of mean and standard deviation; and
1.96 is a scaling constant derived from the normal distribution and the desired level of confidence.
The half-width is a quantitative measure of the margin of error, a.k.a. precision, of the estimate. Note that as n gets larger, the estimate has a smaller margin of error and becomes more precise, but there are diminishing returns to increasing the sample size due to the square root. Increasing the precision by a factor of 2 would require 4 times the sample size if independent sampling is used.
In Python:
var = np.var(X)
print(np.mean(X), var, 1.96 * np.sqrt(var / N))
produces results such as
0.6666763186360812 0.05511848269208021 0.0014551397290634852
where the third column is the confidence interval half-width.
Improving precision
Inverse transform sampling can yield greater precision for a given sample size if we use a clever trick based on fundamental properties of expectation and variance. In intro prob/stats courses you probably were told that Var(X + Y) = Var(X) + Var(Y). The true relationship is actually Var(X + Y) = Var(X) + Var(Y) + 2Cov(X,Y), where Cov(X,Y) is the covariance between X and Y. If they are independent, the covariance is 0 and the general relationship becomes the one we learn/teach in intro courses, but if they are not independent the more general equation must be used. Variance is always a positive quantity, but covariance can be either positive or negative. Consequently, it’s easy to see that if X and Y have negative covariance the variance of their sum will be less than when they are independent. Negative covariance means that when X is above its mean Y tends to be below its mean, and vice-versa.
So how does that help? It helps because we can use the inverse transform, along with a technique known as antithetic variates, to create pairs of random variables which are identically distributed but have negative covariance. If U is a random variable with a Uniform(0,1) distribution, U’ = 1 - U also has a Uniform(0,1) distribution. (In fact, flipping any symmetric distribution will produce the same distribution.) As a result, X = F-1(U) and X’ = F-1(U’) are identically distributed since they’re defined by the same CDF, but will have negative covariance because they fall on opposite sides of their shared median and thus strongly tend to fall on opposite sides of their mean. If we average each pair to get A = (F-1(ui) + F-1(1-ui)) / 2) the expected value E[A] = E[(X + X’)/2] = 2E[X]/2 = E[X] while the variance Var(A) = [(Var(X) + Var(X’) + 2Cov(X,X’)]/4 = 2[Var(X) + Cov(X,X’)]/4 = [Var(X) + Cov(X,X’)]/2. In other words, we get a random variable A whose average is an unbiased estimate of the mean of X but which has less variance.
To fairly compare antithetic results head-to-head with independent sampling, we take the original sample size and allocate it with half the data being generated by the inverse transform of the U’s, and the other half generated by antithetic pairing using 1-U’s. We then average the paired values and generate statistics as before. In Python:
U = np.random.uniform(size = N // 2)
antithetic_avg = (np.sqrt(U) + np.sqrt(1.0 - U)) / 2
anti_var = np.var(antithetic_avg)
print(np.mean(antithetic_avg), anti_var, 1.96*np.sqrt(anti_var / (N / 2)))
which produces results such as
0.6667222935263972 0.0018911848781598295 0.0003811869837216061
Note that the half-width produced with independent sampling is nearly 4 times as large as the half-width produced using antithetic variates. To put it another way, we would need more than an order of magnitude more data for independent sampling to achieve the same precision.
To approximate the integral of some function of x, say, g(x), over S = [0, 1], using Monte Carlo simulation, you
generate N random numbers in [0, 1] (i.e. draw from the uniform distribution U[0, 1])
calculate the arithmetic mean of g(x_i) over i = 1 to i = N where x_i is the ith random number: i.e. (1 / N) times the sum from i = 1 to i = N of g(x_i).
The result of step 2 is the approximation of the integral.
The expected value of continuous random variable X with pdf f(x) and set of possible values S is the integral of x * f(x) over S. The variance of X is the expected value of X-squared minus the square of the expected value of X.
Expected value: to approximate the integral of x * f(x) over S = [0, 1] (i.e. the expected value of X), set g(x) = x * f(x) and apply the method outlined above.
Variance: to approximate the integral of (x * x) * f(x) over S = [0, 1] (i.e. the expected value of X-squared), set g(x) = (x * x) * f(x) and apply the method outlined above. Subtract the result of this by the square of the estimate of the expected value of X to obtain an estimate of the variance of X.
Adapting your method:
import numpy as np
N = 100_000
X = np.random.uniform(size = N, low = 0, high = 1)
Y = [x * (2 * x) for x in X]
E = [(x * x) * (2 * x) for x in X]
# mean
print((a := np.mean(Y)))
# variance
print(np.mean(E) - a * a)
Output
0.6662016482614397
0.05554821798023696
Instead of making Y and E lists, a much better approach is
Y = X * (2 * X)
E = (X * X) * (2 * X)
Y, E in this case are numpy arrays. This approach is much more efficient. Try making N = 100_000_000 and compare the execution times of both methods. The second should be much faster.

Python: Fastest way to perform millions of simple linear regression with 1 exogenous variable only

I am performing component wise regression on a time series data. This is basically where instead of regressing y against x1, x2, ..., xN, we would regress y against x1 only, y against x2 only, ..., and take the regression that reduces the sum of square residues the most and add it as a base learner. This is repeated M times such that the final model is the sum of many many simple linear regression of the form y against xi (1 exogenous variable only), basically gradient boosting using linear regression as the base learners.
The problem is that since I am performing a rolling window regression on the time series data, I have to do N × M × T regressions which is more than a million OLS. Though each OLS is very fast, it takes a few hours to run on my weak laptop.
Currently, I am using statsmodels.OLS.fit() as the way to get my parameters for each y against xi linear regression as such. The z_matrix is the data matrix and the i represents the ith column to slice for the regression. The number of rows is about 100 and z_matrix is about size 100 × 500.
ols_model = sm.OLS(endog=endog, exog=self.z_matrix[:, i][..., None]).fit()
return ols_model.params, ols_model.ssr, ols_model.fittedvalues[..., None]
I have read from a previous post in 2016 Fastest way to calculate many regressions in python? that using repeated calls to statsmodels is not efficient and I tried one of the answers which suggested numpy's pinv which is unfortunately slower:
# slower: 40sec vs 30sec for statsmodel for 100 repeated runs of 150 linear regressions
params = np.linalg.pinv(self.z_matrix[:, [i]]).dot(endog)
y_hat = self.z_matrix[:, [i]]#params
ssr = sum((y_hat-endog)**2)
return params, ssr, y_hat
Does anyone have any better suggestions to speed up the computation of the linear regression? I just need the estimated parameters, sum of square residues, and predicted ŷ value. Thank you!
Here is one way since you are always running regressions without a constant. This code runs around 900K models in about 0.5s. It retains the sse, the predicted values for each of the 900K regressions, and the estimated parameters.
The big idea is to exploit the math behind regressions of one variable on another, which is the ratio of a cross-product to an inner product (which the model does not contain a constant). This could be modified to also include a constant by using a moving window demean to estimate the intercept.
import numpy as np
from statsmodels.regression.linear_model import OLS
import datetime
gen = np.random.default_rng(20210514)
# Number of observations
n = 1000
# Number of predictors
m = 1000
# Window size
w = 100
# Simulate data
y = gen.standard_normal((n, 1))
x = gen.standard_normal((n, m))
now = datetime.datetime.now()
# Compute rolling covariance and variance-like terms
# These assume the model is y = x*b + e w/o a constant
c = np.r_[np.zeros((1, m)), np.cumsum(x * y, axis=0)]
v = np.r_[np.zeros((1, m)), np.cumsum(x * x, axis=0)]
c_trimmed = c[w:] - c[:-w]
v_trimmed = v[w:] - v[:-w]
# Parameters are just the ratio
params = c_trimmed / v_trimmed
# Build a selector array to quickly reshape y and the columns of x
step = np.arange(m - w + 1)
sel = np.arange(w)
locs = step[:, None] + sel
# Get the blocked reshape of y. It has n - w + 1 rows with window observations
# and looks like
# [[y[0],y[1],...,y[99]],
# [y[1],y[2],...,y[100]],
# ...,
# [y[900],y[901],...,y[999]],
y_block = y[locs, 0]
# Storage for the predicted values and the sse
y_pred = np.empty((x.shape[1],) + y_block.shape)
sse = np.empty((m - w + 1, n))
# Easiest to loop over columns.
# Could do broadcasting tricks, but noth worth the trouble since number of columns is modest
for i in range(x.shape[0]):
# Reshape a columns of x like y
x_block = x[locs, i]
# Get the parameters and make sure it is 2d with shape (m-w+1, 1)
# so the broadcasting works
p = params[:, i][:, None]
# Get the predicted value
y_pred[i] = x_block * p
# And the sse
sse[:, i] = ((y_block - y_pred[i]) ** 2).sum(1)
print(f"Time: {(datetime.datetime.now() - now).total_seconds()}s")
# Some test code
# Test any single observation
start = 124
assert start <= m - w
column = 342
assert column < x.shape[1]
res = OLS(y[start : start + 100], x[start : start + 100, [column]]).fit()
np.testing.assert_allclose(res.params[0], params[start, column])
np.testing.assert_allclose(res.fittedvalues, y_pred[column, start])
np.testing.assert_allclose(res.ssr, sse[start, column])

Python: How to resample a 3d curve given by points as spline in equal distances?

Allow me to separate this to increasing difficulty questions:
1.
I have some 1d curve, given as a (n,) point array.
I would like to have it re-sampled k times, and have the results come from a cubic spline that passes through all points.
This can be done with interp1d
2.
The curve is given at non-same-interval samples as an array of shape (n, 2) where (:, 0) represents the sample time, and (:, 1) represent the sample values.
I want to re-sample the curve at k same-time-intervals.
How can this be done?
I thought i could do t_sampler = interp1d(np.arange(0,k),arr[:, 0]) for the time, then interp1d(t_sampler(np.arange(0,k)), arr[:, 1])
Am I missing something with this?
3.
How can I re-sample the curve at equal distance intervals? (question 2 was equal time intervals)
4.
What if the curve is 3d given by an array of shape (n, 4), where (:,0) are the (non uniform) sampling times, and the rest are the locations sampled?
Sorry for many-questionsin-single-question, they seemed too similar to open a new question for every one.
Partial answer; for 1 and 2 I would do this:
from scipy.interpolate import interp1d
import numpy as np
# dummy data
x = np.arange(-100,100,10)
y = x**2 + np.random.normal(0,1, len(x))
# interpolate:
f = interp1d(x,y, kind='cubic')
# resample at k intervals, with k = 100:
k = 100
# generate x axis:
xnew = np.linspace(np.min(x), np.max(x), k)
# call f on xnew to sample y values:
ynew = f(xnew)
plt.scatter(x,y)
plt.plot(xnew, ynew)

Closed Form Ridge Regression

I am having trouble understanding the output of my function to implement multiple-ridge regression. I am doing this from scratch in Python for the closed form of the method. This closed form is shown below:
I have a training set X that is 100 rows x 10 columns and a vector y that is 100x1.
My attempt is as follows:
def ridgeRegression(xMatrix, yVector, lambdaRange):
wList = []
for i in range(1, lambdaRange+1):
lambVal = i
# compute the inner values (X.T X + lambda I)
xTranspose = np.transpose(x)
xTx = xTranspose # x
lamb_I = lambVal * np.eye(xTx.shape[0])
# invert inner, e.g. (inner)**(-1)
inner_matInv = np.linalg.inv(xTx + lamb_I)
# compute outer (X.T y)
outer_xTy = np.dot(xTranspose, y)
# multiply together
w = inner_matInv # outer_xTy
wList.append(w)
print(wList)
For testing, I am running it with the first 5 lambda values.
wList becomes 5 numpy.arrays each of length 10 (I'm assuming for the 10 coefficients).
Here is the first of those 5 arrays:
array([ 0.29686755, 1.48420319, 0.36388528, 0.70324668, -0.51604451,
2.39045735, 1.45295857, 2.21437745, 0.98222546, 0.86124358])
My question, and clarification:
Shouldn't there be 11 coefficients, (1 for the y-intercept + 10 slopes)?
How do I get the Minimum Square Error from this computation?
What comes next if I wanted to plot this line?
I think I am just really confused as to what I'm looking at, since I'm still working on my linear-algebra.
Thanks!
First, I would modify your ridge regression to look like the following:
import numpy as np
def ridgeRegression(X, y, lambdaRange):
wList = []
# Get normal form of `X`
A = X.T # X
# Get Identity matrix
I = np.eye(A.shape[0])
# Get right hand side
c = X.T # y
for lambVal in range(1, lambdaRange+1):
# Set up equations Bw = c
lamb_I = lambVal * I
B = A + lamb_I
# Solve for w
w = np.linalg.solve(B,c)
wList.append(w)
return wList
Notice that I replaced your inv call to compute the matrix inverse with an implicit solve. This is much more numerically stable, which is an important consideration for these types of problems especially.
I've also taken the A=X.T#X computation, identity matrix I generation, and right hand side vector c=X.T#y computation out of the loop--these don't change within the loop and are relatively expensive to compute.
As was pointed out by #qwr, the number of columns of X will determine the number of coefficients you have. You have not described your model, so it's not clear how the underlying domain, x, is structured into X.
Traditionally, one might use polynomial regression, in which case X is the Vandermonde Matrix. In that case, the first coefficient would be associated with the y-intercept. However, based on the context of your question, you seem to be interested in multivariate linear regression. In any case, the model needs to be clearly defined. Once it is, then the returned weights may be used to further analyze your data.
Typically to make notation more compact, the matrix X contains a column of ones for an intercept, so if you have p predictors, the matrix is dimensions n by p+1. See Wikipedia article on linear regression for an example.
To compute in-sample MSE, use the definition for MSE: the average of squared residuals. To compute generalization error, you need cross-validation.
Also, you shouldn't take lambVal as an integer. It can be small (close to 0) if the aim is just to avoid numerical error when xTx is ill-conditionned.
I would advise you to use a logarithmic range instead of a linear one, starting from 0.001 and going up to 100 or more if you want to. For instance you can change your code to that:
powerMin = -3
powerMax = 3
for i in range(powerMin, powerMax):
lambVal = 10**i
print(lambVal)
And then you can try a smaller range or a linear range once you figure out what is the correct order of lambVal with your data from cross-validation.

Density estimation in Python

I'm new to Python and I have to estimate a density from a 2D sample. My first idea was a simple histogram estimator, which I implemented as below:
num = 10**4
sp = 0.01
subsetf1 = np.random.uniform(0,1,size=(num,2)) # I created this set to test the estimator
def f_est(x,y,h=sp, subset=subsetf1):
indicator = np.zeros(num)
for i in range(num):
if (x <= subset[i][0] <= (x + h)) and (y <= subset[i][1] <= (y + h)):
indicator[i] = 1
else : indicator[i] = 0
return sum(indicator)/(num*h**2)
#f_est should yield values closely to 1 if 0 <x,y <1 , because subsetf1 contains uniformly distributed values on [0,1).
The problem is that f_est often yields values greater than 1 which leads me to believe that my code is wrong, but I don't know where the bug could be. I also tried kernel density estimation but if I try something like:
from scipy import stats
xmin = partsetf1[0].min()
xmax = partsetf1[0].max()
ymin = partsetf1[1].min()
ymax = partsetf1[1].max()
X, Y = np.mgrid[xmin:xmax:100j, ymin:ymax:100j]
positions = np.vstack([X.ravel(), Y.ravel()])
values = np.vstack([partsetf1[0], partsetf1[1]])
gkde=stats.gaussian_kde(values)
f = np.reshape(gkde(positions).T, X.shape)
f yields strange values which can't be right.
It would be great if someone could tell me where the bug in my code is and maybe how to use kde in Python, because I did not find good tutorials on this topic.
Kernel density estimates can produce values greater than 1 because they are not returning a probability, but rather a probability density. A probability density in your case gives the probability per unit area, and can locally be much larger than 1.
The good news is that since your results match that of scipy or statsmodels, it sounds like your code is correct.

Categories

Resources