getting the standard error of linear regression coefficient using bootstrap

getting the standard error of linear regression coefficient using bootstrap - python

I would like to calculate the standard error of linear regression coefficient using bootstrap technique (100 resamples) but the result I got is zero, which is not normal. I think something is wrong with the bootstrap part of the code. Do you know how to fix my code?
x, y = np.genfromtxt("input.txt", unpack=True)
#regression part
slope, intercept, r_value, p_value, std_err = stats.linregress(x,y)
print std_err
#bootstrap part of the code
A = np.random.choice(x, size=100, replace=True)
B = np.random.choice(y, size=100, replace=True)
slope2, intercept2, r_value2, p_value2, std_err2 = stats.linregress(A,B)
print std_err2
input.txt:
-1.08 -1.07
-2.62 -2.56
-2.84 -2.79
-2.22 -2.16
-3.47 -3.55
-2.81 -2.79
-2.86 -2.71
-3.41 -3.42
-4.18 -4.21
-3.50 -3.48
-5.67 -5.55
-6.83 -6.95
-6.13 -6.13
-8.34 -8.19
-7.82 -7.83
-9.86 -9.58
-8.67 -8.62
-9.81 -9.81
-8.39 -8.30

I had no issues with your above code running in Python 3.6.1. Maybe check that your scipy version is current?
from scipy import stats
import numpy as np
x, y = np.genfromtxt("./input.txt", unpack=True)
slope_1, intercept_1, r_val_1, p_val_1, stderr_1 = stats.linregress(x, y)
print(slope_1) # 0.9913080927081567
print(stderr_1) # 0.007414734102169809
A = np.random.choice(x, size=100, replace=True)
B = np.random.choice(y, size=100, replace=True)
slope_2, incercept_2, r_val_2, p_val_2, stderr_2 = stats.linregress(A, B)
print(slope_2) # 0.11429903085322253
print(stderr_2) # 0.10158283281966374
Correctly Bootstrapping the Data
The correct way to do this would be to use the resample method from sklearn.utils. This method handles the data in a consistent array format. Since your data is an x, y pair, the y value is dependent on your x value. If you randomly sample x and y independently you lose that dependency and your resampled data will not accurately represent your population.
from scipy import stats
from sklearn.utils import resample
import numpy as np
x, y = np.genfromtxt("./input.txt", unpack=True)
slope_1, intercept_1, r_val_1, p_val_1, stderr_1 = stats.linregress(x, y)
print(slope_1) # 0.9913080927081567
print(stderr_1) # 0.007414734102169809
A, B = resample(x, y, n_samples=100) # defaults to w/ replacement
slope_2, incercept_2, r_val_2, p_val_2, stderr_2 = stats.linregress(A, B)
print(slope_2) # 0.9864339054638176
print(stderr_2) # 0.002669659193615103

Related

Trouble calculating slope and intercept in Numpy/Scypy using linear regression

i'm new in this forum.
I have a small problem to understand how to calcolate slope and intercept from value that are in a csv file.
This is my working codes(minquadbasso.py is the programme's name):
import numpy as np
import matplotlib.pyplot as plt # To visualize
import pandas as pd # To read data
from sklearn.linear_model import LinearRegression
data = pd.read_csv('TelefonoverticaleAsseY.csv') # load data set
X = data.iloc[:, 0].values.reshape(-1, 1) # values converts it into a numpy array
Y = data.iloc[:, 1].values.reshape(-1, 1) # -1 means that calculate the dimension of rows, but have 1 column
linear_regressor = LinearRegression() # create object for the class
linear_regressor.fit(X, Y) # perform linear regression
Y_pred = linear_regressor.predict(X) # make predictions
plt.scatter(X, Y)
plt.plot(X, Y_pred, color='black')
plt.show()
If I use:
from scipy.stats import linregress
linregress(X, Y)
compiler give me this error:
Traceback (most recent call last):
File "minquadbasso.py", line 11, in <module>
linregress(X, Y)
File "/usr/local/lib/python3.7/dist-packages/scipy/stats/_stats_mstats_common.py", line 116, in linregress
ssxm, ssxym, ssyxm, ssym = np.cov(x, y, bias=1).flat
ValueError: too many values to unpack (expected 4)
Can you make me understand what i'm doing wrong and suggest what change in order to calculate seccesfully slope and intercept?

My go-to for linear regression is np.polyfit. If you have an array (or list) of x data, and an array or list of y data just use
coeff = np.polyfit(x,y, deg = 1)
coeff is now a list of least square coefficients to fit your data, with highest degree of x first. So for a first degree fit y = ax + b,
a = coeff[0] and b = coeff[1] 'deg' is the degree of the polynomial you want to fit to your data. To evaluate your regression (predict) you can use np.polyval
y_prediction = np.polyval(coeff, x)
If you want the covariance matrix for the fit
coeff, cov = np.polyfit(x,y, deg = 1, cov = True)
you can find more on it here.

Understanding Sklearn's Linear Regression Weighting

I'm having difficulty getting the weighting array in sklearn's Linear Regression to affect the output.
Here's an example with no weighting.
import numpy as np
import seaborn as sns
from sklearn import linear_model
x = np.arange(0,100.)
y = (x**2.0)
xr = np.array(x).reshape(-1, 1)
yr = np.array(y).reshape(-1, 1)
regr = linear_model.LinearRegression()
regr.fit(xr, yr)
y_pred = regr.predict(xr)
sns.scatterplot(x=x, y = y)
sns.lineplot(x=x, y = y_pred.T[0].tolist())
Now when adding weights, I get the same best fit line back. I expected to see the regression favor the steeper part of the curve. What am I doing wrong?
w = [p**2 for p in x.reshape(-1)]
wregr = linear_model.LinearRegression()
wregr.fit(xr,yr, sample_weight=w)
yw_pred = regr.predict(xr)
wregr = linear_model.LinearRegression(fit_intercept=True)
wregr.fit(xr,yr, sample_weight=w)
yw_pred = regr.predict(xr)
sns.scatterplot(x=x, y = y) #plot curve
sns.lineplot(x=x, y = y_pred.T[0].tolist()) #plot non-weighted best fit line
sns.lineplot(x=x, y = yw_pred.T[0].tolist()) #plot weighted best fit line

This is due to an error in your code. Fitting of your weighted model should be:
yw_pred = wregr.predict(xr)
rather than
yw_pred = regr.predict(xr)
With this you get:

python: setting width to fit parameters

I have been trying to fit a data file with unknown fit parameter "ga" and "MA". What I want to do is set a range withing which the value of "MA" will reside and fit the data, for example I want the fitted value of MA in the range [0.5,0.8] and want to keep "ga" as an arbitrary fit paramter. I am not sure how to do it. I am copying the python code here:
#!/usr/bin/env python3
# to the data in "data_file", each line of which contains the data for one point, x_i, y_i, sigma_i.
import numpy as np
from pylab import *
from scipy.optimize import curve_fit
from scipy.stats import chi2
fname = sys.argv[1] if len(sys.argv) > 1000 else 'data.txt'
x, y, err = np.loadtxt(fname, unpack = True)
n = len(x)
p0 = [-1,1]
f = lambda x, ga, MA: ga/((1+x/(MA*MA))*(1+x/(MA*MA)))
p, covm = curve_fit(f, x, y, p0, err)
ga, MA = p
chisq = sum(((f(x, ga, MA) -y)/err)**2)
ndf = n -len(p)
Q = 1. -chi2.cdf(chisq, ndf)
chisq = chisq / ndf
gaerr, MAerr = sqrt(diag(covm)/chisq) # correct the error bars
print 'ga = %10.4f +/- %7.4f' % (ga, gaerr)
print 'MA = %10.4f +/- %7.4f' % (MA, MAerr)
print 'chi squared / NDF = %7.4lf' % chisq
print (covm)

You might consider using lmfit (https://lmfit.github.io/lmfit-py) for this problem. Lmfit provides a higher-level interface to optimization and curve fitting, including treating Parameters as python objects that have bounds.
Your script might be translated to use lmfit as
import numpy as np
from lmfit import Model
fname = sys.argv[1] if len(sys.argv) > 1000 else 'data.txt'
x, y, err = np.loadtxt(fname, unpack = True)
# define the fitting model function, similar to your `f`:
def f(x, ga, ma):
return ga/((1+x/(ma*ma))*(1+x/(ma*ma)))
# turn this model function into a Model:
mymodel = Model(f)
# now create parameters for this model, giving initial values
# note that the parameters will be *named* from the arguments of your model function:
params = mymodel.make_params(ga=-1, ma=1)
# params is now an orderded dict with parameter names ('ga', 'ma') as keys.
# you can set min/max values for any parameter:
params['ma'].min = 0.5
params['ma'].max = 2.0
# you can fix the value to not be varied in the fit:
# params['ga'].vary = False
# you can also constrain it to be a simple mathematical expression of other parameters
# now do the fit to your `y` data with `params` and your `x` data
# note that you pass in weights for the residual, so 1/err:
result = mymodel.fit(y, params, x=x, weights=1./err)
# print out fit report with fit statistics and best fit values
# and uncertainties and correlations for variables:
print(result.fit_report())
You can get access to the best-fit parameters as result.params; the initial params will not be changed by the fit. There are also routines to plot the best-fit result and/or residual.

scipy.stats.linregress - get p-value of intercept

scipy.stats.linregress returns a p-value corresponding to the slope, but no p-value for the intercept. Consider the following example from the docs:
>>> from scipy import stats
>>> import numpy as np
>>> x = np.random.random(10)
>>> y = np.random.random(10)
>>> slope, intercept, r_value, p_value, std_err = stats.linregress(x,y)
>>> p_value
0.40795314163864016
According to the docs, p-value is the "two-sided p-value for a hypothesis test whose null hypothesis is that the slope is zero." I would like to get the same statistics, but for the intercept instead of the slope.
statsmodels.regression.linear_model.OLS returns p-values for both coefficients out of the box:
>>> import numpy as np
>>> import statsmodels.api as sm
>>> X = sm.add_constant(x)
>>> model = sm.OLS(y,X)
>>> results = model.fit()
>>> results.pvalues
array([ 0.00297559, 0.40795314])
Using only scipy, how can I get the p-value (0.40795314163864016) for the intercept?

To compute the pvalue for the intercept you:
start from the tvalue which is computed starting from mean and stderr of the intercept (see function tvalue below)
then compute the pvalue using survival function for t distribution and the degrees of freedom (see function pvalue below)
Python code for the scipy case:
import scipy.stats
from scipy import stats
import numpy as np
def tvalue(mean, stderr):
return mean / stderr
def pvalue(tvalue, dof):
return 2*scipy.stats.t.sf(abs(tvalue), dof)
np.random.seed(42)
x = np.random.random(10)
y = np.random.random(10)
scipy_results = stats.linregress(x,y)
print(scipy_results)
dof = 1.0*len(x) - 2
print("degrees of freedom = ", dof)
tvalue_intercept = tvalue(scipy_results.intercept, scipy_results.intercept_stderr)
tvalue_slope = tvalue(scipy_results.slope, scipy_results.stderr)
pvalue_intercept = pvalue(tvalue_intercept, dof)
pvalue_slope = pvalue(tvalue_slope, dof)
print(f"""tvalues(intercept, slope) = {tvalue_intercept, tvalue_slope}
pvalues(intercept, slope) = {pvalue_intercept, pvalue_slope}
""")
output:
LinregressResult(slope=0.6741948478345656, intercept=0.044594333294114996, rvalue=0.7042846127289285, pvalue=0.02298486740535295, stderr=0.24027039310814322, intercept_stderr=0.14422953722007206)
degrees of freedom = 8.0
tvalues(intercept, slope) = (0.30919001858870915, 2.8059838713924172)
pvalues(intercept, slope) = (0.7650763497698203, 0.02298486740535295)
compare with the result you obtain with statsmodels:
import statsmodels.api as sm
import math
X = sm.add_constant(x)
model = sm.OLS(y,X)
statsmodels_results = model.fit()
print(f"""intercept, slope = {statsmodels_results.params}
rvalue = {math.sqrt(statsmodels_results.rsquared)}
tvalues(intercept, slope) = {statsmodels_results.tvalues}
pvalues(intercept, slope) = {statsmodels_results.pvalues}""")
output:
intercept, slope = [0.04459433 0.67419485]
rvalue = 0.7042846127289285
tvalues(intercept, slope) = [0.30919002 2.80598387]
pvalues(intercept, slope) = [0.76507635 0.02298487]
notes
fixing a random seed to have reproducible results
using LinregressResult object which contains also intercept_stderr
references
how to compute tvalue and pvalue: https://online.stat.psu.edu/stat501/lesson/2/2.12
how to compute pvalue from tvalue in python: https://www.statology.org/p-value-from-t-score-python/

From SciPy.org documents:
https://docs.scipy.org/doc/scipy-.14.0/reference/generated/scipy.stats.linregress.html
print "r-squared:", r_value**2
output
r-squared: 0.15286643777
For other parameters, try:
print ('Intercept is: ', (intercept))
print ('Slope is: ', (slope))
print ('R-Value is: ', (r_value))
print ('Std Error is: ', (std_err))
print ('p-value is: ', (p_value))

Linear regression with matplotlib / numpy

I'm trying to generate a linear regression on a scatter plot I have generated, however my data is in list format, and all of the examples I can find of using polyfit require using arange. arange doesn't accept lists though. I have searched high and low about how to convert a list to an array and nothing seems clear. Am I missing something?
Following on, how best can I use my list of integers as inputs to the polyfit?
Here is the polyfit example I am following:
import numpy as np
import matplotlib.pyplot as plt
x = np.arange(data)
y = np.arange(data)
m, b = np.polyfit(x, y, 1)
plt.plot(x, y, 'yo', x, m*x+b, '--k')
plt.show()

arange generates lists (well, numpy arrays); type help(np.arange) for the details. You don't need to call it on existing lists.
>>> x = [1,2,3,4]
>>> y = [3,5,7,9]
>>>
>>> m,b = np.polyfit(x, y, 1)
>>> m
2.0000000000000009
>>> b
0.99999999999999833
I should add that I tend to use poly1d here rather than write out "m*x+b" and the higher-order equivalents, so my version of your code would look something like this:
import numpy as np
import matplotlib.pyplot as plt
x = [1,2,3,4]
y = [3,5,7,10] # 10, not 9, so the fit isn't perfect
coef = np.polyfit(x,y,1)
poly1d_fn = np.poly1d(coef)
# poly1d_fn is now a function which takes in x and returns an estimate for y
plt.plot(x,y, 'yo', x, poly1d_fn(x), '--k') #'--k'=black dashed line, 'yo' = yellow circle marker
plt.xlim(0, 5)
plt.ylim(0, 12)

This code:
from scipy.stats import linregress
linregress(x,y) #x and y are arrays or lists.
gives out a list with the following:
slope : float
slope of the regression line
intercept : float
intercept of the regression line
r-value : float
correlation coefficient
p-value : float
two-sided p-value for a hypothesis test whose null hypothesis is that the slope is zero
stderr : float
Standard error of the estimate
Source

Use statsmodels.api.OLS to get a detailed breakdown of the fit/coefficients/residuals:
import statsmodels.api as sm
df = sm.datasets.get_rdataset('Duncan', 'carData').data
y = df['income']
x = df['education']
model = sm.OLS(y, sm.add_constant(x))
results = model.fit()
print(results.params)
# const 10.603498 <- intercept
# education 0.594859 <- slope
# dtype: float64
print(results.summary())
# OLS Regression Results
# ==============================================================================
# Dep. Variable: income R-squared: 0.525
# Model: OLS Adj. R-squared: 0.514
# Method: Least Squares F-statistic: 47.51
# Date: Thu, 28 Apr 2022 Prob (F-statistic): 1.84e-08
# Time: 00:02:43 Log-Likelihood: -190.42
# No. Observations: 45 AIC: 384.8
# Df Residuals: 43 BIC: 388.5
# Df Model: 1
# Covariance Type: nonrobust
# ==============================================================================
# coef std err t P>|t| [0.025 0.975]
# ------------------------------------------------------------------------------
# const 10.6035 5.198 2.040 0.048 0.120 21.087
# education 0.5949 0.086 6.893 0.000 0.421 0.769
# ==============================================================================
# Omnibus: 9.841 Durbin-Watson: 1.736
# Prob(Omnibus): 0.007 Jarque-Bera (JB): 10.609
# Skew: 0.776 Prob(JB): 0.00497
# Kurtosis: 4.802 Cond. No. 123.
# ==============================================================================
New in matplotlib 3.5.0
To plot the best-fit line, just pass the slope m and intercept b into the new plt.axline:
import matplotlib.pyplot as plt
# extract intercept b and slope m
b, m = results.params
# plot y = m*x + b
plt.axline(xy1=(0, b), slope=m, label=f'$y = {m:.1f}x {b:+.1f}$')
Note that the slope m and intercept b can be easily extracted from any of the common regression methods:
numpy.polyfit
import numpy as np
m, b = np.polyfit(x, y, deg=1)
plt.axline(xy1=(0, b), slope=m, label=f'$y = {m:.1f}x {b:+.1f}$')
scipy.stats.linregress
from scipy import stats
m, b, *_ = stats.linregress(x, y)
plt.axline(xy1=(0, b), slope=m, label=f'$y = {m:.1f}x {b:+.1f}$')
statsmodels.api.OLS
import statsmodels.api as sm
b, m = sm.OLS(y, sm.add_constant(x)).fit().params
plt.axline(xy1=(0, b), slope=m, label=f'$y = {m:.1f}x {b:+.1f}$')
sklearn.linear_model.LinearRegression
from sklearn.linear_model import LinearRegression
reg = LinearRegression().fit(x[:, None], y)
b = reg.intercept_
m = reg.coef_[0]
plt.axline(xy1=(0, b), slope=m, label=f'$y = {m:.1f}x {b:+.1f}$')

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
x = np.array([1.5,2,2.5,3,3.5,4,4.5,5,5.5,6])
y = np.array([10.35,12.3,13,14.0,16,17,18.2,20,20.7,22.5])
gradient, intercept, r_value, p_value, std_err = stats.linregress(x,y)
mn=np.min(x)
mx=np.max(x)
x1=np.linspace(mn,mx,500)
y1=gradient*x1+intercept
plt.plot(x,y,'ob')
plt.plot(x1,y1,'-r')
plt.show()
USe this ..

George's answer goes together quite nicely with matplotlib's axline which plots an infinite line.
from scipy.stats import linregress
import matplotlib.pyplot as plt
reg = linregress(x, y)
plt.axline(xy1=(0, reg.intercept), slope=reg.slope, linestyle="--", color="k")

from pylab import *
import numpy as np
x1 = arange(data) #for example this is a list
y1 = arange(data) #for example this is a list
x=np.array(x) #this will convert a list in to an array
y=np.array(y)
m,b = polyfit(x, y, 1)
plot(x, y, 'yo', x, m*x+b, '--k')
show()

Another quick and dirty answer is that you can just convert your list to an array using:
import numpy as np
arr = np.asarray(listname)

Linear Regression is a good example for start to Artificial Intelligence
Here is a good example for Machine Learning Algorithm of Multiple Linear Regression using Python:
##### Predicting House Prices Using Multiple Linear Regression - #Y_T_Akademi
#### In this project we are gonna see how machine learning algorithms help us predict house prices. Linear Regression is a model of predicting new future data by using the existing correlation between the old data. Here, machine learning helps us identify this relationship between feature data and output, so we can predict future values.
import pandas as pd
##### we use sklearn library in many machine learning calculations..
from sklearn import linear_model
##### we import out dataset: housepricesdataset.csv
df = pd.read_csv("housepricesdataset.csv",sep = ";")
##### The following is our feature set:
##### The following is the output(result) data:
##### we define a linear regression model here:
reg = linear_model.LinearRegression()
reg.fit(df[['area', 'roomcount', 'buildingage']], df['price'])
# Since our model is ready, we can make predictions now:
# lets predict a house with 230 square meters, 4 rooms and 10 years old building..
reg.predict([[230,4,10]])
# Now lets predict a house with 230 square meters, 6 rooms and 0 years old building - its new building..
reg.predict([[230,6,0]])
# Now lets predict a house with 355 square meters, 3 rooms and 20 years old building
reg.predict([[355,3,20]])
# You can make as many prediction as you want..
reg.predict([[230,4,10], [230,6,0], [355,3,20], [275, 5, 17]])
And my dataset is below:

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

getting the standard error of linear regression coefficient using bootstrap - python

Related

Trouble calculating slope and intercept in Numpy/Scypy using linear regression

Understanding Sklearn's Linear Regression Weighting

python: setting width to fit parameters

scipy.stats.linregress - get p-value of intercept

Linear regression with matplotlib / numpy

Categories

Resources