Trouble calculating slope and intercept in Numpy/Scypy using linear regression - python

i'm new in this forum.
I have a small problem to understand how to calcolate slope and intercept from value that are in a csv file.
This is my working codes(minquadbasso.py is the programme's name):
import numpy as np
import matplotlib.pyplot as plt # To visualize
import pandas as pd # To read data
from sklearn.linear_model import LinearRegression
data = pd.read_csv('TelefonoverticaleAsseY.csv') # load data set
X = data.iloc[:, 0].values.reshape(-1, 1) # values converts it into a numpy array
Y = data.iloc[:, 1].values.reshape(-1, 1) # -1 means that calculate the dimension of rows, but have 1 column
linear_regressor = LinearRegression() # create object for the class
linear_regressor.fit(X, Y) # perform linear regression
Y_pred = linear_regressor.predict(X) # make predictions
plt.scatter(X, Y)
plt.plot(X, Y_pred, color='black')
plt.show()
If I use:
from scipy.stats import linregress
linregress(X, Y)
compiler give me this error:
Traceback (most recent call last):
File "minquadbasso.py", line 11, in <module>
linregress(X, Y)
File "/usr/local/lib/python3.7/dist-packages/scipy/stats/_stats_mstats_common.py", line 116, in linregress
ssxm, ssxym, ssyxm, ssym = np.cov(x, y, bias=1).flat
ValueError: too many values to unpack (expected 4)
Can you make me understand what i'm doing wrong and suggest what change in order to calculate seccesfully slope and intercept?

My go-to for linear regression is np.polyfit. If you have an array (or list) of x data, and an array or list of y data just use
coeff = np.polyfit(x,y, deg = 1)
coeff is now a list of least square coefficients to fit your data, with highest degree of x first. So for a first degree fit y = ax + b,
a = coeff[0] and b = coeff[1] 'deg' is the degree of the polynomial you want to fit to your data. To evaluate your regression (predict) you can use np.polyval
y_prediction = np.polyval(coeff, x)
If you want the covariance matrix for the fit
coeff, cov = np.polyfit(x,y, deg = 1, cov = True)
you can find more on it here.

Related

Why isn't this Linear Regression line a straight line?

I have points with x and y coordinates I want to fit a straight line to with Linear Regression but I get a jagged looking line.
I am attemting to use LinearRegression from sklearn.
To create the points run a for loop that randomly crates one hundred points into an array that is 100 x 2 in shape. I slice the left side of it for the xs and the right side of it for the ys.
I expect to have a straight line when I print m.predict.
import numpy as np
import matplotlib.pyplot as plt
import random
from sklearn.linear_model import LinearRegression
X = []
adder = 0
for z in range(100):
r = random.random() * 20
r2 = random.random() * 15
X.append([r+adder-0.4, r2+adder])
adder += 0.6
X = np.array(X)
plt.scatter(X[:,0], X[:,1], s=10)
plt.show()
m = LinearRegression()
m.fit(X[:,0].reshape(1, -1), X[:,1].reshape(1, -1))
plt.plot(m.predict(X[:,0].reshape(1, -1))[0])
I am not good with numpy but, I think it is because the use of reshape() function to convert X[:,0] and X[:,1] from 1D to 2D, the resulting 2D array contains only one element, instead of creating a 2D array of len(X[:,0]) and len(X[:,1]) respectively. And resulting into an undesired regressor.
I am able to recreate this model using pandas and able to plot the desired result. Code as follows
import numpy as np
import matplotlib.pyplot as plt
import random
from sklearn.linear_model import LinearRegression
import pandas as pd
X = []
adder = 0
for z in range(100):
r = random.random() * 20
r2 = random.random() * 15
X.append([r+adder-0.4, r2+adder])
adder += 0.6
X = np.array(X)
y_train = pd.DataFrame(X[:,1],columns=['y'])
X_train = pd.DataFrame(X[:,0],columns=['X'])
//plt.scatter(X_train, y_train, s=10)
//plt.show()
m = LinearRegression()
m.fit(X_train, y_train)
plt.scatter(X_train,y_train)
plt.plot(X_train,m.predict(X_train),color='red')

How to interpret the coefficients returned from a multivariate cubic regression (polynomial degree 3) when using linearRegression().coef_?

I am trying to fit a hyperplane to a dataset which includes 2 features and 1 target variable. I processed the features using PolynomialFeatures.fit_transform() and PolynomialFeature(degree = 3), and then fitted those features and target variable into a LinearRegression() model. When I use LinearRegression().coef_ to get the coefficients in order to write out a function for the hyperplane (I want the written out function itself), 10 coefficients are returned and I don't know how to interpret them into a function. I know that for a PolynomialFeature(degree = 2) model, 6 coefficients are returned and the function looks like m[0] + x1*m[1] + x2*m[2] + (x1**2)*m[3] + (x2**2)*m[4] + x1*x2*m[5] where m is the list of coefficients returned in that order. How would I interpret the cubic one?
Here is what my code for thee cubic model looks like:
poly = polyF(degree = 3)
x_poly = poly.fit_transform(x)
model = linR()
model.fit(x_poly, y)
model.coef_
(returns):
array([ 0.00000000e+00, -1.50603348e+01, 2.33283686e+00, 6.73172519e-01,
-1.93686431e-01, -7.30930307e-02, -9.31687047e-03, 3.48729458e-03,
1.63718406e-04, 2.26682333e-03])
So if (X1,X2) transforms to (1,X1,X2,X1^2,X1X2,X2^2)
Then (X1,X2,X3) should transform to
(1,
X1, X2, X3,
X1X2, X1X3, X2X3,
X1^2 * X2, X2^2 * X3, X3^2 * X1)
I was facing the same question and developed the following code block to print the fit equation. To do so, it was necessary to include_bias=True in PolynomialFeatures and to set fit_intercept=False in LinearRegression, as opposed to conventional use:
import pandas as pd
from sklearn import linear_model
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
def polyReg():
seed=12341
df=pd.read_csv("input.txt", delimiter=', ', engine='python')
X=df[["x1","x2","x3"]]
y=df["y"]
poly=PolynomialFeatures(degree=2,include_bias=True)
poly_X=poly.fit_transform(X)
X_train,X_test,y_train,y_test=train_test_split(poly_X,y,test_size=0.5,random_state=seed)
regression=linear_model.LinearRegression(fit_intercept=False)
fit=regression.fit(X_train,y_train)
variable_names=poly.get_feature_names_out(X.columns)
variable_names=np.core.defchararray.replace(variable_names.astype(str),' ','*')
fit_coeffs=["{:0.5g}".format(x) for x in fit.coef_]
arr_list=[fit_coeffs,variable_names]
fit_equation=np.apply_along_axis(join_txt, 0, arr_list)
fit_equation='+'.join(fit_equation)
fit_equation=fit_equation.replace("*1+","+")
fit_equation=fit_equation.replace("+-","-")
print("Fit equation:")
print(fit_equation)
def join_txt(text,delim='*'):
return np.asarray(delim.join(text),dtype=object)

Plotting classification results with confusion matrices on python

I am performing least squares classification on my data and I was able to obtain my weights and I decided to plot a decision boundary line. However I require to use a confusion matrix to show my classification results. I was going to use from sklearn.metrics import confusion_matrix and I was going to assign t as my prediction however I am not sure how to obtain my actual results to work out the matrix. I have never plotted one so I might be getting all this wrong.
import numpy as np
import matplotlib.pyplot as plt
data=np.loadtxt("MyData_A.txt")
x=data[:,0:2] #the data points
t=data[:,2] #class which data points belong to either 1s or 0s
x0=np.ones((len(x),1)) # creat array of ones as matrix (nx1) where n is number of points
X=np.append(x, x0, axis=1) # add column x0 to data
# w= ( (((X^T)X)^-1 )X^T )t
XT_X=np.dot(X.T, X) # (X^T)X
inv_XT_X=np.linalg.inv(XT_X) # (X^T)X)^-1
X_tot=np.dot(inv_XT_X, X.T) # ((X^T)X)^-1 )X^T
w=np.dot(X_tot, t) # ( (((X^T)X)^-1 )X^T )t
x1_line = np.array([-1, 2])
x2_line = -w[2] / w[1] - (w[0] / w[1]) * x1_line
color_cond=['r' if t==1 else 'b' for t in t]
plt.scatter(x[:,0],x[:,1],color=color_cond)
plt.plot(x1_line,x2_line,color='k')
plt.xlabel('X1')
plt.ylabel('X2')
plt.ylim(-2,2)
plt.title('Training Data (X1,X2)')
plt.show()
The following is the plot obtained.
from sklearn.metrics import confusion_matrix
import seaborn as sns
def predict(x1_line, x2_line, x):
d = (x[0] - x1_line[0]) * (x2_line[1] - x2_line[0]) - (x[1] - x2_line[0]) * (x1_line[1] - x1_line[0])
pred = 0 if d > 0 else 1
return pred
preds = np.array([predict(x1_line, x2_line, x12) for x12 in x])
conf_mat = confusion_matrix(t, preds)
sns.heatmap(conf_mat, annot=True);
plt.show()
LogisticRegression, confusion_matrix and ConfusionMatrixDisplay get the job done:
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
data = np.loadtxt("MyData_A.txt")
X = data[:, :-1]
y = data[:, -1].astype(int)
clf = LogisticRegression().fit(X, y)
pred = clf.predict(X)
cm = confusion_matrix(y, pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot()

Understanding Sklearn's Linear Regression Weighting

I'm having difficulty getting the weighting array in sklearn's Linear Regression to affect the output.
Here's an example with no weighting.
import numpy as np
import seaborn as sns
from sklearn import linear_model
x = np.arange(0,100.)
y = (x**2.0)
xr = np.array(x).reshape(-1, 1)
yr = np.array(y).reshape(-1, 1)
regr = linear_model.LinearRegression()
regr.fit(xr, yr)
y_pred = regr.predict(xr)
sns.scatterplot(x=x, y = y)
sns.lineplot(x=x, y = y_pred.T[0].tolist())
Now when adding weights, I get the same best fit line back. I expected to see the regression favor the steeper part of the curve. What am I doing wrong?
w = [p**2 for p in x.reshape(-1)]
wregr = linear_model.LinearRegression()
wregr.fit(xr,yr, sample_weight=w)
yw_pred = regr.predict(xr)
wregr = linear_model.LinearRegression(fit_intercept=True)
wregr.fit(xr,yr, sample_weight=w)
yw_pred = regr.predict(xr)
sns.scatterplot(x=x, y = y) #plot curve
sns.lineplot(x=x, y = y_pred.T[0].tolist()) #plot non-weighted best fit line
sns.lineplot(x=x, y = yw_pred.T[0].tolist()) #plot weighted best fit line
This is due to an error in your code. Fitting of your weighted model should be:
yw_pred = wregr.predict(xr)
rather than
yw_pred = regr.predict(xr)
With this you get:

python: setting width to fit parameters

I have been trying to fit a data file with unknown fit parameter "ga" and "MA". What I want to do is set a range withing which the value of "MA" will reside and fit the data, for example I want the fitted value of MA in the range [0.5,0.8] and want to keep "ga" as an arbitrary fit paramter. I am not sure how to do it. I am copying the python code here:
#!/usr/bin/env python3
# to the data in "data_file", each line of which contains the data for one point, x_i, y_i, sigma_i.
import numpy as np
from pylab import *
from scipy.optimize import curve_fit
from scipy.stats import chi2
fname = sys.argv[1] if len(sys.argv) > 1000 else 'data.txt'
x, y, err = np.loadtxt(fname, unpack = True)
n = len(x)
p0 = [-1,1]
f = lambda x, ga, MA: ga/((1+x/(MA*MA))*(1+x/(MA*MA)))
p, covm = curve_fit(f, x, y, p0, err)
ga, MA = p
chisq = sum(((f(x, ga, MA) -y)/err)**2)
ndf = n -len(p)
Q = 1. -chi2.cdf(chisq, ndf)
chisq = chisq / ndf
gaerr, MAerr = sqrt(diag(covm)/chisq) # correct the error bars
print 'ga = %10.4f +/- %7.4f' % (ga, gaerr)
print 'MA = %10.4f +/- %7.4f' % (MA, MAerr)
print 'chi squared / NDF = %7.4lf' % chisq
print (covm)
You might consider using lmfit (https://lmfit.github.io/lmfit-py) for this problem. Lmfit provides a higher-level interface to optimization and curve fitting, including treating Parameters as python objects that have bounds.
Your script might be translated to use lmfit as
import numpy as np
from lmfit import Model
fname = sys.argv[1] if len(sys.argv) > 1000 else 'data.txt'
x, y, err = np.loadtxt(fname, unpack = True)
# define the fitting model function, similar to your `f`:
def f(x, ga, ma):
return ga/((1+x/(ma*ma))*(1+x/(ma*ma)))
# turn this model function into a Model:
mymodel = Model(f)
# now create parameters for this model, giving initial values
# note that the parameters will be *named* from the arguments of your model function:
params = mymodel.make_params(ga=-1, ma=1)
# params is now an orderded dict with parameter names ('ga', 'ma') as keys.
# you can set min/max values for any parameter:
params['ma'].min = 0.5
params['ma'].max = 2.0
# you can fix the value to not be varied in the fit:
# params['ga'].vary = False
# you can also constrain it to be a simple mathematical expression of other parameters
# now do the fit to your `y` data with `params` and your `x` data
# note that you pass in weights for the residual, so 1/err:
result = mymodel.fit(y, params, x=x, weights=1./err)
# print out fit report with fit statistics and best fit values
# and uncertainties and correlations for variables:
print(result.fit_report())
You can get access to the best-fit parameters as result.params; the initial params will not be changed by the fit. There are also routines to plot the best-fit result and/or residual.

Categories

Resources