I have a DataFrame (df) with two columns and three rows.
Column X = [137,270,344]
Column Y = [51, 121, 136]
I want to get the slope of the linear regression considering the intercept = 0.
I have tried to add a point (0,0) but it doesn´t work.
EX.
Column X = [0, 137,270,344]
Column Y = [0, 51, 121, 136]
The code that I am using.
Code:
X= df [“Column X”].astype(float)
Y = df [“Column Y”].astype(float)
slope, intercept, r_value, p_value, std_err = stats.linregress(X, Y)
intercept_desv = slope
coef_desv = intercept
I expected intercept = 0 but is less than 0.
In standard linear regression, all data points implicitly have a weight of 1.0. In any software that allows linear regression using weights, the regression can effectively be made to pass through any single point - such as the origin - by assigning that data point an extremely large weight. Numpy's polyfit() allows weights. Here is a graphing example with your data using this technique to make the fitted line pass through the 0,0 point.
import numpy, matplotlib
import matplotlib.pyplot as plt
xData = numpy.array( [0.0, 137.0, 270.0, 344.0])
yData = numpy.array([0.0, 51.0, 121.0, 136.0])
weights = numpy.array([1.0E10, 1.0, 1.0, 1.0]) # heavily weight the 0,0 point
#weights = None # use this for "no weights"
polynomialOrder = 1 # example straight line
# curve fit the test data
fittedParameters = numpy.polyfit(xData, yData, polynomialOrder, w=weights)
print('Fitted Parameters:', fittedParameters)
modelPredictions = numpy.polyval(fittedParameters, xData)
absError = modelPredictions - yData
SE = numpy.square(absError) # squared errors
MSE = numpy.mean(SE) # mean squared errors
RMSE = numpy.sqrt(MSE) # Root Mean Squared Error, RMSE
Rsquared = 1.0 - (numpy.var(absError) / numpy.var(yData))
print('RMSE:', RMSE)
print('R-squared:', Rsquared)
print()
print('Predicted value at x=0:', modelPredictions[0])
print()
##########################################################
# graphics output section
def ModelAndScatterPlot(graphWidth, graphHeight):
f = plt.figure(figsize=(graphWidth/100.0, graphHeight/100.0), dpi=100)
axes = f.add_subplot(111)
# first the raw data as a scatter plot
axes.plot(xData, yData, 'D')
# create data for the fitted equation plot
xModel = numpy.linspace(min(xData), max(xData))
yModel = numpy.polyval(fittedParameters, xModel)
# now the model as a line plot
axes.plot(xModel, yModel)
axes.set_xlabel('X Data') # X axis data label
axes.set_ylabel('Y Data') # Y axis data label
plt.show()
plt.close('all') # clean up after using pyplot
graphWidth = 800
graphHeight = 600
ModelAndScatterPlot(graphWidth, graphHeight)
Related
I currently have a scatter plot of data points, and I want to draw a line that captures the general pattern of the data. I believe that this is also known as an ordinary least squares regression method, but I may be wrong as I'm not completely familiar with the literature.
For example, if I had a plot like the following:
I just want a line that goes through the data points, that captures the most general trend.
I've tried methods like using Scikit-Learn's LinearRegression module, but I'll have to split my data into train and test sets and perform regression. Is there a way that I can just capture the general trend without having to do this?
Thank you.
Here is an example polynomial fitter that does this, if you convert your date format to a numeric type such as "elapsed days" you can directly substitute your data into the example. Here I use a curved second-order polynomial (quadratic) equation, set at the top of the code, because to my eye the trend of your data appears to have some curvature rather than a straight line.
import numpy, matplotlib
import matplotlib.pyplot as plt
xData = numpy.array([1.1, 2.2, 3.3, 4.4, 5.0, 6.6, 7.7, 0.0])
yData = numpy.array([1.1, 20.2, 30.3, 40.4, 50.0, 60.6, 70.7, 0.1])
polynomialOrder = 2 # example quadratic
# curve fit the test data
fittedParameters = numpy.polyfit(xData, yData, polynomialOrder)
print('Fitted Parameters:', fittedParameters)
modelPredictions = numpy.polyval(fittedParameters, xData)
absError = modelPredictions - yData
SE = numpy.square(absError) # squared errors
MSE = numpy.mean(SE) # mean squared errors
RMSE = numpy.sqrt(MSE) # Root Mean Squared Error, RMSE
Rsquared = 1.0 - (numpy.var(absError) / numpy.var(yData))
print('RMSE:', RMSE)
print('R-squared:', Rsquared)
print()
##########################################################
# graphics output section
def ModelAndScatterPlot(graphWidth, graphHeight):
f = plt.figure(figsize=(graphWidth/100.0, graphHeight/100.0), dpi=100)
axes = f.add_subplot(111)
# first the raw data as a scatter plot
axes.plot(xData, yData, 'D')
# create data for the fitted equation plot
xModel = numpy.linspace(min(xData), max(xData))
yModel = numpy.polyval(fittedParameters, xModel)
# now the model as a line plot
axes.plot(xModel, yModel)
axes.set_xlabel('X Data') # X axis data label
axes.set_ylabel('Y Data') # Y axis data label
plt.show()
plt.close('all') # clean up after using pyplot
graphWidth = 800
graphHeight = 600
ModelAndScatterPlot(graphWidth, graphHeight)
For calibration purposes I am making N measurements of water flow, each of which is time-intensive. I want to reduce the number of measurements. It sounds like this is part of feature selection as I am reducing the number of columns I have. BUT - I need to predict the measurements I will be dropping.
Here is a sample of the data:
SerialNumber val speed
0 193604048 1.350254 105.0
1 193604048 1.507517 3125.0
2 193604048 1.455142 525.0
6 193604048 1.211184 12.8
7 193604048 1.238835 20.0
For each serial number I have a complete set of speed-val measurements. Ideally I would like a model whose output is the vector of all N val measurements, but it seems the options are all neural networks, which I am trying to avoid for now. Are there are any other options?
If I feed this data into a regression model, how do I differentiate between each serialNumber dataset?
To make sure my goal is clear - I want to learn the historical measurements I have of N measurements and find which speed-val I can drop to still accurately predict all N output values.
Thank you!
I tried to find the simplest equation that would give a good fit to the example data you posted, and from my equation search the Harris Yield Density equation, "y = 1.0 / (a + b * pow(x, c))", is an good candidate. Here is a graphical Python fitter using that equation and your data, with initial parameter estimates for the non-linear fitter calculated directly from the data max and min values. Note that SerialNumber itself is unrelated to the data and would not be used in regressions.
My hope is that you might find this equation generally useful in your work, and it might be possible that after performing similar regressions on several different data sets that parameters a, b, and c are very similar in all cases - that is the best outcome. If your measurement accuracy is high, I personally would expect that with this three-parameter equation it should be possible to use a minimum of four data points per calibration, with max, min and two other well-spaced points along the expected calibration curve.
Note that here the fitted parameters a = -1.91719091e-03. b = 1.11357103e+00, and c = -1.51294798e+01 yield RMSE = 3.191 and R-squared = 0.9999
import numpy, scipy, matplotlib
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
xData = numpy.array([1.350254, 1.507517, 1.455142, 1.211184, 1.238835])
yData = numpy.array([105.0, 3125.0, 525.0, 12.8, 20.0])
def func(x, a, b, c): # Harris yield density equation
return 1.0 / (a + b*numpy.power(x, c))
initialParameters = numpy.array([0.0, min(xData), -10.0 * max(xData)])
# curve fit the test data
fittedParameters, pcov = curve_fit(func, xData, yData, initialParameters)
modelPredictions = func(xData, *fittedParameters)
absError = modelPredictions - yData
SE = numpy.square(absError) # squared errors
MSE = numpy.mean(SE) # mean squared errors
RMSE = numpy.sqrt(MSE) # Root Mean Squared Error, RMSE
Rsquared = 1.0 - (numpy.var(absError) / numpy.var(yData))
print('Parameters:', fittedParameters)
print('RMSE:', RMSE)
print('R-squared:', Rsquared)
print()
##########################################################
# graphics output section
def ModelAndScatterPlot(graphWidth, graphHeight):
f = plt.figure(figsize=(graphWidth/100.0, graphHeight/100.0), dpi=100)
axes = f.add_subplot(111)
# first the raw data as a scatter plot
axes.plot(xData, yData, 'D')
# create data for the fitted equation plot
xModel = numpy.linspace(min(xData), max(xData))
yModel = func(xModel, *fittedParameters)
# now the model as a line plot
axes.plot(xModel, yModel)
axes.set_title('Harris Yield Density Equation') # title
axes.set_xlabel('Val') # X axis data label
axes.set_ylabel('Speed') # Y axis data label
plt.show()
plt.close('all') # clean up after using pyplot
graphWidth = 800
graphHeight = 600
ModelAndScatterPlot(graphWidth, graphHeight)
UPDATE using reversed X and Y
Per the comments, here is a three-parameter equation Mixed Power and Eponential "a * pow(x, b) * exp(c * x)" graphical fitter with X and Y reversed from the previous code. Here the fitted parameters a = 1.05910664e+00, b = 5.26304345e-02, and -2.25604946e-05 yield RMSE = 0.0003602 and R-squared= 0.9999
import numpy, scipy, matplotlib
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
xData = numpy.array([105.0, 3125.0, 525.0, 12.8, 20.0])
yData = numpy.array([1.350254, 1.507517, 1.455142, 1.211184, 1.238835])
def func(x, a, b, c): # mixed power and exponential equation
return a * numpy.power(x, b) * numpy.exp(c * x)
initialParameters = [1.0, 0.01, -0.01]
# curve fit the test data
fittedParameters, pcov = curve_fit(func, xData, yData, initialParameters)
modelPredictions = func(xData, *fittedParameters)
absError = modelPredictions - yData
SE = numpy.square(absError) # squared errors
MSE = numpy.mean(SE) # mean squared errors
RMSE = numpy.sqrt(MSE) # Root Mean Squared Error, RMSE
Rsquared = 1.0 - (numpy.var(absError) / numpy.var(yData))
print('Parameters:', fittedParameters)
print('RMSE:', RMSE)
print('R-squared:', Rsquared)
print()
##########################################################
# graphics output section
def ModelAndScatterPlot(graphWidth, graphHeight):
f = plt.figure(figsize=(graphWidth/100.0, graphHeight/100.0), dpi=100)
axes = f.add_subplot(111)
# first the raw data as a scatter plot
axes.plot(xData, yData, 'D')
# create data for the fitted equation plot
xModel = numpy.linspace(min(xData), max(xData))
yModel = func(xModel, *fittedParameters)
# now the model as a line plot
axes.plot(xModel, yModel)
axes.set_title('Mixed Power and Exponential Equation') # title
axes.set_xlabel('Speed') # X axis data label
axes.set_ylabel('Val') # Y axis data label
plt.show()
plt.close('all') # clean up after using pyplot
graphWidth = 800
graphHeight = 600
ModelAndScatterPlot(graphWidth, graphHeight)
I would like to see the regression equation for a polynomial regression in python.
I am new to python, in R the analogous command I am looking for is "summary." I have tried the print function in python.
x = (LIST)
y = (LIST)
x = x[:, np.newaxis]
y = y[:, np.newaxis]
poly = PolynomialFeatures(degree=2)
x_poly = poly.fit_transform(x)
poly.fit(x_poly,y)
lin = LinearRegression()
lin.fit(x_poly,y)
y_poly_pred = lin.predict(x_poly)
print(lin)
print(poly)
print(lin.predict)
print(poly.fit_transform)
I would like the output to give me the ax^2 + bx + c equation, or at least the info to figure out that equation. Instead, I get (below) for my 4 print statements.
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
normalize=False)
PolynomialFeatures(degree=2, include_bias=True, interaction_only=False,
order='C')
<bound method LinearModel.predict of LinearRegression(copy_X=True,
fit_intercept=True, n_jobs=None, normalize=False)>
<bound method TransformerMixin.fit_transform of
PolynomialFeatures(degree=2, include_bias=True, interaction_only=False,
order='C')>
Here is an example graphical polynomial fitter using numpy.polyfit for fitting and numpy.polyval for evaluation. This example has eight data points, and making polynomialOrder = 7 shows Runge's phenomenon rather nicely.
import numpy, matplotlib
import matplotlib.pyplot as plt
xData = numpy.array([1.1, 2.2, 3.3, 4.4, 5.0, 6.6, 7.7, 0.0])
yData = numpy.array([1.1, 20.2, 30.3, 40.4, 50.0, 60.6, 70.7, 0.1])
polynomialOrder = 2 # example quadratic
# curve fit the test data
fittedParameters = numpy.polyfit(xData, yData, polynomialOrder)
print('Fitted Parameters:', fittedParameters)
modelPredictions = numpy.polyval(fittedParameters, xData)
absError = modelPredictions - yData
SE = numpy.square(absError) # squared errors
MSE = numpy.mean(SE) # mean squared errors
RMSE = numpy.sqrt(MSE) # Root Mean Squared Error, RMSE
Rsquared = 1.0 - (numpy.var(absError) / numpy.var(yData))
print('RMSE:', RMSE)
print('R-squared:', Rsquared)
print()
##########################################################
# graphics output section
def ModelAndScatterPlot(graphWidth, graphHeight):
f = plt.figure(figsize=(graphWidth/100.0, graphHeight/100.0), dpi=100)
axes = f.add_subplot(111)
# first the raw data as a scatter plot
axes.plot(xData, yData, 'D')
# create data for the fitted equation plot
xModel = numpy.linspace(min(xData), max(xData))
yModel = numpy.polyval(fittedParameters, xModel)
# now the model as a line plot
axes.plot(xModel, yModel)
axes.set_xlabel('X Data') # X axis data label
axes.set_ylabel('Y Data') # Y axis data label
plt.show()
plt.close('all') # clean up after using pyplot
graphWidth = 800
graphHeight = 600
ModelAndScatterPlot(graphWidth, graphHeight)
I have done my linear regression and the best fit line, but would like to have also a line connecting the real points (the ones in blue) to the predicted points (the ones i red x) representing the predictions error, or the so called residuals. The plot should look in a similar way:
And what I have until now is:
# draw the plot
xx=X[:,np.newaxis]
yy=y[:,np.newaxis]
slr=LinearRegression()
slr.fit(xx,yy)
y_pred=slr.predict(xx)
plt.scatter(xx,yy)
plt.plot(xx,y_pred,'r')
plt.plot(X,y_pred,'rx') #add the prediction points
plt.show()
Thank you very much in advance!
Here is example code with the vertical lines
import numpy, scipy, matplotlib
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
xData = numpy.array([1.1, 2.2, 3.3, 4.4, 5.0, 6.6, 7.7])
yData = numpy.array([1.1, 20.2, 30.3, 60.4, 50.0, 60.6, 70.7])
def func(x, a, b): # simple linear example
return a * x + b
initialParameters = numpy.array([1.0, 1.0])
# curve fit the test data
fittedParameters, pcov = curve_fit(func, xData, yData, initialParameters)
modelPredictions = func(xData, *fittedParameters)
absError = modelPredictions - yData
SE = numpy.square(absError) # squared errors
MSE = numpy.mean(SE) # mean squared errors
RMSE = numpy.sqrt(MSE) # Root Mean Squared Error, RMSE
Rsquared = 1.0 - (numpy.var(absError) / numpy.var(yData))
print('RMSE:', RMSE)
print('R-squared:', Rsquared)
print()
##########################################################
# graphics output section
def ModelAndScatterPlot(graphWidth, graphHeight):
f = plt.figure(figsize=(graphWidth/100.0, graphHeight/100.0), dpi=100)
axes = f.add_subplot(111)
# first the raw data as a scatter plot
axes.plot(xData, yData, 'D')
# create data for the fitted equation plot
xModel = numpy.linspace(min(xData), max(xData))
yModel = func(xModel, *fittedParameters)
# now the model as a line plot
axes.plot(xModel, yModel)
# now add individual line for each point
for i in range(len(xData)):
lineXdata = (xData[i], xData[i]) # same X
lineYdata = (yData[i], modelPredictions[i]) # different Y
plt.plot(lineXdata, lineYdata)
axes.set_xlabel('X Data') # X axis data label
axes.set_ylabel('Y Data') # Y axis data label
plt.show()
plt.close('all') # clean up after using pyplot
graphWidth = 800
graphHeight = 600
ModelAndScatterPlot(graphWidth, graphHeight)
I have four features and a dependent(X). I want to plot a graph with the predicted regression line and the feature values. I went through the documentation but I can't figure out how to represent everything in a scatter plot.
Here is some example code to get you started, it fits a simple quadratic and scatterplots the raw data and fitted curve along with calculation of RMSE and R-squared. The example uses a non-linear fit in case you would like to try fitting non-linear equations.
import numpy, scipy, matplotlib
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
import scipy.stats
xData = numpy.array([1.1, 2.2, 3.3, 4.4, 5.0, 6.6, 7.7])
yData = numpy.array([1.1, 20.2, 30.3, 40.4, 50.0, 60.6, 70.7])
def func(x, a, b, c): # simple quadratic example
return (a * numpy.square(x)) + b * x + c
initialParameters = numpy.array([1.0, 1.0, 1.0])
# curve fit the test data
fittedParameters, pcov = curve_fit(func, xData, yData, initialParameters)
modelPredictions = func(xData, *fittedParameters)
absError = modelPredictions - yData
SE = numpy.square(absError) # squared errors
MSE = numpy.mean(SE) # mean squared errors
RMSE = numpy.sqrt(MSE) # Root Mean Squared Error, RMSE
Rsquared = 1.0 - (numpy.var(absError) / numpy.var(yData))
print('RMSE:', RMSE)
print('R-squared:', Rsquared)
print()
##########################################################
# graphics output section
def ModelAndScatterPlot(graphWidth, graphHeight):
f = plt.figure(figsize=(graphWidth/100.0, graphHeight/100.0), dpi=100)
axes = f.add_subplot(111)
# first the raw data as a scatter plot
axes.plot(xData, yData, 'D')
# create data for the fitted equation plot
xModel = numpy.linspace(min(xData), max(xData))
yModel = func(xModel, *fittedParameters)
# now the model as a line plot
axes.plot(xModel, yModel)
axes.set_xlabel('X Data') # X axis data label
axes.set_ylabel('Y Data') # Y axis data label
plt.show()
plt.close('all') # clean up after using pyplot
graphWidth = 800
graphHeight = 600
ModelAndScatterPlot(graphWidth, graphHeight)