Sklearn polynomial regression flat with datetime x vales - python

I am getting a flat regression even with a 10th degree regresor. But If I change the date vaues to numeric then the regression works! Anybody knows why?
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LinearRegression
from scipy.optimize import curve_fit
## RESHAPE DATA ##
X = transformed_data.ds.values.reshape(-1, 1)
y = transformed_data.y
# X = data.fecha.dt.day.values.reshape(-1, 1)
## PLOT ##
fig, ax = plt.subplots(figsize=(15,8))
ax.plot(X, y, 'o', label="data")
for i in (range(1, 10)):
polyreg = make_pipeline(PolynomialFeatures(i), LinearRegression())
polyreg.fit(X, y)
mse = round(np.mean((y - polyreg.predict(X))**2))
mae = round(np.mean(abs(y - polyreg.predict(X))))
ax.plot(X, polyreg.predict(X), label='Degree: ' + str(i) + ' MSE: ' + f'{mse:,}' +' MAE: ' + f'{mae:,}')
Datetime Data
ds y
0 2019-01-10 3658.0
1 2019-01-11 2952.0
2 2019-01-12 2855.0
3 2019-01-13 3904.0
Flat regressions
Numeric Data
ds y
0 10 3658.0
1 11 2952.0
2 12 2855.0
3 13 3904.0
Curved regressions

Linear Regression imply the associating of numerical values to a calculated coefficient. What happens next is that the values are multiplied by the coefficients, which in turn gives you an output which is used for predictions.
BUT, in your case, one of the variables is a date and, as explained above, the regression model doesn't know what to do with it. As you noticed, you need to convert them to numerical data.

Related

How to interpret the coefficients returned from a multivariate cubic regression (polynomial degree 3) when using linearRegression().coef_?

I am trying to fit a hyperplane to a dataset which includes 2 features and 1 target variable. I processed the features using PolynomialFeatures.fit_transform() and PolynomialFeature(degree = 3), and then fitted those features and target variable into a LinearRegression() model. When I use LinearRegression().coef_ to get the coefficients in order to write out a function for the hyperplane (I want the written out function itself), 10 coefficients are returned and I don't know how to interpret them into a function. I know that for a PolynomialFeature(degree = 2) model, 6 coefficients are returned and the function looks like m[0] + x1*m[1] + x2*m[2] + (x1**2)*m[3] + (x2**2)*m[4] + x1*x2*m[5] where m is the list of coefficients returned in that order. How would I interpret the cubic one?
Here is what my code for thee cubic model looks like:
poly = polyF(degree = 3)
x_poly = poly.fit_transform(x)
model = linR()
model.fit(x_poly, y)
model.coef_
(returns):
array([ 0.00000000e+00, -1.50603348e+01, 2.33283686e+00, 6.73172519e-01,
-1.93686431e-01, -7.30930307e-02, -9.31687047e-03, 3.48729458e-03,
1.63718406e-04, 2.26682333e-03])
So if (X1,X2) transforms to (1,X1,X2,X1^2,X1X2,X2^2)
Then (X1,X2,X3) should transform to
(1,
X1, X2, X3,
X1X2, X1X3, X2X3,
X1^2 * X2, X2^2 * X3, X3^2 * X1)
I was facing the same question and developed the following code block to print the fit equation. To do so, it was necessary to include_bias=True in PolynomialFeatures and to set fit_intercept=False in LinearRegression, as opposed to conventional use:
import pandas as pd
from sklearn import linear_model
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
def polyReg():
seed=12341
df=pd.read_csv("input.txt", delimiter=', ', engine='python')
X=df[["x1","x2","x3"]]
y=df["y"]
poly=PolynomialFeatures(degree=2,include_bias=True)
poly_X=poly.fit_transform(X)
X_train,X_test,y_train,y_test=train_test_split(poly_X,y,test_size=0.5,random_state=seed)
regression=linear_model.LinearRegression(fit_intercept=False)
fit=regression.fit(X_train,y_train)
variable_names=poly.get_feature_names_out(X.columns)
variable_names=np.core.defchararray.replace(variable_names.astype(str),' ','*')
fit_coeffs=["{:0.5g}".format(x) for x in fit.coef_]
arr_list=[fit_coeffs,variable_names]
fit_equation=np.apply_along_axis(join_txt, 0, arr_list)
fit_equation='+'.join(fit_equation)
fit_equation=fit_equation.replace("*1+","+")
fit_equation=fit_equation.replace("+-","-")
print("Fit equation:")
print(fit_equation)
def join_txt(text,delim='*'):
return np.asarray(delim.join(text),dtype=object)

Plotting classification results with confusion matrices on python

I am performing least squares classification on my data and I was able to obtain my weights and I decided to plot a decision boundary line. However I require to use a confusion matrix to show my classification results. I was going to use from sklearn.metrics import confusion_matrix and I was going to assign t as my prediction however I am not sure how to obtain my actual results to work out the matrix. I have never plotted one so I might be getting all this wrong.
import numpy as np
import matplotlib.pyplot as plt
data=np.loadtxt("MyData_A.txt")
x=data[:,0:2] #the data points
t=data[:,2] #class which data points belong to either 1s or 0s
x0=np.ones((len(x),1)) # creat array of ones as matrix (nx1) where n is number of points
X=np.append(x, x0, axis=1) # add column x0 to data
# w= ( (((X^T)X)^-1 )X^T )t
XT_X=np.dot(X.T, X) # (X^T)X
inv_XT_X=np.linalg.inv(XT_X) # (X^T)X)^-1
X_tot=np.dot(inv_XT_X, X.T) # ((X^T)X)^-1 )X^T
w=np.dot(X_tot, t) # ( (((X^T)X)^-1 )X^T )t
x1_line = np.array([-1, 2])
x2_line = -w[2] / w[1] - (w[0] / w[1]) * x1_line
color_cond=['r' if t==1 else 'b' for t in t]
plt.scatter(x[:,0],x[:,1],color=color_cond)
plt.plot(x1_line,x2_line,color='k')
plt.xlabel('X1')
plt.ylabel('X2')
plt.ylim(-2,2)
plt.title('Training Data (X1,X2)')
plt.show()
The following is the plot obtained.
from sklearn.metrics import confusion_matrix
import seaborn as sns
def predict(x1_line, x2_line, x):
d = (x[0] - x1_line[0]) * (x2_line[1] - x2_line[0]) - (x[1] - x2_line[0]) * (x1_line[1] - x1_line[0])
pred = 0 if d > 0 else 1
return pred
preds = np.array([predict(x1_line, x2_line, x12) for x12 in x])
conf_mat = confusion_matrix(t, preds)
sns.heatmap(conf_mat, annot=True);
plt.show()
LogisticRegression, confusion_matrix and ConfusionMatrixDisplay get the job done:
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
data = np.loadtxt("MyData_A.txt")
X = data[:, :-1]
y = data[:, -1].astype(int)
clf = LogisticRegression().fit(X, y)
pred = clf.predict(X)
cm = confusion_matrix(y, pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot()

How to predict y=1/x values in Python? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I have a data frame named df:
import pandas as pd
df = pd.DataFrame({'p': [15-x for x in range(14)]
, 'x': [x for x in range(14)]})
df['y'] = 1000 * (10 / df['p'])
x is only for plotting purposes.
I'm trying to predict the y value based on the p values. I am using SVR from sklearn:
from sklearn.svm import SVR
nlm = SVR(kernel='poly').fit(df[['p']], df['y'])
df['nml'] = nlm.predict(df[['p']])
I have already tried all of kernels but it still doesn't work correct enough.
p x y nml
0 15 0 666.666667 524.669572
1 14 1 714.285714 713.042459
2 13 2 769.230769 876.338765
3 12 3 833.333333 1016.349674
Do you know which sklearn model or other libraries should I use to better fit a model?
You missed the fundamental step "normalize the data"
Fix
df = pd.DataFrame({'p': [15-x for x in range(14)]
, 'x': [x for x in range(14)]})
df['y'] = 1000 * (10 / df['p'])
# Normalize the data (x - mean(x))/std(x)
s_p = np.std(df['p'])
m_p = np.mean(df['p'])
s_y = np.std(df['y'])
m_y = np.mean(df['y'])
df['p_'] = (df['p'] - s_p)/m_p
df['y_'] = (df['y'] - s_y)/m_y
# Fit and make prediction
nlm = SVR(kernel='rbf').fit(df[['p_']], df['y_'])
df['nml'] = nlm.predict(df[['p_']])
# Plot
plt.plot(df['p_'], df['y_'], 'r')
plt.plot(df['p_'], df['nml'], 'g')
plt.show()
# Rescale back and plot
plt.plot(df['p_']*s_p+m_p, df['y_']*s_y+m_y, 'r')
plt.plot(df['p_']*s_p+m_p, df['nml']*s_y+m_y, 'g')
plt.show()
As #mujjiga pointed out, scaling is important part of the process.
I would like to draw your attention on another two key points:
model selection which determines your ability to solve a class of problem;
new scklearn API which helps you to standardize solution development.
Let's start with your dataset:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
x = np.arange(14)
df = pd.DataFrame({'x': x, 'p': 15-x})
df['y'] = 1e4/df['p']
Then we import somesklearn API objects of interest:
from sklearn.svm import SVR
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler, RobustScaler, FunctionTransformer
First we create a scaler function for target values:
ysc = StandardScaler()
Notice that we can use different scalers, or build a custom transformation.
# Scaler robust against outliers:
ysc = RobustScaler()
# Logarithmic Transformation:
ysc = FunctionTransformer(func=np.log, inverse_func=np.exp, check_inverse=True)
We scale target using the scaler of our choice:
ysc.fit(df[['y']])
df['yn'] = ysc.transform(df[['y']])
We also build a pipeline with features standardizer and the selected model (we adjusted parameters to improve the fit). We fit it to your dataset using the pipeline:
reg = make_pipeline(StandardScaler(), SVR(kernel='rbf', C=1e3, epsilon=1e-3))
reg.fit(df[['p']], df['yn'])
At this point we can predict values and transform them back to the original scale:
df['ynhat'] = reg.predict(df[['p']])
df['yhat'] = ysc.inverse_transform(df[['ynhat']])
We check the fit score:
reg.score(df[['p']], df['yn']) # 0.9999646718755011
We can also compute absolute and relative error for each point:
df['yaerr'] = df['yhat'] - df['y']
df['yrerr'] = df['yaerr']/df['y']
Final result is:
x p y yn ynhat yhat yaerr yrerr
0 0 15 666.666667 -0.834823 -0.833633 668.077018 1.410352 0.002116
1 1 14 714.285714 -0.794636 -0.795247 713.562403 -0.723312 -0.001013
2 2 13 769.230769 -0.748267 -0.749627 767.619013 -1.611756 -0.002095
3 3 12 833.333333 -0.694169 -0.693498 834.128425 0.795091 0.000954
4 4 11 909.090909 -0.630235 -0.629048 910.497550 1.406641 0.001547
5 5 10 1000.000000 -0.553514 -0.555029 998.204445 -1.795555 -0.001796
6 6 9 1111.111111 -0.459744 -0.460002 1110.805275 -0.305836 -0.000275
7 7 8 1250.000000 -0.342532 -0.341099 1251.697707 1.697707 0.001358
8 8 7 1428.571429 -0.191830 -0.193295 1426.835676 -1.735753 -0.001215
9 9 6 1666.666667 0.009105 0.010458 1668.269984 1.603317 0.000962
10 10 5 2000.000000 0.290414 0.291060 2000.764717 0.764717 0.000382
11 11 4 2500.000000 0.712379 0.690511 2474.088446 -25.911554 -0.010365
12 12 3 3333.333333 1.415652 1.416874 3334.780642 1.447309 0.000434
13 13 2 5000.000000 2.822199 2.821420 4999.076799 -0.923201 -0.000185
Graphically it leads to:
fig, axe = plt.subplots()
axe.plot(df['p'], df['y'], label='$y(p)$')
axe.plot(df['p'], df['yhat'], 'o', label='$\hat{y}(p)$')
axe.set_title(r"SVR Fit for $y(x) = \frac{k}{x-a}$")
axe.set_xlabel('$p = x-a$')
axe.set_ylabel('$y, \hat{y}$')
axe.legend()
axe.grid()
Linearization
In the example above we could not use the poly kernel, we had to use the rbf kernel instead. This is because if we aim to fit a rational function using polynomial we are better to transform our data before fitting using a p = x/(x-b) substitution at the first place. In this case it will merely boil down to perform a linear regression. The example below shows that it works:
Scaler and transformation can be composed into a pipeline as well. We define a pipeline that linearize and scale the problem:
# Rational Fraction Substitution with consecutive Standardization
ysc = make_pipeline(
FunctionTransformer(func=lambda x: x/(x+1),
inverse_func=lambda x: x/(1-x),
check_inverse=True),
StandardScaler()
)
Then we can regress the data using classical OLS:
reg = make_pipeline(StandardScaler(), LinearRegression())
reg.fit(df[['p']], df['yn'])
Which provides correct result:
reg.score(df[['p']], df['yn']) # 0.9999998722172933
This second solution take advantage of a known linearization and thus remove the need to parametrize the model.

Linear Regression Returns Different Results Than Synthetic Parameters

trying this code:
from sklearn import linear_model
import numpy as np
x1 = np.arange(0,10,0.1)
x2 = x1*10
y = 2*x1 + 3*x2
X = np.vstack((x1, x2)).transpose()
reg_model = linear_model.LinearRegression()
reg_model.fit(X,y)
print reg_model.coef_
# should be [2,3]
print reg_model.predict([5,6])
# should be 2*5 + 3*6 = 28
print reg_model.intercept_
# perfectly at the expected value of 0
print reg_model.score(X,y)
# seems to be rather confident to be right
The results are
[ 0.31683168 3.16831683]
20.5940594059
0.0
1.0
and therefore not what I expected - they are not the same as the parameters used to synthesize the data. Why is this so?
Your problem is with the uniqueness of solutions, as both dimensions are the same (applying a linear transform to one dimension does not make unique data in the eyes of this model), you get an infinite number of possible solutions that will fit you data. Applying a non-linear transformation to your second dimension you will see the desired output.
from sklearn import linear_model
import numpy as np
x1 = np.arange(0,10,0.1)
x2 = x1**2
X = np.vstack((x1, x2)).transpose()
y = 2*x1 + 3*x2
reg_model = linear_model.LinearRegression()
reg_model.fit(X,y)
print reg_model.coef_
# should be [2,3]
print reg_model.predict([[5,6]])
# should be 2*5 + 3*6 = 28
print reg_model.intercept_
# perfectly at the expected value of 0
print reg_model.score(X,y)
Outputs are
[ 2. 3.]
[ 28.]
-2.84217094304e-14
1.0

Linear regression with matplotlib / numpy

I'm trying to generate a linear regression on a scatter plot I have generated, however my data is in list format, and all of the examples I can find of using polyfit require using arange. arange doesn't accept lists though. I have searched high and low about how to convert a list to an array and nothing seems clear. Am I missing something?
Following on, how best can I use my list of integers as inputs to the polyfit?
Here is the polyfit example I am following:
import numpy as np
import matplotlib.pyplot as plt
x = np.arange(data)
y = np.arange(data)
m, b = np.polyfit(x, y, 1)
plt.plot(x, y, 'yo', x, m*x+b, '--k')
plt.show()
arange generates lists (well, numpy arrays); type help(np.arange) for the details. You don't need to call it on existing lists.
>>> x = [1,2,3,4]
>>> y = [3,5,7,9]
>>>
>>> m,b = np.polyfit(x, y, 1)
>>> m
2.0000000000000009
>>> b
0.99999999999999833
I should add that I tend to use poly1d here rather than write out "m*x+b" and the higher-order equivalents, so my version of your code would look something like this:
import numpy as np
import matplotlib.pyplot as plt
x = [1,2,3,4]
y = [3,5,7,10] # 10, not 9, so the fit isn't perfect
coef = np.polyfit(x,y,1)
poly1d_fn = np.poly1d(coef)
# poly1d_fn is now a function which takes in x and returns an estimate for y
plt.plot(x,y, 'yo', x, poly1d_fn(x), '--k') #'--k'=black dashed line, 'yo' = yellow circle marker
plt.xlim(0, 5)
plt.ylim(0, 12)
This code:
from scipy.stats import linregress
linregress(x,y) #x and y are arrays or lists.
gives out a list with the following:
slope : float
slope of the regression line
intercept : float
intercept of the regression line
r-value : float
correlation coefficient
p-value : float
two-sided p-value for a hypothesis test whose null hypothesis is that the slope is zero
stderr : float
Standard error of the estimate
Source
Use statsmodels.api.OLS to get a detailed breakdown of the fit/coefficients/residuals:
import statsmodels.api as sm
df = sm.datasets.get_rdataset('Duncan', 'carData').data
y = df['income']
x = df['education']
model = sm.OLS(y, sm.add_constant(x))
results = model.fit()
print(results.params)
# const 10.603498 <- intercept
# education 0.594859 <- slope
# dtype: float64
print(results.summary())
# OLS Regression Results
# ==============================================================================
# Dep. Variable: income R-squared: 0.525
# Model: OLS Adj. R-squared: 0.514
# Method: Least Squares F-statistic: 47.51
# Date: Thu, 28 Apr 2022 Prob (F-statistic): 1.84e-08
# Time: 00:02:43 Log-Likelihood: -190.42
# No. Observations: 45 AIC: 384.8
# Df Residuals: 43 BIC: 388.5
# Df Model: 1
# Covariance Type: nonrobust
# ==============================================================================
# coef std err t P>|t| [0.025 0.975]
# ------------------------------------------------------------------------------
# const 10.6035 5.198 2.040 0.048 0.120 21.087
# education 0.5949 0.086 6.893 0.000 0.421 0.769
# ==============================================================================
# Omnibus: 9.841 Durbin-Watson: 1.736
# Prob(Omnibus): 0.007 Jarque-Bera (JB): 10.609
# Skew: 0.776 Prob(JB): 0.00497
# Kurtosis: 4.802 Cond. No. 123.
# ==============================================================================
New in matplotlib 3.5.0
To plot the best-fit line, just pass the slope m and intercept b into the new plt.axline:
import matplotlib.pyplot as plt
# extract intercept b and slope m
b, m = results.params
# plot y = m*x + b
plt.axline(xy1=(0, b), slope=m, label=f'$y = {m:.1f}x {b:+.1f}$')
Note that the slope m and intercept b can be easily extracted from any of the common regression methods:
numpy.polyfit
import numpy as np
m, b = np.polyfit(x, y, deg=1)
plt.axline(xy1=(0, b), slope=m, label=f'$y = {m:.1f}x {b:+.1f}$')
scipy.stats.linregress
from scipy import stats
m, b, *_ = stats.linregress(x, y)
plt.axline(xy1=(0, b), slope=m, label=f'$y = {m:.1f}x {b:+.1f}$')
statsmodels.api.OLS
import statsmodels.api as sm
b, m = sm.OLS(y, sm.add_constant(x)).fit().params
plt.axline(xy1=(0, b), slope=m, label=f'$y = {m:.1f}x {b:+.1f}$')
sklearn.linear_model.LinearRegression
from sklearn.linear_model import LinearRegression
reg = LinearRegression().fit(x[:, None], y)
b = reg.intercept_
m = reg.coef_[0]
plt.axline(xy1=(0, b), slope=m, label=f'$y = {m:.1f}x {b:+.1f}$')
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
x = np.array([1.5,2,2.5,3,3.5,4,4.5,5,5.5,6])
y = np.array([10.35,12.3,13,14.0,16,17,18.2,20,20.7,22.5])
gradient, intercept, r_value, p_value, std_err = stats.linregress(x,y)
mn=np.min(x)
mx=np.max(x)
x1=np.linspace(mn,mx,500)
y1=gradient*x1+intercept
plt.plot(x,y,'ob')
plt.plot(x1,y1,'-r')
plt.show()
USe this ..
George's answer goes together quite nicely with matplotlib's axline which plots an infinite line.
from scipy.stats import linregress
import matplotlib.pyplot as plt
reg = linregress(x, y)
plt.axline(xy1=(0, reg.intercept), slope=reg.slope, linestyle="--", color="k")
from pylab import *
import numpy as np
x1 = arange(data) #for example this is a list
y1 = arange(data) #for example this is a list
x=np.array(x) #this will convert a list in to an array
y=np.array(y)
m,b = polyfit(x, y, 1)
plot(x, y, 'yo', x, m*x+b, '--k')
show()
Another quick and dirty answer is that you can just convert your list to an array using:
import numpy as np
arr = np.asarray(listname)
Linear Regression is a good example for start to Artificial Intelligence
Here is a good example for Machine Learning Algorithm of Multiple Linear Regression using Python:
##### Predicting House Prices Using Multiple Linear Regression - #Y_T_Akademi
#### In this project we are gonna see how machine learning algorithms help us predict house prices. Linear Regression is a model of predicting new future data by using the existing correlation between the old data. Here, machine learning helps us identify this relationship between feature data and output, so we can predict future values.
import pandas as pd
##### we use sklearn library in many machine learning calculations..
from sklearn import linear_model
##### we import out dataset: housepricesdataset.csv
df = pd.read_csv("housepricesdataset.csv",sep = ";")
##### The following is our feature set:
##### The following is the output(result) data:
##### we define a linear regression model here:
reg = linear_model.LinearRegression()
reg.fit(df[['area', 'roomcount', 'buildingage']], df['price'])
# Since our model is ready, we can make predictions now:
# lets predict a house with 230 square meters, 4 rooms and 10 years old building..
reg.predict([[230,4,10]])
# Now lets predict a house with 230 square meters, 6 rooms and 0 years old building - its new building..
reg.predict([[230,6,0]])
# Now lets predict a house with 355 square meters, 3 rooms and 20 years old building
reg.predict([[355,3,20]])
# You can make as many prediction as you want..
reg.predict([[230,4,10], [230,6,0], [355,3,20], [275, 5, 17]])
And my dataset is below:

Categories

Resources