Sklearn Linear Regression seems to not fit the Data - python

I'am trying to train a linear regression model from Sklearn. However, it seems that the resulting regression does not fit the data very well. I would've expected the regression to (approximately) have a slope of 30°, whereas this regression does not show a correlation at all (horizontal slope). Does anyone of you guys have an idea on how I can modify my model to have a more appropriate prediction?
Data Plot
This is the corresponding code:
# Define x & y:
x = pd.DataFrame({'patientweight': df['patientweight']})
y = pd.DataFrame({'rate': df['rate']})
# Split Data into Train-/Test-Set:
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.3,random_state=0)
# Train the Linear Regression Model:
lin_reg = LinearRegression()
lin_reg.fit(x_train, y_train)
# Predict the Test Set Results:
y_pred = lin_reg.predict(x_train)
# Plot the data:
plt.figure(figsize=(15,10))
plt.xlabel('Patientweight')
plt.ylabel('Rate')
plt.title('Cohort 1: Relation between Patientweight and Rate')
plt.xlim(0,200)
plt.ylim(0,4000)
plt.scatter(df['patientweight'], df['rate'], alpha=0.25)
plt.scatter(x_train, y_pred)`

Related

Multivariate Linear Regression, coefficients don't match

I'm facing a problem with different linear models from scikit-learn.
There is my code
from sklearn.linear_model import LinearRegression
reg = LinearRegression().fit(X_train, y_train)
y_pred = reg.predict(X_train).reshape(-1)
print(f"R2 on train set:{reg.score(X_train, y_train)}")
print(f"R2 on test set:{reg.score(X_test, y_test)}")
print(f"MSE on train set:{mean_squared_error(y_train, y_pred)}")
print(f"MSE on test set:{mean_squared_error(y_test, reg.predict(X_test))}")
output:
>R2 on train set:0.5810258473777401
>R2 on test set:0.5908396388537969
>MSE on train set:0.023576848498732563
>MSE on test set:0.02378699441936436
Model is fitted, now I want to get the slope coefficient and the intercept from my model:
A, B = reg.coef_[0], reg.intercept_[0]
A, B
output:
>(array([ 0.14373081, -1.8211677 , 1.81493948, 1.39041689, -0.14027746]),
> 0.060286931992710735)
Since I used 5 features to fit the model I also have 5 slope coefficients, ok.
But when I try to visualize y_true, y_pred and the regression (ax +b) it's looks wrong for the regression of the second feature (total rooms). Since it has -1.81 as coef slope it's look logic but if the predictions of the model look fine, how it's possible to have this regression looks that bad, it make no sense right ?
I think that the return of reg.coef_ is not in the same order as the features the model is fitted with. But as far as I have see, it should be the same order, so idk.
There is also this part of code, that plot the regression just in case
sns.lineplot(x=X[:, i], y=(a[i]*X[:, i])+b, label="regression", color=c3, alpha=1, ci=None, ax=axes[i])
Any idea ?
I keep in mind that there may be no problem at all but visually it hurts a bit
y_pred is a quantifier for listreg. We introduce N as an other variable which cannot be quantified or consecutive.
N=ax/k-b of a scatter plot. This helps to find the total shape or size of the bedroom; b, b=l.
5 is right. 5 is independent of the regression. I mean of an independent variable.

What is the difference between these two ways of specifying training/testing data for sklearn GPR

This is somewhat of a follow up to my previous question about evaluating my scikit Gaussian process regressor. I am very new to GPRs and I think that I may be making a methodological mistake in how I am using training vs testing data.
Essentially I'm wondering what the difference is between specifying training data by splitting the input between test and training data like this:
X = np.atleast_2d(some_data).T
Y = np.atleast_2d(other_data).T
X_train, X_test, y_train, y_test = train_test_split(X, Y,
test_size = 0.33,
random_state = 0)
kernel = ConstantKernel() + Matern() + WhiteKernel(noise_level=1)
gp = gaussian_process.GaussianProcessRegressor(
alpha=1e-10,
copy_X_train=True,
kernel = kernel,
n_restarts_optimizer=10,
normalize_y=False,
random_state=None)
gp.fit(X_train, y_train)
score = gp.score(X_test, y_test)
print(score)
x_pred = np.atleast_2d(np.linspace(0,10,1000)).T
y_pred, sigma = gp.predict(x_pred, return_std=True)
vs using the full data set to train like this.
X = np.atleast_2d(some_data).T
Y = np.atleast_2d(other_data).T
kernel = ConstantKernel() + Matern() + WhiteKernel(noise_level=1)
gp = gaussian_process.GaussianProcessRegressor(
alpha=1e-10,
copy_X_train=True,
kernel = kernel,
n_restarts_optimizer=10,
normalize_y=False,
random_state=None)
gp.fit(X, Y)
score = gp.score(X, Y)
print(score)
x_pred = np.atleast_2d(np.linspace(0,10,1000)).T
y_pred, sigma = gp.predict(x_pred, return_std=True)
Is one of these options going to result in incorrect predictions?
You split off the training data from the test data to evaluate your model, because otherwise you have no idea if you are over fitting the data. For example, just place data in excel and plot it with a smooth line. Technically, that spline function from excel is a perfect model but useless for predicting new values.
In your example, your predictions are over a uniform space to allow you to visualize what your model thinks is the underlying function. But it would be useless for understanding how general the model is. Sometimes you can get very high accuracy (> 95%) on training data and less than chance for testing data, which means the model is over fitting.
In addition to plotting a uniform prediction space to visualize the model, you should also predict values from the test set, then see accuracy metrics for both testing and training data.

How to plot a graph which depicts the perfomance of a ML model?

I have done binary classification for a dataset to determine whether there is a leak or no-leak.I have applied 3 ML algorithms separately for comparing performance namely naive-bayes,random forest and decision tree.for the decision tree classifier i have done the following code where s1 to s20 are sensor values how can i plot an error analysis graph.Since i have the predicted output as either 0 or 1
#creating features and labels
n_features = list(zip(s1,s2,s3,s4,s5,s6,s7,s8,s9,s10,s11,s12,s13,s14,s15,s16,s17,s18,s19,s20))
n_samples = status
#Decision tree regression
clf = tree.DecisionTreeRegressor()
#spliting of data
X_train, X_test, y_train, y_test = train_test_split(n_features,n_samples, test_size=0.5,random_state=0)
sc = StandardScaler()
X_train_std = sc.fit_transform(X_train)
X_test_std = sc.fit_transform(X_test)
#train model
clf.fit(X_train,y_train)
#prediction
y_pred = clf.predict(X_test_std)
print('percentage Accuracy:',100*metrics.accuracy_score(y_test,y_pred))
Create a dataframe called model_performance_df. Add the machine learning algorithms you have used Naive Bayes, RandomForest and DecisionTree as the column names in the dataframe. Add those performance metrics for each algorithm in the dataframe.
Use the visualization library matplotlib or seaborn to plot the graph as you like. Example, give a try Histogram or Distribution plot.

Linear Regression - Predict ŷ

I'm trying to plot a scatter plot of the values of actual sales (y) and predicted sales (ŷ).
I have imported the csv file and currently the codes I have for the linear regression model is:
result = smf.ols('sales ~ discount + holiday + product', data=data).fit()
print(result.summary())
Since, I only have the actual sales values, how do I find the predicted sales (ŷ) values to plot the scatter plot? I have tried researching and found lm.predict() and result.predict(). Is there a difference? lm = LinearRegression()
Thank you in advance!
Without data it is hard to help, but I guess you have X and y from dataset because you want to perform linear regression. You can split data into training and test set using scikit-learn:
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 1/3)
Then you need to fit linear regression to the training set:
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)
and afterwards predict test set results:
y_pred = regressor.predict(X_test)
Finally, you can plot your test or training results:
# Visualising the Training set results
plt.scatter(X_train, y_train, color = 'red')
plt.plot(X_train, regressor.predict(X_train), color = 'blue')
plt.title('Discount vs Sales (Training set)')
plt.xlabel('Discount percentage')
plt.ylabel('Sales')
plt.show()
# Visualising the Test set results
plt.scatter(X_test, y_test, color = 'red')
plt.plot(X_train, regressor.predict(X_train), color = 'blue')
plt.title('Discount vs Sales (Test set)')
plt.xlabel('Discount percentage')
plt.ylabel('Sales')
plt.show()
(In this scenario we want to predict how many Sales will be if we set specific value of e.g. Discount percentage). If you have more than one X parameter, things are more complicated and you will need to use dummy variables, perform statistical analysis etc..

How generate predictions after doing PCA in Python

If I have a training set trainX, trainy, I know that you can run PCA with
pca = PCA(n_components=5)
Xred = pca.fit(trainX).transform(trainX)
If I want to run a model, say Linear Regression, do I then run PCA on the testX?
Like this:
clf = linear_model.LinearRegression()
clf.fit(trainX, trainY)
testXred = pca.fit(testX).transform(testX)
predictions = clf.predict(testXred)
Or do I only run PCA on the training set, so the Linear Regression prediction should be this instead?
predictions = clf.predict(testX)
or this?
testXred = pca.fit(trainX).transform(testX)
predictions = clf.predict(testXred)
If you mean you want to reduce noise using PCA before doing the linear regression, here's an example, which might help:
Using PCA on linear regression

Categories

Resources