Sklearn training data and test data is not same size - python

I'm trying to do a linear model in sklearn, and therefore i want to test the model, that i have implemented using some error functions.
First i chose the features for my X and y axis.
#Predict the average parking rates per month
X = df[['Number of weekly riders', 'Price per week',
'Population of city', 'Monthly income of riders']]
y = df['Average parking rates per month']
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
#only 20% test size because we are working with a small dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=101)
lm = LinearRegression()
lm.fit(X_train, y_train)
after i fitted the model i try to use some of the error functions from the metrics package from sklearn
but apparently i can't use any of the functions, because there is not an equal amount of test and train data
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_train))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_train))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_train)))
ValueError: Found input variables with inconsistent numbers of samples: [6, 21]
is it really true, that you need the same size of train and test data, in order to run the error functions?

When you use train/test-split you want to devide the training and test data:
The idea is that you train your algorithm with your training data and then test it with unseen data. So all the metrics do not make any sense with y_train and y_test. What you try to compare is then the prediction and the y_test this works then like:
y_pred_test = lm.predict(X_test)
metrics.mean_absolute_error(y_test, y_pred_test)
It is also possible to get an idea on the training scores; you can do that by predicting on the training data:
y_pred_train = lm.predict(X_train)
metrics.mean_absolute_error(y_train, y_pred_train)

You want to compare y_test and y_predict which is the output of x_test through your regressor.

The following will raise an inconsistent numbers of samples error.
metrics.mean_absolute_error(y_test, y_train)
The reason is because the training set and the testing set has different number of rows.
In the rare case of them having the same number of rows, the above statement still doesn't make sense: there's no use of comparing the test set labels to training set labels.
Instead, you should obtain the predictions to your testing features(X_test) by inputting X_test to lm:
y_hat = lm.predict(X_test) # y_hat: predictions
Then, these metrics would make sense:
metrics.mean_absolute_error(y_test, y_hat)

Related

Evaluate Polynomial regression using cross_val_score

I am trying to use cross_val_score to evaluate my regression model (with PolymonialFeatures(degree = 2)). As I noted from different blog posts that I should use cross_val_score with original X, y values, not the X_train and y_train.
r_squareds = cross_val_score(pipe, X, y, cv=10)
r_squareds
>>> array([ 0.74285583, 0.78710331, -1.67690578, 0.68890253, 0.63120873,
0.74753825, 0.13937611, 0.18794756, -0.12916661, 0.29576638])
which indicates my model doesn't perform really well with the mean r2 of only 0.241. Is this supposed to be a correct interpretation?
However, I came across a Kaggle code working on the same data and the guy performed cross_val_score on X_train and y_train. I gave this a try and the average r2 was better.
r_squareds = cross_val_score(pipe, X_train, y_train, cv=10)
r_squareds.mean()
>>> 0.673
Is this supposed to be a problem?
Here is the code for my model:
X = df[['CHAS', 'RM', 'LSTAT']]
y = df['MEDV']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=0)
pipe = Pipeline(
steps=[('poly_feature', PolynomialFeatures(degree=2)),
('model', LinearRegression())]
)
## fit the model
pipe.fit(X_train, y_train)
You first interpretation is correct. The first cross_val_score is training 10 models with 90% of your data as train and 10 as a validation dataset. We can see from these results that the estimator's r_square variance is quite high. Sometimes the model performs even worse than a straight line.
From this result we can safely say that the model is not performing well on this dataset.
It is possible that the obtained result using only the train set on your cross_val_score is higher but this score is most likely not representative of your model performance as the dataset might be to small to capture all its variance. (The train set for the second cross_val_score is only 54% of your dataset 90% of 60% of the original dataset)

Regarding increase in MSE of Cross-Validation model with increasing dataset for regression

I have the following experimental setup for a regression problem.
Using the following routine, a data set of about 1800 entries is separated into three groups, validation, test, and training.
X_train, X_test, y_train, y_test = train_test_split(inputs, targets, test_size=0.2,
random_state=42, shuffle=True)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.25,
random_state=42, shuffle=True)
So in essence, training size ~ 1100, validation and test size ~ 350, and each subset is then having unique set of data points, that which is not seen in the other subsets.
With these subsets, I can preform a fitting using any number of the regression models available from scikit-learn, using the following routine:
model = LinearRegression()
clf = make_pipeline(StandardScaler(), model)
clf.fit(X_train, y_train)
predictions = clf.predict(X_test)
Doing this I then calculate the RMSE of the predictions, which in the case of the linear regressor, is about ~ 0.948.
Now, I could instead use cross-validation and not worry about splitting the data instead, using the following routine:
model = LinearRegression()
clf = make_pipeline(StandardScaler(), model)
predictions2 = cross_val_predict(clf, X, y, cv=KFold(n_splits=10, shuffle=True, random_state=42))
However, when I calculate the RMSE of these predictions, it is about ~2.4! To compare, I tried using a similar routine, but switched X for X_train, and y for y_train, i.e.,
model = LinearRegression()
clf = make_pipeline(StandardScaler(), model)
predictions3 = cross_val_predict(clf, X_train, y_train, cv=KFold(n_splits=10, shuffle=True, random_state=42))
and received a RMSE of about ~ 0.956.
I really do not understand why that when using the entire data set, the RMSE for the cross-validation is so much higher, and that the predictions are terrible in comparison to that with reduced data set.
Additional Notes
Additionally, I have tried out running the above routine, this time using the reduced subset X_val, y_val as inputs for the cross validation, and still receive small RMSE. Additionally, when I simply fit a model on the reduced subset X_val, y_val, and then make predictions on X_train, y_train, the RMSE is still better (lower) than that of the cross-validation RMSE!
This does not only happen for LinearRegressor, but also for RandomForrestRegressor, and others. I have additionally tried to change the random state in the splitting, as well as completely shuffling the data around before handing it to the train_test_split, but still, the same outcome occurs.
Edit 1.)
I tested out this on a make_regression data set from scikit and did not get the same results, but rather all the RMSE are small and similar. My guess is that is has to do with my data set.
If anyone could help me out in understanding this, I would greatly appreciate it.
Edit 2.)
Hi thank you (#desertnaut) for the suggestions, the solution was actually quite easy, and the fact was that in my routine to process the data, I was using (targets, inputs) = (X, y), which is really wrong. I swapped that with (targets, inputs) = (y, X), and now the RMSE is about the same as the other profiles. I made a histogram profile of the data and found that problem. Thanks! I'll save the question for about 1 hour, then delete it.
You're overfitting.
Imagine you had 10 data points and 10 parameters, then RMSE would be zero because the model could perfectly fit the data, now increase the data points to 100 and the RMSE will increase (assuming there is some variance in the data you are adding of course) because your model is not perfectly fitting the data anymore.
RMSE being low (or R-squared high) more often than not doesn't mean jack, you need to consider the standard errors of your parameter estimates . . . If you are just increasing the number of parameters (or conversely, in your case, decreasing the number of observations) you are just chewing away your degrees of freedom.
I'd wager that your standard error estimates for the X model's parameter estimates are smaller than your standard error estimates in the X_train model, even though RMSE is "lower" in the X_train model.
Edit: I'll add that your dataset exhibits high multicollinearity.

Machinelearning, how to make a forecast from learning and training data

ive tried to do some machinelearning in python with pandas. My goal was to estimate the insurance costs of people based on their lifestyle. i got a nice database from kaggle. Doing training and testing on my dataset went quite well but now i want to make some forecast for a person and i dont know how to start.
i post what i have done so far with training and testing with a linear regression (i did also a lot of other stuff like monte carlo, knearest, ...)
the result is
Accuracy on training set: 0.735
Accuracy on test set: 0.795
so how would you recommend to continue estimating the insurance cost of another person?
#Linear Regression
linreg = LinearRegression()
linreg.fit(X_train, y_train)
print("Accuracy on training set: {:.3f}".format(linreg.score(X_train, y_train)))
print("Accuracy on test set: {:.3f}".format(linreg.score(X_test, y_test)))```
As you have already 'fit' the algorithm on X_train and y_train dataset, you can make predictions for X_test as follows:
predictions = linreg.predict(X_test)
Basically, linreg.fit(X_train, y_train) means fitting/training using X_train as inputs and y_train as (targeted) labels. On the other hand, linreg.predict(X_test) means using X_test as inputs to produce predictions, and linreg.score(X_test, y_test) means making predictions using X_test as inputs then comparing the predictions with the (targeted) y_test to get (accuracy) score.

SKLearn Predicting using new Data

I've tried out Linear Regression using SKLearn. I have data something along the lines of: Calories Eaten | Weight.
150 | 150
300 | 190
350 | 200
Basically made up numbers but I've fit the dataset into the linear regression model.
What I'm confused on is, how would I go about predicting with new data, say I got 10 new numbers of Calories Eaten, and I want it to predict Weight?
regressor = LinearRegression()
regressor.fit(x_train, y_train)
y_pred = regressor.predict(x_test) ??
But how would I go about making only my 10 new data numbers of Calories Eaten and make it the Test Set I want the regressor to predict?
You are correct, you simply call the predict method of your model and pass in the new unseen data for prediction. Now it also depends on what you mean by new data. Are you referencing data that you do not know the outcome of (i.e. you do not know the weight value), or is this data being used to test the performance of your model?
For new data (to predict on):
Your approach is correct. You can access all predictions by simply printing the y_pred variable.
You know the respective weight values and you want to evaluate model:
Make sure that you have two separate data sets: x_test (containing the features) and y_test (containing the labels). Generate the predictions as you are doing with the y_pred variable, then you can calculate its performance using a number of performance metrics. Most common one is the root mean square, and you simply pass the y_test and y_pred as parameters. Here is a list of all the regression performance metrics supplied by sklearn.
If you do not know the weight value of the 10 new data points:
Use train_test_split to split your initial data set into 2 parts: training and testing. You would have 4 datasets: x_train, y_train, x_test, y_test.
from sklearn.model_selection import train_test_split
# random state can be any number (to ensure same split), and test_size indicates a 25% cut
x_train, y_train, x_test, y_test = train_test_split(calories_eaten, weight, test_size = 0.25, random_state = 42)
Train model by fitting x_train and y_train. Then evaluate model's training performance by predicting on x_test and comparing these predictions with the actual results from y_test. This way you would have an idea of how the model performs. Furthermore, you can then predict the weight values for the 10 new data points accordingly.
It is also worth reading further on the topic as a beginner. This is a simple tutorial to follow.
You have to select the model using model_selection in sklearn then train and fit the dataset.
from sklearn.model_selection import train_test_split
X_train, y_train, X_test, y_test = train_test_split(eaten, weight)
regressor.fit(X_train, y_train)
y_pred = regressor.predict(X_test)
What I'm confused on is, how would I go about predicting with new
data, say I got 10 new numbers of Calories Eaten, and I want it to
predict Weight?
Yes, Calories Eaten represents the independent variable while Weight represent dependent variable.
After you split the data into training set and test set the next step is to fit the regressor using X_train and y_train data.
After the model is trained you can predict the results for X_test method and so we got the y_pred.
Now you can compare y_pred (predicted data) with y_test which is real data.
You can also use score method for your created linear model in order to get the performance of your model.
score is calculated using R^2(R squared) metric or Coefficient of determination.
score = regressor.score(x_test, y_test)
For splitting the data you can use train_test_split method.
from sklearn.model_selection import train_test_split
X_train, y_train, X_test, y_test = train_test_split(eaten, weight, test_size = 0.2, random_state = 0)

Do I have to use fit() again after training in sklearn?

I am using LinearRegression(). Below you can see what I have already done to predict new features:
lm = LinearRegression()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.8, random_state=say)
lm.fit(X_train, y_train)
lm.predict(X_test)
scr = lm.score(X_test, y_test)
lm.fit(X, y)
pred = lm.predict(X_real)
Do I really need the line lm.fit(X, y) or can I just go without using it? Also, If I don't need to calculate accuracy, do you think the following approach is better instead using training and testing? (In case I don't want to test):
lm.fit(X, y)
pred = lm.predict(X_real)
Even I am getting 0.997 accuraccy, the predicted value is not close or shifted. Are there ways to make prediction more accurate?
You don't need to fit multiple times for predicting a value by given features since your algorithm already learned your train set. Check the codes below.
# Split your data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.8, random_state=0)
# Teach your data to your algorithm with train set
lr = LinearRegression()
lr.fit(X_train, y_train)
# Now it can predict
y_pred = lr.predict(X_test)
# Use test set to see how accurate it predicts
lr_score = lr.score(y_pred, y_test)
The reason you are getting almost 100% accuracy score is a data leakage, caused by the following line of code:
lm.fit(X, y)
in the line above you gave your model ALL the data and then you are testing prediction using the subset of data that your model has already seen.
This causes very high accuracy score for the already seen data, but usually it performs badly on the unseen data.
When do you want / need to fit your model multiple times?
If you are getting a new training data and want to improve your model by training it against a new portion of data, then you may want to choose one of regression algorithm, supporting incremental-learning.
In this case you will use model.partial_fit() method instead of model.fit()...

Categories

Resources