ive tried to do some machinelearning in python with pandas. My goal was to estimate the insurance costs of people based on their lifestyle. i got a nice database from kaggle. Doing training and testing on my dataset went quite well but now i want to make some forecast for a person and i dont know how to start.
i post what i have done so far with training and testing with a linear regression (i did also a lot of other stuff like monte carlo, knearest, ...)
the result is
Accuracy on training set: 0.735
Accuracy on test set: 0.795
so how would you recommend to continue estimating the insurance cost of another person?
#Linear Regression
linreg = LinearRegression()
linreg.fit(X_train, y_train)
print("Accuracy on training set: {:.3f}".format(linreg.score(X_train, y_train)))
print("Accuracy on test set: {:.3f}".format(linreg.score(X_test, y_test)))```
As you have already 'fit' the algorithm on X_train and y_train dataset, you can make predictions for X_test as follows:
predictions = linreg.predict(X_test)
Basically, linreg.fit(X_train, y_train) means fitting/training using X_train as inputs and y_train as (targeted) labels. On the other hand, linreg.predict(X_test) means using X_test as inputs to produce predictions, and linreg.score(X_test, y_test) means making predictions using X_test as inputs then comparing the predictions with the (targeted) y_test to get (accuracy) score.
Related
I am trying to do multivariate time series forecasting using linear regression model.
In the below code I first split the data in 80-20 ratio for training and testing.
Then I train the model and use the model to predict using test and compute the relevant performance metrics of the model.
# Split data into testing and training sets
X_train, X_test, y_train, y_test = train_test_split(df[['EMA_10']], df[['close']], test_size=.2)
# Create Regression Model
model = LinearRegression()
# Train the model
model.fit(X_train, y_train)
# Use model to make predictions
y_pred = model.predict(X_test)
# Printout relevant metrics
print("Model Coefficients:", model.coef_)
print("Mean Absolute Error:", mean_absolute_error(y_test, y_pred))
print("Coefficient of Determination:", r2_score(y_test, y_pred))
Now how do I predict the next i.e. future value?
To predict unseen y, you can simply use .predict(<new x here>).
However, why are you using linear regression to tackle the time series problem? It makes the data lose the time dimension. It's important to note that when performing time series forecasting, it's generally a good idea to use a model specifically designed for time series data, such as an autoregressive model (e.g., ARIMA) or advanced DL (e.g., RNN). These kinds of models are able to account for the temporal dependencies that are present in time series data, which can help improve the accuracy of the forecasts.
There are many good resources for that, such as,
https://machinelearningmastery.com/arima-for-time-series-forecasting-with-python/
https://towardsdatascience.com/temporal-loops-intro-to-recurrent-neural-networks-for-time-series-forecasting-in-python-b0398963dc1f
I have the following experimental setup for a regression problem.
Using the following routine, a data set of about 1800 entries is separated into three groups, validation, test, and training.
X_train, X_test, y_train, y_test = train_test_split(inputs, targets, test_size=0.2,
random_state=42, shuffle=True)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.25,
random_state=42, shuffle=True)
So in essence, training size ~ 1100, validation and test size ~ 350, and each subset is then having unique set of data points, that which is not seen in the other subsets.
With these subsets, I can preform a fitting using any number of the regression models available from scikit-learn, using the following routine:
model = LinearRegression()
clf = make_pipeline(StandardScaler(), model)
clf.fit(X_train, y_train)
predictions = clf.predict(X_test)
Doing this I then calculate the RMSE of the predictions, which in the case of the linear regressor, is about ~ 0.948.
Now, I could instead use cross-validation and not worry about splitting the data instead, using the following routine:
model = LinearRegression()
clf = make_pipeline(StandardScaler(), model)
predictions2 = cross_val_predict(clf, X, y, cv=KFold(n_splits=10, shuffle=True, random_state=42))
However, when I calculate the RMSE of these predictions, it is about ~2.4! To compare, I tried using a similar routine, but switched X for X_train, and y for y_train, i.e.,
model = LinearRegression()
clf = make_pipeline(StandardScaler(), model)
predictions3 = cross_val_predict(clf, X_train, y_train, cv=KFold(n_splits=10, shuffle=True, random_state=42))
and received a RMSE of about ~ 0.956.
I really do not understand why that when using the entire data set, the RMSE for the cross-validation is so much higher, and that the predictions are terrible in comparison to that with reduced data set.
Additional Notes
Additionally, I have tried out running the above routine, this time using the reduced subset X_val, y_val as inputs for the cross validation, and still receive small RMSE. Additionally, when I simply fit a model on the reduced subset X_val, y_val, and then make predictions on X_train, y_train, the RMSE is still better (lower) than that of the cross-validation RMSE!
This does not only happen for LinearRegressor, but also for RandomForrestRegressor, and others. I have additionally tried to change the random state in the splitting, as well as completely shuffling the data around before handing it to the train_test_split, but still, the same outcome occurs.
Edit 1.)
I tested out this on a make_regression data set from scikit and did not get the same results, but rather all the RMSE are small and similar. My guess is that is has to do with my data set.
If anyone could help me out in understanding this, I would greatly appreciate it.
Edit 2.)
Hi thank you (#desertnaut) for the suggestions, the solution was actually quite easy, and the fact was that in my routine to process the data, I was using (targets, inputs) = (X, y), which is really wrong. I swapped that with (targets, inputs) = (y, X), and now the RMSE is about the same as the other profiles. I made a histogram profile of the data and found that problem. Thanks! I'll save the question for about 1 hour, then delete it.
You're overfitting.
Imagine you had 10 data points and 10 parameters, then RMSE would be zero because the model could perfectly fit the data, now increase the data points to 100 and the RMSE will increase (assuming there is some variance in the data you are adding of course) because your model is not perfectly fitting the data anymore.
RMSE being low (or R-squared high) more often than not doesn't mean jack, you need to consider the standard errors of your parameter estimates . . . If you are just increasing the number of parameters (or conversely, in your case, decreasing the number of observations) you are just chewing away your degrees of freedom.
I'd wager that your standard error estimates for the X model's parameter estimates are smaller than your standard error estimates in the X_train model, even though RMSE is "lower" in the X_train model.
Edit: I'll add that your dataset exhibits high multicollinearity.
I'm trying to do a linear model in sklearn, and therefore i want to test the model, that i have implemented using some error functions.
First i chose the features for my X and y axis.
#Predict the average parking rates per month
X = df[['Number of weekly riders', 'Price per week',
'Population of city', 'Monthly income of riders']]
y = df['Average parking rates per month']
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
#only 20% test size because we are working with a small dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=101)
lm = LinearRegression()
lm.fit(X_train, y_train)
after i fitted the model i try to use some of the error functions from the metrics package from sklearn
but apparently i can't use any of the functions, because there is not an equal amount of test and train data
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_train))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_train))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_train)))
ValueError: Found input variables with inconsistent numbers of samples: [6, 21]
is it really true, that you need the same size of train and test data, in order to run the error functions?
When you use train/test-split you want to devide the training and test data:
The idea is that you train your algorithm with your training data and then test it with unseen data. So all the metrics do not make any sense with y_train and y_test. What you try to compare is then the prediction and the y_test this works then like:
y_pred_test = lm.predict(X_test)
metrics.mean_absolute_error(y_test, y_pred_test)
It is also possible to get an idea on the training scores; you can do that by predicting on the training data:
y_pred_train = lm.predict(X_train)
metrics.mean_absolute_error(y_train, y_pred_train)
You want to compare y_test and y_predict which is the output of x_test through your regressor.
The following will raise an inconsistent numbers of samples error.
metrics.mean_absolute_error(y_test, y_train)
The reason is because the training set and the testing set has different number of rows.
In the rare case of them having the same number of rows, the above statement still doesn't make sense: there's no use of comparing the test set labels to training set labels.
Instead, you should obtain the predictions to your testing features(X_test) by inputting X_test to lm:
y_hat = lm.predict(X_test) # y_hat: predictions
Then, these metrics would make sense:
metrics.mean_absolute_error(y_test, y_hat)
I am doing isolation forest clustering on the the mulcross database with 2 classes. I divide my data into training and test set and try to calculate the accuracy score, the roc_auc_score and the confusion_matrix on my test set. But there are two problems: The first one is that in a clustering method i should not use the labels in the training phase, it means that "y_train" should not be mentioned, but i did not find another solution to evaluate my model. More over the results i found are wrong.
My problem is how to evaluate a clustering model like isolation forest.
Here is my code:
df = pd.read_csv('db.csv')
y_true=df['Target']
df_data=df.drop('Target',1)
X_train, X_test, y_train, y_test = train_test_split(df_data, y_true, test_size=0.3, random_state=42)
alg=IsolationForest(n_estimators=100, max_samples= 256 , contamination=0.1, max_features=1.0, bootstrap=False, n_jobs=-1, random_state=42, verbose=0, behaviour="new")
model = alg.fit(X_train, y_train)
preds = alg.predict(X_test)
print("#############################\n#############################")
print(accuracy_score(y_test, preds))
print(roc_auc_score(y_test, preds))
cm = confusion_matrix(y_test, preds)
print(cm)
print("#############################\n#############################")
I do not understand why are you clustering and dividing it into training/testing sets. It seems to me like you are mixing classification/clustering or something like that. If you have labels, try a supervised method. Easy winnings are xgboost, random forest, GLM, logistic, etc...
If you want to evaluate clustering methods, you can investigate the inter- and intra-cluster distances. At the end of the day, you want to have small and well-separated clusters. You can look at a metric called silhouette too.
You can also try
print("Accuracy:", list(y_pred_test).count(1)/y_pred_test.shape[0])
also, look here for some more details.
While working with a linear regression model I split the data into a training set and test set. I then calculated R^2, RMSE, and MAE using the following:
lm.fit(X_train, y_train)
R2 = lm.score(X,y)
y_pred = lm.predict(X_test)
RMSE = np.sqrt(metrics.mean_squared_error(y_test, y_pred))
MAE = metrics.mean_absolute_error(y_test, y_pred)
I thought that I was calculating R^2 for the entire data set (instead of comparing the training and original data). However, I learned that you must fit the model before you score it, therefore I'm not sure if I'm scoring the original data (as inputted in R2) or the data that I used to fit the model (X_train, and y_train). When I run:
lm.fit(X_train, y_train)
lm.score(X_train, y_train)
I get a different result than what I got when I was scoring X and y. So my question is are the inputs to the .score parameter compared to the model that was fitted (thereby making lm.fit(X,y); lm.score(X,y) the R^2 value for the original data and lm.fit(X_train, y_train); lm.score(X,y) the R^2 value for the original data based off the model created in .fit.) or is something else entirely happening?
fit() that only fit the data which is synonymous to train, that is fit the data means train the data.
score is something like testing or predict.
So one should use different dataset for training the classifier and testing the acuracy
One can do like this.
X_train,X_test,y_train,y_test=cross_validation.train_test_split(X,y,test_size=0.2)
clf=neighbors.KNeighborsClassifier()
clf.fit(X_train,y_train)
accuracy=clf.score(X_test,y_test)