find out data, whose predictions are way off

find out data, whose predictions are way off - python

I am using pandas sklearn to do some price prediction model. I divide the dataset into train and test sets. Then I fit the model and predict.
X and y are pandas dataframe
X_train,X_test, y_train, y_test = train_test_split(X, y)
y_pred = model.predict(X_test)
difference = np.abs(np.subtract(y_pred,y_test))
define own way of calculating accuracy in a percentage way other than mae
accuracy=np.divide(np.abs(np.subtract(y_pred,y_test)),y_test)
But how can I filter data with lowest accuracy in pandas to explore the data with bad prediction in pandas?

Related

Why does sklearn give me performance metrics so different from the predictions that it makes on all the other permutations of the same data?

As part of a classification model, I initially used seed 101 to generate my train and test data and calibrate my model. Then I printed the classification report and I got a recall of 0.46 for the case=1.
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3,random_state=101)
model = RandomForestClassifier(n_estimators=200)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
classification_report(y_test,y_pred)
The next piece of code uses the calibration of the model in the previous step and tests its performance against different permutations of the test data. On each iteration I use a different random_state to create the test data.
results=[]
for i in range(0,n):
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3, random_state=i) #Note random state will be different on each iteration
predictions = model.predict(X_test)
class_report=classification_report(y_test,predictions,output_dict=True) #Function to produce the classification report
recall_precision=get_recall_precision(class_report) #Function that extracts recall and precision for the dependent variable=1
results.append(recall_precision) #Append recall and precision to a list on each iteration
#Divide recall and precision of dependent variable=1 in two different lists
recall= [x[0] for x in results]
precision= [x[1] for x in results]
#Plot recall against precision in a scatter plot
plt.scatter(precision,recall)
plt.xlabel('Precision',fontsize = 16)
plt.ylabel("Recall",fontsize = 16)
See below the scatter plot of the results for the precision and recall of the positive values (dependent variable=1). Note how my initial prediction is at the bottom left of the plot and all the rest seem to lie in a better performance state. This does not seem right. How can this be happening?

Should we apply normalization on whole data set or only X

I am doing a project based on Machine learning (Python) and trying all models on my data.
Really confused in
For Classification and For Regression
If I have to apply normalization, Z Score or Standard deviation on whole data set and then set the values of Features(X) and output(y).
def normalize(df):
from sklearn.preprocessing import MaxAbsScaler
scaler = MaxAbsScaler()
scaler.fit(df)
scaled = scaler.transform(df)
scaled_df = pd.DataFrame(scaled, columns=df.columns)
return scaled_df
data=normalize(data)
X=data.drop['col']
y=data['col']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Or only have to apply on features(X)
X=data.drop['col']
y=data['col']
def normalize(df):
from sklearn.preprocessing import MaxAbsScaler
scaler = MaxAbsScaler()
scaler.fit(df)
scaled = scaler.transform(df)
scaled_df = pd.DataFrame(scaled, columns=df.columns)
return scaled_df
X=normalize(X)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

TLDR; do normalization on input data, but don't do it on output.
Logically, the normalization is both algorithm dependent and also feature based.
Some algorithms do not require any normalization (like decision trees).
Applying normalization on the dataset: You should perform normalization per feature but on all examples existing in the whole dataset if you have more than one feature in your dataset.
For example, let's say you have two features of X and Y. feature X is always a decimal in the range [0,10]. On the other hand, you have Y in the range [100K,1M]. If you do normalization once for X and Y and once for X and Y combined, you would see how the values of feature X become insignificant.
For Output (labels):
Generally, there is no need to normalize output or labels for any regression or classification tasks. But, make sure to do normalization on training data during training time and inference time.
if the task is the classification, the common approach is just encoding the class numbers (if you have classes dog and cat. you assign 0 to one and 1 to the other)

Regarding increase in MSE of Cross-Validation model with increasing dataset for regression

I have the following experimental setup for a regression problem.
Using the following routine, a data set of about 1800 entries is separated into three groups, validation, test, and training.
X_train, X_test, y_train, y_test = train_test_split(inputs, targets, test_size=0.2,
random_state=42, shuffle=True)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.25,
random_state=42, shuffle=True)
So in essence, training size ~ 1100, validation and test size ~ 350, and each subset is then having unique set of data points, that which is not seen in the other subsets.
With these subsets, I can preform a fitting using any number of the regression models available from scikit-learn, using the following routine:
model = LinearRegression()
clf = make_pipeline(StandardScaler(), model)
clf.fit(X_train, y_train)
predictions = clf.predict(X_test)
Doing this I then calculate the RMSE of the predictions, which in the case of the linear regressor, is about ~ 0.948.
Now, I could instead use cross-validation and not worry about splitting the data instead, using the following routine:
model = LinearRegression()
clf = make_pipeline(StandardScaler(), model)
predictions2 = cross_val_predict(clf, X, y, cv=KFold(n_splits=10, shuffle=True, random_state=42))
However, when I calculate the RMSE of these predictions, it is about ~2.4! To compare, I tried using a similar routine, but switched X for X_train, and y for y_train, i.e.,
model = LinearRegression()
clf = make_pipeline(StandardScaler(), model)
predictions3 = cross_val_predict(clf, X_train, y_train, cv=KFold(n_splits=10, shuffle=True, random_state=42))
and received a RMSE of about ~ 0.956.
I really do not understand why that when using the entire data set, the RMSE for the cross-validation is so much higher, and that the predictions are terrible in comparison to that with reduced data set.
Additional Notes
Additionally, I have tried out running the above routine, this time using the reduced subset X_val, y_val as inputs for the cross validation, and still receive small RMSE. Additionally, when I simply fit a model on the reduced subset X_val, y_val, and then make predictions on X_train, y_train, the RMSE is still better (lower) than that of the cross-validation RMSE!
This does not only happen for LinearRegressor, but also for RandomForrestRegressor, and others. I have additionally tried to change the random state in the splitting, as well as completely shuffling the data around before handing it to the train_test_split, but still, the same outcome occurs.
Edit 1.)
I tested out this on a make_regression data set from scikit and did not get the same results, but rather all the RMSE are small and similar. My guess is that is has to do with my data set.
If anyone could help me out in understanding this, I would greatly appreciate it.
Edit 2.)
Hi thank you (#desertnaut) for the suggestions, the solution was actually quite easy, and the fact was that in my routine to process the data, I was using (targets, inputs) = (X, y), which is really wrong. I swapped that with (targets, inputs) = (y, X), and now the RMSE is about the same as the other profiles. I made a histogram profile of the data and found that problem. Thanks! I'll save the question for about 1 hour, then delete it.

You're overfitting.
Imagine you had 10 data points and 10 parameters, then RMSE would be zero because the model could perfectly fit the data, now increase the data points to 100 and the RMSE will increase (assuming there is some variance in the data you are adding of course) because your model is not perfectly fitting the data anymore.
RMSE being low (or R-squared high) more often than not doesn't mean jack, you need to consider the standard errors of your parameter estimates . . . If you are just increasing the number of parameters (or conversely, in your case, decreasing the number of observations) you are just chewing away your degrees of freedom.
I'd wager that your standard error estimates for the X model's parameter estimates are smaller than your standard error estimates in the X_train model, even though RMSE is "lower" in the X_train model.
Edit: I'll add that your dataset exhibits high multicollinearity.

Sklearn training data and test data is not same size

I'm trying to do a linear model in sklearn, and therefore i want to test the model, that i have implemented using some error functions.
First i chose the features for my X and y axis.
#Predict the average parking rates per month
X = df[['Number of weekly riders', 'Price per week',
'Population of city', 'Monthly income of riders']]
y = df['Average parking rates per month']
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
#only 20% test size because we are working with a small dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=101)
lm = LinearRegression()
lm.fit(X_train, y_train)
after i fitted the model i try to use some of the error functions from the metrics package from sklearn
but apparently i can't use any of the functions, because there is not an equal amount of test and train data
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_train))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_train))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_train)))
ValueError: Found input variables with inconsistent numbers of samples: [6, 21]
is it really true, that you need the same size of train and test data, in order to run the error functions?

When you use train/test-split you want to devide the training and test data:
The idea is that you train your algorithm with your training data and then test it with unseen data. So all the metrics do not make any sense with y_train and y_test. What you try to compare is then the prediction and the y_test this works then like:
y_pred_test = lm.predict(X_test)
metrics.mean_absolute_error(y_test, y_pred_test)
It is also possible to get an idea on the training scores; you can do that by predicting on the training data:
y_pred_train = lm.predict(X_train)
metrics.mean_absolute_error(y_train, y_pred_train)

You want to compare y_test and y_predict which is the output of x_test through your regressor.

The following will raise an inconsistent numbers of samples error.
metrics.mean_absolute_error(y_test, y_train)
The reason is because the training set and the testing set has different number of rows.
In the rare case of them having the same number of rows, the above statement still doesn't make sense: there's no use of comparing the test set labels to training set labels.
Instead, you should obtain the predictions to your testing features(X_test) by inputting X_test to lm:
y_hat = lm.predict(X_test) # y_hat: predictions
Then, these metrics would make sense:
metrics.mean_absolute_error(y_test, y_hat)

SKLearn Predicting using new Data

I've tried out Linear Regression using SKLearn. I have data something along the lines of: Calories Eaten | Weight.
150 | 150
300 | 190
350 | 200
Basically made up numbers but I've fit the dataset into the linear regression model.
What I'm confused on is, how would I go about predicting with new data, say I got 10 new numbers of Calories Eaten, and I want it to predict Weight?
regressor = LinearRegression()
regressor.fit(x_train, y_train)
y_pred = regressor.predict(x_test) ??
But how would I go about making only my 10 new data numbers of Calories Eaten and make it the Test Set I want the regressor to predict?

You are correct, you simply call the predict method of your model and pass in the new unseen data for prediction. Now it also depends on what you mean by new data. Are you referencing data that you do not know the outcome of (i.e. you do not know the weight value), or is this data being used to test the performance of your model?
For new data (to predict on):
Your approach is correct. You can access all predictions by simply printing the y_pred variable.
You know the respective weight values and you want to evaluate model:
Make sure that you have two separate data sets: x_test (containing the features) and y_test (containing the labels). Generate the predictions as you are doing with the y_pred variable, then you can calculate its performance using a number of performance metrics. Most common one is the root mean square, and you simply pass the y_test and y_pred as parameters. Here is a list of all the regression performance metrics supplied by sklearn.
If you do not know the weight value of the 10 new data points:
Use train_test_split to split your initial data set into 2 parts: training and testing. You would have 4 datasets: x_train, y_train, x_test, y_test.
from sklearn.model_selection import train_test_split
# random state can be any number (to ensure same split), and test_size indicates a 25% cut
x_train, y_train, x_test, y_test = train_test_split(calories_eaten, weight, test_size = 0.25, random_state = 42)
Train model by fitting x_train and y_train. Then evaluate model's training performance by predicting on x_test and comparing these predictions with the actual results from y_test. This way you would have an idea of how the model performs. Furthermore, you can then predict the weight values for the 10 new data points accordingly.
It is also worth reading further on the topic as a beginner. This is a simple tutorial to follow.

You have to select the model using model_selection in sklearn then train and fit the dataset.
from sklearn.model_selection import train_test_split
X_train, y_train, X_test, y_test = train_test_split(eaten, weight)
regressor.fit(X_train, y_train)
y_pred = regressor.predict(X_test)

What I'm confused on is, how would I go about predicting with new
data, say I got 10 new numbers of Calories Eaten, and I want it to
predict Weight?
Yes, Calories Eaten represents the independent variable while Weight represent dependent variable.
After you split the data into training set and test set the next step is to fit the regressor using X_train and y_train data.
After the model is trained you can predict the results for X_test method and so we got the y_pred.
Now you can compare y_pred (predicted data) with y_test which is real data.
You can also use score method for your created linear model in order to get the performance of your model.
score is calculated using R^2(R squared) metric or Coefficient of determination.
score = regressor.score(x_test, y_test)
For splitting the data you can use train_test_split method.
from sklearn.model_selection import train_test_split
X_train, y_train, X_test, y_test = train_test_split(eaten, weight, test_size = 0.2, random_state = 0)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.