I want to compare my prediction value with original train data - python

I am trying to learn decision tree regressor and I have wrote below code.
X_train, X_test, y_train, y_test = train_test_split(
x, y, test_size = 0.3, random_state = 100)
model = DecisionTreeRegressor(random_state=1)
model.fit(X_train,y_train)
y_pred = model.predict(X_test)
I want to create a dataframe which include X_test and Y_test and Y_pred.
Is there any method or function for that.

Append the below code at the end of your prediction code:
final_df = X_test.copy()
final_df["Y_original"] = y_test
final_df["Y_predicted"] = y_pred
Here we are creating a new dataframe namely final_df and putting all the values you require into it. Would not suggest you to directly append values into X_test, as it might be needed for use again for prediction.

Related

Best practice for train, validation and test set

I want to assign a sample class to each instance in a dataframe - 'train', 'validation' and 'test'. If I use sklearn train_test_split(), twice, I can get the indices for a train, validation and test set like this:
X = df.drop(['target'], axis=1)
y=df[['target']]
X_train, X_test, y_train, y_test, indices_train, indices_test=train_test_split(X, y, df.index,
test_size=0.2,
random_state=10,
stratify=y,
shuffle=True)
df_=df.iloc[indices_train]
X_ = df_.drop(['target'], axis=1)
y_=df_[['target']]
X_train, X_val, y_train, y_val, indices_train, indices_val=train_test_split(X_, y_, df_.index,
test_size=0.15,
random_state=10,
stratify=y_,
shuffle=True)
df['sample']=['train' if i in indices_train else 'test' if i in indices_test else 'val' for i in df.index]
What is best practice to get a train, validation and test set? Is there any problems with my approach above and can it be frased better?
a faster and optimal solution if dataset is large would be using numpy.
How to split data into 3 sets (train, validation and test)?
or the simpler way is your solution, but maybe just feed the x_train, y_train you obtained in the 1 step, for the train validation split? like the indices being stored and rows just removed from the df feels unnecessary.
So, I did a dummy dataset of 100 points.
I separate the data and I did the first split:
X = df.drop('target', axis=1)
y = df['target']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
If you have a look, my test size is 0.3 which means 70 data points will go for traininf and 30 for test and validation as well.
X_train.shape # Output (70, 3)
X_test.shape # Output (30, 3)
Now you need to split again for validation, so you can do it like this:
X_val, X_test, y_val, y_test = train_test_split(X_test, y_test, test_size=0.5)
Notice how I name the groups and the test_size is now 0.5. Which means I take the 30 points for test and I splitted for validation as well. So the shape of validation and testing, will be:
X_val.shape # Output (15, 3)
X_test.shape # Output (15, 3)
At the end you have 70 points for training, 15 for testing and 15 for validation.
Now, consider validation as "double check" of your training. There are a lot of messy concepts related with that. It's just be sure of your training.

Converting predicted random forest results into dataframe

This question may look silly but did not getting an idea so need your help.
I used random forest to predict the result and wrote the following codes
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)
from sklearn.ensemble import RandomForestRegressor
# create regressor object
regressor = RandomForestRegressor(n_estimators = 100, random_state = 0)
# fit the regressor with x and y data
regressor.fit(X_train, y_train)
Y_pred = regressor.predict(X_test)
Y_pred is the result for a given X_test. Now, I would like to create a data frame of my Y_pred and y_test data and save it into CSV format.
Any idea how to do this?
It seems quite simple, just clicked into my mind. So this way it can be done
df_new = pd.DataFrame({'x':Y_pred, 'y':y_test})
df_new.head()

Relate the predicted value to it index/identification number

I am training a model to predict true or false based on some data. I drop the product number from the list of features when training and testing the model.
X = df.drop(columns = 'Product Number', axis = 1)
y = df['result']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)
SVC = LinearSVC(max_iter = 1200)
SVC.fit(X_train, y_train)
y_pred = SVC.predict(X_test)
Is there any way for me to recover the product number and its features for the item that has passed or failed? How do I get/relate the results of y_pred to which product number it corresponds to?
I also plan on using cross validation so the data gets shuffled, would there still be a way for me to recover the product number for each test item?
I realised I'm using cross validation only to evaluate my model's performance so I decided to just run my code without shuffling the data to see the results for each datapoint.
Edit: For evaluation without cross validation, I drop the irrelevant columns only when I pass it to the classifier as shown below:
cols = ['id', 'label']
X = train_data.copy()
y = train_data['label']
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=2)
knn = make_pipeline(StandardScaler(),KNeighborsClassifier(n_neighbors=10))
y_val_pred = knn.fit(X_train.drop(columns=cols), y_train).predict(X_val.drop(columns=cols))
X_val['y_val_pred'] = y_val_pred
I join the y_val_pred after prediction to check which datapoints have been misclassified.

How do I properly fit a sci-kit learn model using a pandas dataframe?

I am trying to create a machine learning program in sci-kit learn. I am using a CSV file to store data, and have decided to use Pandas data frame to import and format this data. I cannot figure out how to fit this data frame with the model.
My CSV file has one feature, age, and one target, weight. I am using a linear regression algorithm to predict the weight using the age. I do realize this isn't the best algorithm to use with this data.
When I run this code I get the error "ValueError: Found input variables with inconsistent numbers of samples: [10, 40]"
Here is my code:
# Imports
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
# Load And Split Data
data = pd.read_csv("awd.csv")
feature_cols = ['Ages']
X = data.loc[:, feature_cols]
y = data.loc[:, "Weights"]
X_train, y_train, X_test, y_test = train_test_split(X, y, random_state=0, train_size=0.2)
# Train Model
lr = LinearRegression()
lr.fit(X_train, y_train)
# Scores
print(f"Test set score: {round(lr.score(X_test, y_test), 3)}")
print(f"Training set score: {round(lr.score(X_train, y_train), 3)}")
The first 5 lines of my CSV file:
Ages,Weights
1,19
1,21
2,26
2,32
You're assigning the return values incorrectly. See below:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0, train_size=0.2)
You should correct the order of X_train, X_test, y_train and y_test like this:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
See the relevant documentation for details.

How do you get a prediction for a specific value in linear regression in python

i have trained my ML model with linear regression using these
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
regressor = LinearRegression()
regressor.fit(X_train, y_train)
how do i get a prediction for a single row in the data set?
You can easily do something like this-
y_pred = regressor.predict(X_test)
So if you want to do inference on only a single row in the dataset let's say the row at 2nd index. You would do-
y_pred = regressor.predict(X_test[2])
Scikit-learn models all have a predict method you can use.
Just pass it your row as an array and you'll be fine:
regressor.predict(x_val)
Since you only want the first row, you could do
first_row = X_test[0] //assuming X_test is where your test data is at
y_pred = regressor.predict(first_row)

Categories

Resources