Collate probabilities, predictions, coefficients from multiple samples of data using sklearn - python

I would like to combine model probabilities for class 1 predictions for ALL rows from multiple (random) splits/samples of data into a single dataframe in python.
I realize that not all rows will be selected in each split, but if data sampling is replicated enough times, each row will have been selected a few times at least and model probabilities generated.
My current approach basically creates multiple test-train splits (5 in example below), and collates probabilities from each training instance into a single dataframe as shown in below code with a mock dataset:
import pandas as pd
import numpy as np
from sklearn import datasets
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
####Instantiate logistic regression objects
log = LogisticRegression(class_weight='balanced', random_state = 1)
#### import some data
iris = datasets.load_iris()
X = pd.DataFrame(iris.data[:100, :], columns = ["sepal_length", "sepal_width", "petal_length", "petal_width"])
y = iris.target[:100,]
# start by creating the first column of probs table
probs_table = pd.DataFrame(X.index, columns=["members"])
# iterate over random states while keeping track of `i`
for i, state in enumerate([11, 444, 21, 109, 1900]):
train_x, test_x, train_y, test_y = train_test_split(
X, y, stratify=y, test_size=0.2, random_state=state)
pd.DataFrame(log.predict_proba(test_x)[:, 1]) #fit final model
probs_table[f"iter_{i+1}"] = pd.DataFrame(log.predict_proba(test_x)[:, 1])
probs_table
Unfortunately, I am not getting probabilities for all rows in the dataframe. Can somebody please guide me to the solution to this problem? And it would be ideal to include additional model outputs such as predictions, coefficientts for each iteration/data row.
Any other way to sample the data (i.e., other than test-train splitting) is fine as well as long as probabilities can be assembled for all dataframe rows.

There are a couple problems with the code as is:
.fit() is never called here. I'm assuming you'd like it fit right after the train/test split line and before the predict_proba() call?
When you place the values into the dataframe, you're creating a new column and I assume you want one column for all iterations while keeping track of which iteration it came from in each column?
Here is code that I believe accomplishes what you'd like. It 1) loops over each random state integer, 2) creates a new train/test split, 3) fits a new model each time, and 4) predicts on each test set row.
I also have it keep track of the original index so you can see how many times each original row ends up in the prediction data frame:
EDIT: Include the coefficients as a column
import pandas as pd
from sklearn import datasets
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
####Instantiate logistic regression objects
log = LogisticRegression(class_weight='balanced', random_state = 1)
#### import some data
iris = datasets.load_iris()
X = pd.DataFrame(iris.data[:100, :], columns = ["sepal_length", "sepal_width", "petal_length", "petal_width"])
y = iris.target[:100,]
dfs = []
# iterate over random states while keeping track of `i`
for i, state in enumerate([11, 444, 21, 109, 1900]):
train_x, test_x, train_y, test_y = train_test_split(
X, y, stratify=y, test_size=0.2, random_state=state)
log.fit(train_x, train_y)
preds = log.predict_proba(test_x)[:, 1]
orig_indices = test_x.index
df = pd.DataFrame(data={
"orig_index": orig_indices,
"prediction": preds,
"iteration": f"iter_{i+1}",
"coefficients": [log.coef_[0]] * len(preds)})
dfs.append(df)
probs_table = pd.concat(dfs)
probs_table

Related

Sales Order Delivery time Prediction Using Random Forest

This is a very noob question. But I have implemented Random forest algorithm to predict number of days taken for delivery depending on origin, destination, vendor, etc.
I already implemented RF using the past 12 month's data(80% Train,20% Test data) and got good results
My question is that for implementing RF I already had no. of days taken for delivery but for the future In my dataset, I will not have that column. How am I suppose to use this already trained model for future predictions using origin, destination, dates, etc?
This is my randomforest, as you can see i split the dataset in 2 pieces: y and x. y is the predicted value or column and x is the whole dataset minus y. This way you can use your training set to predict in your case the delivery time.
NOTE: this code is for a forest REGRESSOR, if you need the classifier code, let me know!
Just the dataframe definitions:
y = df[targetkolom] #predicted column or target column
x = df.drop(targetkolom, 1) #Whole dataset minus target column
Whole code:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestRegressor
df = pd.read_csv('Dataset Carprices.csv')
df.head()
df = df.drop(['car_ID', 'highwaympg', 'citympg'], 1)
targetkolom = 'price'
#Preperation on CarName
i =0
while i < len(df.CarName):
df.CarName[i] = df.CarName[i].split()[0]
i += 1
pd.set_option('display.max_columns', 200)
#(df.describe())
#Dataset standardization
df = pd.get_dummies(df, columns=['CarName','fueltype','aspiration','doornumber','carbody',
'drivewheel','enginelocation','enginetype','cylindernumber',
'fuelsystem'], prefix="", prefix_sep="")
#print(df.info())
y = df[targetkolom]
x = df.drop(targetkolom, 1)
#Normalisation
x = (x-x.min())/(x.max()-x.min())
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size=0.3 ,random_state=7)
model = RandomForestRegressor(n_estimators=10000, random_state=1)
model.fit(x_train, y_train)
y_pred = model.predict(x_test)
print('Root Mean Squared Error:', np.sqrt(mean_squared_error(y_test, y_pred)))
print('R2 score:', r2_score(y_test,y_pred))

Why does a column of 1s impact the results of a decision tree classifier?

I was testing sklearn's Pipeline on a randomly generated classification problem:
import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score
x, y = make_classification(n_samples=100, n_features=5, random_state=10)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=.2, random_state=0)
model = DecisionTreeClassifier(random_state=0)
pipe = Pipeline(steps=[('scale', StandardScaler()),
('poly', PolynomialFeatures(degree=2, include_bias=False)),
('model', model)])
pipe.fit(x_train, y_train)
pipe_pred = pipe.predict(x_test)
accuracy_score(y_test, pipe_pred)
This results in an accuracy score of .85. However, when I change the PolynomialFeatures argument include_bias to True, which just inserts a single column of 1s into the array, the accuracy score becomes .90. For visualization, below I have plotted the individual trees for the results when bias is True and when False:
When include_bias=True: True
When include_bias=False: False
These images were generated by plot_tree(pipe['model']).
The datasets are the same except when include_bias=True an additional column of 1s is inserted into column 0. So the column indexes for the include_bias=True data correspond to the i + 1 column index in the include_bias=False data. (e.g. with_bias[:, 5] == without_bias[:, 4])
Based on my understanding, the column of 1s shouldn't have an impact on the Decision Tree. What am I missing?
From the documentation for DecisionTreeClassifier:
random_state : int, RandomState instance, default=None
Controls the randomness of the estimator. The features are always randomly permuted at each split, even if splitter is set to "best". When max_features < n_features, the algorithm will select max_features at random at each split before finding the best split among them. But the best found split may vary across different runs, even if max_features=n_features. That is the case, if the improvement of the criterion is identical for several splits and one split has to be selected at random. To obtain a deterministic behaviour during fitting, random_state has to be fixed to an integer. See Glossary for details.
You've set the random_state, but having a different number of columns will nevertheless make those random shuffles different. Note that the value of gini is the same for both of your trees at each node, even though different features are making the splits.

add column to data set in python

I am trying to add predicted data back to my original dataset in Python. I think I'm supposed to use Pandas and ASSIGN and pd.DataFrame but I have no clue how to write this after reading all the documentation (sorry I'm new to all this and just started learning coding recently). I've written my code below and just need help with the code for adding my predictions back to the dataset. Thanks for the help!
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Importing the dataset
dataset = pd.read_csv('Social_Network_Ads.csv')
X = dataset.iloc[:, [2, 3]].values
y = dataset.iloc[:, 4].values
# Splitting the dataset into the Training set and Test set
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25,
random_state = 0)
# Feature Scaling X_train and X_test
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
#Feature scaling the all independent variables used to build the model
whole_dataset = sc.transform(X)
# Fitting classifier to the Training set
# Create your Naive Bayes here
from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(X_train, y_train)
# Predicting the Test set results
y_pred = classifier.predict_proba(X_test)
# Predicting the results for the whole dataset
y_pred2 = classifier.predict_proba(whole_dataset)
# Add y_pred2 predictions back to the dataset
???
You can just do dataset['prediction'] = y_pred to add a new column.
Pandas supports a simple syntax for adding new columns, here it will add a new column and probably take a view on the numpy array returned from sklearn so it should be nice and fast.
EDIT
Looking at your code and the data, you're misunderstanding what train_test_split does, this is splitting the data into 3/4 1/4 splits of your original dataset which has 400 rows, your X train data contains 300 rows, the test data is 100 rows. You're then trying to assign back to your original dataset which is 400 rows. Firstly the number of rows don't match, secondly what is returned from predict_proba is a matrix of the predicted classes as a percentage. So what you want to do after training is to predict on the original dataset and assign this back as 2 columns by sub-selecting each column:
y_pred = classifier.predict_proba(X)
now assign this back :
dataset['predict_class_1'],dataset['predict_class_2'] = y_pred[:,0],y_pred[:,1]
There are several solutions. The answer of EdChurm had mentioned one.
As far as I know, pandas has other 2 methods to work with it.
df.insert()
df.assign()
Since you didn't provide the data in use, here's a pretty simple example.
import pandas as pd
import numpy as np
np.random.seed(1)
df = pd.DataFrame(np.random.randn(10), columns=['raw'])
df = df.assign(cube_raw=df['raw']**2)
df.insert(1,'square_raw',df['raw']**3)
df
raw square_raw cube_raw
0 1.624345 2.638498 4.285832
1 -0.611756 0.374246 -0.228947
2 -0.528172 0.278965 -0.147342
3 -1.072969 1.151262 -1.235268
4 0.865408 0.748930 0.648130
5 -2.301539 5.297080 -12.191435
6 1.744812 3.044368 5.311849
7 -0.761207 0.579436 -0.441071
8 0.319039 0.101786 0.032474
9 -0.249370 0.062186 -0.015507
Just keep in mind that df.assign() doesn't work inplace, so you should reassign to your previous variable.
In my opinion, I prefer df.insert() the most, for it allows you to assign which location you want to insert. (with parameter loc)

scikit-learn error: The least populated class in y has only 1 member

I'm trying to split my dataset into a training and a test set by using the train_test_split function from scikit-learn, but I'm getting this error:
In [1]: y.iloc[:,0].value_counts()
Out[1]:
M2 38
M1 35
M4 29
M5 15
M0 15
M3 15
In [2]: xtrain, xtest, ytrain, ytest = train_test_split(X, y, test_size=1/3, random_state=85, stratify=y)
Out[2]:
Traceback (most recent call last):
File "run_ok.py", line 48, in <module>
xtrain,xtest,ytrain,ytest = train_test_split(X,y,test_size=1/3,random_state=85,stratify=y)
File "/home/aurora/.pyenv/versions/3.6.0/lib/python3.6/site-packages/sklearn/model_selection/_split.py", line 1700, in train_test_split
train, test = next(cv.split(X=arrays[0], y=stratify))
File "/home/aurora/.pyenv/versions/3.6.0/lib/python3.6/site-packages/sklearn/model_selection/_split.py", line 953, in split
for train, test in self._iter_indices(X, y, groups):
File "/home/aurora/.pyenv/versions/3.6.0/lib/python3.6/site-packages/sklearn/model_selection/_split.py", line 1259, in _iter_indices
raise ValueError("The least populated class in y has only 1"
ValueError: The least populated class in y has only 1 member, which is too few. The minimum number of groups for any class cannot be less than 2.
However, all classes have at least 15 samples. Why am I getting this error?
X is a pandas DataFrame which represents the data points, y is a pandas DataFrame with one column that contains the target variable.
I cannot post the original data because it's proprietary, but it is fairly reproducible by creating a random pandas DataFrame (X) with 1k rows x 500 columns, and a random pandas DataFrame (y) with the same number of rows (1k) of X, and, for each row the target variable (a categorical label).
The y pandas DataFrame should have different categorical labels (e.g. 'class1', 'class2'...) and each labels should have at least 15 occurrences.
The problem was that train_test_split takes as input 2 arrays, but the y array is a one-column matrix. If I pass only the first column of y it works.
train, xtest, ytrain, ytest = train_test_split(X, y.iloc[:,1], test_size=1/3,
random_state=85, stratify=y.iloc[:,1])
The main point is if you use stratified CV, then you will get this warning if the number of splits cannot produce all CV splits with the same ratio of all classes in the data. E.g. if you have 2 samples of one class, there will be 2 CV sets with 2 samples of this class, and 3 CV sets with 0 samples, hence the ratio samples for this class does not equal in all CV sets. But the problem is only if there is 0 samples in any of the sets, so if you have at least as many samples as the number of CV splits, i.e. 5 in this case, this warning won't appear.
See https://stackoverflow.com/a/48314533/2340939.
I have the same problem. Some of class has one or two items.(My problem is multi class problem). You can remove or union classes that has less items. I solve my problem like that.
Continuing with user2340939's answer. If you really need your train-test splits to be stratified despite the less number of rows in certain class, you can try using the following method. I generally use the same, where I'll make a copy of all the rows of such classes to both the train and test datasets..
from sklearn.model_selection import train_test_split
def get_min_required_rows(test_size=0.2):
return 1 / test_size
def make_stratified_splits(df, y_col="label", test_size=0.2):
"""
for any class with rows less than min_required_rows corresponding to the input test_size,
all the rows associated with the specific class will have a copy in both the train and test splits.
example: if test_size is 0.2 (20% otherwise),
min_required_rows = 5 (which is obtained from 1 / test_size i.e., 1 / 0.2)
where the resulting splits will have 4 train rows (80%), 1 test row (20%)..
"""
id_col = "id"
temp_col = "same-class-rows"
class_to_counts = df[y_col].value_counts()
df[temp_col] = df[y_col].apply(lambda y: class_to_counts[y])
min_required_rows = get_min_required_rows(test_size)
copy_rows = df[df[temp_col] < min_required_rows].copy(deep=True)
valid_rows = df[df[temp_col] >= min_required_rows].copy(deep=True)
X = valid_rows[id_col].tolist()
y = valid_rows[y_col].tolist()
# notice, this train_test_split is a stratified split
X_train, X_test, _, _ = train_test_split(X, y, test_size=test_size, random_state=43, stratify=y)
X_test = X_test + copy_rows[id_col].tolist()
X_train = X_train + copy_rows[id_col].tolist()
df.drop([temp_col], axis=1, inplace=True)
test_df = df[df[id_col].isin(X_test)].copy(deep=True)
train_df = df[df[id_col].isin(X_train)].copy(deep=True)
print (f"number of rows in the original dataset: {len(df)}")
test_prop = round(len(test_df) / len(df) * 100, 2)
train_prop = round(len(train_df) / len(df) * 100, 2)
print (f"number of rows in the splits: {len(train_df)} ({train_prop}%), {len(test_df)} ({test_prop}%)")
return train_df, test_df
I had this issue because some of my things to be split were lists, and some were arrays. When I converted the arrays to a list, it worked.
from sklearn.model_selection import train_test_split
all_keys = df['Key'].unique().tolist()
t_df = pd.DataFrame()
c_df = pd.DataFrame()
for key in all_keys:
print(key)
if df.loc[df['Key']==key].shape[0] < 2 :
t_df = t_df.append(df.loc[df['Key']==key])
else:
df_t, df_c = train_test_split(df.loc[df['Key']==key],test_size=0.2,stratify=df.loc[df['Key']==key]['Key'])
t_df = t_df.append(df_t)
c_df = c_df.append(df_c)
when you use stratify=y, combine the less number of categories under one category
for example: filter the labels less than 50 and label them as one single category like "others" or any name then the least populated class error will be solved.
Do you like "functional" programming? Like confusing your co-workers, and writing everything in one line of code? Are you the type of person who loves nested ternary operators, instead of 2 'if' statements? Are you an Elixir programmer trapped in a Python programmer's body?
If so, the following solution may work for you. It allows you to discover how many members the least-populated class has, in real-time, then adjust your cross-validation value on the fly:
""" Let's say our dataframe is like this, for example:
dogs weight size
---- ---- ----
Poodle 14 small
Maltese 13 small
Shepherd 45 big
Retriever 41 big
Burmese 43 big
The 'least populated class' would be 'small', as it only has 2 members.
If we tried doing more than 2-fold cross validation on this, the results
would be skewed.
"""
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
X = df['weight']
y = df['size']
# Random forest classifier, to classify dogs into big or small
model = RandomForestClassifier()
# Find the number of members in the least-populated class, THIS IS THE LINE WHERE THE MAGIC HAPPENS :)
leastPopulated = [x for d in set(list(y)) for x in list(y) if x == d].count(min([x for d in set(list(y)) for x in list(y) if x == d], key=[x for d in set(list(y)) for x in list(y) if x == d].count))
# I want to know the F1 score at each fold of cross validation.
# This 'fOne' variable will be a list of the F1 score from each fold
fOne = cross_val_score(model, X, y, cv=leastPopulated, scoring='f1_weighted')
# We print the F1 score here
print(f"Average F1 score during cross-validation: {np.mean(fOne)}")
Try this way, It worked for me which also mentioned here:
x_train, x_test, y_train, y_test = train_test_split(data_x,data_y,test_size=0.33, random_state=42) .
remove stratify=y while splitting train and test data
xtrain, xtest, ytrain, ytest = train_test_split(X, y, test_size=1/3, random_state=85)
Remove stratify.
stratify=y
should only be used in case of classification problems, so that various output classes (say 'good', 'bad') can get equally distributed among train and test data. It is a sampling method in statistics. We should avoid using stratify in regression problems. The below code should work
xtrain, xtest, ytrain, ytest = train_test_split(X, y, test_size=1/3, random_state=85)

Merging results from model.predict() with original pandas DataFrame?

I am trying to merge the results of a predict method back with the original data in a pandas.DataFrame object.
from sklearn.datasets import load_iris
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
import pandas as pd
import numpy as np
data = load_iris()
# bear with me for the next few steps... I'm trying to walk you through
# how my data object landscape looks... i.e. how I get from raw data
# to matrices with the actual data I have, not the iris dataset
# put feature matrix into columnar format in dataframe
df = pd.DataFrame(data = data.data)
# add outcome variable
df['class'] = data.target
X = np.matrix(df.loc[:, [0, 1, 2, 3]])
y = np.array(df['class'])
# finally, split into train-test
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.8)
model = DecisionTreeClassifier()
model.fit(X_train, y_train)
# I've got my predictions now
y_hats = model.predict(X_test)
To merge these predictions back with the original df, I try this:
df['y_hats'] = y_hats
But that raises:
ValueError: Length of values does not match length of index
I know I could split the df into train_df and test_df and this problem would be solved, but in reality I need to follow the path above to create the matrices X and y (my actual problem is a text classification problem in which I normalize the entire feature matrix before splitting into train and test). How can I align these predicted values with the appropriate rows in my df, since the y_hats array is zero-indexed and seemingly all information about which rows were included in the X_test and y_test is lost? Or will I be relegated to splitting dataframes into train-test first, and then building feature matrices? I'd like to just fill the rows included in train with np.nan values in the dataframe.
your y_hats length will only be the length on the test data (20%) because you predicted on X_test. Once your model is validated and you're happy with the test predictions (by examining the accuracy of your model on the X_test predictions compared to the X_test true values), you should rerun the predict on the full dataset (X). Add these two lines to the bottom:
y_hats2 = model.predict(X)
df['y_hats'] = y_hats2
EDIT per your comment, here is an updated result the returns the dataset with the prediction appended where they were in the test datset
from sklearn.datasets import load_iris
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
import pandas as pd
import numpy as np
data = load_iris()
# bear with me for the next few steps... I'm trying to walk you through
# how my data object landscape looks... i.e. how I get from raw data
# to matrices with the actual data I have, not the iris dataset
# put feature matrix into columnar format in dataframe
df = pd.DataFrame(data = data.data)
# add outcome variable
df_class = pd.DataFrame(data = data.target)
# finally, split into train-test
X_train, X_test, y_train, y_test = train_test_split(df,df_class, train_size = 0.8)
model = DecisionTreeClassifier()
model.fit(X_train, y_train)
# I've got my predictions now
y_hats = model.predict(X_test)
y_test['preds'] = y_hats
df_out = pd.merge(df,y_test[['preds']],how = 'left',left_index = True, right_index = True)
I have the same problem (almost)
I fixed it this way
...
.
.
.
X_train, X_test, y_train, y_test = train_test_split(df,df_class, train_size = 0.8)
model = DecisionTreeClassifier()
model.fit(X_train, y_train)
y_hats = model.predict(X_test)
y_hats = pd.DataFrame(y_hats)
df_out = X_test.reset_index()
df_out["Actual"] = y_test.reset_index()["Columns_Name"]
df_out["Prediction"] = y_hats.reset_index()[0]
y_test['preds'] = y_hats
df_out = pd.merge(df,y_test[['preds']],how = 'left',left_index = True, right_index = True)
You can create a y_hat dataframe copying indices from X_test then merge with the original data.
y_hats_df = pd.DataFrame(data = y_hats, columns = ['y_hats'], index = X_test.index.copy())
df_out = pd.merge(df, y_hats_df, how = 'left', left_index = True, right_index = True)
Note, left join will include train data rows. Omitting 'how' parameter will result in just test data.
Try this:
y_hats2 = model.predict(X)
df[['y_hats']] = y_hats2
You can probably make a new dataframe and add to it the test data along with the predicted values:
data['y_hats'] = y_hats
data.to_csv('data1.csv')
predicted = m.predict(X_valid)
predicted_df = pd.DataFrame(data=predicted, columns=['y_hat'],
index=X_valid.index.copy())
df_out = pd.merge(X_valid, predicted_df, how ='left', left_index=True,
right_index=True)
This worked well for me. It maintains the indexing positions.
pred_prob = model.predict(X_test) # calculate prediction probabilities
pred_class = np.where(pred_prob >0.5, "Yes", "No") #for binary(Yes/No) category
predictions = pd.DataFrame(pred_class, columns=['Prediction'])
my_new_df = pd.concat([my_old_df, predictions], axis =1)
Here is a solution that worked for me:
It consists of building, for each of your folds/iterations, one dataframe which includes observed and predicted values for your test set; this way, you make use of the index (ID) contained in y_true, which should correspond to your subjects' IDs (in my code: 'SubjID').
You then concatenate the DataFrames that you generated (through 5 folds of test data in my case) and paste them back into your original dataset.
I hope this helps!
FoldNr = 0
for train_index, test_index in skf.split(X, y):
FoldNr = FoldNr + 1
X_train, X_test = X.iloc[train_index], X.iloc[test_index]
y_train, y_test = y.iloc[train_index], y.iloc[test_index]
# [...] your model
# performance is measured on test set
y_true, y_pred = y_test, clf.predict(X_test)
# Save predicted values for each test set
a = pd.DataFrame(y_true).reset_index()
b = pd.Series(y_pred, name = 'y_pred')
globals()['ObsPred_df' + str(FoldNr)] = a.join(b)
globals()['ObsPred_df' + str(FoldNr)].set_index('SubjID', inplace=True)
# Create dataframe with observed and predicted values for all subjects
ObsPred_Concat = pd.concat([ObsPred_df1, ObsPred_df2, ObsPred_df3, ObsPred_df4, ObsPred_df5])
original_df['y_pred'] = ObsPred_Concat['y_pred']
First you need to convert y_val or y_test data into the DataFrame.
compare_df = pd.DataFrame(y_val)
then just create a new column with predicted data.
compare_df['predicted_res'] = y_pred_val
After that, you can easily filter the data that shows you which data is matching with original prediction based on a simple condition.
test_df = compare_df[compare_df['y_val'] == compare_df['predicted_res'] ]
you can also use
y_hats = model.predict(X)
df['y_hats'] = y_hats.reset_index()['name of the target column']

Categories

Resources