Fitting MultinomialNB on multiple columns of data - python

Given a table of data containing 100 rows, such as:
Place | Text | Value | Text_Two
europe | some random text | 3.2 | some more random text
america | the usa | 4.1 | the white house
...
I am trying to classify with the following:
df = pd.read_csv('data.csv')
mnb = MultinomialNB()
tf = TfidfVectorizer()
df.loc[df['Place'] == 'europe','Place'] = 0
df.loc[df['Place'] == 'america','Place'] = 1
X = df[['Text', 'Value', 'Text_Two']]
y = df['Place']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25)
X_train_tf = tf.fit_transform(X_train)
mnb.fit(X_train_tf, y_train)
The above produces the following error:
ValueError: Found input variables with inconsistent numbers of
samples: [3, 100]
So from what I understand it's only seeing the categories that were set with X = df[['Text', 'Value', 'Text_Two']], not the data within those categories.
The code above works if I only specify X for one category, such as:
X = df['Text']
Is it possible to fit the MultinomialNB on multiple categories of data?

This has nothing to do with MultinomialNB. It can handle multiple columns fine. The problem is TfidfVectorizer.
TfidfVectorizer only works on a an iterable of single dimension (single column of your dataframe) and will not do any kind of check on the shape or type of the input data.
It will only do this:
for doc in raw_documents:
...
...
When you pass a dataframe to it (be it a single column or multiple columns), for doc in raw_documents:, on a dataframe will only output the column names and not actual data. The data you pass in X has three columns, so only those columns are used as documents, and hence the error
ValueError: Found input variables with inconsistent numbers of samples: [3, 100]
because your y will have 100 length, and your X (even though it has length 100, but due to tfidfvectorizer it will only now have 3 length).
So to solve this, you have two options:
1) You need to do individual tf-idf vectorization for each text column (Text, Text_Two) and then combine the resultant matrices to form the feature matrix to be used with MultinomialNB.
2) You can combine the two text columns into a single column as #âńōŋŷxmoůŜ has suggested and then do tf-idf on that single column.
Both options will result in different feature vectors, so you need to first understand what each one does and choose what you want.

I rather combine columns Text and Text_Two as one column then construct the classifier from there. MultinomialNB works only for one classifier. Below is the code that combines columns Text and Text_Two into one.
You might be interested on multi-class or multi-label classification but it refers to the target variables (Y) rather than the dependent variables (X).
http://scikit-learn.org/stable/modules/multiclass.html. Hope it helps.
import pandas as pd
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
df = pd.read_csv('data.csv', header=0, sep='|')
df.columns = [x.strip() for x in df.columns]
mnb = MultinomialNB()
tf = TfidfVectorizer()
#df.loc[df['Place'] == 'europe','Place'] = 0
#df.loc[df['Place'] == 'america','Place'] = 1
#X = df[['Text', 'Value', 'Text_Two']]
X = df.Text + df.Text_Two
y = df['Place']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25)
pipe = make_pipeline(TfidfVectorizer(), MultinomialNB())
pipe.fit(X_train, y_train)
pipe.predict(X_test)

Related

How can I split a dataframe using sklearn train test split such that there are equal proportions for each category?

I have a dataset with n independent variables and a categorical variable that I would like to perform a regression analysis on. The number of rows of data is different for each category. I would like to split the dataset into test and train data sets such that each category has an equivalent train test split, e.g. 80% to 20%. Here is a simplified reproducible example of what I'm doing.
import pandas as pd
import string
import numpy as np
from sklearn.model_selection import train_test_split
nrows=1000
cat_values = ['A','B','C','D']
# defining the category names
cats = np.random.choice(cat_values, size=(nrows))
# creating a random dataframe
df = pd.DataFrame(np.random.randint(0,1000,size=(nrows, 3)), columns=['variable 1','variable 2','variable 3'])
df['category'] = cats
y = np.random.rand(nrows)
# using sklearn to split into training and test datasets
X_train, X_test, y_train, y_test = train_test_split(df, y, test_size = .2, random_state =0)
# printing the number of rows in the output training data set for each category
for i in range(len(cat_values)):
print ("number of rows in category " + str(cat_values[i]) + ": " + str(len(X_train[X_train['category']==cat_values[i]])))
Output:
number of rows in category A: 221
number of rows in category B: 188
number of rows in category C: 179
number of rows in category D: 212
I would like the rows to be split e.g. 80:20 train:test for each categorical variable. I've looked at using StratifiedShuffleSplit (Train/test split preserving class proportions in each split) but there doesn't seem to be an option for specifying which column to stratify the split on (https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedShuffleSplit.html).
Is there a package that can split the data this way, or would I have to divide my dataframe into n categorical dataframes and perform a different train test split on each one before rejoining them?
Thanks for any assistance with this.
Use train_test_split using stratify parameter:
X_train, X_test, y_train, y_test = train_test_split(
df, y, test_size=.2, random_state=0, stratify=y
)

Preserving the index when selecting a slice of a pandas dataframe

So I am creating my training and test sets for use in a Multiple Linear Regression model using sklearn.
my dataset contains 182 features looks like the following;
id feature1 feature2 .... feature182 Target
D24352 145 8 7 1
G09340 10 24 0 0
E40988 6 42 8 1
H42093 238 234 2 1
F32093 12 72 1 0
I have then have the following code;
import pandas as pd
dataset = pd.read_csv('C:\\mylocation\\myfile.csv')
dataset0 = dataset.set_index('t1.id')
dataset2 = pd.get_dummies(dataset0)
y = dataset0.iloc[:, 31:32].values
dataset2.pop('Target')
X = dataset2.iloc[:, :180].values
Once I use dataframe.iloc however, I loose my indexes (which I have set to be my IDs). I would like to keep these as I currently have no way of telling which records in my results relate to which records in my original dataset when I do the following step;
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)
y_pred = regressor.predict(X_test)
It looks like your data is stored as object type. You should convert it to float64 (assuming that all your data is of numeric type. Else only convert those rows, that you want to have as numeric type). Since it turns out your index is of type string, you need to set the dtype of your dataframe after setting the index (and generating the dummies). Again assuming that the rest of your data is of numeric type:
dataset = pd.read_csv('C:\\mylocation\\myfile.csv')
dataset0 = dataset.set_index('t1.id')
dataset2 = pd.get_dummies(dataset0)
dataset0 = dataset0.astype(np.float64) # add this line to explicitly set the dtype
Now you should be able to just leave out values when slicing the DataFrame:
y = dataset0.iloc[:, 31:32]
dataset2.pop('Target')
X = dataset2.iloc[:, :180]
With .values you access the underlying numpy arrays of the DataFrame. These do not have an index column. Since sklearn is, in most cases, compatible with pandas, you can simply pass a pandas DataFrame to sklearn.
If this does not work, you can still apply reset_index to your DataFrame. This will add the index as a new column, which you will have to drop when passing the training data to sklearn:
dataset0.reset_index(inplace=True)
dataset2.reset_index(inplace=True)
y = dataset0.iloc[:, 31:32].values
dataset2.pop('Target')
X = dataset2.iloc[:, :180].values
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.2, random_state = 0)
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train.drop('index', axis=1), y_train.drop('index', axis=1))
y_pred = regressor.predict(X_test.drop('index', axis=1))
In this case you'll still have to change the slicing [:, 31:32] and [:, :180] to the correct columns, so that the index will be included in the slice.

scikit learn logistic regression model tfidfvectorizer

I am trying to create a logistic regression model using scikit learn with the code below. I am using 9 columns for the features (X) and one for the label (Y). When trying to fit I get an error "ValueError: Found input variables with inconsistent numbers of samples: [9, 560000]" even though previously the lengths of X and Y are the same, if I use x.transpose() i get a different error "AttributeError: 'int' object has no attribute 'lower'". I am assuming this has to do with the tfidfvectorizer possibly, I am doing this because 3 of the columns contain single words and wasn't working. Is this the right way to be doing this or should I be converting the words in the columns separately and then using train_test_split? If not why am I getting the errors and how can I fic them. Heres an example of the csv.
df = pd.read_csv("UNSW-NB15_1.csv",header=None, names=cols, encoding = "UTF-8",low_memory=False)
df.to_csv('netraf.csv')
csv = 'netraf.csv'
my_df = pd.read_csv(csv)
x_features = my_df.columns[1:10]
x_data = my_df[x_features]
Y = my_df["Label"]
x_train, x_validation, y_train, y_validation =
model_selection.train_test_split(x_data, Y, test_size=0.2, random_state=7)
tfidf_vectorizer = TfidfVectorizer()
lr = LogisticRegression()
tfidf_lr_pipe = Pipeline([('tfidf', tfidf_vectorizer), ('lr', lr)])
tfidf_lr_pipe.fit(x_train, y_train)
What you are trying to do is unusual because TfidfVectorizer is designed to extract numerical features from text. But if you don't really care and just want to make your code works, one way to do it is by converting your numerical data to string and configure TfidfVectorizer to accept tokenized data:
import pandas as pd
from sklearn import model_selection
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
cols = ['srcip','sport','dstip','dsport','proto','service','smeansz','dmeansz','attack_cat','Label']
df = pd.read_csv("UNSW-NB15_1.csv",header=None, names=cols, encoding = "UTF-8",low_memory=False)
df.to_csv('netraf.csv')
csv = 'netraf.csv'
my_df = pd.read_csv(csv)
# convert all columns to string like we don't care
for col in my_df.columns:
my_df[col] = my_df[col].astype(str)
# replace nan with empty string like we don't care
for col in my_df.columns[my_df.isna().any()].tolist():
my_df.loc[:, col].fillna('', inplace=True)
x_features = my_df.columns[1:10]
x_data = my_df[x_features]
Y = my_df["Label"]
x_train, x_validation, y_train, y_validation = model_selection.train_test_split(
x_data.values, Y.values, test_size=0.2, random_state=7)
# configure TfidfVectorizer to accept tokenized data
# reference http://www.davidsbatista.net/blog/2018/02/28/TfidfVectorizer/
tfidf_vectorizer = TfidfVectorizer(
analyzer='word',
tokenizer=lambda x: x,
preprocessor=lambda x: x,
token_pattern=None)
lr = LogisticRegression()
tfidf_lr_pipe = Pipeline([('tfidf', tfidf_vectorizer), ('lr', lr)])
tfidf_lr_pipe.fit(x_train, y_train)
That being said, I'd recommend you to use another method to do feature engineering on your dataset. For example, you can try to encode your nominal data (eg. IP, port) to numerical value.

scikit-learn error: The least populated class in y has only 1 member

I'm trying to split my dataset into a training and a test set by using the train_test_split function from scikit-learn, but I'm getting this error:
In [1]: y.iloc[:,0].value_counts()
Out[1]:
M2 38
M1 35
M4 29
M5 15
M0 15
M3 15
In [2]: xtrain, xtest, ytrain, ytest = train_test_split(X, y, test_size=1/3, random_state=85, stratify=y)
Out[2]:
Traceback (most recent call last):
File "run_ok.py", line 48, in <module>
xtrain,xtest,ytrain,ytest = train_test_split(X,y,test_size=1/3,random_state=85,stratify=y)
File "/home/aurora/.pyenv/versions/3.6.0/lib/python3.6/site-packages/sklearn/model_selection/_split.py", line 1700, in train_test_split
train, test = next(cv.split(X=arrays[0], y=stratify))
File "/home/aurora/.pyenv/versions/3.6.0/lib/python3.6/site-packages/sklearn/model_selection/_split.py", line 953, in split
for train, test in self._iter_indices(X, y, groups):
File "/home/aurora/.pyenv/versions/3.6.0/lib/python3.6/site-packages/sklearn/model_selection/_split.py", line 1259, in _iter_indices
raise ValueError("The least populated class in y has only 1"
ValueError: The least populated class in y has only 1 member, which is too few. The minimum number of groups for any class cannot be less than 2.
However, all classes have at least 15 samples. Why am I getting this error?
X is a pandas DataFrame which represents the data points, y is a pandas DataFrame with one column that contains the target variable.
I cannot post the original data because it's proprietary, but it is fairly reproducible by creating a random pandas DataFrame (X) with 1k rows x 500 columns, and a random pandas DataFrame (y) with the same number of rows (1k) of X, and, for each row the target variable (a categorical label).
The y pandas DataFrame should have different categorical labels (e.g. 'class1', 'class2'...) and each labels should have at least 15 occurrences.
The problem was that train_test_split takes as input 2 arrays, but the y array is a one-column matrix. If I pass only the first column of y it works.
train, xtest, ytrain, ytest = train_test_split(X, y.iloc[:,1], test_size=1/3,
random_state=85, stratify=y.iloc[:,1])
The main point is if you use stratified CV, then you will get this warning if the number of splits cannot produce all CV splits with the same ratio of all classes in the data. E.g. if you have 2 samples of one class, there will be 2 CV sets with 2 samples of this class, and 3 CV sets with 0 samples, hence the ratio samples for this class does not equal in all CV sets. But the problem is only if there is 0 samples in any of the sets, so if you have at least as many samples as the number of CV splits, i.e. 5 in this case, this warning won't appear.
See https://stackoverflow.com/a/48314533/2340939.
I have the same problem. Some of class has one or two items.(My problem is multi class problem). You can remove or union classes that has less items. I solve my problem like that.
Continuing with user2340939's answer. If you really need your train-test splits to be stratified despite the less number of rows in certain class, you can try using the following method. I generally use the same, where I'll make a copy of all the rows of such classes to both the train and test datasets..
from sklearn.model_selection import train_test_split
def get_min_required_rows(test_size=0.2):
return 1 / test_size
def make_stratified_splits(df, y_col="label", test_size=0.2):
"""
for any class with rows less than min_required_rows corresponding to the input test_size,
all the rows associated with the specific class will have a copy in both the train and test splits.
example: if test_size is 0.2 (20% otherwise),
min_required_rows = 5 (which is obtained from 1 / test_size i.e., 1 / 0.2)
where the resulting splits will have 4 train rows (80%), 1 test row (20%)..
"""
id_col = "id"
temp_col = "same-class-rows"
class_to_counts = df[y_col].value_counts()
df[temp_col] = df[y_col].apply(lambda y: class_to_counts[y])
min_required_rows = get_min_required_rows(test_size)
copy_rows = df[df[temp_col] < min_required_rows].copy(deep=True)
valid_rows = df[df[temp_col] >= min_required_rows].copy(deep=True)
X = valid_rows[id_col].tolist()
y = valid_rows[y_col].tolist()
# notice, this train_test_split is a stratified split
X_train, X_test, _, _ = train_test_split(X, y, test_size=test_size, random_state=43, stratify=y)
X_test = X_test + copy_rows[id_col].tolist()
X_train = X_train + copy_rows[id_col].tolist()
df.drop([temp_col], axis=1, inplace=True)
test_df = df[df[id_col].isin(X_test)].copy(deep=True)
train_df = df[df[id_col].isin(X_train)].copy(deep=True)
print (f"number of rows in the original dataset: {len(df)}")
test_prop = round(len(test_df) / len(df) * 100, 2)
train_prop = round(len(train_df) / len(df) * 100, 2)
print (f"number of rows in the splits: {len(train_df)} ({train_prop}%), {len(test_df)} ({test_prop}%)")
return train_df, test_df
I had this issue because some of my things to be split were lists, and some were arrays. When I converted the arrays to a list, it worked.
from sklearn.model_selection import train_test_split
all_keys = df['Key'].unique().tolist()
t_df = pd.DataFrame()
c_df = pd.DataFrame()
for key in all_keys:
print(key)
if df.loc[df['Key']==key].shape[0] < 2 :
t_df = t_df.append(df.loc[df['Key']==key])
else:
df_t, df_c = train_test_split(df.loc[df['Key']==key],test_size=0.2,stratify=df.loc[df['Key']==key]['Key'])
t_df = t_df.append(df_t)
c_df = c_df.append(df_c)
when you use stratify=y, combine the less number of categories under one category
for example: filter the labels less than 50 and label them as one single category like "others" or any name then the least populated class error will be solved.
Do you like "functional" programming? Like confusing your co-workers, and writing everything in one line of code? Are you the type of person who loves nested ternary operators, instead of 2 'if' statements? Are you an Elixir programmer trapped in a Python programmer's body?
If so, the following solution may work for you. It allows you to discover how many members the least-populated class has, in real-time, then adjust your cross-validation value on the fly:
""" Let's say our dataframe is like this, for example:
dogs weight size
---- ---- ----
Poodle 14 small
Maltese 13 small
Shepherd 45 big
Retriever 41 big
Burmese 43 big
The 'least populated class' would be 'small', as it only has 2 members.
If we tried doing more than 2-fold cross validation on this, the results
would be skewed.
"""
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
X = df['weight']
y = df['size']
# Random forest classifier, to classify dogs into big or small
model = RandomForestClassifier()
# Find the number of members in the least-populated class, THIS IS THE LINE WHERE THE MAGIC HAPPENS :)
leastPopulated = [x for d in set(list(y)) for x in list(y) if x == d].count(min([x for d in set(list(y)) for x in list(y) if x == d], key=[x for d in set(list(y)) for x in list(y) if x == d].count))
# I want to know the F1 score at each fold of cross validation.
# This 'fOne' variable will be a list of the F1 score from each fold
fOne = cross_val_score(model, X, y, cv=leastPopulated, scoring='f1_weighted')
# We print the F1 score here
print(f"Average F1 score during cross-validation: {np.mean(fOne)}")
Try this way, It worked for me which also mentioned here:
x_train, x_test, y_train, y_test = train_test_split(data_x,data_y,test_size=0.33, random_state=42) .
remove stratify=y while splitting train and test data
xtrain, xtest, ytrain, ytest = train_test_split(X, y, test_size=1/3, random_state=85)
Remove stratify.
stratify=y
should only be used in case of classification problems, so that various output classes (say 'good', 'bad') can get equally distributed among train and test data. It is a sampling method in statistics. We should avoid using stratify in regression problems. The below code should work
xtrain, xtest, ytrain, ytest = train_test_split(X, y, test_size=1/3, random_state=85)

Merging results from model.predict() with original pandas DataFrame?

I am trying to merge the results of a predict method back with the original data in a pandas.DataFrame object.
from sklearn.datasets import load_iris
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
import pandas as pd
import numpy as np
data = load_iris()
# bear with me for the next few steps... I'm trying to walk you through
# how my data object landscape looks... i.e. how I get from raw data
# to matrices with the actual data I have, not the iris dataset
# put feature matrix into columnar format in dataframe
df = pd.DataFrame(data = data.data)
# add outcome variable
df['class'] = data.target
X = np.matrix(df.loc[:, [0, 1, 2, 3]])
y = np.array(df['class'])
# finally, split into train-test
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.8)
model = DecisionTreeClassifier()
model.fit(X_train, y_train)
# I've got my predictions now
y_hats = model.predict(X_test)
To merge these predictions back with the original df, I try this:
df['y_hats'] = y_hats
But that raises:
ValueError: Length of values does not match length of index
I know I could split the df into train_df and test_df and this problem would be solved, but in reality I need to follow the path above to create the matrices X and y (my actual problem is a text classification problem in which I normalize the entire feature matrix before splitting into train and test). How can I align these predicted values with the appropriate rows in my df, since the y_hats array is zero-indexed and seemingly all information about which rows were included in the X_test and y_test is lost? Or will I be relegated to splitting dataframes into train-test first, and then building feature matrices? I'd like to just fill the rows included in train with np.nan values in the dataframe.
your y_hats length will only be the length on the test data (20%) because you predicted on X_test. Once your model is validated and you're happy with the test predictions (by examining the accuracy of your model on the X_test predictions compared to the X_test true values), you should rerun the predict on the full dataset (X). Add these two lines to the bottom:
y_hats2 = model.predict(X)
df['y_hats'] = y_hats2
EDIT per your comment, here is an updated result the returns the dataset with the prediction appended where they were in the test datset
from sklearn.datasets import load_iris
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
import pandas as pd
import numpy as np
data = load_iris()
# bear with me for the next few steps... I'm trying to walk you through
# how my data object landscape looks... i.e. how I get from raw data
# to matrices with the actual data I have, not the iris dataset
# put feature matrix into columnar format in dataframe
df = pd.DataFrame(data = data.data)
# add outcome variable
df_class = pd.DataFrame(data = data.target)
# finally, split into train-test
X_train, X_test, y_train, y_test = train_test_split(df,df_class, train_size = 0.8)
model = DecisionTreeClassifier()
model.fit(X_train, y_train)
# I've got my predictions now
y_hats = model.predict(X_test)
y_test['preds'] = y_hats
df_out = pd.merge(df,y_test[['preds']],how = 'left',left_index = True, right_index = True)
I have the same problem (almost)
I fixed it this way
...
.
.
.
X_train, X_test, y_train, y_test = train_test_split(df,df_class, train_size = 0.8)
model = DecisionTreeClassifier()
model.fit(X_train, y_train)
y_hats = model.predict(X_test)
y_hats = pd.DataFrame(y_hats)
df_out = X_test.reset_index()
df_out["Actual"] = y_test.reset_index()["Columns_Name"]
df_out["Prediction"] = y_hats.reset_index()[0]
y_test['preds'] = y_hats
df_out = pd.merge(df,y_test[['preds']],how = 'left',left_index = True, right_index = True)
You can create a y_hat dataframe copying indices from X_test then merge with the original data.
y_hats_df = pd.DataFrame(data = y_hats, columns = ['y_hats'], index = X_test.index.copy())
df_out = pd.merge(df, y_hats_df, how = 'left', left_index = True, right_index = True)
Note, left join will include train data rows. Omitting 'how' parameter will result in just test data.
Try this:
y_hats2 = model.predict(X)
df[['y_hats']] = y_hats2
You can probably make a new dataframe and add to it the test data along with the predicted values:
data['y_hats'] = y_hats
data.to_csv('data1.csv')
predicted = m.predict(X_valid)
predicted_df = pd.DataFrame(data=predicted, columns=['y_hat'],
index=X_valid.index.copy())
df_out = pd.merge(X_valid, predicted_df, how ='left', left_index=True,
right_index=True)
This worked well for me. It maintains the indexing positions.
pred_prob = model.predict(X_test) # calculate prediction probabilities
pred_class = np.where(pred_prob >0.5, "Yes", "No") #for binary(Yes/No) category
predictions = pd.DataFrame(pred_class, columns=['Prediction'])
my_new_df = pd.concat([my_old_df, predictions], axis =1)
Here is a solution that worked for me:
It consists of building, for each of your folds/iterations, one dataframe which includes observed and predicted values for your test set; this way, you make use of the index (ID) contained in y_true, which should correspond to your subjects' IDs (in my code: 'SubjID').
You then concatenate the DataFrames that you generated (through 5 folds of test data in my case) and paste them back into your original dataset.
I hope this helps!
FoldNr = 0
for train_index, test_index in skf.split(X, y):
FoldNr = FoldNr + 1
X_train, X_test = X.iloc[train_index], X.iloc[test_index]
y_train, y_test = y.iloc[train_index], y.iloc[test_index]
# [...] your model
# performance is measured on test set
y_true, y_pred = y_test, clf.predict(X_test)
# Save predicted values for each test set
a = pd.DataFrame(y_true).reset_index()
b = pd.Series(y_pred, name = 'y_pred')
globals()['ObsPred_df' + str(FoldNr)] = a.join(b)
globals()['ObsPred_df' + str(FoldNr)].set_index('SubjID', inplace=True)
# Create dataframe with observed and predicted values for all subjects
ObsPred_Concat = pd.concat([ObsPred_df1, ObsPred_df2, ObsPred_df3, ObsPred_df4, ObsPred_df5])
original_df['y_pred'] = ObsPred_Concat['y_pred']
First you need to convert y_val or y_test data into the DataFrame.
compare_df = pd.DataFrame(y_val)
then just create a new column with predicted data.
compare_df['predicted_res'] = y_pred_val
After that, you can easily filter the data that shows you which data is matching with original prediction based on a simple condition.
test_df = compare_df[compare_df['y_val'] == compare_df['predicted_res'] ]
you can also use
y_hats = model.predict(X)
df['y_hats'] = y_hats.reset_index()['name of the target column']

Categories

Resources