Python: Combine predicted y-variable labels to the dataframe - python

I have a multi-class label prediction problem to identify
lets say fruits for an example. I am able to get the prediction from the model, fit, and predict functions. I have also trained and tested the model. Below is the code. I am trying to merge my "y predictions" from a variable "forest_y_pred" to my original data set so that I can compare the Original Target Variable to Predicted Target Variable in a data frame. I have 2 questions:
1) Is y_test same as forest_y_pred = forest.predict(X_test). I am getting exact same results for when I compare. Am I getting this wrong? I am bit confused here, predict() is suppose to predict X_test not generate exact same results as y_test
2) I am trying to merge forest_y_pred = forest.predict(X_test) back to df. Here is what I tried from this: Merging results from model.predict() with original pandas DataFrame?
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
# Load Data
df = pd.read_excel('../data/file.xlsx',converters={'col1':str})
df = df.set_index('INDEX_ID') # Setting index id
df
# Doing this way because of setting index. INDEX_ID is a column in the df
X_train, X_test, y_train, y_test = train_test_split(df.ix[:, ~df.columns.isin(['Target'])], df.Target,train_size=0.5)
print(y_test[:5])
type(y_test) #pandas.core.series.Series
ID
12 Apples
124 Oranges
345 Apples
123 Oranges
232 Kiwi
forest = RandomForestClassifier()
# Training
forest_model = forest.fit(X_train, y_train)
print(forest_model)
# Predictions
forest_y_pred = forest.predict(X_test)
print("forest_y_pred:\n",forest_y_pred[:5])
['Apples','Oranges','Apples','Oranges','Kiwi']
y_test['preds'] = forest_y_pred
print(y_test['preds'][:5])
['Apples','Oranges','Apples','Oranges','Kiwi']
df_out = pd.merge(df,y_test[['preds']],how = 'left',left_index = True, right_index = True)
# ValueError: can not merge DataFrame with instance of type <class 'pandas.core.series.Series'>
# How do I fix this? I tried ton of ways to convert ndarray, serries, dataframe...nothing is working so far what I tried. Thanks a bunch!!

Related

How to safely resolve Setting With Copy Warning on assigning over a Pandas DataFrame [duplicate]

This question already has an answer here:
Creating a new column for predicted cluster: SettingWithCopyWarning
(1 answer)
Closed 1 year ago.
I am getting a SettingWithCopyWarning from Pandas when performing the below operation. I understand what the warning means and I know I can turn the warning off but I am curious if I am performing this type of standardization incorrectly using a pandas dataframe (I have mixed data with categorical and numeric columns). My numbers seem fine after checking but I would like to clean up my syntax to make sure I am using Pandas correctly.
I am curious if there is a better workflow for this type of operation when dealing with data sets that have mixed data types like this.
My process is as follows with some toy data:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from typing import List
# toy data with categorical and numeric data
df: pd.DataFrame = pd.DataFrame([['0',100,'A', 10],
['1',125,'A',15],
['2',134,'A',20],
['3',112,'A',25],
['4',107,'B',35],
['5',68,'B',50],
['6',321,'B',10],
['7',26,'B',27],
['8',115,'C',64],
['9',100,'C',72],
['10',74,'C',18],
['11',63,'C',18]], columns = ['id', 'weight','type','age'])
df.dtypes
id object
weight int64
type object
age int64
dtype: object
# select categorical data for later operations
cat_cols: List = df.select_dtypes(include=['object']).columns.values.tolist()
# select numeric columns for later operations
numeric_cols: List = df.columns[df.dtypes.apply(lambda x: np.issubdtype(x, np.number))].values.tolist()
# prepare data for modeling by splitting into train and test
# use only standardization means/standard deviations from the TRAINING SET only
# and apply them to the testing set as to avoid information leakage from training set into testing set
X: pd.DataFrame = df.copy()
y: pd.Series = df.pop('type')
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
# perform standardization of numeric variables using the mean and standard deviations of the training set only
X_train_numeric_tmp: pd.DataFrame = X_train[numeric_cols].values
X_train_scaler = preprocessing.StandardScaler().fit(X_train_numeric_tmp)
X_train[numeric_cols]: pd.DataFrame = X_train_scaler.transform(X_train[numeric_cols])
X_test[numeric_cols]: pd.DataFrame = X_train_scaler.transform(X_test[numeric_cols])
<ipython-input-15-74f3f6c70f6a>:10: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
Your X_train, X_test are still slices of the original dataframe. Modifying a slice triggers the warning and often doesn't work.
You can either transform before train_test_split, else do X_train = X_train.copy() after split, then transform.
The second approach would prevent information leak as commented in your code. So something like this:
# these 2 lines don't look good to me
# X: pd.DataFrame = df.copy() # don't you drop the label?
# y: pd.Series = df.pop('type') # y = df['type']
# pass them directly instead
features = [c for c in df if c!='type']
X_train, X_test, y_train, y_test = train_test_split(df[features], df['type'],
test_size = 0.2,
random_state = 0)
# now copy what we want to transform
X_train = X_train.copy()
X_test = X_test.copy()
## Code below should work without warning
############
# perform standardization of numeric variables using the mean and standard deviations of the training set only
# you don't need copy the data to fit
# X_train_numeric_tmp: pd.DataFrame = X_train[numeric_cols].values
X_train_scaler = preprocessing.StandardScaler().fit(X_train[numeric_cols)
X_train[numeric_cols]: pd.DataFrame = X_train_scaler.transform(X_train[numeric_cols])
X_test[numeric_cols]: pd.DataFrame = X_train_scaler.transform(X_test[numeric_cols])
I try to explain both pd.get_dummies() and OneHotEncoder() for transforming categorical data into dummy columns. But I do recommend using OneHotEncoder() transformer, because it's a sklearn transformer that you can use it in a Pipeline later if you want.
First OneHotEncoder(): It does the same job as pd.get_dummies function of pandas does, but return of this class is a Numpy ndarray or a sparse array. you can read more about this class here:
from sklearn.preprocessing import OneHotEncoder
X_train_cat = X_train[["type"]]
cat_encoder = OneHotEncoder(sparse=False)
X_train_cat_1hot = cat_encoder.fit_transform(X_train) #This is a numpy ndarray!
#If you want to make a DataFrame again, you can do so like below:
#X_train_cat_1hot = pd.DataFrame(X_train_cat_1hot, columns=cat_encoder.categories_[0])
#You can also concatenate this transformed dataframe with your numerical transformed one.
Second method, pd.get_dummies():
df_dummies = pd.get_dummies(X_train[["type"]])
X_train = pd.concat([X_train, df_dummies], axis=1).drop("type", axis=1)

Preserving the index when selecting a slice of a pandas dataframe

So I am creating my training and test sets for use in a Multiple Linear Regression model using sklearn.
my dataset contains 182 features looks like the following;
id feature1 feature2 .... feature182 Target
D24352 145 8 7 1
G09340 10 24 0 0
E40988 6 42 8 1
H42093 238 234 2 1
F32093 12 72 1 0
I have then have the following code;
import pandas as pd
dataset = pd.read_csv('C:\\mylocation\\myfile.csv')
dataset0 = dataset.set_index('t1.id')
dataset2 = pd.get_dummies(dataset0)
y = dataset0.iloc[:, 31:32].values
dataset2.pop('Target')
X = dataset2.iloc[:, :180].values
Once I use dataframe.iloc however, I loose my indexes (which I have set to be my IDs). I would like to keep these as I currently have no way of telling which records in my results relate to which records in my original dataset when I do the following step;
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)
y_pred = regressor.predict(X_test)
It looks like your data is stored as object type. You should convert it to float64 (assuming that all your data is of numeric type. Else only convert those rows, that you want to have as numeric type). Since it turns out your index is of type string, you need to set the dtype of your dataframe after setting the index (and generating the dummies). Again assuming that the rest of your data is of numeric type:
dataset = pd.read_csv('C:\\mylocation\\myfile.csv')
dataset0 = dataset.set_index('t1.id')
dataset2 = pd.get_dummies(dataset0)
dataset0 = dataset0.astype(np.float64) # add this line to explicitly set the dtype
Now you should be able to just leave out values when slicing the DataFrame:
y = dataset0.iloc[:, 31:32]
dataset2.pop('Target')
X = dataset2.iloc[:, :180]
With .values you access the underlying numpy arrays of the DataFrame. These do not have an index column. Since sklearn is, in most cases, compatible with pandas, you can simply pass a pandas DataFrame to sklearn.
If this does not work, you can still apply reset_index to your DataFrame. This will add the index as a new column, which you will have to drop when passing the training data to sklearn:
dataset0.reset_index(inplace=True)
dataset2.reset_index(inplace=True)
y = dataset0.iloc[:, 31:32].values
dataset2.pop('Target')
X = dataset2.iloc[:, :180].values
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.2, random_state = 0)
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train.drop('index', axis=1), y_train.drop('index', axis=1))
y_pred = regressor.predict(X_test.drop('index', axis=1))
In this case you'll still have to change the slicing [:, 31:32] and [:, :180] to the correct columns, so that the index will be included in the slice.

add column to data set in python

I am trying to add predicted data back to my original dataset in Python. I think I'm supposed to use Pandas and ASSIGN and pd.DataFrame but I have no clue how to write this after reading all the documentation (sorry I'm new to all this and just started learning coding recently). I've written my code below and just need help with the code for adding my predictions back to the dataset. Thanks for the help!
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Importing the dataset
dataset = pd.read_csv('Social_Network_Ads.csv')
X = dataset.iloc[:, [2, 3]].values
y = dataset.iloc[:, 4].values
# Splitting the dataset into the Training set and Test set
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25,
random_state = 0)
# Feature Scaling X_train and X_test
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
#Feature scaling the all independent variables used to build the model
whole_dataset = sc.transform(X)
# Fitting classifier to the Training set
# Create your Naive Bayes here
from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(X_train, y_train)
# Predicting the Test set results
y_pred = classifier.predict_proba(X_test)
# Predicting the results for the whole dataset
y_pred2 = classifier.predict_proba(whole_dataset)
# Add y_pred2 predictions back to the dataset
???
You can just do dataset['prediction'] = y_pred to add a new column.
Pandas supports a simple syntax for adding new columns, here it will add a new column and probably take a view on the numpy array returned from sklearn so it should be nice and fast.
EDIT
Looking at your code and the data, you're misunderstanding what train_test_split does, this is splitting the data into 3/4 1/4 splits of your original dataset which has 400 rows, your X train data contains 300 rows, the test data is 100 rows. You're then trying to assign back to your original dataset which is 400 rows. Firstly the number of rows don't match, secondly what is returned from predict_proba is a matrix of the predicted classes as a percentage. So what you want to do after training is to predict on the original dataset and assign this back as 2 columns by sub-selecting each column:
y_pred = classifier.predict_proba(X)
now assign this back :
dataset['predict_class_1'],dataset['predict_class_2'] = y_pred[:,0],y_pred[:,1]
There are several solutions. The answer of EdChurm had mentioned one.
As far as I know, pandas has other 2 methods to work with it.
df.insert()
df.assign()
Since you didn't provide the data in use, here's a pretty simple example.
import pandas as pd
import numpy as np
np.random.seed(1)
df = pd.DataFrame(np.random.randn(10), columns=['raw'])
df = df.assign(cube_raw=df['raw']**2)
df.insert(1,'square_raw',df['raw']**3)
df
raw square_raw cube_raw
0 1.624345 2.638498 4.285832
1 -0.611756 0.374246 -0.228947
2 -0.528172 0.278965 -0.147342
3 -1.072969 1.151262 -1.235268
4 0.865408 0.748930 0.648130
5 -2.301539 5.297080 -12.191435
6 1.744812 3.044368 5.311849
7 -0.761207 0.579436 -0.441071
8 0.319039 0.101786 0.032474
9 -0.249370 0.062186 -0.015507
Just keep in mind that df.assign() doesn't work inplace, so you should reassign to your previous variable.
In my opinion, I prefer df.insert() the most, for it allows you to assign which location you want to insert. (with parameter loc)

Map predictions back to IDs - Python Scikit Learn DecisionTreeClassifier

I have a dataset that has a unique identifier and other features. It looks like this
ID LenA TypeA LenB TypeB Diff Score Response
123-456 51 M 101 L 50 0.2 0
234-567 46 S 49 S 3 0.9 1
345-678 87 M 70 M 17 0.7 0
I split it up into training and test data. I am trying to classify test data into two classes from a classifier trained on training data. I want the identifier in the training and testing dataset so I can map the predictions back to the IDs. Is there a way that I can assign the identifier column as a ID or non-predictor like we can do in Azure ML Studio or SAS?
I am using the DecisionTreeClassifier from Scikit-Learn. This is the code I have for the classifier.
from sklearn import tree
clf = tree.DecisionTreeClassifier()
clf = clf.fit(traindata, trainlabels)
If I just include the ID into the traindata, the code throws an error:
ValueError: invalid literal for float(): 123-456
Not knowing how you made your split I would suggest just making sure the ID column is not included in your training data. Something like this perhaps:
X_train, X_test, y_train, y_test = test_train_split(df.ix[:, ~df.columns.isin(['ID', 'Response'])].values, df.Response)
That will split only the values from the DataFrame not in ID or Response for the X values, and split Response for the y values.
But you will still not be able to use the DecisionTreeClassifier with this data as it contains strings. You will need to convert any column with categorical data, i.e. TypeA and TypeB to a numerical representation. The best way to do this in my opinion for sklearn is with the LabelEncoder. Using this will convert the categorical string labels ['M', 'S'] into [1, 2] which can be implemented with the DecisionTreeClassifier. If you need an example take a look at Passing categorical data to sklearn decision tree.
Update
Per your comment I now understand that you need to map back to the ID. In this case you can leverage pandas to your advantage. Set ID as the index of your data and then do the split, that way you will retain the ID value for all of your train and test data. Let's assume your data are already in a pandas dataframe.
df = df.set_index('ID')
X_train, X_test, y_train, y_test = test_train_split(df.ix[:, ~df.columns.isin(['Response'])], df.Response)
print(X_train)
LenA TypeA LenB TypeB Diff Score
ID
345-678 87 M 70 M 17 0.7
234-567 46 S 49 S 3 0.9
The pandas dataframe keep its order when you do transformation (except join/merge that create/drop row).
So, Here is step-by-step:
create df_test dataframe with 'id' column
create df_test2 that don't have 'id' column
df_test2 = df_test.drop(["id"], axis=1)
Input df_test2 into model for prediction pred = model.predict(df_test2)
create df_pred_final from 'id' column from df_test df_pred_final = df_test[["id"]]
add column 'target' into df_pred_final. The pair id-target should be map correctly df_pred_final["target"] = pred
Please take a look at my kaggle notebook. You might get the idea.
https://www.kaggle.com/tthien/20210412-complex-drop-c10-c2

Merging results from model.predict() with original pandas DataFrame?

I am trying to merge the results of a predict method back with the original data in a pandas.DataFrame object.
from sklearn.datasets import load_iris
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
import pandas as pd
import numpy as np
data = load_iris()
# bear with me for the next few steps... I'm trying to walk you through
# how my data object landscape looks... i.e. how I get from raw data
# to matrices with the actual data I have, not the iris dataset
# put feature matrix into columnar format in dataframe
df = pd.DataFrame(data = data.data)
# add outcome variable
df['class'] = data.target
X = np.matrix(df.loc[:, [0, 1, 2, 3]])
y = np.array(df['class'])
# finally, split into train-test
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.8)
model = DecisionTreeClassifier()
model.fit(X_train, y_train)
# I've got my predictions now
y_hats = model.predict(X_test)
To merge these predictions back with the original df, I try this:
df['y_hats'] = y_hats
But that raises:
ValueError: Length of values does not match length of index
I know I could split the df into train_df and test_df and this problem would be solved, but in reality I need to follow the path above to create the matrices X and y (my actual problem is a text classification problem in which I normalize the entire feature matrix before splitting into train and test). How can I align these predicted values with the appropriate rows in my df, since the y_hats array is zero-indexed and seemingly all information about which rows were included in the X_test and y_test is lost? Or will I be relegated to splitting dataframes into train-test first, and then building feature matrices? I'd like to just fill the rows included in train with np.nan values in the dataframe.
your y_hats length will only be the length on the test data (20%) because you predicted on X_test. Once your model is validated and you're happy with the test predictions (by examining the accuracy of your model on the X_test predictions compared to the X_test true values), you should rerun the predict on the full dataset (X). Add these two lines to the bottom:
y_hats2 = model.predict(X)
df['y_hats'] = y_hats2
EDIT per your comment, here is an updated result the returns the dataset with the prediction appended where they were in the test datset
from sklearn.datasets import load_iris
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
import pandas as pd
import numpy as np
data = load_iris()
# bear with me for the next few steps... I'm trying to walk you through
# how my data object landscape looks... i.e. how I get from raw data
# to matrices with the actual data I have, not the iris dataset
# put feature matrix into columnar format in dataframe
df = pd.DataFrame(data = data.data)
# add outcome variable
df_class = pd.DataFrame(data = data.target)
# finally, split into train-test
X_train, X_test, y_train, y_test = train_test_split(df,df_class, train_size = 0.8)
model = DecisionTreeClassifier()
model.fit(X_train, y_train)
# I've got my predictions now
y_hats = model.predict(X_test)
y_test['preds'] = y_hats
df_out = pd.merge(df,y_test[['preds']],how = 'left',left_index = True, right_index = True)
I have the same problem (almost)
I fixed it this way
...
.
.
.
X_train, X_test, y_train, y_test = train_test_split(df,df_class, train_size = 0.8)
model = DecisionTreeClassifier()
model.fit(X_train, y_train)
y_hats = model.predict(X_test)
y_hats = pd.DataFrame(y_hats)
df_out = X_test.reset_index()
df_out["Actual"] = y_test.reset_index()["Columns_Name"]
df_out["Prediction"] = y_hats.reset_index()[0]
y_test['preds'] = y_hats
df_out = pd.merge(df,y_test[['preds']],how = 'left',left_index = True, right_index = True)
You can create a y_hat dataframe copying indices from X_test then merge with the original data.
y_hats_df = pd.DataFrame(data = y_hats, columns = ['y_hats'], index = X_test.index.copy())
df_out = pd.merge(df, y_hats_df, how = 'left', left_index = True, right_index = True)
Note, left join will include train data rows. Omitting 'how' parameter will result in just test data.
Try this:
y_hats2 = model.predict(X)
df[['y_hats']] = y_hats2
You can probably make a new dataframe and add to it the test data along with the predicted values:
data['y_hats'] = y_hats
data.to_csv('data1.csv')
predicted = m.predict(X_valid)
predicted_df = pd.DataFrame(data=predicted, columns=['y_hat'],
index=X_valid.index.copy())
df_out = pd.merge(X_valid, predicted_df, how ='left', left_index=True,
right_index=True)
This worked well for me. It maintains the indexing positions.
pred_prob = model.predict(X_test) # calculate prediction probabilities
pred_class = np.where(pred_prob >0.5, "Yes", "No") #for binary(Yes/No) category
predictions = pd.DataFrame(pred_class, columns=['Prediction'])
my_new_df = pd.concat([my_old_df, predictions], axis =1)
Here is a solution that worked for me:
It consists of building, for each of your folds/iterations, one dataframe which includes observed and predicted values for your test set; this way, you make use of the index (ID) contained in y_true, which should correspond to your subjects' IDs (in my code: 'SubjID').
You then concatenate the DataFrames that you generated (through 5 folds of test data in my case) and paste them back into your original dataset.
I hope this helps!
FoldNr = 0
for train_index, test_index in skf.split(X, y):
FoldNr = FoldNr + 1
X_train, X_test = X.iloc[train_index], X.iloc[test_index]
y_train, y_test = y.iloc[train_index], y.iloc[test_index]
# [...] your model
# performance is measured on test set
y_true, y_pred = y_test, clf.predict(X_test)
# Save predicted values for each test set
a = pd.DataFrame(y_true).reset_index()
b = pd.Series(y_pred, name = 'y_pred')
globals()['ObsPred_df' + str(FoldNr)] = a.join(b)
globals()['ObsPred_df' + str(FoldNr)].set_index('SubjID', inplace=True)
# Create dataframe with observed and predicted values for all subjects
ObsPred_Concat = pd.concat([ObsPred_df1, ObsPred_df2, ObsPred_df3, ObsPred_df4, ObsPred_df5])
original_df['y_pred'] = ObsPred_Concat['y_pred']
First you need to convert y_val or y_test data into the DataFrame.
compare_df = pd.DataFrame(y_val)
then just create a new column with predicted data.
compare_df['predicted_res'] = y_pred_val
After that, you can easily filter the data that shows you which data is matching with original prediction based on a simple condition.
test_df = compare_df[compare_df['y_val'] == compare_df['predicted_res'] ]
you can also use
y_hats = model.predict(X)
df['y_hats'] = y_hats.reset_index()['name of the target column']

Categories

Resources