Map predictions back to IDs - Python Scikit Learn DecisionTreeClassifier - python

I have a dataset that has a unique identifier and other features. It looks like this
ID LenA TypeA LenB TypeB Diff Score Response
123-456 51 M 101 L 50 0.2 0
234-567 46 S 49 S 3 0.9 1
345-678 87 M 70 M 17 0.7 0
I split it up into training and test data. I am trying to classify test data into two classes from a classifier trained on training data. I want the identifier in the training and testing dataset so I can map the predictions back to the IDs. Is there a way that I can assign the identifier column as a ID or non-predictor like we can do in Azure ML Studio or SAS?
I am using the DecisionTreeClassifier from Scikit-Learn. This is the code I have for the classifier.
from sklearn import tree
clf = tree.DecisionTreeClassifier()
clf = clf.fit(traindata, trainlabels)
If I just include the ID into the traindata, the code throws an error:
ValueError: invalid literal for float(): 123-456

Not knowing how you made your split I would suggest just making sure the ID column is not included in your training data. Something like this perhaps:
X_train, X_test, y_train, y_test = test_train_split(df.ix[:, ~df.columns.isin(['ID', 'Response'])].values, df.Response)
That will split only the values from the DataFrame not in ID or Response for the X values, and split Response for the y values.
But you will still not be able to use the DecisionTreeClassifier with this data as it contains strings. You will need to convert any column with categorical data, i.e. TypeA and TypeB to a numerical representation. The best way to do this in my opinion for sklearn is with the LabelEncoder. Using this will convert the categorical string labels ['M', 'S'] into [1, 2] which can be implemented with the DecisionTreeClassifier. If you need an example take a look at Passing categorical data to sklearn decision tree.
Update
Per your comment I now understand that you need to map back to the ID. In this case you can leverage pandas to your advantage. Set ID as the index of your data and then do the split, that way you will retain the ID value for all of your train and test data. Let's assume your data are already in a pandas dataframe.
df = df.set_index('ID')
X_train, X_test, y_train, y_test = test_train_split(df.ix[:, ~df.columns.isin(['Response'])], df.Response)
print(X_train)
LenA TypeA LenB TypeB Diff Score
ID
345-678 87 M 70 M 17 0.7
234-567 46 S 49 S 3 0.9

The pandas dataframe keep its order when you do transformation (except join/merge that create/drop row).
So, Here is step-by-step:
create df_test dataframe with 'id' column
create df_test2 that don't have 'id' column
df_test2 = df_test.drop(["id"], axis=1)
Input df_test2 into model for prediction pred = model.predict(df_test2)
create df_pred_final from 'id' column from df_test df_pred_final = df_test[["id"]]
add column 'target' into df_pred_final. The pair id-target should be map correctly df_pred_final["target"] = pred
Please take a look at my kaggle notebook. You might get the idea.
https://www.kaggle.com/tthien/20210412-complex-drop-c10-c2

Related

How to build a dataframe comparing predicted and actual results from a regression model

I would like to build a dataframe that compares the predicted results of a regression model (y_hat) with the test data (y_test). I have two access methods for selecting the test data. Access method 1 works but Access method 2 doesn't when I try to build the comparison dataframe.
Access method 1:
X_data = df_scores[['Hours']]
y_data = df_scores['Scores']
X_train, X_test, y_train, y_test = train_test_split(X_data, y_data, test_size=0.20, random_state=0)
lm = LinearRegression()
lm.fit(X_train, y_train)
y_hat = lm.predict(X_test)
This dataframe works:
df_scores_comp = pd.DataFrame({'Actual':y_test, 'Predicted':y_hat})
df_scores_comp
Access method 2:
But if I want to use the following way to access X_data and y_data ...
X_data = df_scores.loc[:, ['Hours']]
y_data = df_scores.loc[:, ['Scores']]
I get the following error ...
If using all scalar values, you must pass an index
When using either access method, y_hat is an array and X_data is a dataframe. But y_data is a series using the first access method and a dataframe in the second access method. I thought the clue might be in there somewhere with lm.predict but I can't figure it out.
When I tried the solution offered here (Constructing pandas dataframes...) by wrapping the dictionary in a list, I don't get an error. But the result isn't right: the y_hat (predicted) elements are in the correct column, but are squeezed into one row. And the y_test (Actual) elements and the index values are mixed up in the wrong columns and are squeezed into one row as well. Something like this:
Actual Predicted
0 Scores 5 20 2 27 19 69 16... [[16.884144762398048], [33.73226077948985], [7...
It should look like this (which is does using the first access method):
Actual Predicted
5 20 16.884145
2 27 33.732261
19 69 75.357018
16 30 26.794801
11 62 60.491033

Combine unique ID with prediction results to csv pandas python Modelling

I am doing modelling lets say logistic regression and need to save the results in a dataframe (prediction results and a unique ID).
Code for predictions
from sklearn.linear_model import LogisticRegression
lr_clf.fit(X_train, y_train)
predictions=lr_clf.predict(test_data)
I want that along with predictions, I should also have in a column a unique identifier from X_train in the predictions dataframe (right now predictions is a numpy array). Lets say the unique ID is ID column in X_train.
Expected output
predictions ID
11 1000
123 1001
and so on
You can include the unique ID from X_train along with the predictions as below.
#Modelling
from sklearn.linear_model import LogisticRegression
lr_clf.fit(X_train, y_train)
predictions=lr_clf.predict(test_data)
#Add ID along with prediction and save the pandas dataframe
predictions_df=pd.DataFrame(data={"ID":X_train["ID"],"Predictions":predictions})
predictions_df.to_csv(path="predictions_df.csv",index=False,quoting=3,sep=';')

Logistic Regression - How to use model on another dataset and get probability values

I'm making my first ML model and I need some help with using model on second dataset.
So I have two sets: "train_full.csv" and "test_full.csv". Both sets have the exact same structure.
Only difference is that in "train_full.csv" column "target" is filled with 0s and 1s and in "test_set.csv" this column is empty and I want to predict these values.
Below you can find my model based on "train_full.csv". I have skipped the whole part of data cleaning for clarity of code:
df2 = pd.read_csv("train_full.csv", sep = ';')
test_set = pd.read_csv("test_full.csv", sep = ';')
#Dataset cleaning
#my y is column named "target", and my x's are the remaining column
X_train, X_test, y_train, y_test = train_test_split(df2.drop('target',axis=1),
df2['target'], test_size=0.35,
random_state=101)
#Creating Logistic Regression Model
logmodel = LogisticRegression()
result = logmodel.fit(X_train, y_train)
#Making predictions
Predictions = logmodel.predict(X_test)
print(metrics.confusion_matrix(y_test, Predictions))
print(metrics.classification_report(y_test,Predictions)) #Accuracy: 78%
auc = metrics.roc_auc_score(y_test, y_pred_proba) #AUC: ~0.695
Now I want to use that model on second data set, which I have imported in the second line of code, however I dont need to split the dataset into training and testing subset anymore. I want to use model from above on the entire "test_full.csv" set. How can I do that?
Also, is there a way to add a column with calculated probability? So my output would be a pandas dataframe that would look like this:
Id probability target
0 0.75 1
1 0.78 1
2 0.34 0
3 0.84 1
4 0.13 0
5 0.34 0
Kind regards
It is pretty simple.
You just need to drop the target column from the test_set and need to use
logmodel.predict() for classification and logmodel.predict_proba() for probability. Here is an example for the same =>
test_set = test_set.drop(['target'],axis=1)
==> below 2 lines will add a column in test_set dataframe which are the prob and classification related to predictions
test_set['prob'] = logmodel.predict_proba(test_set)
test_set['classification'] = logmodel.predict(test_set)

Python: Combine predicted y-variable labels to the dataframe

I have a multi-class label prediction problem to identify
lets say fruits for an example. I am able to get the prediction from the model, fit, and predict functions. I have also trained and tested the model. Below is the code. I am trying to merge my "y predictions" from a variable "forest_y_pred" to my original data set so that I can compare the Original Target Variable to Predicted Target Variable in a data frame. I have 2 questions:
1) Is y_test same as forest_y_pred = forest.predict(X_test). I am getting exact same results for when I compare. Am I getting this wrong? I am bit confused here, predict() is suppose to predict X_test not generate exact same results as y_test
2) I am trying to merge forest_y_pred = forest.predict(X_test) back to df. Here is what I tried from this: Merging results from model.predict() with original pandas DataFrame?
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
# Load Data
df = pd.read_excel('../data/file.xlsx',converters={'col1':str})
df = df.set_index('INDEX_ID') # Setting index id
df
# Doing this way because of setting index. INDEX_ID is a column in the df
X_train, X_test, y_train, y_test = train_test_split(df.ix[:, ~df.columns.isin(['Target'])], df.Target,train_size=0.5)
print(y_test[:5])
type(y_test) #pandas.core.series.Series
ID
12 Apples
124 Oranges
345 Apples
123 Oranges
232 Kiwi
forest = RandomForestClassifier()
# Training
forest_model = forest.fit(X_train, y_train)
print(forest_model)
# Predictions
forest_y_pred = forest.predict(X_test)
print("forest_y_pred:\n",forest_y_pred[:5])
['Apples','Oranges','Apples','Oranges','Kiwi']
y_test['preds'] = forest_y_pred
print(y_test['preds'][:5])
['Apples','Oranges','Apples','Oranges','Kiwi']
df_out = pd.merge(df,y_test[['preds']],how = 'left',left_index = True, right_index = True)
# ValueError: can not merge DataFrame with instance of type <class 'pandas.core.series.Series'>
# How do I fix this? I tried ton of ways to convert ndarray, serries, dataframe...nothing is working so far what I tried. Thanks a bunch!!

Python: ValueError: could not convert string to float when apply for down-sampling

I have imbalance dataset as below
id text category
1 comment1 0
2 comment2 0
3 comment3 1
4 comment4 0
I have pre-processed the "text" by removing numeric values and applying stemming.
Next, I split my data to training and testing set for validation.
X_train, X_test, y_train, y_test = train_test_split(data['text'], data['category'])
Then, I'm applying Down-Sampling method on my training dataset
from imblearn.under_sampling import RandomUnderSampler
rus = RandomUnderSampler(return_indices=True)
train_X_resampled, train_y_resampled, idx_resampled = rus.sample(X_train, y_train)
However, when I got the error as below. Can I know how can i fix the error?
ValueError: could not convert string to float: 'comment2'
imblearn doesn't support dataframes, convert your column(s) of interest to a list and then reshape it into a 2d array using np.array(list(data['text'])).reshape(-1, 1) and it would work.

Categories

Resources