I am trying to convert the categorical column of my dataset into numerical using LabelEncoder.
dataset
Here is the conversion code:
for i in cat_columns:
df[i]=encoder.fit_transform(df[i])
After conversion dataset looks like dataset after transformation
But the problem is whenever I try to transform my test dataset it gives an error that
y contains previously unseen labels: 'Male'
Code for transformation on test data :
for i in cat_columns:
df1[i]=encoder.transform(df1[i])
test data
Now how can i solve this problem?
I guess the problem is that you are using the same encoder to fit all the different columns. You should instead fit each column using a different encoder. For example, you can use a dictionary to store the different encoders:
from sklearn import preprocessing
encoders = {}
for i in cat_columns:
encoders[i] = preprocessing.LabelEncoder()
df[i] = encoders[i].fit_transform(df[i])
for i in cat_columns:
df1[i] = encoders[i].transform(df1[i])
The error you encounter (previously unseen labels: 'Male') is caused by the fact you are trying to transform the gender column using the last encoder you create in the previous for loop, which in your case might be a smoking_status label encoder.
Related
I am wondering what is the difference between pandas' get_dummies() encoding of categorical features as compared to the sklearn's OneHotEncoder().
I've seen answers that mention that get_dummies() cannot produce encoding for categories not seen in the training dataset (answers here). However, this is a result of having performed the get_dummies() separately on the testing and training datasets (which can give inconsistent shapes). On the other hand, if we applied the get_dummies() on the original dataset, before splitting it, I think the two methods should give identical results. Am I wrong? Would that cause problems?
My code is currently working like the one below:
def one_hot_encode(ds,feature):
#get DF of dummy variables
dummies = pd.get_dummies(ds[feature])
#One dummy variable to drop (Dummy Trap)
dummyDrop = dummies.columns[0]
#Create a DF from the original and the dummies' DF
#Drop the original categorical variable and the one dummy
final = pd.concat([ds,dummies], axis='columns').drop([feature,dummyDrop], axis='columns')
return final
#Get data DF
dataset = pd.read_csv("census_income_dataset.csv")
columns = dataset.columns
#Perform one-hot-encoding on the DF (See function above) on categorical features
features = ["workclass","marital_status","occupation","relationship","race","sex","native_country"]
for f in features:
dataset = one_hot_encode(dataset,f)
#Re-order to get ouput feature in last column
dataset = dataset[[c for c in dataset.columns if c!="income_level"]+["income_level"]]
dataset.head()
If you apply get_dummies() and OneHotEncoder() in the general dataset, you should obtain the same result.
If you apply get_dummies() in the general dataset, and OneHotEncoder() in the train dataset, you will probably obtain a few (very small) differences if in the test data you have a "new" category. If not, they should have the same result.
The main difference between get_dummies() and OneHotEncoder() is how they work when you are using this model in real life (or in production) and your receive a "new" class of a categorical column that you haven't faced before
Example: Imagine your category "sex" can be only: male or female, and you sold your model to a company. What will happen if now, the category "sex" receives the value: "NA" (not applicable)? (Also, you can image that "NA" is an option, but it only appear 0.001%, and casually, you don't have any of this value in your dataset)
Using get_dummies(), you will have a problem, since your model is trained for only 2 different categories of sex, and now, you have a different and new category that the model can't hand with it.
Using OneHotEncoder(), will allow you to "ignore" this new category that your model can't face, allowing you to keep the same shape between the model input, and your new sample input.
That's why people uses OneHotEncoder() in train set and not in the general dataset, they are "simulating" this type of success (having "new" class you haven't faced before in a categorical column)
The training dataset has object columns called shops and others. Now for the machine learning model I converted the columns into labels for training purposes. Using the code below
from sklearn.ensemble import RandomForestRegressor
X = df_all_4.copy()
y = df_all_4.item_price
X = X.drop(['item_price','date'], axis=1)
for c in df_all_4.columns[df_all_4.dtypes == 'object']:
X[c] = X[c].factorize()[0]
rf = RandomForestRegressor()
rf.fit(X,y)
Now the testing dataset also has those categorical columns but with the some columns missing including the target column not relevant here I think. But if I again label the training dataset (unordered) the labels would be different than the one used while training so the model would not work properly . How to solve this problem and get the same encodings while training and testing
The important thing here is you can use LabelEncoder or OneHotEncoder classes present in Sklearn package. which makes this task pretty much simple.
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
for c in df_all_4.columns[df_all_4.dtypes == 'object']:
le = LabelEncoder()
X[c] = le.fit_transform(X[c])
test[c] = le.transform(test[c])
That's it you have encoded the labels into numbers for both train and test data
You can also use OneHotEncoder which does OneHotEncoding to categorical data.
In Sklearn how can I do OneHotEncoding after LabelEncoding in Sklearn.
What i have done so far is that i mapped all the string features of my dataset like such.
# Categorical boolean mask
categorical_feature_mask = X.dtypes==object
# filter categorical columns using mask and turn it into a list
categorical_cols = X.columns[categorical_feature_mask].tolist()
After that i applied this to the dataset columns, with indexing like such:
X[categorical_cols] = X[categorical_cols].apply(lambda col: le.fit_transform(col))
My results were not super good, so what I want to do, is that I want to use ÒneHotEncoding to see if performance is improved.
This is my code:
ohe = OneHotEncoder(categorical_features = categorical_cols)
X[categorical_cols] = ohe.fit_transform(df).toarray()
I have tried different approaches, but what i try to accomplish here is using the OneHotEncoding technique to overwrite the features.
OneHotEncoder directly supports categorical features, so no need to use a LabelEncoder prior to using it. Also note, that you should not use a LabelEncoder to encode features. Check LabelEncoder for features? for a detailed explanation on this. A LabelEncoder only makes sense on the actual target here.
So select the categorical columns (df.select_dtypes is normally used here), and fit on the specified columns. Here's a sketch one how you could proceed:
# OneHot encoding categorical columns
oh_cols = df.select_dtypes('object').columns
X_cat = df[oh_cols].to_numpy()
oh = OneHotEncoder()
one_hot_cols = oh.fit(X_cat)
Then just call the transform method of the encoder. If you wanted to reconstruct the dataframe (as your code suggests) get_feature_names will give you the category names of the categorical features:
df_prepr = pd.DataFrame(one_hot_cols.transform(X_cat).toarray(),
columns=one_hot_cols.get_feature_names(input_features=oh_cols))
I have data in the following form
Class Feature set list
classlabel1 - [size,time] example:[6780.3,350.00]
classlabel2 - [size,time]
classlabel3 - [size,time]
classlabel4 - [size,time]
How do I save this data in excel sheet and how can I train the model using this feature set? Currently I am working on SVM classifier.
I have tried saving the feature set list in a dataframe and saving this dataframe to a csv file. But the size and time are getting split into two different columns.
The dataframe is getting saved in csv file in the following way:
col 0 col1 col2
62309 396.5099154 label1
I would like to train and test on the feature vector [size,time] combined. Is it possible and is this a right way? If it is possible, how can I do it?
Firstly responding to your question:
I would like to train and test on the feature vector [size,time]
combined. Is it possible and is this a right way? If it is possible,
how can I do it?
Combining the two is not the right thing to do because both are in two different scales (if they are actually what there name suggests) and also combining them will result in loss of information which they will provide, so they are two totally independent features for any ML supervised algorithm. So I would suggest to treat these two features separately rather than combining into one.
Now let's move onto to next section:
How do I save this data in excel sheet and how can I train the model
using this feature set? Currently I am working on SVM classifier.
Storing data : In my opinion, you can store data in whichever format you want but I would prefer storing data in csv format as it is convenient and loading of data file is faster.
sample_data.csv
size,time,class_label
100,150,label1
200,250,label2
240,180,label1
Below is the code for reading the data from csv and training SVM :
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
# loading data
data = pd.read_csv("sample_data.csv", error_bad_lines=True,
warn_bad_lines=True)
# Dividing into dependent and independent features
Y = data.class_label_col.values
X = data.drop("class_label_col", axis=1).values
# encode the class column values
label_encoded_Y = preprocessing.LabelEncoder().fit_transform(list(Y))
# split training and testing data
x_train,x_test,y_train,y_test=train_test_split(X,label_encoded_Y,
train_size=0.8,
test_size=0.2)
# Now use the whichever trainig algo you want
clf = SVC(gamma='auto')
clf.fit(x_train, y_train)
# Using the predictor
y_pred = clf.predict(x_test)
Since size and time are different features, you should separate them into 2 different columns so your model could set separate weight to each of them, i.e.
# data.csv
size time label
6780.3 3,350.00 classLabel1
...
If you want to transform the data you have into the format above you could use pandas.read_excel and use ast to transform the string list into python list object.
import pandas as pd
import ast
df = pd.read_excel("data.xlsx")
size_time = [(ast.literal_eval(x)[0], ast.literal_eval(x)[1]) for x in df["Feature set list"]]
size = [x[0] for x in size_time]
time = [x[1] for x in size_time]
label = df["Class"]
new_df = pd.DataFrame({"size":size, "time":time, "label":label})
# This will result in the DataFrame below.
# size time label
# 6780.3 350.0 classlabel1
# Save DataFrame to csv
new_df.to_csv("data_fix.csv")
# Use it
x = new_df.drop("label", axis=1)
y = new_df.label
# Further data preparation, such as split the dataset
# into train and test set, etc.
...
Hope this helps
I'm trying to get a prediction using a dataset that having string values using Naive Bayes classifier. the data set having 14 columns and 12 columns are having string values.
I encoded the dataset using Labalencoder and onehot encoder and ready the dataset to use the Naive Bayes classifier.
dataset = pd.read_csv('D:\\\\CRC data set copies\\Testing1.csv')
columns = ['Age', 'Weight', 'Gender', 'Ethnic_Group', 'Religion', 'Smoking', 'Alchohol', 'Maritial_Status', 'Family_History', 'District', 'Blood_in_stools', 'Abnormal_Stomach_pain', 'Weight_Loss', 'Tiredness']
X = dataset[columns]
y = dataset['Class']
labelencoder_X = LabelEncoder() # encoding the categorical variables
# replacing the column0 categorical data with numeric values
for col in columns[2:]:
X[col] = labelencoder_X.fit_transform(X[col])
onehotencoder = OneHotEncoder(categorical_features=[2,3,4,5,6,7,8,9,10,11,12,13])
# creating new columns and representing true by 1
X = onehotencoder.fit_transform(X).toarray()
labelencoder_y = LabelEncoder()
y = labelencoder_y.fit_transform(y)
then the model was created and saved.
joblib.dump(model, 'model_joblib')
# load the trained model using joblib
load_model = joblib.load('model_joblib')
predict = [[70,65,"M","b","s","Yes","Yes","MA","Y","kurunegala","P","P","P","P"]]
predict = pd.DataFrame(predict,columns=columns)
for col in columns[2:]:
predict[col] = labelencoder_X.fit_transform(predict[col])
predict = onehotencoder.transform(predict).toarray()
print('\nNew predicted value: ', load_model.predict(predict))
I want to get user inputs and predict a result using the saved model of naive Bayes. I tried encoding the user input using same encoding way but now it's not encoding correctly as same as the dataset values. because of this, the prediction is wrong.
Can someone help me to encode the user input values as the same values that the dataset has been encoded?
In Line:
for col in columns[2:]:
predict[col] = labelencoder_X.fit_transform(predict[col])
Try using labelencoder_X.transform(predict[col]) instead of fit_transform.
During training and test you obviously must use the exact same encoding. If you use a new encoding for test, the results will be all wrong.
Now it can happen that some value was never seen during training. You need to decide what to do then. For example, you could introduce an "unknown" value for previously unseen values.