I am trying to use a high cardinality feature (siteid) in a sci-kit learn model and am using get_dummies to one-hot encode this feature. I get around 800 new binary columns which returns a decent accuracy using logistic regression. My problem is that when I pass a new dataset through my model I have a different cardinality on this feature with say 300 unique values and the model rightly asks, where are the other 500 columns you trained me on? How can I resolve this?
I don't want to have to train the model every time the cardinality changes nor do i want to hard code these columns in my SQL data load.
cat_columns = ["siteid"]
df = pd.get_dummies(df, prefix_sep="__",
columns=cat_columns)
My recommendation would be to pad these remaining columns with zeros. So if your new training sample has, for example 10 unique values, and the model expects 50 values (number of total_cols), then create 40 columns of zeros to the right to "fill out" the rest of the data:
df = pd.DataFrame({"siteid": range(10)})
cat_columns = ["siteid"]
df1 = pd.get_dummies(df, columns=cat_columns)
# df1 has shape (10, 10)
total_cols = 50 # Number of columns that model expects
zero_padding = pd.DataFrame(np.zeros((df1.shape[0], total_cols - df1.shape[1])))
df = pd.concat([df1, zero_padding], axis=1)
df.columns = ["siteid__" + str(i) for i in range(df.shape[1])]
# df now has shape (10, 50)
I suggest using scikit-learn's OneHotEncoder.
documentation here
In your case, the usage would look something like
from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder(handle_unknown='ignore')
enc.fit(df[['cat_columns']])
categories = [cat for cats in enc.categories_ for cat in cats]
df[categories] = enc.transform(df[['cat_columns']])
the handle_unknown parameter is key, and the enc object is necessary for repeatability for new data.
On new dataframes you would run
df_new[categories] = enc.transform(df_new[['cat_columns']])
This will hot-encode the same categories and ignore any new ones that your model is not accustomed to.
Related
I'm not sure if i'm doing something wrong, or if this is not the correct way to do this..
I'm encoding variables in a dataset for a model, now, i'm using a Normalizer() from sklearn.preprocessing to normalize one of my variables which is numerical.
My dataset is split in two, one for the training and one for the inference. Now, my goal is to normalize this numerical variable (let's call it column x) in the training subset, and then use the normalization parameters to normalize the same variable in the inference dataset. Now, both subsets don't have the same amount of entries, so, what i'm doing is:
nr = Normalizer()
nr.fit([df1.x])
new_col = nr.transform(df1.x)
Now, the problme is.. when i try to use the same normalizer parameters on the column x in the inference subset, since it has a different number of rows:
new_col1 = nr.transform(df2.x)
I get:
X has 10 features, but Normalizer is expecting 697 features as input.
I'm not sure if it's some reshape problem or if the Normalizer() shouldn't be used in that way, so, any advice would be more than welcome.
Normalizer is used to normalize rows whereas StandardScaler is used to normalize column. Concerning your questions, it seems that you want to scale columns. Therefore you should use StandardScaler.
scikit-learn transformers excepts 2D array as input of shape (n_sample, n_feature) but pandas.Series are one-dimensional ndarray with axis labels.
You can fix that by passing a pandas.DataFrame to the transformer.
As follows:
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
df1 = pd.DataFrame({'x' : np.random.uniform(low=0, high=10, size=1000)})
df2 = pd.DataFrame({'x' : np.random.uniform(low=0, high=10, size=850)})
scaler = StandardScaler()
new_col = scaler.fit_transform(df1[['x']])
new_col1 = scaler.transform(df2[['x']])
I am wondering what is the difference between pandas' get_dummies() encoding of categorical features as compared to the sklearn's OneHotEncoder().
I've seen answers that mention that get_dummies() cannot produce encoding for categories not seen in the training dataset (answers here). However, this is a result of having performed the get_dummies() separately on the testing and training datasets (which can give inconsistent shapes). On the other hand, if we applied the get_dummies() on the original dataset, before splitting it, I think the two methods should give identical results. Am I wrong? Would that cause problems?
My code is currently working like the one below:
def one_hot_encode(ds,feature):
#get DF of dummy variables
dummies = pd.get_dummies(ds[feature])
#One dummy variable to drop (Dummy Trap)
dummyDrop = dummies.columns[0]
#Create a DF from the original and the dummies' DF
#Drop the original categorical variable and the one dummy
final = pd.concat([ds,dummies], axis='columns').drop([feature,dummyDrop], axis='columns')
return final
#Get data DF
dataset = pd.read_csv("census_income_dataset.csv")
columns = dataset.columns
#Perform one-hot-encoding on the DF (See function above) on categorical features
features = ["workclass","marital_status","occupation","relationship","race","sex","native_country"]
for f in features:
dataset = one_hot_encode(dataset,f)
#Re-order to get ouput feature in last column
dataset = dataset[[c for c in dataset.columns if c!="income_level"]+["income_level"]]
dataset.head()
If you apply get_dummies() and OneHotEncoder() in the general dataset, you should obtain the same result.
If you apply get_dummies() in the general dataset, and OneHotEncoder() in the train dataset, you will probably obtain a few (very small) differences if in the test data you have a "new" category. If not, they should have the same result.
The main difference between get_dummies() and OneHotEncoder() is how they work when you are using this model in real life (or in production) and your receive a "new" class of a categorical column that you haven't faced before
Example: Imagine your category "sex" can be only: male or female, and you sold your model to a company. What will happen if now, the category "sex" receives the value: "NA" (not applicable)? (Also, you can image that "NA" is an option, but it only appear 0.001%, and casually, you don't have any of this value in your dataset)
Using get_dummies(), you will have a problem, since your model is trained for only 2 different categories of sex, and now, you have a different and new category that the model can't hand with it.
Using OneHotEncoder(), will allow you to "ignore" this new category that your model can't face, allowing you to keep the same shape between the model input, and your new sample input.
That's why people uses OneHotEncoder() in train set and not in the general dataset, they are "simulating" this type of success (having "new" class you haven't faced before in a categorical column)
Below is a piece of my code for Label Encoding. While implementing Label Encoder for one column of the Dataframe at a time, it worked fine. But, when I tried to implement on whole categorical features at once sklearn throws
ValueError: bad input shape (600000, 24). I'm not able to find any specific reason for that.
df = pd.read_csv("../inputs/cat-in-the-dat-train-folds.csv")
# extracting the categorical features
cat_features = [x for x in df.columns if x not in ( "id", "target", "kflod")]
for col in cat_features:
df.loc[:, col] = df[col].astype(str).fillna("NONE")
df_train = df[df["kfold"] != fold].reset_index(drop=True)
df_valid = df[df["kfold"] == fold].reset_index(drop=True)
lbl_enc = preprocessing.LabelEncoder()
full_cat_data = pd.concat(
[df_train[cat_features], df_valid[cat_features]],
axis=0)
lbl_enc.fit(full_cat_data)
x_train = lbl_enc.transform(df_train[cat_features])
x_valid = lbl_enc.transform(df_valid[cat_features])
sklearn.preprocessing.LabelEncoder.fit only takes a 1D array as a parameter.
To fit multiple columns, use sklearn.preprocessing.OrdinalEncoder.fit which can take multi-dimensional data with [n_samples, n_features] (as per the documentation)
In your example, try replacing lbl_enc = preprocessing.LabelEncoder() with lbl_enc = preprocessing.OrdinalEncoder() and that should work.
See this answer here for more information on the difference between LabelEncoder and OrdinalEncoder
Additional resources:
Label encoding across multiple columns in scikit-learn
I am struggling to understand how training our algorithms connects with making predictions on new data.
My situation: I have an algorithm that I use on a labeled dataset. After the steps of importing it, encoding it, fit_transforming it and fitting it to make predictions on the data_test of the train_test_split function I get a really nice prediction from using the labeled dataset.
I am stumped as to how I need to feed a new dataset (unlabeled this time) to the trained model, which has learned from the labeled dataset. I know that technically the data used to train withheld the labels from itself to predict, but I am unaware how I have to provide the gaussianNB algorithm new data features to predict unknown labels.
My code for the training:
df = pd.read_csv(chosen_file, sep=',')
cat_cols = df.select_dtypes(include=['object'])
cat_cols_filled = cat_cols.fillna('0')
le = LabelEncoder()
cat_cols_fitted = cat_cols_filled.apply(lambda col: le.fit_transform(col))
non_cat_cols = df.select_dtypes(exclude=['object'])
non_cat_cols_filled = non_cat_cols.fillna('0')
non_cat_cols_fitted = non_cat_cols_filled.apply(lambda col: le.fit_transform(col))
target_prep = df.iloc[:,-1]
target = le.fit_transform(target_prep.astype(str))
data = pd.concat([cat_cols_fitted, non_cat_cols_fitted], axis=1)
try:
data_train, data_test, target_train, target_test = train_test_split(data, target, train_size=0.3))
alg = GaussianNB()
pred = alg.fit(data_train, target_train).predict(***data_test***)
This is all fine and dandy. But I cannot understand how I have to give something in place of data_test. Do I need to provide the new dataset with some placeholder values for the label column? My label column from the beginning dataframe is the last one.
My attempt:
new_df = pd.read_csv(new_chosen_file, sep=',')
new_cat_cols = new_df.select_dtypes(include=['object'])
new_cat_cols_filled = new_cat_cols.fillna('0')
new_cat_cols_fitted = new_cat_cols_filled.apply(lambda col: le.fit_transform(col))
new_non_cat_cols = new_df.select_dtypes(exclude=['object'])
new_non_cat_cols_filled = new_non_cat_cols.fillna('0')
new_non_cat_cols_fitted = new_non_cat_cols_filled.apply(lambda col: le.fit_transform(col))
new_data = pd.concat([new_cat_cols_fitted, new_non_cat_cols_fitted], axis=1)
print(new_data)
new_pred = alg.predict(new_data)
new_prediction = pd.DataFrame({'NEW ML prediction':new_pred})
print(new_pred)
print(new_prediction)
Notice I do not provide the target column in the new dataset. However the program errors out on me if I my column count does not match, so I am forced to add at least the label for the column for it to not do that:
Am I way off in my understanding of how this works? Please let me know.
EDIT:
I found my major screw-up in the code. I had not isolated my target column out of the data DataFrame. This was why data was 10 column shape.
I can finally appreciate the simplicity of the code.
You are instantiating an empty model to alg. Returning the prediction from fitted model to a variable named pred. So you are not actually saving the fitted model.
The concatenation of multiple methods such as
alg.fit(data_train, target_train).predict(***data_test***) is known as method chaining and can cause confusion.
A cleaner & more readable alternative is to :
alg = GaussianNB() # initiating model
alg = alg.fit(data_train, target_train) # fitting model with train data
pred = alg.predict(***data_test***) # testing with test data
new_pred = alg.predict(new_data) # test with new data`
I'm trying to get a prediction using a dataset that having string values using Naive Bayes classifier. the data set having 14 columns and 12 columns are having string values.
I encoded the dataset using Labalencoder and onehot encoder and ready the dataset to use the Naive Bayes classifier.
dataset = pd.read_csv('D:\\\\CRC data set copies\\Testing1.csv')
columns = ['Age', 'Weight', 'Gender', 'Ethnic_Group', 'Religion', 'Smoking', 'Alchohol', 'Maritial_Status', 'Family_History', 'District', 'Blood_in_stools', 'Abnormal_Stomach_pain', 'Weight_Loss', 'Tiredness']
X = dataset[columns]
y = dataset['Class']
labelencoder_X = LabelEncoder() # encoding the categorical variables
# replacing the column0 categorical data with numeric values
for col in columns[2:]:
X[col] = labelencoder_X.fit_transform(X[col])
onehotencoder = OneHotEncoder(categorical_features=[2,3,4,5,6,7,8,9,10,11,12,13])
# creating new columns and representing true by 1
X = onehotencoder.fit_transform(X).toarray()
labelencoder_y = LabelEncoder()
y = labelencoder_y.fit_transform(y)
then the model was created and saved.
joblib.dump(model, 'model_joblib')
# load the trained model using joblib
load_model = joblib.load('model_joblib')
predict = [[70,65,"M","b","s","Yes","Yes","MA","Y","kurunegala","P","P","P","P"]]
predict = pd.DataFrame(predict,columns=columns)
for col in columns[2:]:
predict[col] = labelencoder_X.fit_transform(predict[col])
predict = onehotencoder.transform(predict).toarray()
print('\nNew predicted value: ', load_model.predict(predict))
I want to get user inputs and predict a result using the saved model of naive Bayes. I tried encoding the user input using same encoding way but now it's not encoding correctly as same as the dataset values. because of this, the prediction is wrong.
Can someone help me to encode the user input values as the same values that the dataset has been encoded?
In Line:
for col in columns[2:]:
predict[col] = labelencoder_X.fit_transform(predict[col])
Try using labelencoder_X.transform(predict[col]) instead of fit_transform.
During training and test you obviously must use the exact same encoding. If you use a new encoding for test, the results will be all wrong.
Now it can happen that some value was never seen during training. You need to decide what to do then. For example, you could introduce an "unknown" value for previously unseen values.