how to test one-hot Vs. label-encoder in gradient boosting sklearn - python

I've done some research about this matter. Most of people argue which one to use in gradient boosting. https://www.kaggle.com/c/home-credit-default-risk/discussion/59873 Most kagglers here explain that label encoder is used or ordinal values and one-hot for categorical.
ex: genius, smart, less-smart should be label encoded into genius=3, smart=2, and less-smart=1.
However according to sklearn documentation, label encoding is only used for target values. https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html
So I've trained a model by converting one of my feature using one-hot and label encoder to see which would predict better.
(according to sklearn doc, label encoder should be used for target values. In my case it is wrong since I've used it on one of my independent variable(x)?)
First, using label encoder.
label_encoder = preprocessing.LabelEncoder()
label_encoder.fit(data_set["region"])
data_set['region'] = label_encoder.transform(data_set["region"])
X, y = data_set.iloc[:, 1:6], data_set.iloc[:,-1]
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3, random_state=0)
gbr_le = GradientBoostingRegressor(
n_estimators = 1000,
learning_rate = 0.1,
random_state = 0
)
model = gbr_le.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(f'RMSE with label encoding (best_num_trees) = {np.sqrt(metrics.mean_squared_error(y_test, y_pred))}')
>>> RMSE with label encoding (best_num_trees) = 4.881378370139346
Now this time using one-hot
regions = pd.get_dummies(data_set_onehot.region)
data_set_onehot.drop(columns=['region'], inplace=True)
X, y = data_set_onehot.iloc[:, 1:5], data_set_onehot.iloc[:,-1]
X = pd.concat([X, regions], axis=1)
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3, random_state=0)
gbr_onehot = GradientBoostingRegressor(
n_estimators = 1000,
learning_rate = 0.1,
random_state = 0
)
model = gbr_onehot.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(f'RMSE with one-hot encoding (best_num_trees) = {np.sqrt(metrics.mean_squared_error(y_test, y_pred))}')
>>> RMSE with one-hot encoding (best_num_trees) = 4.86943654679908
This time model trained after using one-hot gave better rmse score. I don't think I can judge since depending on random_state value on either train_test_split, GradientBoostingRegressor changes RMSE value.
What would be best way to measure better accuracy between two model with different feature processing?

Related

Unable to convert array of bytes/strings into decimal numbers with dtype='numeric'

I am working on a classification model in which I use the logistic regression algorithm. I got as result: Prediction LR": "['AnxiousPersonalityDisorder']". Now I need to calculate the probability of this result and I have a lot of problem.
here is the code if anyone has an idea of the source of the problem
# code in colab notebook
x = df.text.values.tolist()
y = df.label.values.tolist()
vectorizer = CountVectorizer()
data1 = vectorizer.fit_transform(x)
x_train, x_test, y_train, y_test = train_test_split(data1, y, test_size=0.2, random_state=32,stratify=y)
#Training model
lr = LogisticRegression(random_state=40)
lr.fit(x_train, y_train)
y_pred_train = lr.predict(x_train)
y_pred_test = lr.predict(x_test)
i use fastapi for the deployment
class EmotionAdoPrediction():
def __init__(self):
# Note: for model_path you should make full path (see docker volume)
self.vector = load("C:/Users/aabid/PycharmProjects/emotion-detection-multilabels/model1/Lrvector.joblib")
self.model = load("C:/Users/aabid/PycharmProjects/emotion-detection-multilabels/model1/LRClassifier.joblib")
self.classes_names = {0: ["angry"], 1: ["joy"], 2: ["sadness"], 3: ["fear"]}
return
def predict(self, text):
text = self.data_cleaning(text)
text_clean = self.standardization(text)
probs = self.model.predict([[text_clean]])[:, 1]
proba = np.max(probs[0])
class_ind = np.argmax(probs[0])
return self.classes_names[class_ind], proba```
I suppose that you use the scikit-learn package for logistic regression. If this is the case than you can calculate the probability of the result. The LogisticRegression class contains the predict_log_proba(X) and the predict_proba(X) function.
According to the documentation:
predict_log_proba(X): Predict logarithm of probability estimates.
predict_proba(X): For a multi_class problem, if multi_class is set to be “multinomial” the softmax function is used to find the predicted probability of each class.
Use the predict_proba function, because you have a multi class problem.
y_prob = lr.predict_proba(x_test)
EDIT: Also encode your labels from categorical values to numerical values. Use pandas.get_dummies to convert the categorical value.

How to use cross validation with sm.GLM (sm = statsmodels.api)?

How to use cross validation with sm.GLM (sm = statsmodels.api)?
I am trying to fill the "model" parameter from the cross_val_score. However, since I need to use the sm.GLM I don't know how to use it.
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.20, random_state = 1)
n = 5
df_cut, bins = pd.cut(X_train, n, retbins=True, right=True)
df_steps = pd.concat([X_train, df_cut, y_train], keys=['age','age_cuts','wage'], axis=1)
# Create dummy variables for the age groups
df_steps_dummies = pd.get_dummies(df_cut)
# Fitting Generalised linear models
fit3 = sm.GLM(df_steps.wage, df_steps_dummies).fit()
# Binning validation set into same 5 bins
bin_mapping = np.digitize(X_test, bins, right=True)
X_valid = pd.get_dummies(bin_mapping)
# Removing any outliers
# X_valid = pd.get_dummies(bin_mapping).drop([6], axis=1)# Prediction
pred2 = fit3.predict(X_valid)
# Calculating RMSE
rms = sqrt(mean_squared_error(y_test, pred2))
print(rms)
scores = cross_val_score(model, X, y, scoring='neg_mean_squared_error',
cv=cv, n_jobs=-1)
Therefore, instead of computing the RMSE like above, I would like to use the cross_val_score. For example, in the model parameter, if I would like to use lasso I would put model = lasso(). However, here I can not put model = sm.GLM().
I hope it is clear...

Selecting number of features for RFE model in Python

I have a dataset with more than 120 features, and I want to use RFE for selecting which features / column names I should use.
I have a problem because RFE is very slow. My code looks like this:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
full_df = pd.read_csv('data.csv')
x = full_df.iloc[:,:-1]
y = full_df.iloc[:,-1]
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 42)
model = LogisticRegression(solver ='lbfgs')
for i in range(1,120):
rfe = RFE(model, i)
fit = rfe.fit(x_train, y_train)
acc = fit.score(x_test, y_test)
print(acc)
print(fit.support_)
My problem is this: rfe = RFE(model, i). I do not know what's the best number for i. That's why I put it in for i in range(1,120). Is there any better way to do this? is there any better function in scikit learn that can help me determine the number of features and names of those features?
Because this took to long, I changed my approach, and I want to see what you think about it, is it good / correct approach.
First I did PCA, and I found out that each column participates with around 1-0.4%, except last 9 columns. Last 9 columns participate with less than 0.00001% so I removed them. Now I have 121 features.
pca = PCA()
fit = pca.fit(x)
Then I split my data into train and test (with 121 features).
Then I used SelectFromModel, and I tested it with 4 different classifiers. Each classifier in SelectFromModel reduced the number of columns. I chosed the number of column that was determined by classifier that gave me the best accuracy:
model = SelectFromModel(clf, prefit=True)
#train_score = clf.score(x_train, y_train)
test_score = clf.score(x_test, y_test)
column_res = model.transform(x_train).shape
End finally I used 'RFE'. I have used number of columns that i get with 'SelectFromModel'.
rfe = RFE(model, number_of_columns)
fit = rfe.fit(x_train, y_train)
acc = fit.score(x_test, y_test)
Is this a good approach, or I did something wrong?
Also, If I got the biggest accuracy in SelectFromModel with one classifier, do I need to use the same classifier in RFE?

cross-validation with Kfold

I'm trying to use three binary explanatory variables relating a banking history: default, housing, and loan to predict the binary response variable using a Logistic Regression classifier.
I have the following dataset:
mapping function to convert text no/yes to integer 0/1
convert_to_binary = {'no' : 0, 'yes' : 1}
default = bank['default'].map(convert_to_binary)
housing = bank['housing'].map(convert_to_binary)
loan = bank['loan'].map(convert_to_binary)
response = bank['response'].map(convert_to_binary)
I added my three explanatory variables and response to an array
data = np.array([np.array(default), np.array(housing), np.array(loan),np.array(response)]).T
kfold = KFold(n_splits=3)
scores = []
for train_index, test_index in kfold.split(data):
X_train, X_test = data[train_index], data[test_index]
y_train, y_test = response[train_index], response[test_index]
model = LogisticRegression().fit(X_train, y_train)
pred = model.predict(data[test_index])
results = model.score(X_test, y_test)
scores.append(results)
print(np.mean(scores))
my accuracy is always 100%, which I know is not correct. the accuracy should be somewhere around 50-65%?
Is there something I'm doing wrong?
The split is not correct
Here is the correct split
X_train, X_labels = data[train_index], response[train_index]
y_test, y_labels = data[test_index], response[test_index]
model = LogisticRegression().fit(X_train, X_labels)
pred = model.predict(y_test)
acc = sklearn.metrics.accuracy_score(y_labels,pred,normalize=True)

My r-squared score is coming negative but my accuracy score using k-fold cross validation is coming to about 92%

For the code below, my r-squared score is coming out to be negative but my accuracies score using k-fold cross validation is coming out to be 92%. How's this possible? Im using random forest regression algorithm to predict some data. The link to the dataset is given in the link below:
https://www.kaggle.com/ludobenistant/hr-analytics
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
dataset = pd.read_csv("HR_comma_sep.csv")
x = dataset.iloc[:,:-1].values ##Independent variable
y = dataset.iloc[:,9].values ##Dependent variable
##Encoding the categorical variables
le_x1 = LabelEncoder()
x[:,7] = le_x1.fit_transform(x[:,7])
le_x2 = LabelEncoder()
x[:,8] = le_x1.fit_transform(x[:,8])
ohe = OneHotEncoder(categorical_features = [7,8])
x = ohe.fit_transform(x).toarray()
##splitting the dataset in training and testing data
from sklearn.cross_validation import train_test_split
y = pd.factorize(dataset['left'].values)[0].reshape(-1, 1)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 0)
from sklearn.preprocessing import StandardScaler
sc_x = StandardScaler()
x_train = sc_x.fit_transform(x_train)
x_test = sc_x.transform(x_test)
sc_y = StandardScaler()
y_train = sc_y.fit_transform(y_train)
from sklearn.ensemble import RandomForestRegressor
regressor = RandomForestRegressor(n_estimators = 10, random_state = 0)
regressor.fit(x_train, y_train)
y_pred = regressor.predict(x_test)
print(y_pred)
from sklearn.metrics import r2_score
r2_score(y_test , y_pred)
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = regressor, X = x_train, y = y_train, cv = 10)
accuracies.mean()
accuracies.std()
There are several issues with your question...
For starters, you are doing a very basic mistake: you think you are using accuracy as a metric, while you are in a regression setting and the actual metric used underneath is the mean squared error (MSE).
Accuracy is a metric used in classification, and it has to do with the percentage of the correctly classified examples - check the Wikipedia entry for more details.
The metric used internally in your chosen regressor (Random Forest) is included in the verbose output of your regressor.fit(x_train, y_train) command - notice the criterion='mse' argument:
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
max_features='auto', max_leaf_nodes=None,
min_impurity_split=1e-07, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
n_estimators=10, n_jobs=1, oob_score=False, random_state=0,
verbose=0, warm_start=False)
MSE is a positive continuous quantity, and it is not upper-bounded by 1, i.e. if you got a value of 0.92, this means... well, 0.92, and not 92%.
Knowing that, it is good practice to include explicitly the MSE as the scoring function of your cross-validation:
cv_mse = cross_val_score(estimator = regressor, X = x_train, y = y_train, cv = 10, scoring='neg_mean_squared_error')
cv_mse.mean()
# -2.433430574463703e-28
For all practical purposes, this is zero - you fit the training set almost perfectly; for confirmation, here is the (perfect again) R-squared score on your training set:
train_pred = regressor.predict(x_train)
r2_score(y_train , train_pred)
# 1.0
But, as always, the moment of truth comes when you apply your model on the test set; your second mistake here is that, since you train your regressor with scaled y_train, you should also scale y_test before evaluating:
y_test = sc_y.fit_transform(y_test)
r2_score(y_test , y_pred)
# 0.9998476914664215
and you get a very nice R-squared in the test set (close to 1).
What about the MSE?
from sklearn.metrics import mean_squared_error
mse_test = mean_squared_error(y_test, y_pred)
mse_test
# 0.00015230853357849051

Categories

Resources