XGboost model on imbalance data using stratifiedkfold techniques - python

I am trying to build an XGboost model for the Binary classification problem on an imbalanced dataset with Class-1 (21k records) being the minority and class-2 (10L/1M) being the Majority.
I am not a big fan of SMOTE analysis of upsampling or downsampling because in downsampling we might end up with data loss or the process of Upsampling data(augmented data) might create bias or overfitting data.
I want to build my model something like below
I want to create a stratified K-folds by maintaining the ratio in both the class of the data
so if I got 48 as my split, I will have each fold where Class 1 and class 2 are equal
kf = model_selection.StratifiedKFold(n_splits=48)
by doing this I feel that each of my models will be trained on a diverse data set..
Questions:
now how to I exactly merge all of these weak learners models (48 models) and make predictions on the final ensembles model(strong learner)
is this a good idea to work with such imbalanced data? or just defining "scale_pos_weight" will help be here?
I also thought something like below (but I feel its not best way) :
pred_logreg = logreg.predict_proba(xfold1)[:, 1]
pred_rf = rf.predict_proba(xfold1)[:, 1]
pred_xgbc = xgbc.predict_proba(xfold1)[:, 1]
avg_pred = (pred_logreg + pred_rf + pred_xgbc) / 3
any other ways here?

Related

How to get k-fold cross validation final model with sklearn

Once I iterated on each training combination, given the k-fold split, I can estimate mean and standard deviation of models performance but I actually get k different models (with their own fitted parameters). How do I get the final, whole model? Is a matter of averaging the parameters?
Not showing code because is a general question so I'll write down the logic only:
dataset
splitting dataset according to the k-fold theory (let's say k = 5)
iterations: training from the first to the fifth model
getting 5 different models with, let's say, the following parameters:
model_1 = [p10, p11, p12] \
model_2 = [p20, p21, p22] |
model_3 = [p30, p31, p32] > param_matrix
model_4 = [p40, p41, p42] |
model_5 = [p50, p51, p52] /
What about model_final: [pf0, pf1, pf2]?
Too trivial solution 1:
model_final = mean(param_matrix, axis=0)
Too trivial solution 2:
model_final = the one of the fives that reach the highest performance
(could be a overfit rather than the optimal one)
First of all, the purpose of cross-validation (K-fold) is model checking, not model building.
In your question, you said that every fold of your program has different parameters, maybe this is not the best way to work.
One possibility to proceed is evaluate every model (each one with different parameters) using K-fold inside (using GridSearchCV); if you see that you obtain similar values of accuracy or other metrics in each split, then you are not overfitting.
Make this methodology for every model you have, and chose the one you obtain better results. Of course, always there is possibility to overfit, but with K-fold, you reduce it.
Finally, once you have checked with cross-validation that you obtain similar metrics for every split and you have chosed the model parameters, you have to train your model with all your training data; and you will finally obtain one unique model.

SVM testing - normalization of test data [duplicate]

This question already has an answer here:
what is the difference between fit() ,fit_transform() and transform() in scikit_learn? [duplicate]
(1 answer)
Closed 1 year ago.
I'm working with SVM model to classify 5 different classes. (N1, N2, N3, W, R)
Feature extractions -> Data normalization -> train SVM
when I tested the model (20%, 80% usual train-test-split), it shows high accuracy enter image description here
But when I tried testing with a completely new dataset, with the same method of
Feature extractions -> Data normalization -> test on trained SVM model
It came out really badly.
Let's say the original dataset used in training is A, and the new test dataset is B.
when I trained the model only with A and tested B, it came out really badly.
First I thought it was model overfitting so I included A and B to train the model and tested with B. It came out badly again...
I think the problem is the normaliztion process. It eventually worked when I tried new dataset C, but this time I brought the train A data, concatenated A+C to normalize, and then cut only C dataset out from it. And when I compared that with the data C normalized alone, it was different..
I used MinMaxScaler from sklearn.
I mean mathematically speaking of course it's different.. because every dataset has different minimum maximum value and normalized data will be different when mixed with other data.
My question is, when you test with new dataset, is it normal to bring the train dataset to normalize it together and then take out the test datapart only?? It's like mixing A(112x12), B(15x12) -> normalize (127x12) together -> take out (15x12)
Or should I start from fixing the code from feature extraction and training SVM?
(I attached the code, and each feature has 12x1 shape which means each stage has 12xN matrix.)
from sklearn import preprocessing
scaler = preprocessing.MinMaxScaler()
# Load training data
N1_train = pd.read_pickle("C:/Users/User/Desktop/EWHADATASETS/Features/Train_N1_features")
N2_train = pd.read_pickle("C:/Users/User/Desktop/EWHADATASETS/Features/Train_N2_features")
N3_train = pd.read_pickle("C:/Users/User/Desktop/EWHADATASETS/Features/Train_N3_features")
W_train = pd.read_pickle("C:/Users/User/Desktop/EWHADATASETS/Features/Train_W_features")
R_train = pd.read_pickle("C:/Users/User/Desktop/EWHADATASETS/Features/Train_R_features")
# Load test data
N1_test = pd.read_pickle("C:/Users/User/Desktop/EWHADATASETS/Features/Test_N1_features")
N2_test = pd.read_pickle("C:/Users/User/Desktop/EWHADATASETS/Features/Test_N2_features")
N3_test = pd.read_pickle("C:/Users/User/Desktop/EWHADATASETS/Features/Test_N3_features")
W_test = pd.read_pickle("C:/Users/User/Desktop/EWHADATASETS/Features/Test_W_features")
R_test = pd.read_pickle("C:/Users/User/Desktop/EWHADATASETS/Features/Test_R_features")
# normalize with original raw features and take only test out
N1_scaled_test = features.normalize_together(N1_test, N1_train, "N1")
N2_scaled_test = features.normalize_together(N2_test, N2_train, "N2")
N3_scaled_test = features.normalize_together(N3_test, N3_train, "N3")
W_scaled_test = features.normalize_together(W_test, W_train, "W")
R_scaled_test = features.normalize_together(R_test, R_train, "R")
def normalize_together(test, raw, stage_no):
together = pd.concat([test, raw], ignore_index=True)
scaled_test = pd.DataFrame(scaler.fit_transform(together.iloc[:, :-1]))
scaled_test['label'] = "{}".format(stage_no)
scaled_test = scaled_test.iloc[0:test.shape[0], :]
return scaled_test
Test data should remain unseen during training (includes preprocessing) - don't use both test + train data to compute a common normalisation factor. Normalise the training set. Separately, normalise the test set.
Why? It's vital to use an unseen test partition to evaluate your trained model. Otherwise you have not tested the ability for your model to generalise - imagine playing a game of cards where you have already have prior knowledge of the cards or order of the deck.

Drastically different feature importance between very same data and very similar model for catboost

Let me first explain about data set that I am using.
I have three set.
train with shape of (1277, 927), target is present about 12% of time
Eval set with shape of (174, 927), target is present about 11.5% of time
Hold out set with shape of (414, 927), target is present about 10% of time
This set is also building using time slices. Train set is oldest data. Hold out set is newest data. and Eval set is in middle set.
Now I am building two models.
Model1:
# Initialize CatBoostClassifier
model = CatBoostClassifier(
# custom_loss=['Accuracy'],
depth=9,
random_seed=42,
l2_leaf_reg=1,
# has_time= True,
iterations=300,
learning_rate=0.05,
loss_function='Logloss',
logging_level='Verbose',
)
## Fitting catboost model
model.fit(
train_set.values, Y_train.values,
cat_features=categorical_features_indices,
eval_set=(test_set.values, Y_test)
# logging_level='Verbose' # you can uncomment this for text output
)
predicting on hold out set.
Model2:
model = CatBoostClassifier(
# custom_loss=['Accuracy'],
depth=9,
random_seed=42,
l2_leaf_reg=1,
# has_time= True,
iterations= 'bestIteration from model1',
learning_rate=0.05,
loss_function='Logloss',
logging_level='Verbose',
)
## Fitting catboost model
model.fit(
train.values, Y.values,
cat_features=categorical_features_indices,
# logging_level='Verbose' # you can uncomment this for text output
)
Both model is identical except iterations. First model has fix 300 round, but it will Shrink model to bestIteration. Where second model uses that bestIteration from model1.
However, When I compare feature importance. It looks drastically difference.
Feature Score_m1 Score_m2 delta
0 x0 3.612309 2.013193 -1.399116
1 x1 3.390630 3.121273 -0.269357
2 x2 2.762750 1.822564 -0.940186
3 x3 2.553052 NaN NaN
4 x4 2.400786 0.329625 -2.071161
As you can see one of feature x3 which was on top3 in first model, dropped off in second model. Not only that but there is large shift in weights between models for given feature. There are about 60 features that are present in model1 are not present in model2. And there about 60 features that present in model2 are not present in model1. delta is difference between Score_m1 and Score_m2. I have seen where model changes score little bit not this drastic. AUC and LogLoss doesn't change that much when I use model1 or model2.
Now I have following questions regarding this situation.
Is this models are instable due to small number of sample and large number of features. If this is case, how to check for this?
Are there feature in this model are just not giving that much information regarding model outcome and there is random change that it is creating split. If this case how to check for this situation?
This catboost is right model for this situation ?
Any help regarding this issue will be appreciated
Yes. Trees in general are somewhat unstable. If you remove the least important feature, you can get a very different model.
Having more data reduces this tendency.
Having more features increases this tendency.
Tree algorithms are random by nature, so the results will be different.
Things to try:
Run the model a large number of times but with different random seeds. Use the results to determine which feature seems to be the least important. (How many features do you have?)
Try to balance your training set. This might require you to upsample the rarer cases.
Get more data. Maybe you'll have to combine your train and test set and use the holdout as the test.

Imbalanced Dataset Using Keras

I am building a classifying ANN with python and the Keras library. I am using training the NN on an imbalanced dataset with 3 different classes. Class 1 is about 7.5 times as prevalent as Classes 2 and 3. As remedy, I took the advice of this stackoverflow answer and set my class weights as such:
class_weight = {0 : 1,
1 : 6.5,
2: 7.5}
However, here is the problem: The ANN is predicting the 3 classes at equal rates!
This is not useful because the dataset is imbalanced, and predicting the outcomes as each having a 33% chance is inaccurate.
Here is the question: How do I deal with an imbalanced dataset so that the ANN does not predict Class 1 every time, but also so that the ANN does not predict the classes with equal probability?
Here is my code I am working with:
class_weight = {0 : 1,
1 : 6.5,
2: 7.5}
# Making the ANN
import keras
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
classifier = Sequential()
# Adding the input layer and the first hidden layer with dropout
classifier.add(Dense(activation = 'relu',
input_dim = 5,
units = 3,
kernel_initializer = 'uniform'))
#Randomly drops 0.1, 10% of the neurons in the layer.
classifier.add(Dropout(rate= 0.1))
#Adding the second hidden layer
classifier.add(Dense(activation = 'relu',
units = 3,
kernel_initializer = 'uniform'))
#Randomly drops 0.1, 10% of the neurons in the layer.
classifier.add(Dropout(rate = 0.1))
# Adding the output layer
classifier.add(Dense(activation = 'sigmoid',
units = 2,
kernel_initializer = 'uniform'))
# Compiling the ANN
classifier.compile(optimizer = 'adam',
loss = 'binary_crossentropy',
metrics = ['accuracy'])
# Fitting the ANN to the training set
classifier.fit(X_train, y_train, batch_size = 100, epochs = 100, class_weight = class_weight)
The most evident problem that I see with your model is that it is not properly structured for classification.
If your samples can belong to only one class at a time, then you should not overlook this fact by having a sigmoid activation as your last layer.
Ideally, the last layer of a classifier should output the probability of a sample belonging to a class, i.e. (in your case) an array [a, b, c] where a + b + c == 1..
If you use a sigmoid output, then the output [1, 1, 1] is a possible one, although it is not what you are after. This is also the reason why your model is not generalizing properly: given that you're not specifically training it to prefer "unbalanced" outputs (like [1, 0, 0]), it will defalut to predicting the average values that it sees during training, accounting for the reweighting.
Try changing the activation of your last layer to 'softmax' and the loss to 'catergorical_crossentropy':
# Adding the output layer
classifier.add(Dense(activation='softmax',
units=2,
kernel_initializer='uniform'))
# Compiling the ANN
classifier.compile(optimizer='adam',
loss='categorical_crossentropy',
metrics=['accuracy'])
If this doesn't work, see my other comment and get back to me with that info, but I'm pretty confident that this is the main problem.
Cheers
Imbalanced datasets (where classes are uneven or unequally distributed) are a prevalent problem in classification. For example, one class label has a very high number of observations, and the other has a pretty low number of observations. Significant causes of data imbalance include:
Faulty data collection
Domain peculiarity – when some domains have an imbalanced dataset.
Imbalanced datasets can create many problems in classification hence the need to improve datasets for robust models and improve performance.
Here are several methods to bring balance to imbalanced datasets:
Undersampling – works by resampling the majority class points in a dataset to match or make them equal to the minority class points. It brings equilibrium between the majority and minority classes so that the classifier gives equal importance to both classes. However, it’s important to note that undersampling may cause some loss of information hence some insignificant results.
Oversampling – Also known as upsampling, oversampling resamples the minority class to equal the total number of majority class points. It replicates the observations from minority class points to balance datasets.
Synthetic Minority Oversampling Technique – As the name suggests, the SMOTE technique uses oversampling to create artificial data points for minority classes. It creates new instances between the attributes of the minority class, which are synthesized from existing data.
Searching optimal value from a grid – This technique involves finding probabilities for a particular class label then finding the optimum threshold to map the possibilities to the correct class label.
Using the BalancedBaggingClassifier – The BalancedBaggingClassifier allows you to resample each subclass of a dataset before training a random estimator to create a balanced dataset.
Use different algorithms – Some algorithms aren’t effective in restoring balance in imbalanced datasets. Sometimes it’s wise to try different algorithms to stand a better chance at creating a balanced dataset and improving performance. For instance, you can employ regularization or penalized models to punish the wrong predictions on the minority class.
The effects of imbalanced datasets can be significant. Hopefully, one of the approaches above can help you get in the right direction.
To test which approach works best for you, I’d suggest using deepchecks, an awesome open python package for validating data and models quickly.

Can you use counts in sklearn logistic regression input?

So, I know that in R you can provide data for a logistic regression in this form:
model <- glm( cbind(count_1, count_0) ~ [features] ..., family = 'binomial' )
Is there a way to do something like cbind(count_1, count_0) with sklearn.linear_model.LogisticRegression? Or do I actually have to provide all those duplicate rows? (My features are categorical, so there would be a lot of redundancy.)
If they are categorical - you should provide binarized version of it. I don't know how that code in R works, but you should binarize your categorical feature always. Because you have to emphasize that each value of your feature is not related to other one, i.e. for feature "blood_type" with possible values 1,2,3,4 your classifier must learn that 2 is not related to 3, and 4 is not related to 1 in any sense. These is achieved by binarization.
If you have too many features after binarization - you can reduce dimensionality of binarized dataset by FeatureHasher or more sophisticated methods like PCA.

Categories

Resources