Ensembling Technique

Ensembling Technique - python

I want to assign weights to multiple models and make an single ensemble model.
I want to use my outputs as the input to a new machine learning algorithm and the algorithm will learn the correct weights.
but how do I give the output of multiple models as an input to a new ML Algorithm as I am getting output like this
preds1=model1.predict_prob(xx)
[[0.28054154 0.35648097 0.32954868 0.03342881]
[0.20625692 0.30749627 0.37018309 0.11606372]
[0.28362306 0.33325501 0.34658685 0.03653508]
...
preds2=model2.predict_prob(xx)
[[0.22153498 0.30271243 0.26420254 0.21155006]
[0.32327647 0.39197589 0.23899729 0.04575035]
[0.18440374 0.32447016 0.4736297 0.0174964 ]
...
How to I make a single Dataframe from the output of these 2 or more models ?
the simplest way of doing this is given below but I want to give the output to a different ML algorithm to learn weights.
model = LogisticRegression()
model.fit(xx_train, yy_train)
preds1 = model.predict_proba(xx_test)
model = KNeighborsClassifier(n_neighbors=5, metric='minkowski', p=2 )
model.fit(xx_train, yy_train)
preds2 = model.predict_proba(xx_test)
# Each weight is evaluated by calculating the corresponding score
for i in range(len(weights)):
final_inner_preds = np.argmax(preds1*weights[i]+ preds2*(1-weights[i]), axis=1)
scores_corr_wts[i]+= accuracy_score(yy_test, final_inner_preds)

In sklearn you can use the StackingClassifier. This should fit your need.
Create a list of the definition of your base model
base_models = [('SVC', LinearSVC(C = 1)),('RF',RandomForestClassifier(n_estimators=500))]
Instanciate your Meta-learner
meta_model = LogisticRegressionCV()
Instanciate Stacking-Model
stacking_model = StackingClassifier(estimators=base_models, final_estimator=meta_model, passthrough=True, cv=3)
Fit and predict

Related

How to get k-fold cross validation final model with sklearn

Once I iterated on each training combination, given the k-fold split, I can estimate mean and standard deviation of models performance but I actually get k different models (with their own fitted parameters). How do I get the final, whole model? Is a matter of averaging the parameters?
Not showing code because is a general question so I'll write down the logic only:
dataset
splitting dataset according to the k-fold theory (let's say k = 5)
iterations: training from the first to the fifth model
getting 5 different models with, let's say, the following parameters:
model_1 = [p10, p11, p12] \
model_2 = [p20, p21, p22] |
model_3 = [p30, p31, p32] > param_matrix
model_4 = [p40, p41, p42] |
model_5 = [p50, p51, p52] /
What about model_final: [pf0, pf1, pf2]?
Too trivial solution 1:
model_final = mean(param_matrix, axis=0)
Too trivial solution 2:
model_final = the one of the fives that reach the highest performance
(could be a overfit rather than the optimal one)

First of all, the purpose of cross-validation (K-fold) is model checking, not model building.
In your question, you said that every fold of your program has different parameters, maybe this is not the best way to work.
One possibility to proceed is evaluate every model (each one with different parameters) using K-fold inside (using GridSearchCV); if you see that you obtain similar values of accuracy or other metrics in each split, then you are not overfitting.
Make this methodology for every model you have, and chose the one you obtain better results. Of course, always there is possibility to overfit, but with K-fold, you reduce it.
Finally, once you have checked with cross-validation that you obtain similar metrics for every split and you have chosed the model parameters, you have to train your model with all your training data; and you will finally obtain one unique model.

How to balance data while using data generators with keras? [duplicate]

I am trying to use keras to fit a CNN model to classify 2 classes of data . I have imbalanced dataset I want to balance the data. I don't know can I use class_weight in model.fit_generator . I wonder if I used class_weight="balanced" in model.fit_generator
The main code:
def generate_arrays_for_training(indexPat, paths, start=0, end=100):
while True:
from_=int(len(paths)/100*start)
to_=int(len(paths)/100*end)
for i in range(from_, int(to_)):
f=paths[i]
x = np.load(PathSpectogramFolder+f)
x = np.expand_dims(x, axis=0)
if('P' in f):
y = np.repeat([[0,1]],x.shape[0], axis=0)
else:
y =np.repeat([[1,0]],x.shape[0], axis=0)
yield(x,y)
history=model.fit_generator(generate_arrays_for_training(indexPat, filesPath, end=75),
validation_data=generate_arrays_for_training(indexPat, filesPath, start=75),
steps_per_epoch=int((len(filesPath)-int(len(filesPath)/100*25))),
validation_steps=int((len(filesPath)-int(len(filesPath)/100*75))),
verbose=2,
epochs=15, max_queue_size=2, shuffle=True, callbacks=[callback])

If you don't want to change your data creation process, you can use class_weight in your fit generator. You can use dictionary to set your class_weight and observe with fine tuning. For instance when class_weight is not used, and you have 50 examples for class0 and 100 examples for class1. Then, loss function calculate loss uniformly. It means that class1 will be a problem. But, when you set:
class_weight = {0:2 , 1:1}
It means that loss function will give 2 times weight to your class 0 now. Therefore, misclassification of underrepresented data will take 2 times more punishment than before. Thus, model can handle imbalanced data.
If you use class_weight='balanced' model can make that setting automatically. But my suggestion is that, create a dictionary like class_weight = {0:a1 , 1:a2} and try different values for a1 and a2, so you can understand difference.
Also, you can use undersampling methods for imbalanced data instead of using class_weight. Check Bootstrapping methods for that purpose.

How to train ML model on 2 columns to solve for classification?

I have three columns in a dataset on which I'm doing sentiment analysis(classes 0,1,2):
text thing sentiment
But the problem is that I can train my data only on either text or thing and get predicted sentiment. Is there a way to train the data both on text & thing and then predict sentiment ?
Problem case(say):
|text thing sentiment
0 | t1 thing1 0
. |
. |
54| t1 thing2 2
This example tells us that sentiment shall depend on the thing as well. If I try to concatenate the two columns one below the other and then try but that would be incorrect as we wouldn't be giving any relationship between the two columns to the model.
Also my test set contains two columns test and thing for which I've to predict the sentiment according to the trained model on the two columns.
Right now I'm using the tokenizer and then the model below:
model = Sequential()
model.add(Embedding(MAX_NB_WORDS, EMBEDDING_DIM, input_length=X.shape[1]))
model.add(SpatialDropout1D(0.2))
model.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(3, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())
Any pointers on how to proceed or which model or coding manipulation to use ?

You may want to shift to the Keras functionnal API and train a multi-input model.
According to Keras's creator, François CHOLLET, in his book Deep Learning with Python [Manning, 2017] (chapter 7, section 1) :
Some tasks, require multimodal inputs: they merge data coming from different input sources, processing each type of data using different kinds of neural layers. Imagine a deep-learning model trying to predict the most likely market price of a second-hand piece of clothing, using the following inputs: user-provided metadata (such as the item’s brand, age, and so on), a user-provided text description, and a picture of the item. If you had only the metadata available, you could one-hot encode it and use a densely connected network to predict the price. If you had only the text description available, you could use an RNN or a 1D convnet. If you had only the picture, you could use a 2D convnet. But how can you use all three at the same time? A naive approach would be to train three separate models and then do a weighted average of their predictions. But this may be suboptimal, because the information extracted by the models may be redundant. A better way is to jointly learn a more accurate model of the data by using a model that can see all available input modalities simultaneously: a model with three input branches.

I think the Concatenate functionality is the way to get in such a case and the general idea should be as follows. Please tweak it according to your use case.
### whatever preprocessing you may want to do
text_input = Input(shape=(1, ))
thing_input = Input(shape=(1,))
### now bring them together
merged_inputs = Concatenate(axis = 1)([text_input, thing_input])
### sample output layer
output = Dense(3)(merged_inputs)
### pass your inputs and outputs to the model
model = Model(inputs = [text_input, thing_input], outputs = output)

You have to take multiple column as list and then merge to train after embedding and pre processing on the raw data.
Example:
train = pd.read_csv('COVID19 multifeature Emotion - 50 data.csv', nrows=49)
# This dataset has two text column field and different class level
X_train_doctor_opinion = train["doctor-opinion"].str.lower()
X_train_patient_opinion = train["patient-opinion"].str.lower()
X_train = list(X_train_doctor_opinion) + list(X_train_patient_opinion))
Then pre process and embed

How can I transform catboosts raw prediction score (RawFormulaVal) into a probability?

For some objects from catboost library (like the python code export model - https://tech.yandex.com/catboost/doc/dg/concepts/python-reference_catboostclassifier_save_model-docpage/) predictions (https://tech.yandex.com/catboost/doc/dg/concepts/python-reference_apply_catboost_model-docpage/) will only give a so called raw score per record (parameter values is called "RawFormulaVal").
Other API functions also allow the result of a prediction to be a probability for the target class (https://tech.yandex.com/catboost/doc/dg/concepts/python-reference_catboostclassifier_predict-docpage/) - parameter value is called "Probability".
I would like to know
how this is related to probabilities (in case of a binary classification) and
if it can be transformed in such a one using the python API (https://tech.yandex.com/catboost/doc/dg/concepts/python-quickstart-docpage/)?

The raw score from the catboost prediction function with type "RawFormulaVal" are the log-odds (https://en.wikipedia.org/wiki/Logit).
So if we apply the function "exp(score) / (1+ exp(score))" we get the probabilities as if we would have used the prediction formula with type "Probability".

The line of code model.predict_proba(evaluation_dataset) will compute probabilities directly.
Following is a sample code to understand:
from catboost import Pool, CatBoostClassifier, cv
train_dataset = Pool(data=X_train,
label=y_train,
cat_features=cat_features)
eval_dataset = Pool(data=X_valid,
label=y_valid,
cat_features=cat_features)
# Initialize CatBoostClassifier
model = CatBoostClassifier(iterations=30,
learning_rate=1,
depth=2,
loss_function='MultiClass')
# Fit model
model.fit(train_dataset)
# Get predicted classes
preds_class = model.predict(eval_dataset)
# Get predicted probabilities for each class
preds_proba = model.predict_proba(eval_dataset)
# Get predicted RawFormulaVal
preds_raw = model.predict(eval_dataset,
prediction_type='RawFormulaVal')
model.fit(train_dataset,
use_best_model=True,
eval_set=eval_dataset)
print("Count of trees in model = {}".format(model.tree_count_))
print(preds_proba)
print(preds_raw)

Using tensorflow or keras to build a NN model by feeding 'pairwise' samples

I'm trying to implement a NN model with pairwise samples. Details are shown in follows:
Original data:
X_org with shape of (100, 50) for example, namely 100 samples with 50 features.
Y_org with shape of (100, 1).
Processing these original data for real training:
Select 2 samples from X_org randomly (so we have 100*99/2 such combinations) to form a new 'pairwise' sample, and the prediction target, namely the new y label is the subtraction of the two corresponding y_org labels (Y_org_sample1 - Y_org_sample2). Now we have new X_train and Y_train.
I need a more a NN model (DNN, CNN, LSTM, whatever ...), with which I can pass the first sub_sample of one pairwise sample from X_train into the model and will get one result, same step for the second sub_sample. By calculating the subtraction of the two results, I can get the prediction of this pairwise sample. This prediction will be the one compared with the corresponding Y label from Y_train.
Overall, I need to train a model (update the weights) after feeding it a 'pairwise' sample (two successive sub samples). The reason why I don't choose a 'two-arm' model (e.g. merge two arms by xxx.sub()) is that I will only feed one sub sample during test process. I will just use the model to predict one sub-sample finally.
So I will use the data from X_train during train step, while use X_org-like data format during test step. It looks a bit complex.
Looks like Tensorflow would be more feasible for this task, if keras also works, please kindly share your idea.

You can first create a model that will take only one X_org-like element:
#create a model the way you like it, it can be Functional API or Sequential, no problem
xOrgModel = createAModelForXOrgData(...)
Now, lets create a second model, this time necessarily functional API that works with both inputs:
from keras.models import Model
from keras.layers import Input, Subtract
input1 = Input(shapeOfInput)
input2 = Input(shapeOfInput)
output1 = xOrgModel(input1)
output2 = xOrgModel(input2)
output = Subtract()([output1,output2])
pairWiseModel = Model([input1,input2],output)
Now you have two models: xOrgModel and pairWiseModel. You can use any of them depending on the task you are doing at the moment.
Both models are sharing their weights. This means that you can train any of them and the other will be updated as well.
Using the pairwise model
First, organize your data in two separate arrays. (Because our model uses two inputs)
L = len(X_org)
x1 = []
x2 = []
y = []
for i in range(L):
for j in range(i+1,L):
x1.append(X_org[i])
x2.append(X_org[j])
y.append(Y_org[i] - Y_org[j])
x1 = np.array(x1)
x2 = np.array(x2)
y = np.array(y)
Train and predict with a list of inputs:
pairWiseModel.fit([x1,x2],y,...)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.