Adjust custom loss function for gradient boosting classification - python

I have implemented a gradient boosting decision tree to do a mulitclass classification. My custom loss functions look like this:
import numpy as np
from sklearn.preprocessing import OneHotEncoder
def softmax(mat):
res = np.exp(mat)
res = np.multiply(res, 1/np.sum(res, axis=1, keepdims=True))
return res
def custom_asymmetric_objective(y_true, y_pred_encoded):
pred = y_pred_encoded.reshape((-1, 3), order='F')
pred = softmax(pred)
y_true = OneHotEncoder(sparse=False,categories='auto').fit_transform(y_true.reshape(-1, 1))
grad = (pred - y_true).astype("float")
hess = 2.0 * pred * (1.0-pred)
return grad.flatten('F'), hess.flatten('F')
def custom_asymmetric_valid(y_true, y_pred_encoded):
y_true = OneHotEncoder(sparse=False,categories='auto').fit_transform(y_true.reshape(-1, 1)).flatten('F')
margin = (y_true - y_pred_encoded).astype("float")
loss = margin*10
return "custom_asymmetric_eval", np.mean(loss), False
Everything works, but now I want to adjust my loss function in the following way: It should "penalize" if an item is classified incorrectly, and a penalty should be added for a certain constraint (this is calculated before, let's just say the penalty is e.g. 0,05, so just a real number).
Is there any way to consider both, the misclassification and the penalty value?

Try L2 regularization: weights will be updated following the subtraction of a learning rate times error times x plus the penalty term lambda weight to the power of 2
Simplifying:
This will be the effect:
ADDED: The penalization term (on the right of equation) increases the generalization power of your model. So, if you overfit your model in training set, the perfomance will be poor in test set. So, you penalize these "right" classifications in training set that generate error in test set and compromise generalization.

Related

Why does this training loss fluctuates? (Logistic regression from scratch with binary cross entropy loss)

I am trying to implement logistic regression from scratch using binary cross entropy loss function. The loss function implemented below is created based on the following formula.
def binary_crossentropy(y, yhat):
no_of_samples = len(y)
numerator_1 = y*np.log(yhat)
numerator_2 = (1-y) * np.log(1-yhat)
loss = -(np.sum(numerator_1 + numerator_2) / no_of_samples)
return loss
And below is how I implement the training using gradient descent.
L = 0.01
epochs = 40000
no_of_samples = len(x)
# Keeping track of the loss
loss = []
for _ in range(epochs):
yhat = sigmoid(x*weight + bias)
# Finding out the loss of each iteration
loss.append(binary_crossentropy(y, yhat))
d_weight = np.sum(x *(yhat-y)) / no_of_samples
d_bias = np.sum(yhat-y) / no_of_samples
weight = weight - L*d_weight
bias = bias - L*d_bias
The training above goes fine since the weight and bias are properly adjusted. But my question here is that, why the loss graph appears to be very fluctuating?
I have ever tried implementing linear regression and the loss appears to be constantly decreasing.
Is there anything incorrect in my logistic regression implementation? If my implementation is already correct, why does it fluctuate that way?
You need to optimize hyperparameters to see if the problem solves or not. One thing that can be done is to change the type of optimizers that you used. For instance, you can use Fmin_tnc instead of gradient descent.
Besides, you can tune the epochs, L and type of solvers (‘newton-cg’, ‘lbfgs’, ‘liblinear’, ‘sag’, ‘saga’) if you use sklearn for regression.

For a classification model in tensorflow, is there a way to impose an asymmetric cost function during the training?

I am trying to build a Neural Network in tensorflow where the cost of a Type I error (false-positive) is more costly than a Type II error (false-negative). Is there a way to impose this during the training process (i.e. inputting a cost matrix)? This is possible with simple models like Logistic Regression in scikit learn by specifying the class_weight parameter.
cw = {0: 3,1:1}
clf = LogisticRegression(class_weight = cw )
In this case, incorrectly predicting a 0 is 3x more costly than incorrectly predicting a 1. However, this cannot be performed with a Neural Network, so I want to see if it is possible in tensorflow.
Thanks
You could use tf.nn.weighted_cross_entropy_with_logits and it's pos_weight argument.
This argument weights positive class, as described by documentation (in TF2.0 at least):
A value pos_weights > 1 decreases the false negative count, hence increasing the recall.
Conversely setting pos_weights < 1 decreases the false positive count and increases the precision.
In your case, you could create custom loss function like this:
import tensorflow as tf
# Output logits from your network, not the values after sigmoid activation
class WeightedBinaryCrossEntropy:
def __init__(self, positive_weight: float):
self.positive_weight = positive_weight
def __call__(self, targets, logits, sample_weight=None):
return tf.nn.weighted_cross_entropy_with_logits(
targets, logits, pos_weight=self.positive_weight
)
And create a custom neural network with it, for example using tf.keras (samples are weighted as they were in your question:
import numpy as np
model = tf.keras.models.Sequential(
[
tf.keras.layers.Dense(32, input_shape=(10,)),
tf.keras.layers.Activation("relu"),
tf.keras.layers.Dense(10),
tf.keras.layers.Activation("relu"),
# Output one logit for binary classification
tf.keras.layers.Dense(1),
]
)
# Example random data
data = np.random.random((32, 10))
targets = np.random.randint(2, size=32)
# 3 times as costly to make type I error
model.compile(optimizer="rmsprop", loss=WeightedBinaryCrossEntropy(positive_weight=3))
model.fit(data, targets, batch_size=32)
You can use a logarithmic scale. For a 0 incorrectly predicted as 1, y - ŷ = -1, log goes to 1.71. For a 1 predicted as 0, y - ŷ = 1 log equals 0.63. For y == ŷ log equals 0. Almost the three times more costly, for a 0 incorrectly predicted as 1.
import numpy as np
from math import exp
loss=abs(1-exp(-np.log(exp(y-ŷ))))
#abs(1-exp(-np.log(exp(0))))
#Out[53]: 0.0
#abs(1-exp(-np.log(exp(-1))))
#Out[54]: 1.718281828459045
#abs(1-exp(-np.log(exp(1))))
#Out[55]: 0.6321205588285577
Then you will have a convex optimization. Implementing:
import keras.backend as K
def custom_loss(y_true,y_pred):
return K.mean(abs(1-exp(-np.log(exp(y_true-y_pred)))))
Then:
model.compile(loss=custom_loss, optimizer=sgd,metrics = ['accuracy'])

Deal with imbalanced dataset in text classification with Keras and Theano

For ~20,000 text datasets, the true and false samples are ~5,000 against ~1,5000. Two-channel textCNN built with Keras and Theano is used to do the classification. F1 score is the evaluation metric. The F1 score is not bad while the confusion matrix shows that the accuracy of the true samples is relatively low(~40%). But actually it is very important to predict the true samples accurately. Therefore, want to design a custom binary cross entropy loss function to increase the weight of mis-classified true samples and make the model focus more on predicting accurately on the true samples.
tried class_weight with sklearn in model.fit method and it did not work very well since the weight applied to all samples instead of the mis-classified ones.
tried and adjusted the method mentioned here: https://github.com/keras-team/keras/issues/2115, but the loss function was categorical cross entropy and it did not work well for the binary classification problem. Tried to modified the loss function to a binary one but encounter some issues concerning the input dimension.
The sample code of the cost sensitive loss function focusing on the mis-classified samples is:
def w_categorical_crossentropy(y_true, y_pred, weights):
nb_cl = len(weights)
final_mask = K.zeros_like(y_pred[:, 0])
y_pred_max = K.max(y_pred, axis=1)
y_pred_max = K.reshape(y_pred_max, (K.shape(y_pred)[0], 1))
y_pred_max_mat = K.equal(y_pred, y_pred_max)
for c_p, c_t in product(range(nb_cl), range(nb_cl)):
final_mask += (weights[c_t, c_p] * y_pred_max_mat[:, c_p] * y_true[:, c_t])
return K.categorical_crossentropy(y_pred, y_true) * final_mask
Actually, a custom loss function for binary classification implemented with Keras and Theano that focuses on the mis-classified samples is of great importance to the imbalanced dataset. Please help troubleshoot this. Thanks!
Well when I have to deal with imbalanced datasets in keras, what I do is to first compute the weights for each class and pass them to the model instance during training. This will look something like this:
from sklearn.utils import compute_class_weight
w = compute_class_weight('balanced', np.unique(targets), targets)
# here I am adding only two categories with their corresponding weights
# you can spin a loop or continue by hand until you include all of your categories
weights = {
np.unique(targets)[0] : w[0], # class 0 with weight 0
np.unique(targets)[1] : w[1] # class 1 with weight 1
}
# then during training you do like this
model.fit(x=features, y=targets, {..}, class_weight=weights)
I believe this will solve your problem.

Custom combined hinge/kb-divergence loss function in siamese-net fails to generate meaningful speaker-embeddings

I'm currently trying to implement a siamese-net in Keras where I have to implement the following loss function:
loss(p ∥ q) = Is · KL(p ∥ q) + Ids · HL(p ∥ q)
detailed description of loss function from paper
Where KL is the Kullback-Leibler divergence and HL is the Hinge-loss.
During training, I label same-speaker pairs as 1, different speakers as 0.
The goal is to use the trained net to extract embeddings from spectrograms.
A spectrogram is a 2-dimensional numpy-array 40x128 (time x frequency)
The problem is I never get over 0.5 accuracy, and when clustering speaker-embeddings the results show there seems to be no correlation between embeddings and speakers
I implemented the kb-divergence as distance measure, and adjusted the hinge-loss accordingly:
def kullback_leibler_divergence(vects):
x, y = vects
x = ks.backend.clip(x, ks.backend.epsilon(), 1)
y = ks.backend.clip(y, ks.backend.epsilon(), 1)
return ks.backend.sum(x * ks.backend.log(x / y), axis=-1)
def kullback_leibler_shape(shapes):
shape1, shape2 = shapes
return shape1[0], 1
def kb_hinge_loss(y_true, y_pred):
"""
y_true: binary label, 1 = same speaker
y_pred: output of siamese net i.e. kullback-leibler distribution
"""
MARGIN = 1.
hinge = ks.backend.mean(ks.backend.maximum(MARGIN - y_pred, 0.), axis=-1)
return y_true * y_pred + (1 - y_true) * hinge
A single spectrogram would be fed into a branch of the base network, the siamese-net consists of two such branches, so two spectrograms are fed simultaneously, and joined in the distance-layer. The output of the base network is 1 x 128. The distance layer computes the kullback-leibler divergence and its output is fed into the kb_hinge_loss. The architecture of the base-network is as follows:
def create_lstm(units: int, gpu: bool, name: str, is_sequence: bool = True):
if gpu:
return ks.layers.CuDNNLSTM(units, return_sequences=is_sequence, input_shape=INPUT_DIMS, name=name)
else:
return ks.layers.LSTM(units, return_sequences=is_sequence, input_shape=INPUT_DIMS, name=name)
def build_model(mode: str = 'train') -> ks.Model:
topology = TRAIN_CONF['topology']
is_gpu = tf.test.is_gpu_available(cuda_only=True)
model = ks.Sequential(name='base_network')
model.add(
ks.layers.Bidirectional(create_lstm(topology['blstm1_units'], is_gpu, name='blstm_1'), input_shape=INPUT_DIMS))
model.add(ks.layers.Dropout(topology['dropout1']))
model.add(ks.layers.Bidirectional(create_lstm(topology['blstm2_units'], is_gpu, is_sequence=False, name='blstm_2')))
if mode == 'extraction':
return model
num_units = topology['dense1_units']
model.add(ks.layers.Dense(num_units, name='dense_1'))
model.add(ks.layers.advanced_activations.PReLU(init='zero', weights=None))
model.add(ks.layers.Dropout(topology['dropout2']))
num_units = topology['dense2_units']
model.add(ks.layers.Dense(num_units, name='dense_2'))
model.add(ks.layers.advanced_activations.PReLU(init='zero', weights=None))
num_units = topology['dense3_units']
model.add(ks.layers.Dense(num_units, name='dense_3'))
model.add(ks.layers.advanced_activations.PReLU(init='zero', weights=None))
num_units = topology['dense4_units']
model.add(ks.layers.Dense(num_units, name='dense_4'))
model.add(ks.layers.advanced_activations.PReLU(init='zero', weights=None))
return model
I then build a siamese net as follows:
base_network = build_model()
input_a = ks.Input(shape=INPUT_DIMS, name='input_a')
input_b = ks.Input(shape=INPUT_DIMS, name='input_b')
processed_a = base_network(input_a)
processed_b = base_network(input_b)
distance = ks.layers.Lambda(kullback_leibler_divergence,
output_shape=kullback_leibler_shape,
name='distance')([processed_a, processed_b])
model = ks.Model(inputs=[input_a, input_b], outputs=distance)
adam = build_optimizer()
model.compile(loss=kb_hinge_loss, optimizer=adam, metrics=['accuracy'])
Lastly, I build a net with the same architecture with only one input, and try to extract embeddings, and then build the mean over them, where an embedding should serve as a representation for a speaker, to be used during clustering:
utterance_embedding = np.mean(embedding_extractor.predict_on_batch(spectrogram), axis=0)
We train the net on the voxceleb speaker set.
The full code can be seen here: GitHub repo
I'm trying to figure out if I have made any wrong assumptions and how to improve my accuracy.
Issue with accuracy
Notice that in your model:
y_true = labels
y_pred = kullback-leibler divergence
These two cannot be compared, see this example:
For correct results, when y_true == 1 (same
speaker), Kullback-Leibler is y_pred == 0 (no divergence).
So it's totally expected that metrics will not work properly.
Then, either you create a custom metric, or you count only on the loss for evaluations.
This custom metric should need a few adjustments in order to be feasible, as explained below.
Possible issues with the loss
Clipping
This might be a problem
First, notice that you're using clip in the values for the Kullback-Leibler. This may be bad because clips lose the gradients in the clipped regions. And since your activation is a PRelu, you have values lower than zero and bigger than 1. Then there are certainly zero gradient cases here and there, with the risk of having a frozen model.
So, you might not want to clip these values. And to avoid having negative values with the PRelu, you can try to use a 'softplus' activation, which is kind of a soft relu without negative values. You might also "sum" an epsilon to avoid trouble, but there is no problem in leaving values bigger than one:
#considering you used 'softplus' instead of 'PRelu' in speakers
def kullback_leibler_divergence(speakers):
x, y = speakers
x = x + ks.backend.epsilon()
y = y + ks.backend.epsilon()
return ks.backend.sum(x * ks.backend.log(x / y), axis=-1)
Assimetry in Kullback-Leibler
This IS a problem
Notice also that Kullback-Leibler is not a symetric function, and also doesn't have its minimum at zero!! The perfect match is zero, but bad matches can have lower values, and this is bad for a loss function because it will drive you to divergence.
See this picture showing KB's graph
Your paper states that you should sum two losses: (p||q) and (q||p).
This eliminates the assimetry and also the negative values.
So:
distance1 = ks.layers.Lambda(kullback_leibler_divergence,
name='distance1')([processed_a, processed_b])
distance2 = ks.layers.Lambda(kullback_leibler_divergence,
name='distance2')([processed_b, processed_a])
distance = ks.layers.Add(name='dist_add')([distance1,distance2])
Very low margin and clipped hinge
This might be a problem
Finally, see that the hinge loss also clips values below zero!
Since Kullback-Leibler is not limited to 1, samples with high divergency may not be controled by this loss. Not sure if this really an issue, but you might want to either:
increase the margin
inside the Kullback-Leibler, use mean instead of sum
use a softplus in hinge instead of a max, to avoid losing gradients.
See:
MARGIN = someValue
hinge = ks.backend.mean(ks.backend.softplus(MARGIN - y_pred), axis=-1)
Now we can think of a custom accuracy
This is not very easy, since we don't have clear limits on KB that tells us "correct/not correct"
You might try one at random, but you'd need to tune this threshold parameter until you find a good thing that represents reality. You may for instance use your validation data to find the threshold that brings the best accuracy.
def customMetric(y_true_targets, y_pred_KBL):
isMatch = ks.backend.less(y_pred_KBL, threshold)
isMatch = ks.backend.cast(isMatch, ks.backend.floatx())
isMatch = ks.backend.equal(y_true_targets, isMatch)
isMatch = ks.backend.cast(isMatch, ks.backend.floatx())
return ks.backend.mean(isMatch)

Linear regression implementation always performs worse than sklearn

I implemented linear regression with gradient descent in python. To see how well it is doing I compared it with scikit-learn's LinearRegression() class. For some reason, sklearn always outperforms my program by a MSE of 3 on average (I am using the Boston Housing dataset for testing). I understand that I am currently not doing gradient checking to check for convergence, but I am allowing for many iterations and have set the learning rate low enough such that it SHOULD converge. Is there any clear bug in my learning algorithm implementation? Here is my code:
import numpy as np
from sklearn.linear_model import LinearRegression
def getWeights(x):
lenWeights = len(x[1,:]);
weights = np.random.rand(lenWeights)
bias = np.random.random();
return weights,bias
def train(x,y,weights,bias,maxIter):
converged = False;
iterations = 1;
m = len(x);
alpha = 0.001;
while not converged:
for i in range(len(x)):
# Dot product of weights and training sample
hypothesis = np.dot(x[i,:], weights) + bias;
# Calculate gradient
error = hypothesis - y[i];
grad = (alpha * 1/m) * ( error * x[i,:] );
# Update weights and bias
weights = weights - grad;
bias = bias - alpha * error;
iterations = iterations + 1;
if iterations > maxIter:
converged = True;
break
return weights, bias
def predict(x, weights, bias):
return np.dot(x,weights) + bias
if __name__ == '__main__':
data = np.loadtxt('housing.txt');
x = data[:,:-1];
y = data[:,-1];
for i in range(len(x[1,:])):
x[:,i] = ( (x[:,i] - np.min(x[:,i])) / (np.max(x[:,i]) - np.min(x[:,i])) );
initialWeights,initialBias = getWeights(x);
weights,bias = train(x,y,initialWeights,initialBias,55000);
pred = predict(x, weights,bias);
MSE = np.mean(abs(pred - y));
print "This Program MSE: " + str(MSE)
sklearnModel = LinearRegression();
sklearnModel = sklearnModel.fit(x,y);
sklearnModel = sklearnModel.predict(x);
skMSE = np.mean(abs(sklearnModel - y));
print "Sklearn MSE: " + str(skMSE)
First, make sure that you are computing the correct objective function value. The linear regression objective should be .5*np.mean((pred-y)**2), rather than np.mean(abs(pred - y)).
You are actually running a stochastic gradient descent (SGD) algorithm (running a gradient iteration on individual examples), which should be distinguished from "gradient descent".
SGD is a good learning method, but a bad optimization method - it can take many iterations to converge to a minimum of the empirical error (http://leon.bottou.org/publications/pdf/nips-2007.pdf).
For SGD to converge, the learning rate must be restricted. Typically, the learning rate is set to the base learning rate divided by the number of iterations, something like alpha/(iterations+1), using the variables in your code.
You also include a multiple of 1/m in your gradient, which is typically not used in SGD updates.
To test your SGD implementation, rather than evaluating the error on the dataset that you trained with, split the dataset into a training set and a test set, and evaluate the error on this test set after training with both methods. The training/test set split will allow you to estimate the performance of your algorithm as a learning algorithm (estimate the expected error) rather than as an optimization algorithm (minimize the empirical error).
Try increasing your iteration value. This should allow your algorithm to, hopefully, converge on a value that is closer to the global minimum. Keep in mind you are not using l-bfgs which can come closer to converging much faster than plain gradient descent or even SGD.
Also try using the normal equation as another way to do Linear Regression.
http://eli.thegreenplace.net/2014/derivation-of-the-normal-equation-for-linear-regression/.

Categories

Resources