Using Python SkLearn Gradient Boost Classifier. The setting I am using is selecting random samples (stochastic). Using the sample_weight of 1 for one of the binary classes (outcome = 0) and 20 for the other class (outcome = 1). My question is how are these weights applied in 'laymans terms'.
Is it that at each iteration, the model will select x rows from the sample for the 0 outcome, and y rows for the 1 outcome, then the sample_weight setting will kick into and keep all of x but oversample the y (1) outcome by a factor of 20?
In the documentation I am not clear if it is oversampling by having sample_weight > 1. I understand that class_weight is different and does not change the data but how the model interprets the data via the loss function. Sample_weight on the other hand, is it true that it effectively changes the data fed into the model by oversampling?
Thanks
Sample weights are a multiplier factor, here is the code:
https://github.com/scikit-learn/scikit-learn/blob/f0ab589f/sklearn/ensemble/gradient_boosting.py#L1225
Related
When training logistic regression it goes through an iterative process where at each process it calculates weights of x variables and bias value to minimize the loss function.
From official sklearn code class LogisticRegression | linear model in scikit-learn, the logistic regression class' fit method is as follows
def fit(self, X, y, sample_weight=None):
"""
Fit the model according to the given training data.
Parameters
----------
X : {array-like, sparse matrix} of shape (n_samples, n_features)
Training vector, where n_samples is the number of samples and
n_features is the number of features.
y : array-like of shape (n_samples,)
Target vector relative to X.
sample_weight : array-like of shape (n_samples,) default=None
Array of weights that are assigned to individual samples.
If not provided, then each sample is given unit weight.
.. versionadded:: 0.17
*sample_weight* support to LogisticRegression.
I am guessing sample_weight = weight of x variables which are set to 1 if not given, is the bias value also 1?
You sound somewhat confused, perhaps looking for an analogy here with the weights & biases of a neural network. But this is not the case; sample_weight here has nothing to do with the weights of a neural network, even as a concept.
sample_weight is there so that, if the (business) problem requires so, we can give more weight (i.e. more importance) to some samples compared with others, and this importance directly affects the loss. It is sometimes used in cases of imbalanced data; quoting from the Tips on practical use section of the documentation (it is about decision trees, but the rationale is the same):
Class balancing can be done by sampling an equal number of samples from each class, or preferably by normalizing the sum of the sample weights (sample_weight) for each class to the same value.
and from a relevant thread at Cross Validated:
Sample weights are used to increase the importance of a single data-point (let's say, some of your data is more trustworthy, then they receive a higher weight). So: The sample weights exist to change the importance of data-points
You can see a practical demostration of how changing the weight of some samples changes the final model in the SO thread What does `sample_weight` do to the way a `DecisionTreeClassifier` works in sklearn? (again, it is about decision trees, but the rationale is the same).
Having clarified that, it should now be apparent that there is no room here for any kind of "bias" parameter whatsoever. In fact, the introductory paragraph in your question is wrong: logistic regression does not compute such weights and biases; it returns coefficients and an intercept term (sometimes itself called bias), and these coefficients & intercept have nothing to do with sample_weight.
I am trying to use keras to fit a CNN model to classify 2 classes of data . I have imbalanced dataset I want to balance the data. I don't know can I use class_weight in model.fit_generator . I wonder if I used class_weight="balanced" in model.fit_generator
The main code:
def generate_arrays_for_training(indexPat, paths, start=0, end=100):
while True:
from_=int(len(paths)/100*start)
to_=int(len(paths)/100*end)
for i in range(from_, int(to_)):
f=paths[i]
x = np.load(PathSpectogramFolder+f)
x = np.expand_dims(x, axis=0)
if('P' in f):
y = np.repeat([[0,1]],x.shape[0], axis=0)
else:
y =np.repeat([[1,0]],x.shape[0], axis=0)
yield(x,y)
history=model.fit_generator(generate_arrays_for_training(indexPat, filesPath, end=75),
validation_data=generate_arrays_for_training(indexPat, filesPath, start=75),
steps_per_epoch=int((len(filesPath)-int(len(filesPath)/100*25))),
validation_steps=int((len(filesPath)-int(len(filesPath)/100*75))),
verbose=2,
epochs=15, max_queue_size=2, shuffle=True, callbacks=[callback])
If you don't want to change your data creation process, you can use class_weight in your fit generator. You can use dictionary to set your class_weight and observe with fine tuning. For instance when class_weight is not used, and you have 50 examples for class0 and 100 examples for class1. Then, loss function calculate loss uniformly. It means that class1 will be a problem. But, when you set:
class_weight = {0:2 , 1:1}
It means that loss function will give 2 times weight to your class 0 now. Therefore, misclassification of underrepresented data will take 2 times more punishment than before. Thus, model can handle imbalanced data.
If you use class_weight='balanced' model can make that setting automatically. But my suggestion is that, create a dictionary like class_weight = {0:a1 , 1:a2} and try different values for a1 and a2, so you can understand difference.
Also, you can use undersampling methods for imbalanced data instead of using class_weight. Check Bootstrapping methods for that purpose.
For ~20,000 text datasets, the true and false samples are ~5,000 against ~1,5000. Two-channel textCNN built with Keras and Theano is used to do the classification. F1 score is the evaluation metric. The F1 score is not bad while the confusion matrix shows that the accuracy of the true samples is relatively low(~40%). But actually it is very important to predict the true samples accurately. Therefore, want to design a custom binary cross entropy loss function to increase the weight of mis-classified true samples and make the model focus more on predicting accurately on the true samples.
tried class_weight with sklearn in model.fit method and it did not work very well since the weight applied to all samples instead of the mis-classified ones.
tried and adjusted the method mentioned here: https://github.com/keras-team/keras/issues/2115, but the loss function was categorical cross entropy and it did not work well for the binary classification problem. Tried to modified the loss function to a binary one but encounter some issues concerning the input dimension.
The sample code of the cost sensitive loss function focusing on the mis-classified samples is:
def w_categorical_crossentropy(y_true, y_pred, weights):
nb_cl = len(weights)
final_mask = K.zeros_like(y_pred[:, 0])
y_pred_max = K.max(y_pred, axis=1)
y_pred_max = K.reshape(y_pred_max, (K.shape(y_pred)[0], 1))
y_pred_max_mat = K.equal(y_pred, y_pred_max)
for c_p, c_t in product(range(nb_cl), range(nb_cl)):
final_mask += (weights[c_t, c_p] * y_pred_max_mat[:, c_p] * y_true[:, c_t])
return K.categorical_crossentropy(y_pred, y_true) * final_mask
Actually, a custom loss function for binary classification implemented with Keras and Theano that focuses on the mis-classified samples is of great importance to the imbalanced dataset. Please help troubleshoot this. Thanks!
Well when I have to deal with imbalanced datasets in keras, what I do is to first compute the weights for each class and pass them to the model instance during training. This will look something like this:
from sklearn.utils import compute_class_weight
w = compute_class_weight('balanced', np.unique(targets), targets)
# here I am adding only two categories with their corresponding weights
# you can spin a loop or continue by hand until you include all of your categories
weights = {
np.unique(targets)[0] : w[0], # class 0 with weight 0
np.unique(targets)[1] : w[1] # class 1 with weight 1
}
# then during training you do like this
model.fit(x=features, y=targets, {..}, class_weight=weights)
I believe this will solve your problem.
I'm trying to write my own logistic regressor (using batch/mini-batch gradient descent) for practice purposes.
I generated a random dataset (see below) with normally distributed inputs, and the output is binary (0,1). I manually used coefficients for the input and was hoping to be able to reproduce them (see below for the code snippet). However, to my surprise, neither my own code, nor sklearn LogisticRegression were able to reproduce the actual numbers (although the sign and order of magnitude are in line). Moreso, the coefficients my algorithm produced are different than the one produced by sklearn.
Am I misinterpreting what the coefficients for a logistic regression are?
I will appreciate any insight into this discrepancy.
Thank you!
edit: I tried using statsmodels Logit and got yet a third set of slightly different values for the coefficients
Some more info that might be relevant:
I wrote a linear regressor using an almost identical code and it worked perfectly, so I am fairly confident this is not a problem in the code. Also my regressor actually outperformed the sklearn one on the training set, and they have the exact same accuracy on the test set, so I have no reason to believe the regressors are wrong.
Code snippets for the generation of the dataset:
o1 = 2
o2 = -3
x[:,1]=np.random.rand(size)*2
x[:,2]=np.random.rand(size)*3
y = np.vectorize(sigmoid)(x[:,1]*o1+x[:,2]*o2 + np.random.normal(size=size))
so as can be seen, input coefficients are +2 and -3 (intercept 0);
sklearn coefficients were ~2.8 and ~-4.8;
my coefficients were ~1.7 and ~-2.6
and of the regressor (the most relevant parts of it):
for j in range(bin_size):
xs = x[i]
y_real = y[i]
z = np.dot(self.coeff,xs)
h = sigmoid(z)
dc+= (h-y_real)*xs
self.coeff-= dc * (learning_rate/n)
What was the intercept learned? It really should not be a surprise, as your y is polynomial of 3rd degree, while your model has only two coefficients, while 3 + y-intercept would be needed to model the response variable from predictors.
Furthermore, values may be different due to SGD for example.
Not really sure, but the coefficients could be different and return correct y for finite set of points. What are the metrics on each model? Do those differ?
I have been learning for myself for several months artificial intelligence through a project of character recognition and transcription of handwriting. Until now I have successfully used Keras, Theano and Tensorflow by implementing CNN, CTC neural networks.
Today, I try to use Gaussian mixture models, the first step towards hidden markov models with Gaussian emission. To do so, I used the sklearn mixture with pca reduction to select the best model with Akaike and Bayesian information criterion. With type of covariance Full for Aic which provides a nice U-curve and Tied for Bic, because with Full covariance Bic gives just a linear curve. With 12.000 samples, I get the best model at 60 n-components for Aic and 120 n-components for Bic.
My input images have 64 pixels aside which represent only the capital letters of the English alphabet, 26 categories numbered from 0 to 25.
The fit method of Sklearn GaussianMixture ignore labels and the predict method returns the position of the component (0 to 59 or 0 to 119) into the n-components regarding the probabilities.
How to retrieve the original label the position of the character in a list using sklearn GaussianMixture ?
So, you want to use GaussianMixture in a generative classifier. You need to compute P(Y|X) for each label and estimate label according to these probabilities. To do so, you need to keep a GMM for each label and train with data from corresponding label. Then score method will give you likelihood, P(X|Y), of given data (or log-likelihood, you may want to check that). If you multiple likelihood with prior, you get posterior, P(Y|X). For each label, you will get a posterior e.g. P(Y=0|X), P(Y=1|X), ... Label with the maximum posterior probability can be reported as estimated label.
You can get some hints from the code sample below. (Here it is assumed that prior probabilities are equal, you need to consider that in your implementation)
Y_predicted = clf.predict(X_test)
score = np.empty((Y_test.shape[0], 10))
predictor_list = []
for i in range(10):
predictor = GMM()
predictor.fit(X[Y==i])
predictor_list.append(predictor)
score[:, i] = predictor.score(X_test)
Y_predicted = np.argmax(score, axis=1)