I have a SGDClassifier model trained with scikit-learn. I extract features names with .get_feature_names() and coefficients with .coef_
I combine the 2 columns in a dataframe like this :
feature value
hiroshima 3.918584
wildfire 3.287680
earthquake 3.256817
massacre 3.186762
storm 3.124809
... ...
job -1.696438
song -1.736640
as -1.956571
nowplaying -2.028240
write -2.263968
I want to know how I can interpret the features importances ?
What does a positive high value mean?
What does a low negative value mean?
SGDClassifier fits a linear model, meaning that the decision is essentially based on
SUM_i w_i f_i + b
where w_i is the weight attached to feature f_i, consequently you can interpret these numbers as literally "votes" for positive/negative class at the scale proportional to their absolute value. All that your classifier does is to add these weights, and then it adds _intercept value from your model, and classifies based on the sign.
I'm quite new to scikit-learn and I have a question about the fit() function. I tried to look for information on the internet but couldn't find much.
In an assignement I have to create a dict of parameters passed to the fit function of a classifier, which means the function will take 3 arguments (X, y, kwargs). What parameters is this dictionary supposed to have? Apparently those are hyper parameters for the fit function. Online I only found information for xgbooster but I'm not supposed to use that, only classifiers from sklearn.
I also found online that fit can take a dictionary called **fit_params but there is nothing about the parameters the function might take.
I hope my question is clear, thanks a lot in advance!
The model hyperparameters are not arguments to the fit function, but to the model class object that you need to create beforehand.
If you have a dictionary with parameters that you want to pass to your model, you need to do things this way (here with a Logistic Regression):
from sklearn.linear_model import LogisticRegression
params = {"C":10, "max_iter":200}
LR = LogisticRegression(**params)
Now that you have created the model specifying the hyperparameters, you can proceed and fit it with your data.
LR.fit(X, y)
I haven't used scikit-learn before, but you can get the docs of a function that you are unsure about by using the __doc__ method. The fit() method of an estimator returns this for its __doc__ method:
Fit the SVM model according to the given training data.
Parameters
----------
X : {array-like, sparse matrix} of shape (n_samples, n_features) or (n_samples, n_samples)
Training vectors, where n_samples is the number of samples
and n_features is the number of features.
For kernel="precomputed", the expected shape of X is
(n_samples, n_samples).
y : array-like of shape (n_samples,)
Target values (class labels in classification, real numbers in
regression)
sample_weight : array-like of shape (n_samples,), default=None
Per-sample weights. Rescale C per sample. Higher weights
force the classifier to put more emphasis on these points.
Returns
-------
self : object
Notes
-----
If X and y are not C-ordered and contiguous arrays of np.float64 and
X is not a scipy.sparse.csr_matrix, X and/or y may be copied.
If X is a dense array, then the other methods will not support sparse
matrices as input.
I ran this to get that output:
from sklearn import svm
clf = svm.SVC(gamma=0.001, C=100.)
print(clf.fit.__doc__)
Using Python SkLearn Gradient Boost Classifier. The setting I am using is selecting random samples (stochastic). Using the sample_weight of 1 for one of the binary classes (outcome = 0) and 20 for the other class (outcome = 1). My question is how are these weights applied in 'laymans terms'.
Is it that at each iteration, the model will select x rows from the sample for the 0 outcome, and y rows for the 1 outcome, then the sample_weight setting will kick into and keep all of x but oversample the y (1) outcome by a factor of 20?
In the documentation I am not clear if it is oversampling by having sample_weight > 1. I understand that class_weight is different and does not change the data but how the model interprets the data via the loss function. Sample_weight on the other hand, is it true that it effectively changes the data fed into the model by oversampling?
Thanks
Sample weights are a multiplier factor, here is the code:
https://github.com/scikit-learn/scikit-learn/blob/f0ab589f/sklearn/ensemble/gradient_boosting.py#L1225
The training dataset contains two classes A and B which we represent as 1 and 0 in our target labels correspondingly. Out labels data is heavily skewed towards class 0 which takes roughly 95% of the data while our class 1 is only 5%. How should we construct our loss function in such case?
I found Tensorflow has a function that can be used with weights:
tf.losses.sigmoid_cross_entropy
weights acts as a coefficient for the loss. If a scalar is provided, then the loss is simply scaled by the given value.
Sounds good. I set weights to 2.0 to make loss higher and punish errors more.
loss = loss_fn(targets, cell_outputs, weights=2.0, label_smoothing=0)
However, not only the loss didn't go down it increased and the final accuracy on the dataset decreased slightly. Ok, maybe I misunderstood and it should be < 1.0, I tried a smaller number. This didn't change anything, I got almost the same loss and accuracy. O_o
Needless to say that same network trained on the same dataset but with loss weight 0.3 significantly reduces the loss up to x10 times in Torch / PyTorch.
Can somebody please explain how to use loss weights in Tensorflow?
If you're scaling the loss with a scalar, like 2.0, then basically you're multiplying the loss and therefore the gradient for backpropagation. It's similar to increasing the learning rate, but not exactly the same, because you're also changing the ratio to regularization losses such as weight decay.
If your classes are heavily skewed, and you want to balance it at the calculation of loss, then you have to specify a tensor as weight, as described in the manual for tf.losses.sigmoid_cross_entropy():
weights: Optional Tensor whose rank is either 0, or the same rank as labels, and must be broadcastable to labels (i.e., all dimensions must be either 1, or the same as the corresponding losses dimension).
That is make the weights tensor 1.0 for class 0, and maybe 10 for class 1, and now "false negative" losses will be much more heavily counted.
It is an art how much you should over-weigh the underrepresented class. If you overdo it, the model will collapse and will predict the over-weighted class all the time.
An alternative to achieve the same thing is using tf.nn.weighted_cross_entropy_with_logits(), which has a pos_weight argument for the exact same purpose. But it's in tf.nn not tf.losses so you have to manually add it to the losses collection.
Generally another method to handle this is to arbitrarily increase the proportion of the underrepresented class at sampling. That should not be overdone either, however. You can do both of these things too.
You can set a penalty for misclassification of each sample. If weights is a tensor of shape [batch_size], the loss for each sample will be multiplied by the corresponding weight. So if you assign the same weight to all samples (which is the same as using a scalar weight), your loss will only be scaled by this scalar, and the accuracy should not change.
If you instead assign different weights for the minority class and the majority class, the contributions of the samples to the loss function will be different, and you should be able to influence the accuracy by choosing your weights differently.
A few scenarios (your choice will depend on what you need):
1.) If you want a good overall accuracy, it you could choose the weights of the majority class to be very large and the weights of the minority class much smaller. This will probably lead to a classification of all events into the majority class (i.e. 95 % of total classification accuracy, but the minority class will usually be classified into the wrong class.
2.) If your signal is the minority class and the background is the majority class, you probably want very little background contamination in your predicted signal, i.e. you want almost no background samples to be predicted as signal. This will also happen if you choose the majority weight much larger than the minority weight, but you might find that the network tends to predict all samples to be background. So you will not have any signal samples left.
In this case you should consider a large weight for the minority class + an extra loss for background samples being classified as signal samples (false positives), like this:
loss = weighted_cross_entropy + extra_penalty_for_false_positives
In the tensorflow API docs they use a keyword called logits. What is it? A lot of methods are written like:
tf.nn.softmax(logits, name=None)
If logits is just a generic Tensor input, why is it named logits?
Secondly, what is the difference between the following two methods?
tf.nn.softmax(logits, name=None)
tf.nn.softmax_cross_entropy_with_logits(logits, labels, name=None)
I know what tf.nn.softmax does, but not the other. An example would be really helpful.
The softmax+logits simply means that the function operates on the unscaled output of earlier layers and that the relative scale to understand the units is linear. It means, in particular, the sum of the inputs may not equal 1, that the values are not probabilities (you might have an input of 5). Internally, it first applies softmax to the unscaled output, and then and then computes the cross entropy of those values vs. what they "should" be as defined by the labels.
tf.nn.softmax produces the result of applying the softmax function to an input tensor. The softmax "squishes" the inputs so that sum(input) = 1, and it does the mapping by interpreting the inputs as log-probabilities (logits) and then converting them back into raw probabilities between 0 and 1. The shape of output of a softmax is the same as the input:
a = tf.constant(np.array([[.1, .3, .5, .9]]))
print s.run(tf.nn.softmax(a))
[[ 0.16838508 0.205666 0.25120102 0.37474789]]
See this answer for more about why softmax is used extensively in DNNs.
tf.nn.softmax_cross_entropy_with_logits combines the softmax step with the calculation of the cross-entropy loss after applying the softmax function, but it does it all together in a more mathematically careful way. It's similar to the result of:
sm = tf.nn.softmax(x)
ce = cross_entropy(sm)
The cross entropy is a summary metric: it sums across the elements. The output of tf.nn.softmax_cross_entropy_with_logits on a shape [2,5] tensor is of shape [2,1] (the first dimension is treated as the batch).
If you want to do optimization to minimize the cross entropy AND you're softmaxing after your last layer, you should use tf.nn.softmax_cross_entropy_with_logits instead of doing it yourself, because it covers numerically unstable corner cases in the mathematically right way. Otherwise, you'll end up hacking it by adding little epsilons here and there.
Edited 2016-02-07:
If you have single-class labels, where an object can only belong to one class, you might now consider using tf.nn.sparse_softmax_cross_entropy_with_logits so that you don't have to convert your labels to a dense one-hot array. This function was added after release 0.6.0.
Short version:
Suppose you have two tensors, where y_hat contains computed scores for each class (for example, from y = W*x +b) and y_true contains one-hot encoded true labels.
y_hat = ... # Predicted label, e.g. y = tf.matmul(X, W) + b
y_true = ... # True label, one-hot encoded
If you interpret the scores in y_hat as unnormalized log probabilities, then they are logits.
Additionally, the total cross-entropy loss computed in this manner:
y_hat_softmax = tf.nn.softmax(y_hat)
total_loss = tf.reduce_mean(-tf.reduce_sum(y_true * tf.log(y_hat_softmax), [1]))
is essentially equivalent to the total cross-entropy loss computed with the function softmax_cross_entropy_with_logits():
total_loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(y_hat, y_true))
Long version:
In the output layer of your neural network, you will probably compute an array that contains the class scores for each of your training instances, such as from a computation y_hat = W*x + b. To serve as an example, below I've created a y_hat as a 2 x 3 array, where the rows correspond to the training instances and the columns correspond to classes. So here there are 2 training instances and 3 classes.
import tensorflow as tf
import numpy as np
sess = tf.Session()
# Create example y_hat.
y_hat = tf.convert_to_tensor(np.array([[0.5, 1.5, 0.1],[2.2, 1.3, 1.7]]))
sess.run(y_hat)
# array([[ 0.5, 1.5, 0.1],
# [ 2.2, 1.3, 1.7]])
Note that the values are not normalized (i.e. the rows don't add up to 1). In order to normalize them, we can apply the softmax function, which interprets the input as unnormalized log probabilities (aka logits) and outputs normalized linear probabilities.
y_hat_softmax = tf.nn.softmax(y_hat)
sess.run(y_hat_softmax)
# array([[ 0.227863 , 0.61939586, 0.15274114],
# [ 0.49674623, 0.20196195, 0.30129182]])
It's important to fully understand what the softmax output is saying. Below I've shown a table that more clearly represents the output above. It can be seen that, for example, the probability of training instance 1 being "Class 2" is 0.619. The class probabilities for each training instance are normalized, so the sum of each row is 1.0.
Pr(Class 1) Pr(Class 2) Pr(Class 3)
,--------------------------------------
Training instance 1 | 0.227863 | 0.61939586 | 0.15274114
Training instance 2 | 0.49674623 | 0.20196195 | 0.30129182
So now we have class probabilities for each training instance, where we can take the argmax() of each row to generate a final classification. From above, we may generate that training instance 1 belongs to "Class 2" and training instance 2 belongs to "Class 1".
Are these classifications correct? We need to measure against the true labels from the training set. You will need a one-hot encoded y_true array, where again the rows are training instances and columns are classes. Below I've created an example y_true one-hot array where the true label for training instance 1 is "Class 2" and the true label for training instance 2 is "Class 3".
y_true = tf.convert_to_tensor(np.array([[0.0, 1.0, 0.0],[0.0, 0.0, 1.0]]))
sess.run(y_true)
# array([[ 0., 1., 0.],
# [ 0., 0., 1.]])
Is the probability distribution in y_hat_softmax close to the probability distribution in y_true? We can use cross-entropy loss to measure the error.
We can compute the cross-entropy loss on a row-wise basis and see the results. Below we can see that training instance 1 has a loss of 0.479, while training instance 2 has a higher loss of 1.200. This result makes sense because in our example above, y_hat_softmax showed that training instance 1's highest probability was for "Class 2", which matches training instance 1 in y_true; however, the prediction for training instance 2 showed a highest probability for "Class 1", which does not match the true class "Class 3".
loss_per_instance_1 = -tf.reduce_sum(y_true * tf.log(y_hat_softmax), reduction_indices=[1])
sess.run(loss_per_instance_1)
# array([ 0.4790107 , 1.19967598])
What we really want is the total loss over all the training instances. So we can compute:
total_loss_1 = tf.reduce_mean(-tf.reduce_sum(y_true * tf.log(y_hat_softmax), reduction_indices=[1]))
sess.run(total_loss_1)
# 0.83934333897877944
Using softmax_cross_entropy_with_logits()
We can instead compute the total cross entropy loss using the tf.nn.softmax_cross_entropy_with_logits() function, as shown below.
loss_per_instance_2 = tf.nn.softmax_cross_entropy_with_logits(y_hat, y_true)
sess.run(loss_per_instance_2)
# array([ 0.4790107 , 1.19967598])
total_loss_2 = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(y_hat, y_true))
sess.run(total_loss_2)
# 0.83934333897877922
Note that total_loss_1 and total_loss_2 produce essentially equivalent results with some small differences in the very final digits. However, you might as well use the second approach: it takes one less line of code and accumulates less numerical error because the softmax is done for you inside of softmax_cross_entropy_with_logits().
tf.nn.softmax computes the forward propagation through a softmax layer. You use it during evaluation of the model when you compute the probabilities that the model outputs.
tf.nn.softmax_cross_entropy_with_logits computes the cost for a softmax layer. It is only used during training.
The logits are the unnormalized log probabilities output the model (the values output before the softmax normalization is applied to them).
Mathematical motivation for term
When we wish to constrain an output between 0 and 1, but our model architecture outputs unconstrained values, we can add a normalisation layer to enforce this.
A common choice is a sigmoid function.1 In binary classification this is typically the logistic function, and in multi-class tasks the multinomial logistic function (a.k.a softmax).2
If we want to interpret the outputs of our new final layer as 'probabilities', then (by implication) the unconstrained inputs to our sigmoid must be inverse-sigmoid(probabilities). In the logistic case this is equivalent to the log-odds of our probability (i.e. the log of the odds) a.k.a. logit:
That is why the arguments to softmax is called logits in Tensorflow - because under the assumption that softmax is the final layer in the model, and the output p is interpreted as a probability, the input x to this layer is interpretable as a logit:
Generalised term
In Machine Learning there is a propensity to generalise terminology borrowed from maths/stats/computer science, hence in Tensorflow logit (by analogy) is used as a synonym for the input to many normalisation functions.
While it has nice properties such as being easily diferentiable, and the aforementioned probabilistic interpretation, it is somewhat arbitrary.
softmax might be more accurately called softargmax, as it is a smooth approximation of the argmax function.
Above answers have enough description for the asked question.
Adding to that, Tensorflow has optimised the operation of applying the activation function then calculating cost using its own activation followed by cost functions. Hence it is a good practice to use: tf.nn.softmax_cross_entropy() over tf.nn.softmax(); tf.nn.cross_entropy()
You can find prominent difference between them in a resource intensive model.
Tensorflow 2.0 Compatible Answer: The explanations of dga and stackoverflowuser2010 are very detailed about Logits and the related Functions.
All those functions, when used in Tensorflow 1.x will work fine, but if you migrate your code from 1.x (1.14, 1.15, etc) to 2.x (2.0, 2.1, etc..), using those functions result in error.
Hence, specifying the 2.0 Compatible Calls for all the functions, we discussed above, if we migrate from 1.x to 2.x, for the benefit of the community.
Functions in 1.x:
tf.nn.softmax
tf.nn.softmax_cross_entropy_with_logits
tf.nn.sparse_softmax_cross_entropy_with_logits
Respective Functions when Migrated from 1.x to 2.x:
tf.compat.v2.nn.softmax
tf.compat.v2.nn.softmax_cross_entropy_with_logits
tf.compat.v2.nn.sparse_softmax_cross_entropy_with_logits
For more information about migration from 1.x to 2.x, please refer this Migration Guide.
One more thing that I would definitely like to highlight as logit is just a raw output, generally the output of last layer. This can be a negative value as well. If we use it as it's for "cross entropy" evaluation as mentioned below:
-tf.reduce_sum(y_true * tf.log(logits))
then it wont work. As log of -ve is not defined.
So using o softmax activation, will overcome this problem.
This is my understanding, please correct me if Im wrong.
Logits are the unnormalized outputs of a neural network. Softmax is a normalization function that squashes the outputs of a neural network so that they are all between 0 and 1 and sum to 1. Softmax_cross_entropy_with_logits is a loss function that takes in the outputs of a neural network (after they have been squashed by softmax) and the true labels for those outputs, and returns a loss value.