I'm implementing a multinomial logistic regression model in Python using scikit-learn. The thing is, however, that I'd like to use probability distribution for classes of my target variable. As an example let's say that this is a 3-classes variable which looks as follows:
class_1 class_2 class_3
0 0.0 0.0 1.0
1 1.0 0.0 0.0
2 0.0 0.5 0.5
3 0.2 0.3 0.5
4 0.5 0.1 0.4
So that a sum of values for every row equals to 1.
How could I fit a model like this? When I try:
model = LogisticRegression(solver='saga', multi_class='multinomial')
model.fit(X, probabilities)
I get an error saying:
ValueError: bad input shape (10000, 3)
Which I know is related to the fact that this method expects a vector, not a matrix. But here I can't compress the probabilities matrix into vector since the classes are not exclusive.
You can't have cross-entropy loss with non-indicator probabilities in scikit-learn; this is not implemented and not supported in API. It is a scikit-learn's limitation.
For logistic regression you can approximate it by upsampling instances according to probabilities of their labels. For example, you can up-sample every instance 10x: e.g. if for a training instance class 1 has probability 0.2, and class 2 has probability 0.8, generate 10 training instances: 8 with class 2 and 2 with class 1. It won't be as efficient as it could be, but in a limit you'll be optimizing the same objective function.
You can do something like this:
from sklearn.utils import check_random_state
import numpy as np
def expand_dataset(X, y_proba, factor=10, random_state=None):
"""
Convert a dataset with float multiclass probabilities to a dataset
with indicator probabilities by duplicating X rows and sampling
true labels.
"""
rng = check_random_state(random_state)
n_classes = y_proba.shape[1]
classes = np.arange(n_classes, dtype=int)
for x, probs in zip(X, y_proba):
for label in rng.choice(classes, size=factor, p=probs):
yield x, label
See a more complete example here: https://github.com/TeamHG-Memex/eli5/blob/8cde96878f14c8f46e10627190abd9eb9e705ed4/eli5/lime/utils.py#L16
Alternatively, you can implement your Logistic Regression using libraries like TensorFlow or PyTorch; unlike scikit-learn, it is easy to define any loss in these frameworks, and cross-entropy is available out of box.
You need to input the correct labels with the training data, and then the logistic regression model will give you probabilities in return when you use predict_proba(X), and it would return a matrix of shape [n_samples, n_classes]. If you use a just predict(X) then it would give you an array of the most probable class in shape [n_samples,1]
Related
Im fitting some data for a classification task using Gaussian Process Classifiers in sklearn. I know that for the Gaussian Process Regressor one can pass return_std in
y_test, std = gp.predict(x_test, return_std=True)
to output the standard deviation of the test sample (like in this question)
However, I couldn't find such a parameter for the GP Classifier.
Is there such thing as outputting the predictive mean and stdv of test data from a GP Classifiers? And is there a way to output the posterior mean and covariance of the fitted model?
There is not standard deviation for categorical data, hence there is no the parameter return_std in the Classifier.
However, if you want to quantify the uncertainty of the classifier predictions, you could use the .predict_proba(X)method. Once you get the probabilites of each posible class you could compute the entropy of the predicted probabilities.
You could get the variance associated with the logit function by going to the predict_proba function definition in _gpc.py and returning the 'var_f_star' value. I have modified the predict_proba and created a function to return the logit variance below:
def predict_var(self, X):
"""Return probability estimates for the test vector X.
Parameters
----------
X : array-like of shape (n_samples, n_features) or list of object
Query points where the GP is evaluated for classification.
Returns
-------
C : array-like of shape (n_samples, n_classes)
Returns the probability of the samples for each class in
the model. The columns correspond to the classes in sorted
order, as they appear in the attribute ``classes_``.
"""
check_is_fitted(self)
# Based on Algorithm 3.2 of GPML
K_star = self.kernel_(self.X_train_, X) # K_star =k(x_star)
f_star = K_star.T.dot(self.y_train_ - self.pi_) # Line 4
v = solve(self.L_, self.W_sr_[:, np.newaxis] * K_star) # Line 5
# Line 6 (compute np.diag(v.T.dot(v)) via einsum)
var_f_star = self.kernel_.diag(X) - np.einsum("ij,ij->j", v, v)
I have trained a binary classification task (pos. vs. neg.) and have a .h5 model. And I have external data (which was never used in training nor in the validation). There are 20 of samples overall belonging to both classes.
preds = model.predict(img)
y_classes = np.argmax(preds , axis=1)
The above code is supposed to calculate probability (preds) and class labels (0 or 1) if it were trained with softmax as the last output layer. But, preds is only a single number between [0;1] and y_classes is always 0.
To go back a little, the model was evaluated with mean AUC with the area being around 0.75.
I can see the probabilities of those 20 samples mostly (17) lie between 0 - 0.15, the rest are 0.74, 0.51 and 0.79.
How do I make a conclusion from this?
EDIT:
10 among 20 samples for testing the model belong to positive class, the other 10 belong to negative class. All 10 which belong to pos. class have very low prabability (0 - 0.15). 7 out 10 negative classes have the same low probability, only 3 being (0.74, 0.51 and 0.79).
The question: Why is the model predicting the samples with such a low probability even though its AUC was quite higher?
the sigmoid activation function is used to generate probabilities in binary classification problems. in this case, the model output an array of probabilities with shape equal to the length of images to predict. we can retrieve the predicted class simply checking the probability score... if it's above 0.5 (this is a common practice but u can also change it according to your needs) the image belongs to the class 1 else it belongs to the class 0.
preds = model.predict(img) # (n_images, 1)
y_classes = ((pred > 0.5)+0).ravel() # (n_images,)
in case of sigmoid, your last output layer must be Dense(1, activation='sigmoid')
in the case of softmax (as you have just done), the predicted class are retrieved using argmax
preds = model.predict(img) # (n_images, n_class)
y_classes = np.argmax(preds , axis=1) # (n_images,)
in case of softmax, your last output layer must be Dense(n_classes, activation='softmax')
WHY AUC IS NOT A GOOD METRIC
The value of AUC can be misleading and can cause us sometimes to overestimate and sometimes to underestimate the actual performance of a model. The behavior of Average-Precision is more expressive in getting a flavor of how the model is doing because it is more sensible in distinguishing between a good and a very good model. Moreover, it is directly linked to precision: an indicator which is human-understandable Here a great reference about the topics which explains all you need: https://towardsdatascience.com/why-you-should-stop-using-the-roc-curve-a46a9adc728
By using a sigmoid function as your activation function you are basically "compressing" the output of prior layers to a probability value from 0 to 1.
Softmax function is just taking a sequence of sigmoid functions, aggregates them and shows the ratio between a specific class probability and all aggregated probabilities for all classes.
For example: if I'm using a model to predict whether an image is an image of a banana, apple or grape, and my model recognizes that a certain image is 0.75 banana, 0.20 apple and 0.15 grape (Each probability is generated with a sigmoid function), my softmax layer will make this calculation:
banana: 0.75 / (0.75 + 0.20 + 0.15) = 0.6818 && apple: 0.20 / 1.1 = 0.1818 && grape: 0.15 / 1.1 = 0.1364.
As we can see, this model will classify this specific picture as a picture of a banana thanks to our softmax layer. Yet, in order to make this classification, it priorly used a series of sigmoid functions.
So if we finally reach to the point, I'd say that the interpretation of a sigmoid function output should be similar to the one that you'd make with a softmax layer, but while a softmax layer gives you the comparison between one class to another, a sigmoid function simply tells you how likely it is that this piece of information belongs to the positive class.
In order to make the final call and decide if a certain item does or doesn't belong to the positive class, you need to pick a threshold (not necessarily 0.5). Picking a threshold is the final step of your output interpretation. If you'd like to max the precision of your model, you will pick a high threshold, but if you'd like to max the recall of your model you can definitely pick a lower threshold.
I hope it answers your question, let me know if you'd like me to elaborate on anything as this answer is quite general.
I am trying to build a Neural Network in tensorflow where the cost of a Type I error (false-positive) is more costly than a Type II error (false-negative). Is there a way to impose this during the training process (i.e. inputting a cost matrix)? This is possible with simple models like Logistic Regression in scikit learn by specifying the class_weight parameter.
cw = {0: 3,1:1}
clf = LogisticRegression(class_weight = cw )
In this case, incorrectly predicting a 0 is 3x more costly than incorrectly predicting a 1. However, this cannot be performed with a Neural Network, so I want to see if it is possible in tensorflow.
Thanks
You could use tf.nn.weighted_cross_entropy_with_logits and it's pos_weight argument.
This argument weights positive class, as described by documentation (in TF2.0 at least):
A value pos_weights > 1 decreases the false negative count, hence increasing the recall.
Conversely setting pos_weights < 1 decreases the false positive count and increases the precision.
In your case, you could create custom loss function like this:
import tensorflow as tf
# Output logits from your network, not the values after sigmoid activation
class WeightedBinaryCrossEntropy:
def __init__(self, positive_weight: float):
self.positive_weight = positive_weight
def __call__(self, targets, logits, sample_weight=None):
return tf.nn.weighted_cross_entropy_with_logits(
targets, logits, pos_weight=self.positive_weight
)
And create a custom neural network with it, for example using tf.keras (samples are weighted as they were in your question:
import numpy as np
model = tf.keras.models.Sequential(
[
tf.keras.layers.Dense(32, input_shape=(10,)),
tf.keras.layers.Activation("relu"),
tf.keras.layers.Dense(10),
tf.keras.layers.Activation("relu"),
# Output one logit for binary classification
tf.keras.layers.Dense(1),
]
)
# Example random data
data = np.random.random((32, 10))
targets = np.random.randint(2, size=32)
# 3 times as costly to make type I error
model.compile(optimizer="rmsprop", loss=WeightedBinaryCrossEntropy(positive_weight=3))
model.fit(data, targets, batch_size=32)
You can use a logarithmic scale. For a 0 incorrectly predicted as 1, y - ŷ = -1, log goes to 1.71. For a 1 predicted as 0, y - ŷ = 1 log equals 0.63. For y == ŷ log equals 0. Almost the three times more costly, for a 0 incorrectly predicted as 1.
import numpy as np
from math import exp
loss=abs(1-exp(-np.log(exp(y-ŷ))))
#abs(1-exp(-np.log(exp(0))))
#Out[53]: 0.0
#abs(1-exp(-np.log(exp(-1))))
#Out[54]: 1.718281828459045
#abs(1-exp(-np.log(exp(1))))
#Out[55]: 0.6321205588285577
Then you will have a convex optimization. Implementing:
import keras.backend as K
def custom_loss(y_true,y_pred):
return K.mean(abs(1-exp(-np.log(exp(y_true-y_pred)))))
Then:
model.compile(loss=custom_loss, optimizer=sgd,metrics = ['accuracy'])
Using Python SkLearn Gradient Boost Classifier. The setting I am using is selecting random samples (stochastic). Using the sample_weight of 1 for one of the binary classes (outcome = 0) and 20 for the other class (outcome = 1). My question is how are these weights applied in 'laymans terms'.
Is it that at each iteration, the model will select x rows from the sample for the 0 outcome, and y rows for the 1 outcome, then the sample_weight setting will kick into and keep all of x but oversample the y (1) outcome by a factor of 20?
In the documentation I am not clear if it is oversampling by having sample_weight > 1. I understand that class_weight is different and does not change the data but how the model interprets the data via the loss function. Sample_weight on the other hand, is it true that it effectively changes the data fed into the model by oversampling?
Thanks
Sample weights are a multiplier factor, here is the code:
https://github.com/scikit-learn/scikit-learn/blob/f0ab589f/sklearn/ensemble/gradient_boosting.py#L1225
There are standard ways of predicting proportions such as logistic regression (without thresholding) and beta regression. There have already been discussions about this:
http://scikit-learn-general.narkive.com/4dSCktaM/using-logistic-regression-on-a-continuous-target-variable
http://scikit-learn-general.narkive.com/lLVQGzyl/beta-regression
I cannot tell if there exists a work-around within the sklearn framework.
There exists a workaround, but it is not intrinsically within the sklearn framework.
If you have a proportional target variable (value range 0-1) you run into two basic difficulties with scikit-learn:
Classifiers (such as logistic regression) deal with class labels as target variables only. As a workaround you could simply threshold your probabilities to 0/1 and interpret them as class labels, but you would lose a lot of information.
Regression models (such as linear regression) do not restrict the target variable. You can train them on proportional data, but there is no guarantee that the output on unseen data will be restricted to the 0/1 range. However, in this situation, there is a powerful work-around (below).
There are different ways to mathematically formulate logistic regression. One of them is the generalized linear model, which basically defines the logistic regression as a normal linear regression on logit-transformed probabilities. Normally, this approach requires sophisticated mathematical optimization because the probabilities are unknown and need to be estimated along with the regression coefficients.
In your case, however, the probabilities are known. This means you can simply transform them with y = log(p / (1 - p)). Now they cover the full range from -oo to oo and can serve as the target variable for a LinearRegression model [*]. Of course, the model output then needs to be transformed again to result in probabilities p = 1 / (exp(-y) + 1).
import numpy as np
from sklearn.linear_model import LinearRegression
class LogitRegression(LinearRegression):
def fit(self, x, p):
p = np.asarray(p)
y = np.log(p / (1 - p))
return super().fit(x, y)
def predict(self, x):
y = super().predict(x)
return 1 / (np.exp(-y) + 1)
if __name__ == '__main__':
# generate example data
np.random.seed(42)
n = 100
x = np.random.randn(n).reshape(-1, 1)
noise = 0.1 * np.random.randn(n).reshape(-1, 1)
p = np.tanh(x + noise) / 2 + 0.5
model = LogitRegression()
model.fit(x, p)
print(model.predict([[-10], [0.0], [1]]))
# [[ 2.06115362e-09]
# [ 5.00000000e-01]
# [ 8.80797078e-01]]
There are also numerous other alternatives. Some non-linear regression models can work naturally in the 0-1 range. For example Random Forest Regressors will never exceed the target variables' range they were trained with. Simply put probabilities in and you will get probabilities out. Neural networks with appropriate output activation functions (tanh, I guess) will also work well with probabilities, but if you want to use those there are more specialized libraries than sklearn.
[*] You could in fact plug in any linear regression model which can make the method more powerful, but then it no longer is exactly equivalent to logistic regression.