How to output mean and stdv of Gaussian Process Classifier in sklearn?

How to output mean and stdv of Gaussian Process Classifier in sklearn? - python

Im fitting some data for a classification task using Gaussian Process Classifiers in sklearn. I know that for the Gaussian Process Regressor one can pass return_std in
y_test, std = gp.predict(x_test, return_std=True)
to output the standard deviation of the test sample (like in this question)
However, I couldn't find such a parameter for the GP Classifier.
Is there such thing as outputting the predictive mean and stdv of test data from a GP Classifiers? And is there a way to output the posterior mean and covariance of the fitted model?

There is not standard deviation for categorical data, hence there is no the parameter return_std in the Classifier.
However, if you want to quantify the uncertainty of the classifier predictions, you could use the .predict_proba(X)method. Once you get the probabilites of each posible class you could compute the entropy of the predicted probabilities.

You could get the variance associated with the logit function by going to the predict_proba function definition in _gpc.py and returning the 'var_f_star' value. I have modified the predict_proba and created a function to return the logit variance below:
def predict_var(self, X):
"""Return probability estimates for the test vector X.
Parameters
----------
X : array-like of shape (n_samples, n_features) or list of object
Query points where the GP is evaluated for classification.
Returns
-------
C : array-like of shape (n_samples, n_classes)
Returns the probability of the samples for each class in
the model. The columns correspond to the classes in sorted
order, as they appear in the attribute ``classes_``.
"""
check_is_fitted(self)
# Based on Algorithm 3.2 of GPML
K_star = self.kernel_(self.X_train_, X) # K_star =k(x_star)
f_star = K_star.T.dot(self.y_train_ - self.pi_) # Line 4
v = solve(self.L_, self.W_sr_[:, np.newaxis] * K_star) # Line 5
# Line 6 (compute np.diag(v.T.dot(v)) via einsum)
var_f_star = self.kernel_.diag(X) - np.einsum("ij,ij->j", v, v)

Related

How does statsmodels calculate in-sample predictions in AR models?

I am very new to time series modeling and statsmodels and trying to understand the AR model in statsmodels. Suppose I have a data record y of 1000 samples, and I fit an AR (1) model on y. Then I generate the in-sample prediction from this model as y_pred. I do this as
from statsmodels.tsa.ar_model import AutoReg
model = AutoReg(y,1).fit()
y_pred = model.predict()
I get the parameters of the model using model.params.
I would like to know, after estimating the model parameters, how does statsmodels calculate the in-sample predictions? For ex. how is y_pred[10] calculated?
I am sorry if the question is too basic, thanks for the help.

Per Wikipedia:
The autoregressive model specifies that the output variable depends linearly on its own previous values and on a stochastic term (an imperfectly predictable term).
In your model example, you have one predictor - lagged value of y. In this simple case, the .predict() method multiplies each lagged value by the value of the estimated linear slope parameter for that predictor and adds the estimated value of the intercept of that line. So y_pred[10] will be equal to the product of the fitted slope parameter and y[9], with the value of the intercept estimate added.
Here is an example:
from statsmodels.tsa.ar_model import AutoReg
y = [1, 2, 3, 6, 2, 9, 1]
model = AutoReg(y,1).fit()
model.params
# array([ 5.72953737, -0.49466192])
The first value in the params array is the estimated intercept parameter and the second value is the estimated linear (slope) parameter.
y_pred = model.predict()
y_pred
# array([5.23487544, 4.74021352, 4.2455516 , 2.76156584, 4.74021352, 1.27758007])
The first value in the y_pred array is the predicted value for the second value in the y array. It is calculated as:
-0.49466192 * 1 + 5.72953737 = 5.23487544
The second value in the y_pred array is computed as:
-0.49466192 * 2 + 5.72953737 = 4.74021353
and so on...

how initial bias value is chosen in sklearn logistic regression?

When training logistic regression it goes through an iterative process where at each process it calculates weights of x variables and bias value to minimize the loss function.
From official sklearn code class LogisticRegression | linear model in scikit-learn, the logistic regression class' fit method is as follows
def fit(self, X, y, sample_weight=None):
"""
Fit the model according to the given training data.
Parameters
----------
X : {array-like, sparse matrix} of shape (n_samples, n_features)
Training vector, where n_samples is the number of samples and
n_features is the number of features.
y : array-like of shape (n_samples,)
Target vector relative to X.
sample_weight : array-like of shape (n_samples,) default=None
Array of weights that are assigned to individual samples.
If not provided, then each sample is given unit weight.
.. versionadded:: 0.17
*sample_weight* support to LogisticRegression.
I am guessing sample_weight = weight of x variables which are set to 1 if not given, is the bias value also 1?

You sound somewhat confused, perhaps looking for an analogy here with the weights & biases of a neural network. But this is not the case; sample_weight here has nothing to do with the weights of a neural network, even as a concept.
sample_weight is there so that, if the (business) problem requires so, we can give more weight (i.e. more importance) to some samples compared with others, and this importance directly affects the loss. It is sometimes used in cases of imbalanced data; quoting from the Tips on practical use section of the documentation (it is about decision trees, but the rationale is the same):
Class balancing can be done by sampling an equal number of samples from each class, or preferably by normalizing the sum of the sample weights (sample_weight) for each class to the same value.
and from a relevant thread at Cross Validated:
Sample weights are used to increase the importance of a single data-point (let's say, some of your data is more trustworthy, then they receive a higher weight). So: The sample weights exist to change the importance of data-points
You can see a practical demostration of how changing the weight of some samples changes the final model in the SO thread What does `sample_weight` do to the way a `DecisionTreeClassifier` works in sklearn? (again, it is about decision trees, but the rationale is the same).
Having clarified that, it should now be apparent that there is no room here for any kind of "bias" parameter whatsoever. In fact, the introductory paragraph in your question is wrong: logistic regression does not compute such weights and biases; it returns coefficients and an intercept term (sometimes itself called bias), and these coefficients & intercept have nothing to do with sample_weight.

scikit-learn - multinomial logistic regression with probabilities as a target variable

I'm implementing a multinomial logistic regression model in Python using scikit-learn. The thing is, however, that I'd like to use probability distribution for classes of my target variable. As an example let's say that this is a 3-classes variable which looks as follows:
class_1 class_2 class_3
0 0.0 0.0 1.0
1 1.0 0.0 0.0
2 0.0 0.5 0.5
3 0.2 0.3 0.5
4 0.5 0.1 0.4
So that a sum of values for every row equals to 1.
How could I fit a model like this? When I try:
model = LogisticRegression(solver='saga', multi_class='multinomial')
model.fit(X, probabilities)
I get an error saying:
ValueError: bad input shape (10000, 3)
Which I know is related to the fact that this method expects a vector, not a matrix. But here I can't compress the probabilities matrix into vector since the classes are not exclusive.

You can't have cross-entropy loss with non-indicator probabilities in scikit-learn; this is not implemented and not supported in API. It is a scikit-learn's limitation.
For logistic regression you can approximate it by upsampling instances according to probabilities of their labels. For example, you can up-sample every instance 10x: e.g. if for a training instance class 1 has probability 0.2, and class 2 has probability 0.8, generate 10 training instances: 8 with class 2 and 2 with class 1. It won't be as efficient as it could be, but in a limit you'll be optimizing the same objective function.
You can do something like this:
from sklearn.utils import check_random_state
import numpy as np
def expand_dataset(X, y_proba, factor=10, random_state=None):
"""
Convert a dataset with float multiclass probabilities to a dataset
with indicator probabilities by duplicating X rows and sampling
true labels.
"""
rng = check_random_state(random_state)
n_classes = y_proba.shape[1]
classes = np.arange(n_classes, dtype=int)
for x, probs in zip(X, y_proba):
for label in rng.choice(classes, size=factor, p=probs):
yield x, label
See a more complete example here: https://github.com/TeamHG-Memex/eli5/blob/8cde96878f14c8f46e10627190abd9eb9e705ed4/eli5/lime/utils.py#L16
Alternatively, you can implement your Logistic Regression using libraries like TensorFlow or PyTorch; unlike scikit-learn, it is easy to define any loss in these frameworks, and cross-entropy is available out of box.

You need to input the correct labels with the training data, and then the logistic regression model will give you probabilities in return when you use predict_proba(X), and it would return a matrix of shape [n_samples, n_classes]. If you use a just predict(X) then it would give you an array of the most probable class in shape [n_samples,1]

how to use sklearn when target variable is a proportion

There are standard ways of predicting proportions such as logistic regression (without thresholding) and beta regression. There have already been discussions about this:
http://scikit-learn-general.narkive.com/4dSCktaM/using-logistic-regression-on-a-continuous-target-variable
http://scikit-learn-general.narkive.com/lLVQGzyl/beta-regression
I cannot tell if there exists a work-around within the sklearn framework.

There exists a workaround, but it is not intrinsically within the sklearn framework.
If you have a proportional target variable (value range 0-1) you run into two basic difficulties with scikit-learn:
Classifiers (such as logistic regression) deal with class labels as target variables only. As a workaround you could simply threshold your probabilities to 0/1 and interpret them as class labels, but you would lose a lot of information.
Regression models (such as linear regression) do not restrict the target variable. You can train them on proportional data, but there is no guarantee that the output on unseen data will be restricted to the 0/1 range. However, in this situation, there is a powerful work-around (below).
There are different ways to mathematically formulate logistic regression. One of them is the generalized linear model, which basically defines the logistic regression as a normal linear regression on logit-transformed probabilities. Normally, this approach requires sophisticated mathematical optimization because the probabilities are unknown and need to be estimated along with the regression coefficients.
In your case, however, the probabilities are known. This means you can simply transform them with y = log(p / (1 - p)). Now they cover the full range from -oo to oo and can serve as the target variable for a LinearRegression model [*]. Of course, the model output then needs to be transformed again to result in probabilities p = 1 / (exp(-y) + 1).
import numpy as np
from sklearn.linear_model import LinearRegression
class LogitRegression(LinearRegression):
def fit(self, x, p):
p = np.asarray(p)
y = np.log(p / (1 - p))
return super().fit(x, y)
def predict(self, x):
y = super().predict(x)
return 1 / (np.exp(-y) + 1)
if __name__ == '__main__':
# generate example data
np.random.seed(42)
n = 100
x = np.random.randn(n).reshape(-1, 1)
noise = 0.1 * np.random.randn(n).reshape(-1, 1)
p = np.tanh(x + noise) / 2 + 0.5
model = LogitRegression()
model.fit(x, p)
print(model.predict([[-10], [0.0], [1]]))
# [[ 2.06115362e-09]
# [ 5.00000000e-01]
# [ 8.80797078e-01]]
There are also numerous other alternatives. Some non-linear regression models can work naturally in the 0-1 range. For example Random Forest Regressors will never exceed the target variables' range they were trained with. Simply put probabilities in and you will get probabilities out. Neural networks with appropriate output activation functions (tanh, I guess) will also work well with probabilities, but if you want to use those there are more specialized libraries than sklearn.
[*] You could in fact plug in any linear regression model which can make the method more powerful, but then it no longer is exactly equivalent to logistic regression.

Keras, output of model predict_proba

In the docs, the predict_proba(self, x, batch_size=32, verbose=1) is
Generates class probability predictions for the input samples batch by batch.
and returns
A Numpy array of probability predictions.
Suppose my model is binary classification model, does the output is [a, b], for a is probability of class_0, and b is the probability of class_1?

Here the situation is different and somehow misleading, especially when you are comparing predict_proba method to sklearn methods with the same name. In Keras (not sklearn wrappers) a method predict_proba is exactly the same as a predict method. You can even check it here:
def predict_proba(self, x, batch_size=32, verbose=1):
"""Generates class probability predictions for the input samples
batch by batch.
# Arguments
x: input data, as a Numpy array or list of Numpy arrays
(if the model has multiple inputs).
batch_size: integer.
verbose: verbosity mode, 0 or 1.
# Returns
A Numpy array of probability predictions.
"""
preds = self.predict(x, batch_size, verbose)
if preds.min() < 0. or preds.max() > 1.:
warnings.warn('Network returning invalid probability values. '
'The last layer might not normalize predictions '
'into probabilities '
'(like softmax or sigmoid would).')
return preds
So - in a binary classification case - the output which you get depends on the design of your network:
if the final output of your network is obtained by a single sigmoid output - then the output of predict_proba is simply a probability assigned to class 1.
if the final output of your network is obtained by a two dimensional output to which you are applying a softmax function - then the output of predict_proba is a pair where [a, b] where a = P(class(x) = 0) and b = P(class(x) = 1).
This second method is rarely used and there are some theorethical advantages of using the first method - but I wanted to inform you - just in case.

It depends on how you specify output of your model and your targets. It can be both. Usually when one is doing binary classification the output is a single value which is a probability of the positive prediction. One minus the output is probability of the negative prediction.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.