There are standard ways of predicting proportions such as logistic regression (without thresholding) and beta regression. There have already been discussions about this:
http://scikit-learn-general.narkive.com/4dSCktaM/using-logistic-regression-on-a-continuous-target-variable
http://scikit-learn-general.narkive.com/lLVQGzyl/beta-regression
I cannot tell if there exists a work-around within the sklearn framework.
There exists a workaround, but it is not intrinsically within the sklearn framework.
If you have a proportional target variable (value range 0-1) you run into two basic difficulties with scikit-learn:
Classifiers (such as logistic regression) deal with class labels as target variables only. As a workaround you could simply threshold your probabilities to 0/1 and interpret them as class labels, but you would lose a lot of information.
Regression models (such as linear regression) do not restrict the target variable. You can train them on proportional data, but there is no guarantee that the output on unseen data will be restricted to the 0/1 range. However, in this situation, there is a powerful work-around (below).
There are different ways to mathematically formulate logistic regression. One of them is the generalized linear model, which basically defines the logistic regression as a normal linear regression on logit-transformed probabilities. Normally, this approach requires sophisticated mathematical optimization because the probabilities are unknown and need to be estimated along with the regression coefficients.
In your case, however, the probabilities are known. This means you can simply transform them with y = log(p / (1 - p)). Now they cover the full range from -oo to oo and can serve as the target variable for a LinearRegression model [*]. Of course, the model output then needs to be transformed again to result in probabilities p = 1 / (exp(-y) + 1).
import numpy as np
from sklearn.linear_model import LinearRegression
class LogitRegression(LinearRegression):
def fit(self, x, p):
p = np.asarray(p)
y = np.log(p / (1 - p))
return super().fit(x, y)
def predict(self, x):
y = super().predict(x)
return 1 / (np.exp(-y) + 1)
if __name__ == '__main__':
# generate example data
np.random.seed(42)
n = 100
x = np.random.randn(n).reshape(-1, 1)
noise = 0.1 * np.random.randn(n).reshape(-1, 1)
p = np.tanh(x + noise) / 2 + 0.5
model = LogitRegression()
model.fit(x, p)
print(model.predict([[-10], [0.0], [1]]))
# [[ 2.06115362e-09]
# [ 5.00000000e-01]
# [ 8.80797078e-01]]
There are also numerous other alternatives. Some non-linear regression models can work naturally in the 0-1 range. For example Random Forest Regressors will never exceed the target variables' range they were trained with. Simply put probabilities in and you will get probabilities out. Neural networks with appropriate output activation functions (tanh, I guess) will also work well with probabilities, but if you want to use those there are more specialized libraries than sklearn.
[*] You could in fact plug in any linear regression model which can make the method more powerful, but then it no longer is exactly equivalent to logistic regression.
Related
From many documents, I have learned the recipe of Ridge regression that is:
loss_Ridge = loss_function + lambda x L2 norm of slope
and the recipe of Lasso regression that is:
loss_Lasso = loss_function + lambda x L1 norm of slope
When I have read topic "Implementing Lasso and Ridge Regression" in "TensorFlow Machine Learning Cookbook", its author explained that:
"...we will use a continuous approximation to a step function, called
the continuous heavy step function..."
and its author also provided lines of code here.
I don't understand about which is called 'the continuous heavy step function' in this context. Please help me.
From the link that you provided,
if regression_type == 'LASSO':
# Declare Lasso loss function
# Lasso Loss = L2_Loss + heavyside_step,
# Where heavyside_step ~ 0 if A < constant, otherwise ~ 99
lasso_param = tf.constant(0.9)
heavyside_step = tf.truediv(1., tf.add(1., tf.exp(tf.multiply(-50., tf.subtract(A, lasso_param)))))
regularization_param = tf.multiply(heavyside_step, 99.)
loss = tf.add(tf.reduce_mean(tf.square(y_target - model_output)), regularization_param)
This heavyside_step function is very close to a logistic function which in turn can be a continuous approximation for a step function.
You use continuous approximation because the loss function needs to be differentiable with respect to the parameters of your model.
To get an intuition about read the constrained formulation section 1.6 in https://www.cs.ubc.ca/~schmidtm/Documents/2005_Notes_Lasso.pdf
You can see that in your code if A < 0.9 then regularization_param vanishes, so optimization will constrain A in that range.
If you want to normalize features using Lasso Regression here you have one example:
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import Lasso
estimator = Lasso()
featureSelection = SelectFromModel(estimator)
featureSelection.fit(features_vector, target)
selectedFeatures = featureSelection.transform(features_vector)
print(selectedFeatures)
I'm trying to write my own logistic regressor (using batch/mini-batch gradient descent) for practice purposes.
I generated a random dataset (see below) with normally distributed inputs, and the output is binary (0,1). I manually used coefficients for the input and was hoping to be able to reproduce them (see below for the code snippet). However, to my surprise, neither my own code, nor sklearn LogisticRegression were able to reproduce the actual numbers (although the sign and order of magnitude are in line). Moreso, the coefficients my algorithm produced are different than the one produced by sklearn.
Am I misinterpreting what the coefficients for a logistic regression are?
I will appreciate any insight into this discrepancy.
Thank you!
edit: I tried using statsmodels Logit and got yet a third set of slightly different values for the coefficients
Some more info that might be relevant:
I wrote a linear regressor using an almost identical code and it worked perfectly, so I am fairly confident this is not a problem in the code. Also my regressor actually outperformed the sklearn one on the training set, and they have the exact same accuracy on the test set, so I have no reason to believe the regressors are wrong.
Code snippets for the generation of the dataset:
o1 = 2
o2 = -3
x[:,1]=np.random.rand(size)*2
x[:,2]=np.random.rand(size)*3
y = np.vectorize(sigmoid)(x[:,1]*o1+x[:,2]*o2 + np.random.normal(size=size))
so as can be seen, input coefficients are +2 and -3 (intercept 0);
sklearn coefficients were ~2.8 and ~-4.8;
my coefficients were ~1.7 and ~-2.6
and of the regressor (the most relevant parts of it):
for j in range(bin_size):
xs = x[i]
y_real = y[i]
z = np.dot(self.coeff,xs)
h = sigmoid(z)
dc+= (h-y_real)*xs
self.coeff-= dc * (learning_rate/n)
What was the intercept learned? It really should not be a surprise, as your y is polynomial of 3rd degree, while your model has only two coefficients, while 3 + y-intercept would be needed to model the response variable from predictors.
Furthermore, values may be different due to SGD for example.
Not really sure, but the coefficients could be different and return correct y for finite set of points. What are the metrics on each model? Do those differ?
All regression examples I find are examples where you predict a real number and unlike with classification you dont the the confidence the model had when predicting that number. I have done in reinforcement learning another way the output is instead the mean and std and then you sample from that distribution. Then you know how confident the model is at predicting every value. Now I cant find how to do this using supervised learning in pytorch. The problem is that I dont understand how to perform sample from the distribution the get the actual value while training or what sort of loss function I should use, not sure how for example MSE or L1Smooth would work.
Is there any example ot there where this is done in pytorch in a robust and state of the art way?
The key point is that you do not need to sample from the NN-produced distribution. All you need is to optimize the likelihood of the target value under the NN distribution.
There is an example in the official PyTorch example on VAE (https://github.com/pytorch/examples/tree/master/vae), though for multidimensional Bernoulli distribution.
Since PyTorch 0.4, you can use torch.distributions: instantiate distribution distro with outputs of your NN and then optimize -distro.log_prob(target).
EDIT: As requested in a comment, a complete example of using the torch.distributions module.
First, we create a heteroscedastic dataset:
import numpy as np
import torch
X = np.random.uniform(size=300)
Y = X + 0.25*X*np.random.normal(size=X.shape[0])
We build a trivial model, which is perfectly able to match the generative process of our data:
class Model(torch.nn.Module):
def __init__(self):
super().__init__()
self.mean_coeff = torch.nn.Parameter(torch.Tensor([0]))
self.var_coeff = torch.nn.Parameter(torch.Tensor([1]))
def forward(self, x):
return torch.distributions.Normal(self.mean_coeff * x, self.var_coeff * x)
mdl = Model()
optim = torch.optim.SGD(mdl.parameters(), lr=1e-3)
Initialization of the model makes it always produce a standard normal, which is a poor fit for our data, so we train (note it is a very stupid batch training, but demonstrates that you can output a set of distributions for your batch at once):
for _ in range(2000): # epochs
dist = mdl(torch.from_numpy(X).float())
obj = -dist.log_prob(torch.from_numpy(Y).float()).mean()
optim.zero_grad()
obj.backward()
optim.step()
Eventually, the learned parameters should match the values we used to construct the Y.
print(mdl.mean_coeff, mdl.var_coeff)
# tensor(1.0150) tensor(0.2597)
The regularization parameter C in logistic regression
(see http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) is used allow the function to be fitted to be well defined and avoid either overfitting or problems with step functions (see https://datascience.stackexchange.com/questions/10805/does-scikit-learn-use-regularization-by-default/10806).
However, regularization in logistic regression should only concern the weights for the features, not the intercept (also explained here: http://aimotion.blogspot.com/2011/11/machine-learning-with-python-logistic.html)
But is seems that sklearn.linear_model.LogisticRegression actually regularizes the intercept as well. Here is why:
1) Conside above link carefully (https://datascience.stackexchange.com/questions/10805/does-scikit-learn-use-regularization-by-default/10806): the sigmod is moved slightly to the left, closer to the intercept 0.
2) I tried to fit data points with a logistic curve and a manual maximum likelihood function. Including the intercept into the L2 norm gives identical results as sklearn's function.
Two questions please:
1) Did I get this wrong, is this a bug, or is there a well-justified reason for regularizing the intercept?
2) Is there a way to use sklearn and specify to regularize all parameters except the intercepts?
Thanks!
import numpy as np
from sklearn.linear_model import LogisticRegression
C = 1e1
model = LogisticRegression(C=C)
x = np.arange(100, 110)
x = x[:, np.newaxis]
y = np.array([0]*5 + [1]*5)
print x
print y
model.fit(x, y)
a = model.coef_[0][0]
b = model.intercept_[0]
b_modified = -b/a # without regularization, b_modified should be 104.5 (as for C=1e10)
print "a, b:", a, -b/a
# OUTPUT:
# [[100]
# [101]
# [102]
# [103]
# [104]
# [105]
# [106]
# [107]
# [108]
# [109]]
# [0 0 0 0 0 1 1 1 1 1]
# a, b: 0.0116744221756 100.478968664
scikit-learn has default regularized logistic regression.
The change in intercept_scaling parameter value in sklearn.linear_model.LogisticRegression has similar effect on the result if only C parameter is changed.
In case of modification in intercept_scaling parameter, regularization has an impact on the estimation of bias in logistic regression. When this parameter's value is on higher side then the regularization impact on bias is reduced. Per official documentation:
The intercept becomes intercept_scaling * synthetic_feature_weight.
Note! the synthetic feature weight is subject to l1/l2 regularization
as all other features. To lessen the effect of regularization on
synthetic feature weight (and therefore on the intercept)
intercept_scaling has to be increased.
Hope it helps!
Thanks #Prem, this is indeed the solution:
C = 1e1
intercept_scaling=1e3 # very high numbers make it unstable in practice
model = LogisticRegression(C=C, intercept_scaling=intercept_scaling)
I was looking at the robust linear regression in statsmodels and I couldn't find a way to specify the "weights" of this regression. For example in least square regression assigning weights to each observation. Similar to what WLS does in statsmodels.
Or is there a way to get around it?
http://www.statsmodels.org/dev/rlm.html
RLM currently does not allow user specified weights. Weights are internally used to implement the reweighted least squares fitting method.
If the weights have the interpretation of variance weights to account for different variances across observations, then rescaling the data, both endog y and exog x, in analogy to WLS will produce the weighted parameter estimates.
WLS used this in the whiten method to rescale y and x
X = np.asarray(X)
if X.ndim == 1:
return X * np.sqrt(self.weights)
elif X.ndim == 2:
return np.sqrt(self.weights)[:, None]*X
I'm not sure whether all extra results that are available will be appropriate for the rescaled model.
Edit Followup based on comments
In WLS the equivalence W*( Y_est - Y )^2 = (sqrt(W)*Y_est - sqrt(W)*Y)^2 means that the parameter estimates are the same independent of the interpretation of weights.
In RLM we have a nonlinear objective function g((y - y_est) / sigma) for which this equivalence does not hold in general
fw * g((y - y_est) / sigma) != g((y - y_est) * sw / sigma )
where fw are frequency weights and sw are scale or variance weights and sigma is the estimated scale or standard deviation of the residual. (In general, we cannot find sw that would correspond to the fw.)
That means that in RLM we cannot use rescaling of the data to account for frequency weights.
Aside: The current development in statsmodels is to add different weight categories to GLM to develop the pattern that can be added to other models. The target is to get similar to Stata at least freq_weights, var_weights and prob_weights as options into the models.