What is the parameter Alpha in Ridge Regression? - python

Can someone give me an understandable explanation of the parameter Alpha in SKlearn's Ridge Regression? How does it influence the function etc.?
Examples would be helpful :)

Ridge regression minimizes the objective function:
||y - Xw||^2_2 + alpha * ||w||^2_2
This model solves a regression model where the loss function is the linear least squares function and regularization is given by the l2-norm. In simple words, alpha is a parameter of how much should ridge regression tries to prevent overfitting!
Let say you have three parameter W = [w1, w2, w3]. In overfitting situation, the loss function can fit a model with W=[0.95, 0.001, 0.0004] which means it is highly biased to the first parameter. However, alpha * ||w||^2_2 increases the loss function in those cases and tries to keep all parameters in some sort of boundaries to prevent overfitting. For instance, with a regularizer, the W could be W=[0.5, 0.2, 0.33]. When you increase alpha you are pushing the Ridge regression to be more robust against overfitting, but might be getting larger training error.

Related

Lasso Regression: The continuous heavy step function

From many documents, I have learned the recipe of Ridge regression that is:
loss_Ridge = loss_function + lambda x L2 norm of slope
and the recipe of Lasso regression that is:
loss_Lasso = loss_function + lambda x L1 norm of slope
When I have read topic "Implementing Lasso and Ridge Regression" in "TensorFlow Machine Learning Cookbook", its author explained that:
"...we will use a continuous approximation to a step function, called
the continuous heavy step function..."
and its author also provided lines of code here.
I don't understand about which is called 'the continuous heavy step function' in this context. Please help me.
From the link that you provided,
if regression_type == 'LASSO':
# Declare Lasso loss function
# Lasso Loss = L2_Loss + heavyside_step,
# Where heavyside_step ~ 0 if A < constant, otherwise ~ 99
lasso_param = tf.constant(0.9)
heavyside_step = tf.truediv(1., tf.add(1., tf.exp(tf.multiply(-50., tf.subtract(A, lasso_param)))))
regularization_param = tf.multiply(heavyside_step, 99.)
loss = tf.add(tf.reduce_mean(tf.square(y_target - model_output)), regularization_param)
This heavyside_step function is very close to a logistic function which in turn can be a continuous approximation for a step function.
You use continuous approximation because the loss function needs to be differentiable with respect to the parameters of your model.
To get an intuition about read the constrained formulation section 1.6 in https://www.cs.ubc.ca/~schmidtm/Documents/2005_Notes_Lasso.pdf
You can see that in your code if A < 0.9 then regularization_param vanishes, so optimization will constrain A in that range.
If you want to normalize features using Lasso Regression here you have one example:
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import Lasso
estimator = Lasso()
featureSelection = SelectFromModel(estimator)
featureSelection.fit(features_vector, target)
selectedFeatures = featureSelection.transform(features_vector)
print(selectedFeatures)

Logistic regression coefficient meaning

I'm trying to write my own logistic regressor (using batch/mini-batch gradient descent) for practice purposes.
I generated a random dataset (see below) with normally distributed inputs, and the output is binary (0,1). I manually used coefficients for the input and was hoping to be able to reproduce them (see below for the code snippet). However, to my surprise, neither my own code, nor sklearn LogisticRegression were able to reproduce the actual numbers (although the sign and order of magnitude are in line). Moreso, the coefficients my algorithm produced are different than the one produced by sklearn.
Am I misinterpreting what the coefficients for a logistic regression are?
I will appreciate any insight into this discrepancy.
Thank you!
edit: I tried using statsmodels Logit and got yet a third set of slightly different values for the coefficients
Some more info that might be relevant:
I wrote a linear regressor using an almost identical code and it worked perfectly, so I am fairly confident this is not a problem in the code. Also my regressor actually outperformed the sklearn one on the training set, and they have the exact same accuracy on the test set, so I have no reason to believe the regressors are wrong.
Code snippets for the generation of the dataset:
o1 = 2
o2 = -3
x[:,1]=np.random.rand(size)*2
x[:,2]=np.random.rand(size)*3
y = np.vectorize(sigmoid)(x[:,1]*o1+x[:,2]*o2 + np.random.normal(size=size))
so as can be seen, input coefficients are +2 and -3 (intercept 0);
sklearn coefficients were ~2.8 and ~-4.8;
my coefficients were ~1.7 and ~-2.6
and of the regressor (the most relevant parts of it):
for j in range(bin_size):
xs = x[i]
y_real = y[i]
z = np.dot(self.coeff,xs)
h = sigmoid(z)
dc+= (h-y_real)*xs
self.coeff-= dc * (learning_rate/n)
What was the intercept learned? It really should not be a surprise, as your y is polynomial of 3rd degree, while your model has only two coefficients, while 3 + y-intercept would be needed to model the response variable from predictors.
Furthermore, values may be different due to SGD for example.
Not really sure, but the coefficients could be different and return correct y for finite set of points. What are the metrics on each model? Do those differ?

Incorporate side conditions into Keras neural network

I want to train my neural network (in Keras) with an additional condition on the output elements.
An example:
Minimize my loss function MSE between network output y_pred and y_true.
Additionally, ensure that the norm of y_pred is less or equal 1.
Without the condition, the task is straightforward.
Note: The condition is not necessarily the vector norm of y_pred.
How can I implement the additional condition/restriction in a Keras (or maybe Tensorflow) model?
In principle, tensorflow (and keras) don't allow you to add hard constraints to your model.
You have to convert your invarient (norm <= 1) to a penalty function, which is added to the loss. This could look like this:
y_norm = tf.norm(y_pred)
norm_loss = tf.where(y_norm > 1, y_norm, 0)
total_loss = mse + norm_loss
Look at the docs of where. If your prediction has a norm bigger than one, backpropagation tries to minimize the norm. If it is less than or equal, this part of the loss is simply 0. No gradient is produced.
But this can be very hard to optimize. Your predictions could oscillate around a norm of 1. It is also possible to add a factor: total_loss = mse + 1000* norm_loss. Be very careful with this, it makes optimization even harder.
In the example above, the norm above one contributes linearly to the loss. This is called l1-regularization. You could also square it, which would become l2-regularization.
In your specific case, you could get creative. Why not normalize your predictions and the targets to one (just a suggestion, might be a bad idea)?
loss = mse(y_pred / tf.norm(y_pred), y_target / np.linalg.norm(y_target)

How to use F-score as error function to train neural networks?

I am pretty new to neural networks. I am training a network in tensorflow, but the number of positive examples is much much less than negative examples in my dataset (it is a medical dataset).
So, I know that F-score calculated from precision and recall is a good measure of how well the model is trained.
I have used error functions like cross-entropy loss or MSE before, but they are all based on accuracy calculation (if I am not wrong). But how do I use this F-score as an error function? Is there a tensorflow function for that? Or I have to create a new one?
Thanks in advance.
It appears approaches for optimising directly for these types of metrics have been devised and used successfully, improving scoring and or training times:
https://www.kaggle.com/c/human-protein-atlas-image-classification/discussion/77289
https://www.kaggle.com/c/human-protein-atlas-image-classification/discussion/70328
https://www.kaggle.com/rejpalcz/best-loss-function-for-f1-score-metric
One such method involves using the sums of probabilities, in place of counts, for the sets of true positives, false positives, and false negative metrics. For example F-beta loss (the generalisation of F1) can be calculated in with Torch in Python as follows:
def forward(self, y_logits, y_true):
y_pred = self.sigmoid(y_logits)
TP = (y_pred * y_true).sum(dim=1)
FP = ((1 - y_pred) * y_true).sum(dim=1)
FN = (y_pred * (1 - y_true)).sum(dim=1)
fbeta = (1 + self.beta**2) * TP / ((1 + self.beta**2) * TP + (self.beta**2) * FN + FP + self.epsilon)
fbeta = fbeta.clamp(min=self.epsilon, max=1 - self.epsilon)
return 1 - fbeta.mean()
An alternative method is described in this paper:
https://arxiv.org/abs/1608.04802
The approach taken optimises for a lower bound on the statistic. Other metrics such as AUROC and AUCPR are also discussed. An implementation in TF of such an approach can be found here:
https://github.com/tensorflow/models/tree/master/research/global_objectives
I think you are confusing model evaluation metrics for classification with training losses.
Accuracy, precision, F-scores etc. are evaluation metrics computed from binary outcomes and binary predictions.
For model training, you need a function that compares a continuous score (your model output) with a binary outcome - like cross-entropy. Ideally, this is calibrated such that it is minimised if the predicted mean matches the population mean (given covariates). These rules are called proper scoring rules, and the cross-entropy is one of them.
Also check the thread is-accuracy-an-improper-scoring-rule-in-a-binary-classification-setting
If you want to weigh positive and negative cases differently, two methods are
oversample the minority class and correct predicted probabilities when predicting on new examples. For fancier methods, check the under sampling module of imbalanced-learn to get an overview.
use a different proper scoring rule for training loss. This allows to e.g. build in asymmetry in how you treat positive and negative cases while preserving calibration. Here is review of the subject.
I recommend just using simple oversampling in practice.
the loss value and accuracy is a different concept. The loss value is used for training the NN. However, accuracy or other metrics is to value the training result.

Logistic regression using python

I want to implement Logisitic regression from scratch in python. Following are the functions in it:
sigmoid
cost
fminunc
Evaluating Logistic regression
I would like to know, what would be a great start to this to start from scratch in python. Any guidance on how and what would be good. I know the theory of those functions but looking for a better pythonic answer.
I used octave and I got it all right but dont know how to start in python as OCtave already has those packages setup to do the work.
You may want to try to translate your octave code to python and see what's going on. You can also use the python package to do this for you. Check out scikit-learn on logistic regression. There is also an easy example in this blog.
In order to implement Logistic Regression, You may consider the following 2 approaches:
Consider How Linear Regression Works. Apply Sigmoid Function to the Hypothesis of Linear Regression and run gradient Descent until convergence. OR Apply the Exponential based Softmax function to rule out lower possibility of occurrence.
def logistic_regression(x, y,alpha=0.05,lamda=0):
'''
Logistic regression for datasets
'''
m,n=np.shape(x)
theta=np.ones(n)
xTrans = x.transpose()
oldcost=0.0
value=True
while(value):
hypothesis = np.dot(x, theta)
logistic=hypothesis/(np.exp(-hypothesis)+1)
reg = (lamda/2*m)*np.sum(np.power(theta,2))
loss = logistic - y
cost = np.sum(loss ** 2)
#print(cost)
# avg cost per example (the 2 in 2*m doesn't really matter here.
# But to be consistent with the gradient, I include it)
# avg gradient per example
gradient = np.dot(xTrans, loss)/m
# update
if(reg):
cost=cost+reg
theta = (theta - (alpha) * (gradient+reg))
else:
theta=theta -(alpha/m) * gradient
if(oldcost==cost):
value=False
else:
oldcost=cost
print(accuracy(theta,m,y,x))
return theta,accuracy(theta,m,y,x)

Categories

Resources