I was looking at the robust linear regression in statsmodels and I couldn't find a way to specify the "weights" of this regression. For example in least square regression assigning weights to each observation. Similar to what WLS does in statsmodels.
Or is there a way to get around it?
http://www.statsmodels.org/dev/rlm.html
RLM currently does not allow user specified weights. Weights are internally used to implement the reweighted least squares fitting method.
If the weights have the interpretation of variance weights to account for different variances across observations, then rescaling the data, both endog y and exog x, in analogy to WLS will produce the weighted parameter estimates.
WLS used this in the whiten method to rescale y and x
X = np.asarray(X)
if X.ndim == 1:
return X * np.sqrt(self.weights)
elif X.ndim == 2:
return np.sqrt(self.weights)[:, None]*X
I'm not sure whether all extra results that are available will be appropriate for the rescaled model.
Edit Followup based on comments
In WLS the equivalence W*( Y_est - Y )^2 = (sqrt(W)*Y_est - sqrt(W)*Y)^2 means that the parameter estimates are the same independent of the interpretation of weights.
In RLM we have a nonlinear objective function g((y - y_est) / sigma) for which this equivalence does not hold in general
fw * g((y - y_est) / sigma) != g((y - y_est) * sw / sigma )
where fw are frequency weights and sw are scale or variance weights and sigma is the estimated scale or standard deviation of the residual. (In general, we cannot find sw that would correspond to the fw.)
That means that in RLM we cannot use rescaling of the data to account for frequency weights.
Aside: The current development in statsmodels is to add different weight categories to GLM to develop the pattern that can be added to other models. The target is to get similar to Stata at least freq_weights, var_weights and prob_weights as options into the models.
Related
PyMC3 has excellent functionality for dealing with Bayesian regressions, so I've been trying to leverage that to run a Bayesian Gamma Regression using PyMC3 where the likelihood would be Gamma.
From what I understand, running any sort of Bayesian Regression in PyMC3 requires the pymc3.glm.GLM() function, which takes in a model formula in Patsy form (e.g. y ~ x_1 + x_2 + ... + x_m), the dataframe, and a distribution.
However, the issue is that the pymc3.glm.GLM() function requires a pymc3..families object (https://github.com/pymc-devs/pymc3/blob/master/pymc3/glm/families.py) for the distribution. But the Gamma distribution doesn't show up as one of the families built into the package so I'm stuck. Or is the Gamma function family hidden somewhere? Would appreciate any help in this matter!
For context:
I have a dataframe of features [x_1, x_2, ..., x_m] (call it X) and a target variable (call it y). This is the code I have prepared so far, but just need to figure out how to get the Gamma distribution in as my likelihood.
import pymc3 as pm
# Combine X and y into a single dataframe
patsy_DF = X
patsy_DF['y'] = y
# Get Patsy Formula
all_columns = "+".join(X.columns)
patsy_formula = "y~" + all_columns
# Instantiate model
model = pm.Model()
# Construct Model
with model:
# Fit Bayesian Gamma Regression
pm.glm.GLM(patsy_formula, df_dummied, family=pm.families.Gamma())
# !!! ... but pm.families.Gamma() doesn't exist ... !!!
# Get MAP Estimate and Trace
map_estimate = pm.find_MAP(model=model)
trace = pm.sample(draws=2000, chains=3, start = map_estimate)
# Get regression results summary (coefficient estimates,
pm.summary(trace).round(3)
I'm trying to write my own logistic regressor (using batch/mini-batch gradient descent) for practice purposes.
I generated a random dataset (see below) with normally distributed inputs, and the output is binary (0,1). I manually used coefficients for the input and was hoping to be able to reproduce them (see below for the code snippet). However, to my surprise, neither my own code, nor sklearn LogisticRegression were able to reproduce the actual numbers (although the sign and order of magnitude are in line). Moreso, the coefficients my algorithm produced are different than the one produced by sklearn.
Am I misinterpreting what the coefficients for a logistic regression are?
I will appreciate any insight into this discrepancy.
Thank you!
edit: I tried using statsmodels Logit and got yet a third set of slightly different values for the coefficients
Some more info that might be relevant:
I wrote a linear regressor using an almost identical code and it worked perfectly, so I am fairly confident this is not a problem in the code. Also my regressor actually outperformed the sklearn one on the training set, and they have the exact same accuracy on the test set, so I have no reason to believe the regressors are wrong.
Code snippets for the generation of the dataset:
o1 = 2
o2 = -3
x[:,1]=np.random.rand(size)*2
x[:,2]=np.random.rand(size)*3
y = np.vectorize(sigmoid)(x[:,1]*o1+x[:,2]*o2 + np.random.normal(size=size))
so as can be seen, input coefficients are +2 and -3 (intercept 0);
sklearn coefficients were ~2.8 and ~-4.8;
my coefficients were ~1.7 and ~-2.6
and of the regressor (the most relevant parts of it):
for j in range(bin_size):
xs = x[i]
y_real = y[i]
z = np.dot(self.coeff,xs)
h = sigmoid(z)
dc+= (h-y_real)*xs
self.coeff-= dc * (learning_rate/n)
What was the intercept learned? It really should not be a surprise, as your y is polynomial of 3rd degree, while your model has only two coefficients, while 3 + y-intercept would be needed to model the response variable from predictors.
Furthermore, values may be different due to SGD for example.
Not really sure, but the coefficients could be different and return correct y for finite set of points. What are the metrics on each model? Do those differ?
There are standard ways of predicting proportions such as logistic regression (without thresholding) and beta regression. There have already been discussions about this:
http://scikit-learn-general.narkive.com/4dSCktaM/using-logistic-regression-on-a-continuous-target-variable
http://scikit-learn-general.narkive.com/lLVQGzyl/beta-regression
I cannot tell if there exists a work-around within the sklearn framework.
There exists a workaround, but it is not intrinsically within the sklearn framework.
If you have a proportional target variable (value range 0-1) you run into two basic difficulties with scikit-learn:
Classifiers (such as logistic regression) deal with class labels as target variables only. As a workaround you could simply threshold your probabilities to 0/1 and interpret them as class labels, but you would lose a lot of information.
Regression models (such as linear regression) do not restrict the target variable. You can train them on proportional data, but there is no guarantee that the output on unseen data will be restricted to the 0/1 range. However, in this situation, there is a powerful work-around (below).
There are different ways to mathematically formulate logistic regression. One of them is the generalized linear model, which basically defines the logistic regression as a normal linear regression on logit-transformed probabilities. Normally, this approach requires sophisticated mathematical optimization because the probabilities are unknown and need to be estimated along with the regression coefficients.
In your case, however, the probabilities are known. This means you can simply transform them with y = log(p / (1 - p)). Now they cover the full range from -oo to oo and can serve as the target variable for a LinearRegression model [*]. Of course, the model output then needs to be transformed again to result in probabilities p = 1 / (exp(-y) + 1).
import numpy as np
from sklearn.linear_model import LinearRegression
class LogitRegression(LinearRegression):
def fit(self, x, p):
p = np.asarray(p)
y = np.log(p / (1 - p))
return super().fit(x, y)
def predict(self, x):
y = super().predict(x)
return 1 / (np.exp(-y) + 1)
if __name__ == '__main__':
# generate example data
np.random.seed(42)
n = 100
x = np.random.randn(n).reshape(-1, 1)
noise = 0.1 * np.random.randn(n).reshape(-1, 1)
p = np.tanh(x + noise) / 2 + 0.5
model = LogitRegression()
model.fit(x, p)
print(model.predict([[-10], [0.0], [1]]))
# [[ 2.06115362e-09]
# [ 5.00000000e-01]
# [ 8.80797078e-01]]
There are also numerous other alternatives. Some non-linear regression models can work naturally in the 0-1 range. For example Random Forest Regressors will never exceed the target variables' range they were trained with. Simply put probabilities in and you will get probabilities out. Neural networks with appropriate output activation functions (tanh, I guess) will also work well with probabilities, but if you want to use those there are more specialized libraries than sklearn.
[*] You could in fact plug in any linear regression model which can make the method more powerful, but then it no longer is exactly equivalent to logistic regression.
I am trying to implement a solution to Ridge regression in Python using Stochastic gradient descent as the solver. My code for SGD is as follows:
def fit(self, X, Y):
# Convert to data frame in case X is numpy matrix
X = pd.DataFrame(X)
# Define a function to calculate the error given a weight vector beta and a training example xi, yi
# Prepend a column of 1s to the data for the intercept
X.insert(0, 'intercept', np.array([1.0]*X.shape[0]))
# Find dimensions of train
m, d = X.shape
# Initialize weights to random
beta = self.initializeRandomWeights(d)
beta_prev = None
epochs = 0
prev_error = None
while (beta_prev is None or epochs < self.nb_epochs):
print("## Epoch: " + str(epochs))
indices = range(0, m)
shuffle(indices)
for i in indices: # Pick a training example from a randomly shuffled set
beta_prev = beta
xi = X.iloc[i]
errori = sum(beta*xi) - Y[i] # Error[i] = sum(beta*x) - y = error of ith training example
gradient_vector = xi*errori + self.l*beta_prev
beta = beta_prev - self.alpha*gradient_vector
epochs += 1
The data I'm testing this on is not normalized and my implementation always ends up with all the weights being Infinity, even though I initialize the weights vector to low values. Only when I set the learning rate alpha to a very small value ~1e-8, the algorithm ends up with valid values of the weights vector.
My understanding is that normalizing/scaling input features only helps reduce convergence time. But the algorithm should not fail to converge as a whole if the features are not normalized. Is my understanding correct?
You can check from scikit-learn's Stochastic Gradient Descent documentation that one of the disadvantages of the algorithm is that it is sensitive to feature scaling. In general, gradient based optimization algorithms converge faster on normalized data.
Also, normalization is advantageous for regression methods.
The updates to the coefficients during each step will depend on the ranges of each feature. Also, the regularization term will be affected heavily by large feature values.
SGD may converge without data normalization, but that is subjective to the data at hand. Therefore, your assumption is not correct.
Your assumption is not correct.
It's hard to answer this, because there are so many different methods/environments but i will try to mention some points.
Normalization
When some method is not scale-invariant (i think every linear-regression is not) you really should normalize your data
I take it that you are just ignoring this because of debugging / analyzing
Normalizing your data is not only relevant for convergence-time, the results will differ too (think about the effect within the loss-function; big values might effect in much more loss to small ones)!
Convergence
There is probably much to tell about convergence of many methods on normalized/non-normalized data, but your case is special:
SGD's convergence theory only guarantees convergence to some local-minimum (= global-minimum in your convex-opt problem) for some chosings of hyper-parameters (learning-rate and learning-schedule/decay)
Even optimizing normalized data can fail with SGD when those params are bad!
This is one of the most important downsides of SGD; dependence on hyper-parameters
As SGD is based on gradients and step-sizes, non-normalized data has a possibly huge effect on not achieving this convergence!
In order for sgd to converge in linear regression the step size should be smaller than 2/s where s is the largest singular value of the matrix (see the Convergence and stability in the mean section in https://en.m.wikipedia.org/wiki/Least_mean_squares_filter), in the case of ridge regression it should be less than 2*(1+p/s^2)/s where p is the ridge penalty.
Normalizing rows of the matrix (or gradients) changes the loss function to give each sample an equal weight and it changes the singular values of the matrix such that you can choose a step size near 1 (see the NLMS section in https://en.m.wikipedia.org/wiki/Least_mean_squares_filter). Depending on your data it might require smaller step sizes or allow for larger step sizes. It all depends on whether or not the normalization increases or deacreses the largest singular value of the matrix.
Note that when deciding whether or not to normalize the rows you shouldn't just think about the convergence rate (which is determined by the ratio between the largest and smallest singular values) or stability in the mean, but also about how it changes the loss function and whether or not it fits your needs because of that, sometimes it makes sense to normalize but sometimes (for example when you want to give different importance for different samples or when you think that a larger energy for the signal means better snr) it doesn't make sense to normalize.
I want to implement Logisitic regression from scratch in python. Following are the functions in it:
sigmoid
cost
fminunc
Evaluating Logistic regression
I would like to know, what would be a great start to this to start from scratch in python. Any guidance on how and what would be good. I know the theory of those functions but looking for a better pythonic answer.
I used octave and I got it all right but dont know how to start in python as OCtave already has those packages setup to do the work.
You may want to try to translate your octave code to python and see what's going on. You can also use the python package to do this for you. Check out scikit-learn on logistic regression. There is also an easy example in this blog.
In order to implement Logistic Regression, You may consider the following 2 approaches:
Consider How Linear Regression Works. Apply Sigmoid Function to the Hypothesis of Linear Regression and run gradient Descent until convergence. OR Apply the Exponential based Softmax function to rule out lower possibility of occurrence.
def logistic_regression(x, y,alpha=0.05,lamda=0):
'''
Logistic regression for datasets
'''
m,n=np.shape(x)
theta=np.ones(n)
xTrans = x.transpose()
oldcost=0.0
value=True
while(value):
hypothesis = np.dot(x, theta)
logistic=hypothesis/(np.exp(-hypothesis)+1)
reg = (lamda/2*m)*np.sum(np.power(theta,2))
loss = logistic - y
cost = np.sum(loss ** 2)
#print(cost)
# avg cost per example (the 2 in 2*m doesn't really matter here.
# But to be consistent with the gradient, I include it)
# avg gradient per example
gradient = np.dot(xTrans, loss)/m
# update
if(reg):
cost=cost+reg
theta = (theta - (alpha) * (gradient+reg))
else:
theta=theta -(alpha/m) * gradient
if(oldcost==cost):
value=False
else:
oldcost=cost
print(accuracy(theta,m,y,x))
return theta,accuracy(theta,m,y,x)