I have a set of independent data points X, and a set of dependent points Y, and I would like to find a model of the form:
(a0+a1*x1+a2*x2+...+amxm)(am+1*xm+1+am+2*xm+2)
I know I can use scipy's curve_fit, but to avoid overfitting, I want to use Lasso for the linear part (i.e. the part in the first set of parenthesis).
Is there a simple way of doing that in Python?
You can fit a lasso regressor to the whole lot, multiplying out your brackets giving you 2m+2 coefficients. Then by performing a change of variables you can make this a linear regression problem again.
See this link for more details:
http://scikit-learn.org/stable/modules/linear_model.html#polynomial-regression
Related
Problem statement: I'm working with a linear system of equations that correspond to an inverse problem that is ill-posed. I can apply Tikhonov regularization or ridge regression by hand in Python, and get solutions on test data that are sufficiently accurate for my problem. I'd like to try solving this problem using sklearn.linear_model.Ridge, because I'd like to try other machine-learning methods in the linear models part of that package (https://scikit-learn.org/stable/modules/linear_model.html). I'd like to know if using sklearn in this context is using the wrong tool.
What I've done: I read the documentation for sklearn.linear_model.Ridge. Since I know the linear transformation corresponding to the forward problem, I have run it over impulse responses to create training data, and then supplied it to sklearn.linear_model.Ridge to generate a model. Unlike when I apply the equation for ridge regression myself in Python, the model from sklearn.linear_model.Ridge only works on impulse responses. On the other hand, applying ridge regression using the equations myself, generates a model that can be applied to any linear combination of the impulse responses.
Is there a way to apply the linear methods of sklearn, without needing to generate a large test data set that represents the entire parameter space of the problem, or is this requisite for using (even linear) machine learning algorithms?
Should sklearn.model.Ridge return the same results as solving the equation for ridge regression, when the sklearn method is applied to test cases that span the forward problem?
Many thanks to anyone who can help my understanding.
Found the answer through trial and error. Answering my own question in case anyone was thinking like I did and needs clarity.
Yes, if you use training data that spans the problem space, it is the same as running ridge regression in python using the equations. sklearn does what it says in the documentation.
You need to use fit_intercept=True to get sklearn.linear_model.Ridge to fit the Y intercept of your problem, otherwise it is assumed to be zero.
If you use the default, fit_intercept=False, and your problem does NOT have a Y-intercept of zero, you will of course, get a bad solution.
This might lead a novice like me to the impression that you haven't supplied enough training data, which is incorrect.
How can I use logistic regression in sklearn for continiuos but bounded (0<=y<=1) dependent variable? If it's not possible in sklearn, with what library can I do it?
It completly depends on your distribution of your problem.
This two pictures are explainining the difference between linear and logistic regression, there are also other regression types (e.g. polynomial regression), depending on your data points (here in red), you need to search for the right approach.
Here is the overview from scikit: https://scikit-learn.org/stable/supervised_learning.html#supervised-learning
See the discussion here: https://scikit-learn-general.narkive.com/4dSCktaM/using-logistic-regression-on-a-continuous-target-variable
There are two suggestions:
Stop doing logistic regression on something that is not a binary target
Use statsmodels https://www.statsmodels.org
I am using Spark's Linear regression (pyspark.ml.regression.LinearRegression) in python. However, I would like to force coefficients to be all positive for every feature (not negative), is there any way I can accomplish that? I was looking in the documentation but could not find a way to accomplish that. I understand I may not get the best solution, but I need the weights to be non-negative.
BTW, Scikit Learn Lasso has a parameter called positive for this request.
from sklearn.linear_model import Lasso
lin = Lasso(alpha=0.0001,precompute=True,max_iter=1000,
positive=True, random_state=9999, selection='random')
lin.fit(X,y)
Thanks
I am looking to build a predictive model and am working with our current JMP model. Our current approach is to guess an nth degree polynomial and then look at which terms are not significant model effects. Polynomials are not always the best and this leads to a lot of confusion and bad models. Our data can have between 2 and 7 effects and always has one response.
I want to use python for this, but package documentation or online guides for something like this are hard to find. I know how to fit a specific nth degree polynomial or do a linear regression in python, but not how to 'guess' the best function type for the data set.
Am I missing something obvious or should I be writing something that probes through a variety of function types? Precision is the most important. I am working with a small (~2000x100) data set.
Potentially I can do regression on smaller training sets, test them against the validation set, then rank the models and choose the best. Is there something better?
Try using other regression models instead of the vanilla Linear Model.
You can use something like this for polynomial regression:
poly = PolynomialFeatures(degree=2)
X_ = poly.fit_transform(input_data)
And you can constraint the weights through the Lasso Regression
clf = linear_model.Lasso(alpha = 0.5, positive = True)
clf.fit(X_,Y_)
where Y_ is the output you want to train against.
Setting alpha to 0 turns it into a simple linear regression. alpha is basically the penalty imposed for smaller weights. You can also make the weights strictly positive. Check this out here.
Run it with a small degree and perform a cross-validation to check how good it fits.
Increasing the degree of the polynomial generally leads to over-fitting. So if you are forced to use degree 4 or 5, that means you should look for other models.
You should also take a look at this question. This explains how you can curve fit.
ANOVA (analysis of variance) uses covariance to determine which effects are statistically significant... you shouldn't have to choose terms at random.
However, if you are saying that your data is inhomogenous (i.e., you shouldn't fit a single model to all the data), then you might consider using the scikit-learn toolkit to build a classifier that could choose a subset of the data to fit.
Given time-series data, I want to find the best fitting logarithmic curve. What are good libraries for doing this in either Python or SQL?
Edit: Specifically, what I'm looking for is a library that can fit data resembling a sigmoid function, with upper and lower horizontal asymptotes.
If your data were categorical, then you could use a logistic regression to fit the probabilities of belonging to a class (classification).
However, I understand you are trying to fit the data to a sigmoid curve, which means you just want to minimize the mean squared error of the fit.
I would redirect you to the SciPy function called scipy.optimize.leastsq: it is used to perform least squares fits.