Initially, I used sklearn to perform logistic regression. When I asked how to get the p-values from the coefficients I got the reply that I should use statsmodels even though I mentioned that it doesn't read it in my data (and I've never used it before).
My training data is a np.array and my labels are a list. I get the following error TypeError: cannot perform reduce with flexible type I've never worked with this module before so I have no clue how to use it and why it doesn't accept my data since sklearn seems to have no problems. How do I make this work and get the p-values?
Related
Problem statement: I'm working with a linear system of equations that correspond to an inverse problem that is ill-posed. I can apply Tikhonov regularization or ridge regression by hand in Python, and get solutions on test data that are sufficiently accurate for my problem. I'd like to try solving this problem using sklearn.linear_model.Ridge, because I'd like to try other machine-learning methods in the linear models part of that package (https://scikit-learn.org/stable/modules/linear_model.html). I'd like to know if using sklearn in this context is using the wrong tool.
What I've done: I read the documentation for sklearn.linear_model.Ridge. Since I know the linear transformation corresponding to the forward problem, I have run it over impulse responses to create training data, and then supplied it to sklearn.linear_model.Ridge to generate a model. Unlike when I apply the equation for ridge regression myself in Python, the model from sklearn.linear_model.Ridge only works on impulse responses. On the other hand, applying ridge regression using the equations myself, generates a model that can be applied to any linear combination of the impulse responses.
Is there a way to apply the linear methods of sklearn, without needing to generate a large test data set that represents the entire parameter space of the problem, or is this requisite for using (even linear) machine learning algorithms?
Should sklearn.model.Ridge return the same results as solving the equation for ridge regression, when the sklearn method is applied to test cases that span the forward problem?
Many thanks to anyone who can help my understanding.
Found the answer through trial and error. Answering my own question in case anyone was thinking like I did and needs clarity.
Yes, if you use training data that spans the problem space, it is the same as running ridge regression in python using the equations. sklearn does what it says in the documentation.
You need to use fit_intercept=True to get sklearn.linear_model.Ridge to fit the Y intercept of your problem, otherwise it is assumed to be zero.
If you use the default, fit_intercept=False, and your problem does NOT have a Y-intercept of zero, you will of course, get a bad solution.
This might lead a novice like me to the impression that you haven't supplied enough training data, which is incorrect.
While using sklearn Linear Regression library, as we split the data using traintestsplit, do we have to use the training data for the OLS (least square method) or we can use the full data for OLS method and deduce the regression result.
There are many mistakes that data-scientists make as a beginner and one of them is to use test data as something in the learning process, look at this diagram from here:
As you can see the data is separated during training process and this is really important to be kept this way.
Now the question you ask is about least square method, while you may think that by using full data you are improving the process, you are forgetting about the evaluation part which then would be better not because the regression is better. It is just better because you have shown the model the data you are testing it with.
I am trying to make sense of the values stored in Scikit-Learn's RidgeCV's cv.values_ object when scoring is set to the r2_score metric.
Per the documentation for Scikit-Learn's RidgeCV function, when store_cv_values=True:
Cross-validation values for each alpha (only available if store_cv_values=True and cv=None). After fit() has been called, this attribute will contain the mean squared errors (by default) or the values of the {loss,score}_func function (if provided in the constructor).
Though I'm somewhat unclear as to the specifics of how RidgeCV's native generalized cross-validation works, if it is indeed an approximation of leave-one-out-cross validation, then what the cv_values object seems to represent is the 'r2_score' for individual (left-out) samples... Except that r2_score does not work for individual samples. What then is returned in the cv_values object when scoring is set to r2_score?
In short, this Generalized Cross-Validation makes the leave-one-out predictions on the entire training set, then applies the scoring function to those (rather than scoring first then averaging).
You can see that in the code, though it's a little obfuscated by the IdentityRegressor/Classifier. That line is really just finding the score with inputs predictions_ and y_. A few lines up you can see where they generate the predictions: that's where they use the trick that makes Generalized Cross-Validation an efficient way to do Leave-One-Out in the context of Ridge Regression. If you're interested, the docs link to some a report and course slides describing why/how that actually works.
Update: After some helpful back and forth, it seems the mystery here is partially resolved and points to an error in the sklearn documentation for the RidgeCV function. If a scoring argument is provided, the cv_values_ object returns predictions per point. (If a scoring argument is not provided, the object returns as stated with the squared error per point.)
From the source code:
I wanted to check if a Multiple Linear Regression problem produced the same output when solved using Scikit-Learn and Statsmodels.api. I did it in 3 sections (in the order of their mention): Statsmodels(without intercept), Statsmodels(with intercept) and SKL. As expected, my SKL coefficients and R(square) were as same as that of Statsmodels(with intercept) but my SKL mean square error was equivalent to that of Statsmodels(without intercept).
I am going to share my notebook code; it's a fairly basic piece of code, since I have just started with Machine Learning Applications. Please go through it and tell me why it is happening. Also, if you could share your insights on any inefficient piece of code, I would be thankful. Here's the code:
https://github.com/vgoel60/Linear-Regression-using-Sklearn-vs-Statsmodel.api/blob/master/Linear%20Regression%20Boston%20Housing%20Prices%20using%20Scikit-Learn%20and%20Statsmodels.api.ipynb
You made a mistake, which explains the strange results. When you make the predictions from the linear model with scikit-learn, you write:
predictions2 = lm.predict(xtest2)
Notice that you are using the lm model, the one resulting from the first statsmodels regression. Instead, you should have written:
predictions2 = lm2.predict(xtest2)
When you do this, the results are as expected.
I'm trying to run gridsearch with LogisticRegression, and get
ValueError: Can't handle mix of continuous and binary
I've traced this error to metrics.accuracy_score. Apparently the prediction doesn't go so well, and while the y_true is continuous (as is the rest of the data), y_pred is all zeros and is thus classified as binary.
Is there any way to avoid this error?
Does the nature of y_pred means I have no business using logistic regression at all, or could this be a result of the parameters used?
Thanks
Somewhat confusingly logistic regression is actually a classification algorithm (see http://scikit-learn.org/stable/modules/linear_model.html#logistic-regression). As such the target ("y_true") data that you feed it should be binary. If you are actually trying to solve a regression problem you should choose a different algorithm, e.g. LinearRegression, SVR, RandomForestRegressor, etc.