Linear Regression using Scikit-learn vs Statsmodels - python

I wanted to check if a Multiple Linear Regression problem produced the same output when solved using Scikit-Learn and Statsmodels.api. I did it in 3 sections (in the order of their mention): Statsmodels(without intercept), Statsmodels(with intercept) and SKL. As expected, my SKL coefficients and R(square) were as same as that of Statsmodels(with intercept) but my SKL mean square error was equivalent to that of Statsmodels(without intercept).
I am going to share my notebook code; it's a fairly basic piece of code, since I have just started with Machine Learning Applications. Please go through it and tell me why it is happening. Also, if you could share your insights on any inefficient piece of code, I would be thankful. Here's the code:
https://github.com/vgoel60/Linear-Regression-using-Sklearn-vs-Statsmodel.api/blob/master/Linear%20Regression%20Boston%20Housing%20Prices%20using%20Scikit-Learn%20and%20Statsmodels.api.ipynb

You made a mistake, which explains the strange results. When you make the predictions from the linear model with scikit-learn, you write:
predictions2 = lm.predict(xtest2)
Notice that you are using the lm model, the one resulting from the first statsmodels regression. Instead, you should have written:
predictions2 = lm2.predict(xtest2)
When you do this, the results are as expected.

Related

linear ill-conditioned problems using sklearn.linear_model.Ridge - best way to describe training data?

Problem statement: I'm working with a linear system of equations that correspond to an inverse problem that is ill-posed. I can apply Tikhonov regularization or ridge regression by hand in Python, and get solutions on test data that are sufficiently accurate for my problem. I'd like to try solving this problem using sklearn.linear_model.Ridge, because I'd like to try other machine-learning methods in the linear models part of that package (https://scikit-learn.org/stable/modules/linear_model.html). I'd like to know if using sklearn in this context is using the wrong tool.
What I've done: I read the documentation for sklearn.linear_model.Ridge. Since I know the linear transformation corresponding to the forward problem, I have run it over impulse responses to create training data, and then supplied it to sklearn.linear_model.Ridge to generate a model. Unlike when I apply the equation for ridge regression myself in Python, the model from sklearn.linear_model.Ridge only works on impulse responses. On the other hand, applying ridge regression using the equations myself, generates a model that can be applied to any linear combination of the impulse responses.
Is there a way to apply the linear methods of sklearn, without needing to generate a large test data set that represents the entire parameter space of the problem, or is this requisite for using (even linear) machine learning algorithms?
Should sklearn.model.Ridge return the same results as solving the equation for ridge regression, when the sklearn method is applied to test cases that span the forward problem?
Many thanks to anyone who can help my understanding.
Found the answer through trial and error. Answering my own question in case anyone was thinking like I did and needs clarity.
Yes, if you use training data that spans the problem space, it is the same as running ridge regression in python using the equations. sklearn does what it says in the documentation.
You need to use fit_intercept=True to get sklearn.linear_model.Ridge to fit the Y intercept of your problem, otherwise it is assumed to be zero.
If you use the default, fit_intercept=False, and your problem does NOT have a Y-intercept of zero, you will of course, get a bad solution.
This might lead a novice like me to the impression that you haven't supplied enough training data, which is incorrect.

Query regarding the probabilities obtained from Logistic regression

I am implementing a classification task which is a 985 class classification problem.
I have trained my model and predicted the class of X_test data.
I am using logistic regression. When I am doing clf.predict(X_test[0]) then I am getting the correct class.
But when I am seeing the probabilities, clf.predict_proba(X_test[0]), then the correct class does not have the highest probability. In fact, another class has a maximum probability. I don't understand why this is happening. I have checked for another input, the same is happening for other inputs also.
This is really hard to troubleshoot without an example to replicate. However, I suspect that there may be an indexing problem. Try restarting the notebook kernel if you're using a notebook, and check for indexing problems.
Also, if you could post more details or examples of this happening, it would help.

logic in fit() and predict() method for regression

Can anyone please explain me the concept of fit() and predict() used in machine learning algorithms.
fit()- used to fit the data.
output- LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
Query1 = what is the back end calculation for fit. On what basis do we get the above output after calling fit() method.
predict() - used to predict the data.
Query2 = what is the back end calculation used here.
These are some basic conceptual understanding that I need. Any help is appreciated.
Thanks.
I learned logistic regression (machine learning) by reading a very interesting book that gives you some basics but also gives you difficult algorithms. You can find the code examples at : https://github.com/rasbt/python-machine-learning-book/tree/master/code at chapter 3 you will find logistic regression but not the algorithm.
Anyway, the fit() method is used to adapt the data at the algorithm, it's just for it and as far as i know there's not an algorithm, it's just to an "organization" purpose. The predict() method in logistic regression uses the sigmoid function and logarithmic expression.
Now i do not remember the algorithms calculations (sorry) but i can get in a few ours.
In Machine Learning you want to make a model of a real-world concept. For example, there's a good chance there is a correlation between the growth of a plant and the amount of water it gets. fit() will try to fit this correlation in a mathematical formula (= the model, a simplification of the real-world concept).
Fitting means that the algorithm will try to repeatedly adjust it's estimate, based on the error it had in in his previous guess.
Note sure what you mean with your second question, but if you're asking how Linear Regression works, you can check-out wikipedia: https://en.wikipedia.org/wiki/Linear_regression
It gives a very good explanation of the basic concept.
Keep in mind however, that LinearRegression() will do a multi-variable linear regression, so it only a linear relationship in de nD plane, not in 2D.

Why the result is different between Matlab and scikit-learn when using PLS regression?

I use PLSRegression.predict form sklearn.cross_decomposition and plsregress from MATLAB(2014a) and the result is a little different. I'm sure I used the same components and data. Matlab always performs better than scikit-learn.
Python:
from sklearn.cross_decomposition import PLSRegression
pls = PLSRegression(n_components=8)
pls.fit(X_train, Y_train)
Y_pred = pls.predict(X_train)
Matlab:
[XL,YL,XS,YS,BETA,PCTVAR,MSE]=plsregress(X_train , Y_train ,8);
Yfit = [ones(size(X_train,1),1) X_train]*BETA;
I believe that scikit-learn uses the NIPALS algorithm for PLS, whereas MATLAB uses the SIMPLS algorithm. They are likely to give slightly differing results.
See the documentation page for plsregress in MATLAB, with a reference to the algorithm at the bottom. I don't have a convenient link for NIPALS, but it's an algorithm by Svante Wold, and fairly widely described on the internet.

statsmodels logistic regression doesn't work

Initially, I used sklearn to perform logistic regression. When I asked how to get the p-values from the coefficients I got the reply that I should use statsmodels even though I mentioned that it doesn't read it in my data (and I've never used it before).
My training data is a np.array and my labels are a list. I get the following error TypeError: cannot perform reduce with flexible type I've never worked with this module before so I have no clue how to use it and why it doesn't accept my data since sklearn seems to have no problems. How do I make this work and get the p-values?

Categories

Resources