logic in fit() and predict() method for regression - python

Can anyone please explain me the concept of fit() and predict() used in machine learning algorithms.
fit()- used to fit the data.
output- LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
Query1 = what is the back end calculation for fit. On what basis do we get the above output after calling fit() method.
predict() - used to predict the data.
Query2 = what is the back end calculation used here.
These are some basic conceptual understanding that I need. Any help is appreciated.
Thanks.

I learned logistic regression (machine learning) by reading a very interesting book that gives you some basics but also gives you difficult algorithms. You can find the code examples at : https://github.com/rasbt/python-machine-learning-book/tree/master/code at chapter 3 you will find logistic regression but not the algorithm.
Anyway, the fit() method is used to adapt the data at the algorithm, it's just for it and as far as i know there's not an algorithm, it's just to an "organization" purpose. The predict() method in logistic regression uses the sigmoid function and logarithmic expression.
Now i do not remember the algorithms calculations (sorry) but i can get in a few ours.

In Machine Learning you want to make a model of a real-world concept. For example, there's a good chance there is a correlation between the growth of a plant and the amount of water it gets. fit() will try to fit this correlation in a mathematical formula (= the model, a simplification of the real-world concept).
Fitting means that the algorithm will try to repeatedly adjust it's estimate, based on the error it had in in his previous guess.
Note sure what you mean with your second question, but if you're asking how Linear Regression works, you can check-out wikipedia: https://en.wikipedia.org/wiki/Linear_regression
It gives a very good explanation of the basic concept.
Keep in mind however, that LinearRegression() will do a multi-variable linear regression, so it only a linear relationship in de nD plane, not in 2D.

Related

How to have regression model predict ranges

I'm trying to make a model which can predict test scores. I'm currently using a simple linear regression model but receiving an accuracy score of close to 0 due to the fact that it's guessing a single number as the score. I was wondering if there was a way to have the model predict a range of about 10 numbers and if the true number is in that range it is marked as a correct guess.
The dataset I am using
Github page with notebook
It seems like you are using a LogisticRegression, LogisticRegression is in fact not for regression, it is for classification (for example, is the input data class a or b).
use sklearn.linear_model.LinearRegression for linear regression, read this for more details
There are also many other regression algorithms that I cannot list all in an answer. If you want to use regressions other than simple naive linear regression, read this for all available supervised learning algorithms scikit-learn provides, Ridge regression and SVR might be good places to start with.

linear ill-conditioned problems using sklearn.linear_model.Ridge - best way to describe training data?

Problem statement: I'm working with a linear system of equations that correspond to an inverse problem that is ill-posed. I can apply Tikhonov regularization or ridge regression by hand in Python, and get solutions on test data that are sufficiently accurate for my problem. I'd like to try solving this problem using sklearn.linear_model.Ridge, because I'd like to try other machine-learning methods in the linear models part of that package (https://scikit-learn.org/stable/modules/linear_model.html). I'd like to know if using sklearn in this context is using the wrong tool.
What I've done: I read the documentation for sklearn.linear_model.Ridge. Since I know the linear transformation corresponding to the forward problem, I have run it over impulse responses to create training data, and then supplied it to sklearn.linear_model.Ridge to generate a model. Unlike when I apply the equation for ridge regression myself in Python, the model from sklearn.linear_model.Ridge only works on impulse responses. On the other hand, applying ridge regression using the equations myself, generates a model that can be applied to any linear combination of the impulse responses.
Is there a way to apply the linear methods of sklearn, without needing to generate a large test data set that represents the entire parameter space of the problem, or is this requisite for using (even linear) machine learning algorithms?
Should sklearn.model.Ridge return the same results as solving the equation for ridge regression, when the sklearn method is applied to test cases that span the forward problem?
Many thanks to anyone who can help my understanding.
Found the answer through trial and error. Answering my own question in case anyone was thinking like I did and needs clarity.
Yes, if you use training data that spans the problem space, it is the same as running ridge regression in python using the equations. sklearn does what it says in the documentation.
You need to use fit_intercept=True to get sklearn.linear_model.Ridge to fit the Y intercept of your problem, otherwise it is assumed to be zero.
If you use the default, fit_intercept=False, and your problem does NOT have a Y-intercept of zero, you will of course, get a bad solution.
This might lead a novice like me to the impression that you haven't supplied enough training data, which is incorrect.

Linear Regression using Scikit-learn vs Statsmodels

I wanted to check if a Multiple Linear Regression problem produced the same output when solved using Scikit-Learn and Statsmodels.api. I did it in 3 sections (in the order of their mention): Statsmodels(without intercept), Statsmodels(with intercept) and SKL. As expected, my SKL coefficients and R(square) were as same as that of Statsmodels(with intercept) but my SKL mean square error was equivalent to that of Statsmodels(without intercept).
I am going to share my notebook code; it's a fairly basic piece of code, since I have just started with Machine Learning Applications. Please go through it and tell me why it is happening. Also, if you could share your insights on any inefficient piece of code, I would be thankful. Here's the code:
https://github.com/vgoel60/Linear-Regression-using-Sklearn-vs-Statsmodel.api/blob/master/Linear%20Regression%20Boston%20Housing%20Prices%20using%20Scikit-Learn%20and%20Statsmodels.api.ipynb
You made a mistake, which explains the strange results. When you make the predictions from the linear model with scikit-learn, you write:
predictions2 = lm.predict(xtest2)
Notice that you are using the lm model, the one resulting from the first statsmodels regression. Instead, you should have written:
predictions2 = lm2.predict(xtest2)
When you do this, the results are as expected.

how to predict binary outcome with categorical and continuous features using scikit-learn?

I need advice choosing a model and machine learning algorithm for a classification problem.
I'm trying to predict a binary outcome for a subject. I have 500,000 records in my data set and 20 continuous and categorical features. Each subject has 10--20 records. The data is labeled with its outcome.
So far I'm thinking logistic regression model and kernel approximation, based on the cheat-sheet here.
I am unsure where to start when implementing this in either R or Python.
Thanks!
Choosing an algorithm and optimizing the parameter is a difficult task in any data mining project. Because it must customized for your data and problem. Try different algorithm like SVM,Random Forest, Logistic Regression, KNN and... and test Cross Validation for each of them and then compare them.
You can use GridSearch in sickit learn to try different parameters and optimize the parameters for each algorithm. also try this project
witch test a range of parameters with genetic algorithm
Features
If your categorical features don't have too many possible different values, you might want to have a look at sklearn.preprocessing.OneHotEncoder.
Model choice
The choice of "the best" model depends mainly on the amount of available training data and the simplicity of the decision boundary you expect to get.
You can try dimensionality reduction to 2 or 3 dimensions. Then you can visualize your data and see if there is a nice decision boundary.
With 500,000 training examples you can think about using a neural network. I can recommend Keras for beginners and TensorFlow for people who know how neural networks work.
You should also know that there are Ensemble methods.
A nice cheat sheet what to use is on in the sklearn tutorial you already found:
(source: scikit-learn.org)
Just try it, compare different results. Without more information it is not possible to give you better advice.

Python model targeting n variable prediction equation

I am looking to build a predictive model and am working with our current JMP model. Our current approach is to guess an nth degree polynomial and then look at which terms are not significant model effects. Polynomials are not always the best and this leads to a lot of confusion and bad models. Our data can have between 2 and 7 effects and always has one response.
I want to use python for this, but package documentation or online guides for something like this are hard to find. I know how to fit a specific nth degree polynomial or do a linear regression in python, but not how to 'guess' the best function type for the data set.
Am I missing something obvious or should I be writing something that probes through a variety of function types? Precision is the most important. I am working with a small (~2000x100) data set.
Potentially I can do regression on smaller training sets, test them against the validation set, then rank the models and choose the best. Is there something better?
Try using other regression models instead of the vanilla Linear Model.
You can use something like this for polynomial regression:
poly = PolynomialFeatures(degree=2)
X_ = poly.fit_transform(input_data)
And you can constraint the weights through the Lasso Regression
clf = linear_model.Lasso(alpha = 0.5, positive = True)
clf.fit(X_,Y_)
where Y_ is the output you want to train against.
Setting alpha to 0 turns it into a simple linear regression. alpha is basically the penalty imposed for smaller weights. You can also make the weights strictly positive. Check this out here.
Run it with a small degree and perform a cross-validation to check how good it fits.
Increasing the degree of the polynomial generally leads to over-fitting. So if you are forced to use degree 4 or 5, that means you should look for other models.
You should also take a look at this question. This explains how you can curve fit.
ANOVA (analysis of variance) uses covariance to determine which effects are statistically significant... you shouldn't have to choose terms at random.
However, if you are saying that your data is inhomogenous (i.e., you shouldn't fit a single model to all the data), then you might consider using the scikit-learn toolkit to build a classifier that could choose a subset of the data to fit.

Categories

Resources