I have the following dataset shown below. Any value between 500 & 900 were categorized as A, while values between 900 & ~1500 were mixed between A and B. I want to find the probability of getting A, B, and C at any value of x where x is my independent variable and A,B,C are my dependent variables. It seems to be a good fit for multinomial logistic regression. I believe the number of observations for each dependent variable is sufficient. If multinomial log regression is appropriate, I wish to uses Python's scikit learn logistic regression module to obtain my probability of A, B, and C at any value of x but I am not sure how to approach this using that module.
Personally, it looks like an all right candidate for logistic regression, but the fact that it looks 1-dimensional with overlapping may make it hard to separate along those parts. I’m mainly here to answer the second part of your question which can be generalized to pretty much any other classifier within scikit-learn.
I recommend looking at the scikit-learn section on SGDClassifier since it has a simple example right below the attribute list, but replace the SGDClassifier part with the LogisticRegression class instead.
http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html#sklearn.linear_model.SGDClassifier
Here’s also the documentation for LogisticRegression: http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression
Related
In my dataset X I have two continuous variables a, b and two boolean variables c, d, making a total of 4 columns.
I have a multidimensional target y consisting of two continuous variables A, B and one boolean variable C.
I would like to train a model on the columns of X to predict the columns of y. However, having tried LinearRegression on X it didn't perform so well (my variables vary several orders of magnitude and I have to apply suitable transforms to get the logarithms, I won't go into too much detail here).
I think I need to use LogisticRegression on the boolean columns.
What I'd really like to do is combine both LinearRegression on the continuous variables and LogisticRegression on the boolean variables into a single pipeline. Note that all the columns of y depend on all the columns of X, so I can't simply train the continuous and boolean variables independently.
Is this even possible, and if so how do I do it?
I've used something called a "Model Tree" (see link below) for the same sort of problem.
https://github.com/ankonzoid/LearningX/tree/master/advanced_ML/model_tree
But it will need to be customized for your application. Please ask more questions if you get stuck using it.
Here's a screen shot of what it does
If your target data Y has multiple columns you need to use multi-task learning approach. Scikit-learn contains some multi-task learning algorithms for regression like multi-task elastic-net but you cannot combine logistic regression with linear regression because these algorithms use different loss functions to optimize. Also, you may try neural networks for your problem.
What i understand you want to do is to is to train a single model that both predicts a continuous variable and a class. You would need to combine both loses into one single loss to be able to do that which I don't think is possible in scikit-learn. However I suggest you use a deep learning framework (tensorflow, pytorch, etc) to implement your own model with the required properties you need which would be more flexible. In addition you can also tinker with solving the above problem using neural networks which would improve your results.
I am wondering whether there exists some correlation among the hyperparameters of two different classifiers.
For example: let us say that we run LogisticRegression on a dataset with best hyperparameters (by finding through GridSearch) and want to run another classifier like SVC (SVM classifier) on the same dataset but instead of finding all hyperparameters using GridSearch, can we fix some values (or reduce range to limit the search space for GridSearch) of hyperparameters?
As an experimentation, I used scikit-learn's classifiers like LogisticRegression, SVS, LinearSVC, SGDClassifier and Perceptron to classifiy some well know datasets. In some cases, I am able to see some correlation empirically, but not always for all datasets.
So please help me to clear this point.
I don't think you can correlated different parameters of different classifiers together like this. This is mainly because each classifier behaves differently as it has it's own way of adjusting the data along their own set of equations. For example, take the case of SVC with two different kernels rbf and sigmoid. It might be the case that rbf may fit perfectly over the data with the intercept parameter C set to say 0.001, while 'sigmoidkernel over the same data may fit withC` value 0.00001. Both values may also be equal. However, you can never say that for sure. When you say that :
In some cases, I am able to see some correlation empirically, but not always for all datasets.
It may simply be a coincidence. Since it all depends on the and the classifiers. You cannot apply it globally.Correlation does not always equal to causation
You can visit this site and see for yourself that although different regressor functions have the same parameter a, their equations are vastly different and hence over the same dataset you might drastically different values of a.
I am looking to build a predictive model and am working with our current JMP model. Our current approach is to guess an nth degree polynomial and then look at which terms are not significant model effects. Polynomials are not always the best and this leads to a lot of confusion and bad models. Our data can have between 2 and 7 effects and always has one response.
I want to use python for this, but package documentation or online guides for something like this are hard to find. I know how to fit a specific nth degree polynomial or do a linear regression in python, but not how to 'guess' the best function type for the data set.
Am I missing something obvious or should I be writing something that probes through a variety of function types? Precision is the most important. I am working with a small (~2000x100) data set.
Potentially I can do regression on smaller training sets, test them against the validation set, then rank the models and choose the best. Is there something better?
Try using other regression models instead of the vanilla Linear Model.
You can use something like this for polynomial regression:
poly = PolynomialFeatures(degree=2)
X_ = poly.fit_transform(input_data)
And you can constraint the weights through the Lasso Regression
clf = linear_model.Lasso(alpha = 0.5, positive = True)
clf.fit(X_,Y_)
where Y_ is the output you want to train against.
Setting alpha to 0 turns it into a simple linear regression. alpha is basically the penalty imposed for smaller weights. You can also make the weights strictly positive. Check this out here.
Run it with a small degree and perform a cross-validation to check how good it fits.
Increasing the degree of the polynomial generally leads to over-fitting. So if you are forced to use degree 4 or 5, that means you should look for other models.
You should also take a look at this question. This explains how you can curve fit.
ANOVA (analysis of variance) uses covariance to determine which effects are statistically significant... you shouldn't have to choose terms at random.
However, if you are saying that your data is inhomogenous (i.e., you shouldn't fit a single model to all the data), then you might consider using the scikit-learn toolkit to build a classifier that could choose a subset of the data to fit.
I am running some algorithms in scikit. Like Currently I use RandomisedLasso. But this question pertain to any ml algo in scikit.
My initial training data is 149x56. Now here is what I do:
from sklearn.linear_model import RandomizedLasso
est_rlasso = RandomizedLasso(max_iter=1000)
# Running Randomised Lasso
x=est_rlasso.fit_transform(tourism_X,tourism_Y)
x.shape
>>> (149x36).
So if you see it gives out 36 best features to be retained out of 56 initially and transforms the dataset from 149x56 to 149x36. But the problem is which 36 features did it retain? The biggest problem with scikit is that it strips off the variable headers. So now I am left clueless which features did this algorithm keep and which one it removed as the final X has no header to cross-check.
THis is common across any ml algorithm implementation in scikit. How does one overcome this? Like if I need to find which variables it gave as significant or if I am running a Regression model then the coefficient stand for which variables as I might have used Onehotencoder to transform categorical variables and then it would change the var order from original.
Any idea?
From the docs
get_support([indices]) Return a mask, or list, of the features/indices
selected.
http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RandomizedLasso.html
Is there any implementation of incremental svm which also has the feature of returning the probability of a given feature vector belonging to the various classes? Preferably usable with python code
I have heard about LaSVM. Does LaSVM has a feature of returning probability estimates? Also does it have features for handling imbalance training datasets?
You can have a look in Scikit Learn, a very flexible and efficient library written in Python
In every model, there are stored the internal calculated values. If clf is your SVM classifier, you can access clf.decision_function to see some explanation of the predictions.
It also provides a good set of tools for preprocessing data among other things you can find interesting.
cheers,
For getting probability estimate you can use scikit-learn library. There are 2 alternatives you can use. One gives probabilities. Here is an example: How to know what classes are represented in return array from predict_proba in Scikit-learn
And the other gives signed values for ranking (not probability but generally gives better result): Scikit-learn predict_proba gives wrong answers you should look at the answer.