Combining logistic and continuous regression with scikit-learn - python

In my dataset X I have two continuous variables a, b and two boolean variables c, d, making a total of 4 columns.
I have a multidimensional target y consisting of two continuous variables A, B and one boolean variable C.
I would like to train a model on the columns of X to predict the columns of y. However, having tried LinearRegression on X it didn't perform so well (my variables vary several orders of magnitude and I have to apply suitable transforms to get the logarithms, I won't go into too much detail here).
I think I need to use LogisticRegression on the boolean columns.
What I'd really like to do is combine both LinearRegression on the continuous variables and LogisticRegression on the boolean variables into a single pipeline. Note that all the columns of y depend on all the columns of X, so I can't simply train the continuous and boolean variables independently.
Is this even possible, and if so how do I do it?

I've used something called a "Model Tree" (see link below) for the same sort of problem.
https://github.com/ankonzoid/LearningX/tree/master/advanced_ML/model_tree
But it will need to be customized for your application. Please ask more questions if you get stuck using it.
Here's a screen shot of what it does

If your target data Y has multiple columns you need to use multi-task learning approach. Scikit-learn contains some multi-task learning algorithms for regression like multi-task elastic-net but you cannot combine logistic regression with linear regression because these algorithms use different loss functions to optimize. Also, you may try neural networks for your problem.

What i understand you want to do is to is to train a single model that both predicts a continuous variable and a class. You would need to combine both loses into one single loss to be able to do that which I don't think is possible in scikit-learn. However I suggest you use a deep learning framework (tensorflow, pytorch, etc) to implement your own model with the required properties you need which would be more flexible. In addition you can also tinker with solving the above problem using neural networks which would improve your results.

Related

How can you predict a combination of categorical and continuous variables with Scikit learn?

I have a dataset with a large number of predictive variables and I want to use them to predict a number of output variables. However, some of the things I want to predict are categorical, and others are continuous; the things I want to predict are not independent. Is it possible with scikit-learn to, for example, mix a classifier and a regressor so that I can predict and disentangle these variables? (I'm currently looking at gradient boosting classifiers/regressors, but there may be better options.)
You can certainly use One Hot Encoding or Dummy Variable Encoding, to convert labels to numerics. See the link below for all details.
https://codefires.com/how-convert-categorical-data-numerical-data-python/
As an aside, Random Forest is a popular machine learning model that is commonly used for classification tasks as can be seen in many academic papers, Kaggle competitions, and blog posts. In addition to classification, Random Forests can also be used for regression tasks. A Random Forest’s nonlinear nature can give it a leg up over linear algorithms, making it a great option. However, it is important to know your data and keep in mind that a Random Forest can’t extrapolate. It can only make a prediction that is an average of previously observed labels. In this sense it is very similar to KNN. In other words, in a regression problem, the range of predictions a Random Forest can make is bound by the highest and lowest labels in the training data. This behavior becomes problematic in situations where the training and prediction inputs differ in their range and/or distributions. This is called covariate shift and it is difficult for most models to handle but especially for Random Forest, because it can’t extrapolate.
https://towardsdatascience.com/a-limitation-of-random-forest-regression-db8ed7419e9f
https://stackabuse.com/random-forest-algorithm-with-python-and-scikit-learn
In closing, Scikit-learn uses numpy matrices as inputs to its models. As such all features become de facto numerical (if you have categorical feature you’ll need to convert them to numerical).
I don't think there's a builtin way. There are ClassifierChain and RegressorChain that allow you to use earlier predictions as features in later predictions, but as the names indicate they assume either classification or regression. Two options come to mind:
Manually patch those together for what you want to do. For example, use a ClassifierChain to predict each of your categorical targets using just the independent features, then add those predictions to the dataset before training a RegressorChain with the numeric targets.
Use those classes as a base for defining a custom estimator. In that case you'll probably look mostly at their common parent class _BaseChain. Unfortunately that also uses a single estimator attribute, whereas you'd need (at least) two, one classifier and one regressor.

Python Select variables in multiple linear regression

I have a dependent variable y and 6 independent variables. I want to make a linear regression out of it. I use sklearn library to do it.
The problem is some of my independent variables have correlation more than 0.5. So I can't have them in my model at the same time
I searched throw internet but didn't find any solution to select best set of independent variables to draw linear regression and output the variables that had been selected.
If you see that you have a correlation between independent variables. You should consider to remove them.
I see you are working with scikit-learn. If you don't want to do any feature selection manually, you could always use one of the feature selection methods in scikit-learns feature_selection module. There are many ways to automatically remove features, and you should cross-validate to determine which one is best for your problem.
You are probably looking for a k-fold validation model.
The idea is to randomly select your features, and have a way to validate them against each other.
The idea is to train your model with your feature selection on (k-1) partitions of your data. And validate it against the last partition. You do it for each partition and take the average of your score (MAE / RMSE for instance)
Your score is an objectif figure to compare your models aka your features selections

Multi-output regression

I have been looking in to Multi-output regression the last view weeks. I am working with the scikit learn package. My machine learning problem has an a input of 3 features an needs to predict two output variables. Some ML models in the sklearn package support multioutput regression nativly. If the models do not support this, the sklearn multioutput regression algorithm can be used to convert it. The multioutput class fits one regressor per target.
Does the mulioutput regressor class or supported multi-output regression algorithms take the underlying relationship of the input variables in to account?
Instead of a multi-output regression algorithm should I use a Neural network?
1) For your first question, I have divided that into two parts.
First part has the answer written in the documentation you linked and also in this user guide topic, which states explicitly that:
As MultiOutputRegressor fits one regressor per target it can not take
advantage of correlations between targets.
Second part of first question asks about other algorithms which support this. For that you can look at the "inherently multiclass" part in the user-guide. Inherently multi-class means that they don't use One-vs-Rest or One-vs-One strategy to be able to handle multi-class (OvO and OvR uses multiple models to fit multiple classes and so may not use the relationship between targets). Inherently multi-class means that they can structure the multi-class setting into a single model. This lists the following:
sklearn.naive_bayes.BernoulliNB
sklearn.tree.DecisionTreeClassifier
sklearn.tree.ExtraTreeClassifier
sklearn.ensemble.ExtraTreesClassifier
sklearn.naive_bayes.GaussianNB
sklearn.neighbors.KNeighborsClassifier
sklearn.semi_supervised.LabelPropagation
sklearn.semi_supervised.LabelSpreading
sklearn.discriminant_analysis.LinearDiscriminantAnalysis
sklearn.svm.LinearSVC (setting multi_class=”crammer_singer”)
sklearn.linear_model.LogisticRegression (setting multi_class=”multinomial”)
...
...
...
Try replacing the 'Classifier' at the end with 'Regressor' and see the documentation of fit() method there. For example let's take DecisionTreeRegressor.fit():
y : array-like, shape = [n_samples] or [n_samples, n_outputs]
The target values (real numbers).
Use dtype=np.float64 and order='C' for maximum efficiency.
You see that it supports a 2-d array for targets (y). So it may be able to use correlation and underlying relationship of targets.
2) Now for your second question about using neural network or not, it depends on personal preference, the type of problem, the amount and type of data you have, the training iterations you want to do. Maybe you can try multiple algorithms and choose what gives best output for your data and problem.

Is Multinomial logistic regression appropriate for this dataset?

I have the following dataset shown below. Any value between 500 & 900 were categorized as A, while values between 900 & ~1500 were mixed between A and B. I want to find the probability of getting A, B, and C at any value of x where x is my independent variable and A,B,C are my dependent variables. It seems to be a good fit for multinomial logistic regression. I believe the number of observations for each dependent variable is sufficient. If multinomial log regression is appropriate, I wish to uses Python's scikit learn logistic regression module to obtain my probability of A, B, and C at any value of x but I am not sure how to approach this using that module.
Personally, it looks like an all right candidate for logistic regression, but the fact that it looks 1-dimensional with overlapping may make it hard to separate along those parts. I’m mainly here to answer the second part of your question which can be generalized to pretty much any other classifier within scikit-learn.
I recommend looking at the scikit-learn section on SGDClassifier since it has a simple example right below the attribute list, but replace the SGDClassifier part with the LogisticRegression class instead.
http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html#sklearn.linear_model.SGDClassifier
Here’s also the documentation for LogisticRegression: http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression

Correlation among Hyperparameters of Classifiers

I am wondering whether there exists some correlation among the hyperparameters of two different classifiers.
For example: let us say that we run LogisticRegression on a dataset with best hyperparameters (by finding through GridSearch) and want to run another classifier like SVC (SVM classifier) on the same dataset but instead of finding all hyperparameters using GridSearch, can we fix some values (or reduce range to limit the search space for GridSearch) of hyperparameters?
As an experimentation, I used scikit-learn's classifiers like LogisticRegression, SVS, LinearSVC, SGDClassifier and Perceptron to classifiy some well know datasets. In some cases, I am able to see some correlation empirically, but not always for all datasets.
So please help me to clear this point.
I don't think you can correlated different parameters of different classifiers together like this. This is mainly because each classifier behaves differently as it has it's own way of adjusting the data along their own set of equations. For example, take the case of SVC with two different kernels rbf and sigmoid. It might be the case that rbf may fit perfectly over the data with the intercept parameter C set to say 0.001, while 'sigmoidkernel over the same data may fit withC` value 0.00001. Both values may also be equal. However, you can never say that for sure. When you say that :
In some cases, I am able to see some correlation empirically, but not always for all datasets.
It may simply be a coincidence. Since it all depends on the and the classifiers. You cannot apply it globally.Correlation does not always equal to causation
You can visit this site and see for yourself that although different regressor functions have the same parameter a, their equations are vastly different and hence over the same dataset you might drastically different values of a.

Categories

Resources