tpot: Use multi-output regressors only

tpot: Use multi-output regressors only - python

I want to use tpot. The data I have includes multi-output continuous variables only (i.e. output shape is: (n_samples, n_output_variables), where all items are floats).
This could be achievable using sklearn's MultiOutputRegressor class. But because I have over 100 different output variables, I want to avoid to apply tpot for each individual output.
Now, how can I use tpot to only search for multi-output models? Is there a way to tell tpot that only multi-output models (such as DecisionTree) should be used?

About regressors with multiple output:
You have a multioutput regression problem. I suggest that you check this answer: Multi-output regression.
There are regressors which do natively support multiple output on the target, for example KNeighborsRegressor, DecisionTreeRegressor, GradientBoostingRegressor, ExtraTreesRegressor and RandomForestRegressor. Others (like SGDRegressor, ElasticNetCV, etc...) can be used with multiple output if you use MultiOutputRegressor as you already mentioned.
About TPOT and multiple output regression:
Currently TPOT can be used with all the regressors that support multiple output natively but you have to adjust a file for that because it is not implemented yet, take a look at https://github.com/EpistasisLab/tpot/issues/971. If you want to compare the other regressors (single output) together with MultiOutputRegressor, TPOT will currently let you only choose one at a time. That is you can specify only one of the several algorithms and then search for the best pipeline. Then you could rerun with another algorithm.
Regarding your question about specifying which algorithm you want to search for: first take a look at the official documentation and read the section Customizing TPOT's operators and parameters. If you want to just use some specific algorithms, one way to achieve this is to copy the standard TPOT configuration for regression (https://github.com/EpistasisLab/tpot/blob/master/tpot/config/regressor.py), to include it in your code and uncomment (or add) all the algorithms you do not (or do) want to include in your search.

Related

Forward feature selection with custom criterion

I am trying to get the best features for my data for classification. For this I want try feature selection using SVM, KNN, LDA and QDA.
Also the way to test this data is a leave one out approach and not cross-validation by splitting data into parts (basically can't split one file/matrix but have to leave one file for testing while training with other files)
I tried using sfs with SVM in Matlab but keep getting only the first feature and nothing else (there are 254 features)
Is there any way to do this in Python or Matlab ?

If you're trying to code the feature selector from scratch, I think you'd better first get deeper in the theory of your algorithm of choice.
But if you're looking for a way to get results faster, scikit-learn provides you with a variety of tools for feature selection. Have a look at this page.

Combining logistic and continuous regression with scikit-learn

In my dataset X I have two continuous variables a, b and two boolean variables c, d, making a total of 4 columns.
I have a multidimensional target y consisting of two continuous variables A, B and one boolean variable C.
I would like to train a model on the columns of X to predict the columns of y. However, having tried LinearRegression on X it didn't perform so well (my variables vary several orders of magnitude and I have to apply suitable transforms to get the logarithms, I won't go into too much detail here).
I think I need to use LogisticRegression on the boolean columns.
What I'd really like to do is combine both LinearRegression on the continuous variables and LogisticRegression on the boolean variables into a single pipeline. Note that all the columns of y depend on all the columns of X, so I can't simply train the continuous and boolean variables independently.
Is this even possible, and if so how do I do it?

I've used something called a "Model Tree" (see link below) for the same sort of problem.
https://github.com/ankonzoid/LearningX/tree/master/advanced_ML/model_tree
But it will need to be customized for your application. Please ask more questions if you get stuck using it.
Here's a screen shot of what it does

If your target data Y has multiple columns you need to use multi-task learning approach. Scikit-learn contains some multi-task learning algorithms for regression like multi-task elastic-net but you cannot combine logistic regression with linear regression because these algorithms use different loss functions to optimize. Also, you may try neural networks for your problem.

What i understand you want to do is to is to train a single model that both predicts a continuous variable and a class. You would need to combine both loses into one single loss to be able to do that which I don't think is possible in scikit-learn. However I suggest you use a deep learning framework (tensorflow, pytorch, etc) to implement your own model with the required properties you need which would be more flexible. In addition you can also tinker with solving the above problem using neural networks which would improve your results.

Scikit-learn multithreading

Do you know if models from scikit-learn use automatically multithreading or just sequential instructions?
Thanks

No. All scikit-learn estimators will by default work on a single thread only.
But then again, it all depends on the algorithm and the problem. If the algorithm is such that which want sequential data, we cannot do anything. If the dataset is multi-class or multi-label and algorithm works on a one-vs-rest basis, then yes it can use multi-threading.
Look for a param n_jobs in the utilities or algorithm you want to use, and set it to -1 for using the multi-threading.
For eg.
LogisticRegression if working in a binary problem will only train a single model, which will require data sequentially, so here using n_jobs have no effect. But it handles multi-class problems as OvR, so it will have to train those many estimators using the same data. In this case you can use the n_jobs=-1.
DecisionTreeClassifier is inherently multi-class enabled and dont need to train multiple models. So we dont have that param there.
Ensemble methods like RandomForestClassifier will train multiple estimators (irrespective of problem type) which individually work on some part of data, so here again we can make use of n_jobs.
Cross-validation utilities like cross_val_score or GridSearchCV will again work on some part of data or some individual parameters, which is independent of other folds, so here also we can use multi-threading capabilities.

Multi-output regression

I have been looking in to Multi-output regression the last view weeks. I am working with the scikit learn package. My machine learning problem has an a input of 3 features an needs to predict two output variables. Some ML models in the sklearn package support multioutput regression nativly. If the models do not support this, the sklearn multioutput regression algorithm can be used to convert it. The multioutput class fits one regressor per target.
Does the mulioutput regressor class or supported multi-output regression algorithms take the underlying relationship of the input variables in to account?
Instead of a multi-output regression algorithm should I use a Neural network?

1) For your first question, I have divided that into two parts.
First part has the answer written in the documentation you linked and also in this user guide topic, which states explicitly that:
As MultiOutputRegressor fits one regressor per target it can not take
advantage of correlations between targets.
Second part of first question asks about other algorithms which support this. For that you can look at the "inherently multiclass" part in the user-guide. Inherently multi-class means that they don't use One-vs-Rest or One-vs-One strategy to be able to handle multi-class (OvO and OvR uses multiple models to fit multiple classes and so may not use the relationship between targets). Inherently multi-class means that they can structure the multi-class setting into a single model. This lists the following:
sklearn.naive_bayes.BernoulliNB
sklearn.tree.DecisionTreeClassifier
sklearn.tree.ExtraTreeClassifier
sklearn.ensemble.ExtraTreesClassifier
sklearn.naive_bayes.GaussianNB
sklearn.neighbors.KNeighborsClassifier
sklearn.semi_supervised.LabelPropagation
sklearn.semi_supervised.LabelSpreading
sklearn.discriminant_analysis.LinearDiscriminantAnalysis
sklearn.svm.LinearSVC (setting multi_class=”crammer_singer”)
sklearn.linear_model.LogisticRegression (setting multi_class=”multinomial”)
...
...
...
Try replacing the 'Classifier' at the end with 'Regressor' and see the documentation of fit() method there. For example let's take DecisionTreeRegressor.fit():
y : array-like, shape = [n_samples] or [n_samples, n_outputs]
The target values (real numbers).
Use dtype=np.float64 and order='C' for maximum efficiency.
You see that it supports a 2-d array for targets (y). So it may be able to use correlation and underlying relationship of targets.
2) Now for your second question about using neural network or not, it depends on personal preference, the type of problem, the amount and type of data you have, the training iterations you want to do. Maybe you can try multiple algorithms and choose what gives best output for your data and problem.

How to add oversampling/undersampling procedure in scikit's Pipeline?

I would like to add oversampling procedure, like SMOTE oversampling, to scikit's Pipeline. But the transformers only supports fit and transform method, and do not provide a way to increase the number of samples and targets.
One possible way to do this is to break the pipeline to two separate pipelines connected by SMOTE sampling.
Is there any better solutions?

Our current Pipeline does not support changing the number of samples between steps as the Transformer.transform method does not return the y argument that would need to also be resampled. This is a know limitation of the current design. It might be fixed in a future version but we have not started to work on that yet.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.