Does scikit-learn's DecisionTreeRegressor do true multi-output regression? - python

I have run in to a ML problem that requires us to use a multi-dimensional Y. Right now we are training independent models on each dimension of this output, which does not take advantage of additional information from the fact outputs are correlated.
I have been reading this to learn more about the few ML algorithms which have been truly extended to handle multidimensional outputs. Decision Trees are one of them.
Does scikit-learn use "Multi-target regression trees" in the event fit(X,Y) is given a multidimensional Y, or does it fit a separate tree for each dimension? I spent some time looking at the code but didn't figure it out.

After more digging, the only difference between a tree given points labeled with a single-dimensional Y versus one given points with multi-dimensional labels is in the Criterion object it uses to decide splits. A Criterion can handle multi-dimensional labels, so the result of fitting a DecisionTreeRegressor will be a single regression tree regardless of the dimension of Y.
This implies that, yes, scikit-learn does use true multi-target regression trees, which can leverage correlated outputs to positive effect.

Related

How to generate a multi target regression dataset for different target ranges (or patterns)

We can generate a multi-target regression dataset using the make_regression() function of the sklearn. Here, the number of targets is `2
X, y = make_regression(n_samples=5000, n_features=10, n_informative=7, n_targets=2, random_state=1, noise=5)
Now, I want to make a multi-target dataset where the ranges (or patterns) of the target variables will be different. So that different ML models can fit and predict well for different targets.
Say, I have 2 targets in a dataset. Target 1 might fit and predict very well by Linear, Lasso, or Ridge while target 2 will fit and predict well by RF, SVR or Knn.
Any idea how can I make this type of dataset?
According to the document (https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_regression.html), this function generates data by a linear regression model. This means data generated by this function should fit well by linear methods. Perhaps lasso would work better if the number of informative features is much smaller than the number of all features because lasso tends to create sparse models.
To generate data that does not fit well with linear models, but good with non-linear models such as RF, SVR, KNN, you will need to add non-linearity to the data. As an example approach, transforming y by some non-linear function such as sin(y) may work (I did not try though).

Is it necessary to use StandardScaler on y_train and y_test? If yes, cases?

Have read multiple cases where StandardScaler is used on y_train and y_test and also where it is not used. Is there any specific rules where it should be used on them?
Quoting from here:
Standardization of a dataset is a common requirement for many machine
learning estimators: they might behave badly if the individual
features do not more or less look like standard normally distributed
data (e.g. Gaussian with 0 mean and unit variance).
For instance many elements used in the objective function of a
learning algorithm (such as the RBF kernel of Support Vector Machines
or the L1 and L2 regularizers of linear models) assume that all
features are centered around 0 and have variance in the same order. If
a feature has a variance that is orders of magnitude larger that
others, it might dominate the objective function and make the
estimator unable to learn from other features correctly as expected.
So probably When your features has different scales/distributions you should standardize/scale their values.

Multi-output regression

I have been looking in to Multi-output regression the last view weeks. I am working with the scikit learn package. My machine learning problem has an a input of 3 features an needs to predict two output variables. Some ML models in the sklearn package support multioutput regression nativly. If the models do not support this, the sklearn multioutput regression algorithm can be used to convert it. The multioutput class fits one regressor per target.
Does the mulioutput regressor class or supported multi-output regression algorithms take the underlying relationship of the input variables in to account?
Instead of a multi-output regression algorithm should I use a Neural network?
1) For your first question, I have divided that into two parts.
First part has the answer written in the documentation you linked and also in this user guide topic, which states explicitly that:
As MultiOutputRegressor fits one regressor per target it can not take
advantage of correlations between targets.
Second part of first question asks about other algorithms which support this. For that you can look at the "inherently multiclass" part in the user-guide. Inherently multi-class means that they don't use One-vs-Rest or One-vs-One strategy to be able to handle multi-class (OvO and OvR uses multiple models to fit multiple classes and so may not use the relationship between targets). Inherently multi-class means that they can structure the multi-class setting into a single model. This lists the following:
sklearn.naive_bayes.BernoulliNB
sklearn.tree.DecisionTreeClassifier
sklearn.tree.ExtraTreeClassifier
sklearn.ensemble.ExtraTreesClassifier
sklearn.naive_bayes.GaussianNB
sklearn.neighbors.KNeighborsClassifier
sklearn.semi_supervised.LabelPropagation
sklearn.semi_supervised.LabelSpreading
sklearn.discriminant_analysis.LinearDiscriminantAnalysis
sklearn.svm.LinearSVC (setting multi_class=”crammer_singer”)
sklearn.linear_model.LogisticRegression (setting multi_class=”multinomial”)
...
...
...
Try replacing the 'Classifier' at the end with 'Regressor' and see the documentation of fit() method there. For example let's take DecisionTreeRegressor.fit():
y : array-like, shape = [n_samples] or [n_samples, n_outputs]
The target values (real numbers).
Use dtype=np.float64 and order='C' for maximum efficiency.
You see that it supports a 2-d array for targets (y). So it may be able to use correlation and underlying relationship of targets.
2) Now for your second question about using neural network or not, it depends on personal preference, the type of problem, the amount and type of data you have, the training iterations you want to do. Maybe you can try multiple algorithms and choose what gives best output for your data and problem.

Correlation among Hyperparameters of Classifiers

I am wondering whether there exists some correlation among the hyperparameters of two different classifiers.
For example: let us say that we run LogisticRegression on a dataset with best hyperparameters (by finding through GridSearch) and want to run another classifier like SVC (SVM classifier) on the same dataset but instead of finding all hyperparameters using GridSearch, can we fix some values (or reduce range to limit the search space for GridSearch) of hyperparameters?
As an experimentation, I used scikit-learn's classifiers like LogisticRegression, SVS, LinearSVC, SGDClassifier and Perceptron to classifiy some well know datasets. In some cases, I am able to see some correlation empirically, but not always for all datasets.
So please help me to clear this point.
I don't think you can correlated different parameters of different classifiers together like this. This is mainly because each classifier behaves differently as it has it's own way of adjusting the data along their own set of equations. For example, take the case of SVC with two different kernels rbf and sigmoid. It might be the case that rbf may fit perfectly over the data with the intercept parameter C set to say 0.001, while 'sigmoidkernel over the same data may fit withC` value 0.00001. Both values may also be equal. However, you can never say that for sure. When you say that :
In some cases, I am able to see some correlation empirically, but not always for all datasets.
It may simply be a coincidence. Since it all depends on the and the classifiers. You cannot apply it globally.Correlation does not always equal to causation
You can visit this site and see for yourself that although different regressor functions have the same parameter a, their equations are vastly different and hence over the same dataset you might drastically different values of a.

scikit-learn classifiers give varying results when one non-binary feature is added

I'm evaluating some machine learning models for a binary classification problem, and encountering weird results when adding one non-binary feature.
My dataset consists of tweets and some other values related to them, so the main feature vector is a sparse matrix (5000 columns) generated using scikit-learn's Tf-idf Vectoriser on the tweets and SelectKBest feature selection.
I have two other features I want to add, which are both 1-column dense matrices. I convert them to sparse and use scipy's hstack function to add them on to the main feature vector. The first of these features is binary, and when I add just that one all is good and I get accuracies of ~60%. However the second feature is integer values, and adding this causes varying results.
I am testing Logistic Regression, SVM (rbf), and Multinomial Naive Bayes. When adding the final feature the SVM accuracy increases to 80%, but for Logistic Regression it now always predicts the same class, and MNB is also very heavily skewed towards that class.
SVM confusion matrix
[[13112 3682]
[ 1958 9270]]
MNB confusion matrix
[[13403 9803]
[ 1667 3149]]
LR confusion matrix
[[15070 12952]
[ 0 0]]
Can anyone explain why this could be? I don't understand why this one extra feature could cause two of the classifiers to effectively become redundant but improve the other one so much? Thanks!
Sounds like your extra feature is non-linear. NB and LR both assume that the features are linear. SVM only assumes that the variables are linearly separable. Intuitively this means that there is a "cut-off" value for your variable that the SVM is optimizing for. If you still want to use LR or NB, you could try transforming this variable to make it linear or otherwise you could try converting it to a binary indicator variable based on this threshold, and you might improve your model's performace.
Take a look at https://stats.stackexchange.com/questions/182329/how-to-know-whether-the-data-is-linearly-separable for some further reading.

Categories

Resources