Need to append bias term when using `sklearn` models?

Need to append bias term when using `sklearn` models? - python

In my machine learning class, we have learned about appending a 1 to each sample's feature vector when using many machine learning models to account for bias. For example, if we are doing linear regression and a sample has features f_1, f_2, ..., f_d, we need to add a "fake" feature value of 1 to allow for the regression function to not have to pass through the origin.
When using sklearn models, do you need to do this yourself, or do their implementations do it for you? Specifically, I'm interested in whether or not this is necessary when using any of their regression models or their SVM models.

No, you do not add any biases, models define biases in their own way. What you learned during course is generic, although not perfect - solution. It matters for models such as SVM, which should not ever have appended "1"s, as then this bias would get regularized, which is simply wrong for SVMs. Thus, while this is nice theoretical trick to show that you can actually create methods completely ignoring bias, in practise - it is often treated in a specific way, and scikit-learn does it for you.

Related

How can you predict a combination of categorical and continuous variables with Scikit learn?

I have a dataset with a large number of predictive variables and I want to use them to predict a number of output variables. However, some of the things I want to predict are categorical, and others are continuous; the things I want to predict are not independent. Is it possible with scikit-learn to, for example, mix a classifier and a regressor so that I can predict and disentangle these variables? (I'm currently looking at gradient boosting classifiers/regressors, but there may be better options.)

You can certainly use One Hot Encoding or Dummy Variable Encoding, to convert labels to numerics. See the link below for all details.
https://codefires.com/how-convert-categorical-data-numerical-data-python/
As an aside, Random Forest is a popular machine learning model that is commonly used for classification tasks as can be seen in many academic papers, Kaggle competitions, and blog posts. In addition to classification, Random Forests can also be used for regression tasks. A Random Forest’s nonlinear nature can give it a leg up over linear algorithms, making it a great option. However, it is important to know your data and keep in mind that a Random Forest can’t extrapolate. It can only make a prediction that is an average of previously observed labels. In this sense it is very similar to KNN. In other words, in a regression problem, the range of predictions a Random Forest can make is bound by the highest and lowest labels in the training data. This behavior becomes problematic in situations where the training and prediction inputs differ in their range and/or distributions. This is called covariate shift and it is difficult for most models to handle but especially for Random Forest, because it can’t extrapolate.
https://towardsdatascience.com/a-limitation-of-random-forest-regression-db8ed7419e9f
https://stackabuse.com/random-forest-algorithm-with-python-and-scikit-learn
In closing, Scikit-learn uses numpy matrices as inputs to its models. As such all features become de facto numerical (if you have categorical feature you’ll need to convert them to numerical).

I don't think there's a builtin way. There are ClassifierChain and RegressorChain that allow you to use earlier predictions as features in later predictions, but as the names indicate they assume either classification or regression. Two options come to mind:
Manually patch those together for what you want to do. For example, use a ClassifierChain to predict each of your categorical targets using just the independent features, then add those predictions to the dataset before training a RegressorChain with the numeric targets.
Use those classes as a base for defining a custom estimator. In that case you'll probably look mostly at their common parent class _BaseChain. Unfortunately that also uses a single estimator attribute, whereas you'd need (at least) two, one classifier and one regressor.

Multi-output regression

I have been looking in to Multi-output regression the last view weeks. I am working with the scikit learn package. My machine learning problem has an a input of 3 features an needs to predict two output variables. Some ML models in the sklearn package support multioutput regression nativly. If the models do not support this, the sklearn multioutput regression algorithm can be used to convert it. The multioutput class fits one regressor per target.
Does the mulioutput regressor class or supported multi-output regression algorithms take the underlying relationship of the input variables in to account?
Instead of a multi-output regression algorithm should I use a Neural network?

1) For your first question, I have divided that into two parts.
First part has the answer written in the documentation you linked and also in this user guide topic, which states explicitly that:
As MultiOutputRegressor fits one regressor per target it can not take
advantage of correlations between targets.
Second part of first question asks about other algorithms which support this. For that you can look at the "inherently multiclass" part in the user-guide. Inherently multi-class means that they don't use One-vs-Rest or One-vs-One strategy to be able to handle multi-class (OvO and OvR uses multiple models to fit multiple classes and so may not use the relationship between targets). Inherently multi-class means that they can structure the multi-class setting into a single model. This lists the following:
sklearn.naive_bayes.BernoulliNB
sklearn.tree.DecisionTreeClassifier
sklearn.tree.ExtraTreeClassifier
sklearn.ensemble.ExtraTreesClassifier
sklearn.naive_bayes.GaussianNB
sklearn.neighbors.KNeighborsClassifier
sklearn.semi_supervised.LabelPropagation
sklearn.semi_supervised.LabelSpreading
sklearn.discriminant_analysis.LinearDiscriminantAnalysis
sklearn.svm.LinearSVC (setting multi_class=”crammer_singer”)
sklearn.linear_model.LogisticRegression (setting multi_class=”multinomial”)
...
...
...
Try replacing the 'Classifier' at the end with 'Regressor' and see the documentation of fit() method there. For example let's take DecisionTreeRegressor.fit():
y : array-like, shape = [n_samples] or [n_samples, n_outputs]
The target values (real numbers).
Use dtype=np.float64 and order='C' for maximum efficiency.
You see that it supports a 2-d array for targets (y). So it may be able to use correlation and underlying relationship of targets.
2) Now for your second question about using neural network or not, it depends on personal preference, the type of problem, the amount and type of data you have, the training iterations you want to do. Maybe you can try multiple algorithms and choose what gives best output for your data and problem.

How to get predictions out of tensorflow model after you've used tf.group on your optimizers

I'm trying to write something similar to google's wide and deep learning after running into difficulties of doing multi-class classification(12 classes) with the sklearn api. I've tried to follow the advice in a couple of posts and used the tf.group(logistic_regression_optimizer, deep_model_optimizer). It seems to work but I was trying to figure out how to get predictions out of this model. I'm hoping that with the tf.group operator the model is learning to weight the logistic and deep models differently but I don't know how to get these weights out so I can get the right combination of the two model's predictions. Thanks in advance for any help.
https://groups.google.com/a/tensorflow.org/forum/#!topic/discuss/Cs0R75AGi8A
How to set layer-wise learning rate in Tensorflow?

tf.group() creates a node that forces a list of other nodes to run using control dependencies. It's really just a handy way to package up logic that says "run this set of nodes, and I don't care about their output". In the discussion you point to, it's just a convenient way to create a single train_op from a pair of training operators.
If you're interested in the value of a Tensor (e.g., weights), you should pass it to session.run() explicitly, either in the same call as the training step, or in a separate session.run() invocation. You can pass a list of values to session.run(), for example, your tf.group() expression, as well as a Tensor whose value you would like to compute.
Hope that helps!

changing the bias parameter b in scikit SVM after training, before prediction

I'm using sklearn.svm.SVC for a classification problem. After having trained on my data, I would like to loop the bias (i.e. the term b in the usual sign(w.x + b) SVM equation) through a number of values, so as to produce an ROC curve. (I've already performed cross-validation and chosen my hyperparameters, so this is for testing).
I tried playing with the .intercept_ attribute, but this doesn't change what I get out from .predict()... Is there an alternative method for altering the bias term?
I could potentially recover the support vectors, and then implement my own .predict() function, with an altered bias, but this seems like a rather heavy-handed approach.

I had a very same problem 2 years ago. Unfortunately the only solution is to do this by yourself. Implementing "predict" is pretty straight forward, it is a one-liner in Python. Unfortunately .intercept_ is actually a copy of intercept used internally (the libsvm one). Quite confusing thing is that for LinearSVC from the very same library it is not true, and you can actually alternate the bias (however, without access to kernels, obviously).
Obviously you do not have to go as deep as computing kernels values yourself. You still have access to "decision_function", which in the end, has a bias inside. Simply remove the old bias from decision function, add new one, and take the sign. This will be (up to the sign of bias):
def new_predict(clf, new_bias, X):
return np.sign(clf.decision_function(X) + clf.intercept_ - new_bias)

Scikit-learn model parameters unavailable? If so what ML workbench alternative?

I am doing machine learning using scikit-learn as recommended in this question. To my surprise, it does not appear to provide access to the actual models it trains. For example, if I create an SVM, linear classifier or even a decision tree, it doesn't seem to provide a way for me to see the parameters selected for the actual trained model.
Seeing the actual model is useful if the model is being created partly to get a clearer picture of what features it is using (e.g., decision trees). Seeing the model is also a significant issue if one wants to use Python to train the model and some other code to actually implement it.
Am I missing something in scikit-learn or is there some way to get at this in scikit-learn? If not, what is the a good free machine learning workbench, not necessarily in python, in which models are transparently available?

The fitted model parameters are stored directly as attributes on the model instance. There is a specific naming convention for those fitted parameters: they all end with a trailing underscore as opposed to user-provided constructor parameters (a.k.a. hyperparameters) which don't.
The type of the fitted attributes is algorithm-dependent. For instance for a kernel Support Vector Machine you will have the arrays support vectors, dual coefs and intercepts while for random forests and extremly randomized trees you will have a collection of binary trees (internally represented in memory as contiguous numpy arrays for performance matters: structure of arrays representation).
See the Attributes section of the docstring of each model for more details, for instance for SVC:
http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC
For tree based models you also have a helper function to generate a graphivz_export of the learned trees:
http://scikit-learn.org/stable/modules/tree.html#classification
To find the importance of features in forests models you should also have a look at the compute_importances parameter, see the following examples for instance:
http://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html#example-ensemble-plot-forest-importances-py
http://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances_faces.html#example-ensemble-plot-forest-importances-faces-py

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.