I am doing machine learning using scikit-learn as recommended in this question. To my surprise, it does not appear to provide access to the actual models it trains. For example, if I create an SVM, linear classifier or even a decision tree, it doesn't seem to provide a way for me to see the parameters selected for the actual trained model.
Seeing the actual model is useful if the model is being created partly to get a clearer picture of what features it is using (e.g., decision trees). Seeing the model is also a significant issue if one wants to use Python to train the model and some other code to actually implement it.
Am I missing something in scikit-learn or is there some way to get at this in scikit-learn? If not, what is the a good free machine learning workbench, not necessarily in python, in which models are transparently available?
The fitted model parameters are stored directly as attributes on the model instance. There is a specific naming convention for those fitted parameters: they all end with a trailing underscore as opposed to user-provided constructor parameters (a.k.a. hyperparameters) which don't.
The type of the fitted attributes is algorithm-dependent. For instance for a kernel Support Vector Machine you will have the arrays support vectors, dual coefs and intercepts while for random forests and extremly randomized trees you will have a collection of binary trees (internally represented in memory as contiguous numpy arrays for performance matters: structure of arrays representation).
See the Attributes section of the docstring of each model for more details, for instance for SVC:
http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC
For tree based models you also have a helper function to generate a graphivz_export of the learned trees:
http://scikit-learn.org/stable/modules/tree.html#classification
To find the importance of features in forests models you should also have a look at the compute_importances parameter, see the following examples for instance:
http://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html#example-ensemble-plot-forest-importances-py
http://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances_faces.html#example-ensemble-plot-forest-importances-faces-py
Related
When dealing with class imbalance issue, penalizing the majority class is a common practice that I have come across while building Machine Learning models. Hence, I often use class weights post re-sampling. LightGBM is one efficient decision tree based framework that is believed to handle class imbalance well. So I am using a LightGBM model for my binary classification problem. The dataset has high class imbalance in the ratio 34:1.
I initially used the LightGBM Classifier with 'class weights' parameter. However, the documentation of LightGBM Classifier mentions to use this parameter for multi-class problems only. For binary classification, it suggests using the 'is_unbalance' or 'scale_pos_weight' parameters. But, by using class weights I see better results and it is also easier to tune the weights and track performance of the model in comparison to when using the other two params.
But since the documentation recommends not to use it for Binary Classification, are there any repercussions of using the parameter? I am getting good results with it on my test data and validation data, but I wonder if it will behave otherwise on other real time data?
Documentation recommends alternative parameters:
Use this parameter only for multi-class classification task; for binary classification task you may use is_unbalance or scale_pos_weight parameters.
https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.LGBMClassifier.html
I know that it is possible, for example using TensorFlow but also in PyTorch or whatever, to store an instance of a trained (or in training) model in a way that it can be loaded in future, or loaded by another machine, or just to use it as a checkpoint during the training.
What I wonder is if there is any way, such as the above mentioned one, to store the difference (maybe not exactly the algebric subtraction but a similar concept, always referring to operation on tensors) between two instances of the same neural network (same architecture, different weights) for efficiency purposes.
If you are wondering why this should be convenient, consider an hypothetical setting where there are different entities and all of them know a model instance (a "shared model"), so using the "difference" calculated with respect to this shared model could be useful in terms of storage space or in terms of bandwidth (if the local model parameters should be sent via Internet to another machine).
The hypotesis is that it is possible to reconstruct a model knowing the shared model and the "difference" with the model to reconstruct.
Summarizing my questions:
There is any built-in features in TensorFlow, Pytorch, etc.. to do this?
It could be convenient in your opinion to do something like that? If not, why?
PS: In literature, this concept exists and it has been recently explored within the "Federated Learning" topic, and the "difference" I mentioned is called update.
I have been looking in to Multi-output regression the last view weeks. I am working with the scikit learn package. My machine learning problem has an a input of 3 features an needs to predict two output variables. Some ML models in the sklearn package support multioutput regression nativly. If the models do not support this, the sklearn multioutput regression algorithm can be used to convert it. The multioutput class fits one regressor per target.
Does the mulioutput regressor class or supported multi-output regression algorithms take the underlying relationship of the input variables in to account?
Instead of a multi-output regression algorithm should I use a Neural network?
1) For your first question, I have divided that into two parts.
First part has the answer written in the documentation you linked and also in this user guide topic, which states explicitly that:
As MultiOutputRegressor fits one regressor per target it can not take
advantage of correlations between targets.
Second part of first question asks about other algorithms which support this. For that you can look at the "inherently multiclass" part in the user-guide. Inherently multi-class means that they don't use One-vs-Rest or One-vs-One strategy to be able to handle multi-class (OvO and OvR uses multiple models to fit multiple classes and so may not use the relationship between targets). Inherently multi-class means that they can structure the multi-class setting into a single model. This lists the following:
sklearn.naive_bayes.BernoulliNB
sklearn.tree.DecisionTreeClassifier
sklearn.tree.ExtraTreeClassifier
sklearn.ensemble.ExtraTreesClassifier
sklearn.naive_bayes.GaussianNB
sklearn.neighbors.KNeighborsClassifier
sklearn.semi_supervised.LabelPropagation
sklearn.semi_supervised.LabelSpreading
sklearn.discriminant_analysis.LinearDiscriminantAnalysis
sklearn.svm.LinearSVC (setting multi_class=”crammer_singer”)
sklearn.linear_model.LogisticRegression (setting multi_class=”multinomial”)
...
...
...
Try replacing the 'Classifier' at the end with 'Regressor' and see the documentation of fit() method there. For example let's take DecisionTreeRegressor.fit():
y : array-like, shape = [n_samples] or [n_samples, n_outputs]
The target values (real numbers).
Use dtype=np.float64 and order='C' for maximum efficiency.
You see that it supports a 2-d array for targets (y). So it may be able to use correlation and underlying relationship of targets.
2) Now for your second question about using neural network or not, it depends on personal preference, the type of problem, the amount and type of data you have, the training iterations you want to do. Maybe you can try multiple algorithms and choose what gives best output for your data and problem.
While running in production, is it possible to update a trained model with new data without re-fitting the model? I see you can use the warm_start parameter to enable adding trees to the model; however, I am looking for a way to update the existing trees with the incoming data.
As far as I can tell, this is not possible with sklearn (as they seem to implement the classical Breiman algorithm). However, you might have a look at Mondrian Forests (https://papers.nips.cc/paper/5234-mondrian-forests-efficient-online-random-forests.pdf, python implementation: https://github.com/balajiln/mondrianforest).
In my machine learning class, we have learned about appending a 1 to each sample's feature vector when using many machine learning models to account for bias. For example, if we are doing linear regression and a sample has features f_1, f_2, ..., f_d, we need to add a "fake" feature value of 1 to allow for the regression function to not have to pass through the origin.
When using sklearn models, do you need to do this yourself, or do their implementations do it for you? Specifically, I'm interested in whether or not this is necessary when using any of their regression models or their SVM models.
No, you do not add any biases, models define biases in their own way. What you learned during course is generic, although not perfect - solution. It matters for models such as SVM, which should not ever have appended "1"s, as then this bias would get regularized, which is simply wrong for SVMs. Thus, while this is nice theoretical trick to show that you can actually create methods completely ignoring bias, in practise - it is often treated in a specific way, and scikit-learn does it for you.