currently the Python API does not yet support multi class classification within Spark, but will in the future as it is described on the Spark page 1.
Is there any release date or any chance to run it with Python that implements multi class with Logistic regression? I know it does with Scala, but I would like to run it with Python. Thank you.
scikit-learn's LogisticRegression offers a multi_class parameter. From the docs:
Multiclass option can be either ‘ovr’ or ‘multinomial’. If the option
chosen is ‘ovr’, then a binary problem is fit for each label. Else the
loss minimised is the multinomial loss fit across the entire
probability distribution. Works only for the ‘lbfgs’ solver.
Hence, multi_class='ovr' seems to be the right choice for you.
For more information: see this link
Added:
As per the pyspark documentation, you can still do multi class regression using their API. Using the class pyspark.mllib.classification.LogisticRegressionWithLBFGS, you get the optional parameter numClasses for multi-class classification.
Related
When dealing with class imbalance issue, penalizing the majority class is a common practice that I have come across while building Machine Learning models. Hence, I often use class weights post re-sampling. LightGBM is one efficient decision tree based framework that is believed to handle class imbalance well. So I am using a LightGBM model for my binary classification problem. The dataset has high class imbalance in the ratio 34:1.
I initially used the LightGBM Classifier with 'class weights' parameter. However, the documentation of LightGBM Classifier mentions to use this parameter for multi-class problems only. For binary classification, it suggests using the 'is_unbalance' or 'scale_pos_weight' parameters. But, by using class weights I see better results and it is also easier to tune the weights and track performance of the model in comparison to when using the other two params.
But since the documentation recommends not to use it for Binary Classification, are there any repercussions of using the parameter? I am getting good results with it on my test data and validation data, but I wonder if it will behave otherwise on other real time data?
Documentation recommends alternative parameters:
Use this parameter only for multi-class classification task; for binary classification task you may use is_unbalance or scale_pos_weight parameters.
https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.LGBMClassifier.html
I have a binary classification problem which I'm trying to solve using LightGBM's train and cv APIs.
First I have tuned the hyperparameters by using hyperopt together with an objective function that wraps the LightGBM CV API call. For that, since the target classes are highly unbalanced, I've used the customized focal loss function with f1-score evaluation to find the best fit.
When I try to fit the final model using the optimized parameters, the model doesn't consider it as a binary problem and outputs continuous values at prediction. See the attached image.
Anyone knows what I'm missing ?
Jupyter notebook
I have been looking in to Multi-output regression the last view weeks. I am working with the scikit learn package. My machine learning problem has an a input of 3 features an needs to predict two output variables. Some ML models in the sklearn package support multioutput regression nativly. If the models do not support this, the sklearn multioutput regression algorithm can be used to convert it. The multioutput class fits one regressor per target.
Does the mulioutput regressor class or supported multi-output regression algorithms take the underlying relationship of the input variables in to account?
Instead of a multi-output regression algorithm should I use a Neural network?
1) For your first question, I have divided that into two parts.
First part has the answer written in the documentation you linked and also in this user guide topic, which states explicitly that:
As MultiOutputRegressor fits one regressor per target it can not take
advantage of correlations between targets.
Second part of first question asks about other algorithms which support this. For that you can look at the "inherently multiclass" part in the user-guide. Inherently multi-class means that they don't use One-vs-Rest or One-vs-One strategy to be able to handle multi-class (OvO and OvR uses multiple models to fit multiple classes and so may not use the relationship between targets). Inherently multi-class means that they can structure the multi-class setting into a single model. This lists the following:
sklearn.naive_bayes.BernoulliNB
sklearn.tree.DecisionTreeClassifier
sklearn.tree.ExtraTreeClassifier
sklearn.ensemble.ExtraTreesClassifier
sklearn.naive_bayes.GaussianNB
sklearn.neighbors.KNeighborsClassifier
sklearn.semi_supervised.LabelPropagation
sklearn.semi_supervised.LabelSpreading
sklearn.discriminant_analysis.LinearDiscriminantAnalysis
sklearn.svm.LinearSVC (setting multi_class=”crammer_singer”)
sklearn.linear_model.LogisticRegression (setting multi_class=”multinomial”)
...
...
...
Try replacing the 'Classifier' at the end with 'Regressor' and see the documentation of fit() method there. For example let's take DecisionTreeRegressor.fit():
y : array-like, shape = [n_samples] or [n_samples, n_outputs]
The target values (real numbers).
Use dtype=np.float64 and order='C' for maximum efficiency.
You see that it supports a 2-d array for targets (y). So it may be able to use correlation and underlying relationship of targets.
2) Now for your second question about using neural network or not, it depends on personal preference, the type of problem, the amount and type of data you have, the training iterations you want to do. Maybe you can try multiple algorithms and choose what gives best output for your data and problem.
I have been creating Logistic regression in R and was now trying out the same in Python where i noticed it does not show the F statistics or Adjusted R values etc. We have a test to run the accuracy of the model and thats it. Is that how the model fitness is checked usually using Python ?
There are plenty of methods to check model's performance in python (F1-score, precision, recall, accuracy). It depends on the library that you are using.
For instance, Scikit-learn is one library (not the only option). This link provides the list of possible metrics for performance measure in Scikit-lean http://scikit-learn.org/stable/modules/classes.html#sklearn-metrics-metrics
I ran a program with optunity to find the hyperparameter of SVM without deciding the kernel first as seen here http://optunity.readthedocs.io/en/latest/notebooks/notebooks/sklearn-svc.html#tune-svc-without-deciding-the-kernel-in-advance it ran but when i replaced the data and labels with multi class information it commits an error how come this is hapenning.
Optunity uses ROC-AUC for selecting optimum hyper parameters. AUC can't be estimated for multiclass problems. Some workaround solutions for using Optunity for multiclass problems are:
Use Accuracy as criterian for selecting optimum parameters rather than AUC
Convert multiclass into binary class and select optimum parameters for each classifier.