I have been creating Logistic regression in R and was now trying out the same in Python where i noticed it does not show the F statistics or Adjusted R values etc. We have a test to run the accuracy of the model and thats it. Is that how the model fitness is checked usually using Python ?
There are plenty of methods to check model's performance in python (F1-score, precision, recall, accuracy). It depends on the library that you are using.
For instance, Scikit-learn is one library (not the only option). This link provides the list of possible metrics for performance measure in Scikit-lean http://scikit-learn.org/stable/modules/classes.html#sklearn-metrics-metrics
Related
I am using the pycaret library and created a Catboost model from it
The model has a great AUC score, but pretty bad Recall and F1 which means that the normal threshold of 0.5 is not ideal, but that there is a threshold that will give good score for both of those metrics.
Is there any way to find this threshold? I am not so sure how to work this since I am trying out Pycaret
Which threshold do you mean? For a feature selection? You can try several adjustments, in order to improve the model in comparison to your baseline in the picture above.
compare_models() - maybe there are another algorithms, which perform better than catboost
Feature Selection - RFE or Random Forest (here you can use the parameter feature_selection in PyCaret and try to play with threshold. The Boruta algorith should be checked as well).
Feature Engineering
fold=5
Try several splits for train / test (80/20, 70/30 etc.)
In PyCaret setup should be numerical and categorical features double-checked. When needed the format needs to be changed.
Try with compare
When dealing with class imbalance issue, penalizing the majority class is a common practice that I have come across while building Machine Learning models. Hence, I often use class weights post re-sampling. LightGBM is one efficient decision tree based framework that is believed to handle class imbalance well. So I am using a LightGBM model for my binary classification problem. The dataset has high class imbalance in the ratio 34:1.
I initially used the LightGBM Classifier with 'class weights' parameter. However, the documentation of LightGBM Classifier mentions to use this parameter for multi-class problems only. For binary classification, it suggests using the 'is_unbalance' or 'scale_pos_weight' parameters. But, by using class weights I see better results and it is also easier to tune the weights and track performance of the model in comparison to when using the other two params.
But since the documentation recommends not to use it for Binary Classification, are there any repercussions of using the parameter? I am getting good results with it on my test data and validation data, but I wonder if it will behave otherwise on other real time data?
Documentation recommends alternative parameters:
Use this parameter only for multi-class classification task; for binary classification task you may use is_unbalance or scale_pos_weight parameters.
https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.LGBMClassifier.html
I have a binary classification problem which I'm trying to solve using LightGBM's train and cv APIs.
First I have tuned the hyperparameters by using hyperopt together with an objective function that wraps the LightGBM CV API call. For that, since the target classes are highly unbalanced, I've used the customized focal loss function with f1-score evaluation to find the best fit.
When I try to fit the final model using the optimized parameters, the model doesn't consider it as a binary problem and outputs continuous values at prediction. See the attached image.
Anyone knows what I'm missing ?
Jupyter notebook
currently the Python API does not yet support multi class classification within Spark, but will in the future as it is described on the Spark page 1.
Is there any release date or any chance to run it with Python that implements multi class with Logistic regression? I know it does with Scala, but I would like to run it with Python. Thank you.
scikit-learn's LogisticRegression offers a multi_class parameter. From the docs:
Multiclass option can be either ‘ovr’ or ‘multinomial’. If the option
chosen is ‘ovr’, then a binary problem is fit for each label. Else the
loss minimised is the multinomial loss fit across the entire
probability distribution. Works only for the ‘lbfgs’ solver.
Hence, multi_class='ovr' seems to be the right choice for you.
For more information: see this link
Added:
As per the pyspark documentation, you can still do multi class regression using their API. Using the class pyspark.mllib.classification.LogisticRegressionWithLBFGS, you get the optional parameter numClasses for multi-class classification.
Is there any implementation of incremental svm which also has the feature of returning the probability of a given feature vector belonging to the various classes? Preferably usable with python code
I have heard about LaSVM. Does LaSVM has a feature of returning probability estimates? Also does it have features for handling imbalance training datasets?
You can have a look in Scikit Learn, a very flexible and efficient library written in Python
In every model, there are stored the internal calculated values. If clf is your SVM classifier, you can access clf.decision_function to see some explanation of the predictions.
It also provides a good set of tools for preprocessing data among other things you can find interesting.
cheers,
For getting probability estimate you can use scikit-learn library. There are 2 alternatives you can use. One gives probabilities. Here is an example: How to know what classes are represented in return array from predict_proba in Scikit-learn
And the other gives signed values for ranking (not probability but generally gives better result): Scikit-learn predict_proba gives wrong answers you should look at the answer.