i've been trying to find this information around and couldnt found any help.
What i want to do is get a float number as output from sklearn svm in order to work as input for a sub classifier.
Is it possible to get output from svm like 0,89898 instead of 1, given that a class is more closely to be classified as 1?
Thank you
Platt scaling can help to achieve what you want. It fits a logistic sigmoid curve on top of the output of SVM in a post-hoc fashion.
To do this in sklearn, you'll need to fit your SVM with probability parameter set to True. Then, you can use the fitted model's predict_proba() method to get a floating point output. More documentations can be found here. You'll also find related discussions in this thread.
Related
I have been exploring Scikit-learn as a tool, and I am very interested in learning if I can modify how Scikit-learn classifies a data point, more specifically, its SVM function. I am looking for a programmatic way to attack this problem.
In general, we can say that SVM classification looks something like this, and let's imagine the blue points are positive, and red points are negative:
SVM
Where the classification occurs as follows:
SVM Classification
As my understanding goes, Scikit-learn does this for us quite easily.
However, I was wondering if there are any parameters I could change to make it look something more like this:
SVM The way I want
That is, the positive point that makes the support vector is also my decision boundary.. Is there another algorithm that I am missing? Would I have to build my SVM function from the ground up?
Thank you
I mean the result, not the theory:
In linear regression, there is a formula to explain the variables and weights that contribute the final score.
In decision tree, there is a path map to explain what conditions result in the segmentation.
The only result I can read from < from sklearn.tree import DecisionTreeRegressor> is by pickle.dump. But pickle is still a black-box. Although features_importance_ output explains the weight importance of each features, however, that's an indirect method. I still cannot understand how the score come from.
How read the data and explain the fitting result of Random Forest directly?
Is there any formula or path map?
With sklearn.tree.export_graphviz and dot you can visualize the decision making process. It's a little tricky to implement but that's a way to read the fitting result. Read more here.
I have two datasets, each defined by the same two parameters. If you plot them on a scatter plot, there is some overlap. I'd like to classify them, but also get a probability that a given point is in one dataset or another. So in the overlap region, I would never expect the probability to be 100%.
I've implemented this using python's scikit-learn package and the kNN algorithm, KNeighborsClassifier. It looks pretty good! When I use predict_proba to return the probability, it looks like what I would expect!
So then I tried doing the same thing with TensorFlow and the DNNClassifier classifier, mostly as a learning exercise for myself. When I evaluate the test samples I used predict_proba to return the probabilities, but the distribution of probabilities look much different than the kNN approach. It looks like the DNNClassifier is really trying to drive the probabilities to 1 or 0, rather than somewhere in between for the overlapping region.
I've not posted code here because my questions is more basic: can I interpret the probabilities returned by these two approaches in the same way? Or is there a fundamental difference between them?
Thanks!
Yes. Provided you used sigmoid or softmax for prediction you should be getting values that are reasonable to interpret as probabilities (DNNClassifier will use softmax as far as I know).
Now you didn't give us any details on the models. Depending on the complexity of the models and the training parameters you might be getting more over fitting.
If you are seeing extreme (0 or 1) values for the overlapping area it's probably over fitting. Use test/validation set to keep a check on it.
From what you are describing a very simple model should do, try to have less depth, less parameters.
I am using a Random Forest classifer in scikit learn with an imbalanced data set of two classes. I am much more worried about false negatives than false positives. Is it possible to fix the false negative rate (to, say, 1%) and ask scikit to optimize the false positive rate somehow?
If this classifier doesn't support it, is there another classifier that does?
I believe the problem of class imbalance in sklearn can be partially resolved by using the class_weight parameter.
this parameter is either a dictionary, where each class is assigned a uniform weight, or is a string that tells sklearn how to build this dictionary. For instance, setting this parameter to 'auto', will weight each class in proportion of the inverse of its frequency.
By weighting the class that is less present with a higher amount, you can end up with 'better' results.
Classifier like like SVM or logistic regression also offer this class_weight parameter.
This Stack Overflow answer gives some other ideas on how to handle class imbalance, like under sampling and oversampling.
I found this article on class imbalance problem.
http://www.chioka.in/class-imbalance-problem/
It has basically discussed the following possible solutions to summarize:
Cost function based approaches
Sampling based approaches
SMOTE (Synthetic Minority Over-Sampling Technique)
recent approaches : RUSBoost, SMOTEBagging and Underbagging
Hope It may help.
Random forests is already a bagged classifier so that should already give some good results.
One typical way of getting desired False positive or False negative accuracies is to analyze it using ROC curves
http://scikit-learn.org/stable/auto_examples/plot_roc.html
and modifying certain parameters to achieve the desired FP rates for example.
Not sure whether it would be possible to tune the random forest classifier FP rates using parameters. You can look at other classifiers based on your application.
Is there any implementation of incremental svm which also has the feature of returning the probability of a given feature vector belonging to the various classes? Preferably usable with python code
I have heard about LaSVM. Does LaSVM has a feature of returning probability estimates? Also does it have features for handling imbalance training datasets?
You can have a look in Scikit Learn, a very flexible and efficient library written in Python
In every model, there are stored the internal calculated values. If clf is your SVM classifier, you can access clf.decision_function to see some explanation of the predictions.
It also provides a good set of tools for preprocessing data among other things you can find interesting.
cheers,
For getting probability estimate you can use scikit-learn library. There are 2 alternatives you can use. One gives probabilities. Here is an example: How to know what classes are represented in return array from predict_proba in Scikit-learn
And the other gives signed values for ranking (not probability but generally gives better result): Scikit-learn predict_proba gives wrong answers you should look at the answer.