Incremental SVM With Probability Estimates - python

Is there any implementation of incremental svm which also has the feature of returning the probability of a given feature vector belonging to the various classes? Preferably usable with python code
I have heard about LaSVM. Does LaSVM has a feature of returning probability estimates? Also does it have features for handling imbalance training datasets?

You can have a look in Scikit Learn, a very flexible and efficient library written in Python
In every model, there are stored the internal calculated values. If clf is your SVM classifier, you can access clf.decision_function to see some explanation of the predictions.
It also provides a good set of tools for preprocessing data among other things you can find interesting.
cheers,

For getting probability estimate you can use scikit-learn library. There are 2 alternatives you can use. One gives probabilities. Here is an example: How to know what classes are represented in return array from predict_proba in Scikit-learn
And the other gives signed values for ranking (not probability but generally gives better result): Scikit-learn predict_proba gives wrong answers you should look at the answer.

Related

How to find the best threshold with Pycaret

I am using the pycaret library and created a Catboost model from it
The model has a great AUC score, but pretty bad Recall and F1 which means that the normal threshold of 0.5 is not ideal, but that there is a threshold that will give good score for both of those metrics.
Is there any way to find this threshold? I am not so sure how to work this since I am trying out Pycaret
Which threshold do you mean? For a feature selection? You can try several adjustments, in order to improve the model in comparison to your baseline in the picture above.
compare_models() - maybe there are another algorithms, which perform better than catboost
Feature Selection - RFE or Random Forest (here you can use the parameter feature_selection in PyCaret and try to play with threshold. The Boruta algorith should be checked as well).
Feature Engineering
fold=5
Try several splits for train / test (80/20, 70/30 etc.)
In PyCaret setup should be numerical and categorical features double-checked. When needed the format needs to be changed.
Try with compare

What is the difference in interpretation of the "probability" returned by a kNN or a DNN algorithm

I have two datasets, each defined by the same two parameters. If you plot them on a scatter plot, there is some overlap. I'd like to classify them, but also get a probability that a given point is in one dataset or another. So in the overlap region, I would never expect the probability to be 100%.
I've implemented this using python's scikit-learn package and the kNN algorithm, KNeighborsClassifier. It looks pretty good! When I use predict_proba to return the probability, it looks like what I would expect!
So then I tried doing the same thing with TensorFlow and the DNNClassifier classifier, mostly as a learning exercise for myself. When I evaluate the test samples I used predict_proba to return the probabilities, but the distribution of probabilities look much different than the kNN approach. It looks like the DNNClassifier is really trying to drive the probabilities to 1 or 0, rather than somewhere in between for the overlapping region.
I've not posted code here because my questions is more basic: can I interpret the probabilities returned by these two approaches in the same way? Or is there a fundamental difference between them?
Thanks!
Yes. Provided you used sigmoid or softmax for prediction you should be getting values that are reasonable to interpret as probabilities (DNNClassifier will use softmax as far as I know).
Now you didn't give us any details on the models. Depending on the complexity of the models and the training parameters you might be getting more over fitting.
If you are seeing extreme (0 or 1) values for the overlapping area it's probably over fitting. Use test/validation set to keep a check on it.
From what you are describing a very simple model should do, try to have less depth, less parameters.

how to predict binary outcome with categorical and continuous features using scikit-learn?

I need advice choosing a model and machine learning algorithm for a classification problem.
I'm trying to predict a binary outcome for a subject. I have 500,000 records in my data set and 20 continuous and categorical features. Each subject has 10--20 records. The data is labeled with its outcome.
So far I'm thinking logistic regression model and kernel approximation, based on the cheat-sheet here.
I am unsure where to start when implementing this in either R or Python.
Thanks!
Choosing an algorithm and optimizing the parameter is a difficult task in any data mining project. Because it must customized for your data and problem. Try different algorithm like SVM,Random Forest, Logistic Regression, KNN and... and test Cross Validation for each of them and then compare them.
You can use GridSearch in sickit learn to try different parameters and optimize the parameters for each algorithm. also try this project
witch test a range of parameters with genetic algorithm
Features
If your categorical features don't have too many possible different values, you might want to have a look at sklearn.preprocessing.OneHotEncoder.
Model choice
The choice of "the best" model depends mainly on the amount of available training data and the simplicity of the decision boundary you expect to get.
You can try dimensionality reduction to 2 or 3 dimensions. Then you can visualize your data and see if there is a nice decision boundary.
With 500,000 training examples you can think about using a neural network. I can recommend Keras for beginners and TensorFlow for people who know how neural networks work.
You should also know that there are Ensemble methods.
A nice cheat sheet what to use is on in the sklearn tutorial you already found:
(source: scikit-learn.org)
Just try it, compare different results. Without more information it is not possible to give you better advice.

How may I calculate Accuracy in NLTK KMeans Clustering

I am trying to use NLTK's KMeans Clustering Algorithm.
It is generally going fine.
I want to use the Metrics package of NLTK to determine precision,recall and f measure.
I searched for some examples in web and in other references but may be without a clue.
If any one may kindly cite an example or reference.
Thanks in Advance.
It is hard to evaluate the performance of unsupervised learning i.e. clustering. It entirely depends on why are you trying to cluster in the first place.
Still, I think there are few performance metrics available, like,
http://scikit-learn.org/stable/modules/clustering.html#clustering-performance-evaluation
Precision, Recall, and thus the F-measure are inappropriate for cluster analysis. Clustering is not classification, and clusters are not classes!
Common measures for clustering (if you are trying to compare with existing labels, which does not make a whole lot of sense - if you already know the classes, then use classification and not clustering) are the Adjusted Rand Index and its variants.

sklearn svm non integer outputs

i've been trying to find this information around and couldnt found any help.
What i want to do is get a float number as output from sklearn svm in order to work as input for a sub classifier.
Is it possible to get output from svm like 0,89898 instead of 1, given that a class is more closely to be classified as 1?
Thank you
Platt scaling can help to achieve what you want. It fits a logistic sigmoid curve on top of the output of SVM in a post-hoc fashion.
To do this in sklearn, you'll need to fit your SVM with probability parameter set to True. Then, you can use the fitted model's predict_proba() method to get a floating point output. More documentations can be found here. You'll also find related discussions in this thread.

Categories

Resources