I have been exploring Scikit-learn as a tool, and I am very interested in learning if I can modify how Scikit-learn classifies a data point, more specifically, its SVM function. I am looking for a programmatic way to attack this problem.
In general, we can say that SVM classification looks something like this, and let's imagine the blue points are positive, and red points are negative:
SVM
Where the classification occurs as follows:
SVM Classification
As my understanding goes, Scikit-learn does this for us quite easily.
However, I was wondering if there are any parameters I could change to make it look something more like this:
SVM The way I want
That is, the positive point that makes the support vector is also my decision boundary.. Is there another algorithm that I am missing? Would I have to build my SVM function from the ground up?
Thank you
Related
I mean the result, not the theory:
In linear regression, there is a formula to explain the variables and weights that contribute the final score.
In decision tree, there is a path map to explain what conditions result in the segmentation.
The only result I can read from < from sklearn.tree import DecisionTreeRegressor> is by pickle.dump. But pickle is still a black-box. Although features_importance_ output explains the weight importance of each features, however, that's an indirect method. I still cannot understand how the score come from.
How read the data and explain the fitting result of Random Forest directly?
Is there any formula or path map?
With sklearn.tree.export_graphviz and dot you can visualize the decision making process. It's a little tricky to implement but that's a way to read the fitting result. Read more here.
I have two datasets, each defined by the same two parameters. If you plot them on a scatter plot, there is some overlap. I'd like to classify them, but also get a probability that a given point is in one dataset or another. So in the overlap region, I would never expect the probability to be 100%.
I've implemented this using python's scikit-learn package and the kNN algorithm, KNeighborsClassifier. It looks pretty good! When I use predict_proba to return the probability, it looks like what I would expect!
So then I tried doing the same thing with TensorFlow and the DNNClassifier classifier, mostly as a learning exercise for myself. When I evaluate the test samples I used predict_proba to return the probabilities, but the distribution of probabilities look much different than the kNN approach. It looks like the DNNClassifier is really trying to drive the probabilities to 1 or 0, rather than somewhere in between for the overlapping region.
I've not posted code here because my questions is more basic: can I interpret the probabilities returned by these two approaches in the same way? Or is there a fundamental difference between them?
Thanks!
Yes. Provided you used sigmoid or softmax for prediction you should be getting values that are reasonable to interpret as probabilities (DNNClassifier will use softmax as far as I know).
Now you didn't give us any details on the models. Depending on the complexity of the models and the training parameters you might be getting more over fitting.
If you are seeing extreme (0 or 1) values for the overlapping area it's probably over fitting. Use test/validation set to keep a check on it.
From what you are describing a very simple model should do, try to have less depth, less parameters.
I need advice choosing a model and machine learning algorithm for a classification problem.
I'm trying to predict a binary outcome for a subject. I have 500,000 records in my data set and 20 continuous and categorical features. Each subject has 10--20 records. The data is labeled with its outcome.
So far I'm thinking logistic regression model and kernel approximation, based on the cheat-sheet here.
I am unsure where to start when implementing this in either R or Python.
Thanks!
Choosing an algorithm and optimizing the parameter is a difficult task in any data mining project. Because it must customized for your data and problem. Try different algorithm like SVM,Random Forest, Logistic Regression, KNN and... and test Cross Validation for each of them and then compare them.
You can use GridSearch in sickit learn to try different parameters and optimize the parameters for each algorithm. also try this project
witch test a range of parameters with genetic algorithm
Features
If your categorical features don't have too many possible different values, you might want to have a look at sklearn.preprocessing.OneHotEncoder.
Model choice
The choice of "the best" model depends mainly on the amount of available training data and the simplicity of the decision boundary you expect to get.
You can try dimensionality reduction to 2 or 3 dimensions. Then you can visualize your data and see if there is a nice decision boundary.
With 500,000 training examples you can think about using a neural network. I can recommend Keras for beginners and TensorFlow for people who know how neural networks work.
You should also know that there are Ensemble methods.
A nice cheat sheet what to use is on in the sklearn tutorial you already found:
(source: scikit-learn.org)
Just try it, compare different results. Without more information it is not possible to give you better advice.
i've been trying to find this information around and couldnt found any help.
What i want to do is get a float number as output from sklearn svm in order to work as input for a sub classifier.
Is it possible to get output from svm like 0,89898 instead of 1, given that a class is more closely to be classified as 1?
Thank you
Platt scaling can help to achieve what you want. It fits a logistic sigmoid curve on top of the output of SVM in a post-hoc fashion.
To do this in sklearn, you'll need to fit your SVM with probability parameter set to True. Then, you can use the fitted model's predict_proba() method to get a floating point output. More documentations can be found here. You'll also find related discussions in this thread.
Is there any implementation of incremental svm which also has the feature of returning the probability of a given feature vector belonging to the various classes? Preferably usable with python code
I have heard about LaSVM. Does LaSVM has a feature of returning probability estimates? Also does it have features for handling imbalance training datasets?
You can have a look in Scikit Learn, a very flexible and efficient library written in Python
In every model, there are stored the internal calculated values. If clf is your SVM classifier, you can access clf.decision_function to see some explanation of the predictions.
It also provides a good set of tools for preprocessing data among other things you can find interesting.
cheers,
For getting probability estimate you can use scikit-learn library. There are 2 alternatives you can use. One gives probabilities. Here is an example: How to know what classes are represented in return array from predict_proba in Scikit-learn
And the other gives signed values for ranking (not probability but generally gives better result): Scikit-learn predict_proba gives wrong answers you should look at the answer.