I mean the result, not the theory:
In linear regression, there is a formula to explain the variables and weights that contribute the final score.
In decision tree, there is a path map to explain what conditions result in the segmentation.
The only result I can read from < from sklearn.tree import DecisionTreeRegressor> is by pickle.dump. But pickle is still a black-box. Although features_importance_ output explains the weight importance of each features, however, that's an indirect method. I still cannot understand how the score come from.
How read the data and explain the fitting result of Random Forest directly?
Is there any formula or path map?
With sklearn.tree.export_graphviz and dot you can visualize the decision making process. It's a little tricky to implement but that's a way to read the fitting result. Read more here.
Related
While using sklearn Linear Regression library, as we split the data using traintestsplit, do we have to use the training data for the OLS (least square method) or we can use the full data for OLS method and deduce the regression result.
There are many mistakes that data-scientists make as a beginner and one of them is to use test data as something in the learning process, look at this diagram from here:
As you can see the data is separated during training process and this is really important to be kept this way.
Now the question you ask is about least square method, while you may think that by using full data you are improving the process, you are forgetting about the evaluation part which then would be better not because the regression is better. It is just better because you have shown the model the data you are testing it with.
I have been exploring Scikit-learn as a tool, and I am very interested in learning if I can modify how Scikit-learn classifies a data point, more specifically, its SVM function. I am looking for a programmatic way to attack this problem.
In general, we can say that SVM classification looks something like this, and let's imagine the blue points are positive, and red points are negative:
SVM
Where the classification occurs as follows:
SVM Classification
As my understanding goes, Scikit-learn does this for us quite easily.
However, I was wondering if there are any parameters I could change to make it look something more like this:
SVM The way I want
That is, the positive point that makes the support vector is also my decision boundary.. Is there another algorithm that I am missing? Would I have to build my SVM function from the ground up?
Thank you
The concept of KNN is to find the nearest data points to the required data.
therefore there is no math or processes before testing the model.
all it does is finding closest K points which mean no training process.
if this is right, then what happens in the training process for KNN in python??
from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors=5)
classifier.fit(X_train, y_train)
Then something happen in the background when fit gets called.
What is that happening if the process requires no calculations
KNN is not quite a specific algorithm on itself, but rather a method that you can implement in several ways. The idea behind nearest neighbors is to select one or more examples from the training data to decide the predicted value for the sample at hand. The simplest way to do that is to simply iterate through the whole dataset and pick the closest data points from the training dataset. In that case, you could skip the fitting step, or you could see the fitting as the production of a callable function that runs that loop. Even in that case, is you are using a library like scikit-learn, it is useful to maintain a similar interface to all predictors, so you can write generic code for them (e.g. training code independent from the specific algorithm used).
However, you can do smarter things for KNN too. In scikit-learn, you will see that KNeighborsClassifier implements three different algorithms. One is brute force, which is just traversing the whole dataset as described, but you also have BallTree (wiki) and KDTree (wiki). These are data structures that can accelerate the search for nearest neighbors, but they need to be constructed in advance from the data. So the fitting step here is building the data structure that will help you find the nearest neighbors.
I have two datasets, each defined by the same two parameters. If you plot them on a scatter plot, there is some overlap. I'd like to classify them, but also get a probability that a given point is in one dataset or another. So in the overlap region, I would never expect the probability to be 100%.
I've implemented this using python's scikit-learn package and the kNN algorithm, KNeighborsClassifier. It looks pretty good! When I use predict_proba to return the probability, it looks like what I would expect!
So then I tried doing the same thing with TensorFlow and the DNNClassifier classifier, mostly as a learning exercise for myself. When I evaluate the test samples I used predict_proba to return the probabilities, but the distribution of probabilities look much different than the kNN approach. It looks like the DNNClassifier is really trying to drive the probabilities to 1 or 0, rather than somewhere in between for the overlapping region.
I've not posted code here because my questions is more basic: can I interpret the probabilities returned by these two approaches in the same way? Or is there a fundamental difference between them?
Thanks!
Yes. Provided you used sigmoid or softmax for prediction you should be getting values that are reasonable to interpret as probabilities (DNNClassifier will use softmax as far as I know).
Now you didn't give us any details on the models. Depending on the complexity of the models and the training parameters you might be getting more over fitting.
If you are seeing extreme (0 or 1) values for the overlapping area it's probably over fitting. Use test/validation set to keep a check on it.
From what you are describing a very simple model should do, try to have less depth, less parameters.
i've been trying to find this information around and couldnt found any help.
What i want to do is get a float number as output from sklearn svm in order to work as input for a sub classifier.
Is it possible to get output from svm like 0,89898 instead of 1, given that a class is more closely to be classified as 1?
Thank you
Platt scaling can help to achieve what you want. It fits a logistic sigmoid curve on top of the output of SVM in a post-hoc fashion.
To do this in sklearn, you'll need to fit your SVM with probability parameter set to True. Then, you can use the fitted model's predict_proba() method to get a floating point output. More documentations can be found here. You'll also find related discussions in this thread.