How do I avoid re-training machine learning models

How do I avoid re-training machine learning models - python

self-learner here.
I am building a web application that predict events.
Let's consider this quick example.
X = [[0], [1], [2], [3]]
y = [0, 0, 1, 1]
from sklearn.neighbors import KNeighborsClassifier
neigh = KNeighborsClassifier(n_neighbors=3)
neigh.fit(X, y)
print(neigh.predict([[1.1]]))
How can I keep the state of neigh so when I enter a new value like neigh.predict([[1.2]]) I don't need to re-train the model. Is there any good practice, or hint to start solving the problem ?

You've chosen a slightly confusing example for a couple of reasons. First, when you say neigh.predict([[1.2]]), you aren't adding a new training point, you're just doing a new prediction, so that doesn't require any changes at all. Second, KNN algorithms aren't really "trained" -- KNN is an instance-based algorithm, which means that "training" amounts to storing the training data in a suitable structure. As a result, this question has two different answers. I'll try to answer the KNN question first.
K Nearest Neighbors
For KNN, adding new training data amounts to appending new data points to the structure. However, it appears that scikit-learn doesn't provide any such functionality. (That's reasonable enough -- since KNN explicitly stores every training point, you can't just keep giving it new training points indefinitely.)
If you aren't using many training points, a simple list might be good enough for your needs! In that case, you could skip sklearn altogether, and just append new data points to your list. To make a prediction, do a linear search, saving the k nearest neighbors, and then make a prediction based on a simple "majority vote" -- if out of five neighbors, three or more are red, then return red, and so on. But keep in mind that every training point you add will slow the algorithm.
If you need to use many training points, you'll want to use a more efficient structure for nearest neighbor search, like a K-D Tree. There's a scipy K-D Tree implementation that ought to work. The query method allows you to find the k nearest neighbors. It will be more efficient than a list, but it will still get slower as you add more training data.
Online Learning
A more general answer to your question is that you are (unbeknownst to yourself) trying to do something called online learning. Online learning algorithms allow you to use individual training points as they arrive, and discard them once they've been used. For this to make sense, you need to be storing not the training points themselves (as in KNN) but a set of parameters, which you optimize.
This means that some algorithms are better suited to this than others. sklearn provides just a few algorithms capable of online learning. These all have a partial_fit method that will allow you to pass training data in batches. The SKDClassifier with 'hinge' or 'log' loss is probably a good starting point.

Or maybe you just want to save your model after fitting
joblib.dump(neigh, FName)
and load it when needed
neigh = joblib.load(FName)
neigh.predict([[1.1]])

Related

how to show every each of the bootstrapped data / sampled data in random forest sklearn?

from what i understand about random forest alogirthm is that the algorithm randomly samples the original dataset to build a new sampled/bootstrapped dataset. the sampled dataset then turned into decision trees.
in scikit learn, you can visualize each individual trees in random forest. but my question is, How to show the sampled/bootstrapped dataset from each of those trees?
i want to see the features and the rows of data used to build each individual trees.

I am not aware of a way to see the bootstrapped rows (samples) of your data, as this is a random process. Neither do I think it is of much importance, if the algorithm trained well.
Nevertheless, for the features: In your visualization of the tree you might already see which features were used by the splits that have been done. But you can also access them directly via the attribute feature_names_in_ (see here) for each DecisionTree in the forest, e.g.
print(rf.estimators_[0].feature_names_in_)
If your feature names in the data have not been defined as given in the documentation, a workaround is to use the feature_importances_ (see here) as a proxy instead - the features that were unused obviously have an importance of 0. You can contrast this with the number of max_features_ that you define for the training to see if you caught all.

Does KNN need training?

The concept of KNN is to find the nearest data points to the required data.
therefore there is no math or processes before testing the model.
all it does is finding closest K points which mean no training process.
if this is right, then what happens in the training process for KNN in python??
from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors=5)
classifier.fit(X_train, y_train)
Then something happen in the background when fit gets called.
What is that happening if the process requires no calculations

KNN is not quite a specific algorithm on itself, but rather a method that you can implement in several ways. The idea behind nearest neighbors is to select one or more examples from the training data to decide the predicted value for the sample at hand. The simplest way to do that is to simply iterate through the whole dataset and pick the closest data points from the training dataset. In that case, you could skip the fitting step, or you could see the fitting as the production of a callable function that runs that loop. Even in that case, is you are using a library like scikit-learn, it is useful to maintain a similar interface to all predictors, so you can write generic code for them (e.g. training code independent from the specific algorithm used).
However, you can do smarter things for KNN too. In scikit-learn, you will see that KNeighborsClassifier implements three different algorithms. One is brute force, which is just traversing the whole dataset as described, but you also have BallTree (wiki) and KDTree (wiki). These are data structures that can accelerate the search for nearest neighbors, but they need to be constructed in advance from the data. So the fitting step here is building the data structure that will help you find the nearest neighbors.

NN: outputting a probability density function instead of a single value

This might sound silly but I'm just wondering about the possibility of modifying a neural network to obtain a probability density function rather than a single value when you are trying to predict a scalar. I know that when you are trying to classify images or words you can get a probability for each class, so I'm thinking there might be a way to do something similar with a continuous value and plot it. (Similar to the posterior plot with bayesian optimisation)
Such details could be interesting when deploying a model for prediction and could provide more flexibility than a single value.
Does anyone knows a way to obtain such an output?
Thanks!

Ok So I found a solution to this issue, though it adds a lot of overhead.
Initially I thought the keras callback could be of use but despite the fact that it provided the flexibility that I wanted i.e.: train only on test data or only a subset and not for every test. It seems that callbacks are only given summary data from the logs.
So the first step what to create a custom metric that would do the same calculation as any metric with the 2 arrays ( the true value and the predicted value) and once those calculations are done, output them to a file for later use.
Then once we found a way to gather all the data for every sample, the next step was to implement a method that could give a good measure of error. I'm currently implementing a handful of methods but the most fitting one seem to be bayesian bootstraping ( user lmc2179 has a great python implementation). I also implemented ensemble methods and gaussian process as alternatives or to use as other metrics and some other bayesian methods.
I'll try to find if there are internals in keras that are set during the training and testing phases to see if I can set a trigger for my metric. The main issue with using all the data is that you obtain a lot of unreliable data points at the start since the network is not optimized. Some data filtering could be useful to remove a good amount of those points to improve the results of the error predictors.
I'll update if I find anything interesting.

What is the purpose of the holdout set in k-means clustering?

Link to the MIT problem set
Here are my current thoughts--please point to where I'm wrong :)
What I believe: The holdout set's purpose is to foil,
contrast, for the training set - to prove that the
k-means eliminates error at each round.
To do this, the holdout set shows the error at the very begin-
ning, i.e. it doesn't recompute the centroid of each clusters
to be at the very center of each cluster, after each
point has been assigned. It just stops, and the error is
calculated.
The training set, for the initial 80% of the points--
partitioned using randomPartition()--simply go through
the entire k-means function, and return the error after
that.)
Where I'm probably wrong: The problem probably just
requests another run of k-means, but with a smaller set.
Also, the way of calculating error for training set vs. the holdout
set seem identical to me. They're probably not.
Also, I heard something about it involving feature selection.
Current methods I'm considering based on current belief:
Duplicate the k-means function, and modify the duplicate
so that it returns the clusters, maxDistance after initial
run. Use this function for the holdout set.

The goal of clustering is to group similar data points. But how would you know if the similar data points you have grouped are grouped correctly? How can you judge your results? For this reason you divide your available data into 2 sets: training and holdout.
Take this as an analogy.
Think about training set as practice questions for some examination. You work the practice questions, try to do best in it and improve your skills.
You can think holdout set as the actual examination. If you have worked good on the practice questions (training set) then you will probably perform good in the examination (holdout set).
Now you know how well did you do in practice and examination (of-course after attempting ) based on which you can infer your overall performance and judge what is good (what number of clusters are good or how good is the data clustered).
So you will apply your clustering algorithm on the training data but not on holdout data and find out cluster centers (representatives of clusters). For holdout data, you will simply use the cluster centers you have found from algorithm and assign data-points to cluster whose center is nearest. Calculate your performance on training and holdout data based on some performance metric (squared distance error in your case). Finally compare these metrics over different values of k to get a good judgement. There is more to it but for assignment sake it seems enough.
In practice, there are many other methods. But the key idea in most of them is same. There is a statistics community where you can find more similar questions: https://stats.stackexchange.com/
References:
https://en.wikipedia.org/wiki/Cross-validation_(statistics)#Holdout_method

Accuracy difference on normalization in KNN

I had trained my model on KNN classification algorithm , and I was getting around 97% accuracy. However,I later noticed that I had missed out to normalise my data and I normalised my data and retrained my model, now I am getting an accuracy of only 87%. What could be the reason? And should I stick to using data that is not normalised or should I switch to normalized version.

To answer your question, you first need to understand how KNN works. Here is a simple diagram:
Supposed the ? is the point you are trying to classify into either red or blue. For this case lets assume you haven't normalized any of the data. As you can see clearly the ? is closer to more red dots than blue bots. Therefore, this point would be assumed to be red. Lets also assume the correct label is red, therefore this is a correct match!
Now, to discuss normalization. Normalization is a way of taking data that is slightly dissimilar but giving it a common state (in your case think of it as making the features more similar). Assume in the above example that you normalize the ?'s features, and therefore the output y value becomes less. This would place the question mark below it's current position and surrounded by more blue dots. Therefore, your algo would label it as blue, and it would be incorrect. Ouch!
Now to answer your questions. Sorry, but there is no answer! Sometimes normalizing data removes important feature differences therefore causing accuracy to go down. Other times, it helps to eliminate noise in your features which cause incorrect classifications. Also, just because accuracy goes up for the data set your are currently working with, doesn't mean you will get the same results with a different data set.
Long story short, instead of trying to label normalization as good/bad, instead consider the feature inputs you are using for classification, determine which ones are important to your model, and make sure differences in those features are reflected accurately in your classification model. Best of luck!

That's a pretty good question, and is unexpected at first glance because usually a normalization will help a KNN classifier do better. Generally, good KNN performance usually requires preprocessing of data to make all variables similarly scaled and centered. Otherwise KNN will be often be inappropriately dominated by scaling factors.
In this case the opposite effect is seen: KNN gets WORSE with scaling, seemingly.
However, what you may be witnessing could be overfitting. The KNN may be overfit, which is to say it memorized the data very well, but does not work well at all on new data. The first model might have memorized more data due to some characteristic of that data, but it's not a good thing. You would need to check your prediction accuracy on a different set of data than what was trained on, a so-called validation set or test set.
Then you will know whether the KNN accuracy is OK or not.
Look into learning curve analysis in the context of machine learning. Please go learn about bias and variance. It's a deeper subject than can be detailed here. The best, cheapest, and fastest sources of instruction on this topic are videos on the web, by the following instructors:
Andrew Ng, in the online coursera course Machine Learning
Tibshirani and Hastie, in the online stanford course Statistical Learning.

If you use normalized feature vectors, the distances between your data points are likely to be different than when you used unnormalized features, particularly when the range of the features are different. Since kNN typically uses euclidian distance to find k nearest points from any given point, using normalized features may select a different set of k neighbors than the ones chosen when unnormalized features were used, hence the difference in accuracy.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.