Does the SVM in sklearn support incremental (online) learning?

Does the SVM in sklearn support incremental (online) learning? - python

I am currently in the process of designing a recommender system for text articles (a binary case of 'interesting' or 'not interesting'). One of my specifications is that it should continuously update to changing trends.
From what I can tell, the best way to do this is to make use of machine learning algorithm that supports incremental/online learning.
Algorithms like the Perceptron and Winnow support online learning but I am not completely certain about Support Vector Machines. Does the scikit-learn python library support online learning and if so, is a support vector machine one of the algorithms that can make use of it?
I am obviously not completely tied down to using support vector machines, but they are usually the go to algorithm for binary classification due to their all round performance. I would be willing to change to whatever fits best in the end.

While online algorithms for SVMs do exist, it has become important to specify if you want kernel or linear SVMs, as many efficient algorithms have been developed for the special case of linear SVMs.
For the linear case, if you use the SGD classifier in scikit-learn with the hinge loss and L2 regularization you will get an SVM that can be updated online/incrementall. You can combine this with feature transforms that approximate a kernel to get similar to an online kernel SVM.
One of my specifications is that it should continuously update to changing trends.
This is referred to as concept drift, and will not be handled well by a simple online SVM. Using the PassiveAggresive classifier will likely give you better results, as it's learning rate does not decrease over time.
Assuming you get feedback while training / running, you can attempt to detect decreases in accuracy over time and begin training a new model when the accuracy starts to decrease (and switch to the new one when you believe that it has become more accurate). JSAT has 2 drift detection methods (see jsat.driftdetectors) that can be used to track accuracy and alert you when it has changed.
It also has more online linear and kernel methods.
(bias note: I'm the author of JSAT).

Maybe it's me being naive but I think it is worth mentioning how to actually update the sci-kit SGD classifier when you present your data incrementally:
clf = linear_model.SGDClassifier()
x1 = some_new_data
y1 = the_labels
clf.partial_fit(x1,y1)
x2 = some_newer_data
y2 = the_labels
clf.partial_fit(x2,y2)

Technical aspects
The short answer is no. Sklearn implementation (as well as most of the existing others) do not support online SVM training. It is possible to train SVM in an incremental way, but it is not so trivial task.
If you want to limit yourself to the linear case, than the answer is yes, as sklearn provides you with Stochastic Gradient Descent (SGD), which has option to minimize the SVM criterion.
You can also try out pegasos library instead, which supports online SVM training.
Theoretical aspects
The problem of trend adaptation is currently very popular in ML community. As #Raff stated, it is called concept drift, and has numerous approaches, which are often kinds of meta models, which analyze "how the trend is behaving" and change the underlying ML model (by for example forcing it to retrain on the subset of the data). So you have two independent problems here:
the online training issue, which is purely technical, and can be addressed by SGD or other libraries than sklearn
concept drift, which is currently a hot topic and has no just works answers There are many possibilities, hypothesis and proofes of concepts, while there is no one, generaly accepted way of dealing with this phenomena, in fact many phd dissertations in ML are currenlly based on this issue.

SGD for batch learning tasks normally has a decreasing learning rate and goes over training set multiple times. So, for purely online learning, make sure learning_rate is set to 'constant' in sklearn.linear_model.SGDClassifier() and eta0= 0.1 or any desired value. Therefore the process is as follows:
clf= sklearn.linear_model.SGDClassifier(learning_rate = 'constant', eta0 = 0.1, shuffle = False, n_iter = 1)
# get x1, y1 as a new instance
clf.partial_fit(x1, y1)
# get x2, y2
# update accuracy if needed
clf.partial_fit(x2, y2)

A way to scale SVM could be split your large dataset into batches that can be safely consumed by an SVM algorithm, then find support vectors for each batch separately, and then build a resulting SVM model on a dataset consisting of all the support vectors found in all the batches.
Updating to trends could be achieved by maintaining a time window each time you run your training pipeline. For example, if you do your training once a day and there is enough information in a month's historical data, create your traning dataset from the historical data obtained in the recent 30 days.

If interested in online learning with concept drift then here is some previous work
Learning under Concept Drift: an Overview
https://arxiv.org/pdf/1010.4784.pdf
The problem of concept drift: definitions and related work
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.58.9085&rep=rep1&type=pdf
A Survey on Concept Drift Adaptation
http://www.win.tue.nl/~mpechen/publications/pubs/Gama_ACMCS_AdaptationCD_accepted.pdf
MOA Concept Drift Active Learning Strategies for Streaming Data
http://videolectures.net/wapa2011_bifet_moa/
A Stream of Algorithms for Concept Drift
http://people.cs.georgetown.edu/~maloof/pubs/maloof.heilbronn12.handout.pdf
MINING DATA STREAMS WITH CONCEPT DRIFT
http://www.cs.put.poznan.pl/dbrzezinski/publications/ConceptDrift.pdf
Analyzing time series data with stream processing and machine learning
http://www.ibmbigdatahub.com/blog/analyzing-time-series-data-stream-processing-and-machine-learning

Related

Regression problem optimization using ML or DL

I have some data (data from sensors and etc.) from an energy system. consider the x-axis is temperature and the y-axis is energy consumption. Suppose we just have data and we don't have access to the mathematical formulation of the problem:
energy consumption vs temperature curve
In the above figure, it is absolutely obvious that the optimum point is 20. I want to predict the optimum point using ML or DL models. Based on the courses that I have taken I know that it's a regression supervised learning problem, however, I don't know how can I do optimization on this kind of problem.
I don't want you to write a code for this problem. I just want you to give me some hints and instructions about doing this optimization problem.
Also if you recommend any references or courses, I will welcome them to learn how to predict the optimum point of a regression supervised learning problem without knowing the mathematical formulation of the problem.

There are lots of ways that you can try when it comes to optimizing your model, for example, fine tuning your model. What you can do with fine tuning is to try different options that a model consists of and find the smallest errors or higher accuracy based on the actual and predicted data.
Using DecisionTreeRegressor model, you can try to use different split criterion, limit the minimum number of split & depth to see which give you the best predicted scores/errors. For neural network model, using keras, you can try different optimizers, try different loss functions, tune your parameters etc. and try all out as a combination of model.
As for resources, you can go Google, Youtube, and other platform to use keywords such as "fine tuning DNN model" and a lot of resources will pop up for your reference. The bottom line is that you will need to try out different models and fine tune your model until when you are satisfied with your results. The results will be based on your judgement and there is no right or wrong answers (i.e., errors are always there), it just completely up to you on how would you like to achieve your solution with handful of ML and DL models that you got. My advice to you is to spend more time on getting your hands dirty. It will be worth it in the long run. HFGL!

How to study the effect of each data on a deep neural network model?

I'm working on a training a neural network model using Python and Keras library.
My model test accuracy is very low (60.0%) and I tried a lot to rise it, but I couldn't. I'm using DEAP dataset (total 32 participants) to train the model. The splitting technique that I'm using is a fixed one. It was as the followings:28 participants for training, 2 for validation and 2 for testing.
For the model I'm using is as follows.
sequential model
Optimizer = Adam
With L2_regularizer, Gaussian noise, dropout, and Batch normalization
Number of hidden layers = 3
Activation = relu
Compile loss = categorical_crossentropy
initializer = he_normal
Now, I'm using train-test technique (fixed one also) to split the data and I got better results. However, I figured out that some of the participants are affecting the training accuracy in a negative way. Thus, I want to know if there is a way to study the effect of the each data (participant) on the accuracy (performance) of a model?
Best Regards,

From my Starting deep learning hands-on: image classification on CIFAR-10 tutorial, in which I insist on keeping track of both:
global metrics (log-loss, accuracy),
examples (correctly and incorrectly classifies cases).
The later may help us telling which kinds of patterns are problematic, and on numerous occasions helped me with changing the network (or supplementing training data, if it was the case).
And example how does it work (here with Neptune, though you can do it manually in Jupyter Notebook, or using TensorBoard image channel):
And then looking at particular examples, along with the predicted probabilities:
Full disclaimer: I collaborate with deepsense.ai, the creators or Neptune - Machine Learning Lab.

This is, perhaps, more broad an answer than you may like, but I hope it'll be useful nevertheless.
Neural networks are great. I like them. But the vast majority of top-performance, hyper-tuned models are ensembles; use a combination of stats-on-crack techniques, neural networks among them. One of the main reasons for this is that some techniques handle some situations better. In your case, you've run into a situation for which I'd recommend exploring alternative techniques.
In the case of outliers, rigorous value analyses are the first line of defense. You might also consider using principle component analysis or linear discriminant analysis. You could also try to chase them out with density estimation or nearest neighbors. There are many other techniques for handling outliers, and hopefully you'll find the tools I've pointed to easily implemented (with help from their docs); sklearn tends to readily accept data prepared for Keras.

What classification model should I use? New to machine learning. Recommendation needed

the goal:
Hey guys, I'm trying to create a classification model in Python to predict when a bike-share station will have too much relative inflow or outflow per hour.
what we're workin with:
The first 5 rows of my dataframe (over 200,000 rows in all) look like this, and I've assigned values 0, 1, 2 in the 'flux' column - 0 if no significant action, 1 if too much inflow, 2 if too much outflow.
And I'm thinking of using the station_name (over 300 stations), hour of day, and day of week as the predictor variables to classify 'flux'.
the model choice:
What should I go with? Naive Bayes? KNN? Random Forest? anything else that would be a good fit? GDMs? SVMs?
fyi: the baseline prediction of always 0 is pretty high at 92.8%. unfortunately the accuracy of logistic regression and decision tree is right on par w that and doesn't improve it much. and KNN just takes forever....
Recommendations from those more experienced with machine learning in dealing with a classification question like this?

The Azure machine learning team has an article on how to choose algorithms which could help even if you aren't using AzureML. From that article:
How large is your training data? If your training set is small, and
you're going to train a supervised classifier, then machine learning
theory says you should stick to a classifier with high bias/low
variance, such as Naive Bayes. These have an advantage over low
bias/high variance classifiers such as kNN since the latter tends to
overfit. But low bias/high variance classifiers are more appropriate
if you have a larger training set because they have a smaller
asymptotic error - in these cases a high bias classifier isn't
powerful enough to provide an accurate model. There are theoretical
and empirical results that indicate that Naive Bayes does well in such
circumstances. But note that having better data and good features
usually can give you a greater advantage than having a better
algorithm. Also, if you have a very large dataset classification
performance may not be affected as much by the algorithm you use, so
in that case it's better to choose your algorithm based on such things
as its scalability, speed, or ease of use.
Do you need to train incrementally or in a batched mode? If you have a
lot of data, or your data is updated frequently, you probably want to
use Bayesian algorithms that update well. Both neural nets and SVMs
need to work on the training data in batch mode.
Is your data exclusively categorical or exclusively numeric or a
mixture of both kinds? Bayesian works best with categorical/binomial
data. Decision trees can't predict numerical values.
Do you or your audience need to understand how the classifier works?
Bayesian or decision trees are more easily explained. It's much harder
to see or explain how neural networks and SVMs classify data.
How fast does your classification need to be generated? Decision trees
can be slow when the tree is complex. SVMs, on the other hand,
classify more quickly since they only need to determine which side of
the "line" your data is on.
How much complexity does the problem present or require? Neural nets
and SVMs can handle complex non-linear classification.
Now, regarding your comment about "fyi: the baseline prediction of always 0 is pretty high at 92.8%": there are anomaly detection algorithms - meaning that the classification is highly unbalanced, with one classification being an "anomaly" that occurs very rarely, like credit card fraud detection (true fraud is hopefully a very small percentage of your total dataset). In Azure Machine Learning, we use one-class support vector machine (SVM) and PCA-based anomaly detection algorithms. Hope that helps!

Just use anything different from average accuracy for model evaluation in case of such unbalanced data: precision/recall/f1/confusion matrix:
http://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_fscore_support.html
http://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics
Try different models and choose best according to chosen metrics on test set.

Random Forest for multi-label classification

I am making an application for multilabel text classification .
I've tried different machine learning algorithm.
No doubt the SVM with linear kernel gets the best results.
I have also tried to sort through the algorithm Radom Forest and the results I have obtained have been very bad, both the recall and precision are very low.
The fact that the linear kernel to respond better result gives me an idea of the different categories are linearly separable.
Is there any reason the Random Forest results are so low?

The ensemble of the random forest performs well across many domains and types of data. They are excellent at reducing error from variance and don't over fit if trees are kept simple enough.
I would expect a forest to perform comparably to a SVM with a linear kernel.
The SVM will tend to overfit more because it does not benefit from being an ensemble.
If you are not using cross validation of some kind. At minimum measuring performance on unseen data using a test/training regimen than i could see you obtaining this type of result.
Go back and make sure performance is measured on unseen data and likelier you'll see the RF performing more comparably.
Good luck.

It is very hard to answer this question without looking at the data in question.
SVM does have a history of working better with text classification - but machine learning by definition is context dependent.
Consider the parameters by which you are running the random forest algorithm. What are your number and depth of trees, are you pruning branches? Are you searching a larger parameter space for SVMs therefore are more likely to find a better optimum.

sklearn linear regression for large data

Does sklearn.LinearRegression support online/incremental learning?
I have 100 groups of data, and I am trying to implement them altogether. For each group, there are over 10000 instances and ~ 10 features, so it will lead to memory error with sklearn if I construct a huge matrix (10^6 by 10). It will be nice if I can update the regressor each time with batch samples of new group.
I found this post relevant, but the accepted solution works for online learning with single new data (only one instance) rather than batch samples.

Take a look at linear_model.SGDRegressor, it learns a a linear model using stochastic gradient.
In general, sklearn has many models that admit "partial_fit", they are all pretty useful on medium to large datasets that don't fit in the RAM.

Not all algorithms can learn incrementally, without seeing all of the instances at once that is. That said, all estimators implementing the partial_fit API are candidates for the mini-batch learning, also known as "online learning".
Here is an article that goes over scaling strategies for incremental learning. For your purposes, have a look at the sklearn.linear_model.SGDRegressor class. It is truly online so the memory and convergence rate are not affected by the batch size.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.