I am working on one class classification using python. I am currently using One class SVM as implemented in scikit-learn. My data will gradually change in due course of time and the older data will no longer be valid to be used for classification, which is known as concept drift as far as I am aware of. Hence I will have to retrain my classifier as soon as a new sample(or a predetermined number of samples) is added.
However, I have read that One class SVM does not work properly in case of concept drift. Why is it so?
Which other one class classification algorithm can I use if not OSVM in case of concept drift in my dataset?
Related
When dealing with class imbalance issue, penalizing the majority class is a common practice that I have come across while building Machine Learning models. Hence, I often use class weights post re-sampling. LightGBM is one efficient decision tree based framework that is believed to handle class imbalance well. So I am using a LightGBM model for my binary classification problem. The dataset has high class imbalance in the ratio 34:1.
I initially used the LightGBM Classifier with 'class weights' parameter. However, the documentation of LightGBM Classifier mentions to use this parameter for multi-class problems only. For binary classification, it suggests using the 'is_unbalance' or 'scale_pos_weight' parameters. But, by using class weights I see better results and it is also easier to tune the weights and track performance of the model in comparison to when using the other two params.
But since the documentation recommends not to use it for Binary Classification, are there any repercussions of using the parameter? I am getting good results with it on my test data and validation data, but I wonder if it will behave otherwise on other real time data?
Documentation recommends alternative parameters:
Use this parameter only for multi-class classification task; for binary classification task you may use is_unbalance or scale_pos_weight parameters.
https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.LGBMClassifier.html
I need to write a program that, given an object with certain attributes, it knows how to classify it. It should know how to classify new objects by being trained with a list of known objects with known attributes.
For example, I have object A with the following attributes: a=10 and b=1. I also trained the program so that it knows that values between 5..15 for a and 0..2 for b classify the given object as label1.
As the program evolves, I need to further train it with known data so that the attribute intervals will get more accurate (hence the classification).
Now, I haven't got any experience with machine learning or any of this kind and I would like to know how should I start with this. I've seen plenty of tutorials, but only for text classification. And only for 2-ways classification (that is, positive or negative, yes or no...only two values to choose from). I would have 5-6 labels to start with and their number will soon increase. Also, the object attributes are integers.
Any tip is highly appreciated!
Machine learning is a very broad field, so the first step would be knowing exactly what you're looking for and familiarizing yourself with the subproblem you're trying to solve.
By your description, you're trying to solve a classification problem in a supervised learning approach.
I'll paraphrase a bit from here:
The classification problem consists in identifying to which class a observation belongs to.
Supervised learning is a way of "teaching" a machine. Basically, an algorithm is trained through examples (i.e.: this particular object belongs to class X). After training, the machine should be able to apply its aquired knowledge to new data.
The k-NN algorithm is one of the simplest algorithms for solving this kind of problem. I suggest you familiarize yourself with it.
You have an implementation of k-NN in scipy. Here's a link to a tutorial on using it.
Now, answering your specific questions:
only for 2-ways classification (that is, positive or negative, yes or
no...only two values to choose from)
k-NN can handle any (finite) number of classes, so you're clear
Also, the object attributes are integers
K-NN usually uses a continuous space - so you'll have to convert those to floats.
Mapping the attributes values into points in the algorithm space is not a trivial problem (see Data pre-processing, especially the articles on normalization, feature extraction and selection)
I am currently in the process of designing a recommender system for text articles (a binary case of 'interesting' or 'not interesting'). One of my specifications is that it should continuously update to changing trends.
From what I can tell, the best way to do this is to make use of machine learning algorithm that supports incremental/online learning.
Algorithms like the Perceptron and Winnow support online learning but I am not completely certain about Support Vector Machines. Does the scikit-learn python library support online learning and if so, is a support vector machine one of the algorithms that can make use of it?
I am obviously not completely tied down to using support vector machines, but they are usually the go to algorithm for binary classification due to their all round performance. I would be willing to change to whatever fits best in the end.
While online algorithms for SVMs do exist, it has become important to specify if you want kernel or linear SVMs, as many efficient algorithms have been developed for the special case of linear SVMs.
For the linear case, if you use the SGD classifier in scikit-learn with the hinge loss and L2 regularization you will get an SVM that can be updated online/incrementall. You can combine this with feature transforms that approximate a kernel to get similar to an online kernel SVM.
One of my specifications is that it should continuously update to changing trends.
This is referred to as concept drift, and will not be handled well by a simple online SVM. Using the PassiveAggresive classifier will likely give you better results, as it's learning rate does not decrease over time.
Assuming you get feedback while training / running, you can attempt to detect decreases in accuracy over time and begin training a new model when the accuracy starts to decrease (and switch to the new one when you believe that it has become more accurate). JSAT has 2 drift detection methods (see jsat.driftdetectors) that can be used to track accuracy and alert you when it has changed.
It also has more online linear and kernel methods.
(bias note: I'm the author of JSAT).
Maybe it's me being naive but I think it is worth mentioning how to actually update the sci-kit SGD classifier when you present your data incrementally:
clf = linear_model.SGDClassifier()
x1 = some_new_data
y1 = the_labels
clf.partial_fit(x1,y1)
x2 = some_newer_data
y2 = the_labels
clf.partial_fit(x2,y2)
Technical aspects
The short answer is no. Sklearn implementation (as well as most of the existing others) do not support online SVM training. It is possible to train SVM in an incremental way, but it is not so trivial task.
If you want to limit yourself to the linear case, than the answer is yes, as sklearn provides you with Stochastic Gradient Descent (SGD), which has option to minimize the SVM criterion.
You can also try out pegasos library instead, which supports online SVM training.
Theoretical aspects
The problem of trend adaptation is currently very popular in ML community. As #Raff stated, it is called concept drift, and has numerous approaches, which are often kinds of meta models, which analyze "how the trend is behaving" and change the underlying ML model (by for example forcing it to retrain on the subset of the data). So you have two independent problems here:
the online training issue, which is purely technical, and can be addressed by SGD or other libraries than sklearn
concept drift, which is currently a hot topic and has no just works answers There are many possibilities, hypothesis and proofes of concepts, while there is no one, generaly accepted way of dealing with this phenomena, in fact many phd dissertations in ML are currenlly based on this issue.
SGD for batch learning tasks normally has a decreasing learning rate and goes over training set multiple times. So, for purely online learning, make sure learning_rate is set to 'constant' in sklearn.linear_model.SGDClassifier() and eta0= 0.1 or any desired value. Therefore the process is as follows:
clf= sklearn.linear_model.SGDClassifier(learning_rate = 'constant', eta0 = 0.1, shuffle = False, n_iter = 1)
# get x1, y1 as a new instance
clf.partial_fit(x1, y1)
# get x2, y2
# update accuracy if needed
clf.partial_fit(x2, y2)
A way to scale SVM could be split your large dataset into batches that can be safely consumed by an SVM algorithm, then find support vectors for each batch separately, and then build a resulting SVM model on a dataset consisting of all the support vectors found in all the batches.
Updating to trends could be achieved by maintaining a time window each time you run your training pipeline. For example, if you do your training once a day and there is enough information in a month's historical data, create your traning dataset from the historical data obtained in the recent 30 days.
If interested in online learning with concept drift then here is some previous work
Learning under Concept Drift: an Overview
https://arxiv.org/pdf/1010.4784.pdf
The problem of concept drift: definitions and related work
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.58.9085&rep=rep1&type=pdf
A Survey on Concept Drift Adaptation
http://www.win.tue.nl/~mpechen/publications/pubs/Gama_ACMCS_AdaptationCD_accepted.pdf
MOA Concept Drift Active Learning Strategies for Streaming Data
http://videolectures.net/wapa2011_bifet_moa/
A Stream of Algorithms for Concept Drift
http://people.cs.georgetown.edu/~maloof/pubs/maloof.heilbronn12.handout.pdf
MINING DATA STREAMS WITH CONCEPT DRIFT
http://www.cs.put.poznan.pl/dbrzezinski/publications/ConceptDrift.pdf
Analyzing time series data with stream processing and machine learning
http://www.ibmbigdatahub.com/blog/analyzing-time-series-data-stream-processing-and-machine-learning
I am doing machine learning using scikit-learn as recommended in this question. To my surprise, it does not appear to provide access to the actual models it trains. For example, if I create an SVM, linear classifier or even a decision tree, it doesn't seem to provide a way for me to see the parameters selected for the actual trained model.
Seeing the actual model is useful if the model is being created partly to get a clearer picture of what features it is using (e.g., decision trees). Seeing the model is also a significant issue if one wants to use Python to train the model and some other code to actually implement it.
Am I missing something in scikit-learn or is there some way to get at this in scikit-learn? If not, what is the a good free machine learning workbench, not necessarily in python, in which models are transparently available?
The fitted model parameters are stored directly as attributes on the model instance. There is a specific naming convention for those fitted parameters: they all end with a trailing underscore as opposed to user-provided constructor parameters (a.k.a. hyperparameters) which don't.
The type of the fitted attributes is algorithm-dependent. For instance for a kernel Support Vector Machine you will have the arrays support vectors, dual coefs and intercepts while for random forests and extremly randomized trees you will have a collection of binary trees (internally represented in memory as contiguous numpy arrays for performance matters: structure of arrays representation).
See the Attributes section of the docstring of each model for more details, for instance for SVC:
http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC
For tree based models you also have a helper function to generate a graphivz_export of the learned trees:
http://scikit-learn.org/stable/modules/tree.html#classification
To find the importance of features in forests models you should also have a look at the compute_importances parameter, see the following examples for instance:
http://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html#example-ensemble-plot-forest-importances-py
http://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances_faces.html#example-ensemble-plot-forest-importances-faces-py
I just started playing a bit with libsvm in python and got some simple classification to work.
The problem is that I'm constructing a face detection system, and I want a very low false rejection rate. The svm on the other hand seems to optimize for equal false rejection and false acceptance. What options do I have here?
And as a said earlier, I'm very new to libsvm, so be kind. ;)
SVMs are not usually thought of as a probabilistic model, but a maximally-discriminant model. Thus I have a hard time formulating your question in the context of what I know of SVMs.
In addition, the Python bindings that come with libSVM are not terribly performant and don't expose all the options of libSVM.
That said, if you are willing to look at other bindings, the scikit-learn's svm bindings are richer and expose some parameters that may come in handy, such as weighted classes, or weighted samples. You might be able to put more emphasis on the class for which you do not want mis-classification.
In addition, the scikit's binding expose a posterior classification probability, but in the case of SVMs, I believe that it relies on a hack (as SVMs are not probabilistic) of libSVM that resamples the classification to have a confidence interval on the prediction.
I've been using the python wrapper for libSVM and found I could compute a confidence-measure using the margin... see the "predict_values_raw" function below. It returns a real value, with large positive values indicating high confidence that it IS a class member, large negative values indicating high confidence that it IS NOT a class member; values close to zero indicate that it is not confident about the classification. So instead of calling 'predict', call 'predict_values_raw' and apply a low threshold (e.g. -2) to ensure you don't reject any true faces
# Begin pseudo-code
import svm as svmlib
prob = svmlib.svm_problem(labels, data)
param = svmlib.svm_parameter(svm_type=svmlib.C_SVC, kernel_type = svmlib.RBF)
model = svmlib.svm_model(prob, param)
# get confidence
self.model.predict_values_raw(sample_to_classify)