Too Much Data for SVM? - python

So I'm running a SVM classifier (with a linear kernel and probability false) from sklearn on a dataframe with about 120 features and 10,000 observations. The program takes hours to run and keeps crashing due to exceeding computational limits. Just wondering if this dataframe is perhaps too large?

In short no, this is not too big at all. Linear svm can scale much further. The libSVC library on the other hand cannot. The good thing, even in scikit-learn you do have large scale svm implementation - LinearSVC which is based on liblinear. You can also solve it using SGD (also available in scikitlearn) which will converge for much bigger datasets as well.

You could try changing the parameters for the algorithm.
Tips on practical use from the documentation.
You could try a different algorithm, here's a cheat sheet you might find helpful:

The implementation is based on libsvm. The fit time complexity is more
than quadratic with the number of samples which makes it hard to scale
to dataset with more than a couple of 10000 samples.
The offical data about sklearn svm told the theshold is 10,000 samples
so SGD could be a better try.

Related

SciKit One-class SVM classifier training time increases exponentially with size of training data

I am using the Python SciKit OneClass SVM classifier to detect outliers in lines of text. The text is converted to numerical features first using bag of words and TF-IDF.
When I train (fit) the classifier running on my computer, the time seems to increase exponentially with the number of items in the training set:
Number of items in training data and training time taken:
10K: 1 sec, 15K: 2 sec, 20K: 8 sec, 25k: 12 sec, 30K: 16 sec, 45K: 44 sec.
Is there anything I can do to reduce the time taken for training, and avoid that this will become too long when training data size increases to a couple of hundred thousand items ?
Well scikit's SVM is a high-level implementation so there is only so much you can do, and in terms of speed, from their website, "SVMs do not directly provide probability estimates, these are calculated using an expensive five-fold cross-validation."
You can increase your kernel size parameter based on your available RAM, but this increase does not help much.
You can try changing your kernel, though your model might be incorrect.
Here is some advice from http://scikit-learn.org/stable/modules/svm.html#tips-on-practical-use: Scale your data.
Otherwise, don't use scikit and implement it yourself using neural nets.
Hope I'm not too late. OCSVM, and SVM, is resource hungry, and the length/time relationship is quadratic (the numbers you show follow this). If you can, see if Isolation Forest or Local Outlier Factor work for you, but if you're considering applying on a lengthier dataset I would suggest creating a manual AD model that closely resembles the context of these off-the-shelf solutions. By doing this then you should be able to work either in parallel or with threads.
For anyone coming here from Google, sklearn has implemented SGDOneClassSVM, which "has a linear complexity in the number of training samples". It should be faster for large datasets.

Random Forest for multi-label classification

I am making an application for multilabel text classification .
I've tried different machine learning algorithm.
No doubt the SVM with linear kernel gets the best results.
I have also tried to sort through the algorithm Radom Forest and the results I have obtained have been very bad, both the recall and precision are very low.
The fact that the linear kernel to respond better result gives me an idea of the different categories are linearly separable.
Is there any reason the Random Forest results are so low?
The ensemble of the random forest performs well across many domains and types of data. They are excellent at reducing error from variance and don't over fit if trees are kept simple enough.
I would expect a forest to perform comparably to a SVM with a linear kernel.
The SVM will tend to overfit more because it does not benefit from being an ensemble.
If you are not using cross validation of some kind. At minimum measuring performance on unseen data using a test/training regimen than i could see you obtaining this type of result.
Go back and make sure performance is measured on unseen data and likelier you'll see the RF performing more comparably.
Good luck.
It is very hard to answer this question without looking at the data in question.
SVM does have a history of working better with text classification - but machine learning by definition is context dependent.
Consider the parameters by which you are running the random forest algorithm. What are your number and depth of trees, are you pruning branches? Are you searching a larger parameter space for SVMs therefore are more likely to find a better optimum.

sklearn linear regression for large data

Does sklearn.LinearRegression support online/incremental learning?
I have 100 groups of data, and I am trying to implement them altogether. For each group, there are over 10000 instances and ~ 10 features, so it will lead to memory error with sklearn if I construct a huge matrix (10^6 by 10). It will be nice if I can update the regressor each time with batch samples of new group.
I found this post relevant, but the accepted solution works for online learning with single new data (only one instance) rather than batch samples.
Take a look at linear_model.SGDRegressor, it learns a a linear model using stochastic gradient.
In general, sklearn has many models that admit "partial_fit", they are all pretty useful on medium to large datasets that don't fit in the RAM.
Not all algorithms can learn incrementally, without seeing all of the instances at once that is. That said, all estimators implementing the partial_fit API are candidates for the mini-batch learning, also known as "online learning".
Here is an article that goes over scaling strategies for incremental learning. For your purposes, have a look at the sklearn.linear_model.SGDRegressor class. It is truly online so the memory and convergence rate are not affected by the batch size.

SVM: Choosing Support Vector Machine regression termination criterion tolerence in sklearn

I am using sklearn.svr with the RBF kernel on an 80k-size dataset with 20+ variables. I was wondering how to choose the termination parameter tol. I ask because the regression does not seem to converge for certain combinations of C and gamma (2+ days before I give up). Interestingly, it converges after less than 10 minutes for certain combinations with an average run-time of approximately an hour.
Is there some sort of rule of thumb for setting this parameter? Perhaps a relationship to the standard deviation or expected value of the forecast?
Mike's answer is correct: subsampling for grid searching parameter is probably the best strategy to train SVR on medium-ish dataset sizes. SVR is not scalable so don't waste your time doing a grid search on the full dataset. Try on 1000 random sub samples, then 2000 and then 4000. Each time find the optimal values for C and gamma and try to guess how they evolve whenever you double the size of the dataset.
Also you can approximate the true SVR solution with the Nystroem kernel approximation and a linear regressor model such as SGDRegressor, LinearRegression, LassoCV or ElasticNetCV. RidgeCV is likely not to improve upon LinearRegression in the n_samples >> n_features regime.
Finally, do not forget to scale your input data by putting a MinMaxScaler or a StandardScaler before the SVR model in a Pipeline.
I would also try GradientBoostingRegressor models (although completely unrelated to SVR).
You really shouldn't use SVR on large data sets: its training algorithm takes between quadratic and cubic time. sklearn.linear_model.SGDRegressor can fit a linear regression on such datasets without trouble, so try that instead. If linear regression won't hack it, transform your data with a kernel approximation before feeding it to SGDRegressor to get a linear-time approximation of an RBF-SVM.
You may have seen the scikit learn documentation for the RBF function. Considering what C and gamma actually do and the fact that the SVR training time is at worst quadratic in the number of samples, I would try training first on a small subset of the data. By first getting a result for all parameter settings and then scaling up the amount of training data used, you might find you actually only need a small sample of the data to get results very close to the full set.
This is the advice I was given by my MSc project supervisor recently, as I had the exact same problem. I found that out of a set of 120k examples with 250 features I only needed around 3000 samples to get within 2% of the error of the full set models.
Sorry this isn't answering your question directly, but I thought it might help.

How many features can scikit-learn handle?

I have a csv file of [66k, 56k] size (rows, columns). Its a sparse matrix. I know that numpy can handle that size a matrix. I would like to know based on everyone's experience, how many features scikit-learn algorithms can handle comfortably?
Depends on the estimator. At that size, linear models still perform well, while SVMs will probably take forever to train (and forget about random forests since they won't handle sparse matrices).
I've personally used LinearSVC, LogisticRegression and SGDClassifier with sparse matrices of size roughly 300k × 3.3 million without any trouble. See #amueller's scikit-learn cheat sheet for picking the right estimator for the job at hand.
Full disclosure: I'm a scikit-learn core developer.
Some linear model (Regression, SGD, Bayes) will probably be your best bet if you need to train your model frequently.
Although before you go running any models you could try the following
1) Feature reduction. Are there features in your data that could easily be removed? For example if your data is text or ratings based there are lots known options available.
2) Learning curve analysis. Maybe you only need a small subset of your data to train a model, and after that you are only fitting to your data or gaining tiny increases in accuracy.
Both approaches could allow you to greatly reduce the training data required.

Categories

Resources