how to do hyperparameter optimization in large data?

how to do hyperparameter optimization in large data? - python

I almost finished my time series model, collected enough data and now I am stuck at hyperparameter optimization.
And after lots of googling I found new & good library called ultraopt, but problem is that how much amount of fragment of data should I use from my total data (~150 GB) for hyperparameter tuning. And I want to try lots of algorithm and combinations, is there any faster and easy way?
Or
Is there any math involved, something like,
mydata = 100%size
hyperparameter optimization with 5% of mydatasize,
optimized hyperparameter *or+ or something with left 95% of datasize #something like this
To get a similar result as full data used for optimization at a time. Is there any shortcut for these?
I am using Python 3.7,
CPU: AMD ryzen5 3400g,
GPU: AMD Vega 11,
RAM: 16 GB

Hyperparameter tuning is typically done on the validation set of a train-val-test split, where each split will have something along the lines of 70%, 10%, and 20% of the entire dataset respectively. As a baseline, random search can be used while Bayesian optimization with Gaussian processes has been shown to be more compute efficient. scikit-optimize is a good package for this.

A good python library for hyper-parameter tuning is keras tuner. You can utilize different tuners in this library, but for the large data, as you've mentioned, Hyperband Optimization can be state-of-the-art and appropriate one.

Related

Optimizing RAM usage when training a learning model

I have been working on creating and training a Deep Learning model for the first time. I did not have any knowledge about the subject prior to the project and therefor my knowledge is limited even now.
I used to run the model on my own laptop but after implementing a well working OHE and SMOTE I simply couldnt run it on my own device anymore due to MemoryError (8GB of RAM). Therefor I am currently running the model on a 30GB RAM RDP which allows me to do so much more, I thought.
My code seems to have some horribly inefficiencies of which I wonder if they can be solved. One example is that by using pandas.concat my model's RAM usages skyrockets from 3GB to 11GB which seems very extreme, afterwards I drop a few columns making the RAm spike to 19GB but actually returning back to 11GB after the computation is completed (unlike the concat). I also forced myself to stop using the SMOTE for now just because the RAM usage would just go up way too much.
At the end of the code, where the training happens the model breaths its final breath while trying to fit the model. What can I do to optimize this?
I have thought about splitting the code into multiple parts (for exmaple preprocessing and training) but to do so I would need to store massive datasets in a pickle which can only reach 4GB (correct me if I'm wrong). I have also given thought about using pre-trained models but I truely did not understand how this process goes to work and how to use one in Python.
P.S.: I would also like my SMOTE back if possible
Thank you all in advance!

Let's analyze the steps:
Step 1: OHE
For your OHE, the only dependence there is between data points is that it needs to be clear what categories are there overall. So the OHE can be broken into two steps, both of which do not require that all data points are in RAM.
Step 1.1: determine categories
Stream read your data points, collecting all the categories. It is not necessary to save the data points you read.
Step 1.2: transform data
After step 1.1, each data point can be independently converted. So stream read, convert, stream write. You only need one or very few data points in memory at all times.
Step 1.3: feature selection
It may be worthwile to look at feature selection to reduce the memory footprint and improve performance. This answer argues it should happen before SMOTE.
Feature selection methods based on entropy depend on all data. While you can probably also throw something together which streams, one approach that worked well for me in the past is removing features that only one or two data points have, since these features definitely have low entropy and probably don't help the classifier much. This can be done again like Step 1.1 and Step 1.2
Step 2: SMOTE
I don't know SMOTE enough to give an answer, but maybe the problem has already solved itself if you do feature selection. In any case, save the resulting data to disk so you do not need to recompute for every training.
Step 3: training
See if the training can be done in batches or streaming (online, basically), or simply with less sampled data.
With regards to saving to disk: Use a format that can be easily streamed, like csv or some other splittable format. Don't use pickle for that.

Slightly orthogonal to your actual question, if your high RAM usage is caused by having entire dataset in memory for the training, you could eliminate such memory footprint by reading and storing only one batch at a time: read a batch, train on this batch, read next batch and so on.

SciKit One-class SVM classifier training time increases exponentially with size of training data

I am using the Python SciKit OneClass SVM classifier to detect outliers in lines of text. The text is converted to numerical features first using bag of words and TF-IDF.
When I train (fit) the classifier running on my computer, the time seems to increase exponentially with the number of items in the training set:
Number of items in training data and training time taken:
10K: 1 sec, 15K: 2 sec, 20K: 8 sec, 25k: 12 sec, 30K: 16 sec, 45K: 44 sec.
Is there anything I can do to reduce the time taken for training, and avoid that this will become too long when training data size increases to a couple of hundred thousand items ?

Well scikit's SVM is a high-level implementation so there is only so much you can do, and in terms of speed, from their website, "SVMs do not directly provide probability estimates, these are calculated using an expensive five-fold cross-validation."
You can increase your kernel size parameter based on your available RAM, but this increase does not help much.
You can try changing your kernel, though your model might be incorrect.
Here is some advice from http://scikit-learn.org/stable/modules/svm.html#tips-on-practical-use: Scale your data.
Otherwise, don't use scikit and implement it yourself using neural nets.

Hope I'm not too late. OCSVM, and SVM, is resource hungry, and the length/time relationship is quadratic (the numbers you show follow this). If you can, see if Isolation Forest or Local Outlier Factor work for you, but if you're considering applying on a lengthier dataset I would suggest creating a manual AD model that closely resembles the context of these off-the-shelf solutions. By doing this then you should be able to work either in parallel or with threads.

For anyone coming here from Google, sklearn has implemented SGDOneClassSVM, which "has a linear complexity in the number of training samples". It should be faster for large datasets.

Too Much Data for SVM?

So I'm running a SVM classifier (with a linear kernel and probability false) from sklearn on a dataframe with about 120 features and 10,000 observations. The program takes hours to run and keeps crashing due to exceeding computational limits. Just wondering if this dataframe is perhaps too large?

In short no, this is not too big at all. Linear svm can scale much further. The libSVC library on the other hand cannot. The good thing, even in scikit-learn you do have large scale svm implementation - LinearSVC which is based on liblinear. You can also solve it using SGD (also available in scikitlearn) which will converge for much bigger datasets as well.

You could try changing the parameters for the algorithm.
Tips on practical use from the documentation.
You could try a different algorithm, here's a cheat sheet you might find helpful:

The implementation is based on libsvm. The fit time complexity is more
than quadratic with the number of samples which makes it hard to scale
to dataset with more than a couple of 10000 samples.
The offical data about sklearn svm told the theshold is 10,000 samples
so SGD could be a better try.

sklearn linear regression for large data

Does sklearn.LinearRegression support online/incremental learning?
I have 100 groups of data, and I am trying to implement them altogether. For each group, there are over 10000 instances and ~ 10 features, so it will lead to memory error with sklearn if I construct a huge matrix (10^6 by 10). It will be nice if I can update the regressor each time with batch samples of new group.
I found this post relevant, but the accepted solution works for online learning with single new data (only one instance) rather than batch samples.

Take a look at linear_model.SGDRegressor, it learns a a linear model using stochastic gradient.
In general, sklearn has many models that admit "partial_fit", they are all pretty useful on medium to large datasets that don't fit in the RAM.

Not all algorithms can learn incrementally, without seeing all of the instances at once that is. That said, all estimators implementing the partial_fit API are candidates for the mini-batch learning, also known as "online learning".
Here is an article that goes over scaling strategies for incremental learning. For your purposes, have a look at the sklearn.linear_model.SGDRegressor class. It is truly online so the memory and convergence rate are not affected by the batch size.

Does the SVM in sklearn support incremental (online) learning?

I am currently in the process of designing a recommender system for text articles (a binary case of 'interesting' or 'not interesting'). One of my specifications is that it should continuously update to changing trends.
From what I can tell, the best way to do this is to make use of machine learning algorithm that supports incremental/online learning.
Algorithms like the Perceptron and Winnow support online learning but I am not completely certain about Support Vector Machines. Does the scikit-learn python library support online learning and if so, is a support vector machine one of the algorithms that can make use of it?
I am obviously not completely tied down to using support vector machines, but they are usually the go to algorithm for binary classification due to their all round performance. I would be willing to change to whatever fits best in the end.

While online algorithms for SVMs do exist, it has become important to specify if you want kernel or linear SVMs, as many efficient algorithms have been developed for the special case of linear SVMs.
For the linear case, if you use the SGD classifier in scikit-learn with the hinge loss and L2 regularization you will get an SVM that can be updated online/incrementall. You can combine this with feature transforms that approximate a kernel to get similar to an online kernel SVM.
One of my specifications is that it should continuously update to changing trends.
This is referred to as concept drift, and will not be handled well by a simple online SVM. Using the PassiveAggresive classifier will likely give you better results, as it's learning rate does not decrease over time.
Assuming you get feedback while training / running, you can attempt to detect decreases in accuracy over time and begin training a new model when the accuracy starts to decrease (and switch to the new one when you believe that it has become more accurate). JSAT has 2 drift detection methods (see jsat.driftdetectors) that can be used to track accuracy and alert you when it has changed.
It also has more online linear and kernel methods.
(bias note: I'm the author of JSAT).

Maybe it's me being naive but I think it is worth mentioning how to actually update the sci-kit SGD classifier when you present your data incrementally:
clf = linear_model.SGDClassifier()
x1 = some_new_data
y1 = the_labels
clf.partial_fit(x1,y1)
x2 = some_newer_data
y2 = the_labels
clf.partial_fit(x2,y2)

Technical aspects
The short answer is no. Sklearn implementation (as well as most of the existing others) do not support online SVM training. It is possible to train SVM in an incremental way, but it is not so trivial task.
If you want to limit yourself to the linear case, than the answer is yes, as sklearn provides you with Stochastic Gradient Descent (SGD), which has option to minimize the SVM criterion.
You can also try out pegasos library instead, which supports online SVM training.
Theoretical aspects
The problem of trend adaptation is currently very popular in ML community. As #Raff stated, it is called concept drift, and has numerous approaches, which are often kinds of meta models, which analyze "how the trend is behaving" and change the underlying ML model (by for example forcing it to retrain on the subset of the data). So you have two independent problems here:
the online training issue, which is purely technical, and can be addressed by SGD or other libraries than sklearn
concept drift, which is currently a hot topic and has no just works answers There are many possibilities, hypothesis and proofes of concepts, while there is no one, generaly accepted way of dealing with this phenomena, in fact many phd dissertations in ML are currenlly based on this issue.

SGD for batch learning tasks normally has a decreasing learning rate and goes over training set multiple times. So, for purely online learning, make sure learning_rate is set to 'constant' in sklearn.linear_model.SGDClassifier() and eta0= 0.1 or any desired value. Therefore the process is as follows:
clf= sklearn.linear_model.SGDClassifier(learning_rate = 'constant', eta0 = 0.1, shuffle = False, n_iter = 1)
# get x1, y1 as a new instance
clf.partial_fit(x1, y1)
# get x2, y2
# update accuracy if needed
clf.partial_fit(x2, y2)

A way to scale SVM could be split your large dataset into batches that can be safely consumed by an SVM algorithm, then find support vectors for each batch separately, and then build a resulting SVM model on a dataset consisting of all the support vectors found in all the batches.
Updating to trends could be achieved by maintaining a time window each time you run your training pipeline. For example, if you do your training once a day and there is enough information in a month's historical data, create your traning dataset from the historical data obtained in the recent 30 days.

If interested in online learning with concept drift then here is some previous work
Learning under Concept Drift: an Overview
https://arxiv.org/pdf/1010.4784.pdf
The problem of concept drift: definitions and related work
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.58.9085&rep=rep1&type=pdf
A Survey on Concept Drift Adaptation
http://www.win.tue.nl/~mpechen/publications/pubs/Gama_ACMCS_AdaptationCD_accepted.pdf
MOA Concept Drift Active Learning Strategies for Streaming Data
http://videolectures.net/wapa2011_bifet_moa/
A Stream of Algorithms for Concept Drift
http://people.cs.georgetown.edu/~maloof/pubs/maloof.heilbronn12.handout.pdf
MINING DATA STREAMS WITH CONCEPT DRIFT
http://www.cs.put.poznan.pl/dbrzezinski/publications/ConceptDrift.pdf
Analyzing time series data with stream processing and machine learning
http://www.ibmbigdatahub.com/blog/analyzing-time-series-data-stream-processing-and-machine-learning

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.