using ray + light gbm + limited memory

using ray + light gbm + limited memory - python

So, I would like to train a lightGBM on a remote, large ray cluster and a large dataset. Before that, I would like to write the code such that I can run the training also in a memory-constrained setting, e.g. my local laptop, where the dataset does not fit in-mem. That will require some way of lazy loading the data.
The way I imagine it, I should be possible with ray to load batches of random samples of the large dataset from disk (multiple .pq files) and feed them to the lightgbm training function. The memory should thereby act as a fast buffer, which contains random, loaded batches that are fed to the training function and then removed from memory. Multiple workers take care of training + IO ops for loading new samples from disk into memory. The maximum amount of memory can be constrained to not exceed my local resources, such that my pc doesn't crash. Is this possible?
I did not understand yet whether the LGBM needs the full dataset at once, or can be fed batches iteratively, such as with neural networks, for instance. So far, I have tried using the lightgbm_ray lib for this:
from lightgbm_ray import RayDMatrix, RayParams, train, RayFileType
# some stuff before
...
# make dataset
data_train = RayDMatrix(
data=filenames,
label=TARGET,
feature_names=features,
filetype=RayFileType.PARQUET,
num_actors=2,
lazy=True,
)
# feed to training function
evals_result = {}
bst = train(
params_model,
data_train,
evals_result=evals_result,
valid_sets=[data_train],
valid_names=["train"],
verbose_eval=False,
ray_params=RayParams(num_actors=2, cpus_per_actor=2)
)
I thought the lazy=True keyword might take care of it, however, when executing this, I see the memory being maxed out and then my app crashes.
Thanks for any advice!

LightGBM requires loading the entire dataset for training, so in this case, you can test on your laptop with a subset of the data (i.e. only pass a subset of the parquet filenames in).
The lazy=True flag delays the data loading to be split across the actors, rather than loading into memory first, then splitting+sending to actors. However, this would still load the entire dataset into memory, since all actors are on the same (local) node.
Additionally, when you do move to running on the remote cluster, these tips might be helpful to optimize memory usage: https://docs.ray.io/en/latest/train/gbdt.html?highlight=xgboost%20memro#how-to-optimize-xgboost-memory-usage.

Related

How can I use xgboost.dask with gpu to model a very large dataset in both a distributed and batched manner?

I would like to utilise multiple GPUs spread across many nodes to train an XGBoost model on a very large data set within Azure Machine Learning using 3 NC12s_v3 compute nodes. The dataset size exceeds both VRAM and RAM size when persisted into Dask, but comfortably fits on disk. However, XGBoost's dask module seems to persist all data when training (At least by default).
All data preprocessing has been handled (One hot encoding with np.bool data type) and one can assume that I have the most efficient data types elsewhere (For example, changing np.float64 to np.float32 for decimal features, changing to int8 for ordinal data, etc.).
Currently, I am trying to get a simple toy model working with just a training set. My code is as follows:
from dask_cuda import LocalCUDACluster
import dask
import xgboost as xgb
import pandas as pd
import numpy as np
import distributed
from dask.distributed import Client, wait, LocalCluster
# Start cluster and client. This is currently local, although I would like to make this distributed across many nodes.
cluster = LocalCUDACluster(n_workers=2, device_memory_limit='16 GiB')
client = distributed.Client(cluster)
# Read in training data.
train_dd = dask.dataframe.read_parquet("folder_of_training_data/*.parquet", chunksize="128 MiB")
# Split into features and target, all other preprocessing including one hot encoding has been completed.
X = train_dd[train_dd.columns.difference(['label'])]
y = train_dd['label']
# Delete dask dataframe to free up memory.
del train_dd
# Create DaskDMatrix from X, y inputs.
dtrain = xgb.dask.DaskDMatrix(client, X, y)
# Delete X and y to free up memory.
del X
del y
# Create watchlist for input into xgb train method.
watchlist = [(dtrain, 'train')]
# Train toy booster on 10 rounds with subsampling and gradient based sampling to reduce memory requirements.
bst = xgb.dask.train(
client,
{
'predictor': 'gpu_predictor',
'tree_method': 'gpu_hist',
'verbosity': 2,
'objective': 'binary:logistic',
'sampling_method': 'gradient_based',
'subsample': 0.1
},
dtrain,
num_boost_round=10,
evals=watchlist
)
del dtrain
print("History:", str(bst['history']))
With the above on a single node containing 2 GPUs, I can only load up to 32GB at a time (Limits of VRAM).
From my current code, I have a few questions:
Is there any way I can stop XGBoost from persisting all data into memory, instead perhaps working through partitions in batches?
Is there any way I can get Dask to natively handle the batch process rather than manually performing for example, incremental learning?
In the docs they mention that you can use external memory mode along with their distributed mode. Assuming I had libsvm files, how would I go about this with multiple nodes & multiple GPUs?
How can I alter my code above such that I can work with more than one node?
Bonus question: Assuming either there is a way to batch process with xgboost.dask, how can I integrate this with RAPIDS for processing purely on GPUs?

Currently, Dask-XGBoost expects your data to be resident in GPU memory. Experimental work is in progress on both out-of-core spilling to host memory and incremental data loading, but it's probably too early to deploy one of those.
Which version of XGBoost are you using? XGBoost 1.1 significantly reduces GPU memory usage in many cases, so that may be worth a shot.
However, since you are on a cloud provider, spinning up more GPUs is probably the easiest option (and if it makes your training faster, you can shut down those instances faster and keep your cost stable).
Basically, you will start a dask scheduler on your master node, then start Dask GPU workers on each of the worker nodes. (E.g. by running the dask-cuda-worker command line program on each worker.)
There are several older tutorials to set up a Dask cluster on Azure, but this new tutorial video from Tom Drabas and the AzureML team may be the best starting point: https://www.youtube.com/watch?v=zyDpbH33LXE&feature=youtu.be
It leverages Dask cloud provider to simplify the cluster setup.

Tensorflow data pipeline: Slow with caching to disk - how to improve evaluation performance?

I've built a data pipeline. Pseudo code is as follows:
dataset ->
dataset = augment(dataset)
dataset = dataset.batch(35).prefetch(1)
dataset = set_from_generator(to_feed_dict(dataset)) # expensive op
dataset = Cache('/tmp', dataset)
dataset = dataset.unbatch()
dataset = dataset.shuffle(64).batch(256).prefetch(1)
to_feed_dict(dataset)
1 to 5 actions are required to generate the pretrained model outputs. I cache them as they do not change throughout epochs (pretrained model weights are not updated). 5 to 8 actions prepare the dataset for training.
Different batch sizes have to be used, as the pretrained model inputs are of a much bigger dimensionality than the outputs.
The first epoch is slow, as it has to evaluate the pretrained model on every input item to generate templates and save them to the disk. Later epochs are faster, yet they're still quite slow - I suspect the bottleneck is reading the disk cache.
What could be improved in this data pipeline to reduce the issue?
Thank you!

prefetch(1) means that there will be only one element prefetched, I think you may want to have it as big as the batch size or larger.
After first cache you may try to put it second time but without providing a path, so it would cache some in the memory.
Maybe your HDD is just slow? ;)
Another idea is you could just manually write to compressed TFRecord after steps 1-4 and then read it with another dataset. Compressed file has lower I/O but causes higher CPU usage.

Optimizing RAM usage when training a learning model

I have been working on creating and training a Deep Learning model for the first time. I did not have any knowledge about the subject prior to the project and therefor my knowledge is limited even now.
I used to run the model on my own laptop but after implementing a well working OHE and SMOTE I simply couldnt run it on my own device anymore due to MemoryError (8GB of RAM). Therefor I am currently running the model on a 30GB RAM RDP which allows me to do so much more, I thought.
My code seems to have some horribly inefficiencies of which I wonder if they can be solved. One example is that by using pandas.concat my model's RAM usages skyrockets from 3GB to 11GB which seems very extreme, afterwards I drop a few columns making the RAm spike to 19GB but actually returning back to 11GB after the computation is completed (unlike the concat). I also forced myself to stop using the SMOTE for now just because the RAM usage would just go up way too much.
At the end of the code, where the training happens the model breaths its final breath while trying to fit the model. What can I do to optimize this?
I have thought about splitting the code into multiple parts (for exmaple preprocessing and training) but to do so I would need to store massive datasets in a pickle which can only reach 4GB (correct me if I'm wrong). I have also given thought about using pre-trained models but I truely did not understand how this process goes to work and how to use one in Python.
P.S.: I would also like my SMOTE back if possible
Thank you all in advance!

Let's analyze the steps:
Step 1: OHE
For your OHE, the only dependence there is between data points is that it needs to be clear what categories are there overall. So the OHE can be broken into two steps, both of which do not require that all data points are in RAM.
Step 1.1: determine categories
Stream read your data points, collecting all the categories. It is not necessary to save the data points you read.
Step 1.2: transform data
After step 1.1, each data point can be independently converted. So stream read, convert, stream write. You only need one or very few data points in memory at all times.
Step 1.3: feature selection
It may be worthwile to look at feature selection to reduce the memory footprint and improve performance. This answer argues it should happen before SMOTE.
Feature selection methods based on entropy depend on all data. While you can probably also throw something together which streams, one approach that worked well for me in the past is removing features that only one or two data points have, since these features definitely have low entropy and probably don't help the classifier much. This can be done again like Step 1.1 and Step 1.2
Step 2: SMOTE
I don't know SMOTE enough to give an answer, but maybe the problem has already solved itself if you do feature selection. In any case, save the resulting data to disk so you do not need to recompute for every training.
Step 3: training
See if the training can be done in batches or streaming (online, basically), or simply with less sampled data.
With regards to saving to disk: Use a format that can be easily streamed, like csv or some other splittable format. Don't use pickle for that.

Slightly orthogonal to your actual question, if your high RAM usage is caused by having entire dataset in memory for the training, you could eliminate such memory footprint by reading and storing only one batch at a time: read a batch, train on this batch, read next batch and so on.

Tensorflow 1.0 training model uses exponentially more space

I am using tensorflow 1.0 to train a DNNRegressor. Most of the training is already handled automatically by the new tensorflow 1.0 features. The model information is saved automatically in a folder. I call the train(filepath, isAuthentic) function repeatedly, with different training files, using a for loop.
The problem is that the events.out.tfevents files keep getting larger and larger, taking up space. I have gotten around this by deleting these files as they are generated, but the CPU still wastes incrementally more time trying to generate these files. These don't affect the results of training or predicting. Is there a way to stop these events.out.tfevents files from being generated?
I've noticed that when I run the python program for a long period, the events.out.tfevents file sizes start small and then get large, but if I run the training for several periods of shorter intervals, the file sizes stay small.
picture of model folder, contents ordered by size
When I let the training run long enough, the events.out.tfevents reaches over 200 MB, wasting much time and space. I have already tried changing the checkpoint and summary parameters in a RunConfig object passed to the DNNRegressor.
def getRegressor():
feature_cols = [tf.contrib.layers.real_valued_column(k) for k in networkSetup.FEATURES]
# Build 2 layer fully connected DNN with 8, 8 units respectively.
regressor = tf.contrib.learn.DNNRegressor(feature_columns=feature_cols,
hidden_units=[8, 8],
model_dir=networkSetup.MODEL_DIR,
activation_fn=tf.nn.sigmoid,
optimizer=tf.train.GradientDescentOptimizer(
learning_rate=0.001
)
)
return regressor
def train(filepath, isAuthentic):
regressor = getRegressor()
# training on training set
regressor.fit(input_fn=lambda: input_fn(filepath, isAuthentic), steps=1)

.tfevents files contain events written by fit method. DNNRegressor saves at least histograms and fraction of zeros for each hidden layer. You can use Tensorboard to view them.
Tensorflow doesn't overwrite event files instead it appends to them so bigger file size doesn't mean more CPU cycles.
You can pass config parameter to DNNRegressor constructor (RunConfig instance) and specify how often you want summary to be saved using its save_summary_steps property. Default is to save summary every 100 steps.

To prevent tensorflow creating the events.out file, you just have to comment that part of code that writes this file each time a new user trains the model.
In all the models, there are writers written in the main class to create these summaries/logs for further analysis of data, although it is not useful in many cases.
Sample code lines from Tensorflow Inception's "retrain.py":
train_writer = tf.summary.FileWriter(FLAGS.summaries_dir + '/train',sess.graph)
validation_writer = tf.summary.FileWriter(FLAGS.summaries_dir + '/validation')
Just comment out the part of code creating the events.out file and you are up.

That's because of newly generated Graph files for tensorboard.
tfFileWriter = tf.summary.FileWriter(os.getcwd())
tfFileWriter.add_graph(sess.graph)
tfFileWriter.close()
Comment out these lines if you find them in your code and it would go away.

Multi-processing on Large Image Dataset in Python

I have a very large image dataset (>50G, single images in a folder) for training, to make loading of images more efficient, I firstly load parts of the images onto RAM and then send small batches to GPU for training.
I want to further speed up the data preparation process before feeding the images to the GPU and was thinking about multi-processing. But I'm not sure how should I do it, any ideas?

For speed I would advise to used HDF5 or LMDB:
I have successfully used ml-pyxis for creating deep learning datasets using LMDBs.
It allows to create binary blobs (LMDB) and they can be read quite fast.
The link above comes with some simple examples on how to create and read the data. Including python generators/ iteratos
For multi-processing:
I personally work with Keras, and by using a python generator it is possible train with mutiple-processing for data using the fit_generator method.
fit_generator(self, generator, samples_per_epoch,
nb_epoch, verbose=1, callbacks=[],
validation_data=None, nb_val_samples=None,
class_weight={}, max_q_size=10, nb_worker=1,
pickle_safe=False)
Fits the model on data generated batch-by-batch by a Python generator. The generator is run in parallel to the model, for efficiency. For instance, this allows you to do real-time data augmentation on images on CPU in parallel to training your model on GPU. You can find the source code here , and the documentation here.

Don't know whether you prefer tensorflow/keras/torch/caffe whatever.
Multiprocessing is simply Using Multiple GPUs
Basically you are trying to leverage more hardware by delegating or spawning one child process for every GPU and let them do their magic. The example above is for Logistic Regression.
Of course you would be more keen on looking into Convnets -
This LSU Material (Pgs 48-52[Slides 11-14]) builds some intuition
Keras is yet to officially provide support but you can "proceed at your own risk"
For multiprocessing, tensorflow is a better way to go about this (my opinion)
In fact they have some good documentation on it too

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.