I am using a subset of the PlantVillage (image) dataset on my Google drive and trying to train CNN models on that data from Google Colab (and of course, I use GPU). The problem is, the first epoch of training goes very slowly because the data is being loaded into the GPU for the first time. the later rounds move much faster and in a predictable frame of time. Now, is this possible to do the loading prior to the training and excluded from it? I want to %%time my training time and having this extra loading time in my training messes things up.
I use Tensorflow and Keras applications for data preprocessing and model training.
You can use Dataset.cache() and Dataset.prefetch() which will keep the data in memory after loading from disk and will increase the model training speed comparatively.
Check the below code:
AUTOTUNE = tf.data.AUTOTUNE
train_ds = train_ds.cache().prefetch(buffer_size=AUTOTUNE)
val_ds = val_ds.cache().prefetch(buffer_size=AUTOTUNE)
Please have a look at this link for your reference.
Related
I'm training a wide and deep autoencoder (21 layers, ~500 features) in Tensorflow on GCP. I have around ~30 million samples that adds up to about 55GB of raw TF proto files.
My training is extremely slow. With 128 Tesla A100 GPUs using MultiWorkerMirroredStrategy (+reduction servers) and 256 batch size per replica, the performance is about 1 hour per epoch.
My dashboard reports that my GPUs are on <1% GPU utilization but ~100% GPU memory utilization (see screenshot). This tells me something is wrong.
However, I've been debugging this for weeks now and I'm honestly exhausted all my hypotheses. I'm beginning to think perhaps it's just suppose to be slow like this.
Q: I understand that this is not a well formed question but what are some possibilities as to why the GPU memory utilization is at 100% but the GPU utilization is <1%? Is it just suppose to be slow like this or is there something wrong?
Some of the things I've tried (not exhaustive):
increase batch size
remove preprocessing layer (i.e. dataset.map() calls)
increase/decrease worker count; increase/decrease attched GPU counts
non-deterministic dataset reads
Some of the key highlights of my setup:
vertex AI training using tfx, mostly following the tutorials here
ETA reported to be about 1 hour per epoch according to model.fit logs.
no custom training loop. Sequential model with Adamax optimizer.
idiomatic call to model.fit, did not temper with performance parameters
DataAccessor call:
dataset = data_accessor.tf_dataset_factory(
file_pattern,
tfxio.TensorFlowDatasetOptions(
batch_size=batch_size,
drop_final_batch=True,
num_epochs=1,
shuffle=True,
shuffle_buffer_size=1000000,
prefetch_buffer_size=tf.data.experimental.AUTOTUNE,
reader_num_threads=tf.data.experimental.AUTOTUNE,
parser_num_threads=tf.data.experimental.AUTOTUNE,
sloppy_ordering=True),
schema=tf_transform_output.transformed_metadata.schema)
def _apply_preprocessing(x):
# preprocessing_model is a just the input layer + one hot encode
# tested to be slow with or without this.
preprocessed_features = preprocessing_model(x)
return preprocessed_features, preprocessed_features
dataset = dataset.map(_apply_preprocessing,
num_parallel_calls=tf.data.AUTOTUNE,
deterministic=False)
return dataset.prefetch(tf.data.AUTOTUNE)
I am using Tensorflow to train my model. I am routinely saving my model every 10 epochs. I have a limited number of samples to train, so I am augmenting my dataset to make a larger training dataset.
If I need to use my saved model to resume training after a power outage would it be best to resume training using the same dataset or to make a new dataset?
Your question very much depends on how you're augmenting your dataset. If your augmentation skews the statistical distribution of the underlying dataset then you should resume training with the pre-power outage dataset. Otherwise, you're assuming that your augmentation has not changed the distribution of the dataset.
It is a fairly safe assumption to make (assuming your augmentations do not change the data in an extremely significant way) that you are safe to resume training on a new dataset or the old dataset without significant change in accuracy.
I am training a model with Keras which constitutes of a Huggingface RoBERTa model as a backbone with a downstream task of span prediction and binary prediction for text.
I have been training the model regularly with datasets which are under 2 Gb in size, which has worked fine. The dataset has grown in size in recent weeks and now recently, it has gotten to around 2.3 Gb in size which makes it over the 2 Gb google protobuf hard limit. This makes it impossible to train the model with keras with numpy tensors without a generator on TPUs as tensorflow uses google protobuf to buffer the tensors for the TPUs, and trying to serve all the data without a generator fails. If I use a dataset under 2 Gb in size, everything works fine. TPUs don't support Keras generators yet, so I was looking into using the tf.data.Dataset api instead.
After seeing this question I adopted code from this gist trying to get this to work, resulting in the following code:
def tfdata_generator(x, y, is_training, batch_size=384):
dataset = tf.data.Dataset.from_tensor_slices((x, y))
if is_training:
dataset = dataset.shuffle(1000)
dataset = dataset.map(map_fn)
dataset = dataset.batch(batch_size)
dataset = dataset.repeat()
dataset = dataset.prefetch(tf.data.experimental.AUTOTUNE)
return dataset
The model is created and compiled for TPU use as before which has never caused any problems and then I create the generators and call the fit function:
train_gen = tfdata_generator(x_train, y_train, is_training=True)
model.fit(
train_gen,
steps_per_epoch=10000,
epochs=1,
)
This results in the following error:
FetchOutputs node : not found [Op:AutoShardDataset]
edit: Colab with bare minimum code and a dummy dataset - unfortunately, b/c of Colab RAM restrictions, building a dummy dataset exceeding 2 Gb in size crashes the notebook. But still, displays code that runs and works on CPU/TPU with a smaller dataset.
This code does however work on a CPU. I can't find any further information on this error online and haven't been able to find more detailed information on how to use TPUs with Keras servicing training data using generators. Have looked into tfrecords a bit but also find documentation on TPUs missing. All help appreciated!
For numpy tensors, 2GB seams to a hard limit for TPU training (as of now).
I see 2 workarounds that you could use.
Write your tf.data to a gs bucket as TFRecord/CSV using TFRecordWriter and let the TPU use training data from that bucket.
Use tf.data service, for your input pipeline. It's a relatively new service that let's you run your data pipeline on separate workers. For details on how to run please see running_the_tfdata_service.
I am creating checkpoints, so I can resume training again.
checkpoint = ModelCheckpoint('model.h5', monitor='val_loss', verbose=1, save_best_only=True, mode='min')
but when I tried to resume training, loading model.h5 is very slow.
from keras.models import load_model
model = load_model('model.h5',custom_objects={'GroupNormalization' : GroupNormalization},compile=False)
Is there a way to solve this?
.h5 extension is the one of the fastest of loading any large files. There can be couple of points to note while loading weights
Are you using normal HDD?
Are you using GPU's?
If not GPU then loading into RAM then
Loading & unloading operation is CPU intensive work if older processor might take time to load
Save a model using ModelCheckpoint without save_weights_only=True will save the optimizer state as well. You probably notice that the saved file size is much bigger than files with just weight.
Bigger files are slower to load, especially with slow CPU. Colab use 1 core CPU on GPU instances so it really slow.
If you, for now only want to resume your training then use save_weights_only=True and on resuming, create model and use model.load_weight should be faster. But note that the optimizer will got reset.
I've built a data pipeline. Pseudo code is as follows:
dataset ->
dataset = augment(dataset)
dataset = dataset.batch(35).prefetch(1)
dataset = set_from_generator(to_feed_dict(dataset)) # expensive op
dataset = Cache('/tmp', dataset)
dataset = dataset.unbatch()
dataset = dataset.shuffle(64).batch(256).prefetch(1)
to_feed_dict(dataset)
1 to 5 actions are required to generate the pretrained model outputs. I cache them as they do not change throughout epochs (pretrained model weights are not updated). 5 to 8 actions prepare the dataset for training.
Different batch sizes have to be used, as the pretrained model inputs are of a much bigger dimensionality than the outputs.
The first epoch is slow, as it has to evaluate the pretrained model on every input item to generate templates and save them to the disk. Later epochs are faster, yet they're still quite slow - I suspect the bottleneck is reading the disk cache.
What could be improved in this data pipeline to reduce the issue?
Thank you!
prefetch(1) means that there will be only one element prefetched, I think you may want to have it as big as the batch size or larger.
After first cache you may try to put it second time but without providing a path, so it would cache some in the memory.
Maybe your HDD is just slow? ;)
Another idea is you could just manually write to compressed TFRecord after steps 1-4 and then read it with another dataset. Compressed file has lower I/O but causes higher CPU usage.