Using tf.data.Dataset with Keras on a TPU - python

I am training a model with Keras which constitutes of a Huggingface RoBERTa model as a backbone with a downstream task of span prediction and binary prediction for text.
I have been training the model regularly with datasets which are under 2 Gb in size, which has worked fine. The dataset has grown in size in recent weeks and now recently, it has gotten to around 2.3 Gb in size which makes it over the 2 Gb google protobuf hard limit. This makes it impossible to train the model with keras with numpy tensors without a generator on TPUs as tensorflow uses google protobuf to buffer the tensors for the TPUs, and trying to serve all the data without a generator fails. If I use a dataset under 2 Gb in size, everything works fine. TPUs don't support Keras generators yet, so I was looking into using the tf.data.Dataset api instead.
After seeing this question I adopted code from this gist trying to get this to work, resulting in the following code:
def tfdata_generator(x, y, is_training, batch_size=384):
dataset = tf.data.Dataset.from_tensor_slices((x, y))
if is_training:
dataset = dataset.shuffle(1000)
dataset = dataset.map(map_fn)
dataset = dataset.batch(batch_size)
dataset = dataset.repeat()
dataset = dataset.prefetch(tf.data.experimental.AUTOTUNE)
return dataset
The model is created and compiled for TPU use as before which has never caused any problems and then I create the generators and call the fit function:
train_gen = tfdata_generator(x_train, y_train, is_training=True)
model.fit(
train_gen,
steps_per_epoch=10000,
epochs=1,
)
This results in the following error:
FetchOutputs node : not found [Op:AutoShardDataset]
edit: Colab with bare minimum code and a dummy dataset - unfortunately, b/c of Colab RAM restrictions, building a dummy dataset exceeding 2 Gb in size crashes the notebook. But still, displays code that runs and works on CPU/TPU with a smaller dataset.
This code does however work on a CPU. I can't find any further information on this error online and haven't been able to find more detailed information on how to use TPUs with Keras servicing training data using generators. Have looked into tfrecords a bit but also find documentation on TPUs missing. All help appreciated!

For numpy tensors, 2GB seams to a hard limit for TPU training (as of now).
I see 2 workarounds that you could use.
Write your tf.data to a gs bucket as TFRecord/CSV using TFRecordWriter and let the TPU use training data from that bucket.
Use tf.data service, for your input pipeline. It's a relatively new service that let's you run your data pipeline on separate workers. For details on how to run please see running_the_tfdata_service.

Related

loading data in GPU before starting training in Google Colab

I am using a subset of the PlantVillage (image) dataset on my Google drive and trying to train CNN models on that data from Google Colab (and of course, I use GPU). The problem is, the first epoch of training goes very slowly because the data is being loaded into the GPU for the first time. the later rounds move much faster and in a predictable frame of time. Now, is this possible to do the loading prior to the training and excluded from it? I want to %%time my training time and having this extra loading time in my training messes things up.
I use Tensorflow and Keras applications for data preprocessing and model training.
You can use Dataset.cache() and Dataset.prefetch() which will keep the data in memory after loading from disk and will increase the model training speed comparatively.
Check the below code:
AUTOTUNE = tf.data.AUTOTUNE
train_ds = train_ds.cache().prefetch(buffer_size=AUTOTUNE)
val_ds = val_ds.cache().prefetch(buffer_size=AUTOTUNE)
Please have a look at this link for your reference.

Resource Exhausted in Tensorflow with any architecture

I tried to train a image classifier using tensorflow. I used data api to load the dataset and i used dataset caching to speed up training process. while trying to training the model i struck with a error called Resource Exhausted. I tried to change the batch size even after trying different batch size like 32,64,128 i could not over come this problem
I have tried to remove some layers but i could not fix this error.
Check your batch_size. Decrease it. It seems it is overwhelming.

How one can quickly verify that a CNN actually learns?

I tried to build a CNN from scratch based on LeNet architecture from this article
I implemented backdrop and now trying to train it on the MNIST dataset using SGD with 16 batch size. I want to find a quick way to verify that the learning goes well and there are no bugs. For this, I visualize loss for every 100th batch but it takes too long on my laptop and I don't see an overall dynamic (the loss fluctuates downwards, but occasionally jumps up back so I am not sure). Could anyone suggest a proven way to find that the CNN works well without waiting many hours of training?
The MNIST consist of 60k datasets of 28 * 28 pixel.Training a CNN with batch size 16 will have 4000 forward pass per epochs.
Now taking into consideration that your are using LeNet which not a very deep model.
I would suggest you to do followings:
Check your PC specifications such as RAM,Processor,GPU etc.
Try your to train your model on cloud service such Google Colab, Kaggle and others
Try a batch size of 128 or 64
Try to normalize your image data set before training
Training speed also depends on machine learning framework you are using such as Tensorflow, Pytorch etc.
I hope this will help.

Using datasets larger than 2 Gb with Keras

TensorFlow has a long-standing limitation of 2 Gb on a single tensor. It means that you can't train your model on more than 2 Gb of data at one time without jumping through hoops. See Initializing tensorflow Variable with an array larger than 2GB ; Use large dataset in Tensorflow
The standard solution referenced in those posts is to use a placeholder and to pass it to the "session" through feed_dict:
my_graph = tf.Graph()
sess = tf.Session(graph=my_graph)
X_init = tf.placeholder(tf.float32, shape=(m_input, n_input))
X = tf.Variable(X_init)
sess.run(tf.global_variables_initializer(), feed_dict={X_init: data_for_X})
However, this only works when I use the "old" API (tf.Session(), etc.) The recommended approach nowadays is to use Keras (all the tutorials on tensorflow.org use it). And, with Keras, there's no tf.Graph(), no tf.Session(), and no run() (at least none that are readily visible to the user.)
How do I adapt the above code to work with Keras?
In Keras, you'd not load your entire dataset in a tensor. You load it in numpy arrays.
If the entire data can be in a single numpy array:
Thanks to #sebrockm's comment.
The most trivial usage of Keras is simply loading your dataset in a numpy array (not a tf tensor) and call model.fit(arrayWithInputs, arrayWithoutputs, ...)
If the entire data doesn't fit a numpy array:
You'd create a generator or a keras.utils.Sequence to load batches one by one and then train the model with model.fit_generator(generatorOrSequence, ...)
The limitation becomes the batch size, but you'd hardly ever hit 2GB in a single batch.
So, go for it:
keras.utils.Sequence
model.fit_generator
Keras doesn't have a 2GB limitation for datasets, I've trained much larger datasets with Keras with no issues.
The limitation could come from TensorFlow constants, which do have a 2GB limit, but in any case you should NOT store datasets as constants, as these are saved as part of the graph, and that is not the idea of storing a model.
Keras has the model.fit_generator function that you can use to pass a generator function which loads data on the fly, and makes batches. This allows you to load a large dataset on the fly, and you usually adjust the batch size so you maximize performance with acceptable RAM usage. TensorFlow doesn't have a similar API, you have to implement it manually as you say with feed_dict.

Train on own data set. Mask_RCNN Resource exhausted: OOM when allocating

I'm trying to train my own dataset with Mask_RCNN but i get the following framework errors:
I have followed the github tutorial on Mask R-CNN for object detection.
Can i do something to decrease the memory needed to train the data set? Or how can i solve this problem?
Need more information about your pc.I think maybe because the batch size or the image size is too large, it needs more video memory or memory.When I try to train on my own dataset using mask-rcnn-resnet101 in tensorflow-models/object_detection, GTX1080TI 11GB works fine.(batch_size=1,image_size=1000*600).
In https://stackoverflow.com/a/64145818/11262633, I have described 3 changes you can make to reduce memory utilization in TensorFlow v2.

Categories

Resources