tf.data.Dataset caching causes memory error - python

I've been trying to use cache() to speed up my training but every time after ~300 iterations I get an error that I've tried to allocate more memory than available (I'm doing the training in a kaggle notebook so my resources are 13GB RAM and 16GB gpu memory). My dataset is about 5GB in total and I'm loading it like this:
paths = glob(str(Path(BASE_TRAIN) / '*' / '*'), recursive = True)
ds_train = tf.data.Dataset.list_files(str(Path(BASE_TRAIN) / '*' / '*'))
ds_train = (ds_train.shuffle(len(paths))
.map(load_image, num_parallel_calls = tf.data.experimental.AUTOTUNE)
.cache()
.batch(64)
.prefetch(tf.data.experimental.AUTOTUNE))
This is my load_image function:
def load_image(file_path):
image = tf.io.read_file(file_path)
image = tf.io.decode_jpeg(image, channels = 3, dct_method = 'INTEGER_ACCURATE')
image = tf.image.resize(image, (224, 224), method = 'nearest')
image = tf.cast(image, tf.float32) / 255.0
return image, image
(I'm returning image, image because I'm working on a Convolutional Autoencoder)
So my question is is this just an issue that I don't have enough resources to use cache() with the dataset of my size or am I doing something wrong that can be corrected in order for me to use it?

Related

How can I reduce the I/O bottleneck when training a CNN using a batch generator for files stored in the cloud?

I have a CNN that uses a batch generator to load my image data, which is stored in a cloud storage bucket. If the batch generator downloads images on-the-fly during model training, it presents a large I/O bottleneck - high GPU memory usage but very low compute usage.
I beleive a possible solution is to download the next batch(es) while the current is being trained on. Thus while the GPU is busy training, I can keep the network busy loading the next set of images (and also likely doing augmentation).
In the past I simply loaded the whole dataset in, did augmentation, and then saved back to disk as a Numpy array to be quickly re-loaded during training. However, I have a lot more data now and don't think there will be enough disk space.
Here is a reduced snippet of my generator and some relevant methods:
class DatasetGeneratorFromBucket(keras.utils.Sequence):
def __init__(self, image_file_names, labels, batch_size, generator_type):
self.image_file_names = image_file_names
self.labels = labels
self.batch_size = batch_size
self.generator_type = generator_type
self.num_samples = len(self.image_file_names)
def __len__(self):
return self.num_samples // self.batch_size
def __getitem__(self, idx):
file_names_for_batch = self.image_file_names[idx * self.batch_size : (idx+1) * self.batch_size]
labels_for_batch = self.labels[idx * self.batch_size : (idx+1) * self.batch_size]
batch_x = []
for fn in file_names_for_batch:
# `storage_client` is a google.cloud.storage.Client() instance
im = download_blob_into_memory(storage_client, 'honours-project-ct-data', fn)
try:
batch_x.append(resize_image(im))
except:
labels_for_batch = np.delete(labels_for_batch, len(batch_x), axis=0)
# convert list of numpy arrays to numpy array of numpy arrays
batch_x = np.stack(batch_x, axis = 0)
# grab already-encoded labels
batch_y = np.array(labels_for_batch)
return batch_x, batch_y
def download_blob_into_memory(storage_client, bucket_name, blob_name):
bucket = storage_client.bucket(bucket_name)
blob = bucket.blob(blob_name)
contents = blob.download_as_bytes()
# convert and read in
blob_as_np_array = np.frombuffer(contents, np.uint8)
im = cv2.imdecode(blob_as_np_array, cv2.IMREAD_COLOR)
return im
def resize_image(im):
desired_size = 224
old_size = im.shape[:2]
ratio = float(desired_size)/max(old_size)
new_size = tuple([int(x*ratio) for x in old_size])
im = cv2.resize(im, (new_size[1], new_size[0]))
delta_w = desired_size - new_size[1]
delta_h = desired_size - new_size[0]
top, bottom = delta_h//2, delta_h-(delta_h//2)
left, right = delta_w//2, delta_w-(delta_w//2)
color = [0, 0, 0]
new_im = cv2.copyMakeBorder(im, top, bottom, left, right, cv2.BORDER_CONSTANT, value=color)
return new_im
The generator is called by Keras:
training_batch_generator = DatasetGeneratorFromBucket(training_image_file_names, training_labels, BATCH_SIZE, 'training')
validation_batch_generator = DatasetGeneratorFromBucket(valid_image_file_names, valid_labels, BATCH_SIZE, 'validation')
# model is an instance of tf.keras.models.Sequential
model.fit(
x = training_batch_generator,
validation_data = validation_batch_generator,
epochs = 60,
)
Some potential solutions:
Using the tf.data API, but I'm not sure where to start.
Use a distribution strategy, but a quick test doesn't seem to change much.
Invoke downloading the next batch on another thread, using the index plus BATCH_SIZE*2 to start downloading the next batches.
Are any of these a feasable solution? Is there maybe a better solution not involving pre-downloading batches that I am missing?

load tiff images in tensorflow using tensorflow-io's decode_tiff

I have a dataset of tiff images and I am trying to load the images from a dataframe having a column of image paths using tfio.experimental.image.decode_tiff as tensorflow currently does not have support for it.
tensorflow version: 2.3.0
tensorflow-io version: 0.15.0.dev20201015045556
list_ds = tf.data.Dataset.from_tensor_slices(df['image_path'].values)
image_count = 86
val_size = int(image_count * 0.2)
train_ds = list_ds.skip(val_size)
val_ds = list_ds.take(val_size)
def decode_img(img):
img = tfio.experimental.image.decode_tiff(img, index=0, name=None)
# resize the image to the desired size
return tf.image.resize(img, [256, 256])
def process_path(file_path):
# load the raw data from the file as a string
img = tf.io.read_file(file_path)
img = decode_img(img)
return img
AUTOTUNE = tf.data.experimental.AUTOTUNE
# Set `num_parallel_calls` so multiple images are loaded/processed in parallel.
train_ds = train_ds.map(process_path, num_parallel_calls=AUTOTUNE)
val_ds = val_ds.map(process_path, num_parallel_calls=AUTOTUNE)
When I am trying to run the following code to visualize images:
image_batch = next(iter(train_ds))
plt.figure(figsize=(2, 2))
for i in range(9):
ax = plt.subplot(3, 3, i + 1)
plt.imshow(image_batch[i].numpy().astype("uint8"))
#label = label_batch[i]
plt.show()
The first line itself is giving this error:
Assertion failed: sizeof(tmsize_t)==sizeof(void*), file external/libtiff/libtiff/tif_open.c, line 99
Process finished with exit code -1073740791 (0xC0000409)
And it is running endlessly in the python console.
I have looked at various issues :
https://github.com/tensorflow/io/issues/957
https://github.com/tensorflow/io/issues/831
But could not find the solution.
I want to load the images in tensorflow only to make use of tensorflow dataset API to enhance performance, as my dataset is quite large. So, let me know if someone have a better idea how to load tiff supported images in tensorflow.
Are you still having issues with newer TF versions? I see that you posted this using TF 2.3.x and TFIO 0.15.x and now TF 2.10.x and TFIO 0.16.x have been released.
Also, are you getting any error message when reading the images in or just when you go to display them? I ask because I'm running into errors when reading the images in using the .decode_tiff() function.

Google colab not loading image files while using tensorflow 2.0 batched dataset

A little bit of background, I am loading about 60,000 images to colab to train a GAN. I have already uploaded them to Drive and the directory structure contains folders for different classes (about 7-8) inside root. I am loading them to colab as follows:
root = "drive/My Drive/data/images"
root = pathlib.Path(root)
list_ds = tf.data.Dataset.list_files(str(root/'*/*'))
for f in list_ds.take(3):
print(f.numpy())
which gives the ouput:
b'drive/My Drive/data/images/folder_1/2994.jpg'
b'drive/My Drive/data/images/folder_1/6628.jpg'
b'drive/My Drive/data/images/folder_2/37872.jpg'
I am further processing them as follows:
def process_path(file_path):
label = tf.strings.split(file_path, '/')[-2]
image = tf.io.read_file(file_path)
image = tf.image.decode_jpeg(image)
image = tf.image.convert_image_dtype(image, tf.float32)
return image#, label
ds = list_ds.map(process_path)
BUFFER_SIZE = 60000
BATCH_SIZE = 128
train_dataset = ds.shuffle(BUFFER_SIZE).batch(BATCH_SIZE)
Each image is of size 128x128. Now coming to the problem when I try to view a batch in colab the execution goes on forever and never stops, for example, with this code:
for batch in train_dataset.take(4):
print([arr.numpy() for arr in batch])
Earlier I thought that batch_size might be a issue so tried changing it but still same problem. Can it be a problem due to colab as I am loading a large number of files?
Or due to the size of images as it was working when using MNIST(28x28)? If so, what are the possible solutions?
Thanks in advance.
EDIT:
After removing the shuffle statement, the last line gets executed within a few seconds. So I thought it could be a problem due to BUFFER_SIZE of shuffle, but even with a reduced BUFFER_SIZE, it is again taking a very long time to execute. Any workaround?
Here is how I load a 1.12GB zipped FLICKR image dataset from my personal Google Drive. First, I unzip the dataset in the colab environment. Some features that can speed up the performance is prefetch and autotune. Additionally, I use the local colab cache to store the processed images. This takes ~20 seconds to execute the first time (assuming you have unzipped the dataset). The cache then allows subsequent calls to load very fast.
Assuming you have authorized the google drive API, I start with unzipping the folder(s)
!unzip /content/drive/My\ Drive/Flickr8k
!unzip Flickr8k_Dataset
!ls
I then used your code with the addition of prefetch(), autotune, and cache file.
import pathlib
import tensorflow as tf
def prepare_for_training(ds, cache, BUFFER_SIZE, BATCH_SIZE):
if cache:
if isinstance(cache, str):
ds = ds.cache(cache)
else:
ds = ds.cache()
ds = ds.shuffle(buffer_size=BUFFER_SIZE)
ds = ds.batch(BATCH_SIZE)
ds = ds.prefetch(buffer_size=AUTOTUNE)
return ds
AUTOTUNE = tf.data.experimental.AUTOTUNE
root = "Flicker8k_Dataset"
root = pathlib.Path(root)
list_ds = tf.data.Dataset.list_files(str(root/'**'))
for f in list_ds.take(3):
print(f.numpy())
def process_path(file_path):
label = tf.strings.split(file_path, '/')[-2]
img = tf.io.read_file(file_path)
img = tf.image.decode_jpeg(img)
img = tf.image.convert_image_dtype(img, tf.float32)
# resize the image to the desired size.
img = tf.image.resize(img, [128, 128])
return img#, label
ds = list_ds.map(process_path, num_parallel_calls=AUTOTUNE)
train_dataset = prepare_for_training(ds, cache="./custom_ds.tfcache", BUFFER_SIZE=600000, BATCH_SIZE=128)
for batch in train_dataset.take(4):
print([arr.numpy() for arr in batch])
Here is a way to do it with keras flow_from_directory(). The benefit of this approach is that you avoid the tensorflow shuffle() which depending on the buffer size may require a processing of the whole dataset. Keras gives you an iterator which you can call to fetch the data batch and has the random shuffling built in.
import pathlib
import tensorflow as tf
from tensorflow.keras.preprocessing.image import ImageDataGenerator
root = "Flicker8k_Dataset"
BATCH_SIZE=128
train_datagen = ImageDataGenerator(
rescale=1./255 )
train_generator = train_datagen.flow_from_directory(
directory = root, # This is the source directory for training images
target_size=(128, 128), # All images will be resized
batch_size=BATCH_SIZE,
shuffle=True,
seed=42, #for the shuffle
classes=[''])
i = 4
for batch in range(i):
[print(x[0]) for x in next(train_generator)]

Why would this dataset implementation run out of memory?

I follow this instruction and write the following code to create a Dataset for images(COCO2014 training set)
from pathlib import Path
import tensorflow as tf
def image_dataset(filepath, image_size, batch_size, norm=True):
def preprocess_image(image):
image = tf.image.decode_jpeg(image, channels=3)
image = tf.image.resize(image, image_size)
if norm:
image /= 255.0 # normalize to [0,1] range
return image
def load_and_preprocess_image(path):
image = tf.read_file(path)
return preprocess_image(image)
all_image_paths = [str(f) for f in Path(filepath).glob('*')]
path_ds = tf.data.Dataset.from_tensor_slices(all_image_paths)
ds = path_ds.map(load_and_preprocess_image, num_parallel_calls=tf.data.experimental.AUTOTUNE)
ds = ds.shuffle(buffer_size = len(all_image_paths))
ds = ds.repeat()
ds = ds.batch(batch_size)
ds = ds.prefetch(tf.data.experimental.AUTOTUNE)
return ds
ds = image_dataset(train2014_dir, (256, 256), 4, False)
image = ds.make_one_shot_iterator().get_next('images')
# image is then fed to the network
This code will always run out of both memory(32G) and GPU(11G) and kill the process. Here is the messages shown on terminal.
I also spot that the program get stuck at sess.run(opt_op). Where is wrong? How can I fix it?
The problem is this:
ds = ds.shuffle(buffer_size = len(all_image_paths))
The buffer that Dataset.shuffle() uses is an 'in memory' buffer so you are effectively trying to load the whole dataset in memory.
You have a couple of options (which you can combine) to fix this:
Option 1:
Reduce the buffer size to a much smaller number.
Option 2:
Move the shuffle() statment before the map() statement.
This means we would be shuffling before we load the images therefore we'd just be storing the filenames in the memory buffer for the shuffle rather than storing huge tensors.

TensorFlow use dataset to replace function feed_dict

when I learn a tensorflow project,find one line code:
cls_prob, box_pred = sess.run([output_cls_prob, output_box_pred], feed_dict={input_img: blob})
But, this line code It took a lot of time. (use CPU need 15 seconds...┭┮﹏┭┮)
By consulting information, I find use function 'dataset' could solve this problem which took a lot of time, How should I use it?
source of 'blob':
img = cv2.imread('./imgs/001.jpg')
img_scale = float(600) / min(img_data.shape[0], img_data.shape[1])
if np.round(img_scale * max(img_data.shape[0], img_data.shape[1])) > 1200:
img_scale = float(1200) / max(img_data.shape[0], img_data.shape[1])
img_data = cv2.resize(img_data, None, None, fx=img_scale, fy=img_scale, interpolation=cv2.INTER_LINEAR)
img_orig = img_data.astype(np.float32, copy=True)
blob = np.zeros((1, img_data.shape[0], img_data.shape[1], 3),dtype=np.float32)
blob[0, 0:img_data.shape[0], 0:img_data.shape[1], :] = img_orig
source of 'output_cls_prob'&'output_box_pred'&'input_img':
# Actually,read PB model...
input_img = sess.graph.get_tensor_by_name('Placeholder:0')
output_cls_prob = sess.graph.get_tensor_by_name('Reshape_2:0')
output_box_pred = sess.graph.get_tensor_by_name('rpn_bbox_pred/Reshape_1:0')
Parameter type:
blob:type 'numpy.ndarray'
output_cls_prob:class 'tensorflow.python.framework.ops.Tensor'
output_box_pred:class 'tensorflow.python.framework.ops.Tensor'
input_img:class 'tensorflow.python.framework.ops.Tensor'
tf.data is the recommended API for tensorflow input pipelines. Here is a tutorial on tensorflow.org. For your example, the section "Decoding image data and resizing it" could be most useful. For example, you could do something like:
# Reads an image from a file, decodes it into a dense tensor, and resizes it
# to a fixed shape.
def _parse_function(filename):
image_string = tf.read_file(filename)
image_decoded = tf.image.decode_jpeg(image_string)
image_resized = tf.image.resize_images(image_decoded, [new_width, new_height])
image_resized = tf.expand_dims(image_resized, 0) # Adds size 1 dimension
return image_resized
# A vector of filenames.
filenames = tf.constant(["./imgs/001.jpg", ...])
dataset = tf.data.Dataset.from_tensor_slices(filenames)
dataset = dataset.map(_parse_function)
And instead of having input_img be a placeholder, change:
input_img = tf.placeholder(tf.float32)
output_class_prob, output_class_pred = (... use input_img ...)
to:
iterator = dataset.make_one_shot_iterator()
input_img = iterator.get_next()
output_class_prob, output_class_pred = (... use input_img ...)
First of all you should know that the use of Dataset API has a great impact in performance when multiples GPUs are used... Otherwise is almost identical to feed_dict. I recommend you to read this other answer from a TF developer, it has almost everything one needs to know to create a mental image of the benefits of this new API.

Categories

Resources