How to check preprocessing time/speed in Colab? - python

I am training a neural network on Google Colab GPU. Therefore, I synchronized the input images (180k in total, 105k for training, 76k for validation) with my Google Drive. Then I mount the Google Drive and go from there.
I load a csv-file with image paths and labels in Google Colab and store it as pandas dataframe.
After that I use a list of image paths and labels.
I take this function to get my labels onehot-encoded because I need a special output shape (7, 35) per label, which cannot be done by the existing default functions:
#One Hot Encoding der Labels, Zielarray hat eine Shape von (7,35)
from numpy import argmax
# define input string
def my_onehot_encoded(label):
# define universe of possible input values
characters = '0123456789ABCDEFGHIJKLMNPQRSTUVWXYZ'
# define a mapping of chars to integers
char_to_int = dict((c, i) for i, c in enumerate(characters))
int_to_char = dict((i, c) for i, c in enumerate(characters))
# integer encode input data
integer_encoded = [char_to_int[char] for char in label]
# one hot encode
onehot_encoded = list()
for value in integer_encoded:
character = [0 for _ in range(len(characters))]
character[value] = 1
onehot_encoded.append(character)
return onehot_encoded
After that I use a customized DataGenerator to get the data in batches into my model. x_set is a list of image paths to my images and y_set are the onehot-encoded labels:
class DataGenerator(Sequence):
def __init__(self, x_set, y_set, batch_size):
self.x, self.y = x_set, y_set
self.batch_size = batch_size
def __len__(self):
return math.ceil(len(self.x) / self.batch_size)
def __getitem__(self, idx):
batch_x = self.x[idx*self.batch_size : (idx + 1)*self.batch_size]
batch_x = np.array([resize(imread(file_name), (224, 224)) for file_name in batch_x])
batch_x = batch_x * 1./255
batch_y = self.y[idx*self.batch_size : (idx + 1)*self.batch_size]
batch_y = np.array(batch_y)
return batch_x, batch_y
And with this code I apply the DataGenerator to my data:
training_generator = DataGenerator(X_train, y_train, batch_size=32)
validation_generator = DataGenerator(X_val, y_val, batch_size=32)
When I now train my model one epoch lasts 25-40 minutes which is very long.
model.fit_generator(generator=training_generator,
validation_data=validation_generator,
steps_per_epoch = num_train_samples // 16,
validation_steps = num_val_samples // 16,
epochs = 10, workers=6, use_multiprocessing=True)
I now was wondering how to measure preprocessing time because I don't think it is due to the model size, because I already experimented with models with fewer parameters but the time for training did not reduce significantly... So, I am suspicious regarding the preprocessing...

To measure time in Colab, you can use this autotime package:
!pip install ipython-autotime
%load_ext autotime
Additionally for profiling, you can use %time as mentioned here.
In general to ensure generator runs faster, suggest you to copy the data from gdrive to local host of that colab, otherwise it can get slower.
If you are using Tensorflow 2.0, cause could be this bug.
Work arounds are:
Call tf.compat.v1.disable_eager_execution() at the start of the code
Use model.fit rather than model.fit_generator. The former supports generators anyway.
Downgrade to TF 1.14
Regardless of Tensorflow version, limit how much disk access you are doing, this that is often a bottleneck.
Note that there does seem to be an issue with generators being slow in TF
1.13.2 and 2.0.1 (at least).

Related

How can I reduce the I/O bottleneck when training a CNN using a batch generator for files stored in the cloud?

I have a CNN that uses a batch generator to load my image data, which is stored in a cloud storage bucket. If the batch generator downloads images on-the-fly during model training, it presents a large I/O bottleneck - high GPU memory usage but very low compute usage.
I beleive a possible solution is to download the next batch(es) while the current is being trained on. Thus while the GPU is busy training, I can keep the network busy loading the next set of images (and also likely doing augmentation).
In the past I simply loaded the whole dataset in, did augmentation, and then saved back to disk as a Numpy array to be quickly re-loaded during training. However, I have a lot more data now and don't think there will be enough disk space.
Here is a reduced snippet of my generator and some relevant methods:
class DatasetGeneratorFromBucket(keras.utils.Sequence):
def __init__(self, image_file_names, labels, batch_size, generator_type):
self.image_file_names = image_file_names
self.labels = labels
self.batch_size = batch_size
self.generator_type = generator_type
self.num_samples = len(self.image_file_names)
def __len__(self):
return self.num_samples // self.batch_size
def __getitem__(self, idx):
file_names_for_batch = self.image_file_names[idx * self.batch_size : (idx+1) * self.batch_size]
labels_for_batch = self.labels[idx * self.batch_size : (idx+1) * self.batch_size]
batch_x = []
for fn in file_names_for_batch:
# `storage_client` is a google.cloud.storage.Client() instance
im = download_blob_into_memory(storage_client, 'honours-project-ct-data', fn)
try:
batch_x.append(resize_image(im))
except:
labels_for_batch = np.delete(labels_for_batch, len(batch_x), axis=0)
# convert list of numpy arrays to numpy array of numpy arrays
batch_x = np.stack(batch_x, axis = 0)
# grab already-encoded labels
batch_y = np.array(labels_for_batch)
return batch_x, batch_y
def download_blob_into_memory(storage_client, bucket_name, blob_name):
bucket = storage_client.bucket(bucket_name)
blob = bucket.blob(blob_name)
contents = blob.download_as_bytes()
# convert and read in
blob_as_np_array = np.frombuffer(contents, np.uint8)
im = cv2.imdecode(blob_as_np_array, cv2.IMREAD_COLOR)
return im
def resize_image(im):
desired_size = 224
old_size = im.shape[:2]
ratio = float(desired_size)/max(old_size)
new_size = tuple([int(x*ratio) for x in old_size])
im = cv2.resize(im, (new_size[1], new_size[0]))
delta_w = desired_size - new_size[1]
delta_h = desired_size - new_size[0]
top, bottom = delta_h//2, delta_h-(delta_h//2)
left, right = delta_w//2, delta_w-(delta_w//2)
color = [0, 0, 0]
new_im = cv2.copyMakeBorder(im, top, bottom, left, right, cv2.BORDER_CONSTANT, value=color)
return new_im
The generator is called by Keras:
training_batch_generator = DatasetGeneratorFromBucket(training_image_file_names, training_labels, BATCH_SIZE, 'training')
validation_batch_generator = DatasetGeneratorFromBucket(valid_image_file_names, valid_labels, BATCH_SIZE, 'validation')
# model is an instance of tf.keras.models.Sequential
model.fit(
x = training_batch_generator,
validation_data = validation_batch_generator,
epochs = 60,
)
Some potential solutions:
Using the tf.data API, but I'm not sure where to start.
Use a distribution strategy, but a quick test doesn't seem to change much.
Invoke downloading the next batch on another thread, using the index plus BATCH_SIZE*2 to start downloading the next batches.
Are any of these a feasable solution? Is there maybe a better solution not involving pre-downloading batches that I am missing?

How to write flow function without memory leaks in tensorflow

I am trying to compete in kaggle's cornell birdcall detection challenge and there is in total 23 gb of data which mainly composed as mp3 sound files. As you may know 23 gb of data is impossible to fit into the RAM kaggle or google colab. Therefore, I tried to write a datagenerator to fetch mp3 files while training my model and convert them in order to prevent out of memory issue. However, I am still getting out of memory issue after first few epochs. Below you can find my generator and training code where I use del command to specifically de-allocate objects from memory but apparently I did something wrong. Is there any resource you can suggest for that or any suggestion to improve my code to prevent memory leak? Calling garbage collector makes no difference too.
Thx
My datagenerator code
from tensorflow import keras
import random
import glob
import gc
class My_Custom_Generator(keras.utils.Sequence):
def __init__(self, batch_size):
files = glob.glob("../input/birdsong-recognition/train_audio/*/*.mp3")
random.shuffle(files)
self.files = files
self.batch_size = batch_size
def __len__(self) :
return (np.ceil(len(self.files) / float(self.batch_size))).astype(np.int)
def __getitem__(self, idx) :
gc.collect(2)
batch_x = self.files[idx * self.batch_size : (idx+1) * self.batch_size]
#batch_y = self.labels[idx * self.batch_size : (idx+1) * self.batch_size]
train_image = []
train_label = []
for i in range(0, len(batch_x)):
image, label = get_data(batch_x[i])
image = tf.convert_to_tensor(image)
label_matrix = get_cat_label(label)
train_image.append(image)
train_label.append(label_matrix)
self.train_image = np.array(train_image)
self.train_label = np.array(train_label)
del train_image
del train_label
return self.train_image, self.train_label
My training loop which I got from tensorflow tutorial and edited
## Note: Rerunning this cell uses the same model variables
# Keep results for plotting
train_loss_results = []
train_accuracy_results = []
num_epochs = int(len(glob.glob("../input/birdsong-recognition/train_audio/*/*.mp3")) // 8)
for epoch in range(num_epochs):
epoch_loss_avg = tf.keras.metrics.Mean()
epoch_accuracy = tf.keras.metrics.CategoricalAccuracy()
imgs, labels = my_training_batch_generator.__getitem__(epoch)
# Training loop - using batches of 32
for i in range(1):
# Optimize the model
loss_value, grads = grad(xceptionModel, imgs, labels)
optimizer.apply_gradients(zip(grads, xceptionModel.trainable_variables))
# Track progress
epoch_loss_avg.update_state(loss_value) # Add current batch loss
# Compare predicted label to actual label
# training=True is needed only if there are layers with different
# behavior during training versus inference (e.g. Dropout).
epoch_accuracy.update_state(labels, xceptionModel(imgs, training=True))
del imgs
del labels
# End epoch
train_loss_results.append(epoch_loss_avg.result())
train_accuracy_results.append(epoch_accuracy.result())
if epoch % 2 == 0:
print("Epoch {:03d}: Loss: {:.3f}, Accuracy: {:.3%}".format(epoch,
epoch_loss_avg.result(),
epoch_accuracy.result()))

FileNotFoundError: No such file: -> Error occuring due to TimeOut of Google Drive?

I created a DataGenerator with Sequence class.
import tensorflow.keras as keras
from skimage.io import imread
from skimage.transform import resize
import numpy as np
import math
from tensorflow.keras.utils import Sequence
Here, `x_set` is list of path to the images and `y_set` are the associated classes.
class DataGenerator(Sequence):
def __init__(self, x_set, y_set, batch_size):
self.x, self.y = x_set, y_set
self.batch_size = batch_size
def __len__(self):
return math.ceil(len(self.x) / self.batch_size)
def __getitem__(self, idx):
batch_x = self.x[idx * self.batch_size:(idx + 1) *
self.batch_size]
batch_y = self.y[idx * self.batch_size:(idx + 1) *
self.batch_size]
return np.array([
resize(imread(file_name), (224, 224))
for file_name in batch_x]), np.array(batch_y)
Then, I applied this to my training and validation data. X_train is a list of strings which contains the image paths to the training data. y_train are onehotencoded labels of the training data. The same for validation data.
I created the image paths using this code:
X_train = []
for name in train_FileName:
file_path = r"/content/gdrive/My Drive/data/2017-IWT4S-CarsReId_LP-dataset/" + name
X_train.append(file_path)
After that, I applied the DataGenerator to the training and validation data:
training_generator = DataGenerator(X_train, y_train, batch_size=32)
validation_generator = DataGenerator(X_val, y_val, batch_size=32)
Afterwards I used the fit_generator method to run a model:
model.fit_generator(generator=training_generator,
validation_data=validation_generator,
steps_per_epoch = num_train_samples // 32,
validation_steps = num_val_samples // 32,
epochs = 10,
use_multiprocessing=True,
workers=2)
On CPU it worked fine the first times, my model was initialized and the first epoch started. Then, I changed the runtime type in Google Colab to GPU and ran the model again.
And got the following error:
---------------------------------------------------------------------------
FileNotFoundError Traceback (most recent call last)
<ipython-input-79-f43ade94ee10> in <module>()
5 epochs = 10,
6 use_multiprocessing=True,
----> 7 workers=2)
16 frames
/usr/local/lib/python3.6/dist-packages/imageio/core/request.py in _parse_uri(self, uri)
271 # Reading: check that the file exists (but is allowed a dir)
272 if not os.path.exists(fn):
--> 273 raise FileNotFoundError("No such file: '%s'" % fn)
274 else:
275 # Writing: check that the directory to write to does exist
FileNotFoundError: No such file: '/content/gdrive/My Drive/data/2017-IWT4S-CarsReId_LP-dataset/s01_l01/1_1.png'
Today, I got this error also when running the program without the usage of GPU. When running the program, Colab told me that there was Google Drive Time Out. So, is this error due to this timeout of Google Drive? And if yes, how can I solve this?
Does anyone know what I should change in the program?
You can write this code to avoid timeout in google colab in console
ConnectButton() {
console.log("Connect pushed");
document.querySelector("#top-toolbar > colab-connect-
button").shadowRoot.querySelector("#connect").click()
}
setInterval(ConnectButton,60000);
Source: How to prevent Google Colab from disconnecting?
The problem seems to be the input. Your model cannot find the input file. If you change runtime then there's gonna be a factory reset. All your disk content in the session will be erased.
Run cells from the beginning if you change runtime in between.

keras predict_generator is shuffling its output when using a keras.utils.Sequence

I am using keras to build a model that inputs 720x1280 images and outputs a value.
I am having a problem with keras.models.Sequential.predict_generator when using the keras.utils.Sequence class to obtain the values corresponding to images on the validation/training sets. The values returned are shuffled, so I don't know which output corresponds to which image.
This is how my generators are defined
from skimage.io import ImageCollection, imread
from keras.utils import Sequence
def load_images(f):
return imread(f).astype(np.float64)
class DataSetImageKeras(Sequence):
def __init__(self, image_collection, values, batch_size):
self.images = image_collection
self.hf = values
self.batch_size = batch_size
self.n = len(self.images)
self.x_scale = 250
self.y_scale = 1e4
def __len__(self):
return int(np.ceil(len(self.images) / float(self.batch_size)))
def __getitem__(self, idx):
# batch_x is a numpy.ndarray
batch_x = (
self.images[idx:min(idx + self.batch_size, self.n)]
.concatenate()
.reshape(self.batch_size, 720, 1280, 1)
)
batch_y = self.hf[idx:min(idx + self.batch_size, self.n)]
return batch_x/self.x_scale, batch_y/self.y_scale
images_train = ImageCollection(images_paths_train, load_func=load_images)
images_val = ImageCollection(images_paths_test, load_func=load_images)
data_train = DataSetImageKeras(images_train, values_train, n_batch)
data_val = DataSetImageKeras(images_val, values_val, n_batch)
from keras.models import load_model
model = load_model('model001') #this model is already trained
If I use the following code:
val_result = []
val_hf =[]
for (batch_x, batch_y) in data_val:
val_result.append(model.predict_on_batch(batch_x))
val_hf.append(batch_y)
val_result = np.concatenate(val_result)
val_hf = np.concatenate(val_hf)
plt.plot(val_hf,
val_result,
marker='.',
linestyle='')
The correct result is obtained (as seen on this image where x is the desired value and y is the predicted value)
However if I use the predict_generator function, as below:
val_result = model.predict_generator(data_val, verbose=1,
workers=1,
max_queue_size=50,
use_multiprocessing=False)
The output is shuffled as can be seen here.
My problem is similar to
#5048 and
#6745,
which should be solved by
#6891 API, but I am using keras version 2.1.6 and it is still shuffling my predictions, even when using workers=1.
It is also similar to this, but I didn't find anything that could reset the generators and this problem is still present if I define a new generator and try to run the predict_generator.
I also found something stating that it could have something to do with the number of batches not dividing exactly the number of samples, but this problem is still present if I use n_batch=1
As a side note, it might be that predict_generator is not shuffling data, but only returning it with an index offset, since the input data on values and images_paths are already shuffled.
predict_generator was not shuffling my predictions, after all. The problem was with the __getitem__ method. For instance, usingn_batch=32, the method would yield values from 1 to 32, then from 2 to 33 and so forth, instead of from 1 to 32, 33 to 64, etc.
Changing the method as follows solves the problem
def __getitem__(self, idx):
# batch_x is a numpy.ndarray
idx_min = idx*self.batch_size
idx_max = min(idx_min + self.batch_size, self.n)
batch_x = (
self.images[idx_min:idx_max]
.concatenate()
.reshape(self.batch_size, 720, 1280, 1)
)
batch_y = self.hf[idx_min:idx_max]

Tensorflow CNN image augmentation pipeline

I'm trying to learn the new Tensorflow APIs and I am a bit lost on where to get a handle on my input batch tensors so I can manipulate and augment them with for example tf.image.
This is the my current network & pipeline:
trainX, testX, trainY, testY = read_data()
# trainX [num_image, height, width, channels], these are numpy arrays
#...
train_dataset = tf.data.Dataset.from_tensor_slices((trainX, trainY))
test_dataset = tf.data.Dataset.from_tensor_slices((testX, testY))
#...
iterator = tf.data.Iterator.from_structure(train_dataset.output_types,
train_dataset.output_shapes)
features, labels = iterator.get_next()
train_init_op = iterator.make_initializer(train_dataset)
test_init_op = iterator.make_initializer(test_dataset)
#...defining cnn architecture...
# In the train loop
TrainLoop {
sess.run(train_init_op) # switching to train data
sess.run(train_step, ...) # running a train step
#...
sess.run(test_init_op) # switching to test data
test_loss = sess.run(loss, ...) # printing test loss after epoch
}
I'm using the Dataset API creating 2 datasets so that in the trainloop I can calculate the train and test loss and log them.
Where in this pipeline would I manipulate and distort my input batch of images?
I'm not creating any tf.placeholders for my trainX input batches so I can't manipulate them with tf.image because for example tf.image.flip_up_down requires a 3-D or 4-D tensor.
What is the natural way to implement this pipeline with the new API?
Is there a module or easy way to augment an input batch of images for training that would fit in this pipeline?
There's a really good article and talk released recently that go over the API in a lot more detail than my response here. Here's a brief example:
import tensorflow as tf
import numpy as np
def read_data():
n_train = 100
n_test = 50
height = 20
width = 30
channels = 3
trainX = (np.random.random(
size=(n_train, height, width, channels)) * 255).astype(np.uint8)
testX = (np.random.random(
size=(n_test, height, width, channels))*255).astype(np.uint8)
trainY = (np.random.random(size=(n_train,))*10).astype(np.int32)
testY = (np.random.random(size=(n_test,))*10).astype(np.int32)
return trainX, testX, trainY, testY
trainX, testX, trainY, testY = read_data()
# trainX [num_image, height, width, channels], these are numpy arrays
train_dataset = tf.data.Dataset.from_tensor_slices((trainX, trainY))
test_dataset = tf.data.Dataset.from_tensor_slices((testX, testY))
def map_single(x, y):
print('Map single:')
print('x shape: %s' % str(x.shape))
print('y shape: %s' % str(y.shape))
x = tf.image.per_image_standardization(x)
# Consider: x = tf.image.random_flip_left_right(x)
return x, y
def map_batch(x, y):
print('Map batch:')
print('x shape: %s' % str(x.shape))
print('y shape: %s' % str(y.shape))
# Note: this flips ALL images left to right. Not sure this is what you want
# UPDATE: looks like tf documentation is wrong and you need a 3D tensor?
# return tf.image.flip_left_right(x), y
return x, y
batch_size = 32
train_dataset = train_dataset.repeat().shuffle(100)
train_dataset = train_dataset.map(map_single, num_parallel_calls=8)
train_dataset = train_dataset.batch(batch_size)
train_dataset = train_dataset.map(map_batch)
train_dataset = train_dataset.prefetch(2)
test_dataset = test_dataset.map(
map_single, num_parallel_calls=8).batch(batch_size).map(map_batch)
test_dataset = test_dataset.prefetch(2)
iterator = tf.data.Iterator.from_structure(train_dataset.output_types,
train_dataset.output_shapes)
features, labels = iterator.get_next()
train_init_op = iterator.make_initializer(train_dataset)
test_init_op = iterator.make_initializer(test_dataset)
with tf.Session() as sess:
sess.run(train_init_op)
feat, lab = sess.run((features, labels))
print(feat.shape)
print(lab.shape)
sess.run(test_init_op)
feat, lab = sess.run((features, labels))
print(feat.shape)
print(lab.shape)
A few notes:
This approach relies on being able to load your entire dataset into memory. If you cannot, consider using tf.data.Dataset.from_generator. This can lead to slow shuffle times if your shuffle buffer is large. My preferred method is to load some keys tensor entirely into memory - it might just be the indices of each example - then map that key value to data values using tf.py_func. This is slightly less efficient than converting to tfrecords, but with prefetching it likely won't affect performance. Since the shuffling is done before the mapping, you only have to load shuffle_buffer keys into memory, rather than shuffle_buffer examples.
To augment your dataset, use tf.data.Dataset.map either before or after the batch operation, depending on whether or not you want to apply a batch-wise operation (something working on a 4D image tensor) or element-wise operation (3D image tensor). Note it looks like the documentation for tf.image.flip_left_right is out of date, since I get an error when I try and use a 4D tensor. If you want to augment you data randomly, use tf.image.random_flip_left_right rather than tf.image.flip_left_right.
If you're using a tf.estimator.Estimator (or wouldn't mind converting your code to using it), then check out tf.estimator.train_and_evaluate for an in-built way of switching between datasets.
Consider shuffling/repeating your dataset with the shuffle/repeat methods. See the article for notes on efficiencies. In particular, repeat -> shuffle -> map -> batch -> batch-wise map -> prefetch seems to be the best ordering of operations for most applications.

Categories

Resources