How to write flow function without memory leaks in tensorflow

How to write flow function without memory leaks in tensorflow - python

I am trying to compete in kaggle's cornell birdcall detection challenge and there is in total 23 gb of data which mainly composed as mp3 sound files. As you may know 23 gb of data is impossible to fit into the RAM kaggle or google colab. Therefore, I tried to write a datagenerator to fetch mp3 files while training my model and convert them in order to prevent out of memory issue. However, I am still getting out of memory issue after first few epochs. Below you can find my generator and training code where I use del command to specifically de-allocate objects from memory but apparently I did something wrong. Is there any resource you can suggest for that or any suggestion to improve my code to prevent memory leak? Calling garbage collector makes no difference too.
Thx
My datagenerator code
from tensorflow import keras
import random
import glob
import gc
class My_Custom_Generator(keras.utils.Sequence):
def __init__(self, batch_size):
files = glob.glob("../input/birdsong-recognition/train_audio/*/*.mp3")
random.shuffle(files)
self.files = files
self.batch_size = batch_size
def __len__(self) :
return (np.ceil(len(self.files) / float(self.batch_size))).astype(np.int)
def __getitem__(self, idx) :
gc.collect(2)
batch_x = self.files[idx * self.batch_size : (idx+1) * self.batch_size]
#batch_y = self.labels[idx * self.batch_size : (idx+1) * self.batch_size]
train_image = []
train_label = []
for i in range(0, len(batch_x)):
image, label = get_data(batch_x[i])
image = tf.convert_to_tensor(image)
label_matrix = get_cat_label(label)
train_image.append(image)
train_label.append(label_matrix)
self.train_image = np.array(train_image)
self.train_label = np.array(train_label)
del train_image
del train_label
return self.train_image, self.train_label
My training loop which I got from tensorflow tutorial and edited
## Note: Rerunning this cell uses the same model variables
# Keep results for plotting
train_loss_results = []
train_accuracy_results = []
num_epochs = int(len(glob.glob("../input/birdsong-recognition/train_audio/*/*.mp3")) // 8)
for epoch in range(num_epochs):
epoch_loss_avg = tf.keras.metrics.Mean()
epoch_accuracy = tf.keras.metrics.CategoricalAccuracy()
imgs, labels = my_training_batch_generator.__getitem__(epoch)
# Training loop - using batches of 32
for i in range(1):
# Optimize the model
loss_value, grads = grad(xceptionModel, imgs, labels)
optimizer.apply_gradients(zip(grads, xceptionModel.trainable_variables))
# Track progress
epoch_loss_avg.update_state(loss_value) # Add current batch loss
# Compare predicted label to actual label
# training=True is needed only if there are layers with different
# behavior during training versus inference (e.g. Dropout).
epoch_accuracy.update_state(labels, xceptionModel(imgs, training=True))
del imgs
del labels
# End epoch
train_loss_results.append(epoch_loss_avg.result())
train_accuracy_results.append(epoch_accuracy.result())
if epoch % 2 == 0:
print("Epoch {:03d}: Loss: {:.3f}, Accuracy: {:.3%}".format(epoch,
epoch_loss_avg.result(),
epoch_accuracy.result()))

Related

How can I reduce the I/O bottleneck when training a CNN using a batch generator for files stored in the cloud?

I have a CNN that uses a batch generator to load my image data, which is stored in a cloud storage bucket. If the batch generator downloads images on-the-fly during model training, it presents a large I/O bottleneck - high GPU memory usage but very low compute usage.
I beleive a possible solution is to download the next batch(es) while the current is being trained on. Thus while the GPU is busy training, I can keep the network busy loading the next set of images (and also likely doing augmentation).
In the past I simply loaded the whole dataset in, did augmentation, and then saved back to disk as a Numpy array to be quickly re-loaded during training. However, I have a lot more data now and don't think there will be enough disk space.
Here is a reduced snippet of my generator and some relevant methods:
class DatasetGeneratorFromBucket(keras.utils.Sequence):
def __init__(self, image_file_names, labels, batch_size, generator_type):
self.image_file_names = image_file_names
self.labels = labels
self.batch_size = batch_size
self.generator_type = generator_type
self.num_samples = len(self.image_file_names)
def __len__(self):
return self.num_samples // self.batch_size
def __getitem__(self, idx):
file_names_for_batch = self.image_file_names[idx * self.batch_size : (idx+1) * self.batch_size]
labels_for_batch = self.labels[idx * self.batch_size : (idx+1) * self.batch_size]
batch_x = []
for fn in file_names_for_batch:
# `storage_client` is a google.cloud.storage.Client() instance
im = download_blob_into_memory(storage_client, 'honours-project-ct-data', fn)
try:
batch_x.append(resize_image(im))
except:
labels_for_batch = np.delete(labels_for_batch, len(batch_x), axis=0)
# convert list of numpy arrays to numpy array of numpy arrays
batch_x = np.stack(batch_x, axis = 0)
# grab already-encoded labels
batch_y = np.array(labels_for_batch)
return batch_x, batch_y
def download_blob_into_memory(storage_client, bucket_name, blob_name):
bucket = storage_client.bucket(bucket_name)
blob = bucket.blob(blob_name)
contents = blob.download_as_bytes()
# convert and read in
blob_as_np_array = np.frombuffer(contents, np.uint8)
im = cv2.imdecode(blob_as_np_array, cv2.IMREAD_COLOR)
return im
def resize_image(im):
desired_size = 224
old_size = im.shape[:2]
ratio = float(desired_size)/max(old_size)
new_size = tuple([int(x*ratio) for x in old_size])
im = cv2.resize(im, (new_size[1], new_size[0]))
delta_w = desired_size - new_size[1]
delta_h = desired_size - new_size[0]
top, bottom = delta_h//2, delta_h-(delta_h//2)
left, right = delta_w//2, delta_w-(delta_w//2)
color = [0, 0, 0]
new_im = cv2.copyMakeBorder(im, top, bottom, left, right, cv2.BORDER_CONSTANT, value=color)
return new_im
The generator is called by Keras:
training_batch_generator = DatasetGeneratorFromBucket(training_image_file_names, training_labels, BATCH_SIZE, 'training')
validation_batch_generator = DatasetGeneratorFromBucket(valid_image_file_names, valid_labels, BATCH_SIZE, 'validation')
# model is an instance of tf.keras.models.Sequential
model.fit(
x = training_batch_generator,
validation_data = validation_batch_generator,
epochs = 60,
)
Some potential solutions:
Using the tf.data API, but I'm not sure where to start.
Use a distribution strategy, but a quick test doesn't seem to change much.
Invoke downloading the next batch on another thread, using the index plus BATCH_SIZE*2 to start downloading the next batches.
Are any of these a feasable solution? Is there maybe a better solution not involving pre-downloading batches that I am missing?

Using multiple CPU in PyTorch

I dont have access to any GPU's, but I want to speed-up the training of my model created with PyTorch, which would be using more than 1 CPU. I will use the most basic model for example here.
All I want is this code to run on multiple CPU instead of just 1 (Dataset and Network class in Appendix).
import pandas as pd
import torch
from torch import nn
from torch.utils.data import Dataset, DataLoader
df = pd.read_pickle('path/to/data')
X = df.drop(columns=['target'])
y = df[['target']]
train_data = CustomDataset(X, y)
train_loader = DataLoader(
dataset=train_data,
batch_size=64
)
model = Network(X.shape[-1])
criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters())
epochs = 10
for e in range(1, epochs + 1):
train_loss = .0
model.train()
for batch_id, (data, labels) in enumerate(train_loader):
optimizer.zero_grad()
target = model(data)
loss = criterion(target, labels)
loss.backward()
optimizer.step()
train_loss += loss.item()
print('Epoch {}:\tTrain Loss: {:.4f}'.format(
e,
train_loss / len(train_loader)
))
Right now, this code is using only 1 CPU for 100% during training. I want the code to use 4 CPU's in the training process for 100%. There is so much different information out there on how to do this in the best way, that none of it is really working. I have tried different approaches using model = torch.nn.parallel.DistributedDataParallel(model) for example, but none of the worked.
Does someone now how to get the most performance for this code using 4 CPU? Thanks in Advance!
APPENDIX
class CustomDataset(Dataset):
def __init__(
self,
X,
y,
):
self.X = torch.Tensor(X.values)
self.y = torch.Tensor(y.values)
def __getitem__(
self,
index
):
return self.X[index], self.y[index]
def __len__(self):
return len(X)
class Network(nn.Module):
def __init__(
self,
input_size
):
super(Network, self).__init__()
self.input_size = input_size
self.linear_1 = nn.Linear(self.input_size, 32)
self.linear_2 = nn.Linear(32, 1)
def forward(self, data):
output = self.linear_1(data)
output = self.linear_2(output)
return output

Torchrun (included with Pytorch) makes this surprisingly easy.
How you want the CPUs to work together is not clear from your question, but I am assuming (because you refer to DistributedDataParallel that you would like to distribute the data across multiple cores which all do backward passes and broadcast their losses to the main process.
First, see if torch.distributed is available: torch.distributed.is_available().
Torchrun requires your script to have a few tweaks.
To initialize a process group, include
import torch.distributed as dist
dist.init_process_group(backend="gloo")
The backend must be gloo for CPUS.
Torchrun sets the environment variables MASTER_PORT, MASTER_ADDR, WORLD_SIZE, and RANK, which are required for torch.distributed.
Then, to be able to broadcast the losses across cores, use
local_rank = int(os.environ["LOCAL_RANK"])
model = torch.nn.parallel.DistributedDataParallel(
model,
device_ids=[local_rank],
output_device=local_rank,
)
Every core still gets to see the same data.
This can be more efficient using DistributedSampler:
train_sampler = DistributedSampler(train_data)
train_loader = DataLoader(
...
train_data,
shuffle=False, # train_sampler will shuffle for you.
sampler=train_sampler,
)
for e in range(1, epochs + 1):
train_sampler.set_epoch(e)
train(train_loader)
where train(train_loader) does the training steps. Note the train_sampler.set_epoch(e).
This makes sure that every epoch the data is distributed to processes differently.
To start the training, run
torchrun
--standalone
--nnodes=1
--nproc_per_node=$NUM_TRAINERS
YOUR_TRAINING_SCRIPT.py
where $NUM_TRAINERS is the number of processes that work on improving your model.
It is recommended, not necessary, to include
from torch.distributed.elastic.multiprocessing.errors import record
#record
def main():
# do train
pass
if __name__ == "__main__":
main()
to see which process throws what error.
If I misunderstood your question and you only want more workers to load the data, then pass num_workers=N to DataLoader.
This ensures more workers (other than the main process) push the data into RAM while the current batch is being processed.

how to access tf.data.Dataset within a keras custom callback?

I have written a custom keras callback to check the augmented data from a generator. (See this answer for the full code.) However, when I tried to use the same callback for a tf.data.Dataset, it gave me an error:
File "/path/to/tensorflow_image_callback.py", line 16, in on_batch_end
imgs = self.train[batch][images_or_labels]
TypeError: 'PrefetchDataset' object is not subscriptable
Do keras callbacks in general only work with generators, or is it something about the way I've written my one? Is there a way to modify either my callback or the dataset to make it work?
I think there are three pieces to this puzzle. I'm open to changes to any and all of them. Firstly, the init function in the custom callback class:
class TensorBoardImage(tf.keras.callbacks.Callback):
def __init__(self, logdir, train, validation=None):
super(TensorBoardImage, self).__init__()
self.logdir = logdir
self.file_writer = tf.summary.create_file_writer(logdir)
self.train = train
self.validation = validation
Secondly, the on_batch_end function within that same class
def on_batch_end(self, batch, logs):
images_or_labels = 0 #0=images, 1=labels
imgs = self.train[batch][images_or_labels]
Thirdly, instantiating the callback
import tensorflow_image_callback
tensorboard_image_callback = tensorflow_image_callback.TensorBoardImage(logdir=tensorboard_log_dir, train=train_dataset, validation=valid_dataset)
model.fit(train_dataset,
epochs=n_epochs,
validation_data=valid_dataset,
callbacks=[
tensorboard_callback,
tensorboard_image_callback
])
Some related threads which haven't led me to an answer yet:
Accessing validation data within a custom callback
Create keras callback to save model predictions and targets for each batch during training

What ended up working for me was the following, using tfds:
the __init__ function:
def __init__(self, logdir, train, validation=None):
super(TensorBoardImage, self).__init__()
self.logdir = logdir
self.file_writer = tf.summary.create_file_writer(logdir)
# #from keras generator
# self.train = train
# self.validation = validation
#from tf.Data
my_data = tfds.as_numpy(train)
imgs = my_data['image']
then on_batch_end:
def on_batch_end(self, batch, logs):
images_or_labels = 0 #0=images, 1=labels
imgs = self.train[batch][images_or_labels]
#calculate epoch
n_batches_per_epoch = self.train.samples / self.train.batch_size
epoch = math.floor(self.train.total_batches_seen / n_batches_per_epoch)
#since the training data is shuffled each epoch, we need to use the index_array to find something which uniquely
#identifies the image and is constant throughout training
first_index_in_batch = batch * self.train.batch_size
last_index_in_batch = first_index_in_batch + self.train.batch_size
last_index_in_batch = min(last_index_in_batch, len(self.train.index_array))
img_indices = self.train.index_array[first_index_in_batch : last_index_in_batch]
with self.file_writer.as_default():
for ix,img in enumerate(imgs):
#only post 1 out of every 1000 images to tensorboard
if (img_indices[ix] % 1000) == 0:
#instead of img_filename, I could just use str(img_indices[ix]) as a unique identifier
#but this way makes it easier to find the unaugmented image
img_filename = self.train.filenames[img_indices[ix]]
#convert float to uint8, shift range to 0-255
img -= tf.reduce_min(img)
img *= 255 / tf.reduce_max(img)
img = tf.cast(img, tf.uint8)
img_tensor = tf.expand_dims(img, 0) #tf.summary needs a 4D tensor
tf.summary.image(img_filename, img_tensor, step=epoch)
I didn't need to make any changes to the instantiation.
I recommend only using it for debugging, otherwise it saves every nth image in your dataset to tensorboard every epoch. That can end up using a lot of disk space.

How to check preprocessing time/speed in Colab?

I am training a neural network on Google Colab GPU. Therefore, I synchronized the input images (180k in total, 105k for training, 76k for validation) with my Google Drive. Then I mount the Google Drive and go from there.
I load a csv-file with image paths and labels in Google Colab and store it as pandas dataframe.
After that I use a list of image paths and labels.
I take this function to get my labels onehot-encoded because I need a special output shape (7, 35) per label, which cannot be done by the existing default functions:
#One Hot Encoding der Labels, Zielarray hat eine Shape von (7,35)
from numpy import argmax
# define input string
def my_onehot_encoded(label):
# define universe of possible input values
characters = '0123456789ABCDEFGHIJKLMNPQRSTUVWXYZ'
# define a mapping of chars to integers
char_to_int = dict((c, i) for i, c in enumerate(characters))
int_to_char = dict((i, c) for i, c in enumerate(characters))
# integer encode input data
integer_encoded = [char_to_int[char] for char in label]
# one hot encode
onehot_encoded = list()
for value in integer_encoded:
character = [0 for _ in range(len(characters))]
character[value] = 1
onehot_encoded.append(character)
return onehot_encoded
After that I use a customized DataGenerator to get the data in batches into my model. x_set is a list of image paths to my images and y_set are the onehot-encoded labels:
class DataGenerator(Sequence):
def __init__(self, x_set, y_set, batch_size):
self.x, self.y = x_set, y_set
self.batch_size = batch_size
def __len__(self):
return math.ceil(len(self.x) / self.batch_size)
def __getitem__(self, idx):
batch_x = self.x[idx*self.batch_size : (idx + 1)*self.batch_size]
batch_x = np.array([resize(imread(file_name), (224, 224)) for file_name in batch_x])
batch_x = batch_x * 1./255
batch_y = self.y[idx*self.batch_size : (idx + 1)*self.batch_size]
batch_y = np.array(batch_y)
return batch_x, batch_y
And with this code I apply the DataGenerator to my data:
training_generator = DataGenerator(X_train, y_train, batch_size=32)
validation_generator = DataGenerator(X_val, y_val, batch_size=32)
When I now train my model one epoch lasts 25-40 minutes which is very long.
model.fit_generator(generator=training_generator,
validation_data=validation_generator,
steps_per_epoch = num_train_samples // 16,
validation_steps = num_val_samples // 16,
epochs = 10, workers=6, use_multiprocessing=True)
I now was wondering how to measure preprocessing time because I don't think it is due to the model size, because I already experimented with models with fewer parameters but the time for training did not reduce significantly... So, I am suspicious regarding the preprocessing...

To measure time in Colab, you can use this autotime package:
!pip install ipython-autotime
%load_ext autotime
Additionally for profiling, you can use %time as mentioned here.
In general to ensure generator runs faster, suggest you to copy the data from gdrive to local host of that colab, otherwise it can get slower.
If you are using Tensorflow 2.0, cause could be this bug.
Work arounds are:
Call tf.compat.v1.disable_eager_execution() at the start of the code
Use model.fit rather than model.fit_generator. The former supports generators anyway.
Downgrade to TF 1.14
Regardless of Tensorflow version, limit how much disk access you are doing, this that is often a bottleneck.
Note that there does seem to be an issue with generators being slow in TF
1.13.2 and 2.0.1 (at least).

Implemented model network but neither training error nor val error decreasing

Since I'm novice to Pytorch, this question might be a very trivial one, but I'd like to ask for your help about how to solve this one.
I've implemented one network from a paper and used all hyper parameters and all layers described in the paper.
But when it starts training, even though I set the learning rate decay as 0.001, the errors didn't go down. Training errors goes around 3.3~3.4 and test errors around 3.5~3.6 during 100 epochs..!
I could change the hyperparameters to improve the model, but since the paper told exact numbers, I'd like to see whether there's an error in the training code that I've implemented.
The code below is the code that I used for training.
from torch.utils.data.sampler import SubsetRandomSampler
import torch.nn.functional as F
import torch.optim as optim
import torch.nn as nn
import json
import torch
import math
import time
import os
model = nn.Sequential(Baseline(), Classification(40)).to(device)
optimizer = optim.Adam(model.parameters(), lr=0.001, betas=(0.9, 0.999))
scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=20, gamma=0.5)
batch = 32
train_path = '/content/mtrain'
train_data = os.listdir(train_path)
test_path = '/content/mtest'
test_data = os.listdir(test_path)
train_loader = torch.utils.data.DataLoader(train_data, batch, shuffle=True)
test_loader = torch.utils.data.DataLoader(test_data, batch, shuffle=True)
train_loss, val_loss = [], []
epochs = 100
now = time.time()
print('training start!')
for epoch in range(epochs):
running_loss = 0.0
for bidx, trainb32 in enumerate(train_loader):
bpts, blabel = [], []
for i, data in enumerate(trainb32):
path = os.path.join(train_path, data)
with open(path, 'r') as f:
jdata = json.load(f)
label = jdata['label']
pts = jdata['pts']
bpts.append(pts)
blabel.append(label)
bpts = torch.tensor(bpts).transpose(1, 2).to(device)
blabel = torch.tensor(blabel).to(device)
input = data_aug(bpts).to(device)
optimizer.zero_grad()
y_pred, feat_stn, glob_feat = model(input)
# print(f'global_feat is {global_feat}')
loss = F.nll_loss(y_pred, blabel) + 0.001 * regularizer(feat_stn)
loss.backward()
optimizer.step()
running_loss += loss.item()
if bidx % 10 == 9:
vrunning_loss = 0
vacc = 0
model.eval()
with torch.no_grad():
# val batch
for vbidx, testb32 in enumerate(test_loader):
bpts, blabel = [], []
for j, data in enumerate(testb32):
path = os.path.join(test_path, data)
with open(path, 'r') as f:
jdata = json.load(f)
label = jdata['label']
pts = jdata['pts']
bpts.append(pts)
blabel.append(label)
bpts = torch.tensor(bpts).transpose(1, 2).to(device)
blabel = torch.tensor(blabel).to(device)
input = data_aug(bpts).to(device)
vy_pred, vfeat_stn, vglob_feat = model(input)
# print(f'global_feat is {vglob_feat}')
vloss = F.nll_loss(vy_pred, blabel) + 0.001 * regularizer(vfeat_stn)
_, vy_max = torch.max(vy_pred, dim=1)
vy_acc = torch.sum(vy_max == blabel) / batch
vacc += vy_acc
vrunning_loss += vloss
# print every training 10th batch
train_loss.append(running_loss / len(train_loader))
val_loss.append(vrunning_loss / len(test_loader))
print(f"Epoch {epoch+1}/{epochs} {bidx}/{len(train_loader)}.. "
f"Train loss: {running_loss / 10:.3f}.."
f"Val loss: {vrunning_loss / len(test_loader):.3f}.."
f"Val Accuracy: {vacc/len(test_loader):.3f}.."
f"Time: {time.time() - now}")
now = time.time()
running_loss = 0
model.train()
print(f'training finish! training time is {time.time() - now}')
print(model.parameters())
savePath = '/content/modelpath.pth'
torch.save(model.state_dict(), '/content/modelpath.pth')
Sorry for the basic question, but if there's no error in this training code, it would be very pleasure to let me know and if there is, please give any hint to solve..
I've implemented pointNet code and the full code is available at https://github.com/RaraKim/PointNet/blob/master/PointNet_pytorch.ipynb
Thank you!

I saw your code, and I believe that you have some tensors that are manually declared. In torch tensors the default value of the "requires_grad" flag is False. And I think hence your backpropagation isn't working correctly, can you try to fix that? I will be happy to help you further if the issue still persists.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to write flow function without memory leaks in tensorflow - python

Related

How can I reduce the I/O bottleneck when training a CNN using a batch generator for files stored in the cloud?

Using multiple CPU in PyTorch

how to access tf.data.Dataset within a keras custom callback?

How to check preprocessing time/speed in Colab?

Implemented model network but neither training error nor val error decreasing

Categories

Resources