Memory utilization much higher than it should be - python

I'm using a simple method to extract descriptors from images and save them to disk into a .csv file. I have around 1M images and my network returns 512 features per image (float32).
Therefore, I estimate that at the end of the loop I would have 1e6 * 512 * 32/4 / 1e9 = 4.1GB. However, I observed that it is using more than twice the memory.
index is a string and class_id is a int64, so I don't think they are the culprit here.
I have already tried using gc.collect() without any success. Do you think my code is leaving references behind?
Here is the method:
def prepare_gallery(self, data_loader, TTA, pbar=False, dump_path=None):
'''Compute embeddings for a data_loader and store it in model.
This is required before predicting to a test set.
New entries should be removed from data before calling this function
to avoid inferring on useless images.
data_loader: A linear loader containing the database that test is
compared against.'''
self.set_mode('valid')
self.net.cuda()
n_iter = len(data_loader.dataset) / data_loader.batch_size
if pbar:
loader = tqdm(enumerate(data_loader), total=n_iter)
else:
loader = enumerate(data_loader)
# Run inference and get embeddings
feat_list = []
index_list = []
class_list = []
for i, (index, im, class_id) in loader:
with torch.no_grad():
feat = tta(self.net, im)
# Returns something like np.random.random((32, 512))
feat_list.extend(feat)
index_list.extend(index)
class_list.extend(class_id.item())
if dump_path is not None:
np.save(dump_path + '_ids', index_list)
np.save(dump_path + '_cls', class_list)
np.save(dump_path + '_feat', feat_list)
return np.asarray(index_list), np.asarray(feat_list), np.asarray(class_list)

Related

Jax / Neural Tangents `linearize` induces CUDA_ERROR_OUT_OF_MEMORY

Cross-posting from GitHub: https://github.com/google/neural-tangents/issues/144
We're trying to fine-tune a linearized Vision Transformer by adapting code from https://github.com/google-research/vision_transformer/blob/main/vit_jax.ipynb.
We're running into a really puzzling problem: when we load a model, we can train it, and when we linearize it, we can still the pre-linearized model to train. However, when we try using the linearized model, we get:
RuntimeError: Internal: Failed to load in-memory CUBIN: CUDA_ERROR_OUT_OF_MEMORY: out of memory
This error emerges regardless of whether we are using 1 GPU or multiple. It also emerges whether we are using a large batch (512) or small (1).
We manually tested that a forward pass raises no error, and that a backward pass raises no error. We suspect that the error might arise from the following code (although we could be wrong!):
Their code:
def make_update_fn(*, apply_fn, accum_steps, lr_fn):
"""Returns update step for data parallel training."""
def update_fn(opt, step, batch, rng):
_, new_rng = jax.random.split(rng)
# Bind the rng key to the device id (which is unique across hosts)
# Note: This is only used for multi-host training (i.e. multiple computers
# each with multiple accelerators).
dropout_rng = jax.random.fold_in(rng, jax.lax.axis_index('batch'))
def cross_entropy_loss(*, logits, labels):
logp = jax.nn.log_softmax(logits)
return -jnp.mean(jnp.sum(logp * labels, axis=1))
def loss_fn(params, images, labels):
logits = apply_fn(
dict(params=params),
rngs=dict(dropout=dropout_rng),
inputs=images,
train=True)
return cross_entropy_loss(logits=logits, labels=labels)
l, g = utils.accumulate_gradient(
jax.value_and_grad(loss_fn), opt.target, batch['image'], batch['label'],
accum_steps)
g = jax.tree_map(lambda x: jax.lax.pmean(x, axis_name='batch'), g)
l = jax.lax.pmean(l, axis_name='batch')
opt = opt.apply_gradient(g, learning_rate=lr_fn(step))
return opt, l, new_rng
return jax.pmap(update_fn, axis_name='batch', donate_argnums=(0,))
That function is then called via:
# Check out train.make_update_fn in the editor on the right side for details.
lr_fn = utils.create_learning_rate_schedule(total_steps, base_lr, decay_type, warmup_steps)
update_fn_repl = train.make_update_fn(
apply_fn=vit_apply, accum_steps=accum_steps, lr_fn=lr_fn)
# We use a momentum optimizer that uses half precision for state to save
# memory. It als implements the gradient clipping.
opt = momentum_clip.Optimizer(grad_norm_clip=grad_norm_clip).create(params)
opt_repl = flax.jax_utils.replicate(opt)
The training loop where the memory error arises:
losses = []
lrs = []
# Completes in ~20 min on the TPU runtime.
for step, batch in zip(
tqdm.trange(1, total_steps + 1),
ds_train.as_numpy_iterator(),
):
opt_repl, loss_repl, update_rng_repl = update_fn_repl(
opt_repl, flax.jax_utils.replicate(step), batch, update_rng_repl) # ERROR IS HERE
losses.append(loss_repl[0])
lrs.append(lr_fn(step))
In order to linearize the ViT, we do the following:
def vit_apply(params, input):
return model.apply(dict(params=params), input, train=True)
f_lin = nt.linearize(vit_apply, params)

how to access tf.data.Dataset within a keras custom callback?

I have written a custom keras callback to check the augmented data from a generator. (See this answer for the full code.) However, when I tried to use the same callback for a tf.data.Dataset, it gave me an error:
File "/path/to/tensorflow_image_callback.py", line 16, in on_batch_end
imgs = self.train[batch][images_or_labels]
TypeError: 'PrefetchDataset' object is not subscriptable
Do keras callbacks in general only work with generators, or is it something about the way I've written my one? Is there a way to modify either my callback or the dataset to make it work?
I think there are three pieces to this puzzle. I'm open to changes to any and all of them. Firstly, the init function in the custom callback class:
class TensorBoardImage(tf.keras.callbacks.Callback):
def __init__(self, logdir, train, validation=None):
super(TensorBoardImage, self).__init__()
self.logdir = logdir
self.file_writer = tf.summary.create_file_writer(logdir)
self.train = train
self.validation = validation
Secondly, the on_batch_end function within that same class
def on_batch_end(self, batch, logs):
images_or_labels = 0 #0=images, 1=labels
imgs = self.train[batch][images_or_labels]
Thirdly, instantiating the callback
import tensorflow_image_callback
tensorboard_image_callback = tensorflow_image_callback.TensorBoardImage(logdir=tensorboard_log_dir, train=train_dataset, validation=valid_dataset)
model.fit(train_dataset,
epochs=n_epochs,
validation_data=valid_dataset,
callbacks=[
tensorboard_callback,
tensorboard_image_callback
])
Some related threads which haven't led me to an answer yet:
Accessing validation data within a custom callback
Create keras callback to save model predictions and targets for each batch during training
What ended up working for me was the following, using tfds:
the __init__ function:
def __init__(self, logdir, train, validation=None):
super(TensorBoardImage, self).__init__()
self.logdir = logdir
self.file_writer = tf.summary.create_file_writer(logdir)
# #from keras generator
# self.train = train
# self.validation = validation
#from tf.Data
my_data = tfds.as_numpy(train)
imgs = my_data['image']
then on_batch_end:
def on_batch_end(self, batch, logs):
images_or_labels = 0 #0=images, 1=labels
imgs = self.train[batch][images_or_labels]
#calculate epoch
n_batches_per_epoch = self.train.samples / self.train.batch_size
epoch = math.floor(self.train.total_batches_seen / n_batches_per_epoch)
#since the training data is shuffled each epoch, we need to use the index_array to find something which uniquely
#identifies the image and is constant throughout training
first_index_in_batch = batch * self.train.batch_size
last_index_in_batch = first_index_in_batch + self.train.batch_size
last_index_in_batch = min(last_index_in_batch, len(self.train.index_array))
img_indices = self.train.index_array[first_index_in_batch : last_index_in_batch]
with self.file_writer.as_default():
for ix,img in enumerate(imgs):
#only post 1 out of every 1000 images to tensorboard
if (img_indices[ix] % 1000) == 0:
#instead of img_filename, I could just use str(img_indices[ix]) as a unique identifier
#but this way makes it easier to find the unaugmented image
img_filename = self.train.filenames[img_indices[ix]]
#convert float to uint8, shift range to 0-255
img -= tf.reduce_min(img)
img *= 255 / tf.reduce_max(img)
img = tf.cast(img, tf.uint8)
img_tensor = tf.expand_dims(img, 0) #tf.summary needs a 4D tensor
tf.summary.image(img_filename, img_tensor, step=epoch)
I didn't need to make any changes to the instantiation.
I recommend only using it for debugging, otherwise it saves every nth image in your dataset to tensorboard every epoch. That can end up using a lot of disk space.

Custom Keras generator much slower compared to Keras' bult in generator

I have a multi label classification problem. I wrote this custom generator. It reads images and output labels from the disk, and returns them in batches of size 32.
def get_input(img_name):
path = os.path.join("images", img_name)
img = image.load_img(path, target_size=(224, 224))
return img
def get_output(img_name, file_path):
data = pd.read_csv(file_path, delim_whitespace=True, header=None)
img_id = img_name.split(".")[0]
img_id = img_id.lstrip("0")
img_id = int(img_id)
labels = data.loc[img_id - 1].values
labels = labels[1:]
labels = list(labels)
label_arrays = []
for i in range(20):
val = np.zeros((1))
val[0] = labels[i]
label_arrays.append(val)
return label_arrays
def preprocess_input(img_name):
img = get_input(img_name)
x = image.img_to_array(img)
x = np.expand_dims(x, axis=0)
return x
def train_generator(batch_size):
file_path = "train.txt"
data = pd.read_csv(file_path, delim_whitespace=True, header=None)
while True:
for i in range(math.floor(8000/batch_size)):
x_batch = np.zeros(shape=(32, 224, 224, 3))
y_batch = np.zeros(shape=(32, 20))
for j in range(batch_size):
img_name = data.loc[i * batch_size + j].values
img_name = img_name[0]
x = preprocess_input(img_name)
y = get_output(img_name, file_path)
x_batch[j, :, :, :] = x
y_batch[j] = y
ys = []
for i in range(20):
ys.append(y_batch[:,i])
yield(x_batch, ys)
Had a little problem with labels returned to the model, and got it solved in this question:
training a multi-output keras model
I tested this generator on a single output problem. This custom generator is very slow. The ETA for a single epoch by using this custom generator is around 27 hours, while the builtin generator(using flow_from_directory) takes 25 minutes for a single epoch. What am I doing wrong?
The training process for both tests is identical, except for the generator used. Validation generator is similar to training generator. I know I will not reach the efficiency of Keras' built in generator, but this difference in speed is too much.
EDIT
Some guides I read for creating custom generators.
Writing Custom Keras Generators
custom generator for fit_generator() that yields multiple inputs with different shapes
Maybe the built in generator processes the data on your gpu while your custom generator runs on the cpu, making is significantly slower.
Another guess is because Keras is using Dataset in the background. Your implementation probably uses feed-dict which is the slowest possible way to pass information to TensorFlow. The best way to feed data into the models is to use an input pipeline to ensure that the GPU never has to wait for new stuff to come in.

How can I get the indices of the data used in every batch?

I need to save the indices of the data that are used in every mini-batch.
For example if my data is:
x = np.array([[1.1], [2.2], [3.3], [4.4]])
and the first mini-batch is [1.1] and [3.3], then I want to store 0 and 2 (since [1.1] is the 0th observations and [3.3] is the 2nd observation).
I am using tensorflow in eager execution with the keras.sequential APIs.
As far as I can tell from reading the source code, this information is not stored anywhere so I was unable to do this with a callback.
I am currently solving my problem by creating an object that stores the indices.
class IndexIterator(object):
def __init__(self, n, n_epochs, batch_size, shuffle=True):
data_ix = np.arange(n)
if shuffle:
np.random.shuffle(data_ix)
self.ix_batches = np.array_split(data_ix, np.ceil(n / batch_size))
self.batch_indices = []
def generate_arrays(self, x, y):
batch_ixs = np.arange(len(self.ix_batches))
while 1:
np.random.shuffle(batch_ixs)
for batch in batch_ixs:
self.batch_indices.append(self.ix_batches[batch])
yield (x[self.ix_batches[batch], :], y[self.ix_batches[batch], :])
data_gen = IndexIterator(n=32, n_epochs=100, batch_size=16)
dnn.fit_generator(data_gen.generate_arrays(x, y),
steps_per_epoch=2,
epochs=100)
# This is what I am looking for
print(data_gen.batch_indices)
Is there no way to do this using a tensorflow callback?
Not sure if this will be more efficient than your solution, but is certainly more general.
If you have training data with n indices you can create a secondary Dataset that contains only these indices and zip it with the "real" dataset.
I.E.
real_data = tf.data.Dataset ...
indices = tf.data.Dataset.from_tensor_slices(tf.range(data_set_length)))
total_dataset = tf.data.Dataset.zip((real_data, indices))
# Perform optional pre-processing ops.
iterator = total_dataset.make_one_shot_iterator()
# Next line yields `(original_data_element, index)`
item_and_index_tuple = iterator.get_next()
`

Possible to use keras “flow_from_directory” with “.mat” files?

I have built a CNN to predict lymph node positivity (has cancer or not). Right now to load the data I have a self defined function that loads a batch of data and feeds it to the model for training.
Instead of loading batches I would love to use the flow_from_directory method. The problem I have is that my data are saved as arrays [#, rows, width, height, PET or CT] not images (that would later be converted to arrays). Example [0,:,:,:,0] is volume sized 48x48x32 from a ct image.
If I try to use flow_from_directory I get 0 images with 3 classes which I expected since '.mat' is not a recognized file (https://github.com/keras-team/keras-preprocessing/blob/362fe9f8daf556151328eb5d02bd5ae638c653b8/keras_preprocessing/image.py#L1868). Interestingly it doesnt raise any errors but I am indefinitely stuck on 1/150 epochs. I am going to see if I can write my own flow_from_directory. Not sure if someone has run across this problem and could give my pointers.
Illustrating how data is combined
for fname in fnames:
data = scipy.io.loadmat(os.path.join(dir_in_train, fname))['roi_patch']
data_PET = scipy.io.loadmat(os.path.join(dir_in_train_PET, fname))['roi_patch']
train_combo[0,:,:,:,0]=data/4.0950
train_combo[0,:,:,:,1]=data_PET/32.1959
train_combo[0,:,:,:,:].shape
train_combo = np.zeros((1, 48, 48, 32, 2))
scipy.io.savemat(fname, {fname: train_combo})
This will create a file ex '1.mat' that has CT data and PET data in one area
Then I have code changing it into npy files.
Example of data generator I already have
# load training data
def load_train_data_batch_generator(self, batch_size=32, rows_in=48, cols_in=48, zs_in=32,
channels_in=2, num_classes=3,
dir_in_train=None, dir_out_train=None):
# dir_in_train = main_dir + '/test_CT_PET_combo'
fnames = ['{}.mat'.format(i) for i in range(1,len(os.listdir(dir_in_train))+1)]
y_train = np.zeros((batch_size, num_classes))
x_train = np.zeros((batch_size, rows_in, cols_in, zs_in, channels_in))
while True:
count = 0
for fname in np.random.choice(fnames, batch_size, replace=False):
data_label = scipy.io.loadmat(os.path.join(dir_out_train, fname))['output']
# changing one hot encoding to integer
integer_label = np.argmax(data_label[0], axis=0)
y_train[count,:] = data_label
# Loading train ct w/ c and pet/ct combo that will be saved into new directory
train_combo = scipy.io.loadmat(os.path.join(dir_in_train, fname))[fname]
x_train[count,:,:,:,:] = train_combo
count += 1
yield(x_train, y_train)

Categories

Resources