Broken Pipe in pytorch DataLoader

Broken Pipe in pytorch DataLoader - python

I was trying to understand how DataLoader works.
This is how I applied it:
# DATASET
class Word2VecDataset(torch_data.Dataset):
def __init__(self, vocabulary):
super(Word2VecDataset, self).__init__()
self.data_list = []
self.vocab = vocabulary
self.generate_batch_list()
def __getitem__(self, index):
return self.data_list[index]
def __len__(self):
return len(self.data_list)
def generate_batch_list(self):
training_data = self.vocab.get_training_phrases()
for query in training_data.Query:
query = utils.skip_gram_tokenize(vocab=self.vocab, sentence=query)
for entry in query:
self.data_list.append(entry)
for response in training_data.Response:
response = utils.skip_gram_tokenize(vocab=self.vocab, sentence=response)
for entry in response:
self.data_list.append(entry)
And this is the actual dataloader part:
dataset = Word2VecDataset(self.vocab)
data_loader = torch_data.DataLoader(dataset, self.batch_size, True, num_workers=4)
print('Model Initialized')
for epo in range(self.num_epochs):
loss_val = None
for i_batch, sample_batched in enumerate(data_loader): # This seems to be causing issues. For some reason this is the part that 'reboots' the whole model, making it print twice (more info under the code)
loss_val = 0
for data, target in sample_batched:
....
Now, weirdly enough both the initialization phase (which you don't see here) that says 'This is the gpu detected: xxx', and the print('Model Initialized') get printed two times.
Finally here (pastebin) is the full console log (with the error).

I have the same issue. I solved it using if __name__ == '__main__': in my python code, but I was enable to solve the broken pipe in my jupyter notebook...

Related

Is there a way to pickle a custom tensorflow.keras metric?

I defined the following custom metric to train my model in tensorflow:
import tensorflow as tf
from tensorflow import keras as ks
N_CLASSES = 15
class MulticlassMeanIoU(tf.keras.metrics.MeanIoU):
def __init__(self,
y_true = None,
y_pred = None,
num_classes = None,
name = "Multi_MeanIoU",
dtype = None):
super(MulticlassMeanIoU, self).__init__(num_classes = num_classes,
name = name, dtype = dtype)
self.__name__ = name
def get_config(self):
base_config = super().get_config()
return {**base_config, "num_classes": self.num_classes}
def update_state(self, y_true, y_pred, sample_weight = None):
y_pred = tf.math.argmax(y_pred, axis = -1)
return super().update_state(y_true, y_pred, sample_weight)
met = MulticlassMeanIoU(num_classes = N_CLASSES)
After training the model, I save the model and I also tried to save the custom object as follows:
with open("/some/path/custom_metrics.pkl", "wb") as f:
pickle.dump(met, f)
However, when I try to load the metric like this:
with open(path_custom_metrics, "rb") as f:
met = pickle.load(f)
I always get some errors, e.g. AttributeError: 'MulticlassMeanIoU' object has no attribute 'update_state_fn'.
Now I wonder whether it is possible to pickle a custom metric at all and if so, how? It would come in handy if I could save custom metrics with the model, so when I load the model in another Python session, I always have the metric which is required to load the model in the first place. It would be possible to define the metric anew through inserting the full code to the other script before loading the model, however, I think this would be bad style and could cause problems in case I would change something about the metric in the training script and forget to copy the code to the other script.

If you need to pickle a metric, one possible solution is to use __getstate__() and __setstate__() methods. During the (de)serialization process, these two methods are called, if they are available. Add these methods to your code and you will have what you need. I tried to make it as general as possible, so that it works for any Metric:
def __getstate__(self):
variables = {v.name: v.numpy() for v in self.variables}
state = {
name: variables[var.name]
for name, var in self._unconditional_dependency_names.items()
if isinstance(var, tf.Variable)}
state['name'] = self.name
state['num_classes'] = self.num_classes
return state
def __setstate__(self, state: Dict[str, Any]):
self.__init__(name=state.pop('name'), num_classes=state.pop('num_classes'))
for name, value in state.items():
self._unconditional_dependency_names[name].assign(value)

How do I properly restore a Tensorflow Checkpoint?

I've extended the python implementation of WGAN-GP from here: https://keras.io/examples/generative/wgan_gp/
Basically, I added a callback to the fit function:
class GANCheckpoint(keras.callbacks.Callback):
def __init__(self, cpkt=None, manager=None):
self.cpkt = cpkt
self.manager = manager
def on_epoch_begin(self, epoch, logs=None):
if self.manager.latest_checkpoint:
self.cpkt.restore(self.manager.latest_checkpoint)
print("Restored from {}".format(self.manager.latest_checkpoint))
else:
print("Initializing from scratch.")
def on_epoch_end(self, epoch, logs=None):
save_path = manager.save()
self.cpkt.step.assign_add(1)
print("\nSaved checkpoint for step {}: {}".format(int(checkpoint.step), save_path))
And the checkpoint manager is initialized as:
Checkpoint manager
checkpoint_dir = './training_checkpoints/GAN/'
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt")
checkpoint = tf.train.Checkpoint(step=tf.Variable(1),
d_model=d_model, g_model=g_model,
discriminator_optimizer=discriminator_optimizer, generator_optimizer=generator_optimizer)
manager = tf.train.CheckpointManager(checkpoint, checkpoint_dir, max_to_keep=None)
cbk = GANCheckpoint(cpkt=checkpoint, manager=manager)
Finally I have the fit call:
wgan.fit(X, batch_size=BATCH_SIZE, epochs=epochs, verbose = True, callbacks=[cbk])
I'm using checkpoint.restore(manager.latest_checkpoint) to restore weights in another python file.
However, my generator results are way off compared to what it is supposed to be.
I'm using the following code:
for i in range(10):
a = tf.random.normal(shape=(1, 128))
sample = checkpoint.g_model.predict(a)
print(sample)
I checked the weights of the generator and optimizer, they're coherent and seem identical.
Are checkpoints tied to a specific python file ?
Additionaly, even when I try to restore a checkpoint without fitting the model a first time, in the original python file, it does not work either.
Do you have any idea ?
Thanks in advance

Where in the code of pytorch or huggingface/transformer label gets "renamed" into labels?

My question concerns the example, available in the great huggingface/transformers library.
I am using a notebook, provided by library creators as a starting point for my pipeline. It presents a pipeline of finetuning a BERT for Sentence Classification on Glue dataset.
When getting into the code, I noticed a very weird thing, which I cannot explain.
In the example, input data is introduced to the model as the instances of the InputFeatures class from here:
This class has 4 attributes, including the label attribute:
class InputFeatures:
...
input_ids: List[int]
attention_mask: Optional[List[int]] = None
token_type_ids: Optional[List[int]] = None
label: Optional[Union[int, float]] = None
which are later passed as a dictionary of inputs to the forward() method of the model. This is done by the Trainer class, for example in the lines 573-576 here:
def _training_step(
self, model: nn.Module, inputs: Dict[str, torch.Tensor], optimizer: torch.optim.Optimizer
) -> float:
model.train()
for k, v in inputs.items():
inputs[k] = v.to(self.args.device)
outputs = model(**inputs)
However, the forward() method expects labels (note the plural form) input parameter (taken from here):
def forward(
self,
input_ids=None,
attention_mask=None,
head_mask=None,
inputs_embeds=None,
labels=None,
output_attentions=None,
):
So my question is where does the label become labels in this pipeline?
To give some extra info on the issue, I created my own pipeline, which uses nothing, related, with Glue data and pipe, basically it relies only on the Trainer class of transformers. I even use another model (Flaubert). I replicated the InputFeature class and my code works for both cases below:
class InputFeature:
def __init__(self, text, label):
self.input_ids = text
self.label = label
class InputFeaturePlural:
def __init__(self, text, label):
self.input_ids = text
self.labels = label
But it does not work if I name the second attribute as self.labe or by any other names. Why is it possible to use both attribute names?
It's not like it is extremely important in my case, but I feel uncomfortable passing around the data in the variable, which "changes name" somewhere along the way.

The rename happens in the collator. In the trainer init, when data_collator is None, a default one is used:
class Trainer:
# ...
def __init__(...):
# ...
self.data_collator = data_collator if data_collator is not None else default_data_collator
# ...
FYI, the self.data_collator is later used when you get the dataloader:
data_loader = DataLoader(
self.train_dataset,
batch_size=self.args.train_batch_size,
sampler=train_sampler,
collate_fn=self.data_collator, # <-- here
drop_last=self.args.dataloader_drop_last,
)
The default collator has a special handling for labels, which does this renaming, if needed:
# Special handling for labels.
# Ensure that tensor is created with the correct type
# (it should be automatically the case, but let's make sure of it.)
if hasattr(first, "label") and first.label is not None:
if type(first.label) is int:
labels = torch.tensor([f.label for f in features], dtype=torch.long)
else:
labels = torch.tensor([f.label for f in features], dtype=torch.float)
batch = {"labels": labels} # <-- here is where it happens
elif hasattr(first, "label_ids") and first.label_ids is not None:
if type(first.label_ids[0]) is int:
labels = torch.tensor([f.label_ids for f in features], dtype=torch.long)
else:
labels = torch.tensor([f.label_ids for f in features], dtype=torch.float)
batch = {"labels": labels}
else:
batch = {}

Getting a value error from converting python list into numpy array

I'm working with a piece of code written by someone else for domain generalization, and as part of it, I have a dataloader set up for loading my training, validation, and test data for one of my datasets. The code works fine when I load in the train or test data but when I try and load in the val data, I get Value Error: could not broadcast input array from shape (320,371) into shape (320) in the load_samples function at the images=np.asarray(images) line. I understand what this error is saying but I can't for the life of me figure out why it's saying it. The code for the val section is identical to the ones for the train and test sections and the csv file I'm reading from is the exact same format as the other two csv files. I'm also calling the get_chexpert function for each of them the exact same way. Additionally, the dataloader for my other dataset has nearly identical code to this one and can create the validation set just fine. I tried testing if it was the csv file by replacing the val csv with the test csv but I still get the same error. Can anyone point out to me what I'm doing wrong? I feel like it must be some stupidly obvious mistake but I just can't see it.
import os
import csv
from PIL import Image
import numpy as np
import torch
import torch.utils.data as data
from torchvision import datasets, transforms
import params
class Chexpert(data.Dataset):
def __init__(self, root, train=True, val=False, transform=None):
"""Init chexpert dataset."""
# init params
self.root = os.path.expanduser(root)
self.train = train
self.val = val
self.transform = transform
self.dataset_size = None
self.train_data, self.train_labels = self.load_samples()
if self.train:
total_num_samples = self.train_labels.shape[0]
indices = np.arange(total_num_samples)
np.random.shuffle(indices)
self.train_data = self.train_data[indices[0:self.dataset_size]]
self.train_labels = self.train_labels[indices[0:self.dataset_size]]
def __getitem__(self, index):
"""Get images and target for data loader.
Args:
index (int): Index
Returns:
tuple: (image, target) where target is index of the target class.
"""
img, label = self.train_data[index], self.train_labels[index]
if self.transform is not None:
img = self.transform(img)
label = torch.LongTensor([np.int64(label).item()])
return img, label
def __len__(self):
"""Return size of dataset."""
return self.dataset_size
def load_samples(self):
"""Load sample images from dataset."""
# some arbitrary limits so I'm not loading 100,000 images while debugging
numtr = 50
numts = 20
numvl = 10
data_root = os.path.join(self.root, 'CheXpert-v1.0-small')
images = []
labels = []
if self.val:
val_info = csv.reader(open(os.path.join(data_root, 'effusion-val-split.csv'), 'r'))
for count, row in enumerate(val_info):
if count == numvl:
break
image = np.array(Image.open(os.path.join(self.root, row[0])))
images.append(image)
labels.append(row[1])
elif self.train:
train_info = csv.reader(open(os.path.join(data_root, 'effusion-train-split.csv'), 'r'))
for count, row in enumerate(train_info):
if count == numtr:
break
image = np.array(Image.open(os.path.join(self.root, row[0])))
images.append(image)
labels.append(row[1])
elif not self.val and not self.train:
test_info = csv.reader(open(os.path.join(data_root, 'effusion-test-split.csv'), 'r'))
for count, row in enumerate(test_info):
if count == numts:
break
image = np.array(Image.open(os.path.join(self.root, row[0])))
images.append(image)
labels.append(row[1])
images = np.asarray(images)
labels = np.asarray(labels)
self.dataset_size = labels.shape[0]
return images, labels
def get_chexpert(train, val):
"""Get chexpert dataset loader."""
# image pre-processing
pre_process = transforms.Compose([transforms.ToPILImage(),
transforms.Resize((224, 224)),
transforms.ToTensor(),
#transforms.Normalize(
#mean=params.dataset_mean,
#std=params.dataset_std)])
])
# dataset and data loader
chexpert_dataset = Chexpert(root=params.data_root,
train=train,
val=val,
transform=pre_process)
chexpert_data_loader = torch.utils.data.DataLoader(
dataset=chexpert_dataset,
batch_size=params.batch_size,
shuffle=True)
return chexpert_data_loader
if __name__ == '__main__':
# load dataset
print("Loading Source Train Data")
src_data_loader = get_chexpert()
print("Loading Source Validation Data")
src_data_loader_val = get_chexpert(train=False, val=True)
print("Loading Source Test Data")
src_data_loader_eval = get_chexpert(train=False)
print("Loading Target Train Data")
tgt_data_loader = get_nih()
print("Loading Target Validation Data")
tgt_data_loader_val = get_nih(train=False, val=True)
print("Loading Target Test Data")
tgt_data_loader_eval = get_nih(train=False)

How to display Runtime Statistics in Tensorboard using Estimator API in a distributed environment

This article illustrates how to add Runtime statistics to Tensorboard:
run_options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE)
run_metadata = tf.RunMetadata()
summary, _ = sess.run([merged, train_step],
feed_dict=feed_dict(True),
options=run_options,
run_metadata=run_metadata)
train_writer.add_run_metadata(run_metadata, 'step%d' % i)
train_writer.add_summary(summary, i)
print('Adding run metadata for', i)
which creates the following details in Tensorboard:
This is fairly straightforward on a single machine. How could one do this in a distributed environment using Estimators?

I use the following hook, based on ProfilerHook, to have the estimator output the run metadata into the model directory and inspect it later with Tensorboard.
import tensorflow as tf
from tensorflow.python.training.session_run_hook import SessionRunHook, SessionRunArgs
from tensorflow.python.training import training_util
from tensorflow.python.training.basic_session_run_hooks import SecondOrStepTimer
class MetadataHook(SessionRunHook):
def __init__ (self,
save_steps=None,
save_secs=None,
output_dir=""):
self._output_tag = "step-{}"
self._output_dir = output_dir
self._timer = SecondOrStepTimer(
every_secs=save_secs, every_steps=save_steps)
def begin(self):
self._next_step = None
self._global_step_tensor = training_util.get_global_step()
self._writer = tf.summary.FileWriter (self._output_dir, tf.get_default_graph())
if self._global_step_tensor is None:
raise RuntimeError("Global step should be created to use ProfilerHook.")
def before_run(self, run_context):
self._request_summary = (
self._next_step is None or
self._timer.should_trigger_for_step(self._next_step)
)
requests = {"global_step": self._global_step_tensor}
opts = (tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE)
if self._request_summary else None)
return SessionRunArgs(requests, options=opts)
def after_run(self, run_context, run_values):
stale_global_step = run_values.results["global_step"]
global_step = stale_global_step + 1
if self._request_summary:
global_step = run_context.session.run(self._global_step_tensor)
self._writer.add_run_metadata(
run_values.run_metadata, self._output_tag.format(global_step))
self._writer.flush()
self._next_step = global_step + 1
def end(self, session):
self._writer.close()
To use it, one creates the estimator instance (my_estimator) as usual, whether it is pre-made one or a custom estimator. The desired operation is called passing an instance of the class above as a hook. For example:
hook = MetadataHook(save_steps=1, output_dir=<model dir>)
my_estimator.train( train_input_fn, hooks=[hook] )
The run metadata will be placed in the model dir and can be inspected by TensorBoard.

You may use tf.train.ProfilerHook. However the catch is that it was released at 1.14.
Example usage:
estimator = tf.estimator.LinearClassifier(...)
hooks = [tf.train.ProfilerHook(output_dir=model_dir, save_secs=600, show_memory=False)]
estimator.train(input_fn=train_input_fn, hooks=hooks)
Executing the hook will generate files timeline-xx.json in output_dir.
Then open chrome://tracing/ in chrome browser and load the file. You will get a time usage timeline like below.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Broken Pipe in pytorch DataLoader - python

I have the same issue. I solved it using if name == 'main': in my python code, but I was enable to solve the broken pipe in my jupyter notebook...

Related

Is there a way to pickle a custom tensorflow.keras metric?

How do I properly restore a Tensorflow Checkpoint?

Where in the code of pytorch or huggingface/transformer label gets "renamed" into labels?

Getting a value error from converting python list into numpy array

How to display Runtime Statistics in Tensorboard using Estimator API in a distributed environment

Categories

Resources