Keras fit_generator is very slow. The GPU is not used constantly in training, sometimes it's usage drops to 0%. Even on 4 workers and multiproceesing=True.
Also, the processes of the script are requesting too much virtual memory and are with a D status, uninterruptible sleep (usually IO).
I already tried different combinations of max_queue_size but it didn't work.
Screenshot GPU Usage
Screenshot of Processes Virtual Memory and Status
Hardware Info
GPU = Titan XP 12Gb
Code of Data Generator Class
import numpy as np
import keras
import conf
class DataGenerator(keras.utils.Sequence):
'Generates data for Keras'
def __init__(self, list_IDs, labels, batch_size=32, dim=(conf.max_file, 128),
n_classes=10, shuffle=True):
'Initialization'
self.dim = dim
self.batch_size = batch_size
self.labels = labels
self.list_IDs = list_IDs
self.n_classes = n_classes
self.shuffle = shuffle
self.on_epoch_end()
def __len__(self):
'Denotes the number of batches per epoch'
return int(np.floor(len(self.list_IDs) / self.batch_size))
def __getitem__(self, index):
'Generate one batch of data'
# Generate indexes of the batch
indexes = self.indexes[index*self.batch_size:(index+1)*self.batch_size]
# Find list of IDs
list_IDs_temp = [self.list_IDs[k] for k in indexes]
# Generate data
X, y = self.__data_generation(list_IDs_temp)
return X, y
def on_epoch_end(self):
'Updates indexes after each epoch'
self.indexes = np.arange(len(self.list_IDs))
if self.shuffle == True:
np.random.shuffle(self.indexes)
def __data_generation(self, list_IDs_temp):
'Generates data containing batch_size samples' # X : (n_samples, *dim, n_channels)
# Initialization
X = np.empty((self.batch_size, *self.dim))
y = np.empty((self.batch_size, conf.max_file, self.n_classes))
# Generate data
for i, ID in enumerate(list_IDs_temp):
# Store sample
X[i, ] = np.load(conf.dir_out_data+"data_by_file/" + ID)
# Store class
y[i, ] = np.load(conf.dir_out_data +
'data_by_file/' + self.labels[ID])
return X, y
Code of python script
training_generator = DataGenerator(partition['train'], labels, **params)
validation_generator = DataGenerator(partition['validation'], labels, **params)
model.fit_generator(generator = training_generator,
validation_data = validation_generator,
epochs=steps,
callbacks=[tensorboard, checkpoint],
workers=4,
use_multiprocessing=True,
max_queue_size=50)
If you are using Tensorflow 2.0, you might be hitting this bug: https://github.com/tensorflow/tensorflow/issues/33024
Work arounds are:
Call tf.compat.v1.disable_eager_execution() at the start of the code
Use model.fit rather than model.fit_generator. The former supports generators anyway.
Downgrade to TF 1.14
Regardless of Tensorflow version, these principles apply:
Limit how much disk access you are doing, this that is often a bottleneck.
Check the batch size in both training and validation. A batch size of 1 in validation is going to be very slow.
There does seem to be an issue though with generators being slow in 1.13.2 and 2.0.1 (at least). https://github.com/keras-team/keras/issues/12683
Related
I dont have access to any GPU's, but I want to speed-up the training of my model created with PyTorch, which would be using more than 1 CPU. I will use the most basic model for example here.
All I want is this code to run on multiple CPU instead of just 1 (Dataset and Network class in Appendix).
import pandas as pd
import torch
from torch import nn
from torch.utils.data import Dataset, DataLoader
df = pd.read_pickle('path/to/data')
X = df.drop(columns=['target'])
y = df[['target']]
train_data = CustomDataset(X, y)
train_loader = DataLoader(
dataset=train_data,
batch_size=64
)
model = Network(X.shape[-1])
criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters())
epochs = 10
for e in range(1, epochs + 1):
train_loss = .0
model.train()
for batch_id, (data, labels) in enumerate(train_loader):
optimizer.zero_grad()
target = model(data)
loss = criterion(target, labels)
loss.backward()
optimizer.step()
train_loss += loss.item()
print('Epoch {}:\tTrain Loss: {:.4f}'.format(
e,
train_loss / len(train_loader)
))
Right now, this code is using only 1 CPU for 100% during training. I want the code to use 4 CPU's in the training process for 100%. There is so much different information out there on how to do this in the best way, that none of it is really working. I have tried different approaches using model = torch.nn.parallel.DistributedDataParallel(model) for example, but none of the worked.
Does someone now how to get the most performance for this code using 4 CPU? Thanks in Advance!
APPENDIX
class CustomDataset(Dataset):
def __init__(
self,
X,
y,
):
self.X = torch.Tensor(X.values)
self.y = torch.Tensor(y.values)
def __getitem__(
self,
index
):
return self.X[index], self.y[index]
def __len__(self):
return len(X)
class Network(nn.Module):
def __init__(
self,
input_size
):
super(Network, self).__init__()
self.input_size = input_size
self.linear_1 = nn.Linear(self.input_size, 32)
self.linear_2 = nn.Linear(32, 1)
def forward(self, data):
output = self.linear_1(data)
output = self.linear_2(output)
return output
Torchrun (included with Pytorch) makes this surprisingly easy.
How you want the CPUs to work together is not clear from your question, but I am assuming (because you refer to DistributedDataParallel that you would like to distribute the data across multiple cores which all do backward passes and broadcast their losses to the main process.
First, see if torch.distributed is available: torch.distributed.is_available().
Torchrun requires your script to have a few tweaks.
To initialize a process group, include
import torch.distributed as dist
dist.init_process_group(backend="gloo")
The backend must be gloo for CPUS.
Torchrun sets the environment variables MASTER_PORT, MASTER_ADDR, WORLD_SIZE, and RANK, which are required for torch.distributed.
Then, to be able to broadcast the losses across cores, use
local_rank = int(os.environ["LOCAL_RANK"])
model = torch.nn.parallel.DistributedDataParallel(
model,
device_ids=[local_rank],
output_device=local_rank,
)
Every core still gets to see the same data.
This can be more efficient using DistributedSampler:
train_sampler = DistributedSampler(train_data)
train_loader = DataLoader(
...
train_data,
shuffle=False, # train_sampler will shuffle for you.
sampler=train_sampler,
)
for e in range(1, epochs + 1):
train_sampler.set_epoch(e)
train(train_loader)
where train(train_loader) does the training steps. Note the train_sampler.set_epoch(e).
This makes sure that every epoch the data is distributed to processes differently.
To start the training, run
torchrun
--standalone
--nnodes=1
--nproc_per_node=$NUM_TRAINERS
YOUR_TRAINING_SCRIPT.py
where $NUM_TRAINERS is the number of processes that work on improving your model.
It is recommended, not necessary, to include
from torch.distributed.elastic.multiprocessing.errors import record
#record
def main():
# do train
pass
if __name__ == "__main__":
main()
to see which process throws what error.
If I misunderstood your question and you only want more workers to load the data, then pass num_workers=N to DataLoader.
This ensures more workers (other than the main process) push the data into RAM while the current batch is being processed.
I am trying to segment medical images using a version of U-Net implemented with Keras. The inputs of my network are 3D images and the outputs are two one-hot-encoded 3D segmentation maps. I know that my dataset is very imbalanced (there is not so much to segment) and therefore I want to use class weights for my loss function (currently binary_crossentropy). With the class weights, I hope the model will give more attention to the small stuff it has to segment.
If you know the imbalance of your database, you can pass the parameter class_weight to model.fit(). Does this also work with my use case?
With the help of the above mentioned github issue I managed to solve the problem for my particular use case. I want to share the solution with you anyway. An extra hurdle was the fact I am using a custom generator for my data. A simplified version of this class is the following code:
import numpy as np
import keras
class DataGenerator(keras.utils.Sequence):
'Generates data for Keras'
def __init__(self, list_IDs, batch_size=2, dim=(144,144,144), n_classes=2):
'Initialization'
self.dim = dim
self.batch_size = batch_size
self.list_IDs = list_IDs
self.n_classes = n_classes
self.on_epoch_end()
def __len__(self):
'Denotes the number of batches per epoch'
return int(np.floor(len(self.list_IDs) / self.batch_size))
def __getitem__(self, index):
'Generate one batch of data'
# Generate indexes of the batch
indexes = self.indexes[index*self.batch_size:(index+1)*self.batch_size]
# Find list of IDs
list_IDs_temp = [self.list_IDs[k] for k in indexes]
# Generate data
x, y = self.__data_generation(list_IDs_temp)
return x, y
def __data_generation(self, list_IDs_temp):
'Generates data containing batch_size samples' # X : (n_samples, *dim, 1)
# Initialization
x = np.empty((self.batch_size, *self.dim, 1))
y = np.empty((self.batch_size, *self.dim, 1))
# Generate data
for i, ID in enumerate(list_IDs_temp):
# Load dataset
data = np.load('data/' + ID + '.npy')
# Store x and y
x[i,] = data[:, :, :, 0] # Image
y[i,] = data[:, :, :, 1] # Mask
# One-hot-encoding
y = keras.utils.to_categorical(y, num_classes=self.n_classes)
return x, y
Actually a few lines of code did the trick. With an extra input argument class_weights to my generator, a line to convert the class weights to sample weights for each individual batch in the __getitem__() method, and also a return of the sample weights in the same method, I solved the issue. The class weights are inputted as list with the following structure: class_weights = [weight_class_0, weight_class_1]. My basic generator class now looks like this (I have marked changes with a comment):
import numpy as np
import keras
class DataGenerator(keras.utils.Sequence):
'Generates data for Keras'
def __init__(self, list_IDs, class_weights, batch_size=2, dim=(144,144,144),
n_classes=2):
'Initialization'
self.dim = dim
self.batch_size = batch_size
self.list_IDs = list_IDs
self.n_classes = n_classes
self.class_weights = class_weights # CLASS WEIGHTS FIX
self.on_epoch_end()
def __len__(self):
'Denotes the number of batches per epoch'
return int(np.floor(len(self.list_IDs) / self.batch_size))
def __getitem__(self, index):
'Generate one batch of data'
# Generate indexes of the batch
indexes = self.indexes[index*self.batch_size:(index+1)*self.batch_size]
# Find list of IDs
list_IDs_temp = [self.list_IDs[k] for k in indexes]
# Generate data
x, y = self.__data_generation(list_IDs_temp)
# Compute sample weights CLASS WEIGHTS FIX
sample_weights = np.take(np.array(self.class_weights), np.round(y[:, :, :, :, 1]).astype('int'))
return x, y, sample weights # CLASS WEIGHTS FIX
def __data_generation(self, list_IDs_temp):
'Generates data containing batch_size samples' # X : (n_samples, *dim, 1)
# Initialization
x = np.empty((self.batch_size, *self.dim, 1))
y = np.empty((self.batch_size, *self.dim, 1))
# Generate data
for i, ID in enumerate(list_IDs_temp):
# Load dataset
data = np.load('data/' + ID + '.npy')
# Store x and y
x[i,] = data[:, :, :, 0] # Image
y[i,] = data[:, :, :, 1] # Mask
# One-hot-encoding
y = keras.utils.to_categorical(y, num_classes=self.n_classes)
return x, y
It might seem a bit like a magic one-liner, but what sample_weights = np.take(np.array(self.class_weights), np.round(y[:, :, :, :, 1]).astype('int')) does is the following: It takes the y-values belonging to the not so common class, in my case the one to segment, and gives each pixel in this 3D image a sample weight. This sample weight is either the class weight for the common class or the uncommon class, depending on which class the pixel is belonging too.
The output of this generator class can be then used in the model.fit() method of the Keras model as long as sample_weight_mode="temporal" is passed to model.compile().
I am training a neural network on Google Colab GPU. Therefore, I synchronized the input images (180k in total, 105k for training, 76k for validation) with my Google Drive. Then I mount the Google Drive and go from there.
I load a csv-file with image paths and labels in Google Colab and store it as pandas dataframe.
After that I use a list of image paths and labels.
I take this function to get my labels onehot-encoded because I need a special output shape (7, 35) per label, which cannot be done by the existing default functions:
#One Hot Encoding der Labels, Zielarray hat eine Shape von (7,35)
from numpy import argmax
# define input string
def my_onehot_encoded(label):
# define universe of possible input values
characters = '0123456789ABCDEFGHIJKLMNPQRSTUVWXYZ'
# define a mapping of chars to integers
char_to_int = dict((c, i) for i, c in enumerate(characters))
int_to_char = dict((i, c) for i, c in enumerate(characters))
# integer encode input data
integer_encoded = [char_to_int[char] for char in label]
# one hot encode
onehot_encoded = list()
for value in integer_encoded:
character = [0 for _ in range(len(characters))]
character[value] = 1
onehot_encoded.append(character)
return onehot_encoded
After that I use a customized DataGenerator to get the data in batches into my model. x_set is a list of image paths to my images and y_set are the onehot-encoded labels:
class DataGenerator(Sequence):
def __init__(self, x_set, y_set, batch_size):
self.x, self.y = x_set, y_set
self.batch_size = batch_size
def __len__(self):
return math.ceil(len(self.x) / self.batch_size)
def __getitem__(self, idx):
batch_x = self.x[idx*self.batch_size : (idx + 1)*self.batch_size]
batch_x = np.array([resize(imread(file_name), (224, 224)) for file_name in batch_x])
batch_x = batch_x * 1./255
batch_y = self.y[idx*self.batch_size : (idx + 1)*self.batch_size]
batch_y = np.array(batch_y)
return batch_x, batch_y
And with this code I apply the DataGenerator to my data:
training_generator = DataGenerator(X_train, y_train, batch_size=32)
validation_generator = DataGenerator(X_val, y_val, batch_size=32)
When I now train my model one epoch lasts 25-40 minutes which is very long.
model.fit_generator(generator=training_generator,
validation_data=validation_generator,
steps_per_epoch = num_train_samples // 16,
validation_steps = num_val_samples // 16,
epochs = 10, workers=6, use_multiprocessing=True)
I now was wondering how to measure preprocessing time because I don't think it is due to the model size, because I already experimented with models with fewer parameters but the time for training did not reduce significantly... So, I am suspicious regarding the preprocessing...
To measure time in Colab, you can use this autotime package:
!pip install ipython-autotime
%load_ext autotime
Additionally for profiling, you can use %time as mentioned here.
In general to ensure generator runs faster, suggest you to copy the data from gdrive to local host of that colab, otherwise it can get slower.
If you are using Tensorflow 2.0, cause could be this bug.
Work arounds are:
Call tf.compat.v1.disable_eager_execution() at the start of the code
Use model.fit rather than model.fit_generator. The former supports generators anyway.
Downgrade to TF 1.14
Regardless of Tensorflow version, limit how much disk access you are doing, this that is often a bottleneck.
Note that there does seem to be an issue with generators being slow in TF
1.13.2 and 2.0.1 (at least).
I am trying to train a network (a lrcn ie a CNN followed by LSTM) using TensoFlow like so:
model=Sequential();
..
.
.
# my model
model.compile(loss='categorical_crossentropy',
optimizer='adam',
metrics=['accuracy'])
model.fit_generator(generator=training_generator,
validation_data=validation_generator,
use_multiprocessing=True,
workers=6)
I am following this link to create the generator class. It looks like this:
class DataGenerator(tf.keras.utils.Sequence):
# 'Generates data for Keras'
def __init__(self, list_ids, labels, batch_size = 8, dim = (15, 16, 3200), n_channels = 1,
n_classes = 3, shuffle = True):
# 'Initialization'
self.dim = dim
self.batch_size = batch_size
self.labels = labels
self.list_IDs = list_ids
self.n_channels = n_channels
self.n_classes = n_classes
self.shuffle = shuffle
self.on_epoch_end()
def __len__(self):
# 'Denotes the number of batches per epoch'
return int(np.floor(len(self.list_IDs) / self.batch_size))
def __getitem__(self, index):
# 'Generate one batch of data'
# Generate indexes of the batch
indexes = self.indexes[index * self.batch_size:(index + 1) * self.batch_size]
# Find list of IDs
list_ids_temp = [self.list_IDs[k] for k in indexes]
# Generate data
X, y = self.__data_generation(list_ids_temp)
return X, y
def on_epoch_end(self):
# Updates indexes after each epoch'
self.indexes = np.arange(len(self.list_IDs))
if self.shuffle:
np.random.shuffle(self.indexes)
def __data_generation(self, list_ids_temp):
# 'Generates data containing batch_size samples' # X : (n_samples, *dim, n_channels)
# Initialization
X = np.empty((self.batch_size, *self.dim, self.n_channels))
y = np.empty(self.batch_size, dtype = int)
sequences = np.empty((15, 16, 3200, self.n_channels))
# Generate data
for i, ID in enumerate(list_ids_temp):
with h5py.File(ID) as file:
_data = list(file['decimated_data'])
_npData = np.array(_data)
_allSequences = np.transpose(_npData)
# a 16 x 48000 matrix is split into 15 sequences of size 16x3200
for sq in range(15):
sequences[sq, :, :, :] = np.reshape(_allSequences[0:16, i:i + 3200], (16, 3200, 1))
# Store sample
X[i, ] = sequences
# Store class
y[i] = self.labels[ID]
return X, tf.keras.utils.to_categorical(y, num_classes = self.n_classes)
This works fine and the code runs, however, I notice that the GPU usage remains 0. When I set log_device_placement to true, it shows the operations being assigned to GPU. But when I monitor the GPU using the task manager or nvidia-smi, I see no activity.
But when I don't use the DataGenerator class and just use model.fit() using generated like this the one shown below, I notice that the program does use GPU.
data = np.random.random((550, num_seq, rows, cols, ch))
label = np.random.random((num_of_samples,1))
_data['train'] = data[0:500,:]
_label['train'] = label[0:500, :]
_data['valid'] = data[500:,:]
_label['valid']=label[500:,:]
model.fit(data['train'],
labels['train'],
epochs = FLAGS.epochs,
batch_size = FLAGS.batch_size,
validation_data = (data['valid'], labels['valid']),
shuffle = True,
callbacks = [tb, early_stopper, checkpoint])'
So I'm guessing it can't be because my NVIDIA drivers were installed wrong or TensorFlow was installed incorrectly and this is the message I get when I run both the codes, which indicates TF can recognize my GPU which leads me to believe there is something wrong with my DataGenerator class and/or the fit_generator()
Can anyone help me point out what I am doing wrong?
I am using TensorFlow 1.10 and cUDA 9 on a windows 10 machine with GTX 1050Ti.
Problem
I am trying to build a multi-input model in keras using two inputs, image and text. I am using the flow_from_dataframe method, passing it a pandas dataframe containing the image-names as well as the respective text (as a vectorized feature-represenation) for each image and the target label/class. As such, the dataframe looks as follows:
ID path text-features label
111 'cat001.jpg' [0.0, 1.0, 0.0,...] cat
112 'dog001.jpg' [1.0, 0.0, 1.0,...] dog
113 'bunny001.jpg' [0.0, 1.0, 1.0,...] bunny
...
After constructing my model using the Keras functional API, I feed both inputs into the model like so:
model = Model(inputs=[images, text], outputs=output)
For the images I use an ImageDataGenerator as suggested in the docs (https://keras.io/preprocessing/image/#flow_from_dataframe) :
datagen=ImageDataGenerator(rescale=1./255,validation_split=0.15)
train_generator=datagen.flow_from_dataframe(dataframe=df, directory=data_dir, x_col=path, y_col="label", has_ext=True, class_mode="categorical", target_size=(224,224), batch_size=batch_size,subset="training")
validation_generator=datagen.flow_from_dataframe(dataframe=df, directory=data_dir, x_col=path, y_col="label", has_ext=True, class_mode="categorical", target_size=(224,224), batch_size=batch_size,subset="validation")
So far so good, but now I am stuck on how to feed the text-features within my dataframe to the model as well during training.
Question
How can I modify the flow_from_dataframe generator in order to handle the text-feature data in the dataframe as well as the images during training? Also, since I can't find any example of this sort of modification on flow_from_dataframe I am wondering if I am approaching this problem wrong i.e. is there any better method of achieving this?
UPDATE
Meanwhile I've been trying to write my own generator following the guide I found here (https://stanford.edu/~shervine/blog/keras-how-to-generate-data-on-the-fly) and adjusting it to my needs. This is what I came up with:
from matplotlib.image import imread
class DataGenerator(keras.utils.Sequence):
def __init__(self, list_IDs, labels, batch_size=32, dim=(32,32,32), n_channels=1,
n_classes=10, shuffle=True):
#'Initialization'
self.dim = dim
self.batch_size = batch_size
self.labels = labels
self.list_IDs = list_IDs
self.n_channels = n_channels
self.n_classes = n_classes
self.shuffle = shuffle
self.on_epoch_end()
def on_epoch_end(self):
#'Updates indexes after each epoch'
self.indexes = np.arange(len(self.list_IDs))
if self.shuffle == True:
np.random.shuffle(self.indexes)
# method for producing batches of data.
# takes as argument the list of IDs of the target batch
def __data_generation(self, list_IDs_temp):
#'Generates data containing batch_size samples' # X : (n_samples, *dim, n_channels)
# Initialization
X = np.empty((self.batch_size, *self.dim, self.n_channels))
Xtext = np.empty((self.batch_size, 7576))
y = np.empty((self.batch_size), dtype=int)
# Generate data
for i, ID in enumerate(list_IDs_temp):
# Store sample
X[i,] = imread('C:/Users/aaron/Desktop/training/'+str(ID)) # <--- all files are in the same DIR
Xtext[i,] = np.array(total_data[df.path== str(ID)]["text-features"].values) # <--- I look-up the text-features by using the ID as a filter with the path column. This line throws the error.
# Store class
y[i] = self.labels[ID]
return X, Xtext, keras.utils.to_categorical(y, num_classes=self.n_classes)
def __len__(self):
#'Denotes the number of batches per epoch'
return int(np.floor(len(self.list_IDs) / self.batch_size))
# Now, when the batch corresponding to a given index is called,
# the generator executes the __getitem__ method to generate it.
def __getitem__(self, index):
#'Generate one batch of data'
# Generate indexes of the batch
indexes = self.indexes[index*self.batch_size:(index+1)*self.batch_size]
# Find list of IDs
list_IDs_temp = [self.list_IDs[k] for k in indexes]
# Generate data
X,Xtext, y = self.__data_generation(list_IDs_temp)
return X,Xtext, y
And I initialize the generator as follows:
partition = {}
partition['train'] = X_train.path.values
partition['validation'] = X_test.path.values
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
encoded_labels = le.fit_transform(df.label)
labels = pd.Series(encoded_labels,index=df.path).to_dict()
# Parameters
params = {'dim': (224,224),
'batch_size': 64,
'n_classes': 5,
'n_channels': 3,
'shuffle': True}
# Generators
training_generator = DataGenerator(partition['train'], labels, **params)
validation_generator = DataGenerator(partition['validation'], labels, **params)
Using this generator however throws me an error:
ValueError: setting an array element with a sequence.
caused by the line X_text[i,] = np.array(total_data[total_data.bust == str(ID)].text.values) in my code above. Any suggestion on how to solve this?