Problem
I am trying to build a multi-input model in keras using two inputs, image and text. I am using the flow_from_dataframe method, passing it a pandas dataframe containing the image-names as well as the respective text (as a vectorized feature-represenation) for each image and the target label/class. As such, the dataframe looks as follows:
ID path text-features label
111 'cat001.jpg' [0.0, 1.0, 0.0,...] cat
112 'dog001.jpg' [1.0, 0.0, 1.0,...] dog
113 'bunny001.jpg' [0.0, 1.0, 1.0,...] bunny
...
After constructing my model using the Keras functional API, I feed both inputs into the model like so:
model = Model(inputs=[images, text], outputs=output)
For the images I use an ImageDataGenerator as suggested in the docs (https://keras.io/preprocessing/image/#flow_from_dataframe) :
datagen=ImageDataGenerator(rescale=1./255,validation_split=0.15)
train_generator=datagen.flow_from_dataframe(dataframe=df, directory=data_dir, x_col=path, y_col="label", has_ext=True, class_mode="categorical", target_size=(224,224), batch_size=batch_size,subset="training")
validation_generator=datagen.flow_from_dataframe(dataframe=df, directory=data_dir, x_col=path, y_col="label", has_ext=True, class_mode="categorical", target_size=(224,224), batch_size=batch_size,subset="validation")
So far so good, but now I am stuck on how to feed the text-features within my dataframe to the model as well during training.
Question
How can I modify the flow_from_dataframe generator in order to handle the text-feature data in the dataframe as well as the images during training? Also, since I can't find any example of this sort of modification on flow_from_dataframe I am wondering if I am approaching this problem wrong i.e. is there any better method of achieving this?
UPDATE
Meanwhile I've been trying to write my own generator following the guide I found here (https://stanford.edu/~shervine/blog/keras-how-to-generate-data-on-the-fly) and adjusting it to my needs. This is what I came up with:
from matplotlib.image import imread
class DataGenerator(keras.utils.Sequence):
def __init__(self, list_IDs, labels, batch_size=32, dim=(32,32,32), n_channels=1,
n_classes=10, shuffle=True):
#'Initialization'
self.dim = dim
self.batch_size = batch_size
self.labels = labels
self.list_IDs = list_IDs
self.n_channels = n_channels
self.n_classes = n_classes
self.shuffle = shuffle
self.on_epoch_end()
def on_epoch_end(self):
#'Updates indexes after each epoch'
self.indexes = np.arange(len(self.list_IDs))
if self.shuffle == True:
np.random.shuffle(self.indexes)
# method for producing batches of data.
# takes as argument the list of IDs of the target batch
def __data_generation(self, list_IDs_temp):
#'Generates data containing batch_size samples' # X : (n_samples, *dim, n_channels)
# Initialization
X = np.empty((self.batch_size, *self.dim, self.n_channels))
Xtext = np.empty((self.batch_size, 7576))
y = np.empty((self.batch_size), dtype=int)
# Generate data
for i, ID in enumerate(list_IDs_temp):
# Store sample
X[i,] = imread('C:/Users/aaron/Desktop/training/'+str(ID)) # <--- all files are in the same DIR
Xtext[i,] = np.array(total_data[df.path== str(ID)]["text-features"].values) # <--- I look-up the text-features by using the ID as a filter with the path column. This line throws the error.
# Store class
y[i] = self.labels[ID]
return X, Xtext, keras.utils.to_categorical(y, num_classes=self.n_classes)
def __len__(self):
#'Denotes the number of batches per epoch'
return int(np.floor(len(self.list_IDs) / self.batch_size))
# Now, when the batch corresponding to a given index is called,
# the generator executes the __getitem__ method to generate it.
def __getitem__(self, index):
#'Generate one batch of data'
# Generate indexes of the batch
indexes = self.indexes[index*self.batch_size:(index+1)*self.batch_size]
# Find list of IDs
list_IDs_temp = [self.list_IDs[k] for k in indexes]
# Generate data
X,Xtext, y = self.__data_generation(list_IDs_temp)
return X,Xtext, y
And I initialize the generator as follows:
partition = {}
partition['train'] = X_train.path.values
partition['validation'] = X_test.path.values
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
encoded_labels = le.fit_transform(df.label)
labels = pd.Series(encoded_labels,index=df.path).to_dict()
# Parameters
params = {'dim': (224,224),
'batch_size': 64,
'n_classes': 5,
'n_channels': 3,
'shuffle': True}
# Generators
training_generator = DataGenerator(partition['train'], labels, **params)
validation_generator = DataGenerator(partition['validation'], labels, **params)
Using this generator however throws me an error:
ValueError: setting an array element with a sequence.
caused by the line X_text[i,] = np.array(total_data[total_data.bust == str(ID)].text.values) in my code above. Any suggestion on how to solve this?
Related
I have a text dataset that I want to use for a GAN and it should turn to onehotencode and this is how I Creating a Custom Dataset for my files
class Dataset2(torch.utils.data.Dataset):
def __init__(self, list_, labels):
'Initialization'
self.labels = labels
self.list_IDs = list_
def __len__(self):
'Denotes the total number of samples'
return len(self.list_IDs)
def __getitem__(self, index):
'Generates one sample of data'
# Select sample
mylist = self.list_IDs[index]
# Load data and get label
X = F.one_hot(mylist, num_classes=len(alphabet))
y = self.labels[index]
return X, y
It is working well and every time I call it, it works just fine but the problem is when I use DataLoader and try to use it, its shape is not the same as it just came out of the dataset, this is the shape that came out of the dataset
x , _ = dataset[1]
x.shape
torch.Size([1274, 22])
and this is the shape that came out dataloader
dataloader = DataLoader(dataset, batch_size=64, shuffle=True)
one = []
for epoch in range(epochs):
for i, (real_data, _) in enumerate(dataloader):
one.append(real_data)
one[3].shape
torch.Size([4, 1274, 22])
this 4 is number of samples in my data but it should not be there, how can I fix this problem?
You confirmed you only had four elements in your dataset. You have wrapped your dataset with a data loader with batch_size=64 which is greater than 4. This means the dataloader will only output a single batch containing 4 elements.
In turn, this means you only append a single element per epoch, and one[3].shape is a batch (the only batch of the data loader), shaped (4, 1274, 22).
I am trying to segment medical images using a version of U-Net implemented with Keras. The inputs of my network are 3D images and the outputs are two one-hot-encoded 3D segmentation maps. I know that my dataset is very imbalanced (there is not so much to segment) and therefore I want to use class weights for my loss function (currently binary_crossentropy). With the class weights, I hope the model will give more attention to the small stuff it has to segment.
If you know the imbalance of your database, you can pass the parameter class_weight to model.fit(). Does this also work with my use case?
With the help of the above mentioned github issue I managed to solve the problem for my particular use case. I want to share the solution with you anyway. An extra hurdle was the fact I am using a custom generator for my data. A simplified version of this class is the following code:
import numpy as np
import keras
class DataGenerator(keras.utils.Sequence):
'Generates data for Keras'
def __init__(self, list_IDs, batch_size=2, dim=(144,144,144), n_classes=2):
'Initialization'
self.dim = dim
self.batch_size = batch_size
self.list_IDs = list_IDs
self.n_classes = n_classes
self.on_epoch_end()
def __len__(self):
'Denotes the number of batches per epoch'
return int(np.floor(len(self.list_IDs) / self.batch_size))
def __getitem__(self, index):
'Generate one batch of data'
# Generate indexes of the batch
indexes = self.indexes[index*self.batch_size:(index+1)*self.batch_size]
# Find list of IDs
list_IDs_temp = [self.list_IDs[k] for k in indexes]
# Generate data
x, y = self.__data_generation(list_IDs_temp)
return x, y
def __data_generation(self, list_IDs_temp):
'Generates data containing batch_size samples' # X : (n_samples, *dim, 1)
# Initialization
x = np.empty((self.batch_size, *self.dim, 1))
y = np.empty((self.batch_size, *self.dim, 1))
# Generate data
for i, ID in enumerate(list_IDs_temp):
# Load dataset
data = np.load('data/' + ID + '.npy')
# Store x and y
x[i,] = data[:, :, :, 0] # Image
y[i,] = data[:, :, :, 1] # Mask
# One-hot-encoding
y = keras.utils.to_categorical(y, num_classes=self.n_classes)
return x, y
Actually a few lines of code did the trick. With an extra input argument class_weights to my generator, a line to convert the class weights to sample weights for each individual batch in the __getitem__() method, and also a return of the sample weights in the same method, I solved the issue. The class weights are inputted as list with the following structure: class_weights = [weight_class_0, weight_class_1]. My basic generator class now looks like this (I have marked changes with a comment):
import numpy as np
import keras
class DataGenerator(keras.utils.Sequence):
'Generates data for Keras'
def __init__(self, list_IDs, class_weights, batch_size=2, dim=(144,144,144),
n_classes=2):
'Initialization'
self.dim = dim
self.batch_size = batch_size
self.list_IDs = list_IDs
self.n_classes = n_classes
self.class_weights = class_weights # CLASS WEIGHTS FIX
self.on_epoch_end()
def __len__(self):
'Denotes the number of batches per epoch'
return int(np.floor(len(self.list_IDs) / self.batch_size))
def __getitem__(self, index):
'Generate one batch of data'
# Generate indexes of the batch
indexes = self.indexes[index*self.batch_size:(index+1)*self.batch_size]
# Find list of IDs
list_IDs_temp = [self.list_IDs[k] for k in indexes]
# Generate data
x, y = self.__data_generation(list_IDs_temp)
# Compute sample weights CLASS WEIGHTS FIX
sample_weights = np.take(np.array(self.class_weights), np.round(y[:, :, :, :, 1]).astype('int'))
return x, y, sample weights # CLASS WEIGHTS FIX
def __data_generation(self, list_IDs_temp):
'Generates data containing batch_size samples' # X : (n_samples, *dim, 1)
# Initialization
x = np.empty((self.batch_size, *self.dim, 1))
y = np.empty((self.batch_size, *self.dim, 1))
# Generate data
for i, ID in enumerate(list_IDs_temp):
# Load dataset
data = np.load('data/' + ID + '.npy')
# Store x and y
x[i,] = data[:, :, :, 0] # Image
y[i,] = data[:, :, :, 1] # Mask
# One-hot-encoding
y = keras.utils.to_categorical(y, num_classes=self.n_classes)
return x, y
It might seem a bit like a magic one-liner, but what sample_weights = np.take(np.array(self.class_weights), np.round(y[:, :, :, :, 1]).astype('int')) does is the following: It takes the y-values belonging to the not so common class, in my case the one to segment, and gives each pixel in this 3D image a sample weight. This sample weight is either the class weight for the common class or the uncommon class, depending on which class the pixel is belonging too.
The output of this generator class can be then used in the model.fit() method of the Keras model as long as sample_weight_mode="temporal" is passed to model.compile().
I am trying to train a network (a lrcn ie a CNN followed by LSTM) using TensoFlow like so:
model=Sequential();
..
.
.
# my model
model.compile(loss='categorical_crossentropy',
optimizer='adam',
metrics=['accuracy'])
model.fit_generator(generator=training_generator,
validation_data=validation_generator,
use_multiprocessing=True,
workers=6)
I am following this link to create the generator class. It looks like this:
class DataGenerator(tf.keras.utils.Sequence):
# 'Generates data for Keras'
def __init__(self, list_ids, labels, batch_size = 8, dim = (15, 16, 3200), n_channels = 1,
n_classes = 3, shuffle = True):
# 'Initialization'
self.dim = dim
self.batch_size = batch_size
self.labels = labels
self.list_IDs = list_ids
self.n_channels = n_channels
self.n_classes = n_classes
self.shuffle = shuffle
self.on_epoch_end()
def __len__(self):
# 'Denotes the number of batches per epoch'
return int(np.floor(len(self.list_IDs) / self.batch_size))
def __getitem__(self, index):
# 'Generate one batch of data'
# Generate indexes of the batch
indexes = self.indexes[index * self.batch_size:(index + 1) * self.batch_size]
# Find list of IDs
list_ids_temp = [self.list_IDs[k] for k in indexes]
# Generate data
X, y = self.__data_generation(list_ids_temp)
return X, y
def on_epoch_end(self):
# Updates indexes after each epoch'
self.indexes = np.arange(len(self.list_IDs))
if self.shuffle:
np.random.shuffle(self.indexes)
def __data_generation(self, list_ids_temp):
# 'Generates data containing batch_size samples' # X : (n_samples, *dim, n_channels)
# Initialization
X = np.empty((self.batch_size, *self.dim, self.n_channels))
y = np.empty(self.batch_size, dtype = int)
sequences = np.empty((15, 16, 3200, self.n_channels))
# Generate data
for i, ID in enumerate(list_ids_temp):
with h5py.File(ID) as file:
_data = list(file['decimated_data'])
_npData = np.array(_data)
_allSequences = np.transpose(_npData)
# a 16 x 48000 matrix is split into 15 sequences of size 16x3200
for sq in range(15):
sequences[sq, :, :, :] = np.reshape(_allSequences[0:16, i:i + 3200], (16, 3200, 1))
# Store sample
X[i, ] = sequences
# Store class
y[i] = self.labels[ID]
return X, tf.keras.utils.to_categorical(y, num_classes = self.n_classes)
This works fine and the code runs, however, I notice that the GPU usage remains 0. When I set log_device_placement to true, it shows the operations being assigned to GPU. But when I monitor the GPU using the task manager or nvidia-smi, I see no activity.
But when I don't use the DataGenerator class and just use model.fit() using generated like this the one shown below, I notice that the program does use GPU.
data = np.random.random((550, num_seq, rows, cols, ch))
label = np.random.random((num_of_samples,1))
_data['train'] = data[0:500,:]
_label['train'] = label[0:500, :]
_data['valid'] = data[500:,:]
_label['valid']=label[500:,:]
model.fit(data['train'],
labels['train'],
epochs = FLAGS.epochs,
batch_size = FLAGS.batch_size,
validation_data = (data['valid'], labels['valid']),
shuffle = True,
callbacks = [tb, early_stopper, checkpoint])'
So I'm guessing it can't be because my NVIDIA drivers were installed wrong or TensorFlow was installed incorrectly and this is the message I get when I run both the codes, which indicates TF can recognize my GPU which leads me to believe there is something wrong with my DataGenerator class and/or the fit_generator()
Can anyone help me point out what I am doing wrong?
I am using TensorFlow 1.10 and cUDA 9 on a windows 10 machine with GTX 1050Ti.
I have followed the following link to learn to use a generator for keras model to fit_generator on.
https://stanford.edu/~shervine/blog/keras-how-to-generate-data-on-the-fly
One problem I have encountered is that, when I called the model.predict_generator() on some test data generator, length of the returned value is not the same as I have sent in the generator.
My test data is of length 229431, and I use a batch_size of 256, and when I define __len__ function in the generator class in the following way:
class DataGenerator(keras.utils.Sequence):
"""A simple generator"""
def __init__(self, list_IDs, labels, dim, dim_label, batch_size=512, shuffle=True, is_training=True):
"""Initialization"""
self.list_IDs = list_IDs
self.labels = labels
self.dim = dim
self.dim_label = dim_label
self.batch_size = batch_size
self.shuffle = shuffle
self.is_training = is_training
self.on_epoch_end()
def __len__(self):
"""Denotes the number of batches per epoch"""
return int(np.ceil(len(self.list_IDs) / self.batch_size))
def __getitem__(self, index):
"""Generate one batch of data"""
# Generate indexes of the batch
indexes = self.indexes[index * self.batch_size: (index + 1) * self.batch_size]
# Find list of IDs
list_IDs_temp = [self.list_IDs[k] for k in indexes]
list_labels_temp = [self.labels[k] for k in indexes]
# Generate data
result = self.__data_generation(list_IDs_temp, list_labels_temp, self.is_training)
if self.is_training:
X, y = result
return X, y
else:
# only return X when test
X = result
return X
def on_epoch_end(self):
"""Updates indexes after each epoch"""
self.indexes = np.arange(len(self.list_IDs))
if self.shuffle:
np.random.shuffle(self.indexes)
def __data_generation(self, list_IDs_temp, list_labels_temp, is_training):
"""Generates data containing batch_size samples"""
# Initialization
# X is a list of np.array
X = np.empty((self.batch_size, *self.dim))
if is_training:
# y could have multiple columns
y = np.empty((self.batch_size, *self.dim_label), dtype=int)
# Generate data
for i, (ID, label) in enumerate(zip(list_IDs_temp, list_labels_temp)):
# Store sample
X[i,] = np.load(ID)
if is_training:
# Store class
y[i,] = np.load(label)
if is_training:
return X, y
else:
return X
The returned length of my predicted value is 229632. Here is the code of predict:
test_generator = DataGenerator(partition, labels, is_training=False, **self.params)
predict_raw = self.model.predict_generator(generator=test_generator, workers=12, verbose=2)
I figured that 229632 / 256 = 897 which is the length of my generator, when I modify the __len__ method of DataGenerator to return int(np.ceil(len(self.list_IDs) / self.batch_size)), I get 229376 predicted values, 229376/256 = 896, which is the correct number of length.
But what I have passed to the generator is 229431 samples.
And I think in __getitem__ method, when running on the last batch, it should only get the less than 256 samples to test automatically. But appearently it is not the case, so how can I make sure the model predict the right number of samples?
For the last batch, the indexes calculated in the method __getitem__ don't have the correct size. To predict the right number of samples, the indexes should be defined as follow (see post ):
def __getitem__(self, index):
"""Generate one batch of data"""
idx_min = idx*self.batch_size
idx_max = min(idx_min + self.batch_size, len(self.list_IDs))
indexes = self.indexes[idx_min: idx_max]
...
I am using ImageDataGenerator to generate new augmented images and extract bottleneck features from pretrained model but most of the tutorial I see on keras
samples same no of training samples as number of images in directory.
train_generator = train_datagen.flow_from_directory(
train_path,
target_size=image_size,
shuffle = "false",
class_mode='categorical',
batch_size=1)
bottleneck_features_train = model.predict_generator(
train_generator, 2* nb_train_samples // batch_size)
Suppose I want 2 times more images from the above code, how I can get the desired class labels for the features extracted from bottleneck layer which are stored in tuple train_generator.
shouldnt the code in training_generator.py at line 422
x, _ = generator_output
do something like this
=> x, y = generator_output
and return tuple [np.concatenate(out) for out in all_outs],y from predict_generator
i.e return the corresponding class labels along with the predicted features all_outs since there is no way to get the corresponding labels without running generator twice.
If you're using predict, normally you simply don't want Y, because Y will be the result of the prediction. (You're not training, so you don't need the true labels)
But you can do it yourself:
bottleneck = []
labels = []
for i in range(2 * nb_train_samples // batch_size):
x, y = next(train_generator)
bottleneck.append(model.predict(x))
labels.append(y)
bottleneck = np.concatenate(bottleneck)
labels = np.concatenate(labels)
If you want it with indexing (if your generator supports that):
#...
for epoch in range(2):
for i in range(nb_train_samples // batch_size):
x,y = train_generator[i]
#...