I have trained a ResNet50 using Keras for classication. For testing, I used the ImageDataGenerator flow_from_directory() method to pass input to the model. Here's the code for that:
testdata_generator = keras.preprocessing.image.ImageDataGenerator(
preprocessing_function=tf.keras.applications.resnet.preprocess_input
)
testgen = testdata_generator.flow_from_directory(
'./test',
shuffle=False,
target_size=(224,224),
color_mode='rgb',
batch_size=32,
class_mode=None
)
Found 18223 images belonging to 1 classes.
However when I test the model on the test images, it doesn't predict for a few images.
pred = model.predict(
testgen,
batch_size=32,
steps=testgen.n//testgen.batch_size
)
print(len(pred))
18208
Anyone help?
You should try removing steps=testgen.n//testgen.batch_size, since calculating the steps results in a different number of samples, when you have a remainder by dividing samples // batch_size.
Related
I have this code:
epochs =50
batch_size = 5
validation_split = 0.2
datagen = tf.keras.preprocessing.image.ImageDataGenerator(validation_split=validation_split )
train_generator = datagen.flow(
X_train_noisy, y_train_denoisy, batch_size=batch_size,
subset='training'
)
val_generator = datagen.flow(
X_train_noisy, y_train_denoisy, batch_size=batch_size,
subset='validation'
)
history = model.fit(train_generator,
steps_per_epoch=(len(X_train_noisy)*(1-validation_split)) // batch_size, epochs=epochs,
validation_data = val_generator, validation_steps=(len(X_train_noisy)*validation_split)//batch_size)
X_train_noisy and y_train_denoisy are ndarray ([20,512,512,1]) p.e. But I get this error:
training and validation subsets have different number of classes after the split
How can I solve that?
thanks!
probably what happened is when the data is split for training and validation, the set of files selected for validation did not include any files in one or more of the classes. This can happen when your data set is small. Try increasing the validation_split to a larger value say like .5 and see if the problem goes away. It should. Then reduce the size of the validation split until the error reoccurs. That will determine the minimum split value you can use. Remember the split is randomized so set the split value at something above the minimum value.
Another (BETTER) alternative is to split the data using sklearn train_test_split. This function has a parameter stratify that splits the data but ensures that all classes are included in the two component. See code below
from sklearn.model_selection import train_test_split
X_train_noisy, X_valid_noisy, y_train_denoisy, y_valid_denoisy=train_test_split(X_train_noisy,
y_train_denoisy, test_size=validation_split,
shuffle=True, random_state=123,
stratify=y_train_denoisy)
now use these split variable in model.fit
How can I randomly split my image dataset into training and validation datesets? More specifically, the validation_split argument in Keras ImageDataGenerator function is not randomly splitting my images into training and validation but is slicing the validation sample from an unshuffled dataset.
When specifying the validation_split argument in Keras' ImageDataGenerator the split is performed before the data is shuffled such that only the last x samples are taken. The issue is that the last sample of data selected as validation may not be representative of the training data and so it can fail. This is an especially common dead end when your image data is stored in a common directory with each sub-folder named by class. The has been noted in several posts:
Choose random validation data set
As you mentioned, Keras simply takes the last x samples of the dataset, so if you want to keep using it, you need to shuffle your dataset in advance.
The training accuracy is very high, while the validation accuracy is very low?
please check if you have shuffled the data before training. Because the validation splitting in keras is performed before shuffle, so maybe you have chosen an unbalanced dataset as your validation set, thus you got the low accuracy.
Does 'validation split' randomly choose validation sample?
The validation data is picked as the last 10% (for instance, if validation_split=0.9) of the input. The training data (the remainder) can optionally be shuffled at every epoch (shuffle argument in fit). That doesn't affect the validation data, obviously, it has to be the same set from epoch to epoch.
This answer points to the sklearn train_test_split() as a solution, but I want to propose a different solution that keeps consistency in the keras workflow.
With the split-folders package you can randomly split your main data directory into training, validation, and testing (or just training and validation) directories. The class-specific subfolders are automatically copied.
The input folder shoud have the following format:
input/
class1/
img1.jpg
img2.jpg
...
class2/
imgWhatever.jpg
...
...
In order to give you this:
output/
train/
class1/
img1.jpg
...
class2/
imga.jpg
...
val/
class1/
img2.jpg
...
class2/
imgb.jpg
...
test/ # optional
class1/
img3.jpg
...
class2/
imgc.jpg
...
From the documentation:
import split_folders
# Split with a ratio.
# To only split into training and validation set, set a tuple to `ratio`, i.e, `(.8, .2)`.
split_folders.ratio('input_folder', output="output", seed=1337, ratio=(.8, .1, .1)) # default values
# Split val/test with a fixed number of items e.g. 100 for each set.
# To only split into training and validation set, use a single number to `fixed`, i.e., `10`.
split_folders.fixed('input_folder', output="output", seed=1337, fixed=(100, 100), oversample=False) # default values
With this new folder arrangement you can easily use keras data generators to divide your data into training and validation and eventually train your model.
import tensorflow as tf
import split_folders
import os
main_dir = '/Volumes/WMEL/Independent Research Project/Data/test_train/Data'
output_dir = '/Volumes/WMEL/Independent Research Project/Data/test_train/output'
split_folders.ratio(main_dir, output=output_dir, seed=1337, ratio=(.7, .3))
train_datagen = tf.keras.preprocessing.image.ImageDataGenerator(
rescale=1./224)
train_generator = train_datagen.flow_from_directory(os.path.join(output_dir,'train'),
class_mode='categorical',
batch_size=32,
target_size=(224,224),
shuffle=True)
validation_generator = train_datagen.flow_from_directory(os.path.join(output_dir,'val'),
target_size=(224, 224),
batch_size=32,
class_mode='categorical',
shuffle=True) # set as validation data
base_model = tf.keras.applications.ResNet50V2(
input_shape=IMG_SHAPE,
include_top=False,
weights=None)
maxpool_layer = tf.keras.layers.GlobalMaxPooling2D()
prediction_layer = tf.keras.layers.Dense(4, activation='softmax')
model = tf.keras.Sequential([
base_model,
maxpool_layer,
prediction_layer
])
opt = tf.keras.optimizers.Adam(lr=0.004)
model.compile(optimizer=opt,
loss=tf.keras.losses.CategoricalCrossentropy(),
metrics=['accuracy'])
model.fit(
train_generator,
steps_per_epoch = train_generator.samples // 32,
validation_data = validation_generator,
validation_steps = validation_generator.samples // 32,
epochs = 20)
I want to use the Keras ImageDataGenerator for data augmentation.
To do so, I have to call the .fit() function on the instantiated ImageDataGenerator object using my training data as parameter as shown below.
image_datagen = ImageDataGenerator(featurewise_center=True, rotation_range=90)
image_datagen.fit(X_train, augment=True)
train_generator = image_datagen.flow_from_directory('data/images')
model.fit_generator(train_generator, steps_per_epoch=2000, epochs=50)
However, my training data set is too large to fit into memory when loaded up at once.
Consequently, I would like to fit the generator in several steps using subsets of my training data.
Is there a way to do this?
One potential solution that came to my mind is to load up batches of my training data using a custom generator function and fitting the image generator multiple times in a loop. However, I am not sure whether the fit function of ImageDataGenerator can be used in this way as it might reset on each fitting approach.
As an example of how it might work:
def custom_train_generator():
# Code loading training data subsets X_batch
yield X_batch
image_datagen = ImageDataGenerator(featurewise_center=True, rotation_range=90)
gen = custom_train_generator()
for batch in gen:
image_datagen.fit(batch, augment=True)
train_generator = image_datagen.flow_from_directory('data/images')
model.fit_generator(train_generator, steps_per_epoch=2000, epochs=50)
NEWER TF VERSIONS (>=2.5):
ImageDataGenerator() has been deprecated in favour of :
tf.keras.utils.image_dataset_from_directory
An example usage from the documentation:
tf.keras.utils.image_dataset_from_directory(
directory,
labels='inferred',
label_mode='int',
class_names=None,
color_mode='rgb',
batch_size=32,
image_size=(256, 256),
shuffle=True,
seed=None,
validation_split=None,
subset=None,
interpolation='bilinear',
follow_links=False,
crop_to_aspect_ratio=False,
**kwargs
)
OLDER TF VERSIONS (<2.5)
ImageDataGenerator() provides you with the possibility of loading the data into batches; You can actually use in your fit_generator() method the parameter batch_size, which works with ImageDataGenerator(); there is no need (only for good practice if you want) to write a generator from scratch.
IMPORTANT NOTE:
Starting from TensorFlow 2.1, .fit_generator() has been deprecated and you should use .fit()
Example taken from Keras official documentation:
datagen = ImageDataGenerator(
featurewise_center=True,
featurewise_std_normalization=True,
rotation_range=20,
width_shift_range=0.2,
height_shift_range=0.2,
horizontal_flip=True)
# compute quantities required for featurewise normalization
# (std, mean, and principal components if ZCA whitening is applied)
datagen.fit(x_train)
# TF <= 2.0
# fits the model on batches with real-time data augmentation:
model.fit_generator(datagen.flow(x_train, y_train, batch_size=32),
steps_per_epoch=len(x_train) // 32, epochs=epochs)
#TF >= 2.1
model.fit(datagen.flow(x_train, y_train, batch_size=32),
steps_per_epoch=len(x_train) // 32, epochs=epochs)
I would suggest reading this excellent article about ImageDataGenenerator and Augmentation: https://machinelearningmastery.com/how-to-configure-image-data-augmentation-when-training-deep-learning-neural-networks/
The solution to your problem lies in this line of code(either simple flow or flow_from_directory):
# prepare iterator
it = datagen.flow(samples, batch_size=1)
For creating your own DataGenerator, one should have a look at this link(for a starting point): https://stanford.edu/~shervine/blog/keras-how-to-generate-data-on-the-fly
IMPORTANT NOTE (2):
If you use Keras from Tensorflow (Keras inside Tensorflow), then for both the code presented and the tutorials you consult, ensure that you replace the import/neural network creation snippets:
from keras.x.y.z import A
WITH
from tensorflow.keras.x.y.z import A
I am working on multi-label image classification, i am using inception net as my base architecture.
after the complete training i am getting, training accuracy > 90% and validation accuracy > 85% but i am getting 17% accuracy on test data.
Model training -->
model = Model(pre_trained_model.input, x)
model.compile(loss='categorical_crossentropy',
optimizer=RMSprop(lr=0.0001),#'adam'
metrics=['acc'])
history = model.fit_generator(
train_generator,
steps_per_epoch=600,#total data/batch size
epochs=100,
validation_data=validation_generator,
validation_steps=20,
verbose=1,callbacks = callbacks)
Testing on the trained model:
test_generator = test_datagen.flow_from_directory(
test_dir,target_size=(128, 128),batch_size=1,class_mode='categorical')
filenames = test_generator.filenames
nb_samples = len(filenames)
prediction = test_model.predict_generator(test_generator,steps=nb_samples,verbose=1)
Saving the results to Pandas
predicted_class_indices = np.argmax(prediction,axis=1)
labels = (train_generator.class_indices) #geting names of classes from folder structure
labels = dict((v,k) for k,v in labels.items())
predictions = [k for k in predicted_class_indices]
results=pd.DataFrame({"image_name":filenames,
"label":predictions})
results['image_name'] = [each.split("\\")[-1] for each in results['image_name']]
Everything looks fine but still i am getting very poor prediction.
kidly help me to fugure out, where i am making the mistakes.
It can be the case that the images in your dataset are arranged in such a way that test images are previously unseen by the model and so the accuracy drops significantly.
What I recommend is for you to try to use K-fold cross validation or even Stratified K-fold cross validation. The benefit here is that your dataset will be splitted in, let's say 10 'batches'. Every iteration (out of 10) one batch will be the test batch and all the others will be train batches. The next iteration, test batch from the previous step becomes train batch and some other batch becomes test batch. It's important to denote that every batch will be the test batch only once. Another benefit of the Stratified K-fold is that it will take into account the class labels and try to split the classes in such way that every batch has approximately the same distribution of classes.
Another way to achieve some better results is to just shuffle the images and pick the training ones and test ones then.
I have x_train and y_train numpy arrays, each of >2GB. I want to train model using the tf.estimator API, but I am getting the errors:
ValueError: Cannot create a tensor proto whose content is larger than 2GB
I am passing the data using:
def input_fn(features, labels=None, batch_size=None,
shuffle=False, repeats=False):
if labels is not None:
inputs = (features, labels)
else:
inputs = features
dataset = tf.data.Dataset.from_tensor_slices(inputs)
if shuffle:
dataset = dataset.shuffle(shuffle)
if batch_size:
dataset = dataset.batch(batch_size)
if repeats:
# if False, evaluate after each epoch
dataset = dataset.repeat(repeats)
return dataset
train_spec = tf.estimator.TrainSpec(
lambda : input_fn(x_train, y_train,
batch_size=BATCH_SIZE, shuffle=50),
max_steps=EPOCHS
)
eval_spec = tf.estimator.EvalSpec(lambda : input_fn(x_dev, y_dev))
tf.estimator.train_and_evaluate(model, train_spec, eval_spec)
The tf.data documentation mentions this error and provides solution using traditional TenforFlow API with placeholders. Unfortunately, I don't know how this could be translated into tf.estimator API?
The solution that worked for me was using
tf.estimator.inputs.numpy_input_fn(x_train, y_train, num_epochs=EPOCHS,
batch_size=BATCH_SIZE, shuffle=True)
instead of input_fn. The only problem is that tf.estimator.inputs.numpy_input_fn raises deprecation warnings, so unfortunately this will stop working as well.