Custom Datagenerator - python

I have a custom file containing the paths to all my images and their labels which I load in a dataframe using:
MyIndex=pd.read_table('./MySet.txt')
MyIndex has two columns of interest ImagePath and ClassName
Next I do some train test split and encoding the output labels as:
images=[]
for index, row in MyIndex.iterrows():
img_path=basePath+row['ImageName']
img = image.load_img(img_path, target_size=(299, 299))
img_path=None
img_data = image.img_to_array(img)
img=None
images.append(img_data)
img_data=None
images[0].shape
Classes=Sample['ClassName']
OutputClasses=Classes.unique().tolist()
labels=Sample['ClassName']
images=np.array(images, dtype="float") / 255.0
(trainX, testX, trainY, testY) = train_test_split(images,labels, test_size=0.10, random_state=42)
trainX, valX, trainY, valY = train_test_split(trainX, trainY, test_size=0.10, random_state=41)
images=None
labels=None
encoder = LabelEncoder()
encoder=encoder.fit(OutputClasses)
encoded_Y = encoder.transform(trainY)
# convert integers to dummy variables (i.e. one hot encoded)
trainY = to_categorical(encoded_Y, num_classes=len(OutputClasses))
encoded_Y = encoder.transform(valY)
# convert integers to dummy variables (i.e. one hot encoded)
valY = to_categorical(encoded_Y, num_classes=len(OutputClasses))
encoded_Y = encoder.transform(testY)
# convert integers to dummy variables (i.e. one hot encoded)
testY = to_categorical(encoded_Y, num_classes=len(OutputClasses))
datagen=ImageDataGenerator(rotation_range=90,horizontal_flip=True,vertical_flip=True,width_shift_range=0.25,height_shift_range=0.25)
datagen.fit(trainX,augment=True)
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
batch_size=128
model.fit_generator(datagen.flow(trainX,trainY,batch_size=batch_size), epochs=500,
steps_per_epoch=trainX.shape[0]//batch_size,validation_data=(valX,valY))
The problem I face that the data loaded in one go is too large to fit in current machine memory and so I am unable to work with the complete dataset.
I have tried to work with the datagenerator but do not want to follow he directory conventions it follows and also cannot eradicate the augmentation part.
The question is that is there a way to load batches from the disk ensuring the two stated conditions.

I believe you should have a look at this post
What you are looking for is Keras flow_from_dataframe that let you load the batches from disk by providing the names of your files and their labels in a dataframe and also providing a top directory path that contains all your images.
Making a bit of midifications in your code and borrowing some from the link shared:
MyIndex=pd.read_table('./MySet.txt')
Classes=MyIndex['ClassName']
OutputClasses=Classes.unique().tolist()
trainDf=MyIndex[['ImageName','ClassName']]
train, test = train_test_split(trainDf, test_size=0.10, random_state=1)
#creating a data generator to load the files on runtime
traindatagen=ImageDataGenerator(rotation_range=90,horizontal_flip=True,vertical_flip=True,width_shift_range=0.25,height_shift_range=0.25,
validation_split=0.1)
train_generator=traindatagen.flow_from_dataframe(
dataframe=train,
directory=basePath,#the directory containing all your images
x_col='ImageName',
y_col='ClassName',
class_mode='categorical',
target_size=(299, 299),
batch_size=batch_size,
subset='training'
)
#Also a generator for the validation data
val_generator=traindatagen.flow_from_dataframe(
dataframe=train,
directory=basePath,#the directory containing all your images
x_col='ImageName',
y_col='ClassName',
class_mode='categorical',
target_size=(299, 299),
batch_size=batch_size,
subset='validation'
)
STEP_SIZE_TRAIN=train_generator.n//train_generator.batch_size
STEP_SIZE_VALID=val_generator.n//val_generator.batch_size
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit_generator(generator=train_generator, steps_per_epoch=STEP_SIZE_TRAIN,
validation_data=val_generator,
validation_steps=STEP_SIZE_VALID,
epochs=500)
Also note now you do not need the encoding of the labels as you had in your original code and also omit the image loading code.
I have not tried this code itself so try to fix any bugs you may encounter, as the primary focus was to deliver you the basic idea.
In response to your comment:
If you have all files in different directories then one solution would be to have your ImagesName to store the relative path including the intermediate directory in path something like './Dir/File.jpg' and then move all the directories to one folder and use the one as base path and everything else stays the same.
Also looking at your code segment that loaded the files look like you already have file paths stored in ImageName column so the suggested approach should work for you.
images=[]
for index, row in MyIndex.iterrows():
img_path=basePath+row['ImageName']
img = image.load_img(img_path, target_size=(299, 299))
img_path=None
img_data = image.img_to_array(img)
img=None
images.append(img_data)
img_data=None
In case if still some ambiguity exists feel free to ask again.

I think the simplest way to do this would be to just load part of your images per each generator and repeatedly call .fit_generator() with that smaller batch.
This example uses `random.random()` to choose which images to load – you could use something more sophisticated.
The previous version used random.random(), but we can just as well use a start index and page size like in this revised version to loop over the list of images forever.
import itertools
def load_images(start_index, page_size):
images = []
for index in range(page_size):
# Generate index using modulo to loop over the list forever
index = (start_index + index) % len(rows)
row = MyIndex[index]
img_path = basePath + row["ImageName"]
img = image.load_img(img_path, target_size=(299, 299))
img_data = image.img_to_array(img)
images.append(img_data)
return images
def generate_datagen(batch_size, start_index, page_size):
images = load_images(start_index, page_size)
# ... everything else you need to get from images to trainX and trainY, etc. here ...
datagen = ImageDataGenerator(
rotation_range=90,
horizontal_flip=True,
vertical_flip=True,
width_shift_range=0.25,
height_shift_range=0.25,
)
datagen.fit(trainX, augment=True)
return (
trainX,
trainY,
valX,
valY,
datagen.flow(trainX, trainY, batch_size=batch_size),
)
model.compile(
loss="categorical_crossentropy", optimizer="adam", metrics=["accuracy"]
)
page_size = (
500
) # load 500 images at a time; change this as suitable for your memory condition
for page in itertools.count(): # Count from zero to forever.
batch_size = 128
trainX, trainY, valX, valY, generator = generate_datagen(
128, page * page_size, page_size
)
model.fit_generator(
generator,
epochs=5,
steps_per_epoch=trainX.shape[0] // batch_size,
validation_data=(valX, valY),
)
# TODO: add a `break` clause with a suitable condition

If you want to load from the disk it is convenient to do with ImageDataGenerator that you used.
There are two ways to do it. By stating the directory of the data with flow_from_directory. Alternatively you can use flow_from_dataframe with Pandas dataframe
If you want to have a list of paths you should not use a custom generator that yields batches of images. Here is a stub:
def load_image_from_path(path):
"Loading and preprocessing"
...
def my_generator():
length = df.shape[0]
for i in range(0, length, batch_size)
batch = df.loc[i:min(i+batch_size, length-1)]
x, y = map(load_image_from_path, batch['ImageName']), batch['ClassName']
yield x, y
Note: in fit_generator there is an additional generator named validation_data for well you guessed it - validation.
One option is to pass the generators the indices to choose from in order to split train and test (assuming the data is shuffled, if not check this out).

Related

How to preprocess my ImageDataset using Keras (Augmentation, Split)

I have a project on object detection.
I have few data and want to apply the data augmentation method using Keras, but I am taking errors when I try to split and save my data into training and test.
How can I do all of this?
what I want to do?
First, I want to resize my image dataset then split data randomly into training and test.
After that saving into 'training' 'test' directory then I want to implement data augmentation for the training folder.
from tensorflow.keras.applications.xception import preprocess_input
from tensorflow.keras.preprocessing.image import ImageDataGenerator
data_dir=/..path/
ds_gen = ImageDataGenerator(
preprocessing_function=preprocess_input,
validation_split=0.2
)
train_ds = ds_gen.flow_from_directory(
"data_dir",
seed=1,
target_size=(150, 150), #adjust to your needs
batch_size=32,#adjust to your needs
save_to_dir= data_dir/training
subset='training'
)
val_ds = ds_gen.flow_from_directory(
"data_dir",
seed=1,
target_size=(150, 150),
batch_size=32,
save_to_dir= data_dir/validation
subset='validation'
)
I recommend using ImageDataGenerator.flow_from_dataframe to do what you wish. Since you are using flow from directory your data is organized so that the code below will read in the image information and create a train_df, a test_df and a valid_df set of data frames:
def preprocess (sdir, trsplit, vsplit, random_seed):
filepaths=[]
labels=[]
classlist=os.listdir(sdir)
for klass in classlist:
classpath=os.path.join(sdir,klass)
flist=os.listdir(classpath)
for f in flist:
fpath=os.path.join(classpath,f)
filepaths.append(fpath)
labels.append(klass)
Fseries=pd.Series(filepaths, name='filepaths')
Lseries=pd.Series(labels, name='labels')
df=pd.concat([Fseries, Lseries], axis=1)
# split df into train_df and test_df
dsplit=vsplit/(1-trsplit)
strat=df['labels']
train_df, dummy_df=train_test_split(df, train_size=trsplit, shuffle=True, random_state=random_seed, stratify=strat)
strat=dummy_df['labels']
valid_df, test_df=train_test_split(dummy_df, train_size=dsplit, shuffle=True, random_state=random_seed, stratify=strat)
print('train_df length: ', len(train_df), ' test_df length: ',len(test_df), ' valid_df length: ', len(valid_df))
print(train_df['labels'].value_counts())
return train_df, test_df, valid_df
sdir=/..path/
train_split=.8 # set this to the % of data you want for the train set
valid_split=.1 # set this to the % of the data you want for a validation set
# note % used for test is 1-train_split-valid_split
train_df, test_df, valid_df= preprocess(sdir,train_split, valid_split)
The function will show the balance between the classes in terms of how many sample there are in the training dataframe for each class. Examine this data and decide how on the number of samples you want in every class. For example is class0 has 3000 samples, class1 has 1200 samples and class2 has 800 samples you may decide that for the training dataframe you want to have every class have 1000 samples (max_samples=1000). That implies that for class 2 you have to create 200 augmented images, and for classes 0 and 1 you need to reduce the number of images. The functions below will do that for you.
The trim function trims the maximum number of samples in a class. The balance function use the trim function, then creates directories to store the augmented images, creates an aug_df dataframe and merges it with the train_df data frame. The result is a composite dataframe ndf that serves as the composite training set and is balanced with exactly max_samples of samples in each class.
def trim (df, max_size, min_size, column):
df=df.copy()
sample_list=[]
groups=df.groupby(column)
for label in df[column].unique():
group=groups.get_group(label)
sample_count=len(group)
if sample_count> max_size :
samples=group.sample(max_size, replace=False, weights=None, random_state=123, axis=0).reset_index(drop=True)
sample_list.append(samples)
elif sample_count>= min_size:
sample_list.append(group)
df=pd.concat(sample_list, axis=0).reset_index(drop=True)
balance=list(df[column].value_counts())
print (balance)
return df
def balance(train_df,max_samples, min_samples, column, working_dir, image_size):
train_df=train_df.copy()
train_df=trim (train_df, max_samples, min_samples, column)
# make directories to store augmented images
aug_dir=os.path.join(working_dir, 'aug')
if os.path.isdir(aug_dir):
shutil.rmtree(aug_dir)
os.mkdir(aug_dir)
for label in train_df['labels'].unique():
dir_path=os.path.join(aug_dir,label)
os.mkdir(dir_path)
# create and store the augmented images
total=0
gen=ImageDataGenerator(horizontal_flip=True, rotation_range=20, width_shift_range=.2,
height_shift_range=.2, zoom_range=.2)
groups=train_df.groupby('labels') # group by class
for label in train_df['labels'].unique(): # for every class
group=groups.get_group(label) # a dataframe holding only rows with the specified label
sample_count=len(group) # determine how many samples there are in this class
if sample_count< max_samples: # if the class has less than target number of images
aug_img_count=0
delta=max_samples-sample_count # number of augmented images to create
target_dir=os.path.join(aug_dir, label) # define where to write the images
aug_gen=gen.flow_from_dataframe( group, x_col='filepaths', y_col=None, target_size=image_size,
class_mode=None, batch_size=1, shuffle=False,
save_to_dir=target_dir, save_prefix='aug-', color_mode='rgb',
save_format='jpg')
while aug_img_count<delta:
images=next(aug_gen)
aug_img_count += len(images)
total +=aug_img_count
print('Total Augmented images created= ', total)
# create aug_df and merge with train_df to create composite training set ndf
if total>0:
aug_fpaths=[]
aug_labels=[]
classlist=os.listdir(aug_dir)
for klass in classlist:
classpath=os.path.join(aug_dir, klass)
flist=os.listdir(classpath)
for f in flist:
fpath=os.path.join(classpath,f)
aug_fpaths.append(fpath)
aug_labels.append(klass)
Fseries=pd.Series(aug_fpaths, name='filepaths')
Lseries=pd.Series(aug_labels, name='labels')
aug_df=pd.concat([Fseries, Lseries], axis=1)
ndf=pd.concat([train_df,aug_df], axis=0).reset_index(drop=True)
else:
ndf=train_df
print (list(ndf['labels'].value_counts()) )
return ndf
max_samples= 1000 # set this to how many samples you want in each class
min_samples=0
column='labels'
working_dir = r'./' # this is the directory where the augmented images will be stored
img_size=(224,224) # set this to the image size you want for the images
ndf=balance(train_df,max_samples, min_samples, column, working_dir, img_size)
now create the train, test and valid generators
channels=3
batch_size=30
img_shape=(img_size[0], img_size[1], channels)
length=len(test_df)
test_batch_size=sorted([int(length/n) for n in range(1,length+1) if length % n ==0 and length/n<=80],reverse=True)[0]
test_steps=int(length/test_batch_size)
print ( 'test batch size: ' ,test_batch_size, ' test steps: ', test_steps)
def scalar(img):
return img # EfficientNet expects pixelsin range 0 to 255 so no scaling is required
trgen=ImageDataGenerator(preprocessing_function=scalar, horizontal_flip=True)
tvgen=ImageDataGenerator(preprocessing_function=scalar)
train_gen=trgen.flow_from_dataframe( ndf, x_col='filepaths', y_col='labels', target_size=img_size, class_mode='categorical',
color_mode='rgb', shuffle=True, batch_size=batch_size)
test_gen=tvgen.flow_from_dataframe( test_df, x_col='filepaths', y_col='labels', target_size=img_size, class_mode='categorical',
color_mode='rgb', shuffle=False, batch_size=test_batch_size)
valid_gen=tvgen.flow_from_dataframe( valid_df, x_col='filepaths', y_col='labels', target_size=img_size, class_mode='categorical',
color_mode='rgb', shuffle=True, batch_size=batch_size)
classes=list(train_gen.class_indices.keys())
class_count=len(classes)
now use the train_gen and valid_gen in model.fit. Use the test_gen in model.evaluate or model.predict

AutoKeras Image Classifier: Generator not working and plt.show() gives empty image

I am trying to build an image classification program using AutoKeras, Tensorflow, and Pandas.
The code is as folllows:
from keras_preprocessing.image import ImageDataGenerator
import autokeras as ak
import pandas as pd
import matplotlib.pyplot as plt
import tensorflow as tf
# directory with subfolders (that contain other subfolders) that contain images
data_dir = "/home/jack/project/"
# dataframe initialization
dataframe = pd.read_excel("/home/jack/project/pathsandlabels.xlsx")
# splitting the dataset
train_dataframe = dataframe.sample(frac=0.75, random_state=200)
test_dataframe = dataframe.drop(train_dataframe.index)
# Augmenting it
datagen = ImageDataGenerator(rescale=1./255., horizontal_flip=True, shear_range=0.6, zoom_range=0.4,
validation_split=0.25)
# Setting up a train generator
train_generator = datagen.flow_from_dataframe(
dataframe=train_dataframe,
directory="/home/jack/project",
x_col="filename",
y_col="assessment",
subset="training",
seed=42,
batch_size=16,
shuffle=True,
class_mode="binary",
target_size=(224, 224)
)
# setting up a validation generator
validation_generator = datagen.flow_from_dataframe(
dataframe=train_dataframe,
directory="/home/jack/project/",
x_col="filename",
y_col="assessment",
subset="validation",
batch_size=16,
seed=42,
shuffle=True,
class_mode="binary",
target_size=(224, 224)
)
# Another augmentation but for test data
test_gen = ImageDataGenerator(rescale=1./255.)
# test generator set up
test_generator = test_gen.flow_from_dataframe(
dataframe=test_dataframe,
directory="/home/jack/project/",
x_col="filename",
y_col=None,
batch_size=16,
seed=42,
shuffle=False,
class_mode=None,
target_size=(224, 224)
)
# this function will yield the variables we need to work with in order to create a train and test set
# it will iterate through the generator
def my_iterator(generator):
for img_batch, targets_batch in generator:
yield test_generator.batch_size, targets_batch
# Train and Validation set creation
# The first problem is here
# 1: Invalid argument: Value Error: 'generator' yielded an element of shape (16,224,224,3) where an element
# of shape (224,) was expected.
train_set = tf.data.Dataset.from_generator(lambda: my_iterator(train_generator), output_shapes=(224, 244),
output_types=(tf.float32, tf.float32))
val_set = tf.data.Dataset.from_generator(lambda: my_iterator(validation_generator), output_shapes=(224, 224),
output_types=(tf.float32, tf.float32))
# we check the output of both validation and train sets
print(train_set)
print(val_set)
# This piece of code is where the other two issues are:
# 2: squeeze(axis=2) gives this error: ValueError: cannot select an axis to squeeze out which has size not equal to one
# 3: Issue 2 can be averted by setting axis=None, but the next problem is plt.show() gives an empty image.
for image, label in train_set.take(1):
print("Image shape: ", image.numpy.shape())
print("Label: ", label.numpy.shape())
plt.imshow(image.numpy()[0].squeeze(axis=2) * 255)
plt.show()
clf = ak.ImageClassifier(overwrite=True, max_trials=1, seed=5)
clf.fit(x=train_set, epochs=20)
print(clf.evaluate(val_set))
I mentioned the issues I face as comments in the code, but I will explain again.
The biggest issue is the first one:Value Error: 'generator' yielded an element of shape (16,224,224,3) where an element of shape (224,) was expected. This happens when I try to initialize my test set.
What I tried:
Changing output_shape to (224,224,3) and (16,224,224,3) (didn't help, threw a different error saying that "The two sequences do not have the same length"
Deleting batch_size from train_generator (this set it back to the default 32 which my pc can't handle)
Changing target_size within the generators to (224,224,3) and (16,224,224,3). didn't work
Changing the number of variables that my_iterator yields. Didn't work (error message: expect n (this is either 3 or 4) values to unpack, got 2)
Changing batch_size to a number by which the total number of images can be divided by (didn't work, throws original error message)
How the data is stored:
Excel. Single sheet. Two columns, A and B. filename and assessment being the column names. Filename is paths to the images (e.g "/subfolder/subfolder/subfolder/A2c3jc3291n.jpeg") but without the quotes obviously.
Assessments are the classes. There are only two in this case.

How to deal with thousands of images for CNN training Keras

I have ~10000k images that cannot fit in memory. So for now I can only read 1000 images and train on it...
My code is here :
img_dir = "TrainingSet" # Enter Directory of all images
image_path = os.path.join(img_dir+"/images",'*.bmp')
files = glob.glob(image_path)
images = []
masks = []
contours = []
indexes = []
files_names = []
for f1 in np.sort(files):
img = cv2.imread(f1)
result = re.search('original_cropped_(.*).bmp', str(f1))
idx = result.group(1)
mask_path = img_dir+"/masks/mask_cropped_"+str(idx)+".bmp"
mask = cv2.imread(mask_path,0)
contour_path = img_dir+"/contours/contour_cropped_"+str(idx)+".bmp"
contour = cv2.imread(contour_path,0)
indexes.append(idx)
images.append(img)
masks.append(mask)
contours.append(contour)
train_df = pd.DataFrame({"id":indexes,"masks": masks, "images": images,"contours": contours })
train_df.sort_values(by="id",ascending=True,inplace=True)
print(train_df.shape)
img_size_target = (256,256)
ids_train, ids_valid, x_train, x_valid, y_train, y_valid, c_train, c_valid = train_test_split(
train_df.index.values,
np.array(train_df.images.apply(lambda x: cv2.resize(x,img_size_target).reshape(img_size_target[0],img_size_target[1],3))),
np.array(train_df.masks.apply(lambda x: cv2.resize(x,img_size_target).reshape(img_size_target[0],img_size_target[1],1))),
np.array(train_df.contours.apply(lambda x: cv2.resize(x,img_size_target).reshape(img_size_target[0],img_size_target[1],1))),
test_size=0.2, random_state=1337)
#Here we define the model architecture...
#.....
#End of model definition
# Training
optimizer = Adam(lr=1e-3,decay=1e-10)
model.compile(loss="binary_crossentropy", optimizer=optimizer, metrics=["accuracy"])
early_stopping = EarlyStopping(patience=10, verbose=1)
model_checkpoint = ModelCheckpoint("./keras.model", save_best_only=True, verbose=1)
reduce_lr = ReduceLROnPlateau(factor=0.5, patience=5, min_lr=0.00001, verbose=1)
epochs = 200
batch_size = 32
history = model.fit(x_train, y_train,
validation_data=[x_valid, y_valid],
epochs=epochs,
batch_size=batch_size,
callbacks=[early_stopping, model_checkpoint, reduce_lr])
What I would like to know is how can I modify my code in order to do batches of a small set of images without loading all the other 10000 into memory ? which means that the algorithm will read X images each epoch from directory and train on it and after that goes for the next X until the last one.
X here would be a reasonable amount of images that can fit into memory.
use fit_generator instead of fit
def generate_batch_data(num):
#load X images here
return images
model.fit_generator(generate_batch_data(X),
samples_per_epoch=10000, nb_epoch=10)
Alternative you could use train_on_batch instead of fit
Discussion on GitHub about this topic: https://github.com/keras-team/keras/issues/2708
np.array(train_df.images.apply(lambda x:cv2.resize(x,img_size_target).reshape(img_size_target[0],img_size_target[1],3)))
You can first apply this filter (and the 2 others) to each individual file and save them to a special folder (images_prepoc, masks_preproc,etc.. ) in a separate script, then load them back already ready for use in the current script.
Assuming that the actual images dimensions are greater than 256x256, you will have a faster algorithm, using less memory at the cost of a single preparation phase.

predict_generator and class labels

I am using ImageDataGenerator to generate new augmented images and extract bottleneck features from pretrained model but most of the tutorial I see on keras
samples same no of training samples as number of images in directory.
train_generator = train_datagen.flow_from_directory(
train_path,
target_size=image_size,
shuffle = "false",
class_mode='categorical',
batch_size=1)
bottleneck_features_train = model.predict_generator(
train_generator, 2* nb_train_samples // batch_size)
Suppose I want 2 times more images from the above code, how I can get the desired class labels for the features extracted from bottleneck layer which are stored in tuple train_generator.
shouldnt the code in training_generator.py at line 422
x, _ = generator_output
do something like this
=> x, y = generator_output
and return tuple [np.concatenate(out) for out in all_outs],y from predict_generator
i.e return the corresponding class labels along with the predicted features all_outs since there is no way to get the corresponding labels without running generator twice.
If you're using predict, normally you simply don't want Y, because Y will be the result of the prediction. (You're not training, so you don't need the true labels)
But you can do it yourself:
bottleneck = []
labels = []
for i in range(2 * nb_train_samples // batch_size):
x, y = next(train_generator)
bottleneck.append(model.predict(x))
labels.append(y)
bottleneck = np.concatenate(bottleneck)
labels = np.concatenate(labels)
If you want it with indexing (if your generator supports that):
#...
for epoch in range(2):
for i in range(nb_train_samples // batch_size):
x,y = train_generator[i]
#...

How do I check the order in which Keras' flow_from_directory method processes folders?

When doing transfer learning, I am first feeding the images through the bottom layers of the VGG16 network. I am using a generator function.
datagen = ImageDataGenerator(1./255)
generator = datagen.flow_from_directory(
train_data_dir,
target_size=(img_width, img_height),
batch_size = 32,
class_mode=None,
shuffle=False
)
model.predict_generator(generator, nb_train_samples)
I am setting the class mode to none, because I only want the data output. I am setting shuffle = false, because I want to feed the predicted features later here, and match them up with the ground truth category variable:
train_data = np.lead(open(file_name, 'rb'))
train_labels = np.array([0] * NUMBER_OF_ITEMS_FOR_ITEM1 +
[1] * NUMBER_OF_ITEMS_FOR_ITEM2 +...
[n-1] * NUMBER_OF_ITEMS_FOR_ITEMN
The problem here is that I don't know in which order the files are read. How can I find it out? Or even better yet, how can I just avoid having to guess the correct order? I am asking because I am almost certain that the low prediction accuracy has to do with labels not matching up.
I looked into the source code. I should note that since I posted this question, Keras was updated to version 2.0. So the answer is based on that version.
ImageDataGenerator inherits from DirectoryGenerator. In it, I find the following lines:
if not classes:
classes = []
for subdir in sorted(os.listdir(directory)):
if os.path.isdir(os.path.join(directory, subdir)):
classes.append(subdir)
self.num_class = len(classes)
self.class_indices = dict(zip(classes, range(len(classes))))
def _recursive_list(subpath):
return sorted(os.walk(subpath, followlinks=follow_links), key=lambda tpl: tpl[0])
for subdir in classes:
subpath = os.path.join(directory, subdir)
for root, _, files in _recursive_list(subpath):
for fname in files:
is_valid = False
for extension in white_list_formats:
if fname.lower().endswith('.' + extension):
is_valid = True
break
if is_valid:
self.samples += 1
print('Found %d images belonging to %d classes.' % (self.samples, self.num_class))
Note line 3 where it says "sorted(os.listdir(direcectory, subdir))". The generator traverses all folders in alphabetical order.
Later on the definition of a _recursive_list uses the same logic on substructures as well.
So the answer is: Folders are processed in alphabetical order which somehow makes sense.
Good question, add a print statement to the keras/preprocessing/image.py in the next method in inside the DirectoryIterator class: Here is the relevant code which iterates over a list of filenames. You would of course have to rebuild keras from source.
for i, j in enumerate(index_array):
fname = self.filenames[j]
print(fname) # add this to see the current file being accessed
img = load_img(os.path.join(self.directory, fname),
grayscale=grayscale,
target_size=self.target_size)
x = img_to_array(img, data_format=self.data_format)
x = self.image_data_generator.random_transform(x)
However to avoid all that pain this example on the keras docs page suggests that to ensure the consistency one should follow this pattern.
Pass train and validate generators of the same template to the model.fit_generator function.
train_datagen = ImageDataGenerator(
rescale=1./255,
shear_range=0.2,
zoom_range=0.2,
horizontal_flip=True)
test_datagen = ImageDataGenerator(rescale=1./255)
train_generator = train_datagen.flow_from_directory(
'data/train',
target_size=(150, 150),
batch_size=32,
class_mode='binary')
validation_generator = test_datagen.flow_from_directory(
'data/validation',
target_size=(150, 150),
batch_size=32,
class_mode='binary')
model.fit_generator(
train_generator,
samples_per_epoch=2000,
epochs=50,
validation_data=validation_generator,
num_val_samples=800)

Categories

Resources