I am trying to create a multilabel classification model with keras. As such I have all my images in one folder. Furthermore, I have a CSV file containing a path to each image followed by multiple possible labels
Example of my CSV:
path, x1, x2, x3
img/img_00000001.jpg,1,0,1
img/img_00000002.jpg,0,0,1
...
I am trying to read in my images using flow_from_directory and provide the respective labels via the CSV. My so far looks like this:
image_path= "C:/user/Images"
data_generator = ImageDataGenerator(rescale=1./255,
validation_split=0.20)
train_generator = data_generator.flow_from_directory(image_path, target_size=(IMAGE_HEIGHT, IMAGE_SIZE), shuffle=True, seed=13,
class_mode='binary', batch_size=BATCH_SIZE, subset="training")
validation_generator = data_generator.flow_from_directory(image_path, target_size=(IMAGE_HEIGHT, IMAGE_SIZE), shuffle=False, seed=13,
class_mode='binary', batch_size=BATCH_SIZE, subset="validation")
A solution to a similar problem is suggested here: How to manually specify class labels in keras flow_from_directory? providing this code:
def multiclass_flow_from_directory(flow_from_directory_gen, multiclasses_getter):
for x, y in flow_from_directory_gen:
yield x, multiclasses_getter(x, y)
However, I cant figure out how to implement the multiclasses_getter() such that it works.
Try to use flow_from_dataframe instead flow_from_directory
Related
I have trained a ResNet50 using Keras for classication. For testing, I used the ImageDataGenerator flow_from_directory() method to pass input to the model. Here's the code for that:
testdata_generator = keras.preprocessing.image.ImageDataGenerator(
preprocessing_function=tf.keras.applications.resnet.preprocess_input
)
testgen = testdata_generator.flow_from_directory(
'./test',
shuffle=False,
target_size=(224,224),
color_mode='rgb',
batch_size=32,
class_mode=None
)
Found 18223 images belonging to 1 classes.
However when I test the model on the test images, it doesn't predict for a few images.
pred = model.predict(
testgen,
batch_size=32,
steps=testgen.n//testgen.batch_size
)
print(len(pred))
18208
Anyone help?
You should try removing steps=testgen.n//testgen.batch_size, since calculating the steps results in a different number of samples, when you have a remainder by dividing samples // batch_size.
I have this code:
epochs =50
batch_size = 5
validation_split = 0.2
datagen = tf.keras.preprocessing.image.ImageDataGenerator(validation_split=validation_split )
train_generator = datagen.flow(
X_train_noisy, y_train_denoisy, batch_size=batch_size,
subset='training'
)
val_generator = datagen.flow(
X_train_noisy, y_train_denoisy, batch_size=batch_size,
subset='validation'
)
history = model.fit(train_generator,
steps_per_epoch=(len(X_train_noisy)*(1-validation_split)) // batch_size, epochs=epochs,
validation_data = val_generator, validation_steps=(len(X_train_noisy)*validation_split)//batch_size)
X_train_noisy and y_train_denoisy are ndarray ([20,512,512,1]) p.e. But I get this error:
training and validation subsets have different number of classes after the split
How can I solve that?
thanks!
probably what happened is when the data is split for training and validation, the set of files selected for validation did not include any files in one or more of the classes. This can happen when your data set is small. Try increasing the validation_split to a larger value say like .5 and see if the problem goes away. It should. Then reduce the size of the validation split until the error reoccurs. That will determine the minimum split value you can use. Remember the split is randomized so set the split value at something above the minimum value.
Another (BETTER) alternative is to split the data using sklearn train_test_split. This function has a parameter stratify that splits the data but ensures that all classes are included in the two component. See code below
from sklearn.model_selection import train_test_split
X_train_noisy, X_valid_noisy, y_train_denoisy, y_valid_denoisy=train_test_split(X_train_noisy,
y_train_denoisy, test_size=validation_split,
shuffle=True, random_state=123,
stratify=y_train_denoisy)
now use these split variable in model.fit
I'm trying to use generators in my CNN training but for some reason.
However, when I try to run model.predict_evaluator(), each time I execute it (I'm working in Jupyter Notebook), ¡it gives different results! Same data (stored in folder), same model (I just rerun the same cell)
This block works fine, every time I rerun it, it gives the same metrics
test_generator = test_datagen.flow_from_directory(
'keras_data/test',
batch_size = 1,
class_mode='categorical')
loss, acc = model.evaluate(test_generator, verbose=1)
print(loss,acc)
However, when I run this cell, it gives different results every time
ytest = test_generator.classes
yhat = np.argmax(model.predict_generator(test_generator),axis=1)
from sklearn.metrics import confusion_matrix
m = confusion_matrix(ytest,yhat)
print(m)
It doesn't make any sense! Any ideas on what's happening?
EDIT: here is how I create the generators, just in case the problem is here
train_datagen = ImageDataGenerator(
preprocessing_function=preprocess_input,
horizontal_flip=True)
test_datagen = ImageDataGenerator(preprocessing_function=preprocess_input)
train_generator = train_datagen.flow_from_directory(
'keras_data/train',
batch_size=1,
class_mode='categorical')
validation_generator = test_datagen.flow_from_directory(
'keras_data/val',
batch_size=1,
class_mode='categorical')
Shuffle - Whether to shuffle the data (default: True) If set to False, sorts the data in alphanumeric order.
From comments
Set shuffle = False for test generator has resolved the issue (paraphrased from Frightera)
I am looking to build a classification model using my own dataset. But I'm having trouble formatting the dataset to be used. They are currently in subfolders, with each name being the class. I want to create my dataset like the format of the MNIST dataset, but I am unable to do so. For example, for MNIST, we can split the dataset:
(train_images, train_labels), (
test_images,
test_labels) = tf.keras.datasets.mnist.load_data()
And then for example, I could flatten the data:
train_images = train_images.reshape((train_images.shape[0], -1))
test_images = test_images.reshape((test_images.shape[0], -1))
How would I replace tf.keras.datasets.mnist.load_data() with my own dataset but in the same format as the MNIST dataset? I am also doing multi-class classification.
Edit: Added Notes:
To be clear, my main task is to replace the MNIST dataset with my own dataset: My subdirectories are like this for example:
main_directory/
...class_a/
......a_image_1.jpg
......a_image_2.jpg
...class_b/
......b_image_1.jpg
......b_image_2.jpg
...class_c/
......c_image_1.jpg
......c_image_2.jpg
...class_d/
......d_image_1.jpg
......d_image_2.jpg
I tried following this link to make my dataset into a format that tf.keras could use to load the dataset, similar to the way the MNIST dataset is loaded. I have tried generating a tf.data.Dataset.
data_dir = "/datas"
data_dir = pathlib.Path(data_dir)
image_count = len(list(data_dir.glob('*/*.jpg')))
print(image_count)
cats = list(data_dir.glob('cats/*'))
PIL.Image.open(str(cats[0]))
batch_size = 32
img_height = 150
img_width = 150
train_ds = tf.keras.preprocessing.image_dataset_from_directory(
data_dir,
validation_split=0.2,
subset="training",
seed=123,
image_size=(img_height, img_width),
batch_size=batch_size)
val_ds = tf.keras.preprocessing.image_dataset_from_directory(
data_dir,
validation_split=0.2,
subset="validation",
seed=123,
image_size=(img_height, img_width),
batch_size=batch_size)
But I am still unable to put it into the right format, to fix the code as described above. Something I saw was about conversion to a numpy array but I'm not sure how to do it.
Some packages have helper functions to make data access easy for the users. As you can see in the documentation for the load_data() function, it returns a tuple of Numpy arrays:
(X_train, y_train), (X_test, y_test)
You can get the same structure by arranging your features (X) and target (y) as numpy arrays, and pass them through the train_test_split() scikit-learn function.
I want to use the Keras ImageDataGenerator for data augmentation.
To do so, I have to call the .fit() function on the instantiated ImageDataGenerator object using my training data as parameter as shown below.
image_datagen = ImageDataGenerator(featurewise_center=True, rotation_range=90)
image_datagen.fit(X_train, augment=True)
train_generator = image_datagen.flow_from_directory('data/images')
model.fit_generator(train_generator, steps_per_epoch=2000, epochs=50)
However, my training data set is too large to fit into memory when loaded up at once.
Consequently, I would like to fit the generator in several steps using subsets of my training data.
Is there a way to do this?
One potential solution that came to my mind is to load up batches of my training data using a custom generator function and fitting the image generator multiple times in a loop. However, I am not sure whether the fit function of ImageDataGenerator can be used in this way as it might reset on each fitting approach.
As an example of how it might work:
def custom_train_generator():
# Code loading training data subsets X_batch
yield X_batch
image_datagen = ImageDataGenerator(featurewise_center=True, rotation_range=90)
gen = custom_train_generator()
for batch in gen:
image_datagen.fit(batch, augment=True)
train_generator = image_datagen.flow_from_directory('data/images')
model.fit_generator(train_generator, steps_per_epoch=2000, epochs=50)
NEWER TF VERSIONS (>=2.5):
ImageDataGenerator() has been deprecated in favour of :
tf.keras.utils.image_dataset_from_directory
An example usage from the documentation:
tf.keras.utils.image_dataset_from_directory(
directory,
labels='inferred',
label_mode='int',
class_names=None,
color_mode='rgb',
batch_size=32,
image_size=(256, 256),
shuffle=True,
seed=None,
validation_split=None,
subset=None,
interpolation='bilinear',
follow_links=False,
crop_to_aspect_ratio=False,
**kwargs
)
OLDER TF VERSIONS (<2.5)
ImageDataGenerator() provides you with the possibility of loading the data into batches; You can actually use in your fit_generator() method the parameter batch_size, which works with ImageDataGenerator(); there is no need (only for good practice if you want) to write a generator from scratch.
IMPORTANT NOTE:
Starting from TensorFlow 2.1, .fit_generator() has been deprecated and you should use .fit()
Example taken from Keras official documentation:
datagen = ImageDataGenerator(
featurewise_center=True,
featurewise_std_normalization=True,
rotation_range=20,
width_shift_range=0.2,
height_shift_range=0.2,
horizontal_flip=True)
# compute quantities required for featurewise normalization
# (std, mean, and principal components if ZCA whitening is applied)
datagen.fit(x_train)
# TF <= 2.0
# fits the model on batches with real-time data augmentation:
model.fit_generator(datagen.flow(x_train, y_train, batch_size=32),
steps_per_epoch=len(x_train) // 32, epochs=epochs)
#TF >= 2.1
model.fit(datagen.flow(x_train, y_train, batch_size=32),
steps_per_epoch=len(x_train) // 32, epochs=epochs)
I would suggest reading this excellent article about ImageDataGenenerator and Augmentation: https://machinelearningmastery.com/how-to-configure-image-data-augmentation-when-training-deep-learning-neural-networks/
The solution to your problem lies in this line of code(either simple flow or flow_from_directory):
# prepare iterator
it = datagen.flow(samples, batch_size=1)
For creating your own DataGenerator, one should have a look at this link(for a starting point): https://stanford.edu/~shervine/blog/keras-how-to-generate-data-on-the-fly
IMPORTANT NOTE (2):
If you use Keras from Tensorflow (Keras inside Tensorflow), then for both the code presented and the tutorials you consult, ensure that you replace the import/neural network creation snippets:
from keras.x.y.z import A
WITH
from tensorflow.keras.x.y.z import A