TensorFlow: how to create training and testing image datasets

TensorFlow: how to create training and testing image datasets - python

I've been looking forever on the internet trying to create, train and test my own tensorflow model. But I have unsuccessfully done that. From investigating my code, I think it's how I create my dataset of images. Most online tutorials would just import a prepared dataset, but my dataset is specifically for use-case diagrams and holds each element within those diagrams. My aim is to train a tensorflow model to predict each element in a diagram and hopefully the errors too. Here's the code:
def createDataSet(labelList, label, filePath, width, height):
dataList = []
for img in os.listdir(filePath):
filename = str(img)
if filename[len(filename) - 3:len(filename)] != "npy":
pic = cv.imread(os.path.join(filePath, img))
pic = cv.cvtColor(pic, cv.COLOR_BGR2RGB)
pic = cv.resize(pic, (width, height))
dataList.append(pic)
labelList.append(label)
return dataList, labelList
# appending the pics to the training data list
training_dataset, train_labels = createDataSet(train_labels, train_label, path, width, height)
test_dataset, test_labels = createDataSet(test_labels, test_label, path2, width, height)
#converting the list to numpy array and saving it to a file using #numpy.save
np.save(os.path.join(path,train_label),np.array(training_dataset))
np.save(os.path.join(path2,test_label),np.array(test_dataset))
#loading the saved file once again
train_images = np.array(training_dataset)
test_images = np.array(test_dataset)
As of now, the function creates a list which will be saved as a numpy array and that numpy array will be used for my model. But it causes errors like UNIMPLEMENTED: Cast string to float is not supported.
I'm sure that I'm creating the train, test data and the labels for both incorrectly

You can use tf.keras.utils.image_dataset_from_directory function read images form directory and split the data into training and validation. Find the below is sample code.
main_directory/
...class_a/
......a_image_1.jpg
......a_image_2.jpg
...class_b/
......b_image_1.jpg
......b_image_2.jpg
import tensorflow as tf
batch_size = 32
img_height = 224
img_width = 224
train_ds = tf.keras.preprocessing.image_dataset_from_directory(
data_dir,
validation_split=0.2,
subset="training",
seed=123,
image_size=(img_height, img_width),
batch_size=batch_size)
class_names = train_ds.class_names
val_ds = tf.keras.preprocessing.image_dataset_from_directory(
data_dir,
validation_split=0.2,
subset="validation",
seed=123,
image_size=(img_height, img_width),
batch_size=batch_size)

Related

How to split folders to 3 datasets with ImageDataGenerator?

validation_split parameter is able to allow ImageDataGenerator to split the data sets reading from the folder into 2 different disjoint sets. Is there any way to create 3 sets - of training, validation, and evaluation datasets using it?
I am thinking about splitting the dataset into 2 datasets, then splitting the 2nd dataset into another 2 datasets
datagen = ImageDataGenerator(validation_split=0.5, rescale=1./255)
train_generator = datagen.flow_from_directory(
TRAIN_DIR,
subset='training'
)
val_generator = datagen.flow_from_directory(
TRAIN_DIR,
subset='validation'
)
Here I am thinking about splitting the validation dataset into 2 sets using val_generator. One for validation and the other for evaluation? How should I do it?

I like working with the flow_from_dataframe() method of ImageDataGenerator, where I interact with a simple Pandas DataFrame (perhaps containig other features), not with the directory. But you can easily change my code if you insist on flow_from_directory().
So this is my go-to function, e.g. for a regression task, where we try to predict a continuous y:
def get_generators(train_samp, test_samp, validation_split = 0.1):
train_datagen = ImageDataGenerator(validation_split=validation_split, rescale = 1. / 255)
test_datagen = ImageDataGenerator(rescale = 1. / 255)
train_generator = train_datagen.flow_from_dataframe(
dataframe = images_df[images_df.index.isin(train_samp)],
directory = images_dir,
x_col = 'img_file',
y_col = 'y',
target_size = (IMG_HEIGHT, IMG_WIDTH),
class_mode = 'raw',
batch_size = batch_size,
shuffle = True,
subset = 'training',
validate_filenames = False
)
valid_generator = train_datagen.flow_from_dataframe(
dataframe = images_df[images_df.index.isin(train_samp)],
directory = images_dir,
x_col = 'img_file',
y_col = 'y',
target_size = (IMG_HEIGHT, IMG_WIDTH),
class_mode = 'raw',
batch_size = batch_size,
shuffle = False,
subset = 'validation',
validate_filenames = False
)
test_generator = test_datagen.flow_from_dataframe(
dataframe = images_df[images_df.index.isin(test_samp)],
directory = images_dir,
x_col = 'img_file',
y_col = 'y',
target_size = (IMG_HEIGHT, IMG_WIDTH),
class_mode = 'raw',
batch_size = batch_size,
shuffle = False,
validate_filenames = False
)
return train_generator, valid_generator, test_generator
Things to notice:
I use two generators
The input to the function are the train/test indices (such as received from Sklearn's train_test_split) which are used to filter the DataFrame index.
The function also take a validation_split parameter for the training generator
images_df is a DataFrame somewhere in global memory with proper columns like img_file and y.
No need to shuffle validation and test generators
This can be further generalized for multiple outputs, classification, what have you.

I mostly have been splitting data in 80/10/10 for training, validation and test respectivelly.
When working with keras I favor the tf.data API as it provides a good abstraction for complex input pipelines
It does not provide a simple tf.data.DataSet.split functionality though
I have this function (that I found from someone's code and my source is missing) which I consistently use
def get_dataset_partitions_tf(ds: tf.data.Dataset, ds_size, train_split=0.8, val_split=0.1, test_split=0.1, shuffle=True, shuffle_size=10000):
assert (train_split + test_split + val_split) == 1
if shuffle:
# Specify seed to always have the same split distribution between runs
ds = ds.shuffle(shuffle_size, seed=12)
train_size = int(train_split * ds_size)
val_size = int(val_split * ds_size)
train_ds = ds.take(train_size)
val_ds = ds.skip(train_size).take(val_size)
test_ds = ds.skip(train_size).skip(val_size)
return train_ds, val_ds, test_ds
Firstly read your data set, and get its size(with cardianlity method), then pass it into the function and you're good to go!
This function can be given a flag to shuffle the original dataset before creating the splits, this is useful to have more realistic validation and test metrics.
The seed for shuffling is fixed so that we can run the same function and the splits remain the same, which we want for consistent results.

Extract data from tensorflow dataset (e.g. to numpy)

I'm loading images via
data = keras.preprocessing.image_dataset_from_directory(
'./data',
labels='inferred',
label_mode='binary',
validation_split=0.2,
subset="training",
image_size=(img_height, img_width),
batch_size=sz_batch,
crop_to_aspect_ratio=True
)
I want to use the obtained data in non-tensorflow routines too. Therefore, I want to extract the data e.g. to numpy arrays. How can I achieve this? I can't use tfds

I would suggest unbatching your dataset and using tf.data.Dataset.map:
import numpy as np
import tensorflow as tf
dataset_url = "https://storage.googleapis.com/download.tensorflow.org/example_images/flower_photos.tgz"
data_dir = tf.keras.utils.get_file('flower_photos', origin=dataset_url, untar=True)
data_dir = pathlib.Path(data_dir)
batch_size = 32
train_ds = tf.keras.utils.image_dataset_from_directory(
data_dir,
validation_split=0.2,
subset="training",
seed=123,
image_size=(180, 180),
batch_size=batch_size,
shuffle=False)
train_ds = train_ds.unbatch()
images = np.asarray(list(train_ds.map(lambda x, y: x)))
labels = np.asarray(list(train_ds.map(lambda x, y: y)))
Or as suggested in the comments, you could also try just working with the batches and concatenating them afterwards:
images = np.concatenate(list(train_ds.map(lambda x, y: x)))
labels = np.concatenate(list(train_ds.map(lambda x, y: y)))
Or set shuffle=True and use tf.TensorArray:
images = tf.TensorArray(dtype=tf.float32, size=0, dynamic_size=True)
labels = tf.TensorArray(dtype=tf.int32, size=0, dynamic_size=True)
for x, y in train_ds.unbatch():
images = images.write(images.size(), x)
labels = labels.write(labels.size(), y)
images = tf.stack(images.stack(), axis=0)
labels = tf.stack(labels.stack(), axis=0)

Because tf.keras.utils.image_dataset_from_directory returns a Dataset object, use tf.data.Dataset.as_numpy_iterator. For example:
for elem in data.as_numpy_iterator():
print(elem)
In the end, its probably a better idea to use tf.data.Dataset because its more efficient. You can find more information here.

How can I preprocess a tf.data.Dataset using a provided preprocess_input function that expects a tf.Tensor?

Having a bit of a clueless moment, I'm looking to apply transfer learning to a problem using ResNet50 pre-trained on ImageNet.
I've got the transfer learning process all ready to go, but need my data set in the right form which tf.keras.applications.resnet50.preprocess_input handily does. Except it works on a numpy.array or tf.Tensor and I'm using image_dataset_from_directory to load the data which gives me a tf.data.Dataset.
Is there a simple way to use the provided preprocess_input function to preprocess my data in this form?
Alternatively, the function specifies:
The images are converted from RGB to BGR, then each color channel is zero-centered with respect to the ImageNet dataset, without scaling.
So any other way to achieve this in the data pipeline or as part of the model would also be acceptable.

You could use the map function of tf.data.Dataset to apply the preprocess_input function to every batch of images:
import tensorflow as tf
import pathlib
import matplotlib.pyplot as plt
dataset_url = "https://storage.googleapis.com/download.tensorflow.org/example_images/flower_photos.tgz"
data_dir = tf.keras.utils.get_file('flower_photos', origin=dataset_url, untar=True)
data_dir = pathlib.Path(data_dir)
batch_size = 32
train_ds = tf.keras.utils.image_dataset_from_directory(
data_dir,
validation_split=0.2,
subset="training",
seed=123,
image_size=(180, 180),
batch_size=batch_size)
def display(ds):
images, _ = next(iter(ds.take(1)))
image = images[0].numpy()
image /= 255.0
plt.imshow(image)
def preprocess(images, labels):
return tf.keras.applications.resnet50.preprocess_input(images), labels
train_ds = train_ds.map(preprocess)
display(train_ds)

Creating a dataset like the MNIST format, with different classes?

I am looking to build a classification model using my own dataset. But I'm having trouble formatting the dataset to be used. They are currently in subfolders, with each name being the class. I want to create my dataset like the format of the MNIST dataset, but I am unable to do so. For example, for MNIST, we can split the dataset:
(train_images, train_labels), (
test_images,
test_labels) = tf.keras.datasets.mnist.load_data()
And then for example, I could flatten the data:
train_images = train_images.reshape((train_images.shape[0], -1))
test_images = test_images.reshape((test_images.shape[0], -1))
How would I replace tf.keras.datasets.mnist.load_data() with my own dataset but in the same format as the MNIST dataset? I am also doing multi-class classification.
Edit: Added Notes:
To be clear, my main task is to replace the MNIST dataset with my own dataset: My subdirectories are like this for example:
main_directory/
...class_a/
......a_image_1.jpg
......a_image_2.jpg
...class_b/
......b_image_1.jpg
......b_image_2.jpg
...class_c/
......c_image_1.jpg
......c_image_2.jpg
...class_d/
......d_image_1.jpg
......d_image_2.jpg
I tried following this link to make my dataset into a format that tf.keras could use to load the dataset, similar to the way the MNIST dataset is loaded. I have tried generating a tf.data.Dataset.
data_dir = "/datas"
data_dir = pathlib.Path(data_dir)
image_count = len(list(data_dir.glob('*/*.jpg')))
print(image_count)
cats = list(data_dir.glob('cats/*'))
PIL.Image.open(str(cats[0]))
batch_size = 32
img_height = 150
img_width = 150
train_ds = tf.keras.preprocessing.image_dataset_from_directory(
data_dir,
validation_split=0.2,
subset="training",
seed=123,
image_size=(img_height, img_width),
batch_size=batch_size)
val_ds = tf.keras.preprocessing.image_dataset_from_directory(
data_dir,
validation_split=0.2,
subset="validation",
seed=123,
image_size=(img_height, img_width),
batch_size=batch_size)
But I am still unable to put it into the right format, to fix the code as described above. Something I saw was about conversion to a numpy array but I'm not sure how to do it.

Some packages have helper functions to make data access easy for the users. As you can see in the documentation for the load_data() function, it returns a tuple of Numpy arrays:
(X_train, y_train), (X_test, y_test)
You can get the same structure by arranging your features (X) and target (y) as numpy arrays, and pass them through the train_test_split() scikit-learn function.

My training Data and labels have different numpy array shapes. It is disrupting my training

I have an image based database That I am working with and am attempting to convert it to a numpy array. Which I would then use for a cGAN input. I have tried using multiple codes and they are all giving me dimesnionality issue. Not sure what to do
training_data = []
IMG_SIZE = 32
datadir = 'drive/My Drive/dummyDS'
CATEGORIES = ['HTC-1-M7', 'IPhone-4s', 'iPhone-6', 'LG-Nexus-5x',
'Motorola-Droid-Max', 'Motorola-Nexus-6', 'Motorola-X',
'Samsung-Galaxy-Note3', 'Samsung-Galaxy-S4', 'Sony-Nex-7']
def create_training_data():
i=0
for category in CATEGORIES:
path=os.path.join(datadir,category)
class_num = CATEGORIES.index(category)
for img in os.listdir(path):
img_array=cv2.imread(os.path.join(path,img))
new_array=cv2.resize(img_array,(IMG_SIZE,IMG_SIZE))
training_data.append([new_array,class_num])
plt.imshow(img_array,cmap="gray")
plt.imshow(new_array,cmap="gray")
plt.show()
create_training_data()
X=[]
y=[]
random.shuffle(training_data)
for features,label in training_data:
X.append(features)
y.append(label)
X = np.array(X).reshape(-1, IMG_SIZE, IMG_SIZE, 3)
pickle_out = open("X.pickle","wb")
pickle.dump(X, pickle_out)
pickle_out.close()
y = np.array(y)
pickle_out = open("y.pickle","wb")
pickle.dump(y, pickle_out)
pickle_out.close()
y = to_categorical(y)
# saving the y_labels_one_hot array as a .npy file
np.save('y_labels_one_hot.npy', y)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=2./11)
X_train.shape=(32,32,32,3) while y_train.shape= (32,4,2)
Now in training I am getting
real_labels=to_categorical(Y_train[i*batch_size:(i+1)*batch_size].reshape(-1,1),num_classes=10)
d_loss_real = discriminator.train_on_batch(x=[X_batch, real_labels],
y=real * (1 - smooth))
ValueError: All input arrays (x) should have the same number of samples. Got array shapes: [(32, 32, 32, 3), (256, 10)]

tensorflow.keras.imagedatagenerator.flow_from_directory should simplify your task.
It does almost everything you do using the code you mentioned, in a simpler way, including Splitting the Data
Code mentioned demonstrates how to use it, along with the detailed explanation of each line of code :
train_datagen = ImageDataGenerator(rescale=1./255, # Normalizes every pixel value
validation_split=0.2) # Setting Validation Data as 20% of Total Data
train_generator = train_datagen.flow_from_directory(
datadir, # Traverses through all the Sub Folders (Category) inside this dir
target_size=(img_height, img_width), # Sets the Image Size
batch_size=batch_size, # Generates batches of `batch_size`
class_mode='categorical', # Will Consider Labels as Categorical
shuffle = True, # Shuffles the Data
subset='training') # Considers 80% as training data
# Since we don't have separate directory for Validation Data and since we want the Total Data to be Partitioned, we should use "train_datagen"
validation_generator = train_datagen.flow_from_directory(
datadir , # Should use the Same Dir as Training for Splitting
target_size=(img_height, img_width),
batch_size=batch_size,
class_mode='categorical',
shuffle = True, # Shuffles the Data
subset='validation') # Considers 20% as Validation data
# Then you can train the model using the code mentioned below
model.fit(
train_generator,
steps_per_epoch = train_generator.samples // batch_size,
validation_data = validation_generator,
validation_steps = validation_generator.samples // batch_size,
epochs = nb_epochs)
Hope this will resolve your issue of different Shapes as it will ensure that Features and Labels will be of same shape. Please share more information if this approach is resulting in Error.
Happy Learning!

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

TensorFlow: how to create training and testing image datasets - python

Related

How to split folders to 3 datasets with ImageDataGenerator?

Extract data from tensorflow dataset (e.g. to numpy)

How can I preprocess a tf.data.Dataset using a provided preprocess_input function that expects a tf.Tensor?

Creating a dataset like the MNIST format, with different classes?

My training Data and labels have different numpy array shapes. It is disrupting my training

Categories

Resources