Extract data from tensorflow dataset (e.g. to numpy)

Extract data from tensorflow dataset (e.g. to numpy) - python

I'm loading images via
data = keras.preprocessing.image_dataset_from_directory(
'./data',
labels='inferred',
label_mode='binary',
validation_split=0.2,
subset="training",
image_size=(img_height, img_width),
batch_size=sz_batch,
crop_to_aspect_ratio=True
)
I want to use the obtained data in non-tensorflow routines too. Therefore, I want to extract the data e.g. to numpy arrays. How can I achieve this? I can't use tfds

I would suggest unbatching your dataset and using tf.data.Dataset.map:
import numpy as np
import tensorflow as tf
dataset_url = "https://storage.googleapis.com/download.tensorflow.org/example_images/flower_photos.tgz"
data_dir = tf.keras.utils.get_file('flower_photos', origin=dataset_url, untar=True)
data_dir = pathlib.Path(data_dir)
batch_size = 32
train_ds = tf.keras.utils.image_dataset_from_directory(
data_dir,
validation_split=0.2,
subset="training",
seed=123,
image_size=(180, 180),
batch_size=batch_size,
shuffle=False)
train_ds = train_ds.unbatch()
images = np.asarray(list(train_ds.map(lambda x, y: x)))
labels = np.asarray(list(train_ds.map(lambda x, y: y)))
Or as suggested in the comments, you could also try just working with the batches and concatenating them afterwards:
images = np.concatenate(list(train_ds.map(lambda x, y: x)))
labels = np.concatenate(list(train_ds.map(lambda x, y: y)))
Or set shuffle=True and use tf.TensorArray:
images = tf.TensorArray(dtype=tf.float32, size=0, dynamic_size=True)
labels = tf.TensorArray(dtype=tf.int32, size=0, dynamic_size=True)
for x, y in train_ds.unbatch():
images = images.write(images.size(), x)
labels = labels.write(labels.size(), y)
images = tf.stack(images.stack(), axis=0)
labels = tf.stack(labels.stack(), axis=0)

Because tf.keras.utils.image_dataset_from_directory returns a Dataset object, use tf.data.Dataset.as_numpy_iterator. For example:
for elem in data.as_numpy_iterator():
print(elem)
In the end, its probably a better idea to use tf.data.Dataset because its more efficient. You can find more information here.

Related

Too much RAM is required for loading dataset

I’m working in a neural network and my dataset has 42000 images and I have to load it all. I’m using google colab for that, but every time I load the dataset the RAM is insufficient.
I am putting everything in a numpy array, cause I tried to use the ImageGenerator method and it didn’t work. I’m using the following code to load the data:
class = glob.glob(r"/content/drive/MyDrive/DATASET/class/*.*")
data = []
labels = []
for i in class:
image=tf.keras.preprocessing.image.load_img(i, color_mode='rgb',
target_size= (336, 336))
image=np.array(image)
data.append(image)
labels.append(0)
data = np.array(data)
labels = np.array(labels)

As ImageDataGenerator is deprecated, you can use a custom Keras Sequence class to load images when needed.
The strategy here is to create a Pandas DataFrame with all the path and class of your images then transform the class to numeric label with pd.factorize. Once, you have X (paths) and y (labels), you can use train_test_split to extract 3 subsets: train, test and validation. The last step is to convert these collections to datasets compatible with Tensorflow.
Each time, Tensorflow process a batch, the Sequence will load a batch of images in memory and so on.
Step 0: Imports and constants
import tensorflow as tf
import pandas as pd
import numpy as np
import pathlib
from sklearn.model_selection import train_test_split
INPUT_SHAPE = (336, 336, 3)
BATCH_SIZE = 32
DATA_DIR = pathlib.Path('/content/drive/MyDrive/DATASET/')
Step 1: Load all image paths to a Pandas DataFrame:
# Find images of dataset
data = []
for file in DATA_DIR.glob('**/*.jpg'):
d = {'class': file.parent.name,
'path': file}
data.append(d)
# Create dataframe and select columns
df = pd.DataFrame(data)
df['label'] = pd.factorize(df['class'])[0]
X = df['path']
y = df['label']
# Split into 3 balanced datasets
X_train, X_test, y_train, y_test = \
train_test_split(X, y, test_size=0.2, random_state=2023)
X_train, X_valid, y_train, y_valid = \
train_test_split(X_train, y_train, test_size=0.2, random_state=2023)
Step 2: Create a custom data Sequence
class ImgDataSequence(tf.keras.utils.Sequence):
"""
Check documentation here: https://www.tensorflow.org/api_docs/python/tf/keras/utils/Sequence
"""
def __init__(self, image_set, label_set, batch_size=32, image_size=(256, 256)):
self.image_set = np.array(image_set)
self.label_set = np.array(label_set)
self.batch_size = batch_size
self.image_size = image_size
def __get_image(self, image):
image = tf.keras.preprocessing.image.load_img(image, color_mode='rgb', target_size=self.image_size)
image_arr = tf.keras.preprocessing.image.img_to_array(image)
return image_arr
def __get_data(self, images, labels):
image_batch = np.asarray([self.__get_image(img) for img in images])
label_batch = np.asarray(labels)
return image_batch, label_batch
def __getitem__(self, index):
images = self.image_set[index * self.batch_size:(index + 1) * self.batch_size]
labels = self.label_set[index * self.batch_size:(index + 1) * self.batch_size]
images, labels = self.__get_data(images, labels)
return images, labels
def __len__(self):
return len(self.image_set) // self.batch_size + (len(self.image_set) % self.batch_size > 0)
Step 3: Create datasets
train_ds = ImgDataSequence(X_train, y_train, image_size=INPUT_SHAPE[:2], batch_size=BATCH_SIZE)
valid_ds = ImgDataSequence(X_valid, y_valid, image_size=INPUT_SHAPE[:2], batch_size=BATCH_SIZE)
test_ds = ImgDataSequence(X_test, y_test, image_size=INPUT_SHAPE[:2], batch_size=BATCH_SIZE)
Test the new datasets:
# Take the first batch of our train dataset
>>> imgs, labels = train_ds[0]
# Check then length (BATCH_SIZE)
>>> len(labels)
32
# Check the dimension of one image
>>> imgs[0].shape
(336, 336, 3)
How to use it with Tensorflow?
# train_ds & valid_ds to fit
history = model.fit(train_ds, epochs=10, validation_data=valid_ds)
# test_ds to evaluate
loss, *metrics = model.evaluate(test_ds)

How to split folders to 3 datasets with ImageDataGenerator?

validation_split parameter is able to allow ImageDataGenerator to split the data sets reading from the folder into 2 different disjoint sets. Is there any way to create 3 sets - of training, validation, and evaluation datasets using it?
I am thinking about splitting the dataset into 2 datasets, then splitting the 2nd dataset into another 2 datasets
datagen = ImageDataGenerator(validation_split=0.5, rescale=1./255)
train_generator = datagen.flow_from_directory(
TRAIN_DIR,
subset='training'
)
val_generator = datagen.flow_from_directory(
TRAIN_DIR,
subset='validation'
)
Here I am thinking about splitting the validation dataset into 2 sets using val_generator. One for validation and the other for evaluation? How should I do it?

I like working with the flow_from_dataframe() method of ImageDataGenerator, where I interact with a simple Pandas DataFrame (perhaps containig other features), not with the directory. But you can easily change my code if you insist on flow_from_directory().
So this is my go-to function, e.g. for a regression task, where we try to predict a continuous y:
def get_generators(train_samp, test_samp, validation_split = 0.1):
train_datagen = ImageDataGenerator(validation_split=validation_split, rescale = 1. / 255)
test_datagen = ImageDataGenerator(rescale = 1. / 255)
train_generator = train_datagen.flow_from_dataframe(
dataframe = images_df[images_df.index.isin(train_samp)],
directory = images_dir,
x_col = 'img_file',
y_col = 'y',
target_size = (IMG_HEIGHT, IMG_WIDTH),
class_mode = 'raw',
batch_size = batch_size,
shuffle = True,
subset = 'training',
validate_filenames = False
)
valid_generator = train_datagen.flow_from_dataframe(
dataframe = images_df[images_df.index.isin(train_samp)],
directory = images_dir,
x_col = 'img_file',
y_col = 'y',
target_size = (IMG_HEIGHT, IMG_WIDTH),
class_mode = 'raw',
batch_size = batch_size,
shuffle = False,
subset = 'validation',
validate_filenames = False
)
test_generator = test_datagen.flow_from_dataframe(
dataframe = images_df[images_df.index.isin(test_samp)],
directory = images_dir,
x_col = 'img_file',
y_col = 'y',
target_size = (IMG_HEIGHT, IMG_WIDTH),
class_mode = 'raw',
batch_size = batch_size,
shuffle = False,
validate_filenames = False
)
return train_generator, valid_generator, test_generator
Things to notice:
I use two generators
The input to the function are the train/test indices (such as received from Sklearn's train_test_split) which are used to filter the DataFrame index.
The function also take a validation_split parameter for the training generator
images_df is a DataFrame somewhere in global memory with proper columns like img_file and y.
No need to shuffle validation and test generators
This can be further generalized for multiple outputs, classification, what have you.

I mostly have been splitting data in 80/10/10 for training, validation and test respectivelly.
When working with keras I favor the tf.data API as it provides a good abstraction for complex input pipelines
It does not provide a simple tf.data.DataSet.split functionality though
I have this function (that I found from someone's code and my source is missing) which I consistently use
def get_dataset_partitions_tf(ds: tf.data.Dataset, ds_size, train_split=0.8, val_split=0.1, test_split=0.1, shuffle=True, shuffle_size=10000):
assert (train_split + test_split + val_split) == 1
if shuffle:
# Specify seed to always have the same split distribution between runs
ds = ds.shuffle(shuffle_size, seed=12)
train_size = int(train_split * ds_size)
val_size = int(val_split * ds_size)
train_ds = ds.take(train_size)
val_ds = ds.skip(train_size).take(val_size)
test_ds = ds.skip(train_size).skip(val_size)
return train_ds, val_ds, test_ds
Firstly read your data set, and get its size(with cardianlity method), then pass it into the function and you're good to go!
This function can be given a flag to shuffle the original dataset before creating the splits, this is useful to have more realistic validation and test metrics.
The seed for shuffling is fixed so that we can run the same function and the splits remain the same, which we want for consistent results.

TensorFlow: how to create training and testing image datasets

I've been looking forever on the internet trying to create, train and test my own tensorflow model. But I have unsuccessfully done that. From investigating my code, I think it's how I create my dataset of images. Most online tutorials would just import a prepared dataset, but my dataset is specifically for use-case diagrams and holds each element within those diagrams. My aim is to train a tensorflow model to predict each element in a diagram and hopefully the errors too. Here's the code:
def createDataSet(labelList, label, filePath, width, height):
dataList = []
for img in os.listdir(filePath):
filename = str(img)
if filename[len(filename) - 3:len(filename)] != "npy":
pic = cv.imread(os.path.join(filePath, img))
pic = cv.cvtColor(pic, cv.COLOR_BGR2RGB)
pic = cv.resize(pic, (width, height))
dataList.append(pic)
labelList.append(label)
return dataList, labelList
# appending the pics to the training data list
training_dataset, train_labels = createDataSet(train_labels, train_label, path, width, height)
test_dataset, test_labels = createDataSet(test_labels, test_label, path2, width, height)
#converting the list to numpy array and saving it to a file using #numpy.save
np.save(os.path.join(path,train_label),np.array(training_dataset))
np.save(os.path.join(path2,test_label),np.array(test_dataset))
#loading the saved file once again
train_images = np.array(training_dataset)
test_images = np.array(test_dataset)
As of now, the function creates a list which will be saved as a numpy array and that numpy array will be used for my model. But it causes errors like UNIMPLEMENTED: Cast string to float is not supported.
I'm sure that I'm creating the train, test data and the labels for both incorrectly

You can use tf.keras.utils.image_dataset_from_directory function read images form directory and split the data into training and validation. Find the below is sample code.
main_directory/
...class_a/
......a_image_1.jpg
......a_image_2.jpg
...class_b/
......b_image_1.jpg
......b_image_2.jpg
import tensorflow as tf
batch_size = 32
img_height = 224
img_width = 224
train_ds = tf.keras.preprocessing.image_dataset_from_directory(
data_dir,
validation_split=0.2,
subset="training",
seed=123,
image_size=(img_height, img_width),
batch_size=batch_size)
class_names = train_ds.class_names
val_ds = tf.keras.preprocessing.image_dataset_from_directory(
data_dir,
validation_split=0.2,
subset="validation",
seed=123,
image_size=(img_height, img_width),
batch_size=batch_size)

Reading images without rigid folder structure

I am using Tensorflow 2 (Tensorflow 2.2 in particular)
The function below allows us to read in images from folders
train_datagen = ImageDataGenerator(
rescale=1./255,
shear_range=0.2,
zoom_range=0.2,
horizontal_flip=True)
test_datagen = ImageDataGenerator(rescale=1./255)
train_generator = train_datagen.flow_from_directory(
'data/train',
target_size=(150, 150),
batch_size=32,
class_mode='binary')
but it requires us to rigidly structured the folder according to the classes say cat and dog to be classified as
data/train/cat and data/train/dog
Say now, we have all the training images in the folder data/train/ (say data/train/1.jpg etc.) and I have train_set X and label y in the following:
X=['1.jpg','2.jpg',...]
y=[0,1,...]
where 0 denotes dog and 1 denotes cat for y, and I want to achieve the same effect as the code above (e.g., image aug. like horizontal flipping etc. + with batchsize specified), how should I do that?
An approach I have tried:
I use the following code
def preprocess(image):
img_shape=np.array(image).shape
image = tf.cast(np.array(image), tf.float32)
image = (image / 127.5) - 1
return image
image_path=pathlib.Path.joinpath("train", "data")
class_names=[x.name.lower() for x in image_path.glob('*') if x.is_dir()]
X=[]
y=[]
for path in image_path.glob('**/*'):
if path.is_file():
if path.name.lower().endswith(('.png', '.jpg', '.jpeg', '.tiff', '.bmp', '.gif')):
X.append(preprocess(Image.open(path).resize((224,224),resample=Image.BICUBIC)))
y.append(class_names.index(path.parent.name.lower()))
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=1 - train_ratio, stratify=y)
X_val, X_test, y_val, y_test = train_test_split(X_test, y_test,
test_size=test_ratio / (test_ratio + validation_ratio),
stratify=y_test)
train_data = tf.data.Dataset.from_tensor_slices((X_train, y_train)).batch(batch_size)
validation_data = tf.data.Dataset.from_tensor_slices((X_val, y_val)).batch(batch_size)
test_data = tf.data.Dataset.from_tensor_slices((X_test, y_test)).batch(batch_size)
I got out of memory error (as I store all images in X), how should I resolve that?

For this I strongly recommend that you use tf.data.Dataset(), in order to read and ingest your data.
In fact it is even officially the recommended manner in which the ETL process in TensorFlow (extract,transform,load) should be prepared.
You can have a look here: https://www.tensorflow.org/api_docs/python/tf/data/Dataset
For example, in your particular case (when you read the documentation it will make more sense), you could use a .map() function in which you retrieve/generate the label 0 or 1 depending on the string in the description of your image.
Or you could also implement it in the way you described above, using tf.data.Dataset.from_tensor_slices()
In addition, you can use another mapping function for augmentation; you can investigate here the available image preprocessing techniques: https://www.tensorflow.org/api_docs/python/tf/image
From my own work(adapted tutorial from some time ago), I attach here an example:
def load_filenames(csv_data, datapath):
filenames = [os.path.join(datapath, filename) for filename in csv_data['id'].tolist()]
return filenames
def load_labels(csv_data):
return csv_data['has_cactus'].tolist()
def parse_fn(filename, label):
filename = filename.numpy().decode('utf-8')
print(filename)
return filename, label
def process_function(filename, label):
img = tf.io.read_file(filename)
img = tf.image.decode_jpeg(img)
img = (tf.cast(img, tf.float32) / 127.5) - 1
img = tf.image.resize(img, (96, 96))
return img, label
train_csv = pd.read_csv(filepath_or_buffer='data/aerial-cactus-identification/train.csv')
filenames = load_filenames(csv_data=train_csv, datapath='data/aerial-cactus-identification/train')
labels = load_labels(csv_data=train_csv)
train_filenames, val_filenames, train_labels, val_labels = train_test_split(filenames,
labels,
train_size=0.9,
random_state=42)
num_train = len(train_filenames)
num_val = len(val_filenames)
train_data = tf.data.Dataset.from_tensor_slices(
(tf.constant(train_filenames), tf.constant(train_labels))
)
val_data = tf.data.Dataset.from_tensor_slices(
(tf.constant(val_filenames), tf.constant(val_labels))
)
train_data = (train_data.map(process_function)
.shuffle(buffer_size=num_train)
.batch(BATCH_SIZE)
.prefetch(buffer_size=tf.data.experimental.AUTOTUNE)
)
val_data = (val_data.map(process_function)
.shuffle(buffer_size=num_val)
.batch(BATCH_SIZE)
.prefetch(buffer_size=tf.data.experimental.AUTOTUNE)
)

My training Data and labels have different numpy array shapes. It is disrupting my training

I have an image based database That I am working with and am attempting to convert it to a numpy array. Which I would then use for a cGAN input. I have tried using multiple codes and they are all giving me dimesnionality issue. Not sure what to do
training_data = []
IMG_SIZE = 32
datadir = 'drive/My Drive/dummyDS'
CATEGORIES = ['HTC-1-M7', 'IPhone-4s', 'iPhone-6', 'LG-Nexus-5x',
'Motorola-Droid-Max', 'Motorola-Nexus-6', 'Motorola-X',
'Samsung-Galaxy-Note3', 'Samsung-Galaxy-S4', 'Sony-Nex-7']
def create_training_data():
i=0
for category in CATEGORIES:
path=os.path.join(datadir,category)
class_num = CATEGORIES.index(category)
for img in os.listdir(path):
img_array=cv2.imread(os.path.join(path,img))
new_array=cv2.resize(img_array,(IMG_SIZE,IMG_SIZE))
training_data.append([new_array,class_num])
plt.imshow(img_array,cmap="gray")
plt.imshow(new_array,cmap="gray")
plt.show()
create_training_data()
X=[]
y=[]
random.shuffle(training_data)
for features,label in training_data:
X.append(features)
y.append(label)
X = np.array(X).reshape(-1, IMG_SIZE, IMG_SIZE, 3)
pickle_out = open("X.pickle","wb")
pickle.dump(X, pickle_out)
pickle_out.close()
y = np.array(y)
pickle_out = open("y.pickle","wb")
pickle.dump(y, pickle_out)
pickle_out.close()
y = to_categorical(y)
# saving the y_labels_one_hot array as a .npy file
np.save('y_labels_one_hot.npy', y)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=2./11)
X_train.shape=(32,32,32,3) while y_train.shape= (32,4,2)
Now in training I am getting
real_labels=to_categorical(Y_train[i*batch_size:(i+1)*batch_size].reshape(-1,1),num_classes=10)
d_loss_real = discriminator.train_on_batch(x=[X_batch, real_labels],
y=real * (1 - smooth))
ValueError: All input arrays (x) should have the same number of samples. Got array shapes: [(32, 32, 32, 3), (256, 10)]

tensorflow.keras.imagedatagenerator.flow_from_directory should simplify your task.
It does almost everything you do using the code you mentioned, in a simpler way, including Splitting the Data
Code mentioned demonstrates how to use it, along with the detailed explanation of each line of code :
train_datagen = ImageDataGenerator(rescale=1./255, # Normalizes every pixel value
validation_split=0.2) # Setting Validation Data as 20% of Total Data
train_generator = train_datagen.flow_from_directory(
datadir, # Traverses through all the Sub Folders (Category) inside this dir
target_size=(img_height, img_width), # Sets the Image Size
batch_size=batch_size, # Generates batches of `batch_size`
class_mode='categorical', # Will Consider Labels as Categorical
shuffle = True, # Shuffles the Data
subset='training') # Considers 80% as training data
# Since we don't have separate directory for Validation Data and since we want the Total Data to be Partitioned, we should use "train_datagen"
validation_generator = train_datagen.flow_from_directory(
datadir , # Should use the Same Dir as Training for Splitting
target_size=(img_height, img_width),
batch_size=batch_size,
class_mode='categorical',
shuffle = True, # Shuffles the Data
subset='validation') # Considers 20% as Validation data
# Then you can train the model using the code mentioned below
model.fit(
train_generator,
steps_per_epoch = train_generator.samples // batch_size,
validation_data = validation_generator,
validation_steps = validation_generator.samples // batch_size,
epochs = nb_epochs)
Hope this will resolve your issue of different Shapes as it will ensure that Features and Labels will be of same shape. Please share more information if this approach is resulting in Error.
Happy Learning!

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extract data from tensorflow dataset (e.g. to numpy) - python

Related

Too much RAM is required for loading dataset

How to split folders to 3 datasets with ImageDataGenerator?

TensorFlow: how to create training and testing image datasets

Reading images without rigid folder structure

My training Data and labels have different numpy array shapes. It is disrupting my training

Categories

Resources