I want to read images from multiple directories. I used Keras Image datagenerator for this. After applying a preprocessing function, now I want to load entire data in a single list. I tried doing the following code.
def preProcess(X):
X = X.astype('float32')
X = (X - 127.5) / 127.5
return X
batch_sz=32
path = "/content/drive/My Drive/new_net/mini_unet_data/labeled/"
test_datagen = ImageDataGenerator(preprocessing_function=preProcess)
test_gen = test_datagen.flow_from_directory(directory=path,target_size=(256,256),batch_size=32,color_mode="rgb",class_mode="sparse",shuffle=True,seed=42)
test_images=[]
test_labels=[]
for i in range(int(test_gen.n/batch_sz)):
tmp1, tmp2 = test_gen.next()
test_images.append(tmp1)
test_labels.append(tmp2)
test_images = np.asarray(test_images)
test_labels = np.asarray(test_labels)
test_images = np.reshape(test_images,(test_images.shape[0]*test_images.shape[1],test_images.shape[2],test_images.shape[3],test_images.shape[4]))
test_labels = np.squeeze(np.reshape(test_labels,(test_labels.shape[1]*test_labels.shape[0])))
The above code works without error. But it takes time to load around 1500 images (more than 10 minutes). So is there better way to achieve this faster? I tried using glob and open cv commands. That is too slow as well.
Thank You.
You can try two things to speed up your process each time you run the evaluation code.
Convert the images to .npy format. Read it using cv2 and then change it from BGR to RGB and save it.
You can generate the npy array of whole test set and then use predict/evaluate generator.
(Can't comment)
Related
I have a dataset of medical images (.dcm) which I can read into TensorFlow as a batch. However, the problem that I am facing is that the labels of these images are in a .csv. The .csv file contains two columns - image_path (location of the image) and image_labels (0 for no; 1 for yes). I wanted to know how I can read the labels into a TensorFlow dataset batch wise. I am using the following code to load the images batch wise:-
import tensorflow as tf
import tensorflow_io as tfio
def process_image(filename):
image_bytes = tf.io.read_file(filename)
image = tf.squeeze(
tfio.image.decode_dicom_image(image_bytes, on_error='strict', dtype=tf.uint16),
axis = 0
)
x = tfio.image.decode_dicom_data(image_bytes, tfio.image.dicom_tags.PhotometricInterpretation)
image = (image - tf.reduce_min(image))/(tf.reduce_max(image) - tf.reduce_min(image))
if(x == "MONOCHROME1"):
image = 1 - image
image = image*255
image = tf.cast(tf.image.resize(image, (512, 512)),tf.uint8)
return image
# train_images is a list containing the locations of .dcm images
dataset = tf.data.Dataset.from_tensor_slices(train_images)
dataset = dataset.map(process_image, num_parallel_calls=4).batch(50)
Hence, I can load the images into the TensorFlow dataset. But I would like to know how I can load the image labels batch wise.
Something like this instead of the last two lines should work:
#train_labels is a list of labels for each image in the same order as in train_images
dataset = tf.data.Dataset.from_tensor_slices((train_images, train_labels))
dataset = dataset.map(lambda x,y : (process_image(x), y), num_parallel_calls=4).batch(50)
now the dataset can be passed to your network's .fit(), .predict() and other methods:
model.fit(dataset, epochs=epochs, callbacks=callbacks)
Alternatively, you can create a second dataset containing the labels and then combine two datasets with tf.data.Dataset.zip(). It works similarly to the python's native zip.
I prefer the first method since It feels a bit cleaner to me + I can, for example, shuffle the filenames/labels and only then parse the files instead of doing the opposite.
I am training Face-recognition model, So for Triplet Loss, I have to generate the batch such that it contains fixed amount of images from each label. For eg. I am saying that take 8 images from 3 random labels each time it generates batch for training, As suggested in this Github Issue.
In my dataset folder I have subfolder which is renamed as a label and contains the images of that folder.
In the given issue, solution is presented,
import numpy as np
import cv2
num_labels = len(path_list)
num_classes_per_batch = 3
num_images_per_class = 8
image_dirs = ["/content/drive/My Drive/smalld_processed/train/{:d}".format(i) for i in
range(num_labels)]
## Create the list of datasets creating filenames
#datasets = [tf.data.Dataset.list_files(f"{image_dir}/*.jpg" for image_dir in image_dirs)]
datasets = [tf.data.Dataset.list_files(f"{image_dir}/*.jpg") for image_dir in image_dirs]
adk = ["{}/*.jpg".format(image_dir) for image_dir in image_dirs]
print(adk)
def generator():
while True:
# Sample the labels that will compose the batch
labels = np.random.choice(range(num_labels),
num_classes_per_batch,
replace=False)
for label in labels:
for _ in range(num_images_per_class):
yield label
choice_dataset = tf.data.Dataset.from_generator(generator, tf.int64)
dataset = tf.data.experimental.choose_from_datasets(datasets, choice_dataset)
## Now you read the image content
def load_image(filename):
image = cv2.imread(filename,1)
image = dataset.map(image, num_parallel_calls=tf.data.experimental.AUTOTUNE)
image = image[...,::-1]
label = int(os.path.split(os.path.dirname(filename))[1])
image=dataset1.append()
label=dataset2.append
return image, label
dataset = dataset.map(load_image, num_parallel_calls=tf.data.experimental.AUTOTUNE)
batch_size = num_classes_per_batch * num_images_per_class
dataset = dataset.batch(batch_size)
dataset = dataset.prefetch(None)
With this I am not able to load the images and it's showing me this error.
SystemError: <built-in function imread> returned NULL without setting an error
Could you help me to fix the error or any other suggestion on how to load images.
Thanks in advance!!
I think that in this case your cv2.imread is acting up. I would first build a simple program that does not do the reading "on the fly", but instead pre-loads images to train on a small dataset.
It also feels like you are misusing the dataset.map function. I would recommend this tutorial on the tf.data.Dataset function: http://tensorexamples.com/2020/07/27/Using-the-tf.data.Dataset.html, and maybe this one on augmentation so you can see how you should use the map function properly: http://tensorexamples.com/2020/07/28/Augmentation.html.
Good luck!
I'm trying to create a Dataset object in tensorflow 1.14 (I have some legacy code that i can't change for this specific project) starting from numpy arrays, but everytime i try i get everything copied on my graph and for this reason when i create an event log file it is huge (719 MB in this case).
Originally i tried using this function "tf.data.Dataset.from_tensor_slices()", but it didn't work, then i read it is a common problem and someone suggested me to try with generators, thus i tried with the following code, but again i got a huge event file (719 MB again)
def fetch_batch(x, y, batch):
i = 0
while i < batch:
yield (x[i,:,:,:], y[i])
i +=1
train, test = tf.keras.datasets.fashion_mnist.load_data()
images, labels = train
images = images/255
training_dataset = tf.data.Dataset.from_generator(fetch_batch,
args=[images, np.int32(labels), batch_size], output_types=(tf.float32, tf.int32),
output_shapes=(tf.TensorShape(features_shape), tf.TensorShape(labels_shape)))
file_writer = tf.summary.FileWriter("/content", graph=tf.get_default_graph())
I know in this case I could use tensorflow_datasets API and it would be easier, but this is a more general question, and it involves how to create datasets in general, not only using the mnist one.
Could you explain to me what am i doing wrong? Thank you
I guess it's because you are using args in from_generator. This will surely put the provided args in the graph.
What you could do is define a function that will return a generator that will iterate through your set, something like (haven't tested):
def data_generator(images, labels):
def fetch_examples():
i = 0
while True:
example = (images[i], labels[i])
i += 1
i %= len(labels)
yield example
return fetch_examples
This would give in your example:
train, test = tf.keras.datasets.fashion_mnist.load_data()
images, labels = train
images = images/255
training_dataset = tf.data.Dataset.from_generator(data_generator(images, labels), output_types=(tf.float32, tf.int32),
output_shapes=(tf.TensorShape(features_shape), tf.TensorShape(labels_shape))).batch(batch_size)
file_writer = tf.summary.FileWriter("/content", graph=tf.get_default_graph())
Note that I changed fetch_batch to fetch_examples since you probably want to batch using the dataset utilities (.batch).
I am new at Pytorch, and have a couple of questions regarding the way pictures are being handled:
1) In the "training a classifier" tutorial, the pictures are PIL files, and are being handled via the following commands (where "transform" also turns the PIL format into a tensor format):
trainset = torchvision.datasets.CIFAR10(root='./data', train=True,
download=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=4,
shuffle=True, num_workers=2)
It seems like trainset[1] (and also for the other indices) consists of a tensor, and a number. I want to define a new variable "image" that will consist of the tensor part of trainset[ 1 ] and then print it - how can I do it?
2) Assume that I have a different dataset that I want to classify. It consists of .jpeg images that are located in the folder "C:/temp/dataset". How can I define the variable "trainset" to consist of these images?
Thanks a lot in advance!
For your first question:
image = trainset[1][0]
print(image)
For your second question:
from PIL import Image
import numpy as np
import os
def load_image(infilename):
"""This function loads an image into memory when you give it
the path of the image
"""
img = Image.open(infilename)
img.load()
data = np.asarray(img, dtype="float32")
return data
def create_npy_from_image(images_folder, output_name, num_images, image_dim):
"""Loops through the images in a folder and saves all of them
as a numpy array in output_name
"""
image_matrix = np.empty((num_images, image_dim, image_dim, 3), dtype=np.float32)
for i, filename in enumerate(os.listdir(images_folder)):
if filename.endswith(".jpg"):
data = load_image(images_folder + filename)
image_matrix[i] = data
else:
continue
np.save(output_name, image_matrix)
So I would write something like this:
create_npy_from_image(path_to_images_folder, "trainset.npy", numer_of_images_in_your_folder, DIM)
DIM is 64 for example if your images are 64x64x3
You can then load the saved array with np.load and then convert it to a pytorch tensor using from_numpy function.
Let me know if this works. Good luck!
I have a dataset (71094 train images and 17000 test) for which i need to train a CNN.During preprocessing , i tried creating a matrix using numpy that turns out to be ridiculously large(71094*100*100*3 for the train data) [all images are RGB 100 by 100].. Hence i get a memory error.How do i tackle the situation.??Pls help .
This is my code..
import numpy as np
import cv2
from matplotlib import pyplot as plt
data_dir = './fashion-data/images/'
train_data = './fashion-data/train.txt'
test_data = './fashion-data/test.txt'
f = open(train_data, 'r').read()
ims = f.split('\n')
print len(ims)
train = np.zeros((71094, 100, 100, 3)) #this line causes the error..
for ix in range(train.shape[0]):
i = cv2.imread(data_dir + ims[ix] + '.jpg')
label = ims[ix].split('/')[0]
train[ix, :, :, :] = cv2.resize(i, (100, 100))
print train[0]
train_labels = np.zeros((71094, 1))
for ix in range(train_labels.shape[0]):
l = ims[ix].split('/')[0]
train_labels[ix] = int(l)
print train_labels[0]
np.save('./data/train', train)
np.save('./data/train_labels', train_labels)
I recently ran into the same problem, and I believe its a common problem when working with image data.
There are a number of methods you can use to tackle this problem depending on what you would like to do.
1) It can make sense to sample the data from each image when training, so not to train on ALL 71094*100*100 pixels. This can be done simply, by creating a function which loads one image at a time, and samples your pixels. There is some argument that doing this randomly for each epoch can reduce overfitting, but again depends on the exact problem. Stratified sampling may also help to balance the classes should you be working with pixel classification.
2) mini-batch training - split your training data into small "mini-batches" and train on each of these individually. Your epoch will end after you have completed training on all of your mini-batches, with all of your data. Here you should, each epoch, randomise the order of the data in order to avoid overfitting.
3) Load and train one image at a time - similar to mini-batch training, but just use one image as a "mini-batch" for each iteration, and run a for-loop through all of the images in the folder. this way only 1x100x100x3 is stored in memory at a time. depending on the size of your memory, you could perhaps use more than one image per mini batch i.e. - Nx100x100x3, and run for 71094/N iterations to go over all training data
I hope this was clear.. and that it helps somewhat!