using tensorflow dataset for custom image oversampling

using tensorflow dataset for custom image oversampling - python

OK. I want to try to set up a custom datasets work flow in tensorflow for custom image oversampling. I have an unbalanced data set with many more normal images than fibrosis images.
I start with some variable...
num_fibrosis = len(glob.glob(WORKING_DIR_TS +'NIH_1stPA_Norm_Fib/Fibrosis/*.png'))
num_normal = len(glob.glob(WORKING_DIR_TS +'NIH_1stPA_Norm_Fib/No Finding/*.png'))
perc_for_val =.2
oversample_multiplier = 5
num_fibrosis_in_val = int(num_fibrosis*perc_for_val)
oversample_count = num_fibrosis_in_val*oversample_multiplier
And create a dataset base on the folder structure. This dataset contains images and labels.
full_ds = tf.keras.utils.image_dataset_from_directory(
'folder_path',
image_size=(SIZE,SIZE),
batch_size= None,
# shuffle=False
)
Then i take 20% of the fibrosis images and put them into our validation dataset. I also put an equal number of normal images in val_ds.
fibrosis_ds = full_ds.filter(lambda x, y: tf.equal(y, 0) ) # y == 0 for fibrosis
normal_ds = full_ds.filter(lambda x, y: tf.equal(y, 1) ) # y == 1 for normal
# Let's take 20% of fibrosis images, and an equal number of normals, for our validation dataset
val_ds = fibrosis_ds.take( num_fibrosis_in_val )
val_ds = val_ds.concatenate( normal_ds.take( num_fibrosis_in_val ) )
val_ds = val_ds.batch(BATCH_SIZE)
And lastly I make the training dataset. I use skip so I don't repeat any of images that I used earlier. I use repeat to oversample the fibrosis images. I add an equal number of normal images to make sure the data classes are balanced. And at the end I shuffle.
# Make the traing set
train_ds = fibrosis_ds.skip(num_fibrosis_in_val).take(num_fibrosis - num_fibrosis_in_val)
train_ds = train_ds.repeat( oversample_multiplier )
train_ds = train_ds.concatenate( normal_ds.skip(num_fibrosis_in_val).take(oversample_count) )
train_ds = train_ds.shuffle(oversample_count*2)
train_ds = train_ds.batch(BATCH_SIZE)
This seems to work but in google colab it almost fills RAM just went I loop over val_ds to confirm the count. Is it holding the entire dataset in memory while applying the chained functions on top? Is there a more reasonable way to do this?

Related

Tensorflow dataset iterator pick a sub-sample of whole data

I have a code that generates an iterator from a Tensorflow dataset. The code is this:
#tf.function
def normalize_image(record):
out = record.copy()
out['image'] = tf.cast(out['image'], 'float32') / 255.
return out
train_it = iter(tfds.builder('mnist').as_dataset(split='train').map(normalize_image).repeat().batch(256*10))
However, I want to do the manual splitting. For example, the MNISt dataset has 60000 training samples, but I want to only use the first 50000 (and hold others for validation). The problem is I don't know how to do so.
I tried to convert it to NumPy and split based on that, but then I couldn't apply the map to it.
ds_builder = tfds.builder('mnist')
print(dir(ds_builder))
ds_builder.download_and_prepare()
train_ds = tfds.as_numpy(ds_builder.as_dataset(split='train', batch_size=-1))
train_ds['image'] = train_ds['image'][0:50000, : , :]
train_ds['label'] = train_ds['label'][0:50000]
I was wondering how to do so.
P.S: The ordering of data is also important for me, so I was thinking of loading all data in Numpy and saving the required ones in png and loading with tfds, but I'm not sure if it keeps the original order or not. I want to take the first 50000 samples of the whole 60000 samples.
Thanks.

train_ds = tfds.builder('mnist').as_dataset(split='train').map(normalize_image)
train_ds = train_ds.take(50000).repeat().batch(256*10)
val_ds = tfds.builder('mnist').as_dataset(split='train').map(normalize_image)
val_ds = val_ds.skip(50000).batch(256*10)
train_it = iter(train_ds)
val_it = iter(val_ds)

How to add data augmentation to regression problem?

I am trying to build a CNN model for regression problem with limited number of input data with 400 sample size. The inputs are images and labels are extracted from a column of csv file. To increase the input data, I need to augment the input images and match them with existing labels. I am using rotation and flipping augmentation methods. I am not sure how existing labels should be linked to the augmented images and how the final tensorflow dataset should be created to fit the model. Can anyone help me to solve this data augmentation?
#load csv file
labelPath = "/content/drive/MyDrive/Notebook/tepm.csv"
cols = ["temperature"]
df = pd.read_csv(labelPath, sep=" ", header=None, names=cols)
inputPath='/content/drive/MyDrive/Notebook/test_png_64'
images = []
# Load in the images
for filepath in os.listdir(inputPath):
images.append(cv2.imread(inputPath+'/{0}'.format(filepath),flags=(cv2.IMREAD_ANYCOLOR | cv2.IMREAD_ANYDEPTH)))
images_scaled = np.array(images, dtype="float") / 255.0
(trainY, testY, trainX, testX) = train_test_split(df, images_scaled, test_size=0.20, random_state=42)
(trainY, valY, trainX, valX) = train_test_split(trainY, trainX, test_size=0.20, random_state=42)
def rotate(trainX: tf.Tensor) -> tf.Tensor:
# Rotate 90 degrees
return tf.image.rot90(trainX, tf.random_uniform(shape=[], minval=0, maxval=4, dtype=tf.float32))
def flip(trainX: tf.Tensor) -> tf.Tensor:
trainX = tf.image.random_flip_left_right(trainX)
trainX = tf.image.random_flip_up_down(trainX)
return trainX
update with ImageDataGenerator
datagen = ImageDataGenerator(
vertical_flip=True,
horizontal_flip=True,
fill_mode="nearest")
datagen.fit(trainX)
model.compile(optimizer=tf.keras.optimizers.Adam(lr=0.0001), loss='mean_squared_error', metrics='mse')

ImageDataGenerator should do the trick. It generate batches of tensor image data with real-time data augmentation.

Initially you should consider whether augmentations that preserve the labels are useful, or augmentations that require matching label augmentation, or both. If I am following your code correctly, you have temperature for scalar labels. Without knowing the nature of your images, I'd guess it unlikely that rotations and flips would be temperature-dependent, thus the labels are preserved and you are all set to go with ImageDataGenerator as is. Whether or not those augmentations will help the training is hard to know without trying it. Conversely, ImageDataGenerator does have a Brightness augmentation, which is the sort of thing I could imagine being temperature dependent in an image. In that case, the labels aren't preserved and you'd have to augment them manually, because I don't think ImageDataGenerator has methods for scalar labels. In my experience, it is the latter sort of augmentations (labels not-preserved) which are more obviously useful. But to get matching label augmentation you may have to do a little more manual coding than what comes stock with ImageDataGenerator; fortunately it might not be too hard.
Some of the basic elements for matching label augmentation might go like this (this is not complete code, just snippets):
Set up the subset of parameters for ImageDataGenerator augmentation that make sense for your scalar labels in a convenience dict:
regression_aug = dict(fill_mode='nearest',
rotation_range=3,
width_shift_range=0.1,
height_shift_range=0.1,
Use the ImageDataGenerator method get_random_transform:
self.tparams[i] = self.generator.get_random_transform(self.img_dims)
Apply it to the training image, and further manually apply it to the scalar label(s):
batch_X[i] = self.generator.apply_transform(img[i], self.tparams[i]))
batch_y[i,0] = self.lbl[x,0] - self.tparams[i]['tx']
batch_y[i,1] = self.lbl[x,1] - self.tparams[i]['ty']
batch_y[i,2] = self.lbl[x,2] - self.tparams[i]['theta']
where in this example case I had scalar labels that consisted of position and orientation, such that they could be sensibly be translated and rotated during augmentation.

how to load label data presented in raster format into Keras/Tensorflow

I want to use CNN network to segment 2 objects (binary: "0: object not present, 1: object present") into shapes but I have an issue with data. The train data is 150 images and in "jpg" format and the ground truth (label data) is also 150 images of "png" rasters of 0 and 1 (resulting in black white images).
Now the question is how to load this hybrid of train images and label images in Keras/Tensorflow and if there`s a dummy example and/or demonstration on how to do that in Python, I would be grateful.

You can define one generator for reading the input images and another one for reading the labels using the ImageDataGenerator class and its flow_from_directory() method, and then combine these two generators in a single generator. Just make sure the directory structure and (order of) file names of input and label images are the same:
data_image_gen = ImageDataGenerator(...)
data_label_gen = ImageDataGenerator(...)
image_gen = data_image_gen.flow_from_directory(image_directory,
# no need to return labels
class_mode=None,
# don't shuffle to have the same order as labels
shuffle=False)
image_gen = data_image_gen.flow_from_directory(label_directory,
color_mode='grayscale',
# no need to return labels
class_mode=None,
# don't shuffle to have the same order as images
shuffle=False)
def final_gen(image_gen, label_gen):
for data, labels in zip(image_gen, label_gen):
# divide labels by 255 to make them like masks i.e. 0 and 1
labels /= 255.
# remove the last axis, i.e. (batch_size, n_rows, n_cols, 1) --> (batch_size, n_rows, n_cols)
labels = np.squeeze(labels, axis=-1)
yield data, labels
# ... define your model
# fit the model
model.fit_generator(final_gen(image_gen, label_gen), ...)

How to use .predict_generator() on new Images - Keras

I've used ImageDataGenerator and flow_from_directory for training and validation.
These are my directories:
train_dir = Path('D:/Datasets/Trell/images/new_images/training')
test_dir = Path('D:/Datasets/Trell/images/new_images/validation')
pred_dir = Path('D:/Datasets/Trell/images/new_images/testing')
ImageGenerator Code:
img_width, img_height = 28, 28
batch_size=32
train_datagen = ImageDataGenerator(
rescale=1. / 255,
shear_range=0.2,
zoom_range=0.2,
horizontal_flip=True)
test_datagen = ImageDataGenerator(rescale=1. / 255)
train_generator = train_datagen.flow_from_directory(
train_dir,
target_size=(img_height, img_width),
batch_size=batch_size,
class_mode='categorical')
validation_generator = test_datagen.flow_from_directory(
test_dir,
target_size=(img_height, img_width),
batch_size=batch_size,
class_mode='categorical')
Found 1852 images belonging to 4 classes
Found 115 images belonging to 4 classes
This is my model training code:
history = cnn.fit_generator(
train_generator,
steps_per_epoch=1852 // batch_size,
epochs=20,
validation_data=validation_generator,
validation_steps=115 // batch_size)
Now I have some new images in a test folder (all images are inside the same folder only), on which I want to predict. But when I use .predict_generator I get:
Found 0 images belonging to 0 class
So I tried these solutions:
1) Keras: How to use predict_generator with ImageDataGenerator? This didn't work out, because its trying on validation set only.
2) How to predict the new image by using model.predict? module image not found
3) How to get predictions with predict_generator on streaming test data in Keras? This also didn't work out.
My train data is basically stored in 4 separate folders, i.e. 4 specific classes, validation also stored in same way and works out pretty well.
So in my test folder I have around 300 images, on which I want to predict and make a dataframe, like this:
image_name class
gghh.jpg 1
rrtq.png 2
1113.jpg 1
44rf.jpg 4
tyug.png 1
ssgh.jpg 3
I have also used this following code:
img = image.load_img(pred_dir, target_size=(28, 28))
img_tensor = image.img_to_array(img)
img_tensor = np.expand_dims(img_tensor, axis=0)
img_tensor /= 255.
cnn.predict(img_tensor)
But I get this error: [Errno 13] Permission denied: 'D:\\Datasets\\Trell\\images\\new_images\\testing'
But I haven't been able to predict_generator on my test images. So how can I predict on my new images using Keras. I have googled a lot, searched on Kaggle Kernels also but haven't been able to get a solution.

So first of all the test images should be placed inside a separate folder inside the test folder. So in my case I made another folder inside test folder and named it all_classes.
Then ran the following code:
test_generator = test_datagen.flow_from_directory(
directory=pred_dir,
target_size=(28, 28),
color_mode="rgb",
batch_size=32,
class_mode=None,
shuffle=False
)
The above code gives me an output:
Found 306 images belonging to 1 class
And most importantly you've to write the following code:
test_generator.reset()
else weird outputs will come.
Then using the .predict_generator() function:
pred=cnn.predict_generator(test_generator,verbose=1,steps=306/batch_size)
Running the above code will give output in probabilities so at first I need to convert them to class number. In my case it was 4 classes, so class numbers were 0,1,2 and 3.
Code written:
predicted_class_indices=np.argmax(pred,axis=1)
Next step is I want the name of the classes:
labels = (train_generator.class_indices)
labels = dict((v,k) for k,v in labels.items())
predictions = [labels[k] for k in predicted_class_indices]
Where by class numbers will be replaced by the class names. One final step if you want to save it to a csv file, arrange it in a dataframe with the image names appended with the class predicted.
filenames=test_generator.filenames
results=pd.DataFrame({"Filename":filenames,
"Predictions":predictions})
Display your dataframe. Everything is done now. You get all the predicted class for your images.

I had some trouble with predict_generator(). Some posts here helped a lot. I post my solution here as well and hope it will help others. What I do:
Make predictions on new images using predict_generator()
Get filename for each prediction
Store results in a data frame
I make binary predictions à la "cats and dogs" as documented here. However, the logic can be generalised to multiclass cases. In this case the outcome of the prediction has one column per class.
First, I load my stored model and set up the data generator:
import numpy as np
import pandas as pd
from keras.preprocessing.image import ImageDataGenerator
from keras.models import load_model
# Load model
model = load_model('my_model_01.hdf5')
test_datagen = ImageDataGenerator(rescale=1./255)
test_generator = test_datagen.flow_from_directory(
"C:/kerasimages/pred/",
target_size=(150, 150),
batch_size=20,
class_mode='binary',
shuffle=False)
Note: it is important to specify shuffle=False in order to preserve the order of filenames and predictions.
Images are stored in C:/kerasimages/pred/images/. The data generator will only look for images in subfolders of C:/kerasimages/pred/ (as specified in test_generator). It is important to respect the logic of the data generator, so the subfolder /images/ is required. Each subfolder in C:/kerasimages/pred/ is interpreted as one class by the generator. Here, the generator will report Found x images belonging to 1 classes (since there is only one subfolder). If we make predictions, classes (as detected by the generator) are not relevant.
Now, I can make predictions using the generator:
# Predict from generator (returns probabilities)
pred=model.predict_generator(test_generator, steps=len(test_generator), verbose=1)
Resetting the generator is not required in this case, but if a generator has been set up before, it may be necessary to rest it using test_generator.reset().
Next I round probabilities to get classes and I retrieve filenames:
# Get classes by np.round
cl = np.round(pred)
# Get filenames (set shuffle=false in generator is important)
filenames=test_generator.filenames
Finally, results can be stored in a data frame:
# Data frame
results=pd.DataFrame({"file":filenames,"pr":pred[:,0], "class":cl[:,0]})

I strongly recommend you to make a parent folder in the test folder. Then move the test folder to the parent folder.
means if you have test folder in this manner:
/root/test/img1.png
/root/test/img2.png
/root/test/img3.png
/root/test/img4.png
this wrong way to use predict_generator. Update your test folder like this:
/root/test_parent/test/img1.png
/root/test_parent/test/img2.png
/root/test_parent/test/img3.png
/root/test_parent/test/img4.png
Use this command to update:
mv /root/test/ ./root/test_parent/test
And, also don't forget to give a path to the model like this
"/root/test_parent/"
This method is work for me.

The most probably you are making a mistake using flow_from_directory. Reading the docs:
flow_from_directory(directory, ...)
Where:
directory: Path to the target directory. It should contain one
subdirectory per class. Any PNG, JPG, BMP, PPM or TIF images inside
each of the subdirectories directory tree will be included in the
generator.
That means that inside the directory that you are passing to this function, you have to create subdirectories and place your images inside this subdirectories. Otherwise, when the images are in the directory that you are passing (not subdirectories), indeed there are 0 images and 0 classes.
EDIT
Okay so in case of the prediction you want to perform I believe that you want to use the predict function as follows: (note that you have to provide data to the network just in the same format as you did during learning process)
image = img_to_array(load_img(f"{directory}/{foldername}/{filename}"))
# here you prepare the input data, for example here we take the gray image
# gray scale is the 1st channel in the Lab color space
color_me = rgb2lab((1.0 / 255) * color_me)[:, :, 0]
color_me = color_me.reshape(color_me.shape + (1,))
# here data is in the format which is accepted by, in this case, my model
# for your model you have to do the preparation just the same as in the case of learning process
output = model.predict(np.array([color_me]))
# and here you have your predicted output

As per Keras documenation cited below, predict_generator is deprecated. Model.predict now supports generators, so there is no longer any need to use the predict_generator endpoint.
Keras documentation, Refernce: https://www.tensorflow.org/api_docs/python/tf/keras/Model#predict_generator

tensorflow input pipeline returns multiple values

I'm trying to make an input pipeline in tensorflow for image classification, therefore I want to make batches of images and corresponding labels. The Tensorflow document suggests that we can use tf.train.batch to make batches of inputs:
train_batch, train_label_batch = tf.train.batch(
[train_image, train_image_label],
batch_size=batch_size,
num_threads=1,
capacity=10*batch_size,
enqueue_many=False,
shapes=[[224,224,3], [len(labels),]],
allow_smaller_final_batch=True
)
However, I'm thinking would it be a problem if I feed in the graph like this:
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=train_label_batch, logits=Model(train_batch)))
The question is does the operation in the cost function dequeues images and their corresponding labels, or it returns them separately? Therefore causing the training with wrong images and labels.

There are several things you need to consider to preserve the ordering of images and labels.
let's say we need a function that gives us images and labels.
def _get_test_images(_train=False):
"""
Gets the test images and labels as a batch
Inputs:
======
_train : Boolean if images are from training set
random_crop : Boolean if random cropping is allowed
random_flip : Boolean if random horizontal flip is allowed
distortion : Boolean if distortions are allowed
Outputs:
========
images_batch : Batch of images containing BATCH_SIZE images at a time
label_batch : Batch of labels corresponding to the images in images_batch
idx : Batch of indexes of images
"""
#get images and labels
_,_img_names,_img_class,index= _get_list(_train = _train)
#total number of distinct images used for train will be equal to the images
#fed in tf.train.slice_input_producer as _img_names
img_path,label,idx = tf.train.slice_input_producer([_img_names,_img_class,index],shuffle=False)
img_path,label,idx = tf.convert_to_tensor(img_path),tf.convert_to_tensor(label),tf.convert_to_tensor(idx)
img_path = tf.cast(img_path,dtype=tf.string)
#read file
image_file = tf.read_file(img_path)
#decode jpeg/png/bmp
#tf.image.decode_image won't give shape out. So it will give error while resizing
image = tf.image.decode_jpeg(image_file)
#image preprocessing
image = tf.image.resize_images(image, [IMG_DIM,IMG_DIM])
float_image = tf.cast(image,dtype=tf.float32)
#subtracting mean and divide by standard deviation
float_image = tf.image.per_image_standardization(float_image)
#set the shape
float_image.set_shape(IMG_SIZE)
labels_original = tf.cast(label,dtype=tf.int32)
img_index = tf.cast(idx,dtype=tf.int32)
#parameters for shuffle
batch_size = BATCH_SIZE
min_fraction_of_examples_in_queue = 0.3
num_preprocess_threads = 1
num_examples_per_epoch = MAX_TEST_EXAMPLE
min_queue_examples = int(num_examples_per_epoch *
min_fraction_of_examples_in_queue)
images_batch, label_batch,idx = tf.train.batch(
[float_image,label,img_index],
batch_size=batch_size,
num_threads=num_preprocess_threads,
capacity=min_queue_examples + 3 * batch_size)
# Display the training images in the visualizer.
tf.summary.image('images', images_batch)
return images_batch, label_batch,idx
Here,tf.train.slice_input_producer([_img_names,_img_class,index],shuffle=False) is an interesting thing to look at where if you put shuffle=True it will shuffle all three arrays in coordination.
Second thing is, num_preprocess_threads. As long as you are using single threads for dequeue operation, batches will come out in a deterministic way. But more than one threads will shuffle the arrays randomly. for example for image 0001.jpg if True label is 1 you might get 2 or 4. Once its dequeue it is in tensor form. tf.nn.softmax_cross_entropy_with_logits shouldn't have problem with such tensors.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

using tensorflow dataset for custom image oversampling - python

Related

Tensorflow dataset iterator pick a sub-sample of whole data

How to add data augmentation to regression problem?

how to load label data presented in raster format into Keras/Tensorflow

How to use .predict_generator() on new Images - Keras

tensorflow input pipeline returns multiple values

Categories

Resources