Possible to use keras “flow_from_directory” with “.mat” files? - python

I have built a CNN to predict lymph node positivity (has cancer or not). Right now to load the data I have a self defined function that loads a batch of data and feeds it to the model for training.
Instead of loading batches I would love to use the flow_from_directory method. The problem I have is that my data are saved as arrays [#, rows, width, height, PET or CT] not images (that would later be converted to arrays). Example [0,:,:,:,0] is volume sized 48x48x32 from a ct image.
If I try to use flow_from_directory I get 0 images with 3 classes which I expected since '.mat' is not a recognized file (https://github.com/keras-team/keras-preprocessing/blob/362fe9f8daf556151328eb5d02bd5ae638c653b8/keras_preprocessing/image.py#L1868). Interestingly it doesnt raise any errors but I am indefinitely stuck on 1/150 epochs. I am going to see if I can write my own flow_from_directory. Not sure if someone has run across this problem and could give my pointers.
Illustrating how data is combined
for fname in fnames:
data = scipy.io.loadmat(os.path.join(dir_in_train, fname))['roi_patch']
data_PET = scipy.io.loadmat(os.path.join(dir_in_train_PET, fname))['roi_patch']
train_combo[0,:,:,:,0]=data/4.0950
train_combo[0,:,:,:,1]=data_PET/32.1959
train_combo[0,:,:,:,:].shape
train_combo = np.zeros((1, 48, 48, 32, 2))
scipy.io.savemat(fname, {fname: train_combo})
This will create a file ex '1.mat' that has CT data and PET data in one area
Then I have code changing it into npy files.
Example of data generator I already have
# load training data
def load_train_data_batch_generator(self, batch_size=32, rows_in=48, cols_in=48, zs_in=32,
channels_in=2, num_classes=3,
dir_in_train=None, dir_out_train=None):
# dir_in_train = main_dir + '/test_CT_PET_combo'
fnames = ['{}.mat'.format(i) for i in range(1,len(os.listdir(dir_in_train))+1)]
y_train = np.zeros((batch_size, num_classes))
x_train = np.zeros((batch_size, rows_in, cols_in, zs_in, channels_in))
while True:
count = 0
for fname in np.random.choice(fnames, batch_size, replace=False):
data_label = scipy.io.loadmat(os.path.join(dir_out_train, fname))['output']
# changing one hot encoding to integer
integer_label = np.argmax(data_label[0], axis=0)
y_train[count,:] = data_label
# Loading train ct w/ c and pet/ct combo that will be saved into new directory
train_combo = scipy.io.loadmat(os.path.join(dir_in_train, fname))[fname]
x_train[count,:,:,:,:] = train_combo
count += 1
yield(x_train, y_train)

Related

Why would this dataset implementation run out of memory?

I follow this instruction and write the following code to create a Dataset for images(COCO2014 training set)
from pathlib import Path
import tensorflow as tf
def image_dataset(filepath, image_size, batch_size, norm=True):
def preprocess_image(image):
image = tf.image.decode_jpeg(image, channels=3)
image = tf.image.resize(image, image_size)
if norm:
image /= 255.0 # normalize to [0,1] range
return image
def load_and_preprocess_image(path):
image = tf.read_file(path)
return preprocess_image(image)
all_image_paths = [str(f) for f in Path(filepath).glob('*')]
path_ds = tf.data.Dataset.from_tensor_slices(all_image_paths)
ds = path_ds.map(load_and_preprocess_image, num_parallel_calls=tf.data.experimental.AUTOTUNE)
ds = ds.shuffle(buffer_size = len(all_image_paths))
ds = ds.repeat()
ds = ds.batch(batch_size)
ds = ds.prefetch(tf.data.experimental.AUTOTUNE)
return ds
ds = image_dataset(train2014_dir, (256, 256), 4, False)
image = ds.make_one_shot_iterator().get_next('images')
# image is then fed to the network
This code will always run out of both memory(32G) and GPU(11G) and kill the process. Here is the messages shown on terminal.
I also spot that the program get stuck at sess.run(opt_op). Where is wrong? How can I fix it?
The problem is this:
ds = ds.shuffle(buffer_size = len(all_image_paths))
The buffer that Dataset.shuffle() uses is an 'in memory' buffer so you are effectively trying to load the whole dataset in memory.
You have a couple of options (which you can combine) to fix this:
Option 1:
Reduce the buffer size to a much smaller number.
Option 2:
Move the shuffle() statment before the map() statement.
This means we would be shuffling before we load the images therefore we'd just be storing the filenames in the memory buffer for the shuffle rather than storing huge tensors.

Tensorflow 1.13.1 tf.data map multiple images with a single row together

I'm building my tf dataset where there are multiple inputs (images and numerical/categorical data). The problem I am having is that multiple images correspond to the same row in the pd.Dataframe I have. I am doing regression.
So how, (even when shuffling all the inputs) do I ensure that each image gets mapped to the correct row?
Again, say I have 10 rows, and 100 images, with 10 images corresponding to a particular row. Now we shuffle the dataset, and we want to make sure that the shuffled images all correspond to their respective row.
I am using tf.data.Dataset to do this. I also have a directory structure such that the folder name corresponds to an element in the DataFrame, which is what I was thinking of using if I knew how to do the mapping
i.e. folder1 would be in the df with cols like dir_name, feature1, feature2, .... Naturally, the dir_names should not be passed as data into the model to fit on.
#images
path_ds = tf.data.Dataset.from_tensor_slices(paths)
image_ds = path_ds.map(load_and_preprocess_image, num_parallel_calls=AUTOTUNE)
#numerical&categorical features. First remove the dirs
x_train_input = X_train[X_train.columns.difference(['dir_name'])]
x_train_input=np.expand_dims(x_train_input, axis=1)
text_ds = tf.data.Dataset.from_tensor_slices(x_train_input)
#labels, y_train's cols are: 'label' and 'dir_name'
label_ds = tf.data.Dataset.from_tensor_slices(
tf.cast(y_train['label'], tf.float32))
# test creation of dataset without prior shuffling.
xtrain_ = tf.data.Dataset.zip((image_ds, text_ds))
model_ds = tf.data.Dataset.zip((xtrain_, label_ds))
# Shuffling
BATCH_SIZE = 64
# Setting a shuffle buffer size as large as the dataset ensures that
# data is completely shuffled
ds = model_ds.shuffle(buffer_size=len(paths))
ds = ds.repeat()
ds = ds.batch(BATCH_SIZE)
# prefetch lets the dataset fetch batches in the background while the
# model is training
# ds = ds.prefetch(buffer_size=AUTOTUNE)
ds = ds.prefetch(buffer_size=BATCH_SIZE)
My solution would be to utilize TFRecords for storing the data and holding it's integrity. This will also open doors for other efficiencies as well.
What the below code is doing...
Create dummy data. All need to be arrays with the same datatype found in the _parse_function. You can change that dtype, just also ensure you change it for your data too.
Create a dictionary that holds the arrays by name
Create feature_dimensions object that holds the shape of all arrays
Create TFRecords by looping over data dict. You can create one large file, or many small ones. This is a good starting point for you however.
Declare functions for generating the dataset. You can add and modify whatever logic you want there. The key, however, is that these functions use the feature_dimensions object to remember how to put the data back together
Create a dataset
Generate a sample. The result is a dictionary with a batch-size worth of data.
You should be able to just run this sample code all by itself and have no issues. Then just make the changes you need for it to work in your problem.
import tensorflow as tf
import pandas as pd
import numpy as np
from functools import partial
# Create dummy data, TODO replace with your own logic
# 10 images per row in DF
images_per_example = 10
examples = 200
# Save name for TFRecords, you can create multiple and pass a list of the names as well
save_name = "my_tfrecords.tfrecords"
# DF, dataframe with random categorical data
x_data = pd.DataFrame(data=(np.random.normal(size=(examples, 50)) > 0).astype(np.float32))
y_data = np.random.uniform(0, 1, size=(examples, )).reshape(-1, 1).astype(np.float32)
def load_and_preprocess_image(file):
# For dummy purposes generating instead of loading
img = np.random.uniform(high=255, low=0, size=(15, 15))
return (img / 255.).astype(np.float32)
# I would preprocess your images prior to creating the tfrecords file
img_data = np.array([[load_and_preprocess_image("add_logic") for j in range(images_per_example)]
for k in range(examples)])
# Prepare for tfrecords
data_dict = dict()
data_dict["images"] = img_data # Already an array
data_dict["x_data"] = x_data.values # Ensure it's an array
data_dict["y_data"] = y_data # Already an array
# Remember the dimensions for later restoration, replacing number of examples with -1
feature_dimensions = {k: v.shape for k, v in data_dict.items()}
feature_dimensions = {k: tuple([-1] + list(v[1:])) for k, v in feature_dimensions.items()}
def _bytes_feature(value):
return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))
writer = tf.python_io.TFRecordWriter(save_name)
# Create TFRecords file
for i in range(examples):
example_dict = dict() # New dictionary for each single example
for name, data in data_dict.items():
# if name == "images":
# break
example_dict[name] = data[i]
# Define the features of your tfrecord
feature = {k: _bytes_feature(tf.compat.as_bytes(v.tostring())) for k, v in example_dict.items()}
# Serialize to string and write to file
example = tf.train.Example(features=tf.train.Features(feature=feature))
writer.write(example.SerializeToString())
writer.close()
# Declare functions for creating dataset
def _parse_function(proto, feature_dimensions_: dict):
# define your tfrecord again. Remember that you saved your image as a string.
keys_to_features = {k: tf.FixedLenFeature([], tf.string) for k in feature_dimensions_.keys()}
# Load one example
parsed_features = tf.parse_single_example(proto, keys_to_features)
# Split data
for k, v in parsed_features.items():
parsed_features[k] = tf.decode_raw(v, tf.float32)
return parsed_features
def create_tf_dataset(file_paths: str, feature_dimensions_: dict, batch_size=64):
# This works with arrays as well
dataset = tf.data.TFRecordDataset(file_paths)
# Maps the parser on every filepath in the array. You can set the number of parallel loaders here
parse_function = partial(_parse_function, feature_dimensions_=feature_dimensions_)
dataset = dataset.map(parse_function, num_parallel_calls=1)
# This dataset will go on forever
dataset = dataset.repeat()
# Set the number of datapoints you want to load and shuffle
dataset = dataset.shuffle(batch_size) # Put whatever you want here
# Set the batchsize
dataset = dataset.batch(batch_size)
# Set up a pipeline
dataset = dataset.prefetch(batch_size) # Put whatever you want here
# Create an iterator
iterator = dataset.make_one_shot_iterator()
# Create your tf representation of the iterator
parsed_features = iterator.get_next()
# Reshape arrays and cast to float
for k, v in parsed_features.items():
parsed_features[k] = tf.reshape(v, feature_dimensions_[k])
for k, v in parsed_features.items():
parsed_features[k] = tf.cast(v, tf.float32)
return parsed_features
# Create dataset
ds = create_tf_dataset(save_name, feature_dimensions, batch_size=64)
# The final result is a dictionary with the names used above
sample = tf.Session().run(ds)
print("Sample Length:", len(sample))
print("Sample Keys:", sample.keys())
print("images shape:", sample["images"].shape)
print("x_data shape:", sample["x_data"].shape)
print("y_data shape:", sample["y_data"].shape)
Printed Results
Sample Length: 3
Sample Keys: dict_keys(['images', 'x_data', 'y_data'])
images shape: (64, 10, 15, 15)
x_data shape: (64, 50)
y_data shape: (64, 1)

Custom Keras generator much slower compared to Keras' bult in generator

I have a multi label classification problem. I wrote this custom generator. It reads images and output labels from the disk, and returns them in batches of size 32.
def get_input(img_name):
path = os.path.join("images", img_name)
img = image.load_img(path, target_size=(224, 224))
return img
def get_output(img_name, file_path):
data = pd.read_csv(file_path, delim_whitespace=True, header=None)
img_id = img_name.split(".")[0]
img_id = img_id.lstrip("0")
img_id = int(img_id)
labels = data.loc[img_id - 1].values
labels = labels[1:]
labels = list(labels)
label_arrays = []
for i in range(20):
val = np.zeros((1))
val[0] = labels[i]
label_arrays.append(val)
return label_arrays
def preprocess_input(img_name):
img = get_input(img_name)
x = image.img_to_array(img)
x = np.expand_dims(x, axis=0)
return x
def train_generator(batch_size):
file_path = "train.txt"
data = pd.read_csv(file_path, delim_whitespace=True, header=None)
while True:
for i in range(math.floor(8000/batch_size)):
x_batch = np.zeros(shape=(32, 224, 224, 3))
y_batch = np.zeros(shape=(32, 20))
for j in range(batch_size):
img_name = data.loc[i * batch_size + j].values
img_name = img_name[0]
x = preprocess_input(img_name)
y = get_output(img_name, file_path)
x_batch[j, :, :, :] = x
y_batch[j] = y
ys = []
for i in range(20):
ys.append(y_batch[:,i])
yield(x_batch, ys)
Had a little problem with labels returned to the model, and got it solved in this question:
training a multi-output keras model
I tested this generator on a single output problem. This custom generator is very slow. The ETA for a single epoch by using this custom generator is around 27 hours, while the builtin generator(using flow_from_directory) takes 25 minutes for a single epoch. What am I doing wrong?
The training process for both tests is identical, except for the generator used. Validation generator is similar to training generator. I know I will not reach the efficiency of Keras' built in generator, but this difference in speed is too much.
EDIT
Some guides I read for creating custom generators.
Writing Custom Keras Generators
custom generator for fit_generator() that yields multiple inputs with different shapes
Maybe the built in generator processes the data on your gpu while your custom generator runs on the cpu, making is significantly slower.
Another guess is because Keras is using Dataset in the background. Your implementation probably uses feed-dict which is the slowest possible way to pass information to TensorFlow. The best way to feed data into the models is to use an input pipeline to ensure that the GPU never has to wait for new stuff to come in.

how to write a generator for Keras fit_generator with a state?

I am trying to feed a large dataset to a keras model.
The dataset does not fit into memory.
It is currently stored as a serie of hd5f files
I want to train my model using
model.fit_generator(my_gen, steps_per_epoch=30, epochs=10, verbose=1)
However, in all the examples I could find online, my_gen was used only to perform data augmentation on a already loaded dataset. For example
def generator(features, labels, batch_size):
# Create empty arrays to contain batch of features and labels#
batch_features = np.zeros((batch_size, 64, 64, 3))
batch_labels = np.zeros((batch_size,1))
while True:
for i in range(batch_size):
# choose random index in features
index= random.choice(len(features),1)
batch_features[i] = some_processing(features[index])
batch_labels[i] = labels[index]
yield batch_features, batch_labels
In my case, it needs to be something like
def generator(features, labels, batch_size):
while True:
for i in range(batch_size):
# choose random index in features
index= # SELECT THE NEXT FILE
batch_features[i] = some_processing(features[files[index]])
batch_labels[i] = labels[file[index]]
yield batch_features, batch_labels
How do I keep track of the files which were already read in previous batch?
From the keras doc
generator: A generator or an instance of Sequence (keras.utils.Sequence) object in order to avoid duplicate data when using multiprocessing. [...]
This means you can write a class inheriting from keras.utils.sequence
class ProductSequence(keras.utils.Sequence):
def __init__(self):
pass
def __len__(self):
pass
def __getitem__(self, idx):
pass
__init__ ist to init the class.
__len__ should return the number of batches per epoch. Keras will use thisto know which index can be passed to __getitem__. __getitem__ will then return the batch data depending on the index.
A simple example can be found here
With this approach you can simpy have an internal class object in which you save which files are already read.
Let us suppose that your data are images. If you have many images you probably won't be able to load all of them in memory and you would like to read from disk in batches.
Keras flow_from _directory is very fast in doing that as it does this in a multi threading way too but it needs all the images to be in different files, according to their class. If we have all the images in the same file and their classes in separated file we could use the generator bellow to load our x,y data.
import pandas as pd
import numpy as np
import cv2
#df_train: data frame with class of every image
#dpath: path of images
classes=list(np.unique(df_train.label))
def batch_generator(ids):
while True:
for start in range(0, len(ids), batch_size):
x_batch = []
y_batch = []
end = min(start + batch_size, len(ids))
ids_batch = ids[start:end]
for id in ids_batch:
img = cv2.imread(dpath+'train/{}.png'.format(id)) #open cv read as BGR
#img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB) #BGR to RGB
#img = cv2.resize(img, (224, 224), interpolation = cv2.INTER_CUBIC)
#img = pre_process(img)
labelname=df_train.label.loc[df_train.id==id].values
labelnum=classes.index(labelname)
x_batch.append(img)
y_batch.append(labelnum)
x_batch = np.array(x_batch)
y_batch = to_categorical(y_batch,10)
yield x_batch, y_batch

Tensorflow and reading binary data properly

I am trying to properly read in my own binary data to Tensorflow based on Fixed length records section of this tutorial, and by looking at the read_cifar10 function here. Mind you I am new to tensorflow, so my understanding may be off.
My Data
My files are binary with float32 type. The first 32 bit sample is the label, and the remaining 256 samples are the data. I want to reshape the data at the end to a [2, 128] matrix.
My Code So far:
import tensorflow as tf
import os
def read_data(filename_queue):
item_type = tf.float32
label_items = 1
data_items = 256
label_bytes = label_items * item_type.size
data_bytes = data_items * item_type.size
record_bytes = label_bytes + data_bytes
reader = tf.FixedLengthRecordReader(record_bytes=record_bytes)
key, value = reader.read(filename_queue)
record_data = tf.decode_raw(value, item_type)
# labels = tf.cast(tf.strided_slice(record_data, [0], [label_items]), tf.int32)
label = tf.strided_slice(record_data, [0], [label_items])
data0 = tf.strided_slice(record_data, [label_items], [label_items + data_items])
data = tf.reshape(data0, [2, data_items/2])
return data, label
if __name__ == '__main__':
os.environ["CUDA_VISIBLE_DEVICES"] = "0" # Set GPU device
datafiles = ['train_0000.dat', 'train_0001.dat']
num_epochs = 2
filename_queue = tf.train.string_input_producer(datafiles, num_epochs=num_epochs, shuffle=True)
data, label = read_data(filename_queue)
with tf.Session() as sess:
init = tf.global_variables_initializer()
sess.run(init)
(x, y) = read_data(filename_queue)
print(y.eval())
This code hands at the print(y.eval()), but I fear I have much bigger issues than that.
Question:
When I execute this, I get a data and label tensor returned. The problem is I don't quite understand how to actually read the data from the tensor. For example, I understand the autoencoder example here, however this has a mnist.train.next_batch(batch_size) function that is called to read the next batch. Do I need to write that for my function, or is it handled by something internal to my read_data() function. If I need to write that function, what does it look like?
Are their any other obvious things I'm missing? My goal in using this method is to reduce I/O overhead, and not store all of the data in memory, since my file are quite large.
Thanks in advance.
Yes. You are pretty much done. At this point you need to:
1) Write your neural network model model which is supposed to take your data and return a label.
2) Write your cost function C which takes the network prediction and the true label and gives you a cost.
3) Choose and optimizer.
4) Put everything together:
opt = tf.AdamOptimizer(learning_rate=0.001)
datafiles = ['train_0000.dat', 'train_0001.dat']
num_epochs = 2
with tf.Session() as sess:
init = tf.global_variables_initializer()
sess.run(init)
filename_queue = tf.train.string_input_producer(datafiles, num_epochs=num_epochs, shuffle=True)
data, label = read_data(filename_queue)
example_batch, label_batch = tf.train.shuffle_batch(
[data, label], batch_size=128)
y_pred = model(data)
loss = C(label, y_pred)
After which you iterate and minimize the loss with:
opt.minimize(loss)
See also tf.train.string_input_producer behavior in a loop for related information.

Categories

Resources