Reading data into tensorflow and creating Dataset with TF-slim

Reading data into tensorflow and creating Dataset with TF-slim - python

I need to read in many 'images' from .txt files and want to generate a tensorflow dataset with them. Currently, I read in every single matrix with numpy.loadtxt and create an array of shape [N_matrices, height, width, N_channels], and a similar array with the label for every matrix.
I create a tensorflow dataset from these two arrays by using
inputs = tf.convert_to_tensor(x_train, dtype=tf.float32)
labels = tf.convert_to_tensor(y_train, dtype=tf.float32)
dataset = tf.data.Dataset.from_tensor_slices( {"image": inputs,"label": labels})
I now want to make use of the following function to create batches from this dataset (as done here):
def load_batch(dataset, batch_size=BATCH_SIZE, height=LENGTH_INPUT, width=LENGTH_INPUT):
data_provider = slim.dataset_data_provider.DatasetDataProvider(dataset)
image, label = data_provider.get(['image', 'label'])
images, labels = tf.train.batch(
[image, label],
batch_size=batch_size,
allow_smaller_final_batch=True)
return images, labels
However, this gives me the following error:
data_provider = slim.dataset_data_provider.DatasetDataProvider(dataset)
File "/home/.local/lib/python3.5/site-packages/tensorflow/contrib/slim/python/slim/data/dataset_data_provider.py", line 85, in init
dataset.data_sources,
AttributeError: 'TensorSliceDataset' object has no attribute 'data_sources'
Why am I getting this error, and how can I fix it? I also suppose there are much better ways for handling input from txt files to tensorflow (or tensorflow-slim) but I've found very little information on this. How could I generate my Datasets in a better way?

Related

Read image labels from a csv file

I have a dataset of medical images (.dcm) which I can read into TensorFlow as a batch. However, the problem that I am facing is that the labels of these images are in a .csv. The .csv file contains two columns - image_path (location of the image) and image_labels (0 for no; 1 for yes). I wanted to know how I can read the labels into a TensorFlow dataset batch wise. I am using the following code to load the images batch wise:-
import tensorflow as tf
import tensorflow_io as tfio
def process_image(filename):
image_bytes = tf.io.read_file(filename)
image = tf.squeeze(
tfio.image.decode_dicom_image(image_bytes, on_error='strict', dtype=tf.uint16),
axis = 0
)
x = tfio.image.decode_dicom_data(image_bytes, tfio.image.dicom_tags.PhotometricInterpretation)
image = (image - tf.reduce_min(image))/(tf.reduce_max(image) - tf.reduce_min(image))
if(x == "MONOCHROME1"):
image = 1 - image
image = image*255
image = tf.cast(tf.image.resize(image, (512, 512)),tf.uint8)
return image
# train_images is a list containing the locations of .dcm images
dataset = tf.data.Dataset.from_tensor_slices(train_images)
dataset = dataset.map(process_image, num_parallel_calls=4).batch(50)
Hence, I can load the images into the TensorFlow dataset. But I would like to know how I can load the image labels batch wise.

Something like this instead of the last two lines should work:
#train_labels is a list of labels for each image in the same order as in train_images
dataset = tf.data.Dataset.from_tensor_slices((train_images, train_labels))
dataset = dataset.map(lambda x,y : (process_image(x), y), num_parallel_calls=4).batch(50)
now the dataset can be passed to your network's .fit(), .predict() and other methods:
model.fit(dataset, epochs=epochs, callbacks=callbacks)
Alternatively, you can create a second dataset containing the labels and then combine two datasets with tf.data.Dataset.zip(). It works similarly to the python's native zip.
I prefer the first method since It feels a bit cleaner to me + I can, for example, shuffle the filenames/labels and only then parse the files instead of doing the opposite.

Flatten Dataset of multiple files tensorflow

I'm trying to read the CIFAR-10 dataset from 6 .bin files, and then create a initializable_iterator. This is the site I downloaded the data from, and it also contains a description of the structure of the binary files. Each file contains 2500 images. The resulting iterator, however, only generates one tensor for each file, a tensor of size (2500,3703). Here is my code
import tensorflow as tf
filename_dataset = tf.data.Dataset.list_files("cifar-10-batches-bin/*.bin")
image_dataset = filename_dataset.map(lambda x: tf.decode_raw(tf.read_file(x), tf.float32))
iter_ = image_dataset.make_initializable_iterator()
next_file_data = iter_.get_next()I
next_file_data = tf.reshape(next_file_data, [-1,3073])
next_file_img_data, next_file_labels = next_file_data[:,:-1], next_file_data[:,-1]
next_file_img_data = tf.reshape(next_file_img_data, [-1,32,32,3])
init_op = iter_.initializer
with tf.Session() as sess:
sess.run(init_op)
print(next_file_img_data.eval().shape)
_______________________________________________________________________
>> (2500,32,32,3)
The first two lines are based on this answer. I would like to be able to specify the number of images generated by get_next(), using batch() rather than it being the number of images in each .bin file, which here is 2500.
There has already been a question about flattening a dataset here, but the answer is not clear to me. In particular, the question seems to contain a code snippet from a class function which is defined elsewhere, and I am not sure how to implement it.
I have also tried creating the dataset with tf.data.Dataset.from_tensor_slices(), replacing the first line above with
import os
filenames = [os.path.join('cifar-10-batches-bin',f) for f in os.listdir("cifar-10-batches-bin") if f.endswith('.bin')]
filename_dataset = tf.data.Dataset.from_tensor_slices(filenames)
but this didn't solve the problem.
Any help would be very much appreciated. Thanks.

I am not sure how your bin file is structured. I am assuming 32*32*3 = 3072 points per image is present in each file. So the data present in each file is a multiple of 3072. However for any other structure, the kind of operations would be similar, so this can still serve as a guide for that.
You could do a series of mapping operations:
import tensorflow as tf
filename_dataset = tf.data.Dataset.list_files("cifar-10-batches-bin/*.bin")
image_dataset = filename_dataset.map(lambda x: tf.decode_raw(tf.read_file(x), tf.float32))
image_dataset = image_dataset.map(lambda x: tf.reshape(x, [-1, 32, 32, 3]) # Reshape your data to get 2500, 32, 32, 3
image_dataset = image_dataset.flat_map(lambda x: tf.data.Dataset.from_tensor_slices(x)) # This operation would give you tensors of shape 32,32,3 and put them all together.
image_dataset = image_dataset.batch(batch_size) # Now you can define your batchsize

how to train model with batches

I trying yolo model in python.
To process the data and annotation I'm taking the data in batches.
batchsize = 50
#boxList= []
#boxArr = np.empty(shape = (0,26,5))
for i in range(0, len(box_list), batchsize):
boxList = box_list[i:i+batchsize]
imagesList = image_list[i:i+batchsize]
#to convert the annotation from VOC format
convertedBox = np.array([np.array(get_boxes_for_id(box_l)) for box_l in boxList])
#pre-process on image and annotaion
image_data, boxes = process_input_data(imagesList,max_boxes,convertedBox)
boxes = np.array(list(itertools.chain.from_iterable(boxes)))
detectors_mask, matching_true_boxes = get_detector_mask(boxes, anchors)
after this, I want to pass my data to the model to train.
when I append the list it gives memory error because of array size.
and when i append array gives dimensionality error because of shape.
how can i train the data and what shoud i use model.fit() or model.train_on_batch()

If you are using Keras to Train your model with a bunch of Images you can use Train generator and validation generator, all you have to do is put your images in there respective class folders. look at a sample code . also take a look at this link maybe it may help you https://keras.io/preprocessing/image/ . i hope i have answered your question unless i did not understand it

Tensorflow: Batching whole dataset (MNIST Tutorial)

Following this tutorial: https://www.tensorflow.org/versions/r1.3/get_started/mnist/pros
I wanted to solve a classification problem with labeled images by myself. Since I'm not using the MNIST database, I spent days creating my own dataset inside tensorflow. It looks like this:
#variables
batch_size = 50
dimension = 784
stages = 10
#step 1 read Dataset
filenames = tf.constant(filenamesList)
labels = tf.constant(labelsList)
#step 2 create Dataset
dataset = tf.data.Dataset.from_tensor_slices((filenames, labels))
#step 3: parse every image in the dataset using `map`
def _parse_function(filename, label):
#convert label to one-hot encoding
one_hot = tf.one_hot(label, stages)
#read image file
image_string = tf.read_file(filename)
image_decoded = tf.image.decode_image(image_string, channels=3)
image = tf.cast(image_decoded, tf.float32)
return image, one_hot
#step 4 final input tensor
dataset = dataset.map(_parse_function)
dataset = dataset.batch(batch_size) #batch_size = 100
iterator = dataset.make_one_shot_iterator()
images, labels = iterator.get_next()
images = tf.reshape(images, [batch_size,dimension]).eval()
labels = tf.reshape(labels, [batch_size,stages]).eval()
for _ in range(10):
dataset = dataset.shuffle(buffer_size = 100)
dataset = dataset.batch(batch_size)
iterator = dataset.make_one_shot_iterator()
images, labels = iterator.get_next()
images = tf.reshape(images, [batch_size,dimension]).eval()
labels = tf.reshape(labels, [batch_size,stages]).eval()
train_step.run(feed_dict={x: images, y_:labels})
Somehow using a higher batch_sizes will break python. What I'm trying to do is to train my neural network with new batches on each iteration. That's why Im also using dataset.shuffle(...). Using dataset.shuffle also breaks my Python.
What I wanted to do (because shuffle breaks) is to batch the whole dataset. By evaluating ('.eval()') I will get a numpy array. I will then shuffle the array with numpy.random.shuffle(images) and then pick up some the first elements to train it.
e.g.
for _ in range(1000):
images = tf.reshape(images, [batch_size,dimension]).eval()
labels = tf.reshape(labels, [batch_size,stages]).eval()
#shuffle
np.random.shuffle(images)
np.random.shuffle(labels)
train_step.run(feed_dict={x: images[0:train_size], y_:labels[0:train_size]})
But then here comes the problem that I can't batch the my whole dataset. It looks like that the data is too big for python to work with.
How should I solve this differently?
Since I'm not using the MNIST database there isn't a function like mnist.train.next_batch(100) which comes handy for me.

Notice how you call shuffle and batch inside your for loop? This is wrong. Datasets in TF work in the style of functional programming, so you are actually defining a pipeline for preprocessing the data to feed into your model. In a way, you give a recipe that answers the question "given this raw data, which operations (map, etc.) should I do to get batches that I can feed into my neural network?"
Now you are modifying that pipeline for every batch! What happens is that the first iteration, the batch size is, say [32 3600]. The next iteration, the elements of this shape are batched again, to [32 32 3600], and so on.
There's a great tutorial on the TF website where you can find out more how Datasets work, but here are a few suggestions how you can resolve your problem.
Move the shuffling to right after "Step 2" in your code. Then you are shuffling the whole dataset so your batches will have a good mixture of examples. Also increase the buffer_size argument, this works in a different way than you probably assume. It's usually a good idea to shuffle as early as possible, as it can be a slow operation if you have a large dataset -- the shuffled part of dataset will have to be read into memory. Here it does not really matter whether you shuffle the filenames and labels, or the read images and labels -- but the latter will have more work to do since the dataset is larger by that time.
Move batching and the iterator generator to be the last steps, just before starting your training loop.
Don't use feed_dict with Dataset iterators to input data into your model. Instead, define your model in terms of the outputs of iterator.get_next() and omit the feed_dict argument. See more details from this Q&A: Tensorflow: create minibatch from numpy array > 2 GB

Ive been getting through a lot of problems with creating tensorflow datasets. So I decided to use OpenCV to import images.
import opencv as cv
imgDataset = []
for i in range(len(files)):
imgDataset.append(cv2.imread(files[i]))
imgDataset = np.asarray(imgDataset)
the shape of imgDataset is (num_img, height, width, col_channels). Getting the i-th image should be imgDataset[i].
shuffling the dataset and getting only batches of it can be done like this:
from sklearn.utils import shuffle
X,y = shuffle(X, y)
X_feed = X[batch_size]
y_feed = y[batch_size]
Then you feed X_feed and y_feed into your model

Pytorch: Can’t load images using ImageFolder

I’m trying to load images using “ImageFolder”.
data_dir = './train_dog' # directory structure is
train_dog/image
dset = datasets.ImageFolder(data_dir, transform)
train_loader = torch.utils.data.DataLoader(dset, batch_size=128, shuffle=True)
However, it seems not working. So I checked the stored data as below
print dset[0][0]
Then it shows only 3 tensors(size 64x64).
[torch.FloatTensor of size 3x64x64]
There are more than 10,000 images in the folder. How come it can’t store all data?

You should try this:
print len(dset)
which represents the size of the dataset, aka the number of image files.
dset[0] means the (shuffled) first index of the dataset, where dset[0][0] contains the input image tensor and dset[0][1] contains the corresponding label or target.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.