pytorch: Why is Dataloader loaded dimensionally？

pytorch: Why is Dataloader loaded dimensionally？ - python

img_path = 'G:/tiff/NC_H08_20220419_0600.tif'
img = io.imread(img_path).astype(np.float32)
print(img.shape)
data_tf = torchvision.transforms.Compose([torchvision.transforms.ToTensor()])
train_data = data_tf(img)
print(train_data.shape)
train_loader = DataLoader(dataset=train_data, batch_size=1)
print(len(train_loader))
result:
(2486, 2755, 16)
torch.Size([16, 2486, 2755])
16
I think len(train_loader) is 1，but now it is 16, I wonder why.

The DataLoader assumes you pass in a dataset, which is usually not a single piece of data. Therefore, it will interpret the first dimension usually as the batch dimension. So, in your case, it assumes you have 16 pieces of 2D data.
To solve it, add a batch dimension to your train_data. (Or make a Dataset, but that seems like a hassle for your simple use case)

Related

How can I iterate over the test dataset batches?

I have a question about test the model. I created a model test set using tf.keras.utils.image_dataset_from_directory following as:
batch_size = 32
test_dataset = tf.keras.utils.image_dataset_from_directory(
'/content/drive/MyDrive/test',
image_size = (224, 224),
batch_size = batch_size,
shuffle = False
)
and I get the output as Found 150 files belonging to 3 classes.
After that, I want to iterate over the test dataset batches by using:
labels_batch = []
for dataset in test_dataset.unbatch():
image_batch, label_batch = dataset
labels = label_batch.numpy()
labels_batch.append(labels)
I understand that, in the structure of dataset <class 'tuple'> consist of 2 positions are image_batch and label_batch, which are <class 'tensorflow.python.framework.ops.EagerTensor'>.
Therefore, image_batch[0] should mean first image in test_dataset. When I want to show array of first image, I use the command print(image_batch[0]) as shown array of all image with shape=(224, 3) but I think the size of all images should be shape=(224,224,3).
So what command do I have to use to access the array of each images?.
I use TensorFlow version 2.9 in google colab. I'm not sure test_dataset.unbatch().
Is the problem here or not?

the unbatch method actually returns each individual image, to get a batch iterator that returns a batch on each iteration you should be calling batch method instead, or just use the dataset iterator, ie:
for dataset in test_dataset:
so in your code image_batch is an image of shape (224,224,3), and image_batch[0] is an array of shape (224,3), because you sliced the first dimension.
you might want to check the dataset documentation for description of each method.

How can I Modify pytorch dataset getitem function to return a bag of 10 images?

I have a directory with multiple images separated into folders. Each folder has up to 3000 images. I would like to modify the pytorch dataset getitem function so that it returns bags of images, where each bag contains 10 images.
Here is what I have so far:
transform = transforms.Compose([transforms.Resize(255),
transforms.CenterCrop(224),
transforms.ToTensor()
])
dataset = datasets.ImageFolder('./../BCNB/patches/WSI_1', transform=transform)
data_loader = torch.utils.data.DataLoader(dataset, batch_size = 1)
My output of DataLoader should be a tensor with a shape of [1, 10, 3, 256, 256].
Any input would be very helpful!
Thank you very much in advance!

Why do you need "bags of 10 images"? If you need them as mini batches for training -- don't change the Dataset, but use a DataLoader for that. A DataLoader takes a dataset and does the "batching" for you.
Alternatively, you can overload the __getitem__ method and implement your own that returns 10 images instead of just one.

How to correctly map a python function and then batch the Dataset in Tensorflow

I wish to create a pipeline to provide non-standard files to the neural network (for example with extension *.xxx).
Currently I have structured my code as follows:
  1) I define a list of paths where to find training files
  2) I define an instance of the tf.data.Dataset object containing these paths
  3) I map to the Dataset a python function that takes each path and returns the associated numpy array (loaded from the folder on the pc); this array is a matrix with dimensions [256, 256, 192].
  4) I define an initializable iterator and then use it during network training.
My doubt lies in the size of the batch I provide to the network. I would like to have batches of size 64 supplied to the network. How could I do?
For example, if I use the function train_data.batch(b_size) with b_size = 1 the result is that when iterated, the iterator gives one element of shape [256, 256, 192]; what if I wanted to feed the neural net with just 64 slices of this array?
This is an extract of my code:
with tf.name_scope('data'):
train_filenames = tf.constant(list_of_files_train)
train_data = tf.data.Dataset.from_tensor_slices(train_filenames)
train_data = train_data.map(lambda filename: tf.py_func(
self._parse_xxx_data, [filename], [tf.float32]))
train_data.shuffle(buffer_size=len(list_of_files_train))
train_data.batch(b_size)
iterator = tf.data.Iterator.from_structure(train_data.output_types, train_data.output_shapes)
input_data = iterator.get_next()
train_init = iterator.make_initializer(train_data)
[...]
with tf.Session() as sess:
sess.run(train_init)
_ = sess.run([self.train_op])
Thanks in advance
----------
I posted a solution to my problem in the comments below. I would still be happy to receive any comment or suggestion on possible improvements. Thank you ;)

It's been a long time but I'll post a possible solution to batch the dataset with custom shape in TensorFlow, in case someone may need it.
The module tf.data offers the method unbatch() to unwrap the content of each dataset element. One can first unbatch and than batch again the dataset object in the desired way. Oftentimes, a good idea may also be shuffling the unbatched dataset before batching it again (so that we have random slices from random elements in each batch):
with tf.name_scope('data'):
train_filenames = tf.constant(list_of_files_train)
train_data = tf.data.Dataset.from_tensor_slices(train_filenames)
train_data = train_data.map(lambda filename: tf.py_func(
self._parse_xxx_data, [filename], [tf.float32]))
# un-batch first, then batch the data
train_data = train_data.apply(tf.data.experimental.unbatch())
train_data.shuffle(buffer_size=BSIZE)
train_data.batch(b_size)
# [...]

If I clearly understand you question, you can try to slice the array into the shape you want in your self._parse_xxx_data function.

Tensorflow: Batching whole dataset (MNIST Tutorial)

Following this tutorial: https://www.tensorflow.org/versions/r1.3/get_started/mnist/pros
I wanted to solve a classification problem with labeled images by myself. Since I'm not using the MNIST database, I spent days creating my own dataset inside tensorflow. It looks like this:
#variables
batch_size = 50
dimension = 784
stages = 10
#step 1 read Dataset
filenames = tf.constant(filenamesList)
labels = tf.constant(labelsList)
#step 2 create Dataset
dataset = tf.data.Dataset.from_tensor_slices((filenames, labels))
#step 3: parse every image in the dataset using `map`
def _parse_function(filename, label):
#convert label to one-hot encoding
one_hot = tf.one_hot(label, stages)
#read image file
image_string = tf.read_file(filename)
image_decoded = tf.image.decode_image(image_string, channels=3)
image = tf.cast(image_decoded, tf.float32)
return image, one_hot
#step 4 final input tensor
dataset = dataset.map(_parse_function)
dataset = dataset.batch(batch_size) #batch_size = 100
iterator = dataset.make_one_shot_iterator()
images, labels = iterator.get_next()
images = tf.reshape(images, [batch_size,dimension]).eval()
labels = tf.reshape(labels, [batch_size,stages]).eval()
for _ in range(10):
dataset = dataset.shuffle(buffer_size = 100)
dataset = dataset.batch(batch_size)
iterator = dataset.make_one_shot_iterator()
images, labels = iterator.get_next()
images = tf.reshape(images, [batch_size,dimension]).eval()
labels = tf.reshape(labels, [batch_size,stages]).eval()
train_step.run(feed_dict={x: images, y_:labels})
Somehow using a higher batch_sizes will break python. What I'm trying to do is to train my neural network with new batches on each iteration. That's why Im also using dataset.shuffle(...). Using dataset.shuffle also breaks my Python.
What I wanted to do (because shuffle breaks) is to batch the whole dataset. By evaluating ('.eval()') I will get a numpy array. I will then shuffle the array with numpy.random.shuffle(images) and then pick up some the first elements to train it.
e.g.
for _ in range(1000):
images = tf.reshape(images, [batch_size,dimension]).eval()
labels = tf.reshape(labels, [batch_size,stages]).eval()
#shuffle
np.random.shuffle(images)
np.random.shuffle(labels)
train_step.run(feed_dict={x: images[0:train_size], y_:labels[0:train_size]})
But then here comes the problem that I can't batch the my whole dataset. It looks like that the data is too big for python to work with.
How should I solve this differently?
Since I'm not using the MNIST database there isn't a function like mnist.train.next_batch(100) which comes handy for me.

Notice how you call shuffle and batch inside your for loop? This is wrong. Datasets in TF work in the style of functional programming, so you are actually defining a pipeline for preprocessing the data to feed into your model. In a way, you give a recipe that answers the question "given this raw data, which operations (map, etc.) should I do to get batches that I can feed into my neural network?"
Now you are modifying that pipeline for every batch! What happens is that the first iteration, the batch size is, say [32 3600]. The next iteration, the elements of this shape are batched again, to [32 32 3600], and so on.
There's a great tutorial on the TF website where you can find out more how Datasets work, but here are a few suggestions how you can resolve your problem.
Move the shuffling to right after "Step 2" in your code. Then you are shuffling the whole dataset so your batches will have a good mixture of examples. Also increase the buffer_size argument, this works in a different way than you probably assume. It's usually a good idea to shuffle as early as possible, as it can be a slow operation if you have a large dataset -- the shuffled part of dataset will have to be read into memory. Here it does not really matter whether you shuffle the filenames and labels, or the read images and labels -- but the latter will have more work to do since the dataset is larger by that time.
Move batching and the iterator generator to be the last steps, just before starting your training loop.
Don't use feed_dict with Dataset iterators to input data into your model. Instead, define your model in terms of the outputs of iterator.get_next() and omit the feed_dict argument. See more details from this Q&A: Tensorflow: create minibatch from numpy array > 2 GB

Ive been getting through a lot of problems with creating tensorflow datasets. So I decided to use OpenCV to import images.
import opencv as cv
imgDataset = []
for i in range(len(files)):
imgDataset.append(cv2.imread(files[i]))
imgDataset = np.asarray(imgDataset)
the shape of imgDataset is (num_img, height, width, col_channels). Getting the i-th image should be imgDataset[i].
shuffling the dataset and getting only batches of it can be done like this:
from sklearn.utils import shuffle
X,y = shuffle(X, y)
X_feed = X[batch_size]
y_feed = y[batch_size]
Then you feed X_feed and y_feed into your model

Assign ImageDataGenerator result to Numpy array

I'm using the ImageDataGenerator inside Keras to read a directory of images. I'd like to save the result inside a numpy array, so I can do further manipulations and save it to disk in one file.
flow_from_directory() returns an iterator, which is why I tried the following
itr = gen.flow_from_directory('data/train/', batch_size=1, target_size=(32,32))
imgs = np.concatenate([itr.next() for i in range(itr.nb_sample)])
but that produced
ValueError: could not broadcast input array from shape (32,32,3) into shape (1)
I think I'm misusing the concatenate() function, but I can't figure out where I fail.

I had the same problem and solved it the following way:
itr.next returns the next batch of images as two numpy.ndarray objects: batch_x, batch_y. (Source: keras/preprocessing/image.py)
So what you can do is set the batch_size for flow_from_directory to the size of your whole train dataset.
Example, my whole training set consists of 1481 images:
train_datagen = ImageDataGenerator(rescale=1. / 255)
itr = train_datagen.flow_from_directory(
train_data_dir,
target_size=(img_width, img_height),
batch_size=1481,
class_mode='categorical')
X, y = itr.next()

While using ImageDataGenerator, the data is loaded in the format of the directoryiterator.
you can extract it as batches or as a whole
train_generator = train_datagen.flow_from_directory(
train_parent_dir,
target_size=(300, 300),
batch_size=32,
class_mode='categorical'
)
the output of which is
Found 3875 images belonging to 3 classes.
to extract as numpy array as a whole(which means not as a batch), this code can be used
x=np.concatenate([train_generator.next()[0] for i in range(train_generator.__len__())])
y=np.concatenate([train_generator.next()[1] for i in range(train_generator.__len__())])
print(x.shape)
print(y.shape)
NOTE:BEFORE THIS CODE IT IS ADVISED TO USE train_generator.reset()
the output of above code is
(3875, 300, 300, 3)
(3875, 3)
The output is obtained as a numpy array together, even though it was loaded as batches of 32 using ImageDataGenerator.
To get the output as batches use the following code
x=[]
y=[]
train_generator.reset()
for i in range(train_generator.__len__()):
a,b=train_generator.next()
x.append(a)
y.append(b)
x=np.array(x)
y=np.array(y)
print(x.shape)
print(y.shape)
the output of the code is
(122,)
(122,)
Hope this works as a solution

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.