How can I solve the wrong shape in DataLoader? - python

I have a text dataset that I want to use for a GAN and it should turn to onehotencode and this is how I Creating a Custom Dataset for my files
class Dataset2(torch.utils.data.Dataset):
def __init__(self, list_, labels):
'Initialization'
self.labels = labels
self.list_IDs = list_
def __len__(self):
'Denotes the total number of samples'
return len(self.list_IDs)
def __getitem__(self, index):
'Generates one sample of data'
# Select sample
mylist = self.list_IDs[index]
# Load data and get label
X = F.one_hot(mylist, num_classes=len(alphabet))
y = self.labels[index]
return X, y
It is working well and every time I call it, it works just fine but the problem is when I use DataLoader and try to use it, its shape is not the same as it just came out of the dataset, this is the shape that came out of the dataset
x , _ = dataset[1]
x.shape
torch.Size([1274, 22])
and this is the shape that came out dataloader
dataloader = DataLoader(dataset, batch_size=64, shuffle=True)
one = []
for epoch in range(epochs):
for i, (real_data, _) in enumerate(dataloader):
one.append(real_data)
one[3].shape
torch.Size([4, 1274, 22])
this 4 is number of samples in my data but it should not be there, how can I fix this problem?

You confirmed you only had four elements in your dataset. You have wrapped your dataset with a data loader with batch_size=64 which is greater than 4. This means the dataloader will only output a single batch containing 4 elements.
In turn, this means you only append a single element per epoch, and one[3].shape is a batch (the only batch of the data loader), shaped (4, 1274, 22).

Related

Pytorch DataLoader is not dividing the dataset into batches

I am trying to load training data in the DataLoader with following code
class Dataset(Dataset):
def __init__(self, x, y):
self.x = x
self.y = y
def __getitem__(self, index):
x = torch.Tensor(self.x[index])
y = torch.Tensor(self.y[index])
return (x, y)
def __len__(self):
count = self.x.shape[0]
return count
X_train = np.reshape(X_train,(-1,1,X_train.shape[0],X_train.shape[1]))
y_train = np.reshape(y_train,(-1,1,y_train.shape[0],y_train.shape[1]))
train_dataset = Dataset(X_train, y_train)
train_loader = torch.utils.data.DataLoader(dataset=train_dataset,batch_size=128,shuffle=True)
Now, when I check the length of the DataLoader, I get one dataset everytime. The loader is not splitting the dataset into batches. What am I doing wrong here?
After testing your code, it seems to work perfectly if you remove the reshape steps. You're introducing a new dimension, so the new shape of X_train is (1, something, something), but you're indexing your items using self.x[index], so you're always accessing the batch dimension. You make the same mistake when calculating the length of your dataset: is always 1.
Solution: do not reshape.
X_train = np.random.rand(12_000, 1280)
y_train = np.random.rand(12_000, 1)
train_dataset = Dataset(X_train, y_train)
train_loader = torch.utils.data.DataLoader(dataset=train_dataset,batch_size=128,shuffle=True)
for x, y in train_loader:
print(x.shape)
print(y.shape)
break

Keras - multi-input model using flow_from_dataframe

Problem
I am trying to build a multi-input model in keras using two inputs, image and text. I am using the flow_from_dataframe method, passing it a pandas dataframe containing the image-names as well as the respective text (as a vectorized feature-represenation) for each image and the target label/class. As such, the dataframe looks as follows:
ID path text-features label
111 'cat001.jpg' [0.0, 1.0, 0.0,...] cat
112 'dog001.jpg' [1.0, 0.0, 1.0,...] dog
113 'bunny001.jpg' [0.0, 1.0, 1.0,...] bunny
...
After constructing my model using the Keras functional API, I feed both inputs into the model like so:
model = Model(inputs=[images, text], outputs=output)
For the images I use an ImageDataGenerator as suggested in the docs (https://keras.io/preprocessing/image/#flow_from_dataframe) :
datagen=ImageDataGenerator(rescale=1./255,validation_split=0.15)
train_generator=datagen.flow_from_dataframe(dataframe=df, directory=data_dir, x_col=path, y_col="label", has_ext=True, class_mode="categorical", target_size=(224,224), batch_size=batch_size,subset="training")
validation_generator=datagen.flow_from_dataframe(dataframe=df, directory=data_dir, x_col=path, y_col="label", has_ext=True, class_mode="categorical", target_size=(224,224), batch_size=batch_size,subset="validation")
So far so good, but now I am stuck on how to feed the text-features within my dataframe to the model as well during training.
Question
How can I modify the flow_from_dataframe generator in order to handle the text-feature data in the dataframe as well as the images during training? Also, since I can't find any example of this sort of modification on flow_from_dataframe I am wondering if I am approaching this problem wrong i.e. is there any better method of achieving this?
UPDATE
Meanwhile I've been trying to write my own generator following the guide I found here (https://stanford.edu/~shervine/blog/keras-how-to-generate-data-on-the-fly) and adjusting it to my needs. This is what I came up with:
from matplotlib.image import imread
class DataGenerator(keras.utils.Sequence):
def __init__(self, list_IDs, labels, batch_size=32, dim=(32,32,32), n_channels=1,
n_classes=10, shuffle=True):
#'Initialization'
self.dim = dim
self.batch_size = batch_size
self.labels = labels
self.list_IDs = list_IDs
self.n_channels = n_channels
self.n_classes = n_classes
self.shuffle = shuffle
self.on_epoch_end()
def on_epoch_end(self):
#'Updates indexes after each epoch'
self.indexes = np.arange(len(self.list_IDs))
if self.shuffle == True:
np.random.shuffle(self.indexes)
# method for producing batches of data.
# takes as argument the list of IDs of the target batch
def __data_generation(self, list_IDs_temp):
#'Generates data containing batch_size samples' # X : (n_samples, *dim, n_channels)
# Initialization
X = np.empty((self.batch_size, *self.dim, self.n_channels))
Xtext = np.empty((self.batch_size, 7576))
y = np.empty((self.batch_size), dtype=int)
# Generate data
for i, ID in enumerate(list_IDs_temp):
# Store sample
X[i,] = imread('C:/Users/aaron/Desktop/training/'+str(ID)) # <--- all files are in the same DIR
Xtext[i,] = np.array(total_data[df.path== str(ID)]["text-features"].values) # <--- I look-up the text-features by using the ID as a filter with the path column. This line throws the error.
# Store class
y[i] = self.labels[ID]
return X, Xtext, keras.utils.to_categorical(y, num_classes=self.n_classes)
def __len__(self):
#'Denotes the number of batches per epoch'
return int(np.floor(len(self.list_IDs) / self.batch_size))
# Now, when the batch corresponding to a given index is called,
# the generator executes the __getitem__ method to generate it.
def __getitem__(self, index):
#'Generate one batch of data'
# Generate indexes of the batch
indexes = self.indexes[index*self.batch_size:(index+1)*self.batch_size]
# Find list of IDs
list_IDs_temp = [self.list_IDs[k] for k in indexes]
# Generate data
X,Xtext, y = self.__data_generation(list_IDs_temp)
return X,Xtext, y
And I initialize the generator as follows:
partition = {}
partition['train'] = X_train.path.values
partition['validation'] = X_test.path.values
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
encoded_labels = le.fit_transform(df.label)
labels = pd.Series(encoded_labels,index=df.path).to_dict()
# Parameters
params = {'dim': (224,224),
'batch_size': 64,
'n_classes': 5,
'n_channels': 3,
'shuffle': True}
# Generators
training_generator = DataGenerator(partition['train'], labels, **params)
validation_generator = DataGenerator(partition['validation'], labels, **params)
Using this generator however throws me an error:
ValueError: setting an array element with a sequence.
caused by the line X_text[i,] = np.array(total_data[total_data.bust == str(ID)].text.values) in my code above. Any suggestion on how to solve this?

How can I get the indices of the data used in every batch?

I need to save the indices of the data that are used in every mini-batch.
For example if my data is:
x = np.array([[1.1], [2.2], [3.3], [4.4]])
and the first mini-batch is [1.1] and [3.3], then I want to store 0 and 2 (since [1.1] is the 0th observations and [3.3] is the 2nd observation).
I am using tensorflow in eager execution with the keras.sequential APIs.
As far as I can tell from reading the source code, this information is not stored anywhere so I was unable to do this with a callback.
I am currently solving my problem by creating an object that stores the indices.
class IndexIterator(object):
def __init__(self, n, n_epochs, batch_size, shuffle=True):
data_ix = np.arange(n)
if shuffle:
np.random.shuffle(data_ix)
self.ix_batches = np.array_split(data_ix, np.ceil(n / batch_size))
self.batch_indices = []
def generate_arrays(self, x, y):
batch_ixs = np.arange(len(self.ix_batches))
while 1:
np.random.shuffle(batch_ixs)
for batch in batch_ixs:
self.batch_indices.append(self.ix_batches[batch])
yield (x[self.ix_batches[batch], :], y[self.ix_batches[batch], :])
data_gen = IndexIterator(n=32, n_epochs=100, batch_size=16)
dnn.fit_generator(data_gen.generate_arrays(x, y),
steps_per_epoch=2,
epochs=100)
# This is what I am looking for
print(data_gen.batch_indices)
Is there no way to do this using a tensorflow callback?
Not sure if this will be more efficient than your solution, but is certainly more general.
If you have training data with n indices you can create a secondary Dataset that contains only these indices and zip it with the "real" dataset.
I.E.
real_data = tf.data.Dataset ...
indices = tf.data.Dataset.from_tensor_slices(tf.range(data_set_length)))
total_dataset = tf.data.Dataset.zip((real_data, indices))
# Perform optional pre-processing ops.
iterator = total_dataset.make_one_shot_iterator()
# Next line yields `(original_data_element, index)`
item_and_index_tuple = iterator.get_next()
`

How to iterate over two dataloaders simultaneously using pytorch?

I am trying to implement a Siamese network that takes in two images. I load these images and create two separate dataloaders.
In my loop I want to go through both dataloaders simultaneously so that I can train the network on both images.
for i, data in enumerate(zip(dataloaders1, dataloaders2)):
# get the inputs
inputs1 = data[0][0].cuda(async=True);
labels1 = data[0][1].cuda(async=True);
inputs2 = data[1][0].cuda(async=True);
labels2 = data[1][1].cuda(async=True);
labels1 = labels1.view(batchSize,1)
labels2 = labels2.view(batchSize,1)
# zero the parameter gradients
optimizer.zero_grad()
# forward + backward + optimize
outputs1 = alexnet(inputs1)
outputs2 = alexnet(inputs2)
The return value of the dataloader is a tuple.
However, when I try to use zip to iterate over them, I get the following error:
OSError: [Errno 24] Too many open files
Exception NameError: "global name 'FileNotFoundError' is not defined" in <bound method _DataLoaderIter.__del__ of <torch.utils.data.dataloader._DataLoaderIter object at 0x7f2d3c00c190>> ignored
Shouldn't zip work on all iterable items? But it seems like here I can't use it on dataloaders.
Is there any other way to pursue this? Or am I approaching the implementation of a Siamese network incorrectly?
Further to what it is already mentioned, cycle() and zip() might create a memory leakage problem - especially when using image datasets! To solve that, instead of iterating like this:
dataloaders1 = DataLoader(DummyDataset(0, 100), batch_size=10, shuffle=True)
dataloaders2 = DataLoader(DummyDataset(0, 200), batch_size=10, shuffle=True)
num_epochs = 10
for epoch in range(num_epochs):
for i, (data1, data2) in enumerate(zip(cycle(dataloaders1), dataloaders2)):
do_cool_things()
you could use:
dataloaders1 = DataLoader(DummyDataset(0, 100), batch_size=10, shuffle=True)
dataloaders2 = DataLoader(DummyDataset(0, 200), batch_size=10, shuffle=True)
num_epochs = 10
for epoch in range(num_epochs):
dataloader_iterator = iter(dataloaders1)
for i, data1 in enumerate(dataloaders2)):
try:
data2 = next(dataloader_iterator)
except StopIteration:
dataloader_iterator = iter(dataloaders1)
data2 = next(dataloader_iterator)
do_cool_things()
Bear in mind that if you use labels as well, you should replace in this example data1 with (inputs1,targets1) and data2 with inputs2,targets2, as #Sajad Norouzi said.
KUDOS to this one: https://github.com/pytorch/pytorch/issues/1917#issuecomment-433698337
If you want to iterate over two datasets simultaneously, there is no need to define your own dataset class just use TensorDataset like below:
dataset = torch.utils.data.TensorDataset(dataset1, dataset2)
dataloader = DataLoader(dataset, batch_size=128, shuffle=True)
for index, (xb1, xb2) in enumerate(dataloader):
....
If you want the labels or iterating over more than two datasets just feed them as an argument to the TensorDataset after dataset2.
To complete #ManojAcharya's answer:
The error you are getting comes neither from zip() nor DataLoader() directly. Python is trying to tell you that it couldn't find one of the data files you are asking for (c.f. FileNotFoundError in the exception trace), probably in your Dataset.
Find below a working example using DataLoader and zip together. Note that if you want to shuffle your data, it becomes difficult to keep the correspondences between the 2 datasets. This justifies #ManojAcharya's solution.
import torch
from torch.utils.data import DataLoader, Dataset
class DummyDataset(Dataset):
"""
Dataset of numbers in [a,b] inclusive
"""
def __init__(self, a=0, b=100):
super(DummyDataset, self).__init__()
self.a = a
self.b = b
def __len__(self):
return self.b - self.a + 1
def __getitem__(self, index):
return index, "label_{}".format(index)
dataloaders1 = DataLoader(DummyDataset(0, 9), batch_size=2, shuffle=True)
dataloaders2 = DataLoader(DummyDataset(0, 9), batch_size=2, shuffle=True)
for i, data in enumerate(zip(dataloaders1, dataloaders2)):
print(data)
# ([tensor([ 4, 7]), ('label_4', 'label_7')], [tensor([ 8, 5]), ('label_8', 'label_5')])
# ([tensor([ 1, 9]), ('label_1', 'label_9')], [tensor([ 6, 9]), ('label_6', 'label_9')])
# ([tensor([ 6, 5]), ('label_6', 'label_5')], [tensor([ 0, 4]), ('label_0', 'label_4')])
# ([tensor([ 8, 2]), ('label_8', 'label_2')], [tensor([ 2, 7]), ('label_2', 'label_7')])
# ([tensor([ 0, 3]), ('label_0', 'label_3')], [tensor([ 3, 1]), ('label_3', 'label_1')])
Adding on #Aldream's solution for the case when we have varying length of the dataset and if we want to pass through them all at same epoch then we could use the cycle() from itertools, a Python Standard library. Using the code snippet of #Aldrem, the updated code will look like:
from torch.utils.data import DataLoader, Dataset
from itertools import cycle
class DummyDataset(Dataset):
"""
Dataset of numbers in [a,b] inclusive
"""
def __init__(self, a=0, b=100):
super(DummyDataset, self).__init__()
self.a = a
self.b = b
def __len__(self):
return self.b - self.a + 1
def __getitem__(self, index):
return index
dataloaders1 = DataLoader(DummyDataset(0, 100), batch_size=10, shuffle=True)
dataloaders2 = DataLoader(DummyDataset(0, 200), batch_size=10, shuffle=True)
num_epochs = 10
for epoch in range(num_epochs):
for i, data in enumerate(zip(cycle(dataloaders1), dataloaders2)):
print(data)
With only zip() the iterator will be exhausted when the length is equal to that of the smallest dataset (here 100). But with the use of cycle(), we will repeat the smallest dataset again unless our iterator looks at all the samples from the largest dataset (here 200).
P.S. One can always argue this approach may not be required to achieve convergence as long as one does samples randomly but with this approach, the evaluation might be easier.
I see you are struggling to make a right dataloder function. I would do:
class Siamese(Dataset):
def __init__(self, transform=None):
#init data here
def __len__(self):
return #length of the data
def __getitem__(self, idx):
#get images and labels here
#returned images must be tensor
#labels should be int
return img1, img2 , label1, label2

keras predict_generator is shuffling its output when using a keras.utils.Sequence

I am using keras to build a model that inputs 720x1280 images and outputs a value.
I am having a problem with keras.models.Sequential.predict_generator when using the keras.utils.Sequence class to obtain the values corresponding to images on the validation/training sets. The values returned are shuffled, so I don't know which output corresponds to which image.
This is how my generators are defined
from skimage.io import ImageCollection, imread
from keras.utils import Sequence
def load_images(f):
return imread(f).astype(np.float64)
class DataSetImageKeras(Sequence):
def __init__(self, image_collection, values, batch_size):
self.images = image_collection
self.hf = values
self.batch_size = batch_size
self.n = len(self.images)
self.x_scale = 250
self.y_scale = 1e4
def __len__(self):
return int(np.ceil(len(self.images) / float(self.batch_size)))
def __getitem__(self, idx):
# batch_x is a numpy.ndarray
batch_x = (
self.images[idx:min(idx + self.batch_size, self.n)]
.concatenate()
.reshape(self.batch_size, 720, 1280, 1)
)
batch_y = self.hf[idx:min(idx + self.batch_size, self.n)]
return batch_x/self.x_scale, batch_y/self.y_scale
images_train = ImageCollection(images_paths_train, load_func=load_images)
images_val = ImageCollection(images_paths_test, load_func=load_images)
data_train = DataSetImageKeras(images_train, values_train, n_batch)
data_val = DataSetImageKeras(images_val, values_val, n_batch)
from keras.models import load_model
model = load_model('model001') #this model is already trained
If I use the following code:
val_result = []
val_hf =[]
for (batch_x, batch_y) in data_val:
val_result.append(model.predict_on_batch(batch_x))
val_hf.append(batch_y)
val_result = np.concatenate(val_result)
val_hf = np.concatenate(val_hf)
plt.plot(val_hf,
val_result,
marker='.',
linestyle='')
The correct result is obtained (as seen on this image where x is the desired value and y is the predicted value)
However if I use the predict_generator function, as below:
val_result = model.predict_generator(data_val, verbose=1,
workers=1,
max_queue_size=50,
use_multiprocessing=False)
The output is shuffled as can be seen here.
My problem is similar to
#5048 and
#6745,
which should be solved by
#6891 API, but I am using keras version 2.1.6 and it is still shuffling my predictions, even when using workers=1.
It is also similar to this, but I didn't find anything that could reset the generators and this problem is still present if I define a new generator and try to run the predict_generator.
I also found something stating that it could have something to do with the number of batches not dividing exactly the number of samples, but this problem is still present if I use n_batch=1
As a side note, it might be that predict_generator is not shuffling data, but only returning it with an index offset, since the input data on values and images_paths are already shuffled.
predict_generator was not shuffling my predictions, after all. The problem was with the __getitem__ method. For instance, usingn_batch=32, the method would yield values from 1 to 32, then from 2 to 33 and so forth, instead of from 1 to 32, 33 to 64, etc.
Changing the method as follows solves the problem
def __getitem__(self, idx):
# batch_x is a numpy.ndarray
idx_min = idx*self.batch_size
idx_max = min(idx_min + self.batch_size, self.n)
batch_x = (
self.images[idx_min:idx_max]
.concatenate()
.reshape(self.batch_size, 720, 1280, 1)
)
batch_y = self.hf[idx_min:idx_max]

Categories

Resources