Pytorch DataLoader is not dividing the dataset into batches

Pytorch DataLoader is not dividing the dataset into batches - python

I am trying to load training data in the DataLoader with following code
class Dataset(Dataset):
def __init__(self, x, y):
self.x = x
self.y = y
def __getitem__(self, index):
x = torch.Tensor(self.x[index])
y = torch.Tensor(self.y[index])
return (x, y)
def __len__(self):
count = self.x.shape[0]
return count
X_train = np.reshape(X_train,(-1,1,X_train.shape[0],X_train.shape[1]))
y_train = np.reshape(y_train,(-1,1,y_train.shape[0],y_train.shape[1]))
train_dataset = Dataset(X_train, y_train)
train_loader = torch.utils.data.DataLoader(dataset=train_dataset,batch_size=128,shuffle=True)
Now, when I check the length of the DataLoader, I get one dataset everytime. The loader is not splitting the dataset into batches. What am I doing wrong here?

After testing your code, it seems to work perfectly if you remove the reshape steps. You're introducing a new dimension, so the new shape of X_train is (1, something, something), but you're indexing your items using self.x[index], so you're always accessing the batch dimension. You make the same mistake when calculating the length of your dataset: is always 1.
Solution: do not reshape.
X_train = np.random.rand(12_000, 1280)
y_train = np.random.rand(12_000, 1)
train_dataset = Dataset(X_train, y_train)
train_loader = torch.utils.data.DataLoader(dataset=train_dataset,batch_size=128,shuffle=True)
for x, y in train_loader:
print(x.shape)
print(y.shape)
break

Related

Splitting pytorch dataloader into numpy arrays

In principle I'd like to do the opposite of what was done here https://datascience.stackexchange.com/questions/45916/loading-own-train-data-and-labels-in-dataloader-using-pytorch.
I have a Pytorch dataloader train_dataloader with shape (2000,3). I want to store the 3 dataloader columns in 3 separate numpy arrays. (The first column of the dataloader contains the data, the second column contains the labels.)
I managed to do it for the last batch of the train_dataloader (see below), but unfortunately couldn't make it work for the whole train_dataloader.
for X, y, ind in train_dataloader:
pass
train_X = np.asarray(X, dtype=np.float32)
train_y = np.asarray(y, dtype=np.float32)
Any help would be very much appreciated!

You can collect all the data:
all_X = []
all_y = []
for X, y, ind in train_dataloader:
all_X.append(X)
all_y.append(y)
train_X = torch.cat(all_X, dim=0).numpy()
train_y = torch.cat(all_y, dim=0).numpy()

How can I solve the wrong shape in DataLoader?

I have a text dataset that I want to use for a GAN and it should turn to onehotencode and this is how I Creating a Custom Dataset for my files
class Dataset2(torch.utils.data.Dataset):
def __init__(self, list_, labels):
'Initialization'
self.labels = labels
self.list_IDs = list_
def __len__(self):
'Denotes the total number of samples'
return len(self.list_IDs)
def __getitem__(self, index):
'Generates one sample of data'
# Select sample
mylist = self.list_IDs[index]
# Load data and get label
X = F.one_hot(mylist, num_classes=len(alphabet))
y = self.labels[index]
return X, y
It is working well and every time I call it, it works just fine but the problem is when I use DataLoader and try to use it, its shape is not the same as it just came out of the dataset, this is the shape that came out of the dataset
x , _ = dataset[1]
x.shape
torch.Size([1274, 22])
and this is the shape that came out dataloader
dataloader = DataLoader(dataset, batch_size=64, shuffle=True)
one = []
for epoch in range(epochs):
for i, (real_data, _) in enumerate(dataloader):
one.append(real_data)
one[3].shape
torch.Size([4, 1274, 22])
this 4 is number of samples in my data but it should not be there, how can I fix this problem?

You confirmed you only had four elements in your dataset. You have wrapped your dataset with a data loader with batch_size=64 which is greater than 4. This means the dataloader will only output a single batch containing 4 elements.
In turn, this means you only append a single element per epoch, and one[3].shape is a batch (the only batch of the data loader), shaped (4, 1274, 22).

How to use class weights in Keras for image segmentation

I am trying to segment medical images using a version of U-Net implemented with Keras. The inputs of my network are 3D images and the outputs are two one-hot-encoded 3D segmentation maps. I know that my dataset is very imbalanced (there is not so much to segment) and therefore I want to use class weights for my loss function (currently binary_crossentropy). With the class weights, I hope the model will give more attention to the small stuff it has to segment.
If you know the imbalance of your database, you can pass the parameter class_weight to model.fit(). Does this also work with my use case?

With the help of the above mentioned github issue I managed to solve the problem for my particular use case. I want to share the solution with you anyway. An extra hurdle was the fact I am using a custom generator for my data. A simplified version of this class is the following code:
import numpy as np
import keras
class DataGenerator(keras.utils.Sequence):
'Generates data for Keras'
def __init__(self, list_IDs, batch_size=2, dim=(144,144,144), n_classes=2):
'Initialization'
self.dim = dim
self.batch_size = batch_size
self.list_IDs = list_IDs
self.n_classes = n_classes
self.on_epoch_end()
def __len__(self):
'Denotes the number of batches per epoch'
return int(np.floor(len(self.list_IDs) / self.batch_size))
def __getitem__(self, index):
'Generate one batch of data'
# Generate indexes of the batch
indexes = self.indexes[index*self.batch_size:(index+1)*self.batch_size]
# Find list of IDs
list_IDs_temp = [self.list_IDs[k] for k in indexes]
# Generate data
x, y = self.__data_generation(list_IDs_temp)
return x, y
def __data_generation(self, list_IDs_temp):
'Generates data containing batch_size samples' # X : (n_samples, *dim, 1)
# Initialization
x = np.empty((self.batch_size, *self.dim, 1))
y = np.empty((self.batch_size, *self.dim, 1))
# Generate data
for i, ID in enumerate(list_IDs_temp):
# Load dataset
data = np.load('data/' + ID + '.npy')
# Store x and y
x[i,] = data[:, :, :, 0] # Image
y[i,] = data[:, :, :, 1] # Mask
# One-hot-encoding
y = keras.utils.to_categorical(y, num_classes=self.n_classes)
return x, y
Actually a few lines of code did the trick. With an extra input argument class_weights to my generator, a line to convert the class weights to sample weights for each individual batch in the __getitem__() method, and also a return of the sample weights in the same method, I solved the issue. The class weights are inputted as list with the following structure: class_weights = [weight_class_0, weight_class_1]. My basic generator class now looks like this (I have marked changes with a comment):
import numpy as np
import keras
class DataGenerator(keras.utils.Sequence):
'Generates data for Keras'
def __init__(self, list_IDs, class_weights, batch_size=2, dim=(144,144,144),
n_classes=2):
'Initialization'
self.dim = dim
self.batch_size = batch_size
self.list_IDs = list_IDs
self.n_classes = n_classes
self.class_weights = class_weights # CLASS WEIGHTS FIX
self.on_epoch_end()
def __len__(self):
'Denotes the number of batches per epoch'
return int(np.floor(len(self.list_IDs) / self.batch_size))
def __getitem__(self, index):
'Generate one batch of data'
# Generate indexes of the batch
indexes = self.indexes[index*self.batch_size:(index+1)*self.batch_size]
# Find list of IDs
list_IDs_temp = [self.list_IDs[k] for k in indexes]
# Generate data
x, y = self.__data_generation(list_IDs_temp)
# Compute sample weights CLASS WEIGHTS FIX
sample_weights = np.take(np.array(self.class_weights), np.round(y[:, :, :, :, 1]).astype('int'))
return x, y, sample weights # CLASS WEIGHTS FIX
def __data_generation(self, list_IDs_temp):
'Generates data containing batch_size samples' # X : (n_samples, *dim, 1)
# Initialization
x = np.empty((self.batch_size, *self.dim, 1))
y = np.empty((self.batch_size, *self.dim, 1))
# Generate data
for i, ID in enumerate(list_IDs_temp):
# Load dataset
data = np.load('data/' + ID + '.npy')
# Store x and y
x[i,] = data[:, :, :, 0] # Image
y[i,] = data[:, :, :, 1] # Mask
# One-hot-encoding
y = keras.utils.to_categorical(y, num_classes=self.n_classes)
return x, y
It might seem a bit like a magic one-liner, but what sample_weights = np.take(np.array(self.class_weights), np.round(y[:, :, :, :, 1]).astype('int')) does is the following: It takes the y-values belonging to the not so common class, in my case the one to segment, and gives each pixel in this 3D image a sample weight. This sample weight is either the class weight for the common class or the uncommon class, depending on which class the pixel is belonging too.
The output of this generator class can be then used in the model.fit() method of the Keras model as long as sample_weight_mode="temporal" is passed to model.compile().

Keras - multi-input model using flow_from_dataframe

Problem
I am trying to build a multi-input model in keras using two inputs, image and text. I am using the flow_from_dataframe method, passing it a pandas dataframe containing the image-names as well as the respective text (as a vectorized feature-represenation) for each image and the target label/class. As such, the dataframe looks as follows:
ID path text-features label
111 'cat001.jpg' [0.0, 1.0, 0.0,...] cat
112 'dog001.jpg' [1.0, 0.0, 1.0,...] dog
113 'bunny001.jpg' [0.0, 1.0, 1.0,...] bunny
...
After constructing my model using the Keras functional API, I feed both inputs into the model like so:
model = Model(inputs=[images, text], outputs=output)
For the images I use an ImageDataGenerator as suggested in the docs (https://keras.io/preprocessing/image/#flow_from_dataframe) :
datagen=ImageDataGenerator(rescale=1./255,validation_split=0.15)
train_generator=datagen.flow_from_dataframe(dataframe=df, directory=data_dir, x_col=path, y_col="label", has_ext=True, class_mode="categorical", target_size=(224,224), batch_size=batch_size,subset="training")
validation_generator=datagen.flow_from_dataframe(dataframe=df, directory=data_dir, x_col=path, y_col="label", has_ext=True, class_mode="categorical", target_size=(224,224), batch_size=batch_size,subset="validation")
So far so good, but now I am stuck on how to feed the text-features within my dataframe to the model as well during training.
Question
How can I modify the flow_from_dataframe generator in order to handle the text-feature data in the dataframe as well as the images during training? Also, since I can't find any example of this sort of modification on flow_from_dataframe I am wondering if I am approaching this problem wrong i.e. is there any better method of achieving this?
UPDATE
Meanwhile I've been trying to write my own generator following the guide I found here (https://stanford.edu/~shervine/blog/keras-how-to-generate-data-on-the-fly) and adjusting it to my needs. This is what I came up with:
from matplotlib.image import imread
class DataGenerator(keras.utils.Sequence):
def __init__(self, list_IDs, labels, batch_size=32, dim=(32,32,32), n_channels=1,
n_classes=10, shuffle=True):
#'Initialization'
self.dim = dim
self.batch_size = batch_size
self.labels = labels
self.list_IDs = list_IDs
self.n_channels = n_channels
self.n_classes = n_classes
self.shuffle = shuffle
self.on_epoch_end()
def on_epoch_end(self):
#'Updates indexes after each epoch'
self.indexes = np.arange(len(self.list_IDs))
if self.shuffle == True:
np.random.shuffle(self.indexes)
# method for producing batches of data.
# takes as argument the list of IDs of the target batch
def __data_generation(self, list_IDs_temp):
#'Generates data containing batch_size samples' # X : (n_samples, *dim, n_channels)
# Initialization
X = np.empty((self.batch_size, *self.dim, self.n_channels))
Xtext = np.empty((self.batch_size, 7576))
y = np.empty((self.batch_size), dtype=int)
# Generate data
for i, ID in enumerate(list_IDs_temp):
# Store sample
X[i,] = imread('C:/Users/aaron/Desktop/training/'+str(ID)) # <--- all files are in the same DIR
Xtext[i,] = np.array(total_data[df.path== str(ID)]["text-features"].values) # <--- I look-up the text-features by using the ID as a filter with the path column. This line throws the error.
# Store class
y[i] = self.labels[ID]
return X, Xtext, keras.utils.to_categorical(y, num_classes=self.n_classes)
def __len__(self):
#'Denotes the number of batches per epoch'
return int(np.floor(len(self.list_IDs) / self.batch_size))
# Now, when the batch corresponding to a given index is called,
# the generator executes the __getitem__ method to generate it.
def __getitem__(self, index):
#'Generate one batch of data'
# Generate indexes of the batch
indexes = self.indexes[index*self.batch_size:(index+1)*self.batch_size]
# Find list of IDs
list_IDs_temp = [self.list_IDs[k] for k in indexes]
# Generate data
X,Xtext, y = self.__data_generation(list_IDs_temp)
return X,Xtext, y
And I initialize the generator as follows:
partition = {}
partition['train'] = X_train.path.values
partition['validation'] = X_test.path.values
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
encoded_labels = le.fit_transform(df.label)
labels = pd.Series(encoded_labels,index=df.path).to_dict()
# Parameters
params = {'dim': (224,224),
'batch_size': 64,
'n_classes': 5,
'n_channels': 3,
'shuffle': True}
# Generators
training_generator = DataGenerator(partition['train'], labels, **params)
validation_generator = DataGenerator(partition['validation'], labels, **params)
Using this generator however throws me an error:
ValueError: setting an array element with a sequence.
caused by the line X_text[i,] = np.array(total_data[total_data.bust == str(ID)].text.values) in my code above. Any suggestion on how to solve this?

keras predict_generator is shuffling its output when using a keras.utils.Sequence

I am using keras to build a model that inputs 720x1280 images and outputs a value.
I am having a problem with keras.models.Sequential.predict_generator when using the keras.utils.Sequence class to obtain the values corresponding to images on the validation/training sets. The values returned are shuffled, so I don't know which output corresponds to which image.
This is how my generators are defined
from skimage.io import ImageCollection, imread
from keras.utils import Sequence
def load_images(f):
return imread(f).astype(np.float64)
class DataSetImageKeras(Sequence):
def __init__(self, image_collection, values, batch_size):
self.images = image_collection
self.hf = values
self.batch_size = batch_size
self.n = len(self.images)
self.x_scale = 250
self.y_scale = 1e4
def __len__(self):
return int(np.ceil(len(self.images) / float(self.batch_size)))
def __getitem__(self, idx):
# batch_x is a numpy.ndarray
batch_x = (
self.images[idx:min(idx + self.batch_size, self.n)]
.concatenate()
.reshape(self.batch_size, 720, 1280, 1)
)
batch_y = self.hf[idx:min(idx + self.batch_size, self.n)]
return batch_x/self.x_scale, batch_y/self.y_scale
images_train = ImageCollection(images_paths_train, load_func=load_images)
images_val = ImageCollection(images_paths_test, load_func=load_images)
data_train = DataSetImageKeras(images_train, values_train, n_batch)
data_val = DataSetImageKeras(images_val, values_val, n_batch)
from keras.models import load_model
model = load_model('model001') #this model is already trained
If I use the following code:
val_result = []
val_hf =[]
for (batch_x, batch_y) in data_val:
val_result.append(model.predict_on_batch(batch_x))
val_hf.append(batch_y)
val_result = np.concatenate(val_result)
val_hf = np.concatenate(val_hf)
plt.plot(val_hf,
val_result,
marker='.',
linestyle='')
The correct result is obtained (as seen on this image where x is the desired value and y is the predicted value)
However if I use the predict_generator function, as below:
val_result = model.predict_generator(data_val, verbose=1,
workers=1,
max_queue_size=50,
use_multiprocessing=False)
The output is shuffled as can be seen here.
My problem is similar to
#5048 and
#6745,
which should be solved by
#6891 API, but I am using keras version 2.1.6 and it is still shuffling my predictions, even when using workers=1.
It is also similar to this, but I didn't find anything that could reset the generators and this problem is still present if I define a new generator and try to run the predict_generator.
I also found something stating that it could have something to do with the number of batches not dividing exactly the number of samples, but this problem is still present if I use n_batch=1
As a side note, it might be that predict_generator is not shuffling data, but only returning it with an index offset, since the input data on values and images_paths are already shuffled.

predict_generator was not shuffling my predictions, after all. The problem was with the __getitem__ method. For instance, usingn_batch=32, the method would yield values from 1 to 32, then from 2 to 33 and so forth, instead of from 1 to 32, 33 to 64, etc.
Changing the method as follows solves the problem
def __getitem__(self, idx):
# batch_x is a numpy.ndarray
idx_min = idx*self.batch_size
idx_max = min(idx_min + self.batch_size, self.n)
batch_x = (
self.images[idx_min:idx_max]
.concatenate()
.reshape(self.batch_size, 720, 1280, 1)
)
batch_y = self.hf[idx_min:idx_max]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pytorch DataLoader is not dividing the dataset into batches - python

Related

Splitting pytorch dataloader into numpy arrays

How can I solve the wrong shape in DataLoader?

How to use class weights in Keras for image segmentation

Keras - multi-input model using flow_from_dataframe

keras predict_generator is shuffling its output when using a keras.utils.Sequence

Categories

Resources