I want to play around with a neural network that recognizes handwritten numbers. I found some of these on the web which use PyTorch, however they seem to download the data from the MNIST website in a particular format. My data is, however, available as follows:
with np.load('prediction-challenge-01-data.npz') as fh:
data_x = fh['data_x']
data_y = fh['data_y']
Where data_x is the training data and data_y are the labels of the pictures. I want these data sets to be in the same format as trainloader as shown below:
trainset = datasets.MNIST('/data/mnist', download=True, train=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=64, shuffle=True)
Where trainloader already has the training set data_x and labels data_y together in one set.
Is there any way to do this?
Edit: Shapes of data_x and data_y:
In [1]: data_x.shape
Out[2]: (20000, 1, 28, 28)
In [5]: data_y.shape
Out[7]: (20000,)
You can easily create your own dataset. Just inherit from torch.utils.data.Dataset and implement
__getitem__ at the very least:
Here is a quick and dirty example to get you going:
class YourOwnDataset(torch.utils.data.Dataset):
def __init__(self, input_file_path, transformations) :
super().__init__()
self.path = input_file_path
self.transforms = transformations
with np.load(self.path) as fh:
# I assume fh['data_x'] is a list you get the idea
self.data = fh['data_x']
self.labels = fh['data_y']
# in getitem, we retrieve one item based on the input index
def __getitem__(self, index):
data = self.data[index]
# based on the loss you chose and what you have in mind,
# you can transform you label, here I assume they are
# integer numbers (like, 1, 3, etc as labels used for classification)
label = self.labels[index]
img = convert/reshape your data into img
img = self.transforms(img)
return img, labels
def __len__(self):
return len(self.data)
and you can create your dataset like :
from torchvision import transforms
# add any number of transformations you like, I just added ToTensor()
transformations = transforms.Compose([transforms.ToTensor()])
trainset = YourOwnDataset('prediction-challenge-01-data.npz', transformations )
trainloader = torch.utils.data.DataLoader(trainset, batch_size=64, shuffle=True)
Related
I am training a neural network with Pytorch, and I would like to understand more of Mnist dataset.
The dataloader looks like this:
batch_size = 128
transform = transforms.Compose([
transforms.Resize((28,28)),
transforms.ToTensor(),
transforms.Normalize((0.5), (0.5)),
])
train_dataset = datasets.MNIST('./data', transform=transform, download=True)
train_dataloader = torch.utils.data.DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
test_dataset = datasets.MNIST('./data', transform=transform, download=True, train=False)
test_dataloader = torch.utils.data.DataLoader(test_dataset, batch_size=batch_size, shuffle=False)
However, when I train my own dataset there are problems loading the data. What I know is that the Mnist dataset for pytorch has the shape of (1,28,28) which are grayscaled images. I want to know how they are saved. Are they png, jpg, jpeg or npy files?
The MNIST dataset class is based on this code. If you would like to use your own dataset, you should write your custom dataset class to read your dataset based on its properties, like its image size, number of channels, labels, etc.
For instance something like this example:
class CustomImageDataset(Dataset):
def __init__(self, annotations_file, img_dir, transform=None, target_transform=None):
self.img_labels = pd.read_csv(annotations_file)
self.img_dir = img_dir
self.transform = transform
self.target_transform = target_transform
def __len__(self):
return len(self.img_labels)
def __getitem__(self, idx):
img_path = os.path.join(self.img_dir, self.img_labels.iloc[idx, 0])
image = scipyIO.loadmat(img_path).get('rawData')
image = image.astype(np.float64)
h, w = image.shape
image = torch.from_numpy(image).reshape(1, h, w)
image = image.float()
ua = self.img_labels.iloc[idx, 1] # 1: ua value
us = self.img_labels.iloc[idx, 2] # 2: us value
g = self.img_labels.iloc[idx, 3] # 3: g value
gt = torch.tensor([ua, us, g])
gt = gt.float()
if self.transform:
image = self.transform(image)
if self.target_transform:
gt = self.target_transform(gt)
return image, gt
(above example is based on this repository)
I have actually a directory RealPhotos containing 17000 jpg photos. I would be interested in creating a train dataloader and a test dataloader
ls RealPhotos/
2007_000027.jpg 2008_007119.jpg 2010_001501.jpg 2011_002987.jpg
2007_000032.jpg 2008_007120.jpg 2010_001502.jpg 2011_002988.jpg
2007_000033.jpg 2008_007123.jpg 2010_001503.jpg 2011_002992.jpg
2007_000039.jpg 2008_007124.jpg 2010_001505.jpg 2011_002993.jpg
2007_000042.jpg 2008_007129.jpg 2010_001511.jpg 2011_002994.jpg
2007_000061.jpg 2008_007130.jpg 2010_001514.jpg 2011_002996.jpg
2007_000063.jpg 2008_007131.jpg 2010_001515.jpg 2011_002997.jpg
2007_000068.jpg 2008_007133.jpg 2010_001516.jpg 2011_002999.jpg
2007_000121.jpg 2008_007134.jpg 2010_001518.jpg 2011_003002.jpg
2007_000123.jpg 2008_007138.jpg 2010_001520.jpg 2011_003003.jpg
...
I know I can subclassing TensorDataset to make it compatible with unlabeled data with
class UnlabeledTensorDataset(TensorDataset):
"""Dataset wrapping unlabeled data tensors.
Each sample will be retrieved by indexing tensors along the first
dimension.
Arguments:
data_tensor (Tensor): contains sample data.
"""
def __init__(self, data_tensor):
self.data_tensor = data_tensor
def __getitem__(self, index):
return self.data_tensor[index]
And something along these lines for training the autoencoder
X_train = rnd.random((300,100))
train = UnlabeledTensorDataset(torch.from_numpy(X_train).float())
train_loader= data_utils.DataLoader(train, batch_size=1)
for epoch in range(50):
for batch in train_loader:
data = Variable(batch)
optimizer.zero_grad()
output = model(data)
loss = criterion(output, data)
You first need to define a Dataset (torch.utils.data.Dataset) then you can use DataLoader on it. There is no difference between your train and test dataset, you can define a generic dataset that will look into a particular directory and map each index to a unique file.
class MyDataset(Dataset):
def __init__(self, directory):
self.files = os.listdir(directory)
def __getitem__(self, index):
img = Image.open(self.files[index]).convert('RGB')
return T.ToTensor()(img)
Where T refers to torchvision.transform and Image is imported from PIL.
You can then instanciate a dataset with
data_set = MyDataset('./RealPhotos')
From there you can use torch.utils.data.random_split to perform the split:
train_len = int(len(data_set)*0.7)
train_set, test_set = random_split(data_set, [train_len, len(data_set)-train_len])
Then use torch.utils.data.DataLoader as you did:
train_loader = DataLoader(train_set, batch_size=1, shuffle=True)
test_loader = DataLoader(test_set, batch_size=16, shuffle=False)
I am trying to train pytorches torchvision.models.detection.fasterrcnn_resnet50_fpn to detect objects in my own images.
According to the documentation, this model expects a list of images and a list of dictionaries with
'boxes' and 'labels' as keys. So my dataloaders __getitem__() looks like this:
def __getitem__(self, idx):
# load images
_, img = self.images[idx].getImage()
img = Image.fromarray(img, mode='RGB')
objects = self.images[idx].objects
boxes = []
labels = []
for o in objects:
# append bbox to boxes
boxes.append([o.x, o.y, o.x+o.width, o.y+o.height])
# append the 4th char of class_id, the number of lights (1-4)
labels.append(int(str(o.class_id)[3]))
# convert everything into a torch.Tensor
boxes = torch.as_tensor(boxes, dtype=torch.float32)
labels = torch.as_tensor(labels, dtype=torch.int64)
target = {}
target["boxes"] = boxes
target["labels"] = labels
# transforms consists only of transforms.Compose([transforms.ToTensor()]) for the time being
if self.transforms is not None:
img = self.transforms(img)
return img, target
To my best knowledge, it returns exactly what's asked. My dataloader looks like this
data_loader = torch.utils.data.DataLoader(
dataset, batch_size=4, shuffle=False, num_workers=2)
however, when it get's to this stage:
for images, targets in dataloaders[phase]:
it raises
RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 0. Got 12 and 7 in dimension 1 at C:\w\1\s\windows\pytorch\aten\src\TH/generic/THTensor.cpp:689
Can someone point me in the right direction?
#jodag was right, I had to write a seperate collate function in order for the net to receive the data like it was supposed to. In my case I only needed to bypass the default function.
Problem
I am trying to build a multi-input model in keras using two inputs, image and text. I am using the flow_from_dataframe method, passing it a pandas dataframe containing the image-names as well as the respective text (as a vectorized feature-represenation) for each image and the target label/class. As such, the dataframe looks as follows:
ID path text-features label
111 'cat001.jpg' [0.0, 1.0, 0.0,...] cat
112 'dog001.jpg' [1.0, 0.0, 1.0,...] dog
113 'bunny001.jpg' [0.0, 1.0, 1.0,...] bunny
...
After constructing my model using the Keras functional API, I feed both inputs into the model like so:
model = Model(inputs=[images, text], outputs=output)
For the images I use an ImageDataGenerator as suggested in the docs (https://keras.io/preprocessing/image/#flow_from_dataframe) :
datagen=ImageDataGenerator(rescale=1./255,validation_split=0.15)
train_generator=datagen.flow_from_dataframe(dataframe=df, directory=data_dir, x_col=path, y_col="label", has_ext=True, class_mode="categorical", target_size=(224,224), batch_size=batch_size,subset="training")
validation_generator=datagen.flow_from_dataframe(dataframe=df, directory=data_dir, x_col=path, y_col="label", has_ext=True, class_mode="categorical", target_size=(224,224), batch_size=batch_size,subset="validation")
So far so good, but now I am stuck on how to feed the text-features within my dataframe to the model as well during training.
Question
How can I modify the flow_from_dataframe generator in order to handle the text-feature data in the dataframe as well as the images during training? Also, since I can't find any example of this sort of modification on flow_from_dataframe I am wondering if I am approaching this problem wrong i.e. is there any better method of achieving this?
UPDATE
Meanwhile I've been trying to write my own generator following the guide I found here (https://stanford.edu/~shervine/blog/keras-how-to-generate-data-on-the-fly) and adjusting it to my needs. This is what I came up with:
from matplotlib.image import imread
class DataGenerator(keras.utils.Sequence):
def __init__(self, list_IDs, labels, batch_size=32, dim=(32,32,32), n_channels=1,
n_classes=10, shuffle=True):
#'Initialization'
self.dim = dim
self.batch_size = batch_size
self.labels = labels
self.list_IDs = list_IDs
self.n_channels = n_channels
self.n_classes = n_classes
self.shuffle = shuffle
self.on_epoch_end()
def on_epoch_end(self):
#'Updates indexes after each epoch'
self.indexes = np.arange(len(self.list_IDs))
if self.shuffle == True:
np.random.shuffle(self.indexes)
# method for producing batches of data.
# takes as argument the list of IDs of the target batch
def __data_generation(self, list_IDs_temp):
#'Generates data containing batch_size samples' # X : (n_samples, *dim, n_channels)
# Initialization
X = np.empty((self.batch_size, *self.dim, self.n_channels))
Xtext = np.empty((self.batch_size, 7576))
y = np.empty((self.batch_size), dtype=int)
# Generate data
for i, ID in enumerate(list_IDs_temp):
# Store sample
X[i,] = imread('C:/Users/aaron/Desktop/training/'+str(ID)) # <--- all files are in the same DIR
Xtext[i,] = np.array(total_data[df.path== str(ID)]["text-features"].values) # <--- I look-up the text-features by using the ID as a filter with the path column. This line throws the error.
# Store class
y[i] = self.labels[ID]
return X, Xtext, keras.utils.to_categorical(y, num_classes=self.n_classes)
def __len__(self):
#'Denotes the number of batches per epoch'
return int(np.floor(len(self.list_IDs) / self.batch_size))
# Now, when the batch corresponding to a given index is called,
# the generator executes the __getitem__ method to generate it.
def __getitem__(self, index):
#'Generate one batch of data'
# Generate indexes of the batch
indexes = self.indexes[index*self.batch_size:(index+1)*self.batch_size]
# Find list of IDs
list_IDs_temp = [self.list_IDs[k] for k in indexes]
# Generate data
X,Xtext, y = self.__data_generation(list_IDs_temp)
return X,Xtext, y
And I initialize the generator as follows:
partition = {}
partition['train'] = X_train.path.values
partition['validation'] = X_test.path.values
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
encoded_labels = le.fit_transform(df.label)
labels = pd.Series(encoded_labels,index=df.path).to_dict()
# Parameters
params = {'dim': (224,224),
'batch_size': 64,
'n_classes': 5,
'n_channels': 3,
'shuffle': True}
# Generators
training_generator = DataGenerator(partition['train'], labels, **params)
validation_generator = DataGenerator(partition['validation'], labels, **params)
Using this generator however throws me an error:
ValueError: setting an array element with a sequence.
caused by the line X_text[i,] = np.array(total_data[total_data.bust == str(ID)].text.values) in my code above. Any suggestion on how to solve this?
I am trying to feed a large dataset to a keras model.
The dataset does not fit into memory.
It is currently stored as a serie of hd5f files
I want to train my model using
model.fit_generator(my_gen, steps_per_epoch=30, epochs=10, verbose=1)
However, in all the examples I could find online, my_gen was used only to perform data augmentation on a already loaded dataset. For example
def generator(features, labels, batch_size):
# Create empty arrays to contain batch of features and labels#
batch_features = np.zeros((batch_size, 64, 64, 3))
batch_labels = np.zeros((batch_size,1))
while True:
for i in range(batch_size):
# choose random index in features
index= random.choice(len(features),1)
batch_features[i] = some_processing(features[index])
batch_labels[i] = labels[index]
yield batch_features, batch_labels
In my case, it needs to be something like
def generator(features, labels, batch_size):
while True:
for i in range(batch_size):
# choose random index in features
index= # SELECT THE NEXT FILE
batch_features[i] = some_processing(features[files[index]])
batch_labels[i] = labels[file[index]]
yield batch_features, batch_labels
How do I keep track of the files which were already read in previous batch?
From the keras doc
generator: A generator or an instance of Sequence (keras.utils.Sequence) object in order to avoid duplicate data when using multiprocessing. [...]
This means you can write a class inheriting from keras.utils.sequence
class ProductSequence(keras.utils.Sequence):
def __init__(self):
pass
def __len__(self):
pass
def __getitem__(self, idx):
pass
__init__ ist to init the class.
__len__ should return the number of batches per epoch. Keras will use thisto know which index can be passed to __getitem__. __getitem__ will then return the batch data depending on the index.
A simple example can be found here
With this approach you can simpy have an internal class object in which you save which files are already read.
Let us suppose that your data are images. If you have many images you probably won't be able to load all of them in memory and you would like to read from disk in batches.
Keras flow_from _directory is very fast in doing that as it does this in a multi threading way too but it needs all the images to be in different files, according to their class. If we have all the images in the same file and their classes in separated file we could use the generator bellow to load our x,y data.
import pandas as pd
import numpy as np
import cv2
#df_train: data frame with class of every image
#dpath: path of images
classes=list(np.unique(df_train.label))
def batch_generator(ids):
while True:
for start in range(0, len(ids), batch_size):
x_batch = []
y_batch = []
end = min(start + batch_size, len(ids))
ids_batch = ids[start:end]
for id in ids_batch:
img = cv2.imread(dpath+'train/{}.png'.format(id)) #open cv read as BGR
#img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB) #BGR to RGB
#img = cv2.resize(img, (224, 224), interpolation = cv2.INTER_CUBIC)
#img = pre_process(img)
labelname=df_train.label.loc[df_train.id==id].values
labelnum=classes.index(labelname)
x_batch.append(img)
y_batch.append(labelnum)
x_batch = np.array(x_batch)
y_batch = to_categorical(y_batch,10)
yield x_batch, y_batch