creating a train and a test dataloader

creating a train and a test dataloader - python

I have actually a directory RealPhotos containing 17000 jpg photos. I would be interested in creating a train dataloader and a test dataloader
ls RealPhotos/
2007_000027.jpg 2008_007119.jpg 2010_001501.jpg 2011_002987.jpg
2007_000032.jpg 2008_007120.jpg 2010_001502.jpg 2011_002988.jpg
2007_000033.jpg 2008_007123.jpg 2010_001503.jpg 2011_002992.jpg
2007_000039.jpg 2008_007124.jpg 2010_001505.jpg 2011_002993.jpg
2007_000042.jpg 2008_007129.jpg 2010_001511.jpg 2011_002994.jpg
2007_000061.jpg 2008_007130.jpg 2010_001514.jpg 2011_002996.jpg
2007_000063.jpg 2008_007131.jpg 2010_001515.jpg 2011_002997.jpg
2007_000068.jpg 2008_007133.jpg 2010_001516.jpg 2011_002999.jpg
2007_000121.jpg 2008_007134.jpg 2010_001518.jpg 2011_003002.jpg
2007_000123.jpg 2008_007138.jpg 2010_001520.jpg 2011_003003.jpg
...
I know I can subclassing TensorDataset to make it compatible with unlabeled data with
class UnlabeledTensorDataset(TensorDataset):
"""Dataset wrapping unlabeled data tensors.
Each sample will be retrieved by indexing tensors along the first
dimension.
Arguments:
data_tensor (Tensor): contains sample data.
"""
def __init__(self, data_tensor):
self.data_tensor = data_tensor
def __getitem__(self, index):
return self.data_tensor[index]
And something along these lines for training the autoencoder
X_train = rnd.random((300,100))
train = UnlabeledTensorDataset(torch.from_numpy(X_train).float())
train_loader= data_utils.DataLoader(train, batch_size=1)
for epoch in range(50):
for batch in train_loader:
data = Variable(batch)
optimizer.zero_grad()
output = model(data)
loss = criterion(output, data)

You first need to define a Dataset (torch.utils.data.Dataset) then you can use DataLoader on it. There is no difference between your train and test dataset, you can define a generic dataset that will look into a particular directory and map each index to a unique file.
class MyDataset(Dataset):
def __init__(self, directory):
self.files = os.listdir(directory)
def __getitem__(self, index):
img = Image.open(self.files[index]).convert('RGB')
return T.ToTensor()(img)
Where T refers to torchvision.transform and Image is imported from PIL.
You can then instanciate a dataset with
data_set = MyDataset('./RealPhotos')
From there you can use torch.utils.data.random_split to perform the split:
train_len = int(len(data_set)*0.7)
train_set, test_set = random_split(data_set, [train_len, len(data_set)-train_len])
Then use torch.utils.data.DataLoader as you did:
train_loader = DataLoader(train_set, batch_size=1, shuffle=True)
test_loader = DataLoader(test_set, batch_size=16, shuffle=False)

Related

AttributeError in Classification report

The thing is i want to output the precisio, recall, and f1-score using classification report. But when i run below code, that error occurs. How can i fix the AttributeError?
print(classification_report(test.targets.cpu().numpy(),
File "C:\Users\Admin\PycharmProjects\ImageRotation\venv\lib\site-packages\torch\utils\data\dataset.py", line 83, in __getattr__
raise AttributeError
AttributeError
This is where i load the data from my directory.
data_loader = ImageFolder(data_dir,transform = transformer)
lab = data_loader.classes
num_classes = int(len(lab))
print("Number of Classes: ", num_classes)
print("The classes are as follows : \n",data_loader.classes)
batch_size = 128
train_size = int(len(data_loader) * 0.8)
test_size = len(data_loader) - train_size
train,test = random_split(data_loader,[train_size,test_size])
train_size = int(len(train) * 0.8)
val_size = len(train) - train_size
train_data, val_data = random_split(train,[train_size,val_size])
#load the train and validation into batches.
print(f"Length of Train Data : {len(train_data)}")
print(f"Length of Validation Data : {len(val_data)}")
print(f"Length of Test Data : {len(test)}")
train_dl = DataLoader(train_data, batch_size, shuffle = True)
val_dl = DataLoader(val_data, batch_size*2)
test_dl = DataLoader(test, batch_size, shuffle=True)
model.evaL() code
with torch.no_grad():
# set the model in evaluation mode
model.eval()
# initialize a list to store our predictions
preds = []
# loop over the test set
for (x, y) in test_dl:
# send the input to the device
x = x.to(device)
# make the predictions and add them to the list
pred = model(x)
preds.extend(pred.argmax(axis=1).cpu().numpy())
# generate a classification report
print(classification_report(test.targets.cpu().numpy(),
np.array(preds), target_names=test.classes))

It seems, ImageFolder is the your dataset object, but that is not inherited ted from torch.utils.data.Datasets.
torch Dataloader tries to call the __getitem__ method in the your Dataset object, but since it is not torch.utils.data.Dataset object it does not have has a function, then that causes to AttributeError now you are getting.
Convert ImageFolder to torch torch dataset. For further library details : torch doc
Practical implemtation : ast_dataloader
Also, you can use freeze model [without back propagation] to speedup the inference process.
with torch.no_grad():
# make the predictions and add them to the list
pred = model(x)
Update>
sample torch dataset:
from torch.utils.data import Dataset
class Dataset_train(Dataset):
def __init__(self, list_IDs, labels, base_dir):
"""self.list_IDs : list of strings (each string: utt key),
self.labels : dictionary (key: utt key, value: label integer)"""
self.list_IDs = list_IDs
self.labels = labels
self.base_dir = base_dir
def __len__(self):
return len(self.list_IDs)
def __getitem__(self, index):
key = self.list_IDs[index]
X, _ = get_sample(f"{self.base_dir}/{key}", self.noises)
y = self.labels[index]
return X, y
[Note] get_sample is custom build function for .wav file read. you could replace it with any funtion.
torch example-1
torch example-2
medium example

How to use DistilBERT Huggingface NLP model to perform sentiment analysis on new data?

I am using DistilBERT to do sentiment analysis on my dataset. The dataset contains text and a label for each row which identifies whether the text is a positive or negative movie review (eg: 1 = positive and 0 = negative). Here is the code from the huggingface documentation (https://huggingface.co/transformers/custom_datasets.html?highlight=imdb)
#This dataset can be explored in the Hugging Face model hub (IMDb), and can be alternatively downloaded with the 🤗 Datasets library with load_dataset("imdb").
wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
tar -xf aclImdb_v1.tar.gz
#This data is organized into pos and neg folders with one text file per example. Let’s write a function that can read this in.
from pathlib import Path
def read_imdb_split(split_dir):
split_dir = Path(split_dir)
texts = []
labels = []
for label_dir in ["pos", "neg"]:
for text_file in (split_dir/label_dir).iterdir():
texts.append(text_file.read_text())
labels.append(0 if label_dir is "neg" else 1)
return texts, labels
train_texts, train_labels = read_imdb_split('aclImdb/train')
test_texts, test_labels = read_imdb_split('aclImdb/test')
from sklearn.model_selection import train_test_split
train_texts, val_texts, train_labels, val_labels = train_test_split(train_texts, train_labels, test_size=.2)
from transformers import DistilBertTokenizerFast
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')
train_encodings = tokenizer(train_texts, truncation=True, padding=True)
val_encodings = tokenizer(val_texts, truncation=True, padding=True)
test_encodings = tokenizer(test_texts, truncation=True, padding=True)
import torch
class IMDbDataset(torch.utils.data.Dataset):
def __init__(self, encodings, labels):
self.encodings = encodings
self.labels = labels
def __getitem__(self, idx):
item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
item['labels'] = torch.tensor(self.labels[idx])
return item
def __len__(self):
return len(self.labels)
train_dataset = IMDbDataset(train_encodings, train_labels)
val_dataset = IMDbDataset(val_encodings, val_labels)
test_dataset = IMDbDataset(test_encodings, test_labels)
#Now that our datasets our ready, we can fine-tune a model either #with the 🤗 Trainer/TFTrainer or with native PyTorch/TensorFlow. See #training.
#Fine-tuning with Trainer
#The steps above prepared the datasets in the way that the trainer is #expected. Now all we need to do is create a model to fine-tune, #define the TrainingArguments/TFTrainingArguments and instantiate a #Trainer/TFTrainer.
from transformers import DistilBertForSequenceClassification, Trainer, TrainingArguments
training_args = TrainingArguments(
output_dir='./results', # output directory
num_train_epochs=3, # total number of training epochs
per_device_train_batch_size=16, # batch size per device during training
per_device_eval_batch_size=64, # batch size for evaluation
warmup_steps=500, # number of warmup steps for learning rate scheduler
weight_decay=0.01, # strength of weight decay
logging_dir='./logs', # directory for storing logs
logging_steps=10,
)
model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased")
trainer = Trainer(
model=model, # the instantiated 🤗 Transformers model to be trained
args=training_args, # training arguments, defined above
train_dataset=train_dataset, # training dataset
eval_dataset=val_dataset # evaluation dataset
)
trainer.train()
#We can also train with Pytorch/Tensorflow
from torch.utils.data import DataLoader
from transformers import DistilBertForSequenceClassification, AdamW
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased')
model.to(device)
model.train()
train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
optim = AdamW(model.parameters(), lr=5e-5)
for epoch in range(3):
for batch in train_loader:
optim.zero_grad()
input_ids = batch['input_ids'].to(device)
attention_mask = batch['attention_mask'].to(device)
labels = batch['labels'].to(device)
outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
loss = outputs[0]
loss.backward()
optim.step()
model.eval()
I want to know test this model on a new piece of data. So, I have a dataframe which contains a piece of text/review for each row, and I want to predict the label. Does anyone know how I would go about doing that? I apologize, I am very new to this and would greatly appreciate any help! I tried taking in text, cleaning it, and then doing
prediction = model.predict(text)
and I got an error saying DistilBERT has no attribute .predict.

If you just want to use the model, you can use the corresponding pipeline:
from transformers import pipeline
classifier = pipeline('sentiment-analysis')
Then you can use it:
classifier("I hate this book")

The code that you've shared from the documentation essentially covers the training and evaluation loop. Beware that your shared code contains two ways of fine-tuning, once with the trainer, which also includes evaluation, and once with native Pytorch/TF, which contains just the training portion and not the evaluation portion.
Here is how the native method can be tweaked to generate predictions on the test set:
# Put model in evaluation mode
model.eval()
# Tracking variables for storing ground truth and predictions
predictions , true_labels = [], []
# Prediction Loop
for batch in test_dataset:
# Unpack the inputs from our dataloader and move to GPU/accelerator
input_ids = batch['input_ids'].to(device)
attention_mask = batch['attention_mask'].to(device)
labels = batch['labels'].to(device)
# Telling the model not to compute or store gradients, saving memory and
# speeding up prediction
with torch.no_grad():
# Forward pass, calculate logit predictions
outputs = model(input_ids, attention_mask=attention_mask,
labels=labels)
logits = outputs[0]
# Move logits and labels to CPU
logits = logits.detach().cpu().numpy()
label_ids = labels.to('cpu').numpy()
# Store predictions and true labels
predictions.append(logits)
true_labels.append(label_ids)
After the execution of this loop, predictions will contain logits, i.e., the probability distribution from the model before any form of normalization.
You can use the following to pick the label with the maximum score from the logits, and produce a classification report
from sklearn.metrics import classification_report, accuracy_score
# Combine the results across all batches.
flat_predictions = np.concatenate(predictions, axis=0)
# For each sample, pick the label (0 or 1) with the higher score.
flat_predictions = np.argmax(flat_predictions, axis=1).flatten()
# Combine the correct labels for each batch into a single list.
flat_true_labels = np.concatenate(true_labels, axis=0)
# Accuracy
print(accuracy_score(flat_true_labels, flat_predictions))
# Classification Report
report = classification_report(flat_true_labels, flat_predictions)
For a more elegant way of performing predictions, you can create a BERTModel Class that would contain different methods and variables for handling the tokenization, creation of dataloader, running the predictions, etc.

You can try code like this example: Link-BERT
You'll arrange the dataset according to the BERT model. D Section in this link, you can just change the model name and your dataset.

Converting two Numpy data sets into a particularr PyTorch data set

I want to play around with a neural network that recognizes handwritten numbers. I found some of these on the web which use PyTorch, however they seem to download the data from the MNIST website in a particular format. My data is, however, available as follows:
with np.load('prediction-challenge-01-data.npz') as fh:
data_x = fh['data_x']
data_y = fh['data_y']
Where data_x is the training data and data_y are the labels of the pictures. I want these data sets to be in the same format as trainloader as shown below:
trainset = datasets.MNIST('/data/mnist', download=True, train=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=64, shuffle=True)
Where trainloader already has the training set data_x and labels data_y together in one set.
Is there any way to do this?
Edit: Shapes of data_x and data_y:
In [1]: data_x.shape
Out[2]: (20000, 1, 28, 28)
In [5]: data_y.shape
Out[7]: (20000,)

You can easily create your own dataset. Just inherit from torch.utils.data.Dataset and implement
__getitem__ at the very least:
Here is a quick and dirty example to get you going:
class YourOwnDataset(torch.utils.data.Dataset):
def __init__(self, input_file_path, transformations) :
super().__init__()
self.path = input_file_path
self.transforms = transformations
with np.load(self.path) as fh:
# I assume fh['data_x'] is a list you get the idea
self.data = fh['data_x']
self.labels = fh['data_y']
# in getitem, we retrieve one item based on the input index
def __getitem__(self, index):
data = self.data[index]
# based on the loss you chose and what you have in mind,
# you can transform you label, here I assume they are
# integer numbers (like, 1, 3, etc as labels used for classification)
label = self.labels[index]
img = convert/reshape your data into img
img = self.transforms(img)
return img, labels
def __len__(self):
return len(self.data)
and you can create your dataset like :
from torchvision import transforms
# add any number of transformations you like, I just added ToTensor()
transformations = transforms.Compose([transforms.ToTensor()])
trainset = YourOwnDataset('prediction-challenge-01-data.npz', transformations )
trainloader = torch.utils.data.DataLoader(trainset, batch_size=64, shuffle=True)

Training a Keras model from batches of .npy files using generator?

Currently I am dealing with a big data issue when training Image data using Keras. I have directory which has batch of .npy file. Each batch contain 512 images. Each batch has its corresponding label file as .npy. So it looks like: {image_file_1.npy, label_file_1.npy, ..., image_file_37.npy, label_file_37}. Each image file has dimension (512, 199, 199, 3), each label file has dimension (512, 1)(eather 1 or 0) . If I load all the images in one ndarray it will be 35+ GB. So far reading all the Keras Doc. I am still not able to find how I will be able to train using custom generator. I have read about flow_from_dict and ImageDataGenerator(...).flow() but they are not ideal in that case or I do not know how to customized them.Here what I have done.
import numpy as np
import keras
from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten
from keras.layers import Conv2D, MaxPooling2D
from keras.optimizers import SGD
from keras.preprocessing.image import ImageDataGenerator
val_gen = ImageDataGenerator(rescale=1./255)
x_test = np.load("../data/val_file.npy")
y_test = np.load("../data/val_label.npy")
val_gen.fit(x_test)
model = Sequential()
...
model_1.add(layers.Dense(512, activation='relu'))
model_1.add(layers.Dense(1, activation='sigmoid'))
model.compile(loss='categorical_crossentropy',
optimizer=sgd,
metrics=['acc'])
model.fit_generator(generate_batch_from_directory() # should give 1 image file and 1 label file
validation_data=val_gen.flow(x_test,
y_test,
batch_size=64),
validation_steps=32)
So here generate_batch_from_directory() should take image_file_i.npy and label_file_i.npy every time and optimise the weight until there is no batch left. Each image array in the .npy files has already been processed with augmentation, rotation and scaling. Each .npy file is properly mixed with data from class 1 and 0 (50/50).
If I append all the batch and create a big file such as:
X_train = np.append([image_file_1, ..., image_file_37])
y_train = np.append([label_file_1, ..., label_file_37])
It does not fit in the memory. Otherwise I could use .flow() to generate image sets to train the model.
Thanks for any advise.

Finally I was able to solve that problem. But I had to go through source code and documentation of keras.utils.Sequence to build my own generator class. This document help a lot to understand how generator works in Kears. You can read more detail in my kaggle notebook:
all_files_loc = "datapsycho/imglake/population/train/image_files/"
all_files = os.listdir(all_files_loc)
image_label_map = {
"image_file_{}.npy".format(i+1): "label_file_{}.npy".format(i+1)
for i in range(int(len(all_files)/2))}
partition = [item for item in all_files if "image_file" in item]
class DataGenerator(keras.utils.Sequence):
def __init__(self, file_list):
"""Constructor can be expanded,
with batch size, dimentation etc.
"""
self.file_list = file_list
self.on_epoch_end()
def __len__(self):
'Take all batches in each iteration'
return int(len(self.file_list))
def __getitem__(self, index):
'Get next batch'
# Generate indexes of the batch
indexes = self.indexes[index:(index+1)]
# single file
file_list_temp = [self.file_list[k] for k in indexes]
# Set of X_train and y_train
X, y = self.__data_generation(file_list_temp)
return X, y
def on_epoch_end(self):
'Updates indexes after each epoch'
self.indexes = np.arange(len(self.file_list))
def __data_generation(self, file_list_temp):
'Generates data containing batch_size samples'
data_loc = "datapsycho/imglake/population/train/image_files/"
# Generate data
for ID in file_list_temp:
x_file_path = os.path.join(data_loc, ID)
y_file_path = os.path.join(data_loc, image_label_map.get(ID))
# Store sample
X = np.load(x_file_path)
# Store class
y = np.load(y_file_path)
return X, y
# ====================
# train set
# ====================
all_files_loc = "datapsycho/imglake/population/train/image_files/"
all_files = os.listdir(all_files_loc)
training_generator = DataGenerator(partition)
validation_generator = ValDataGenerator(val_partition) # work same as training generator
hst = model.fit_generator(generator=training_generator,
epochs=200,
validation_data=validation_generator,
use_multiprocessing=True,
max_queue_size=32)

Tensorflow CNN image augmentation pipeline

I'm trying to learn the new Tensorflow APIs and I am a bit lost on where to get a handle on my input batch tensors so I can manipulate and augment them with for example tf.image.
This is the my current network & pipeline:
trainX, testX, trainY, testY = read_data()
# trainX [num_image, height, width, channels], these are numpy arrays
#...
train_dataset = tf.data.Dataset.from_tensor_slices((trainX, trainY))
test_dataset = tf.data.Dataset.from_tensor_slices((testX, testY))
#...
iterator = tf.data.Iterator.from_structure(train_dataset.output_types,
train_dataset.output_shapes)
features, labels = iterator.get_next()
train_init_op = iterator.make_initializer(train_dataset)
test_init_op = iterator.make_initializer(test_dataset)
#...defining cnn architecture...
# In the train loop
TrainLoop {
sess.run(train_init_op) # switching to train data
sess.run(train_step, ...) # running a train step
#...
sess.run(test_init_op) # switching to test data
test_loss = sess.run(loss, ...) # printing test loss after epoch
}
I'm using the Dataset API creating 2 datasets so that in the trainloop I can calculate the train and test loss and log them.
Where in this pipeline would I manipulate and distort my input batch of images?
I'm not creating any tf.placeholders for my trainX input batches so I can't manipulate them with tf.image because for example tf.image.flip_up_down requires a 3-D or 4-D tensor.
What is the natural way to implement this pipeline with the new API?
Is there a module or easy way to augment an input batch of images for training that would fit in this pipeline?

There's a really good article and talk released recently that go over the API in a lot more detail than my response here. Here's a brief example:
import tensorflow as tf
import numpy as np
def read_data():
n_train = 100
n_test = 50
height = 20
width = 30
channels = 3
trainX = (np.random.random(
size=(n_train, height, width, channels)) * 255).astype(np.uint8)
testX = (np.random.random(
size=(n_test, height, width, channels))*255).astype(np.uint8)
trainY = (np.random.random(size=(n_train,))*10).astype(np.int32)
testY = (np.random.random(size=(n_test,))*10).astype(np.int32)
return trainX, testX, trainY, testY
trainX, testX, trainY, testY = read_data()
# trainX [num_image, height, width, channels], these are numpy arrays
train_dataset = tf.data.Dataset.from_tensor_slices((trainX, trainY))
test_dataset = tf.data.Dataset.from_tensor_slices((testX, testY))
def map_single(x, y):
print('Map single:')
print('x shape: %s' % str(x.shape))
print('y shape: %s' % str(y.shape))
x = tf.image.per_image_standardization(x)
# Consider: x = tf.image.random_flip_left_right(x)
return x, y
def map_batch(x, y):
print('Map batch:')
print('x shape: %s' % str(x.shape))
print('y shape: %s' % str(y.shape))
# Note: this flips ALL images left to right. Not sure this is what you want
# UPDATE: looks like tf documentation is wrong and you need a 3D tensor?
# return tf.image.flip_left_right(x), y
return x, y
batch_size = 32
train_dataset = train_dataset.repeat().shuffle(100)
train_dataset = train_dataset.map(map_single, num_parallel_calls=8)
train_dataset = train_dataset.batch(batch_size)
train_dataset = train_dataset.map(map_batch)
train_dataset = train_dataset.prefetch(2)
test_dataset = test_dataset.map(
map_single, num_parallel_calls=8).batch(batch_size).map(map_batch)
test_dataset = test_dataset.prefetch(2)
iterator = tf.data.Iterator.from_structure(train_dataset.output_types,
train_dataset.output_shapes)
features, labels = iterator.get_next()
train_init_op = iterator.make_initializer(train_dataset)
test_init_op = iterator.make_initializer(test_dataset)
with tf.Session() as sess:
sess.run(train_init_op)
feat, lab = sess.run((features, labels))
print(feat.shape)
print(lab.shape)
sess.run(test_init_op)
feat, lab = sess.run((features, labels))
print(feat.shape)
print(lab.shape)
A few notes:
This approach relies on being able to load your entire dataset into memory. If you cannot, consider using tf.data.Dataset.from_generator. This can lead to slow shuffle times if your shuffle buffer is large. My preferred method is to load some keys tensor entirely into memory - it might just be the indices of each example - then map that key value to data values using tf.py_func. This is slightly less efficient than converting to tfrecords, but with prefetching it likely won't affect performance. Since the shuffling is done before the mapping, you only have to load shuffle_buffer keys into memory, rather than shuffle_buffer examples.
To augment your dataset, use tf.data.Dataset.map either before or after the batch operation, depending on whether or not you want to apply a batch-wise operation (something working on a 4D image tensor) or element-wise operation (3D image tensor). Note it looks like the documentation for tf.image.flip_left_right is out of date, since I get an error when I try and use a 4D tensor. If you want to augment you data randomly, use tf.image.random_flip_left_right rather than tf.image.flip_left_right.
If you're using a tf.estimator.Estimator (or wouldn't mind converting your code to using it), then check out tf.estimator.train_and_evaluate for an in-built way of switching between datasets.
Consider shuffling/repeating your dataset with the shuffle/repeat methods. See the article for notes on efficiencies. In particular, repeat -> shuffle -> map -> batch -> batch-wise map -> prefetch seems to be the best ordering of operations for most applications.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

creating a train and a test dataloader - python

Related

AttributeError in Classification report

How to use DistilBERT Huggingface NLP model to perform sentiment analysis on new data?

Converting two Numpy data sets into a particularr PyTorch data set

Training a Keras model from batches of .npy files using generator?

Tensorflow CNN image augmentation pipeline

Categories

Resources