Spliting the dataset using SubsetRandomSampler not working

Spliting the dataset using SubsetRandomSampler not working - python

I used SubsetRandomSampler to split the training data to train (80%) and validation data (20%). But it is showing the same number of images for both after the split (4996):
>>> print('len(train_data): ', len(train_loader.dataset))
>>> print('len(valid_data): ', len(validation_loader.dataset))
len(train_data): 4996
len(valid_data): 4996
Full code:
import numpy as np
import torch
from torchvision import transforms
from torch.utils.data.sampler import SubsetRandomSampler
train_transforms = transforms.Compose([transforms.ToTensor(),
transforms.Normalize([0.485, 0.456, 0.406],
[0.229, 0.224, 0.225])])
dataset = datasets.ImageFolder( '/data/images/train', transform=train_transforms )
validation_split = .2
shuffle_dataset = True
random_seed= 42
batch_size = 20
dataset_size = len(dataset) #4996
indices = list(range(dataset_size))
split = int(np.floor(validation_split * dataset_size))
if shuffle_dataset :
np.random.seed(random_seed)
np.random.shuffle(indices)
train_indices, val_indices = indices[split:], indices[:split]
train_sampler = SubsetRandomSampler(train_indices)
valid_sampler = SubsetRandomSampler(val_indices)
train_loader = torch.utils.data.DataLoader(dataset, batch_size=batch_size, sampler=train_sampler)
validation_loader = torch.utils.data.DataLoader(dataset, batch_size=batch_size, sampler=valid_sampler)

train_loader.dataset and validation_loader.dataset are methods which return the underlying original dataset that the loaders sample from (i.e. the original dataset of size 4996).
If you iterate through the loaders themselves you will see they only return as many samples (acounting for batching) as you have included in the index for each sampler, however.

Related

Too much RAM is required for loading dataset

I’m working in a neural network and my dataset has 42000 images and I have to load it all. I’m using google colab for that, but every time I load the dataset the RAM is insufficient.
I am putting everything in a numpy array, cause I tried to use the ImageGenerator method and it didn’t work. I’m using the following code to load the data:
class = glob.glob(r"/content/drive/MyDrive/DATASET/class/*.*")
data = []
labels = []
for i in class:
image=tf.keras.preprocessing.image.load_img(i, color_mode='rgb',
target_size= (336, 336))
image=np.array(image)
data.append(image)
labels.append(0)
data = np.array(data)
labels = np.array(labels)

As ImageDataGenerator is deprecated, you can use a custom Keras Sequence class to load images when needed.
The strategy here is to create a Pandas DataFrame with all the path and class of your images then transform the class to numeric label with pd.factorize. Once, you have X (paths) and y (labels), you can use train_test_split to extract 3 subsets: train, test and validation. The last step is to convert these collections to datasets compatible with Tensorflow.
Each time, Tensorflow process a batch, the Sequence will load a batch of images in memory and so on.
Step 0: Imports and constants
import tensorflow as tf
import pandas as pd
import numpy as np
import pathlib
from sklearn.model_selection import train_test_split
INPUT_SHAPE = (336, 336, 3)
BATCH_SIZE = 32
DATA_DIR = pathlib.Path('/content/drive/MyDrive/DATASET/')
Step 1: Load all image paths to a Pandas DataFrame:
# Find images of dataset
data = []
for file in DATA_DIR.glob('**/*.jpg'):
d = {'class': file.parent.name,
'path': file}
data.append(d)
# Create dataframe and select columns
df = pd.DataFrame(data)
df['label'] = pd.factorize(df['class'])[0]
X = df['path']
y = df['label']
# Split into 3 balanced datasets
X_train, X_test, y_train, y_test = \
train_test_split(X, y, test_size=0.2, random_state=2023)
X_train, X_valid, y_train, y_valid = \
train_test_split(X_train, y_train, test_size=0.2, random_state=2023)
Step 2: Create a custom data Sequence
class ImgDataSequence(tf.keras.utils.Sequence):
"""
Check documentation here: https://www.tensorflow.org/api_docs/python/tf/keras/utils/Sequence
"""
def __init__(self, image_set, label_set, batch_size=32, image_size=(256, 256)):
self.image_set = np.array(image_set)
self.label_set = np.array(label_set)
self.batch_size = batch_size
self.image_size = image_size
def __get_image(self, image):
image = tf.keras.preprocessing.image.load_img(image, color_mode='rgb', target_size=self.image_size)
image_arr = tf.keras.preprocessing.image.img_to_array(image)
return image_arr
def __get_data(self, images, labels):
image_batch = np.asarray([self.__get_image(img) for img in images])
label_batch = np.asarray(labels)
return image_batch, label_batch
def __getitem__(self, index):
images = self.image_set[index * self.batch_size:(index + 1) * self.batch_size]
labels = self.label_set[index * self.batch_size:(index + 1) * self.batch_size]
images, labels = self.__get_data(images, labels)
return images, labels
def __len__(self):
return len(self.image_set) // self.batch_size + (len(self.image_set) % self.batch_size > 0)
Step 3: Create datasets
train_ds = ImgDataSequence(X_train, y_train, image_size=INPUT_SHAPE[:2], batch_size=BATCH_SIZE)
valid_ds = ImgDataSequence(X_valid, y_valid, image_size=INPUT_SHAPE[:2], batch_size=BATCH_SIZE)
test_ds = ImgDataSequence(X_test, y_test, image_size=INPUT_SHAPE[:2], batch_size=BATCH_SIZE)
Test the new datasets:
# Take the first batch of our train dataset
>>> imgs, labels = train_ds[0]
# Check then length (BATCH_SIZE)
>>> len(labels)
32
# Check the dimension of one image
>>> imgs[0].shape
(336, 336, 3)
How to use it with Tensorflow?
# train_ds & valid_ds to fit
history = model.fit(train_ds, epochs=10, validation_data=valid_ds)
# test_ds to evaluate
loss, *metrics = model.evaluate(test_ds)

How to split a dataset into a custom training set and a custom validation set with pytorch?

I'm using a non-torchvision dataset and I have extracted it with the ImageFolder method. I'm trying to split the dataset into 20% validation set and 80% training set. I can only find this method (random_split) from PyTorch library which allows splitting dataset. However, this is random every time. I'm wondering is there a way to split the dataset with a specific amount in the PyTorch library?
This is my code for extracting the dataset and split it randomly.
transformations = transforms.Compose([
transforms.Resize(255),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])
TrafficSignSet = datasets.ImageFolder(root='./train/', transform=transformations)
####### split data
train_size = int(0.8 * len(TrafficSignSet))
test_size = len(TrafficSignSet) - train_size
train_dataset_split, test_dataset_split = torch.utils.data.random_split(TrafficSignSet, [train_size, test_size])
#######put into a Dataloader
train_dataset = torch.utils.data.DataLoader(train_dataset_split, batch_size=32, shuffle=True)
test_dataset = torch.utils.data.DataLoader(test_dataset_split, batch_size=32, shuffle=True)

If you look "under the hood" of random_split you'll see it uses torch.utils.data.Subset to do the actual splitting. You can do so yourself with fixed indices:
import random
indices = list(range(len(TrafficSignSet))
random.seed(310) # fix the seed so the shuffle will be the same everytime
random.shuffle(indices)
train_dataset_split = torch.utils.data.Subset(TrafficSignSet, indices[:train_size])
val_dataset_split = torch.utils.data.Subset(TrafficSignSet, indices[train_size:])

CNN is overfitting to minority class

I'm new to developing CNNs and currently making a binary image classifier using PyTorch. My dataset was heavily unbalanced, and I've manually augmented my a testing split and training split to be balanced. I have a class 0 (Training set has 6500 images), and class 1 (training set has 5200ish images). When I try to use skorch's fit function, I get a validation accuracy equivalent to the percent of class 1 images in the set, and my prediction function only outputs 1 for all images.
This is the tutorial I adapted my CNN from: https://colab.research.google.com/github/dnouri/skorch/blob/master/notebooks/Transfer_Learning.ipynb#scrollTo=cane7VRWW3dO
Here's my CNN: (it's adapted from a tutorial)
import torch
import torch.nn as nn
import torch.optim as optim
import torchvision.transforms as transforms
import torchvision.datasets as datasets
import torchvision.models as models
import numpy as np
from numpy import array
import pandas as pd
from skorch import NeuralNetClassifier
from skorch.helper import predefined_split
from skorch.callbacks import LRScheduler
from skorch.callbacks import Checkpoint
from skorch.callbacks import Freezer
from PIL import Image
import skorch
import os
import cv2
import glob
from sklearn.pipeline import FeatureUnion, Pipeline
from sklearn.preprocessing import StandardScaler, MinMaxScaler, Normalizer
#Define transforms (will be same w/o online transforms)
#Manually augmented earlier
train_transforms = transforms.Compose([
#transforms.RandomResizedCrop(224),
#transforms.RandomHorizontalFlip(),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize([0.485, 0.456, 0.406],
[0.229, 0.224, 0.225])
])
val_transforms = transforms.Compose([
#transforms.Resize(256),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize([0.485, 0.456, 0.406],
[0.229, 0.224, 0.225])
])
train_ds = datasets.ImageFolder(train_split_path, train_transforms)
val_ds = datasets.ImageFolder(test_aug_split_path, val_transforms)
checkpoint = Checkpoint(
f_params='best_model.pt', monitor='valid_acc_best')
#Using ResNet with some layers for now
class PretrainedModel(nn.Module):
def __init__(self, output_features):
super().__init__()
model = models.resnet152(pretrained=True)
#Don't want to change pretrained weights
for param in model.parameters():
param.requires_grad = False
num_features = model.fc.in_features
fc_layers = nn.Sequential(
nn.Linear(num_features, 4096),
nn.ReLU(inplace=True),
nn.Dropout(p=0.3),
nn.Linear(4096, output_features),
nn.ReLU(inplace=True),
nn.Dropout(p=0.3),
)
model.fc = fc_layers
self.model = model
def forward(self, x):
return self.model(x)
use_cuda = torch.cuda.is_available()
net = NeuralNetClassifier(
module=PretrainedModel,
module__output_features = 2,
criterion=nn.CrossEntropyLoss,
batch_size = 16,
lr=0.0001,
max_epochs=3,
optimizer=optim.Adam,
train_split=predefined_split(val_ds),
callbacks=[checkpoint],
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
)
net.fit(train_ds, y=None)
Here are the results of the fit function:
Epoch 1: Train loss = .46 Valid_acc = .4301 Valid Loss = .6931
Epoch 2: Train loss = .6931 Valid_acc = .4301 Valid Loss = .6931
Epoch 3: Train loss = .6931 Valid_acc = .4301 Valid Loss = .6931
As it turns out for this particular dataset, exactly 43% of my validation images are Class 1 images.
y_pred = net.predict(val_ds) gives me the following:
array([1, 1, 1, ..., 1, 1, 1], dtype=int64)
I guess I have two questions:
1) Is there anything that I have done incorrectly in my initialization of my CNN that would cause this?
2) What would cause this, and is there anything I can do to correct for it?

Unexpected results when using tfrecords loaded using tf.data.Dataset.list_files() with shuffle argument

I'm hoping to get clarification on how the shuffle argument in tf.data.Dataset.list_files() works. The documentation states that when shuffle=True, the filenames will be shuffled randomly. I've made model predictions using a tfrecords dataset that has been loaded using tf.data.Dataset.list_files(), and I would've expected the accuracy metric to be the same no matter the order of the files (i.e. whether shuffle is True or False), but am seeing otherwise.
Is this expected behavior or is there something wrong with my code or intepretation? I have reproducible example code below.
Oddly, as long as tf.random.set_random_seed() is set initially (and it seems it doesn't even matter what seed value is set), then the predictions results are the same no matter whether shuffle is True or False in list_files().
tensorflow==1.13.1, keras==2.2.4
Thanks for any clarifications!
Edit: re-thinking it through and wondering if Y = [y[0] for _ in range(steps) for y in sess.run(Y)] is a separate and independent call?
# Fit and Save a Dummy Model
import numpy as np
import tensorflow as tf
from keras.models import Sequential
from keras.layers import Dense
from keras.utils import np_utils
from sklearn import datasets, metrics
seed = 7
np.random.seed(seed)
tf.random.set_random_seed(seed)
dataset = datasets.load_iris()
X = dataset.data
Y = dataset.target
dummy_Y = np_utils.to_categorical(Y)
# 150 rows
print(len(X))
model = Sequential()
model.add(Dense(8, input_dim=4, activation='relu'))
model.add(Dense(3, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(X, dummy_Y, epochs=10, batch_size=10, verbose=2)
model.save('./iris/iris_model')
predictions = model.predict(X)
predictions = np.argmax(predictions, axis=1)
# returns accuracy = 0.3466666666666667
print(metrics.accuracy_score(y_true=Y, y_pred=predictions))
Split dataset into multiple tfrecords files so we can reload it with list_files() later:
numrows = 15
for i, j in enumerate(range(0, len(X), numrows)):
with tf.python_io.TFRecordWriter('./iris/iris{}.tfrecord'.format(i)) as writer:
for x, y in zip(X[j:j+numrows, ], Y[j:j+numrows, ]):
features = tf.train.Features(feature=
{'X': tf.train.Feature(float_list=tf.train.FloatList(value=x)),
'Y': tf.train.Feature(int64_list=tf.train.Int64List(value=[y]))
})
example = tf.train.Example(features=features)
writer.write(example.SerializeToString())
At this point, I exit (ipython) and restart again:
import numpy as np
import tensorflow as tf
from keras.models import load_model
from sklearn import metrics
model = load_model('./iris/iris_model')
batch_size = 10
steps = int(150/batch_size)
file_pattern = './iris/iris*.tfrecord'
feature_description = {
'X': tf.FixedLenFeature([4], tf.float32),
'Y': tf.FixedLenFeature([1], tf.int64)
}
def _parse_function(example_proto):
return tf.parse_single_example(example_proto, feature_description)
def load_data(filenames, batch_size):
raw_dataset = tf.data.TFRecordDataset(filenames)
dataset = raw_dataset.map(_parse_function)
dataset = dataset.batch(batch_size, drop_remainder=True)
dataset = dataset.prefetch(2)
iterator = dataset.make_one_shot_iterator()
record = iterator.get_next()
return record['X'], record['Y']
def get_predictions_accuracy(filenames):
X, Y = load_data(filenames=filenames, batch_size=batch_size)
predictions = model.predict([X], steps=steps)
predictions = np.argmax(predictions, axis=1)
print(len(predictions))
with tf.Session() as sess:
Y = [y[0] for _ in range(steps) for y in sess.run(Y)]
print(metrics.accuracy_score(y_true=Y, y_pred=predictions))
# No shuffle results:
# Returns expected accuracy = 0.3466666666666667
filenames_noshuffle = tf.data.Dataset.list_files(file_pattern=file_pattern, shuffle=False)
get_predictions_accuracy(filenames_noshuffle)
# Shuffle results, no seed value set:
# Returns UNEXPECTED accuracy (non-deterministic value)
filenames_shuffle_noseed = tf.data.Dataset.list_files(file_pattern=file_pattern, shuffle=True)
get_predictions_accuracy(filenames_shuffle_noseed)
# Shuffle results, seed value set:
# Returns expected accuracy = 0.3466666666666667
# It seems like it doesn't even matter what seed value you set, as long as you you set it
seed = 1000
tf.random.set_random_seed(seed)
filenames_shuffle_seed = tf.data.Dataset.list_files(file_pattern=file_pattern, shuffle=True)
get_predictions_accuracy(filenames_shuffle_seed)

LSTM-Keras Error: ValueError: non-broadcastable output operand with shape (67704,1) doesn't match the broadcast shape (67704,12)

Good morning everyone. I'm trying to implement this LSTM Algorithm using Keras and pandas as to read in the csv file in. The backend that I'm using is Tensorflow. I'm having a problem when it comes to inversing my results before predicting the training set. Below is my code
import numpy
import matplotlib.pyplot as plt
import pandas
import math
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_squared_error
#plt.plot(dataset)
#plt.show()
#fix random seed for reproducibility
numpy.random.seed(7)
#Load dataset
col_names = ['UserID','SysTouchTime', 'EventTime', 'ActivityTouchID', 'Pointer_count', 'PointerID',
'ActionID', 'Touch_X', 'Touch_Y', 'Touch_Pressure', 'Contact_Size', 'Phone_Orientation']
dataframe = pandas.read_csv('touchEventsFor5Users.csv', engine='python', header=None, names = col_names, skiprows=1)
#print(dataset.head())
#print(dataset.shape)
dataset = dataframe.values
dataset = dataframe.astype('float32')
print(dataset.isnull().any())
dataset = dataset.fillna(method='ffill')
feature_cols = ['SysTouchTime', 'EventTime', 'ActivityTouchID', 'Pointer_count', 'PointerID', 'ActionID', 'Touch_X', 'Touch_Y', 'Touch_Pressure', 'Contact_Size', 'Phone_Orientation']
X = dataset[feature_cols]
y = dataset['UserID']
print(y.head())
#normalize the dataset
scaler = MinMaxScaler(feature_range=(0, 1))
dataset = scaler.fit_transform(dataset)
# split into train and test sets
train_size = int(len(dataset) * 0.67)
test_size = len(dataset) - train_size
train, test = dataset[0:train_size, :], dataset[train_size:len(dataset),:]
print(len(train), len(test))
# convert an array of values into a dataset matrix
def create_dataset(dataset, look_back=1):
dataX, dataY = [], []
for i in range(len(dataset)-look_back-1):
a = dataset[i:(i+look_back), 0]
dataX.append(a)
dataY.append(dataset[i + look_back, 0])
return numpy.array(dataX), numpy.array(dataY)
# reshape into X=t and Y=t+1
look_back = 1
trainX, trainY = create_dataset(train, look_back)
testX, testY = create_dataset(test, look_back)
#reshape input to be [samples, time steps, features]
trainX = numpy.reshape(trainX, (trainX.shape[0], 1, trainX.shape[1]))
testX = numpy.reshape(testX, (testX.shape[0], 1, testX.shape[1]))
#create and fit the LSTM network
model = Sequential()
model.add(LSTM(4, input_dim=look_back))
model.add(Dense(1))
model.compile(loss='mean_squared_error', optimizer='adam', metrics=['accuracy'])
model.fit(trainX, trainY, epochs=1, batch_size=32, verbose=2)
# make predictions
trainPredict = model.predict(trainX)
testPredict = model.predict(testX)
# invert predictions
import gc
gc.collect()
#####problem occurs with the following line of code#############
trainPredict = scaler.inverse_transform(trainPredict)
trainY = scaler.inverse_transform([trainY])
testPredict = scaler.inverse_transform(testPredict)
testY = scaler.inverse_transform([testY])
# calculate root mean squared error
trainScore = math.sqrt(mean_squared_error(trainY[0], trainPredict[:,0]))
print('Train Score: %.2f RMSE' % (trainScore))
testScore = math.sqrt(mean_squared_error(testY[0], testPredict[:,0]))
print('Test Score: %.2f RMSE' % (testScore))
#shift train predictions for plotting
trainPredictPlot = numpy.empty_like(dataset)
trainPredictPlot[:, :] = numpy.nan
trainPredictPlot[look_back:len(trainPredict)+look_back, :] = trainPredict
# shift test predictions for plotting
testPredictPlot = numpy.empty_like(dataset)
testPredictPlot[:, :] = numpy.nan
testPredictPlot[len(trainPredict)+(look_back*2)+1:len(dataset)-1, :] = testPredict
# plot baseline and predictions
plt.plot(scaler.inverse_transform(dataset))
plt.plot(trainPredictPlot)
plt.plot(testPredictPlot)
plt.show()
The error that I get is
ValueError: non-broadcastable output operand with shape (67704,1) doesn't match the broadcast shape (67704,12)
Think you guys could help me solve this problem? I'm very new to this but want to learn it so bad, and this error is making me suffer! Thanks for any help that can be provided.

When you scale your data, it will scale the 12 fields differently. It will take the minmax of each field and transform it into 0 to 1 values.
When you make an invert_transform, it makes no sense to the function because you give it only one field, it doesn't know what to do with it, what was its min and max value... You need to feed a 12 fields dataset, with thise predicted field in the right place.
Try to add this before the problematic line :
# create empty table with 12 fields
trainPredict_dataset_like = np.zeros(shape=(len(train_predict), 12) )
# put the predicted values in the right field
trainPredict_dataset_like[:,0] = trainPredict[:,0]
# inverse transform and then select the right field
trainPredict = scaler.inverse_transform(trainPredict_dataset_like)[:,0]
Does this help? :)

x_df = pd.DataFrame(prepared, None, [
'open',
'..',
])
# Scale
# ------------------------------------------------------------------------
scaler = MinMaxScaler()
scaled = scaler.fit_transform(x_df)
x_df_scaled = pd.DataFrame(scaled, None, x_df.keys())
x_df_scaled_expanded = np.expand_dims(x_df_scaled, axis=0)
# Model
# ------------------------------------------------------------------------
model = tf.keras.models.load_model(filepath_model)
y = model.predict(x_df_scaled_expanded)
# Scale back
# ------------------------------------------------------------------------
y_df = pd.DataFrame(np.zeros((len(x_df), len(x_df.columns))), columns=x_df.columns)
y_df['open'] = y[0][:, 0]
y_inversed = scaler.inverse_transform(y_df)
y_df['open'] = y_inversed[:, 0]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Spliting the dataset using SubsetRandomSampler not working - python

Related

Too much RAM is required for loading dataset

How to split a dataset into a custom training set and a custom validation set with pytorch?

CNN is overfitting to minority class

Unexpected results when using tfrecords loaded using tf.data.Dataset.list_files() with shuffle argument

LSTM-Keras Error: ValueError: non-broadcastable output operand with shape (67704,1) doesn't match the broadcast shape (67704,12)

Categories

Resources