CIFAR10 dataloader sampler split

CIFAR10 dataloader sampler split - python

i am trying to split the training data of CIFAR10 so the last 5000 of the training set is used for validation. my code
size = len(CIFAR10_training)
dataset_indices = list(range(size))
val_index = int(np.floor(0.9 * size))
train_idx, val_idx = dataset_indices[:val_index], dataset_indices[val_index:]
train_sampler = SubsetRandomSampler(train_idx)
val_sampler = SubsetRandomSampler(val_idx)
train_dataloader = torch.utils.data.DataLoader(CIFAR10_training,
batch_size=config['batch_size'],
shuffle=False, sampler = train_sampler)
valid_dataloader = torch.utils.data.DataLoader(CIFAR10_training,
batch_size=config['batch_size'],
shuffle=False, sampler = val_sampler)
print(len(train_dataloader.dataset),len(valid_dataloader.dataset),
but the last print statement prints 50000 and 10000. should it not be 45000 and 5000
when i print the train_idx and val_idx it prints the right values([0:44999],[45000:49999]
is there anything wrong with my code

I cannot replicate your results, when I execute your code, the print statements outputs twice the same number : the number of elements in train_CIFAR10. So I guess you made a mistake when copying your code, and valid_dataloader is actually given CIFAR10_test (or something like that) as parameter. In the following, I'm gonna assume that it's the case, and that your print outputs (50000, 50000), which is the size of the training part of Pytorch's CIFAR10 dataset.
Then it is completely expected, and no it should not output (45000, 5000). You are asking for the length of train_dataloader.dataset and valid_dataloader.dataset, i.e the length of the underlying datasets. For both your loaders, this dataset is CIFAR10_training. Therefore you will get twice the size of this dataset (i.e 50000).
You cannot ask for len(train_dataloader) either, because you that would yield the number of batches in your dataset (approximately 45000/batch_size).
If you need to know the size of your splits, then you have to compute the length of your samplers:
print(len(train_dataloader.sampler), len(valid_dataloader.sampler))
Besides this, your code is fine, you are correctly splitting your data.

Related

Tensorflow dataset iterator pick a sub-sample of whole data

I have a code that generates an iterator from a Tensorflow dataset. The code is this:
#tf.function
def normalize_image(record):
out = record.copy()
out['image'] = tf.cast(out['image'], 'float32') / 255.
return out
train_it = iter(tfds.builder('mnist').as_dataset(split='train').map(normalize_image).repeat().batch(256*10))
However, I want to do the manual splitting. For example, the MNISt dataset has 60000 training samples, but I want to only use the first 50000 (and hold others for validation). The problem is I don't know how to do so.
I tried to convert it to NumPy and split based on that, but then I couldn't apply the map to it.
ds_builder = tfds.builder('mnist')
print(dir(ds_builder))
ds_builder.download_and_prepare()
train_ds = tfds.as_numpy(ds_builder.as_dataset(split='train', batch_size=-1))
train_ds['image'] = train_ds['image'][0:50000, : , :]
train_ds['label'] = train_ds['label'][0:50000]
I was wondering how to do so.
P.S: The ordering of data is also important for me, so I was thinking of loading all data in Numpy and saving the required ones in png and loading with tfds, but I'm not sure if it keeps the original order or not. I want to take the first 50000 samples of the whole 60000 samples.
Thanks.

train_ds = tfds.builder('mnist').as_dataset(split='train').map(normalize_image)
train_ds = train_ds.take(50000).repeat().batch(256*10)
val_ds = tfds.builder('mnist').as_dataset(split='train').map(normalize_image)
val_ds = val_ds.skip(50000).batch(256*10)
train_it = iter(train_ds)
val_it = iter(val_ds)

tensorflow 2.0, model.fit() : Your input ran out of data

I am absolutely new to TensorFlow and Keras, and I am trying to make my way around trying out some code that I am finding online.
In particular I am using the fashion-MNIST - consisting of 60000 examples and test set of 10000 examples. Each of them is a 28x28 grayscale image.
I am following this tutorial "https://towardsdatascience.com/building-your-first-neural-network-in-tensorflow-2-tensorflow-for-hackers-part-i-e1e2f1dfe7a0", and I have no problem until the definition of
history = model.fit(
train_dataset.repeat(),
epochs=10,
steps_per_epoch=500,
validation_data=val_dataset.repeat(),
validation_steps=2)
As long as I understood, I need to use train_dataset.repeat() as input dataset because otherwise I won't have enough training example using those values for the hyperparameters (epochs, steps_per_epochs).
My question is: how can I avoid to have to use .repeat()?
How do I need to change the hyperparameters?
I am coping the code here, for simplicity:
def preprocess(x,y):
x = tf.cast(x,tf.float32) / 255.0
y = tf.cast(y, tf.float32)
return x,y
def create_dataset(xs, ys, n_classes=10):
ys = tf.one_hot(ys, depth=n_classes)
return tf.data.Dataset.from_tensor_slices((xs, ys)).map(preprocess).shuffle(len(ys)).batch(128)
model.compile(optimizer = 'adam', loss =tf.losses.CategoricalCrossentropy(from_logits= True), metrics =['accuracy'])
history1 = model.fit(train_dataset.repeat(),
epochs=10,
steps_per_epoch=500,
validation_data=val_dataset.repeat(),
validation_steps=2)
Thanks!

If you don't want to use .repeat() you need to have your model passing thought your entire data only one time per epoch.
In order to do that you need to calculate how many steps it will take for your model to pass throught the entire dataset, the calcul is easy :
steps_per_epoch = len(train_dataset) // batch_size
So with a train_dataset of 60 000 sample and a batch_size of 128, you need to have 468 steps per epoch.
By setting this parameter like that you make sure that you do not exceed the size of your dataset.

I encountered the same problem and here is what I found.
Documentation of tf.keras.Model.fit: "If x is a tf.data dataset, and 'steps_per_epoch' is None, the epoch will run until the input dataset is exhausted."
In other words, we don't need to specify 'steps_per_epoch' if we use the tf.data.dataset as the training data, and tf will figure out how many steps are there. Meanwhile, tf will automatically repeat the dataset when the next epoch begins, so you can specify any 'epoch'.
When passing an infinitely repeating dataset (e.g. dataset.repeat()), you must specify the steps_per_epoch argument.

Keras Validation Accuracy is Zero but other metrics are normal

I am working on a computer vision problem in keras and I have run into a an interesting problem. My val_acc is 0.0000e+00. This is especially interesting as my other metrics such as loss, acc, and val_loss all are acting normally.
This started happening when I switched from the Sequence data_generator to a custom one that I'm pretty sure is working as intended. My issue is very similar to this one validation accuracy is 0 with Keras fit_generator but no answer was reached in that thread.
I have checked to make sure my activations and loss metrics are appropriate for my particular problem. I am using: loss='categorical_crossentropy' metrics=['accuracy'] and am attempting to predict the month that a certain spectrogram comes from.The validation data is being loaded in the exact same way as the training data so I really can't figure out whats happening also even random guessing should give a 1/12 val_acc right? It can't be zero.
Here is my model architecture:
x = (Convolution2D(32,5,5,activation='relu',input_shape=(501,501,1)))(input_img)
x = (MaxPooling2D(pool_size=(2,2)))(x)
x = (Convolution2D(32,5,5,activation='relu'))(x)
x = (MaxPooling2D(pool_size=(2,2)))(x)
x = (Dropout(0.25))(x)
x = (Flatten())(x)
x = (Dense(128,activation='relu'))(x)
x = (Dropout(0.5))(x)
classify = (Dense(12,activation='softmax', kernel_regularizer=regularizers.l1_l2(l1 = 0.001,l2 = 0.001)))(x)
model = Model(input_img,classify)
model.compile(loss='categorical_crossentropy',optimizer='nadam',metrics=['accuracy'])
and here is my call to fit_generator:
model.fit_generator(generator = pd.data_generator(folder,'train'),
validation_data = pd.data_generator(folder,'test'),
steps_per_epoch=(120),
validation_steps=(24),
nb_epoch=20,
verbose=1,
shuffle=True,
callbacks=[tensorboard_callback,early_stop_callback])
and finally here is the important part of my data generator:
if mode == 'test':
print('test')
while True:
for things in up.unpickle_batch(folder,50,6000,7200): #The last 1200 things in batches of 50
random.shuffle(things)
test_spect = []
test_months = []
for thing in things:
test_spect.append(thing.spect) #GET BATCH DATA
test_months.append(thing.month-1) #this is is here because the months go from 1-12 but should go from 0-11 for to_categorical
x_test = np.asarray(test_spect) #PREPARE BATCH DATA
x_test = x_test.astype('float32')
x_test /= np.amax(x_test) #- 0.5
X_test = np.reshape(x_test, (-1,501, 501,1))
Y_test = np_utils.to_categorical(test_months,12)
yield X_test,Y_test #RETURN BATCH DATA

Check for bad data.
Make sure your data is what you think it is -- shuffled, distributed the same as your validation and/or test set, free of misleading/erroneous/contradictory samples. You can probably generate a failproof dataset (e.g. distinguish dark images from light ones, or sharp versus blurry) and prove that everything but the data is OK. If you can't, then look more closely at your code. This, however, sounds like a data problem.
I just fixed a similar problem in a simple 3-layer MLP network for which training loss & accuracy were heading in reasonable directions, validation loss was following training loss (but lagging) yet validation accuracy hovered at zero. There was an off-by-one error in my training dataset generation (a sampling script from a larger set) that meant that 1 sample in the entire block of samples for one type had the label for the next block for a different type. 499 correct samples out of 500 was insufficient to keep the training on track.

Why is TensorFlow's `tf.data` package slowing down my code?

I'm just learning to use TensorFlow's tf.data API, and I've found that it is slowing my code down a lot, measured in time per epoch. This is the opposite of what it's supposed to do, I thought. I wrote a simple linear regression program to test it out.
Tl;Dr: With 100,000 training data, tf.data slows time per epoch down by about a factor of ten, if you're using full batch training. Worse if you use smaller batches. The opposite is true with 500 training data.
My question: What is going on? Is my implementation flawed? Other sources I've read have tf.data improving speeds by about 30%.
import tensorflow as tf
import numpy as np
import timeit
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'
tf.logging.set_verbosity(tf.logging.ERROR)
n_epochs = 10
input_dimensions_list = [10]
def function_to_approximate(x):
return np.dot(x, random_covector).astype(np.float32) + np.float32(.01) * np.random.randn(1,1).astype(np.float32)
def regress_without_tfData(n_epochs, input_dimension, training_inputs, training_labels):
tf.reset_default_graph()
weights = tf.get_variable("weights", initializer=np.random.randn(input_dimension, 1).astype(np.float32))
X = tf.placeholder(tf.float32, shape=(None, input_dimension), name='X')
Y = tf.placeholder(tf.float32, shape=(None, 1), name='Y')
prediction = tf.matmul(X,weights)
loss = tf.reduce_mean(tf.square(tf.subtract(prediction, Y)))
loss_op = tf.train.AdamOptimizer(.01).minimize(loss)
init = tf.global_variables_initializer()
with tf.Session() as sess:
sess.run(init)
for _ in range(n_epochs):
sess.run(loss_op, feed_dict={X: training_inputs, Y:training_labels})
def regress_with_tfData(n_epochs, input_dimension, training_inputs, training_labels, batch_size):
tf.reset_default_graph()
weights = tf.get_variable("weights", initializer=np.random.randn(input_dimension, 1).astype(np.float32))
X,Y = data_set.make_one_shot_iterator().get_next()
prediction = tf.matmul(X, weights)
loss = tf.reduce_mean(tf.square(tf.subtract(prediction, Y)))
loss_op = tf.train.AdamOptimizer(.01).minimize(loss)
init = tf.global_variables_initializer()
with tf.Session() as sess:
sess.run(init)
while True:
try:
sess.run(loss_op)
except tf.errors.OutOfRangeError:
break
for input_dimension in input_dimensions_list:
for data_size in [500, 100000]:
training_inputs = np.random.randn(data_size, input_dimension).astype(np.float32)
random_covector = np.random.randint(-5, 5, size=(input_dimension, 1))
training_labels = function_to_approximate(training_inputs)
print("Not using tf.data, with data size "
"{}, input dimension {} and training with "
"a full batch, it took an average of "
"{} seconds to run {} epochs.\n".
format(
data_size,
input_dimension,
timeit.timeit(
lambda: regress_without_tfData(
n_epochs, input_dimension,
training_inputs, training_labels
),
number=3
),
n_epochs))
for input_dimension in input_dimensions_list:
for data_size, batch_size in [(500, 50), (500, 500), (100000, 50), (100000, 100000)]:
training_inputs = np.random.randn(data_size, input_dimension).astype(np.float32)
random_covector = np.random.randint(-5, 5, size=(input_dimension, 1))
training_labels = function_to_approximate(training_inputs)
data_set = tf.data.Dataset.from_tensor_slices((training_inputs, training_labels))
data_set = data_set.repeat(n_epochs)
data_set = data_set.batch(batch_size)
print("Using tf.data, with data size "
"{}, and input dimension {}, and training with "
"batch size {}, it took an average of {} seconds "
"to run {} epochs.\n".
format(
data_size,
input_dimension,
batch_size,
timeit.timeit(
lambda: regress_with_tfData(
n_epochs, input_dimension,
training_inputs, training_labels,
batch_size
),
number=3
)/3,
n_epochs
))
This outputs for me:
Not using tf.data, with data size 500, input dimension 10 and training
with a full batch, it took an average of 0.20243382899980134 seconds
to run 10 epochs.
Not using tf.data, with data size 100000, input dimension 10 and
training with a full batch, it took an average of 0.2431719040000644
seconds to run 10 epochs.
Using tf.data, with data size 500, and input dimension 10, and
training with batch size 50, it took an average of 0.09512088866661846
seconds to run 10 epochs.
Using tf.data, with data size 500, and input dimension 10, and
training with batch size 500, it took an average of
0.07286913600000844 seconds to run 10 epochs.
Using tf.data, with data size 100000, and input dimension 10, and
training with batch size 50, it took an average of 4.421892363666605
seconds to run 10 epochs.
Using tf.data, with data size 100000, and input dimension 10, and
training with batch size 100000, it took an average of
2.2555197536667038 seconds to run 10 epochs.
Edit: Fixed an important issue that Fred Guth pointed out. It didn't much affect the results, though.

I wanted to test the dataset API which seems to be really convenient for processing data. I did a lot of time testing about this API in CPU, GPU and multi-GPU way for small and large NN with different type of data.
First thing, It seems to me that your code is ok. But I need to point that your NN is just one simple layer.
Now, the dataset API is not suitable for your type of NN but for NN with a lot more complexity. Why ? For several reasons that I explain below (founded in my quest of understanding the dataset API).
Firstly, in one hand the dataset API processes data each batch whereas in the other hand data are preprocessed. Therefore, if it fits your RAM, you can save time by preprocessing the data. Here your data are just to "simple". If you want to test what i am saying, try to find a really really big dataset to process. Nevertheless, the dataset API can be tuned with prefetching data. You can take a look to this tutorial that explain really well why it is good to process data with prefetch.
Secondly, in my quest of dataset API for Multi-GPU training, I discovered that as far as i know the old pre-processing way is faster than dataset API for small Neural Network. You can verify that by creating a simple stackable RNN which take a sequence in input. You can try different size of stack (i have tested 1, 2, 10 and 20). You will see that, using the dataset API, on 1-GPU or on 4-GPUs, the time did not differ for small RNN stacks (1, 2 and 5).
To summarize, the dataset API is suitable for Neural Network that have data that can't be pre-process. Depending on your task, it may be more convenient to pre-process data, for example if you want to tweak your NN in order to improve it. I agree that the dataset API is really cool for batch, padding and also convenient for shuffling large amount of data but it's also not suitable for multi-GPU training.

First:
You are recreating the dataset unnecessarily.
data_set = tf.data.Dataset.from_tensor_slices((training_inputs, training_labels))
Create the dataset prior to the loop and change the regress_with_tfData input signature to use dataset instead of training_inputs and training_labels.
Second:
The problem here is that minibatches of size 50 or even 500 are too small to compensate the cost of td.data building latency. You should increase the minibatch size. Interestingly you did so with a minibatch of size 100000, but then maybe it is too big ( I am not certain of this, I think it would need more tests).
There are a couple of things you could try:
1) Increase the minibatch size to something like 10000 and see if you get an improvement
2) Change your pipeline to use an iterator, example:
data_set = tf.data.Dataset.from_tensor_slices((training_inputs, training_labels))
data_set = data_set.repeat(n_epochs)
data_set = data_set.batch(batch_size)
iterator = data_set.make_one_shot_iterator()
....
next_element = iterator.get_next()

That is because you are comparing apples with bananas.
On one hand, when using placeholders, you are providing a monolithic tensor as is. On the other hand, when using Dataset, you are slicing the tensor into individual samples. This is very different.
The equivalent of providing a monolothic placeholder tensor with the Dataset pipeline is by using tf.data.Dataset.from_tensors. When I use from_tensors in your example, I get similar (actually smaller) computation times than with placeholders.
If you want to compare a more sophisticated pipeline using from_tensor_slices, you should use a fair comparison with placeholders. For example, shuffle your data. Add some preprocessing on your slices. I have no doubt you will observe the performance gain that makes people switch to this pipeline.

One possible thing you are missing is a prefetch. Add a prefetch of 1 at the end of your data pipeline like so:
data_set = tf.data.Dataset.from_tensor_slices((training_inputs, training_labels))
data_set = data_set.repeat(n_epochs)
data_set = data_set.batch(batch_size).prefetch(1)
Adding a prefetch of 1 at the end of your dataset pipeline means you try and fetch 1 batch of data while training is happening. This way you wont be waiting around while the batch is prepared, it should be ready to go as soon as each train iteration is done.

The accepted answer doesn't help longer valid, as the TF behavior has changed. Per documentation:
from_tensors produces a dataset containing only a single element. To
slice the input tensor into multiple elements, use from_tensor_slices
instead.
This means you cannot batch it
X = np.arange(10)
data = tf.data.Dataset.from_tensors( X )
data = data.batch(2)
for t in data.as_numpy_iterator():
print(t)
# only one row, whereas expected 5 !!!
The documentation recommends from_tensor_slices. But this has quite some overhead when compared to numpy slicing. Slow slicing is an open issue https://github.com/tensorflow/tensorflow/issues/39750
Essentially, slicing in TF is slow and impacts input-bound or light models such as small networks (regression, word2vec).

Tensorflow: Batching whole dataset (MNIST Tutorial)

Following this tutorial: https://www.tensorflow.org/versions/r1.3/get_started/mnist/pros
I wanted to solve a classification problem with labeled images by myself. Since I'm not using the MNIST database, I spent days creating my own dataset inside tensorflow. It looks like this:
#variables
batch_size = 50
dimension = 784
stages = 10
#step 1 read Dataset
filenames = tf.constant(filenamesList)
labels = tf.constant(labelsList)
#step 2 create Dataset
dataset = tf.data.Dataset.from_tensor_slices((filenames, labels))
#step 3: parse every image in the dataset using `map`
def _parse_function(filename, label):
#convert label to one-hot encoding
one_hot = tf.one_hot(label, stages)
#read image file
image_string = tf.read_file(filename)
image_decoded = tf.image.decode_image(image_string, channels=3)
image = tf.cast(image_decoded, tf.float32)
return image, one_hot
#step 4 final input tensor
dataset = dataset.map(_parse_function)
dataset = dataset.batch(batch_size) #batch_size = 100
iterator = dataset.make_one_shot_iterator()
images, labels = iterator.get_next()
images = tf.reshape(images, [batch_size,dimension]).eval()
labels = tf.reshape(labels, [batch_size,stages]).eval()
for _ in range(10):
dataset = dataset.shuffle(buffer_size = 100)
dataset = dataset.batch(batch_size)
iterator = dataset.make_one_shot_iterator()
images, labels = iterator.get_next()
images = tf.reshape(images, [batch_size,dimension]).eval()
labels = tf.reshape(labels, [batch_size,stages]).eval()
train_step.run(feed_dict={x: images, y_:labels})
Somehow using a higher batch_sizes will break python. What I'm trying to do is to train my neural network with new batches on each iteration. That's why Im also using dataset.shuffle(...). Using dataset.shuffle also breaks my Python.
What I wanted to do (because shuffle breaks) is to batch the whole dataset. By evaluating ('.eval()') I will get a numpy array. I will then shuffle the array with numpy.random.shuffle(images) and then pick up some the first elements to train it.
e.g.
for _ in range(1000):
images = tf.reshape(images, [batch_size,dimension]).eval()
labels = tf.reshape(labels, [batch_size,stages]).eval()
#shuffle
np.random.shuffle(images)
np.random.shuffle(labels)
train_step.run(feed_dict={x: images[0:train_size], y_:labels[0:train_size]})
But then here comes the problem that I can't batch the my whole dataset. It looks like that the data is too big for python to work with.
How should I solve this differently?
Since I'm not using the MNIST database there isn't a function like mnist.train.next_batch(100) which comes handy for me.

Notice how you call shuffle and batch inside your for loop? This is wrong. Datasets in TF work in the style of functional programming, so you are actually defining a pipeline for preprocessing the data to feed into your model. In a way, you give a recipe that answers the question "given this raw data, which operations (map, etc.) should I do to get batches that I can feed into my neural network?"
Now you are modifying that pipeline for every batch! What happens is that the first iteration, the batch size is, say [32 3600]. The next iteration, the elements of this shape are batched again, to [32 32 3600], and so on.
There's a great tutorial on the TF website where you can find out more how Datasets work, but here are a few suggestions how you can resolve your problem.
Move the shuffling to right after "Step 2" in your code. Then you are shuffling the whole dataset so your batches will have a good mixture of examples. Also increase the buffer_size argument, this works in a different way than you probably assume. It's usually a good idea to shuffle as early as possible, as it can be a slow operation if you have a large dataset -- the shuffled part of dataset will have to be read into memory. Here it does not really matter whether you shuffle the filenames and labels, or the read images and labels -- but the latter will have more work to do since the dataset is larger by that time.
Move batching and the iterator generator to be the last steps, just before starting your training loop.
Don't use feed_dict with Dataset iterators to input data into your model. Instead, define your model in terms of the outputs of iterator.get_next() and omit the feed_dict argument. See more details from this Q&A: Tensorflow: create minibatch from numpy array > 2 GB

Ive been getting through a lot of problems with creating tensorflow datasets. So I decided to use OpenCV to import images.
import opencv as cv
imgDataset = []
for i in range(len(files)):
imgDataset.append(cv2.imread(files[i]))
imgDataset = np.asarray(imgDataset)
the shape of imgDataset is (num_img, height, width, col_channels). Getting the i-th image should be imgDataset[i].
shuffling the dataset and getting only batches of it can be done like this:
from sklearn.utils import shuffle
X,y = shuffle(X, y)
X_feed = X[batch_size]
y_feed = y[batch_size]
Then you feed X_feed and y_feed into your model

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.