Tensorflow dataset iterator pick a sub-sample of whole data

Tensorflow dataset iterator pick a sub-sample of whole data - python

I have a code that generates an iterator from a Tensorflow dataset. The code is this:
#tf.function
def normalize_image(record):
out = record.copy()
out['image'] = tf.cast(out['image'], 'float32') / 255.
return out
train_it = iter(tfds.builder('mnist').as_dataset(split='train').map(normalize_image).repeat().batch(256*10))
However, I want to do the manual splitting. For example, the MNISt dataset has 60000 training samples, but I want to only use the first 50000 (and hold others for validation). The problem is I don't know how to do so.
I tried to convert it to NumPy and split based on that, but then I couldn't apply the map to it.
ds_builder = tfds.builder('mnist')
print(dir(ds_builder))
ds_builder.download_and_prepare()
train_ds = tfds.as_numpy(ds_builder.as_dataset(split='train', batch_size=-1))
train_ds['image'] = train_ds['image'][0:50000, : , :]
train_ds['label'] = train_ds['label'][0:50000]
I was wondering how to do so.
P.S: The ordering of data is also important for me, so I was thinking of loading all data in Numpy and saving the required ones in png and loading with tfds, but I'm not sure if it keeps the original order or not. I want to take the first 50000 samples of the whole 60000 samples.
Thanks.

train_ds = tfds.builder('mnist').as_dataset(split='train').map(normalize_image)
train_ds = train_ds.take(50000).repeat().batch(256*10)
val_ds = tfds.builder('mnist').as_dataset(split='train').map(normalize_image)
val_ds = val_ds.skip(50000).batch(256*10)
train_it = iter(train_ds)
val_it = iter(val_ds)

Related

using tensorflow dataset for custom image oversampling

OK. I want to try to set up a custom datasets work flow in tensorflow for custom image oversampling. I have an unbalanced data set with many more normal images than fibrosis images.
I start with some variable...
num_fibrosis = len(glob.glob(WORKING_DIR_TS +'NIH_1stPA_Norm_Fib/Fibrosis/*.png'))
num_normal = len(glob.glob(WORKING_DIR_TS +'NIH_1stPA_Norm_Fib/No Finding/*.png'))
perc_for_val =.2
oversample_multiplier = 5
num_fibrosis_in_val = int(num_fibrosis*perc_for_val)
oversample_count = num_fibrosis_in_val*oversample_multiplier
And create a dataset base on the folder structure. This dataset contains images and labels.
full_ds = tf.keras.utils.image_dataset_from_directory(
'folder_path',
image_size=(SIZE,SIZE),
batch_size= None,
# shuffle=False
)
Then i take 20% of the fibrosis images and put them into our validation dataset. I also put an equal number of normal images in val_ds.
fibrosis_ds = full_ds.filter(lambda x, y: tf.equal(y, 0) ) # y == 0 for fibrosis
normal_ds = full_ds.filter(lambda x, y: tf.equal(y, 1) ) # y == 1 for normal
# Let's take 20% of fibrosis images, and an equal number of normals, for our validation dataset
val_ds = fibrosis_ds.take( num_fibrosis_in_val )
val_ds = val_ds.concatenate( normal_ds.take( num_fibrosis_in_val ) )
val_ds = val_ds.batch(BATCH_SIZE)
And lastly I make the training dataset. I use skip so I don't repeat any of images that I used earlier. I use repeat to oversample the fibrosis images. I add an equal number of normal images to make sure the data classes are balanced. And at the end I shuffle.
# Make the traing set
train_ds = fibrosis_ds.skip(num_fibrosis_in_val).take(num_fibrosis - num_fibrosis_in_val)
train_ds = train_ds.repeat( oversample_multiplier )
train_ds = train_ds.concatenate( normal_ds.skip(num_fibrosis_in_val).take(oversample_count) )
train_ds = train_ds.shuffle(oversample_count*2)
train_ds = train_ds.batch(BATCH_SIZE)
This seems to work but in google colab it almost fills RAM just went I loop over val_ds to confirm the count. Is it holding the entire dataset in memory while applying the chained functions on top? Is there a more reasonable way to do this?

Progressive loading of large arbitrary datasets in keras

I'm training my keras dense models on very large datasets.
For practical reasons, I am saving them on my disk on separate .txt files. I have 1e4 text files, each containing 1e4 examples.
I would like to find a way to fit my keras model on this dataset as a whole. For now, I am only able to use "model.fit" on individual text files, i.e. :
for k in range(10000):
X = np.loadtxt('/path/X_'+str(k)+'.txt')
Y = np.loadtxt('/path/Y_'+str(k)+'.txt')
mod = model.fit(x=X, y=Y, batch_size=batch_size, epochs=epochs)
Which is problematic if I want for instance to perform several epochs on the whole datasets.
Ideally, I would like to have a dataloader function that could be used in the following way to feed all the sub-datasets as a single one:
mod = model.fit(dataloader('/path/'), batch_size=batch_size, epochs=epochs)
I think I found what I want, but only for datasets composed of images: tf.keras.preprocessing.image.ImageDataGenerator.flow_from_directory
Is there any tf/keras function doing something similar, but for datasets which are not composed of images?
Thanks!

You can create a generator function and then use tensorflow Dataset class using from_generator method to create a dataset, see bellow a dummy example:
def mygenerator():
for k in range(1000):
x = np.random.normal(size=1000,)
y = np.random.randint(low=0, high=5, size=1000)
yield x, y
from tensorflow.data import Dataset
mydataset = Dataset.from_generator(mygenerator, output_signature=(tf.TensorSpec(shape=(1000,), dtype=tf.float32), tf.TensorSpec(shape=(1000,), dtype=tf.int32)))
mytraindata = mydataset.batch(batch_size)

CIFAR10 dataloader sampler split

i am trying to split the training data of CIFAR10 so the last 5000 of the training set is used for validation. my code
size = len(CIFAR10_training)
dataset_indices = list(range(size))
val_index = int(np.floor(0.9 * size))
train_idx, val_idx = dataset_indices[:val_index], dataset_indices[val_index:]
train_sampler = SubsetRandomSampler(train_idx)
val_sampler = SubsetRandomSampler(val_idx)
train_dataloader = torch.utils.data.DataLoader(CIFAR10_training,
batch_size=config['batch_size'],
shuffle=False, sampler = train_sampler)
valid_dataloader = torch.utils.data.DataLoader(CIFAR10_training,
batch_size=config['batch_size'],
shuffle=False, sampler = val_sampler)
print(len(train_dataloader.dataset),len(valid_dataloader.dataset),
but the last print statement prints 50000 and 10000. should it not be 45000 and 5000
when i print the train_idx and val_idx it prints the right values([0:44999],[45000:49999]
is there anything wrong with my code

I cannot replicate your results, when I execute your code, the print statements outputs twice the same number : the number of elements in train_CIFAR10. So I guess you made a mistake when copying your code, and valid_dataloader is actually given CIFAR10_test (or something like that) as parameter. In the following, I'm gonna assume that it's the case, and that your print outputs (50000, 50000), which is the size of the training part of Pytorch's CIFAR10 dataset.
Then it is completely expected, and no it should not output (45000, 5000). You are asking for the length of train_dataloader.dataset and valid_dataloader.dataset, i.e the length of the underlying datasets. For both your loaders, this dataset is CIFAR10_training. Therefore you will get twice the size of this dataset (i.e 50000).
You cannot ask for len(train_dataloader) either, because you that would yield the number of batches in your dataset (approximately 45000/batch_size).
If you need to know the size of your splits, then you have to compute the length of your samplers:
print(len(train_dataloader.sampler), len(valid_dataloader.sampler))
Besides this, your code is fine, you are correctly splitting your data.

How to load numpy array in a tensorflow dataset

I'm trying to create a Dataset object in tensorflow 1.14 (I have some legacy code that i can't change for this specific project) starting from numpy arrays, but everytime i try i get everything copied on my graph and for this reason when i create an event log file it is huge (719 MB in this case).
Originally i tried using this function "tf.data.Dataset.from_tensor_slices()", but it didn't work, then i read it is a common problem and someone suggested me to try with generators, thus i tried with the following code, but again i got a huge event file (719 MB again)
def fetch_batch(x, y, batch):
i = 0
while i < batch:
yield (x[i,:,:,:], y[i])
i +=1
train, test = tf.keras.datasets.fashion_mnist.load_data()
images, labels = train
images = images/255
training_dataset = tf.data.Dataset.from_generator(fetch_batch,
args=[images, np.int32(labels), batch_size], output_types=(tf.float32, tf.int32),
output_shapes=(tf.TensorShape(features_shape), tf.TensorShape(labels_shape)))
file_writer = tf.summary.FileWriter("/content", graph=tf.get_default_graph())
I know in this case I could use tensorflow_datasets API and it would be easier, but this is a more general question, and it involves how to create datasets in general, not only using the mnist one.
Could you explain to me what am i doing wrong? Thank you

I guess it's because you are using args in from_generator. This will surely put the provided args in the graph.
What you could do is define a function that will return a generator that will iterate through your set, something like (haven't tested):
def data_generator(images, labels):
def fetch_examples():
i = 0
while True:
example = (images[i], labels[i])
i += 1
i %= len(labels)
yield example
return fetch_examples
This would give in your example:
train, test = tf.keras.datasets.fashion_mnist.load_data()
images, labels = train
images = images/255
training_dataset = tf.data.Dataset.from_generator(data_generator(images, labels), output_types=(tf.float32, tf.int32),
output_shapes=(tf.TensorShape(features_shape), tf.TensorShape(labels_shape))).batch(batch_size)
file_writer = tf.summary.FileWriter("/content", graph=tf.get_default_graph())
Note that I changed fetch_batch to fetch_examples since you probably want to batch using the dataset utilities (.batch).

Tensorflow: Batching whole dataset (MNIST Tutorial)

Following this tutorial: https://www.tensorflow.org/versions/r1.3/get_started/mnist/pros
I wanted to solve a classification problem with labeled images by myself. Since I'm not using the MNIST database, I spent days creating my own dataset inside tensorflow. It looks like this:
#variables
batch_size = 50
dimension = 784
stages = 10
#step 1 read Dataset
filenames = tf.constant(filenamesList)
labels = tf.constant(labelsList)
#step 2 create Dataset
dataset = tf.data.Dataset.from_tensor_slices((filenames, labels))
#step 3: parse every image in the dataset using `map`
def _parse_function(filename, label):
#convert label to one-hot encoding
one_hot = tf.one_hot(label, stages)
#read image file
image_string = tf.read_file(filename)
image_decoded = tf.image.decode_image(image_string, channels=3)
image = tf.cast(image_decoded, tf.float32)
return image, one_hot
#step 4 final input tensor
dataset = dataset.map(_parse_function)
dataset = dataset.batch(batch_size) #batch_size = 100
iterator = dataset.make_one_shot_iterator()
images, labels = iterator.get_next()
images = tf.reshape(images, [batch_size,dimension]).eval()
labels = tf.reshape(labels, [batch_size,stages]).eval()
for _ in range(10):
dataset = dataset.shuffle(buffer_size = 100)
dataset = dataset.batch(batch_size)
iterator = dataset.make_one_shot_iterator()
images, labels = iterator.get_next()
images = tf.reshape(images, [batch_size,dimension]).eval()
labels = tf.reshape(labels, [batch_size,stages]).eval()
train_step.run(feed_dict={x: images, y_:labels})
Somehow using a higher batch_sizes will break python. What I'm trying to do is to train my neural network with new batches on each iteration. That's why Im also using dataset.shuffle(...). Using dataset.shuffle also breaks my Python.
What I wanted to do (because shuffle breaks) is to batch the whole dataset. By evaluating ('.eval()') I will get a numpy array. I will then shuffle the array with numpy.random.shuffle(images) and then pick up some the first elements to train it.
e.g.
for _ in range(1000):
images = tf.reshape(images, [batch_size,dimension]).eval()
labels = tf.reshape(labels, [batch_size,stages]).eval()
#shuffle
np.random.shuffle(images)
np.random.shuffle(labels)
train_step.run(feed_dict={x: images[0:train_size], y_:labels[0:train_size]})
But then here comes the problem that I can't batch the my whole dataset. It looks like that the data is too big for python to work with.
How should I solve this differently?
Since I'm not using the MNIST database there isn't a function like mnist.train.next_batch(100) which comes handy for me.

Notice how you call shuffle and batch inside your for loop? This is wrong. Datasets in TF work in the style of functional programming, so you are actually defining a pipeline for preprocessing the data to feed into your model. In a way, you give a recipe that answers the question "given this raw data, which operations (map, etc.) should I do to get batches that I can feed into my neural network?"
Now you are modifying that pipeline for every batch! What happens is that the first iteration, the batch size is, say [32 3600]. The next iteration, the elements of this shape are batched again, to [32 32 3600], and so on.
There's a great tutorial on the TF website where you can find out more how Datasets work, but here are a few suggestions how you can resolve your problem.
Move the shuffling to right after "Step 2" in your code. Then you are shuffling the whole dataset so your batches will have a good mixture of examples. Also increase the buffer_size argument, this works in a different way than you probably assume. It's usually a good idea to shuffle as early as possible, as it can be a slow operation if you have a large dataset -- the shuffled part of dataset will have to be read into memory. Here it does not really matter whether you shuffle the filenames and labels, or the read images and labels -- but the latter will have more work to do since the dataset is larger by that time.
Move batching and the iterator generator to be the last steps, just before starting your training loop.
Don't use feed_dict with Dataset iterators to input data into your model. Instead, define your model in terms of the outputs of iterator.get_next() and omit the feed_dict argument. See more details from this Q&A: Tensorflow: create minibatch from numpy array > 2 GB

Ive been getting through a lot of problems with creating tensorflow datasets. So I decided to use OpenCV to import images.
import opencv as cv
imgDataset = []
for i in range(len(files)):
imgDataset.append(cv2.imread(files[i]))
imgDataset = np.asarray(imgDataset)
the shape of imgDataset is (num_img, height, width, col_channels). Getting the i-th image should be imgDataset[i].
shuffling the dataset and getting only batches of it can be done like this:
from sklearn.utils import shuffle
X,y = shuffle(X, y)
X_feed = X[batch_size]
y_feed = y[batch_size]
Then you feed X_feed and y_feed into your model

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.