Tensorflow-IO Dataset input pipeline with very large HDF5 files

Tensorflow-IO Dataset input pipeline with very large HDF5 files - python

I have very big training (30Gb) files.
Since all the data does not fit in my available RAM, I want to read the data by batch.
I saw that there is Tensorflow-io package which implemented a way to read HDF5 into Tensorflow this way thanks to the function tfio.IODataset.from_hdf5()
Then, since tf.keras.model.fit() takes a tf.data.Dataset as input containing both samples and targets, I need to zip my X and Y together and then use .batch and .prefetch to load in memory just the necessary data. For testing I tried to apply this method to smaller samples: training (9Gb), validation (2.5Gb) and testing (1.2Gb) which I know work well because they can fit into memory and I have good results (70% accuracy and <1 loss).
The training files are stored in HDF5 files split into samples (X) and labels (Y) files like so:
X_learn.hdf5
X_val.hdf5
X_test.hdf5
Y_test.hdf5
Y_learn.hdf5
Y_val.hdf5
Here is my code:
BATCH_SIZE = 2048
EPOCHS = 100
# Create an IODataset from a hdf5 file's dataset object
x_val = tfio.IODataset.from_hdf5(path_hdf5_x_val, dataset='/X_val')
y_val = tfio.IODataset.from_hdf5(path_hdf5_y_val, dataset='/Y_val')
x_test = tfio.IODataset.from_hdf5(path_hdf5_x_test, dataset='/X_test')
y_test = tfio.IODataset.from_hdf5(path_hdf5_y_test, dataset='/Y_test')
x_train = tfio.IODataset.from_hdf5(path_hdf5_x_train, dataset='/X_learn')
y_train = tfio.IODataset.from_hdf5(path_hdf5_y_train, dataset='/Y_learn')
# Zip together samples and corresponding labels
train = tf.data.Dataset.zip((x_train,y_train)).batch(BATCH_SIZE, drop_remainder=True).prefetch(tf.data.experimental.AUTOTUNE)
test = tf.data.Dataset.zip((x_test,y_test)).batch(BATCH_SIZE, drop_remainder=True).prefetch(tf.data.experimental.AUTOTUNE)
val = tf.data.Dataset.zip((x_val,y_val)).batch(BATCH_SIZE, drop_remainder=True).prefetch(tf.data.experimental.AUTOTUNE)
# Build the model
model = build_model()
# Compile the model with custom learing rate function for Adam optimizer
model.compile(loss='categorical_crossentropy',
optimizer=Adam(lr=lr_schedule(0)),
metrics=['accuracy'])
# Fit model with class_weights calculated before
model.fit(train,
epochs=EPOCHS,
class_weight=class_weights_train,
validation_data=val,
shuffle=True,
callbacks=callbacks)
This code runs but the loss goes very high (300+) and accuracy drops to 0 (0.30 -> 4*e^-5) right from the beginning... I don't understand what I am doing wrong, am I missing something ?

Providing the solution here (Answer Section), even though it is present in the Comment Section for the benefit of the community.
There was no issue with the code, its actually with the data (not preprocessed properly), hence model not able to learning well, which leads to strange loss and accuracy.

Related

Tensorflow : Manually selecting the batch when training deep neural networks

# x_train.shape[0] = 54000
model.fit(
x_train, y_train,
batch_size = 128,
epochs = 12,
validation_data = (x_val, y_val)
)
When I am using this fit() method to train a neural network:
batch_size = 128 means that I randomly pick 54000 // 128 batches of size 128 in my training dataset every epoch.
Are those batches chosen with replacement? I suspect from the docs they're not but I'd like confirmation.
Can I manually choose my batches? I would like to focus on specific images and not others for a given batch, by choosing them personally instead of letting randomness choose for me.

Are those batches chosen with replacement?
In each individual epoch, no. Of course the entire dataset is used again in the next epoch.
Can I manually choose my batches? I would like to focus on specific images and not others for a given batch, by choosing them personally instead of letting randomness choose for me.
You should create a custom dataset for this, and leave the rest of the training loop (data loader, model etc.) unchanged.
But be aware that the samples in a minibatch are supposed to be random.

Validation accuracy on MNIST differs when training data in one go vs when separated

I have a 27G dataset to analyse, and because of the size of my RAM I can't feed all my data into my Neural Network at once, and I have to import bits of it, learn on them, and then another part, so the process would look some thing like this:
import 10% of data
learn
save model
delete the data on RAM
import the next 10% and so on
To see how this would affect a known dataset, I tested it on MNIST. the following is the process/procedure:
for 35 times:
import 1/5 of the data
learn
delete
import the next 1/5
learn
delete
...
This is the code to import the dataset from tensorflow:
from tensorflow.keras.datasets import mnist
(sep, label), (sep_t, label_t) = mnist.load_data()
Then, the network:
Dense = tf.keras.layers.Dense
fc_model = tf.keras.Sequential(
[
tf.keras.Input(shape=(28,28)),
tf.keras.layers.Flatten(),
Dense(128, activation='relu'),
Dense(32, activation='relu'),
Dense(10, activation='softmax')])
fc_model.compile(optimizer="adam", loss="sparse_categorical_crossentropy", metrics=["accuracy"])
Below is the code for partially importing and learning the MNIST data set:
for k in range(35):
for j in range(5):
if i == 0:
history = fc_model.fit(sep[i*12000:(i+1)*12000-1], label[i*12000:(i+1)*12000-1], batch_size=128, validation_data=(sep_t, label_t) ,epochs=1)
fc_model.save('Mytf.h5')
i = i + 1
else:
fc_model = load_model('Mytf.h5')
history = fc_model.fit(sep[i*12000:(i+1)*12000-1], label[i*12000:(i+1)*12000-1], batch_size=128, validation_data=(sep_t, label_t) ,epochs=1)
fc_model.save('Mytf.h5')
valacc.append(history.history['val_accuracy'])
valacc_epc.append(history.history['val_accuracy'])
The following is the code to learn the data in one whole dataset:
history_new = fc_model.fit(sep, label, batch_size=128, validation_data=(sep_t, label_t) ,epochs=35)
and the graph below is the comparison between the two methods in terms of accuracy of the validation data:
even though the difference is like 1% (96(avg)-95(avg)=1%), would this mean that when testing on a different dataset using the same methodology of saving and learning, this would result in reduced accuracy? is it better to do some investment and do it on a cloud computation platform?

For both approaches, the batches are organized differently, so there would have to be some deviations (similar to shuffling the data vs feeding it in a specific order). But we can assume that these differences would not be consistent unless we observe this on a large amount of trials.
In any case, this is the common approach to loading large datasets bit by bit: tf.keras allows you to pass a Python generator for model.fit(x) (in case you want to research tutorials or work with an older API: until recently this was a separate method called model.fit_generator, see API).
All the data generator needs to do is yield a batch of training data and labels (x,y) each time it is called. The API will take care of calling it for you, as long as you pass it with fit. The result is that everything is read batch-by-batch into the RAM. A very basic template for a generator is something like this (source):
def generator(features, labels, batch_size):
# Create empty arrays to contain batch of features and labels#
batch_features = np.zeros((batch_size, 64, 64, 3))
batch_labels = np.zeros((batch_size,1))
while True:
for i in range(batch_size):
# choose random index in features
index= random.choice(len(features),1)
batch_features[i] = some_processing(features[index])
batch_labels[i] = labels[index]
yield batch_features, batch_labels

tensorflow 2.0, model.fit() : Your input ran out of data

I am absolutely new to TensorFlow and Keras, and I am trying to make my way around trying out some code that I am finding online.
In particular I am using the fashion-MNIST - consisting of 60000 examples and test set of 10000 examples. Each of them is a 28x28 grayscale image.
I am following this tutorial "https://towardsdatascience.com/building-your-first-neural-network-in-tensorflow-2-tensorflow-for-hackers-part-i-e1e2f1dfe7a0", and I have no problem until the definition of
history = model.fit(
train_dataset.repeat(),
epochs=10,
steps_per_epoch=500,
validation_data=val_dataset.repeat(),
validation_steps=2)
As long as I understood, I need to use train_dataset.repeat() as input dataset because otherwise I won't have enough training example using those values for the hyperparameters (epochs, steps_per_epochs).
My question is: how can I avoid to have to use .repeat()?
How do I need to change the hyperparameters?
I am coping the code here, for simplicity:
def preprocess(x,y):
x = tf.cast(x,tf.float32) / 255.0
y = tf.cast(y, tf.float32)
return x,y
def create_dataset(xs, ys, n_classes=10):
ys = tf.one_hot(ys, depth=n_classes)
return tf.data.Dataset.from_tensor_slices((xs, ys)).map(preprocess).shuffle(len(ys)).batch(128)
model.compile(optimizer = 'adam', loss =tf.losses.CategoricalCrossentropy(from_logits= True), metrics =['accuracy'])
history1 = model.fit(train_dataset.repeat(),
epochs=10,
steps_per_epoch=500,
validation_data=val_dataset.repeat(),
validation_steps=2)
Thanks!

If you don't want to use .repeat() you need to have your model passing thought your entire data only one time per epoch.
In order to do that you need to calculate how many steps it will take for your model to pass throught the entire dataset, the calcul is easy :
steps_per_epoch = len(train_dataset) // batch_size
So with a train_dataset of 60 000 sample and a batch_size of 128, you need to have 468 steps per epoch.
By setting this parameter like that you make sure that you do not exceed the size of your dataset.

I encountered the same problem and here is what I found.
Documentation of tf.keras.Model.fit: "If x is a tf.data dataset, and 'steps_per_epoch' is None, the epoch will run until the input dataset is exhausted."
In other words, we don't need to specify 'steps_per_epoch' if we use the tf.data.dataset as the training data, and tf will figure out how many steps are there. Meanwhile, tf will automatically repeat the dataset when the next epoch begins, so you can specify any 'epoch'.
When passing an infinitely repeating dataset (e.g. dataset.repeat()), you must specify the steps_per_epoch argument.

Keras Validation Accuracy is Zero but other metrics are normal

I am working on a computer vision problem in keras and I have run into a an interesting problem. My val_acc is 0.0000e+00. This is especially interesting as my other metrics such as loss, acc, and val_loss all are acting normally.
This started happening when I switched from the Sequence data_generator to a custom one that I'm pretty sure is working as intended. My issue is very similar to this one validation accuracy is 0 with Keras fit_generator but no answer was reached in that thread.
I have checked to make sure my activations and loss metrics are appropriate for my particular problem. I am using: loss='categorical_crossentropy' metrics=['accuracy'] and am attempting to predict the month that a certain spectrogram comes from.The validation data is being loaded in the exact same way as the training data so I really can't figure out whats happening also even random guessing should give a 1/12 val_acc right? It can't be zero.
Here is my model architecture:
x = (Convolution2D(32,5,5,activation='relu',input_shape=(501,501,1)))(input_img)
x = (MaxPooling2D(pool_size=(2,2)))(x)
x = (Convolution2D(32,5,5,activation='relu'))(x)
x = (MaxPooling2D(pool_size=(2,2)))(x)
x = (Dropout(0.25))(x)
x = (Flatten())(x)
x = (Dense(128,activation='relu'))(x)
x = (Dropout(0.5))(x)
classify = (Dense(12,activation='softmax', kernel_regularizer=regularizers.l1_l2(l1 = 0.001,l2 = 0.001)))(x)
model = Model(input_img,classify)
model.compile(loss='categorical_crossentropy',optimizer='nadam',metrics=['accuracy'])
and here is my call to fit_generator:
model.fit_generator(generator = pd.data_generator(folder,'train'),
validation_data = pd.data_generator(folder,'test'),
steps_per_epoch=(120),
validation_steps=(24),
nb_epoch=20,
verbose=1,
shuffle=True,
callbacks=[tensorboard_callback,early_stop_callback])
and finally here is the important part of my data generator:
if mode == 'test':
print('test')
while True:
for things in up.unpickle_batch(folder,50,6000,7200): #The last 1200 things in batches of 50
random.shuffle(things)
test_spect = []
test_months = []
for thing in things:
test_spect.append(thing.spect) #GET BATCH DATA
test_months.append(thing.month-1) #this is is here because the months go from 1-12 but should go from 0-11 for to_categorical
x_test = np.asarray(test_spect) #PREPARE BATCH DATA
x_test = x_test.astype('float32')
x_test /= np.amax(x_test) #- 0.5
X_test = np.reshape(x_test, (-1,501, 501,1))
Y_test = np_utils.to_categorical(test_months,12)
yield X_test,Y_test #RETURN BATCH DATA

Check for bad data.
Make sure your data is what you think it is -- shuffled, distributed the same as your validation and/or test set, free of misleading/erroneous/contradictory samples. You can probably generate a failproof dataset (e.g. distinguish dark images from light ones, or sharp versus blurry) and prove that everything but the data is OK. If you can't, then look more closely at your code. This, however, sounds like a data problem.
I just fixed a similar problem in a simple 3-layer MLP network for which training loss & accuracy were heading in reasonable directions, validation loss was following training loss (but lagging) yet validation accuracy hovered at zero. There was an off-by-one error in my training dataset generation (a sampling script from a larger set) that meant that 1 sample in the entire block of samples for one type had the label for the next block for a different type. 499 correct samples out of 500 was insufficient to keep the training on track.

Keras model taking forerver to train with dask dataframe

I m working with large dataset having low memory and I got introduced to Dask dataframe. What I understood from the docs that Dask does not load whole dataset into memory . instead it created multiple threads which will fetch the records from disk on demand basis. So I suppose keras model with having batch size = 500, it should only have 500 records in the memory at the training time. but when I start training. it takes forever.May be I am doing something wrong.please suggest.
shape of training data: 1000000 * 1290
import glob
import dask.dataframe
paths_train = glob.glob(r'x_train_d_final*.csv')
X_train_d = dd.read_csv('.../x_train_d_final0.csv')
Y_train1 = keras.utils.to_categorical(Y_train.iloc[,1], num_classes)
batch_size = 500
num_classes = 2
epochs = 5
model = Sequential()
model.add(Dense(645, activation='sigmoid', input_shape=(1290,),kernel_initializer='glorot_normal'))
#model.add(Dense(20, activation='sigmoid',kernel_initializer='glorot_normal'))
model.add(Dense(num_classes, activation='sigmoid'))
model.compile(loss='binary_crossentropy',
optimizer=Adam(decay=0),
metrics=['accuracy'])
history = model.fit(X_train_d.to_records(), Y_train,
batch_size=batch_size,
epochs=epochs,
verbose=1,
class_weight = {0:1,1:6.5},
shuffle=False)

You should use fit_generator() from Sequential model with generator or with a Sequence instance. Both provide a proper way to load only a portion of data.
Keras docs provide an excellent example:
def generate_arrays_from_file(path):
while 1:
f = open(path)
for line in f:
# create Numpy arrays of input data
# and labels, from each line in the file
x, y = process_line(line)
yield (x, y)
f.close()
model.fit_generator(generate_arrays_from_file('/my_file.txt'),
steps_per_epoch=1000, epochs=10)

Today Keras does not know about Dask dataframes or arrays. I suspect that it is just converting the dask object into the equivalent Pandas or Numpy object instead.
If your Keras model can be trained incrementally then you could solve this problem using dask.delayed and some for loops.
Eventually it would be nice to see the Keras and Dask projects learn more about each other to facilitate these workloads without excess work.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Tensorflow-IO Dataset input pipeline with very large HDF5 files - python

Providing the solution here (Answer Section), even though it is present in the Comment Section for the benefit of the community. There was no issue with the code, its actually with the data (not preprocessed properly), hence model not able to learning well, which leads to strange loss and accuracy.

Related

Tensorflow : Manually selecting the batch when training deep neural networks

Validation accuracy on MNIST differs when training data in one go vs when separated

tensorflow 2.0, model.fit() : Your input ran out of data

Keras Validation Accuracy is Zero but other metrics are normal

Keras model taking forerver to train with dask dataframe

Categories

Resources