Handling larger tensorflow dataset

Handling larger tensorflow dataset - python

I am relatively new to Tensorflow and have been putting together some model training based on the tutorial I found on the ts website. I have been able to put together something functional that satisfies my preliminary requirements.
I am reading locally a csv files that provides some links towards images associated with labels written on the same csv row. My code roughly look like that:
def map_func(*row):
img = process_img(img_filename)
output = read(row)
return img, output
dataset = tf.data.experimental.CsvDataset(CSV_FILE, default_format, header=True)
dataset = dataset.map(map_func)
dataset = dataset.shuffle(buffer_size=shuffle_buffer_size)
dataset = dataset.batch(NB_IMG)
dataset = dataset.prefetch(buffer_size=tf.data.experimental.AUTOTUNE)
X, y = next(iter(dataset))
X_train, X_test = tf.split(X, split, axis=0)
y_train, y_test = tf.split(y, split, axis=0)
model = create_model()
model.compile(optimizer=OPTIMIZER, loss='mse')
model.fit(x=X_train, y=y_train, epochs=EPOCHS, validation_data=(X_test, y_test))
NB_IMG is the total number of images I have. EPOCHS is here arbitrary fixed to a given value (in general 20 or 40) and the split is a ratio applied on NB_IMG.
All my images are locally on my computer and with that code my GPU currently can manage up to 50000 images roughly. The training is failing with more images (GPU exhausted). I can understand that is due to the fact that I am reading the data all at once, but I am bit blocked to take the next step here to be able to manage a bigger dataset.
This part below is the one that need improvement I guess:
X, y = next(iter(dataset))
Could someone here help me to move forward and guide me towards some examples or snippets where I can train the model on a bigger dataset? I am a bit lost here for the next move and not sure where to focus in the ts documentation. I did not really find a clear example online that would suit my needs. How should I loop on different batches? How is coded the iterator?
Thanks!

Well, can you give more details about the two functions process_img and read?
During my experiments, I have noticed that the shuffle function can be slow when you have a lot of data and the buffer size is big. Try to comment that line and check if it runs faster. If so, you can use pandas to load your CSV file and then shuffle it and use tf.data.Dataset.from_tensor_slices
Tensorflow has a great tool now to profile models and the dataset pipeline (https://www.tensorflow.org/tensorboard/tensorboard_profiling_keras).

process_img and read are very simple functions:
def process_img(filename):
img = tf.io.read_file(filename)
return tf.image.decode_jpeg(img, channels=3)
def read(row):
return row[1]
The shuffle part of my code is slow but does not seem to be the cause of failure, I can remove it and shuffle the data directly from the csv. It seems to fail at the X, y = next(iter(dataset)) line if the dataset is too big
Thanks for your suggestions to profile the code, I will give it a go. Is there any other possible approach to split and iterate among the dataset?

Related

Correct way to pass a set of images to a model for training

I'm trying to create a Keras model to train with a group of images, taken from a list of paths.
I know that the method tf.keras.utils.image_dataset_from_directory exists but it doesn't meet my needs because I want to learn the correct way to handle images and because I need to make a regression, not a classification.
Every approach I tried failed one way or another, mostly because the type of the x_train variable is wrong.
The most promising function I used to load a single image is:
def encode_image(img_path):
img = tf.keras.preprocessing.image.load_img(img_path)
img_array = tf.keras.preprocessing.image.img_to_array(img)
img_array = tf.expand_dims(img_array, 0)
return img_array
x_train = df['filename'].apply(lambda i: encode_image(i))
This doesn't work because, when I call the .fit() method this way:
history = model.fit(x_train, y_train, epochs=1)
I receive the following error:
Failed to convert a NumPy array to a Tensor (Unsupported object type numpy.ndarray)
This makes me understand that I'm passing the data in a wrong format.
Can someone provide me a basic example of creating a (x_train, y_train) pair to feed a model for training using a set of images?
Thank you very much

How to form a tuple of numpy array from a directory of Images or a csv file?

I working on a project regarding Convolutional Neural Networks but I am quite new to using Python. The usage of different libraries has overwhelmed me.
I stored different training images in the same directory and also created a csv file with dataset and labels. After using the ImageDataGenerator to create variant of images, the data would be fed the flow from data frame for further processing instead of the flow_from_director, as there are more than two labels to match the images. I realized I must use fit() function to calculate the mean and std among the images, then the flow_from_data_frame would use the value calculated for doing normalization and zero mean operation with the images.
My questions are:
how to form a tuple of numpy array from a directory of Images or a csv file, just like the 'x_train' from tf.keras.datasets.cifar10.load_data()?
I also not quite sure how to feed fit() as an input of flow_from_data_frame, as I added .png on the ID of my csv file.
p.s. If you think my plan would not work, please point it out. I would much appreciate it. Thank you.
Below are my code,directory and csv file:
Code1:
def append_ext(fn):
return fn+".png"
traindf=pd.read_csv(r'C:\Users\User\Desktop\Python\COVID-19/COVID-
19RadiographyDataLabel.csv',dtype=str)
traindf["ID"]=traindf["ID"].apply(append_ext)
testdf=pd.read_csv(r'C:\Users\User\Desktop\Python\COVID-
19/SampleDataLabel.csv',dtype=str)
testdf["ID"]=testdf["ID"].apply(append_ext)
Code2:
datagen=ImageDataGenerator(featurewise_center = True,
featurewise_std_normalization = True,
rescale=1./255.,
rotation_range=10,
width_shift_range=0.2,
height_shift_range=0.2,
zoom_range=0.2,
horizontal_flip=True,
brightness_range=[0.2,1.2],
validation_split=0.25,
preprocessing_function=Gray_to_Color
Code3:
train_iterator = datagen.fit(traindf)
Directory: Total Number of Training Images:20165
Training csv file
Code1output
Code_cifar10.load_Data()
Code3outputError

Save in a external file the dataset obtained from tensorflow.keras.preprocessing.text_dataset_from_directory()

Can I save in a external file the dataset obtained from tensorflow.keras.preprocessing.text_dataset_from_directory()?
from tensorflow.keras import preprocessing
train_ds = preprocessing.text_dataset_from_directory(
directory = 'aclImdb/train',
validation_split= 0.2,
subset= 'training', # Estamos en training
shuffle = True,
seed= 689
)
val_ds = preprocessing.text_dataset_from_directory(
directory = 'aclImdb/train',
validation_split= 0.2,
subset= 'validation',
shuffle = True,
seed= 689
)
test_ds = preprocessing.text_dataset_from_directory(
directory = 'aclImdb/test'
)
I'm reading the documentation but I'm not sure if it's possible.
Answer to #Lescurel question
I want to do this because I want to avoid do this preprocessing each time and have to wait while its done. And furthermore, because I want to see if this new saved file is takes up less space in my computer.
Actually, I don't care the format. I thought that if this can be done, it would already have a standard format that everyone uses.
Thank you very much.

Technically that is possible.
But you don't want that, because:
The preprocessing.text_dataset_from_directory create a generator based dataset, that supports
on the fly loading of data,
shuffle after each epoch (for training),
prefetching and other features.
If you just save a shuffled dataset as file on your computer, you will have to do it again. If the dataset would be/get larger than your RAM you would have to care about that, too.
If you still want to do it: You can get the batches of data with dataset.take(1) and then either save all individual string (using for .. in) or pickle to write the binary objects... But I repeat myself: You do not want to do that.
If you want to do preprocessing up front, use a program that works on your text files and saves them as text files back again (e.g. for cleaning etc.) - but be aware that you will have to do the same for test and production data later on, so everything you remove from the (keras) pipeline, you have to care about yourself.

Tensorflow Dataset API shuffle hurts performance by 9x

I'm using the Tensorflow Dataset API to take a bunch of filenames; shuffle the filenames; perform a python function to load the image files, preprocess them, and turn them into tensors; and then cache, repeat, and batch them. So far, so good.
When I add a shuffle() to the tensors, performance degrades by a factor of 9x. Likewise, when I do self.dataset.apply(tf.data.experimental.shuffle_and_repeat(16384)).
Why does shuffle hurt performance so badly, and how can I fix it?
Code:
filenames = tf.data.Dataset.list_files(self.FILE_PATTERN).shuffle(buffer_size=16384)
dataset = filenames.map(lambda filename: self.pp(filename),
num_parallel_calls=self.N_CPUS)
dataset = dataset.cache("./cachefile")
# The line below (shuffle_and_repeat) made performance very bad (1s/step without, 9s/step with)
# dataset = dataset.apply(tf.data.experimental.shuffle_and_repeat(16384))
# This too:
# dataset = dataset.repeat().shuffle(16384)
# This works fine, but doesn't shuffle:
dataset = dataset.repeat()
dataset = dataset.batch(self.BATCH_SIZE)
dataset = dataset.prefetch(4)

try changing the prefetch parameter buffer_size=2
dataset = dataset.prefetch(2)
prefetch is a performance flag, read next number of datasets in background for the next iterations. If prefetch's buffer_size is large, then it creates lots of datasets for iterations and may slow down due to low memory.

OneClassSVM scikit learn

I have two data sets, trainig and test. They have labels "1" and "0". I need to evaluate these data sets using "oneClassSVM" Algorithm with "rbf" kernel in scikit learn. I loaded training data set, but I have no idea how to evaluate that with test data set. Below is my code,
from sklearn import svm
import numpy as np
input_file_data = "/home/anuradha/TrainData.csv"
dataset = np.loadtxt(input_file_iris, delimiter=",")
X = dataset[:,0:4]
y = dataset[:,4]
estimator= svm.OneClassSVM(nu=0.1, kernel="rbf", gamma=0.1)
Please some one can help me to solve this problem ?

It's as simple as adding the following two lines of code at the end of your script:
estimator.fit(X_train)
y_pred_test = estimator.predict(X_test)
The first line tells svn which training data to use and the second one makes prediction on the test set (be sure to load both datasets and to change variable names accordingly).
Here there is a complete example on how to use OneClassSVM and here the class reference.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Handling larger tensorflow dataset - python

Related

Correct way to pass a set of images to a model for training

How to form a tuple of numpy array from a directory of Images or a csv file?

Save in a external file the dataset obtained from tensorflow.keras.preprocessing.text_dataset_from_directory()

Tensorflow Dataset API shuffle hurts performance by 9x

OneClassSVM scikit learn

Categories

Resources