How to include model.predict into my input pipeline? - python

The problem:
I'm building a video captioning seq2seq model and I've a problem with the input pipeline:
I'm using a pre-trained InceptionV3 model to preprocess all my data
However, I've got a lot of data: 10.000 videos containing about several hundred frames each.
When I'm using Inceptionv3 to preprocess my data, it returns me a numpy array with a very huge size.
It's then impossible for me to create a tf.data.Dataset from this numpy array because its size, 6GiB, is way bigger than 2GiB tensorflow's size limit.
My pipeline steps:
Step 1:
Extracting videos and storing frames on the disk. Creating a tf.data.Dataset from this files.
record_files, lowest_n_frames = build_tfrecord_dataset(videos_name, video_path_zip, record_file_path)
records_dataset = tf.data.TFRecordDataset(record_files)
Step 2:
Taking the frames from the TFRecord files created in step 1 and reformatting it for Inceptionv3.
video_dataset = records_dataset.map(lambda tfrecord: decode_tfrecord(tfrecord, n_frames))
video_dataset = video_dataset.map(format_video)
Step 3:
Preprocessing the video frames using Inceptionv3. This returns me a huge numpy array.
bottlenecks = image_features_extract_model.predict(video_dataset, verbose=1, steps=n_samples)
Step 4 (Error step):
I'm creating a tf.data.Dataset from bottlenecks. Of course, it works on a small number of data.
features_dataset = tf.data.Dataset.from_tensor_slices(bottlenecks)
The error is very simple and straightforward:
ValueError: Cannot create a tensor proto whose content is larger than 2GB.
What I would like:
I'm wondering how turning the model.predict operation as a part of my input pipeline !
I don't want to have a 6GiB numpy array in my RAM at a time, I want to do the preprocessing "step by step" over my data

When your data gets big, its smarter to use tf.data.Dataset.from_generator
I don't know what your exact code and outputs look like but it should work if you try something like this
features_dataset = tf.data.Dataset.from_generator(iter(bottlenecks),
output_types=...,
output_shapes=...,)

Related

How to read large sample of images effectively without overloading ram memory?

While training the classification model I'm passing input image samples as NumPy array but when I try to train large dataset samples I run into memory error. Currently, I've 120 GB size of memory even with this size I run into memory error. I've enclosed code snippet below
x_train = np.array([np.array(ndimage.imread(image)) for image in image_list])
x_train = x_train.astype(np.float32)
Error traceback:
x_train = x_train.astype(np.float32) numpy.core._exceptions.MemoryError: Unable to
allocate 134. GiB for an array with shape (2512019,82,175,1) and data type float32
How can I fix this issue without increasing ram size? is there a better way to read the data like using cache or using protobuf?
I would load the first half of the dataset and then train the model on the first half of the dataset and then I would load the 2nd half and train the model on the 2nd part of the dataset. This does not influence the result.
The easiest way to split your dataset is to simply make a 2nd folder with the same structure with 50% of the dataset.
The pseudo-code for that method of training would look like this.:
load dataset 1
train the model with dataset 1
load dataset 2 but into the same variable as the first one to reuse the memory instead of creating a 2nd variable with the first one still in memory
train the model with dataset 2
A second option to decrease the memory size of your array you could use np.float16 instead of np.float32 but this would result in a more inaccurate model. The difference is data-dependent. So it could be 1-2% or even 5-10% so the only option without losing any accuracy is the option is described above.
EDIT
I am going to add the actual code.
import cv2 #pip install opencv-python
import os
part1_of_dataset = os.listdir("Path_to_your_first_dataset")
part2_of_dataset = os.listdir("Path_to_your_second_dataset")
x_train = np.array([cv2.imread(image) for image in part1_of_dataset],dtype=np.float32)
x_train /=255.0
x_train_m = np.mean(x_train,axis=0)
x_train -= x_train_m
model.fit(x_train,y_train) #not the full training code just an example
x_train = np.array([cv2.imread(image) for image in part2_of_dataset],dtype=np.float32)
x_train /=255.0
x_train_m = np.mean(x_train,axis=0)
x_train -= x_train_m
model.fit(x_train,y_train) #not the full training code just an example
This question comes up just as I put the first 2 32GB RAM sticks into my pc today for pretty much the same reason.
At this point it becomes necsessary to handle the data different.
I am not sure what you are using to do the learning. But if its tensorflow you can customize your input pipeline.
Anyways. It comes down to correctly analyze what you want to do with the data and the capabilities of your environment. If the data is ready to train and you just load it from disk it should not be a problem to only load then train on only a portion of it, then go to the next portion and so on.
You can split this data into multiple files or partially load the data (there are datatypes/fileformats to help with that). You can even optimize this so far that you can read from disk during training and have the next batch ready to go when you need it.

Resizing images in data preprocessing for training convolution network

I am trying to load data from jpeg files to train a convolution network. The images are large, with 24 million pixels however, so loading and using the full resolution is not practical.
To get the images to a more useful format I am trying to load each image, rescale it and then append it to a list. Once this is done, I can then convert the list into a numpy array and feed into the network for training as usual.
My problem is that my data set is very large and it takes about a second to rescale every image, which means it is not feasible to resize every image the way I have currently implemented this:
length_training_DF = 30000
for i in range(length_training_DF):
im = plt.imread(TRAIN_IM_DIR + trainDF.iloc[i]['image_name'] + '.jpg')
image = block_reduce(im, block_size=(10, 10, 1), func=np.max)
trainX.append(image)
I have also used the following:
length_training_DF = 30000
from keras.preprocessing import image
for i in range(50):
img = image.load_img(TRAIN_IM_DIR + trainDF.iloc[0]['image_name'] + '.jpg', target_size=(224, 224))
trainX.append(ima)
Is there any way to load these images more quickly into a format for training a network? I have thought about using a keras dataset, perhaps by using tf.keras.preprocessing.image_dataset_from_directory(), but the directory in which the image data is stored is not formatted correctly into folders containing the same targets as is required by this method.
The images are for a binary classification problem.
The usual way would be to write a preprocessing script that loads the large images, rescales them, applies other operations if needed, and then saves each class to a separate directory, as required by ImageDataGenerator.
There are at least three good reasons to do that:
Typically, you will run your training process dozens of time. You don't want to every time do the rescaling or e.g. auto-white balance.
ImageDataGenerator provides vital methods for augmenting your training data set.
It's a good generator out of the box. Likely you don't want to load entire data set into memory.

PyTorch data loading from multiple different-sized datasets

I have multiple datasets, each with a different number of images (and different image dimensions) in it. In the training loop I want to load a batch of images randomly from among all the datasets but so that each batch only contains images from a single dataset. For example, I have datasets A, B, C, D and each has images 01.jpg, 02.jpg, … n.jpg (where n depends on the dataset), and let’s say the batch size is 3. In the first loaded batch, for example, I may get images [B/02.jpg, B/06.jpg, B/12.jpg], in the next batch [D/01.jpg, D/05.jpg, D/12.jpg], etc.
So far I have considered the following:
Use a different DataLoader for each dataset, e.g. dataloaderA, dataloaderB, etc., and then in each training loop randomly select one of the dataloaders and get a batch from it. However, this will require a for loop and for large number of datasets it would be very slow since it can’t be split among workers to do it in parallel.
Use a single DataLoader with all of the images from all datasets together but with a custom collate_fn which will create a batch using only images from the same dataset. (I’m not sure how exactly to go about this.)
I have looked at the ConcatDataset class but from its source code it looks like if I use it and try getting a new batch the images in it will be mixed up from among different datasets which I don’t want.
What would be the best way to do this? Thanks!
You can use ConcatDataset, and provide a batch_sampler to DataLoader.
concat_dataset = ConcatDataset((dataset1, dataset2))
ConcatDataset.comulative_sizes will give you the boundaries between each dataset you have:
ds_indices = concat_dataset.cumulative_sizes
Now, you can use ds_indices to create a batch sampler. See the source for BatchSampler for reference. Your batch sampler just has to return a list with N random indices that will respect the ds_indices boundaries. This will guarantee that your batches will have elements from the same dataset.

splitting tfrecords dataset with multiple features

I have an image classification task where I've created multiple crops of each image as well as flipped/flopped versions to extend my limited dataset. I have written the dataset to a tfrecords file where each record consists of (simplified here to two crops and only a flipped version):
{
lbl: int,
crop_0: np.ndarray,
crop_1: np.ndarray,
crop_0_flipped: np.ndarray,
crop_1_flipped: np.ndarray
}
Basically 4 images / entry. During training, I'd like to treat each image as separate, i.e. feed each record as 4 images with the same label, shuffled with the rest of the images in the dataset, so that N images becomes 4N images. During testing (using a separate but similarly structured dataset), I'd like to take each image, only use the crop_0 and crop_1 images and average the softmax outputs for classification.
My question is - what is the best and most efficient way of training such a dataset? I'm willing to change my approach if this will make training more inefficient, and it seems that the simplest thing to do would have been to have separate tfrecords files for each version (crop & flip/flop images) and interleave the files into one dataset, but I do not want to have a whole bunch of files to deal with if I can help it.
Writing the dataset to disk with 4N images is an approach that you'll come to loath later (I did it this way originally and loath that code now). The better way is to keep your original dataset on disk as-is, don't write your preprocessing steps to disk. Do that kind of preprocessing in the CPU while you train. The tensorflow Dataset preprocessing pipeline makes this easy, modular, and provides the parallelization you need to take advantage of multiple cores at not extra coding expense.
This is the main guide:
https://www.tensorflow.org/programmers_guide/datasets
Your approach should be to create 2 Dataset objects, one for train and one for test. The train Dataset pipeline will perform all the data augmentation you mentioned. The test Dataset pipeline will not, naturally.
One key to understanding this approach is that you will not feed the data to tensorflow using feed_dict, instead, tensorflow will just invoke the Dataset pipeline to pull the data it needs for each batch.
To get parallelization you'll use the Dataset.map function to apply some set of transformations and use the property num_parallel_calls to distribute the operations across multiple cores. If your preprocessing can be done in tensorflow code, great, if not you'll need to use tf.py_func to use python preprocessing code.
The guide I linked to above describes all of this very well. You will want to us a feedable iterator described in the section called "Creating an iterator". This will allow you to get a string ID from each of the 2 datasets (train and test) and pass that string to tensorflow via feed_dict indicating which of the two datasets tensorflow should pull samples from.

How to normalize data when using Keras fit_generator

I have a very large data set and am using Keras' fit_generator to train a Keras model (tensorflow backend). My data needs to be normalized across the entire data set however when using fit_generator, I have access to relatively small batches of data and normalization of the data in this small batch is not representative of normalizing the data across the entire data set. The impact is quite large (I tested it and the model accuracy is significantly degraded).
My question is this: What is the correct practice of normalizing data across entire data set when using Keras' fit_generator? One last point: my data is a mix of text and numeric data and not images, and hence I am not able to use some of the capabilities in Keras' provided image generator which may address some of the issues for image data.
I have looked at normalizing the full data set prior to training ("brute-force" approach, I suppose) but I am wondering if there is a more elegant way of doing this.
The generator does allow you to do on-the-fly processing of data but pre-processing the data prior to training is the preferred approach:
Pre-process and saving avoids processing the data for every epoch, you should really just do small operations that can be applied to batches. One-hot encoding for example is a common one while tokenising sentences etc can be done offline.
You probably will tweak, fine-tune your model. You don't want to have the overhead of normalising the data and ensure every model trains on the same normalised data.
So, pre-process once offline prior to training and save it as your training data. When predicting you can process on-the-fly.
You would do this via pre-processing your data to a matrix. One hot encode your text data:
from keras.preprocessing.text import Tokenizer
# X is a list of text elements
t = Tokenizer()
t.fit_on_texts(X)
X_one_hot = t.texts_to_matrix(X)
and normalize your numeric data via:
for i in range(len(matrix)):
refactored_array = (matrix[i]- np.min(matrix[i], 0)) / (np.max(matrix[i], 0) + 0.0001)
If you concatenate your two matrices you should have properly preprocessed your data. I just could imagine that the text will always be influencing the outcome of your model too much. So it would make sence to train seperate models for text and numeric data.

Categories

Resources