Generating batches of images in dask

Generating batches of images in dask - python

I just started with dask because it offers great parallel processing power. I have around 40000 images on my disk which I am going to use for building a classifier using some DL library, say Keras or TF. I collected this meta-info(image path and corresponding label) in a pandas dataframe, which looks like this:
img_path labels
0 data/1.JPG 1
1 data/2.JPG 1
2 data/3.JPG 5
...
Now here is my simple task: Use dask to read images and corresponding labels in a lazy fashion. Do some processing on images and pass batches to the classifier in a batch size of 32.
Define functions for reading and preprocessing:
def read_data(idx):
img = cv2.imread(data['img_path'].iloc[idx])
label = data['labels'].iloc[idx]
return img, label
def img_resize(img):
return cv2.resize(img, (224,224))
Get delayed dask arrays:
data = [dd.delayed(read_data)(idx) for idx in range(len(df))]
images = [d[0] for d in data]
labels = [d[1] for d in data]
resized_images = [dd.delayed(img_resize)(img) for img in images]
resized_images = [dd.array.from_delayed(x, shape=(224,224, 3),dtype=np.float32) for x in resized_images]
Now here are my questions:
Q1. How do I get a batch of data, with batch_size=32 from this array? Is this equivalent to a lazy generator now? If not, can it be made to behave like one?
Q2. How to choose effective chunksize for better batch generation? For example, if I have 4 cores, size of images is (224,224,3), how can I make my batch processing efficient?

Related

Python: How to feed large dataset to Keras Model? [duplicate]

This question already has an answer here:
Keras - data generator for datasets too large to fit into memory
(1 answer)
Closed 26 days ago.
Basically I have a training dataset with 100s of thousands of images with labels that can be used to train an ML model.
However (as expected) I can't simply create a numpy array to hold the images as follows:
all_images = np.zeros(shape=(500000, 256, 256, 3), dtype="uint8")
I don't suppose large companies simply have 'huge' ram to use huge datasets for training.
So how can I use the entire data set for training without having to hold the entire thing in memory before calling model.fit()?
Here's the entire loading function if needed:
(details about it below)
def load_images(images: list):
# Create empty np.ndarray to hold n images of size 256 x 256 with 3 channels (RGB)
resized_images = np.zeros(shape=(len(images), 256, 256, 3), dtype="uint8")
index = 0
for image in images:
print(index)
# Load image with cv2
img = cv2.imread(images)
# Resize image to 256 width, 256 height
img = cv2.resize(img, dsize=(256, 256))
# Add image to ndarray 'resized_images'
resized_images[index] = img
index += 1
return resized_images
The objective of this function is to resize the training images and load them into a single numpy array to be passed to the model in model.fit()
Note: I removed some np.transpose() calls to make the code more legible so this might not work if copied and pasted
So far I've tried saving the model and loading it up to continue the training without success (loading model doesn't retain all properties). But if this is the best way feel free to share your method.

Consider of using such wonderful thing as generator.
At first I would suggest you to pay attantion on tf.keras.preprocessing.image.ImageDataGenerator class and its method flow_from_directory().
In case you want to preprocess images in some unusual way I would recommend you to consider creating your own generator by inheriting from the tf.keras.utils.Sequence class like this:
class CustomImageDataGen(tf.keras.utils.Sequence)
This article may help.

How to use multiprocessing Pool when evaluating many images using scikit-learn pipeline?

I used a GridSearchCV pipeline for training several different image classifiers in scikit-learn. In the pipeline I used two stages, scaler and classifier. The training run successfully, and this is what turned out to be the best hyper-parameter setting:
Pipeline(steps=[('scaler', MinMaxScaler()),
('classifier',
ExtraTreesClassifier(criterion='log_loss', max_depth=30,
min_samples_leaf=5, min_samples_split=7,
n_estimators=50, random_state=42))],
verbose=True)
Now I want to use this trained pipeline to test it on a lot of images. Therefore, I'm reading my test images from disk (150x150px) and store them in a hdf5 file, where each image is represented as a row vector (150*150=22500px), and all images are stacked upon each other in an np.array:
X_test.shape -> (n_imgs,22500)
Then I'm predicting the labels y_preds with
y_preds = model.predict(X_test)
So far, so good, as long as I'm only predicting some images.
But when n_imgs is growing (e.g. 1 Mio images), it doesn't fit into memory anymore. So I was googling around and found some solutions, that unfortunately didn't work.
I'm currently trying to use multiprocessing.pool.Pool. Now my problem: I want to call multiprocessing's Pool.map(), like so:
n_cores = 10
with Pool(n_cores) as pool:
results = pool.map(model.predict, X_test, chunksize=22500)
but suddenly all workers say:
without further details, no matter what chunksize I use.
So I tried to reshape X_test so that each image is represented blockwise next to each other:
X_reshaped = np.reshape(X_test,(n_imgs,150,150))
now chunksize picks out whole images, but as my model has been trained on 1x22500 arrays, not quadratic ones, I get the error:
ValueError: X_test has 150 features, but MinMaxScaler is expecting 22500 features as input.
I'd need to reshape the images back to 1x22500 before predict runs on the chunks. But I'd need a function with several inputs, which pool.map() doesn't allow (it only takes 1 argument for the given function).
So I followed Jason Brownlee's post: Multiprocessing Pool map() Multiple Arguments
and packed several variables into a tuple, which I then unpacked in a wrapper function, before calling model.predict():
n_imgs = X_test.shape[0]
X_reshaped = np.reshape(X_test,(n_imgs,150,150)) # reshape each row to 150x150px images
input_tuple = (model,X_reshaped) # pack model and data into a tuple as input for the wrapper
with Pool(n_cores) as pool:
results = pool.map(predict_wrapper, input_tuple, chunksize=22500)
and the wrapper function:
def predict_wrapper(input_tuple):
model, X = input_tuple # unpack the input tuple
n_imgs = X.shape[0]
X_mod = np.reshape(X,(n_imgs,150*150)) # reshape back
y_preds = model.predict(X_mod)
return y_preds
But: input_tuple doesn't get unpacked correctly in the wrapper function:
As you can see: instead of assigning model to model and X_test to X, it splits my pipeline and assigns the scaler to model and the classifier to X. 🤯
So, long story short:
does anybody have a solution how I can use my trained scikit-learn pipeline and do prediction on a plethora of images? I'm not bound to use multiprocessing.pool.Pool, but I didn't find any other solution so far...
Many thanks in advance! 🤝🏼

When you call pool.map() on a numpy array, the array is broken up along its first dimension.
So if you called pool.map(my_func, X_test), this will cause my_func to be called n_imag times, each with a 1-dimensional array of size 22500.
You have already mentioned that X_test is too big to fit into memory. It might make sense to have each subprocess read a range of images on its own from the database, process those, and send you back the results, rather than you sending the images to it.
def process_image_ranges(image_range):
start, end = image_range
# read images start [include] to end (exclusive) and process them
if __name__ == '__main__':
image_count = 1_000_000 # or whatever the count is
image_batching = 1024 # or whatever you want your batch size to be
image_ranges = [(i, min(i + image_batching, image_count))
for i in range(0, image_count, image_batching)]
with mp.Pool() as pool:
result = pool.map(process_image_ranges, image_ranges)

Ok, now I finally got this working! Thanks to Frank Yellin's answer here I realized that my problem seemed to be the chunksize I explicitly passed. I thought that by doing so I could force pool.map() to take a certain number of images per chunk, but it behaved differently and complained about the wrong dimensions of the given chunks.
But inspired by Frank's answer I rather defined the chunks before the call to pool.map() and then passed the chunks to it. Now the images are passed chunkwise to the single workers.
Seems I could not see the forest for the trees...
So in the end it looks like this:
from multiprocessing import Pool
import h5py
import joblib
import numpy as np
def main_prediction_batch():
# --- load model ---
model_URL = "<path to model.pkl>"
with open(model_URL, 'rb') as model_file:
model = joblib.load(model_file)
# --- load image and label file ---
hdf5_file_URL = "<path to hdf5 file with images and labels.hdf5>"
with h5py.File(hdf5_file_URL, mode='r') as hdf5_file:
X_test = hdf5_file["Images"][:] # 👉️ (n_imgs, 150*150)
y_test = hdf5_file["Labels"].asstr()[:] # 👉️ (n_imgs,)
n_imgs = X_test.shape[0]
n_cores = 10
image_batching = 10000 # or whatever you want your batch size to be
# doesn't have to be a multiple of n_cores!
chunk_ranges = [(i, min(i + image_batching, n_imgs))
for i in range(0, n_imgs, image_batching)]
# define chunks of several images
chunks = [
(X_test[chunk_ranges[i][0]:chunk_ranges[i][1], :])
for i in range(len(chunk_ranges))
]
with Pool(n_cores) as pool:
results = pool.map(model.predict, chunks)
# stack the predictions to get a final row vector
y_preds = np.hstack(results) # can now be compared with y_test
return y_preds
# --------------------------
# MAIN
# --------------------------
if __name__ == '__main__':
y_preds = main_prediction_batch()
When I now look at it... it was complicated to describe but the final solution was quite simple... thanks a lot for enlightening me!

how to include files with tf.data.Dataset

I am training Face-recognition model, So for Triplet Loss, I have to generate the batch such that it contains fixed amount of images from each label. For eg. I am saying that take 8 images from 3 random labels each time it generates batch for training, As suggested in this Github Issue.
In my dataset folder I have subfolder which is renamed as a label and contains the images of that folder.
In the given issue, solution is presented,
import numpy as np
import cv2
num_labels = len(path_list)
num_classes_per_batch = 3
num_images_per_class = 8
image_dirs = ["/content/drive/My Drive/smalld_processed/train/{:d}".format(i) for i in
range(num_labels)]
## Create the list of datasets creating filenames
#datasets = [tf.data.Dataset.list_files(f"{image_dir}/*.jpg" for image_dir in image_dirs)]
datasets = [tf.data.Dataset.list_files(f"{image_dir}/*.jpg") for image_dir in image_dirs]
adk = ["{}/*.jpg".format(image_dir) for image_dir in image_dirs]
print(adk)
def generator():
while True:
# Sample the labels that will compose the batch
labels = np.random.choice(range(num_labels),
num_classes_per_batch,
replace=False)
for label in labels:
for _ in range(num_images_per_class):
yield label
choice_dataset = tf.data.Dataset.from_generator(generator, tf.int64)
dataset = tf.data.experimental.choose_from_datasets(datasets, choice_dataset)
## Now you read the image content
def load_image(filename):
image = cv2.imread(filename,1)
image = dataset.map(image, num_parallel_calls=tf.data.experimental.AUTOTUNE)
image = image[...,::-1]
label = int(os.path.split(os.path.dirname(filename))[1])
image=dataset1.append()
label=dataset2.append
return image, label
dataset = dataset.map(load_image, num_parallel_calls=tf.data.experimental.AUTOTUNE)
batch_size = num_classes_per_batch * num_images_per_class
dataset = dataset.batch(batch_size)
dataset = dataset.prefetch(None)
With this I am not able to load the images and it's showing me this error.
SystemError: <built-in function imread> returned NULL without setting an error
Could you help me to fix the error or any other suggestion on how to load images.
Thanks in advance!!

I think that in this case your cv2.imread is acting up. I would first build a simple program that does not do the reading "on the fly", but instead pre-loads images to train on a small dataset.
It also feels like you are misusing the dataset.map function. I would recommend this tutorial on the tf.data.Dataset function: http://tensorexamples.com/2020/07/27/Using-the-tf.data.Dataset.html, and maybe this one on augmentation so you can see how you should use the map function properly: http://tensorexamples.com/2020/07/28/Augmentation.html.
Good luck!

How to limit RAM usage while batch training in tensorflow?

I am training a deep neural network with a large image dataset in mini-batches of size 40. My dataset is in .mat format (which I can easily change to any other format e.g. .npy format if necessitates) and before training, loaded as a 4-D numpy array. My problem is that while training, cpu-RAM (not GPU RAM) is very quickly exhausting and starts using almost half of my Swap memory.
My training code has the following pattern:
batch_size = 40
...
with h5py.File('traindata.mat', 'r') as _data:
train_imgs = np.array(_data['train_imgs'])
# I can replace above with below loading, if necessary
# train_imgs = np.load('traindata.npy')
...
shape_4d = train_imgs.shape
for epoch_i in range(max_epochs):
for iter in range(shape_4d[0] // batch_size):
y_ = train_imgs[iter*batch_size:(iter+1)*batch_size]
...
...
This seems like the initial loading of the full training data is itself becoming the bottle-neck (taking over 12 GB cpu RAM before I abort).
What is the best efficient way to tackle this bottle-neck?
Thanks in advance.

Loading a big dataset in memory is not a good idea. I suggest you to use something different for loading the datasets, take a look to the dataset API in TensorFlow: https://www.tensorflow.org/programmers_guide/datasets
You might need to convert your data into other format, but if you have a CSV or TXT file with a example per line you can use TextLineDataset and feed the model with it:
filenames = ["/var/data/file1.txt", "/var/data/file2.txt"]
dataset = tf.data.TextLineDataset(filenames)
def _parse_py_fun(text_line):
... your custom code here, return np arrays
def _map_fun(text_line):
result = tf.py_func(_parse_py_fun, [text_line], [tf.uint8])
... other tensorlow code here
return result
dataset = dataset.map(_map_fun)
dataset = dataset.batch(4)
iterator = dataset.make_one_shot_iterator()
input_data_of_your_model = iterator.get_next()
output = build_model_fn(input_data_of_your_model)
sess.run([output]) # the input was assigned directly when creating the model

Tensorflow: Batching whole dataset (MNIST Tutorial)

Following this tutorial: https://www.tensorflow.org/versions/r1.3/get_started/mnist/pros
I wanted to solve a classification problem with labeled images by myself. Since I'm not using the MNIST database, I spent days creating my own dataset inside tensorflow. It looks like this:
#variables
batch_size = 50
dimension = 784
stages = 10
#step 1 read Dataset
filenames = tf.constant(filenamesList)
labels = tf.constant(labelsList)
#step 2 create Dataset
dataset = tf.data.Dataset.from_tensor_slices((filenames, labels))
#step 3: parse every image in the dataset using `map`
def _parse_function(filename, label):
#convert label to one-hot encoding
one_hot = tf.one_hot(label, stages)
#read image file
image_string = tf.read_file(filename)
image_decoded = tf.image.decode_image(image_string, channels=3)
image = tf.cast(image_decoded, tf.float32)
return image, one_hot
#step 4 final input tensor
dataset = dataset.map(_parse_function)
dataset = dataset.batch(batch_size) #batch_size = 100
iterator = dataset.make_one_shot_iterator()
images, labels = iterator.get_next()
images = tf.reshape(images, [batch_size,dimension]).eval()
labels = tf.reshape(labels, [batch_size,stages]).eval()
for _ in range(10):
dataset = dataset.shuffle(buffer_size = 100)
dataset = dataset.batch(batch_size)
iterator = dataset.make_one_shot_iterator()
images, labels = iterator.get_next()
images = tf.reshape(images, [batch_size,dimension]).eval()
labels = tf.reshape(labels, [batch_size,stages]).eval()
train_step.run(feed_dict={x: images, y_:labels})
Somehow using a higher batch_sizes will break python. What I'm trying to do is to train my neural network with new batches on each iteration. That's why Im also using dataset.shuffle(...). Using dataset.shuffle also breaks my Python.
What I wanted to do (because shuffle breaks) is to batch the whole dataset. By evaluating ('.eval()') I will get a numpy array. I will then shuffle the array with numpy.random.shuffle(images) and then pick up some the first elements to train it.
e.g.
for _ in range(1000):
images = tf.reshape(images, [batch_size,dimension]).eval()
labels = tf.reshape(labels, [batch_size,stages]).eval()
#shuffle
np.random.shuffle(images)
np.random.shuffle(labels)
train_step.run(feed_dict={x: images[0:train_size], y_:labels[0:train_size]})
But then here comes the problem that I can't batch the my whole dataset. It looks like that the data is too big for python to work with.
How should I solve this differently?
Since I'm not using the MNIST database there isn't a function like mnist.train.next_batch(100) which comes handy for me.

Notice how you call shuffle and batch inside your for loop? This is wrong. Datasets in TF work in the style of functional programming, so you are actually defining a pipeline for preprocessing the data to feed into your model. In a way, you give a recipe that answers the question "given this raw data, which operations (map, etc.) should I do to get batches that I can feed into my neural network?"
Now you are modifying that pipeline for every batch! What happens is that the first iteration, the batch size is, say [32 3600]. The next iteration, the elements of this shape are batched again, to [32 32 3600], and so on.
There's a great tutorial on the TF website where you can find out more how Datasets work, but here are a few suggestions how you can resolve your problem.
Move the shuffling to right after "Step 2" in your code. Then you are shuffling the whole dataset so your batches will have a good mixture of examples. Also increase the buffer_size argument, this works in a different way than you probably assume. It's usually a good idea to shuffle as early as possible, as it can be a slow operation if you have a large dataset -- the shuffled part of dataset will have to be read into memory. Here it does not really matter whether you shuffle the filenames and labels, or the read images and labels -- but the latter will have more work to do since the dataset is larger by that time.
Move batching and the iterator generator to be the last steps, just before starting your training loop.
Don't use feed_dict with Dataset iterators to input data into your model. Instead, define your model in terms of the outputs of iterator.get_next() and omit the feed_dict argument. See more details from this Q&A: Tensorflow: create minibatch from numpy array > 2 GB

Ive been getting through a lot of problems with creating tensorflow datasets. So I decided to use OpenCV to import images.
import opencv as cv
imgDataset = []
for i in range(len(files)):
imgDataset.append(cv2.imread(files[i]))
imgDataset = np.asarray(imgDataset)
the shape of imgDataset is (num_img, height, width, col_channels). Getting the i-th image should be imgDataset[i].
shuffling the dataset and getting only batches of it can be done like this:
from sklearn.utils import shuffle
X,y = shuffle(X, y)
X_feed = X[batch_size]
y_feed = y[batch_size]
Then you feed X_feed and y_feed into your model

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.