Batch processing: read image files, then write multidimensional numpy array to HDFS - python

I am trying to iteratively load a batch of images from a folder, process, then store the results of the batch to an hdf file. What's the best practice for batch reading images/files, and batch storing a resulting multi-dimensional array?
First Part
I start with a csv list of file names:
file_list = [''.join(x) + '.png' for x in permutations('abcde')][:100]
Say for example I want to process 5 images at a time.
I currently grab 5 file names from the list, create an empty array to hold 5 images, then read each image one at a time to yield a batch.
def load_images(file_list):
for i in range(0, 100, 5):
files_list = file_list[i, i + 5]
image_list = np.zeros(shape=(5, 50, 50, 3))
for idx, file in enumerate(files_list):
loaded_img = np.random.random((50, 50, 3)) # misc.imread(file)
image_list[idx] = loaded_img
yield image_list, files_list
Question 1: Is there a way to eliminate the second for loop? Can I batch read in the images, or is the method above (one at a time) best practice?
Second Part:
After loading the images I do some processing on them. This results in a different size array
def process_images(image_batch):
result = image_batch[:, 5, 4, 3] # a novel down-sampling algorithm
return result
Now, I want to store the batch of images with their original file names.
def store_images(data, file_names):
with pd.HDFstore('output.h5') as hdf:
pass
Question 2: What is the best way to store a batch of multidimensional numpy arrays, while still referencing them with a key (such as the original file name)?
I would like to explore using .h5 files, so if anyone knows how to batch process data to an .h5 and has advice on this, it would be most appreciated. Alternatively I think there is a way to save the numpy arrays as just .npy files to a folder, but I was having trouble with this and still wouldn't know how to do it other than one sample at a time (versus one batch at a time)

Related

Fast loading multiple .npy files into data generator

I do have thousands of .npy files stored in my hard disk, each containing a single matrix with dimensions [128, T], where T is variable (on average T=800). Each .npy file has size around 2Mb, depending on the matrix shape.
These matrices are then passed to a generator, which yields batches of 32 to a neural network. The Python code used to pass the matrices into the generator is:
def load_batch(path_list):
np_list = []
for path in path_list:
np_list.append(np.load(path))
return np_list
which, given a list of paths of the .npy files, returns a list of the corresponding NumPy matrices.
This code takes, on average, 0.6s to return a list of 32 matrices. I am using append because this is usually a quick operation.
I am aware that the speed of the hard disk buffer does have an influence on timings but, right now, I really would like to shrink the amount of time required as much as possible by just modifying the code in a smart way.
As an alternative, I tried implementing multi-processing:
from multiprocessing import Pool
def reader(filename):
return np.load(filename)
def load_multiprocess(path_list, n_cores=5):
pool = Pool(n_cores)
np_list = pool.map(reader, path_list)
return np_list
However, the performance is much worse. I had a look around stackoverflow, and I got the idea that my specific application could not benefit from multiprocessing.
To summarize, I am looking for any kind of advice for one of these two tasks:
Improving the speed of the first code (even 0.1s less would mean a lot).
Using multiprocessing in the right way, if possible.
SOLUTION AND BENCHMARK
Out of the three methods here proposed, user7138814's solution seems to generally improve a lot the execution speed. However, things seem to change when the data is loaded while training a neural network: even though mapping is by itself still the quicker method for loading data, the overall training time seems to increase, I have no idea where and why, as timings using the mapping load are always better.
Below, I will do a benchmark of the three methods.First, define the methods:
import numpy as np
# my initial method
def load_batch(path_list):
np_list = []
for path in path_list:
np_list.append(np.load(path))
return np_list
# Aaj Kaal's method
def load_batch1(path_list):
return [np.load(path) for path in path_list]
# user7138814's method
def load_batch2(path_list):
np_list = []
for path in path_list:
np_list.append(np.load(path, mmap_mode='r'))
return np_list
I defined a list of paths as follows:
batches_list = []
batch_size = 32
for n in range(0,150):
batches_list.append(X_path_list[n*batch_size:n*batch_size+batch_size])
The list contains 150 batches of 32 paths each, it should be enough to calculate the mean.
Then, each method is executed using passing to it exactly the same data.
import time
# my initial method
timing0 = []
for l in batches_list:
start = time.time()
load_batch(l)
end = time.time()
timing0.append(end-start)
print(np.mean(timing0))
# Aaj Kaal's method
timing1 = []
for l in batches_list:
start = time.time()
load_batch1(l)
end = time.time()
timing1.append(end-start)
print(np.mean(timing1))
# user7138814's method
timing2 = []
for l in batches_list:
start = time.time()
load_batch2(l)
end = time.time()
timing2.append(end-start)
print(np.mean(timing2))
Output (mean timing in seconds over 150 executions):
0.022530150413513184
0.022546884218851725
0.009580903053283692
Results seem to be consistent when changing length of batches_list and batch_size.
Maybe memory mapping the files will be beneficial due to lazy loading. If you would use for example
np.load(filename, mmap_mode='r')
the creation of the numpy array becomes almost a no-op, but later in the pipeline you pay the price. This could provide a speedup if it results in processing the data in parallel with reading from disk.
Did you try using use list comprehension. Replace
def load_batch(path_list):
np_list = []
for path in path_list:
np_list.append(np.load(path))
return np_list
with
def load_batch(path_list):
return [np.load(path) for path in path_list]
In fact you can get rid of the function and directly use list comprehension. If functional call is required use lambda

H5Py and storage

I am writing some code which needs to save a very large numpy array to memory. The numpy array is so large in fact that I cannot load it all into memory at once. But I can calculate the array in chunks. I.e. my code looks something like:
for i in np.arange(numberOfChunks):
myArray[(i*chunkSize):(i*(chunkSize+1)),:,:] = #... do some calculation
As I can't load myArray into memory all at once, I want to save it to a file one "chunk" at a time. i.e. I want to do something like this:
for i in np.arange(numberOfChunks):
myArrayChunk = #... do some calculation to obtain chunk
saveToFile(myArrayChunk, indicesInFile=[(i*chunkSize):(i*(chunkSize+1)),:,:], filename)
I understand this can be done with h5py but I am a little confused how to do this. My current understanding is that I can do this:
import h5py
# Make the file
h5py_file = h5py.File(filename, "a")
# Tell it we are going to store a dataset
myArray = h5py_file.create_dataset("myArray", myArrayDimensions, compression="gzip")
for i in np.arange(numberOfChunks):
myArrayChunk = #... do some calculation to obtain chunk
myArray[(i*chunkSize):(i*(chunkSize+1)),:,:] = myArrayChunk
But this is where I become a little confused. I have read that if you index a h5py datatype like I did when I wrote myArray[(i*chunkSize):(i*(chunkSize+1)),:,:], then this part of myArray has now been read into memory. So surely, by the end of my loop above, have I not still got the whole of myArray in memory now? How has this saved my memory?
Similarly, later on, I would like to read in my file back in one chunk at a time, doing further calculation. i.e. I would like to do something like:
import h5py
# Read in the file
h5py_file = h5py.File(filename, "a")
# Read in myArray
myArray = h5py_file['myArray']
for i in np.arange(numberOfChunks):
# Read in chunk
myArrayChunk = myArray[(i*chunkSize):(i*(chunkSize+1)),:,:]
# ... Do some calculation on myArrayChunk
But by the end of this loop is the whole of myArray now in memory? I am a little confused by when myArray[(i*chunkSize):(i*(chunkSize+1)),:,:] is in memory and when it isn't. Please could someone explain this.
You have the basic idea. Take care when saying "save to memory". NumPy arrays are saved in memory (RAM). HDF5 data is saved on disk (not to memory/RAM!), then accessed (memory used depends on how you access). In the first step you are creating and writing data in chunks to the disk. In the second step you are accessing data from disk in chunks. Working example provided at the end.
When reading data with h5py there 2 ways to read the data:
This returns a NumPy array:
myArrayNP = myArray[:,:,:]
This returns a h5py dataset object that operates like a NumPy array:
myArrayDS = myArray
The difference: h5py dataset objects are not read into memory all at once. You can then slice them as needed. Continuing from above, this is a valid operation to get a subset of the data:
myArrayChunkNP = myArrayDS[i*chunkSize):(i+1)*chunkSize),:,:]
My example also corrects 1 small error in your chunksize increment equation.
You had:
myArray[(i*chunkSize):(i*(chunkSize+1)),:,:] = myArrayChunk
You want:
myArray[(i*chunkSize):(i+1)*chunkSize),:,:] = myArrayChunk
Working Example (writes and reads):
import h5py
import numpy as np
# Make the file
with h5py.File("SO_61173314.h5", "w") as h5w:
numberOfChunks = 3
chunkSize = 4
print( 'WRITING %d chunks with w/ chunkSize=%d ' % (numberOfChunks,chunkSize) )
# Write dataset to disk
h5Array = h5w.create_dataset("myArray", (numberOfChunks*chunkSize,2,2), compression="gzip")
for i in range(numberOfChunks):
h5ArrayChunk = np.random.random(chunkSize*2*2).reshape(chunkSize,2,2)
print (h5ArrayChunk)
h5Array[(i*chunkSize):((i+1)*chunkSize),:,:] = h5ArrayChunk
with h5py.File("SO_61173314.h5", "r") as h5r:
print( '/nREADING %d chunks with w/ chunkSize=%d/n' % (numberOfChunks,chunkSize) )
# Access myArray dataset - Note: This is NOT a NumpPy array
myArray = h5r['myArray']
for i in range(numberOfChunks):
# Read a chunk into memory (as a NumPy array)
myArrayChunk = myArray[(i*chunkSize):((i+1)*chunkSize),:,:]
# ... Do some calculation on myArrayChunk
print (myArrayChunk)

How to work with large dataset in pytorch

I have a huge dataset that does not fit in memory (150G) and I'm looking for the best way to work with it in pytorch. The dataset is composed of several .npz files of 10k samples each. I tried to build a Dataset class
class MyDataset(Dataset):
def __init__(self, path):
self.path = path
self.files = os.listdir(self.path)
self.file_length = {}
for f in self.files:
# Load file in as a nmap
d = np.load(os.path.join(self.path, f), mmap_mode='r')
self.file_length[f] = len(d['y'])
def __len__(self):
raise NotImplementedException()
def __getitem__(self, idx):
# Find the file where idx belongs to
count = 0
f_key = ''
local_idx = 0
for k in self.file_length:
if count < idx < count + self.file_length[k]:
f_key = k
local_idx = idx - count
break
else:
count += self.file_length[k]
# Open file as numpy.memmap
d = np.load(os.path.join(self.path, f_key), mmap_mode='r')
# Actually fetch the data
X = np.expand_dims(d['X'][local_idx], axis=1)
y = np.expand_dims((d['y'][local_idx] == 2).astype(np.float32), axis=1)
return X, y
but when a sample is actually fetched, it takes more than 30s. It looks like the entire .npz is opened, stocked in RAM and it accessed the right index.
How to be more efficient ?
EDIT
It appears to be a misunderstading of .npz files see post, but is there a better approach ?
SOLUTION PROPOSAL
As proposed by #covariantmonkey, lmdb can be a good choice. For now, as the problem comes from .npz files and not memmap, I remodelled my dataset by splitting .npz packages files into several .npy files. I can now use the same logic where memmap makes all sense and is really fast (several ms to load a sample).
How large are the individual .npz files? I was in similar predicament a month ago. Various forum posts, google searches later I went the lmdb route. Here is what I did
Chunk the large dataset into small enough files that I can fit in gpu — each of them is essentially my minibatch. I did not optimize for load time at this stage just memory.
create an lmdb index with key = filename and data = np.savez_compressed(stff)
lmdb takes care of the mmap for you and insanely fast to load.
Regards,
A
PS: savez_compessed requires a byte object so you can do something like
output = io.BytesIO()
np.savez_compressed(output, x=your_np_data)
#cache output in lmdb

How to retrieve the filename of an image with keras flow_from_directory shuffled method?

If I don't shuffle my files, I can get the file names with generator.filenames. But when the generator shuffles the images, filenames isn't shuffled, so I don't know how to get the file names back.
Internally, the DirectoryIterator will iterate over an index_arrray which is shuffled when you ask it to.
You just need to index the filename array using the current indexes of the batch:
it = ImageDataGenerator().flow_from_directory(shuffle=True,...)
for img in it:
idx = (it.batch_index - 1) * it.batch_size
fnames = [it.filenames[it.index_array[i]] for i in range(idx, idx + it.batch_size)]
I think the only option here is to NOT shuffle the files. I have been wondering this myself and this is the only thing I could find in the docs. Seems odd and not correct...

Create a single-file dataset out of _many_ b/n GIFs [duplicate]

This question already has answers here:
Efficient serialization of numpy boolean arrays
(3 answers)
Closed 8 years ago.
I have many 100x100px black/white GIF images.
I want to use them in Numpy to train a machine learning algorithm, but I would like to save them in a single file that is easily readable in Python/Numpy.
By saying many I mean several hundred thousands, so I would like to take advantage of the images carrying only 1 bit per pixel.
Any idea on how I can do this?
EDIT:
I used a BitArray object, from the bitstring module. Then I saved it using numpy.savez. The problem is that it takes ages to save. I never managed to see the end of such process on the entire dataset. I tried to save a small subset and it took 10 minutes and about 20 times the size of the small subset itself.
I will try with the BoolArray, thanks for the reference.
EDIT (solved):
I solved the problem by using a different approach from those that I found in the questions you linked. I found the numpy.packbits function here: numpy boolean array with 1 bit entries
I'm reporting my code here so it can be useful to others:
accepted_shape = (100, 100)
images = []
for file_path in gifs:
img_data = imread(file_path)
if img_data.shape != accepted_shape:
continue
max_value = img_data.max()
min_value = img_data.min()
middle_value = (max_value - min_value) // 2
image = np.packbits((img_data.ravel() > middle_value).astype(int))
images.append(image)
np.vstack(images)
np.savez_compressed('dataset.npz', shape=accepted_shape, images=images)
This just requires some attention when uncompressing because if the number of bits is not a multiple of 8, some zeros are added as padding. This is how I uncompress the files:
data = np.load('dataset.npz')
shape = data['shape']
images = data['images']
nf = np.prod(shape)
ne = images.size / nf
images = np.unpackbits(images, axis=1)
images = images[:,:nf]
PyTables seems like a good option here. Something like this might work:
import numpy as np
import tables as tb
nfiles = 100000 #or however many files you have
h5file = tb.openFile('data.h5', mode='w', title="Test Array")
root = h5file.root
x = h5file.createCArray(root,'x',tb.Float64Atom(),shape=(100,100,nfiles))
x[:100,:100, 0] = np.random.random(size=(100,100)) # Now put in some data
h5file.close()

Categories

Resources