How to work with large dataset in pytorch

How to work with large dataset in pytorch - python

I have a huge dataset that does not fit in memory (150G) and I'm looking for the best way to work with it in pytorch. The dataset is composed of several .npz files of 10k samples each. I tried to build a Dataset class
class MyDataset(Dataset):
def __init__(self, path):
self.path = path
self.files = os.listdir(self.path)
self.file_length = {}
for f in self.files:
# Load file in as a nmap
d = np.load(os.path.join(self.path, f), mmap_mode='r')
self.file_length[f] = len(d['y'])
def __len__(self):
raise NotImplementedException()
def __getitem__(self, idx):
# Find the file where idx belongs to
count = 0
f_key = ''
local_idx = 0
for k in self.file_length:
if count < idx < count + self.file_length[k]:
f_key = k
local_idx = idx - count
break
else:
count += self.file_length[k]
# Open file as numpy.memmap
d = np.load(os.path.join(self.path, f_key), mmap_mode='r')
# Actually fetch the data
X = np.expand_dims(d['X'][local_idx], axis=1)
y = np.expand_dims((d['y'][local_idx] == 2).astype(np.float32), axis=1)
return X, y
but when a sample is actually fetched, it takes more than 30s. It looks like the entire .npz is opened, stocked in RAM and it accessed the right index.
How to be more efficient ?
EDIT
It appears to be a misunderstading of .npz files see post, but is there a better approach ?
SOLUTION PROPOSAL
As proposed by #covariantmonkey, lmdb can be a good choice. For now, as the problem comes from .npz files and not memmap, I remodelled my dataset by splitting .npz packages files into several .npy files. I can now use the same logic where memmap makes all sense and is really fast (several ms to load a sample).

How large are the individual .npz files? I was in similar predicament a month ago. Various forum posts, google searches later I went the lmdb route. Here is what I did
Chunk the large dataset into small enough files that I can fit in gpu — each of them is essentially my minibatch. I did not optimize for load time at this stage just memory.
create an lmdb index with key = filename and data = np.savez_compressed(stff)
lmdb takes care of the mmap for you and insanely fast to load.
Regards,
A
PS: savez_compessed requires a byte object so you can do something like
output = io.BytesIO()
np.savez_compressed(output, x=your_np_data)
#cache output in lmdb

Related

Hierarchical dictionary (reducing memory footprint or using a database)

I am working with extremely high dimensional biological count data (single cell RNA sequencing where rows are cell ID and columns are genes).
Each dataset is a separate flat file (AnnData format). Each flat file can be broken down by various metadata attributes, including by cell type (eg: muscle cell, heart cell), subtypes (eg: a lung dataset can be split into normal lung and cancerous lung), cancer stage (eg: stage 1, stage 2), etc.
The goal is to pre-compute aggregate metrics for a specific metadata column, sub-group, dataset, cell-type, gene combination and keep that readily accessible such that when a person queries my web app for a plot, I can quickly retrieve results (refer to Figure below to understand what I want to create). I have generated Python code to assemble the dictionary below and it has sped up how quickly I can create visualizations.
Only issue now is that the memory footprint of this dictionary is very high (there are ~10,000 genes per dataset). What is the best way to reduce the memory footprint of this dictionary? Or, should I consider another storage framework (briefly saw something called Redis Hashes)?

One option to reduce your memory footprint but keep fast lookup is to use an hdf5 file as a database. This will be a single large file that lives on your disk instead of memory, but is structured the same way as your nested dictionaries and allows for rapid lookups by reading in only the data you need. Writing the file will be slow, but you only have to do it once and then upload to your web-app.
To test this idea, I've created two test nested dictionaries in the format of the diagram you shared. The small one has 1e5 metadata/group/dataset/celltype/gene entries, and the other is 10 times larger.
Writing the small dict to hdf5 took ~2 minutes and resulted in a file 140 MB in size while the larger dict-dataset took ~14 minutes to write to hdf5 and is a 1.4 GB file.
Querying the small and large hdf5 files similar amounts of time showing that the queries scale well to more data.
Here's the code I used to create the test dict-datasets, write to hdf5, and query
import h5py
import numpy as np
import time
def create_data_dict(level_counts):
"""
Create test data in the same nested-dict format as the diagram you show
The Agg_metric values are random floats between 0 and 1
(you shouldn't need this function since you already have real data in dict format)
"""
if not level_counts:
return {f'Agg_metric_{i+1}':np.random.random() for i in range(num_agg_metrics)}
level,num_groups = level_counts.popitem()
return {f'{level}_{i+1}':create_data_dict(level_counts.copy()) for i in range(num_groups)}
def write_dict_to_hdf5(hdf5_path,d):
"""
Write the nested dictionary to an HDF5 file to act as a database
only have to create this file once, but can then query it any number of times
(unless the data changes)
"""
def _recur_write(f,d):
for k,v in d.items():
#check if the next level is also a dict
sk,sv = v.popitem()
v[sk] = sv
if type(sv) == dict:
#this is a 'node', move on to next level
_recur_write(f.create_group(k),v)
else:
#this is a 'leaf', stop here
leaf = f.create_group(k)
for sk,sv in v.items():
leaf.attrs[sk] = sv
with h5py.File(hdf5_path,'w') as f:
_recur_write(f,d)
def query_hdf5(hdf5_path,search_terms):
"""
Query the hdf5_path with a list of search terms
The search terms must be in the order of the dict, and have a value at each level
Output is a dict of agg stats
"""
with h5py.File(hdf5_path,'r') as f:
k = '/'.join(search_terms)
try:
f = f[k]
except KeyError:
print('oh no! at least one of the search terms wasnt matched')
return {}
return dict(f.attrs)
################
# start #
################
#this "small_level_counts" results in an hdf5 file of size 140 MB (took < 2 minutes to make)
#all possible nested dictionaries are made,
#so there are 40*30*10*3*3 = ~1e5 metadata/group/dataset/celltype/gene entries
num_agg_metrics = 7
small_level_counts = {
'Gene':40,
'Cell_Type':30,
'Dataset':10,
'Unique_Group':3,
'Metadata':3,
}
#"large_level_counts" results in an hdf5 file of size 1.4 GB (took 14 mins to make)
#has 400*30*10*3*3 = ~1e6 metadata/group/dataset/celltype/gene combinations
num_agg_metrics = 7
large_level_counts = {
'Gene':400,
'Cell_Type':30,
'Dataset':10,
'Unique_Group':3,
'Metadata':3,
}
#Determine which test dataset to use
small_test = True
if small_test:
level_counts = small_level_counts
hdf5_path = 'small_test.hdf5'
else:
level_counts = large_level_counts
hdf5_path = 'large_test.hdf5'
np.random.seed(1)
start = time.time()
data_dict = create_data_dict(level_counts)
print('created dict in {:.2f} seconds'.format(time.time()-start))
start = time.time()
write_dict_to_hdf5(hdf5_path,data_dict)
print('wrote hdf5 in {:.2f} seconds'.format(time.time()-start))
#Search terms in order of most broad to least
search_terms = ['Metadata_1','Unique_Group_3','Dataset_8','Cell_Type_15','Gene_17']
start = time.time()
query_result = query_hdf5(hdf5_path,search_terms)
print('queried in {:.2f} seconds'.format(time.time()-start))
direct_result = data_dict['Metadata_1']['Unique_Group_3']['Dataset_8']['Cell_Type_15']['Gene_17']
print(query_result == direct_result)

Although Python dictionaries themselves are fairly efficient in terms of memory usage you are likely storing multiple copies of the strings you are using as dictionary keys. From your description of your data structure it is likely that you have 10000 copies of “Agg metric 1”, “Agg metric 2”, etc for every gene in your dataset. It is likely that these duplicate strings are taking up a significant amount of memory. These can be deduplicated with sys.inten so that although you still have as many references to the string in your dictionary, they all point to a single copy in memory. You would only need to make a minimal adjustment to your code by simply changing the assignment to data[sys.intern(‘Agg metric 1’)] = value. I would do this for all of the keys used at all levels of your dictionary hierarchy.

Optimize data acquisition with HDF5 files in Python

I'm trying to understand how can I write a DAQ in Python where I manage two signals (I and Q from an IQ mixer) from a NI device. My doubt concern two problems:
What are the main differences to use h5py instead of pandas? My data are not complex, I need only two matrices datasets, one for the I signal and one for the Q signal.
Is it more efficient to create the whole dataset and then occupy a lot of memory before storing it in an HDF5 file, or to open the HDF5 file each time to add a new row (new data) to the matrix?

#Frostman, this is primarily an opinion question. Remember, "Beauty is in the eye of the beholder."
The question about memory use is the more important consideration. The answer depends on memory required to hold your data (and if you have enough before writing to disk). Creating the whole dataset in memory is faster and easier. But, that's not an option if it doesn't fit. :-) Note: you don't want to write data 1 row at a time. That is the slowest way to work with HDF5 data. If you need to save incrementally, write "a lot" of rows at 1 time (say 1,000).
There is a related consideration: which of these packages is fast enough (& easy enough) to keep up with I/O requirement to data acquisition? (I have no expertise in this areas, but know high sample rates will quickly create a lot of data.)
From a technical perspective, the difference in the packages is "cosmetic" (IMHO). (FYI, you can also use PyTables to create HDF5 data.) In other words, all 3 can easily create a HDF5 file with the data described in your question. The question (for you), is which package do you want to learn? And, which package do you plan use for later post-processing? (I assume you want to open the file and "do something" with the data later.)
All else being equal, I would to create the data with the same package I plan to use for downstream processing. Why? h5py and pandas use different schema to store the data. So, reading the data will be easier if you write and read with the same package. (That said, you can manipulate HDF5 data between the packages.)
If the downstream processing requirement has not been decided, I would select h5py if either of these are true: a) you are comfortable with NumPy, or b) you need to use NumPy for other operations. h5py is pretty easy to learn if you know NumPy.
Otherwise, you might prefer pandas. Many claim it is easier to use (vs h5py and PyTables). I am not a pandas expert, so can't comment. I prefer h5py and PyTables primarly because the 2d data schema is saved in a table format that is easy to review with HDFView. (Also, I use NumPy 90+% of the time, so h5py/PyTables are natural extensions.)
If you want to compare code for each, look at the answers to this question: How to write large multiple arrays to a h5 file in layers? They show code required to store data similar to yours for all 3 packages: h5py, pytables and pandas.

Especially when acquiring long measurements, it is handy to write directly into an HDF5 file. This is my preferred way, because any interrupt (power failure etc.) wont result in data loss.
This is my solution using a while-loop that collects in each cycle all available samples from the DAQ and stores them immediately into the HDF5 file. You could imaging some real-time display during each loop cycle, but be aware of the loop duration (set parameter[debug_output] = 3 to see some more statistics like buffer size of each cycle)
changing the boolean hdf5_write to False causes the code to store into Numpy array data which sooner or later will fills the memory. If True, all the samples are written directly into a growing HDF5 file.
import nidaqmx
import datetime
import time
import numpy as np
import h5py
def hdf5_write_parameter(h5_file, parameter, group_name='parameter'):
# add parameter group
param_grp = h5_file.create_group(group_name)
# write single item
for key, item in parameter.items():
try:
if item is None:
item = 'None'
if isinstance(item, dict):
# recursive write each dictionary
hdf5_write_parameter(h5_file, item, group_name+'/'+key)
else:
h5_file.create_dataset("/"+group_name+"/{}".format(key), data=item)
except:
print("[hdf5_write_parameter]: failed to write:", key, "=", item)
return
run_bool = True # should be controlled by GUI or caller thread
measurement_duration = 1 # in seconds
filename = 'test_acquisition'
hdf5_write = True # a hdf5 file with ending '.h5' is created, False = numpy array
# check if device is available
system = nidaqmx.system.System.local()
system.driver_version
for device in system.devices:
print(device) # plot devices
ADC_DEVICE_NAME = device.name # 'PCI6024e'
print('ADC: init measure for', measurement_duration, 'seconds')
# Setup ADC
parameter = {
"channels": 8, # number of AI channels
"channel_name": ADC_DEVICE_NAME + '/ai0:7',
"log_rate": int(20000), # Samples per second
"adc_min_value": -5.0, # minimum ADC value in Volts
"adc_max_value": 5.0, # maximum ADC value in Volts
"timeout": measurement_duration + 2.0, # timeout to detect external clock on read
"debug_output": 1,
"measurement_duration": measurement_duration,
}
parameter["buffer_size"] = int(parameter["log_rate"]) # buffer size in samples
# must be bigger than loop duration!
parameter["requested_samples"] = parameter["log_rate"] * measurement_duration
parameter["hdf5_write"] = hdf5_write # write in array
if parameter['hdf5_write']:
filename += '.h5'
f = h5py.File(filename, 'w') # create a h5-file object if True
data = f.create_dataset('data', (0, parameter["channels"]),
maxshape=(None, parameter["channels"]), chunks=True)
else:
filename += '.csv'
# pre-allocate array, we might get up to 1 buffer more than requested...
data = np.empty((parameter["requested_samples"]+parameter["buffer_size"], parameter["channels"]), dtype=np.float64)
data[:] = np.nan
with nidaqmx.Task() as task:
task.ai_channels.add_ai_voltage_chan(parameter["channel_name"],
terminal_config=nidaqmx.constants.TerminalConfiguration.RSE,
min_val=parameter["adc_min_value"],
max_val=parameter["adc_max_value"],
units=nidaqmx.constants.VoltageUnits.VOLTS
)
task.timing.cfg_samp_clk_timing(rate=parameter["log_rate"],
sample_mode=nidaqmx.constants.AcquisitionType.CONTINUOUS)
# helper variables
total_samples = 0
i = 0
last_display = -1
parameter["acquisition_start"] = str(datetime.datetime.now())
if 1:
print("ADC: --- acquisition started:", parameter["acquisition_start"])
print("ADC: Requested samples:", parameter["requested_samples"], "Acquisition duration:",
measurement_duration)
task.control(nidaqmx.constants.TaskMode.TASK_COMMIT)
time_adc_start = time.perf_counter()
# ############################# READING LOOP ##########################
while run_bool and total_samples < parameter["requested_samples"] and time.perf_counter() - time_adc_start < parameter[
"timeout"]:
i = i + 1
if parameter["debug_output"] >= 1:
elapsed_time = np.floor(time.perf_counter() - time_adc_start) # in sec
if elapsed_time != last_display:
print("ADC: ...", round(elapsed_time), "of", measurement_duration, "sec:",
total_samples, "acquired ...")
last_display = elapsed_time
# high-lvl read function: always create a new array
data_buff = np.asarray(
task.read(number_of_samples_per_channel=nidaqmx.constants.READ_ALL_AVAILABLE)).T
time_adc_end = time.perf_counter()
samples_from_buffer = data_buff.shape[0]
# get nr of samples and acumulate to total_samples
total_samples = int(total_samples + samples_from_buffer)
if parameter["debug_output"] >= 2:
print("ADC: iter", i, "total:", total_samples, "smp from buffer", samples_from_buffer,
"time elapsed", time.perf_counter() - time_adc_start)
if samples_from_buffer > 0:
# prepair buffer and hdf5 dataset
if parameter["hdf5_write"]: # sequential write to hdf5 file
chunk_start = data.shape[0]
# resize dataset in file
data.resize(data.shape[0] + samples_from_buffer, axis=0)
else:
# prepair buffer to fit in pre-allocated array 'data'
chunk_start = int(np.count_nonzero(~np.isnan(data)) / parameter["channels"])
if parameter['channels'] == 1:
data_buff = data_buff[:, np.newaxis]
if parameter["debug_output"] >= 3:
print("Non-empty data shape: (", data.shape,
"), buffer shape:", data_buff.shape,
"chunk start:", chunk_start)
# write buffer to HDF5 file or into numpy array
data[chunk_start:chunk_start + samples_from_buffer, :] = data_buff
# ############################# READING LOOP #########################
parameter["acquisition_stop"] = str(datetime.datetime.now())
if parameter["debug_output"] >= 1:
print("ADC: requested points: ", parameter["requested_samples"])
print("ADC: total aqcuired points", total_samples, "in", time_adc_end - time_adc_start)
print("ADC: data array shape:", data.shape)
print("ADC: --- aqcuisition finished:", parameter["acquisition_stop"])
print("ADC: sample rate:", round(1/((time_adc_end-time_adc_start)/parameter["requested_samples"])))
# prepare data nparray for return
if not parameter["hdf5_write"]:
# shrink numpy array by all nan's (from oversize with buffer size)
total_written = int(np.count_nonzero(~np.isnan(data)) / parameter["channels"])
if parameter["debug_output"] >= 2:
print("resize data array by cutting", data.shape[0] - total_written, "tailing NaN's")
data = np.resize(data, (total_written, parameter["channels"]))
# add more parameter to wrtie into the hdf5 file
parameter["total_samples"] = total_samples
parameter["total_acquisition_time"] = time_adc_end - time_adc_start
parameter["data_shape"] = data.shape
if parameter['hdf5_write']:
hdf5_write_parameter(f, parameter) # write parameter
f.close()

Fast loading multiple .npy files into data generator

I do have thousands of .npy files stored in my hard disk, each containing a single matrix with dimensions [128, T], where T is variable (on average T=800). Each .npy file has size around 2Mb, depending on the matrix shape.
These matrices are then passed to a generator, which yields batches of 32 to a neural network. The Python code used to pass the matrices into the generator is:
def load_batch(path_list):
np_list = []
for path in path_list:
np_list.append(np.load(path))
return np_list
which, given a list of paths of the .npy files, returns a list of the corresponding NumPy matrices.
This code takes, on average, 0.6s to return a list of 32 matrices. I am using append because this is usually a quick operation.
I am aware that the speed of the hard disk buffer does have an influence on timings but, right now, I really would like to shrink the amount of time required as much as possible by just modifying the code in a smart way.
As an alternative, I tried implementing multi-processing:
from multiprocessing import Pool
def reader(filename):
return np.load(filename)
def load_multiprocess(path_list, n_cores=5):
pool = Pool(n_cores)
np_list = pool.map(reader, path_list)
return np_list
However, the performance is much worse. I had a look around stackoverflow, and I got the idea that my specific application could not benefit from multiprocessing.
To summarize, I am looking for any kind of advice for one of these two tasks:
Improving the speed of the first code (even 0.1s less would mean a lot).
Using multiprocessing in the right way, if possible.
SOLUTION AND BENCHMARK
Out of the three methods here proposed, user7138814's solution seems to generally improve a lot the execution speed. However, things seem to change when the data is loaded while training a neural network: even though mapping is by itself still the quicker method for loading data, the overall training time seems to increase, I have no idea where and why, as timings using the mapping load are always better.
Below, I will do a benchmark of the three methods.First, define the methods:
import numpy as np
# my initial method
def load_batch(path_list):
np_list = []
for path in path_list:
np_list.append(np.load(path))
return np_list
# Aaj Kaal's method
def load_batch1(path_list):
return [np.load(path) for path in path_list]
# user7138814's method
def load_batch2(path_list):
np_list = []
for path in path_list:
np_list.append(np.load(path, mmap_mode='r'))
return np_list
I defined a list of paths as follows:
batches_list = []
batch_size = 32
for n in range(0,150):
batches_list.append(X_path_list[n*batch_size:n*batch_size+batch_size])
The list contains 150 batches of 32 paths each, it should be enough to calculate the mean.
Then, each method is executed using passing to it exactly the same data.
import time
# my initial method
timing0 = []
for l in batches_list:
start = time.time()
load_batch(l)
end = time.time()
timing0.append(end-start)
print(np.mean(timing0))
# Aaj Kaal's method
timing1 = []
for l in batches_list:
start = time.time()
load_batch1(l)
end = time.time()
timing1.append(end-start)
print(np.mean(timing1))
# user7138814's method
timing2 = []
for l in batches_list:
start = time.time()
load_batch2(l)
end = time.time()
timing2.append(end-start)
print(np.mean(timing2))
Output (mean timing in seconds over 150 executions):
0.022530150413513184
0.022546884218851725
0.009580903053283692
Results seem to be consistent when changing length of batches_list and batch_size.

Maybe memory mapping the files will be beneficial due to lazy loading. If you would use for example
np.load(filename, mmap_mode='r')
the creation of the numpy array becomes almost a no-op, but later in the pipeline you pay the price. This could provide a speedup if it results in processing the data in parallel with reading from disk.

Did you try using use list comprehension. Replace
def load_batch(path_list):
np_list = []
for path in path_list:
np_list.append(np.load(path))
return np_list
with
def load_batch(path_list):
return [np.load(path) for path in path_list]
In fact you can get rid of the function and directly use list comprehension. If functional call is required use lambda

How to append chunks of 2D numpy array to binary file as the chunks are created?

I have a large input file which consists of data frames (a data series (complex64), with an identifying header in each frame). It is larger than my available memory. The headers repeat, but are randomly ordered, so for example the input file could look like:
<FRAME header={0}, data={**first** 500 numbers...}>,
<FRAME header={18}, data={first 500 numbers...}>,
<FRAME header={4}, data={first 500 numbers...}>,
<FRAME header={0}, data={**next** 500 numbers...}>
...
I want to order the data into a new file that is a numpy array of shape (len(headers), len(data_series)). It has to build the output file as it reads the frames, because I can't fit it all in memory.
I've looked at numpy.savetxt and the python csv package but for disk size, precision, and speed reasons I would prefer for the output file to be binary. numpy.save is good except that I can't figure out how to make it append to an unknown array size.
I have to work in Python2.7 because of some dependencies needed to read these frames. What I have done so far is made a function able to write all of the frames with a matching header to a single binary file:
input_data = Funky_Data_Reader_that_doesnt_matter(input_filename)
with open("singleFrameHeader", 'ab') as f:
current_data = input_data.readFrame() # This loads the next frame in the file
if current_data.header == 0:
float_arr = np.array(current_data.data).view(float)
float_arr.tofile(f)
This works great, but what I need to extend it to be two dimensional. I'm starting to look at h5py as an option, but was hoping there is a simpler solution.
What would be great is something like
input_data = Funky_Data_Reader_that_doesnt_matter(input_filename)
with open("bigMatrix", 'ab') as f:
current_data = input_data.readFrame() # This loads the next frame in the file
index = current_data.header
float_arr = np.array(current_data.data).view(float)
float_arr.tofile(f, index)
Any help is appreciated. I thought this would be a more common use-case to read and write to a 2D binary file in append mode.

You have two problems: one is that a file contains sequential data, and the other is that numpy binary files don't store shape information.
A simple way to start solving this would be to carry through with your initial idea of converting the data into files by header, then combining all the binary files into one large product (if you still feel the need to do so).
You could maintain a map of the headers you've found so far to their output files, data size, etc. This will allow you to combine the data more intelligently, if for example, there are missing chunks or headers or something.
from contextlib import ExitStack
from os import remove
from tempfile import NamedTemporaryFile
from shutil import copyfileobj
import sys
class Header:
__slots__ = ('id', 'count', 'file', 'name')
def __init__(self, id):
self.id = id
self.count = 0
self.file = NamedTemporaryFile(delete=False)
self.name = self.file.name
def write_frame(self, frame):
data = np.array(frame.data).view(float)
self.count += data.size
data.tofile(self.file)
input_data = Funky_Data_Reader_that_doesnt_matter(input_filename)
file_map = {}
with ExitStack() as stack:
while True:
frame = input_data.next_frame()
if frame is None:
break # recast this loop as necessary
if frame.header not in file_map:
header = Header(frame.header)
stack.enter_context(header.file)
file_map[frame.header] = header
else:
header = file_map[frame.header]
header.write_frame(frame)
max_header = max(file_map)
max_count = max(h.count for h in file_map)
with open('singleFrameHeader', 'wb') as output:
output.write(max_header.to_bytes(8, sys.byteorder))
output.write(max_count.to_bytes(8, sys.byteorder))
for i in range max_header:
if i in file_map:
h = file_map[i]
with open(h.name, 'rb') as input:
copyfileobj(input, output)
remove(h.name)
if h.count < max_count:
np.full(max_count - h.count, np.nan, dtype=np.float).tofile(output)
else:
np.full(max_count, np.nan, dtype=np.float).tofile(output)
The first 16 bytes will be the int64 number of headers and number of elements per header, respectively. Keep in mind that the file is in native byte order, whatever that may be, and is therefore not portable.
Alternative
If (and only if) you know the exact size of a header dataset ahead of time, you can do this in one pass, with no temporary files. It also helps if the headers are contiguous. Otherwise, missing swaths will be zero-filled. You will still need to maintain a dictionary of your current position within a header, but you will no longer have to keep a separate file pointer around for each one. All-in-all, this is a much better alternative than the original solution, if your use-case allows it:
header_size = 500 * N # You must know this up front
input_data = Funky_Data_Reader_that_doesnt_matter(input_filename)
header_map = {}
with open('singleFrameHeader', 'wb') as output:
output.write(max_header.to_bytes(8, sys.byteorder))
output.write(max_count.to_bytes(8, sys.byteorder))
while True:
frame = input_data.next__frame()
if frame is None:
break
if frame.header not in header_map:
header_map[frame.header] = 0
data = np.array(frame.data).view(float)
output.seek(16 + frame.header * header_size + header_map[frame.header])
data.tofile(output)
header_map[frame.header] += data.size * data.dtype.itemsize
I asked a question regarding this sort of out-of-order write pattern as a consequence of this answer: What happens when you seek past the end of a file opened for writing?

Batch processing: read image files, then write multidimensional numpy array to HDFS

I am trying to iteratively load a batch of images from a folder, process, then store the results of the batch to an hdf file. What's the best practice for batch reading images/files, and batch storing a resulting multi-dimensional array?
First Part
I start with a csv list of file names:
file_list = [''.join(x) + '.png' for x in permutations('abcde')][:100]
Say for example I want to process 5 images at a time.
I currently grab 5 file names from the list, create an empty array to hold 5 images, then read each image one at a time to yield a batch.
def load_images(file_list):
for i in range(0, 100, 5):
files_list = file_list[i, i + 5]
image_list = np.zeros(shape=(5, 50, 50, 3))
for idx, file in enumerate(files_list):
loaded_img = np.random.random((50, 50, 3)) # misc.imread(file)
image_list[idx] = loaded_img
yield image_list, files_list
Question 1: Is there a way to eliminate the second for loop? Can I batch read in the images, or is the method above (one at a time) best practice?
Second Part:
After loading the images I do some processing on them. This results in a different size array
def process_images(image_batch):
result = image_batch[:, 5, 4, 3] # a novel down-sampling algorithm
return result
Now, I want to store the batch of images with their original file names.
def store_images(data, file_names):
with pd.HDFstore('output.h5') as hdf:
pass
Question 2: What is the best way to store a batch of multidimensional numpy arrays, while still referencing them with a key (such as the original file name)?
I would like to explore using .h5 files, so if anyone knows how to batch process data to an .h5 and has advice on this, it would be most appreciated. Alternatively I think there is a way to save the numpy arrays as just .npy files to a folder, but I was having trouble with this and still wouldn't know how to do it other than one sample at a time (versus one batch at a time)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to work with large dataset in pytorch - python

Related

Hierarchical dictionary (reducing memory footprint or using a database)

Optimize data acquisition with HDF5 files in Python

Fast loading multiple .npy files into data generator

How to append chunks of 2D numpy array to binary file as the chunks are created?

Batch processing: read image files, then write multidimensional numpy array to HDFS

Categories

Resources