I am newbie to Tensorflow so I would appreciate any constructive help.
I am trying to build a feature extraction and data pipeline with Tensorflow for video processing where multiple folders holding video files with multiple classes (JHMDB database), but kind of stuck.
I have the feature extracted to one folder, at the moment to separate *.npz compressed arrays, in the filename I have stored the class name as well.
First Attempt
First I thought I would use this code from the TF tutorials site, simply reading files from folder method:
jhmdb_path = Path('...')
# Process files in folder
list_ds = tf.data.Dataset.list_files(str(jhmdb_path/'*.npz'))
for f in list_ds.take(5):
print(f.numpy())
def process_path(file_path):
labels = tf.strings.split(file_path, '_')[-1]
features = np.load(file_path)
return features, labels
labeled_ds = list_ds.map(process_path)
for a, b in labeled_ds.take(5):
print(a, b)
TypeError: expected str, bytes or os.PathLike object, not Tensor
..but this not working.
Second Attempt
Then I thought ok I will use generators:
# using generator
jhmdb_path = Path('...')
def generator():
for item in jhmdb_path.glob("*.npz"):
features = np.load(item)
print(item.files)
print(f['PAFs'].shape)
features = features['PAFs']
yield features
dataset = tf.data.Dataset.from_generator(generator, (tf.uint8))
iter(next(dataset))
TypeError: 'FlatMapDataset' object is not an iterator...not working.
In the first case, somehow the path is a byte type, and I could not change it to str to be able to load it with np.load(). (If I point a file directly on np.load(direct_path), then strange, but it works.)
At second case... I am not sure what is wrong.
I looked for hours to find a solution how to build an iterable dataset from list of relatively big and large numbers of 'npz' or 'npy' files, but seems to be this is not mentioned anywhere (or just way too trivial maybe).
Also, as I could not test the model so far, I am not sure if this is the good way to go. I.e. to feed the model with hundreds of files in this way, or just build one huge 3.5 GB npz (that would sill fit in memory) and use that instead. Or use TFrecords, but that looks more complicated, than the usual examples.
What is really annoying here, that TF tutorials and in general all are about how to load a ready dataset directly, or how to load np array(s) or how to load, image, text, dataframe objects, but no way to find any real examples how to process big chunks of data files, e.g. the case of extracting features from audio or video files.
So any suggestions or solutions would be highly appreciated and I would be really, really grateful to have something finally working! :)
Related
I have an existing h5py file that I downloaded which is ~18G in size. It has a number of nested datasets within it:
h5f = h5py.File('input.h5', 'r')
data = h5f['data']
latlong_data = data['lat_long'].value
I want to be able to some basic min/max scaling of the numerical data within latlong, so i want to put it in its own h5py file for easier use and lower memory usage.
However, when i try to write it out to its own file:
out = h5py.File('latlong_only.h5', 'w')
out.create_dataset('latlong', data=latlong)
out.close()
The output file is incredibly large. It's still not done writing to disk and is ~85GB in space. Why is the data being written to the new file not compressed?
Could be h5f['data/lat_long'] is using compression filters (and you aren't). To check the original dataset's compression settings, use this line:
print (h5f['data/latlong'].compression, h5f['data/latlong'].compression_opts)
After writing my answer, it occurred to me that you don't need to copy the data to another file to reduce the memory footprint. Your code reads the dataset into an array, which is not necessary in most use cases. A h5py dataset object behaves similar to a NumPy array. Instead, use this line: ds = h5f1['data/latlong'] to create a dataset object (instead of an array) and use it "like" it's a NumPy array. FYI, .value is a deprecated method to return the dataset as an array. Use this syntax instead arr = h5f1['data/latlong'][()]. Loading the dataset into an array also requires more memory than using an h5py object (which could be an issue with large datasets).
There are other ways to access the data. My suggestion to use dataset objects is 1 way. Your method (extracting data to a new file) is another way. I am not found of that approach because you now have 2 copies of the data; a bookkeeping nightmare. Another alternative is to create external links from the new file to the existing 18GB file. That way you have a small file that links to the big file (and no duplicate data). I describe that method in this post: [How can I combine multiple .h5 file?][1] Method 1: Create External Links.
If you still want to copy the data, here is what I would do. Your code reads the dataset into an array then writes the array to the new file (uncompressed). Instead, copy the dataset using h5py's group .copy() method, it will retain compression settings and attributes.
See below:
with h5py.File('input.h5', 'r') as h5f1, \
h5py.File('latlong_only.h5', 'w') as h5f2:
h5f1.copy(h5f1['data/latlong'], h5f2,'latlong')
I'm running my current code on a data set which has the following naming convetions:
Training files: training-??-?? where the ?? are wildcards (placeholders for any range).
The same convetion goes for validation and test files (e.g. validation-??-??).
In my code I create the file pattern like this:
training_file_pattern = os.path.join(config['data_dir'], "training-??-of-??")
But now I wanted to train my model also on the validation and training set together. But I'm having problems figuring out how I can take both datasets. For training I would do:
tf_data_files = tf.data.Dataset.list_files(training_file_pattern, seed=1234, shuffle=self.shuffle)
I thought to do the same with the validation set and concatenate it to like this:
tf_data_files = tf.concat(tf_data_files, tf.data.Dataset.list_files(validation_file_pattern, seed=1234, shuffle=self.shuffle))
But it doesn't work correctly. What would the correct way to do it?
I also tried to define the file_pattern differently to contain also validation but I don't know how to do it without taking also the test set (they are all in the same folder). So I cannot do this:
training_and_validation_file_pattern = os.path.join(config['data_dir'], "?-??-of-??")
Because this would also take the test set right?
Any help would be very appreciated.
If I got your point, you can simply do
dataset = tf.data.Dataset.list_files(os.listdir('path'))
dataset = tf.data.TextLineDataset(dataset)
Dataset API also has concatenate method
dataset = dataset_1.concatenate(dataset_2)
but it's not completely clear wether you need it
Edit:
list_files will create dataset with filenames
dataset = tf.data.Dataset.list_files(['f1.csv', 'f2.csv'])
for i in ds:
print(i) #output 'f1.csv'
I'm using TF 2.0 version just for clarity.
On the other hand, tf.data.TextLineDataset() outputs actual values form text file, like
tf.Tensor(b'0.7079635943784122,0.9659163071487907'
So using just list_files will create dataset from files, not their contents and will require to apply additional parse function to dataset
Perhaps this question has been asked before, but I'm having trouble finding relevant info for my situation.
I'm using PyTorch to create a CNN for regression with image data. I don't have a formal, academic programming background, so many of my approaches are ad-hoc and just terribly inefficient. May times I can go back through my code and clean things up later because the inefficiency is not so drastic that performance is significantly affected. However, in this case, my method for using the image data takes a long time, uses a lot of memory, and it is done every time I want to test a change in the model.
What I've done is essentially loaded the image data into numpy arrays, saved those arrays in an .npy file, and then when I want to use said data for the model I import all of the data in that file. I don't think the data set is really THAT large, as it is comprised of 5000, 3 color channel images of size 64x64. Yet my memory usage shoots up to 70%-80% (out of 16gb) when it is being loaded, and it takes 20-30 seconds to load in every time.
My guess is that I'm being dumb about the way I'm loading it in, but frankly I'm not sure what the standard is. Should I, in some way, put the image data somewhere before I need it, or should the data be loaded directly from the image files? And in either case, what is the best, most efficient way to do that, independent of file structure?
I would really appreciate any help on this.
For speed I would advise to used HDF5 or LMDB:
Reasons to use LMDB:
LMDB uses memory-mapped files, giving much better I/O performance.
Works well with really large datasets. The HDF5 files are always read
entirely into memory, so you can’t have any HDF5 file exceed your
memory capacity. You can easily split your data into several HDF5
files though (just put several paths to h5 files in your text file).
Then again, compared to LMDB’s page caching the I/O performance won’t
be nearly as good.
[http://deepdish.io/2015/04/28/creating-lmdb-in-python/]
If you decide to used LMDB:
ml-pyxis is a tool for creating and reading deep learning datasets using LMDBs.*(I am co author of this tool)
It allows to create binary blobs (LMDB) and they can be read quite fast. The link above comes with some simple examples on how to create and read the data. Including python generators/ iteratos .
This notebook has an example on how to create a dataset and read it paralley while using pytorch.
If you decide to use HDF5:
PyTables is a package for managing hierarchical datasets and designed to efficiently and easily cope with extremely large amounts of data.
https://www.pytables.org/
Here is a concrete example to demonstrate what I meant. This assumes that you've already dumped the images into an hdf5 file (train_images.hdf5) using h5py.
import h5py
hf = h5py.File('train_images.hdf5', 'r')
group_key = list(hf.keys())[0]
ds = hf[group_key]
# load only one example
x = ds[0]
# load a subset, slice (n examples)
arr = ds[:n]
# should load the whole dataset into memory.
# this should be avoided
arr = ds[:]
In simple terms, ds can now be used as an iterator which gives images on the fly (i.e. it doesn't load anything in memory). This should make the whole run time blazing fast.
for idx, img in enumerate(ds):
# do something with `img`
In addition to the above answers, the following may be useful due to some recent advances (2020) in the Pytorch world.
Your question: Should I, in some way, put the image data somewhere before I need it, or should the data be loaded directly from the image files? And in either case, what is the best, most efficient way to do that, independent of file structure?
You can leave the image files in their original format (.jpg, .png, etc.) on your local disk or on the cloud storage, but with one added step - compress the directory as a tar file. Please read this for more details:
Pytorch Blog (Aug 2020): Efficient PyTorch I/O library for Large Datasets, Many Files, Many GPUs (https://pytorch.org/blog/efficient-pytorch-io-library-for-large-datasets-many-files-many-gpus/)
This package is designed for situations where the data files are too large to fit in memory for training. Therefore, you give the URL of the dataset location (local, cloud, ..) and it will bring in the data in batches and in parallel.
The only (current) requirement is that the dataset must be in a tar file format.
The tar file can be on the local disk or on the cloud. With this, you don't have to load the entire dataset into the memory every time. You can use the torch.utils.data.DataLoader to load in batches for stochastic gradient descent.
No need saving image into npy and loading all into memory. Just load a batch of image path and transform then into tensor.
The following code define the MassiveDataset, and pass it into DataLoader, everything goes well.
from torch.utils.data.dataset import Dataset
from typing import Optional, Callable
import os
import multiprocessing
def apply_transform(transform: Callable, data):
try:
if isinstance(data, (list, tuple)):
return [transform(item) for item in data]
return transform(data)
except Exception as e:
raise RuntimeError(f'applying transform {transform}: {e}')
class MassiveDataset(Dataset):
def __init__(self, filename, transform: Optional[Callable] = None):
self.offset = []
self.n_data = 0
if not os.path.exists(filename):
raise ValueError(f'filename does not exist: {filename}')
with open(filename, 'rb') as fp:
self.offset = [0]
while fp.readline():
self.offset.append(fp.tell())
self.offset = self.offset[:-1]
self.n_data = len(self.offset)
self.filename = filename
self.fd = open(filename, 'rb', buffering=0)
self.lock = multiprocessing.Lock()
self.transform = transform
def __len__(self):
return self.n_data
def __getitem__(self, index: int):
if index < 0:
index = self.n_data + index
with self.lock:
self.fd.seek(self.offset[index])
line = self.fd.readline()
data = line.decode('utf-8').strip('\n')
return apply_transform(self.transform, data) if self.transform is not None else data
NB: open file with buffering=0 and multiprocessing.Lock() are used to avoid loading bad data (usually a bit from one part of the file and a bit from the another part of the file).
additionally, if using multiprocessing in DataLoader, one could get such exception TypeError: cannot serialize '_io.BufferedReader' object. This is caused by pickle module used in multiprocessing, it cannot serialize _io.BufferedReader, but dill can. Replacing multiprocessing with multiprocess, things goes okay (major changes compare with multiprocessing, enhanced serialization is done with dill)
same thing was discussed in this issue
I am setting up a TensorFlow pipeline for reading large HDF5 files as input for my deep learning models. Each HDF5 file contains 100 videos of variable size length stored as a collection of compressed JPG images (to make size on disk manageable). Using tf.data.Dataset and a map to tf.py_func, reading examples from the HDF5 file using custom Python logic is quite easy. For example:
def read_examples_hdf5(filename, label):
with h5py.File(filename, 'r') as hf:
# read frames from HDF5 and decode them from JPG
return frames, label
filenames = glob.glob(os.path.join(hdf5_data_path, "*.h5"))
labels = [0]*len(filenames) # ... can we do this more elegantly?
dataset = tf.data.Dataset.from_tensor_slices((filenames, labels))
dataset = dataset.map(
lambda filename, label: tuple(tf.py_func(
read_examples_hdf5, [filename, label], [tf.uint8, tf.int64]))
)
dataset = dataset.shuffle(1000 + 3 * BATCH_SIZE)
dataset = dataset.batch(BATCH_SIZE)
iterator = dataset.make_one_shot_iterator()
next_batch = iterator.get_next()
This example works, however the problem is that it seems like tf.py_func can only handle one example at a time. As my HDF5 container stores 100 examples, this limitation causes significant overhead as the files constantly need to be opened, read, closed and reopened. It would be much more efficient to read all the 100 video examples into the dataset object and then move on with the next HDF5 file (preferably in multiple threads, each thread dealing with it's own collection of HDF5 files).
So, what I would like is a number of threads running in the background, reading video frames from the HDF5 files, decode them from JPG and then feed them into the dataset object. Prior to the introduction of the tf.data.Dataset pipeline, this was quite easy using the RandomShuffleQueue and enqueue_many ops, but it seems like there is currently no elegant way of doing this (or the documentation is lacking).
Does anyone know what would be the best way of achieving my goal? I have also looked into (and implemented) the pipeline using tfrecord files, but taking a random sample of video frames stored in a tfrecord file seems quite impossible (see here). Additionally, I have looked at the from_generator() inputs for tf.data.Dataset but that is definitely not going to run in multiple threads it seems. Any suggestions are more than welcome.
I stumbled across this question while dealing with a similar issue. I came up with a solution based on using a Python generator, together with the TF dataset construction method from_generator. Because we use a generator, the HDF5 file should be opened for reading only once and kept open as long as there are entries to read. So it will not be opened, read, and then closed for every single call to get the next data element.
Generator definition
To allow the user to pass in the HDF5 filename as an argument, I generated a class that has a __call__ method since from_generator specifies that the generator has to be callable. This is the generator:
import h5py
import tensorflow as tf
class generator:
def __init__(self, file):
self.file = file
def __call__(self):
with h5py.File(self.file, 'r') as hf:
for im in hf["train_img"]:
yield im
By using a generator, the code should pick up from where it left off at each call from the last time it returned a result, instead of running everything from the beginning again. In this case it is on the next iteration of the inner for loop. So this should skip opening the file again for reading, keeping it open as long as there is data to yield. For more on generators, see this excellent Q&A.
Of course, you will have to replace anything inside the with block to match how your dataset is constructed and what outputs you want to obtain.
Usage example
ds = tf.data.Dataset.from_generator(
generator(hdf5_path),
tf.uint8,
tf.TensorShape([427,561,3]))
value = ds.make_one_shot_iterator().get_next()
# Example on how to read elements
while True:
try:
data = sess.run(value)
print(data.shape)
except tf.errors.OutOfRangeError:
print('done.')
break
Again, in my case I had stored uint8 images of height 427, width 561, and 3 color channels in my dataset, so you will need to modify these in the above call to match your use case.
Handling multiple files
I have a proposed solution for handling multiple HDF5 files. The basic idea is to construct a Dataset from the filenames as usual, and then use the interleave method to process many input files concurrently, getting samples from each of them to form a batch, for example.
The idea is as follows:
ds = tf.data.Dataset.from_tensor_slices(filenames)
# You might want to shuffle() the filenames here depending on the application
ds = ds.interleave(lambda filename: tf.data.Dataset.from_generator(
generator(filename),
tf.uint8,
tf.TensorShape([427,561,3])),
cycle_length, block_length)
What this does is open cycle_length files concurrently, and produce block_length items from each before moving to the next file - see interleave documentation for details. You can set the values here to match what is appropriate for your application: e.g., do you need to process one file at a time or several concurrently, do you only want to have a single sample at a time from each file, and so on.
Edit: for a parallel version, take a look at tf.contrib.data.parallel_interleave!
Possible caveats
Be aware of the peculiarities of using from_generator if you decide to go with the solution. For Tensorflow 1.6.0, the documentation of from_generator mentions these two notes.
It may be challenging to apply this across different environments or with distributed training:
NOTE: The current implementation of Dataset.from_generator() uses
tf.py_func and inherits the same constraints. In particular, it
requires the Dataset- and Iterator-related operations to be placed on
a device in the same process as the Python program that called
Dataset.from_generator(). The body of generator will not be serialized
in a GraphDef, and you should not use this method if you need to
serialize your model and restore it in a different environment.
Be careful if the generator depends on external state:
NOTE: If generator depends on mutable global variables or other
external state, be aware that the runtime may invoke generator
multiple times (in order to support repeating the Dataset) and at any
time between the call to Dataset.from_generator() and the production
of the first element from the generator. Mutating global variables or
external state can cause undefined behavior, and we recommend that you
explicitly cache any external state in generator before calling
Dataset.from_generator().
I took me a while to figure this out, so I thought I should record this here. Based on mikkola's answer, this is how to handle multiple files:
import h5py
import tensorflow as tf
class generator:
def __call__(self, file):
with h5py.File(file, 'r') as hf:
for im in hf["train_img"]:
yield im
ds = tf.data.Dataset.from_tensor_slices(filenames)
ds = ds.interleave(lambda filename: tf.data.Dataset.from_generator(
generator(),
tf.uint8,
tf.TensorShape([427,561,3]),
args=(filename,)),
cycle_length, block_length)
The key is you can't pass filename directly to generator, since it's a Tensor. You have to pass it through args, which tensorflow evaluates and converts it to a regular python variable.
I have a folder with NetCDF files from 2006-2100, in ten year blocks (2011-2020, 2021-2030 etc).
I want to create a new NetCDF file which contains all of these files joined together. So far I have read in the files:
ds = xarray.open_dataset('Path/to/file/20062010.nc')
ds1 = xarray.open_dataset('Path/to/file/20112020.nc')
etc.
Then merged these like this:
dsmerged = xarray.merge([ds,ds1])
This works, but is clunky and there must be a simpler way to automate this process, as I will be doing this for many different folders full of files. Is there a more efficient way to do this?
EDIT:
Trying to join these files using glob:
for filename in glob.glob('path/to/file/.*nc'):
dsmerged = xarray.merge([filename])
Gives the error:
AttributeError: 'str' object has no attribute 'items'
This is reading only the text of the filename, and not the actual file itself, so it can't merge it. How do I open, store as a variable, then merge without doing it bit by bit?
If you are looking for a clean way to get all your datasets merged together, you can use some form of list comprehension and the xarray.merge function to get it done. The following is an illustration:
ds = xarray.merge([xarray.open_dataset(f) for f in glob.glob('path/to/file/.*nc')])
In response to the out of memory issues you encountered, that is probably because you have more files than the python process can handle. The best fix for that is to use the xarray.open_mfdataset function, which actually uses the library dask under the hood to break the data into smaller chunks to be processed. This is usually more memory efficient and will often allow you bring your data into python. With this function, you do not need a for-loop; you can just pass it a string glob in the form "path/to/my/files/*.nc". The following is equivalent to the previously provided solution, but more memory efficient:
ds = xarray.open_mfdataset('path/to/file/*.nc')
I hope this proves useful.