how to save torchtext Dataset?

how to save torchtext Dataset? - python

I'm working with text and use torchtext.data.Dataset.
Creating the dataset takes a considerable amount of time.
For just running the program this is still acceptable. But I would like to debug the torch code for the neural network. And if python is started in debug mode, the dataset creation takes roughly 20 minutes (!!). That's just to get a working environment where I can debug-step through the neural network code.
I would like to save the Dataset, for example with pickle. This sample code is taken from here, but I removed everything that is not necessary for this example:
from torchtext import data
from fastai.nlp import *
PATH = 'data/aclImdb/'
TRN_PATH = 'train/all/'
VAL_PATH = 'test/all/'
TRN = f'{PATH}{TRN_PATH}'
VAL = f'{PATH}{VAL_PATH}'
TEXT = data.Field(lower=True, tokenize="spacy")
bs = 64;
bptt = 70
FILES = dict(train=TRN_PATH, validation=VAL_PATH, test=VAL_PATH)
md = LanguageModelData.from_text_files(PATH, TEXT, **FILES, bs=bs, bptt=bptt, min_freq=10)
with open("md.pkl", "wb") as file:
pickle.dump(md, file)
To run the code, you need the aclImdb dataset, it can be downloaded from here. Extract it into a data/ folder next to this code snippet. The code produces an error in the last line, where pickle is used:
Traceback (most recent call last):
File "/home/lhk/programming/fastai_sandbox/lesson4-imdb2.py", line 27, in <module>
pickle.dump(md, file)
TypeError: 'generator' object is not callable
The samples from fastai often use dill instead of pickle. But that doesn't work for me either.

I came up with the following functions for myself:
import dill
from pathlib import Path
import torch
from torchtext.data import Dataset
def save_dataset(dataset, path):
if not isinstance(path, Path):
path = Path(path)
path.mkdir(parents=True, exist_ok=True)
torch.save(dataset.examples, path/"examples.pkl", pickle_module=dill)
torch.save(dataset.fields, path/"fields.pkl", pickle_module=dill)
def load_dataset(path):
if not isinstance(path, Path):
path = Path(path)
examples = torch.load(path/"examples.pkl", pickle_module=dill)
fields = torch.load(path/"fields.pkl", pickle_module=dill)
return Dataset(examples, fields)
Not that actual objects could be a bit different, for example, if you save TabularDataset, then load_dataset returns an instance of class Dataset. This unlikely affect the data pipeline but may require extra diligence for tests.
In the case of a custom tokenizer, it should be serializable as well (e.g. no lambda functions, etc).

You can use dill instead of pickle. It works for me.
You can save a torchtext Field like
TEXT = data.Field(sequential=True, tokenize=tokenizer, lower=True,fix_length=200,batch_first=True)
with open("model/TEXT.Field","wb")as f:
dill.dump(TEXT,f)
And load a Field like
with open("model/TEXT.Field","rb")as f:
TEXT=dill.load(f)
The offical code suppport is under development，you can follow https://github.com/pytorch/text/issues/451 and https://github.com/pytorch/text/issues/73 .

You can always use the pickle to dump the objects, but keep in mind one thing that dumping a list of dictionary or fields objects are not taken care of by the module, so to the best try to decompose the list first
To Store the DataSet Object to a pickle file for later easy loading
def save_to_pickle(dataSetObject,PATH):
with open(PATH,'wb') as output:
for i in dataSetObject:
pickle.dump(vars(i), output, pickle.HIGHEST_PROTOCOL)
The toughest thing is yet to come, Yeah loading the pickle file.... ;)
First, try to look for all field names and field attributes and then go for the kill
To load the pickle file into the DataSetObject
def load_pickle(PATH, FIELDNAMES, FIELD):
dataList = []
with open(PATH, "rb") as input_file:
while True:
try:
# Taking the dictionary instance as the input Instance
inputInstance = pickle.load(input_file)
# plugging it into the list
dataInstance = [inputInstance[FIELDNAMES[0]],inputInstance[FIELDNAMES[1]]]
# Finally creating an example objects list
dataList.append(Example().fromlist(dataInstance,fields=FIELD))
except EOFError:
break
# At last creating a data Set Object
exampleListObject = Dataset(dataList, fields=data_fields)
return exampleListObject
This hackish solution has worked in my case, hope you will find it useful in your case too.
Btw any suggestion is welcome :).

The pickle/dill approach is fine if your dataset is small. But if you are working with large datasets I won't recommend it as it will be too slow.
I simply save the examples (iteratively) as JSON-strings. The reason behind this is because saving the whole Dataset object takes a lot of time, plus you need serialization tricks such a dill, which makes the serialization even slower.
Moreover, these serializers take a lot of memory (some of them even create copies of the dataset) and if they start making use of the swap memory, you're done. That process is gonna take so long that you will probably terminate it before it finishes.
Therefore, I end up with the following approach:
Iterate over the examples
Convert each example (on the fly) to a JSON-string
Write that JSON-string into a text file (one sample per
line)
When loading, add the examples to the Dataset object along with the fields
def save_examples(dataset, savepath):
with open(savepath, 'w') as f:
# Save num. elements (not really need it)
f.write(json.dumps(total)) # Write examples length
f.write("\n")
# Save elements
for pair in dataset.examples:
data = [pair.src, pair.trg]
f.write(json.dumps(data)) # Write samples
f.write("\n")
def load_examples(filename):
examples = []
with open(filename, 'r') as f:
# Read num. elements (not really need it)
total = json.loads(f.readline())
# Save elements
for i in range(total):
line = f.readline()
example = json.loads(line)
# example = data.Example().fromlist(example, fields) # Create Example obj. (you can do it here or later)
examples.append(example)
end = time.time()
print(end - start)
return examples
Then, you can simply rebuild the dataset by:
# Define fields
SRC = data.Field(...)
TRG = data.Field(...)
fields = [('src', SRC), ('trg', TRG)]
# Load examples from JSON and convert them to "Example objects"
examples = load_examples(filename)
examples = [data.Example().fromlist(d, fields) for d in examples]
# Build dataset
mydataset = Dataset(examples, fields)
The reason why I use JSON instead of pickle, dill, msgpack, etc is not arbitrary.
I did some tests and these are the results:
Dataset size: 2x (1,960,641)
Saving times:
- Pickle/Dill*: >30-45 min (...or froze my computer)
- MessagePack (iterative): 123.44 sec
100%|██████████| 1960641/1960641 [02:03<00:00, 15906.52it/s]
- JSON (iterative): 16.33 sec
100%|██████████| 1960641/1960641 [00:15<00:00, 125955.90it/s]
- JSON (bulk): 46.54 sec (memory problems)
Loading times:
- Pickle/Dill*: -
- MessagePack (iterative): 143.79 sec
100%|██████████| 1960641/1960641 [02:23<00:00, 13635.20it/s]
- JSON (iterative): 33.83 sec
100%|██████████| 1960641/1960641 [00:33<00:00, 57956.28it/s]
- JSON (bulk): 27.43 sec
*Similar approach as the other answers

Related

Is there any feasible solution to read WOT battle results .dat files?

I am new here to try to solve one of my interesting questions in World of Tanks. I heard that every battle data is reserved in the client's disk in the Wargaming.net folder because I want to make a batch of data analysis for our clan's battle performances.
image
It is said that these .dat files are a kind of json files, so I tried to use a couple of lines of Python code to read but failed.
import json
f = open('ex.dat', 'r', encoding='unicode_escape')
content = f.read()
a = json.loads(content)
print(type(a))
print(a)
f.close()
The code is very simple and obviously fails to make it. Well, could anyone tell me the truth about that?
Added on Feb. 9th, 2022
After I tried another set of codes via Jupyter Notebook, it seems like something can be shown from the .dat files
import struct
import numpy as np
import matplotlib.pyplot as plt
import io
with open('C:/Users/xukun/Desktop/br/ex.dat', 'rb') as f:
fbuff = io.BufferedReader(f)
N = len(fbuff.read())
print('byte length: ', N)
with open('C:/Users/xukun/Desktop/br/ex.dat', 'rb') as f:
data =struct.unpack('b'*N, f.read(1*N))
The result is a set of tuple but I have no idea how to deal with it now.

Here's how you can parse some parts of it.
import pickle
import zlib
file = '4402905758116487.dat'
cache_file = open(file, 'rb') # This can be improved to not keep the file opened.
# Converting pickle items from python2 to python3 you need to use the "bytes" encoding or "latin1".
legacyBattleResultVersion, brAllDataRaw = pickle.load(cache_file, encoding='bytes', errors='ignore')
arenaUniqueID, brAccount, brVehicleRaw, brOtherDataRaw = brAllDataRaw
# The data stored inside the pickled file will be a compressed pickle again.
vehicle_data = pickle.loads(zlib.decompress(brVehicleRaw), encoding='latin1')
account_data = pickle.loads(zlib.decompress(brAccount), encoding='latin1')
brCommon, brPlayersInfo, brPlayersVehicle, brPlayersResult = pickle.loads(zlib.decompress(brOtherDataRaw), encoding='latin1')
# Lastly you can print all of these and see a lot of data inside.
The response contains a mixture of more binary files as well as some data captured from the replays.
This is not a complete solution but it's a decent start to parsing these files.

First you can look at the replay file itself in a text editor. But it won't show the code at the beginning of the file that has to be cleaned out. Then there is a ton of info that you have to read in and figure out but it is the stats for each player in the game. THEN it comes to the part that has to do with the actual replay. You don't need that stuff.
You can grab the player IDs and tank IDs from WoT developer area API if you want.

After loading the pickle files like gabzo mentioned, you will see that it is simply a list of values and without knowing what the value is referring to, its hard to make sense of it. The identifiers for the values can be extracted from your game installation:
import zipfile
WOT_PKG_PATH = "Your/Game/Path/res/packages/scripts.pkg"
BATTLE_RESULTS_PATH = "scripts/common/battle_results/"
archive = zipfile.ZipFile(WOT_PKG_PATH, 'r')
for file in archive.namelist():
if file.startswith(BATTLE_RESULTS_PATH):
archive.extract(file)
You can then decompile the python files(uncompyle6) and then go through the code to see the identifiers for the values.
One thing to note is that the list of values for the main pickle objects (like brAccount from gabzo's code) always has a checksum as the first value. You can use this to check whether you have the right order and the correct identifiers for the values. The way these checksums are generated can be seen in the decompiled python files.
I have been tackling this problem for some time (albeit in Rust): https://github.com/dacite/wot-battle-results-parser/tree/main/datfile_parser.

How to extract relation members from .osm xml files

All,
I've been trying to build a website (in Django) which is to be an index of all MTB routes in the world. I'm a Pythonian so wherever I can I try to use Python.
I've successfully extracted data from the OSM API (Display relation (trail) in leaflet) but found that doing this for all MTB trails (tag: route=mtb) is too much data (processing takes very long). So I tried to do everything locally by downloading a torrent of the entire OpenStreetMap dataset (from Latest Weekly Planet XML File) and filtering for tag: route=mtb using osmfilter (part of osmctools in Ubuntu 20.04), like this:
osmfilter $unzipped_osm_planet_file --keep="route=mtb" -o=$osm_planet_dir/world_mtb_routes.osm
This produces a file of about 1.2 GB and on closer inspection seems to contain all the data I need. My goal was to transform the file into a pandas.DataFrame() so I could do some further filtering en transforming before pushing relevant aspects into my Django DB. I tried to load the file as a regular XML file using Python Pandas but this crashed the Jupyter notebook Kernel. I guess the data is too big.
My second approach was this solution: How to extract and visualize data from OSM file in Python. It worked for me, at least, I can get some of the information, like the tags of the relations in the file (and the other specified details). What I'm missing is the relation members (the ways) and then the way members (the nodes) and their latitude/longitudes. I need these to achieve what I did here: Plotting OpenStreetMap relations does not generate continuous lines
I'm open to many solutions, for example one could break the file up into many different files containing 1 relation and it's members per file, using an osmium based script. Perhaps then I can move on with pandas.read_xml(). This would be nice for batch processing en filling the Database. Loading the whole OSM XML file into a pd.DataFrame would be nice but I guess this really is a lot of data. Perhaps this can also be done on a per-relation basis with pyosmium?
Any help is appreciated.

Ok, I figured out how to get what I want (all information per relation of the type "route=mtb" stored in an accessible way), it's a multi-step process, I'll describe it here.
First, I downloaded the world file (went to wiki.openstreetmap.org/wiki/Planet.osm, opened the xml of the pbf file and downloaded the world file as .pbf (everything on Linux, and this file is referred to as $osm_planet_file below).
I converted this file to o5m using osmconvert (available in Ubuntu 20.04 by doing apt install osmctools, on the Linux cli:
osmconvert --verbose --drop-version $osm_planet_file -o=$osm_planet_dir/planet.o5m
The next step is to filter all relations of interest out of this file (in my case I wanted all MTB routes: route=mtb) and store them in a new file, like this:
osmfilter $osm_planet_dir/planet.o5m --keep="route=mtb" -o=$osm_planet_dir/world_mtb_routes.o5m
This creates a much smaller file that contains all information on the relations that are MTB routes.
From there on I switched to a Jupyter notebook and used Python3 to further divide the file into useful, manageable chunks. I first installed osmium using conda (in the env I created first but that can be skipped):
conda install -c conda-forge osmium
Then I made a recommended osm.SimpleHandle class, this class iterates through the large o5m file and while doing this it can do actions. This is the way to deal with these files because they are far to big for memory. I made the choice to iterate through the file and store everything I needed into separate json files. This does generate more than 12.000 json files but it can be done on my laptop with 8 GB of memory. This is the class:
import osmium as osm
import json
import os
data_dump_dir = '../data'
class OSMHandler(osm.SimpleHandler):
def __init__(self):
osm.SimpleHandler.__init__(self)
self.osm_data = []
def tag_inventory(self, elem, elem_type):
for tag in elem.tags:
data = dict()
data['version'] = elem.version,
data['members'] = [int(member.ref) for member in elem.members if member.type == 'w'], # filter nodes from waylist => could be a mistake
data['visible'] = elem.visible,
data['timestamp'] = str(elem.timestamp),
data['uid'] = elem.uid,
data['user'] = elem.user,
data['changeset'] = elem.changeset,
data['num_tags'] = len(elem.tags),
data['key'] = tag.k,
data['value'] = tag.v,
data['deleted'] = elem.deleted
with open(os.path.join(data_dump_dir, str(elem.id)+'.json'), 'w') as f:
json.dump(data, f)
def relation(self, r):
self.tag_inventory(r, "relation")
Run the class like this:
osmhandler = OSMHandler()
osmhandler.apply_file("../data/world_mtb_routes.o5m")
Now we have json files with the relation number as their filename and with all metadata, and a list of the ways. But we want a list of the ways and then also all the nodes per way, so we can plot the full relations (the MTB routes). To achieve this, we parse the o5m file again (using a class build on the osm.SimpleHandler class) and this time we extract all way members (the nodes), and create a dictionary:
class OSMHandler(osm.SimpleHandler):
def __init__(self):
osm.SimpleHandler.__init__(self)
self.osm_data = dict()
def tag_inventory(self, elem, elem_type):
for tag in elem.tags:
self.osm_data[int(elem.id)] = dict()
# self.osm_data[int(elem.id)]['is_closed'] = str(elem.is_closed)
self.osm_data[int(elem.id)]['nodes'] = [str(n) for n in elem.nodes]
def way(self, w):
self.tag_inventory(w, "way")
Execute the class:
osmhandler = OSMHandler()
osmhandler.apply_file("../data/world_mtb_routes.o5m")
ways = osmhandler.osm_data
This gives is dict (called ways) of all ways as keys and the node IDs (!Meaning we need some more steps!) as values.
len(ways.keys())
>>> 337597
In the next (and almost last) step we add the node IDs for all ways to our relation jsons, so they become part of the files:
all_data = dict()
for relation_file in [
os.path.join(data_dump_dir,file) for file in os.listdir(data_dump_dir) if file.endswith('.json')
]:
with open(relation_file, 'r') as f:
data = json.load(f)
if 'members' in data: # Make sure these steps are never performed twice
try:
data['ways'] = dict()
for way in data['members'][0]:
data['ways'][way] = ways[way]['nodes']
del data['members']
with open(relation_file, 'w') as f:
json.dump(data, f)
except KeyError as err:
print(err, relation_file) # Not sure why some relations give errors?
So now we have relation jsons with all ways and all ways have all node IDs, the last thing to do is to replace the node IDs with their values (latitude and longitude). I also did this in 2 steps, first I build a nodeID:lat/lon dictionary, again using an osmium.SimpleHandler based class :
import osmium
class CounterHandler(osmium.SimpleHandler):
def __init__(self):
osmium.SimpleHandler.__init__(self)
self.osm_data = dict()
def node(self, n):
self.osm_data[int(n.id)] = [n.location.lat, n.location.lon]
Execute the class:
h = CounterHandler()
h.apply_file("../data/world_mtb_routes.o5m")
nodes = h.osm_data
This gives us dict with a latitude/longitude pair for every node ID. We can use this on our json files to fill the ways with coordinates (where there are now still only node IDs), I create these final json files in a new directory (data/with_coords in my case) because if there is an error, my original (input) json file is not affected and I can try again:
import os
relation_files = [file for file in os.listdir('../data/') if file.endswith('.json')]
for relation in relation_files:
relation_file = os.path.join('../data/',relation)
relation_file_with_coords = os.path.join('../data/with_coords',relation)
with open(relation_file, 'r') as f:
data = json.load(f)
try:
for way in data['ways']:
node_coords_per_way = []
for node in data['ways'][way]:
node_coords_per_way.append(nodes[int(node)])
data['ways'][way] = node_coords_per_way
with open(relation_file_with_coords, 'w') as f:
json.dump(data, f)
except KeyError:
print(relation)
Now I have what I need and I can start adding the info to my Django database, but that is beyond the scope of this question.
Btw, there are some relations that give an error, I suspect that for some relations ways were labelled as nodes but I'm not sure. I'll update here if I find out. I also have to do this process regularly (when the world file updates, or every now and then) so I'll probably write something more concise later on, but for now this works and the steps are understandable, to me, after a lot of thinking at least.
All of the complexity comes from the fact that the data is not big enough for memory, otherwise I'd have created a pandas.DataFrame in step one and be done with it. I could also have loaded the data in a database in one go perhaps, but I'm not that good with databases, yet.

python Ray: How to write to a file

How can i construct a ray framework where each process will write it's results to a common file ? What i'm currently trying is :
import ray
import time
import pickle
import filelock
ray.init()
filename = 'data/db.pkl'
#ray.remote
def f(i):
try:
with filelock.FileLock(filename):
with open(filename, 'rb') as file:
data = pickle.load(file)
except FileNotFoundError:
data = {}
if i not in data.keys():
# The actual computations that takes times and need to be parralell: here just a square.
new_key = i
new_item = i**2
with filelock.FileLock(filename):
with open(filename, 'rb') as file:
data = pickle.load(file)
data[new_key] = new_item
with open(filename, 'wb') as file:
pickle.dump(data,file)
return None
numbers = [0,1,2,3,4,5,6,7,8,9,10]
rez = [f.remote(i) for i in numbers]
But i get an error.
How can i achieve this behavior ? I want each process to :
1° Check the database to see if it's work is needed
2° Work
3° Write it's result to the database.
Without locking the file, this work, but not all results are saved... How can i achieve the wanted behavior ? Note that later i'll need this to work on a distributed setup..

First of all, you should use 'ab' (the append mode instead of 'wb' for overwriting the file). With append mode you shouldn't need locking since it is thread-safe on a POSIX system.
What error did you get when using lock on the file?
Given that you will eventually make the program distributed, I think the easiest thing to do is to use ray.put() in your f(i) to store the data in Ray shared memory and then write the objects out from the main program.

Most efficient way to use a large data set for PyTorch?

Perhaps this question has been asked before, but I'm having trouble finding relevant info for my situation.
I'm using PyTorch to create a CNN for regression with image data. I don't have a formal, academic programming background, so many of my approaches are ad-hoc and just terribly inefficient. May times I can go back through my code and clean things up later because the inefficiency is not so drastic that performance is significantly affected. However, in this case, my method for using the image data takes a long time, uses a lot of memory, and it is done every time I want to test a change in the model.
What I've done is essentially loaded the image data into numpy arrays, saved those arrays in an .npy file, and then when I want to use said data for the model I import all of the data in that file. I don't think the data set is really THAT large, as it is comprised of 5000, 3 color channel images of size 64x64. Yet my memory usage shoots up to 70%-80% (out of 16gb) when it is being loaded, and it takes 20-30 seconds to load in every time.
My guess is that I'm being dumb about the way I'm loading it in, but frankly I'm not sure what the standard is. Should I, in some way, put the image data somewhere before I need it, or should the data be loaded directly from the image files? And in either case, what is the best, most efficient way to do that, independent of file structure?
I would really appreciate any help on this.

For speed I would advise to used HDF5 or LMDB:
Reasons to use LMDB:
LMDB uses memory-mapped files, giving much better I/O performance.
Works well with really large datasets. The HDF5 files are always read
entirely into memory, so you can’t have any HDF5 file exceed your
memory capacity. You can easily split your data into several HDF5
files though (just put several paths to h5 files in your text file).
Then again, compared to LMDB’s page caching the I/O performance won’t
be nearly as good.
[http://deepdish.io/2015/04/28/creating-lmdb-in-python/]
If you decide to used LMDB:
ml-pyxis is a tool for creating and reading deep learning datasets using LMDBs.*(I am co author of this tool)
It allows to create binary blobs (LMDB) and they can be read quite fast. The link above comes with some simple examples on how to create and read the data. Including python generators/ iteratos .
This notebook has an example on how to create a dataset and read it paralley while using pytorch.
If you decide to use HDF5:
PyTables is a package for managing hierarchical datasets and designed to efficiently and easily cope with extremely large amounts of data.
https://www.pytables.org/

Here is a concrete example to demonstrate what I meant. This assumes that you've already dumped the images into an hdf5 file (train_images.hdf5) using h5py.
import h5py
hf = h5py.File('train_images.hdf5', 'r')
group_key = list(hf.keys())[0]
ds = hf[group_key]
# load only one example
x = ds[0]
# load a subset, slice (n examples)
arr = ds[:n]
# should load the whole dataset into memory.
# this should be avoided
arr = ds[:]
In simple terms, ds can now be used as an iterator which gives images on the fly (i.e. it doesn't load anything in memory). This should make the whole run time blazing fast.
for idx, img in enumerate(ds):
# do something with `img`

In addition to the above answers, the following may be useful due to some recent advances (2020) in the Pytorch world.
Your question: Should I, in some way, put the image data somewhere before I need it, or should the data be loaded directly from the image files? And in either case, what is the best, most efficient way to do that, independent of file structure?
You can leave the image files in their original format (.jpg, .png, etc.) on your local disk or on the cloud storage, but with one added step - compress the directory as a tar file. Please read this for more details:
Pytorch Blog (Aug 2020): Efficient PyTorch I/O library for Large Datasets, Many Files, Many GPUs (https://pytorch.org/blog/efficient-pytorch-io-library-for-large-datasets-many-files-many-gpus/)
This package is designed for situations where the data files are too large to fit in memory for training. Therefore, you give the URL of the dataset location (local, cloud, ..) and it will bring in the data in batches and in parallel.
The only (current) requirement is that the dataset must be in a tar file format.
The tar file can be on the local disk or on the cloud. With this, you don't have to load the entire dataset into the memory every time. You can use the torch.utils.data.DataLoader to load in batches for stochastic gradient descent.

No need saving image into npy and loading all into memory. Just load a batch of image path and transform then into tensor.
The following code define the MassiveDataset, and pass it into DataLoader, everything goes well.
from torch.utils.data.dataset import Dataset
from typing import Optional, Callable
import os
import multiprocessing
def apply_transform(transform: Callable, data):
try:
if isinstance(data, (list, tuple)):
return [transform(item) for item in data]
return transform(data)
except Exception as e:
raise RuntimeError(f'applying transform {transform}: {e}')
class MassiveDataset(Dataset):
def __init__(self, filename, transform: Optional[Callable] = None):
self.offset = []
self.n_data = 0
if not os.path.exists(filename):
raise ValueError(f'filename does not exist: {filename}')
with open(filename, 'rb') as fp:
self.offset = [0]
while fp.readline():
self.offset.append(fp.tell())
self.offset = self.offset[:-1]
self.n_data = len(self.offset)
self.filename = filename
self.fd = open(filename, 'rb', buffering=0)
self.lock = multiprocessing.Lock()
self.transform = transform
def __len__(self):
return self.n_data
def __getitem__(self, index: int):
if index < 0:
index = self.n_data + index
with self.lock:
self.fd.seek(self.offset[index])
line = self.fd.readline()
data = line.decode('utf-8').strip('\n')
return apply_transform(self.transform, data) if self.transform is not None else data
NB: open file with buffering=0 and multiprocessing.Lock() are used to avoid loading bad data (usually a bit from one part of the file and a bit from the another part of the file).
additionally, if using multiprocessing in DataLoader, one could get such exception TypeError: cannot serialize '_io.BufferedReader' object. This is caused by pickle module used in multiprocessing, it cannot serialize _io.BufferedReader, but dill can. Replacing multiprocessing with multiprocess, things goes okay (major changes compare with multiprocessing, enhanced serialization is done with dill)
same thing was discussed in this issue

TensorFlow - tf.data.Dataset reading large HDF5 files

I am setting up a TensorFlow pipeline for reading large HDF5 files as input for my deep learning models. Each HDF5 file contains 100 videos of variable size length stored as a collection of compressed JPG images (to make size on disk manageable). Using tf.data.Dataset and a map to tf.py_func, reading examples from the HDF5 file using custom Python logic is quite easy. For example:
def read_examples_hdf5(filename, label):
with h5py.File(filename, 'r') as hf:
# read frames from HDF5 and decode them from JPG
return frames, label
filenames = glob.glob(os.path.join(hdf5_data_path, "*.h5"))
labels = [0]*len(filenames) # ... can we do this more elegantly?
dataset = tf.data.Dataset.from_tensor_slices((filenames, labels))
dataset = dataset.map(
lambda filename, label: tuple(tf.py_func(
read_examples_hdf5, [filename, label], [tf.uint8, tf.int64]))
)
dataset = dataset.shuffle(1000 + 3 * BATCH_SIZE)
dataset = dataset.batch(BATCH_SIZE)
iterator = dataset.make_one_shot_iterator()
next_batch = iterator.get_next()
This example works, however the problem is that it seems like tf.py_func can only handle one example at a time. As my HDF5 container stores 100 examples, this limitation causes significant overhead as the files constantly need to be opened, read, closed and reopened. It would be much more efficient to read all the 100 video examples into the dataset object and then move on with the next HDF5 file (preferably in multiple threads, each thread dealing with it's own collection of HDF5 files).
So, what I would like is a number of threads running in the background, reading video frames from the HDF5 files, decode them from JPG and then feed them into the dataset object. Prior to the introduction of the tf.data.Dataset pipeline, this was quite easy using the RandomShuffleQueue and enqueue_many ops, but it seems like there is currently no elegant way of doing this (or the documentation is lacking).
Does anyone know what would be the best way of achieving my goal? I have also looked into (and implemented) the pipeline using tfrecord files, but taking a random sample of video frames stored in a tfrecord file seems quite impossible (see here). Additionally, I have looked at the from_generator() inputs for tf.data.Dataset but that is definitely not going to run in multiple threads it seems. Any suggestions are more than welcome.

I stumbled across this question while dealing with a similar issue. I came up with a solution based on using a Python generator, together with the TF dataset construction method from_generator. Because we use a generator, the HDF5 file should be opened for reading only once and kept open as long as there are entries to read. So it will not be opened, read, and then closed for every single call to get the next data element.
Generator definition
To allow the user to pass in the HDF5 filename as an argument, I generated a class that has a __call__ method since from_generator specifies that the generator has to be callable. This is the generator:
import h5py
import tensorflow as tf
class generator:
def __init__(self, file):
self.file = file
def __call__(self):
with h5py.File(self.file, 'r') as hf:
for im in hf["train_img"]:
yield im
By using a generator, the code should pick up from where it left off at each call from the last time it returned a result, instead of running everything from the beginning again. In this case it is on the next iteration of the inner for loop. So this should skip opening the file again for reading, keeping it open as long as there is data to yield. For more on generators, see this excellent Q&A.
Of course, you will have to replace anything inside the with block to match how your dataset is constructed and what outputs you want to obtain.
Usage example
ds = tf.data.Dataset.from_generator(
generator(hdf5_path),
tf.uint8,
tf.TensorShape([427,561,3]))
value = ds.make_one_shot_iterator().get_next()
# Example on how to read elements
while True:
try:
data = sess.run(value)
print(data.shape)
except tf.errors.OutOfRangeError:
print('done.')
break
Again, in my case I had stored uint8 images of height 427, width 561, and 3 color channels in my dataset, so you will need to modify these in the above call to match your use case.
Handling multiple files
I have a proposed solution for handling multiple HDF5 files. The basic idea is to construct a Dataset from the filenames as usual, and then use the interleave method to process many input files concurrently, getting samples from each of them to form a batch, for example.
The idea is as follows:
ds = tf.data.Dataset.from_tensor_slices(filenames)
# You might want to shuffle() the filenames here depending on the application
ds = ds.interleave(lambda filename: tf.data.Dataset.from_generator(
generator(filename),
tf.uint8,
tf.TensorShape([427,561,3])),
cycle_length, block_length)
What this does is open cycle_length files concurrently, and produce block_length items from each before moving to the next file - see interleave documentation for details. You can set the values here to match what is appropriate for your application: e.g., do you need to process one file at a time or several concurrently, do you only want to have a single sample at a time from each file, and so on.
Edit: for a parallel version, take a look at tf.contrib.data.parallel_interleave!
Possible caveats
Be aware of the peculiarities of using from_generator if you decide to go with the solution. For Tensorflow 1.6.0, the documentation of from_generator mentions these two notes.
It may be challenging to apply this across different environments or with distributed training:
NOTE: The current implementation of Dataset.from_generator() uses
tf.py_func and inherits the same constraints. In particular, it
requires the Dataset- and Iterator-related operations to be placed on
a device in the same process as the Python program that called
Dataset.from_generator(). The body of generator will not be serialized
in a GraphDef, and you should not use this method if you need to
serialize your model and restore it in a different environment.
Be careful if the generator depends on external state:
NOTE: If generator depends on mutable global variables or other
external state, be aware that the runtime may invoke generator
multiple times (in order to support repeating the Dataset) and at any
time between the call to Dataset.from_generator() and the production
of the first element from the generator. Mutating global variables or
external state can cause undefined behavior, and we recommend that you
explicitly cache any external state in generator before calling
Dataset.from_generator().

I took me a while to figure this out, so I thought I should record this here. Based on mikkola's answer, this is how to handle multiple files:
import h5py
import tensorflow as tf
class generator:
def __call__(self, file):
with h5py.File(file, 'r') as hf:
for im in hf["train_img"]:
yield im
ds = tf.data.Dataset.from_tensor_slices(filenames)
ds = ds.interleave(lambda filename: tf.data.Dataset.from_generator(
generator(),
tf.uint8,
tf.TensorShape([427,561,3]),
args=(filename,)),
cycle_length, block_length)
The key is you can't pass filename directly to generator, since it's a Tensor. You have to pass it through args, which tensorflow evaluates and converts it to a regular python variable.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.