Queue input data with tensorflow or skflow

Queue input data with tensorflow or skflow - python

I am training a neural net with a DataFeeder which is a bit slow (because it reads non-contiguous data from a h5 file); so the GPU satays idle (GPU-Util is at 0 %) half of the time.
Is there a way, in either TensorFlow or skflow, to have multiple DataFeeders running in parallel, to avoid this bottleneck?

Tensorflow has reader library that can in parallel (and in C++) read and queue data. This should remove bottleneck you are talking about.
We are currently (this/next week) adding it's support to tf.learn (new name for skflow) to make it easy to use. You will still need to convert your data into one of the formats readers support (fixed len vectors, Example proto).
If you want to try make it work yourself - you can create a separate DataFeeder, that would use the ops from reader library in input_builder function and return no-op in the get_feed_dict_fn.

Related

Loading pretrained glove on production with flask and Gunicorn

I have a model that requires some preprocessing using Glove from Stanford. From my experience it takes the at least 20-30 seconds until the Glove is loaded by this code:
glove_pd = pd.read_csv(embed_path+'/glove.6B.300d.txt', sep=" ", quoting=3, header=None, index_col=0)
glove = {key: val.values for key, val in glove_pd.T.items()}
My question is what is the best practice to handle this in a production app? As far as I can understand is that everytime that I restart the server I need to wait 30 seconds until the endpoint is ready.
Also, I have read that when using Gunicorn, it is recommended to run with workers>1, something like this:
ExecStart=/path/to/gunicorn --workers 3 --bind unix:app.sock -m 007 wsgi:app
Does it mean that each instance of gunicorn requires to load the same glove to memory? This means that the server resources will be quite large, let me know if I am correct here.
Bottom line my question is what are the recommended methods for hosting a model that requires an pretrained embedding (glove/word2vec/fasttext) on a production server

At one level, if you need it in memory, and that's how long it takes to read the gigabyte-plus from disk into useful RAM structures, then yes - that's how long it takes before a process is ready to use that data. But there's room for optimizations!
For example, reading this as 1st a Pandas dataframe, then converting it to a Python dict, involves both more steps & more RAM than other options. (At the momentary peak, when both glove_pd and glove are fully constructed & referenced, you'll have two full copies in memory – and neither is as compact as would be ideal, which could trigger other slowdowns, especially if the bloat triggers using any virtual-memory at all.)
And as you fear, if 3 gunicorn workers each run the same loading code, 3 separate copies of the same data will be loaded – but there's a way to avoid this, below.
I'd suggest 1st loading the vectors into a utility class for accessing word-vectors, like the KeyedVectors interface in the Gensim library. It'll store all the vectors in one compact numpy matrix, with a dict-like interface that still returns one numpy ndarray for as each individual vector.
For example, you can convert GLoVe text-format vectors to a slightly-different interchange format (with an extra header line, that Gensim calls word2vec_format after its use by the original Google word2vec.c code). In gensim-3.8.3 (current release as of August 2020) you can do:
from gensim.scripts.glove2word2vec import glove2word2vec
glove2word2vec('glove.6B.300d.txt', 'glove.6B.300d.w2vtxt')
Then, the utility-class KeyedVectors can load them like so:
from gensim.models import KeyedVectors
glove_kv = KeyedVectors.load_word2vec_format('glove.6B.300d.w2vtxt', binary=False)
(Starting in the future gensim-4.0.0 release, it should be possible to skip conversion & just use the new no_header argument to read a GLoVe text file directly: glove_kv = KeyedVectors.load_word2vec_format('glove.6B.300d.w2vtxt', binary=False, no_header=True). But this headerless-format will be a little slower, as it requires two passes over the file - the 1st to learn the full size.)
Loading just once into KeyedVectors should already be faster & more-compact than your original generic two-step process. And, lookups that are analogous to what you were doing on the prior dict will be available on the glove_kv instance. (Also, there are many other convenience operations, like ranked .most_similar() lookup, that utilize efficient array library functions for speed.)
You can take another step, though, to minimize the parsing-on-load, and to defer loading unneeded ranges of the full set of vectors, and automatically reuse raw array data between processes.
That extra step is to re-save the vectors using the Gensim instance's .save() function, which will dump the raw vectors into a separate dense file that's suitable for memory-mapping upon the next load. So first:
glove_kv.save('glove.6B.300d.gs')
This will create more than one file which must be kept together if relocated – but the .npy file(s) saved will be the exact minimal format ready for memory-mapping .
Then, when needed later, load as:
glove_kv = KeyedVectors.load('glove.6B.300d.gs', mmap='r')
The mmap argument uses underlying OS mechanisms to simply map the relevant matrix address-space to the (read-only) file(s) on disk, so that the initial 'load' is effectively instant, but any attempt to access ranges of the matrix will use virtual-memory to page-in the right ranges of the file. It thus eliminates any scanning-for-delimiters & defers IO until absolutely needed. (And if there are any ranges you never access? They'll never be loaded.)
The other big benefit of memory-mapping is that if multiple processes each read-only memory-map the same on-disk files, the OS is smart enough to let them share any common paged-in ranges. So with, say, 3 totally-separate OS processes that each mmap the same file, you get 3X RAM savings.
(If after all these changes, the lag upon restarting server processes is still an issue – perhaps because the server processes crash or otherwise need restarting often – you could even consider using some other long-lived, stable process to initially mmap the vectors. Then, even the crash of all server processes wouldn't cause the OS to lose any paged-in ranges of the file, and the restart of the server processes might find some or all of the relevant data already in RAM. But the complication of this extra role may be superfluous once the other optimizations are in place.)
One extra caveat: if you start using KeyedVectors methods like .most_similar() that can (up through gensim-3.8.3) trigger the creation of a full-size cache of the unit-length-normalized word-vectors, you could lose the mmap benefits unless you take some extra steps to short-circuit that process. See more details in prior answer: How to speed up Gensim Word2vec model load time?

MXNet parameter serialisation with numpy

I want to use a pre-trained MXNet model on s390x architecture but it doesn't seem to work. This is because the pre-trained models are in little-endian whereas s390x is big-endian. So, I'm trying to use https://numpy.org/devdocs/reference/generated/numpy.lib.format.html which works on both little-endian as well as big-endian.
One way to solve this is to I've found is to load the model parameters on an x86 machine, call asnumpy, save through numpy Then load the parameters on s390x machine using numpy and convert them to MXNet. But I'm not really sure how to code it. Can anyone please help me with that?
UPDATE
It seems the question is unclear. So, I'm adding an example that better explains what I want to do in 3 steps -
Load a preexisting model from MXNet, something like this -
net = mx.gluon.model_zoo.vision.resnet18_v1(pretrained=True, ctx=mx.cpu())
Export the model. The following code saves the model parameters in .param file. But this .param binary file has endian issues. So, instead of directly saving the model using mxnet API, I want to save the parameters file using numpy - https://numpy.org/devdocs/reference/generated/numpy.lib.format.html. Because using numpy, would make the binary file (.npy) endian independent. I am not sure how can I convert the parameters of MXNet model into numpy format and save them.
gluon.contrib.utils.export(net, path="./my_model")
Load the model. The following code loads the model from .param file.
net = gluon.contrib.utils.import(symbol_file="my_model-symbol.json",
param_file="my_model-0000.params",
ctx = 'cpu')
Instead of loading using the MXNet API, I want to use numpy to load .npy file that we created in step 2. After we have loaded the .npy file, we need to convert it to MXNet. So, I can finally use the model in MXNet.

Starting from the code snippets posted in the other question, Save/Load MXNet model parameters using NumPy :
It appears that mxnet has an option to store data internally as numpy arrays:
mx.npx.set_np(True, True)
Unfortunately, this option doesn't do what it I hoped (my IPython session crashed).
The parameters are a dict of mxnet.gluon.parameter.Parameter instances, each of them containing attributes of other special datatypes. Disentangling this so that you can store it as a large number of pure numpy arrays (or a collection of them in an .npz file) is a hopeless task.
Fortunately, python has pickle to convert complex data structures into something more or less portable:
# (mxnet/resnet setup skipped)
parameters = resnet.collect_params()
import pickle
with open('foo.pkl', 'wb') as f:
pickle.dump(parameters, f)
To restore the parameters:
with open('foo.pkl', 'rb') as f:
parameters_loaded = pickle.load(f)
Essentially, it looks like resnet.save_parameters() as defined in mxnet/gluon/block.py gets the parameters (using _collect_parameters_with_prefix()) and writes them to a file using a custom write function which appears to be compiled from C (I didn't check the details).
You can save the parameters using pickle instead.
For loading, load_parameters (also in util.py) contains this code (with sanity checks removed):
for name in loaded:
params[name]._load_init(loaded[name], ctx, cast_dtype=cast_dtype, dtype_source=dtype_source)
Here, loaded is a dict as loaded from the file. From examining the code, I don't fully grasp exactly what is being loaded - params seems to be a local variable in the function that is not used anymore. But it's worth a try to start from here, by writing a replacement for the load_parameters function. You can "monkey-patch" a function into an existing class by defining a function outside the class like this:
def my_load_parameters(self, ...):
... (put your modified implementation here)
mx.gluon.Block.load_parameters = my_load_parameters
Disclaimers/warnings:
even if you get save/load via pickle to work on a single big-endian system, it's not guaranteed to work between different-endian systems. The pickle protocol itself is endian-neutral, but if floating-point values (deep inside the mxnet.gluon.parameter.Parameter were stored as a raw data buffer in machine-endian convention, then pickle is not going to magically guess that groups of 8 bytes in the buffer need to be reversed. I think numpy arrays are endian-safe when pickled.
Pickle is not very robust if the underlying class definitions change between pickling and unpickling.
Never unpickle untrusted data.

How to convert images to TFRecords with tf.data.Dataset in most efficient way possible

I am absolutely baffled by how many unhelpful error messages I've received while trying to use this supposedly simple API to write TFRecords in a manner that doesn't take 30 minutes every time I have a new dataset.
Task:
I'd like to feed a list of image paths and a list of labels to a tf.data.Dataset, parse them in parallel to read the images and encode as tf.train.Examples, use tf.data.Dataset.shard to distribute them into different TFRecord shards (e.g. train-001-of-010.tfrecord, train-002-of-010.tfrecord, etc.), and for each shard finally write them to the corresponding file.
Since I've been debugging this for hours I haven't gotten any single particular error to fix, otherwise I would provide it. I've struggled to find any up to date tutorial that doesn't either (a) come from 2017 and use queue runners, (b) use a tf.Session (I'm using tensorflow 1.15 but official docs keep telling me to phase out sessions), (c) Conveniently do the record creating in pure python, which makes a simple tutorial but is too slow for any actual application, or (d) use already created TFRecords and just skip the whole process.
If necessary, I can put together an example of what I'm talking about. But since I'm getting stuck at every level of the process, at the moment it seems unhelpful.
Tldr:
If anyone has utilized tf.data.Dataset to create TFRecord shards in parallel please point me in a better direction than google has.

Is tf.recorder way of reading data more efficient than feed to placeholder?

I'm dealing with a huge amount of data in Tensorflow.
One way is to define placeholder and then read my data by my own defined functions outside of the graph, such as a queue and feed a batch every time into the placeholders.
Another way is to use recorder related built-in classes in Tensorflow to directly read data as tensors.
I searched but failed to find any relavant comparison between the two. Does anyone has idea about their advantages and disadvanteges, especially about the efficiency? Which one do you guys prefer when you use tensorflow?

The different methods of reading data in Tensorflow are compared and discussed here with more comparison here
tfrecord allows to read data in chunks, so you can deal with data that exceed RAM capacity. Also it can be arranged in such way that you read data a separate thread using tf.Coordinator and start_queue_runners. More information can be found here

I want to read a large amount of images for deep learning, but what is the solution for when there is insufficient memory?

In deep learning program written in python, I wanted to store a large amount of image data in numpy array at once, and to extract batch data randomly from that array, but the image data is too large and memory is run out.
How should we deal with such cases? I have no choice but to do IO processing and read image data from storage every time you retrieve batch data?

File I/O would solve the issue, but will slow down the leanring process, since FILE I/O is a task which takes long.
However, you could try to implement a mixture of both using multithreading, e.g.
https://github.com/stratospark/keras-multiprocess-image-data-generator
(I do not know what kind of framework you are using).
Anyhow back to basic idea:
Pick some random files and read them, start training. During training start a second thread which will read out random Files again. Thus, you learning thread does not have to wait for new data, since the training process might take longer than the reading process.
Some frameworks have this feature already implemented, check out:
https://github.com/fchollet/keras/issues/1627
or:
https://github.com/pytorch/examples/blob/master/mnist_hogwild/train.py

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.