Keras: Tokenizer with fit_generator() on text data

Keras: Tokenizer with fit_generator() on text data - python

I am creating a neural net on a very large text dataset using keras. To build the model and make sure everything was working, I read a fraction of the data into memory, and use the built in keras 'Tokenizer' to do the necessary preprocessing stuff, including mapping each word to a token. Then, I use model.fit().
Now, I want to extend to the full dataset, and don't have the space to read all the data into memory. So, I'd like to make a generator function to sequentially read data from disk, and use model.fit_generator(). However, if I do this, then I separately fit a Tokenizer object on each batch of data, providing a different word-to-token mapping for each batch. Is there anyway around this? Is there any way I can continuously build a token dictionary using keras?

So basically you could define a text generator and feed it to fit_on_text method in a following manner:
Assuming that you have texts_generator which is reading partially your data from disk and returning an iterable collection of text you may define:
def text_generator(texts_generator):
for texts in texts_generator:
for text in texts:
yield text
Please take care that you should make this generator stop after reading a whole of data from disk - what could possible make you to change the original generator you want to use in model.fit_generator
Once you have the generator from 1. you may simply apply a tokenizer.fit_on_text method by:
tokenizer.fit_on_text(text_generator)

Related

TFLite model maker custom object detector training using tfrecord

I am trying to train a custom object detector using tflite model maker (https://www.tensorflow.org/lite/tutorials/model_maker_object_detection). I want to deploy trained tflite model to coral edgeTPU. I want to use tensorflow tfrecord (multiple) as input for training a model like object detection API. I tried with
tflite_model_maker.object_detector.DataLoader(
tfrecord_file_patten, size, label_map, annotations_json_file=None
) but I am not able to work around it. I have following questions.
Is it possible to tfrecord for training like mentioned above?
Is it also possible to pass multiple CSV files for training?

For multiple CSV files, you could probably just append one file to the other. Then you'd just have to pass one csv file.
As for passing a tfrecord instead, this should be possible. I'm also attempting to do this, so if I get it working I'll update my post. Looking at the source, it seems from_cache is the function internally used. Following that structure, should be able to create a DataLoader object similarly:
train_data = DataLoader(tfrecord_file_patten, meta_data['size'],
meta_data['label_map'], ann_json_file)
In this case, tfrecord_file_patten should be a tfrecord of your training data. You can construct the validation and test data the same way. This will work provided you're constructing your TFRecords correctly. There appears to be some inconsistency to how it's done in different places, so make sure you follow the same structure in creating the TFRecords as found in the ModelMaker source. This worked for me. One specific thing to watch out for is to use an integer for the 'image/source_id' feature in your TFExamples. If you use a string it'll throw an error.

Upload custom text dataset to tensorflow model

I'm attempting to create a text-classification model with tensorflow. There are many datasets you can import into a project using tfds.load(), but I want to create a unique dataset of my own. In tensorflow.js, all I had to do was create a JSON file with training/testing data. There doesn't seem to be an easy way to do this with python.
Does anyone have experience with this?

ŧf.data.Dataset is the place to be. Lil' pointer: https://www.tensorflow.org/api_docs/python/tf/data/Dataset. If your dataset fits into memory, you can go with tf.data.from_tensor_slices which lets you create a Dataset from numpy arrays. If not, from_generator might suit you, as you can write your generator in plain python. For the "correct" way to do it (this gives you the fastest pipeline in theory) you should save your data as TFRecords and read them with tf.data.TFRecordDataset. Whatever floats your boat. Just click the link!

Does Tensorflow use only one hot encoding to store labels?

I have just started working with Tensorflow, with Caffe it was super practical reading in the data in an efficient manner but with Tensorflow I see that I have to write data loading process myself, creating TFRecords, the batching, the multiple threats, handling those threads etc. So I started with an example, inception v3, as they handle the part to read in the data. I am new to Tensorflow and relatively new to Python, so I feel like I don't understand what is going on with this part exactly (I mean yes it extends the size of the labels to label_index * no of files -but- why? Is it creating one hot encoding for labels? Do we have to? Why doesn't it just extend as much for the length or files as each file have a label? Thx.
labels.extend([label_index] * len(filenames))
texts.extend([text] * len(filenames))
filenames.extend(filenames)
The whole code is here: https://github.com/tensorflow/models/tree/master/research/inception
The part mentioned is under data/build_image_data.py and builds image dataset from an existing dataset as images stored under folders (where foldername is the label): https://github.com/tensorflow/models/blob/master/research/inception/inception/data/build_image_data.py

Putting together what we discussed in the comments:
You have to one-hot encode because the network architecture requires you to, not because it's Tensorflow's demand. The network is a N-class classifier, so the final layer will have one neuron per class and you'll train the network to activate the neuron matching the class the sample belongs to. One-hot encoding the label is the first step in doing this.
About the human-readable labels, the code you're referring to is located in the _find_image_files function, which in turn is used by _process_dataset to transform the dataset from a set of folders to a set TfRecord files, which are a convenient input format type for Tensorflow.
The human-readable label string is included as a feature in the Examples inside the tfrecord files as an 'extra' (probably to simplify visualization of intermediate results during training), it is not strictly necessary for the dataset and will not be used in any way in the actual optimization of the network's parameters.

TensorFlow Experiment: how to avoid loading all data in memory with input_fn?

I'm struggling with passing my (messy) code from tensorflow core to the Estimator paradigm, especially using Experiments - with learn_runner.run. But I'm actually having issues feeding data to my neural network.
What I'm trying to achieve is actually pretty close to what's done with all the examples of TensorFlow and the tf.TextLineReader, e.g. https://github.com/GoogleCloudPlatform/cloudml-samples/blob/master/census/customestimator/trainer/model.py#L297, though I load data not from a file on disk but with a web-service.
From my understanding (and looking at the code of tensorflow.python.estimator._train_model()) the input_fn is only called once and not at each iteration. I could easily load all my data, and then do something like:
def input_fn():
data = # all data in memory
batch = tf.train.input_producer(tf.constant(data))
return batch.dequeue_many(batch_size)
but this is not sustainable as my data won't fit in memory. I'm trying to do something like:
1. load first piece of data (say N lines)
2. consume it by batches in a queue just like the input_fn above
2'. feed this queue asynchronously with new data when it's almost empty
I know how to do it in "pure" tf, e.g. How to prefetch data using a custom python function in tensorflow or Tensorflow: custom data load + asynchronous computation but I'm finding it hard to transpose it to the Experiment paradigm as I don't have access to the session to load things by myself, nor to the graph to append operations inside.
EDIT
I managed to do it using tf.py_func(), something like:
class Reader(object):
# a Python object that can load data and have some intelligence, not related to TF, initialized with batch_sized
def read_up_to(self):
"""Reads up to batch_size elements loaded in Python"""
def input_fn():
reader = Reader() # instantiated once
return tf.py_func(reader.read_up_to, inp=[], Tout=...)
I works fine, though it's a bit slower (as expected there's a way round from C++ execution to Python that introduces about 50% delay). I'm trying to work around this by putting in a specific TensorFlow queue the Python data that's read in the reader asynchronously, so that loading could be done without passing data from Python to C++ (just as in the two links above).

I had a similar issue on which I found a fix by using a SessionRunHook. This hook (there are also others) allows you to initialize operations just after the Session is created.

tf.data.Dataset.from_generator is a dataset that calls a function of yours to generate the data one example at a time. This gives you a hook to program the generation of data however you want, such as loading in batches then yielding a single example from the batch on each call to it. This other question has an example.

Training tensorflow RNN with large datasets

I'm training an RNN in tensorflow. The function used is "rnn" from https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/ops/rnn.py.
outputs, states = rnn.rnn(cell, inputs, initial_state=initial_state, sequence_length=seq_length)
The reason I use this function is because my data sequences are of variable lengths. This function expects all data to be loaded at once. Since my data doesn't fit in to memory all at once, I need to load data piece by piece. Any pointers on how it can be done would be highly appreciated.
Thanks

The standard practice here is to break your data up into chunks and work on it a chunk at a time. For example, if you are working with text, you might break your data up into sentences, and pass mini-batches of 10s-100s of sentences to the training process one at a time.
For an example of how to do this, take a look at this RNN tutorial.
https://www.tensorflow.org/versions/r0.9/tutorials/recurrent/index.html
The tutorial text itself doesn't describe chunking in detail, but take a look at the associated code in github and see how it loads its input data and batches it for training.
https://github.com/tensorflow/tensorflow/tree/master/tensorflow/models/rnn/ptb
Hope that helps!

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Keras: Tokenizer with fit_generator() on text data - python

Related

TFLite model maker custom object detector training using tfrecord

Upload custom text dataset to tensorflow model

Does Tensorflow use only one hot encoding to store labels?

TensorFlow Experiment: how to avoid loading all data in memory with input_fn?

Training tensorflow RNN with large datasets

Categories

Resources