Upload custom text dataset to tensorflow model

Upload custom text dataset to tensorflow model - python

I'm attempting to create a text-classification model with tensorflow. There are many datasets you can import into a project using tfds.load(), but I want to create a unique dataset of my own. In tensorflow.js, all I had to do was create a JSON file with training/testing data. There doesn't seem to be an easy way to do this with python.
Does anyone have experience with this?

ŧf.data.Dataset is the place to be. Lil' pointer: https://www.tensorflow.org/api_docs/python/tf/data/Dataset. If your dataset fits into memory, you can go with tf.data.from_tensor_slices which lets you create a Dataset from numpy arrays. If not, from_generator might suit you, as you can write your generator in plain python. For the "correct" way to do it (this gives you the fastest pipeline in theory) you should save your data as TFRecords and read them with tf.data.TFRecordDataset. Whatever floats your boat. Just click the link!

Related

TFLite model maker custom object detector training using tfrecord

I am trying to train a custom object detector using tflite model maker (https://www.tensorflow.org/lite/tutorials/model_maker_object_detection). I want to deploy trained tflite model to coral edgeTPU. I want to use tensorflow tfrecord (multiple) as input for training a model like object detection API. I tried with
tflite_model_maker.object_detector.DataLoader(
tfrecord_file_patten, size, label_map, annotations_json_file=None
) but I am not able to work around it. I have following questions.
Is it possible to tfrecord for training like mentioned above?
Is it also possible to pass multiple CSV files for training?

For multiple CSV files, you could probably just append one file to the other. Then you'd just have to pass one csv file.
As for passing a tfrecord instead, this should be possible. I'm also attempting to do this, so if I get it working I'll update my post. Looking at the source, it seems from_cache is the function internally used. Following that structure, should be able to create a DataLoader object similarly:
train_data = DataLoader(tfrecord_file_patten, meta_data['size'],
meta_data['label_map'], ann_json_file)
In this case, tfrecord_file_patten should be a tfrecord of your training data. You can construct the validation and test data the same way. This will work provided you're constructing your TFRecords correctly. There appears to be some inconsistency to how it's done in different places, so make sure you follow the same structure in creating the TFRecords as found in the ModelMaker source. This worked for me. One specific thing to watch out for is to use an integer for the 'image/source_id' feature in your TFExamples. If you use a string it'll throw an error.

Saving custom variables in Keras .h5 file

I'm developing a RNN for a project and I need to train it on a computer and be able to predict on another. The solution I found is to save the model into a .h5 file using the code below:
... # Train the data etc....
model.save("model.h5")
My problem is that I need to store some meta-data from my training dataset and pre-process and be able to load it together with the model. (e.g. name of dataset file, size of the dataset file, number of characters, etc...)
I don't want to store this information in a second file (e.g. a .txt file) because I would have to use two files. I don't want to use any additional library or framework for this task.
I was thinking (brainstorming) a code like this:
model.save("model.h5", metaData={'myVariableName': myVariable})
And to load would be:
myVariable = model.load("model.h5").getMetaData('myVariableName')
I know this is not possible in the current version and I already read Keras doc, but I couldn't find any efficient method to do that. Notice that what I'm asking is different from custom_object because want to save and load my own variables.
Is there a smarter approach to solve this problem?

What is the difference between tfrecord and bottleneck

I have been studying transfer learning with models like inception_v4 and inception_resnet_v2. Found some projects that uses bottleneck and some uses tfrecords to store the training images. When retraining the inception_v4 model with the same data using those two methods bottleneck gave 95% accuracy and tfrecord only gave 75%. But, all the new projects seems to use tfrecords for data and .ckpt format to store the model. Can someone explain me whats the difference and which one is better in which case

If you are working with large datasets, using a binary file format for storage of your data can have a significant impact on the performance of your import pipeline. Hence, it will affect your training time of the model.
By using TFRecords, it is possible to store sequence data. For e.g, a series of data. Besides, it easy to combine multiple datasets and integrates seamlessly with the data import and preprocessing functionality provided by the library.
For more information about TFrecords, please refer this link.

How to create my own dataset to train/test a convolutional neural network

So here is my question:
I want to make my very own dataset using a motion capture camera system to get the ground truth poses and one RGB camera to get images, and then using this as input to my network, train/test a convNet.
I have looked around at other datasets for tensorflow, caffe and Matlab. I have viewed the MNIST, Cats/Dogs, Iris, LSP, HumanEva, HumanEva3.6, FLIC, etc. datasets and have viewed and tried to understand their data as best as I can. I have viewed online people trying to make their own datasets. The one thing is usually when you use their datasets as an example, you download a .txt file that already contains the labels.
If anyone could please explain to me how to use the image data with the labels to feed it into my network, it would be a tremendous help. I have made code before using tensorflow to input a .txt file into the network and get the correct predicted output. But, my brain is missing something to understand how to input an image with a label. How to I create that dataset?

Your input images and your labels are two separate variables. You will be writing separate bits of code to import them. The videos typically need to be converted to JPG files (it's a royal pain to read video files directly, mostly because you can't randomly skip around the video easily).
Probably the easiest way to structure you data is via a CSV that contains filename, poseinfoA, poseinfoB, etc. And the filename refers to the JPG image on disk.
To get started on the basics, I suggest looking at the Aymericdamen tutorial examples, I haven't found tutorials anywhere that were as clear and concise.
https://github.com/aymericdamien/TensorFlow-Examples
Those examples don't go into detail on the data input pipeline though. To set up a good data input pipeline in tensorflow I suggest you use the new (as of TF 1.4) Dataset object. It will force you into a good data input pipline workflow, and it's the way all data input is going in tensorflow, so it's worth learning. It's also easy to test and debug when you write it this way. Here's the guide you want to follow.
https://www.tensorflow.org/programmers_guide/datasets
You can start your Dataset object from the CSV, and use a dataset.map_fn() to load the images using tf.image.decode_jpeg
Since you're doing pose estimation I'll also suggest a nice blog I came across recently that will probably interest you. The topic is segmentation, but pose estimation is quite related.
http://blog.qure.ai/notes/semantic-segmentation-deep-learning-review

Can I export RapidMiner model to integrate with python?

I have trained a classifier model using RapidMiner after a trying a lot of algorithms and evaluate it on my dataset.
I also export the model from RapidMiner as XML and pkl file, but I can't read it in my python program (scikit-learn).
Is there any way to import RapidMiner classifier/model in a python program and use it to predict or classify new data in my end application?

Practically, I would say no - just train your model in sklearn from the beginning if that's where you want it.
Your RapidMiner model is some kind of object. The two formats you are exporting as are just storage methods. Sklearn models are a different kind of object. You can't directly save one and load it into the other. A similar example would be to ask if you can take an airplane engine and load it into a train.
To do what you're asking, you'll need to take the underlying data that your classifier saved, find the format, and then figure out a way to get it in the same format as a sklearn classifier. This is dependent on what type of classifier you have. For example, if you're using a bayesian model, you could somehow capture the prior probabilities and then use those, but this isn't trivial.

You could use the pmml extenstion for RapidMiner to export your model.
For python there is for example the augustus library that can work with pmml files.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.