Custom input function for estimator instead of tf.data.dataset

Custom input function for estimator instead of tf.data.dataset - python

I want to know if anyone has created their own custom input function for tensorflow's estimator ? like in (link) this image:
where the say it is recommended to use tf.data.dataset. But I do not want to use that one, as I want to write my own iterator which yields data in batches and shuffles it as well.
def data_in(train_data):
data = next(train_data)
ff = list(data)
tf.enable_eager_execution()
imgs = tf.stack([tf.convert_to_tensor(np.reshape(f[0], [img_size[0], img_size[1], img_size[2]])) for f
in ff])
lbls = tf.stack([f[1] for f in ff])
print('TRAIN data: %s %s ' % (imgs.get_shape(), lbls.get_shape()))
return imgs, lbls
output: TRAIN data: (10, 32, 32, 3) (10,)
where train_data is a generator object which iterates through my dataset using iter and np.reshape(f[0], [img_size[0], img_size2, img_size2] basically reshapes the data I extract to the required dimensions and it is a batch of an entire dataset. I use stack to convert the list of tensors to convert to stacked tensors. But when I use this with estimators I get an error for the features provided to the model saying the features do not have get_shape(). When I test it without an estimator it works well and it get_shape() also works well.

Hey kvish I figured it out how to do it. I just had to add these lines
experiment = tf.contrib.learn.Experiment(
cifar_classifier,
train_input_fn=lambda: data_in(),
eval_input_fn=lambda: data_in_eval(),
train_steps=train_steps)
I know experiment is deprecated, I will also do it with estimator now :)

Related

How to load big dataset from CSV into keras

I'm trying to use Keras with TensorFlow to train a network based on the SURF features that I obtained from several images. I have all this features stored in a CSV file that has the following columns:
[ID, Code, PointX, PointY, Desc1, ..., Desc64]
The "ID" column is an autoincremental index created by pandas when I store all the values. The "Code" column is the label of the point, this would be just a number that I got by pairing the actual code (which is a string) with a number. "PointX/Y" are the coordinates of the point found in an image of a given class, and "Desc#" is the float value of the corresponding descriptor of that point.
The CSV file contains all the KeyPoints and Descriptors found in all 20.000 images. This gives me a total size of almost 60GB in disk, which I obviously can't fit into memory.
I've been trying to load batches of the file using pandas, then put all the values in a numpy array, and then fitting my model (a Sequential model of only 3 layers). I've used the following code to do so:
chunksize = 10 ** 6
for chunk in pd.read_csv("surf_kps.csv", chunksize=chunksize):
dataset_chunk = chunk.to_numpy(dtype=np.float32, copy=False)
print(dataset_chunk)
# Divide dataset in data and labels
X = dataset_chunk[:,9:]
Y = dataset_chunk[:,1]
# Train model
model.fit(x=X,y=Y,batch_size=200,epochs=20)
# Evaluate model
scores = model.evaluate(X, Y)
print("\n%s: %.2f%%" % (model.metrics_names[1], scores[1]*100))
This is alright with the first chunk loaded, but when the loop gets another chunk, accuracy and loss stuck on 0.
Is it wrong the way I'm trying to load all this information?
Thanks in advance!
------ EDIT ------
Ok, now I made a simple generator like this:
def read_csv(filename):
with open(filename, 'r') as f:
for line in f.readlines():
record = line.rstrip().split(',')
features = [np.float32(n) for n in record[9:73]]
label = int(record[1])
print("features: ",type(features[0]), " ", type(label))
yield np.array(features), label
and use fit_generator with it:
tf_ds = read_csv("mini_surf_kps.csv")
model.fit_generator(tf_ds,steps_per_epoch=1000,epochs=20)
I don't know why, but I keep getting an error just before the first epoch starts:
ValueError: Error when checking input: expected dense_input to have shape (64,) but got array with shape (1,)
The first layer of the model has input_dim=64 and the shape of the features array yielded is also 64.

I think it is better to use tf.data.Dataset, this may help:
https://www.tensorflow.org/beta/tutorials/load_data/csv
Streaming large training and test files into Tensorflow's DNNClassifier

If you are using Tf 2.0, you could verify if the contents of the dataset are right.
You can simply do this by ,
print(next(iter(tf_ds)))
to see the first element of the dataset and check if it matches the input expected by the model.

Keras / Tensorflow: Predict Using tf.data.Dataset API

I'm using Keras with a Tensorflow backend for building a model for this problem: https://www.kaggle.com/cfpb/us-consumer-finance-complaints (just practicing).
I train my Keras model using the tf.data.Dataset API. Now, I have a Pandas DataFrame, df_testing, whose columns are complaint (strings) and label (also strings). I want to predict on these new samples. I create a tf.data.Dataset object, perform preprocessing, make an Iterator, and call predict on my model:
data = df_testing["complaint"].values
labels = df_testing["label"].values
dataset = tf.data.Dataset.from_tensor_slices((data))
dataset = dataset.map(lambda x: ({'reviews': x}))
dataset = dataset.batch(self.batch_size).repeat()
dataset = dataset.map(lambda x: self.preprocess_text(x, self.data_table))
dataset = dataset.map(lambda x: x['reviews'])
dataset = dataset.make_initializable_iterator()
My training used a tf.data.Dataset where each element was of the form ({'reviews': "movie was great"}, "positive") so I'm mimicking that here for prediction. Also, my preprocessing just turns my string into a Tensor of integers.
When I call:
preds = model.predict(dataset)
But I'm told my predict call fails:
ValueError: When using iterators as input to a model, you should specify the `steps` argument.
So I modify this call to be:
preds = model.predict(dataset, steps=3)
But now I get back:
ValueError: Please provide data as a list or tuple of 2 elements - input and target pair. Received Tensor("IteratorGetNext_2:0", shape=(?, 100), dtype=int32)
What am I doing incorrectly here? I shouldn't have to provide a tuple of 2 elements when predicting (I shouldn't need the label).
Thanks for any help you can offer!

What version of Keras are you on? I cannot find that specific error message in the code base, but I think I found where it used to be.
Here's the error in a version of the code that I think is close to the version you're running: commit
And here's the updated version of that error: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/keras/engine/training_eager.py#L464
The conditions of the input validation have changed (in the newest version your input would be accepted), but what's relevant is that the error message is much more clear:
raise ValueError(
'Please provide data as a list or tuple of 1, 2, or 3 elements '
' - `(input)`, or `(input, target)`, or `(input, target,'
'sample_weights)`. Received %s. We do not use the `target` or'
'`sample_weights` value here.' % inputs.output_shapes)
The target value is never used in the predict function, and so can be anything. Looking at the rest of the function next_element[1] is never used.
[TLDR] Using your current version, add a dummy target value to the data, or update your Keras.

The following code worked for me (tested on tensorflow 1.10.0):
[TLDR] Only insert empty dictionary as a dummy input and specify the number of steps:
model.predict(x={},steps=4)
Full code:
import numpy as np
import tensorflow as tf
from tensorflow.data import Dataset
from tensorflow.keras.layers import Dense, Input
from tensorflow.keras.models import Model
# dummy data:
x = np.arange(4).reshape(-1, 1).astype('float32')
y = np.arange(5, 9).reshape(-1, 1).astype('float32')
# build the Datasets
ds_x = Dataset.from_tensor_slices(x).repeat().batch(4)
it_x = ds_x.make_one_shot_iterator()
ds_y = Dataset.from_tensor_slices(y).repeat().batch(4)
it_y = ds_y.make_one_shot_iterator()
# build compile and train the model
input_vals = Input(tensor=it_x.get_next())
output = Dense(1, activation='relu')(input_vals)
model = Model(inputs=input_vals, outputs=output)
model.compile('rmsprop', 'mse', target_tensors=[it_y.get_next()])
model.fit(steps_per_epoch=1, epochs=5, verbose=2)
# infer using the dataset
model.predict(x={},steps=4)

how to train model with batches

I trying yolo model in python.
To process the data and annotation I'm taking the data in batches.
batchsize = 50
#boxList= []
#boxArr = np.empty(shape = (0,26,5))
for i in range(0, len(box_list), batchsize):
boxList = box_list[i:i+batchsize]
imagesList = image_list[i:i+batchsize]
#to convert the annotation from VOC format
convertedBox = np.array([np.array(get_boxes_for_id(box_l)) for box_l in boxList])
#pre-process on image and annotaion
image_data, boxes = process_input_data(imagesList,max_boxes,convertedBox)
boxes = np.array(list(itertools.chain.from_iterable(boxes)))
detectors_mask, matching_true_boxes = get_detector_mask(boxes, anchors)
after this, I want to pass my data to the model to train.
when I append the list it gives memory error because of array size.
and when i append array gives dimensionality error because of shape.
how can i train the data and what shoud i use model.fit() or model.train_on_batch()

If you are using Keras to Train your model with a bunch of Images you can use Train generator and validation generator, all you have to do is put your images in there respective class folders. look at a sample code . also take a look at this link maybe it may help you https://keras.io/preprocessing/image/ . i hope i have answered your question unless i did not understand it

How to correctly map a python function and then batch the Dataset in Tensorflow

I wish to create a pipeline to provide non-standard files to the neural network (for example with extension *.xxx).
Currently I have structured my code as follows:
  1) I define a list of paths where to find training files
  2) I define an instance of the tf.data.Dataset object containing these paths
  3) I map to the Dataset a python function that takes each path and returns the associated numpy array (loaded from the folder on the pc); this array is a matrix with dimensions [256, 256, 192].
  4) I define an initializable iterator and then use it during network training.
My doubt lies in the size of the batch I provide to the network. I would like to have batches of size 64 supplied to the network. How could I do?
For example, if I use the function train_data.batch(b_size) with b_size = 1 the result is that when iterated, the iterator gives one element of shape [256, 256, 192]; what if I wanted to feed the neural net with just 64 slices of this array?
This is an extract of my code:
with tf.name_scope('data'):
train_filenames = tf.constant(list_of_files_train)
train_data = tf.data.Dataset.from_tensor_slices(train_filenames)
train_data = train_data.map(lambda filename: tf.py_func(
self._parse_xxx_data, [filename], [tf.float32]))
train_data.shuffle(buffer_size=len(list_of_files_train))
train_data.batch(b_size)
iterator = tf.data.Iterator.from_structure(train_data.output_types, train_data.output_shapes)
input_data = iterator.get_next()
train_init = iterator.make_initializer(train_data)
[...]
with tf.Session() as sess:
sess.run(train_init)
_ = sess.run([self.train_op])
Thanks in advance
----------
I posted a solution to my problem in the comments below. I would still be happy to receive any comment or suggestion on possible improvements. Thank you ;)

It's been a long time but I'll post a possible solution to batch the dataset with custom shape in TensorFlow, in case someone may need it.
The module tf.data offers the method unbatch() to unwrap the content of each dataset element. One can first unbatch and than batch again the dataset object in the desired way. Oftentimes, a good idea may also be shuffling the unbatched dataset before batching it again (so that we have random slices from random elements in each batch):
with tf.name_scope('data'):
train_filenames = tf.constant(list_of_files_train)
train_data = tf.data.Dataset.from_tensor_slices(train_filenames)
train_data = train_data.map(lambda filename: tf.py_func(
self._parse_xxx_data, [filename], [tf.float32]))
# un-batch first, then batch the data
train_data = train_data.apply(tf.data.experimental.unbatch())
train_data.shuffle(buffer_size=BSIZE)
train_data.batch(b_size)
# [...]

If I clearly understand you question, you can try to slice the array into the shape you want in your self._parse_xxx_data function.

Tensorflow: Batching whole dataset (MNIST Tutorial)

Following this tutorial: https://www.tensorflow.org/versions/r1.3/get_started/mnist/pros
I wanted to solve a classification problem with labeled images by myself. Since I'm not using the MNIST database, I spent days creating my own dataset inside tensorflow. It looks like this:
#variables
batch_size = 50
dimension = 784
stages = 10
#step 1 read Dataset
filenames = tf.constant(filenamesList)
labels = tf.constant(labelsList)
#step 2 create Dataset
dataset = tf.data.Dataset.from_tensor_slices((filenames, labels))
#step 3: parse every image in the dataset using `map`
def _parse_function(filename, label):
#convert label to one-hot encoding
one_hot = tf.one_hot(label, stages)
#read image file
image_string = tf.read_file(filename)
image_decoded = tf.image.decode_image(image_string, channels=3)
image = tf.cast(image_decoded, tf.float32)
return image, one_hot
#step 4 final input tensor
dataset = dataset.map(_parse_function)
dataset = dataset.batch(batch_size) #batch_size = 100
iterator = dataset.make_one_shot_iterator()
images, labels = iterator.get_next()
images = tf.reshape(images, [batch_size,dimension]).eval()
labels = tf.reshape(labels, [batch_size,stages]).eval()
for _ in range(10):
dataset = dataset.shuffle(buffer_size = 100)
dataset = dataset.batch(batch_size)
iterator = dataset.make_one_shot_iterator()
images, labels = iterator.get_next()
images = tf.reshape(images, [batch_size,dimension]).eval()
labels = tf.reshape(labels, [batch_size,stages]).eval()
train_step.run(feed_dict={x: images, y_:labels})
Somehow using a higher batch_sizes will break python. What I'm trying to do is to train my neural network with new batches on each iteration. That's why Im also using dataset.shuffle(...). Using dataset.shuffle also breaks my Python.
What I wanted to do (because shuffle breaks) is to batch the whole dataset. By evaluating ('.eval()') I will get a numpy array. I will then shuffle the array with numpy.random.shuffle(images) and then pick up some the first elements to train it.
e.g.
for _ in range(1000):
images = tf.reshape(images, [batch_size,dimension]).eval()
labels = tf.reshape(labels, [batch_size,stages]).eval()
#shuffle
np.random.shuffle(images)
np.random.shuffle(labels)
train_step.run(feed_dict={x: images[0:train_size], y_:labels[0:train_size]})
But then here comes the problem that I can't batch the my whole dataset. It looks like that the data is too big for python to work with.
How should I solve this differently?
Since I'm not using the MNIST database there isn't a function like mnist.train.next_batch(100) which comes handy for me.

Notice how you call shuffle and batch inside your for loop? This is wrong. Datasets in TF work in the style of functional programming, so you are actually defining a pipeline for preprocessing the data to feed into your model. In a way, you give a recipe that answers the question "given this raw data, which operations (map, etc.) should I do to get batches that I can feed into my neural network?"
Now you are modifying that pipeline for every batch! What happens is that the first iteration, the batch size is, say [32 3600]. The next iteration, the elements of this shape are batched again, to [32 32 3600], and so on.
There's a great tutorial on the TF website where you can find out more how Datasets work, but here are a few suggestions how you can resolve your problem.
Move the shuffling to right after "Step 2" in your code. Then you are shuffling the whole dataset so your batches will have a good mixture of examples. Also increase the buffer_size argument, this works in a different way than you probably assume. It's usually a good idea to shuffle as early as possible, as it can be a slow operation if you have a large dataset -- the shuffled part of dataset will have to be read into memory. Here it does not really matter whether you shuffle the filenames and labels, or the read images and labels -- but the latter will have more work to do since the dataset is larger by that time.
Move batching and the iterator generator to be the last steps, just before starting your training loop.
Don't use feed_dict with Dataset iterators to input data into your model. Instead, define your model in terms of the outputs of iterator.get_next() and omit the feed_dict argument. See more details from this Q&A: Tensorflow: create minibatch from numpy array > 2 GB

Ive been getting through a lot of problems with creating tensorflow datasets. So I decided to use OpenCV to import images.
import opencv as cv
imgDataset = []
for i in range(len(files)):
imgDataset.append(cv2.imread(files[i]))
imgDataset = np.asarray(imgDataset)
the shape of imgDataset is (num_img, height, width, col_channels). Getting the i-th image should be imgDataset[i].
shuffling the dataset and getting only batches of it can be done like this:
from sklearn.utils import shuffle
X,y = shuffle(X, y)
X_feed = X[batch_size]
y_feed = y[batch_size]
Then you feed X_feed and y_feed into your model

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.