Keras / Tensorflow: Predict Using tf.data.Dataset API - python

I'm using Keras with a Tensorflow backend for building a model for this problem: https://www.kaggle.com/cfpb/us-consumer-finance-complaints (just practicing).
I train my Keras model using the tf.data.Dataset API. Now, I have a Pandas DataFrame, df_testing, whose columns are complaint (strings) and label (also strings). I want to predict on these new samples. I create a tf.data.Dataset object, perform preprocessing, make an Iterator, and call predict on my model:
data = df_testing["complaint"].values
labels = df_testing["label"].values
dataset = tf.data.Dataset.from_tensor_slices((data))
dataset = dataset.map(lambda x: ({'reviews': x}))
dataset = dataset.batch(self.batch_size).repeat()
dataset = dataset.map(lambda x: self.preprocess_text(x, self.data_table))
dataset = dataset.map(lambda x: x['reviews'])
dataset = dataset.make_initializable_iterator()
My training used a tf.data.Dataset where each element was of the form ({'reviews': "movie was great"}, "positive") so I'm mimicking that here for prediction. Also, my preprocessing just turns my string into a Tensor of integers.
When I call:
preds = model.predict(dataset)
But I'm told my predict call fails:
ValueError: When using iterators as input to a model, you should specify the `steps` argument.
So I modify this call to be:
preds = model.predict(dataset, steps=3)
But now I get back:
ValueError: Please provide data as a list or tuple of 2 elements - input and target pair. Received Tensor("IteratorGetNext_2:0", shape=(?, 100), dtype=int32)
What am I doing incorrectly here? I shouldn't have to provide a tuple of 2 elements when predicting (I shouldn't need the label).
Thanks for any help you can offer!

What version of Keras are you on? I cannot find that specific error message in the code base, but I think I found where it used to be.
Here's the error in a version of the code that I think is close to the version you're running: commit
And here's the updated version of that error: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/keras/engine/training_eager.py#L464
The conditions of the input validation have changed (in the newest version your input would be accepted), but what's relevant is that the error message is much more clear:
raise ValueError(
'Please provide data as a list or tuple of 1, 2, or 3 elements '
' - `(input)`, or `(input, target)`, or `(input, target,'
'sample_weights)`. Received %s. We do not use the `target` or'
'`sample_weights` value here.' % inputs.output_shapes)
The target value is never used in the predict function, and so can be anything. Looking at the rest of the function next_element[1] is never used.
[TLDR] Using your current version, add a dummy target value to the data, or update your Keras.

The following code worked for me (tested on tensorflow 1.10.0):
[TLDR] Only insert empty dictionary as a dummy input and specify the number of steps:
model.predict(x={},steps=4)
Full code:
import numpy as np
import tensorflow as tf
from tensorflow.data import Dataset
from tensorflow.keras.layers import Dense, Input
from tensorflow.keras.models import Model
# dummy data:
x = np.arange(4).reshape(-1, 1).astype('float32')
y = np.arange(5, 9).reshape(-1, 1).astype('float32')
# build the Datasets
ds_x = Dataset.from_tensor_slices(x).repeat().batch(4)
it_x = ds_x.make_one_shot_iterator()
ds_y = Dataset.from_tensor_slices(y).repeat().batch(4)
it_y = ds_y.make_one_shot_iterator()
# build compile and train the model
input_vals = Input(tensor=it_x.get_next())
output = Dense(1, activation='relu')(input_vals)
model = Model(inputs=input_vals, outputs=output)
model.compile('rmsprop', 'mse', target_tensors=[it_y.get_next()])
model.fit(steps_per_epoch=1, epochs=5, verbose=2)
# infer using the dataset
model.predict(x={},steps=4)

Related

How to apply Keras Normalization to a ParallelMapDataset without making it eager?

I am training a Tensorflow Keras CNN over images, too much training data to fit into memory. I've got a tf.Dataset preprocessing pipeline that reads the images from HDF5 files using a dataset.map() pipeline step. Now I'm trying to normalize the numeric image data to 0 mean and unit variance.
I'm following this example from this guide, except that I have that .map() in there:
def load_features_from_hdf5(filename):
spec = tf.TensorSpec(feature_shape, dtype=tf.dtypes.float32, name=None)
dataset = tfio.IODataset.from_hdf5(filename, "/features", spec=spec) # returns a Dataset
feature = dataset.get_single_element()
feature.set_shape(feature_shape)
return feature
train_x = tf.data.Dataset.from_tensor_slices(filenames).map(load_features_from_fbank, num_parallel_calls=tf.data.AUTOTUNE)
normalizer = tf.keras.layers.Normalization(axis=None)
normalizer.adapt(train_x.take(1000))
train_x_normalized = normalizer(train_x) # <-- ValueError
adapt() successfully computes the mean and variance from the dataset. But when I try to actually apply normalization of values on the exact same dataset, it errors while trying to convert my ParallelMapDataset to an EagerTensor.
ValueError: Attempt to convert a value (<ParallelMapDataset shapes: (41, 682, 1), types: tf.float32>) with an unsupported type (<class 'tensorflow.python.data.ops.dataset_ops.ParallelMapDataset'>) to a Tensor.
How can I get this working? Since the data is so large, I wouldn't think I want to make anything eager until training starts. Should I make the normalization an explicit pipeline step on the Dataset? Or an explicit layer on the model itself? (If the latter case, how can I bring the mean and variance values from training time to inference time in another process?)
You could try something like this:
import tensorflow as tf
# Create dummy data
train_x = tf.data.Dataset.from_tensor_slices((tf.random.normal((100, 28, 28, 3)), tf.random.normal((100, 1)))).batch(10)
normalizer = tf.keras.layers.Normalization(axis=None)
# Adapt
normalizer.adapt(train_x.map(lambda x, y: x))
# Apply to images
train_x_normalized = train_x.map(lambda x, y: (normalizer(x), y))
Example:
for x, y in train_x_normalized.take(1):
print(tf.reduce_mean(x), tf.math.reduce_variance(x))
tf.Tensor(0.00930768, shape=(), dtype=float32) tf.Tensor(1.0023469, shape=(), dtype=float32)
Or, as you mentioned in your question, your can use the normalization layer as part of your model.

How to print out the tensor values of a specific layer

I wish to exam the values of a tensor after mask is applied to it.
Here is a truncated part of the model. I let temp = x so later I wish to print temp to check the exact values.
So given a 4-class classification model using acoustic features. Assume I have data in (1000,50,136) as (batch, timesteps, features)
The objective is to check if the model is studying the features by timesteps. In other words, we wish to reassure the model is learning using slice as the red rectangle in the picture. Logically, it is the way for Keras LSTM layer but the confusion matrix produced is quite different when a parameter changes (eg. Dense units). The validation accuracy stays 45% thus we would like to visualize the model.
The proposed idea is to print out the first step of the first batch and print out the input in the model. If they are the same, then model is learning in the right way ((136,1) features once) instead of (50,1) timesteps of a single feature once.
input_feature = Input(shape=(X_train.shape[1],X_train.shape[2]))
x = Masking(mask_value=0)(input_feature)
temp = x
x = Dense(Dense_unit,kernel_regularizer=l2(dense_reg), activation='relu')(x)
I have tried tf.print() which brought me AttributeError: 'Tensor' object has no attribute '_datatype_enum'
As Get output from a non final keras model layer suggested by Lescurel.
model2 = Model(inputs=[input_attention, input_feature], outputs=model.get_layer('masking')).output
print(model2.predict(X_test))
AttributeError: 'Masking' object has no attribute 'op'
You want to output after mask.
lescurel's link in the comment shows how to do that.
This link to github, too.
You need to make a new model that
takes as inputs the input from your model
takes as outputs the output from the layer
I tested it with some made-up code derived from your snippets.
import numpy as np
from keras import Input
from keras.layers import Masking, Dense
from keras.regularizers import l2
from keras.models import Sequential, Model
X_train = np.random.rand(4,3,2)
Dense_unit = 1
dense_reg = 0.01
mdl = Sequential()
mdl.add(Input(shape=(X_train.shape[1],X_train.shape[2]),name='input_feature'))
mdl.add(Masking(mask_value=0,name='masking'))
mdl.add(Dense(Dense_unit,kernel_regularizer=l2(dense_reg),activation='relu',name='output_feature'))
mdl.summary()
mdl2mask = Model(inputs=mdl.input,outputs=mdl.get_layer("masking").output)
maskoutput = mdl2mask.predict(X_train)
mdloutput = mdl.predict(X_train)
maskoutput # print output after/of masking
mdloutput # print output of mdl
maskoutput.shape #(4, 3, 2): masking has the shape of the layer before (input here)
mdloutput.shape #(4, 3, 1): shape of the output of dense

How do I predict on more than one batch from a Tensorflow Dataset, using .predict_on_batch?

As the question says, I can only predict from my model with model.predict_on_batch(). Keras tries to concatenate everything together if I use model.predict() and that doesn't work.
For my application (a sequence to sequence model) it is faster to do grouping on the fly. But even if I had done it in Pandas and then only used Dataset for the padded batch, .predict() still shouldn't work?
If I can get predict_on_batch to work then that's what works. But I can only predict on the first batch of the Dataset. How do I get predictions for the rest? I can't loop over the Dataset, I can't consume it...
Here's a smaller code example. The group is the same as the labels but in the real world they are obviously two different things. There are 3 classes, maximum of 2 values in a sequence, 2 row of data per batch. There's a lot of comments and I nicked parts of the windowing from somewhere on StackOverflow. I hope it is fairly legible to most.
If you have any other suggestions on how to improve the code, please comment. But no, that's not what my model looks like at all. So suggestions for that part probably aren't helpful.
EDIT: Tensorflow version 2.1.0
import tensorflow as tf
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Bidirectional, Masking, Input, Dense, GRU
import random
import numpy as np
random.seed(100)
# input data
feature = list(range(3, 14))
# shuffle data
random.shuffle(feature)
# make label from feature data, +1 because we are padding with zero
label = [feat // 5 +1 for feat in feature]
group = label[:]
# random.shuffle(group)
max_group = 2
batch_size = 2
print('Data:')
print(*zip(group, feature, label), sep='\n')
# make dataset from data arrays
ds = tf.data.Dataset.zip((tf.data.Dataset.from_tensor_slices({'group': group, 'feature': feature}),
tf.data.Dataset.from_tensor_slices({'label': label})))
# group by window
ds = ds.apply(tf.data.experimental.group_by_window(
# use feature as key (you may have to use tf.reshape(x['group'], []) instead of tf.cast)
key_func=lambda x, y: tf.cast(x['group'], tf.int64),
# convert each window to a batch
reduce_func=lambda _, window: window.batch(max_group),
# use batch size as window size
window_size=max_group))
# shuffle at most 100k rows, but commented out because we don't want to predict on shuffled data
# ds = ds.shuffle(int(1e5))
ds = ds.padded_batch(batch_size,
padded_shapes=({s: (None,) for s in ['group', 'feature']},
{s: (None,) for s in ['label']}))
# show dataset contents
print('Result:')
for element in ds:
print(element)
# Keras matches the name in the input to the tensor names in the first part of ds
inp = Input(shape=(None,), name='feature')
# RNNs require an additional rank, even if it is a degenerate dimension
duck = tf.expand_dims(inp, axis=-1)
rnn = GRU(32, return_sequences=True)(duck)
# again Keras matches names
out = Dense(max(label)+1, activation='softmax', name='label')(rnn)
model = Model(inputs=inp, outputs=out)
model.summary()
model.compile(loss='sparse_categorical_crossentropy', metrics=['accuracy'])
model.fit(ds, epochs=3)
model.predict_on_batch(ds)
You can iterate over the dataset, like so, remembering what is "x" and what is "y" in typical notation:
for item in ds:
xi, yi = item
pi = model.predict_on_batch(xi)
print(xi["group"].shape, pi.shape)
Of course, this predicts on each element individually. Otherwise you'd have to define the batches yourself, by batching matching shapes together, as the batch size itself is allowed to be variable.

Fitting an RNN estimator in tensor flow

I'm trying to train a TF estimator using the RNNEstimator() class, but I'm having trouble with defining the estimator. My goal is the following:
Create a tf.data.Dataset.
Feed it into the RNN estimator.
The first part seems to be working correctly. I define the
def _parse_func(record):
# takes tf record as input and returns the following tensors
# numeric_tensor.shape = (5,170) and y.shape=()
return {'numerical': numeric_tensor,}, y
def input_fn(filenames=['data.tfrecord']):
# Returns parsed tf record i.e. the tf.data.Dataset
dataset = tf.data.TFRecordDataset(filenames=filenames)
dataset = dataset.map(map_func=_parse_func)
dataset = dataset.repeat()
dataset = dataset.batch(batch_size=BATCH_SIZE)
return dataset
Now let's move onto the meaty part.
Estimators take care of creating the session and graph. So I simply create the estimator in the following format:
# create the column
column = tf.contrib.feature_column.sequence_numeric_column('numerical')
# create the estimator
estimator = RNNEstimator(
head=tf.contrib.estimator.regression_head(),
sequence_feature_columns=[column],
num_units=[32, 16], cell_type='lstm')
# train the estimator
estimator.train(input_fn=input_fn, steps=100)
However, this doesn't work. It gives me a variety of errors! In particularly, at the moment I get:
TypeError: Input must be a SparseTensor.
Additionally, I seem to be unable to change the loss to log-loss. I tried setting it by passing it to the head parameter using:
head = tf.contrib.estimator.regression_head(loss_fn=tf.losses.log_loss)

Tensorflow error: "Tensor must be from the same graph as Tensor..."

I am trying to train a simple binary logistic regression classifier using Tensorflow (version 0.9.0) in a very similar way to the beginner's tutorial and am encountering the following error when fitting the model:
ValueError: Tensor("centered_bias_weight:0", shape=(1,), dtype=float32_ref) must be from the same graph as Tensor("linear_14/BiasAdd:0", shape=(?, 1), dtype=float32).
Here is my code:
import tempfile
import tensorflow as tf
import pandas as pd
# Customized training data parsing
train_data = read_train_data()
feature_names = get_feature_names(train_data)
labels = get_labels(train_data)
# Construct dataframe from training data features
x_train = pd.DataFrame(train_data , columns=feature_names)
x_train["label"] = labels
y_train = tf.constant(labels)
# Create SparseColumn for each feature (assume all feature values are integers and either 0 or 1)
feature_cols = [ tf.contrib.layers.sparse_column_with_integerized_feature(f,2) for f in feature_names ]
# Create SparseTensor for each feature based on data
categorical_cols = { f: tf.SparseTensor(indices=[[i,0] for i in range(x_train[f].size)],
values=x_train[f].values,
shape=[x_train[f].size,1]) for f in feature_names }
# Initialize logistic regression model
model_dir = tempfile.mkdtemp()
model = tf.contrib.learn.LinearClassifier(feature_columns=feature_cols, model_dir=model_dir)
def eval_input_fun():
return categorical_cols, y_train
# Fit the model - similarly to the tutorial
model.fit(input_fn=eval_input_fun, steps=200)
I feel like I'm missing something critical... maybe something that was assumed in the tutorial but wasn't explicitly mentioned?
Also, I get the following warning every time I call fit():
WARNING:tensorflow:create_partitioned_variables is deprecated. Use tf.get_variable with a partitioner set, or tf.get_partitioned_variable_list, instead.
When you execute model.fit, the LinearClassifier is creating a separate tf.Graph based on the Ops contained in your eval_input_fun function. But, during the creation of this Graph, LinearClassifier doesn't have access to the definitions of categorical_cols and y_train you saved globally.
Solution: move all the Ops definitions (and their dependencies) inside eval_input_fun

Categories

Resources