TypeError when training Tensorflow Random Forest using TensorForestEstimator - python

I get a TypeError when attempting to train an Tensorflow Random Forest using TensorForestEstimator.
TypeError: Input 'input_data' of 'CountExtremelyRandomStats' Op has type float64 that does not match expected type of float32.
I've tried using Python 2.7 and Python 3, and I've tried using tf.cast() to put everything in float32 but it doesn't help. I have checked the data type on execution and it's float32. The problem doesn't seem to be the data I provide (csv of all floats), so I'm not sure where to go from here.
Any suggestions of things I can try would be much appreciated.
Code:
# Build an estimator.
def build_estimator(model_dir):
params = tensor_forest.ForestHParams(
num_classes=2, num_features=760,
num_trees=FLAGS.num_trees, max_nodes=FLAGS.max_nodes)
graph_builder_class = tensor_forest.RandomForestGraphs
if FLAGS.use_training_loss:
graph_builder_class = tensor_forest.TrainingLossForest
# Use the SKCompat wrapper, which gives us a convenient way to split in-memory data into batches.
return estimator.SKCompat(random_forest.TensorForestEstimator(params, graph_builder_class=graph_builder_class, model_dir=model_dir))
# Train and evaluate the model.
def train_and_eval():
# load datasets
training_set = pd.read_csv('/Users/carl/Dropbox/Docs/Python/randomforest_balanced_train.csv', dtype=np.float32, header=None)
test_set = pd.read_csv('/Users/carl/Dropbox/Docs/Python/randomforest_balanced_test.csv', dtype=np.float32, header=None)
print('###########')
print(training_set.loc[:,1].dtype) # this prints float32
# load labels
training_labels = pd.read_csv('/Users/carl/Dropbox/Docs/Python/randomforest_balanced_train_class.csv', dtype=np.int32, names=LABEL, header=None)
test_labels = pd.read_csv('/Users/carl/Dropbox/Docs/Python/randomforest_balanced_test_class.csv', dtype=np.int32, names=LABEL, header=None)
# define the path where the model will be stored - default is current directory
model_dir = tempfile.mkdtemp() if not FLAGS.model_dir else FLAGS.model_dir
print('model directory = %s' % model_dir)
# build the random forest estimator
est = build_estimator(model_dir)
tf.cast(training_set, tf.float32) #error occurs with/without casts
tf.cast(test_set, tf.float32)
# train the forest to fit the training data
est.fit(x=training_set, y=training_labels) #this line throws the error

You are using tf.cast in incorrect manner
tf.cast(training_set, tf.float32) #error occurs with/without casts
should be
training_set = tf.cast(training_set, tf.float32)
tf.cast is not in-place method, it is a tensor flow op, as any other operation, and needs to be assigned and run.

Related

Tensorflow Data API: repeat()

The following code is a snippet from "Hands on machine learning with scimitar-learn, Keras and tensorflow".
I understand everything in the following code, except the chaining the .repeat(repeat) function in second line.
I know that repeat is repeating the dataset elements (i.e., in this case the file paths) and if the argument is set to None or left empty the repetition will continues forever until the function which using it decides when to stop.
As you can see in the code bellow, the author is setting the repeat() argument to None;
1 - basically I want to know why the author decided to do such?
2 - or is it because the code is trying to simulate a situation which the dataset is not fitting in the memory, if this is the case then in real situation we should avoid repeat(), am I correct?
def csv_reader_dataset(filepaths, repeat=1, n_readers=5,
n_read_threads=None, shuffle_buffer_size=10000,
n_parse_threads=5, batch_size=32):
dataset = tf.data.Dataset.list_files(filepaths, seed = 42).repeat(repeat)
dataset = dataset.interleave(
lambda filepath: tf.data.TextLineDataset(filepath).skip(1),
cycle_length = n_readers, num_parallel_calls = n_read_threads)
dataset = dataset.shuffle(shuffle_buffer_size)
dataset = dataset.map(preprocess, num_parallel_calls = n_parse_threads)
dataset = dataset.batch(batch_size)
return dataset.prefetch(1)
train_set = csv_reader_dataset(train_filepaths, repeat = None)
valid_set = csv_reader_dataset(valid_filepaths)
test_set = csv_reader_dataset(test_filepaths)
keras.backend.clear_session()
np.random.seed(42)
tf.random.set_seed(42)
model = keras.models.Sequential([
keras.layers.InputLayer(input_shape = X_train.shape[-1: ]),
keras.layers.Dense(30, activation = 'relu'),
keras.layers.Dense(1)
])
m_loss = keras.losses.mean_squared_error
m_optimizer = keras.optimizers.SGD(lr = 1e-3)
batch_size = 32
model.compile(loss = m_loss, optimizer = m_optimizer, metrics = ['accuracy'])
model.fit(train_set, steps_per_epoch = len(X_train) // batch_size, epochs = 10, validation_data = valid_set)
For your questions, I think:
tf.data API won't easily lead to out-of-memory easily as it loads data given the file paths or tfrecrods (compressed mode). Hence, repeat() does not thing with memory here; instead, it is used for data-transforming.
I have to use repeat(#) when setting steps_per_epoch to #. Say your batch_num = 32, and steps_per_epoch = 100//32 = 3 -> require 3 * 32 = 96 samples per epoch but your data has 80 samples only. Then, I have to use data.repeat(2) to have totally 160 samples that 80 samples in repeat_1 and the first 16 samples in repeat_2 would be used within 1 epoch. This is to prevent the error Input run out of data.
I had another copy of the same question on book's author git repo.
The issue is clarified; it was due to a bug in Keras 2.0.
Read more on: https://github.com/ageron/handson-ml2/issues/407

Converting tensorflow 2 estimator to tf.lite

I am trying to convert an estimator LinearClassifier into tflite.
However the code is throwing some error.. I am not able to understand where I am doing wrong.
Here is my code
import pandas as pd
import tensorflow as tf
dftrain = pd.read_csv('https://storage.googleapis.com/tf-datasets/titanic/train.csv')
dfeval = pd.read_csv('https://storage.googleapis.com/tf-datasets/titanic/eval.csv')
y_train = dftrain.pop('survived')
y_eval = dfeval.pop('survived')
#create feature columns. For testing I am using only numeric ones
NUMERIC_COLUMNS = ['age', 'fare']
feature_columns = []
for feature_name in NUMERIC_COLUMNS:
feature_columns.append(tf.feature_column.numeric_column(feature_name,
dtype=tf.float32))
# Use entire batch since this is such a small dataset.
NUM_EXAMPLES = len(y_train)
def make_input_fn(X, y, n_epochs=None, shuffle=True):
def input_fn():
dataset = tf.data.Dataset.from_tensor_slices((dict(X), y))
if shuffle:
dataset = dataset.shuffle(NUM_EXAMPLES)
# For training, cycle thru dataset as many times as need (n_epochs=None).
dataset = dataset.repeat(n_epochs)
# In memory training doesn't use batching.
dataset = dataset.batch(NUM_EXAMPLES)
return dataset
return input_fn
# Training and evaluation input functions.
train_input_fn = make_input_fn(dftrain[NUMERIC_COLUMNS], y_train)
eval_input_fn = make_input_fn(dfeval[NUMERIC_COLUMNS], y_eval, shuffle=False, n_epochs=1)
linear_est = tf.estimator.LinearClassifier(feature_columns)
# Train model.
linear_est.train(train_input_fn, max_steps=100)
# Evaluation.
result = linear_est.evaluate(eval_input_fn)
So model is working fine.
print(pd.Series(result))
accuracy 0.659091
accuracy_baseline 0.625000
auc 0.667095
auc_precision_recall 0.589936
average_loss 0.619227
label/mean 0.375000
loss 0.619227
precision 0.764706
prediction/mean 0.336755
recall 0.131313
global_step 100.000000
dtype: float64
Now saving part:
serving_input_fn = tf.estimator.export.build_parsing_serving_input_receiver_fn(
tf.feature_column.make_parse_example_spec(feature_columns))
model_dir = 'model_data'
path = linear_est.export_saved_model(model_dir, serving_input_fn)
when I am using :
converter = tf.lite.TFLiteConverter.from_saved_model(path)
tflite_model = converter.convert()
it throws error :
ValueError: This converter can only convert a single ConcreteFunction. Converting multiple functions is under development.
I have also tried :
saved_model_obj = tf.saved_model.load(export_dir=path)
concrete_func = saved_model_obj.signatures['serving_default']
converter = tf.lite.TFLiteConverter.from_concrete_functions([concrete_func])
tflite_model = converter.convert()
And error is :
ConverterError: See console for info.
2020-02-18 16:23:15.446583: I tensorflow/lite/toco/import_tensorflow.cc:193] Unsupported data type in placeholder op: 20
2020-02-18 16:23:15.446687: F tensorflow/lite/toco/import_tensorflow.cc:2706] Check failed: status.ok() Input_content string_val doesn't have the right dimensions for this string tensor
(while processing node 'head/AsString')
Fatal Python error: Aborted
Please help.
Please refer to the following info:
"Some of the operators in the model are not supported by the standard TensorFlow Lite runtime. If those are native TensorFlow operators, you might be able to use the extended runtime by passing --enable_select_tf_ops, or by setting target_ops=TFLITE_BUILTINS,SELECT_TF_OPS when calling tf.lite.TFLiteConverter(). Otherwise, if you have a custom implementation for them you can disable this error with --allow_custom_ops, or by setting allow_custom_ops=True when calling tf.lite.TFLiteConverter(). Here is a list of builtin operators you are using: ADD, ADD_N, ARG_MAX, EXPAND_DIMS, FULLY_CONNECTED, RESHAPE, SOFTMAX, TILE. Here is a list of operators for which you will need custom implementations: As
String."
For me, this problem was solved after adding option "--allow_custom_ops":
tflite_convert --enable_v1_converter --allow_custom_ops --output_file=xxx --saved_model_dir=xxx --saved_model_signature_key='predict'

Keras model.fit() with tf.dataset fails while using tf.train works fine

Summary: according to the documentation, Keras model.fit() should accept tf.dataset as input (I am using TF version 1.12.0). I can train my model if I manually do the training steps but using model.fit() on the same model, I get an error I cannot resolve.
Here is a sketch of what I did: my dataset, which is too big to fit in the memory, consists of many files each with different number of rows of (100 features, label). I'd like to use tf.data to build my data pipeline:
def data_loader(filename):
'''load a single data file with many rows'''
features, labels = load_hdf5(filename)
...
return features, labels
def make_dataset(filenames, batch_size):
'''read files one by one, pick individual rows, batch them and repeat'''
dataset = tf.data.Dataset.from_tensor_slices(filenames)
dataset = dataset.map( # Problem here! See edit for solution
lambda filename: tuple(tf.py_func(data_loader, [filename], [float32, tf.float32])))
dataset = dataset.flat_map(
lambda features, labels: tf.data.Dataset.from_tensor_slices((features, labels)))
dataset = dataset.batch(batch_size)
dataset = dataset.repeat()
dataset = dataset.prefetch(1000)
return dataset
_BATCH_SIZE = 128
training_set = make_dataset(training_files, batch_size=_BATCH_SIZE)
I'd like to try a very basic logistic regression model:
inputs = tf.keras.layers.Input(shape=(100,))
outputs = tf.keras.layers.Dense(1, activation='softmax')(inputs)
model = tf.keras.Model(inputs, outputs)
If I train it manually everything works fine, e.g.:
labels = tf.placeholder(tf.float32)
loss = tf.reduce_mean(tf.keras.backend.categorical_crossentropy(labels, outputs))
train_step = tf.train.GradientDescentOptimizer(.05).minimize(loss)
iterator = training_set.make_one_shot_iterator()
next_element = iterator.get_next()
init_op = tf.global_variables_initializer()
with tf.Session() as sess:
sess.run(init_op)
for i in range(training_size // _BATCH_SIZE):
x, y = sess.run(next_element)
train_step.run(feed_dict={inputs: x, labels: y})
However, if I instead try to use model.fit like this:
model.compile('adam', 'categorical_crossentropy', metrics=['acc'])
model.fit(training_set.make_one_shot_iterator(),
steps_per_epoch=training_size // _BATCH_SIZE,
epochs=1,
verbose=1)
I get an error message ValueError: Cannot take the length of Shape with unknown rank. inside the keras'es _standardize_user_data function.
I have tried quite a few things but could not resolve the issue. Any ideas?
Edit: based on #kvish's answer, the solution was to change the map from a lambda to a function that would specify the correct tensor dimensions, e.g.:
def data_loader(filename):
def loader_impl(filename):
features, labels, _ = load_hdf5(filename)
...
return features, labels
features, labels = tf.py_func(loader_impl, [filename], [tf.float32, tf.float32])
features.set_shape((None, 100))
labels.set_shape((None, 1))
return features, labels
and now, all needed to do is to call this function from map:
dataset = dataset.map(data_loader)
Probably tf.py_func produces an unknown shape which Keras cannot infer. We can set the shape of the tensor returned by it using set_shape(your_shape) method and that would help Keras infer the shape of the result.

Stuck trying to make this tensorflow script work

I'm trying to move my scikit-learn python script into tensorflow code. Keep getting stuck with errors. Please help!
import pandas as pd
import numpy as np
import tensorflow as tf
# read csv
df = pd.read_csv("/Downloads/iris-2.csv", header=0)
# get header names as array
features = list(df.columns.values)
label = features.pop()
classes = len(df[label].unique())
# encode target
X = df[features]
y = df[label]
# convert feature headers into tf
for index,value in enumerate(features):
features[index] = tf.feature_column.numeric_column(value)
# initialize classifier
classifier = tf.estimator.DNNClassifier(
feature_columns=features,
hidden_units=[10, 10],
n_classes=classes)
# train the classifier
dataset = tf.data.Dataset.from_tensor_slices((dict(X), y))
dataset = dataset.shuffle(1000).repeat().batch(0)
data = dataset.make_one_shot_iterator().get_next()
classifier.train(input_fn=lambda:data,steps=3)
predictions = classifier.predict([5.1,3.0,4.2,1.2])
print(predictions)
Latest error I'm stuck on is:
ValueError: Passed Tensor("dnn/head/weighted_loss/Sum:0", shape=(), dtype=float32) should have graph attribute that is equal to current graph <tensorflow.python.framework.ops.Graph object at 0x10dd9a190>.
Here's the dataset I'm using: https://gist.githubusercontent.com/curran/a08a1080b88344b0c8a7/raw/d546eaee765268bf2f487608c537c05e22e4b221/iris.csv
The input tensor (variables data and dataset) cannot be precomputed. They need to be computed inside the function passed to input_fn in the call to train so that the tensors are in the graph that the Estimator (classifier) creates during the call to train(). So for your last block you could use:
# train the classifier
def my_input_fn():
dataset = tf.data.Dataset.from_tensor_slices((dict(X), y))
dataset = dataset.shuffle(1000).repeat().batch(0)
return dataset.make_one_shot_iterator().get_next()
classifier.train(input_fn=my_input_fn, steps=3)
predictions = classifier.predict([5.1,3.0,4.2,1.2])
print(predictions)

How to use the iterator 'make_initializable_iterator' of tensorflow within a 'input_fn'?

I want to train my mode with a tf.estimator.Estimator and load my data by Dataset API.Because my data,for example 'mnist', is a array(tensor),so I try to load it with 'tf.data.Dataset.from_tensor_slices'.But I don't how to initialize 'make_initializable_iterator' within a 'input_fn'.
If I can use 'make_one_shot_iterator' to train successfully, but it load slowly before training. And 《Higher-Level APIs in TensorFlow》is a good example to 'make_initializable_iterator' within a 'input_fn',but it needs to return a 'iterator_initializer_hook' to other function from 'input_fn' . I want to know is there any other better or more elegant way?
def input_fn():
mnist_data = input_data.read_data_sets('mnist_data', one_hot=False)
images = mnist_data.train.images.reshape([-1, 28, 28, 1])
labels = np.asarray(mnist_data.train.labels, dtype=np.int64)
# Build dataset iterator
dataset = tf.data.Dataset.from_tensor_slices((images, labels))
dataset = dataset.repeat(None) # Infinite iterations
dataset = dataset.shuffle(buffer_size=10000)
dataset = dataset.batch(100)
iterator = dataset.make_one_shot_iterator()
next_example = iterator.get_next()
# Set runhook to initialize iterator
return next_example
In TensorFlow version 1.5 and later, the tf.estimator.Estimator will automatically create and initialize an initializable iterator when you return a tf.data.Dataset from your input_fn. This enables you to write the following code, without having to worry about initialization or hooks:
def input_fn():
mnist_data = input_data.read_data_sets('mnist_data', one_hot=False)
images = mnist_data.train.images.reshape([-1, 28, 28, 1])
labels = np.asarray(mnist_data.train.labels, dtype=np.int64)
# Build dataset.
dataset = tf.data.Dataset.from_tensor_slices((images, labels))
dataset = dataset.repeat(None) # Infinite iterations
dataset = dataset.shuffle(buffer_size=10000)
dataset = dataset.batch(100)
return dataset
Inside your code, add this:
self.hooks.append(utils_hooks.DatasetHook(iter))
In the run_loop.py, before the call into your fn, add this
for hook in dataset_hooks:
sess.run(hook.iterator().initializer)
Then, it should be fine.

Categories

Resources