Passing >2GB data to tf.estimator

Passing >2GB data to tf.estimator - python

I have x_train and y_train numpy arrays, each of >2GB. I want to train model using the tf.estimator API, but I am getting the errors:
ValueError: Cannot create a tensor proto whose content is larger than 2GB
I am passing the data using:
def input_fn(features, labels=None, batch_size=None,
shuffle=False, repeats=False):
if labels is not None:
inputs = (features, labels)
else:
inputs = features
dataset = tf.data.Dataset.from_tensor_slices(inputs)
if shuffle:
dataset = dataset.shuffle(shuffle)
if batch_size:
dataset = dataset.batch(batch_size)
if repeats:
# if False, evaluate after each epoch
dataset = dataset.repeat(repeats)
return dataset
train_spec = tf.estimator.TrainSpec(
lambda : input_fn(x_train, y_train,
batch_size=BATCH_SIZE, shuffle=50),
max_steps=EPOCHS
)
eval_spec = tf.estimator.EvalSpec(lambda : input_fn(x_dev, y_dev))
tf.estimator.train_and_evaluate(model, train_spec, eval_spec)
The tf.data documentation mentions this error and provides solution using traditional TenforFlow API with placeholders. Unfortunately, I don't know how this could be translated into tf.estimator API?

The solution that worked for me was using
tf.estimator.inputs.numpy_input_fn(x_train, y_train, num_epochs=EPOCHS,
batch_size=BATCH_SIZE, shuffle=True)
instead of input_fn. The only problem is that tf.estimator.inputs.numpy_input_fn raises deprecation warnings, so unfortunately this will stop working as well.

Related

Keras model.predict() not predicting classes for some images

I have trained a ResNet50 using Keras for classication. For testing, I used the ImageDataGenerator flow_from_directory() method to pass input to the model. Here's the code for that:
testdata_generator = keras.preprocessing.image.ImageDataGenerator(
preprocessing_function=tf.keras.applications.resnet.preprocess_input
)
testgen = testdata_generator.flow_from_directory(
'./test',
shuffle=False,
target_size=(224,224),
color_mode='rgb',
batch_size=32,
class_mode=None
)
Found 18223 images belonging to 1 classes.
However when I test the model on the test images, it doesn't predict for a few images.
pred = model.predict(
testgen,
batch_size=32,
steps=testgen.n//testgen.batch_size
)
print(len(pred))
18208
Anyone help?

You should try removing steps=testgen.n//testgen.batch_size, since calculating the steps results in a different number of samples, when you have a remainder by dividing samples // batch_size.

ValueError: `sample_weight` argument is not supported when using dataset as input

I would like to train a keras model and use sample weight. My data source is of type tf.data.dataset. Got the following error when using the sample_weight argument of model.fit function.
ValueError: `sample_weight` argument is not supported when using dataset as input.
The code looks like:
model.fit(tf_train_dataset,
epochs=epochs,
verbose=self.verbose,
batch_size=batch_size,
callbacks=callbacks,
sample_weight=sample_weights
steps_per_epoch=self.steps_per_epoch,
use_multiprocessing=True,
tf_train_dataset is created by tf.data.Dataset.from_generator. How could I pass weights for each sample and apply it to the loss and finally training?

While using tf.data.Dataset API, sample weights should be another tuple in the dataset following order: (input_batch, label_batch, sample_weight_batch).
Dummy example:
import numpy as np
import tensorflow as tf
from sklearn.utils.class_weight import compute_sample_weight
x_train = np.random.randn(100,2)
y_train = np.random.randint(low = 0, high = 5, size = 100, dtype = np.int32)
weights = compute_sample_weight(class_weight='balanced', y=y_train)
train_data = tf.data.Dataset.from_tensor_slices((x_train, y_train, weights))
For more you can refer the docs.

How to use mean_squared_error as loss with sklearn.model_selection.cross_val_score()

I can use "categorical_crossentropy" as loss function without error but when I replace it with "mse" this error raises:
Error when checking target: expected dense_2 to have shape (2,) but got array with shape (1,)
If I use the following method
labels = np_utils.to_categorical(labels, num_classes = 2)
another error raises:
Supported target types are: ('binary', 'multiclass'). Got 'multilabel-indicator' instead.
The question is how can I use "mse" with cross_val_score() function?
This is github link and this is the troublesome code:
model = KerasClassifier(build_fn=customXceptionBuild, epochs=epochs, batch_size=batch_size)
kfold = StratifiedKFold(n_splits=folds, shuffle=True, random_state=random_state)
def classification_report_with_accuracy_score(y_true, y_pred):
originalclass.extend(y_true)
predictedclass.extend(y_pred)
return accuracy_score(y_true, y_pred) # return accuracy score
scoring = make_scorer(classification_report_with_accuracy_score)
scores = cross_val_score(model, data, labels, cv=kfold, error_score="raise", scoring=scoring )
customXceptionBuild function implements Xception pre-trained model and uses "mse" as loss function.

The first error is about output size mismatch,
change this
F3 = Dense(classes, activation='softmax')(D2)
to
F3 = Dense(1, activation='softmax')(D2)
Since this is a binary classification, you only need 1 neuron.
Or if you want to fix the second error, here is why. It is not possible to do stratification on one-hot encoded labels. You can do one hot encoding after stratification. Thus,
labels = np_utils.to_categorical(labels, num_classes = 2)
should come after
kfold = StratifiedKFold(n_splits=folds, shuffle=True, random_state=random_state)

How to get subset of 10K MNIST images from Dataset class in tensorflow?

I found the following way to get mnist dataset in tensorflow:
def get_input_fn(dataset_split, batch_size, capacity=10000, min_after_dequeue=3000):
def _input_fn():
images_batch, labels_batch = tf.train.shuffle_batch(
tensors=[dataset_split.images, dataset_split.labels.astype(np.int32)],
batch_size=batch_size,
capacity=capacity,
min_after_dequeue=min_after_dequeue,
enqueue_many=True,
num_threads=4)
features_map = {'images': images_batch}
return features_map, labels_batch
return _input_fn
data = tf.contrib.learn.datasets.mnist.load_mnist()
train_input_fn = get_input_fn(data.train, batch_size=256)
eval_input_fn = get_input_fn(data.validation, batch_size=5000)
data variable is Dataset object.
This approach is quite unclear to me and I cannot figure out how to convert 60K dataset into 10K dataset.
When I do the following:
data = tf.contrib.learn.datasets.mnist.load_mnist().take(10000)
I get error:
AttributeError: 'Datasets' object has no attribute 'take'
But docs provide this method:
Thank you for help!

This function from contrib module is deprecated. You can use tf.keras.datasets.mnist.load_data(). As per https://www.tensorflow.org/api_docs/python/tf/keras/datasets/mnist/load_data, it returns
Tuple of Numpy arrays: `(x_train, y_train), (x_test, y_test)`.
So, in order to apply any functions to it you need to load it in a dataset object.
train, test = tf.keras.datasets.mnist.load_data(path='mnist.npz')
dataset_train = tf.data.Dataset.from_tensor_slices((train[0], train[1]))
dataset_test = tf.data.Dataset.from_tensor_slices((test[0], test[1]))
Then you can apply shuffle, batch, take, or any map function to dataset_train or dataset_test objects

tf.estimator wants label_data and batch_size for prediction Tensorflow

I have created a network using high level tf APIs such as tf.estimator.
Training and evaluating work fine and produce an output. However when predicting on new data, get_inputs() requires label_data and batch_size.
The error is: TypeError: get_inputs() missing 2 required positional arguments: 'label_data' and 'batch_size'
How can I resolve this so I can make a prediction?
Here is my code:
predictTest = [0.34, 0.65, 0.88]
predictTest is just a test and won't be my real prediction data.
get_inputs(), this is where the error is thrown.
def get_inputs(feature_data, label_data, batch_size, n_epochs=None, shuffle=True):
dataset = tf.data.Dataset.from_tensor_slices(
(feature_data, label_data))
dataset = dataset.repeat(n_epochs)
if shuffle:
dataset = dataset.shuffle(len(feature_data))
dataset = dataset.batch(batch_size)
features, labels = dataset.make_one_shot_iterator().get_next()
return features, labels
Prediction inputs:
def predict_input_fn():
return get_inputs(
predictTest,
n_epochs=1,
shuffle=False
)
Predicting:
predict = estimator.predict(predict_input_fn)
print("Prediction: {}".format(list(predict)))

I worked out that I must create a new get_inputs() function for the prediction.
If I use the get_inputs() that train and evaluate use, it is expecting data it won't get.
get_inputs:
def get_inputs(feature_data, label_data, batch_size, n_epochs=None, shuffle=True):
dataset = tf.data.Dataset.from_tensor_slices( #from_tensor_slices
(feature_data, label_data))
dataset = dataset.repeat(n_epochs)
if shuffle:
dataset = dataset.shuffle(len(feature_data))
dataset = dataset.batch(batch_size)
features, labels = dataset.make_one_shot_iterator().get_next()
return features, labels
Make a new function called pred_get_inputs that doesn't require label_data or batch_size:
def get_pred_inputs(feature_data,n_epochs=None, shuffle=False):
dataset = tf.data.Dataset.from_tensor_slices( #from_tensor_slices
(feature_data))
dataset = dataset.repeat(n_epochs)
if shuffle:
dataset = dataset.shuffle(len(feature_data))
dataset = dataset.batch(1)
features = dataset
return features

The testing of any model has two types.
1) you want accuracy, recall etc. you need to provide a label for the test data. if you don't provide label it will give you an error.
2) you just want to test your model without calculating the accuracy than you don't need a label but here the prediction will be different.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Passing >2GB data to tf.estimator - python

Related

Keras model.predict() not predicting classes for some images

ValueError: `sample_weight` argument is not supported when using dataset as input

How to use mean_squared_error as loss with sklearn.model_selection.cross_val_score()

How to get subset of 10K MNIST images from Dataset class in tensorflow?

tf.estimator wants label_data and batch_size for prediction Tensorflow

Categories

Resources