Tensorflow predictions change as more predictions are made - python

I'm using tensorflow 0.8.0 and skflow (or now known as learn). My model is very similar to this example but with a dnn as the last layer (similar tot he minst example). Nothing very fancy going on, the model works pretty well on its own. The text inputs are a max of 200 characters and 3 classes.
The problem I'm seeing is when I try to load the model and make many predictions (Usually around 200 predictions or more), I start to see results vary.
For example, my model is already trained and I load it and go through my data and make predictions.
char_processor = skflow.preprocessing.ByteProcessor(200)
classifier = skflow.TensorFlowEstimator.restore('/path/to/model')
for item in dataset:
# each item is an array of strings, ex: ['foo', 'bar', 'hello', 'world']
line_data = np.array(list(char_processor.transform(item)))
res = classifier.predict_proba(line_data)
If I load my classifier and only give it one item to predict upon then quit, it works perfectly. When I continue to make predictions, I start to see weirdness.
What could I be missing here? Shouldn't my model always return the same results for the same data?

Related

fasttext train_supervised model: get top predicted labels

I have used fasttext train_supervised utility to train a classification model according to their webpage https://fasttext.cc/docs/en/supervised-tutorial.html .
model = fasttext.train_supervised(input='train.txt', autotuneValidationFile='validation.txt', autotuneDuration=600)
After I got the model how could I explore what kind of best parameters for the model like in sklearn after a set of best parameters trained, we could always check the values for these parameters but I could not find any document to explain this.
I also used this trained model to make prediction on my data
model.predict(test_df.iloc[2, 1])
It will return the label with a probability like this
(('__label__2',), array([0.92334366]))
I'm wondering if I have 5 labels, every time when make prediction,is it possible for each text to get all the probability for each label?
Like for the above test_df text,
I could get something like
model.predict(test_df.iloc[2, 1])
(('__label__2',), array([0.92334366])),(('__label__1',), array([0.82334366])),
(('__label__3',), array([0.52333333])),(('__label__0',), array([0.07000000])),
(('__label__4',), array([0.00002000]))
could find anything related to make change to get such prediction results.
Any suggestion?
Thanks.
As you can see here in the documentation, when using predict method, you should specify k parameter to get the top-k predicted classes.
model.predict("Why not put knives in the dishwasher?", k=5)
OUTPUT:
((u'__label__food-safety', u'__label__baking', u'__label__equipment',
u'__label__substitutions', u'__label__bread'), array([0.0857 , 0.0657,
0.0454, 0.0333, 0.0333]))

Keras Calling model prediction with different input generate same result?

Here, call predict multiple times
m = tf.keras.models.load_model(model_path, compile=False)
for i, (img_id, corr) in enumerate(zip(csv_file['id'], csv_file['corr'])):
img_filename = os.path.join(data_dir, 'data/piccollage_data/train_imgs/{}.png'.format(img_id))
img = tf.expand_dims(load_image(img_filename), 0)
result = m.predict(img, steps=1)
print ('idx:{} result: {}, gt: {}'.format(i, result, corr))
And the results make me very confused:
I tried to find if any questions is similar to this, but I only found it's about training problems link.
And I also tried other method e.g. change input as dataset (No problem), , another brute solution is reload model every for loop (It's very stupid) and last solution seems using model serving link.
So what I want to ask is that "Why the code I wrote above will generate same results"? Can anyone give me clue?

How can I make predictions from a trained model inside a Tensorflow input pipeline?

I am trying to train a model for emotion recognition, which uses one of VGG's layer's output as an input.
I could manage what I want by running the prediction in a first step, saving the extracted features and then using them as input to my network, but I am looking for a way to do the whole process at once.
The second model uses a concatenated array of feature maps as input (I am working with video data), so I am not able to simply wire it to the output of VGG.
I tried to use a map operation as depicted in the tf.data.dataset API documentations this way :
def trimmed_vgg16():
vgg16 = tf.keras.applications.vgg16.VGG16(input_shape=(224,224,3))
trimmed = tf.keras.models.Model(inputs=vgg16.get_input_at(0),
outputs=vgg16.layers[-3].get_output_at(0))
return trimmed
vgg16 = trimmed_vgg16()
def _extract_vgg_features(images, labels):
pred = vgg16_model.predict(images, batch_size=batch_size, steps=1)
return pred, labels
dataset = #load the dataset (image, label) as usual
dataset = dataset.map(_extract_vgg_features)
But I'm getting this error : Tensor Tensor("fc1/Relu:0", shape=(?, 4096), dtype=float32) is not an element of this graph which is pretty explicit. I'm stuck here, as I don't see a good way of inserting the trained model in the same graph and getting predictions "on the fly".
Is there a clean way of doing this or something similar ?
Edit: missed a line.
Edit2: added details
You should be able to connect the layers by first creating the vgg16 and then retrieving the output of the model as such and afterward you can use that tensor as an input to your own network.
vgg16 = tf.keras.applications.vgg16.VGG16(input_shape=(224,224,3))
network_input = vgg16.get_input_at(0)
vgg16_out = vgg16.layers[-3].get_output_at(0) # use this tensor as input to your own network

Keras Lambda Layer Before Embedding: Use to Convert Text to Integers

I currently have a keras model which uses an Embedding layer. Something like this:
input = tf.keras.layers.Input(shape=(20,) dtype='int32')
x = tf.keras.layers.Embedding(input_dim=1000,
output_dim=50,
input_length=20,
trainable=True,
embeddings_initializer='glorot_uniform',
mask_zero=False)(input)
This is great and works as expected. However, I want to be able to send text to my model, have it preprocess the text into integers, and continue normally.
Two issues:
1) The Keras docs say that Embedding layers can only be used as the first layer in a model: https://keras.io/layers/embeddings/
2) Even if I could add a Lambda layer before the Embedding, I'd need it to keep track of certain state (like a dictionary mapping specific words to integers). How might I go about this stateful preprocessing?
In short, I need to modify the underlying Tensorflow DAG, so when I save my model and upload to ML Engine, it'll be able to handle my sending it raw text.
Thanks!
Here are the first few layers of a model which uses a string input:
input = keras.layers.Input(shape=(1,), dtype="string", name='input_1')
lookup_table_op = tf.contrib.lookup.index_table_from_tensor(
mapping=vocab_list,
num_oov_buckets=num_oov_buckets,
default_value=-1,
)
lambda_output = Lambda(lookup_table_op.lookup)(input)
emb_layer = Embedding(int(number_of_categories),int(number_of_categories**0.25))(lambda_output)
Then you can continue the model as you normally would after an embedding layer. This is working for me and the model trains fine from string inputs.
It is recommended that you do the string -> int conversion in some preprocessing step to speed up the training process. Then after the model is trained you create a second keras model that just converts string -> int and then combine the two models to get the full string -> target model.

Hold out tensorflow 1.4 new dataset API

With the new dataset object, is there a way to divide a dataset into training and test dataset, according to a certain ratio, to get an hold out? and a k-fold cross validation?
In my case i wrote all data in only one TFRecord file and then i imported it with tf.data.TFRecordDataset.
Now, for hold out i'd like a way to split this given dataset in two datasets with a ratio. I solved this with data.take() and data.skip() but for ratio i need dataset's lenght, it's not graceful.
def split_dataset(dataset, ratio, n):
count_train = (n*ratio)//100
train = dataset.take(count_train)
test = dataset.skip(count_train)
return train,test
filenames = ["dataset_breast.tfrecords"]
dataset = tf.data.TFRecordDataset(filenames)
train_dataset, test_dataset = split_dataset(dataset, 80, 3360)
While for k-fold, i find only solution with scikit workaround on the dataset, before tf.data.TFRecordDataset importing.
I do not know of any feature like what you're describing. There are, of course, ways to achieve the functionality you're after. Here's two:
"Source" Placeholder
This one comes straight from the API docs. Though it is originally intended for TFRecordDataset, I imagine it could be adapted to other types. I'll copy/paste from the link:
filenames = tf.placeholder(tf.string, shape=[None])
dataset = tf.data.TFRecordDataset(filenames)
dataset = dataset.map(...) # Parse the record into tensors.
dataset = dataset.repeat() # Repeat the input indefinitely.
dataset = dataset.batch(32)
iterator = dataset.make_initializable_iterator()
# You can feed the initializer with the appropriate filenames for the current
# phase of execution, e.g. training vs. validation.
# Initialize `iterator` with training data.
training_filenames = ["/var/data/file1.tfrecord", "/var/data/file2.tfrecord"]
sess.run(iterator.initializer, feed_dict={filenames: training_filenames})
# Initialize `iterator` with validation data.
validation_filenames = ["/var/data/validation1.tfrecord", ...]
sess.run(iterator.initializer, feed_dict={filenames: validation_filenames})
A little discussion: This works with just one training tfrecord and one validation tfrecord, too. So, you could split your data using sklearn.model_selection.train_test_split before writing the TFRecords. Write one TFRecord dedicated to training data, one dedicated to validation data. With sklearn, you can specify a ratio (or an absolute number).
Two Datasets
Exactly like the name says, forget the filenames = tf.placeholder. Just create two iterators, one for testing and one for training. I usually use TFRecords, but you're free to try another type of dataset. Typically, I put the get_next calls into a tf.cond on a boolean tf.placeholder. If you're especially interested in this method, I could provide a MWE. But, the source placeholder seems to be the preferred method (seeing as it's in the docs...).

Categories

Resources