Keras Calling model prediction with different input generate same result? - python

Here, call predict multiple times
m = tf.keras.models.load_model(model_path, compile=False)
for i, (img_id, corr) in enumerate(zip(csv_file['id'], csv_file['corr'])):
img_filename = os.path.join(data_dir, 'data/piccollage_data/train_imgs/{}.png'.format(img_id))
img = tf.expand_dims(load_image(img_filename), 0)
result = m.predict(img, steps=1)
print ('idx:{} result: {}, gt: {}'.format(i, result, corr))
And the results make me very confused:
I tried to find if any questions is similar to this, but I only found it's about training problems link.
And I also tried other method e.g. change input as dataset (No problem), , another brute solution is reload model every for loop (It's very stupid) and last solution seems using model serving link.
So what I want to ask is that "Why the code I wrote above will generate same results"? Can anyone give me clue?

Related

How to write a generation function for text pytorch transformer?

Following this pytorch tutorial, I'm able to create and train a transformer model on a custom dataset. The problem is, I've scoured the web and have found no clear answers... How do I use this model to generate text? I took a stab at it, by encoding my SOS and seed text and passing it through the model's forward method... But this only produces repeating garbage. The src_mask doesn't appear to be the right size or functioning at all.
def generate(model: nn.Module, src_text:str):
src=BeatleSet.encode(src_text.lower()) # encodes seed text
SOS=BeatleSet.textTokDict['<sos>'] ; EOS=BeatleSet.textTokDict['<eos>'] # obtain eos and sos tokens
model.eval(); entry=[SOS]+src
y_input=torch.tensor([entry], dtype=torch.long, device=device) # convert entry to tensor
num_tokens=len(BeatleSet)
for i in range(50):
src_mask=generate_square_subsequent_mask(y_input.size(0)).to(device) #create a mask of size 1,1 (???)
pred=model(y_input, src_mask) # passed through forward method
next_item = pred.topk(1)[1].view(-1)[-1].item() # selecting the most probable next token (I think)
next_item = torch.tensor([[next_item]], device=device)
y_input=torch.cat((y_input, next_item), dim=1) # added to inputs to be run again
if next_item.view(-1).item() == EOS:
break
return " ".join(BeatleSet.decode(y_input.view(-1).tolist()))
print(generate(model, "Oh yeah I"))
For the record, I'm following the architecture to the letter. This should be reproducible with the wikidata set that is used in the tutorial. Please advise, I've been banging my head on this one.

Tensorflow Datasets, padded_batch, why allow different output_shapes, and is there a better way?

I'm trying to write Tensorflow 2.0 code which is good enough to share with other people. I have run into a problem with tf.data.Dataset. I have solved it, but I dislike my solutions.
Here is working Python code which generates padded batches from irregular data, two different ways. In one case, I re-use a global variable to supply the shape information. I dislike the global variable, especially because I know that the Dataset knows its own output shapes, and in the future I may have Dataset objects with several different output shapes.
In the other case, I extract the shape information from the Dataset object itself. But I have to jump through hoops to do it.
import numpy as np
import tensorflow as tf
print("""
Create a data set with the desired shape: 1 input per sub-element,
3 targets per sub-element, 8 elements of varying lengths.
""")
def gen():
lengths = np.tile(np.arange(4,8), 2)
np.random.shuffle(lengths)
for length in lengths:
inp = np.random.randint(1, 51, length)
tgt = np.random.random((length, 3))
yield inp, tgt
output_types = (tf.int64, tf.float64)
output_shapes = ([None], [None, 3])
dataset = tf.data.Dataset.from_generator(gen, output_types, output_shapes)
print("""
Using the global variable, output_shapes, allows the retrieval
of padded batches.
""")
for inp, tgt in dataset.padded_batch(3, output_shapes):
print(inp)
print(tgt)
print()
print("""
Obtaining the shapes supplied to Dataset.from_generator()
is possible, but hard.
""")
default_shapes = tuple([[y.value for y in x.shape.dims] for x in dataset.element_spec]) # Crazy!
for inp, tgt in dataset.padded_batch(3, default_shapes):
print(inp)
print(tgt)
I don't quite understand why one might want to pad the data in a batch of unevenly-sized elements to any shapes other than the output shapes which were used to define the Dataset elements in the first place. Does anyone know of a use case?
Also, there is no default value for the padded_shapes argument. I show how to retrieve what I think is the sensible default value for padded_shapes. That one-liner works... but why is it so difficult?
I'm currently trying to subclass Dataset to provide the Dataset default shapes as a Python property. Tensorflow is fighting me, probably because the underlying Dataset is a C++ object while I'm working in Python.
All this trouble makes me wonder whether there is a cleaner approach than what I have tried.
Thanks for your suggestions.
Answering my own question. I asked this same question on Reddit. A Tensorflow contributor replied that TF 2.2 will provide a default value for the padded_shapes argument. I am glad to see that the development team has recognized the same need that I identified.

Runtime switching between datasets in Tensorflow from_generator?

I have a huge dataset (about 50 Gigabytes) and I'm loading it using Python generators like this:
def data_generator(self, images_path):
with open(self.temp_csv, 'r') as f:
for image in f.readlines():
# Something going on...
yield (X, y)
The important thing is that I'm using a single generator for both training and validation data and I'm trying to change the self.temp_csv during the runtime. However, things are not going on as expected and by updating the variable self.temp_csv which is supposed to switch between train and validation sets, with open is not called and I end up iterating over the same dataset over and over again. I wonder if there is any possibility to use Dataset.from_generator and during the runtime, I switch to another dataset to do the validation phase. Here is how I am specifying the generator. Thank you!
def get_data(self):
with tf.name_scope('data'):
data_generator = lambda: self.data_generator(images_path=self.data_path)
my_data = tf.data.Dataset.from_generator(
generator=data_generator,
output_types=(tf.float32, tf.float32),
output_shapes=(tf.TensorShape([None]), tf.TensorShape([None]))
).batch(self.batch_size).prefetch(2)
img, self.label = my_data.make_one_shot_iterator().get_next()
self.img = tf.reshape(img, [-1, CNN_INPUT_HEIGHT, CNN_INPUT_WIDTH, CNN_INPUT_CHANNELS])
You could use a reinitialize iterator or a feedable iterator to switch between 2 datasets as shown in the official docs.
However, if you want to read all the data you have using the generator, then create a train and validation split, then its not that straightforward.
If you have a separate validation file, you can simply create a new validation dataset and use the iterator like shown above.
If thats not the case, methods such as skip() and take() can help you split the data, but the shuffling for a good split is something you need to think about.

Pymc3: Optimizing parameters with multiple data?

I've designed a model using Pymc3, and I have some trouble optimizing it with multiple data.
The model is a bit similar to the coal-mining disaster (as in the Pymc3 tutorial for those who know it), except there are multiple switchpoints.
The output of the network is a serie of real numbers for instance:
[151,152,150,20,19,18,0,0,0]
with Model() as accrochage_model:
time=np.linspace(0,n_cycles*data_length,n_cycles*data_length)
poisson = [Normal('poisson_0',5,1), Normal('poisson_1',10,1)]
variance=3
t = [Normal('t_0',0.5,0.01), Normal('t_1',0.7,0.01)]
taux = [Bernoulli('taux_{}'.format(i),t[i]) for i in range(n_peaks)]
switchpoint = [Poisson('switchpoint_{}'.format(i),poisson[i])*taux[i] for i in range(n_peaks)]
peak=[Normal('peak_0',150,2),Normal('peak_1',50,2),Normal('peak_2',0,2)]
z_init=switch(switchpoint[0]>=time%n_cycles,0,peak[0])
z_list=[switch(sum(switchpoint[j] for j in range(i))>=time%n_cycles,0,peak[i]-peak[i-1]) for i in range(1,n_peaks)]
z=(sum(z_list[i] for i in range(len(z_list))))
z+=z_init
m =Normal('m', z, variance,observed=data)
I have multiple realisations of the true distribution and I'd like taking all of them into account while performing optimization of the parameters of the system.
Right now my "data" that appears in observed=data is just one list of results , such as:
[151,152,150,20,19,18,0,0,0]
What I would like to do is give not just one but several lists of results,
for instance:
data=([151,152,150,20,19,18,0,0,0],[145,152,150,21,17,19,1,0,0],[151,149,153,17,19,18,0,0,1])
I tried using the shape parameter and making data an array of results but none of it seemed to work.
Does anyone have an idea of how it's possible to do the inference so that the network is optimized for an entire dataset and not a single output?

Tensorflow predictions change as more predictions are made

I'm using tensorflow 0.8.0 and skflow (or now known as learn). My model is very similar to this example but with a dnn as the last layer (similar tot he minst example). Nothing very fancy going on, the model works pretty well on its own. The text inputs are a max of 200 characters and 3 classes.
The problem I'm seeing is when I try to load the model and make many predictions (Usually around 200 predictions or more), I start to see results vary.
For example, my model is already trained and I load it and go through my data and make predictions.
char_processor = skflow.preprocessing.ByteProcessor(200)
classifier = skflow.TensorFlowEstimator.restore('/path/to/model')
for item in dataset:
# each item is an array of strings, ex: ['foo', 'bar', 'hello', 'world']
line_data = np.array(list(char_processor.transform(item)))
res = classifier.predict_proba(line_data)
If I load my classifier and only give it one item to predict upon then quit, it works perfectly. When I continue to make predictions, I start to see weirdness.
What could I be missing here? Shouldn't my model always return the same results for the same data?

Categories

Resources