Multi-instance classification using tranformer model - python

I use the transformer from this Keras documentation example for multi-instance classification. The class of each instance depends on other instances that come in one bag. I use transformer model because:
It makes no assumptions about the temporal/spatial relationships across the data. This is ideal for processing a set of objects
For example, each bag may have maximal 5 instances and there are 3 features per instance.
# Generate data
max_length = 5
x_lst = []
y_lst = []
for _ in range(10):
num_instances = np.random.randint(2, max_length + 1)
x_bag = np.random.randint(0, 9, size=(num_instances, 3))
y_bag = np.random.randint(0, 2, size=num_instances)
x_lst.append(x_bag)
y_lst.append(y_bag)
Features and labels of first 2 bags (with 5 and 2 instances):
x_lst[:2]
[array([[8, 0, 3],
[8, 1, 0],
[4, 6, 8],
[1, 6, 4],
[7, 4, 6]]),
array([[5, 8, 4],
[2, 1, 1]])]
y_lst[:2]
[array([0, 1, 1, 1, 0]), array([0, 0])]
Next, I pad features with zeros and targets with -1:
x_padded = []
y_padded = []
for x, y in zip(x_lst, y_lst):
x_p = np.zeros((max_length, 3))
x_p[:x.shape[0], :x.shape[1]] = x
x_padded.append(x_p)
y_p = np.negative(np.ones(max_length))
y_p[:y.shape[0]] = y
y_padded.append(y_p)
X = np.stack(x_padded)
y = np.stack(y_padded)
where X.shape is equal to (10, 5, 3) and y.shape is equal to (10, 5).
I made two changes to the original model: added the Masking layer
after the Input layer and set the number of units in the last Dense layer to the maximal size of the bag (plus 'sigmoid' activation):
def transformer_encoder(inputs, head_size, num_heads, ff_dim, dropout=0):
# Attention and Normalization
x = layers.MultiHeadAttention(
key_dim=head_size, num_heads=num_heads, dropout=dropout
)(inputs, inputs)
x = layers.Dropout(dropout)(x)
x = layers.LayerNormalization(epsilon=1e-6)(x)
res = x + inputs
# Feed Forward Part
x = layers.Conv1D(filters=ff_dim, kernel_size=1, activation="relu")(res)
x = layers.Dropout(dropout)(x)
x = layers.Conv1D(filters=inputs.shape[-1], kernel_size=1)(x)
x = layers.LayerNormalization(epsilon=1e-6)(x)
return x + res
def build_model(
input_shape,
head_size,
num_heads,
ff_dim,
num_transformer_blocks,
mlp_units,
dropout=0,
mlp_dropout=0,
):
inputs = keras.Input(shape=input_shape)
inputs = keras.layers.Masking(mask_value=0)(inputs) # ADDED MASKING LAYER
x = inputs
for _ in range(num_transformer_blocks):
x = transformer_encoder(x, head_size, num_heads, ff_dim, dropout)
x = layers.GlobalAveragePooling1D(data_format="channels_first")(x)
for dim in mlp_units:
x = layers.Dense(dim, activation="relu")(x)
x = layers.Dropout(mlp_dropout)(x)
outputs = layers.Dense(5, activation='sigmoid')(x) # CHANGED ACCORDING TO MY OUTPUT
return keras.Model(inputs, outputs)
input_shape = (5, 3)
model = build_model(
input_shape,
head_size=256,
num_heads=4,
ff_dim=4,
num_transformer_blocks=4,
mlp_units=[128],
mlp_dropout=0.4,
dropout=0.25,
)
model.compile(
loss="binary_crossentropy",
optimizer=keras.optimizers.Adam(learning_rate=1e-4),
metrics=["binary_accuracy"],
)
model.summary()
It looks like my model doesn't learn much. If I use the number of true values for each bag (y.sum(axis=1) and Dense(1)) as a target instead of classification of each instance, the model learns good. Where is my error? How should I build the output layer in this case? Do I need a custom lost function?
UPDATE:
I made a custom loss function:
def my_loss_fn(y_true, y_pred):
mask = tf.cast(tf.math.not_equal(y_true, tf.constant(-1.)), tf.float32)
y_true, y_pred = tf.expand_dims(y_true, axis=-1), tf.expand_dims(y_pred, axis=-1)
bce = tf.keras.losses.BinaryCrossentropy(reduction='none')
return tf.reduce_sum(tf.cast(bce(y_true, y_pred), tf.float32) * mask)
mask = (y_test != -1).astype(int)
pd.DataFrame({'n_labels': mask.sum(axis=1), 'preds': ((preds * mask) >= .5).sum(axis=1)}).plot(figsize=(20, 5))
And it looks like the model learns:
But it predicts all nonmasked labels as 1.
#thushv89 This is my problem. I take 2 time points: t1 and t2 and look for all vehicles that are in maintenance at the time t1 and for all vehicles that are planned to be in maintenance at the time t2. So, this is my bag of items. Then I calculate features like how much time t1 vehicles have already spent in maintenance, how much time from t1 to the plan start for t2 vehicle etc. My model learns well if I try to predict the number of vehicles in maintenance at the time t2, but I would like to predict which of them will leave and which of them will come in (3 vs [True, False, True, True] for 4 vehicles in the bag).

There are three important improvements:
Remove the GlobalAveragePooling1D. It's a kind of bottleneck (data compression) if you make a prediction for each item. Without this layer, you also get a two-dimensional tensor with the max number of items in the first dimension (5 in my case) for free in the output. So, you can set the last Dense layer to the number of categories (1 in my case).
Add a custom loss function to exclude target padding from calculation (already added to my question) and a custom metric function if you want to see the real metric.
Add an attention_mask to the MultiHeadAttention (instead of Masking layer) to mask the padding.

Just a simple add-on to #Mykola_Zotko 's improvement answer to those new users who are learning deep-learning with keras and tensorflow.
Remove the GlobalAveragePooling1D
For context, this GlobalAveragePooling1D is basically a Global average pooling operation for temporal data. So basically when you remove this method call, you are removing the "pooling" operation, or in simpler terms by #Mykola_Zotko:
... you get a two-dimensional tensor with the max number of items in the first dimension (5 in my case) for free in the output
The alias is:
tf.keras.layers.GlobalAvgPool1D
and the code for this method:
tf.keras.layers.GlobalAveragePooling1D (
data_format = "channels_last", **kwargs
)
The source for this can be found on:
Github
TensorFlow 1 version
TensorFlow.org doc
Add a custom loss function
What a loss function simply does is to "to generate the quantity that a model should seek to minimize during training time". Source
Or in other terms:
In mathematical optimization, statistics, machine learning and Deep Learning the Loss Function (also known as Cost Function or Error Function) is a function that defines a correlation between a series of values and a real number. That number represents conceptually the cost associated with an event or a set of values. In general, the goal of an optimization procedure is to minimize the loss function. Towardsdatascience - custom loss function in tensorflow
Add an attention_mask to the MultiHeadAttention
Alias:
tf.keras.layers.MultiHeadAttention
Code for the method:
tf.keras.layers.MultiHeadAttention(
num_heads,
key_dim,
value_dim=None,
dropout=0.0,
use_bias=True,
output_shape=None,
attention_axes=None,
kernel_initializer='glorot_uniform',
bias_initializer='zeros',
kernel_regularizer=None,
bias_regularizer=None,
activity_regularizer=None,
kernel_constraint=None,
bias_constraint=None,
**kwargs
)
Source on:
Github
TensorFlow.org doc
Previous improvements that were made to the code:
metrics=["accuracy"] to metrics=["binary_accuracy"]
model.compile(
loss="binary_crossentropy",
optimizer=keras.optimizers.Adam(learning_rate=1e-4),
metrics=["binary_accuracy"],
)
Using Crossentropy in the custom loss function

Related

How to prepare the inputs in Keras implementation of Wavenet for time-series prediction

In Keras implementation of Wavenet, the input shape is (None, 1). I have a time series (val(t)) in which the target is to predict the next data point given a window of past values (the window size depends on maximum dilation). The input-shape in wavenet is confusing. I have few questions about it:
How Keras figure out the input dimension (None) when a full sequence is given? According to dilations, we want the input to have a length of 2^8.
If a input series of shape (1M, 1) is given as training X, do we need to generate vectors of 2^8 time-steps as input? It seems, we can just use the input series as input of wave-net (Not sure why raw time series input does not give error).
In general, how we can debug such Keras networks. I tried to apply the function on numerical data like Conv1D(16, 1, padding='same', activation='relu')(inputs), however, it gives error.
#
n_filters = 32
filter_width = 2
dilation_rates = [2**i for i in range(7)] * 2
from keras.models import Model
from keras.layers import Input, Conv1D, Dense, Activation, Dropout, Lambda, Multiply, Add, Concatenate
from keras.optimizers import Adam
history_seq = Input(shape=(None, 1))
x = history_seq
skips = []
for dilation_rate in dilation_rates:
# preprocessing - equivalent to time-distributed dense
x = Conv1D(16, 1, padding='same', activation='relu')(x)
# filter
x_f = Conv1D(filters=n_filters,
kernel_size=filter_width,
padding='causal',
dilation_rate=dilation_rate)(x)
# gate
x_g = Conv1D(filters=n_filters,
kernel_size=filter_width,
padding='causal',
dilation_rate=dilation_rate)(x)
# combine filter and gating branches
z = Multiply()([Activation('tanh')(x_f),
Activation('sigmoid')(x_g)])
# postprocessing - equivalent to time-distributed dense
z = Conv1D(16, 1, padding='same', activation='relu')(z)
# residual connection
x = Add()([x, z])
# collect skip connections
skips.append(z)
# add all skip connection outputs
out = Activation('relu')(Add()(skips))
# final time-distributed dense layers
out = Conv1D(128, 1, padding='same')(out)
out = Activation('relu')(out)
out = Dropout(.2)(out)
out = Conv1D(1, 1, padding='same')(out)
# extract training target at end
def slice(x, seq_length):
return x[:,-seq_length:,:]
pred_seq_train = Lambda(slice, arguments={'seq_length':1})(out)
model = Model(history_seq, pred_seq_train)
model.compile(Adam(), loss='mean_absolute_error')
you are using extreme values for dilatation rate, they don't make sense. try to reduce them using, for example, a sequence made of [1, 2, 4, 8, 16, 32]. the dilatation rates aren't a constraint on the dimension of the input passed
your network work simply passing this input
n_filters = 32
filter_width = 2
dilation_rates = [1, 2, 4, 8, 16, 32]
....
model = Model(history_seq, pred_seq_train)
model.compile(Adam(), loss='mean_absolute_error')
n_sample = 5
time_step = 100
X = np.random.uniform(0,1, (n_sample,time_step,1))
model.predict(X)
specify a None dimension in Keras means to leave the model free to receive every dimension. this not means you can pass samples of various dimension, they always must have the same format... you can build the model every time with a different dimension size
for time_step in np.random.randint(100,200, 4):
print('temporal dim:', time_step)
n_sample = 5
model = Model(history_seq, pred_seq_train)
model.compile(Adam(), loss='mean_absolute_error')
X = np.random.uniform(0,1, (n_sample,time_step,1))
print(model.predict(X).shape)
I suggest also you a premade library in Keras which provide WAVENET implementation: https://github.com/philipperemy/keras-tcn you can use it as a baseline and investigate also the code to create a WAVENET

Problems porting Tensorflow code to Keras

I have been bashing my head against the wall for the past few days - and I simply cannot figure it out.
Would some of you good people perhaps let me know what I am doing wrong?
I am trying to port code from https://github.com/simoninithomas/Deep_reinforcement_learning_Course/blob/master/Deep%20Q%20Learning/Doom/Deep%20Q%20learning%20with%20Doom.ipynb (written in Tensorflow) to Keras. Here is the original part of the code:
class DQNetwork:
def __init__(self, state_size, action_size, learning_rate, name='DQNetwork'):
self.state_size = state_size
self.action_size = action_size
self.learning_rate = learning_rate
with tf.variable_scope(name):
self.inputs_ = tf.placeholder(tf.float32, [None, *state_size], name="inputs")
self.actions_ = tf.placeholder(tf.float32, [None, 3], name="actions_")
self.target_Q = tf.placeholder(tf.float32, [None], name="target")
#First convnet: CNN => BatchNormalization => ELU; Input is 84x84x4
self.conv1 = tf.layers.conv2d(inputs = self.inputs_,
filters = 32, kernel_size = [8,8],strides = [4,4],padding = "VALID",
kernel_initializer=tf.contrib.layers.xavier_initializer_conv2d(), name = "conv1")
self.conv1_batchnorm = tf.layers.batch_normalization(self.conv1,training = True,
epsilon = 1e-5,name = 'batch_norm1')
self.conv1_out = tf.nn.elu(self.conv1_batchnorm, name="conv1_out")
## --> [20, 20, 32]
#Second convnet: CNN => BatchNormalization => ELU
self.conv2 = tf.layers.conv2d(inputs = self.conv1_out,
filters = 64,kernel_size = [4,4],strides = [2,2],padding = "VALID",
kernel_initializer=tf.contrib.layers.xavier_initializer_conv2d(),name = "conv2")
self.conv2_batchnorm = tf.layers.batch_normalization(self.conv2,training = True,
epsilon = 1e-5,name = 'batch_norm2')
self.conv2_out = tf.nn.elu(self.conv2_batchnorm, name="conv2_out")
## --> [9, 9, 64]
#Third convnet: CNN => BatchNormalization => ELU
self.conv3 = tf.layers.conv2d(inputs = self.conv2_out,
filters = 128,kernel_size = [4,4],strides = [2,2],padding = "VALID",
kernel_initializer=tf.contrib.layers.xavier_initializer_conv2d(),name = "conv3")
self.conv3_batchnorm = tf.layers.batch_normalization(self.conv3,training = True,
epsilon = 1e-5,name = 'batch_norm3')
self.conv3_out = tf.nn.elu(self.conv3_batchnorm, name="conv3_out")
## --> [3, 3, 128]
self.flatten = tf.layers.flatten(self.conv3_out)
## --> [1152]
self.fc = tf.layers.dense(inputs = self.flatten,
units = 512, activation = tf.nn.elu,
kernel_initializer=tf.contrib.layers.xavier_initializer(),name="fc1")
self.output = tf.layers.dense(inputs = self.fc, kernel_initializer=tf.contrib.layers.xavier_initializer(),
units = 3, activation=None)
# Q is our predicted Q value.
self.Q = tf.reduce_sum(tf.multiply(self.output, self.actions_), axis=1)
# The loss is the difference between our predicted Q_values and the Q_target
# Sum(Qtarget - Q)^2
self.loss = tf.reduce_mean(tf.square(self.target_Q - self.Q))
self.optimizer = tf.train.RMSPropOptimizer(self.learning_rate).minimize(self.loss)
# farther below...
Qs_next_state = sess.run(DQNetwork.output, feed_dict = {DQNetwork.inputs_: next_states_mb})
# Set Q_target = r if the episode ends at s+1, otherwise set Q_target = r + gamma*maxQ(s', a')
for i in range(0, len(batch)):
terminal = dones_mb[i]
# If we are in a terminal state, only equals reward
if terminal:
target_Qs_batch.append(rewards_mb[i])
else:
target = rewards_mb[i] + gamma * np.max(Qs_next_state[i])
target_Qs_batch.append(target)
targets_mb = np.array([each for each in target_Qs_batch])
loss, _ = sess.run([DQNetwork.loss, DQNetwork.optimizer],
feed_dict={DQNetwork.inputs_: states_mb,
DQNetwork.target_Q: targets_mb,
DQNetwork.actions_: actions_mb})
And here is my conversion:
class DQNetworkA:
def __init__(self, state_size, action_size, learning_rate):
self.state_size = state_size
self.action_size = action_size
self.learning_rate = learning_rate
self.model = keras.models.Sequential()
self.model.add(keras.layers.Conv2D(32, (8, 8), strides=(4, 4), padding = "VALID", input_shape=state_size))#, kernel_initializer='glorot_normal'))
self.model.add(keras.layers.BatchNormalization(epsilon = 1e-5))
self.model.add(keras.layers.Activation('elu'))
self.model.add(keras.layers.Conv2D(64, (4, 4), strides=(2, 2), padding = "VALID"))#, kernel_initializer='glorot_normal'))
self.model.add(keras.layers.BatchNormalization(epsilon = 1e-5))
self.model.add(keras.layers.Activation('elu'))
self.model.add(keras.layers.Conv2D(128, (4, 4), strides=(2, 2), padding = "VALID"))#, kernel_initializer='glorot_normal'))
self.model.add(keras.layers.BatchNormalization(epsilon = 1e-5))
self.model.add(keras.layers.Activation('elu'))
self.model.add(keras.layers.Flatten())
self.model.add(keras.layers.Dense(512))
self.model.add(keras.layers.Activation('elu'))
self.model.add(keras.layers.Dense(action_size))
self.model.compile(loss="mse", optimizer=keras.optimizers.RMSprop(lr=self.learning_rate))
print(self.model.summary())
# farther below...
Qs = DQNetwork.predict(states_mb)
Qs_next_state = DQNetwork.predict(next_states_mb)
# Set Q_target = r if the episode ends at s+1, otherwise set Q_target = r + gamma*maxQ(s', a')
for i in range(0, len(batch)):
terminal = dones_mb[i]
t = np.copy(Qs[i])
a = np.argmax(actions_mb[i])
# If we are in a terminal state, only equals reward
if terminal:
t[a] = rewards_mb[i]
else:
t[a] = rewards_mb[i] + gamma * np.max(Qs_next_state[i])
target_Qs_batch.append(t)
dbg_target_Qs_batch.append(t[a])
targets_mb = np.array([each for each in target_Qs_batch])
loss = DQNetwork.train_on_batch(states_mb, targets_mb)
Everything else is the same. I have even tried to mess around with a custom loss function to minimize differences in the code – and it simply does not work! While the original code quickly converges my Keras doodlings simply does not seem to want to work!
Does anyone have a clue? Any hints or help would be highly appreciated...
A little further explanation:
This is a simple DQN playing Doom - so the after about 100 episodes (games), the model seems to be able to shoot the target without a problem every episode. Loss goes down, rewards per game go up - as one would expect... However, in the Keras model loss graph is flat, reward graph is flat - it almost seems not to be able to learn anything. (see the graphs linked below)
Here is how it works. In TF code, model outputs a tensor [a, b, c] where a, b and c give probability of each action the main character might take (ie: [left, right, shoot]). Model is then given reward for every action, so it is passed a target value (target_mb, f.ex. 10) along with which action this is for (one-hot encoded in actions_mb, ie [0,1,0] - if this is a target for moving right). Loss is then computed with a simple MSE over difference between target and predicted value of the model for the given action.
I have done two things:
1) I tried to use the standard "mse" loss as I have seen in other models of this type. To make the loss behave the same way, I pass the model its own input apart from target value. So if model predicts [3,4,5] and the target is 10 for [0,1,0] - we pass [3,10,5] as the truth to the model. This should be equivalent to the actions of the TF model. ie, difference between 10 and 4, squared and then mean over all differences from the batch.
2) When 1) did not work, I tried to make a custom loss function that basically attempts to mimick behaviour of the TF model as closely as possible. So if model predicts [3,4,5] and the target is 10 for [0,1,0] (as above) - we pass [0,10,0] as the truth to the model. Then the custom loss function through some finicky multiplication and division arrives at difference between 10 and 4 - squares it and takes mean of all squared errors as below:
def custom_loss(y_true, y_pred):
isolated_truths = tf.reduce_sum(y_true, axis=1)
isolated_predictions = tf.divide(tf.reduce_sum(tf.multiply(y_true, y_pred), axis=1), isolated_truths)
delta = isolated_predictions - isolated_truths
return tf.reduce_mean(tf.square(delta))
# when training, this small modification is made to targets:
loss = DQN_Keras.train_on_batch(states_mb, targets_mb.reshape(len(targets_mb),1) * actions_mb)
And it still does not work (although you can see on the graphs that the loss seems to behave far more reasonably!).
Take a look at the graphs:
tf model: https://pasteboard.co/IN1b5MN.png
keras model with mse loss: https://pasteboard.co/IN1kH6P.png
keras model with custom loss: https://pasteboard.co/IN17ktg.png
edit #2 - runnable code
Original TF code - copy pasted from tutorial above, working:
=> https://pastebin.com/QLb7nWZi
My code with custom loss in full:
=> https://pastebin.com/3HiYg6t7
Well, I have made work - by removing BatchNormalization layers. Now I am completely mystified... so does batch normalization work differently in Keras and Tensorflow? Or is the missing clue this mysterious "training=True" parameter in TF (not present in Keras)?
PS.
While digging into the issue, I also found this very useful article describing how to create advanced Keras models with several inputs like masks (like in the original TF code!):
https://becominghuman.ai/lets-build-an-atari-ai-part-1-dqn-df57e8ff3b26

Keras: how to implement target replication for LSTM?

Using examples from Lipton et al (2016), target replication is basically calculating the loss at each time step (except final) of the LSTM (or GRU) and averaging this loss and adding it to the main loss while training. Mathematically, it is given by -
Graphically, it can be represented as -
So how do I go about exactly implementing this in Keras? Say, I have binary classification task. Let's say my model is a simple one given below -
model.add(LSTM(50))
model.add(Dense(1))
model.compile(loss='binary_crossentropy', class_weights={0:0.5, 1:4}, optimizer=Adam(), metrics=['accuracy'])
model.fit(x_train, y_train)
I think y_train needs to be reshaped/tiled from (batch_size, 1) to (batch_size, time_step) right?
The dense layer needs TimeDistributed to be applied correctly to the LSTM after setting return_sequences=True?
How do I exactly implement the exact loss function given above? Will class_weights need to be modified?
Target replication is only during training. How to implement validation set evaluation using only the main loss?
How should I deal with zero paddings in target replication? My sequences are padded to a max_len of 15 with average length being 7. Since the target replication loss averages over all the steps, how do I make sure it doesn't use the padded words in calculating the loss? Basically, dynamically assign T the actual sequence length.
Question 1:
So, for the targets, you need it shaped as (batch_size, time_steps, 1). Just use:
y_train = np.stack([y_train]*time_steps, axis=1)
Question 2:
You're correct, but TimeDistributed is optional in Keras 2.
Question 3:
I don't know how class weights will behave, but a regular loss function should go like:
from keras.losses import binary_crossentropy
def target_replication_loss(alpha):
def inner_loss(true,pred):
losses = binary_crossentropy(true,pred)
return (alpha*K.mean(losses[:,:-1], axis=-1)) + ((1-alpha)*losses[:,-1])
return inner_loss
model.compile(......, loss = target_replication_loss(alpha), ...)
Question 3a:
Since the above doens't work well with class weights, I created an alternative where the weights go into the loss:
def target_replication_loss(alpha, class_weights):
def get_weights(x):
b = class_weights[0]
a = class_weights[1] - b
return (a*x) + b
def inner_loss(true,pred):
#this will only work for classification with only one class 0 or 1
#and only if the target is the same for all classes
true_classes = true[:,-1,0]
weights = get_weights(true_classes)
losses = binary_crossentropy(true,pred)
return weights*((alpha*K.mean(losses[:,:-1], axis=-1)) + ((1-alpha)*losses[:,-1]))
return inner_loss
Question 4:
To avoid complexity, I'd say you should use an additional metric in validation:
def last_step_BC(true,pred):
return binary_crossentropy(true[:,-1], pred[:,-1])
model.compile(....,
loss = target_replication_loss(alpha),
metrics=[last_step_BC])
Question 5:
This is a hard one and I'd need to research a little....
As an initial workaround, you can set the model with an input shape of (None, features), and train each sequence individually.
Working example without class_weight
def target_replication_loss(alpha):
def inner_loss(true,pred):
losses = binary_crossentropy(true,pred)
#print(K.int_shape(losses))
#print(K.int_shape(losses[:,:-1]))
#print(K.int_shape(K.mean(losses[:,:-1], axis=-1)))
#print(K.int_shape(losses[:,-1]))
return (alpha*K.mean(losses[:,:-1], axis=-1)) + ((1-alpha)*losses[:,-1])
return inner_loss
alpha = 0.6
i1 = Input((5,2))
i2 = Input((5,2))
out = LSTM(1, activation='sigmoid', return_sequences=True)(i1)
model = Model(i1, out)
model.compile(optimizer='adam', loss = target_replication_loss(alpha))
model.fit(np.arange(30).reshape((3,5,2)), np.arange(15).reshape((3,5,1)), epochs = 200)
Working example with class weights:
def target_replication_loss(alpha, class_weights):
def get_weights(x):
b = class_weights[0]
a = class_weights[1] - b
return (a*x) + b
def inner_loss(true,pred):
#this will only work for classification with only one class 0 or 1
#and only if the target is the same for all classes
true_classes = true[:,-1,0]
weights = get_weights(true_classes)
losses = binary_crossentropy(true,pred)
print(K.int_shape(losses))
print(K.int_shape(losses[:,:-1]))
print(K.int_shape(K.mean(losses[:,:-1], axis=-1)))
print(K.int_shape(losses[:,-1]))
print(K.int_shape(weights))
return weights*((alpha*K.mean(losses[:,:-1], axis=-1)) + ((1-alpha)*losses[:,-1]))
return inner_loss
alpha = 0.6
class_weights={0: 0.5, 1:4.}
i1 = Input(batch_shape=(3,5,2))
i2 = Input((5,2))
out = LSTM(1, activation='sigmoid', return_sequences=True)(i1)
model = Model(i1, out)
model.compile(optimizer='adam', loss = target_replication_loss(alpha, class_weights))
model.fit(np.arange(30).reshape((3,5,2)), np.arange(15).reshape((3,5,1)), epochs = 200)

Tensorflow: Layer size dependent on batch size?

I am currently trying to get familiar with the Tensorflow library and I have a rather fundamental question that bugs me.
While building a convolutional neural network for MNIST classification I tried to use my own model_fn. In which usually the following line occurs to reshape the input features.
x = tf.reshape(x, shape=[-1, 28, 28, 1]), with the -1 referring to the input batch size.
Since I use this node as input to my convolutional layer,
x = tf.reshape(x, shape=[-1, 28, 28, 1])
conv1 = tf.layers.conv2d(x, 32, 5, activation=tf.nn.relu)
does this mean that the size all my networks layers are dependent on the batch size?
I tried freezing and running the graph on a single test input, which will only work if I provide n=batch_size test images.
Can you give me a hint on how to make my network run on any input batchsize while predicting?
Also I guess using the tf.reshape node (see first node in cnn_layout) in the network definition is not the best input for serving.
I will append my network layer-up and the model_fn
def cnn_layout(features,reuse,is_training):
with tf.variable_scope('cnn',reuse=reuse):
# resize input to [batchsize,height,width,channel]
x = tf.reshape(features['x'], shape=[-1,30,30,1], name='input_placeholder')
# conv1, 32 filter, 5 kernel
conv1 = tf.layers.conv2d(x, 32, 5, activation=tf.nn.relu, name='conv1')
# pool1, 2 stride, 2 kernel
pool1 = tf.layers.max_pooling2d(conv1, 2, 2, name='pool1')
# conv2, 64 filter, 3 kernel
conv2 = tf.layers.conv2d(pool1, 64, 3, activation=tf.nn.relu, name='conv2')
# pool2, 2 stride, 2 kernel
pool2 = tf.layers.max_pooling2d(conv2, 2, 2, name='pool2')
# flatten pool2
flatten = tf.contrib.layers.flatten(pool2)
# fc1 with 1024 neurons
fc1 = tf.layers.dense(flatten, 1024, name='fc1')
# 75% dropout
drop = tf.layers.dropout(fc1, rate=0.75, training=is_training, name='dropout')
# output logits
output = tf.layers.dense(drop, 1, name='output_logits')
return output
def model_fn(features, labels, mode):
# setup two networks one for training one for prediction while sharing weights
logits_train = cnn_layout(features=features,reuse=False,is_training=True)
logits_test = cnn_layout(features=features,reuse=True,is_training=False)
# predictions
predictions = tf.round(tf.sigmoid(logits_test),name='predictions')
if mode == tf.estimator.ModeKeys.PREDICT:
return tf.estimator.EstimatorSpec(mode, predictions=predictions)
# define loss and optimizer
loss = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits=logits_train,labels=labels),name='loss')
optimizer = tf.train.AdamOptimizer(learning_rate=LEARNING_RATE, name='optimizer')
train = optimizer.minimize(loss, global_step=tf.train.get_global_step(),name='train')
# accuracy for evaluation
accuracy = tf.metrics.accuracy(labels=labels,predictions=predictions,name='accuracy')
# summarys for tensorboard
tf.summary.scalar('loss',loss)
# return training and evalution spec
return tf.estimator.EstimatorSpec(
mode=mode,
predictions=predictions,
loss=loss,
train_op=train,
eval_metric_ops={'accuracy':accuracy}
)
Thanks!
In the typical scenario, the rank of features['x'] is already going to be 4, with the outer dimension being the actual batch size, so there's no need to resize it.
Let me try to explain.
You haven't shown your serving_input_receiver_fn yet and there are several ways to do that, although in the end the principle is similar across them all. If you're using TensorFlow Serving, then you probably use build_parsing_serving_input_receiver_fn. It's informative to look at the source code:
def build_parsing_serving_input_receiver_fn(feature_spec,
default_batch_size=None):
serialized_tf_example = array_ops.placeholder(
dtype=dtypes.string,
shape=[default_batch_size],
name='input_example_tensor')
receiver_tensors = {'examples': serialized_tf_example}
features = parsing_ops.parse_example(serialized_tf_example, feature_spec)
return ServingInputReceiver(features, receiver_tensors)
So in your client, you're going to prepare a request that has one or more Examples in it (let's say the length is N). The server treats the serialized examples as a list of strings which get "fed" into the input_example_tensor placeholder. The shape (which is None) dynamically gets filled in to be the size of the list (N).
Then the parse_example op parses each item in the placeholder and out pops a Tensor for each feature whose outer dimension is N. In your case, you'll have x with shape=[N, 30, 30, 1].
(Note that other serving systems, such as CloudML Engine, do not operate on Example objects, but the principles are the same).
I just want to briefly provide my found solution. Since I did not want to build a scalable production grade model, but a simple model runner in python to execute my CNN locally.
To export the model I used,
input_size = 900
def serving_input_receiver_fn():
inputs = {"x": tf.placeholder(shape=[None, input_size], dtype=tf.float32)}
return tf.estimator.export.ServingInputReceiver(inputs, inputs)
model.export_savedmodel(
export_dir_base=model_dir,
serving_input_receiver_fn=serving_input_receiver_fn)
To load and run it (without needing the model definition again) I used the tensorflow predictor class.
from tensorflow.contrib import predictor
class TFRunner:
""" runs a frozen cnn graph """
def __init__(self,model_dir):
self.predictor = predictor.from_saved_model(model_dir)
def run(self, input_list):
""" runs the input list through the graph, returns output """
if len(input_list) > 1:
inputs = np.vstack(input_list)
predictions = self.predictor({"x": inputs})
elif len(input_list) == 1:
predictions = self.predictor({"x": input_list[0]})
else:
predictions = []
return predictions

scheduled sampling in Tensorflow

The newest Tensorflow api about seq2seq model has included scheduled sampling:
https://www.tensorflow.org/api_docs/python/tf/contrib/seq2seq/ScheduledEmbeddingTrainingHelper
https://www.tensorflow.org/api_docs/python/tf/contrib/seq2seq/ScheduledOutputTrainingHelper
The original paper of scheduled sampling can be found here:
https://arxiv.org/abs/1506.03099
I read the paper but I cannot understand the difference between ScheduledEmbeddingTrainingHelper and ScheduledOutputTrainingHelper. The documentation only says ScheduledEmbeddingTrainingHelper is a training helper that adds scheduled sampling while ScheduledOutputTrainingHelper is a training helper that adds scheduled sampling directly to outputs.
I wonder what's the difference between these two helpers?
I contacted the engineer behind this, and he responded:
The output sampler either emits the raw rnn output or the raw ground truth at that time step. The embedding sampler treats the rnn output as logits of a distribution and either emits the embedding lookup of a sampled id from that categorical distribution or the raw ground truth at that time step.
Here's a basic example of using ScheduledEmbeddingTrainingHelper, using TensorFlow 1.3 and some higher level tf.contrib APIs. It's a sequence2sequence model, where the decoder's initial hidden state is the final hidden state of the encoder. It shows only how to train on a single batch (and apparently the task is "reverse this sequence"). For actual training tasks, I suggest looking at tf.contrib.learn APIs such as learn_runner, Experiment and tf.estimator.Estimator.
import tensorflow as tf
import numpy as np
from tensorflow.python.layers.core import Dense
vocab_size = 7
embedding_size = 5
lstm_units = 10
src_batch = np.array([[1, 2, 3], [4, 5, 6]])
trg_batch = np.array([[3, 2, 1], [6, 5, 4]])
# *_seq will have shape (2, 3), *_seq_len will have shape (2)
source_seq = tf.placeholder(shape=(None, None), dtype=tf.int32)
target_seq = tf.placeholder(shape=(None, None), dtype=tf.int32)
source_seq_len = tf.placeholder(shape=(None,), dtype=tf.int32)
target_seq_len = tf.placeholder(shape=(None,), dtype=tf.int32)
# add Start of Sequence (SOS) tokens to each sequence
batch_size, sequence_size = tf.unstack(tf.shape(target_seq))
sos_slice = tf.zeros([batch_size, 1], dtype=tf.int32) # 0 = start of sentence token
decoder_input = tf.concat([sos_slice, target_seq], axis=1)
embedding_matrix = tf.get_variable(
name="embedding_matrix",
shape=[vocab_size, embedding_size],
dtype=tf.float32)
source_seq_embedded = tf.nn.embedding_lookup(embedding_matrix, source_seq) # shape=(2, 3, 5)
decoder_input_embedded = tf.nn.embedding_lookup(embedding_matrix, decoder_input) # shape=(2, 4, 5)
unused_encoder_outputs, encoder_state = tf.nn.dynamic_rnn(
tf.contrib.rnn.LSTMCell(lstm_units),
source_seq_embedded,
sequence_length=source_seq_len,
dtype=tf.float32)
# Decoder:
# At each time step t and for each sequence in the batch, we get x_t by either
# (1) sampling from the distribution output_layer(t-1), or
# (2) reading from decoder_input_embedded.
# We do (1) with probability sampling_probability and (2) with 1 - sampling_probability.
# Using sampling_probability=0.0 is equivalent to using TrainingHelper (no sampling).
# Using sampling_probability=1.0 is equivalent to doing inference,
# where we don't supervise the decoder at all: output at t-1 is the input at t.
sampling_prob = tf.Variable(0.0, dtype=tf.float32)
helper = tf.contrib.seq2seq.ScheduledEmbeddingTrainingHelper(
decoder_input_embedded,
target_seq_len,
embedding_matrix,
sampling_probability=sampling_prob)
output_layer = Dense(vocab_size)
decoder = tf.contrib.seq2seq.BasicDecoder(
tf.contrib.rnn.LSTMCell(lstm_units),
helper,
encoder_state,
output_layer=output_layer)
outputs, state, seq_len = tf.contrib.seq2seq.dynamic_decode(decoder)
loss = tf.contrib.seq2seq.sequence_loss(
logits=outputs.rnn_output,
targets=target_seq,
weights=tf.ones(trg_batch.shape))
train_op = tf.contrib.layers.optimize_loss(
loss=loss,
global_step=tf.contrib.framework.get_global_step(),
optimizer=tf.train.AdamOptimizer,
learning_rate=0.001)
with tf.Session() as session:
session.run(tf.global_variables_initializer())
_, _loss = session.run([train_op, loss], {
source_seq: src_batch,
target_seq: trg_batch,
source_seq_len: [3, 3],
target_seq_len: [3, 3],
sampling_prob: 0.5
})
print("Loss: " + str(_loss))
For ScheduledOutputTrainingHelper, I would expect to just swap out the helper and use:
helper = tf.contrib.seq2seq.ScheduledOutputTrainingHelper(
target_seq,
target_seq_len,
sampling_probability=sampling_prob)
However this gives an error, since the LSTM cell expects a multidimensional input per timestep (of shape (batch_size, input_dims)). I will raise an issue in GitHub to find out if this is a bug, or there's some other way to use ScheduledOutputTrainingHelper.
This might also help you. This is for the case where you want to do scheduled sampling at each decoding step separately.
import tensorflow as tf
import numpy as np
from tensorflow.python.ops import array_ops
from tensorflow.python.ops import gen_array_ops
from tensorflow.python.ops import math_ops
from tensorflow.python.ops.distributions import categorical
from tensorflow.python.ops.distributions import bernoulli
batch_size = 64
vocab_size = 50000
emb_dim = 128
output = tf.get_variable('output',
initializer=tf.constant(np.random.rand(batch_size,vocab_size)))
base_next_inputs = tf.get_variable('input',
initializer=tf.constant(np.random.rand(batch_size,emb_dim)))
embedding = tf.get_variable('embedding',
initializer=tf.constant(np.random.rand(vocab_size,emb_dim)))
select_sampler = bernoulli.Bernoulli(probs=0.99, dtype=tf.bool)
select_sample = select_sampler.sample(sample_shape=batch_size,
seed=123)
sample_id_sampler = categorical.Categorical(logits=output)
sample_ids = array_ops.where(
select_sample,
sample_id_sampler.sample(seed=123),
gen_array_ops.fill([batch_size], -1))
where_sampling = math_ops.cast(
array_ops.where(sample_ids > -1), tf.int32)
where_not_sampling = math_ops.cast(
array_ops.where(sample_ids <= -1), tf.int32)
sample_ids_sampling = array_ops.gather_nd(sample_ids, where_sampling)
inputs_not_sampling = array_ops.gather_nd(base_next_inputs,
where_not_sampling)
sampled_next_inputs = tf.nn.embedding_lookup(embedding,
sample_ids_sampling)
base_shape = array_ops.shape(base_next_inputs)
result1 = array_ops.scatter_nd(indices=where_sampling,
updates=sampled_next_inputs, shape=base_shape)
result2 = array_ops.scatter_nd(indices=where_not_sampling,
updates=inputs_not_sampling, shape=base_shape)
result = result1 + result2
I used the tensorflow documentation code to make this example.
https://github.com/tensorflow/tensorflow/blob/r1.5/tensorflow/contrib/seq2seq/python/ops/helper.py

Categories

Resources