Keras Masking for RNN with Varying Time Steps - python

I'm trying to fit an RNN in Keras using sequences that have varying time lengths. My data is in a Numpy array with format (sample, time, feature) = (20631, max_time, 24) where max_time is determined at run-time as the number of time steps available for the sample with the most time stamps. I've padded the beginning of each time series with 0, except for the longest one, obviously.
I've initially defined my model like so...
model = Sequential()
model.add(Masking(mask_value=0., input_shape=(max_time, 24)))
model.add(LSTM(100, input_dim=24))
model.add(Dense(2))
model.add(Activation(activate))
model.compile(loss=weibull_loglik_discrete, optimizer=RMSprop(lr=.01))
model.fit(train_x, train_y, nb_epoch=100, batch_size=1000, verbose=2, validation_data=(test_x, test_y))
For completeness, here's the code for the loss function:
def weibull_loglik_discrete(y_true, ab_pred, name=None):
y_ = y_true[:, 0]
u_ = y_true[:, 1]
a_ = ab_pred[:, 0]
b_ = ab_pred[:, 1]
hazard0 = k.pow((y_ + 1e-35) / a_, b_)
hazard1 = k.pow((y_ + 1) / a_, b_)
return -1 * k.mean(u_ * k.log(k.exp(hazard1 - hazard0) - 1.0) - hazard1)
And here's the code for the custom activation function:
def activate(ab):
a = k.exp(ab[:, 0])
b = k.softplus(ab[:, 1])
a = k.reshape(a, (k.shape(a)[0], 1))
b = k.reshape(b, (k.shape(b)[0], 1))
return k.concatenate((a, b), axis=1)
When I fit the model and make some test predictions, every sample in the test set gets exactly the same prediction, which seems fishy.
Things get better if I remove the masking layer, which makes me think there's something wrong with the masking layer, but as far as I can tell, I've followed the documentation exactly.
Is there something mis-specified with the masking layer? Am I missing something else?

The way you implemented masking should be correct. If you have data with the shape (samples, timesteps, features), and you want to mask timesteps lacking data with a zero mask of the same size as the features argument, then you add Masking(mask_value=0., input_shape=(timesteps, features)). See here: keras.io/layers/core/#masking
Your model could potentially be too simple, and/or your number of epochs could be insufficient for the model to differentiate between all of your classes. Try this model:
model = Sequential()
model.add(Masking(mask_value=0., input_shape=(max_time, 24)))
model.add(LSTM(256, input_dim=24))
model.add(Dense(1024))
model.add(Dense(2))
model.add(Activation(activate))
model.compile(loss=weibull_loglik_discrete, optimizer=RMSprop(lr=.01))
model.fit(train_x, train_y, nb_epoch=100, batch_size=1000, verbose=2, validation_data=(test_x, test_y))
If that does not work, try doubling the epochs a few times (e.g. 200, 400) and see if that improves the results.

I could not validate without actual data, but I had a similar experience with an RNN. In my case normalization solved the issue. Add a normalization layer to your model.

Related

Tensorflow model performing significantly worse than Keras model

I was having an issue with my Tensorflow model and decided to try Keras. It appears to me at least that I am creating the same model with the same parameters, but the Tensorflow model just outputs the mean value of train_y while the Keras model actually varies according the input. Am I missing something in my tf.Session? I usually use Tensorflow and have never had a problem like this.
Tensorflow Code:
score_inputs = tf.placeholder(np.float32, shape=(None, 100))
targets = tf.placeholder(np.float32, shape=(None), name="targets")
l2 = tf.contrib.layers.l2_regularizer(0.01)
first_layer = tf.layers.dense(score_inputs, 100, activation=tf.nn.relu, kernel_regularizer=l2)
outputs = tf.layers.dense(first_layer, 1, activation = None, kernel_regularizer=l2)
optimizer = tf.train.AdamOptimizer(0.001)
l2_loss = tf.losses.get_regularization_loss()
loss = tf.reduce_mean(tf.square(tf.subtract(targets, outputs)))
loss += l2_loss
rmse = tf.sqrt(tf.reduce_mean(tf.square(outputs - targets)))
mae = tf.reduce_mean(tf.sqrt(tf.square(outputs - targets)))
training_op = optimizer.minimize(loss)
batch_size = 32
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
for epoch in range(10):
avg_train_error = []
for i in range(len(train_x) // batch_size):
batch_x = train_x[i*batch_size: (i+1)*batch_size]
batch_y = train_y[i*batch_size: (i+1)*batch_size]
_, train_loss = sess.run([training_op, loss], {score_inputs: batch_x, targets: batch_y})
feed = {score_inputs: test_x, targets: test_y}
test_loss, test_mae, test_rmse, test_ouputs = sess.run([loss, mae, rmse, outputs], feed)
This has a mean absolute error of 0.682 and root mean squared error of 0.891.
The Keras Code:
inputs = Input(shape=(100,))
hidden = Dense(100, activation="relu", kernel_regularizer = regularizers.l2(0.01))(inputs)
outputs = Dense(1, activation=None, kernel_regularizer = regularizers.l2(0.01))(hidden)
model = Model(inputs=inputs, outputs=outputs)
model.compile(optimizer=keras.optimizers.Adam(lr=0.001), loss='mse', metrics=['mae'])
model.fit(train_x, train_y, batch_size=32, epochs=10, shuffle=False)
keras_pred = model.predict(test_x)
This has a mean absolute error of 0.601 and root mean square error of 0.753.
It appears to me that I am defining the same network in both instances, yet as I said the Tensorflow model only outputs the mean value of train_y, while the Keras model performs a lot better. Any suggestions?
I'm going to try to point out the differences between the two codes.
Keras documentation here shows that the weights are initialized by 'glorot_uniform' whereas your weights are initialized by default, most probably at random as the documentation doesn't clearly specify what it is tensorflow intialization. So initialization is most probably different and it definitely
matters.
The second difference most probably is because of the difference in the data type of input, one being numpy.float32 and other being keras default input type, which again hasn't been specified by the documentation
#Priyank Pathak and #lehiester have given some valid points. Taking their suggestions into account, I can suggest you to change the following things and check again:
Use same kernel_initializer and data_type
Use more epochs for better generalisation
Seed your random, numpy and tensorflow functions
There isn't any obvious difference in the models, but the different results could possibly be explained due to random variation in training. Especially since you're only training for 10 epochs, the results could be fairly sensitive to the randomly chosen initial weights for the models.
Try running with more epochs (e.g. 1000) and running each one several times (e.g. 5)--the average results should be fairly close.

How to iterate through tensors in custom loss function?

I'm using keras with tensorflow backend. My goal is to query the batchsize of the current batch in a custom loss function. This is needed to compute values of the custom loss functions which depend on the index of particular observations. I like to make this clearer given the minimum reproducible examples below.
(BTW: Of course I could use the batch size defined for the training procedure and plugin it's value when defining the custom loss function, but there are some reasons why this can vary, especially if epochsize % batchsize (epochsize modulo batchsize) is unequal zero, then the last batch of an epoch has different size. I didn't found a suitable approach in stackoverflow, especially e. g.
Tensor indexing in custom loss function and Tensorflow custom loss function in Keras - loop over tensor and Looping over a tensor because obviously the shape of any tensor can't be inferred when building the graph which is the case for a loss function - shape inference is only possible when evaluating given the data, which is only possible given the graph. Hence I need to tell the custom loss function to do something with particular elements along a certain dimension without knowing the length of the dimension.
(this is the same in all examples)
from keras.models import Sequential
from keras.layers import Dense, Activation
# Generate dummy data
import numpy as np
data = np.random.random((1000, 100))
labels = np.random.randint(2, size=(1000, 1))
model = Sequential()
model.add(Dense(32, activation='relu', input_dim=100))
model.add(Dense(1, activation='sigmoid'))
example 1: nothing special without issue, no custom loss
model.compile(optimizer='rmsprop',
loss='binary_crossentropy',
metrics=['accuracy'])
# Train the model, iterating on the data in batches of 32 samples
model.fit(data, labels, epochs=10, batch_size=32)
(Output omitted, this runs perfectily fine)
example 2: nothing special, with a fairly simple custom loss
def custom_loss(yTrue, yPred):
loss = np.abs(yTrue-yPred)
return loss
model.compile(optimizer='rmsprop',
loss=custom_loss,
metrics=['accuracy'])
# Train the model, iterating on the data in batches of 32 samples
model.fit(data, labels, epochs=10, batch_size=32)
(Output omitted, this runs perfectily fine)
example 3: the issue
def custom_loss(yTrue, yPred):
print(yPred) # Output: Tensor("dense_2/Sigmoid:0", shape=(?, 1), dtype=float32)
n = yPred.shape[0]
for i in range(n): # TypeError: __index__ returned non-int (type NoneType)
loss = np.abs(yTrue[i]-yPred[int(i/2)])
return loss
model.compile(optimizer='rmsprop',
loss=custom_loss,
metrics=['accuracy'])
# Train the model, iterating on the data in batches of 32 samples
model.fit(data, labels, epochs=10, batch_size=32)
Of course the tensor has not shape info yet which can't be inferred when building the graph, only at training time. Hence for i in range(n) rises an error. Is there any way to perform this?
The traceback of the output:
-------
BTW here's my true custom loss function in case of any questions. I skipped it above for clarity and simplicity.
def neg_log_likelihood(yTrue,yPred):
yStatus = yTrue[:,0]
yTime = yTrue[:,1]
n = yTrue.shape[0]
for i in range(n):
s1 = K.greater_equal(yTime, yTime[i])
s2 = K.exp(yPred[s1])
s3 = K.sum(s2)
logsum = K.log(y3)
loss = K.sum(yStatus[i] * yPred[i] - logsum)
return loss
Here's an image of the partial negative log-likelihood of the cox proportional harzards model.
This is to clarify a question in the comments to avoid confusion. I don't think it is necessary to understand this in detail to answer the question.
As usual, don't loop. There are severe performance drawbacks and also bugs. Use only backend functions unless totally unavoidable (usually it's not unavoidable)
Solution for example 3:
So, there is a very weird thing there...
Do you really want to simply ignore half of your model's predictions? (Example 3)
Assuming this is true, just duplicate your tensor in the last dimension, flatten and discard half of it. You have the exact effect you want.
def custom_loss(true, pred):
n = K.shape(pred)[0:1]
pred = K.concatenate([pred]*2, axis=-1) #duplicate in the last axis
pred = K.flatten(pred) #flatten
pred = K.slice(pred, #take only half (= n samples)
K.constant([0], dtype="int32"),
n)
return K.abs(true - pred)
Solution for your loss function:
If you have sorted times from greater to lower, just do a cumulative sum.
Warning: If you have one time per sample, you cannot train with mini-batches!!!
batch_size = len(labels)
It makes sense to have time in an additional dimension (many times per sample), as is done in recurrent and 1D conv netoworks. Anyway, considering your example as expressed, that is shape (samples_equal_times,) for yTime:
def neg_log_likelihood(yTrue,yPred):
yStatus = yTrue[:,0]
yTime = yTrue[:,1]
n = K.shape(yTrue)[0]
#sort the times and everything else from greater to lower:
#obs, you can have the data sorted already and avoid doing it here for performance
#important, yTime will be sorted in the last dimension, make sure its (None,) in this case
# or that it's (None, time_length) in the case of many times per sample
sortedTime, sortedIndices = tf.math.top_k(yTime, n, True)
sortedStatus = K.gather(yStatus, sortedIndices)
sortedPreds = K.gather(yPred, sortedIndices)
#do the calculations
exp = K.exp(sortedPreds)
sums = K.cumsum(exp) #this will have the sum for j >= i in the loop
logsums = K.log(sums)
return K.sum(sortedStatus * sortedPreds - logsums)

Constant Output and Prediction Syntax with LSTM Keras Network

I am new to neural networks and have two, probably pretty basic, questions. I am setting up a generic LSTM Network to predict the future of sequence, based on multiple Features.
My training data is therefore of the shape (number of training sequences, length of each sequence, amount of features for each timestep).
Or to make it more specific, something like (2000, 10, 3).
I try to predict the value of one feature, not of all three.
Problem:
If I make my Network deeper and/or wider, the only output I get is the constant mean of the values to be predicted. Take this setup for example:
z0 = Input(shape=[None, len(dataset[0])])
z = LSTM(32, return_sequences=True, activation='softsign', recurrent_activation='softsign')(z0)
z = LSTM(32, return_sequences=True, activation='softsign', recurrent_activation='softsign')(z)
z = LSTM(64, return_sequences=True, activation='softsign', recurrent_activation='softsign')(z)
z = LSTM(64, return_sequences=True, activation='softsign', recurrent_activation='softsign')(z)
z = LSTM(128, activation='softsign', recurrent_activation='softsign')(z)
z = Dense(1)(z)
model = Model(inputs=z0, outputs=z)
print(model.summary())
model.compile(loss='mean_squared_error', optimizer='adam', metrics=['accuracy'])
history= model.fit(trainX, trainY,validation_split=0.1, epochs=200, batch_size=32,
callbacks=[ReduceLROnPlateau(factor=0.67, patience=3, verbose=1, min_lr=1E-5),
EarlyStopping(patience=50, verbose=1)])
If I just use one layer, like:
z0 = Input(shape=[None, len(dataset[0])])
z = LSTM(4, activation='soft sign', recurrent_activation='softsign')(z0)
z = Dense(1)(z)
model = Model(inputs=z0, outputs=z)
print(model.summary())
model.compile(loss='mean_squared_error', optimizer='adam', metrics=['accuracy'])
history= model.fit(trainX, trainY,validation_split=0.1, epochs=200, batch_size=32,
callbacks=[ReduceLROnPlateau(factor=0.67, patience=3, verbose=1, min_lr=1E-5),
EarlyStopping(patience=200, verbose=1)])
The predictions are somewhat reasonable, at least they are not constant anymore.
Why does that happen? Around 2000 samples not that many, but in the case of overfitting, I would expect the predictions to match perfectly...
EDIT: Solved, as stated in the comments, it's just that Keras always expects Batches: Keras
When I use:
`test=model.predict(trainX[0])`
to get the prediction for the first sequence, I get an dimension error:
"Error when checking : expected input_1 to have 3 dimensions, but got array with shape (3, 3)"
I need to feed in an array of sequences like:
`test=model.predict(trainX[0:1])`
This is a workaround, but I am not really sure, whether this has any deeper meaning, or is just a syntax thing...
This is because you have not normalised input data.
Any neural network model will initially have weights normalised around zero. Since your training dataset has all positive values, the model will try to adjust its weights to predict only positive values. However, the activation function (in your case softsign) will map it to 1. So the model can do nothing except adding the bias. That is why you are getting an almost constant line around the average value of the dataset.
For this, you can use a general tool like sklearn to pre-process your data. If you are using pandas dataframe, something like this will help
data_df = (data_df - data_df.mean()) / data_df.std()
Or to have the parameters in the model, you can consider adding batch normalization layer to your model

keras: model for learning with two input sequences and one scalar target value

I'm trying to implement this: https://arxiv.org/abs/1706.03741 utility/preference learning approach in keras. The problem:
xTrain: 2 input sequences ,yTrain: binary value
The DNN only maps (single) input vectors to numeric values
To be more precise:
DNN: f(x)=y
t1 and t2 are sequences of vectors
Loss-function: cross_entropy(sigmoid(sum_t1(f(x_t1))-sum_t2(f(x_t2))),1)
The question: How do i implement that ?
My current approach is to use a TimeDistributed network with flattened output. The input is a concatenation of both sequences. (WIP code below) But it think this is a really bad way and there has to be something better ?
model = Sequential()
model.add(TimeDistributed(Masking(mask_value=0.), input_shape(2000,inputDims)))
model.add(TimeDistributed(Dense(64)))
model.add(LeakyReLU(alpha=0.01))
model.add(TimeDistributed(Dense(64)))
model.add(LeakyReLU(alpha=0.01))
model.add(TimeDistributed(Dense(1, activation='sigmoid')))
model.add(Flatten())
model.compile(loss=customLoss,
optimizer=Adam(),
metrics=['mse'])
def customLoss(y_true,y_pred):
y_pred = K.reshape(y_pred, shape=[2,1000])
y_pred = K.sum(y_pred, axis=1)
y_pred = K.sigmoid(y_pred[0]-y_pred[1])
y_pred = K.clip(y_pred, epsilon, 1.0 - epsilon)
loss = K.binary_crossentropy(y_pred, 1)
return loss

model.get_weights() returning array of NaNs after training due to NaN masking

I'm trying to train an LSTM to classify sequences of various lengths. I want to get the weights of this model, so I can use them in stateful version of the model. Before training, the weights are normal. Also, the training seems to run successfully, with a gradually decreasing error. However, when I change the mask value from -10 to np.Nan, mod.get_weights() starts returning arrays of NaNs and the validation error drops suddenly to a value close to zero. Why is this occurring?
from keras import models
from keras.layers import Dense, Masking, LSTM
from keras.optimizers import RMSprop
from keras.losses import categorical_crossentropy
from keras.preprocessing.sequence import pad_sequences
import numpy as np
import matplotlib.pyplot as plt
def gen_noise(noise_len, mag):
return np.random.uniform(size=noise_len) * mag
def gen_sin(t_val, freq):
return 2 * np.sin(2 * np.pi * t_val * freq)
def train_rnn(x_train, y_train, max_len, mask, number_of_categories):
epochs = 3
batch_size = 100
# three hidden layers of 256 each
vec_dims = 1
hidden_units = 256
in_shape = (max_len, vec_dims)
model = models.Sequential()
model.add(Masking(mask, name="in_layer", input_shape=in_shape,))
model.add(LSTM(hidden_units, return_sequences=False))
model.add(Dense(number_of_categories, input_shape=(number_of_categories,),
activation='softmax', name='output'))
model.compile(loss=categorical_crossentropy, optimizer=RMSprop())
model.fit(x_train, y_train, batch_size=batch_size, epochs=epochs,
validation_split=0.05)
return model
def gen_sig_cls_pair(freqs, t_stops, num_examples, noise_magnitude, mask, dt=0.01):
x = []
y = []
num_cat = len(freqs)
max_t = int(np.max(t_stops) / dt)
for f_i, f in enumerate(freqs):
for t_stop in t_stops:
t_range = np.arange(0, t_stop, dt)
t_len = t_range.size
for _ in range(num_examples):
sig = gen_sin(f, t_range) + gen_noise(t_len, noise_magnitude)
x.append(sig)
one_hot = np.zeros(num_cat, dtype=np.bool)
one_hot[f_i] = 1
y.append(one_hot)
pad_kwargs = dict(padding='post', maxlen=max_t, value=mask, dtype=np.float32)
return pad_sequences(x, **pad_kwargs), np.array(y)
if __name__ == '__main__':
noise_mag = 0.01
mask_val = -10
frequencies = (5, 7, 10)
signal_lengths = (0.8, 0.9, 1)
dt_val = 0.01
x_in, y_in = gen_sig_cls_pair(frequencies, signal_lengths, 50, noise_mag, mask_val)
mod = train_rnn(x_in[:, :, None], y_in, int(np.max(signal_lengths) / dt_val), mask_val, len(frequencies))
This persists even if I change the network architecture to return_sequences=True and wrap the Dense layer with TimeDistributed, nor does removing the LSTM layer.
I had the same problem. In your case I can see it was probably something different but someone might have the same problem and come here from Google. So in my case I was passing sample_weight parameter to fit() method and when the sample weights contained some zeros in it, get_weights() was returning an array with NaNs. When I omitted the samples where sample_weight=0 (they were useless anyway if sample_weight=0), it started to work.
The weights are indeed changing. The unchanging weights are from the edge of the image, and they may have not changed because the edge isn't helpful for classifying digits.
to check select a specific layer and see the result:
print(model.layers[70].get_weights()[1])
70 : is the number of the last layer in my case.
get_weights() method of keras.engine.training.Model instance should retrieve the weights of the model.
This should be a flat list of Numpy arrays, or in other words this should be the list of all weight tensors in the model.
mw = model.get_weights()
print(mw)
If you got the NaN(s) this has a specific meaning. You are dealing simple with vanishing gradients problem. (In some cases even with Exploding gradients).
I would first try to alter the model to reduce the chances for the vanishing gradients. Try reducing the hidden_units first, and normalize your activations.
Even though LSTM are there to solve the problem of vanishing/exploding gradients problem you need to set the right activations from (-1, 1) interval.
Note this interval is where float points are most precise.
Working with np.nan under the masking layer is not a predictable operation since you cannot do comparison with np.nan.
Try print(np.nan==np.nan) and it will return False. This is an old problem with the IEEE 754 standard.
Or it may actually be this is a bug in Tensorflow, based on the IEEE 754 standard weakness.

Categories

Resources