I've had this suspicion for the longest time but couldn't figure out whether the case or not, so here's the scenario:
I'm trying to build a model that has 3 features from 3 different inputs:
A text sequence
A float
A float
Now all three of these makes up one time step. But since I'm using glove to vectorize my text sequence using 100 dimensions, a 20 word text sequence ends up having the length of 2000. Hence, the overall input per step has the length of 2002 (each time step a matrix with the shape (1, 2002) is fed in, with 2000 of those coming from a single feature.
Is the text sequence overwhelming the two floats so whatever the value of the floats are it's irrelevant to the prediction? If so, what can I do to fix this? Perhaps manually weigh how much each feature should be used? Code is attached
def build_model(embedding_matrix) -> Model:
text = Input(shape=(9, news_text.shape[1]), name='text')
price = Input(shape=(9, 1), name='price')
volume = Input(shape=(9, 1), name='volume')
text_layer = Embedding(
embedding_matrix.shape[0],
embedding_matrix.shape[1],
weights=[embedding_matrix]
)(text)
text_layer = Dropout(0.2)(text_layer)
# Flatten the vectorized text matrix
text_layer = Reshape((9, int_shape(text_layer)[2] * int_shape(text_layer)[3]))(text_layer)
inputs = concatenate([
text_layer,
price,
volume
])
output = Convolution1D(128, 5, activation='relu')(inputs)
output = MaxPool1D(pool_size=4)(output)
output = LSTM(units=128, dropout=0.2, return_sequences=True)(output)
output = LSTM(units=128, dropout=0.2, return_sequences=True)(output)
output = LSTM(units=128, dropout=0.2)(output)
output = Dense(units=2, activation='linear', name='output')(output)
model = Model(
inputs=[text, price, volume],
outputs=[output]
)
model.compile(optimizer='adam', loss='mean_squared_error')
return model
Edit: note that the input shape into the lstm is (?, 9, 2002) which means right now the 2000 coming from the text are treated as 2000 independent features
As I mentioned in my comments, one approach is to have a two branch model where one branch processes the text data and another one processes the two float features. At the end the output of the two branches are merged together:
# Branch one: process text data
text_input = Input(shape=(news_text.shape[1],), name='text')
text_emb = Embedding(embedding_matrix.shape[0],embedding_matrix.shape[1],
weights=[embedding_matrix])(text_input)
# you may alternatively use only Conv1D + MaxPool1D or
# stack multiple LSTM layers on top of each other or
# use a combination of Conv1D, MaxPool1D and LSTM
text_conv = Convolution1D(128, 5, activation='relu')(text_emb)
text_lstm = LSTM(units=128, dropout=0.2)(text_conv)
# Branch two: process float features
price_input = Input(shape=(9, 1), name='price')
volume_input = Input(shape=(9, 1), name='volume')
pv = concatenate([price_input, volume_input])
# you can also stack multiple LSTM layers on top of each other
pv_lstm = LSTM(units=128, dropout=0.2)(pv)
# merge output of branches
text_pv = concatenate([text_lstm, pv_lstm])
output = Dense(units=2, activation='linear', name='output')(text_pv)
model = Model(
inputs=[text_input, price_input, volume_input],
outputs=[output]
)
model.compile(optimizer='adam', loss='mean_squared_error')
As I have commented in the code, this is just a simple illustration. You may need to further add or remove layers or regularization and tune the hyper-parameters.
Related
I'm trying to understand the keras LSTM layer a bit better in regards to timesteps, but am still struggling a bit.
I want to create a model that is able to compare 2 inputs (siamese network). So my input is twice a preprocessed text. The preprocessing is done as followed:
max_len = 64
data['cleaned_text_1'] = assets.apply(lambda x: clean_string(data[]), axis=1)
data['text_1_seq'] = t.texts_to_sequences(cleaned_text_1.astype(str).values)
data['text_1_seq_pad'] = [list(x) for x in pad_sequences(assets['text_1_seq'], maxlen=max_len, padding='post')]
same is being done for the second text input. T is from keras.preprocessing.text.Tokenizer.
I defined the model with:
common_embed = Embedding(
name="synopsis_embedd",
input_dim=len(t.word_index)+1,
output_dim=300,
input_length=len(data['text_1_seq_pad'].tolist()[0]),
trainable=True
)
lstm_layer = tf.keras.layers.Bidirectional(
tf.keras.layers.LSTM(32, dropout=0.2, recurrent_dropout=0.2)
)
input1 = tf.keras.Input(shape=(len(data['text_1_seq_pad'].tolist()[0]),))
e1 = common_embed(input1)
x1 = lstm_layer(e1)
input2 = tf.keras.Input(shape=(len(data['text_1_seq_pad'].tolist()[0]),))
e2 = common_embed(input2)
x2 = lstm_layer(e2)
merged = tf.keras.layers.Lambda(
function=l1_distance, output_shape=l1_dist_output_shape, name='L1_distance'
)([x1, x2])
conc = Concatenate(axis=-1)([merged, x1, x2])
x = Dropout(0.01)(conc)
preds = tf.keras.layers.Dense(1, activation='sigmoid')(x)
model = tf.keras.Model(inputs=[input1, input2], outputs=preds)
that seems to work if I feed the numpy data with the fit method:
model.fit(
x = [np.array(data['text_1_seq_pad'].tolist()), np.array(data['text_2_seq_pad'].tolist())],
y = y_train.values.reshape(-1,1),
epochs=epochs,
batch_size=batch_size,
validation_data=([np.array(val['text_1_seq_pad'].tolist()), np.array(val['text_2_seq_pad'].tolist())], y_val.values.reshape(-1,1)),
)
What I'm trying to understand at the moment is what is the shape in my case for the LSTM layer for:
samples
time_steps
features
Is it correct that the input_shape for the LSTM layer would be input_shape=(300,1) because I set the embedding output dim to 300 and I have only 1 input feature per LSTM?
And do I need to reshape the embedding output or can I just set
lstm_layer = tf.keras.layers.Bidirectional(
tf.keras.layers.LSTM(32, input_shape=(300,1), dropout=0.2, recurrent_dropout=0.2)
)
from the embedding output?
Example notebook can be found in Github or as Colab
In general, an LSTM layer needs 3D inputs shaped this way : (batch_size, lenght of an input sequence , number of features ). (Batch size is not really important, so you can just consider that one input need to have this shape (lenght of sequence, number of features par item) )
In your case, the output dim of your embedding layer is 300. So your LSTM have 300 features.
Then, using LSTM on sentences requires a constant number of tokens. LSTM works with constant input dimension, you can not pass it a text with 12 tokens following by another one with 68 tokens. Indeed, you need to fix a limit and pad the sequence if needed.
So, if your sentence is 20 tokens long and that your limit is 50, you need to pad (add at the end of your sequence) the sequence with 30 “neutral” tokens (often zeros).
After all, your LSTM input dimension must be (number of token per text, dimension of your embedding outputs) -> (50, 300) in my example.
To learn more about it, it suggest you to take a look to this : (but in your case, you can replace time_steps by number_of_tokens)
https://shiva-verma.medium.com/understanding-input-and-output-shape-in-lstm-keras-c501ee95c65e
Share
Edit
Delete
Flag
I am trying to build an RNN based model in Tensorflow that takes a sequence of categorical values as an input, and sequence of categorical values as the output.
For example, if I have sequence of 30 values, the first 25 would be the training data, and the last 5 would be the target. Imagine the data is something like a person pressing keys on a computer keyboard and recording their key presses over time.
I've tried to feed the training data and targets into this model in different shapes, and I always get an error that indicates the data is in the wrong shape.
I've included a code sample that should run and demonstrate what I'm trying to do and the failure I'm seeing.
In the code sample, I've used windows for batches. So if there are 90 values in the sequence, the first 25 values would be the training data for the first batch, and the next 5 values would be the target. This next batch would be the next 30 values (25 training values, 5 target values).
import numpy as np
import tensorflow as tf
from tensorflow import keras
num_categories = 20
data_sequence = np.random.choice(num_categories, 10000)
def create_target(batch):
X = tf.cast(batch[:,:-5][:,:,None], tf.float32)
Y = batch[:,-5:][:,:,None]
return X,Y
def add_windows(data):
data = tf.data.Dataset.from_tensor_slices(data)
return data.window(20, shift=1, drop_remainder=True)
dataset = tf.data.Dataset.from_tensor_slices(data_sequence)
dataset = dataset.window(30, drop_remainder=True)
dataset = dataset.flat_map(lambda x: x.batch(30))
dataset = dataset.batch(5)
dataset = dataset.map(create_target)
model = keras.models.Sequential([
keras.layers.SimpleRNN(20, return_sequences=True),
keras.layers.SimpleRNN(20, return_sequences=True),
keras.layers.TimeDistributed(keras.layers.Dense(num_categories, activation="softmax"))
])
optimizer = keras.optimizers.Adam()
model.compile(loss="sparse_categorical_crossentropy", optimizer=optimizer)
model.fit(dataset, epochs=1)
The error I get when I run the above code is
Node: 'sparse_categorical_crossentropy/SparseSoftmaxCrossEntropyWithLogits/SparseSoftmaxCrossEntropyWithLogits'
logits and labels must have the same first dimension, got logits shape [125,20] and labels shape [25]
I've also tried the following model, but the errors are similar.
model = keras.models.Sequential([
keras.layers.SimpleRNN(20, return_sequences=True),
keras.layers.SimpleRNN(20),
keras.layers.Dense(num_categories, activation="softmax"))
])
Does anybody have any recommendations about what I need to do to get this working?
Thanks.
I figured out the issue. The size of the time dimension needs to be the same for the training data and the target.
If you look at my original example code, the training data has these shapes
X.shape = (1, 25, 1)
Y.shape = (1, 5, 1)
To fix it, the time dimension should be the same.
X.shape = (1, 15, 1)
Y.shape = (1, 15, 1)
Here is the updated function that will let the model train. Note that all I did was update the array sizes so they are equally sized. The value of 15 is used because the original array length is 30.
def create_target(batch):
X = tf.cast(batch[:,:-15][:,:,None], tf.float32)
Y = batch[:,-15:][:,:,None]
return X,Y
I would like to use a RNN for time series prediction to use 96 backwards steps to predict 96 steps into the future. For this I have the following code:
#Import modules
import pandas as pd
import numpy as np
import tensorflow as tf
from sklearn.preprocessing import StandardScaler
from tensorflow import keras
# Define the parameters of the RNN and the training
epochs = 1
batch_size = 50
steps_backwards = 96
steps_forward = 96
split_fraction_trainingData = 0.70
split_fraction_validatinData = 0.90
randomSeedNumber = 50
helpValueStrides = int(steps_backwards /steps_forward)
#Read dataset
df = pd.read_csv('C:/Users1/Desktop/TestValues.csv', sep=';', header=0, low_memory=False, infer_datetime_format=True, parse_dates={'datetime':[0]}, index_col=['datetime'])
# standardize data
data = df.values
indexWithYLabelsInData = 0
data_X = data[:, 0:3]
data_Y = data[:, indexWithYLabelsInData].reshape(-1, 1)
scaler_standardized_X = StandardScaler()
data_X = scaler_standardized_X.fit_transform(data_X)
data_X = pd.DataFrame(data_X)
scaler_standardized_Y = StandardScaler()
data_Y = scaler_standardized_Y.fit_transform(data_Y)
data_Y = pd.DataFrame(data_Y)
# Prepare the input data for the RNN
series_reshaped_X = np.array([data_X[i:i + (steps_backwards+steps_forward)].copy() for i in range(len(data) - (steps_backwards+steps_forward))])
series_reshaped_Y = np.array([data_Y[i:i + (steps_backwards+steps_forward)].copy() for i in range(len(data) - (steps_backwards+steps_forward))])
timeslot_x_train_end = int(len(series_reshaped_X)* split_fraction_trainingData)
timeslot_x_valid_end = int(len(series_reshaped_X)* split_fraction_validatinData)
X_train = series_reshaped_X[:timeslot_x_train_end, :steps_backwards]
X_valid = series_reshaped_X[timeslot_x_train_end:timeslot_x_valid_end, :steps_backwards]
X_test = series_reshaped_X[timeslot_x_valid_end:, :steps_backwards]
Y_train = series_reshaped_Y[:timeslot_x_train_end, steps_backwards:]
Y_valid = series_reshaped_Y[timeslot_x_train_end:timeslot_x_valid_end, steps_backwards:]
Y_test = series_reshaped_Y[timeslot_x_valid_end:, steps_backwards:]
# Build the model and train it
np.random.seed(randomSeedNumber)
tf.random.set_seed(randomSeedNumber)
model = keras.models.Sequential([
keras.layers.SimpleRNN(10, return_sequences=True, input_shape=[None, 3]),
keras.layers.SimpleRNN(10, return_sequences=True),
keras.layers.Conv1D(16, helpValueStrides, strides=helpValueStrides),
keras.layers.TimeDistributed(keras.layers.Dense(1))
])
model.compile(loss="mean_squared_error", optimizer="adam", metrics=['mean_absolute_percentage_error'])
history = model.fit(X_train, Y_train, epochs=epochs, batch_size=batch_size, validation_data=(X_valid, Y_valid))
#Predict the test data
Y_pred = model.predict(X_test)
prediction_lastValues_list=[]
for i in range (0, len(Y_pred)):
prediction_lastValues_list.append((Y_pred[i][0][1 - 1]))
# Create thw dataframe for the whole data
wholeDataFrameWithPrediciton = pd.DataFrame((X_test[:,1]))
wholeDataFrameWithPrediciton.rename(columns = {indexWithYLabelsInData:'actual'}, inplace = True)
wholeDataFrameWithPrediciton.rename(columns = {1:'Feature 1'}, inplace = True)
wholeDataFrameWithPrediciton.rename(columns = {2:'Feature 2'}, inplace = True)
wholeDataFrameWithPrediciton['predictions'] = prediction_lastValues_list
wholeDataFrameWithPrediciton['difference'] = (wholeDataFrameWithPrediciton['predictions'] - wholeDataFrameWithPrediciton['actual']).abs()
wholeDataFrameWithPrediciton['difference_percentage'] = ((wholeDataFrameWithPrediciton['difference'])/(wholeDataFrameWithPrediciton['actual']))*100
# Inverse the scaling (traInv: transformation inversed)
data_X_traInv = scaler_standardized_X.inverse_transform(data_X)
data_Y_traInv = scaler_standardized_Y.inverse_transform(data_Y)
series_reshaped_X_notTransformed = np.array([data_X_traInv[i:i + (steps_backwards+steps_forward)].copy() for i in range(len(data) - (steps_backwards+steps_forward))])
X_test_notTranformed = series_reshaped_X_notTransformed[timeslot_x_valid_end:, :steps_backwards]
predictions_traInv = scaler_standardized_Y.inverse_transform(wholeDataFrameWithPrediciton['predictions'].values.reshape(-1, 1))
edictions_traInv = wholeDataFrameWithPrediciton['predictions'].values.reshape(-1, 1)
# Create thw dataframe for the inversed transformed data
wholeDataFrameWithPrediciton_traInv = pd.DataFrame((X_test_notTranformed[:,0]))
wholeDataFrameWithPrediciton_traInv.rename(columns = {indexWithYLabelsInData:'actual'}, inplace = True)
wholeDataFrameWithPrediciton_traInv.rename(columns = {1:'Feature 1'}, inplace = True)
wholeDataFrameWithPrediciton_traInv['predictions'] = predictions_traInv
wholeDataFrameWithPrediciton_traInv['difference_absolute'] = (wholeDataFrameWithPrediciton_traInv['predictions'] - wholeDataFrameWithPrediciton_traInv['actual']).abs()
wholeDataFrameWithPrediciton_traInv['difference_percentage'] = ((wholeDataFrameWithPrediciton_traInv['difference_absolute'])/(wholeDataFrameWithPrediciton_traInv['actual']))*100
wholeDataFrameWithPrediciton_traInv['difference'] = (wholeDataFrameWithPrediciton_traInv['predictions'] - wholeDataFrameWithPrediciton_traInv['actual'])
Here you can have some test data (don't care about the actual values as I made them up, just the shape is important) Download test data
How can the output of the Y_pred data be interpreted? Which of those values yields me the predicted values 96 steps into the future? I have attached a screenshot of the 'Y_pred' data. One time with 5 output neurons in the last layer and one time only with 1. Can anyone tell me, how to interpret the 'Y_pred' data meaning what exactly is the RNN predicting? I can use any values in the output (last layer ) of the RNN model. The 'Y_pred' data always has the shape (Batch size of X_test, timesequence, Number of output neurons). My question is targeting at the last dimension. I thought that these might be the features, but this is not true in my case, as I only have 1 output features (you can see that in the shape of the Y_train, Y_test and Y_valid data).
**Reminder **: The bounty is expiring soon and unfortunately I still have not received any answer. So I would like to remind you on the question and the bounty. I'll highly appreciate every comment.
It may be useful to step through the model inputs/outputs in detail.
When using the keras.layers.SimpleRNN layer with return_sequences=True, the output will return a 3-D tensor where the 0th axis is the batch size, the 1st axis is the timestep, and the 2nd axis is the number of hidden units (in the case for both SimpleRNN layers in your model, 10).
The Conv1D layer will produce an output tensor where the last dimension becomes the number of hidden units (in the case for your model, 16), as it's just being convolved with the input.
keras.layers.TimeDistributed, the layer supplied (in the example provided, Dense(1)) will be applied to each timestep in the batch independently. So with 96 timesteps, we have 96 outputs for each record in the batch.
So stepping through your model:
model = keras.models.Sequential([
keras.layers.SimpleRNN(10, return_sequences=True, input_shape=[None, 3]), # output size is (BATCH_SIZE, NUMBER_OF_TIMESTEPS, 10)
keras.layers.SimpleRNN(10, return_sequences=True), # output size is (BATCH_SIZE, NUMBER_OF_TIMESTEPS, 10)
keras.layers.Conv1D(16, helpValueStrides, strides=helpValueStrides) # output size is (BATCH_SIZE, NUMBER_OF_TIMESTEPS, 16),
keras.layers.TimeDistributed(keras.layers.Dense(1)) # output size is (BATCH_SIZE, NUMBER_OF_TIMESTEPS, 1)
])
To answer your question, the output tensor from your model contains the predicted values for 96 steps into the future, for each sample. If it's easier to conceptualize, for the case of 1 output, you can apply np.squeeze to the result of model.predict, which will make the output 2-D:
Y_pred = model.predict(X_test) # output size is (BATCH_SIZE, NUMBER_OF_TIMESTEPS, 1)
Y_pred_squeezed = np.squeeze(Y_pred) # output size is (BATCH_SIZE, NUMBER_OF_TIMESTEPS)
In that way, you have a rectangular matrix where each row corresponds to a sample in the batch, and each column i corresponds to the prediction for the timestep i.
In the loop after the prediction step, all the timestep predictions are being discarded except for the first one:
for i in range(0, len(Y_pred)):
prediction_lastValues_list.append((Y_pred[i][0][1 - 1]))
which means the end result is just a list of predictions for the first timestep for each sample in the batch. If you wanted the prediction for the 96th timestep, you could do:
for i in range(0, len(Y_pred)):
prediction_lastValues_list.append((Y_pred[i][-1][1 - 1]))
Notice the -1 instead of 0 for the second bracket, to ensure we grab the last predicted timestep instead of the first.
As a side note, to replicate the results, I had to make one change to your code, specifically when creating series_reshaped_X and series_reshaped_Y. I hit an exception when using np.array to create the array from the list: ValueError: cannot copy sequence with size 192 to array axis with dimension 3 , but looking at what you were doing (joining tensors along a new axis), I changed it to np.stack, which will accomplish the same goal (https://numpy.org/doc/stable/reference/generated/numpy.stack.html):
series_reshaped_X = np.stack([data_X[i:i + (steps_backwards + steps_forward)].copy() for i in
range(len(data) - (steps_backwards + steps_forward))])
series_reshaped_Y = np.stack([data_Y[i:i + (steps_backwards + steps_forward)].copy() for i in
range(len(data) - (steps_backwards + steps_forward))])
Update
"What are those 5 values representing when I only have 1 target feature?"
That's actually just the broadcasting feature of the Tensorflow API (which is also a feature of NumPy). If you perform an arithmetic operation on two tensors with differing shapes, it will try to make them compatible. In this case, if you change the output layer size to be "5" instead of "1" (keras.layers.Dense(5)), the output size is (BATCH_SIZE, NUMBER_OF_TIMESTEPS, 5) instead of (BATCH_SIZE, NUMBER_OF_TIMESTEPS, 1), which just means the output from the convolutional layer is going into 5 neurons instead of 1. When the loss (mean squared error) is computed between the two, the size of the label tensor ((BATCH_SIZE, NUMBER_OF_TIMESTEPS, 1)) is broadcast to the size of the prediction tensor ((BATCH_SIZE, NUMBER_OF_TIMESTEPS, 5)). In this case, the broadcasting is accomplished by replicating the column. For example, if Y_train had [-1.69862224] in the first row for the first timestep, and Y_pred had [-0.6132075 , -0.6621697 , -0.7712653 , -0.60011995, -0.48753992] in the first row for the first timestep, to perform the subtraction operation, the entry in Y_train is converted to [-1.69862224, -1.69862224, -1.69862224, -1.69862224, -1.69862224].
And which of those 5 values is the "correct" value to choose for the 96 time step ahead prediciton?
There is no real "correct" value when trained this way - as detailed above, this just a feature of the API. All output should converge to the single target value for the timestep, they're all being compared to that value, so you could technically train that way, but it's just adding parameters and complexity to the model (and you would just have to choose one to be the "real" prediction). The correct approach for getting the prediction for 96 timesteps ahead is detailed in the original answer, but just to reiterate, the output of the model contains future timestep predictions for each sample in the batch. The output tensor could be iterated over to retrieve the predictions for each timestep, for each sample. Furthermore, ensure the number of neurons in the final dense layer matches the number of target values you are trying to predict, otherwise you'll hit the broadcasting issue (and the "correct" output will be unclear).
Just to be exhaustive (and I am not recommending this), if you really wanted to incorporate several neurons in the output despite only having one target value, you could do something like averaging the results:
for i in range(0, len(Y_pred)):
prediction_lastValues_list.append(np.mean(Y_pred[i][0]))
But there is absolutely no benefit to this approach, so I would recommend just sticking with the previous suggestion.
Update 2
Is my model only predicting one time slot which is 96 time steps into the future or is it also predicting everything in between?
The model is predicting everything in between. So for a sample at timestep t, the output of the model are predictions [t + 1, t + 2, ..., t + NUMBER_OF_TIMESTEPS]. Per my original answer, "the output tensor from your model contains the predicted values for 96 steps into the future, for each sample". To specify that in your evaluation code, you can do something like:
Y_pred = np.squeeze(Y_pred)
predictions_for_all_samples_and_timesteps = Y_pred.tolist()
This results in a list of length BATCH_SIZE, and each element in the list is a list of length NUMBER_OF_TIMESTEPS (to be clear, predictions_for_all_samples_and_timesteps is a list of lists). The element at index i in predictions_for_all_samples_and_timesteps contains the predictions for each timestep from 1-96 for the i^th sample (row) in X_test.
As a side note, you could omit np.squeeze, but then you will have a list of lists of lists, where each element in the inner list is a list of one item (instead of [[1, 2, 3, ...], ], the output would look like [[[1], [2], [3], ...], ].
Update 3
Y_test and Y_pred are both 3-D numpy arrays of size (BATCH_SIZE, NUMBER_OF_TIMESTEPS, 1). To compare them, you can take the absolute (or squared) difference between the two:
abs_diff = np.abs(Y_pred - Y_test)
This results in an array of the same dimensions, (BATCH_SIZE, NUMBER_OF_TIMESTEPS). You can then iterate over the rows and generate a plot of the timestep error for each row.
for diff in abs_diff:
print(diff.shape)
plt.plot(list(range(diff)), diff)
It may get a bit unwieldy with a large batch size (as you can see in the image), so maybe you plot a subset of the rows. You can also transform the absolute difference to an error percentage if you would prefer to plot that:
percentage_diff = abs_diff / Y_test
which would be the absolute difference over the actual value, as I see you were originally doing in Pandas. This numpy array will have the same dimensions, so you can iterate over it and generate plots in the same fashion.
For future inquiries, instead of posting the comments, please open a new question and just provide the link - I would be happy to continue helping, but I would like to continue gaining reputation from it.
I disagree with #danielcahall on just one point:
The output tensor from your model contains the predicted values for 96 steps into the future, for each sample
The output does contain 96 time steps, one for each input time step, and you can take an output to mean whatever you want. But this is just not a good model for what you're trying to do. The main reason is that the RNNs you're using are single direction.
x x x x x x # input
| | | | | |
x-->x-->x-->x-->x-->x # SimpleRNN
| | | | | |
x-->x-->x-->x-->x-->x # SimpleRNN
| /|\ /|\ /|\ /|\ |
| / | \ | \ | \ | \ |
x x x x x x # Conv
| | | | | |
x x x x x x # Dense -> output
So the first time index of the output only sees the first 2 input times (thanks to the Conv), it can't see the later times. The first prediction is based only on old data. It's only the last few outputs that can see all the inputs.
use 96 backwards steps to predict 96 steps into the future
Most of the outputs just can't see all the data.
This model would be appropriate if you were trying to predict 1 step into the future from each of the input times.
To predict 96 steps into the future it would be much more reasonable to drop the return_sequences=True and the Conv layer. Then expand the Dense layer to make the prediction:
model = keras.models.Sequential([
keras.layers.SimpleRNN(10, return_sequences=True, input_shape=[None, 3]), # output size is (BATCH_SIZE, NUMBER_OF_TIMESTEPS, 10)
keras.layers.SimpleRNN(10), # output size is (BATCH_SIZE, 10)
keras.layers.Dense(96) # output size is (BATCH_SIZE, 96)
])
That way all 96 predictions see all 96 inputs.
See https://www.tensorflow.org/tutorials/structured_data/time_series for more details.
Also SimpleRNN is terrible. Never use it over more than a couple of steps.
I am working on a project where I have to use a combination of numeric and text data in a neural network to make predictions of a system's availability for the next hour. Instead of trying to use separate neural networks and doing something weird/unclear (to me) at the end to produce the desired output, I decided to use Keras' merge layer with two networks (one for numeric data, one for text). The idea is that I feed the model a sequence of performance metrics for the previous 6 hours in the shape of (batch_size, 6hrs, num_features). Alongside the input I am giving to the network that handles numeric data, I am giving the second network another sequence of the size (batch_size, max_alerts_per_sequence, max_sentence length).
Any sequence of numeric data within a time range can have a variable number of events (text data) associated with it. For the sake of simplicity, I only allow a maximum of 50 events to accompany a sequence of performance data. Each event is hash encoded by word and padded. I have tried using a flatten layer to reduce the input shape from (50, 30) to (1500) so that the model can train on every event in these "sequences" (to clarify: I pass the model 50 sentences with 30 encoded elements each for every sequence of performance data).
My question is: Due to the fact that I need the NN to look at all events for a given sequence of performance metrics, how can I make the NN for text based data train on sequences of sentences?
My Model:
#LSTM Module for performance metrics
input = Input(shape=(shape[1], shape[2]))
lstm1 = Bidirectional(LSTM(units=lstm_layer_count, activation='tanh', return_sequences=True, input_shape=shape))(input)
dropout1 = Dropout(rate=0.2)(lstm1)
lstm2 = Bidirectional(LSTM(units=lstm_layer_count, activation='tanh', return_sequences=False))(dropout1)
dropout2 = Dropout(rate=0.2)(lstm2)
#LSTM Module for text based data
tInput = Input(shape=(50, 30))
flatten = Flatten()(tInput)
embed = Embedding(input_dim=vocabsize + 1, output_dim= 50 * 30, input_length=30*50)(flatten)
magic = Bidirectional(LSTM(100))(embed)
tOut = Dense(1, activation='relu')(magic)
#Merge the layers
concat = Concatenate()([dropout2, tOut])
output = Dense(units=1, activation='sigmoid')(concat)
nn = keras.models.Model(inputs=[input, tInput], outputs = output)
opt = keras.optimizers.SGD(lr=0.1, momentum=0.8, nesterov=True, decay=0.001)
nn.compile(optimizer=opt, loss='mse', metrics=['accuracy', coeff_determination])
So as far as I understood you have a sequence of max 50 events, which you want to make predictions for. These events have text data attached, which can be treated as another sequence of word embeddings. Here is an article about a similar architecture.
I would propose a solution which involves LSTMs for the text part an 1D-convolution for the "real" sequence part. Every LSTM layer is concatenated with the numerical data. This involves 50 LSTM layers, which can be time consuming to train, even if you use shared weights. It would be also possible to use only convolution layers for the text part, which is faster, but does not model long term dependencies. (I have the experience, that these long term dependencies are often not that important in text mining).
Text -> LSTM or 1DConv -> concat with numerical data -> 1DConv -> Output
Here is some exmaple code, which shows how to do use shard weights
numeric_input = Input(shape=(x_numeric_train.values.shape[1],), name='numeric_input')
nlp_seq = Input(shape=(number_of_messages ,seq_length,), name='nlp_input'+str(i))
# shared layers
emb = TimeDistributed(Embedding(input_dim=num_features, output_dim=embedding_size,
input_length=seq_length, mask_zero=True,
input_shape=(seq_length, )))(nlp_seq)
x = TimeDistributed(Bidirectional(LSTM(32, dropout=0.3, recurrent_dropout=0.3, kernel_regularizer=regularizers.l2(0.01))))(emb)
c1 = Conv1D(filter_size, kernel1, padding='valid', activation='relu', strides=1, kernel_regularizer=regularizers.l2(kernel_reg))(x)
p1 = GlobalMaxPooling1D()(c1)
c2 = Conv1D(filter_size, kernel2, padding='valid', activation='relu', strides=1, kernel_regularizer=regularizers.l2(kernel_reg))(x)
p2 = GlobalMaxPooling1D()(c2)
c3 = Conv1D(filter_size, kernel3, padding='valid', activation='relu', strides=1, kernel_regularizer=regularizers.l2(kernel_reg))(x)
p3 = GlobalMaxPooling1D()(c3)
x = concatenate([p1, p2, p3, numeric_input])
x = Dense(1, activation='sigmoid')(x)
model = Model(inputs=[nlp_seq, meta_input] , outputs=[x])
model.compile('adam', 'binary_crossentropy', metrics=['accuracy'])
And training:
model.fit([x_train, x_numeric_train], y_train)
# where x_train is a a array of num_samples * num_messages * seq_length
A complex model like this needs a lot of data to converge. For less data a simpler solution could be implemented by aggregating the events to have only one sequence. For example the text data of all events can be treated as one single text (with a separator token), instead of multiple texts, while the numerical data can be summed up, averaged or even combined into a fixed length list. But this depends on your data.
As I am working on something similar, I will update these answer with code later on.
I'm trying to combine two outputs that are produced by the same network that makes predictions on a 4 class task and a 10 class task. Then I look to combine these outputs to give a length 14 array which I use as my end target.
While this seems to work actively the predictions are always for one class so it produces a probability dist which is only concerned with selecting 1 out of the 14 options instead of 2. What I actually need it to do is to provide 2 predictions, one for each class. I want this all to be produced by the same model.
input = Input(shape=(100, 100), name='input')
lstm = LSTM(128, input_shape=(100, 100)))(input)
output1 = Dense(len(4), activation='softmax', name='output1')(lstm)
output2 = Dense(len(10), activation='softmax', name='output2')(lstm)
output3 = concatenate([output1, output2])
model = Model(inputs=[input], outputs=[output3])
My issue here is determining an appropriate loss function and method of prediction? For prediction I can simply grab the output of each layer after the softmax however I'm unsure how to set the loss function for each of these things to be trained.
Any ideas?
Thanks a lot
You don't need to concatenate the outputs, your model can have two outputs:
input = Input(shape=(100, 100), name='input')
lstm = LSTM(128, input_shape=(100, 100)))(input)
output1 = Dense(len(4), activation='softmax', name='output1')(lstm)
output2 = Dense(len(10), activation='softmax', name='output2')(lstm)
model = Model(inputs=[input], outputs=[output1, output2])
Then to train this model, you typically use two losses that are weighted to produce a single loss:
model.compile(optimizer='sgd', loss=['categorical_crossentropy',
'categorical_crossentropy'], loss_weights=[0.2, 0.8])
Just make sure to format your data right, as now each input sample corresponds to two output labeled samples. For more information check the Functional API Guide.