Training on sequences of sentences using Keras - python

I am working on a project where I have to use a combination of numeric and text data in a neural network to make predictions of a system's availability for the next hour. Instead of trying to use separate neural networks and doing something weird/unclear (to me) at the end to produce the desired output, I decided to use Keras' merge layer with two networks (one for numeric data, one for text). The idea is that I feed the model a sequence of performance metrics for the previous 6 hours in the shape of (batch_size, 6hrs, num_features). Alongside the input I am giving to the network that handles numeric data, I am giving the second network another sequence of the size (batch_size, max_alerts_per_sequence, max_sentence length).
Any sequence of numeric data within a time range can have a variable number of events (text data) associated with it. For the sake of simplicity, I only allow a maximum of 50 events to accompany a sequence of performance data. Each event is hash encoded by word and padded. I have tried using a flatten layer to reduce the input shape from (50, 30) to (1500) so that the model can train on every event in these "sequences" (to clarify: I pass the model 50 sentences with 30 encoded elements each for every sequence of performance data).
My question is: Due to the fact that I need the NN to look at all events for a given sequence of performance metrics, how can I make the NN for text based data train on sequences of sentences?
My Model:
#LSTM Module for performance metrics
input = Input(shape=(shape[1], shape[2]))
lstm1 = Bidirectional(LSTM(units=lstm_layer_count, activation='tanh', return_sequences=True, input_shape=shape))(input)
dropout1 = Dropout(rate=0.2)(lstm1)
lstm2 = Bidirectional(LSTM(units=lstm_layer_count, activation='tanh', return_sequences=False))(dropout1)
dropout2 = Dropout(rate=0.2)(lstm2)
#LSTM Module for text based data
tInput = Input(shape=(50, 30))
flatten = Flatten()(tInput)
embed = Embedding(input_dim=vocabsize + 1, output_dim= 50 * 30, input_length=30*50)(flatten)
magic = Bidirectional(LSTM(100))(embed)
tOut = Dense(1, activation='relu')(magic)
#Merge the layers
concat = Concatenate()([dropout2, tOut])
output = Dense(units=1, activation='sigmoid')(concat)
nn = keras.models.Model(inputs=[input, tInput], outputs = output)
opt = keras.optimizers.SGD(lr=0.1, momentum=0.8, nesterov=True, decay=0.001)
nn.compile(optimizer=opt, loss='mse', metrics=['accuracy', coeff_determination])

So as far as I understood you have a sequence of max 50 events, which you want to make predictions for. These events have text data attached, which can be treated as another sequence of word embeddings. Here is an article about a similar architecture.
I would propose a solution which involves LSTMs for the text part an 1D-convolution for the "real" sequence part. Every LSTM layer is concatenated with the numerical data. This involves 50 LSTM layers, which can be time consuming to train, even if you use shared weights. It would be also possible to use only convolution layers for the text part, which is faster, but does not model long term dependencies. (I have the experience, that these long term dependencies are often not that important in text mining).
Text -> LSTM or 1DConv -> concat with numerical data -> 1DConv -> Output
Here is some exmaple code, which shows how to do use shard weights
numeric_input = Input(shape=(x_numeric_train.values.shape[1],), name='numeric_input')
nlp_seq = Input(shape=(number_of_messages ,seq_length,), name='nlp_input'+str(i))
# shared layers
emb = TimeDistributed(Embedding(input_dim=num_features, output_dim=embedding_size,
input_length=seq_length, mask_zero=True,
input_shape=(seq_length, )))(nlp_seq)
x = TimeDistributed(Bidirectional(LSTM(32, dropout=0.3, recurrent_dropout=0.3, kernel_regularizer=regularizers.l2(0.01))))(emb)
c1 = Conv1D(filter_size, kernel1, padding='valid', activation='relu', strides=1, kernel_regularizer=regularizers.l2(kernel_reg))(x)
p1 = GlobalMaxPooling1D()(c1)
c2 = Conv1D(filter_size, kernel2, padding='valid', activation='relu', strides=1, kernel_regularizer=regularizers.l2(kernel_reg))(x)
p2 = GlobalMaxPooling1D()(c2)
c3 = Conv1D(filter_size, kernel3, padding='valid', activation='relu', strides=1, kernel_regularizer=regularizers.l2(kernel_reg))(x)
p3 = GlobalMaxPooling1D()(c3)
x = concatenate([p1, p2, p3, numeric_input])
x = Dense(1, activation='sigmoid')(x)
model = Model(inputs=[nlp_seq, meta_input] , outputs=[x])
model.compile('adam', 'binary_crossentropy', metrics=['accuracy'])
And training:
model.fit([x_train, x_numeric_train], y_train)
# where x_train is a a array of num_samples * num_messages * seq_length
A complex model like this needs a lot of data to converge. For less data a simpler solution could be implemented by aggregating the events to have only one sequence. For example the text data of all events can be treated as one single text (with a separator token), instead of multiple texts, while the numerical data can be summed up, averaged or even combined into a fixed length list. But this depends on your data.
As I am working on something similar, I will update these answer with code later on.

Related

Correct keras LSTM input shape after text-embedding

I'm trying to understand the keras LSTM layer a bit better in regards to timesteps, but am still struggling a bit.
I want to create a model that is able to compare 2 inputs (siamese network). So my input is twice a preprocessed text. The preprocessing is done as followed:
max_len = 64
data['cleaned_text_1'] = assets.apply(lambda x: clean_string(data[]), axis=1)
data['text_1_seq'] = t.texts_to_sequences(cleaned_text_1.astype(str).values)
data['text_1_seq_pad'] = [list(x) for x in pad_sequences(assets['text_1_seq'], maxlen=max_len, padding='post')]
same is being done for the second text input. T is from keras.preprocessing.text.Tokenizer.
I defined the model with:
common_embed = Embedding(
name="synopsis_embedd",
input_dim=len(t.word_index)+1,
output_dim=300,
input_length=len(data['text_1_seq_pad'].tolist()[0]),
trainable=True
)
lstm_layer = tf.keras.layers.Bidirectional(
tf.keras.layers.LSTM(32, dropout=0.2, recurrent_dropout=0.2)
)
input1 = tf.keras.Input(shape=(len(data['text_1_seq_pad'].tolist()[0]),))
e1 = common_embed(input1)
x1 = lstm_layer(e1)
input2 = tf.keras.Input(shape=(len(data['text_1_seq_pad'].tolist()[0]),))
e2 = common_embed(input2)
x2 = lstm_layer(e2)
merged = tf.keras.layers.Lambda(
function=l1_distance, output_shape=l1_dist_output_shape, name='L1_distance'
)([x1, x2])
conc = Concatenate(axis=-1)([merged, x1, x2])
x = Dropout(0.01)(conc)
preds = tf.keras.layers.Dense(1, activation='sigmoid')(x)
model = tf.keras.Model(inputs=[input1, input2], outputs=preds)
that seems to work if I feed the numpy data with the fit method:
model.fit(
x = [np.array(data['text_1_seq_pad'].tolist()), np.array(data['text_2_seq_pad'].tolist())],
y = y_train.values.reshape(-1,1),
epochs=epochs,
batch_size=batch_size,
validation_data=([np.array(val['text_1_seq_pad'].tolist()), np.array(val['text_2_seq_pad'].tolist())], y_val.values.reshape(-1,1)),
)
What I'm trying to understand at the moment is what is the shape in my case for the LSTM layer for:
samples
time_steps
features
Is it correct that the input_shape for the LSTM layer would be input_shape=(300,1) because I set the embedding output dim to 300 and I have only 1 input feature per LSTM?
And do I need to reshape the embedding output or can I just set
lstm_layer = tf.keras.layers.Bidirectional(
tf.keras.layers.LSTM(32, input_shape=(300,1), dropout=0.2, recurrent_dropout=0.2)
)
from the embedding output?
Example notebook can be found in Github or as Colab
In general, an LSTM layer needs 3D inputs shaped this way : (batch_size, lenght of an input sequence , number of features ). (Batch size is not really important, so you can just consider that one input need to have this shape (lenght of sequence, number of features par item) )
In your case, the output dim of your embedding layer is 300. So your LSTM have 300 features.
Then, using LSTM on sentences requires a constant number of tokens. LSTM works with constant input dimension, you can not pass it a text with 12 tokens following by another one with 68 tokens. Indeed, you need to fix a limit and pad the sequence if needed.
So, if your sentence is 20 tokens long and that your limit is 50, you need to pad (add at the end of your sequence) the sequence with 30 “neutral” tokens (often zeros).
After all, your LSTM input dimension must be (number of token per text, dimension of your embedding outputs) -> (50, 300) in my example.
To learn more about it, it suggest you to take a look to this : (but in your case, you can replace time_steps by number_of_tokens)
https://shiva-verma.medium.com/understanding-input-and-output-shape-in-lstm-keras-c501ee95c65e
Share
Edit
Delete
Flag

Siamese network for feature similarity

I have around 20k images of different domains with the features already extracted using GLCM and HOG . The dimensions of features are around 2000 for each image. I want to find similarity between features using Siamese network.I stored all in a dataframe. I'm not sure how we can give input features to neural net.
There is only one possibilty of using 1DCNN / Dense layers.
encoder = models.Sequential(name='encoder')
encoder.add(layer=layers.Dense(units=1024, activation=activations.relu, input_shape=[n_features]))
encoder.add(layers.Dropout(0.1))
encoder.add(layer=layers.Dense(units=512, activation=activations.relu))
encoder.add(layers.Dropout(0.1))
encoder.add(layer=layers.Dense(units=256, activation=activations.relu))
encoder.add(layers.Dropout(0.1))
In this above code we only we give number of features as input to encoder, But number of features for my both images are same.
Should I train two encoders separately and join them at the end to form a embedding layer?
But how should I test?
For a siamese network you would want to have one network, and train it on different sets of data.
So say you have two sets of data X0 and X1 that have the same shape, you would do
from tensorflow.keras import models
from tensorflow.keras import layers
from tensorflow.python.keras.utils import losses_utils
# number of features
n_features = 2000
# fake data w/batch size 4
X0 = tf.random.normal([4, n_features])
X1 = tf.random.normal([4, n_features])
# siamese encoder model
encoder = models.Sequential(name='encoder')
encoder.add(layer=layers.Dense(
units=1024, activation="relu", input_shape=[n_features]))
encoder.add(layers.Dropout(0.1))
encoder.add(layer=layers.Dense(units=512, activation="relu"))
encoder.add(layers.Dropout(0.1))
encoder.add(layer=layers.Dense(units=256, activation="relu"))
encoder.add(layers.Dropout(0.1))
# send both sets of data through same model
enc0 = encoder(X0)
enc1 = encoder(X1)
# compare the two outputs
compared = tf.keras.losses.CosineSimilarity(
reduction=losses_utils.ReductionV2.NONE)(enc0, enc1)
print(f"cosine similarity of output: {compared.numpy()}")
# cosine similarity of output: [-0.5785658, -0.6405066, -0.57274437, -0.6017716]
# now do optimization ...
There are numerous way to compare the output, cosine similarity being one of them, but I just included it for illustration and you may require some other metric.
There is only one network which is just duplicated. All weights are shared. so you are training one network, just run it twice at each step of learning.
you should pick two sample from your dataset and label it to 1 if came from same class and 0 otherwise.
from tensorflow.keras import models
from tensorflow.keras import layers
import tensorflow.keras.backend as K
n_features = 2000
def cos_similarity(x):
x1,x2 = x
return K.sum(x1*x2)/(K.sqrt(K.sum(x1*x1))*K.sqrt(K.sum(x2*x2)))
inp1 = layers.Input(shape=(n_features))
inp2 = layers.Input(shape=(n_features))
encoder = models.Sequential(name='encoder')
encoder.add(layer=layers.Dense(
units=1024, activation="relu", input_shape=[n_features]))
encoder.add(layers.Dropout(0.1))
encoder.add(layer=layers.Dense(units=512, activation="relu"))
encoder.add(layers.Dropout(0.1))
encoder.add(layer=layers.Dense(units=256, activation="relu"))
encoder.add(layers.Dropout(0.1))
out1 = encoder(inp1)
out2 = encoder(inp2)
similarity = layers.Lambda(cos_similarity)([out1,out2])
model = models.Model(inputs=[inp1,inp2],outputs=[similarity])
model.compile(optimizer='adam',loss='mse')
For testing, first of all you should compute HOG features which you said it has 2000 features. Then run
model.predict(hog_feature)
and you have output feature.
By the way I recommend to do not use hog feature and siamese network. Extract image feature just using this network. change input shape and train with images.

Performing one hot encoding inside a Lambda layer in Keras to avoid memory issues: problem

I've made a sequential model in keras, for generating musical sequences. Something very simple, with LSTM and dense softmax. I have 333 possible musical events
I know that model.fit() needs all training data in memory, which is a problem if it is one hot encoded. So I give the model an integer as input, transform this to one hot encoding in a Lambda layer, and then use sparse categorical cross entropy for loss. Because each batch would be transformed to one hot encoding on the fly, I thought that this would sort out my memory issues. But instead, it hangs at the beginning of fitting, and fills up my memory, even with small batch size. Evidently, I'm not understanding something about how keras works, which is not surprising, given that I'm new to it (and on that note, please point out anything too naive in my code).
1) what is happening behind the scenes? What is it about keras that I'm not understanding? It seems like keras is going ahead and running the Lambda layer on all of my training examples before doing any training.
2) How can I solve this, and make keras do it truly on the fly? Can I solve it with model.fit(), which I'm currently using, or do I need model.fit_generator(), which to me looks like it could solve this rather easily?
Here is some of my code:
def musicmodel(Tx, n_a, n_values):
"""
Arguments:
Tx -- length of a sequence in the corpus
n_a -- the number of activations used in our model (for the LSTM)
n_values -- number of unique values in the music data
Returns:
model -- a keras model
"""
# Define the input with a shape
X = Input(shape=(Tx,))
# Define s0, initial hidden state for the decoder LSTM
a0 = Input(shape=(n_a,), name='a0')
c0 = Input(shape=(n_a,), name='c0')
a = a0
c = c0
# Create empty list to append the outputs to while iterating
outputs = []
# Step 2: Loop
for t in range(Tx):
# select the "t"th time step from X.
x = Lambda(lambda x: x[:,t])(X)
# We need the class represented in one hot fashion:
x = Lambda(lambda x: tf.one_hot(K.cast(x, dtype='int32'), n_values))(x)
# We then reshape x to be (1, n_values)
x = reshapor(x)
# Perform one step of the LSTM_cell
a, _, c = LSTM_cell(x, initial_state=[a, c])
# Apply densor to the hidden state output of LSTM_Cell
out = densor(a)
# Add the output to "outputs"
outputs.append(out)
# Step 3: Create model instance
model = Model(inputs=[X,a0,c0],outputs=outputs)
return model
I then fit my model:
model = musicmodel(Tx, n_a, n_values)
opt = Adam(lr=0.01, beta_1=0.9, beta_2=0.999, decay=0.01)
model.compile(optimizer=opt, loss='sparse_categorical_crossentropy', metrics=['accuracy'])
a0 = np.zeros((m, n_a))
c0 = np.zeros((m, n_a))
model.fit([X, a0, c0], list(Y), validation_split=0.25, epochs=600, verbose=2, batch_size=4)

Keras LSTM Larger Feature Overwhelm Smaller Ones?

I've had this suspicion for the longest time but couldn't figure out whether the case or not, so here's the scenario:
I'm trying to build a model that has 3 features from 3 different inputs:
A text sequence
A float
A float
Now all three of these makes up one time step. But since I'm using glove to vectorize my text sequence using 100 dimensions, a 20 word text sequence ends up having the length of 2000. Hence, the overall input per step has the length of 2002 (each time step a matrix with the shape (1, 2002) is fed in, with 2000 of those coming from a single feature.
Is the text sequence overwhelming the two floats so whatever the value of the floats are it's irrelevant to the prediction? If so, what can I do to fix this? Perhaps manually weigh how much each feature should be used? Code is attached
def build_model(embedding_matrix) -> Model:
text = Input(shape=(9, news_text.shape[1]), name='text')
price = Input(shape=(9, 1), name='price')
volume = Input(shape=(9, 1), name='volume')
text_layer = Embedding(
embedding_matrix.shape[0],
embedding_matrix.shape[1],
weights=[embedding_matrix]
)(text)
text_layer = Dropout(0.2)(text_layer)
# Flatten the vectorized text matrix
text_layer = Reshape((9, int_shape(text_layer)[2] * int_shape(text_layer)[3]))(text_layer)
inputs = concatenate([
text_layer,
price,
volume
])
output = Convolution1D(128, 5, activation='relu')(inputs)
output = MaxPool1D(pool_size=4)(output)
output = LSTM(units=128, dropout=0.2, return_sequences=True)(output)
output = LSTM(units=128, dropout=0.2, return_sequences=True)(output)
output = LSTM(units=128, dropout=0.2)(output)
output = Dense(units=2, activation='linear', name='output')(output)
model = Model(
inputs=[text, price, volume],
outputs=[output]
)
model.compile(optimizer='adam', loss='mean_squared_error')
return model
Edit: note that the input shape into the lstm is (?, 9, 2002) which means right now the 2000 coming from the text are treated as 2000 independent features
As I mentioned in my comments, one approach is to have a two branch model where one branch processes the text data and another one processes the two float features. At the end the output of the two branches are merged together:
# Branch one: process text data
text_input = Input(shape=(news_text.shape[1],), name='text')
text_emb = Embedding(embedding_matrix.shape[0],embedding_matrix.shape[1],
weights=[embedding_matrix])(text_input)
# you may alternatively use only Conv1D + MaxPool1D or
# stack multiple LSTM layers on top of each other or
# use a combination of Conv1D, MaxPool1D and LSTM
text_conv = Convolution1D(128, 5, activation='relu')(text_emb)
text_lstm = LSTM(units=128, dropout=0.2)(text_conv)
# Branch two: process float features
price_input = Input(shape=(9, 1), name='price')
volume_input = Input(shape=(9, 1), name='volume')
pv = concatenate([price_input, volume_input])
# you can also stack multiple LSTM layers on top of each other
pv_lstm = LSTM(units=128, dropout=0.2)(pv)
# merge output of branches
text_pv = concatenate([text_lstm, pv_lstm])
output = Dense(units=2, activation='linear', name='output')(text_pv)
model = Model(
inputs=[text_input, price_input, volume_input],
outputs=[output]
)
model.compile(optimizer='adam', loss='mean_squared_error')
As I have commented in the code, this is just a simple illustration. You may need to further add or remove layers or regularization and tune the hyper-parameters.

Keras - Making two predictions from one neural network

I'm trying to combine two outputs that are produced by the same network that makes predictions on a 4 class task and a 10 class task. Then I look to combine these outputs to give a length 14 array which I use as my end target.
While this seems to work actively the predictions are always for one class so it produces a probability dist which is only concerned with selecting 1 out of the 14 options instead of 2. What I actually need it to do is to provide 2 predictions, one for each class. I want this all to be produced by the same model.
input = Input(shape=(100, 100), name='input')
lstm = LSTM(128, input_shape=(100, 100)))(input)
output1 = Dense(len(4), activation='softmax', name='output1')(lstm)
output2 = Dense(len(10), activation='softmax', name='output2')(lstm)
output3 = concatenate([output1, output2])
model = Model(inputs=[input], outputs=[output3])
My issue here is determining an appropriate loss function and method of prediction? For prediction I can simply grab the output of each layer after the softmax however I'm unsure how to set the loss function for each of these things to be trained.
Any ideas?
Thanks a lot
You don't need to concatenate the outputs, your model can have two outputs:
input = Input(shape=(100, 100), name='input')
lstm = LSTM(128, input_shape=(100, 100)))(input)
output1 = Dense(len(4), activation='softmax', name='output1')(lstm)
output2 = Dense(len(10), activation='softmax', name='output2')(lstm)
model = Model(inputs=[input], outputs=[output1, output2])
Then to train this model, you typically use two losses that are weighted to produce a single loss:
model.compile(optimizer='sgd', loss=['categorical_crossentropy',
'categorical_crossentropy'], loss_weights=[0.2, 0.8])
Just make sure to format your data right, as now each input sample corresponds to two output labeled samples. For more information check the Functional API Guide.

Categories

Resources