My supervisor asked me to implement an Attention-layer for a CNN (applied on text), but Im pretty sure it does not work for ConvD1-layers.
Using Keras I have a pretty straightforward model, with an embedding-layer followed by a convolutional layer.
Since the input for ConvD1 is (#documents, words, embedding_size) followed by a MaxPooling-Layer, I considered dropping max-pool and inserting Attention here, but I really dont know whats my query and value-input in this case.
I know tf.backend has an Attention-Layer, but is it possible to apply it here? Or do I need some kind of self-attention?
What I need, is to see which "words" are (most) responsible for the according classification.
sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
embedded_sequences = embedding_layer(sequence_input)
conv1 = Conv1D(filters=2, kernel_size=2, padding='same')(embedded_sequences)
conv1 = MaxPooling1D(pool_size=32)(conv1)
conv1 = Dropout(0.2)(conv1)
x = Dense(50, activation="relu",
kernel_regularizer=regularizers.l2(0.01),
bias_regularizer=regularizers.l2(0.01))(conv1)
x = Dropout(0.3)(x)
preds = Dense(1, activation='sigmoid',name='output')(x)
model = Model(sequence_input, preds)
model.compile(loss='binary_crossentropy',
optimizer='adam',
metrics=[TruePositives(name='true_positives'),
TrueNegatives(name='true_negatives'),
FalseNegatives(name='false_negatives'),
FalsePositives(name='false_positives'),
])
Related
I am trying to make a 3 sequence many-to-many LSTM model, but I am confused about it's implementation in Keras. I searched on internet for examples of many-to-many models, but each website gives different method. That has confused me even more. What is the correct method of those? I want a model like this:
Some of the various methods I found were
Using encoder, decoder
from keras.layers import RepeatVector
from keras.layers import TimeDistributed
model = Sequential()
# encoder layer
model.add(LSTM(100, activation='relu', input_shape=(3, 1)))
# repeat vector
model.add(RepeatVector(3))
# decoder layer
model.add(LSTM(100, activation='relu', return_sequences=True))
model.add(TimeDistributed(Dense(1)))
model.compile(optimizer='adam', loss='mse')
Another with encoder, decoder
from keras.models import Model
from keras.layers import Input, LSTM, Dense
encoder_inputs = Input(shape=(None, 1))
encoder = LSTM(100, return_state=True)
encoder_outputs, state_h, state_c = encoder(encoder_inputs)
encoder_states = [state_h, state_c]
decoder_inputs = Input(shape=(None, 1))
decoder_lstm = LSTM(100, return_sequences=True, return_state=True)
decoder_outputs, _, _ = decoder_lstm(decoder_inputs,
initial_state=encoder_states)
decoder_dense = Dense(num_decoder_tokens, activation='softmax')
decoder_outputs = decoder_dense(decoder_outputs)
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)
model = Sequential()
model.add(LSTM(100,input_shape=(3,1),return_sequences=True))
model.add(TimeDistributed(Dense(2)))
model.compile(optimizer='adam', loss='mse')
model = Sequential()
model.add(LSTM(100,input_shape=(3,1),return_sequences=True))
model.compile(optimizer='adam', loss='mse')
Which one of these is the correct method? which one will give the model like the one I want?
You have to mention your problem statement first.
1 and 2 are best for neural machine translation problems. While 2 is superior because it is considering return states in LSTM layer. 3 is also a good architecture where logic from input to output is simple. 4 is a very basic architecture becuase nth output in the output array has knowledge about [0 to n-1th input, not later ones] also no fully connected (Dense) layer so even moderate logic cannot be learned here.
I am trying to use a CNN architecture to classify text sentences. The architecture of the network is as follows:
text_input = Input(shape=X_train_vec.shape[1:], name = "Text_input")
conv2 = Conv1D(filters=128, kernel_size=5, activation='relu')(text_input)
drop21 = Dropout(0.5)(conv2)
pool1 = MaxPooling1D(pool_size=2)(drop21)
conv22 = Conv1D(filters=64, kernel_size=5, activation='relu')(pool1)
drop22 = Dropout(0.5)(conv22)
pool2 = MaxPooling1D(pool_size=2)(drop22)
dense = Dense(16, activation='relu')(pool2)
flat = Flatten()(dense)
dense = Dense(128, activation='relu')(flat)
out = Dense(32, activation='relu')(dense)
outputs = Dense(y_train.shape[1], activation='softmax')(out)
model = Model(inputs=text_input, outputs=outputs)
# compile
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
I have some callbacks as early_stopping and reduceLR to stop the training and to reduce the learning rate when the validation loss is not improving (reducing).
early_stopping = EarlyStopping(monitor='val_loss',
patience=5)
model_checkpoint = ModelCheckpoint(filepath=checkpoint_filepath,
save_weights_only=False,
monitor='val_loss',
mode="auto",
save_best_only=True)
learning_rate_decay = ReduceLROnPlateau(monitor='val_loss',
factor=0.1,
patience=2,
verbose=1,
mode='auto',
min_delta=0.0001,
cooldown=0,
min_lr=0)
Once the model is trained the history of the training goes as follows:
We can observe here that the validation loss is not improving from epoch 5 on and that the training loss is being overfitted with each step.
I will like to know if I'm doing something wrong in the architecture of the CNN? Aren't enough the dropout layers to avoid the overfitting? Which are other ways to reduce overfitting?
Any suggestion?
Thanks in advance.
Edit:
I have tried also with regularization an the result where even worse:
kernel_regularizer=l2(0.01), bias_regularizer=l2(0.01)
Edit 2:
I have tried to apply BatchNormalization layers after each convolution and the result is the next one:
norm = BatchNormalization()(conv2)
Edit 3:
After applying the LSTM architecture:
text_input = Input(shape=X_train_vec.shape[1:], name = "Text_input")
conv2 = Conv1D(filters=128, kernel_size=5, activation='relu')(text_input)
drop21 = Dropout(0.5)(conv2)
conv22 = Conv1D(filters=64, kernel_size=5, activation='relu')(drop21)
drop22 = Dropout(0.5)(conv22)
lstm1 = Bidirectional(LSTM(128, return_sequences = True))(drop22)
lstm2 = Bidirectional(LSTM(64, return_sequences = True))(lstm1)
flat = Flatten()(lstm2)
dense = Dense(128, activation='relu')(flat)
out = Dense(32, activation='relu')(dense)
outputs = Dense(y_train.shape[1], activation='softmax')(out)
model = Model(inputs=text_input, outputs=outputs)
# compile
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
overfitting can caused by many factors, it happens when your model fits too well to the training set.
To handle it you can do some ways:
Add more data
Use data augmentation
Use architectures that generalize well
Add regularization (mostly dropout, L1/L2 regularization are also possible)
Reduce architecture complexity.
for more clearly you can read in https://towardsdatascience.com/deep-learning-3-more-on-cnns-handling-overfitting-2bd5d99abe5d
This is screaming Transfer Learning. google-unversal-sentence-encoder is perfect for this use case. Replace your model with
import tensorflow_hub as hub
import tensorflow_text
text_input = Input(shape=X_train_vec.shape[1:], name = "Text_input")
# this next layer might need some tweaking dimension wise, to correctly fit
# X_train in the model
text_input = tf.keras.layers.Lambda(lambda x: tf.squeeze(x))(text_input)
# conv2 = Conv1D(filters=128, kernel_size=5, activation='relu')(text_input)
# drop21 = Dropout(0.5)(conv2)
# pool1 = MaxPooling1D(pool_size=2)(drop21)
# conv22 = Conv1D(filters=64, kernel_size=5, activation='relu')(pool1)
# drop22 = Dropout(0.5)(conv22)
# pool2 = MaxPooling1D(pool_size=2)(drop22)
# 1) you might need `text_input = tf.expand_dims(text_input, axis=0)` here
# 2) If you're classifying English only, you can use the link to the normal `google-universal-sentence-encoder`, not the multilingual one
# 3) both the English and multilingual have a `-large` version. More accurate but slower to train and infer.
embedded = hub.KerasLayer('https://tfhub.dev/google/universal-sentence-encoder-multilingual/3')(text_input)
# this layer seems out of place,
# dense = Dense(16, activation='relu')(embedded)
# you don't need to flatten after a dense layer (in your case) or a backbone (in my case (google-universal-sentence-encoder))
# flat = Flatten()(dense)
dense = Dense(128, activation='relu')(flat)
out = Dense(32, activation='relu')(dense)
outputs = Dense(y_train.shape[1], activation='softmax')(out)
model = Model(inputs=text_input, outputs=outputs)
I think since you are doing a text Classification, adding 1 or 2 LSTM layers might help the network learn better, since it will be able to better associate with the context of the data. I suggest adding the following code before the flatten layer.
lstm1 = Bidirectional(LSTM(128, return_sequence = True))
lstm2 = Bidirectional(LSTM(64))
LSTM layers can help neural network learn association between certain words and might improve the accuracy of your network.
I also Suggest dropping the Max Pooling layers as max pooling especially in text classification can lead the network to drop some of the useful features.
Just keep the convolutional Layers and the dropout. Also remove the Dense layer before flatten and add the aforementioned LSTMs.
It is unclear how you feed the text into your model. I am assuming that you tokenize the text to represent it as a sequence of integers, but do you use any word embedding prior to feeding it into your model? If not, I suggest you to throw atrainable tensorflow Embedding layer at the start of your model. There is a clever technique called Embedding Lookup to speed up its training, but you can save it for later. Try adding this layer to your model. Then your Conv1D layer would have a much easier time working on a sequence of floats. Also, I suggest you throw BatchNormalization after each Conv1D, it should help to speed up convergence and training.
This is part of a code on construction of a CNN in a book.
I don't understand why 'filters =64' here. As far as I know this is the number of feature maps. How do I determine this number when I make my own CNN?
# network parameters
# image is processed as is (square grayscale)
input_shape = (image_size, image_size, 1)
batch_size = 128
kernel_size = 3
pool_size = 2
filters = 64
dropout = 0.2
model = Sequential()
model.add(Conv2D(filters = filters,
kernel_size = kernel_size,
activation = 'relu',
input_shape = input_shape))
model.add(MaxPooling2D(pool_size))
model.add(Conv2D(filters = filters,
kernel_size = kernel_size,
activation = 'relu'))
model.add(MaxPooling2D(pool_size))
model.add(Conv2D(filters = filters,
kernel_size = kernel_size,
activation = 'relu'))
model.add(Flatten())
# dropout added as regularizer
model.add(Dropout(dropout))
# output layer is 10-dim one-hot vector
model.add(Dense(num_labels))
model.add(Activation('softmax'))
model.summary()
plot_model(model, to_file='cnn-mnist.png', show_shapes=True)
Filters are the number of features you want to detect in the image. Also known as the feature detectors in the image. It is a hyper-parameter and entirely up to you.
There are a couple of architectures of CNN. It will be better if you look for existing solutions on the problem you are trying to solve using your CNN. Then try to tune filter value and check if accuracy increases or not.
You can choose filters based on the complexity of the task. The number of filters tends to increase with each layer. As first layer will extract simple features, and the next layer will extract more complex features. Have a look at the link below for further reference. Hope I helped :)
https://stackoverflow.com/a/48243420/9024042
I have a CNN and I want to sneak in some extra information into one of the final layers.
Here's a simplified version of the code. Watch for the comment
def define_model():
model = Sequential()
model.add(Conv2D(32, (3,3))
model.add(Conv2D(32, (3,3))
model.add(MaxPooling2D((2,2))
model.add(Conv2D(64, (3,3))
model.add(Conv2D(64, (3,3))
model.add(MaxPooling2D((2,2)))
model.add(Flatten())
# this next layer is where I want to sneak the neuron(s) in
model.add(Dense(1024))
model.add(Dropout(rate=0.4))
model.add(Dense(168))
model.compile()
return model
So I have some additional information about the input image which might be able to help the network. Think of it as a clue which may or may not deserve a reasonable amount of weighting.
The clue is in the form of an integer which technically is in [0, inf) but practically is probably in [0, 20].
So my questions are
What's the appropriate way to represent that hint speaking in terms of NN architecture in general.
How do I tweak the Keras model to make that happen in practice?
Bonus: If I wanted to, could I prevent the subsequent dropout from ever dropping out this added feature?
This could work by using Keras' functional API:
def define_model():
inputs = Input(input_shape=(...))
hints = Input(input_shape=(...))
x = Conv2D(32, (3,3))(inputs)
x = Conv2D(32, (3,3))(x)
x = MaxPooling2D((2,2))(x)
x = Conv2D(64, (3,3))(x)
x = Conv2D(64, (3,3))(x)
x = MaxPooling2D((2,2))(x)
x = Flatten()(x)
x = Add()([x, hints])
x = Dense(1024)(x)
x = Dropout(rate=0.4)(x)
outputs = Dense(168)(x)
model = Model([inputs, hints], outputs)
model.compile()
return model
I don't know about protecting it from the dropout using Keras though.
I'm building a Keras model to categorise data into one of 9 categories. The issue is it will only work with a Sigmoid activation which is designed for binary outputs, other activations result in 0 accuracy. What would I need to change for it to classify into each of the labels?
#Reshape data to add new dimension
X_train = X_train.reshape((100, 150, 1))
Y_train = X_train.reshape((100, 1, 1))
model = Sequential()
model.add(Conv1d(1, kernel_size=3, activation='relu', input_shape=(None, 1)))
model.add(Dense(1))
model.add(Activation('sigmoid'))
model.compile(loss='categorical_hinge', optimizer='adam', metrics=['accuracy'])
model.fit(x=X_train,y=Y_train, epochs=200, batch_size=20)
A single-unit dense layer is not what we use in the case of multi-class classification; you should first ensure that your Y data are one-hot encoded - if not, you can make them so using Keras utility functions:
num_classes=9
Y_train = keras.utils.to_categorical(Y_train, num_classes)
and then change your last layer to:
model.add(Dense(num_classes))
model.add(Activation('softmax'))
Also, if you don't have any specific reasons to use the categorical Hinge loss, I would suggest starting with loss='categorical_crossentropy' in your model compilation.
That said, your model seems too simple, and you may want to try adding some more layers...