Related
I am novice in TensorFlow
I am traying to use BERT embeddings in LSTM model
this is my model function
def bert_tweets_model():
Bertmodel = TFAutoModel.from_pretrained(model_name,output_hidden_states=True)
input_word_ids = tf.keras.Input(shape=(max_length,), dtype=tf.int32, name="input_ids")
input_masks_in = tf.keras.Input(shape=(max_length,), name='masked_token', dtype='int32')
with torch.no_grad():
last_hidden_states = Bertmodel(input_word_ids, attention_mask=input_masks_in)[0]
x = tf.keras.layers.LSTM(100, dropout=0.1, activation='relu',recurrent_dropout=0.3,return_sequences = True)(last_hidden_states)
x = tf.keras.layers.LSTM(50, dropout=0.1,activation='relu', recurrent_dropout=0.3,return_sequences = True)(x)
x=tf.keras.layers.Flatten()(x)
output = tf.keras.layers.Dense(units = 2, activation='sigmoid')(x)
model = tf.keras.Model(inputs=[input_word_ids, input_masks_in], outputs = output)
return model
with strategy.scope():
model = bert_tweets_model()
adam_optimizer = tf.keras.optimizers.Adam(learning_rate=1e-5)
model.compile(loss='binary_crossentropy',optimizer=adam_optimizer,metrics=['accuracy'])
model.summary()
validation_data=[dev_encoded, y_val]
train2=[input_id, attention_mask]
history = model.fit(
x=train2, y=y_train, batch_size=batch_size,
epochs=3,
validation_data=validation_data,
verbose=2)
I recieved this error in fit function when I tried to input data
"ValueError: Layer "model_1" expects 2 input(s), but it received 1 input tensors. Inputs received: [<tf.Tensor 'IteratorGetNext:0' shape=(None, 512) dtype=int32>]"
also,I received these warning massages I do not know what is means.
WARNING:tensorflow:Layer lstm_2 will not use cuDNN kernels since it doesn't meet the criteria. It will use a generic GPU kernel as fallback when running on GPU.
WARNING:tensorflow:Layer lstm_3 will not use cuDNN kernels since it doesn't meet the criteria. It will use a generic GPU kernel as fallback when running on GPU.
can someone help me, thanks in advance.
Regenerating your error
_input1 = tf.random.uniform((1,100), 0 , 10)
_input2 = tf.random.uniform((1,100), 0 , 10)
model(_input1, _input2)
After running this code I am getting the same error...
Layer "model" expects 2 input(s), but it received 1 input tensors. Inputs received: [<tf.Tensor: shape=(1, 100), ...
#Now, the problem is you have to enclose the inputs in the set or list then you have to pass the inputs to the model like this
model((_input1, _input2))
<tf.Tensor: shape=(1, 2), dtype=float32, numpy=array([[0.5324366, 0.3743334]], dtype=float32)>
Remember: if you are using tf.data.Dataset then encolse it then while making the dataset enclose the dataset within the set like this
tf.data.Dataset.from_tensor_slices((words_id, words_mask))
Second Problem as you asked
The warning you are getting because, you should be aware that LSTM doesn't run in CUDA GPU it uses the CPU only therefore it is slow, so TensorFlow is just telling you that LSTM will not run under GPU or parallel computing.
I'm trying to recreate a transformer written in Pytorch and implement it in Tensorflow. The problem is that despite both the documentation for the Pytorch version and Tensorflow version, they still come out pretty differently.
I wrote a little code snippet to show the issue:
import torch
import tensorflow as tf
import numpy as np
class TransformerLayer(tf.Module):
def __init__(self, d_model, nhead, dropout=0):
super(TransformerLayer, self).__init__()
self.self_attn = torch.nn.MultiheadAttention(d_model, nhead, dropout=dropout)
batch_size = 2
seq_length = 5
d_model = 10
src = np.random.uniform(size=(batch_size, seq_length, d_model))
srcTF = tf.convert_to_tensor(src)
srcPT = torch.Tensor(src.reshape((seq_length, batch_size, d_model)))
self_attnTF = tf.keras.layers.MultiHeadAttention(key_dim=10, num_heads=5, dropout=0)
transformer_encoder = TransformerLayer(d_model=10, nhead=5, dropout=0.0)
output, scores = self_attnTF(srcTF, srcTF, srcTF, return_attention_scores=True)
print("Tensorflow Attendtion outputs:", output)
print("Tensorflow (averaged) weights:", tf.math.reduce_mean(scores, 1))
print("Torch Attendtion outputs:", transformer_encoder.self_attn(srcPT,srcPT,srcPT)[0])
print("Torch attention output weights:", transformer_encoder.self_attn(srcPT,srcPT,srcPT)[1])
and the result is:
Tensorflow Attendtion outputs: tf.Tensor(
[[[ 0.02602757 -0.14134401 0.00855263 0.4735083 -0.01851891
-0.20382246 -0.18152176 -0.21076852 0.08623976 -0.33548725]
[ 0.02607442 -0.1403394 0.00814065 0.47415024 -0.01882939
-0.20353754 -0.18291879 -0.21234266 0.08595885 -0.33613583]
[ 0.02524654 -0.14096384 0.00870436 0.47411725 -0.01800703
-0.20486829 -0.18163288 -0.21082559 0.08571021 -0.3362339 ]
[ 0.02518575 -0.14039244 0.0090138 0.47431853 -0.01775141
-0.20391947 -0.18138805 -0.2118245 0.08432849 -0.33521986]
[ 0.02556361 -0.14039293 0.00876258 0.4746476 -0.01891363
-0.20398234 -0.18229616 -0.21147579 0.08555281 -0.33639923]]
[[ 0.07844199 -0.1614371 0.01649148 0.5287745 0.05126739
-0.13851154 -0.09829871 -0.1621251 0.01922669 -0.2428589 ]
[ 0.07844222 -0.16024739 0.01805423 0.52941847 0.04975721
-0.13537636 -0.09829231 -0.16129729 0.01979005 -0.24491176]
[ 0.07800542 -0.160701 0.01677295 0.52902794 0.05082911
-0.13843337 -0.09805533 -0.16165744 0.01928401 -0.24327613]
[ 0.07815789 -0.1600025 0.01757433 0.5291927 0.05032986
-0.1368022 -0.09849522 -0.16172451 0.01929555 -0.24438493]
[ 0.0781548 -0.16028519 0.01764914 0.52846324 0.04941286
-0.13746066 -0.09787872 -0.16141161 0.01994199 -0.2440269 ]]], shape=(2, 5, 10), dtype=float32)
Tensorflow (averaged) weights: tf.Tensor(
[[[0.199085 0.20275716 0.20086522 0.19873264 0.19856 ]
[0.2015336 0.19960018 0.20218948 0.19891861 0.19775811]
[0.19906266 0.20318432 0.20190334 0.19812575 0.19772394]
[0.20074987 0.20104568 0.20269363 0.19744729 0.19806348]
[0.19953248 0.20176074 0.20314851 0.19782843 0.19772986]]
[[0.2010009 0.20053487 0.20004745 0.20092985 0.19748697]
[0.20034568 0.20035927 0.19955876 0.20062163 0.19911464]
[0.19967113 0.2006859 0.20012529 0.20047483 0.19904283]
[0.20132652 0.19996871 0.20019794 0.20008174 0.19842513]
[0.2006393 0.20000939 0.19938737 0.20054278 0.19942114]]], shape=(2, 5, 5), dtype=float32)
Torch Attendtion outputs: tensor([[[ 0.1097, -0.4467, -0.0719, -0.1779, -0.0766, -0.1247, 0.1557,
0.0051, -0.3932, -0.1323],
[ 0.1264, -0.3822, 0.0759, -0.0335, -0.1084, -0.1539, 0.1475,
-0.0272, -0.4235, -0.1744]],
[[ 0.1122, -0.4502, -0.0747, -0.1796, -0.0756, -0.1271, 0.1581,
0.0049, -0.3964, -0.1340],
[ 0.1274, -0.3823, 0.0754, -0.0356, -0.1091, -0.1547, 0.1477,
-0.0272, -0.4252, -0.1752]],
[[ 0.1089, -0.4427, -0.0728, -0.1746, -0.0756, -0.1202, 0.1501,
0.0031, -0.3894, -0.1242],
[ 0.1263, -0.3820, 0.0718, -0.0374, -0.1063, -0.1562, 0.1485,
-0.0271, -0.4233, -0.1761]],
[[ 0.1061, -0.4369, -0.0685, -0.1696, -0.0772, -0.1173, 0.1454,
0.0012, -0.3860, -0.1201],
[ 0.1265, -0.3820, 0.0762, -0.0325, -0.1082, -0.1560, 0.1501,
-0.0271, -0.4249, -0.1779]],
[[ 0.1043, -0.4402, -0.0705, -0.1719, -0.0791, -0.1205, 0.1508,
0.0018, -0.3895, -0.1262],
[ 0.1260, -0.3805, 0.0775, -0.0298, -0.1083, -0.1547, 0.1494,
-0.0276, -0.4242, -0.1768]]], grad_fn=<AddBackward0>)
Torch attention output weights: tensor([[[0.2082, 0.2054, 0.1877, 0.1956, 0.2031],
[0.2100, 0.2079, 0.1841, 0.1943, 0.2037],
[0.2007, 0.1995, 0.1929, 0.1999, 0.2070],
[0.1995, 0.1950, 0.1976, 0.2002, 0.2077],
[0.1989, 0.1969, 0.1970, 0.2024, 0.2048]],
[[0.2095, 0.1902, 0.1987, 0.2027, 0.1989],
[0.2090, 0.1956, 0.1997, 0.2004, 0.1952],
[0.2047, 0.1869, 0.2006, 0.2121, 0.1957],
[0.2073, 0.1953, 0.1982, 0.2014, 0.1978],
[0.2089, 0.2003, 0.1953, 0.1957, 0.1998]]], grad_fn=<DivBackward0>)
The output weights look similar but the base attention outputs are way off. Is there any way to make the Tensorflow model come out more like the Pytorch one? Any help would be greatly appreciated!
In MultiHeadAttention there is also a projection layer, like
Q = W_q # input_query + b_q
K = W_k # input_keys + b_k
V = W_v # input_values + b_v
Matrices W_q, W_k and W_v and biases b_q, b_k, b_v are initialized randomly, so difference in outputs should be expected (even between outputs of two distinct layers in pytorch on same input). After self-attention operation there is one more projection and it's also initialized randomly. Weights can be set manually in tensorflow by calling method set_weights of self_attnTF.
Correspondence between weights in tf.keras.layers.MultiHeadAttention and nn.MultiheadAttention not so clear, as an example: torch shares weights between heads, while tf keeps them unique. So if you are using weights of pretrained model from pytorch and try to put them in tensorflow model (for whatever reason) it'll certainly take more than five minutes.
Results should be the same if after initializing pytorch model and tensorflow model you step through their parameters and assign them identical values.
I am following this tutorial on how to train a siamese bert network:
https://keras.io/examples/nlp/semantic_similarity_with_bert/
all good, but I am not sure what is the best way to save the model after train it and save it.
any suggestion?
I was trying with
model.save('models/bert_siamese_v1')
which creates a folder with save_model.bp keras_metadata.bp and two subfolders (variables and assets)
then I try to load it with:
model.load_weights('models/bert_siamese_v1/')
and it gives me this error:
2022-03-08 14:11:52.567762: W tensorflow/core/util/tensor_slice_reader.cc:95] Could not open models/bert_siamese_v1/: Failed precondition: models/bert_siamese_v1; Is a directory: perhaps your file is in a different file format and you need to use a different restore operator?
what is the best way to proceed?
Try using tf.saved_model.save to save your model:
tf.saved_model.save(model, 'models/bert_siamese_v1')
model = tf.saved_model.load('models/bert_siamese_v1')
The warning you get during saving can apparently be ignored. After loading your model, you can use it for inference f(test_data):
f = model.signatures["serving_default"]
x1 = tf.random.uniform((1, 128), maxval=100, dtype=tf.int32)
x2 = tf.random.uniform((1, 128), maxval=100, dtype=tf.int32)
x3 = tf.random.uniform((1, 128), maxval=100, dtype=tf.int32)
print(f)
print(f(attention_masks = x1, input_ids = x2, token_type_ids = x3))
ConcreteFunction signature_wrapper(*, token_type_ids, attention_masks, input_ids)
Args:
attention_masks: int32 Tensor, shape=(None, 128)
input_ids: int32 Tensor, shape=(None, 128)
token_type_ids: int32 Tensor, shape=(None, 128)
Returns:
{'dense': <1>}
<1>: float32 Tensor, shape=(None, 3)
{'dense': <tf.Tensor: shape=(1, 3), dtype=float32, numpy=array([[0.40711606, 0.13456087, 0.45832306]], dtype=float32)>}
It seems you have two options
manually save weights
model.save_weights('./checkpoints/my_checkpoint')
model = create_model()
model.load_weights('./checkpoints/my_checkpoint')
save the entire model
Call model.save to save a model's architecture, weights, and training configuration in a single file/folder. This allows you to export a model so it can be used without access to the original Python code*. Since the optimizer-state is recovered, you can resume training from exactly where you left off.
Save model
# Create and train a new model instance.
model = create_model()
model.fit(train_images, train_labels, epochs=5)
# Save the entire model as a SavedModel.
!mkdir -p saved_model
model.save('saved_model/my_model')
load model
new_model = tf.keras.models.load_model('saved_model/my_model')
It seems that you are mixing both approaches, saving model and loading weights.
I've been writing some custom layers and I have realized my bias values will train but my weights are not training. I'm going to use a very simplified code here to illustrate the issue.
class myWeights(Layer):
def __init__(self, units, **kwargs):
self.units = units
super(myWeights, self).__init__(**kwargs)
def build(self, input_shape):
self.w = self.add_weight(shape=(input_shape[-1], self.units),
initializer='GlorotUniform',
trainable=True)
self.b = self.add_weight(shape=(self.units,),
initializer='random_normal',
trainable=True)
super(myWeights, self).build(input_shape)
def call(self, inputs):
return tf.matmul(inputs, self.w) + self.b
def compute_output_shape(self, input_shape):
return(input_shape[0],self.units)
Now I set up MNIST data to train. I also set a seed so this is reproducible on your end.
tf.random.set_seed(1234)
mnist = tf.keras.datasets.mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train=tf.keras.utils.normalize(x_train, axis=1)
x_test=tf.keras.utils.normalize(x_test, axis=1)
I build out the model using the functional API
inp=Input(shape=(x_train.shape[1:]))
flat=Flatten()(inp)
hid=myWeights(32)(flat)
out=Dense(10, 'softmax')(hid)
model=Model(inp,out)
model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
Now when I check the values of the parameters using
print(model.layers[2].get_weights())
I see output like the following, which I have reformatted for easier reading.
[array([[ 0.00652369, -0.02321771, 0.01399945, ..., -0.07599965,
-0.04356881, -0.0333882 ],
[-0.03132245, -0.05264733, 0.05576386, ..., -0.03755575,
0.07358163, -0.02338506],
[-0.01808248, 0.04092623, 0.02177643, ..., 0.00971264,
0.07631209, 0.0495184 ],
...,
[-0.03780914, 0.00219346, 0.04460619, ..., -0.06703794,
0.03407502, -0.01071112],
[-0.0012739 , -0.0683699 , -0.06152753, ..., 0.05373723,
0.03079057, 0.00855774],
[ 0.06245673, -0.07649396, 0.06748571, ..., -0.06948434,
-0.01416317, -0.08318184]], dtype=float32), *
array([ 0.05734033, 0.04822996, 0.04391507, -0.01550511, 0.05383257,
0.05043739, -0.04092903, -0.0081823 , -0.06425817, 0.02402171,
-0.00374672, -0.06069579, -0.08422226, 0.02909392, -0.02071654,
0.0422841 , -0.05020861, 0.01267704, 0.0365625 , -0.01743891,
-0.01030697, 0.00639807, -0.01493454, 0.03214667, 0.03262959,
0.07799669, 0.05789128, 0.01754347, -0.07558075, 0.0466203 ,
-0.05332188, 0.00270758], dtype=float32)]*
After training with
model.fit(x_train,y_train, epochs=3, verbose=1)
print(model.layers[2].get_weights())
I find the following output.
[array([[ 0.00652369, -0.02321771, 0.01399945, ..., -0.07599965,
-0.04356881, -0.0333882 ],
[-0.03132245, -0.05264733, 0.05576386, ..., -0.03755575,
0.07358163, -0.02338506],
[-0.01808248, 0.04092623, 0.02177643, ..., 0.00971264,
0.07631209, 0.0495184 ],
...,
[-0.03780914, 0.00219346, 0.04460619, ..., -0.06703794,
0.03407502, -0.01071112],
[-0.0012739 , -0.0683699 , -0.06152753, ..., 0.05373723,
0.03079057, 0.00855774],
[ 0.06245673, -0.07649396, 0.06748571, ..., -0.06948434,
-0.01416317, -0.08318184]], dtype=float32), *
array([-0.250459 , -0.21746232, 0.01250297, 0.00065066, -0.09093136,
0.04943814, -0.13446714, -0.11985168, 0.23259214, -0.14288908,
0.03274751, 0.1462888 , -0.2206902 , 0.14455307, 0.17767513,
0.11378342, -0.22250313, 0.11601174, -0.1855521 , 0.0900097 ,
0.21218981, -0.03386492, -0.06818825, 0.34211585, -0.24891953,
0.08827516, 0.2806849 , 0.07634751, -0.32905066, -0.1860122 ,
0.06170518, -0.20212872], dtype=float32)]*
I can see that the bias values have changed but the weight values are static. I'm not sure at all why this is occurring.
What your trying is Multilayer Perceptron (MLP), MLP is usually composed of one(passthrough) input layer, one or more layers
of TLUs, called hidden layers, and one final layer of TLUs called the
output layer.
Here the signal flows only in one direction (from the inputs to the outputs), so this
architecture is an example of a feedforward neural network (FNN).
See this link which will explain feedforward neural network.
Coming to the explanation of your code, you are initializing weights using some initializers. So the first initialization of weights happens at the hidden layer and then gets updated in the next Dense layer.
So whatever the weights are initialized will remain the same even after training in the hidden layer since it is a feedforward neural network means it is not dependent on the output of the current layer.
But if you want to check your code then you can include one more hidden layer exactly as the one which is present and see the weights for layer 3(hidden layer 2) which looks something like this.
inp=Input(shape=(x_train.shape[1:]))
flat=Flatten()(inp)
hid=myWeights(32)(flat)
hid2=myWeights(32)(hid)
out=Dense(10, 'softmax')(hid2)
model=Model(inp,out)
model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
Then by printing the weights before fit and after fit for hidden2 layer will give you different weights, since the weights for the hidden 2 layer is dependent on the output of the hidden 1 layer.
print(model.layers[3].get_weights())
I want to have a model that only predicts a certain syntactic category, for example verbs, can I update the weights of the LSTM so that they are set to 1 if the word is a verb and 0 if it is any other category?
This is my current code:
model = Sequential()
model.add(Embedding(vocab_size, embedding_size, input_length=5, weights=[pretrained_weights]))
model.add(Bidirectional(LSTM(units=embedding_size)))
model.add(Dense(2000, activation='softmax'))
for e in zip(model.layers[-1].trainable_weights, model.layers[-1].get_weights()):
print('Param %s:\n%s' % (e[0], e[1]))
weights = [layer.get_weights() for layer in model.layers]
print(weights)
print(model.summary())
# compile network
model.compile(loss='categorical_crossentropy',
optimizer = RMSprop(lr=0.001),
metrics=['accuracy'])
# fit network
history = model.fit(X_train_fit, y_train_fit, epochs=100, verbose=2, validation_data=(X_val, y_val))
score = model.evaluate(x=X_test, y=y_test, batch_size=32)
These are the weights that I am returning:
Param <tf.Variable 'dense_1/kernel:0' shape=(600, 2000) dtype=float32_ref>:
[[-0.00803087 0.0332068 -0.02052244 ... 0.03497869 0.04023124
-0.02789269]
[-0.02439511 0.02649114 0.00163587 ... -0.01433908 0.00598045
0.00556619]
[-0.01622458 -0.02026448 0.02620039 ... 0.03154427 0.00676246
0.00236203]
...
[-0.00233192 0.02012364 -0.01562861 ... -0.01857186 -0.02323328
0.01365903]
[-0.02556716 0.02962652 0.02400535 ... -0.01870854 -0.04620285
-0.02111554]
[ 0.01415684 -0.00216265 0.03434955 ... 0.01771339 0.02930249
0.002172 ]]
Param <tf.Variable 'dense_1/bias:0' shape=(2000,) dtype=float32_ref>:
[0. 0. 0. ... 0. 0. 0.]
[[array([[-0.023167 , -0.0042483, -0.10572 , ..., 0.089398 , -0.0159 ,
0.14866 ],
[-0.11112 , -0.0013859, -0.1778 , ..., 0.063374 , -0.12161 ,
0.039339 ],
[-0.065334 , -0.093031 , -0.017571 , ..., 0.16642 , -0.13079 ,
0.035397 ],
and so on.
Can I do it by updating the weights? Or is there a more efficient way to be able to only output verbs?
Thank you for the help!
In this model, with this loss (categorical_crossentropy), you cannot learn verb/non-verb labels without supervision. So, you need labeled data. Perhaps, you can use tagged corpus, e.g. Penn Tree Bank corpus, train this model which takes the input words and predicts the output labels (closed class of labels).
If you want to have one tag and regression on each word, you can change the model so the last layer becomes a value between 0 and 1:
model.add(Dense(1, activation='sigmoid'))
Then change the loss function to be a binary:
# compile network
model.compile(loss='binary_crossentropy',
optimizer = RMSprop(lr=0.001),
metrics=['accuracy'])
Then instead of labels, you should have 1 and 0 values in y_train_fit representing verb/non-verb of each word.