Huggingface Transformers Tensorflow fine-tuned distilgpt2 bad outputs

Huggingface Transformers Tensorflow fine-tuned distilgpt2 bad outputs - python

I fine-tuned a model starting from the 'distilgpt2' checkpoint. I fit the model with the model.fit() method and saved the resulting model with the .save_pretrained() method.
When I use this model to generate text:
import transformers
from transformers import TFAutoModelForCausalLM, AutoTokenizer
original_model = 'distilgpt2'
path2model = 'clm_model_save'
path2tok = 'clm_tokenizer_save'
tuned_model = TFAutoModelForCausalLM.from_pretrained(path2model, from_pt=False)
tuned_tokenizer = AutoTokenizer.from_pretrained(path2tok)
input_context = 'The dog'
input_ids = tuned_tokenizer.encode(input_context, return_tensors='tf') # encode input context
outputs = tuned_model.generate(input_ids=input_ids,
max_length=40,
temperature=0.7,
num_return_sequences=3,
do_sample=True) # generate 3 candidates using sampling
for i in range(3): # 3 output sequences were generated
print(f'Generated {i}: {tuned_tokenizer.decode(outputs[i], skip_special_tokens=True)}')
The model returns the output:
>>>All model checkpoint layers were used when initializing TFGPT2LMHeadModel.
>>>All the layers of TFGPT2LMHeadModel were initialized from the model checkpoint at clm_model_save.
>>>If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training.
>>>Setting `pad_token_id` to 50256 (first `eos_token_id`) to generate sequence
>>>Generated 0: The dog!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
>>>Generated 1: The dog!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
>>>Generated 2: The dog!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
When I use the original checkpoint, distilgpt2, the model generates text just fine. Is this a sign of some sort of misconfiguration? Or is this simply a sign of a poorly trained model?
I've tried using the original checkpoint tokenizer, manually setting the pad_token_id, using a much longer input context, and changing several parameters for the .generate() method. Same results each time.
Also, I added special tokens to my tuned_tokenizer:
tuned_tokenizer.special_tokens_map
>>>{'bos_token': '<|startoftext|>',
>>> 'eos_token': '<|endoftext|>',
>>> 'unk_token': '<|endoftext|>',
>>> 'pad_token': '<|PAD|>'}
Compared to the original tokenizer:
tokenizer.special_tokens_map
>>> {'bos_token': '<|endoftext|>',
>>> 'eos_token': '<|endoftext|>',
>>> 'unk_token': '<|endoftext|>'}

Related

Meta-learning to find optimal model from pre-trained models in Tensorflow

I have many pre-trained models with a different number of layers (Models are not Sequential). Training data had a shape (1, 1, 103) for these models and output was a class label between 0 and 9.
I loaded these saved models, set all layers as non-trainable. I used theses models into new architecture as follows:
inp = keras.layers.Input(shape=(1,1,103), name = "new_input")
out_1 = model_1(inp) # model_1 is the name of variable where I loaded trained model
out_2 = model_2(inp)
out_3 = model_3(inp)
out_4 = model_4(inp)
x = keras.layers.concatenate([out_1, out_2, out_3, out_4])
out = keras.layers.dense(1)(x)
model = keras.models.Model(inputs=inp, outputs=out, name = "meta_model")
When I compile this model with optimizer = "sgd" and loss = "mse".
I didn't get any error until this point, but when I run model.fit(),
I get this error TypeError: int() argument must be a string, a
bytes-like object or a number, not 'NoneType'
I'm not sure where I'm going wrong.
The previous models were trained with "adam" optimizer and "sparse_categorical_crossentropy" loss and the dataset had 10 classes.
The objective of this model was to train this model with the same data and try to find out which model amongst the previously trained model was optimal.
Any other solution/suggestion to find an optimal number of layers using meta-learning would also be appreciated. I can manually find the optimal number of layers by trial and error but I want meta-model to find this based on the dataset.
eg: by training on dataset1 I found that there was no significant
increase in accuracy after 7 layers whereas for dataset2 it reached
its peak at 4 layers and adding more layers was useless.

For Hyperparameters tunning I can recommend Ray Tune. I use it and I like this framework very much.
https://docs.ray.io/en/latest/tune/examples/tune_mnist_keras.html

Bert using transformer's pipeline and encode_plus function

when I use:
modelname = 'deepset/bert-base-cased-squad2'
model = BertForQuestionAnswering.from_pretrained(modelname)
tokenizer = AutoTokenizer.from_pretrained(modelname)
nlp = pipeline('question-answering', model=model, tokenizer=tokenizer)
result = nlp({'question': question,'context': context})
it doesn't crash. However when i use encode_plus():
modelname = 'deepset/bert-base-cased-squad2'
model = BertForQuestionAnswering.from_pretrained(modelname)
tokenizer = AutoTokenizer.from_pretrained(modelname)
inputs= tokenizer.encode_plus(question,context,return_tensors='pt')
I have this error:
The size of tensor a (629) must match the size of tensor b (512) at non-singleton dimension 1
which I understand but why I don't have the same error in the first case? Can someone explain the difference?

The reason for getting an error in the second code is that the input data does not fit in the pytorch tensor. For this you have to set truncation flag as True when calling the tokenizer. Thus, when data that will not fit in the tensor arrives, it only takes as much as it fits. i.e:
tokenizer = AutoTokenizer.from_pretrained('deepset/bert-base-cased-squad2',truncation= True )
There is no problem when using the pipeline, probably because the developers of the pre-trained model used in the pipeline apply this process by default.

Add dense layer on top of Huggingface BERT model

I want to add a dense layer on top of the bare BERT Model transformer outputting raw hidden-states, and then fine tune the resulting model. Specifically, I am using this base model. This is what the model should do:
Encode the sentence (a vector with 768 elements for each token of the sentence)
Keep only the first vector (related to the first token)
Add a dense layer on top of this vector, to get the desired transformation
So far, I have successfully encoded the sentences:
from sklearn.neural_network import MLPRegressor
import torch
from transformers import AutoModel, AutoTokenizer
# List of strings
sentences = [...]
# List of numbers
labels = [...]
tokenizer = AutoTokenizer.from_pretrained("dbmdz/bert-base-italian-xxl-cased")
model = AutoModel.from_pretrained("dbmdz/bert-base-italian-xxl-cased")
# 2D array, one line per sentence containing the embedding of the first token
encoded_sentences = torch.stack([model(**tokenizer(s, return_tensors='pt'))[0][0][0]
for s in sentences]).detach().numpy()
regr = MLPRegressor()
regr.fit(encoded_sentences, labels)
In this way I can train a neural network by feeding it with the encoded sentences. However, this approach clearly does not fine tune the base BERT model. Can anybody help me? How can I build a model (possibly in pytorch or using the Huggingface library) that can be entirely fine tuned?

There are two ways to do it: Since you are looking to fine-tune the model for a downstream task similar to classification, you can directly use:
BertForSequenceClassification class. Performs fine-tuning of logistic regression layer on the output dimension of 768.
Alternatively, you can define a custom module, that created a bert model based on the pre-trained weights and adds layers on top of it.
from transformers import BertModel
class CustomBERTModel(nn.Module):
def __init__(self):
super(CustomBERTModel, self).__init__()
self.bert = BertModel.from_pretrained("dbmdz/bert-base-italian-xxl-cased")
### New layers:
self.linear1 = nn.Linear(768, 256)
self.linear2 = nn.Linear(256, 3) ## 3 is the number of classes in this example
def forward(self, ids, mask):
sequence_output, pooled_output = self.bert(
ids,
attention_mask=mask)
# sequence_output has the following shape: (batch_size, sequence_length, 768)
linear1_output = self.linear1(sequence_output[:,0,:].view(-1,768)) ## extract the 1st token's embeddings
linear2_output = self.linear2(linear2_output)
return linear2_output
tokenizer = AutoTokenizer.from_pretrained("dbmdz/bert-base-italian-xxl-cased")
model = CustomBERTModel() # You can pass the parameters if required to have more flexible model
model.to(torch.device("cpu")) ## can be gpu
criterion = nn.CrossEntropyLoss() ## If required define your own criterion
optimizer = torch.optim.Adam(filter(lambda p: p.requires_grad, model.parameters()))
for epoch in epochs:
for batch in data_loader: ## If you have a DataLoader() object to get the data.
data = batch[0]
targets = batch[1] ## assuming that data loader returns a tuple of data and its targets
optimizer.zero_grad()
encoding = tokenizer.batch_encode_plus(data, return_tensors='pt', padding=True, truncation=True,max_length=50, add_special_tokens = True)
outputs = model(input_ids, attention_mask=attention_mask)
outputs = F.log_softmax(outputs, dim=1)
input_ids = encoding['input_ids']
attention_mask = encoding['attention_mask']
loss = criterion(outputs, targets)
loss.backward()
optimizer.step()

If you want to tune the BERT model itself you will need to modify the parameters of the model. To do this you will most likely want to do your work with PyTorch. Here is some rough psuedo code to illustrate:
from torch.optim import SGD
model = ... # whatever model you are using
parameters = model.parameters() # or some more specific set of parameters
optimizer = SGD(parameters,lr=.01) # or whatever optimizer you want
optimizer.zero_grad() # boiler-platy pytorch function
input = ... # whatever the appropriate input for your task is
label = ... # whatever the appropriate label for your task is
loss = model(**input, label) # usuall loss is the first item returned
loss.backward() # calculates gradient
optim.step() # runs optimization algorithm
I've left out all the relevant details because they are quite tedious and specific to whatever your specific task is. Huggingface has a nice article walking through this is more detail here, and you will definitely want to refer to some pytorch documentation as you use any pytorch stuff. I highly recommend the pytorch blitz before trying to do anything serious with it.

For anyone using Tensorflow/ Keras the equivalent of Ashwin's answer would be:
from tensorflow import keras
from transformers import AutoTokenizer, TFAutoModel
class CustomBERTModel(keras.Model):
def __init__(self):
super(CustomBERTModel, self).__init__()
self.bert = TFAutoModel.from_pretrained("dbmdz/bert-base-italian-xxl-cased")
### New layers:
self.linear1 = keras.layers.Dense(256)
self.linear2 = keras.layers.Dense(3) ## 3 is the number of classes in this example
def call(self, inputs, training=False):
# call expects only one positional argument, so you have to pass in a tuple and unpack. The next parameter is a special reserved training parameter.
ids, mask = inputs
sequence_output = self.bert(ids, mask, training=training).last_hidden_state
# sequence_output has the following shape: (batch_size, sequence_length, 768)
linear1_output = self.linear1(sequence_output[:,0,:]) ## extract the 1st token's embeddings
linear2_output = self.linear2(linear1_output)
return linear2_output
model = CustomBERTModel()
tokenizer = AutoTokenizer.from_pretrained("dbmdz/bert-base-italian-xxl-cased")
ipts = tokenizer("Some input sequence", return_tensors="tf")
test = model((ipts["input_ids"], ipts["attention_mask"]))
Then to train the model you can make a custom training loop using GradientTape.
You can verify that the additional layers are also trainable with model.trainable_weights. You can access weights for individual layers with e.g. model.trainable_weights[-1].numpy() would get the last layer's bias vector. [Note the Dense layers will only appear after the first time the call method is executed.]

Loading a converted pytorch model in huggingface transformers properly

I converted a pre-trained tf model to pytorch using the following function.
def convert_tf_checkpoint_to_pytorch(*, tf_checkpoint_path, albert_config_file, pytorch_dump_path):
# Initialise PyTorch model
config = AlbertConfig.from_json_file(albert_config_file)
print("Building PyTorch model from configuration: {}".format(str(config)))
model = AlbertForPreTraining(config)
# Load weights from tf checkpoint
load_tf_weights_in_albert(model, config, tf_checkpoint_path)
# Save pytorch-model
print("Save PyTorch model to {}".format(pytorch_dump_path))
torch.save(model.state_dict(), pytorch_dump_path)
I am loading the converted model and encoding sentences in the following way:
def vectorize_sentence(text):
albert_tokenizer = AlbertTokenizer.from_pretrained("albert-base-v2")
config = AlbertConfig.from_pretrained(config_path, output_hidden_states=True)
model = TFAlbertModel.from_pretrained(pytorch_dir, config=config, from_pt=True)
e = albert_tokenizer.encode(text, max_length=512)
model_input = tf.constant(e)[None, :] # Batch size 1
output = model(model_input)
v = [0] * 768
# generate sentence vectors by averaging the word vectors
for i in range(1, len(model_input[0]) - 1):
v = v + output[0][0][i].numpy()
vector = v/len(model_input[0])
return vector
However while loading the model, a warning comes up:
Some weights or buffers of the PyTorch model TFAlbertModel were not
initialized from the TF 2.0 model and are newly initialized:
['predictions.LayerNorm.bias', 'predictions.dense.weight',
'predictions.LayerNorm.weight', 'sop_classifier.classifier.bias',
'predictions.dense.bias', 'sop_classifier.classifier.weight',
'predictions.decoder.bias', 'predictions.bias',
'predictions.decoder.weight'] You should probably TRAIN this model on
a down-stream task to be able to use it for predictions and inference.
Can anyone tell me if I am doing anything wrong? What does the warning mean? I saw issue #5588 on the github repo of Transformers. Don't know if my issue is the same as this.

I think you could try using
model = AlbertModel.from_pretrained
instead of
model = TFAlbertModel.from_pretrained
in the VectorizeSentence definition.
AlbertModel is the name of the class for the pytorch format model, and TFAlbertModel is the name of the class for the tensorflow format model.
I'm not sure exactly what load_tf_weights_in_albert() does, but I think that once you have done that your model is in pytorch format.

different prediction after load a model in keras

I have a Sequential Model built in Keras and after trained it give me good prediction but when i save and then load the model i don't obtain the same prediction on the same dataset. Why?
Note that I checked the weight of the model and they are the same as well as the architecture of the model, checked with model.summary() and model.getWeights(). This is very strange in my opinion and I have no idea how to deal with this problem.
I don't have any error but the prediction are different
I tried to use model.save() and load_model()
I tried to use model.save_weights() and after that re-built the model and then load the model
I have the same problem with both options.
def Classifier(input_shape, word_to_vec_map, word_to_index, emb_dim, num_activation):
sentence_indices = Input(shape=input_shape, dtype=np.int32)
emb_dim = 300 # embedding di 300 parole in italiano
embedding_layer = pretrained_embedding_layer(word_to_vec_map, word_to_index, emb_dim)
embeddings = embedding_layer(sentence_indices)
X = LSTM(256, return_sequences=True)(embeddings)
X = Dropout(0.15)(X)
X = LSTM(128)(X)
X = Dropout(0.15)(X)
X = Dense(num_activation, activation='softmax')(X)
model = Model(sentence_indices, X)
sequentialModel = Sequential(model.layers)
return sequentialModel
model = Classifier((maxLen,), word_to_vec_map, word_to_index, maxLen, num_activation)
...
model.fit(Y_train_indices, Z_train_oh, epochs=30, batch_size=32, shuffle=True)
# attempt 1
model.save('classificationTest.h5', True, True)
modelRNN = load_model(r'C:\Users\Alessio\classificationTest.h5')
# attempt 2
model.save_weights("myWeight.h5")
model = Classifier((maxLen,), word_to_vec_map, word_to_index, maxLen, num_activation)
model.load_weights(r'C:\Users\Alessio\myWeight.h5')
# PREDICTION TEST
code_train, category_train, category_code_train, text_train = read_csv_for_email(r'C:\Users\Alessio\Desktop\6Febbraio\2test.csv')
categories, code_categories = get_categories(r'C:\Users\Alessio\Desktop\6Febbraio\2test.csv')
X_my_sentences = text_train
Y_my_labels = category_code_train
X_test_indices = sentences_to_indices(X_my_sentences, word_to_index, maxLen)
pred = model.predict(X_test_indices)
def codeToCategory(categories, code_categories, current_code):
i = 0;
for code in code_categories:
if code == current_code:
return categories[i]
i = i + 1
return "no_one_find"
# result
for i in range(len(Y_my_labels)):
num = np.argmax(pred[i])
# Pretrained embedding layer
def pretrained_embedding_layer(word_to_vec_map, word_to_index, emb_dim):
"""
Creates a Keras Embedding() layer and loads in pre-trained GloVe 50-dimensional vectors.
Arguments:
word_to_vec_map -- dictionary mapping words to their GloVe vector representation.
word_to_index -- dictionary mapping from words to their indices in the vocabulary (400,001 words)
Returns:
embedding_layer -- pretrained layer Keras instance
"""
vocab_len = len(word_to_index) + 1 # adding 1 to fit Keras embedding (requirement)
### START CODE HERE ###
# Initialize the embedding matrix as a numpy array of zeros of shape (vocab_len, dimensions of word vectors = emb_dim)
emb_matrix = np.zeros((vocab_len, emb_dim))
# Set each row "index" of the embedding matrix to be the word vector representation of the "index"th word of the vocabulary
for word, index in word_to_index.items():
emb_matrix[index, :] = word_to_vec_map[word]
# Define Keras embedding layer with the correct output/input sizes, make it trainable. Use Embedding(...). Make sure to set trainable=False.
embedding_layer = Embedding(vocab_len, emb_dim)
### END CODE HERE ###
# Build the embedding layer, it is required before setting the weights of the embedding layer. Do not modify the "None".
embedding_layer.build((None,))
# Set the weights of the embedding layer to the embedding matrix. Your layer is now pretrained.
embedding_layer.set_weights([emb_matrix])
return embedding_layer
Do you have any kind of suggestion?
Thanks in Advance.
Edit1: if use the code of saving and loading in the same "page" (I'm using notebook jupyter) it works fine. If I change "page" it doesn't work. Could it be that there is something related with the tensorflow session?
Edit2: my final goal is to load a model, trained in Keras, with Deeplearning4J in java. So if you know a solution for "transforming" the keras model in something else readable in DL4J it will help anyway.
Edit3: add function pretrained_embedding_layer()
Edit4: dictionaries from word2Vec model read with gensim
from gensim.models import Word2Vec
model = Word2Vec.load('C:/Users/Alessio/Desktop/emoji_ita/embedding/glove_WIKI')
def getMyModels (model):
word_to_index = dict({})
index_to_word = dict({})
word_to_vec_map = dict({})
for idx, key in enumerate(model.wv.vocab):
word_to_index[key] = idx
index_to_word[idx] = key
word_to_vec_map[key] = model.wv[key]
return word_to_index, index_to_word, word_to_vec_map

Are you pre-processing your data in the same way when you load your model ?
And if yes, did you set the seed of your pre-processing functions ?
If you build a dictionnary with keras, are the sentences coming in the same order ?

I had the same problem before, so here is how you solve it. After making sure that the weights and summary are the same, try to print your random seed and check. If its value is changing from a session to another and if you tried tensorflow's seed, it means you need to disable the PYTHONHASHSEED environment variable. You can read more about it here:
https://docs.python.org/3/using/cmdline.html#envvar-PYTHONHASHSEED
To disable it, go to your system environment variables and add PYTHONHASHSEED as a new variable if it doesn't exist. Then, set its value to 0 to disable it. Please note that it was done in this way because it has to be disabled before running the interpreter.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Huggingface Transformers Tensorflow fine-tuned distilgpt2 bad outputs - python

Related

Meta-learning to find optimal model from pre-trained models in Tensorflow

Bert using transformer's pipeline and encode_plus function

Add dense layer on top of Huggingface BERT model

Loading a converted pytorch model in huggingface transformers properly

different prediction after load a model in keras

Categories

Resources