Bert using transformer's pipeline and encode_plus function - python

when I use:
modelname = 'deepset/bert-base-cased-squad2'
model = BertForQuestionAnswering.from_pretrained(modelname)
tokenizer = AutoTokenizer.from_pretrained(modelname)
nlp = pipeline('question-answering', model=model, tokenizer=tokenizer)
result = nlp({'question': question,'context': context})
it doesn't crash. However when i use encode_plus():
modelname = 'deepset/bert-base-cased-squad2'
model = BertForQuestionAnswering.from_pretrained(modelname)
tokenizer = AutoTokenizer.from_pretrained(modelname)
inputs= tokenizer.encode_plus(question,context,return_tensors='pt')
I have this error:
The size of tensor a (629) must match the size of tensor b (512) at non-singleton dimension 1
which I understand but why I don't have the same error in the first case? Can someone explain the difference?

The reason for getting an error in the second code is that the input data does not fit in the pytorch tensor. For this you have to set truncation flag as True when calling the tokenizer. Thus, when data that will not fit in the tensor arrives, it only takes as much as it fits. i.e:
tokenizer = AutoTokenizer.from_pretrained('deepset/bert-base-cased-squad2',truncation= True )
There is no problem when using the pipeline, probably because the developers of the pre-trained model used in the pipeline apply this process by default.

Related

Huggingface Transformers Tensorflow fine-tuned distilgpt2 bad outputs

I fine-tuned a model starting from the 'distilgpt2' checkpoint. I fit the model with the model.fit() method and saved the resulting model with the .save_pretrained() method.
When I use this model to generate text:
import transformers
from transformers import TFAutoModelForCausalLM, AutoTokenizer
original_model = 'distilgpt2'
path2model = 'clm_model_save'
path2tok = 'clm_tokenizer_save'
tuned_model = TFAutoModelForCausalLM.from_pretrained(path2model, from_pt=False)
tuned_tokenizer = AutoTokenizer.from_pretrained(path2tok)
input_context = 'The dog'
input_ids = tuned_tokenizer.encode(input_context, return_tensors='tf') # encode input context
outputs = tuned_model.generate(input_ids=input_ids,
max_length=40,
temperature=0.7,
num_return_sequences=3,
do_sample=True) # generate 3 candidates using sampling
for i in range(3): # 3 output sequences were generated
print(f'Generated {i}: {tuned_tokenizer.decode(outputs[i], skip_special_tokens=True)}')
The model returns the output:
>>>All model checkpoint layers were used when initializing TFGPT2LMHeadModel.
>>>All the layers of TFGPT2LMHeadModel were initialized from the model checkpoint at clm_model_save.
>>>If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training.
>>>Setting `pad_token_id` to 50256 (first `eos_token_id`) to generate sequence
>>>Generated 0: The dog!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
>>>Generated 1: The dog!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
>>>Generated 2: The dog!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
When I use the original checkpoint, distilgpt2, the model generates text just fine. Is this a sign of some sort of misconfiguration? Or is this simply a sign of a poorly trained model?
I've tried using the original checkpoint tokenizer, manually setting the pad_token_id, using a much longer input context, and changing several parameters for the .generate() method. Same results each time.
Also, I added special tokens to my tuned_tokenizer:
tuned_tokenizer.special_tokens_map
>>>{'bos_token': '<|startoftext|>',
>>> 'eos_token': '<|endoftext|>',
>>> 'unk_token': '<|endoftext|>',
>>> 'pad_token': '<|PAD|>'}
Compared to the original tokenizer:
tokenizer.special_tokens_map
>>> {'bos_token': '<|endoftext|>',
>>> 'eos_token': '<|endoftext|>',
>>> 'unk_token': '<|endoftext|>'}

How to make a Trainer pad inputs in a batch with huggingface-transformers?

I'm trying to train a model using a Trainer, according to the documentation (https://huggingface.co/transformers/master/main_classes/trainer.html#transformers.Trainer) I can specify a tokenizer:
tokenizer (PreTrainedTokenizerBase, optional) – The tokenizer used to
preprocess the data. If provided, will be used to automatically pad
the inputs the maximum length when batching inputs, and it will be
saved along the model to make it easier to rerun an interrupted
training or reuse the fine-tuned model.
So padding should be handled automatically, but when trying to run it I get this error:
ValueError: Unable to create tensor, you should probably activate
truncation and/or padding with 'padding=True' 'truncation=True' to
have batched tensors with the same length.
The tokenizer is created this way:
tokenizer = BertTokenizerFast.from_pretrained(pretrained_model)
And the Trainer like that:
trainer = Trainer(
tokenizer=tokenizer,
model=model,
args=training_args,
train_dataset=train,
eval_dataset=dev,
compute_metrics=compute_metrics
)
I've tried putting the padding and truncation parameters in the tokenizer, in the Trainer, and in the training_args. Nothing does. Any idea?
Look at the columns your tokenizer is returning. You might wanna limit it to only the required columns.
For Example
def preprocess_function(examples):
#function to tokenize the dataset.
if sentence2_key is None:
return tokenizer(examples[sentence1_key], truncation=True, padding=True)
return tokenizer(examples[sentence1_key], examples[sentence2_key], truncation=True, padding=True)
encoded_dataset = dataset.map(preprocess_function, batched=True, load_from_cache_file=False)
#Thing you should do is
columns_to_return = ['input_ids', 'label', 'attention_mask']
encoded_dataset.set_format(type='torch', columns=columns_to_return)
I was able to solve this problem by adding a datacollator to the trainer.
from transformers import DataCollatorForTokenClassification
data_collator = DataCollatorForTokenClassification(tokenizer)
trainer = Trainer(
model=model,
args=args,
train_dataset=...,
eval_dataset=...,
compute_metrics=compute_metrics,
data_collator=data_collator,
tokenizer=tokenizer,
optimizers=(optimizer, None),
)
I had the same error when one of the inputs to the tokenizer is None.
My tokenizer takes two texts at the same time (so Bert will add [SEP] between them).

RuntimeError, working on IA tryna use a pre-trained BERT model

Hi here is a part of my code to use a pre-trained bert model for classification:
model = BertForSequenceClassification.from_pretrained(
"bert-base-uncased", # Use the 12-layer BERT model, with an uncased vocab.
num_labels = 2, # The number of output labels--2 for binary classification.
# You can increase this for multi-class tasks.
output_attentions = False, # Whether the model returns attentions weights.
output_hidden_states = False, # Whether the model returns all hidden-states.
)
...
for step, batch in enumerate(train_dataloader):
b_input_ids = batch[0].to(device)
b_input_mask = batch[1].to(device)
b_labels = batch[2].to(device)
outputs = model(b_input_ids,
token_type_ids=None,
attention_mask=b_input_mask,
labels=b_labels)
but then I receive this error message:
RuntimeError: Expected tensor for argument #1 'indices' to have scalar
type Long;
but got torch.IntTensor instead (while checking arguments for embedding)
So I think I should transform my b_input_ids to tensor but don't know how to do it.
Thanks a lot in advance for your help everyone !
Finally succeed using .to(torch.int64)

different prediction after load a model in keras

I have a Sequential Model built in Keras and after trained it give me good prediction but when i save and then load the model i don't obtain the same prediction on the same dataset. Why?
Note that I checked the weight of the model and they are the same as well as the architecture of the model, checked with model.summary() and model.getWeights(). This is very strange in my opinion and I have no idea how to deal with this problem.
I don't have any error but the prediction are different
I tried to use model.save() and load_model()
I tried to use model.save_weights() and after that re-built the model and then load the model
I have the same problem with both options.
def Classifier(input_shape, word_to_vec_map, word_to_index, emb_dim, num_activation):
sentence_indices = Input(shape=input_shape, dtype=np.int32)
emb_dim = 300 # embedding di 300 parole in italiano
embedding_layer = pretrained_embedding_layer(word_to_vec_map, word_to_index, emb_dim)
embeddings = embedding_layer(sentence_indices)
X = LSTM(256, return_sequences=True)(embeddings)
X = Dropout(0.15)(X)
X = LSTM(128)(X)
X = Dropout(0.15)(X)
X = Dense(num_activation, activation='softmax')(X)
model = Model(sentence_indices, X)
sequentialModel = Sequential(model.layers)
return sequentialModel
model = Classifier((maxLen,), word_to_vec_map, word_to_index, maxLen, num_activation)
...
model.fit(Y_train_indices, Z_train_oh, epochs=30, batch_size=32, shuffle=True)
# attempt 1
model.save('classificationTest.h5', True, True)
modelRNN = load_model(r'C:\Users\Alessio\classificationTest.h5')
# attempt 2
model.save_weights("myWeight.h5")
model = Classifier((maxLen,), word_to_vec_map, word_to_index, maxLen, num_activation)
model.load_weights(r'C:\Users\Alessio\myWeight.h5')
# PREDICTION TEST
code_train, category_train, category_code_train, text_train = read_csv_for_email(r'C:\Users\Alessio\Desktop\6Febbraio\2test.csv')
categories, code_categories = get_categories(r'C:\Users\Alessio\Desktop\6Febbraio\2test.csv')
X_my_sentences = text_train
Y_my_labels = category_code_train
X_test_indices = sentences_to_indices(X_my_sentences, word_to_index, maxLen)
pred = model.predict(X_test_indices)
def codeToCategory(categories, code_categories, current_code):
i = 0;
for code in code_categories:
if code == current_code:
return categories[i]
i = i + 1
return "no_one_find"
# result
for i in range(len(Y_my_labels)):
num = np.argmax(pred[i])
# Pretrained embedding layer
def pretrained_embedding_layer(word_to_vec_map, word_to_index, emb_dim):
"""
Creates a Keras Embedding() layer and loads in pre-trained GloVe 50-dimensional vectors.
Arguments:
word_to_vec_map -- dictionary mapping words to their GloVe vector representation.
word_to_index -- dictionary mapping from words to their indices in the vocabulary (400,001 words)
Returns:
embedding_layer -- pretrained layer Keras instance
"""
vocab_len = len(word_to_index) + 1 # adding 1 to fit Keras embedding (requirement)
### START CODE HERE ###
# Initialize the embedding matrix as a numpy array of zeros of shape (vocab_len, dimensions of word vectors = emb_dim)
emb_matrix = np.zeros((vocab_len, emb_dim))
# Set each row "index" of the embedding matrix to be the word vector representation of the "index"th word of the vocabulary
for word, index in word_to_index.items():
emb_matrix[index, :] = word_to_vec_map[word]
# Define Keras embedding layer with the correct output/input sizes, make it trainable. Use Embedding(...). Make sure to set trainable=False.
embedding_layer = Embedding(vocab_len, emb_dim)
### END CODE HERE ###
# Build the embedding layer, it is required before setting the weights of the embedding layer. Do not modify the "None".
embedding_layer.build((None,))
# Set the weights of the embedding layer to the embedding matrix. Your layer is now pretrained.
embedding_layer.set_weights([emb_matrix])
return embedding_layer
Do you have any kind of suggestion?
Thanks in Advance.
Edit1: if use the code of saving and loading in the same "page" (I'm using notebook jupyter) it works fine. If I change "page" it doesn't work. Could it be that there is something related with the tensorflow session?
Edit2: my final goal is to load a model, trained in Keras, with Deeplearning4J in java. So if you know a solution for "transforming" the keras model in something else readable in DL4J it will help anyway.
Edit3: add function pretrained_embedding_layer()
Edit4: dictionaries from word2Vec model read with gensim
from gensim.models import Word2Vec
model = Word2Vec.load('C:/Users/Alessio/Desktop/emoji_ita/embedding/glove_WIKI')
def getMyModels (model):
word_to_index = dict({})
index_to_word = dict({})
word_to_vec_map = dict({})
for idx, key in enumerate(model.wv.vocab):
word_to_index[key] = idx
index_to_word[idx] = key
word_to_vec_map[key] = model.wv[key]
return word_to_index, index_to_word, word_to_vec_map
Are you pre-processing your data in the same way when you load your model ?
And if yes, did you set the seed of your pre-processing functions ?
If you build a dictionnary with keras, are the sentences coming in the same order ?
I had the same problem before, so here is how you solve it. After making sure that the weights and summary are the same, try to print your random seed and check. If its value is changing from a session to another and if you tried tensorflow's seed, it means you need to disable the PYTHONHASHSEED environment variable. You can read more about it here:
https://docs.python.org/3/using/cmdline.html#envvar-PYTHONHASHSEED
To disable it, go to your system environment variables and add PYTHONHASHSEED as a new variable if it doesn't exist. Then, set its value to 0 to disable it. Please note that it was done in this way because it has to be disabled before running the interpreter.

Keras / Tensorflow: Predict Using tf.data.Dataset API

I'm using Keras with a Tensorflow backend for building a model for this problem: https://www.kaggle.com/cfpb/us-consumer-finance-complaints (just practicing).
I train my Keras model using the tf.data.Dataset API. Now, I have a Pandas DataFrame, df_testing, whose columns are complaint (strings) and label (also strings). I want to predict on these new samples. I create a tf.data.Dataset object, perform preprocessing, make an Iterator, and call predict on my model:
data = df_testing["complaint"].values
labels = df_testing["label"].values
dataset = tf.data.Dataset.from_tensor_slices((data))
dataset = dataset.map(lambda x: ({'reviews': x}))
dataset = dataset.batch(self.batch_size).repeat()
dataset = dataset.map(lambda x: self.preprocess_text(x, self.data_table))
dataset = dataset.map(lambda x: x['reviews'])
dataset = dataset.make_initializable_iterator()
My training used a tf.data.Dataset where each element was of the form ({'reviews': "movie was great"}, "positive") so I'm mimicking that here for prediction. Also, my preprocessing just turns my string into a Tensor of integers.
When I call:
preds = model.predict(dataset)
But I'm told my predict call fails:
ValueError: When using iterators as input to a model, you should specify the `steps` argument.
So I modify this call to be:
preds = model.predict(dataset, steps=3)
But now I get back:
ValueError: Please provide data as a list or tuple of 2 elements - input and target pair. Received Tensor("IteratorGetNext_2:0", shape=(?, 100), dtype=int32)
What am I doing incorrectly here? I shouldn't have to provide a tuple of 2 elements when predicting (I shouldn't need the label).
Thanks for any help you can offer!
What version of Keras are you on? I cannot find that specific error message in the code base, but I think I found where it used to be.
Here's the error in a version of the code that I think is close to the version you're running: commit
And here's the updated version of that error: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/keras/engine/training_eager.py#L464
The conditions of the input validation have changed (in the newest version your input would be accepted), but what's relevant is that the error message is much more clear:
raise ValueError(
'Please provide data as a list or tuple of 1, 2, or 3 elements '
' - `(input)`, or `(input, target)`, or `(input, target,'
'sample_weights)`. Received %s. We do not use the `target` or'
'`sample_weights` value here.' % inputs.output_shapes)
The target value is never used in the predict function, and so can be anything. Looking at the rest of the function next_element[1] is never used.
[TLDR] Using your current version, add a dummy target value to the data, or update your Keras.
The following code worked for me (tested on tensorflow 1.10.0):
[TLDR] Only insert empty dictionary as a dummy input and specify the number of steps:
model.predict(x={},steps=4)
Full code:
import numpy as np
import tensorflow as tf
from tensorflow.data import Dataset
from tensorflow.keras.layers import Dense, Input
from tensorflow.keras.models import Model
# dummy data:
x = np.arange(4).reshape(-1, 1).astype('float32')
y = np.arange(5, 9).reshape(-1, 1).astype('float32')
# build the Datasets
ds_x = Dataset.from_tensor_slices(x).repeat().batch(4)
it_x = ds_x.make_one_shot_iterator()
ds_y = Dataset.from_tensor_slices(y).repeat().batch(4)
it_y = ds_y.make_one_shot_iterator()
# build compile and train the model
input_vals = Input(tensor=it_x.get_next())
output = Dense(1, activation='relu')(input_vals)
model = Model(inputs=input_vals, outputs=output)
model.compile('rmsprop', 'mse', target_tensors=[it_y.get_next()])
model.fit(steps_per_epoch=1, epochs=5, verbose=2)
# infer using the dataset
model.predict(x={},steps=4)

Categories

Resources