How to revert BERT/XLNet embeddings? - python

I've been experimenting with stacking language models recently and noticed something interesting: the output embeddings of BERT and XLNet are not the same as the input embeddings. For example, this code snippet:
bert = transformers.BertForMaskedLM.from_pretrained("bert-base-cased")
tok = transformers.BertTokenizer.from_pretrained("bert-base-cased")
sent = torch.tensor(tok.encode("I went to the store the other day, it was very rewarding."))
enc = bert.get_input_embeddings()(sent)
dec = bert.get_output_embeddings()(enc)
print(tok.decode(dec.softmax(-1).argmax(-1)))
Outputs this for me:
,,,,,,,,,,,,,,,,,
I would have expected the (formatted) input sequence to be returned since I was under the impression that the input and output token embeddings were tied.
What's interesting is that most other models do not exhibit this behavior. For example, if you run the same code snippet on GPT2, Albert or Roberta, it outputs the input sequence.
Is this a bug? Or is it expected for BERT/XLNet?

Not sure if it's too late, but I've experimented a bit with your code and it can be reverted. :)
bert = transformers.BertForMaskedLM.from_pretrained("bert-base-cased")
tok = transformers.BertTokenizer.from_pretrained("bert-base-cased")
sent = torch.tensor(tok.encode("I went to the store the other day, it was very rewarding."))
print("Initial sentence:", sent)
enc = bert.get_input_embeddings()(sent)
dec = bert.get_output_embeddings()(enc)
print("Decoded sentence:", tok.decode(dec.softmax(0).argmax(1)))
For this, you get the following output:
Initial sentence: tensor([ 101, 146, 1355, 1106, 1103, 2984, 1103, 1168, 1285, 117,
1122, 1108, 1304, 10703, 1158, 119, 102])
Decoded sentence: [CLS] I went to the store the other day, it was very rewarding. [SEP]

Related

Using past and attention_mask at the same time for gpt2

I am processing a batch of sentences with different lengths, so I am planning to take advantage of the padding + attention_mask functionality in gpt2 for that.
At the same time, for each sentence I need to add a suffix phrase and run N different inferences. For instance, given the sentence "I like to drink coke", I may need to run two different inferences: "I like to drink coke. Coke is good" and "I like to drink coke. Drink is good". Thus, I am trying to improve the inference time for this by using the "past" functionality: https://huggingface.co/transformers/quickstart.html#using-the-past so I just process the original sentence (e.g. "I like to drink coke") once, and then I somehow expand the result to be able to be used with two other sentences: "Coke is good" and "Drink is good".
Below you will find a simple code that is trying to represent how I was trying to do this. For simplicity I'm just adding a single suffix phrase per sentence (...but I still hope my original idea is possible though):
from transformers.tokenization_gpt2 import GPT2Tokenizer
from transformers.modeling_gpt2 import GPT2LMHeadModel
tokenizer = GPT2Tokenizer.from_pretrained('gpt2', pad_token='<|endoftext|>')
model = GPT2LMHeadModel.from_pretrained('gpt2')
# Complete phrases are: "I like to drink soda without sugar" and "Go watch TV alone, I am not going"
docs = ["I like to drink soda", "Go watch TV"]
docs_tensors = tokenizer.batch_encode_plus(
[d for d in docs], pad_to_max_length=True, return_tensors='pt')
docs_next = ["without sugar", "alone, I am not going"]
docs_next_tensors = tokenizer.batch_encode_plus(
[d for d in docs_next], pad_to_max_length=True, return_tensors='pt')
# predicting the first part of each phrase
_, past = model(docs_tensors['input_ids'], attention_mask=docs_tensors['attention_mask'])
# predicting the rest of the phrase
logits, _ = model(docs_next_tensors['input_ids'], attention_mask=docs_next_tensors['attention_mask'], past=past)
logits = logits[:, -1]
_, top_indices_results = logits.topk(30)
The error I am getting is the following:
Traceback (most recent call last):
File "/Applications/PyCharm CE.app/Contents/plugins/python-ce/helpers/pydev/pydevd.py", line 1434, in _exec
pydev_imports.execfile(file, globals, locals) # execute the script
File "/Applications/PyCharm CE.app/Contents/plugins/python-ce/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "/Users/damiox/Workspace/xxLtd/yy/stress-test-withpast2.py", line 26, in <module>
logits, _ = model(docs_next_tensors['input_ids'], attention_mask=docs_next_tensors['attention_mask'], past=past)
File "/Users/damiox/.local/share/virtualenvs/yy-uMxmjV2h/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/Users/damiox/.local/share/virtualenvs/yy-uMxmjV2h/lib/python3.7/site-packages/transformers/modeling_gpt2.py", line 593, in forward
inputs_embeds=inputs_embeds,
File "/Users/damiox/.local/share/virtualenvs/yy-uMxmjV2h/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/Users/damiox/.local/share/virtualenvs/yy-uMxmjV2h/lib/python3.7/site-packages/transformers/modeling_gpt2.py", line 476, in forward
hidden_states, layer_past=layer_past, attention_mask=attention_mask, head_mask=head_mask[i]
File "/Users/damiox/.local/share/virtualenvs/yy-uMxmjV2h/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/Users/damiox/.local/share/virtualenvs/yy-uMxmjV2h/lib/python3.7/site-packages/transformers/modeling_gpt2.py", line 226, in forward
self.ln_1(x), layer_past=layer_past, attention_mask=attention_mask, head_mask=head_mask
File "/Users/damiox/.local/share/virtualenvs/yy-uMxmjV2h/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/Users/damiox/.local/share/virtualenvs/yy-uMxmjV2h/lib/python3.7/site-packages/transformers/modeling_gpt2.py", line 189, in forward
attn_outputs = self._attn(query, key, value, attention_mask, head_mask)
File "/Users/damiox/.local/share/virtualenvs/yy-uMxmjV2h/lib/python3.7/site-packages/transformers/modeling_gpt2.py", line 150, in _attn
w = w + attention_mask
RuntimeError: The size of tensor a (11) must match the size of tensor b (6) at non-singleton dimension 3
Process finished with exit code 1
Initially I thought this was related to https://github.com/huggingface/transformers/issues/3031 - so I re-built latest master to try the fix, but I still experience the issue.
In order to make your current code snippet work, you will have combine the previous and new attention mask as follows:
from transformers.tokenization_gpt2 import GPT2Tokenizer
from transformers.modeling_gpt2 import GPT2LMHeadModel
import torch
tokenizer = GPT2Tokenizer.from_pretrained('gpt2', pad_token='<|endoftext|>')
model = GPT2LMHeadModel.from_pretrained('gpt2')
# Complete phrases are: "I like to drink soda without sugar" and "Go watch TV alone, I am not going"
docs = ["I like to drink soda", "Go watch TV"]
docs_tensors = tokenizer.batch_encode_plus(
[d for d in docs], pad_to_max_length=True, return_tensors='pt')
docs_next = ["without sugar", "alone, I am not going"]
docs_next_tensors = tokenizer.batch_encode_plus(
[d for d in docs_next], pad_to_max_length=True, return_tensors='pt')
# predicting the first part of each phrase
_, past = model(docs_tensors['input_ids'], attention_mask=docs_tensors['attention_mask'])
# predicting the rest of the phrase
attn_mask = torch.cat([docs_tensors['attention_mask'], docs_next_tensors['attention_mask']], dim=-1)
logits, _ = model(docs_next_tensors['input_ids'], attention_mask=attn_mask, past=past)
logits = logits[:, -1]
_, top_indices_results = logits.topk(30)
For the case that you want to test two possible suffixes for a sentence start you probably will have to clone your past variable as many times as you have suffixes. That means that the batch size of your prefix input_ids has to match the batch size of your suffix input_ids in order to make it work.
Also you have to change the positional encodings input of your suffix input_ids (GPT2 uses absolute positional encodings) if one of your prefix input_ids is padded (this is not shown in the code above - please take a look at https://github.com/huggingface/transformers/issues/3021 to see how it's done).

How can I use a list of lists, or a list of sets, for the TfidfVectorizer?

I'm using the sklearn TfidfVectorizer for text-classification.
I know this vectorizer wants raw text as input, but using a list works (see input1).
However, if I want to use multiple lists (or sets) I get the following Attribute error.
Does anyone know how to tackle this problem? Thanks in advance!
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(min_df=1, stop_words="english")
input1 = ["This", "is", "a", "test"]
input2 = [["This", "is", "a", "test"], ["It", "is", "raining", "today"]]
print(vectorizer.fit_transform(input1)) #works
print(vectorizer.fit_transform(input2)) #gives Attribute error
input 1:
(3, 0) 1.0
input 2:
Traceback (most recent call last): File "", line 1, in
File
"/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/sklearn/feature_extraction/text.py",
line 1381, in fit_transform
X = super(TfidfVectorizer, self).fit_transform(raw_documents) File
"/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/sklearn/feature_extraction/text.py",
line 869, in fit_transform
self.fixed_vocabulary_) File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/sklearn/feature_extraction/text.py",
line 792, in _count_vocab
for feature in analyze(doc): File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/sklearn/feature_extraction/text.py",
line 266, in
tokenize(preprocess(self.decode(doc))), stop_words) File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/sklearn/feature_extraction/text.py",
line 232, in
return lambda x: strip_accents(x.lower()) AttributeError: 'list' object has no attribute 'lower'
Note that input1 works, but it considers each element of the list (string) as a different document to vectorize.
In the case of input2, I assume you want to vectorize each "sentence" (sublists). One solution is using the following list comprehension syntax:
input2_corrected = [" ".join(x) for x in input2]
which produces
['This is a test', 'It is raining today']
which does not yield the AttributeError anymore.

ValueError : too many values to unpack (expected 2)

I'm doing sentiment analysis using naive bayes classifier of nltk. I'm just inserting a csv file that contains words and their labels as training set and not testing it yet. I'm finding sentiment of each sentence and then finding average of sentiments of all sentences in the end. My file contains words in the format:
good,0.6
amazing,0.95
great,0.8
awesome,0.95
love,0.7
like,0.5
better,0.4
beautiful,0.6
bad,-0.6
worst,-0.9
hate,-0.8
sad,-0.4
disappointing,-0.6
angry,-0.7
happy,0.7
But the file doesn't get trained and the above mentioned error shows up. Here's my python code:
import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize
from nltk.classify.api import ClassifierI
operators=set(('not','never','no'))
stop_words=set(stopwords.words("english"))-operators
text="this restaurant is good but i hate it ."
sent=0.0
x=0
text2=""
xyz=[]
dot=0
if "but" in text:
i=text.find("but")
text=text[:i]+"."+text[i+3:]
if "whereas" in text:
i=text.find("whereas")
text=text[:i]+"."+text[i+7:]
if "while" in text:
i=text.find("while")
text=text[:i]+"."+text[i+5:]
a=open('C:/Users/User/train_words.csv','r')
for w in text.split():
if w in stop_words:
continue
else:
text2=text2+" "+w
print (text2)
cl=nltk.NaiveBayesClassifier.train(a)
xyz=sent_tokenize(text2)
print(xyz)
for s in xyz:
x=x+1
print(s)
if "not" in s or "n't" in s:
print(float(cl.classify(s))*-1)
sent=sent+(float(cl.classify(s))*-1)
else:
print(cl.classify(s))
sent=sent+float(cl.classify(s))
print("sentiment of the overall document:",sent/x)
error:
runfile('C:/Users/User/Documents/untitled1.py', wdir='C:/Users /User/Documents')
restaurant good . hate .
Traceback (most recent call last):
File "<ipython-input-8-d03fac6844c7>", line 1, in <module>
runfile('C:/Users/User/Documents/untitled1.py', wdir='C:/Users/User/Documents')
File "C:\ProgramData\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 866, in runfile
execfile(filename, namespace)
File "C:\ProgramData\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 102, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "C:/Users/User/Documents/untitled1.py", line 37, in <module>
cl = nltk.NaiveBayesClassifier.train(a)
File "C:\ProgramData\Anaconda3\lib\site-packages\nltk\classify\naivebayes.py", line 194, in train
for featureset, label in labeled_featuresets:
ValueError: too many values to unpack (expected 2)
if am not wrong train() takes list of tuple and you are providing file obj.
Instead of this
a = open('C:/Users/User/train_words.csv','r')
Try this
a = open('C:/Users/User/train_words.csv','r').read() # this is string
a_list = a.split('\n')
a_list_of_tuple = [tuple(x.split(',')) for x in a_list]
and pass a_list_of_tuple variable to train()
hop this will help :)
From the doc:
def train(cls, labeled_featuresets, estimator=ELEProbDist):
"""
:param labeled_featuresets: A list of classified featuresets,
i.e., a list of tuples ``(featureset, label)``.
"""
So you can write something similar:
feature_set = [line.split(',')[::-1] for line in open('filename').readline()]

How to shuffle words in word2vec [duplicate]

This question already has answers here:
Why does random.shuffle return None?
(5 answers)
Closed 5 months ago.
I have this piece of code:
import gensim
import random
file = open('../../../dataset/output/interaction_jobroles_titles_tags.txt')
read_data = file.read()
data = read_data.split('\n')
sentences = [line.split() for line in data]
print(len(sentences))
print(sentences[1])
model = gensim.models.Word2Vec(min_count=1, window=10, size=300, negative=5)
model.build_vocab(sentences)
for epoch in range(5):
shuffled_sentences = random.shuffle(sentences)
model.train(shuffled_sentences)
print(epoch)
print(model)
model.save("../../../dataset/output/wordvectors_jobroles_titles_300d_10w_wordshuffling" + '.model')
If I print a single sentence, then it output is something like this:
['JO_3787672', 'JO_272304', 'JO_2027410', 'TI_2969041', 'TI_2509936', 'TA_954638', 'TA_4321623', 'TA_339347', 'TA_272304', 'TA_3017535', 'TA_494116', 'TA_798840']
What I need is to shuffle the words before training and then save the model.
I am not sure whether I am coding it in a right way. I end up with exception:
Exception in thread Thread-8:
Traceback (most recent call last):
File "/usr/local/Cellar/python3/3.5.1/Frameworks/Python.framework/Versions/3.5/lib/python3.5/threading.py", line 914, in _bootstrap_inner
self.run()
File "/usr/local/Cellar/python3/3.5.1/Frameworks/Python.framework/Versions/3.5/lib/python3.5/threading.py", line 862, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.5/site-packages/gensim/models/word2vec.py", line 747, in job_producer
for sent_idx, sentence in enumerate(sentences):
File "/usr/local/lib/python3.5/site-packages/gensim/utils.py", line 668, in __iter__
for document in self.corpus:
TypeError: 'NoneType' object is not iterable
I would like to ask you how can I shuffle words.
Random.shuffle shuffles the list inplace and returns none. For this reason your shuffled sentences are None after this call.
model.build_vocab(sentences)
sentences_list = sentences
Idx = range(len(sentences_list))
print(Idx)
for epoch in range(5):
random.shuffle(sentences)
perm_sentences = [sentences_list[i] for i in Idx]
model.train(perm_sentences)
print(epoch)
print(model)
model.save("somefile'.model')
This solves my problem.
But how can shuffle individual words in a sentence?
Sentence:
['JO_3787672', 'JO_272304', 'JO_2027410', 'TI_2969041', 'TI_2509936', 'TA_954638', 'TA_4321623', 'TA_339347', 'TA_272304', 'TA_3017535', 'TA_494116', 'TA_798840']
my objective is:
If I check for most similar word for, let say ['JO_3787672'], then every time it will predict words starting from 'JO_'. and the words starting from 'TA_' and 'TI_' have really less similarity score.
I suspected that, this is because of the words position in the data(I am not sure). That is why I try to do shuffling between word( I am really not sure whether it helps or not).

nltk stemming and stop words for naive bayes

I'm looking to understand why using stemming and stop words, results in worse results in my naive bayes classifier.
I have two files, positive and negative reviews, both of which have around 200 lines but with many words, possibly with 5000 words per line.
I have the following code that creates a bag of words and then I create two feature sets for training and testing, then I run it against the nltk classifier
word_features = list(all_words.keys())[:15000]
testing_set = featuresets[10000:]
training_set = featuresets[:10000]
nbclassifier = nltk.NaiveBayesClassifier.train(training_set)
print((nltk.classify.accuracy(nbclassifier, testing_set))*100)
nbclassifier.show_most_informative_features(30)
This produces around 45000 words and has an accuracy of 85%.
I've looked at adding stemming (PorterStemmer) and removing stop words in my training data, but when I run the classifier again I now get 205 words and a 0% accuracy in my classifier and while testing other classifiers the script generates errors
Traceback (most recent call last):
File "foo.py", line 108, in <module>
print((nltk.classify.accuracy(MNB_classifier, testing_set))*100)
File "/Library/Python/2.7/site-packages/nltk/classify/util.py", line 87, in accuracy
results = classifier.classify_many([fs for (fs, l) in gold])
File "/Library/Python/2.7/site-packages/nltk/classify/scikitlearn.py", line 83, in classify_many
X = self._vectorizer.transform(featuresets)
File "/Library/Python/2.7/site-packages/sklearn/feature_extraction/dict_vectorizer.py", line 293, in transform
return self._transform(X, fitting=False)
File "/Library/Python/2.7/site-packages/sklearn/feature_extraction/dict_vectorizer.py", line 184, in _transform
raise ValueError("Sample sequence X is empty.")
ValueError: Sample sequence X is empty.
I don't understand why adding stemming and or removing stop words breaks the classifier?
Adding stemming or removing stop words could not cause your issue. I think you have an issue further up in your code due to how you read the file. When I was following sentdex's tutorial on YouTube, I came across this same error. I was stuck for the past hour, but I finally got it. If you follow his code you get this:
short_pos = open("short_reviews/positive.txt", "r").read()
short_neg = open("short_reviews/negative.txt", "r").read()
documents = []
for r in short_pos.split('\n'):
documents.append( (r, 'pos' ))
for r in short_neg.split('\n'):
documents.append( (r, 'neg' ))
all_words = []
short_pos_words = word_tokenize(short_pos)
short_neg_words = word_tokenize(short_neg)
for w in short_pos_words:
all_words.append(w.lower())
for w in short_neg_words:
all_words.append(w.lower())
all_words = nltk.FreqDist(all_words)
word_features = list(all_words.keys())[:5000]
I kept running into this error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x97 in position 6056: invalid start byte.
You get this error because there are non UTF-8 characters in the files provided. I was able to get around the error by changing the code to this:
fname = 'short_reviews/positive.txt'
with open(fname, 'r', encoding='utf-16') as f:
for line in f:
pos_lines.append(line)
Unfortunately, then I started getting this error:
UnicodeError: UTF-16 stream does not start with BOM
I forget how, but I made this error go away too. Then I started getting the same error as your original question:
ValueError: Sample sequence X is empty.
When I printed the length of featuresets, I saw it was only 2.
print("Feature sets list length : ", len(featuresets))
After digging on this site, I found these two questions:
Delete every non utf-8 symbols froms string
'str' object has no attribute 'decode' in Python3
The first one didn't really help, but the second one solved my problem (Note: I'm using python-3).
I'm not one for one liners, but this worked for me:
pos_lines = [line.rstrip('\n') for line in open('short_reviews/positive.txt', 'r', encoding='ISO-8859-1')]
I will update my github repo later this week with the full code for the nlp tutorial if you'd like to see the complete solution. I realize this answer probably comes 2 years too late, but hopefully it helps.

Categories

Resources