How to shuffle words in word2vec [duplicate]

How to shuffle words in word2vec [duplicate] - python

This question already has answers here:
Why does random.shuffle return None?
(5 answers)
Closed 5 months ago.
I have this piece of code:
import gensim
import random
file = open('../../../dataset/output/interaction_jobroles_titles_tags.txt')
read_data = file.read()
data = read_data.split('\n')
sentences = [line.split() for line in data]
print(len(sentences))
print(sentences[1])
model = gensim.models.Word2Vec(min_count=1, window=10, size=300, negative=5)
model.build_vocab(sentences)
for epoch in range(5):
shuffled_sentences = random.shuffle(sentences)
model.train(shuffled_sentences)
print(epoch)
print(model)
model.save("../../../dataset/output/wordvectors_jobroles_titles_300d_10w_wordshuffling" + '.model')
If I print a single sentence, then it output is something like this:
['JO_3787672', 'JO_272304', 'JO_2027410', 'TI_2969041', 'TI_2509936', 'TA_954638', 'TA_4321623', 'TA_339347', 'TA_272304', 'TA_3017535', 'TA_494116', 'TA_798840']
What I need is to shuffle the words before training and then save the model.
I am not sure whether I am coding it in a right way. I end up with exception:
Exception in thread Thread-8:
Traceback (most recent call last):
File "/usr/local/Cellar/python3/3.5.1/Frameworks/Python.framework/Versions/3.5/lib/python3.5/threading.py", line 914, in _bootstrap_inner
self.run()
File "/usr/local/Cellar/python3/3.5.1/Frameworks/Python.framework/Versions/3.5/lib/python3.5/threading.py", line 862, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.5/site-packages/gensim/models/word2vec.py", line 747, in job_producer
for sent_idx, sentence in enumerate(sentences):
File "/usr/local/lib/python3.5/site-packages/gensim/utils.py", line 668, in __iter__
for document in self.corpus:
TypeError: 'NoneType' object is not iterable
I would like to ask you how can I shuffle words.

Random.shuffle shuffles the list inplace and returns none. For this reason your shuffled sentences are None after this call.

model.build_vocab(sentences)
sentences_list = sentences
Idx = range(len(sentences_list))
print(Idx)
for epoch in range(5):
random.shuffle(sentences)
perm_sentences = [sentences_list[i] for i in Idx]
model.train(perm_sentences)
print(epoch)
print(model)
model.save("somefile'.model')
This solves my problem.
But how can shuffle individual words in a sentence?
Sentence:
['JO_3787672', 'JO_272304', 'JO_2027410', 'TI_2969041', 'TI_2509936', 'TA_954638', 'TA_4321623', 'TA_339347', 'TA_272304', 'TA_3017535', 'TA_494116', 'TA_798840']
my objective is:
If I check for most similar word for, let say ['JO_3787672'], then every time it will predict words starting from 'JO_'. and the words starting from 'TA_' and 'TI_' have really less similarity score.
I suspected that, this is because of the words position in the data(I am not sure). That is why I try to do shuffling between word( I am really not sure whether it helps or not).

Related

Using past and attention_mask at the same time for gpt2

I am processing a batch of sentences with different lengths, so I am planning to take advantage of the padding + attention_mask functionality in gpt2 for that.
At the same time, for each sentence I need to add a suffix phrase and run N different inferences. For instance, given the sentence "I like to drink coke", I may need to run two different inferences: "I like to drink coke. Coke is good" and "I like to drink coke. Drink is good". Thus, I am trying to improve the inference time for this by using the "past" functionality: https://huggingface.co/transformers/quickstart.html#using-the-past so I just process the original sentence (e.g. "I like to drink coke") once, and then I somehow expand the result to be able to be used with two other sentences: "Coke is good" and "Drink is good".
Below you will find a simple code that is trying to represent how I was trying to do this. For simplicity I'm just adding a single suffix phrase per sentence (...but I still hope my original idea is possible though):
from transformers.tokenization_gpt2 import GPT2Tokenizer
from transformers.modeling_gpt2 import GPT2LMHeadModel
tokenizer = GPT2Tokenizer.from_pretrained('gpt2', pad_token='<|endoftext|>')
model = GPT2LMHeadModel.from_pretrained('gpt2')
# Complete phrases are: "I like to drink soda without sugar" and "Go watch TV alone, I am not going"
docs = ["I like to drink soda", "Go watch TV"]
docs_tensors = tokenizer.batch_encode_plus(
[d for d in docs], pad_to_max_length=True, return_tensors='pt')
docs_next = ["without sugar", "alone, I am not going"]
docs_next_tensors = tokenizer.batch_encode_plus(
[d for d in docs_next], pad_to_max_length=True, return_tensors='pt')
# predicting the first part of each phrase
_, past = model(docs_tensors['input_ids'], attention_mask=docs_tensors['attention_mask'])
# predicting the rest of the phrase
logits, _ = model(docs_next_tensors['input_ids'], attention_mask=docs_next_tensors['attention_mask'], past=past)
logits = logits[:, -1]
_, top_indices_results = logits.topk(30)
The error I am getting is the following:
Traceback (most recent call last):
File "/Applications/PyCharm CE.app/Contents/plugins/python-ce/helpers/pydev/pydevd.py", line 1434, in _exec
pydev_imports.execfile(file, globals, locals) # execute the script
File "/Applications/PyCharm CE.app/Contents/plugins/python-ce/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "/Users/damiox/Workspace/xxLtd/yy/stress-test-withpast2.py", line 26, in <module>
logits, _ = model(docs_next_tensors['input_ids'], attention_mask=docs_next_tensors['attention_mask'], past=past)
File "/Users/damiox/.local/share/virtualenvs/yy-uMxmjV2h/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/Users/damiox/.local/share/virtualenvs/yy-uMxmjV2h/lib/python3.7/site-packages/transformers/modeling_gpt2.py", line 593, in forward
inputs_embeds=inputs_embeds,
File "/Users/damiox/.local/share/virtualenvs/yy-uMxmjV2h/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/Users/damiox/.local/share/virtualenvs/yy-uMxmjV2h/lib/python3.7/site-packages/transformers/modeling_gpt2.py", line 476, in forward
hidden_states, layer_past=layer_past, attention_mask=attention_mask, head_mask=head_mask[i]
File "/Users/damiox/.local/share/virtualenvs/yy-uMxmjV2h/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/Users/damiox/.local/share/virtualenvs/yy-uMxmjV2h/lib/python3.7/site-packages/transformers/modeling_gpt2.py", line 226, in forward
self.ln_1(x), layer_past=layer_past, attention_mask=attention_mask, head_mask=head_mask
File "/Users/damiox/.local/share/virtualenvs/yy-uMxmjV2h/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/Users/damiox/.local/share/virtualenvs/yy-uMxmjV2h/lib/python3.7/site-packages/transformers/modeling_gpt2.py", line 189, in forward
attn_outputs = self._attn(query, key, value, attention_mask, head_mask)
File "/Users/damiox/.local/share/virtualenvs/yy-uMxmjV2h/lib/python3.7/site-packages/transformers/modeling_gpt2.py", line 150, in _attn
w = w + attention_mask
RuntimeError: The size of tensor a (11) must match the size of tensor b (6) at non-singleton dimension 3
Process finished with exit code 1
Initially I thought this was related to https://github.com/huggingface/transformers/issues/3031 - so I re-built latest master to try the fix, but I still experience the issue.

In order to make your current code snippet work, you will have combine the previous and new attention mask as follows:
from transformers.tokenization_gpt2 import GPT2Tokenizer
from transformers.modeling_gpt2 import GPT2LMHeadModel
import torch
tokenizer = GPT2Tokenizer.from_pretrained('gpt2', pad_token='<|endoftext|>')
model = GPT2LMHeadModel.from_pretrained('gpt2')
# Complete phrases are: "I like to drink soda without sugar" and "Go watch TV alone, I am not going"
docs = ["I like to drink soda", "Go watch TV"]
docs_tensors = tokenizer.batch_encode_plus(
[d for d in docs], pad_to_max_length=True, return_tensors='pt')
docs_next = ["without sugar", "alone, I am not going"]
docs_next_tensors = tokenizer.batch_encode_plus(
[d for d in docs_next], pad_to_max_length=True, return_tensors='pt')
# predicting the first part of each phrase
_, past = model(docs_tensors['input_ids'], attention_mask=docs_tensors['attention_mask'])
# predicting the rest of the phrase
attn_mask = torch.cat([docs_tensors['attention_mask'], docs_next_tensors['attention_mask']], dim=-1)
logits, _ = model(docs_next_tensors['input_ids'], attention_mask=attn_mask, past=past)
logits = logits[:, -1]
_, top_indices_results = logits.topk(30)
For the case that you want to test two possible suffixes for a sentence start you probably will have to clone your past variable as many times as you have suffixes. That means that the batch size of your prefix input_ids has to match the batch size of your suffix input_ids in order to make it work.
Also you have to change the positional encodings input of your suffix input_ids (GPT2 uses absolute positional encodings) if one of your prefix input_ids is padded (this is not shown in the code above - please take a look at https://github.com/huggingface/transformers/issues/3021 to see how it's done).

How to use a list in word2vec.similarity

I have a word2vec model using pre-trained GoogleNews-vectors-negative300.bin. The model works fine and I can get the similarities between the two words. For example:
word2vec.similarity('culture','friendship')
0.2732939
Now, I want to use list elements instead of the words. For example, suppose that I have a list which its name is "tag". and the first two elements in the first row are culture and friendship. So, tag[0,0]= culture, and tag[0,1]=friendship.
I use the following code which gives me an error:
word2vec.similarity(tag[0,0],tag[0,1])
the "tag" list is a numpy.ndarray
the error is:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Users\s\AppData\Local\Programs\Python6436\Python36\lib\site-packages\gensim\models\keyedvectors.py", line 992, in similarity
return dot(matutils.unitvec(self[w1]), matutils.unitvec(self[w2]))
File "C:\Users\s\AppData\Local\Programs\Python6436\Python36\lib\site-packages\gensim\models\keyedvectors.py", line 337, in __getitem__
return self.get_vector(entities)
File "C:\Users\s\AppData\Local\Programs\Python6436\Python36\lib\site-packages\gensim\models\keyedvectors.py", line 455, in get_vector
return self.word_vec(word)
File "C:\Users\s\AppData\Local\Programs\Python6436\Python36\lib\site-packages\gensim\models\keyedvectors.py", line 452, in word_vec
raise KeyError("word '%s' not in vocabulary" % word)
KeyError: "word ' friendship' not in vocabulary"

I think there are some leading spaces in your word ' friendship'.
Could you try this:
word2vec.similarity(tag[0,0].strip(),tag[0,1].strip())

If tag is,according to your question,is a python-list.Then the problem is you can not index a list with a tuple.
If your list is like [["culture","friendship"],[...]...]
Then your should write word2vec.similarity(tag[0][0],tag[0][1])

ValueError : too many values to unpack (expected 2)

I'm doing sentiment analysis using naive bayes classifier of nltk. I'm just inserting a csv file that contains words and their labels as training set and not testing it yet. I'm finding sentiment of each sentence and then finding average of sentiments of all sentences in the end. My file contains words in the format:
good,0.6
amazing,0.95
great,0.8
awesome,0.95
love,0.7
like,0.5
better,0.4
beautiful,0.6
bad,-0.6
worst,-0.9
hate,-0.8
sad,-0.4
disappointing,-0.6
angry,-0.7
happy,0.7
But the file doesn't get trained and the above mentioned error shows up. Here's my python code:
import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize
from nltk.classify.api import ClassifierI
operators=set(('not','never','no'))
stop_words=set(stopwords.words("english"))-operators
text="this restaurant is good but i hate it ."
sent=0.0
x=0
text2=""
xyz=[]
dot=0
if "but" in text:
i=text.find("but")
text=text[:i]+"."+text[i+3:]
if "whereas" in text:
i=text.find("whereas")
text=text[:i]+"."+text[i+7:]
if "while" in text:
i=text.find("while")
text=text[:i]+"."+text[i+5:]
a=open('C:/Users/User/train_words.csv','r')
for w in text.split():
if w in stop_words:
continue
else:
text2=text2+" "+w
print (text2)
cl=nltk.NaiveBayesClassifier.train(a)
xyz=sent_tokenize(text2)
print(xyz)
for s in xyz:
x=x+1
print(s)
if "not" in s or "n't" in s:
print(float(cl.classify(s))*-1)
sent=sent+(float(cl.classify(s))*-1)
else:
print(cl.classify(s))
sent=sent+float(cl.classify(s))
print("sentiment of the overall document:",sent/x)
error:
runfile('C:/Users/User/Documents/untitled1.py', wdir='C:/Users /User/Documents')
restaurant good . hate .
Traceback (most recent call last):
File "<ipython-input-8-d03fac6844c7>", line 1, in <module>
runfile('C:/Users/User/Documents/untitled1.py', wdir='C:/Users/User/Documents')
File "C:\ProgramData\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 866, in runfile
execfile(filename, namespace)
File "C:\ProgramData\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 102, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "C:/Users/User/Documents/untitled1.py", line 37, in <module>
cl = nltk.NaiveBayesClassifier.train(a)
File "C:\ProgramData\Anaconda3\lib\site-packages\nltk\classify\naivebayes.py", line 194, in train
for featureset, label in labeled_featuresets:
ValueError: too many values to unpack (expected 2)

if am not wrong train() takes list of tuple and you are providing file obj.
Instead of this
a = open('C:/Users/User/train_words.csv','r')
Try this
a = open('C:/Users/User/train_words.csv','r').read() # this is string
a_list = a.split('\n')
a_list_of_tuple = [tuple(x.split(',')) for x in a_list]
and pass a_list_of_tuple variable to train()
hop this will help :)

From the doc:
def train(cls, labeled_featuresets, estimator=ELEProbDist):
"""
:param labeled_featuresets: A list of classified featuresets,
i.e., a list of tuples ``(featureset, label)``.
"""
So you can write something similar:
feature_set = [line.split(',')[::-1] for line in open('filename').readline()]

Override a function in nltk - Error in ContextIndex class

I am using text.similar('example') function from nltk.Text module.
(Which prints the similar words for a given word based on corpus.)
However I want to store that list of words in a list. But the function itself returns None.
#text is a variable of nltk.Text module
simList = text.similar("physics")
>>> a = text.similar("physics")
the and a in science this which it that energy his of but chemistry is
space mathematics theory as mechanics
>>> a
>>> a
# a contains no value.
So should I modify the source function itself? But I don't think it is a good practice. So how can I override that function so that it returns the value?
Edit - Referring this thread, I tried using the ContextIndex class. But I am getting the following error.
File "test.py", line 39, in <module>
text = nltk.text.ContextIndex(word.lower() for word in words) File "/home/kenden/den/codes/nlpenv/local/lib/python2.7/site-packages/nltk/text.py", line 56, in __init__
for i, w in enumerate(tokens)) File "/home/kenden/den/codes/nlpenv/local/lib/python2.7/site-packages/nltk/probability.py", line 1752, in __init__
for (cond, sample) in cond_samples: File "/home/kenden/den/codes/nlpenv/local/lib/python2.7/site-packages/nltk/text.py", line 56, in <genexpr>
for i, w in enumerate(tokens)) File "/home/kenden/den/codes/nlpenv/local/lib/python2.7/site-packages/nltk/text.py", line 43, in _default_context
right = (tokens[i+1].lower() if i != len(tokens) - 1 else '*END*') TypeError: object of type 'generator' has no len()
This is my line 39 of test.py
text = nltk.text.ContextIndex(word.lower() for word in words)
How can I solve this?

You are getting the error because the ContextIndex constructor is trying to take the len() of your token list (the argument tokens). But you actually pass it as a generator, hence the error. To avoid the problem, just pass a true list, e.g.:
text = nltk.text.ContextIndex(list(word.lower() for word in words))

nltk stemming and stop words for naive bayes

I'm looking to understand why using stemming and stop words, results in worse results in my naive bayes classifier.
I have two files, positive and negative reviews, both of which have around 200 lines but with many words, possibly with 5000 words per line.
I have the following code that creates a bag of words and then I create two feature sets for training and testing, then I run it against the nltk classifier
word_features = list(all_words.keys())[:15000]
testing_set = featuresets[10000:]
training_set = featuresets[:10000]
nbclassifier = nltk.NaiveBayesClassifier.train(training_set)
print((nltk.classify.accuracy(nbclassifier, testing_set))*100)
nbclassifier.show_most_informative_features(30)
This produces around 45000 words and has an accuracy of 85%.
I've looked at adding stemming (PorterStemmer) and removing stop words in my training data, but when I run the classifier again I now get 205 words and a 0% accuracy in my classifier and while testing other classifiers the script generates errors
Traceback (most recent call last):
File "foo.py", line 108, in <module>
print((nltk.classify.accuracy(MNB_classifier, testing_set))*100)
File "/Library/Python/2.7/site-packages/nltk/classify/util.py", line 87, in accuracy
results = classifier.classify_many([fs for (fs, l) in gold])
File "/Library/Python/2.7/site-packages/nltk/classify/scikitlearn.py", line 83, in classify_many
X = self._vectorizer.transform(featuresets)
File "/Library/Python/2.7/site-packages/sklearn/feature_extraction/dict_vectorizer.py", line 293, in transform
return self._transform(X, fitting=False)
File "/Library/Python/2.7/site-packages/sklearn/feature_extraction/dict_vectorizer.py", line 184, in _transform
raise ValueError("Sample sequence X is empty.")
ValueError: Sample sequence X is empty.
I don't understand why adding stemming and or removing stop words breaks the classifier?

Adding stemming or removing stop words could not cause your issue. I think you have an issue further up in your code due to how you read the file. When I was following sentdex's tutorial on YouTube, I came across this same error. I was stuck for the past hour, but I finally got it. If you follow his code you get this:
short_pos = open("short_reviews/positive.txt", "r").read()
short_neg = open("short_reviews/negative.txt", "r").read()
documents = []
for r in short_pos.split('\n'):
documents.append( (r, 'pos' ))
for r in short_neg.split('\n'):
documents.append( (r, 'neg' ))
all_words = []
short_pos_words = word_tokenize(short_pos)
short_neg_words = word_tokenize(short_neg)
for w in short_pos_words:
all_words.append(w.lower())
for w in short_neg_words:
all_words.append(w.lower())
all_words = nltk.FreqDist(all_words)
word_features = list(all_words.keys())[:5000]
I kept running into this error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x97 in position 6056: invalid start byte.
You get this error because there are non UTF-8 characters in the files provided. I was able to get around the error by changing the code to this:
fname = 'short_reviews/positive.txt'
with open(fname, 'r', encoding='utf-16') as f:
for line in f:
pos_lines.append(line)
Unfortunately, then I started getting this error:
UnicodeError: UTF-16 stream does not start with BOM
I forget how, but I made this error go away too. Then I started getting the same error as your original question:
ValueError: Sample sequence X is empty.
When I printed the length of featuresets, I saw it was only 2.
print("Feature sets list length : ", len(featuresets))
After digging on this site, I found these two questions:
Delete every non utf-8 symbols froms string
'str' object has no attribute 'decode' in Python3
The first one didn't really help, but the second one solved my problem (Note: I'm using python-3).
I'm not one for one liners, but this worked for me:
pos_lines = [line.rstrip('\n') for line in open('short_reviews/positive.txt', 'r', encoding='ISO-8859-1')]
I will update my github repo later this week with the full code for the nlp tutorial if you'd like to see the complete solution. I realize this answer probably comes 2 years too late, but hopefully it helps.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to shuffle words in word2vec [duplicate] - python

Random.shuffle shuffles the list inplace and returns none. For this reason your shuffled sentences are None after this call.

Related

Using past and attention_mask at the same time for gpt2

How to use a list in word2vec.similarity

ValueError : too many values to unpack (expected 2)

Override a function in nltk - Error in ContextIndex class

nltk stemming and stop words for naive bayes

Categories

Resources