ValueError : too many values to unpack (expected 2) - python

I'm doing sentiment analysis using naive bayes classifier of nltk. I'm just inserting a csv file that contains words and their labels as training set and not testing it yet. I'm finding sentiment of each sentence and then finding average of sentiments of all sentences in the end. My file contains words in the format:
good,0.6
amazing,0.95
great,0.8
awesome,0.95
love,0.7
like,0.5
better,0.4
beautiful,0.6
bad,-0.6
worst,-0.9
hate,-0.8
sad,-0.4
disappointing,-0.6
angry,-0.7
happy,0.7
But the file doesn't get trained and the above mentioned error shows up. Here's my python code:
import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize
from nltk.classify.api import ClassifierI
operators=set(('not','never','no'))
stop_words=set(stopwords.words("english"))-operators
text="this restaurant is good but i hate it ."
sent=0.0
x=0
text2=""
xyz=[]
dot=0
if "but" in text:
i=text.find("but")
text=text[:i]+"."+text[i+3:]
if "whereas" in text:
i=text.find("whereas")
text=text[:i]+"."+text[i+7:]
if "while" in text:
i=text.find("while")
text=text[:i]+"."+text[i+5:]
a=open('C:/Users/User/train_words.csv','r')
for w in text.split():
if w in stop_words:
continue
else:
text2=text2+" "+w
print (text2)
cl=nltk.NaiveBayesClassifier.train(a)
xyz=sent_tokenize(text2)
print(xyz)
for s in xyz:
x=x+1
print(s)
if "not" in s or "n't" in s:
print(float(cl.classify(s))*-1)
sent=sent+(float(cl.classify(s))*-1)
else:
print(cl.classify(s))
sent=sent+float(cl.classify(s))
print("sentiment of the overall document:",sent/x)
error:
runfile('C:/Users/User/Documents/untitled1.py', wdir='C:/Users /User/Documents')
restaurant good . hate .
Traceback (most recent call last):
File "<ipython-input-8-d03fac6844c7>", line 1, in <module>
runfile('C:/Users/User/Documents/untitled1.py', wdir='C:/Users/User/Documents')
File "C:\ProgramData\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 866, in runfile
execfile(filename, namespace)
File "C:\ProgramData\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 102, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "C:/Users/User/Documents/untitled1.py", line 37, in <module>
cl = nltk.NaiveBayesClassifier.train(a)
File "C:\ProgramData\Anaconda3\lib\site-packages\nltk\classify\naivebayes.py", line 194, in train
for featureset, label in labeled_featuresets:
ValueError: too many values to unpack (expected 2)

if am not wrong train() takes list of tuple and you are providing file obj.
Instead of this
a = open('C:/Users/User/train_words.csv','r')
Try this
a = open('C:/Users/User/train_words.csv','r').read() # this is string
a_list = a.split('\n')
a_list_of_tuple = [tuple(x.split(',')) for x in a_list]
and pass a_list_of_tuple variable to train()
hop this will help :)

From the doc:
def train(cls, labeled_featuresets, estimator=ELEProbDist):
"""
:param labeled_featuresets: A list of classified featuresets,
i.e., a list of tuples ``(featureset, label)``.
"""
So you can write something similar:
feature_set = [line.split(',')[::-1] for line in open('filename').readline()]

Related

Using past and attention_mask at the same time for gpt2

I am processing a batch of sentences with different lengths, so I am planning to take advantage of the padding + attention_mask functionality in gpt2 for that.
At the same time, for each sentence I need to add a suffix phrase and run N different inferences. For instance, given the sentence "I like to drink coke", I may need to run two different inferences: "I like to drink coke. Coke is good" and "I like to drink coke. Drink is good". Thus, I am trying to improve the inference time for this by using the "past" functionality: https://huggingface.co/transformers/quickstart.html#using-the-past so I just process the original sentence (e.g. "I like to drink coke") once, and then I somehow expand the result to be able to be used with two other sentences: "Coke is good" and "Drink is good".
Below you will find a simple code that is trying to represent how I was trying to do this. For simplicity I'm just adding a single suffix phrase per sentence (...but I still hope my original idea is possible though):
from transformers.tokenization_gpt2 import GPT2Tokenizer
from transformers.modeling_gpt2 import GPT2LMHeadModel
tokenizer = GPT2Tokenizer.from_pretrained('gpt2', pad_token='<|endoftext|>')
model = GPT2LMHeadModel.from_pretrained('gpt2')
# Complete phrases are: "I like to drink soda without sugar" and "Go watch TV alone, I am not going"
docs = ["I like to drink soda", "Go watch TV"]
docs_tensors = tokenizer.batch_encode_plus(
[d for d in docs], pad_to_max_length=True, return_tensors='pt')
docs_next = ["without sugar", "alone, I am not going"]
docs_next_tensors = tokenizer.batch_encode_plus(
[d for d in docs_next], pad_to_max_length=True, return_tensors='pt')
# predicting the first part of each phrase
_, past = model(docs_tensors['input_ids'], attention_mask=docs_tensors['attention_mask'])
# predicting the rest of the phrase
logits, _ = model(docs_next_tensors['input_ids'], attention_mask=docs_next_tensors['attention_mask'], past=past)
logits = logits[:, -1]
_, top_indices_results = logits.topk(30)
The error I am getting is the following:
Traceback (most recent call last):
File "/Applications/PyCharm CE.app/Contents/plugins/python-ce/helpers/pydev/pydevd.py", line 1434, in _exec
pydev_imports.execfile(file, globals, locals) # execute the script
File "/Applications/PyCharm CE.app/Contents/plugins/python-ce/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "/Users/damiox/Workspace/xxLtd/yy/stress-test-withpast2.py", line 26, in <module>
logits, _ = model(docs_next_tensors['input_ids'], attention_mask=docs_next_tensors['attention_mask'], past=past)
File "/Users/damiox/.local/share/virtualenvs/yy-uMxmjV2h/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/Users/damiox/.local/share/virtualenvs/yy-uMxmjV2h/lib/python3.7/site-packages/transformers/modeling_gpt2.py", line 593, in forward
inputs_embeds=inputs_embeds,
File "/Users/damiox/.local/share/virtualenvs/yy-uMxmjV2h/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/Users/damiox/.local/share/virtualenvs/yy-uMxmjV2h/lib/python3.7/site-packages/transformers/modeling_gpt2.py", line 476, in forward
hidden_states, layer_past=layer_past, attention_mask=attention_mask, head_mask=head_mask[i]
File "/Users/damiox/.local/share/virtualenvs/yy-uMxmjV2h/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/Users/damiox/.local/share/virtualenvs/yy-uMxmjV2h/lib/python3.7/site-packages/transformers/modeling_gpt2.py", line 226, in forward
self.ln_1(x), layer_past=layer_past, attention_mask=attention_mask, head_mask=head_mask
File "/Users/damiox/.local/share/virtualenvs/yy-uMxmjV2h/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/Users/damiox/.local/share/virtualenvs/yy-uMxmjV2h/lib/python3.7/site-packages/transformers/modeling_gpt2.py", line 189, in forward
attn_outputs = self._attn(query, key, value, attention_mask, head_mask)
File "/Users/damiox/.local/share/virtualenvs/yy-uMxmjV2h/lib/python3.7/site-packages/transformers/modeling_gpt2.py", line 150, in _attn
w = w + attention_mask
RuntimeError: The size of tensor a (11) must match the size of tensor b (6) at non-singleton dimension 3
Process finished with exit code 1
Initially I thought this was related to https://github.com/huggingface/transformers/issues/3031 - so I re-built latest master to try the fix, but I still experience the issue.
In order to make your current code snippet work, you will have combine the previous and new attention mask as follows:
from transformers.tokenization_gpt2 import GPT2Tokenizer
from transformers.modeling_gpt2 import GPT2LMHeadModel
import torch
tokenizer = GPT2Tokenizer.from_pretrained('gpt2', pad_token='<|endoftext|>')
model = GPT2LMHeadModel.from_pretrained('gpt2')
# Complete phrases are: "I like to drink soda without sugar" and "Go watch TV alone, I am not going"
docs = ["I like to drink soda", "Go watch TV"]
docs_tensors = tokenizer.batch_encode_plus(
[d for d in docs], pad_to_max_length=True, return_tensors='pt')
docs_next = ["without sugar", "alone, I am not going"]
docs_next_tensors = tokenizer.batch_encode_plus(
[d for d in docs_next], pad_to_max_length=True, return_tensors='pt')
# predicting the first part of each phrase
_, past = model(docs_tensors['input_ids'], attention_mask=docs_tensors['attention_mask'])
# predicting the rest of the phrase
attn_mask = torch.cat([docs_tensors['attention_mask'], docs_next_tensors['attention_mask']], dim=-1)
logits, _ = model(docs_next_tensors['input_ids'], attention_mask=attn_mask, past=past)
logits = logits[:, -1]
_, top_indices_results = logits.topk(30)
For the case that you want to test two possible suffixes for a sentence start you probably will have to clone your past variable as many times as you have suffixes. That means that the batch size of your prefix input_ids has to match the batch size of your suffix input_ids in order to make it work.
Also you have to change the positional encodings input of your suffix input_ids (GPT2 uses absolute positional encodings) if one of your prefix input_ids is padded (this is not shown in the code above - please take a look at https://github.com/huggingface/transformers/issues/3021 to see how it's done).

How to use a list in word2vec.similarity

I have a word2vec model using pre-trained GoogleNews-vectors-negative300.bin. The model works fine and I can get the similarities between the two words. For example:
word2vec.similarity('culture','friendship')
0.2732939
Now, I want to use list elements instead of the words. For example, suppose that I have a list which its name is "tag". and the first two elements in the first row are culture and friendship. So, tag[0,0]= culture, and tag[0,1]=friendship.
I use the following code which gives me an error:
word2vec.similarity(tag[0,0],tag[0,1])
the "tag" list is a numpy.ndarray
the error is:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Users\s\AppData\Local\Programs\Python6436\Python36\lib\site-packages\gensim\models\keyedvectors.py", line 992, in similarity
return dot(matutils.unitvec(self[w1]), matutils.unitvec(self[w2]))
File "C:\Users\s\AppData\Local\Programs\Python6436\Python36\lib\site-packages\gensim\models\keyedvectors.py", line 337, in __getitem__
return self.get_vector(entities)
File "C:\Users\s\AppData\Local\Programs\Python6436\Python36\lib\site-packages\gensim\models\keyedvectors.py", line 455, in get_vector
return self.word_vec(word)
File "C:\Users\s\AppData\Local\Programs\Python6436\Python36\lib\site-packages\gensim\models\keyedvectors.py", line 452, in word_vec
raise KeyError("word '%s' not in vocabulary" % word)
KeyError: "word ' friendship' not in vocabulary"
I think there are some leading spaces in your word ' friendship'.
Could you try this:
word2vec.similarity(tag[0,0].strip(),tag[0,1].strip())
If tag is,according to your question,is a python-list.Then the problem is you can not index a list with a tuple.
If your list is like [["culture","friendship"],[...]...]
Then your should write word2vec.similarity(tag[0][0],tag[0][1])

How can I use a list of lists, or a list of sets, for the TfidfVectorizer?

I'm using the sklearn TfidfVectorizer for text-classification.
I know this vectorizer wants raw text as input, but using a list works (see input1).
However, if I want to use multiple lists (or sets) I get the following Attribute error.
Does anyone know how to tackle this problem? Thanks in advance!
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(min_df=1, stop_words="english")
input1 = ["This", "is", "a", "test"]
input2 = [["This", "is", "a", "test"], ["It", "is", "raining", "today"]]
print(vectorizer.fit_transform(input1)) #works
print(vectorizer.fit_transform(input2)) #gives Attribute error
input 1:
(3, 0) 1.0
input 2:
Traceback (most recent call last): File "", line 1, in
File
"/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/sklearn/feature_extraction/text.py",
line 1381, in fit_transform
X = super(TfidfVectorizer, self).fit_transform(raw_documents) File
"/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/sklearn/feature_extraction/text.py",
line 869, in fit_transform
self.fixed_vocabulary_) File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/sklearn/feature_extraction/text.py",
line 792, in _count_vocab
for feature in analyze(doc): File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/sklearn/feature_extraction/text.py",
line 266, in
tokenize(preprocess(self.decode(doc))), stop_words) File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/sklearn/feature_extraction/text.py",
line 232, in
return lambda x: strip_accents(x.lower()) AttributeError: 'list' object has no attribute 'lower'
Note that input1 works, but it considers each element of the list (string) as a different document to vectorize.
In the case of input2, I assume you want to vectorize each "sentence" (sublists). One solution is using the following list comprehension syntax:
input2_corrected = [" ".join(x) for x in input2]
which produces
['This is a test', 'It is raining today']
which does not yield the AttributeError anymore.

Python .words issue?

Ok so I'm trying to create a program that tells me how positive or negative each line of that paulryan.txt file is. I'm using the opinion_lexicon, and the file is '_io.TextIOWrapper'
Is there something I can use instead of .words?
Other less important problem: any ideas how to make my WHOLE paulryan.txt file lowercase while keeping it tokenized by line? Thinking it won't give me an accurate positive or negative score if I don't make the whole thing lowercase because there are only lowercase words in the opinion_lexicon.
import nltk
from nltk.corpus import opinion_lexicon
from nltk.tokenize.simple import (LineTokenizer, line_tokenize)
poswords = set(opinion_lexicon.words("positive-words.txt"))
negwords = set(opinion_lexicon.words("negative-words.txt"))
f=open("paulryan.txt", "rU")
raw = f.read()
token= nltk.line_tokenize(raw)
print(token)
def finddemons():
for x in token:
y = token.words()
percpos = len([w for w in token if w in poswords ]) / len(y)
percneg = len([w for w in token if w in negwords ]) / len(y)
print(x, "pos:", round(percpos, 3), "neg:", round(percneg, 3))
finddemons()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 3, in finddemons
AttributeError: 'list' object has no attribute 'words'
I suggest you to read the file line by line. Then , use word_ tokenize:
for line in f:
tokens = word_tokenize(line)
You are right about lowercase the text for searching in the lexicon :
for line in f:
tokens = word_tokenize(line.lower())
You could even try to lemmatize the tokens by using wordnet, because the opinion lexicon is not that rich in vocabulary. Especially if you use tweets, where words are often in different forms.

How to shuffle words in word2vec [duplicate]

This question already has answers here:
Why does random.shuffle return None?
(5 answers)
Closed 5 months ago.
I have this piece of code:
import gensim
import random
file = open('../../../dataset/output/interaction_jobroles_titles_tags.txt')
read_data = file.read()
data = read_data.split('\n')
sentences = [line.split() for line in data]
print(len(sentences))
print(sentences[1])
model = gensim.models.Word2Vec(min_count=1, window=10, size=300, negative=5)
model.build_vocab(sentences)
for epoch in range(5):
shuffled_sentences = random.shuffle(sentences)
model.train(shuffled_sentences)
print(epoch)
print(model)
model.save("../../../dataset/output/wordvectors_jobroles_titles_300d_10w_wordshuffling" + '.model')
If I print a single sentence, then it output is something like this:
['JO_3787672', 'JO_272304', 'JO_2027410', 'TI_2969041', 'TI_2509936', 'TA_954638', 'TA_4321623', 'TA_339347', 'TA_272304', 'TA_3017535', 'TA_494116', 'TA_798840']
What I need is to shuffle the words before training and then save the model.
I am not sure whether I am coding it in a right way. I end up with exception:
Exception in thread Thread-8:
Traceback (most recent call last):
File "/usr/local/Cellar/python3/3.5.1/Frameworks/Python.framework/Versions/3.5/lib/python3.5/threading.py", line 914, in _bootstrap_inner
self.run()
File "/usr/local/Cellar/python3/3.5.1/Frameworks/Python.framework/Versions/3.5/lib/python3.5/threading.py", line 862, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.5/site-packages/gensim/models/word2vec.py", line 747, in job_producer
for sent_idx, sentence in enumerate(sentences):
File "/usr/local/lib/python3.5/site-packages/gensim/utils.py", line 668, in __iter__
for document in self.corpus:
TypeError: 'NoneType' object is not iterable
I would like to ask you how can I shuffle words.
Random.shuffle shuffles the list inplace and returns none. For this reason your shuffled sentences are None after this call.
model.build_vocab(sentences)
sentences_list = sentences
Idx = range(len(sentences_list))
print(Idx)
for epoch in range(5):
random.shuffle(sentences)
perm_sentences = [sentences_list[i] for i in Idx]
model.train(perm_sentences)
print(epoch)
print(model)
model.save("somefile'.model')
This solves my problem.
But how can shuffle individual words in a sentence?
Sentence:
['JO_3787672', 'JO_272304', 'JO_2027410', 'TI_2969041', 'TI_2509936', 'TA_954638', 'TA_4321623', 'TA_339347', 'TA_272304', 'TA_3017535', 'TA_494116', 'TA_798840']
my objective is:
If I check for most similar word for, let say ['JO_3787672'], then every time it will predict words starting from 'JO_'. and the words starting from 'TA_' and 'TI_' have really less similarity score.
I suspected that, this is because of the words position in the data(I am not sure). That is why I try to do shuffling between word( I am really not sure whether it helps or not).

Categories

Resources