How to use a list in word2vec.similarity - python

I have a word2vec model using pre-trained GoogleNews-vectors-negative300.bin. The model works fine and I can get the similarities between the two words. For example:
word2vec.similarity('culture','friendship')
0.2732939
Now, I want to use list elements instead of the words. For example, suppose that I have a list which its name is "tag". and the first two elements in the first row are culture and friendship. So, tag[0,0]= culture, and tag[0,1]=friendship.
I use the following code which gives me an error:
word2vec.similarity(tag[0,0],tag[0,1])
the "tag" list is a numpy.ndarray
the error is:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Users\s\AppData\Local\Programs\Python6436\Python36\lib\site-packages\gensim\models\keyedvectors.py", line 992, in similarity
return dot(matutils.unitvec(self[w1]), matutils.unitvec(self[w2]))
File "C:\Users\s\AppData\Local\Programs\Python6436\Python36\lib\site-packages\gensim\models\keyedvectors.py", line 337, in __getitem__
return self.get_vector(entities)
File "C:\Users\s\AppData\Local\Programs\Python6436\Python36\lib\site-packages\gensim\models\keyedvectors.py", line 455, in get_vector
return self.word_vec(word)
File "C:\Users\s\AppData\Local\Programs\Python6436\Python36\lib\site-packages\gensim\models\keyedvectors.py", line 452, in word_vec
raise KeyError("word '%s' not in vocabulary" % word)
KeyError: "word ' friendship' not in vocabulary"

I think there are some leading spaces in your word ' friendship'.
Could you try this:
word2vec.similarity(tag[0,0].strip(),tag[0,1].strip())

If tag is,according to your question,is a python-list.Then the problem is you can not index a list with a tuple.
If your list is like [["culture","friendship"],[...]...]
Then your should write word2vec.similarity(tag[0][0],tag[0][1])

Related

How can I use a list of lists, or a list of sets, for the TfidfVectorizer?

I'm using the sklearn TfidfVectorizer for text-classification.
I know this vectorizer wants raw text as input, but using a list works (see input1).
However, if I want to use multiple lists (or sets) I get the following Attribute error.
Does anyone know how to tackle this problem? Thanks in advance!
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(min_df=1, stop_words="english")
input1 = ["This", "is", "a", "test"]
input2 = [["This", "is", "a", "test"], ["It", "is", "raining", "today"]]
print(vectorizer.fit_transform(input1)) #works
print(vectorizer.fit_transform(input2)) #gives Attribute error
input 1:
(3, 0) 1.0
input 2:
Traceback (most recent call last): File "", line 1, in
File
"/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/sklearn/feature_extraction/text.py",
line 1381, in fit_transform
X = super(TfidfVectorizer, self).fit_transform(raw_documents) File
"/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/sklearn/feature_extraction/text.py",
line 869, in fit_transform
self.fixed_vocabulary_) File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/sklearn/feature_extraction/text.py",
line 792, in _count_vocab
for feature in analyze(doc): File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/sklearn/feature_extraction/text.py",
line 266, in
tokenize(preprocess(self.decode(doc))), stop_words) File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/sklearn/feature_extraction/text.py",
line 232, in
return lambda x: strip_accents(x.lower()) AttributeError: 'list' object has no attribute 'lower'
Note that input1 works, but it considers each element of the list (string) as a different document to vectorize.
In the case of input2, I assume you want to vectorize each "sentence" (sublists). One solution is using the following list comprehension syntax:
input2_corrected = [" ".join(x) for x in input2]
which produces
['This is a test', 'It is raining today']
which does not yield the AttributeError anymore.

ValueError : too many values to unpack (expected 2)

I'm doing sentiment analysis using naive bayes classifier of nltk. I'm just inserting a csv file that contains words and their labels as training set and not testing it yet. I'm finding sentiment of each sentence and then finding average of sentiments of all sentences in the end. My file contains words in the format:
good,0.6
amazing,0.95
great,0.8
awesome,0.95
love,0.7
like,0.5
better,0.4
beautiful,0.6
bad,-0.6
worst,-0.9
hate,-0.8
sad,-0.4
disappointing,-0.6
angry,-0.7
happy,0.7
But the file doesn't get trained and the above mentioned error shows up. Here's my python code:
import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize
from nltk.classify.api import ClassifierI
operators=set(('not','never','no'))
stop_words=set(stopwords.words("english"))-operators
text="this restaurant is good but i hate it ."
sent=0.0
x=0
text2=""
xyz=[]
dot=0
if "but" in text:
i=text.find("but")
text=text[:i]+"."+text[i+3:]
if "whereas" in text:
i=text.find("whereas")
text=text[:i]+"."+text[i+7:]
if "while" in text:
i=text.find("while")
text=text[:i]+"."+text[i+5:]
a=open('C:/Users/User/train_words.csv','r')
for w in text.split():
if w in stop_words:
continue
else:
text2=text2+" "+w
print (text2)
cl=nltk.NaiveBayesClassifier.train(a)
xyz=sent_tokenize(text2)
print(xyz)
for s in xyz:
x=x+1
print(s)
if "not" in s or "n't" in s:
print(float(cl.classify(s))*-1)
sent=sent+(float(cl.classify(s))*-1)
else:
print(cl.classify(s))
sent=sent+float(cl.classify(s))
print("sentiment of the overall document:",sent/x)
error:
runfile('C:/Users/User/Documents/untitled1.py', wdir='C:/Users /User/Documents')
restaurant good . hate .
Traceback (most recent call last):
File "<ipython-input-8-d03fac6844c7>", line 1, in <module>
runfile('C:/Users/User/Documents/untitled1.py', wdir='C:/Users/User/Documents')
File "C:\ProgramData\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 866, in runfile
execfile(filename, namespace)
File "C:\ProgramData\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 102, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "C:/Users/User/Documents/untitled1.py", line 37, in <module>
cl = nltk.NaiveBayesClassifier.train(a)
File "C:\ProgramData\Anaconda3\lib\site-packages\nltk\classify\naivebayes.py", line 194, in train
for featureset, label in labeled_featuresets:
ValueError: too many values to unpack (expected 2)
if am not wrong train() takes list of tuple and you are providing file obj.
Instead of this
a = open('C:/Users/User/train_words.csv','r')
Try this
a = open('C:/Users/User/train_words.csv','r').read() # this is string
a_list = a.split('\n')
a_list_of_tuple = [tuple(x.split(',')) for x in a_list]
and pass a_list_of_tuple variable to train()
hop this will help :)
From the doc:
def train(cls, labeled_featuresets, estimator=ELEProbDist):
"""
:param labeled_featuresets: A list of classified featuresets,
i.e., a list of tuples ``(featureset, label)``.
"""
So you can write something similar:
feature_set = [line.split(',')[::-1] for line in open('filename').readline()]

Override a function in nltk - Error in ContextIndex class

I am using text.similar('example') function from nltk.Text module.
(Which prints the similar words for a given word based on corpus.)
However I want to store that list of words in a list. But the function itself returns None.
#text is a variable of nltk.Text module
simList = text.similar("physics")
>>> a = text.similar("physics")
the and a in science this which it that energy his of but chemistry is
space mathematics theory as mechanics
>>> a
>>> a
# a contains no value.
So should I modify the source function itself? But I don't think it is a good practice. So how can I override that function so that it returns the value?
Edit - Referring this thread, I tried using the ContextIndex class. But I am getting the following error.
File "test.py", line 39, in <module>
text = nltk.text.ContextIndex(word.lower() for word in words) File "/home/kenden/den/codes/nlpenv/local/lib/python2.7/site-packages/nltk/text.py", line 56, in __init__
for i, w in enumerate(tokens)) File "/home/kenden/den/codes/nlpenv/local/lib/python2.7/site-packages/nltk/probability.py", line 1752, in __init__
for (cond, sample) in cond_samples: File "/home/kenden/den/codes/nlpenv/local/lib/python2.7/site-packages/nltk/text.py", line 56, in <genexpr>
for i, w in enumerate(tokens)) File "/home/kenden/den/codes/nlpenv/local/lib/python2.7/site-packages/nltk/text.py", line 43, in _default_context
right = (tokens[i+1].lower() if i != len(tokens) - 1 else '*END*') TypeError: object of type 'generator' has no len()
This is my line 39 of test.py
text = nltk.text.ContextIndex(word.lower() for word in words)
How can I solve this?
You are getting the error because the ContextIndex constructor is trying to take the len() of your token list (the argument tokens). But you actually pass it as a generator, hence the error. To avoid the problem, just pass a true list, e.g.:
text = nltk.text.ContextIndex(list(word.lower() for word in words))

Python NLP: TypeError: not all arguments converted during string formatting

I tried the code on "Natural language processing with python", but a type error occurred.
import nltk
from nltk.corpus import brown
suffix_fdist = nltk.FreqDist()
for word in brown.words():
word = word.lower()
suffix_fdist.inc(word[-1:])
suffix_fdist.inc(word[-2:])
suffix_fdist.inc(word[-3:])
common_suffixes = suffix_fdist.items()[:100]
def pos_features(word):
features = {}
for suffix in common_suffixes:
features['endswith(%s)' % suffix] = word.lower().endswith(suffix)
return features
pos_features('people')
the error is below:
Traceback (most recent call last):
File "/home/wanglan/javadevelop/TestPython/src/FirstModule.py", line 323, in <module>
pos_features('people')
File "/home/wanglan/javadevelop/TestPython/src/FirstModule.py", line 321, in pos_features
features['endswith(%s)' % suffix] = word.lower().endswith(suffix)
TypeError: not all arguments converted during string formatting
Does anyone could help me find out where i am wrong?
suffix is a tuple, because .items() returns (key,value) tuples. When you use %, if the right hand side is a tuple, the values will be unpacked and substituted for each % format in order. The error you get is complaining that the tuple has more entries than % formats.
You probably want just the key (the actual suffix), in which case you should use suffix[0], or .keys() to only retrieve the dictionary keys.

Mongoengine - using icontains with all

I have seen this question but it does not answer my question, or even pose it very well.
I think that this is best explained with an example:
class Blah(Document):
someList = ListField(StringField())
Blah.drop_collection()
Blah(someList=['lop', 'glob', 'hat']).save()
Blah(someList=['hello', 'kitty']).save()
# One of these should match the first entry
print(Blah.objects(someList__icontains__all=['Lo']).count())
print(Blah.objects(someList__all__icontains=['Lo']).count())
I assumed that this would print either 1, 0 or 0, 1 (or miraculously 1, 1) but instead it gives
0
Traceback (most recent call last):
File "metst.py", line 14, in <module>
print(Blah.objects(someList__all__icontains=['lO']).count())
File "/home/blah/.pythonbrew/pythons/Python-3.1.4/lib/python3.1/site-packages/mongoengine/queryset.py", line 1034, in count
return self._cursor.count(with_limit_and_skip=True)
File "/home/blah/.pythonbrew/pythons/Python-3.1.4/lib/python3.1/site-packages/mongoengine/queryset.py", line 608, in _cursor
self._cursor_obj = self._collection.find(self._query,
File "/home/blah/.pythonbrew/pythons/Python-3.1.4/lib/python3.1/site-packages/mongoengine/queryset.py", line 390, in _query
self._mongo_query = self._query_obj.to_query(self._document)
File "/home/blah/.pythonbrew/pythons/Python-3.1.4/lib/python3.1/site-packages/mongoengine/queryset.py", line 213, in to_query
query = query.accept(QueryCompilerVisitor(document))
File "/home/blah/.pythonbrew/pythons/Python-3.1.4/lib/python3.1/site-packages/mongoengine/queryset.py", line 278, in accept
return visitor.visit_query(self)
File "/home/blah/.pythonbrew/pythons/Python-3.1.4/lib/python3.1/site-packages/mongoengine/queryset.py", line 170, in visit_query
return QuerySet._transform_query(self.document, **query.query)
File "/home/blah/.pythonbrew/pythons/Python-3.1.4/lib/python3.1/site-packages/mongoengine/queryset.py", line 755, in _transform_query
value = field.prepare_query_value(op, value)
File "/home/blah/.pythonbrew/pythons/Python-3.1.4/lib/python3.1/site-packages/mongoengine/fields.py", line 594, in prepare_query_value
return self.field.prepare_query_value(op, value)
File "/home/blah/.pythonbrew/pythons/Python-3.1.4/lib/python3.1/site-packages/mongoengine/fields.py", line 95, in prepare_query_value
value = re.escape(value)
File "/home/blah/.pythonbrew/pythons/Python-3.1.4/lib/python3.1/re.py", line 246, in escape
return bytes(s)
TypeError: 'str' object cannot be interpreted as an integer
Neither query works!
Does MongoEngine support some way to search using icontains and all? Or some way to get around this?
Note: I want to use MongoEngine, not PyMongo.
Edit: The same issue exists with Python 2.7.3.
The only way to do this, as of now(version 0.8.0) is by using a __raw__ query, possibly combined with re.compile(). Like so:
import re
input_list = ['Lo']
converted_list = [re.compile(q, re.I) for q in input_list]
print(Blah.objects(__raw__={"someList": {"$all": converted_list}}).count())
There is currently no way in mongoengine to combine all and icontains, and the only operator that can be used with other operators is not. This is subtly mentioned in the docs, as in it says that:
not – negate a standard check, may be used before other operators (e.g. Q(age_not_mod=5))
emphasis mine
But it does not say that you can not do this with other operators, which is actually the case.
You can confirm this behavior by looking at the source:
version 0.8.0+ (in module - mongoengine/queryset/transform.py - lines 42-48):
if parts[-1] in MATCH_OPERATORS:
op = parts.pop()
negate = False
if parts[-1] == 'not':
parts.pop()
negate = True
In older versions the above lines can be seen in mongoengine/queryset.py within the _transform_query method.

Categories

Resources