How to use gensim.similarities.Similarity to find similarity between two sentences - python

I wanted to write the code to find the similarity between two sentences and then I ended up writing this code using nltk and gensim. I used tokenization and gensim.similarities.Similarity to do the work. But it ain't serving my purpose.
It works fine until I introduce the last line of code.
import gensim
import nltk
raw_documents = ["I'm taking the show on the road.",
"My socks are a force multiplier.",
"I am the barber who cuts everyone's hair who doesn't cut their
own.",
"Legend has it that the mind is a mad monkey.",
"I make my own fun."]
from nltk.tokenize import word_tokenize
gen_docs = [[w.lower() for w in word_tokenize(text)]
for text in raw_documents]
dictionary = gensim.corpora.Dictionary(gen_docs)
print(dictionary[5])
print(dictionary.token2id['socks'])
print("Number of words in dictionary:",len(dictionary))
for i in range(len(dictionary)):
print(i, dictionary[i])
corpus = [dictionary.doc2bow(gen_doc) for gen_doc in gen_docs]
print(corpus)
tf_idf = gensim.models.TfidfModel(corpus)
print(tf_idf)
s = 0
for i in corpus:
s += len(i)
print(s)
sims = gensim.similarities.Similarity('/usr/workdir/',tf_idf[corpus],
num_features=len(dictionary))
print(sims)
print(type(sims))
query_doc = [w.lower() for w in word_tokenize("Socks are a force for good.")]
print(query_doc)
query_doc_bow = dictionary.doc2bow(query_doc)
print(query_doc_bow)
query_doc_tf_idf = tf_idf[query_doc_bow]
print(query_doc_tf_idf)
sims[query_doc_tf_idf]
It throws this error. I couldn't find the answer for this anywhere on the internet.
Traceback (most recent call last):
File "C:\Python36\lib\site-packages\gensim\utils.py", line 679, in save
_pickle.dump(self, fname_or_handle, protocol=pickle_protocol)
TypeError: file must have a 'write' attribute
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "semantic.py", line 45, in <module>
sims[query_doc_tf_idf]
File "C:\Python36\lib\site-packages\gensim\similarities\docsim.py", line
503, in __getitem__
self.close_shard() # no-op if no documents added to index since last
query
File "C:\Python36\lib\site-packages\gensim\similarities\docsim.py", line
427, in close_shard
shard = Shard(self.shardid2filename(shardid), index)
File "C:\Python36\lib\site-packages\gensim\similarities\docsim.py", line
110, in __init__
index.save(self.fullname())
File "C:\Python36\lib\site-packages\gensim\utils.py", line 682, in save
self._smart_save(fname_or_handle, separately, sep_limit, ignore,
pickle_protocol=pickle_protocol)
File "C:\Python36\lib\site-packages\gensim\utils.py", line 538, in
_smart_save
pickle(self, fname, protocol=pickle_protocol)
File "C:\Python36\lib\site-packages\gensim\utils.py", line 1337, in pickle
with smart_open(fname, 'wb') as fout: # 'b' for binary, needed on
Windows
File "C:\Python36\lib\site-packages\smart_open\smart_open_lib.py", line
181, in smart_open
fobj = _shortcut_open(uri, mode, **kw)
File "C:\Python36\lib\site-packages\smart_open\smart_open_lib.py", line
287, in _shortcut_open
return io.open(parsed_uri.uri_path, mode, **open_kwargs)
Please help figure out where the problem is.

Your query should work if you specify a valid path when you instantiate your Similarity. For the example below, I have created a directory Similarity on my C-drive and have specified the directory path and a name for the file in the function call.
sims = gensim.similarities.Similarity('C:/Similarity/sims',tf_idf[corpus],
num_features=len(dictionary))
print(sims)
print(type(sims))
query_doc = [w.lower() for w in word_tokenize("Socks are a force for good.")]
print(query_doc)
query_doc_bow = dictionary.doc2bow(query_doc)
print(query_doc_bow)
query_doc_tf_idf = tf_idf[query_doc_bow]
print(query_doc_tf_idf)
print('Query result:', sims[query_doc_tf_idf])
Query result: [0. 0.84565616 0. 0.06124881 0. ]

Related

AttributeError: 'Vocab' object has no attribute 'stoi'

Trying to run a training script, after resolving a few error messages I've come accross this one, Anyone know what is happening here?
Batch size > 1 not implemented! Falling back to batch_size = 1 ...
Building multi-modal model...
Loading model parameters.
Traceback (most recent call last):
File "translate_mm.py", line 166, in <module>
main()
File "translate_mm.py", line 98, in main
use_filter_pred=False)
File "/content/drive/My Drive/Thesis/thesis_code/onmt/io/IO.py", line 198, in build_dataset
use_filter_pred=use_filter_pred)
File "/content/drive/My Drive/Thesis/thesis_code/onmt/io/TextDataset.py", line 75, in __init__
out_examples = list(out_examples)
File "/content/drive/My Drive/Thesis/thesis_code/onmt/io/TextDataset.py", line 69, in <genexpr>
out_examples = (self._construct_example_fromlist(
File "/content/drive/My Drive/Thesis/thesis_code/onmt/io/TextDataset.py", line 68, in <genexpr>
example_values = ([ex[k] for k in keys] for ex in examples_iter)
File "/content/drive/My Drive/Thesis/thesis_code/onmt/io/TextDataset.py", line 265, in _dynamic_dict
src_map = torch.LongTensor([src_vocab.stoi[w] for w in src])
File "/content/drive/My Drive/Thesis/thesis_code/onmt/io/TextDataset.py", line 265, in <listcomp>
src_map = torch.LongTensor([src_vocab.stoi[w] for w in src])
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1178, in __getattr__
type(self).__name__, name))
AttributeError: 'Vocab' object has no attribute 'stoi'
which refers to
def _dynamic_dict(self, examples_iter):
for example in examples_iter:
src = example["src"]
src_vocab = torchtext.vocab.Vocab(Counter(src))
self.src_vocabs.append(src_vocab)
# Mapping source tokens to indices in the dynamic dict.
src_map = torch.LongTensor([src_vocab.stoi[w] for w in src])
example["src_map"] = src_map
if "tgt" in example:
tgt = example["tgt"]
mask = torch.LongTensor(
[0] + [src_vocab.stoi[w] for w in tgt] + [0])
example["alignment"] = mask
yield example
Note: the original model was made with a much older version of torchtext, I am guessing the error is related to that, but I am simply too inexperienced to know for sure.
Anyone has an idea? Googling this provided no significant results.
regards,
U.
You must use get_stoi()[w].This is for the newer version after removing the legacy. You also can use get_itos() which returns a list of elements.

Python - NaiveBayesClassifier - TypeError: 'float' object is not iterable

I am trying to train some data for a classification tool I am building. I have done some simple examples and it works fine.
I am now trying to use some data from work (which is what it will be used on), and I am getting a TypeError: 'float' object is not iterable error, the traceback is here:
Traceback (most recent call last):
File "C:/Users/nicholas/Desktop/machineTraining/classLearning.py", line 13, in <module>
cl = NaiveBayesClassifier(train)
File "C:\Users\nicholas\AppData\Local\Programs\Python\Python36-32\lib\site-packages\textblob\classifiers.py", line 205, in __init__
super(NLTKClassifier, self).__init__(train_set, feature_extractor, format, **kwargs)
File "C:\Users\nicholas\AppData\Local\Programs\Python\Python36-32\lib\site-packages\textblob\classifiers.py", line 139, in __init__
self._word_set = _get_words_from_dataset(self.train_set) #Keep a hidden set of unique words.
File "C:\Users\nicholas\AppData\Local\Programs\Python\Python36-32\lib\site-packages\textblob\classifiers.py", line 63, in _get_words_from_dataset
return set(all_words)
This is my code:
df = pd.read_csv("C:/Users/nicholas\Desktop/trainData.csv", encoding='latin-1')
df['train'] = df[['Summary', 'Primary Classification']].apply(tuple, axis=1)
aTrain = df['train'].values.tolist()
train = aTrain
cl = NaiveBayesClassifier(train)
Any ideas on what is going wrong?

Gensim getting started Error : No such file or directory: 'vectors.bin'

I am learning about Word2Vec and GloVe model in python so I am going through this getting started with GENSIM available here.
After I compiled these code step by step in Idle3:
from gensim.models import word2vec
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
sentences = word2vec.Text8Corpus('text8')
sentences = word2vec.Text8Corpus('~/Desktop/text8')
model = word2vec.Word2Vec(sentences, size=200)
model.most_similar(positive=['woman', 'king'], negative=['man'], topn=1)
model.most_similar(positive=['woman', 'king'], negative=['man'], topn=2)
model.most_similar(['man'])
model.save('text8.model')
model.save_word2vec_format('text.model.bin', binary=True)
model1 = word2vec.Word2Vec.load_word2vec_format('text.model.bin', binary=True)
model1.most_similar(['girl', 'father'], ['boy'], topn=3)
more_examples = ["he is she", "big bigger bad", "going went being"]
for example in more_examples:
a, b, x = example.split()
predicted = model.most_similar([x, b], [a])[0][0]
print ("'%s' is to '%s' as '%s' is to '%s'" % (a, b, x, predicted))
model_org = word2vec.Word2Vec.load_word2vec_format('vectors.bin', binary=True)
I am getting this error:
2017-01-17 10:34:26,054 : INFO : loading projection weights from vectors.bin
Traceback (most recent call last):
File "<pyshell#16>", line 1, in <module>
model_org = word2vec.Word2Vec.load_word2vec_format('vectors.bin', binary=True)
File "/usr/local/lib/python3.5/dist-packages/gensim/models/word2vec.py", line 1172, in load_word2vec_format
with utils.smart_open(fname) as fin:
File "/usr/local/lib/python3.5/dist-packages/smart_open-1.3.5-py3.5.egg/smart_open/smart_open_lib.py", line 127, in smart_open
return file_smart_open(parsed_uri.uri_path, mode)
File "/usr/local/lib/python3.5/dist-packages/smart_open-1.3.5-py3.5.egg/smart_open/smart_open_lib.py", line 558, in file_smart_open
return open(fname, mode)
FileNotFoundError: [Errno 2] No such file or directory: 'vectors.bin'
How do I rectify this. Where can I get the vector.bin file.
Thanks for your help in advance.
The tutorial you link uses the name vectors.bin, as an example, when describing how you might load vectors created by the original Google-released word2vec.c toolkit. (That's the name used in that toolkit's documentation.)
Unless you have such a file and need to do something with it, you wouldn't need to load it.

Usage of python-readability

(https://github.com/buriy/python-readability)
I am struggling using this library and I can't find any documentation for it. (Is there any?)
There are some kind of useable pieces calling help(Document) but there is still something wrong.
My code so far:
from readability.readability import Document
import requests
url = 'http://www.somepage.com'
html = requests.get(url, verify=False).content
readable_article = Document(html, negative_keywords='test_keyword').summary()
with open('test.html', 'w', encoding='utf-8') as test_file:
test_file.write(readable_article)
According to the help(Document) output, it should be possible to use a list for the input of the negative_keywords.
readable_article = Document(html, negative_keywords=['test_keyword1', 'test-keyword2').summary()
Gives me a bunch of errors I don't understand:
Traceback (most recent call last): File
"/usr/lib/python3.4/site-packages/readability/readability.py", line
163, in summary
candidates = self.score_paragraphs() File "/usr/lib/python3.4/site-packages/readability/readability.py", line
300, in score_paragraphs
candidates[parent_node] = self.score_node(parent_node) File "/usr/lib/python3.4/site-packages/readability/readability.py", line
360, in score_node
content_score = self.class_weight(elem) File "/usr/lib/python3.4/site-packages/readability/readability.py", line
348, in class_weight
if self.negative_keywords and self.negative_keywords.search(feature): AttributeError: 'list' object
has no attribute 'search' Traceback (most recent call last): File
"/usr/lib/python3.4/site-packages/readability/readability.py", line
163, in summary
candidates = self.score_paragraphs() File "/usr/lib/python3.4/site-packages/readability/readability.py", line
300, in score_paragraphs
candidates[parent_node] = self.score_node(parent_node) File "/usr/lib/python3.4/site-packages/readability/readability.py", line
360, in score_node
content_score = self.class_weight(elem) File "/usr/lib/python3.4/site-packages/readability/readability.py", line
348, in class_weight
if self.negative_keywords and self.negative_keywords.search(feature): AttributeError: 'list' object
has no attribute 'search'
Could some one give me please a hint on the error or how to deal with it?
There's an error in the library code. If you look at compile_pattern:
def compile_pattern(elements):
if not elements:
return None
elif isinstance(elements, (list, tuple)):
return list(elements)
elif isinstance(elements, regexp_type):
return elements
else:
# assume string or string like object
elements = elements.split(',')
return re.compile(u'|'.join([re.escape(x.lower()) for x in elements]), re.U)
You can see that it only returns a regex if the elements is not None, not a list or tuple, and not a regular expression.
Later on, though, it assumes that self.negative_keywords is a regular expression. So, I suggest you input your list as a string in the form of "test_keyword1,test_keyword2". This will make sure that compile_pattern returns a regular expression which should fix the error.

Reportlab: use Table and SPAN

Example:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from reportlab.platypus import SimpleDocTemplate, Table, TableStyle
from reportlab.lib.pagesizes import letter
def testPdf():
doc = SimpleDocTemplate("testpdf.pdf",pagesize=letter,
rightMargin=72,leftMargin=72,
topMargin=72,bottomMargin=18)
elements = []
datas = []
for x in range(1,50):
datas.append(
[x,x+1]
)
t=Table(datas)
tTableStyle=[
('SPAN',(0,0),(0,37)),
]
t.setStyle(TableStyle(tTableStyle))
elements.append(t)
doc.build(elements)
if __name__ == '__main__':
testPdf()
this code runs success, because the table is in one page,if I set the "SPAN" to "(0,0),(0,38)" ,the error is:
reportlab.platypus.doctemplate.LayoutError: Flowable with cell(0,0) containing
'1'(46.24 x 702) too large on page 2 in frame 'normal'(456.0 x 690.0*) of template 'Later'
and if I set it bigger the error will be:
Traceback (most recent call last):
File "testpdf.py", line 26, in <module>
testPdf()
File "testpdf.py", line 23, in testPdf
doc.build(elements)
File "/usr/local/lib/python2.7/dist-packages/reportlab-2.5-py2.7-linux-x86_64.egg/reportlab/platypus/doctemplate.py", line 1117, in build
BaseDocTemplate.build(self,flowables, canvasmaker=canvasmaker)
File "/usr/local/lib/python2.7/dist-packages/reportlab-2.5-py2.7-linux-x86_64.egg/reportlab/platypus/doctemplate.py", line 880, in build
self.handle_flowable(flowables)
File "/usr/local/lib/python2.7/dist-packages/reportlab-2.5-py2.7-linux-x86_64.egg/reportlab/platypus/doctemplate.py", line 763, in handle_flowable
if frame.add(f, canv, trySplit=self.allowSplitting):
File "/usr/local/lib/python2.7/dist-packages/reportlab-2.5-py2.7-linux-x86_64.egg/reportlab/platypus/frames.py", line 159, in _add
w, h = flowable.wrap(aW, h)
File "/usr/local/lib/python2.7/dist-packages/reportlab-2.5-py2.7-linux-x86_64.egg/reportlab/platypus/tables.py", line 1113, in wrap
self._calc(availWidth, availHeight)
File "/usr/local/lib/python2.7/dist-packages/reportlab-2.5-py2.7-linux-x86_64.egg/reportlab/platypus/tables.py", line 587, in _calc
self._calc_height(availHeight,availWidth,W=W)
File "/usr/local/lib/python2.7/dist-packages/reportlab-2.5-py2.7-linux-x86_64.egg/reportlab/platypus/tables.py", line 553, in _calc_height
spanFixDim(H0,H,spanCons,lim=hmax)
File "/usr/local/lib/python2.7/dist-packages/reportlab-2.5-py2.7-linux-x86_64.egg/reportlab/platypus/tables.py", line 205, in spanFixDim
t = sum([V[x]+M.get(x,0) for x in xrange(x0,x1)])
TypeError: unsupported operand type(s) for +: 'NoneType' and 'int'
How can I deal with this?
The reason you're facing this problem is exactly what Gordon Worley commented above. There's no way to SPAN across page automatically as the algorithm implemented will be confused of the height and width calculated.
An approach to tackle this will be manually format/style your table per page using row/column coordinates. Sadly, even the replies in the reportlab suggest we do this manually.
I did split my tables manually and style them separately, which in my opinion is a very ugly approach. I'll look for other alternatives later.
For reference: https://bitbucket.org/ntj/reportlab_imko_table

Categories

Resources