(https://github.com/buriy/python-readability)
I am struggling using this library and I can't find any documentation for it. (Is there any?)
There are some kind of useable pieces calling help(Document) but there is still something wrong.
My code so far:
from readability.readability import Document
import requests
url = 'http://www.somepage.com'
html = requests.get(url, verify=False).content
readable_article = Document(html, negative_keywords='test_keyword').summary()
with open('test.html', 'w', encoding='utf-8') as test_file:
test_file.write(readable_article)
According to the help(Document) output, it should be possible to use a list for the input of the negative_keywords.
readable_article = Document(html, negative_keywords=['test_keyword1', 'test-keyword2').summary()
Gives me a bunch of errors I don't understand:
Traceback (most recent call last): File
"/usr/lib/python3.4/site-packages/readability/readability.py", line
163, in summary
candidates = self.score_paragraphs() File "/usr/lib/python3.4/site-packages/readability/readability.py", line
300, in score_paragraphs
candidates[parent_node] = self.score_node(parent_node) File "/usr/lib/python3.4/site-packages/readability/readability.py", line
360, in score_node
content_score = self.class_weight(elem) File "/usr/lib/python3.4/site-packages/readability/readability.py", line
348, in class_weight
if self.negative_keywords and self.negative_keywords.search(feature): AttributeError: 'list' object
has no attribute 'search' Traceback (most recent call last): File
"/usr/lib/python3.4/site-packages/readability/readability.py", line
163, in summary
candidates = self.score_paragraphs() File "/usr/lib/python3.4/site-packages/readability/readability.py", line
300, in score_paragraphs
candidates[parent_node] = self.score_node(parent_node) File "/usr/lib/python3.4/site-packages/readability/readability.py", line
360, in score_node
content_score = self.class_weight(elem) File "/usr/lib/python3.4/site-packages/readability/readability.py", line
348, in class_weight
if self.negative_keywords and self.negative_keywords.search(feature): AttributeError: 'list' object
has no attribute 'search'
Could some one give me please a hint on the error or how to deal with it?
There's an error in the library code. If you look at compile_pattern:
def compile_pattern(elements):
if not elements:
return None
elif isinstance(elements, (list, tuple)):
return list(elements)
elif isinstance(elements, regexp_type):
return elements
else:
# assume string or string like object
elements = elements.split(',')
return re.compile(u'|'.join([re.escape(x.lower()) for x in elements]), re.U)
You can see that it only returns a regex if the elements is not None, not a list or tuple, and not a regular expression.
Later on, though, it assumes that self.negative_keywords is a regular expression. So, I suggest you input your list as a string in the form of "test_keyword1,test_keyword2". This will make sure that compile_pattern returns a regular expression which should fix the error.
Related
Trying to run a training script, after resolving a few error messages I've come accross this one, Anyone know what is happening here?
Batch size > 1 not implemented! Falling back to batch_size = 1 ...
Building multi-modal model...
Loading model parameters.
Traceback (most recent call last):
File "translate_mm.py", line 166, in <module>
main()
File "translate_mm.py", line 98, in main
use_filter_pred=False)
File "/content/drive/My Drive/Thesis/thesis_code/onmt/io/IO.py", line 198, in build_dataset
use_filter_pred=use_filter_pred)
File "/content/drive/My Drive/Thesis/thesis_code/onmt/io/TextDataset.py", line 75, in __init__
out_examples = list(out_examples)
File "/content/drive/My Drive/Thesis/thesis_code/onmt/io/TextDataset.py", line 69, in <genexpr>
out_examples = (self._construct_example_fromlist(
File "/content/drive/My Drive/Thesis/thesis_code/onmt/io/TextDataset.py", line 68, in <genexpr>
example_values = ([ex[k] for k in keys] for ex in examples_iter)
File "/content/drive/My Drive/Thesis/thesis_code/onmt/io/TextDataset.py", line 265, in _dynamic_dict
src_map = torch.LongTensor([src_vocab.stoi[w] for w in src])
File "/content/drive/My Drive/Thesis/thesis_code/onmt/io/TextDataset.py", line 265, in <listcomp>
src_map = torch.LongTensor([src_vocab.stoi[w] for w in src])
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1178, in __getattr__
type(self).__name__, name))
AttributeError: 'Vocab' object has no attribute 'stoi'
which refers to
def _dynamic_dict(self, examples_iter):
for example in examples_iter:
src = example["src"]
src_vocab = torchtext.vocab.Vocab(Counter(src))
self.src_vocabs.append(src_vocab)
# Mapping source tokens to indices in the dynamic dict.
src_map = torch.LongTensor([src_vocab.stoi[w] for w in src])
example["src_map"] = src_map
if "tgt" in example:
tgt = example["tgt"]
mask = torch.LongTensor(
[0] + [src_vocab.stoi[w] for w in tgt] + [0])
example["alignment"] = mask
yield example
Note: the original model was made with a much older version of torchtext, I am guessing the error is related to that, but I am simply too inexperienced to know for sure.
Anyone has an idea? Googling this provided no significant results.
regards,
U.
You must use get_stoi()[w].This is for the newer version after removing the legacy. You also can use get_itos() which returns a list of elements.
I wanted to write the code to find the similarity between two sentences and then I ended up writing this code using nltk and gensim. I used tokenization and gensim.similarities.Similarity to do the work. But it ain't serving my purpose.
It works fine until I introduce the last line of code.
import gensim
import nltk
raw_documents = ["I'm taking the show on the road.",
"My socks are a force multiplier.",
"I am the barber who cuts everyone's hair who doesn't cut their
own.",
"Legend has it that the mind is a mad monkey.",
"I make my own fun."]
from nltk.tokenize import word_tokenize
gen_docs = [[w.lower() for w in word_tokenize(text)]
for text in raw_documents]
dictionary = gensim.corpora.Dictionary(gen_docs)
print(dictionary[5])
print(dictionary.token2id['socks'])
print("Number of words in dictionary:",len(dictionary))
for i in range(len(dictionary)):
print(i, dictionary[i])
corpus = [dictionary.doc2bow(gen_doc) for gen_doc in gen_docs]
print(corpus)
tf_idf = gensim.models.TfidfModel(corpus)
print(tf_idf)
s = 0
for i in corpus:
s += len(i)
print(s)
sims = gensim.similarities.Similarity('/usr/workdir/',tf_idf[corpus],
num_features=len(dictionary))
print(sims)
print(type(sims))
query_doc = [w.lower() for w in word_tokenize("Socks are a force for good.")]
print(query_doc)
query_doc_bow = dictionary.doc2bow(query_doc)
print(query_doc_bow)
query_doc_tf_idf = tf_idf[query_doc_bow]
print(query_doc_tf_idf)
sims[query_doc_tf_idf]
It throws this error. I couldn't find the answer for this anywhere on the internet.
Traceback (most recent call last):
File "C:\Python36\lib\site-packages\gensim\utils.py", line 679, in save
_pickle.dump(self, fname_or_handle, protocol=pickle_protocol)
TypeError: file must have a 'write' attribute
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "semantic.py", line 45, in <module>
sims[query_doc_tf_idf]
File "C:\Python36\lib\site-packages\gensim\similarities\docsim.py", line
503, in __getitem__
self.close_shard() # no-op if no documents added to index since last
query
File "C:\Python36\lib\site-packages\gensim\similarities\docsim.py", line
427, in close_shard
shard = Shard(self.shardid2filename(shardid), index)
File "C:\Python36\lib\site-packages\gensim\similarities\docsim.py", line
110, in __init__
index.save(self.fullname())
File "C:\Python36\lib\site-packages\gensim\utils.py", line 682, in save
self._smart_save(fname_or_handle, separately, sep_limit, ignore,
pickle_protocol=pickle_protocol)
File "C:\Python36\lib\site-packages\gensim\utils.py", line 538, in
_smart_save
pickle(self, fname, protocol=pickle_protocol)
File "C:\Python36\lib\site-packages\gensim\utils.py", line 1337, in pickle
with smart_open(fname, 'wb') as fout: # 'b' for binary, needed on
Windows
File "C:\Python36\lib\site-packages\smart_open\smart_open_lib.py", line
181, in smart_open
fobj = _shortcut_open(uri, mode, **kw)
File "C:\Python36\lib\site-packages\smart_open\smart_open_lib.py", line
287, in _shortcut_open
return io.open(parsed_uri.uri_path, mode, **open_kwargs)
Please help figure out where the problem is.
Your query should work if you specify a valid path when you instantiate your Similarity. For the example below, I have created a directory Similarity on my C-drive and have specified the directory path and a name for the file in the function call.
sims = gensim.similarities.Similarity('C:/Similarity/sims',tf_idf[corpus],
num_features=len(dictionary))
print(sims)
print(type(sims))
query_doc = [w.lower() for w in word_tokenize("Socks are a force for good.")]
print(query_doc)
query_doc_bow = dictionary.doc2bow(query_doc)
print(query_doc_bow)
query_doc_tf_idf = tf_idf[query_doc_bow]
print(query_doc_tf_idf)
print('Query result:', sims[query_doc_tf_idf])
Query result: [0. 0.84565616 0. 0.06124881 0. ]
I am trying to train some data for a classification tool I am building. I have done some simple examples and it works fine.
I am now trying to use some data from work (which is what it will be used on), and I am getting a TypeError: 'float' object is not iterable error, the traceback is here:
Traceback (most recent call last):
File "C:/Users/nicholas/Desktop/machineTraining/classLearning.py", line 13, in <module>
cl = NaiveBayesClassifier(train)
File "C:\Users\nicholas\AppData\Local\Programs\Python\Python36-32\lib\site-packages\textblob\classifiers.py", line 205, in __init__
super(NLTKClassifier, self).__init__(train_set, feature_extractor, format, **kwargs)
File "C:\Users\nicholas\AppData\Local\Programs\Python\Python36-32\lib\site-packages\textblob\classifiers.py", line 139, in __init__
self._word_set = _get_words_from_dataset(self.train_set) #Keep a hidden set of unique words.
File "C:\Users\nicholas\AppData\Local\Programs\Python\Python36-32\lib\site-packages\textblob\classifiers.py", line 63, in _get_words_from_dataset
return set(all_words)
This is my code:
df = pd.read_csv("C:/Users/nicholas\Desktop/trainData.csv", encoding='latin-1')
df['train'] = df[['Summary', 'Primary Classification']].apply(tuple, axis=1)
aTrain = df['train'].values.tolist()
train = aTrain
cl = NaiveBayesClassifier(train)
Any ideas on what is going wrong?
I create xml structure using the methods lxml.etree.Element
import lxml.etree
import lxml.html
parent = lxml.etree.Element('root')
child = lxml.etree.Element('sub')
child.text = 'text'
parent.append(child)
I need to do the following query:
doc = lxml.html.document_fromstring(parent)
text = doc.xpath('sub/text()')
print(text)
but I get the following error message:
Traceback (most recent call last): File
"C:\VINT\OPENSERVER\OpenServer\domains\localhost\python\parse_html\6_first_store_names_cat_full_xml_nested\q.py",
line 9, in
doc = lxml.html.document_fromstring(parent) File "C:\Python33\lib\site-packages\lxml\html__init__.py", line 600, in
document_fromstring
value = etree.fromstring(html, parser, **kw) File "lxml.etree.pyx", line 3003, in lxml.etree.fromstring
(src\lxml\lxml.etree.c:67277) File "parser.pxi", line 1784, in
lxml.etree._parseMemoryDocument (src\lxml\lxml.etree.c:101615)
ValueError: can only parse strings
>
help my please
lxml.html.document_fromstring() accepts a string, not an Element as you are passing in. Try passing in lxml.etree.tostring(parent):
s = lxml.etree.tostring(parent)
doc = lxml.html.document_fromstring(s)
I have the following code:
from random import randint,choice
add=lambda x:lambda y:x+y
sub=lambda x:lambda y:x-y
mul=lambda x:lambda y:x*y
ops=[[add,'+'],[sub,'-'],[mul,'*']]
def gen_expression(length,left,right):
expr=[]
for i in range(length):
op=choice(ops)
expr.append([op[0](randint(left,right)),op[1]])
return expr
def eval_expression (expr,x):
for i in expr:
x=i[0](x)
return x
def eval_expression2 (expr,x):
for i in expr:
x=i(x)
return x
[snip , see end of post]
def genetic_arithmetic(get,start,length,left,right):
batch=[]
found = False
for i in range(30):
batch.append(gen_expression(length,left,right))
while not found:
batch=sorted(batch,key=lambda y:abs(eval_expression(y,start)-get))
print evald_expression_tostring(batch[0],start)+"\n\n"
#combine
for w in range(len(batch)):
rind=choice(range(length))
batch.append(batch[w][:rind]+choice(batch)[rind:])
#mutate
for w in range(len(batch)):
rind=choice(range(length))
op=choice(ops)
batch.append(batch[w][:rind]+[op[0](randint(left,right)),op[1]]+batch[w][rind+1:])
found=(eval_expression(batch[0],start)==get)
print "\n\n"+evald_expression_tostring(batch[0],start)
When I try to call to call genetic_artihmetic with eval_expression as the sorting key, I get this:
Traceback (most recent call last):
File "<pyshell#113>", line 1, in <module>
genetic_arithmetic(0,10,10,-10,10)
File "/home/benikis/graming/py/genetic_number.py", line 50, in genetic_arithmetic
batch=sorted(batch,key=lambda y:abs(eval_expression(y,start)-get))
File "/home/benikis/graming/py/genetic_number.py", line 50, in <lambda>
batch=sorted(batch,key=lambda y:abs(eval_expression(y,start)-get))
File "/home/benikis/graming/py/genetic_number.py", line 20, in eval_expression
x=i[0](x)
TypeError: 'function' object is unsubscriptable
And when I try the same with eval_expression2 as the sorting,the error is this:
Traceback (most recent call last):
File "<pyshell#114>", line 1, in <module>
genetic_arithmetic(0,10,10,-10,10)
File "/home/benikis/graming/py/genetic_number.py", line 50, in genetic_arithmetic
batch=sorted(batch,key=lambda y:abs(eval_expression2(y,start)-get))
File "/home/benikis/graming/py/genetic_number.py", line 50, in <lambda>
batch=sorted(batch,key=lambda y:abs(eval_expression2(y,start)-get))
File "/home/benikis/graming/py/genetic_number.py", line 25, in eval_expression2
x=i(x)
TypeError: 'list' object is not callable
As far as i can wrap my head around this, my guess is that sorted() is trying to recursively sort the sublists,maybe? What is really going on here?
Python version is 2.6 - the one in the debian stable repos.
[snip] here:
def expression_tostring(expr):
expr_str=len(expr)*'('+'x '
for i in expr :
if i[1]=='*':
n=i[0](1)
else:
n=i[0](0)
expr_str+=i[1]+' '+str(n)+') '
return expr_str
def evald_expression_tostring(expr,x):
exprstr=expression_tostring(expr).replace('x',str(x))
return exprstr+ ' = ' + str(eval_expression(expr,x))
x=i[0](x) #here i is a function so you can't perform indexing operation on it
x=i(x) #here i is a list so you can't call it as a function
in both cases the value of i is fetched from expr, may be expr contains different type of object than what you're assuming here.
Try this modification:
def gen_expression(length,left,right):
expr=[]
for i in range(length):
op=choice(ops)
expr.append([op[0], randint(left,right),op[1]])
return expr
def eval_expression (expr,x):
for i in expr:
x=i[0](i[1])
return x
You had expr.append([op[0](randint(left,right)),op[1]]) which will put the return value of the calling the function into the 0th index.