Tf-idf with char_wb ignores custom preporcessor? - python

I have
import nltk
from nltk.stem.snowball import GermanStemmer
def my_tokenizer(doc):
stemmer= GermanStemmer()
return([stemmer.stem(t.lower()) for t in nltk.word_tokenize(doc) if
t.lower() not in my_stop_words])
text="hallo df sdfd"
singleTFIDF = TfidfVectorizer(analyzer='char_wb', ngram_range=
(4,6),preprocessor=my_tokenizer, max_features=50).fit([str(text)])
From docs it is clear that a custom toenizer only applies for analyzer=word.
I get
Traceback (most recent call last):
File "TfidF.py", line 95, in <module>
singleTFIDF = TfidfVectorizer(analyzer='char_wb', ngram_range=(4,6),preprocessor=my_tokenizer, max_features=50).fit([str(text)])
File "C:\Users\chris1\Anaconda3\envs\master\lib\site-packages\sklearn\feature_extraction\text.py", line 185, in _char_wb_ngrams
text_document = self._white_spaces.sub(" ", text_document)
TypeError: expected string or bytes-like object

you have to join the words and then return a single string.
Try this!
return(' '.join ([stemmer.stem(t.lower()) for t in nltk.word_tokenize(doc) if
t.lower() not in my_stop_words]))

Related

AttributeError: Python

Hey I am using python script to create ".json" file and getting following error
Traceback (most recent call last):
File "ngs_rawdata_config_creator.py", line 104, in <module>
per_lib = parse_per_lib(pd.read_csv(args.per_lib_input, dtype=str))
File "ngs_rawdata_config_creator.py", line 32, in parse_per_lib
per_lib_dict['lib_paths'] = assign_libpaths(lib_basepaths)
File "ngs_rawdata_config_creator.py", line 53, in assign_libpaths
libpaths_dict[lib] = basepath_to_filepathsdict(path, "*.fastq.gz", ".*_L(\d+)_R(\d+).*\.fastq\.gz")
File "ngs_rawdata_config_creator.py", line 73, in basepath_to_filepathsdict
if rmatch.group(0) == basename:
AttributeError: 'NoneType' object has no attribute 'group
'
this is the part of the code
for fq in all_fastqs:
basename = os.path.basename(fq)
rmatch = re.match(capture_regex, basename)
if rmatch.group(0) == basename:
lane = rmatch.group(1)
read = rmatch.group(2)
readgroups[lane][read] = fq
If re.match doesn't get a match, it returns None. You need to check for that:
if rmatch and rmatch.group(0) == basename:

How to use gensim.similarities.Similarity to find similarity between two sentences

I wanted to write the code to find the similarity between two sentences and then I ended up writing this code using nltk and gensim. I used tokenization and gensim.similarities.Similarity to do the work. But it ain't serving my purpose.
It works fine until I introduce the last line of code.
import gensim
import nltk
raw_documents = ["I'm taking the show on the road.",
"My socks are a force multiplier.",
"I am the barber who cuts everyone's hair who doesn't cut their
own.",
"Legend has it that the mind is a mad monkey.",
"I make my own fun."]
from nltk.tokenize import word_tokenize
gen_docs = [[w.lower() for w in word_tokenize(text)]
for text in raw_documents]
dictionary = gensim.corpora.Dictionary(gen_docs)
print(dictionary[5])
print(dictionary.token2id['socks'])
print("Number of words in dictionary:",len(dictionary))
for i in range(len(dictionary)):
print(i, dictionary[i])
corpus = [dictionary.doc2bow(gen_doc) for gen_doc in gen_docs]
print(corpus)
tf_idf = gensim.models.TfidfModel(corpus)
print(tf_idf)
s = 0
for i in corpus:
s += len(i)
print(s)
sims = gensim.similarities.Similarity('/usr/workdir/',tf_idf[corpus],
num_features=len(dictionary))
print(sims)
print(type(sims))
query_doc = [w.lower() for w in word_tokenize("Socks are a force for good.")]
print(query_doc)
query_doc_bow = dictionary.doc2bow(query_doc)
print(query_doc_bow)
query_doc_tf_idf = tf_idf[query_doc_bow]
print(query_doc_tf_idf)
sims[query_doc_tf_idf]
It throws this error. I couldn't find the answer for this anywhere on the internet.
Traceback (most recent call last):
File "C:\Python36\lib\site-packages\gensim\utils.py", line 679, in save
_pickle.dump(self, fname_or_handle, protocol=pickle_protocol)
TypeError: file must have a 'write' attribute
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "semantic.py", line 45, in <module>
sims[query_doc_tf_idf]
File "C:\Python36\lib\site-packages\gensim\similarities\docsim.py", line
503, in __getitem__
self.close_shard() # no-op if no documents added to index since last
query
File "C:\Python36\lib\site-packages\gensim\similarities\docsim.py", line
427, in close_shard
shard = Shard(self.shardid2filename(shardid), index)
File "C:\Python36\lib\site-packages\gensim\similarities\docsim.py", line
110, in __init__
index.save(self.fullname())
File "C:\Python36\lib\site-packages\gensim\utils.py", line 682, in save
self._smart_save(fname_or_handle, separately, sep_limit, ignore,
pickle_protocol=pickle_protocol)
File "C:\Python36\lib\site-packages\gensim\utils.py", line 538, in
_smart_save
pickle(self, fname, protocol=pickle_protocol)
File "C:\Python36\lib\site-packages\gensim\utils.py", line 1337, in pickle
with smart_open(fname, 'wb') as fout: # 'b' for binary, needed on
Windows
File "C:\Python36\lib\site-packages\smart_open\smart_open_lib.py", line
181, in smart_open
fobj = _shortcut_open(uri, mode, **kw)
File "C:\Python36\lib\site-packages\smart_open\smart_open_lib.py", line
287, in _shortcut_open
return io.open(parsed_uri.uri_path, mode, **open_kwargs)
Please help figure out where the problem is.
Your query should work if you specify a valid path when you instantiate your Similarity. For the example below, I have created a directory Similarity on my C-drive and have specified the directory path and a name for the file in the function call.
sims = gensim.similarities.Similarity('C:/Similarity/sims',tf_idf[corpus],
num_features=len(dictionary))
print(sims)
print(type(sims))
query_doc = [w.lower() for w in word_tokenize("Socks are a force for good.")]
print(query_doc)
query_doc_bow = dictionary.doc2bow(query_doc)
print(query_doc_bow)
query_doc_tf_idf = tf_idf[query_doc_bow]
print(query_doc_tf_idf)
print('Query result:', sims[query_doc_tf_idf])
Query result: [0. 0.84565616 0. 0.06124881 0. ]

Python - NaiveBayesClassifier - TypeError: 'float' object is not iterable

I am trying to train some data for a classification tool I am building. I have done some simple examples and it works fine.
I am now trying to use some data from work (which is what it will be used on), and I am getting a TypeError: 'float' object is not iterable error, the traceback is here:
Traceback (most recent call last):
File "C:/Users/nicholas/Desktop/machineTraining/classLearning.py", line 13, in <module>
cl = NaiveBayesClassifier(train)
File "C:\Users\nicholas\AppData\Local\Programs\Python\Python36-32\lib\site-packages\textblob\classifiers.py", line 205, in __init__
super(NLTKClassifier, self).__init__(train_set, feature_extractor, format, **kwargs)
File "C:\Users\nicholas\AppData\Local\Programs\Python\Python36-32\lib\site-packages\textblob\classifiers.py", line 139, in __init__
self._word_set = _get_words_from_dataset(self.train_set) #Keep a hidden set of unique words.
File "C:\Users\nicholas\AppData\Local\Programs\Python\Python36-32\lib\site-packages\textblob\classifiers.py", line 63, in _get_words_from_dataset
return set(all_words)
This is my code:
df = pd.read_csv("C:/Users/nicholas\Desktop/trainData.csv", encoding='latin-1')
df['train'] = df[['Summary', 'Primary Classification']].apply(tuple, axis=1)
aTrain = df['train'].values.tolist()
train = aTrain
cl = NaiveBayesClassifier(train)
Any ideas on what is going wrong?

Usage of python-readability

(https://github.com/buriy/python-readability)
I am struggling using this library and I can't find any documentation for it. (Is there any?)
There are some kind of useable pieces calling help(Document) but there is still something wrong.
My code so far:
from readability.readability import Document
import requests
url = 'http://www.somepage.com'
html = requests.get(url, verify=False).content
readable_article = Document(html, negative_keywords='test_keyword').summary()
with open('test.html', 'w', encoding='utf-8') as test_file:
test_file.write(readable_article)
According to the help(Document) output, it should be possible to use a list for the input of the negative_keywords.
readable_article = Document(html, negative_keywords=['test_keyword1', 'test-keyword2').summary()
Gives me a bunch of errors I don't understand:
Traceback (most recent call last): File
"/usr/lib/python3.4/site-packages/readability/readability.py", line
163, in summary
candidates = self.score_paragraphs() File "/usr/lib/python3.4/site-packages/readability/readability.py", line
300, in score_paragraphs
candidates[parent_node] = self.score_node(parent_node) File "/usr/lib/python3.4/site-packages/readability/readability.py", line
360, in score_node
content_score = self.class_weight(elem) File "/usr/lib/python3.4/site-packages/readability/readability.py", line
348, in class_weight
if self.negative_keywords and self.negative_keywords.search(feature): AttributeError: 'list' object
has no attribute 'search' Traceback (most recent call last): File
"/usr/lib/python3.4/site-packages/readability/readability.py", line
163, in summary
candidates = self.score_paragraphs() File "/usr/lib/python3.4/site-packages/readability/readability.py", line
300, in score_paragraphs
candidates[parent_node] = self.score_node(parent_node) File "/usr/lib/python3.4/site-packages/readability/readability.py", line
360, in score_node
content_score = self.class_weight(elem) File "/usr/lib/python3.4/site-packages/readability/readability.py", line
348, in class_weight
if self.negative_keywords and self.negative_keywords.search(feature): AttributeError: 'list' object
has no attribute 'search'
Could some one give me please a hint on the error or how to deal with it?
There's an error in the library code. If you look at compile_pattern:
def compile_pattern(elements):
if not elements:
return None
elif isinstance(elements, (list, tuple)):
return list(elements)
elif isinstance(elements, regexp_type):
return elements
else:
# assume string or string like object
elements = elements.split(',')
return re.compile(u'|'.join([re.escape(x.lower()) for x in elements]), re.U)
You can see that it only returns a regex if the elements is not None, not a list or tuple, and not a regular expression.
Later on, though, it assumes that self.negative_keywords is a regular expression. So, I suggest you input your list as a string in the form of "test_keyword1,test_keyword2". This will make sure that compile_pattern returns a regular expression which should fix the error.

Python TypeErrors: "' list' object is not callable" and "'function' object is unsubscriptable"

I have the following code:
from random import randint,choice
add=lambda x:lambda y:x+y
sub=lambda x:lambda y:x-y
mul=lambda x:lambda y:x*y
ops=[[add,'+'],[sub,'-'],[mul,'*']]
def gen_expression(length,left,right):
expr=[]
for i in range(length):
op=choice(ops)
expr.append([op[0](randint(left,right)),op[1]])
return expr
def eval_expression (expr,x):
for i in expr:
x=i[0](x)
return x
def eval_expression2 (expr,x):
for i in expr:
x=i(x)
return x
[snip , see end of post]
def genetic_arithmetic(get,start,length,left,right):
batch=[]
found = False
for i in range(30):
batch.append(gen_expression(length,left,right))
while not found:
batch=sorted(batch,key=lambda y:abs(eval_expression(y,start)-get))
print evald_expression_tostring(batch[0],start)+"\n\n"
#combine
for w in range(len(batch)):
rind=choice(range(length))
batch.append(batch[w][:rind]+choice(batch)[rind:])
#mutate
for w in range(len(batch)):
rind=choice(range(length))
op=choice(ops)
batch.append(batch[w][:rind]+[op[0](randint(left,right)),op[1]]+batch[w][rind+1:])
found=(eval_expression(batch[0],start)==get)
print "\n\n"+evald_expression_tostring(batch[0],start)
When I try to call to call genetic_artihmetic with eval_expression as the sorting key, I get this:
Traceback (most recent call last):
File "<pyshell#113>", line 1, in <module>
genetic_arithmetic(0,10,10,-10,10)
File "/home/benikis/graming/py/genetic_number.py", line 50, in genetic_arithmetic
batch=sorted(batch,key=lambda y:abs(eval_expression(y,start)-get))
File "/home/benikis/graming/py/genetic_number.py", line 50, in <lambda>
batch=sorted(batch,key=lambda y:abs(eval_expression(y,start)-get))
File "/home/benikis/graming/py/genetic_number.py", line 20, in eval_expression
x=i[0](x)
TypeError: 'function' object is unsubscriptable
And when I try the same with eval_expression2 as the sorting,the error is this:
Traceback (most recent call last):
File "<pyshell#114>", line 1, in <module>
genetic_arithmetic(0,10,10,-10,10)
File "/home/benikis/graming/py/genetic_number.py", line 50, in genetic_arithmetic
batch=sorted(batch,key=lambda y:abs(eval_expression2(y,start)-get))
File "/home/benikis/graming/py/genetic_number.py", line 50, in <lambda>
batch=sorted(batch,key=lambda y:abs(eval_expression2(y,start)-get))
File "/home/benikis/graming/py/genetic_number.py", line 25, in eval_expression2
x=i(x)
TypeError: 'list' object is not callable
As far as i can wrap my head around this, my guess is that sorted() is trying to recursively sort the sublists,maybe? What is really going on here?
Python version is 2.6 - the one in the debian stable repos.
[snip] here:
def expression_tostring(expr):
expr_str=len(expr)*'('+'x '
for i in expr :
if i[1]=='*':
n=i[0](1)
else:
n=i[0](0)
expr_str+=i[1]+' '+str(n)+') '
return expr_str
def evald_expression_tostring(expr,x):
exprstr=expression_tostring(expr).replace('x',str(x))
return exprstr+ ' = ' + str(eval_expression(expr,x))
x=i[0](x) #here i is a function so you can't perform indexing operation on it
x=i(x) #here i is a list so you can't call it as a function
in both cases the value of i is fetched from expr, may be expr contains different type of object than what you're assuming here.
Try this modification:
def gen_expression(length,left,right):
expr=[]
for i in range(length):
op=choice(ops)
expr.append([op[0], randint(left,right),op[1]])
return expr
def eval_expression (expr,x):
for i in expr:
x=i[0](i[1])
return x
You had expr.append([op[0](randint(left,right)),op[1]]) which will put the return value of the calling the function into the 0th index.

Categories

Resources