Override a function in nltk - Error in ContextIndex class - python

I am using text.similar('example') function from nltk.Text module.
(Which prints the similar words for a given word based on corpus.)
However I want to store that list of words in a list. But the function itself returns None.
#text is a variable of nltk.Text module
simList = text.similar("physics")
>>> a = text.similar("physics")
the and a in science this which it that energy his of but chemistry is
space mathematics theory as mechanics
>>> a
>>> a
# a contains no value.
So should I modify the source function itself? But I don't think it is a good practice. So how can I override that function so that it returns the value?
Edit - Referring this thread, I tried using the ContextIndex class. But I am getting the following error.
File "test.py", line 39, in <module>
text = nltk.text.ContextIndex(word.lower() for word in words) File "/home/kenden/den/codes/nlpenv/local/lib/python2.7/site-packages/nltk/text.py", line 56, in __init__
for i, w in enumerate(tokens)) File "/home/kenden/den/codes/nlpenv/local/lib/python2.7/site-packages/nltk/probability.py", line 1752, in __init__
for (cond, sample) in cond_samples: File "/home/kenden/den/codes/nlpenv/local/lib/python2.7/site-packages/nltk/text.py", line 56, in <genexpr>
for i, w in enumerate(tokens)) File "/home/kenden/den/codes/nlpenv/local/lib/python2.7/site-packages/nltk/text.py", line 43, in _default_context
right = (tokens[i+1].lower() if i != len(tokens) - 1 else '*END*') TypeError: object of type 'generator' has no len()
This is my line 39 of test.py
text = nltk.text.ContextIndex(word.lower() for word in words)
How can I solve this?

You are getting the error because the ContextIndex constructor is trying to take the len() of your token list (the argument tokens). But you actually pass it as a generator, hence the error. To avoid the problem, just pass a true list, e.g.:
text = nltk.text.ContextIndex(list(word.lower() for word in words))

Related

Why am I getting an IndexError in Python 3 when indexing a string and not slicing?

I'm new to programming, and experimenting with Python 3. I've found a few topics which deal with IndexError but none that seem to help with this specific circumstance.
I've written a function which opens a text file, reads it one line at a time, and slices the line up into individual strings which are each appended to a particular list (one list per 'column' in the record line). Most of the slices are multiple characters [x:y] but some are single characters [x].
I'm getting an IndexError: string index out of range message, when as far as I can tell, it isn't. This is the function:
def read_recipe_file():
recipe_id = []
recipe_book = []
recipe_name = []
recipe_page = []
ingred_1 = []
ingred_1_qty = []
ingred_2 = []
ingred_2_qty = []
ingred_3 = []
ingred_3_qty = []
f = open('recipe-file.txt', 'r') # open the file
for line in f:
# slice out each component of the record line and store it in the appropriate list
recipe_id.append(line[0:3])
recipe_name.append(line[3:23])
recipe_book.append(line[23:43])
recipe_page.append(line[43:46])
ingred_1.append(line[46])
ingred_1_qty.append(line[47:50])
ingred_2.append(line[50])
ingred_2_qty.append(line[51:54])
ingred_3.append(line[54])
ingred_3_qty.append(line[55:])
f.close()
return recipe_id, recipe_name, recipe_book, recipe_page, ingred_1, ingred_1_qty, ingred_2, ingred_2_qty, ingred_3, \
ingred_3_qty
This is the traceback:
Traceback (most recent call last):
File "recipe-test.py", line 84, in <module>
recipe_id, recipe_book, recipe_name, recipe_page, ingred_1, ingred_1_qty, ingred_2, ingred_2_qty, ingred_3, ingred_3_qty = read_recipe_file()
File "recipe-test.py", line 27, in read_recipe_file
ingred_1.append(line[46])
The code which calls the function in question is:
print('To show list of recipes: 1')
print('To add a recipe: 2')
user_choice = input()
recipe_id, recipe_book, recipe_name, recipe_page, ingred_1, ingred_1_qty, ingred_2, ingred_2_qty, \
ingred_3, ingred_3_qty = read_recipe_file()
if int(user_choice) == 1:
print_recipe_table(recipe_id, recipe_book, recipe_name, recipe_page, ingred_1, ingred_1_qty,
ingred_2, ingred_2_qty, ingred_3, ingred_3_qty)
elif int(user_choice) == 2:
#code to add recipe
The failing line is this:
ingred_1.append(line[46])
There are more than 46 characters in each line of the text file I am trying to read, so I don't understand why I'm getting an out of bounds error (a sample line is below). If I change to the code to this:
ingred_1.append(line[46:])
to read a slice, rather than a specific character, the line executes correctly, and the program fails on this line instead:
ingred_2.append(line[50])
This leads me to think it is somehow related to appending a single character from the string, rather than a slice of multiple characters.
Here is a sample line from the text file I am reading:
001Cheese on Toast Meals For Two 012120038005002
I should probably add that I'm well aware this isn't great code overall - there are lots of ways I could generally improve the program, but as far as I can tell the code should actually work.
This will happen if some of the lines in the file are empty or at least short. A stray newline at the end of the file is a common cause, since that comes up as an extra blank line. The best way to debug a case like this is to catch the exception, and investigate the particular line that fails (which almost certainly won't be the sample line you reproduced):
try:
ingred_1.append(line[46])
except IndexError:
print(line)
print(len(line))
Catching this exception is also usually the right way to deal with the error: you've detected a pathological case, and now you can consider what to do. You might for example:
continue, which will silently skip processing that line,
Log something and then continue
Bail out by raising a new, more topical exception: eg raise ValueError("Line too short").
Printing something relevant, with or without continuing, is almost always a good idea if this represents a problem with the input file that warrants fixing. Continuing silently is a good option if it is something relatively trivial, that you know can't cause flow-on errors in the rest of your processing. You may want to differentiate between the "too short" and "completely empty" cases by detecting the "completely empty" case early such as by doing this at the top of your loop:
if not line:
# Skip blank lines
continue
And handling the error for the other case appropriately.
The reason changing it to a slice works is because string slices never fail. If both indexes in the slice are outside the string (in the same direction), you will get an empty string - eg:
>>> 'abc'[4]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
IndexError: string index out of range
>>> 'abc'[4:]
''
>>> 'abc'[4:7]
''
Your code fails on line[46] because line contains fewer than 47 characters. The slice operation line[46:] still works because an out-of-range string slice returns an empty string.
You can verify that the line is too short by replacing
ingred_1.append(line[46])
with
try:
ingred_1.append(line[46])
except IndexError:
print('line = "%s", length = %d' % (line, len(line)))

Python NLP: TypeError: not all arguments converted during string formatting

I tried the code on "Natural language processing with python", but a type error occurred.
import nltk
from nltk.corpus import brown
suffix_fdist = nltk.FreqDist()
for word in brown.words():
word = word.lower()
suffix_fdist.inc(word[-1:])
suffix_fdist.inc(word[-2:])
suffix_fdist.inc(word[-3:])
common_suffixes = suffix_fdist.items()[:100]
def pos_features(word):
features = {}
for suffix in common_suffixes:
features['endswith(%s)' % suffix] = word.lower().endswith(suffix)
return features
pos_features('people')
the error is below:
Traceback (most recent call last):
File "/home/wanglan/javadevelop/TestPython/src/FirstModule.py", line 323, in <module>
pos_features('people')
File "/home/wanglan/javadevelop/TestPython/src/FirstModule.py", line 321, in pos_features
features['endswith(%s)' % suffix] = word.lower().endswith(suffix)
TypeError: not all arguments converted during string formatting
Does anyone could help me find out where i am wrong?
suffix is a tuple, because .items() returns (key,value) tuples. When you use %, if the right hand side is a tuple, the values will be unpacked and substituted for each % format in order. The error you get is complaining that the tuple has more entries than % formats.
You probably want just the key (the actual suffix), in which case you should use suffix[0], or .keys() to only retrieve the dictionary keys.

TypeError: 'str' object is not callable

I am stuck with these 2 errors in Python 3.3.2:
import os
path="D:\\Data\\MDF Testing\\MDF 4 -Bangalore\\Bangalore Testing"
os.chdir(path)
for file in os.listdir("."):
if file.endswith(".doc"):
print('FileName is ', file)
def testcasenames(file):
nlines = 0
lookup="Test procedures"
procnames=[]
temp=[]
'''Open a doc file and try to get the names of the various test procedures:'''
f = open(file, 'r')
for line in f:
val=int(nlines)+1
if (lookup in line):
val1=int(nlines)
elif(line(int(val))!=" ") and line(int(val1))==lookup):
temp=line.split('.')
procnames.append(temp[1])
else:
continue
return procnames
filename="MDF_Bng_Test.doc"
testcasenames(filename)
Traceback (most recent call last):
File "D:/Data/Python files/MS_Word_Python.py", line 34, in <module>
testcasenames(filename)
File "D:/Data/Python files/MS_Word_Python.py", line 25, in testcasenames
elif(line(val)!=" " and line(val1)==lookup):
TypeError: 'str' object is not callable
The idea is to only get the test procedure names after I get the Section "Test procedures" while looping in the Test document file (MDF_Bng_Test.doc) and after that I copy all the test procedure names (T_Proc_2.1,S_Proc_2.2...)coming under it.
Ex:
1.1.1 Test objectives
1.Obj 1.1
2.Obj 1.2
3.Obj 1.3
4.Obj 1.4
**2.1.1 Test procedures
1.T_Proc_2.1
2.S_Proc_2.2
3.M_Proc_2.3
4.N_Proc_2.4**
3.1.1 Test References
1.Refer_3.1
2.Refer_3.2
3.Refer_3.3
when you use () with line, it thinks that line is a function which actually is not. What you actually need to use is [] notation
line[int(val)]!=" " and line[int(val1)]==lookup
The problem is in this line:
elif(line(int(val))!=" ") and line(int(val1))==lookup):
If you are trying to index the string, Python uses square brackets notation ([]) to accomplish it, it would be like this:
elif(line[int(val)]!=" ") and line[int(val1)]==lookup):
Another suggestion, parenthesis wrapping if..else statements in Python are optional and normally the code looks better without them:
elif line[int(val)]!=" " and line[int(val1)]==lookup:
Hope this helps!

How do i use list as variable in regexp in Python

How do i use list variable in regexp?
The problem is here:
re.search(re.compile(''.format('|'.join(map(re.escape, kand))), corpus.raw(fileid)))
error is
TypeError: unsupported operand type(s) for &: 'str' and 'int'
simple re.search works well, but i need list as first attribute in re.search:
for fileid in corpus.fileids():
if re.search(r'[Чч]естны[й|м|ого].труд(а|ом)', corpus.raw(fileid)):
dict_features[fileid]['samoprezentacia'] = 1
else:
dict_features[fileid]['samoprezentacia'] = 0
if re.search(re.compile('\b(?:%s)\b'.format('|'.join(map(re.escape, kand))), corpus.raw(fileid))):
dict_features[fileid]['up'] = 1
else:
dict_features[fileid]['up'] = 0
return dict_features
by the way kand is list:
kand = [line.strip() for line in open('kand.txt', encoding="utf8")]
in output kand is ['apple', 'banana', 'peach', 'plum', 'pineapple', 'kiwi']
Edit: i am using Python 3.3.2 with WinPython on Windows 7
full errors stack:
Traceback (most recent call last):
File "F:/Python/NLTK packages/agit_classify.py", line 59, in <module>
print (regexp_features(agit_corpus))
File "F:/Python/NLTK packages/agit_classify.py", line 53, in regexp_features
if re.search(re.compile(r'\b(?:{0})\b'.format('|'.join(map(re.escape, kandidats_all))), corpus.raw(fileid))):
File "F:\WinPython-32bit-3.3.2.0\python-3.3.2\lib\re.py", line 214, in compile
return _compile(pattern, flags)
File "F:\WinPython-32bit-3.3.2.0\python-3.3.2\lib\re.py", line 281, in _compile
p = sre_compile.compile(pattern, flags)
File "F:\WinPython-32bit-3.3.2.0\python-3.3.2\lib\sre_compile.py", line 494, in compile
p = sre_parse.parse(p, flags)
File "F:\WinPython-32bit-3.3.2.0\python-3.3.2\lib\sre_parse.py", line 748, in parse
p = _parse_sub(source, pattern, 0)
File "F:\WinPython-32bit-3.3.2.0\python-3.3.2\lib\sre_parse.py", line 360, in _parse_sub
itemsappend(_parse(source, state))
File "F:\WinPython-32bit-3.3.2.0\python-3.3.2\lib\sre_parse.py", line 453, in _parse
if state.flags & SRE_FLAG_VERBOSE:
TypeError: unsupported operand type(s) for &: 'str' and 'int'
The reason you're getting the actual exception is mismatched parentheses. Let's break it up to make it clearer:
re.search(
re.compile(
''.format('|'.join(map(re.escape, kand))),
corpus.raw(fileid)))
In other words, you're passing a string, corpus.raw(fileid), as the second argument to re.compile, not as the second argument to re.search.
In other words, you're trying to use it as the flags value, which is supposed to be an integer. When re.compile tries to use the & operator on your string to test each flag bit, it raises a TypeError.
And if you got past this error, the re.search would itself raise a TypeError because you're only passing it one argument rather than two.
This is exactly why you shouldn't write overly-complicated expressions. They're very painful to debug. If you'd written this in separate steps, it would be obvious:
escaped_kand = map(re.escape, kand)
alternation = '|'.join(escaped_kand)
whatever_this_was_supposed_to_do = ''.format(alternation)
regexpr = re.compile(whatever_this_was_supposed_to_do, corpus.raw(fileid))
re.search(regexpr)
This would also make it obvious that half the work you're doing isn't needed in the first place.
First, re.search takes a pattern, not a compiled regexpr. If it happens to work with a compiled regexpr, that's just an accident. So, that whole part of the expression is useless. Just pass the pattern itself.
Or, if you have a good reason to compile the regexpr, as re.compile explains, the result regular expression object "can be used for matching using its match() and search() methods". So use the compiled object's search method, not the top-level re.search function.
Second, I don't know what you expected ''.format(anything) to do, but it can't possibly return anything but ''.
You're mixing old and new string formatting rules. Also, you need to use raw strings with a regex, or \b will mean backspace, not word boundary.
'\b(?:%s)\b'.format('|'.join(map(re.escape, kand)))
should be
r'\b(?:{0})\b'.format('|'.join(map(re.escape, kand)))
Furthermore, be aware that \b only works if your "words" start and end with alphanumeric characters (or _).

Bug in python tokenize?

Why would this
if 1 \
and 0:
pass
simplest of code choke on tokenize/untokenize cycle
import tokenize
import cStringIO
def tok_untok(src):
f = cStringIO.StringIO(src)
return tokenize.untokenize(tokenize.generate_tokens(f.readline))
src='''if 1 \\
and 0:
pass
'''
print tok_untok(src)
It throws:
AssertionError:
File "/mnt/home/anushri/untitled-1.py", line 13, in <module>
print tok_untok(src)
File "/mnt/home/anushri/untitled-1.py", line 6, in tok_untok
tokenize.untokenize(tokenize.generate_tokens(f.readline))
File "/usr/lib/python2.6/tokenize.py", line 262, in untokenize
return ut.untokenize(iterable)
File "/usr/lib/python2.6/tokenize.py", line 198, in untokenize
self.add_whitespace(start)
File "/usr/lib/python2.6/tokenize.py", line 187, in add_whitespace
assert row <= self.prev_row
Is there a workaround without modifying the src to be tokenized (it seems \ is the culprit)
Another example where it fails is if no newline at end e.g. src='if 1:pass' fails with same error
Workaround:
But it seems using untokenize different way works
def tok_untok(src):
f = cStringIO.StringIO(src)
tokens = [ t[:2] for t in tokenize.generate_tokens(f.readline)]
return tokenize.untokenize(tokens)
i.e. do not pass back whole token tuple but only t[:2]
though python doc says extra args are skipped
Converts tokens back into Python source code. The iterable must return
sequences with at least two elements,
the token type and the token string.
Any additional sequence elements are
ignored.
Yes, it's a known bug and there is interest in a cleaner patch than the one attached to that issue. Perfect time to contribute to a better Python ;)

Categories

Resources