Bug in python tokenize?

Bug in python tokenize? - python

Why would this
if 1 \
and 0:
pass
simplest of code choke on tokenize/untokenize cycle
import tokenize
import cStringIO
def tok_untok(src):
f = cStringIO.StringIO(src)
return tokenize.untokenize(tokenize.generate_tokens(f.readline))
src='''if 1 \\
and 0:
pass
'''
print tok_untok(src)
It throws:
AssertionError:
File "/mnt/home/anushri/untitled-1.py", line 13, in <module>
print tok_untok(src)
File "/mnt/home/anushri/untitled-1.py", line 6, in tok_untok
tokenize.untokenize(tokenize.generate_tokens(f.readline))
File "/usr/lib/python2.6/tokenize.py", line 262, in untokenize
return ut.untokenize(iterable)
File "/usr/lib/python2.6/tokenize.py", line 198, in untokenize
self.add_whitespace(start)
File "/usr/lib/python2.6/tokenize.py", line 187, in add_whitespace
assert row <= self.prev_row
Is there a workaround without modifying the src to be tokenized (it seems \ is the culprit)
Another example where it fails is if no newline at end e.g. src='if 1:pass' fails with same error
Workaround:
But it seems using untokenize different way works
def tok_untok(src):
f = cStringIO.StringIO(src)
tokens = [ t[:2] for t in tokenize.generate_tokens(f.readline)]
return tokenize.untokenize(tokens)
i.e. do not pass back whole token tuple but only t[:2]
though python doc says extra args are skipped
Converts tokens back into Python source code. The iterable must return
sequences with at least two elements,
the token type and the token string.
Any additional sequence elements are
ignored.

Yes, it's a known bug and there is interest in a cleaner patch than the one attached to that issue. Perfect time to contribute to a better Python ;)

Related

Override a function in nltk - Error in ContextIndex class

I am using text.similar('example') function from nltk.Text module.
(Which prints the similar words for a given word based on corpus.)
However I want to store that list of words in a list. But the function itself returns None.
#text is a variable of nltk.Text module
simList = text.similar("physics")
>>> a = text.similar("physics")
the and a in science this which it that energy his of but chemistry is
space mathematics theory as mechanics
>>> a
>>> a
# a contains no value.
So should I modify the source function itself? But I don't think it is a good practice. So how can I override that function so that it returns the value?
Edit - Referring this thread, I tried using the ContextIndex class. But I am getting the following error.
File "test.py", line 39, in <module>
text = nltk.text.ContextIndex(word.lower() for word in words) File "/home/kenden/den/codes/nlpenv/local/lib/python2.7/site-packages/nltk/text.py", line 56, in __init__
for i, w in enumerate(tokens)) File "/home/kenden/den/codes/nlpenv/local/lib/python2.7/site-packages/nltk/probability.py", line 1752, in __init__
for (cond, sample) in cond_samples: File "/home/kenden/den/codes/nlpenv/local/lib/python2.7/site-packages/nltk/text.py", line 56, in <genexpr>
for i, w in enumerate(tokens)) File "/home/kenden/den/codes/nlpenv/local/lib/python2.7/site-packages/nltk/text.py", line 43, in _default_context
right = (tokens[i+1].lower() if i != len(tokens) - 1 else '*END*') TypeError: object of type 'generator' has no len()
This is my line 39 of test.py
text = nltk.text.ContextIndex(word.lower() for word in words)
How can I solve this?

You are getting the error because the ContextIndex constructor is trying to take the len() of your token list (the argument tokens). But you actually pass it as a generator, hence the error. To avoid the problem, just pass a true list, e.g.:
text = nltk.text.ContextIndex(list(word.lower() for word in words))

Why am I getting an IndexError in Python 3 when indexing a string and not slicing?

I'm new to programming, and experimenting with Python 3. I've found a few topics which deal with IndexError but none that seem to help with this specific circumstance.
I've written a function which opens a text file, reads it one line at a time, and slices the line up into individual strings which are each appended to a particular list (one list per 'column' in the record line). Most of the slices are multiple characters [x:y] but some are single characters [x].
I'm getting an IndexError: string index out of range message, when as far as I can tell, it isn't. This is the function:
def read_recipe_file():
recipe_id = []
recipe_book = []
recipe_name = []
recipe_page = []
ingred_1 = []
ingred_1_qty = []
ingred_2 = []
ingred_2_qty = []
ingred_3 = []
ingred_3_qty = []
f = open('recipe-file.txt', 'r') # open the file
for line in f:
# slice out each component of the record line and store it in the appropriate list
recipe_id.append(line[0:3])
recipe_name.append(line[3:23])
recipe_book.append(line[23:43])
recipe_page.append(line[43:46])
ingred_1.append(line[46])
ingred_1_qty.append(line[47:50])
ingred_2.append(line[50])
ingred_2_qty.append(line[51:54])
ingred_3.append(line[54])
ingred_3_qty.append(line[55:])
f.close()
return recipe_id, recipe_name, recipe_book, recipe_page, ingred_1, ingred_1_qty, ingred_2, ingred_2_qty, ingred_3, \
ingred_3_qty
This is the traceback:
Traceback (most recent call last):
File "recipe-test.py", line 84, in <module>
recipe_id, recipe_book, recipe_name, recipe_page, ingred_1, ingred_1_qty, ingred_2, ingred_2_qty, ingred_3, ingred_3_qty = read_recipe_file()
File "recipe-test.py", line 27, in read_recipe_file
ingred_1.append(line[46])
The code which calls the function in question is:
print('To show list of recipes: 1')
print('To add a recipe: 2')
user_choice = input()
recipe_id, recipe_book, recipe_name, recipe_page, ingred_1, ingred_1_qty, ingred_2, ingred_2_qty, \
ingred_3, ingred_3_qty = read_recipe_file()
if int(user_choice) == 1:
print_recipe_table(recipe_id, recipe_book, recipe_name, recipe_page, ingred_1, ingred_1_qty,
ingred_2, ingred_2_qty, ingred_3, ingred_3_qty)
elif int(user_choice) == 2:
#code to add recipe
The failing line is this:
ingred_1.append(line[46])
There are more than 46 characters in each line of the text file I am trying to read, so I don't understand why I'm getting an out of bounds error (a sample line is below). If I change to the code to this:
ingred_1.append(line[46:])
to read a slice, rather than a specific character, the line executes correctly, and the program fails on this line instead:
ingred_2.append(line[50])
This leads me to think it is somehow related to appending a single character from the string, rather than a slice of multiple characters.
Here is a sample line from the text file I am reading:
001Cheese on Toast Meals For Two 012120038005002
I should probably add that I'm well aware this isn't great code overall - there are lots of ways I could generally improve the program, but as far as I can tell the code should actually work.

This will happen if some of the lines in the file are empty or at least short. A stray newline at the end of the file is a common cause, since that comes up as an extra blank line. The best way to debug a case like this is to catch the exception, and investigate the particular line that fails (which almost certainly won't be the sample line you reproduced):
try:
ingred_1.append(line[46])
except IndexError:
print(line)
print(len(line))
Catching this exception is also usually the right way to deal with the error: you've detected a pathological case, and now you can consider what to do. You might for example:
continue, which will silently skip processing that line,
Log something and then continue
Bail out by raising a new, more topical exception: eg raise ValueError("Line too short").
Printing something relevant, with or without continuing, is almost always a good idea if this represents a problem with the input file that warrants fixing. Continuing silently is a good option if it is something relatively trivial, that you know can't cause flow-on errors in the rest of your processing. You may want to differentiate between the "too short" and "completely empty" cases by detecting the "completely empty" case early such as by doing this at the top of your loop:
if not line:
# Skip blank lines
continue
And handling the error for the other case appropriately.
The reason changing it to a slice works is because string slices never fail. If both indexes in the slice are outside the string (in the same direction), you will get an empty string - eg:
>>> 'abc'[4]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
IndexError: string index out of range
>>> 'abc'[4:]
''
>>> 'abc'[4:7]
''

Your code fails on line[46] because line contains fewer than 47 characters. The slice operation line[46:] still works because an out-of-range string slice returns an empty string.
You can verify that the line is too short by replacing
ingred_1.append(line[46])
with
try:
ingred_1.append(line[46])
except IndexError:
print('line = "%s", length = %d' % (line, len(line)))

Python: ipaddress.AddressValueError: At least 3 parts expected

An attribute of ipaddress.IPv4Network can be used to check if any IP address is reserved.
In IPython:
In [52]: IPv4Address(u'169.254.255.1').is_private
Out[52]: False
Yet if I try the exact same thing in a function:
import ipaddress
def isPrivateIp(ip):
unicoded = unicode(ip)
if ipaddress.IPv4Network(unicoded).is_private or ipaddress.IPv6Network(unicoded).is_private:
return True
else:
return False
print isPrivateIp(r'169.254.255.1')
I get:
File "isPrivateIP.py", line 13, in <module>
print isPrivateIp(ur'169.254.255.1')
File "isPrivateIP.py", line 7, in isPrivateIp
if ipaddress.IPv4Network(unicoded).is_private or ipaddress.IPv6Network(unicoded).is_private:
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/ipaddress.py", line 2119, in __init__
self.network_address = IPv6Address(self._ip_int_from_string(addr[0]))
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/ipaddress.py", line 1584, in _ip_int_from_string
raise AddressValueError(msg)
ipaddress.AddressValueError: At least 3 parts expected in u'169.254.255.1'
Why is this the case?
Note: In python 2, ip addresses must be passed to ipaddress functions as unicode objects, hence calling unicode() on the string input ip.

The expected input for ipaddress.IPv6Network() is different than ipaddress.IPv4Network(). If you remove or ipaddress.IPv6Network(unicoded).is_private from your code it works fine. You can read more from here.

How do i use list as variable in regexp in Python

How do i use list variable in regexp?
The problem is here:
re.search(re.compile(''.format('|'.join(map(re.escape, kand))), corpus.raw(fileid)))
error is
TypeError: unsupported operand type(s) for &: 'str' and 'int'
simple re.search works well, but i need list as first attribute in re.search:
for fileid in corpus.fileids():
if re.search(r'[Чч]естны[й|м|ого].труд(а|ом)', corpus.raw(fileid)):
dict_features[fileid]['samoprezentacia'] = 1
else:
dict_features[fileid]['samoprezentacia'] = 0
if re.search(re.compile('\b(?:%s)\b'.format('|'.join(map(re.escape, kand))), corpus.raw(fileid))):
dict_features[fileid]['up'] = 1
else:
dict_features[fileid]['up'] = 0
return dict_features
by the way kand is list:
kand = [line.strip() for line in open('kand.txt', encoding="utf8")]
in output kand is ['apple', 'banana', 'peach', 'plum', 'pineapple', 'kiwi']
Edit: i am using Python 3.3.2 with WinPython on Windows 7
full errors stack:
Traceback (most recent call last):
File "F:/Python/NLTK packages/agit_classify.py", line 59, in <module>
print (regexp_features(agit_corpus))
File "F:/Python/NLTK packages/agit_classify.py", line 53, in regexp_features
if re.search(re.compile(r'\b(?:{0})\b'.format('|'.join(map(re.escape, kandidats_all))), corpus.raw(fileid))):
File "F:\WinPython-32bit-3.3.2.0\python-3.3.2\lib\re.py", line 214, in compile
return _compile(pattern, flags)
File "F:\WinPython-32bit-3.3.2.0\python-3.3.2\lib\re.py", line 281, in _compile
p = sre_compile.compile(pattern, flags)
File "F:\WinPython-32bit-3.3.2.0\python-3.3.2\lib\sre_compile.py", line 494, in compile
p = sre_parse.parse(p, flags)
File "F:\WinPython-32bit-3.3.2.0\python-3.3.2\lib\sre_parse.py", line 748, in parse
p = _parse_sub(source, pattern, 0)
File "F:\WinPython-32bit-3.3.2.0\python-3.3.2\lib\sre_parse.py", line 360, in _parse_sub
itemsappend(_parse(source, state))
File "F:\WinPython-32bit-3.3.2.0\python-3.3.2\lib\sre_parse.py", line 453, in _parse
if state.flags & SRE_FLAG_VERBOSE:
TypeError: unsupported operand type(s) for &: 'str' and 'int'

The reason you're getting the actual exception is mismatched parentheses. Let's break it up to make it clearer:
re.search(
re.compile(
''.format('|'.join(map(re.escape, kand))),
corpus.raw(fileid)))
In other words, you're passing a string, corpus.raw(fileid), as the second argument to re.compile, not as the second argument to re.search.
In other words, you're trying to use it as the flags value, which is supposed to be an integer. When re.compile tries to use the & operator on your string to test each flag bit, it raises a TypeError.
And if you got past this error, the re.search would itself raise a TypeError because you're only passing it one argument rather than two.
This is exactly why you shouldn't write overly-complicated expressions. They're very painful to debug. If you'd written this in separate steps, it would be obvious:
escaped_kand = map(re.escape, kand)
alternation = '|'.join(escaped_kand)
whatever_this_was_supposed_to_do = ''.format(alternation)
regexpr = re.compile(whatever_this_was_supposed_to_do, corpus.raw(fileid))
re.search(regexpr)
This would also make it obvious that half the work you're doing isn't needed in the first place.
First, re.search takes a pattern, not a compiled regexpr. If it happens to work with a compiled regexpr, that's just an accident. So, that whole part of the expression is useless. Just pass the pattern itself.
Or, if you have a good reason to compile the regexpr, as re.compile explains, the result regular expression object "can be used for matching using its match() and search() methods". So use the compiled object's search method, not the top-level re.search function.
Second, I don't know what you expected ''.format(anything) to do, but it can't possibly return anything but ''.

You're mixing old and new string formatting rules. Also, you need to use raw strings with a regex, or \b will mean backspace, not word boundary.
'\b(?:%s)\b'.format('|'.join(map(re.escape, kand)))
should be
r'\b(?:{0})\b'.format('|'.join(map(re.escape, kand)))
Furthermore, be aware that \b only works if your "words" start and end with alphanumeric characters (or _).

Why is ElementTree raising a ParseError?

I have been trying to parse a file with xml.etree.ElementTree:
import xml.etree.ElementTree as ET
from xml.etree.ElementTree import ParseError
def analyze(xml):
it = ET.iterparse(file(xml))
count = 0
last = None
try:
for (ev, el) in it:
count += 1
last = el
except ParseError:
print("catastrophic failure")
print("last successful: {0}".format(last))
print('count: {0}'.format(count))
This is of course a simplified version of my code, but this is enough to break my program. I get this error with some files if I remove the try-catch block:
Traceback (most recent call last):
File "<pyshell#22>", line 1, in <module>
from yparse import analyze; analyze('file.xml')
File "C:\Python27\yparse.py", line 10, in analyze
for (ev, el) in it:
File "C:\Python27\lib\xml\etree\ElementTree.py", line 1258, in next
self._parser.feed(data)
File "C:\Python27\lib\xml\etree\ElementTree.py", line 1624, in feed
self._raiseerror(v)
File "C:\Python27\lib\xml\etree\ElementTree.py", line 1488, in _raiseerror
raise err
ParseError: reference to invalid character number: line 1, column 52459
The results are deterministic though, if a file works it will always work. If a file fails, it always fails and always fails at the same point.
The strangest thing is I'm using the trace to find out if I have any malformed XML that's breaking the parser. I then isolate the node that caused the failure. But when I create an XML file containing that node and a few of its neighbors, the parsing works!
This doesn't seem to be a size problem either. I have managed to parse much larger files with no problems.
Any ideas?

Here are some ideas:
(0) Explain "a file" and "occasionally": do you really mean it works sometimes and fails sometimes with the same file?
Do the following for each failing file:
(1) Find out what is in the file at the point that it is complaining about:
text = open("the_file.xml", "rb").read()
err_col = 52459
print repr(text[err_col-50:err_col+100]) # should include the error text
print repr(text[:50]) # show the XML declaration
(2) Throw your file at a web-based XML validation service e.g. http://www.validome.org/xml/ or http://validator.aborla.net/
and edit your question to display your findings.
Update: Here is the minimal xml file that illustrates your problem:
[badcharref.xml]
<a></a>
[Python 2.7.1 output]
>>> import xml.etree.ElementTree as ET
>>> it = ET.iterparse(file("badcharref.xml"))
>>> for ev, el in it:
... print el.tag
...
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\python27\lib\xml\etree\ElementTree.py", line 1258, in next
self._parser.feed(data)
File "C:\python27\lib\xml\etree\ElementTree.py", line 1624, in feed
self._raiseerror(v)
File "C:\python27\lib\xml\etree\ElementTree.py", line 1488, in _raiseerror
raise err
xml.etree.ElementTree.ParseError: reference to invalid character number: line 1, column 3
>>>
Not all valid Unicode characters are valid in XML. See the XML 1.0 Specification.
You may wish to examine your files using regexes like r'&#([0-9]+);' and r'&#x([0-9A-Fa-f]+);', convert the matched text to an int ordinal and check against the valid list from the spec i.e. #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
... or maybe the numeric character reference is syntactically invalid e.g. not terminated by a ;', &#not-a-digit etc etc
Update 2 I was wrong, the number in the ElementTree error message is counting Unicode code points, not bytes. See the code below and snippets from the output from running it over the two bad files.
# coding: ascii
# Find numeric character references that refer to Unicode code points
# that are not valid in XML.
# Get byte offsets for seeking etc in undecoded file bytestreams.
# Get unicode offsets for checking against ElementTree error message,
# **IF** your input file is small enough.
BYTE_OFFSETS = True
import sys, re, codecs
fname = sys.argv[1]
print fname
if BYTE_OFFSETS:
text = open(fname, "rb").read()
else:
# Assumes file is encoded in UTF-8.
text = codecs.open(fname, "rb", "utf8").read()
rx = re.compile("&#([0-9]+);|&#x([0-9a-fA-F]+);")
endpos = len(text)
pos = 0
while pos < endpos:
m = rx.search(text, pos)
if not m: break
mstart, mend = m.span()
target = m.group(1)
if target:
num = int(target)
else:
num = int(m.group(2), 16)
# #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
if not(num in (0x9, 0xA, 0xD) or 0x20 <= num <= 0xD7FF
or 0xE000 <= num <= 0xFFFD or 0x10000 <= num <= 0x10FFFF):
print mstart, m.group()
pos = mend
Output:
comments.xml
6615405 
10205764 
10213901 
10213936 
10214123 
13292514 
...
155656543 
155656564 
157344876 
157722583 
posts.xml
7607143 
12982273 
12982282 
12982292 
12982302 
12982310 
16085949 
16085955 
...
36303479 
36303494  <<=== whoops
38942863 
...
785292911 
801282472 
848911592

As #John Machin suggested, the files in question do have dubious numeric entities in them, though the error messages seem to be pointing at the wrong place in the text. Perhaps the streaming nature and buffering are making it difficult to report accurate positions.
In fact, all of these entities appear in the text:
set(['', '', '', '', '', '', '
', '', '', '', '', '', '', '', '
', '', '', ' ', '', '', '', '', ''])
Most are not allowed. Looks like this parser is quite strict, you'll need to find another that is not so strict, or pre-process the XML.

I'm not sure if this answers your question, but if you want to use an exception with the ParseError raised by element tree, you would do this:
except ET.ParseError:
print("catastrophic failure")
print("last successful: {0}".format(last))
Source: http://effbot.org/zone/elementtree-13-intro.htm

I felt it might also be important to note here that you could rather easily catch your error and avoid having to completely stop your program by simply using what you're already using later on in the function, placing your statement:
it = ET.iterparse(file(xml))
inside a try & except bracket:
try:
it = ET.iterparse(file(xml))
except:
print('iterparse error')
Of course, this will not fix your XML file or pre-processing technique, but could help in identifying which file (if you're parsing lots) is causing your error.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Bug in python tokenize? - python

Yes, it's a known bug and there is interest in a cleaner patch than the one attached to that issue. Perfect time to contribute to a better Python ;)

Related

Override a function in nltk - Error in ContextIndex class

Why am I getting an IndexError in Python 3 when indexing a string and not slicing?

Python: ipaddress.AddressValueError: At least 3 parts expected

How do i use list as variable in regexp in Python

Why is ElementTree raising a ParseError?

Categories

Resources