Why is ElementTree raising a ParseError?

Why is ElementTree raising a ParseError? - python

I have been trying to parse a file with xml.etree.ElementTree:
import xml.etree.ElementTree as ET
from xml.etree.ElementTree import ParseError
def analyze(xml):
it = ET.iterparse(file(xml))
count = 0
last = None
try:
for (ev, el) in it:
count += 1
last = el
except ParseError:
print("catastrophic failure")
print("last successful: {0}".format(last))
print('count: {0}'.format(count))
This is of course a simplified version of my code, but this is enough to break my program. I get this error with some files if I remove the try-catch block:
Traceback (most recent call last):
File "<pyshell#22>", line 1, in <module>
from yparse import analyze; analyze('file.xml')
File "C:\Python27\yparse.py", line 10, in analyze
for (ev, el) in it:
File "C:\Python27\lib\xml\etree\ElementTree.py", line 1258, in next
self._parser.feed(data)
File "C:\Python27\lib\xml\etree\ElementTree.py", line 1624, in feed
self._raiseerror(v)
File "C:\Python27\lib\xml\etree\ElementTree.py", line 1488, in _raiseerror
raise err
ParseError: reference to invalid character number: line 1, column 52459
The results are deterministic though, if a file works it will always work. If a file fails, it always fails and always fails at the same point.
The strangest thing is I'm using the trace to find out if I have any malformed XML that's breaking the parser. I then isolate the node that caused the failure. But when I create an XML file containing that node and a few of its neighbors, the parsing works!
This doesn't seem to be a size problem either. I have managed to parse much larger files with no problems.
Any ideas?

Here are some ideas:
(0) Explain "a file" and "occasionally": do you really mean it works sometimes and fails sometimes with the same file?
Do the following for each failing file:
(1) Find out what is in the file at the point that it is complaining about:
text = open("the_file.xml", "rb").read()
err_col = 52459
print repr(text[err_col-50:err_col+100]) # should include the error text
print repr(text[:50]) # show the XML declaration
(2) Throw your file at a web-based XML validation service e.g. http://www.validome.org/xml/ or http://validator.aborla.net/
and edit your question to display your findings.
Update: Here is the minimal xml file that illustrates your problem:
[badcharref.xml]
<a></a>
[Python 2.7.1 output]
>>> import xml.etree.ElementTree as ET
>>> it = ET.iterparse(file("badcharref.xml"))
>>> for ev, el in it:
... print el.tag
...
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\python27\lib\xml\etree\ElementTree.py", line 1258, in next
self._parser.feed(data)
File "C:\python27\lib\xml\etree\ElementTree.py", line 1624, in feed
self._raiseerror(v)
File "C:\python27\lib\xml\etree\ElementTree.py", line 1488, in _raiseerror
raise err
xml.etree.ElementTree.ParseError: reference to invalid character number: line 1, column 3
>>>
Not all valid Unicode characters are valid in XML. See the XML 1.0 Specification.
You may wish to examine your files using regexes like r'&#([0-9]+);' and r'&#x([0-9A-Fa-f]+);', convert the matched text to an int ordinal and check against the valid list from the spec i.e. #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
... or maybe the numeric character reference is syntactically invalid e.g. not terminated by a ;', &#not-a-digit etc etc
Update 2 I was wrong, the number in the ElementTree error message is counting Unicode code points, not bytes. See the code below and snippets from the output from running it over the two bad files.
# coding: ascii
# Find numeric character references that refer to Unicode code points
# that are not valid in XML.
# Get byte offsets for seeking etc in undecoded file bytestreams.
# Get unicode offsets for checking against ElementTree error message,
# **IF** your input file is small enough.
BYTE_OFFSETS = True
import sys, re, codecs
fname = sys.argv[1]
print fname
if BYTE_OFFSETS:
text = open(fname, "rb").read()
else:
# Assumes file is encoded in UTF-8.
text = codecs.open(fname, "rb", "utf8").read()
rx = re.compile("&#([0-9]+);|&#x([0-9a-fA-F]+);")
endpos = len(text)
pos = 0
while pos < endpos:
m = rx.search(text, pos)
if not m: break
mstart, mend = m.span()
target = m.group(1)
if target:
num = int(target)
else:
num = int(m.group(2), 16)
# #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
if not(num in (0x9, 0xA, 0xD) or 0x20 <= num <= 0xD7FF
or 0xE000 <= num <= 0xFFFD or 0x10000 <= num <= 0x10FFFF):
print mstart, m.group()
pos = mend
Output:
comments.xml
6615405 
10205764 
10213901 
10213936 
10214123 
13292514 
...
155656543 
155656564 
157344876 
157722583 
posts.xml
7607143 
12982273 
12982282 
12982292 
12982302 
12982310 
16085949 
16085955 
...
36303479 
36303494  <<=== whoops
38942863 
...
785292911 
801282472 
848911592

As #John Machin suggested, the files in question do have dubious numeric entities in them, though the error messages seem to be pointing at the wrong place in the text. Perhaps the streaming nature and buffering are making it difficult to report accurate positions.
In fact, all of these entities appear in the text:
set(['', '', '', '', '', '', '
', '', '', '', '', '', '', '', '
', '', '', ' ', '', '', '', '', ''])
Most are not allowed. Looks like this parser is quite strict, you'll need to find another that is not so strict, or pre-process the XML.

I'm not sure if this answers your question, but if you want to use an exception with the ParseError raised by element tree, you would do this:
except ET.ParseError:
print("catastrophic failure")
print("last successful: {0}".format(last))
Source: http://effbot.org/zone/elementtree-13-intro.htm

I felt it might also be important to note here that you could rather easily catch your error and avoid having to completely stop your program by simply using what you're already using later on in the function, placing your statement:
it = ET.iterparse(file(xml))
inside a try & except bracket:
try:
it = ET.iterparse(file(xml))
except:
print('iterparse error')
Of course, this will not fix your XML file or pre-processing technique, but could help in identifying which file (if you're parsing lots) is causing your error.

Related

struct.error: unpack requires a string argument of length 12

I am trying to follow a tutorial from Coding Robin to create a HAAR classifier: http://coding-robin.de/2013/07/22/train-your-own-opencv-haar-classifier.html.
I am at the part where I need to merge all the .vec files. I am trying to execute the python script given and I am getting the following error:
Traceback (most recent call last):
File "mergevec.py", line 170, in <module>
merge_vec_files(vec_directory, output_filename)
File "mergevec.py", line 133, in merge_vec_files
val = struct.unpack('<iihh', content[:12])
struct.error: unpack requires a string argument of length 12
Here is the code from the python script:
# Get the value for the first image size
prev_image_size = 0
try:
with open(files[0], 'rb') as vecfile:
content = ''.join(str(line) for line in vecfile.readlines())
val = struct.unpack('<iihh', content[:12])
prev_image_size = val[1]
except IOError as e:
print('An IO error occured while processing the file: {0}'.format(f))
exception_response(e)
# Get the total number of images
total_num_images = 0
for f in files:
try:
with open(f, 'rb') as vecfile:
content = ''.join(str(line) for line in vecfile.readlines())
val = struct.unpack('<iihh', content[:12])
num_images = val[0]
image_size = val[1]
if image_size != prev_image_size:
err_msg = """The image sizes in the .vec files differ. These values must be the same. \n The image size of file {0}: {1}\n
The image size of previous files: {0}""".format(f, image_size, prev_image_size)
sys.exit(err_msg)
total_num_images += num_images
except IOError as e:
print('An IO error occured while processing the file: {0}'.format(f))
exception_response(e)
I tried looking through solutions, but can't find a solution that fits this specific problem. Any help will be appreciated.
Thank you!

I figured it out by going to the github page for the tutorial. Apparently, I had to delete any vec files that had a length of zero.

Your problem is this bit:
content[:12]
The string is not guaranteed to be 12 characters long; it could be fewer. Add a length check and handle it separately, or try: except: and give the user a saner error message like "Invalid input in file ...".

Prevent datetime.strptime from exit in case of format mismatch

I am parsing dates from a measurement file (about 200k lines). The format is a date and a measurement. The date format is "2013-08-07-20-46" or in time format "%Y-%m-%d-%H-%M". Ever so often the time stamp has a bad character. (The data came from a serial link which had interruptions). The entry would look like : 201-08-11-05-15 .
My parsing line to convert the time string into seconds is :
time.mktime(datetime.datetime.strptime(dt, "%Y-%m-%d-%H-%M").timetuple())
I got it online and don't fully understand how it work. (But it works)
My problem is to prevent the program from throwing error exit when a format mismatch happens. Is there a way to prevent the strptime to no exit but gracefully return an error flag in which case I would simple discard the data line and move on to the next. Yes, I can perform a pattern check with regexp but I was wondering if some smart mismatch handling is already built into strptime.
Append # Anand S Kumar
It worked for a few bad lines but then it failed.
fp = open('bmp085.dat', 'r')
for line in fp:
[dt,t,p]= string.split(line)
try:
sec= time.mktime(datetime.datetime.strptime(dt, "%Y-%m-%d-%H-%M").timetuple()) - sec0
except ValueError:
print 'Bad data : ' + line
continue #If you are doing this in a loop (looping over the lines) so that it moves onto next iteration
print sec, p ,t
t_list.append(sec)
p_list.append(p)
fp.close()
Output:
288240.0 1014.48 24.2
288540.0 1014.57 24.2
288840.0 1014.46 24.2
Bad data : �013-08-11-05-05 24.2! 1014.49
Bad data : 2013=0▒-11-05-10 �24.2 1014.57
Bad data : 201�-08-11-05-15 24.1 1014.57
Bad data : "0�#-08-1!-p5-22 24.1 1014.6
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
ValueError: too many values to unpack
>>>
Append # Anand S Kumar
It crashed again.
for line in fp:
print line
dt,t,p = line.split(' ',2)
try:
sec= time.mktime(datetime.datetime.strptime(dt, "%Y-%m-%d-%H-%M").timetuple()) - sec0
except ValueError:
print 'Bad data : ' + line
continue #If you are doing this in a loop (looping over the lines) so that it moves onto next iteration
print sec, p ,t
Failed :
2013-08-11�06-t5 03/9 9014.y
Bad data : 2013-08-11�06-t5 03/9 9014.y
2013-08-11-06-50 (23. 1014.96
295440.0 (23. 1014.96
2013-08-11-06%55 23.9 !�1015.01
Traceback (most recent call last):
File "<stdin>", line 5, in <module>
TypeError: must be string without null bytes, not str
>>> fp.close()
>>>

You can use try..except catching any ValueError and if any such value error occurs, move onto the next line. Example -
try:
time.mktime(datetime.datetime.strptime(dt, "%Y-%m-%d-%H-%M").timetuple())
except ValueError:
continue #If you are doing this in a loop (looping over the lines) so that it moves onto next iteration
If you are doing something else (maybe like a function call for each line , then return None or so in the except block)
The second ValueError you are getting should be occuring in line -
[dt,t,p]= string.split(line)
This issue is occur because there maybe a particular line that is resulting in more than 3 elements. One thing you can do for this would be to use the maxspplit argument from str.split() to split maximum 3 times. Example -
dt,t,p = line.split(None,2)
Or if you really want to use string.split() -
[dt,t,p]= string.split(line,None,2)
Or if you are not expecting space inside any of the fields, you can include the line causing the ValueError inside the try..except block and treat it as a bad line.

Use try - except in a for-loop:
for dt in data:
try:
print time.mktime(datetime.datetime.strptime(dt, "%Y-%m-%d-%H-%M").timetuple())
except ValueError:
print "Wrong format!"
continue
Output for data = ["1998-05-14-15-45","11998-05-14-15-45","2002-05-14-15-45"]:
895153500.0
Wrong format!
1021383900.0

Why am I getting an IndexError in Python 3 when indexing a string and not slicing?

I'm new to programming, and experimenting with Python 3. I've found a few topics which deal with IndexError but none that seem to help with this specific circumstance.
I've written a function which opens a text file, reads it one line at a time, and slices the line up into individual strings which are each appended to a particular list (one list per 'column' in the record line). Most of the slices are multiple characters [x:y] but some are single characters [x].
I'm getting an IndexError: string index out of range message, when as far as I can tell, it isn't. This is the function:
def read_recipe_file():
recipe_id = []
recipe_book = []
recipe_name = []
recipe_page = []
ingred_1 = []
ingred_1_qty = []
ingred_2 = []
ingred_2_qty = []
ingred_3 = []
ingred_3_qty = []
f = open('recipe-file.txt', 'r') # open the file
for line in f:
# slice out each component of the record line and store it in the appropriate list
recipe_id.append(line[0:3])
recipe_name.append(line[3:23])
recipe_book.append(line[23:43])
recipe_page.append(line[43:46])
ingred_1.append(line[46])
ingred_1_qty.append(line[47:50])
ingred_2.append(line[50])
ingred_2_qty.append(line[51:54])
ingred_3.append(line[54])
ingred_3_qty.append(line[55:])
f.close()
return recipe_id, recipe_name, recipe_book, recipe_page, ingred_1, ingred_1_qty, ingred_2, ingred_2_qty, ingred_3, \
ingred_3_qty
This is the traceback:
Traceback (most recent call last):
File "recipe-test.py", line 84, in <module>
recipe_id, recipe_book, recipe_name, recipe_page, ingred_1, ingred_1_qty, ingred_2, ingred_2_qty, ingred_3, ingred_3_qty = read_recipe_file()
File "recipe-test.py", line 27, in read_recipe_file
ingred_1.append(line[46])
The code which calls the function in question is:
print('To show list of recipes: 1')
print('To add a recipe: 2')
user_choice = input()
recipe_id, recipe_book, recipe_name, recipe_page, ingred_1, ingred_1_qty, ingred_2, ingred_2_qty, \
ingred_3, ingred_3_qty = read_recipe_file()
if int(user_choice) == 1:
print_recipe_table(recipe_id, recipe_book, recipe_name, recipe_page, ingred_1, ingred_1_qty,
ingred_2, ingred_2_qty, ingred_3, ingred_3_qty)
elif int(user_choice) == 2:
#code to add recipe
The failing line is this:
ingred_1.append(line[46])
There are more than 46 characters in each line of the text file I am trying to read, so I don't understand why I'm getting an out of bounds error (a sample line is below). If I change to the code to this:
ingred_1.append(line[46:])
to read a slice, rather than a specific character, the line executes correctly, and the program fails on this line instead:
ingred_2.append(line[50])
This leads me to think it is somehow related to appending a single character from the string, rather than a slice of multiple characters.
Here is a sample line from the text file I am reading:
001Cheese on Toast Meals For Two 012120038005002
I should probably add that I'm well aware this isn't great code overall - there are lots of ways I could generally improve the program, but as far as I can tell the code should actually work.

This will happen if some of the lines in the file are empty or at least short. A stray newline at the end of the file is a common cause, since that comes up as an extra blank line. The best way to debug a case like this is to catch the exception, and investigate the particular line that fails (which almost certainly won't be the sample line you reproduced):
try:
ingred_1.append(line[46])
except IndexError:
print(line)
print(len(line))
Catching this exception is also usually the right way to deal with the error: you've detected a pathological case, and now you can consider what to do. You might for example:
continue, which will silently skip processing that line,
Log something and then continue
Bail out by raising a new, more topical exception: eg raise ValueError("Line too short").
Printing something relevant, with or without continuing, is almost always a good idea if this represents a problem with the input file that warrants fixing. Continuing silently is a good option if it is something relatively trivial, that you know can't cause flow-on errors in the rest of your processing. You may want to differentiate between the "too short" and "completely empty" cases by detecting the "completely empty" case early such as by doing this at the top of your loop:
if not line:
# Skip blank lines
continue
And handling the error for the other case appropriately.
The reason changing it to a slice works is because string slices never fail. If both indexes in the slice are outside the string (in the same direction), you will get an empty string - eg:
>>> 'abc'[4]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
IndexError: string index out of range
>>> 'abc'[4:]
''
>>> 'abc'[4:7]
''

Your code fails on line[46] because line contains fewer than 47 characters. The slice operation line[46:] still works because an out-of-range string slice returns an empty string.
You can verify that the line is too short by replacing
ingred_1.append(line[46])
with
try:
ingred_1.append(line[46])
except IndexError:
print('line = "%s", length = %d' % (line, len(line)))

how can I make a file created in a function readable by other functions?

I need the contents of a file made by some function be able to be read by other functions. The closest I've come is to import a function within another function. The following code is what is what I'm using. According to the tutorials I've read python will either open a file if it exists or create one if not.What's happening is in "def space" the file "loader.py" is duplicated with no content.
def load(): # all this is input with a couple of filters
first = input("1st lot#: ") #
last = input("last lot#: ") #
for a in range(first,last+1): #
x = raw_input("?:")
while x==(""):
print " Error",
x=raw_input("?")
while int(x)> 35:
print"Error",
x=raw_input("?")
num= x #python thinks this is a tuple
num= str(num)
f=open("loader.py","a") #this is the file I want to share
f.write(num)
f.close()
f=open("loader.py","r") #just shows that the file is being
print f.read() #appened
f.close()
print "Finished loading"
def spacer():
count=0
f=open("loader.py","r") #this is what I thought would open the
#file but just opens a new 1 with the
#same name
length=len(f.read())
print type(f.read(count))
print f.read(count)
print f.read(count+1)
for a in range(1,length+1):
print f.read(count)
vector1= int(f.read(count))
vector2 = int(f.read(count+1))
if vector1==vector2:
space= 0
if vector1< vector2:
space= vector2-vector1
else:
space= (35-vector1)+vector2
count=+1
b= open ("store_space.py","w")
b.write(space)
b.close()
load()
spacer()
this what I get
1st lot#: 1
last lot#: 1
?:2
25342423555619333523452624356232184517181933235991010111348287989469658293435253195472514148238543246547722232633834632
Finished loading # This is the end of "def load" it shows the file is being appended
<type 'str'> # this is from "def spacer" I realized python was creating another
# file named "loader.py with nothing in it. You can see this in the
#error msgs below
Traceback (most recent call last):
File "C:/Python27/ex1", line 56, in <module>
spacer()
File "C:/Python27/ex1", line 41, in spacer
vector1= int(f.read(count))
ValueError: invalid literal for int() with base 10: ''tion within another function but this only causes the imported function to run.

The file probably has content, but you're not reading it properly. You have:
count=0
#...
vector1= int(f.read(count))
You told Python to read 0 bytes, so it returns an empty string. Then it tries to convert the empty string to an int, and this fails as the error says, because an empty string is not a valid representation of an integer value.

Bug in python tokenize?

Why would this
if 1 \
and 0:
pass
simplest of code choke on tokenize/untokenize cycle
import tokenize
import cStringIO
def tok_untok(src):
f = cStringIO.StringIO(src)
return tokenize.untokenize(tokenize.generate_tokens(f.readline))
src='''if 1 \\
and 0:
pass
'''
print tok_untok(src)
It throws:
AssertionError:
File "/mnt/home/anushri/untitled-1.py", line 13, in <module>
print tok_untok(src)
File "/mnt/home/anushri/untitled-1.py", line 6, in tok_untok
tokenize.untokenize(tokenize.generate_tokens(f.readline))
File "/usr/lib/python2.6/tokenize.py", line 262, in untokenize
return ut.untokenize(iterable)
File "/usr/lib/python2.6/tokenize.py", line 198, in untokenize
self.add_whitespace(start)
File "/usr/lib/python2.6/tokenize.py", line 187, in add_whitespace
assert row <= self.prev_row
Is there a workaround without modifying the src to be tokenized (it seems \ is the culprit)
Another example where it fails is if no newline at end e.g. src='if 1:pass' fails with same error
Workaround:
But it seems using untokenize different way works
def tok_untok(src):
f = cStringIO.StringIO(src)
tokens = [ t[:2] for t in tokenize.generate_tokens(f.readline)]
return tokenize.untokenize(tokens)
i.e. do not pass back whole token tuple but only t[:2]
though python doc says extra args are skipped
Converts tokens back into Python source code. The iterable must return
sequences with at least two elements,
the token type and the token string.
Any additional sequence elements are
ignored.

Yes, it's a known bug and there is interest in a cleaner patch than the one attached to that issue. Perfect time to contribute to a better Python ;)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Why is ElementTree raising a ParseError? - python

I'm not sure if this answers your question, but if you want to use an exception with the ParseError raised by element tree, you would do this: except ET.ParseError: print("catastrophic failure") print("last successful: {0}".format(last)) Source: http://effbot.org/zone/elementtree-13-intro.htm

Related

struct.error: unpack requires a string argument of length 12

Prevent datetime.strptime from exit in case of format mismatch

Why am I getting an IndexError in Python 3 when indexing a string and not slicing?

how can I make a file created in a function readable by other functions?

Bug in python tokenize?

Categories

Resources