Unclosed XML Token - python

How would I handle this error in Python 2.6?
Traceback (most recent call last):
File "./fetch_xml_collect.py", line 32, in <module>
tree=ET.parse(response)
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/xml/etree/ElementTree.py", line 862, in parse
tree.parse(source, parser)
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/xml/etree/ElementTree.py", line 587, in parse
self._root = parser.close()
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/xml/etree/ElementTree.py", line 1254, in close
self._parser.Parse("", 1) # end of data
xml.parsers.expat.ExpatError: unclosed token: line 56, column 1
Current code being implemented:
while urlget==1:
try:
response = urllib.urlopen(rep)
except IOError:
print("reason")
else:
try:
tree=ET.parse(response)
except IOError:
print("XML Parse Error\n")
else:
root=tree.getroot()
print root[0].text
powerlist=tree.findall('meter/power')
print powerlist[0].tag,powerlist[0].text
The question is: How would I handle the above error in the given code?

try:
#Some code
...
except xml.parsers.expat.ExpatError, ex:
print ex
continue
Something like the above should work. Just continue if you get that error. It will continue with the next iteration with the loop, or if it's the last iteration, break out of the loop.
The XML is formed incorrectly, and is unable to be processed. Just skip it and go on to the next one.

Related

Why Does This Python Script Freeze Up?

I have a function that tries a list of regexes on some text to see if there's a match.
#timeout(1)
def get_description(data, old):
description = None
if old:
for rx in rxs:
try:
matched = re.search(rx, data, re.S|re.M)
if matched is not None:
try:
description = matched.groups(1)
if description:
return description
else:
continue
except TimeoutError as why:
print(why)
continue
else:
continue
except Exception as why:
print(why)
pass
I use this function in a loop and run a bunch of text files through. In one file, execution keeps stopping:
Traceback (most recent call last):
File "extract.py", line 223, in <module>
scrape()
File "extract.py", line 40, in scrape
metadata = get_metadata(f)
File "extract.py", line 186, in get_metadata
description = get_description(text, True)
File "extract.py", line 64, in get_description
matched = re.search(rx, data, re.S|re.M)
File "C:\Users\Joseph\AppData\Local\Programs\Python\Python36\lib\re.py", line 182, in search
return _compile(pattern, flags).search(string)
KeyboardInterrupt
It simply hangs on evaluating matched = re.search(rx, data, re.S|re.M). For many other files, when no match is found, it goes on to the next regex. With this file, it does nothing and throws no exception. Any ideas what could be causing this?
EDIT:
I'm now trying to detect timeout errors (This is more efficient for me than changing the rx's)
The TimeoutError, borrowed from this question, is triggered but doesn't cause the script to keep running. It simply writes 'Timer expired' and stays frozen.

Catch error in a for loop python

I have a for loop on an avro data reader object
for i in reader:
print i
then I got a unicode decode error in the for statement so I wanted to ignore that particular record. So I did this
try:
for i in reader:
print i
except:
pass
but it does not continue further. How can I overcome this problem
Edit: Error trace added
Traceback (most recent call last):
File "modify.py", line 22, in <module>
for record in reader:
File "/usr/lib/python2.6/site-packages/avro-1.7.7-py2.6.egg/avro/datafile.py", line 362, in next
datum = self.datum_reader.read(self.datum_decoder)
File "/usr/lib/python2.6/site-packages/avro-1.7.7-py2.6.egg/avro/io.py", line 445, in read
return self.read_data(self.writers_schema, self.readers_schema, decoder)
File "/usr/lib/python2.6/site-packages/avro-1.7.7-py2.6.egg/avro/io.py", line 490, in read_data
return self.read_record(writers_schema, readers_schema, decoder)
File "/usr/lib/python2.6/site-packages/avro-1.7.7-py2.6.egg/avro/io.py", line 690, in read_record
field_val = self.read_data(field.type, readers_field.type, decoder)
File "/usr/lib/python2.6/site-packages/avro-1.7.7-py2.6.egg/avro/io.py", line 468, in read_data
return decoder.read_utf8()
File "/usr/lib/python2.6/site-packages/avro-1.7.7-py2.6.egg/avro/io.py", line 233, in read_utf8
return unicode(self.read_bytes(), "utf-8")
UnicodeDecodeError: 'utf8' codec can't decode byte 0xb4 in position 14: invalid start byte
could it be due to the fact that the file was corrupted?
Edit2:
As per suggestion in answers to go through iterobject I modified code and got this error
Traceback (most recent call last):
File "modify.py", line 28, in <module>
print next(iterobject)["filepath"]
File "/usr/lib/python2.6/site-packages/avro-1.7.7-py2.6.egg/avro/datafile.py", line 362, in next
datum = self.datum_reader.read(self.datum_decoder)
File "/usr/lib/python2.6/site-packages/avro-1.7.7-py2.6.egg/avro/io.py", line 445, in read
return self.read_data(self.writers_schema, self.readers_schema, decoder)
File "/usr/lib/python2.6/site-packages/avro-1.7.7-py2.6.egg/avro/io.py", line 490, in read_data
return self.read_record(writers_schema, readers_schema, decoder)
File "/usr/lib/python2.6/site-packages/avro-1.7.7-py2.6.egg/avro/io.py", line 690, in read_record
field_val = self.read_data(field.type, readers_field.type, decoder)
File "/usr/lib/python2.6/site-packages/avro-1.7.7-py2.6.egg/avro/io.py", line 468, in read_data
return decoder.read_utf8()
File "/usr/lib/python2.6/site-packages/avro-1.7.7-py2.6.egg/avro/io.py", line 233, in read_utf8
return unicode(self.read_bytes(), "utf-8")
File "/usr/lib/python2.6/site-packages/avro-1.7.7-py2.6.egg/avro/io.py", line 226, in read_bytes
return self.read(self.read_long())
File "/usr/lib/python2.6/site-packages/avro-1.7.7-py2.6.egg/avro/io.py", line 184, in read_long
b = ord(self.read(1))
TypeError: ord() expected a character, but string of length 0 found
If your error is in for i in. Then try this, it will skip element in iterator if UnicodeDecodeError occurs.
iterobject = iter(reader)
while iterobject:
try:
print(next(iterobject))
except StopIteration:
break
except UnicodeDecodeError:
pass
You need the try/except inside the loop:
for i in reader:
try:
print i
except UnicodeEncodeError:
pass
By the way it's good practice to specify the specific type of error you're trying to catch (like I did with except UnicodeEncodeError:, since otherwise you risk making your code very hard to debug!
You can except the specific error, and avoid unknown errors to pass unnoticed.
Python 3.x:
try:
for i in reader:
print i
except UnicodeDecodeError as ue:
print(str(ue))
Python 2.x:
try:
for i in reader:
print i
except UnicodeDecodeError, ue:
print(str(ue))
By printing the error it's possible to know what happened. When you use only except, you except anything (And that can include an obscure RuntimeError), and you'll never know what happened. It can be useful sometimes, but it's dangerous and generally a bad practice.

elementtree.parse() not opening file open() will

I am trying to write a python script that will open an XML file add some elements and then exit. I can't get element tree to open the file I want. I am able to open the file however with open(). Example code
try:
file = open(projectLocation + "\etc\config\struts-config.xml", 'r')
print "oppenned file" #this will print
tree = ET.parse(projectLocation + "\etc\config\struts-config.xml") #this fails
print "oppenned xml"
except:
print "could not open " + projectLocation + "\etc\config\struts-config.xml"
sys.exit()
Updating the code to this, for the comment I recieved:
tree = ET.parse(projectLocation + "\etc\config\struts-config.xml")
print "oppenned xml"
this give me the error:
Traceback (most recent call last):
File "test.py", line 117, in <module>
tree = ET.parse(projectLocation + "\etc\config\struts-config.xml")
File "/usr/lib/python2.6/xml/etree/ElementTree.py", line 862, in parse
tree.parse(source, parser)
File "/usr/lib/python2.6/xml/etree/ElementTree.py", line 586, in parse
parser.feed(data)
File "/usr/lib/python2.6/xml/etree/ElementTree.py", line 1245, in feed
self._parser.Parse(data, 0)
xml.parsers.expat.ExpatError: mismatched tag: line 1205, column 2

Exception handling in Python and Praw

I am having trouble with the following code:
import praw
import argparse
# argument handling was here
def main():
r = praw.Reddit(user_agent='Python Reddit Image Grabber v0.1')
for i in range(len(args.subreddits)):
try:
r.get_subreddit(args.subreddits[i]) # test to see if the subreddit is valid
except:
print "Invalid subreddit"
else:
submissions = r.get_subreddit(args.subreddits[i]).get_hot(limit=100)
print [str(x) for x in submissions]
if __name__ == '__main__':
main()
subreddit names are taken as arguments to the program.
When an invalid args.subreddits is passed to get_subreddit, it throws an exception which should be caught in the above code.
When a valid args.subreddit name is given as an argument, the program runs fine.
But when an invalid args.subreddit name is given, the exception is not thrown, and instead the following uncaught exception is outputted.
Traceback (most recent call last):
File "./pyrig.py", line 33, in <module>
main()
File "./pyrig.py", line 30, in main
print [str(x) for x in submissions]
File "/usr/local/lib/python2.7/dist-packages/praw/__init__.py", line 434, in get_content
page_data = self.request_json(url, params=params)
File "/usr/local/lib/python2.7/dist-packages/praw/decorators.py", line 95, in wrapped
return_value = function(reddit_session, *args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/praw/__init__.py", line 469, in request_json
response = self._request(url, params, data)
File "/usr/local/lib/python2.7/dist-packages/praw/__init__.py", line 342, in _request
response = handle_redirect()
File "/usr/local/lib/python2.7/dist-packages/praw/__init__.py", line 316, in handle_redirect
url = _raise_redirect_exceptions(response)
File "/usr/local/lib/python2.7/dist-packages/praw/internal.py", line 165, in _raise_redirect_exceptions
.format(subreddit))
praw.errors.InvalidSubreddit: `soccersdsd` is not a valid subreddit
I can't tell what I am doing wrong. I have also tried rewriting the exception code as
except praw.errors.InvalidSubreddit:
which also does not work.
EDIT: exception info for Praw can be found here
File "./pyrig.py", line 30, in main
print [str(x) for x in submissions]
The problem, as your traceback indicates is that the exception doesn't occur when you call get_subreddit In fact, it also doesn't occur when you call get_hot. The first is a lazy invocation that just creates a dummy Subreddit object but doesn't do anything with it. The second, is a generator that doesn't make any requests until you actually try to iterate over it.
Thus you need to move the exception handling code around your print statement (line 30) which is where the request is actually made that results in the exception.

Why is urllib.urlopen() only working once? - Python

I'm writing a crawler to download the static html pages using urllib.
The get_page function works for 1 cycle but when i try to loop it, it doesn't open the content to the next url i've fed in.
How do i make urllib.urlopen continuously download HTML pages?
If it is not possible, is there any other suggestion to download
webpages within my python code?
my code below only returns the html for the 1st website in the seed list:
import urllib
def get_page(url):
return urllib.urlopen(url).read().decode('utf8')
seed = ['http://www.pmo.gov.sg/content/pmosite/home.html',
'http://www.pmo.gov.sg/content/pmosite/aboutpmo.html']
for j in seed:
print "here"
print get_page(j)
The same crawl "once-only" problem also occurs with urllib2:
import urllib2
def get_page(url):
req = urllib2.Request(url)
response = urllib2.urlopen(req)
return response.read().decode('utf8')
seed = ['http://www.pmo.gov.sg/content/pmosite/home.html',
'http://www.pmo.gov.sg/content/pmosite/aboutpmo.html']
for j in seed:
print "here"
print get_page(j)
Without the exception, i'm getting an IOError with urllib:
Traceback (most recent call last):
File "/home/alvas/workspace/SingCorp/sgcrawl.py", line 91, in <module>
print get_page(j)
File "/home/alvas/workspace/SingCorp/sgcrawl.py", line 4, in get_page
return urllib.urlopen(url).read().decode('utf8')
File "/usr/lib/python2.7/urllib.py", line 86, in urlopen
return opener.open(url)
File "/usr/lib/python2.7/urllib.py", line 207, in open
return getattr(self, name)(url)
File "/usr/lib/python2.7/urllib.py", line 462, in open_file
return self.open_local_file(url)
File "/usr/lib/python2.7/urllib.py", line 476, in open_local_file
raise IOError(e.errno, e.strerror, e.filename)
IOError: [Errno 2] No such file or directory: 'http://www.pmo.gov.sg/content/pmosite/aboutpmo.html'
Without the exception, i'm getting a ValueError with urllib2:
Traceback (most recent call last):
File "/home/alvas/workspace/SingCorp/sgcrawl.py", line 95, in <module>
print get_page(j)
File "/home/alvas/workspace/SingCorp/sgcrawl.py", line 7, in get_page
response = urllib2.urlopen(req)
File "/usr/lib/python2.7/urllib2.py", line 126, in urlopen
return _opener.open(url, data, timeout)
File "/usr/lib/python2.7/urllib2.py", line 392, in open
protocol = req.get_type()
File "/usr/lib/python2.7/urllib2.py", line 254, in get_type
raise ValueError, "unknown url type: %s" % self.__original
ValueError: unknown url type: http://www.pmo.gov.sg/content/pmosite/aboutpmo.html
ANSWERED:
The IOError and ValueError occurred because there was some sort of Unicode byte order mark (BOM). A non-break space was found in the second URL. Thanks for all your help and suggestion in solving the problem!!
your code is choking on .read().decode('utf8').
but you wouldn't see that since you are just swallowing exceptions. urllib works fine "more than once".
import urllib
def get_page(url):
return urllib.urlopen(url).read()
seeds = ['http://www.pmo.gov.sg/content/pmosite/home.html',
'http://www.pmo.gov.sg/content/pmosite/aboutpmo.html']
for seed in seeds:
print 'here'
print get_page(seed)
Both of your examples work fine for me. The only explanation I can think of for your exact errors is that the second URL string contains some sort of non-printable character (a Unicode BOM, perhaps) that got filtered out when pasting the code here. Try copying the code back from this site into your file, or retyping the entire second string from scratch.

Categories

Resources