Upper limit of fromstring function in ElementTree - python

I'm using Python 2.4 version on a Windows 32-bit PC. I'm trying to parse through a very large XML file using the ElementTree module. I downloaded version 1.2.6 of this module from effbot.org.
I followed the below code for my purpose:
import elementtree.ElementTree as ET
input = ''' 001 Chuck 009 Brent '''
stuff = ET.fromstring(input)
lst = stuff.findall("users/user")
print len(lst)
for item in lst:
print item.attrib["x"]
item = lst[0]
ET.dump(item)
item.get("x") # get works on attributes
item.find("id").text
item.find("id").tag
for user in stuff.getiterator('user'):
print "User" , user.attrib["x"]
ET.dump(user)
If the content of input is too large, more than 10,000 lines, the fromstring function raises an error (below). Can anyone help me out in rectifying this error?
This is the error generated:
Traceback (most recent call last): File "C:\Documents and Settings\hariprar\My Documents\My files\Python Try\xml_try1.py", line 16, in -toplevel- stuff = ET.fromstring(input) File "C:\Python24\Lib\site-packages\elementtree\ElementTree.py", line 1012, in XML return api.fromstring(text) File "C:\Python24\Lib\site-packages\elementtree\ElementTree.py", line 182, in fromstring parser.feed(text) File "C:\Python24\Lib\site-packages\elementtree\ElementTree.py", line 1292, in feed self._parser.Parse(data, 0) ExpatError: not well-formed (invalid token): line 2445, column 39

Take a look at the iterparse function. It will let you parse your input incrementally rather than reading it into memory as one big chunk.
It's described here: http://effbot.org/zone/element-iterparse.htm

Related

Python lxml XPathEvalError: Error in xpath expression when parsing larger file

The following XPATH query works with the majority of the XML files I am parsing, but it is causing an XPathEvalError on a larger XML file (~200MB) that I am attempting to parse.
from lxml import etree
query = "//*[self::foo or self::bar]/test/entry"
# Working with a 1 MB file
small_tree = etree.parse("http://aiweb.cs.washington.edu/research/projects/xmltk/xmldata/data/courses/wsu.xml")
_ = small_tree.xpath(query)
# Failing with a 683 MB file
big_tree = etree.parse("http://aiweb.cs.washington.edu/research/projects/xmltk/xmldata/data/pir/psd7003.xml")
_ = big_tree.xpath(query)
Traceback (most recent call last):
File "<ipython-input-35-495ebeb2207b>", line 1, in <module>
big_tree.xpath("//*[self::foo or self::bar]/test/entry")
File "src/lxml/etree.pyx", line 2287, in lxml.etree._ElementTree.xpath
File "src/lxml/xpath.pxi", line 359, in lxml.etree.XPathDocumentEvaluator.__call__
File "src/lxml/xpath.pxi", line 227, in lxml.etree._XPathEvaluatorBase._handle_result
XPathEvalError: Error in xpath expression
Is this a bug with lxml?
For testing purposes, you can use this sample small XML file to see the query work () and this sample large XML file to see the query fail.

Python XML Parser: junk after document element

I'm learning Python at work. I've got a large XML file with data similar to this:
testData3.xml File
<r><c>something1</c><c>something1</c><c>something1</c><c>something1</c><c>something1</c><c>something1</c><c>something1</c><c>something1</c><c></c><c></c><c>something1</c><c>something1</c></r>
<r><c>something2</c><c>something2</c><c>something2</c><c>something2</c><c>something2</c><c>something2</c><c>something2</c><c>something2</c><c></c><c></c><c>something2</c><c>something2</c></r>
I have copied an XML parser out of one of my Python books that works in gathering the data when the data file contains only one line. As soon as I add a second line of data, the script fails when it runs.
Python script that I'm running (xmlReader.py):
from xml.dom.minidom import parse, Node
xmltree = parse('testData3.xml')
for node1 in xmltree.getElementsByTagName('c'):
for node2 in node1.childNodes:
if node2.nodeType == Node.TEXT_NODE:
print(node2.data)
I'm looking for some help on how to write the loop so that my xmlReader.py continues through the entire file instead of just one line. I get the following errors when I run this script:
Errors during execution:
xxxx#xxxx:~/xxxx/xxxx> python xmlReader.py
Traceback (most recent call last):
File "xmlReader.py", line 2, in <module>
xmltree = parse('testData3.xml')
File "/usr/lib64/python2.6/site-packages/_xmlplus/dom/minidom.py", line 1915, in parse
return expatbuilder.parse(file)
File "/usr/lib64/python2.6/site-packages/_xmlplus/dom/expatbuilder.py", line 926, in parse
result = builder.parseFile(fp)
File "/usr/lib64/python2.6/site-packages/_xmlplus/dom/expatbuilder.py", line 207, in parseFile
parser.Parse(buffer, 0)
xml.parsers.expat.ExpatError: junk after document element: line 2, column 0
xxxx#xxxx:~/xxxx/xxxx>
The problem is that your example data is not valid XML. A valid XML document should have a single root element; this is true for a single line of the file, where <r> is the root element, but not true when you add a second line, because each line is contained within a separate <r> element, but there is no global parent element in the file.
Either construct valid XML, for example:
<root>
<r><c>something1</c><c>something1</c><c>something1</c><c>something1</c><c>something1</c><c>something1</c><c>something1</c><c>something1</c><c></c><c></c><c>something1</c><c>something1</c></r>
<r><c>something2</c><c>something2</c><c>something2</c><c>something2</c><c>something2</c><c>something2</c><c>something2</c><c>something2</c><c></c><c></c><c>something2</c><c>something2</c></r>
</root>
or parse the file line by line:
from xml.dom.minidom import parseString
f = open('testData3.xml'):
for line in f:
xmltree = parseString(line)
...
f.close()

ExpatError: no element found - Python script

Using OS X 10.6.8, libxml 2-2.7.8, libxslt-1.1.26, and python 2.6, I'm trying to run the tumblrRestore.py script linked here:
https://github.com/hughsaunders/Tumblr-Restore/blob/master/tumblrRestore.py
It ran successfully and restored 76 posts before crashing.
However on second run I got an ExpatError: no element found, and have not been able to run it successfully since - it always produces this same error now. Error text:
Tumblr Restore
Traceback (most recent call last):
File "tumblrRestore.py", line 264, in <module>
cli.start()
File "tumblrRestore.py", line 232, in start
bp.parse()
File "tumblrRestore.py", line 51, in parse
postelement=ElementTree.fromstring(xml_string)
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/xml/etree/ElementTree.py", line 964, in XM
return parser.close()
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/xml/etree/ElementTree.py", line 1254, in close
self._parser.Parse("", 1) # end of data
xml.parsers.expat.ExpatError: no element found: line 1, column 0
I'm wondering whether I have the wrong or competing or outdated versions of python or lxml, though that still doesn't explain why the script ran successfully once.
Complete newbie, any advice appreciated.
Check your extract_xml_string method of BackupParser class. It definitely returns empty string, because your begin_re regular expresssion doesn't match xml header.
Try the next one:
begin_re = re.compile("<\? xml .*\?>")

Wikipedia with Python

I have this very simple python code to read xml for the wikipedia api:
import urllib
from xml.dom import minidom
usock = urllib.urlopen("http://en.wikipedia.org/w/api.php?action=query&titles=Fractal&prop=links&pllimit=500")
xmldoc=minidom.parse(usock)
usock.close()
print xmldoc.toxml()
But this code returns with these errors:
Traceback (most recent call last):
File "/home/user/workspace/wikipediafoundations/src/list.py", line 5, in <module><br>
xmldoc=minidom.parse(usock)<br>
File "/usr/lib/python2.6/xml/dom/minidom.py", line 1918, in parse<br>
return expatbuilder.parse(file)<br>
File "/usr/lib/python2.6/xml/dom/expatbuilder.py", line 928, in parse<br>
result = builder.parseFile(file)<br>
File "/usr/lib/python2.6/xml/dom/expatbuilder.py", line 207, in parseFile<br>
parser.Parse(buffer, 0)<br>
xml.parsers.expat.ExpatError: syntax error: line 1, column 62<br>
I have no clue as I just learning python. Is there a way to get an error with more detail? Does anyone know the solution? Also, please recommend a better language to do this in.
Thank You,
Venkat Rao
The URL you're requesting is an HTML representation of the XML that would be returned:
http://en.wikipedia.org/w/api.php?action=query&titles=Fractal&prop=links&pllimit=500
So the XML parser fails. You can see this by pasting the above in a browser. Try adding a format=xml at the end:
http://en.wikipedia.org/w/api.php?action=query&titles=Fractal&prop=links&pllimit=500&format=xml
as documented on the linked page:
http://en.wikipedia.org/w/api.php

Yahoo BOSS Python Library, ExpatError

I tried to install the Yahoo BOSS mashup framework, but am having trouble running the examples provided. Examples 1, 2, 5, and 6 work, but 3 & 4 give Expat errors. Here is the output from ex3.py:
gpython examples/ex3.py
examples/ex3.py:33: Warning: 'as' will become a reserved keyword in Python 2.6
Traceback (most recent call last):
File "examples/ex3.py", line 27, in <module>
digg = db.select(name="dg", udf=titlef, url="http://digg.com/rss_search?search=google+android&area=dig&type=both&section=news")
File "/usr/lib/python2.5/site-packages/yos/yql/db.py", line 214, in select
tb = create(name, data=data, url=url, keep_standards_prefix=keep_standards_prefix)
File "/usr/lib/python2.5/site-packages/yos/yql/db.py", line 201, in create
return WebTable(name, d=rest.load(url), keep_standards_prefix=keep_standards_prefix)
File "/usr/lib/python2.5/site-packages/yos/crawl/rest.py", line 38, in load
return xml2dict.fromstring(dl)
File "/usr/lib/python2.5/site-packages/yos/crawl/xml2dict.py", line 41, in fromstring
t = ET.fromstring(s)
File "/usr/lib/python2.5/xml/etree/ElementTree.py", line 963, in XML
parser.feed(text)
File "/usr/lib/python2.5/xml/etree/ElementTree.py", line 1245, in feed
self._parser.Parse(data, 0)
xml.parsers.expat.ExpatError: syntax error: line 1, column 0
It looks like both examples are failing when trying to query Digg.com. Here is the query that is constructed in ex3.py's code:
diggf = lambda r: {"title": r["title"]["value"], "diggs": int(r["diggCount"]["value"])}
digg = db.select(name="dg", udf=diggf, url="http://digg.com/rss_search?search=google+android&area=dig&type=both&section=news")
The problem is the digg search string. It should be "s=". Not "search="
I believe that must be an error in the example: it's getting a JSON result (indeed if you copy and paste that URL in your browser, you'll download a file names search.json which starts with
{"results":[{"profile_image_url":
"http://a3.twimg.com/profile_images/255524395/KEN_OMALLEY_REVISED_normal.jpg",
"created_at":"Mon, 14 Sep 2009 14:52:07 +0000","from_user":"twilightlords",
i.e. perfectly normal JSON; but then instead of parsing it with modules such as json or simplejson, it tries to parse it as XML -- and obviously this attempt fails.
I believe the fix (which probably needs to be brought to the attention of whoever maintains that code so they can incorporate it) is either to ask for XML instead of JSON output, OR to parse the resulting JSON with appropriate means instead of trying to look at it as XML (not sure how to best implement either change, as I'm not familiar with that code).

Categories

Resources