Python XML Parser: junk after document element

Python XML Parser: junk after document element - python

I'm learning Python at work. I've got a large XML file with data similar to this:
testData3.xml File
<r><c>something1</c><c>something1</c><c>something1</c><c>something1</c><c>something1</c><c>something1</c><c>something1</c><c>something1</c><c></c><c></c><c>something1</c><c>something1</c></r>
<r><c>something2</c><c>something2</c><c>something2</c><c>something2</c><c>something2</c><c>something2</c><c>something2</c><c>something2</c><c></c><c></c><c>something2</c><c>something2</c></r>
I have copied an XML parser out of one of my Python books that works in gathering the data when the data file contains only one line. As soon as I add a second line of data, the script fails when it runs.
Python script that I'm running (xmlReader.py):
from xml.dom.minidom import parse, Node
xmltree = parse('testData3.xml')
for node1 in xmltree.getElementsByTagName('c'):
for node2 in node1.childNodes:
if node2.nodeType == Node.TEXT_NODE:
print(node2.data)
I'm looking for some help on how to write the loop so that my xmlReader.py continues through the entire file instead of just one line. I get the following errors when I run this script:
Errors during execution:
xxxx#xxxx:~/xxxx/xxxx> python xmlReader.py
Traceback (most recent call last):
File "xmlReader.py", line 2, in <module>
xmltree = parse('testData3.xml')
File "/usr/lib64/python2.6/site-packages/_xmlplus/dom/minidom.py", line 1915, in parse
return expatbuilder.parse(file)
File "/usr/lib64/python2.6/site-packages/_xmlplus/dom/expatbuilder.py", line 926, in parse
result = builder.parseFile(fp)
File "/usr/lib64/python2.6/site-packages/_xmlplus/dom/expatbuilder.py", line 207, in parseFile
parser.Parse(buffer, 0)
xml.parsers.expat.ExpatError: junk after document element: line 2, column 0
xxxx#xxxx:~/xxxx/xxxx>

The problem is that your example data is not valid XML. A valid XML document should have a single root element; this is true for a single line of the file, where <r> is the root element, but not true when you add a second line, because each line is contained within a separate <r> element, but there is no global parent element in the file.
Either construct valid XML, for example:
<root>
<r><c>something1</c><c>something1</c><c>something1</c><c>something1</c><c>something1</c><c>something1</c><c>something1</c><c>something1</c><c></c><c></c><c>something1</c><c>something1</c></r>
<r><c>something2</c><c>something2</c><c>something2</c><c>something2</c><c>something2</c><c>something2</c><c>something2</c><c>something2</c><c></c><c></c><c>something2</c><c>something2</c></r>
</root>
or parse the file line by line:
from xml.dom.minidom import parseString
f = open('testData3.xml'):
for line in f:
xmltree = parseString(line)
...
f.close()

Related

Python lxml XPathEvalError: Error in xpath expression when parsing larger file

The following XPATH query works with the majority of the XML files I am parsing, but it is causing an XPathEvalError on a larger XML file (~200MB) that I am attempting to parse.
from lxml import etree
query = "//*[self::foo or self::bar]/test/entry"
# Working with a 1 MB file
small_tree = etree.parse("http://aiweb.cs.washington.edu/research/projects/xmltk/xmldata/data/courses/wsu.xml")
_ = small_tree.xpath(query)
# Failing with a 683 MB file
big_tree = etree.parse("http://aiweb.cs.washington.edu/research/projects/xmltk/xmldata/data/pir/psd7003.xml")
_ = big_tree.xpath(query)
Traceback (most recent call last):
File "<ipython-input-35-495ebeb2207b>", line 1, in <module>
big_tree.xpath("//*[self::foo or self::bar]/test/entry")
File "src/lxml/etree.pyx", line 2287, in lxml.etree._ElementTree.xpath
File "src/lxml/xpath.pxi", line 359, in lxml.etree.XPathDocumentEvaluator.__call__
File "src/lxml/xpath.pxi", line 227, in lxml.etree._XPathEvaluatorBase._handle_result
XPathEvalError: Error in xpath expression
Is this a bug with lxml?
For testing purposes, you can use this sample small XML file to see the query work () and this sample large XML file to see the query fail.

Python - write Xml (formatted)

I wrote this python script in order to create Xml content and i would like to write this "prettified" xml to a file (50% done):
My script so far:
data = ET.Element("data")
project = ET.SubElement(data, "project")
project.text = "This project text"
rawString = ET.tostring(data, "utf-8")
reparsed = xml.dom.minidom.parseString(rawString)
cleanXml = reparsed.toprettyxml(indent=" ")
# This prints the prettified xml i would like to save to a file
print cleanXml
# This part does not work, the only parameter i can pass is "data"
# But when i pass "data" as a parameter, a xml-string is written to the file
tree = ET.ElementTree(cleanXml)
tree.write("config.xml")
The error i get when i pass cleanXml as parameter:
Traceback (most recent call last):
File "app.py", line 45, in <module>
tree.write("config.xml")
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 817, in write
self._root, encoding, default_namespace
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 876, in _namespaces
iterate = elem.getiterator # cET compatibility
AttributeError: 'unicode' object has no attribute 'getiterator'
Anybody knows how i can get my prettified xml to a file ?
Thanks and Greetings!

The ElementTree constructor can be passed a root element and a file. To create an ElementTree from a string, use ElementTree.fromstring.
However, that isn't what you want. Just open a file and write the string directly:
with open("config.xml", "w") as config_file:
config_file.write(cleanXml)

Removing Elements From 300MG Xml In Python / Element Tree

I'm trying to parse a 300MB XML in ElementTree, based on advise like Can Python xml ElementTree parse a very large xml file?
from xml.etree import ElementTree as Et
for event, elem in Et.iterparse('C:\...path...\desc2015.xml'):
if elem.tag == 'DescriptorRecord':
for e in elem._children:
if str(e.tag) in ['DateCreated', 'Year', 'Month', 'TreeNumber', 'HistoryNote', 'PreviousIndexing']:
e.clear()
elem.remove(e)
print 'removed %s' % e
giving...
removed <Element 'HistoryNote' at 0x557cc7f0>
removed <Element 'DateCreated' at 0x557fa990>
removed <Element 'HistoryNote' at 0x55809af0>
removed <Element 'DateCreated' at 0x5580f5d0>
However, this just keeps going, the file isn't getting any smaller, and on inspection the elements are still there. Tried either e.clear() or elem.remove(e), but the same results. Regards
UPDATE
Error code from my first comment on #alexanderlukanin13 s answer:
Traceback (most recent call last): File "C:\Users\Eddie\Downloads\eclipse\plugins\org.python.pydev_4.0.0.201504132356\pysrc\pydevd.py", line 1570, in trace_dispatch Traceback (most recent call last): File "C:\Users\Eddie\Downloads\eclipse\plugins\org.python.pydev_4.0.0.201504132356\pysrc\pydevd.py", line 2278, in globals = debugger.run(setup['file'], None, None) File "C:\Users\Eddie\Downloads\eclipse\plugins\org.python.pydev_4.0.0.201504132356\pysrc\pydevd.py", line 1704, in run pydev_imports.execfile(file, globals, locals) # execute the script File "C:\Users\Eddie\Downloads\eclipse\plugins\org.python.pydev_4.0.0.201504132356\pysrc\runfiles.py", line 234, in main() File "C:\Users\Eddie\Downloads\eclipse\plugins\org.python.pydev_4.0.0.201504132356\pysrc\runfiles.py", line 78, in main return pydev_runfiles.main(configuration) # Note: still doesn't return a proper value. File "C:\Users\Eddie\Downloads\eclipse\plugins\org.python.pydev_4.0.0.201504132356\pysrc\pydev_runfiles.py", line 835, in main PydevTestRunner(configuration).run_tests() File "C:\Users\Eddie\Downloads\eclipse\plugins\org.python.pydev_4.0.0.201504132356\pysrc\pydev_runfiles.py", line 762, in run_tests file_and_modules_and_module_name = self.find_modules_from_files(files) File "C:\Users\Eddie\Downloads\eclipse\plugins\org.python.pydev_4.0.0.201504132356\pysrc\pydev_runfiles.py", line 517, in find_modules_from_files mod = self.__get_module_from_str(import_str, print_exception, pyfile) File "C:\Users\Eddie\Downloads\eclipse\plugins\org.python.pydev_4.0.0.201504132356\pysrc\pydev_runfiles.py", line 476, in __get_module_from_str buf_err = pydevd_io.StartRedirect(keep_original_redirection=True, std='stderr') File "C:\Users\Eddie\Downloads\eclipse\plugins\org.python.pydev_4.0.0.201504132356\pysrc\pydevd_io.py", line 72, in StartRedirect import sys MemoryError

The main problem in your script is that you don't save altered XML back to disk. You need to store reference to root element and then call ElementTree.write:
from xml.etree import ElementTree as Et
context = Et.iterparse('input.xml')
root = None
for event, elem in context:
if elem.tag == 'DescriptorRecord':
for e in list(elem.getchildren()): # Don't use _children, it's a private field
if e.tag in ['DateCreated', 'Year', 'Month', 'TreeNumber', 'HistoryNote', 'PreviousIndexing']:
elem.remove(e) # You need remove(), not clear()
root = elem
with open('output.xml', 'wb') as file:
Et.ElementTree(root).write(file, encoding='utf-8', xml_declaration=True)
Note: here I use an awkward (and probably unsafe) way to get a root element - I assume that it's always the last element in iterparse output. If anyone knows a better way, please tell.

Import xml in another xml with Python ElementTree parser

Is it possible to load an xml file which imports another xml file with Python ElementTree.parse ?
For example:
I have file test.xml which contains:
<TestXml>
<!DOCTYPE doc [
<!ENTITY otherFile SYSTEM "test_1.xml">
]>
</TestXml>
and I also have test_1.xml which contains:
<test>it works!</test>
and I want to load test.xml in my python script:
from xml.etree.ElementTree import parse
a = parse('test.xml')
print a.find('test').text
and I expect it to output:
it works!
but instead I have:
Traceback (most recent call last):
File "D:/Work/depot/WIP/olex/Python/test/test.py", line 3, in <module>
a = parse('test.xml')
File "C:\Python27\lib\xml\etree\ElementTree.py", line 1182, in parse
tree.parse(source, parser)
File "C:\Python27\lib\xml\etree\ElementTree.py", line 656, in parse
parser.feed(data)
File "C:\Python27\lib\xml\etree\ElementTree.py", line 1642, in feed
self._raiseerror(v)
File "C:\Python27\lib\xml\etree\ElementTree.py", line 1506, in _raiseerror
raise err
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 2, column 6
Does somebody know what am I doing wrong or it is just impossible to load such a xml file for python ElementTree parser ?

The specific problem you are having is that your xml is malformed. Your DOCTYPE declaration should not be inside your root element. Rather, it should precede your root element:
<!DOCTYPE doc [
<!ENTITY otherFile SYSTEM "test_1.xml">
]>
<TestXml>
some content . . .
</TestXml>
That said, you will face a larger problem once you solve that issue. How do you use Python to parse the DOCTYPE declaration? Should you use the xml module, the lxml module, or the bs4 module?
That's a tough question. From what I have seen, people have (recently) had to do dtd parsing themselves. See the SO threads here and here for some possible leads.

Upper limit of fromstring function in ElementTree

I'm using Python 2.4 version on a Windows 32-bit PC. I'm trying to parse through a very large XML file using the ElementTree module. I downloaded version 1.2.6 of this module from effbot.org.
I followed the below code for my purpose:
import elementtree.ElementTree as ET
input = ''' 001 Chuck 009 Brent '''
stuff = ET.fromstring(input)
lst = stuff.findall("users/user")
print len(lst)
for item in lst:
print item.attrib["x"]
item = lst[0]
ET.dump(item)
item.get("x") # get works on attributes
item.find("id").text
item.find("id").tag
for user in stuff.getiterator('user'):
print "User" , user.attrib["x"]
ET.dump(user)
If the content of input is too large, more than 10,000 lines, the fromstring function raises an error (below). Can anyone help me out in rectifying this error?
This is the error generated:
Traceback (most recent call last): File "C:\Documents and Settings\hariprar\My Documents\My files\Python Try\xml_try1.py", line 16, in -toplevel- stuff = ET.fromstring(input) File "C:\Python24\Lib\site-packages\elementtree\ElementTree.py", line 1012, in XML return api.fromstring(text) File "C:\Python24\Lib\site-packages\elementtree\ElementTree.py", line 182, in fromstring parser.feed(text) File "C:\Python24\Lib\site-packages\elementtree\ElementTree.py", line 1292, in feed self._parser.Parse(data, 0) ExpatError: not well-formed (invalid token): line 2445, column 39

Take a look at the iterparse function. It will let you parse your input incrementally rather than reading it into memory as one big chunk.
It's described here: http://effbot.org/zone/element-iterparse.htm

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python XML Parser: junk after document element - python

Related

Python lxml XPathEvalError: Error in xpath expression when parsing larger file

Python - write Xml (formatted)

Removing Elements From 300MG Xml In Python / Element Tree

Import xml in another xml with Python ElementTree parser

Upper limit of fromstring function in ElementTree

Categories

Resources