Parsing large combined XML document with Python

Parsing large combined XML document with Python - python

I have one large document (400 mb), which contains hundreds of XML documents, each with their own declarations. I am trying to parse each document using ElementTree in Python. I am having a lot of trouble with splitting each XML document in order to parse out the information. Here is an example of what the document looks like:
<?xml version="1.0"?>
<data>
<more>
<p></p>
</more>
</data>
<?xml version="1.0"?>
<different data>
<etc>
<p></p>
</etc>
</different data>
<?xml version="1.0"?>
<continues.....>
Ideally I would like to read through each XML declaration, parse the data, and continue on with the next XML document. Any suggestions will help.

You'll need to read in the documents separately; here is a generator function that'll yield complete XML documents from a given file object:
def xml_documents(fileobj):
document = []
for line in fileobj:
if line.strip().startswith('<?xml') and document:
yield ''.join(document)
document = []
document.append(line)
if document:
yield ''.join(document)
Then use ElementTree.fromstring() to load and parse these:
with open('file_with_multiple_xmldocuments') as fileobj:
for xml in xml_documents(fileobj):
tree = ElementTree.fromstring(xml)

Related

ParseError: junk after document element: line 7, column 0, (Python, XML parsing)

I have a dummy xml file,
<?xml version="1.0" encoding="UTF-8"?>
<hello xmlns="abc">
<inside>
<ok>xyz</ok>
</inside>
</hello>
<?xml version="1.0" encoding="UTF-8"?>
<xyz xmlns="acxd">
</xyz>
<?xml version="1.0" encoding="UTF-8"?>
<zz xmlns="zmrt">
</zz>
]]>]]>
And Iam trying to parse this xml file, using following code.
import xml.etree.ElementTree as ET
mytree = ET.parse(temp_xml)
The error I am getting is "ParseError: junk after document element: line 7, column 0".
I did try to remove ']]>]]>' i.e. in line 7 but still I am getting same error i.e. "ParseError: junk after document element: line 8, column 0". Is there a way to deal with such error or we can skip reading such lines where there is junk data ?

XML document may only have a single root element. Yours has three and therefore is not well-formed. If you wish to parse it using XML tools, you'll have to first, manually or programmatically, separate the root elements into their own documents.
Note that an XML document also can have at most a single XML declaration (<?xml version="1.0" encoding="UTF-8"?>), and if it exists, it must be at the top of the file.
See also
Why must XML documents have a single root element?
How to parse invalid (bad / not well-formed) XML?
Are multiple XML declarations in a document well-formed XML?
Parse a xml file with multiple root element in python

Writing XML formatted text to output file in Python

I am having issues while writing the below XML to output file.
<?xml version="1.0" encoding="utf-8"?>
<?xml-stylesheet href="CoreNLP-to-HTML.xsl" type="text/xsl"?>
<root>
<document>
<sentences>
<sentence id="1">
<tokens>
<token id="1">
<word>
Pusheen
</word>
<CharacterOffsetBegin>
0
</CharacterOffsetBegin>
<CharacterOffsetEnd>
7
</CharacterOffsetEnd>
<POS>
NNP
</POS>
</token>
</tokens>
</sentence>
</sentences>
</document>
</root>
How to write this to output file in xml format? I tried using below write statement
tree.write(open('person.xml', 'w'), encoding='unicode').
But, I am getting the below error
AttributeError: 'str' object has no attribute 'write'
I don't have to build XML here as I already have the data in XML format. I just need it to write it to a XML file.

Assuming that tree is your XML, it is a string. You probably want something like:
with open("person.xml", "w", encoding="unicode") as outfile:
outfile.write(tree)
(It is good practice to use with for files; it automatically closes them after)
The error is caused by the fact that, since tree is a string, you can't write to it.

I recommend using the lxml module to check the format first and then write it to a file. I notice that you've got two elements with the same id, which caught my eye. It doesn't flag an error in XML, but it could cause trouble on an HTML page, where each id is supposed to be unique.
Here's the simple code to do what I described above:
from lxml import etree
try:
root = etree.fromstring(your_xml_data) # checks XML formatting, returns Element if good
if root is not None:
tree = etree.ElementTree(root) # convert the Element to ElementTree
tree.write('person.xml') # we needed the ElementTree for writing the file
except:
'Oops!'

Element tree writing back to disk

I parse a large xml file in python using
tree = ET.parse('test.xml')
#do my manipulation
How do I write back the xml file to disk exactly as I have read it, albeit with my modifications.

<?xml version="1.0" encoding="utf-16"?>
This was the first line of the input xml file
I added tree.write("output.sbp", encoding="utf-16") and now they are of the same size.

How to remove xml header in beautifulsoup?

I have imported and modified some xml, but when I write out my xml using test.prettify(). It changes the top line of the xml from
<?xml version="1.0"?>
to
<?xml version="1.0" encoding="utf-8"?>
I don't want this change. How can I just keep the first line unchanged? What is the easiest way to do this?
If it matters, I'm using the xml parser.
soup = BeautifulSoup(r.text,'xml')

I'm sure there's a more elegant way to do this using BeautifulSoup's built-ins, but based on your comment, I'll give you the "strip it out" version:
xml_string = '<?xml version="1.0" encoding="utf-8"?>'
print xml_string[:xml_string.find("encoding")-1] + "?>"
This is general enough to strip out any encoding from the header (not just utf-8).

You could find the xml and use replaceWith() to replace it with the value you want.

Detect empty XML root element

I am reading in a bunch of XML files. If the file only contains an empty root element like:
<?xml version="1.0" encoding="UTF-8"?>
<root />
I want to skip over it. Currently I do:
import xml.etree.cElementTree as ET
xml = ET.parse(filename)
if not [el for el in xml.getroot()]:
# skip
Is there a better way to handle this case?

Instead of the list comprehension, use the DOM methods ElementTree gives you:
if not xml.getroot().getchildren():
# skip

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parsing large combined XML document with Python - python

Related

ParseError: junk after document element: line 7, column 0, (Python, XML parsing)

Writing XML formatted text to output file in Python

Element tree writing back to disk

How to remove xml header in beautifulsoup?

Detect empty XML root element

Categories

Resources