'XML' document with multiple root elements - python

I have an 'XML' file, which I do not control, which I am trying to parse with etree.ElementTree which contains two root elements:
<?xml version="1.0"?>
<meta>
... data I do not care about
</meta>
<database>
... data I wish to parse
</database>
Trying to parse the file I'm getting the error: 'junk after document element' which I understand is related to the fact that it isn't valid xml, since xml can only have one root element. I've been reading around for a solution, and while I have found a few posts addressing this issue they have all been different enough or difficult enough that I could not, as a beginner, get my head round them.
As I understand it the solution would either be to encase everything in a new root element, and parse that, or somehow ignore/split off the <meta> element and it's children. Any guidance on how to best accomplish this would be appreciated.

Beautiful Soup might ease your problem (although it is the lxml inside which renders this service), but its a long-term downgrade, thus for instance when you want to use xpath.
Stick to ET. It is strict and won't allow you to parse not well-formed XML, which requires one root element and nothing else outside of it.
If you manage to parse your xml-file, you can be sure, it is well-formed. All options are legit:
1) Read the file as a string, remove the declaration and put the root tags around it. Then parse from string. (Clear the string variable after that.) Or you could edit the file first.
2) Create a new root element ( new_root = ET.Element('new_root') ), read the top-level elements in the file an append them with SubElement.
The second option requires more coding and maintainance, if the file gets changed.

Here is one solution using BeautifulSoup, in data is malformed xml. BeautifulSoup will process it as any document, so you can access both parts:
from bs4 import BeautifulSoup
data = """<?xml version="1.0"?>
<meta>
<somedata>1</somedata>
</meta>
<database>
<important>100</important>
</database>"""
soup = BeautifulSoup(data, 'lxml')
print(soup.database.important.text)
Prints:
100

Related

Parse self-closing tags missing the '/'

I'm trying to parse some old SGML code using BeautifulSoup4 and build an Element Tree with the data. It's mostly working fine, but some of the tags that should be self-closing are aren't marked as such. For example:
<element1>
<element2 attr="0">
<element3>Data</element3>
</element1>
When I parse the data, it ends up like:
<element1>
<element2 attr="0">
<element3>Data</element3>
</element2>
</element1>
What I'd like is for it to assume that if it doesn't find a closing tag for such elements, it should treat it as self-closing tag instead of assuming that everything after it is a child and putting the closing tag as late as possible, like so:
<element1>
<element2 attr="0"/>
<element3>Data</element3>
</element1>
Can anyone point me to a parser that could do this, or some way to modify an existing one to act this way? I've dug through a few parsers (lxml, lxml-xml, html5lib) but I can't figure out how to get these results.
What I ended up doing was extracting all empty elements where the end tag can be omitted from the DTD (eg. <!ELEMENT elem_name - o EMPTY >), creating a list from those elements, then using regex to close all the tags in the list. The resulting text is then passed to the XML parser.
Here's a boiled down version of what I'm doing:
import re
from lxml.html import soupparser
from lxml import etree as ET
empty_tags = ['elem1', 'elem2', 'elem3']
markup = """
<elem1 attr="some value">
<elem2/>
<elem3></elem3>
"""
for t in empty_tags:
markup = re.sub(r'(<{0}(?:>|\s+[^>/]*))>\s*(?:</{0}>)?\n?'.format(t), r'\1/>\n', markup)
tree = soupparser.fromstring(markup)
print(ET.tostring(tree, pretty_print=True).decode("utf-8"))
The output should be:
<elem1 attr="some value"/>
<elem2/>
<elem3/>
(This will actually be enclosed in tags, but the parser adds those in.)
It will leave attributes alone, and won't touch tags that are already self-closed. If the tag has a closing tag, but is empty, it will remove the closing tag and self-close the tag instead, just so it's standardized.
It's not a very generic solution but, as far as I can tell, there's no other way to do this without knowing which tags should be closed. Even OpenSP needs the DTD to know which tags it should be closing.

Parse xml from file using etree works when reading string, but not a file

I am a relative newby to Python and SO. I have an xml file from which I need to extract information. I've been struggling with this for several days, but I think I finally found something that will extract the information properly. Now I'm having troubles getting the right output. Here is my code:
from xml import etree
node = etree.fromstring('<dataObject><identifier>5e1882d882ec530069d6d29e28944396</identifier><description>This is a paragraph about a shark.</description></dataObject>')
identifier = node.findtext('identifier')
description = node.findtext('description')
print identifier, description
The result that I get is "5e1882d882ec530069d6d29e28944396 This is a paragraph about a shark.", which is what I want.
However, what I really need is to be able to read from a file instead of a string. So I try this code:
from xml import etree
node = etree.parse('test3.xml')
identifier = node.findtext('identifier')
description = node.findtext('description')
print identifier, description
Now my result is "None None". I have a feeling I'm either not getting the file in right or something is wrong with the output. Here is the contents of test3.xml
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<response xmlns="http://www.eol.org/transfer/content/0.3" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dwc="http://rs.tdwg.org/dwc/dwcore/" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:dwct="http://rs.tdwg.org/dwc/terms/" xsi:schemaLocation="http://www.eol.org/transfer/content/0.3 http://services.eol.org/schema/content_0_3.xsd">
<identifier>5e1882d822ec530069d6d29e28944369</identifier>
<description>This is a paragraph about a shark.</description>
Your XML file uses a default namespace. You need to qualify your searches with the correct namespace:
identifier = node.findtext('{http://www.eol.org/transfer/content/0.3}identifier')
for ElementTree to match the correct elements.
You could also give the .find(), findall() and iterfind() methods an explicit namespace dictionary. This is not documented very well:
namespaces = {'eol': 'http://www.eol.org/transfer/content/0.3'} # add more as needed
root.findall('eol:identifier', namespaces=namespaces)
Prefixes are only looked up in the namespaces parameter you pass in. This means you can use any namespace prefix you like; the API splits off the eol: part, looks up the corresponding namespace URL in the namespaces dictionary, then changes the search to look for the XPath expression {http://www.eol.org/transfer/content/0.3}identifier instead.
If you can switch to the lxml library things are better; that library supports the same ElementTree API, but collects namespaces for you in a .nsmap attribute on elements.
Have you thought of trying beautifulsoup to parse your xml with python:
http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html#Parsing%20XML
There is some good documentation and a healthy online group so support is quite good
A

Parse XML in order using Python

I'm trying to parse an XML document. The document has HTML like formatting embedded, for example
<p>This is a paragraph
<em>with some <b>extra</b> formatting</em>
scattered throughout.
</p>
So far I've used
import xml.etree.cElementTree as xmlTree
to handle the XML document, but I am not sure if this provides the functionality I look for. How would I go about handling the text nodes here?
Also, is there a way to find the closing tags in a document?
Thanks!
If your XML document fits in memory, you should use Beautiful Soup which will give you a much cleaner access to the document. You'll be able to select a node and automatically interact with its children; every node will have a .next command, which will iterate through the text up to the next tag.
So:
>>> b = BeautifulSoup.BeautifulStoneSoup("<p>This is a paragraph <em>with some <b>extra</b> formatting</em> scattered throughout.</p>")
>>> b.find('p')
<p>This is a paragraph <em>with some <b>extra</b> formatting</em> scattered throughout.</p>
>>> b.find('p').next
u'This is a paragraph '
>>> b.find('p').next.next
<em>with some <b>extra</b> formatting</em>
That, or something like it, should solve your problem.
If it doesn't fit in memory, you'll need to subclass a SAX parser, which is a bit more work. To do that, you use from xml.parsers import expat and write handlers for opening and closing of tags. It's a bit more involved.

lxml removing <?xml ...> tags when parsing?

I'm currently working with parsing XML documents (adding elements, adding attributes, etc). So I first need to parse the XML in before working on it. However, lxml seems to be removing the element <?xml ...>. For example
from lxml import etree
tree = etree.fromstring('<?xml version="1.0" encoding="utf-8"?><dmodule>test</dmodule>', etree.XMLParser())
print etree.tostring(tree)
will result in
<dmodule>test</dmodule>
Does anyone know why the <?xml ...> element is being removed? I thought encoding tags were valid XML. Thanks for your time.
The <?xml> element is an XML declaration, so it's not strictly an element. It just gives info about the XML tree below it.
If you need to print it out with lxml, there is some info here about the xmlDeclaration=TRUE flag you can use.
http://lxml.de/api.html#serialisation
etree.tostring(tree, xml_declaration=True)
Does anyone know why the <?xml ...> element is being removed?
XML defaults to version 1.0 in UTF-8 so the document is equivalent if you remove them.
You are parsing some XML to a data structure and then converting that data structure back to XML. You will get a representation of that data structure in XML, but it might not be expressed in the same way (so the prolog can be removed and <foo /> can be exchanged with <foo></foo> and so on).

How to get whole text of an Element in xml.minidom?

I want to get the whole text of an Element to parse some xhtml:
<div id='asd'>
<pre>skdsk</pre>
</div>
begin E = div element on the above example, I want to get
<pre>skdsk</pre>
How?
Strictly speaking:
from xml.dom.minidom import parse, parseString
tree = parseString("<div id='asd'><pre>skdsk</pre></div>")
root = tree.firstChild
node = root.childNodes[0]
print node.toxml()
In practice, though, I'd recommend looking at the http://www.crummy.com/software/BeautifulSoup/ library. Finding the right childNode in an xhtml document, and skipping "whitespace nodes" is a pain. BeautifulSoup is a robust html/xhtml parser with fantastic tree-search capacilities.
Edit: The example above compresses the HTML into one string. If you use the HTML as in the question, the line breaks and so-forth will generate "whitespace" nodes, so the node you want won't be at childNodes[0].

Categories

Resources