Parsing XML with undeclared prefixes in Python

Parsing XML with undeclared prefixes in Python - python

I am trying to parse XML data with Python that uses prefixes, but not every file has the declaration of the prefix. Example XML:
<?xml version="1.0" encoding="UTF-8"?>
<item subtype="bla">
<thing>Word</thing>
<abc:thing2>Another Word</abc:thing2>
</item>
I have been using xml.etree.ElementTree to parse these files, but whenever the prefix is not properly declared, ElementTree throws a parse error. (unbound prefix, right at the start of <abc:thing2>)
Searching for this error leads me to solutions that suggest I fix the namespace declaration. However, I do not control the XML that I need to work with, so modifying the input files is not a viable option.
Searching for namespace parsing in general leads me to many questions about searching in namespace-agnostic way, which is not what I need.
I am looking for some way to automatically parse these files, even if the namespace declaration is broken. I have thought about doing the following:
tell ElementTree what namespaces to expect beforehand, because I do know which ones can occur. I found register_namespace, but that does not seem to work.
have the full DTD read in before parsing, and see if that solves it. I could not find a way to do this with ElementTree.
tell ElementTree to not bother about namespaces at all. It should not cause issues with my data, but I found no way to do this
use some other parsing library that can handle this issue - though I prefer not to need installation of extra libraries. I have difficulty seeing from the documentation if any others would be able to solve my issue.
some other route that I am currently not seeing?
UPDATE:
After Har07 put me on the path of lxml, I tried to see if this would let me perform the different solutions I had thought of, and what the result would be:
telling the parser what namespaces to expect beforehand: I still could not find any 'official' way to do this, but in my searches before I had found the suggestion to simply add the requisite declaration to the data programmatically. (for a different programming situation - unfortunately I can't find the link anymore) It seemed terribly hacky to me, but I tried it anyway. It involves loading the data as a string, changing the enclosing element to have the right xmlns declarations, and then handing it off to lxml.etree's fromstring method. Unfortunately, that also requires removing all reference to encoding declaration from the string. It works, though.
Read in the DTD before parsing: it is possible with lxml (through attribute_defaults, dtd_validation, or load_dtd), but unfortunately does not solve the namespace issue.
Telling lxml not to bother about namespaces: possible through the recover option. Unfortunately, that also ignores other ways in which the XML may be broken (see Har07's answer for details)

One possible way is using ElementTree compatible library, lxml. For example :
from lxml import etree as ElementTree
xml = """<?xml version="1.0" encoding="UTF-8"?>
<item subtype="bla">
<thing>Word</thing>
<abc:thing2>Another Word</abc:thing2>
</item>"""
parser = ElementTree.XMLParser(recover=True)
tree = ElementTree.fromstring(xml, parser)
thing = tree.xpath("//thing")[0]
print(ElementTree.tostring(thing))
All you need to do for parsing a non well-formed XML using lxml is passing parameter recover=True to constructor of XMLParser. lxml also has full support for xpath 1.0 which is very useful when you need to get part of XML document using more complex criteria.
UPDATE :
I don't know all the types of XML error that recover=True option can tolerate. But here is another type of error that I know besides unbound namespace prefix: unclosed tag. lxml will fix -rather than ignore- unclosed tag by adding corresponding closing tag automatically. For example, given the following broken XML :
xml = """<item subtype="bla">
<thing>Word</thing>
<bad>
<abc:thing2>Another Word</abc:thing2>
</item>"""
parser = ElementTree.XMLParser(recover=True)
tree = ElementTree.fromstring(xml, parser)
print(ElementTree.tostring(tree))
The final output XML after parsed by lxml is as follow :
<item subtype="bla">
<thing>Word</thing>
<bad>
<abc:thing2>Another Word</abc:thing2>
</bad></item>

Related

lxml namespaces defined multiple times

I am switching from python xml to lxml. I faced that lxml has rather strict policy towards namespaces.
I want to produce a xml with multiple redundant xmlns:ns0 namespace declaration. However, lxml strips this namespaces from child Elements.
How can I define redundant namespaces via lxml?
<?xml version="1.0" ?>
<env:Envelope xmlns:env="some URI" xmlns:ns0="http://www.w3.org/2001/XMLSchema" xmlns:ns1="http://www.w3.org/2001/XMLSchema-instance">
<env:Body>
<ns0:getLabelResponseElement>
<ns0:result>
<ns1:statusMessage xmlns:ns0="http://www.w3.org/2001/XMLSchema"></ns1:statusMessage>
</ns0:result>
</ns0:getLabelResponseElement>
</env:Body>
</env:Envelope>

Based on my research, lxml.etree doesn't allow to add redundant namespace declaration in child Elements since this violates XML standards. So, even if you explicitly redeclare namespace in child element lxml wouldn't add it in final XML.
Unlike lxml, python xml.etree.ElementTree library allows redundant xml namespace declaration.

'XML' document with multiple root elements

I have an 'XML' file, which I do not control, which I am trying to parse with etree.ElementTree which contains two root elements:
<?xml version="1.0"?>
<meta>
... data I do not care about
</meta>
<database>
... data I wish to parse
</database>
Trying to parse the file I'm getting the error: 'junk after document element' which I understand is related to the fact that it isn't valid xml, since xml can only have one root element. I've been reading around for a solution, and while I have found a few posts addressing this issue they have all been different enough or difficult enough that I could not, as a beginner, get my head round them.
As I understand it the solution would either be to encase everything in a new root element, and parse that, or somehow ignore/split off the <meta> element and it's children. Any guidance on how to best accomplish this would be appreciated.

Beautiful Soup might ease your problem (although it is the lxml inside which renders this service), but its a long-term downgrade, thus for instance when you want to use xpath.
Stick to ET. It is strict and won't allow you to parse not well-formed XML, which requires one root element and nothing else outside of it.
If you manage to parse your xml-file, you can be sure, it is well-formed. All options are legit:
1) Read the file as a string, remove the declaration and put the root tags around it. Then parse from string. (Clear the string variable after that.) Or you could edit the file first.
2) Create a new root element ( new_root = ET.Element('new_root') ), read the top-level elements in the file an append them with SubElement.
The second option requires more coding and maintainance, if the file gets changed.

Here is one solution using BeautifulSoup, in data is malformed xml. BeautifulSoup will process it as any document, so you can access both parts:
from bs4 import BeautifulSoup
data = """<?xml version="1.0"?>
<meta>
<somedata>1</somedata>
</meta>
<database>
<important>100</important>
</database>"""
soup = BeautifulSoup(data, 'lxml')
print(soup.database.important.text)
Prints:
100

Parsing XML with namespaces using ElementTree in Python

I have an xml, small part of it looks like this:
<?xml version="1.0" ?>
<i:insert xmlns:i="urn:com:xml:insert" xmlns="urn:com:xml:data">
<data>
<image imageId="1"></image>
<content>Content</content>
</data>
</i:insert>
When i parse it using ElementTree and save it to a file i see following:
<ns0:insert xmlns:ns0="urn:com:xml:insert" xmlns:ns1="urn:com:xml:data">
<ns1:data>
<ns1:image imageId="1"></ns1:image>
<ns1:content>Content</ns1:content>
</ns1:data>
</ns0:insert>
Why does it change prefixes and put them everywhere? Using minidom i don't have such problem. Is it configured? Documentation for ElementTree is very poor.
The problem is, that i can't find any node after such parsing, for example image - can't find it with or without namespace if i use it like {namespace}image or just image. Why's that? Any suggestions are strongly appreciated.
What i already tried:
import xml.etree.ElementTree as ET
tree = ET.parse('test.xml')
root = tree.getroot()
for a in root.findall('ns1:image'):
print a.attrib
This returns an error and the other one returns nothing:
for a in root.findall('{urn:com:xml:data}image'):
print a.attrib
I also tried to make namespace like this and use it:
namespaces = {'ns1': 'urn:com:xml:data'}
for a in root.findall('ns1:image', namespaces):
print a.attrib
It returns nothing. What am i doing wrong?

This snippet from your question,
for a in root.findall('{urn:com:xml:data}image'):
print a.attrib
does not output anything because it only looks for direct {urn:com:xml:data}image children of the root of the tree.
This slightly modified code,
for a in root.findall('.//{urn:com:xml:data}image'):
print a.attrib
will print {'imageId': '1'} because it uses .//, which selects matching subelements on all levels.
Reference: https://docs.python.org/2/library/xml.etree.elementtree.html#supported-xpath-syntax.
It is a bit annoying that ElementTree does not just retain the original namespace prefixes by default, but keep in mind that it is not the prefixes that matter anyway. The register_namespace() function can be used to set the wanted prefix when serializing the XML. The function does not have any effect on parsing or searching.

From what I gather, it has something to do with the namespace recognition in ET.
from here http://effbot.org/zone/element-namespaces.htm
When you save an Element tree to XML, the standard Element serializer generates unique prefixes for all URI:s that appear in the tree. The prefixes usually have the form “ns” followed by a number. For example, the above elements might be serialized with the prefix ns0 for “http://www.w3.org/1999/xhtml” and ns1 for “http://effbot.org/namespace/letters”.
If you want to use specific prefixes, you can add prefix/uri mappings to a global table in the ElementTree module. In 1.3 and later, you do this by calling the register_namespace function. In earlier versions, you can access the internal table directly:
ElementTree 1.3
ET.register_namespace(prefix, uri)
ElementTree 1.2 (Python 2.5)
ET._namespace_map[uri] = prefix
Note the argument order; the function takes the prefix first, while the raw dictionary maps from URI:s to prefixes.

Parse xml from file using etree works when reading string, but not a file

I am a relative newby to Python and SO. I have an xml file from which I need to extract information. I've been struggling with this for several days, but I think I finally found something that will extract the information properly. Now I'm having troubles getting the right output. Here is my code:
from xml import etree
node = etree.fromstring('<dataObject><identifier>5e1882d882ec530069d6d29e28944396</identifier><description>This is a paragraph about a shark.</description></dataObject>')
identifier = node.findtext('identifier')
description = node.findtext('description')
print identifier, description
The result that I get is "5e1882d882ec530069d6d29e28944396 This is a paragraph about a shark.", which is what I want.
However, what I really need is to be able to read from a file instead of a string. So I try this code:
from xml import etree
node = etree.parse('test3.xml')
identifier = node.findtext('identifier')
description = node.findtext('description')
print identifier, description
Now my result is "None None". I have a feeling I'm either not getting the file in right or something is wrong with the output. Here is the contents of test3.xml
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<response xmlns="http://www.eol.org/transfer/content/0.3" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dwc="http://rs.tdwg.org/dwc/dwcore/" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:dwct="http://rs.tdwg.org/dwc/terms/" xsi:schemaLocation="http://www.eol.org/transfer/content/0.3 http://services.eol.org/schema/content_0_3.xsd">
<identifier>5e1882d822ec530069d6d29e28944369</identifier>
<description>This is a paragraph about a shark.</description>

Your XML file uses a default namespace. You need to qualify your searches with the correct namespace:
identifier = node.findtext('{http://www.eol.org/transfer/content/0.3}identifier')
for ElementTree to match the correct elements.
You could also give the .find(), findall() and iterfind() methods an explicit namespace dictionary. This is not documented very well:
namespaces = {'eol': 'http://www.eol.org/transfer/content/0.3'} # add more as needed
root.findall('eol:identifier', namespaces=namespaces)
Prefixes are only looked up in the namespaces parameter you pass in. This means you can use any namespace prefix you like; the API splits off the eol: part, looks up the corresponding namespace URL in the namespaces dictionary, then changes the search to look for the XPath expression {http://www.eol.org/transfer/content/0.3}identifier instead.
If you can switch to the lxml library things are better; that library supports the same ElementTree API, but collects namespaces for you in a .nsmap attribute on elements.

Have you thought of trying beautifulsoup to parse your xml with python:
http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html#Parsing%20XML
There is some good documentation and a healthy online group so support is quite good
A

How to parse XML containing prefixes but no namespace declarations with lxml?

I have a bunch of XML files which are using prefixes but without the corresponding namespace declaration.
Stuff like:
<tal:block tal:condition="foo">
...
</tal:block>
or:
<div i18n:domain="my-app">
...
I know where those prefixes come from, an I tried the following, but without success:
from lxml import etree as ElementTree
ElementTree.register_namespace("i18n", "http://namespaces.zope.org")
ElementTree.register_namespace("tal", "http://xml.zope.org/namespaces/tal")
with open(path) as fp:
tree = ElementTree.parse(fp)
but lxml still chokes with:
lxml.etree.XMLSyntaxError: Namespace prefix i18n for domain on div is not defined, line 4, column 20
I know I can use ElementTree.XMLParser(recover=True), but I would like to keep the prefix anyway, which this method don't.
Any idea?

It's not valid XML, using undefined prefixes, so no XML parser is going to be able to deal with it.
Your best bet (other than fixing the XML) is to programmaticly modify the XML source to add the namespace attributes to the root element (just using the string support in your language). Add xmlns:tal="http://xml.zope.org/namespaces/tal", etc to the root element before you give the XML to the parser. Then the XML parser should handle it without complaint and without any registering namespaces.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parsing XML with undeclared prefixes in Python - python

Related

lxml namespaces defined multiple times

'XML' document with multiple root elements

Parsing XML with namespaces using ElementTree in Python

Parse xml from file using etree works when reading string, but not a file

How to parse XML containing prefixes but no namespace declarations with lxml?

Categories

Resources