Parsing XML with namespaces using ElementTree in Python

Parsing XML with namespaces using ElementTree in Python - python

I have an xml, small part of it looks like this:
<?xml version="1.0" ?>
<i:insert xmlns:i="urn:com:xml:insert" xmlns="urn:com:xml:data">
<data>
<image imageId="1"></image>
<content>Content</content>
</data>
</i:insert>
When i parse it using ElementTree and save it to a file i see following:
<ns0:insert xmlns:ns0="urn:com:xml:insert" xmlns:ns1="urn:com:xml:data">
<ns1:data>
<ns1:image imageId="1"></ns1:image>
<ns1:content>Content</ns1:content>
</ns1:data>
</ns0:insert>
Why does it change prefixes and put them everywhere? Using minidom i don't have such problem. Is it configured? Documentation for ElementTree is very poor.
The problem is, that i can't find any node after such parsing, for example image - can't find it with or without namespace if i use it like {namespace}image or just image. Why's that? Any suggestions are strongly appreciated.
What i already tried:
import xml.etree.ElementTree as ET
tree = ET.parse('test.xml')
root = tree.getroot()
for a in root.findall('ns1:image'):
print a.attrib
This returns an error and the other one returns nothing:
for a in root.findall('{urn:com:xml:data}image'):
print a.attrib
I also tried to make namespace like this and use it:
namespaces = {'ns1': 'urn:com:xml:data'}
for a in root.findall('ns1:image', namespaces):
print a.attrib
It returns nothing. What am i doing wrong?

This snippet from your question,
for a in root.findall('{urn:com:xml:data}image'):
print a.attrib
does not output anything because it only looks for direct {urn:com:xml:data}image children of the root of the tree.
This slightly modified code,
for a in root.findall('.//{urn:com:xml:data}image'):
print a.attrib
will print {'imageId': '1'} because it uses .//, which selects matching subelements on all levels.
Reference: https://docs.python.org/2/library/xml.etree.elementtree.html#supported-xpath-syntax.
It is a bit annoying that ElementTree does not just retain the original namespace prefixes by default, but keep in mind that it is not the prefixes that matter anyway. The register_namespace() function can be used to set the wanted prefix when serializing the XML. The function does not have any effect on parsing or searching.

From what I gather, it has something to do with the namespace recognition in ET.
from here http://effbot.org/zone/element-namespaces.htm
When you save an Element tree to XML, the standard Element serializer generates unique prefixes for all URI:s that appear in the tree. The prefixes usually have the form “ns” followed by a number. For example, the above elements might be serialized with the prefix ns0 for “http://www.w3.org/1999/xhtml” and ns1 for “http://effbot.org/namespace/letters”.
If you want to use specific prefixes, you can add prefix/uri mappings to a global table in the ElementTree module. In 1.3 and later, you do this by calling the register_namespace function. In earlier versions, you can access the internal table directly:
ElementTree 1.3
ET.register_namespace(prefix, uri)
ElementTree 1.2 (Python 2.5)
ET._namespace_map[uri] = prefix
Note the argument order; the function takes the prefix first, while the raw dictionary maps from URI:s to prefixes.

Related

How can I turn an xml Element into an ElementTree (python)?

As I understood it, XML files are tree structures ie each branch is its own tree. Conceptually, I can't see the difference between an Element and an ElementTree. But I guess that's ok - what's worse is that there is stuff you can't do with an Element - for example root.write("bla.xml") seems to be fine but element.write("bla.xml") doesn't work.
So I suppose I need to convert the Element to an ElementTree and set it as root before I do anything else. How do I do this...?

You are right, conceptually there is no difference. So, just build you elements however you like, and then just include their root in an ElementTree so you have access its methods. You can just do
tree = ElementTree(my_root_element)
tree.write(...)

To get the root tree from an xml Element, you can use the getroottree method:
doc = lxml.html.parse(s)
tree = doc.getroottree()
for more info please check the doc to know more about the module.

Finding and converting XML processing instructions using Python

We are converting our ancient FrameMaker docs to XML. My job is to convert this:
<?FM MARKER [Index] foo, bar ?>`
to this:
<indexterm>
<primary>foo, bar</primary>
</indexterm>
I'm not worried about that part (yet); what is stumping me is that the ProcessingInstructions are all over the documents and could potentially be under any element, so I need to be able to search the entire tree, find them, and then process them. I cannot figure out how to iterate over an entire XML tree using minidom. Am I missing some secret method/iterator? This is what I've looked at thus far:
Elementtree has the excellent Element.iter() method, which is a depth-first search, but it doesn't process ProcessingInstructions.
ProcessingInstructions don't have tag names, so I cannot search for them using minidom's getElementsByTagName.
xml.sax's ContentHandler.processingInstruction looks like it's only used to create ProcessingInstructions.
Short of creating my own depth-first search algorithm, is there a way to generate a list of ProcessingInstructions in an XML file, or identify their parents?

Use the XPath API of the lxml module as such:
from lxml import etree
foo = StringIO('<foo><bar></bar></foo>')
tree = etree.parse(foo)
result = tree.xpath('//processing-instruction()')
The node test processing-instruction() is true for any processing instruction. The processing-instruction() test may have an argument that is Literal; in this case, it is true for any processing instruction that has a name equal to the value of the Literal.
References
XPath and XSLT with lxml
XML Path Language 1.0: Node Tests

Setting 'xml:space' to 'preserve' Python lxml

I have a text element within an SVG file that I'm generating using lxml. I want to preserve whitespace in this element. I create the text element and then attempt to .set() the xml:space to preserve but nothing I try seems to work. I'm probably missing something conceptually. Any ideas?

You can do it by explicitly specifying the namespace URI associated with the special xml: prefix (see http://www.w3.org/XML/1998/namespace).
from lxml import etree
root = etree.Element("root")
root.set("{http://www.w3.org/XML/1998/namespace}space", "preserve")
print etree.tostring(root)
Output:
<root xml:space="preserve"/>

Parse xml from file using etree works when reading string, but not a file

I am a relative newby to Python and SO. I have an xml file from which I need to extract information. I've been struggling with this for several days, but I think I finally found something that will extract the information properly. Now I'm having troubles getting the right output. Here is my code:
from xml import etree
node = etree.fromstring('<dataObject><identifier>5e1882d882ec530069d6d29e28944396</identifier><description>This is a paragraph about a shark.</description></dataObject>')
identifier = node.findtext('identifier')
description = node.findtext('description')
print identifier, description
The result that I get is "5e1882d882ec530069d6d29e28944396 This is a paragraph about a shark.", which is what I want.
However, what I really need is to be able to read from a file instead of a string. So I try this code:
from xml import etree
node = etree.parse('test3.xml')
identifier = node.findtext('identifier')
description = node.findtext('description')
print identifier, description
Now my result is "None None". I have a feeling I'm either not getting the file in right or something is wrong with the output. Here is the contents of test3.xml
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<response xmlns="http://www.eol.org/transfer/content/0.3" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dwc="http://rs.tdwg.org/dwc/dwcore/" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:dwct="http://rs.tdwg.org/dwc/terms/" xsi:schemaLocation="http://www.eol.org/transfer/content/0.3 http://services.eol.org/schema/content_0_3.xsd">
<identifier>5e1882d822ec530069d6d29e28944369</identifier>
<description>This is a paragraph about a shark.</description>

Your XML file uses a default namespace. You need to qualify your searches with the correct namespace:
identifier = node.findtext('{http://www.eol.org/transfer/content/0.3}identifier')
for ElementTree to match the correct elements.
You could also give the .find(), findall() and iterfind() methods an explicit namespace dictionary. This is not documented very well:
namespaces = {'eol': 'http://www.eol.org/transfer/content/0.3'} # add more as needed
root.findall('eol:identifier', namespaces=namespaces)
Prefixes are only looked up in the namespaces parameter you pass in. This means you can use any namespace prefix you like; the API splits off the eol: part, looks up the corresponding namespace URL in the namespaces dictionary, then changes the search to look for the XPath expression {http://www.eol.org/transfer/content/0.3}identifier instead.
If you can switch to the lxml library things are better; that library supports the same ElementTree API, but collects namespaces for you in a .nsmap attribute on elements.

Have you thought of trying beautifulsoup to parse your xml with python:
http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html#Parsing%20XML
There is some good documentation and a healthy online group so support is quite good
A

How do I find an xml node that does not have an attribute

I'm using python 2.7 and trying to parse the XML below - what I'm trying to do is create a python array of all genres with a language attribute together with an array where there is no language attribute.
I'm using the python module import xml.etree.cElementTree as ET
I know I can find the XML section where the language attribute is in the "fr" language via syntax:
tree=ET.ElementTree(file='popups.xml')
root = tree.getroot()
for x in root.findall('alt[#{http://www.w3.org/XML/1998/namespace}lang="fr"]/alt'):
print x.text
I dont really understand why I can't use xml:lang rather than {http://www.w3.org/XML/1998/namespace}lang, but the above seems to work on Ubuntu 12.04
What I'm trying to find out is the "not" syntax - where the XML section does NOT have any language attribute
Anybody have any thoughts how to achieve this?
<genre>
<alt>
<alt genre="easy listening">lounge</alt>
<alt genre="alternative">ska</alt>
</alt>
<alt xml:lang="fr">
<alt genre="gospel">catholique</alt>
</alt>
</genre>

You need to use the full QName in your xpath because the stdlib ElementTree does not have a way of registering a prefix. I usually use a helper function to create QNames:
def qname(prefix, element, map={'xml':'http://www.w3.org/XML/1998/namespace'}):
return "{{{}}}{}".format(map[prefix], element)
The ElementTree implementation in the standard library does not support enough XPath to do what you want easily. However, the spec for xml:lang specifies that the value of this attribute is inherited by everything that contains it, sort of like xml:base or xmlns namespace declarations. So as an alternative, we can make the language setting explicit on all elements:
xml_lang = qname('xml', 'lang')
def set_xml_lang(root, defaultlang=''):
xml_lang = qname('xml', 'lang')
for item in root:
try:
lang = item.attrib[xml_lang]
except KeyError, err:
item.set(xml_lang, defaultlang)
lang = defaultlang
set_xml_lang(item, lang)
set_xml_lang(root)
namespaces = {'xml':'http://www.w3.org/XML/1998/namespace'}
# Every element in root now has an xml:lang attribute
# so XPath is easy now:
alts_with_no_lang = root.findall('alt[#{{{xml}}}lang=""]'.format(**namespaces))
If you're willing to use lxml, your use of "lang" can be much more robust because it follows the complete XPath 1.0 spec. In particular, you can use the lang() function:
import lxml.etree as ET
root = ET.fromstring(xml)
print root.xpath('//alt[lang("fr")]')
As a bonus, it will have proper lang() semantics, like case-insensitivity and being smart about language regions (e.g., lang('en') will be true for xml:lang="en-US" too).
Unfortunately you can't use lang() to determine the language of a node. You need to find the first xml:lang ancestor and use that:
mylang = node.xpath('(ancestor-or-self::*/#xml:lang)[1]')
Putting it all together, to match nodes that have no language:
tree.xpath('//alt[not((ancestor-or-self::*/#xml:lang)[1])]')

I dont really understand why I can't use xml:lang rather than {http://www.w3.org/XML/1998/namespace}lang, but the above seems to work on Ubuntu 12.04
What you are trying to do will be easier using the xpath method (which is not available in cElementTree), which among other things will read the namespace labels from the root element of your document, so you can ask this:
import lxml.etree as et
root = et.parse(open('mydoc.xml')).getroot()
for x in root.xpath('alt[not(#xml:lang)]/alt'):
print x.text
The not(#attr) syntax I wasn't previously familiar with, but a Google search for "xpath find element without attribute" was tremendously useful.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parsing XML with namespaces using ElementTree in Python - python

Related

How can I turn an xml Element into an ElementTree (python)?

Finding and converting XML processing instructions using Python

Setting 'xml:space' to 'preserve' Python lxml

Parse xml from file using etree works when reading string, but not a file

How do I find an xml node that does not have an attribute

Categories

Resources