Parsing a kml file using lxml - python

I've got a KML file - I'm using the wikipedia 'default' as a sample:
<?xml version="1.0" encoding="UTF-8"?>
<kml xmlns="http://www.opengis.net/kml/2.2">
<Document>
<Placemark>
<name>New York City</name>
<description>New York City</description>
<Point>
<coordinates>-74.006393,40.714172,0</coordinates>
</Point>
</Placemark>
</Document>
</kml>
And I'm trying to extract the coordinates.
Now, I've got a snippet working that embeds the namespace to search:
#!/usr/python/python3.4/bin/python3
from lxml import etree as ET
tree = ET.parse('sample.kml')
root = tree.getroot
print (root.find('.//{http://www.opengis.net/kml/2.2}coordinates').text)
This works fine.
However having found this:
Parsing XML with namespace in Python via 'ElementTree'
I'm trying to do it via reading the namespace from the document, using 'root.nsmap'.
print (root.nsmap)
Gives me:
{None: '{http://www.opengis.net/kml/2.2}'}
So I think I should be able to do this:
print ( root.find('.//coordinates',root.nsmap).text )
Or something very similar, using the None namespace. (e.g. has no prefix). But this doesn't work - I get an error when doing it:
AttributeError: 'NoneType' object has no attribute 'text'
I assume that means that my 'find' didn't find anything in this instance.
What am I missing here?

This code,
root.find('.//coordinates', root.nsmap)
does not return anything because no prefix is used. See http://lxml.de/xpathxslt.html#namespaces-and-prefixes.
Below are two options that work.
Define another nsmap with a real prefix as key:
nsmap2 = {"k": root.nsmap[None]}
print (root.find('.//k:coordinates', nsmap2).text)
Don't bother with prefixes. Put the namespace URI inside curly braces ("Clark notation") to form a universal element name:
ns = root.nsmap[None]
print (root.find('.//{{{0}}}coordinates'.format(ns)).text)

Related

Validate XML with namespaces against Schematron using lxml in Python

I am not able to get lxml Schematron validator to recognize namespaces. Validation works fine in code without namespaces.
This is for Python 3.7.4 and lxml 4.4.0 on MacOS 10.15
Here is the schematron file
<?xml version='1.0' encoding='UTF-8'?>
<schema xmlns="http://purl.oclc.org/dsdl/schematron"
xmlns:ns1="http://foo">
<pattern>
<rule context="//ns1:bar">
<assert test="number(.) = 2">
bar must be 2
</assert>
</rule>
</pattern>
</schema>
and here is the xml file
<?xml version="1.0" encoding="UTF-8"?>
<zip xmlns:ns1="http://foo">
<ns1:bar>3</ns1:bar>
</zip>
here is the python code
from lxml import etree, isoschematron
from plumbum import local
schematron_doc = etree.parse(local.path('rules.sch'))
schematron = isoschematron.Schematron(schematron_doc)
xml_doc = etree.parse(local.path('test.xml'))
is_valid = schematron.validate(xml_doc)
assert not is_valid
What I get: lxml.etree.XSLTParseError: xsltCompilePattern : failed to compile '//ns1:bar'
If I remove ns1 from both the XML file and the Schematron file, the example works perfectly-- no error message.
There must be a trick to registering namespaces in lxml Schematron that I am missing. Has anyone done this?
As it turns out, there is a specific way to register namespaces in Schematron. It is described in the Schematron ISO standard
It only required a small change to the Schematron file, adding the "ns" element in as follows:
<?xml version='1.0' encoding='UTF-8'?>
<schema xmlns="http://purl.oclc.org/dsdl/schematron">
<ns uri="http://foo" prefix="ns1"/>
<pattern>
<rule context="//ns1:bar">
<assert test="number(.) = 2">
bar must be 2
</assert>
</rule>
</pattern>
</schema>
I won't remove the question, since there is a dearth of examples of Schematron rules using namespaces. Hopefully it can be helpful to someone.

Writing a xml:id attribute with lxml

I'm trying to rebuild a TEI-XML file with lxml.
The beginning of my file looks like this:
<?xml version="1.0" encoding="UTF-8"?>
<?xml-model href="https://www.ssrq-sds-fds.ch/tei/TEI_Schema_SSRQ.rng"
type="application/xml"
schematypens="http://relaxng.org/ns/structure/1.0"?>
<?xml-model href="https://www.ssrq-sds-fds.ch/tei/TEI_Schema_SSRQ.rng"
type="application/xml"
schematypens="http://purl.oclc.org/dsdl/schematron"?>
<?xml-stylesheet type="text/css"
href="https://www.ssrq-sds-fds.ch/tei/Textkritik_Version_tei-ssrq.css"?>
<TEI xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns="http://www.tei-c.org/ns/1.0" n=""
xml:id="[To be generated]" <!-- e.g. StAAG_U-17_0007a --> >
The first four lines should not matter too much in my opinion, but I included them for completeness. My problem starts with the TEI-Element.
So my code to copy this looks like this:
NSMAP = {"xml":"http://www.tei-c.org/ns/1.0",
"xi":"http://www.w3.org/2001/XInclude"}
root = et.Element('TEI', n="", nsmap=NSMAP)
root.attrib["id"] = xml_id
root.attrib["xmlns"] = "http://www.tei-c.org/ns/1.0"
The String xml_id is assigned at some point before and does not matter for my question. So my codes returns me this line:
<TEI xmlns:xi="http://www.w3.org/2001/XInclude"
n=""
id="StAAG_U-17_0006"
xmlns="http://www.tei-c.org/ns/1.0">
So the only thing that is missing is this xml:id attribute. I found this specification page: https://www.w3.org/TR/xml-id/ and I know it is mentioned to be supported in lxml in its FAQ.
Btw, root.attrib["xml:id"] does not work, as it is not a viable attribute name.
So, does anyone know how I can assign my id to an elemnt's xml:id attribute?
You need to specify that id is part of the default xml namespace. Try this:
root.attrib["{http://www.w3.org/XML/1998/namespace}id"] = xml_id
Reference: https://www.w3.org/TR/xml-names/#ns-decl

Reading xml with lxml lib geting strange string from xmlns tag

I am writing program to work on xml file and change it. But when I try to get to any part of it I get some extra part.
My xml file:
<?xml version="1.0" encoding="UTF-8"?>
<Package xmlns="http://soap.sforce.com/2006/04/metadata">
<types>
<members>sbaa__ApprovalChain__c.ExternalID__c</members>
<members>sbaa__ApprovalCondition__c.ExternalID__c</members>
<members>sbaa__ApprovalRule__c.ExternalID__c</members>
<name>CustomField</name>
</types>
<version>40.0</version>
</Package>
And I have my code:
from lxml import etree
import sys
tree = etree.parse('package.xml')
root = tree.getroot()
print( root[0][0].tag )
As output I expect to see members but I get something like this:
{http://soap.sforce.com/2006/04/metadata}members
Why do I see that url and how to stop it from showing up?
You have defined a default namespace (Wikipedia, lxml tutorial). When defined, it is a part of every child tag.
If you want to print the tag without the namespace, it's easy
tag = root[0][0].tag
print(tag[tag.find('}')+1:])
If you want to remove the namespace from XML, see this question.

lxml: How do I search for fields without adding a xmlns (localhost) path to each search term?

I'm trying to locate fields in a SOAP xml file using lxml (3.6.0)
...
<soap:Body>
<Request xmlns="http://localhost/">
<Test>
<field1>hello</field1>
<field2>world</field2>
</Test>
</Request>
</soap:Body>
...
In this example I'm trying to find field1 and field2.
I need to add a path to the search term, to find the field:
print (myroot.find(".//{http://localhost/}field1").tag) # prints 'field1'
without it, I don't find anything
print (myroot.find("field1").tag) # finds 'None'
Is there any other way to search for the field tag (here field1) without giving path info?
Full example below:
from lxml import etree
example = """<?xml version="1.0" encoding="utf-8"?><soap:Envelope xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<soap:Body><Request xmlns="http://localhost/">
<Test><field1>hello</field1><field2>world</field2></Test>
</Request></soap:Body></soap:Envelope>
"""
myroot = etree.fromstring(example)
# this works
print (myroot.find(".//{http://localhost/}field1").text)
print (myroot.find(".//{http://localhost/}field2").text)
# this fails
print (myroot.find(".//field1").text)
print (myroot.find("field1").text)
Comment: The input of the SOAP request is given, I can't change any of it in real live to make things easier.
There is a way to ignore namespace when selecting element using XPath, but that isn't a good practice. Namespace is there for a reason. Anyway, there is a cleaner way to reference element in namespace i.e by using namespace prefix that was mapped to the namespace uri, instead of using the actual namespace uri every time :
.....
>>> ns = {'d': 'http://localhost/'}
>>> print (myroot.find(".//d:field1", ns).text)
hello
>>> print (myroot.find(".//d:field2", ns).text)
world

Remove namespace with xmltodict in Python

xmltodict converts XML to a Python dictionary. It supports namespaces. I can follow the example on the homepage and successfully remove a namespace. However, I cannot remove the namespace from my XML and cannot identify why? Here is my XML:
<?xml version="1.0" encoding="UTF-8"?>
<status xmlns:mystatus="http://localhost/mystatus">
<section1
mystatus:field1="data1"
mystatus:field2="data2" />
<section2
mystatus:lineA="outputA"
mystatus:lineB="outputB" />
</status>
And using:
xmltodict.parse(xml,process_namespaces=True,namespaces={'http://localhost/mystatus':None})
I get:
OrderedDict([(u'status', OrderedDict([(u'section1', OrderedDict([(u'#http://localhost/mystatus:field1', u'data1'), (u'#http://localhost/mystatus:field2', u'data2')])), (u'section2', OrderedDict([(u'#http://localhost/mystatus:lineA', u'outputA'), (u'#http://localhost/mystatus:lineB', u'outputB')]))]))])
instead of:
OrderedDict([(u'status', OrderedDict([(u'section1', OrderedDict([(u'field1', u'data1'), (u'field2', u'data2')])), (u'section2', OrderedDict([(u'lineA', u'outputA'), (u'#lineB', u'outputB')]))]))])
Am I making some simple mistake, or is there something about my XML that prevents the process_namespace modification from working correctly?
xmltodict is based on expat, so namespaces should applied to the class name, not attribute names:
<?xml version="1.0" encoding="UTF-8"?>
<status xmlns:mystatus="http://localhost/mystatus">
<mystatus:section1 field1="data1" field2="data2" />
<mystatus:section2 lineA="outputA" lineB="outputB" />
</status>
When parsed with:
foo = xmltodict.parse(xml,
process_namespaces=True,
namespaces={'http://localhost/mystatus':None})
outputs:
OrderedDict([(u'status', OrderedDict([(u'section1', OrderedDict([(u'#field1', u'data1'), (u'#field2', u'data2')])), (u'section2', OrderedDict([(u'#lineA', u'outputA'), (u'#lineB', u'outputB')]))]))])
Accessing it is easy:
# Get attribute 'lineA' from class 'section2' from class 'status'
>>> foo.get('status').get('section2').get('#lineA')
u'outputA'
Attribute namespaces are only required when you have multiple attributes of the same name (e.g. multiple id's or multiple prices, etc), in which case, I couldn't get expat or xmltodict to parse it correctly. YMMV though.

Categories

Resources