Writing a xml:id attribute with lxml

Writing a xml:id attribute with lxml - python

I'm trying to rebuild a TEI-XML file with lxml.
The beginning of my file looks like this:
<?xml version="1.0" encoding="UTF-8"?>
<?xml-model href="https://www.ssrq-sds-fds.ch/tei/TEI_Schema_SSRQ.rng"
type="application/xml"
schematypens="http://relaxng.org/ns/structure/1.0"?>
<?xml-model href="https://www.ssrq-sds-fds.ch/tei/TEI_Schema_SSRQ.rng"
type="application/xml"
schematypens="http://purl.oclc.org/dsdl/schematron"?>
<?xml-stylesheet type="text/css"
href="https://www.ssrq-sds-fds.ch/tei/Textkritik_Version_tei-ssrq.css"?>
<TEI xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns="http://www.tei-c.org/ns/1.0" n=""
xml:id="[To be generated]" <!-- e.g. StAAG_U-17_0007a --> >
The first four lines should not matter too much in my opinion, but I included them for completeness. My problem starts with the TEI-Element.
So my code to copy this looks like this:
NSMAP = {"xml":"http://www.tei-c.org/ns/1.0",
"xi":"http://www.w3.org/2001/XInclude"}
root = et.Element('TEI', n="", nsmap=NSMAP)
root.attrib["id"] = xml_id
root.attrib["xmlns"] = "http://www.tei-c.org/ns/1.0"
The String xml_id is assigned at some point before and does not matter for my question. So my codes returns me this line:
<TEI xmlns:xi="http://www.w3.org/2001/XInclude"
n=""
id="StAAG_U-17_0006"
xmlns="http://www.tei-c.org/ns/1.0">
So the only thing that is missing is this xml:id attribute. I found this specification page: https://www.w3.org/TR/xml-id/ and I know it is mentioned to be supported in lxml in its FAQ.
Btw, root.attrib["xml:id"] does not work, as it is not a viable attribute name.
So, does anyone know how I can assign my id to an elemnt's xml:id attribute?

You need to specify that id is part of the default xml namespace. Try this:
root.attrib["{http://www.w3.org/XML/1998/namespace}id"] = xml_id
Reference: https://www.w3.org/TR/xml-names/#ns-decl

Related

Find Placemarks in kml file

I want to find all the Placemarks in a kml file:
from lxml import etree
doc = etree.parse(filename)
for elem in doc.findall('<Placemark>'):
print(elem.find("<Placemark>").text)
This doesn't work, i.e. it doesn't find anything, I think because each Placemark is unique in that each has its own id, e.g.:
<Placemark id="ID_09795">
<Placemark id="ID_15356">
<Placemark id="ID_64532">
How do I do this?
Edit: changed code based on #ScottHunter comment:
placemark_list = doc.findall("Placemark")
print ("length:" + str(len(placemark_list)))
for placemark in placemark_list:
print(placemark.text)
length is 0

It's hard to tell without seeing the full file, but try something like this
placemark_list = doc.xpath("//*[local-name()='Placemark']")
print(len(placemark_list))
and see if it works.

Unable to retrieve comment from XML due to namespace prefix issue Python

I've got the following "example.xml" document where my main goal is to be able to retrieve the comments for each tag in the document. Note, I've been able to retrieve the comments thanks to this answer, where there are no namespace prefixes, but given this, I'm getting the below errors.
<?xml version="1.0" encoding="UTF-8"?>
<abc:root xmlns:abc="http://com/example/URL" xmlns:abcdef="http://com/another/example/URL" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<tag1>
<tag2>
<tag3>tag3<!-- comment = “this is the tag3.1 comment”--></tag3>
<tag4>tag4<!-- comment = “this is the tag4.1 comment”--></tag4>
</tag2>
</tag1>
<tag1>
<tag2>
<tag3>tag3<!-- comment = “this is the tag3.2 comment”--></tag3>
<tag4>tag4<!-- comment = “this is the tag4.2 comment”--></tag4>
</tag2>
</tag1>
</abc:root>
I've tried to go through two options, both resulting in errors.
I'm essentially iterating through each node of the document and checking for the comment associated. The code is as follows:
from lxml import etree
import os
tree = etree.parse("example.xml")
rootXML = tree.getroot()
print(rootXML.nsmap)
for Node in tree.xpath('//*'):
elements = tree.xpath(tree.getpath(Node), rootXML.nsmap)
basename = os.path.basename(tree.getpath(Node))
for tag in elements:
comment = tag.xpath('{0}/comment()'.format(tree.getpath(Node)))
print(tree.getpath(Node))
print(comment)
Executing this code however, gives me the following error:
TypeError: xpath() takes exactly 1 positional argument (2 given)
I've also tried to follow this answer and define the namespace within the xpath. In doing so, my code becomes:
from lxml import etree
import os
tree = etree.parse("example.xml")
rootXML = tree.getroot()
print(rootXML.nsmap)
for Node in tree.xpath('//*'):
elements = tree.xpath(tree.getpath(Node), namespaces={rootXML.nsmap})
basename = os.path.basename(tree.getpath(Node))
for tag in elements:
comment = tag.xpath('{0}/comment()'.format(tree.getpath(Node)))
print(tree.getpath(Node))
print(comment)
where the only change is replacing elements = tree.xpath(tree.getpath(Node), rootXML.nsmap) with elements = tree.xpath(tree.getpath(Node), namespaces={rootXML.nsmap}). However, this then results in the following error at the modified line.
TypeError: unhashable type: 'dict'
EDIT: modified a closing bracket as per one of the answers.

You are missing a closing bracket at the end of this line:
comment = tag.xpath('{0}/comment()'.format(tree.getpath(Node))
Update
Here's a working example:
from lxml import etree
import os
xml = """<?xml version="1.0" encoding="UTF-8"?>
<abc:root xmlns:abc="http://com/example/URL" xmlns:abcdef="http://com/another/example/URL" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<tag1>
<tag2>
<tag3>tag3<!-- comment = “this is the tag3 comment”--></tag3>
<tag4>tag4<!-- comment = “this is the tag4 comment”--></tag4>
</tag2>
</tag1>
<tag1>
<tag2>
<tag3>tag3<!-- comment = “this is the tag3 comment”--></tag3>
<tag4>tag4<!-- comment = “this is the tag4 comment”--></tag4>
</tag2>
</tag1>
</abc:root>""".encode('utf-8')
rootElement = etree.fromstring(xml)
rootTree = rootElement.getroottree()
print(rootElement.nsmap)
for Node in rootTree.xpath('//*'):
elements = rootTree.xpath(rootTree.getpath(Node), namespaces=rootElement.nsmap)
basename = os.path.basename(rootTree.getpath(Node))
for tag in elements:
comment = tag.xpath('{0}/comment()'.format(rootTree.getpath(Node)), namespaces=rootElement.nsmap)
print(rootTree.getpath(Node))
print(comment)
The main issue was trying to pass the namespaces to getPath as a positional argument, when they need to be given using the namespaces keyword argument. The other issue was trying to call methods on an _Element when they can only be called on _ElementTrees and vice versa.
Also in your second example you try and do this namespaces={rootXML.nsmap}. rootXML.nsmap is already a dictionary, you don't need any curly braces. Also, that syntax would not create a dictionary, it would create a Set, hence it complaining that the thing you're trying to put in it is not hashable.

Reading xml with lxml lib geting strange string from xmlns tag

I am writing program to work on xml file and change it. But when I try to get to any part of it I get some extra part.
My xml file:
<?xml version="1.0" encoding="UTF-8"?>
<Package xmlns="http://soap.sforce.com/2006/04/metadata">
<types>
<members>sbaa__ApprovalChain__c.ExternalID__c</members>
<members>sbaa__ApprovalCondition__c.ExternalID__c</members>
<members>sbaa__ApprovalRule__c.ExternalID__c</members>
<name>CustomField</name>
</types>
<version>40.0</version>
</Package>
And I have my code:
from lxml import etree
import sys
tree = etree.parse('package.xml')
root = tree.getroot()
print( root[0][0].tag )
As output I expect to see members but I get something like this:
{http://soap.sforce.com/2006/04/metadata}members
Why do I see that url and how to stop it from showing up?

You have defined a default namespace (Wikipedia, lxml tutorial). When defined, it is a part of every child tag.
If you want to print the tag without the namespace, it's easy
tag = root[0][0].tag
print(tag[tag.find('}')+1:])
If you want to remove the namespace from XML, see this question.

Parsing a kml file using lxml

I've got a KML file - I'm using the wikipedia 'default' as a sample:
<?xml version="1.0" encoding="UTF-8"?>
<kml xmlns="http://www.opengis.net/kml/2.2">
<Document>
<Placemark>
<name>New York City</name>
<description>New York City</description>
<Point>
<coordinates>-74.006393,40.714172,0</coordinates>
</Point>
</Placemark>
</Document>
</kml>
And I'm trying to extract the coordinates.
Now, I've got a snippet working that embeds the namespace to search:
#!/usr/python/python3.4/bin/python3
from lxml import etree as ET
tree = ET.parse('sample.kml')
root = tree.getroot
print (root.find('.//{http://www.opengis.net/kml/2.2}coordinates').text)
This works fine.
However having found this:
Parsing XML with namespace in Python via 'ElementTree'
I'm trying to do it via reading the namespace from the document, using 'root.nsmap'.
print (root.nsmap)
Gives me:
{None: '{http://www.opengis.net/kml/2.2}'}
So I think I should be able to do this:
print ( root.find('.//coordinates',root.nsmap).text )
Or something very similar, using the None namespace. (e.g. has no prefix). But this doesn't work - I get an error when doing it:
AttributeError: 'NoneType' object has no attribute 'text'
I assume that means that my 'find' didn't find anything in this instance.
What am I missing here?

This code,
root.find('.//coordinates', root.nsmap)
does not return anything because no prefix is used. See http://lxml.de/xpathxslt.html#namespaces-and-prefixes.
Below are two options that work.
Define another nsmap with a real prefix as key:
nsmap2 = {"k": root.nsmap[None]}
print (root.find('.//k:coordinates', nsmap2).text)
Don't bother with prefixes. Put the namespace URI inside curly braces ("Clark notation") to form a universal element name:
ns = root.nsmap[None]
print (root.find('.//{{{0}}}coordinates'.format(ns)).text)

List only one category Python xml

I am trying to write a python program that uses DOM to read xml file and print another xml structure that list from only one node with particular selected attribute "fun".
<?xml version="1.0" encoding="ISO-8859-1"?>
<website>
<url category="fun">
<title>Fun world</title>
<author>Jack</author>
<year>2010</year>
<price>100.00</price>
</url>
<url category="entertainment">
<title>Fun world</title>
<author>Jack</author>
<year>2010</year>
<price>100.00</price>
</url>
</website>
I couldn't select the list from the URL having category="fun".
I tried this code:
for n in dom.getElementsByTagName('url'):
s = n.attribute['category']
if (s.value == "fun"):
print n.toxml()
Can you guys help to me to debug my code?

nb: One of your tags opens "Website" and attempts to close "website" - so you'll want to fix that one...
You've mentioned lxml.
from lxml import etree as et
root = et.fromstring(xml)
fun = root.xpath('/Website/url[#category="fun"]')
for node in fun:
print et.tostring(node)

Use getAttribute:
for n in dom.getElementsByTagName('url'):
if (n.getAttribute('category') == "fun"):
print(n.toxml())

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Writing a xml:id attribute with lxml - python

You need to specify that id is part of the default xml namespace. Try this: root.attrib["{http://www.w3.org/XML/1998/namespace}id"] = xml_id Reference: https://www.w3.org/TR/xml-names/#ns-decl

Related

Find Placemarks in kml file

Unable to retrieve comment from XML due to namespace prefix issue Python

Reading xml with lxml lib geting strange string from xmlns tag

Parsing a kml file using lxml

List only one category Python xml

Categories

Resources