Unable to retrieve comment from XML due to namespace prefix issue Python - python

I've got the following "example.xml" document where my main goal is to be able to retrieve the comments for each tag in the document. Note, I've been able to retrieve the comments thanks to this answer, where there are no namespace prefixes, but given this, I'm getting the below errors.
<?xml version="1.0" encoding="UTF-8"?>
<abc:root xmlns:abc="http://com/example/URL" xmlns:abcdef="http://com/another/example/URL" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<tag1>
<tag2>
<tag3>tag3<!-- comment = “this is the tag3.1 comment”--></tag3>
<tag4>tag4<!-- comment = “this is the tag4.1 comment”--></tag4>
</tag2>
</tag1>
<tag1>
<tag2>
<tag3>tag3<!-- comment = “this is the tag3.2 comment”--></tag3>
<tag4>tag4<!-- comment = “this is the tag4.2 comment”--></tag4>
</tag2>
</tag1>
</abc:root>
I've tried to go through two options, both resulting in errors.
I'm essentially iterating through each node of the document and checking for the comment associated. The code is as follows:
from lxml import etree
import os
tree = etree.parse("example.xml")
rootXML = tree.getroot()
print(rootXML.nsmap)
for Node in tree.xpath('//*'):
elements = tree.xpath(tree.getpath(Node), rootXML.nsmap)
basename = os.path.basename(tree.getpath(Node))
for tag in elements:
comment = tag.xpath('{0}/comment()'.format(tree.getpath(Node)))
print(tree.getpath(Node))
print(comment)
Executing this code however, gives me the following error:
TypeError: xpath() takes exactly 1 positional argument (2 given)
I've also tried to follow this answer and define the namespace within the xpath. In doing so, my code becomes:
from lxml import etree
import os
tree = etree.parse("example.xml")
rootXML = tree.getroot()
print(rootXML.nsmap)
for Node in tree.xpath('//*'):
elements = tree.xpath(tree.getpath(Node), namespaces={rootXML.nsmap})
basename = os.path.basename(tree.getpath(Node))
for tag in elements:
comment = tag.xpath('{0}/comment()'.format(tree.getpath(Node)))
print(tree.getpath(Node))
print(comment)
where the only change is replacing elements = tree.xpath(tree.getpath(Node), rootXML.nsmap) with elements = tree.xpath(tree.getpath(Node), namespaces={rootXML.nsmap}). However, this then results in the following error at the modified line.
TypeError: unhashable type: 'dict'
EDIT: modified a closing bracket as per one of the answers.

You are missing a closing bracket at the end of this line:
comment = tag.xpath('{0}/comment()'.format(tree.getpath(Node))
Update
Here's a working example:
from lxml import etree
import os
xml = """<?xml version="1.0" encoding="UTF-8"?>
<abc:root xmlns:abc="http://com/example/URL" xmlns:abcdef="http://com/another/example/URL" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<tag1>
<tag2>
<tag3>tag3<!-- comment = “this is the tag3 comment”--></tag3>
<tag4>tag4<!-- comment = “this is the tag4 comment”--></tag4>
</tag2>
</tag1>
<tag1>
<tag2>
<tag3>tag3<!-- comment = “this is the tag3 comment”--></tag3>
<tag4>tag4<!-- comment = “this is the tag4 comment”--></tag4>
</tag2>
</tag1>
</abc:root>""".encode('utf-8')
rootElement = etree.fromstring(xml)
rootTree = rootElement.getroottree()
print(rootElement.nsmap)
for Node in rootTree.xpath('//*'):
elements = rootTree.xpath(rootTree.getpath(Node), namespaces=rootElement.nsmap)
basename = os.path.basename(rootTree.getpath(Node))
for tag in elements:
comment = tag.xpath('{0}/comment()'.format(rootTree.getpath(Node)), namespaces=rootElement.nsmap)
print(rootTree.getpath(Node))
print(comment)
The main issue was trying to pass the namespaces to getPath as a positional argument, when they need to be given using the namespaces keyword argument. The other issue was trying to call methods on an _Element when they can only be called on _ElementTrees and vice versa.
Also in your second example you try and do this namespaces={rootXML.nsmap}. rootXML.nsmap is already a dictionary, you don't need any curly braces. Also, that syntax would not create a dictionary, it would create a Set, hence it complaining that the thing you're trying to put in it is not hashable.

Related

Remove element in a XML file with Python

I'm a newbie with Python and I'd like to remove the element openingHours and the child elements from the XML.
I have this input
<Root>
<stations>
<station id= "1">
<name>whatever</name>
<openingHours>
<openingHour>
<entrance>main</entrance>
<timeInterval>
<from>05:30</from>
<to>21:30</to>
</timeInterval>
<openingHour/>
<openingHours>
<station/>
<station id= "2">
<name>foo</name>
<openingHours>
<openingHour>
<entrance>main</entrance>
<timeInterval>
<from>06:30</from>
<to>21:30</to>
</timeInterval>
<openingHour/>
<openingHours>
<station/>
<stations/>
<Root/>
I'd like this output
<Root>
<stations>
<station id= "1">
<name>whatever</name>
<station/>
<station id= "2">
<name>foo</name>
<station/>
<stations/>
<Root/>
So far I've tried this from another thread How to remove elements from XML using Python
from lxml import etree
doc=etree.parse('stations.xml')
for elem in doc.xpath('//*[attribute::openingHour]'):
parent = elem.getparent()
parent.remove(elem)
print(etree.tostring(doc))
However, It doesn't seem to be working.
Thanks
I took your code for a spin but at first Python couldn't agree with the way you composed your XML, wanting the / in the closing tag to be at the beginning (like </...>) instead of at the end (<.../>).
That aside, the reason your code isn't working is because the xpath expression is looking for the attribute openingHour while in reality you want to look for elements called openingHours. I got it to work by changing the expression to //openingHours. Making the entire code:
from lxml import etree
doc=etree.parse('stations.xml')
for elem in doc.xpath('//openingHours'):
parent = elem.getparent()
parent.remove(elem)
print(etree.tostring(doc))
You want to remove the tags <openingHours> and not some attribute with name openingHour:
from lxml import etree
doc = etree.parse('stations.xml')
for elem in doc.findall('.//openingHours'):
parent = elem.getparent()
parent.remove(elem)
print(etree.tostring(doc))

Writing a xml:id attribute with lxml

I'm trying to rebuild a TEI-XML file with lxml.
The beginning of my file looks like this:
<?xml version="1.0" encoding="UTF-8"?>
<?xml-model href="https://www.ssrq-sds-fds.ch/tei/TEI_Schema_SSRQ.rng"
type="application/xml"
schematypens="http://relaxng.org/ns/structure/1.0"?>
<?xml-model href="https://www.ssrq-sds-fds.ch/tei/TEI_Schema_SSRQ.rng"
type="application/xml"
schematypens="http://purl.oclc.org/dsdl/schematron"?>
<?xml-stylesheet type="text/css"
href="https://www.ssrq-sds-fds.ch/tei/Textkritik_Version_tei-ssrq.css"?>
<TEI xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns="http://www.tei-c.org/ns/1.0" n=""
xml:id="[To be generated]" <!-- e.g. StAAG_U-17_0007a --> >
The first four lines should not matter too much in my opinion, but I included them for completeness. My problem starts with the TEI-Element.
So my code to copy this looks like this:
NSMAP = {"xml":"http://www.tei-c.org/ns/1.0",
"xi":"http://www.w3.org/2001/XInclude"}
root = et.Element('TEI', n="", nsmap=NSMAP)
root.attrib["id"] = xml_id
root.attrib["xmlns"] = "http://www.tei-c.org/ns/1.0"
The String xml_id is assigned at some point before and does not matter for my question. So my codes returns me this line:
<TEI xmlns:xi="http://www.w3.org/2001/XInclude"
n=""
id="StAAG_U-17_0006"
xmlns="http://www.tei-c.org/ns/1.0">
So the only thing that is missing is this xml:id attribute. I found this specification page: https://www.w3.org/TR/xml-id/ and I know it is mentioned to be supported in lxml in its FAQ.
Btw, root.attrib["xml:id"] does not work, as it is not a viable attribute name.
So, does anyone know how I can assign my id to an elemnt's xml:id attribute?
You need to specify that id is part of the default xml namespace. Try this:
root.attrib["{http://www.w3.org/XML/1998/namespace}id"] = xml_id
Reference: https://www.w3.org/TR/xml-names/#ns-decl

Reading xml with lxml lib geting strange string from xmlns tag

I am writing program to work on xml file and change it. But when I try to get to any part of it I get some extra part.
My xml file:
<?xml version="1.0" encoding="UTF-8"?>
<Package xmlns="http://soap.sforce.com/2006/04/metadata">
<types>
<members>sbaa__ApprovalChain__c.ExternalID__c</members>
<members>sbaa__ApprovalCondition__c.ExternalID__c</members>
<members>sbaa__ApprovalRule__c.ExternalID__c</members>
<name>CustomField</name>
</types>
<version>40.0</version>
</Package>
And I have my code:
from lxml import etree
import sys
tree = etree.parse('package.xml')
root = tree.getroot()
print( root[0][0].tag )
As output I expect to see members but I get something like this:
{http://soap.sforce.com/2006/04/metadata}members
Why do I see that url and how to stop it from showing up?
You have defined a default namespace (Wikipedia, lxml tutorial). When defined, it is a part of every child tag.
If you want to print the tag without the namespace, it's easy
tag = root[0][0].tag
print(tag[tag.find('}')+1:])
If you want to remove the namespace from XML, see this question.

Python -lxml xpath returns empty list

I am reading an xliff file and planning to retrieve specific element. I tried to print all the elements using
from lxml import etree
with open('path\to\file\.xliff', 'r',encoding = 'utf-8') as xml_file:
tree = etree.parse(xml_file)
root = tree.getroot()
for element in root.iter():
print("child", element)
The output was
child <Element {urn:oasis:names:tc:xliff:document:2.0}segment at 0x6b8f9c8>
child <Element {urn:oasis:names:tc:xliff:document:2.0}source at 0x6b8f908>
When I tried to get the specific element (with the help of many posts here) - source tag
segment = tree.xpath('{urn:oasis:names:tc:xliff:document:2.0}segment')
print(segment)
it returns an empty list. Can someone tell me how to retrieve it properly.
Input :
<?xml version='1.0' encoding='UTF-8'?>
<xliff xmlns="urn:oasis:names:tc:xliff:document:2.0" version="2.0">
<segment id = 1>
<source>
Hello world
</source>
</segment>
<segment id = 2 >
<source>
2nd statement
</source>
</segment>
</xliff>
I want to get the values of segment and its corresponding source
This code,
tree.xpath('{urn:oasis:names:tc:xliff:document:2.0}segment')
is not accepted by lxml ("lxml.etree.XPathEvalError: Invalid expression"). You need to use findall().
The following works (in the XML sample, the segment elements are children of xliff):
from lxml import etree
tree = etree.parse("test.xliff") # XML in the question; ill-formed attributes corrected
segment = tree.findall('{urn:oasis:names:tc:xliff:document:2.0}segment')
print(segment)
However, the real XML is apparently more complex (segment is not a direct child of xliff). Then you need to add .// to search the whole tree:
segment = tree.findall('.//{urn:oasis:names:tc:xliff:document:2.0}segment')

XML parsing specific values - Python

I've been attempting to parse a list of xml files. I'd like to print specific values such as the userName value.
<?xml version="1.0" encoding="utf-8"?>
<Drives clsid="{8FDDCC1A-0C3C-43cd-A6B4-71A6DF20DA8C}"
disabled="1">
<Drive clsid="{935D1B74-9CB8-4e3c-9914-7DD559B7A417}"
name="S:"
status="S:"
image="2"
changed="2007-07-06 20:57:37"
uid="{4DA4A7E3-F1D8-4FB1-874F-D2F7D16F7065}">
<Properties action="U"
thisDrive="NOCHANGE"
allDrives="NOCHANGE"
userName=""
cpassword=""
path="\\scratch"
label="SCRATCH"
persistent="1"
useLetter="1"
letter="S"/>
</Drive>
</Drives>
My script is working fine collecting a list of xml files etc. However the below function is to print the relevant values. I'm trying to achieve this as suggested in this post. However I'm clearly doing something incorrectly as I'm getting errors suggesting that elm object has no attribute text. Any help would be appreciated.
Current Code
from lxml import etree as ET
def read_files(files):
for fi in files:
doc = ET.parse(fi)
elm = doc.find('userName')
print elm.text
doc.find looks for a tag with the given name. You are looking for an attribute with the given name.
elm.text is giving you an error because doc.find doesn't find any tags, so it returns None, which has no text property.
Read the lxml.etree docs some more, and then try something like this:
doc = ET.parse(fi)
root = doc.getroot()
prop = root.find(".//Properties") # finds the first <Properties> tag anywhere
elm = prop.attrib['userName']
userName is an attribute, not an element. Attributes don't have text nodes attached to them at all.
for el in doc.xpath('//*[#userName]'):
print el.attrib['userName']
You can try to take the element using the tag name and then try to take its attribute (userName is an attribute for Properties):
from lxml import etree as ET
def read_files(files):
for fi in files:
doc = ET.parse(fi)
props = doc.getElementsByTagName('Properties')
elm = props[0].attributes['userName']
print elm.value

Categories

Resources