How to comment out an XML Element (using minidom DOM implementation) - python

I would like to comment out a specific XML element in an xml file. I could just remove the element, but I would prefer to leave it commented out, in case it's needed later.
The code I use at the moment that removes the element looks like this:
from xml.dom import minidom
doc = minidom.parse(myXmlFile)
for element in doc.getElementsByTagName('MyElementName'):
if element.getAttribute('name') in ['AttribName1', 'AttribName2']:
element.parentNode.removeChild(element)
f = open(myXmlFile, "w")
f.write(doc.toxml())
f.close()
I would like to modify this so that it comments the element out rather then deleting it.

The following solution does exactly what I want.
from xml.dom import minidom
doc = minidom.parse(myXmlFile)
for element in doc.getElementsByTagName('MyElementName'):
if element.getAttribute('name') in ['AttrName1', 'AttrName2']:
parentNode = element.parentNode
parentNode.insertBefore(doc.createComment(element.toxml()), element)
parentNode.removeChild(element)
f = open(myXmlFile, "w")
f.write(doc.toxml())
f.close()

You can do it with beautifulSoup. Read target tag, create appropriate comment tag and replace target tag
For example, creating comment tag:
from BeautifulSoup import BeautifulSoup
hello = "<!--Comment tag-->"
commentSoup = BeautifulSoup(hello)

Related

Is there a way to return the value for a tag from a XML based on a specific path in python?

I have this XML
<Body>
<Batch_Number>2000</Batch_Number>
<Total_No_Of_Batches>12312</Total_No_Of_Batches>
<requestNo>1923</requestNo>
<Parent1>
<Parent2>
<Parent3>
<lastModifiedDateTime>2022-11-11T11:07:30.000</lastModifiedDateTime>
<purpose>NeverMore</purpose>
<endDate>9999-12-31T00:00:00.000</endDate>
<createdDateTime>2019-06-06T06:32:16.000</createdDateTime>
<createdOn>2019-06-06T08:32:16.000</createdOn>
<address2>Forever street 21</address2>
<externalCode>code123</externalCode>
<lastModifiedBy>user2.thisUser</lastModifiedBy>
<lastModifiedOn>2039-06-11T13:07:30.000</lastModifiedOn>
<lastModifiedBy>MG</lastModifiedBy>
<PS>1234431</PS>
</Parent3>
</Parent2>
</Parent1>
</Body>
Is there a way to return the value for lastModifiedBy for example where the path has this specific structure :
Body.Parent1.Parent2.Parent3.lastModifiedBy
Idealy, I would like to populate a dictionary with the child tag name and its value, for example :
dict[lastModifiedBy.tag] = lastModifiedBy.text
You can use xml from standart library for working with xml files.
from xml.etree import ElementTree as ET
tree = ET.parse("d.xml") # our xml file
root = tree.getroot()
And then you can access elements as indexes or you can use root like as a list:
for i in root:
print(i)
A XML element may have more than one child with same tag name (even you have two lastModifiedBy in the Parent3). This is why we use them like lists, they works like a list. So you shouldn't try to use them like dictionary.
I think you need to use XPath. Like so:
from xml.etree import ElementTree as ET
tree = ET.parse("d.xml") # our xml file
root = tree.getroot()
s = root.findall(".Parent1/Parent2/Parent3/lastModifiedBy")
for i in s:
print(i.text)
This gives you all lastModifiedBy elements in the Parent3 element. You can access to any index if you want too, like this:
from xml.etree import ElementTree as ET
tree = ET.parse("d.xml") # our xml file
root = tree.getroot()
s = root.find(".Parent1/Parent2/Parent3/lastModifiedBy[1]") # first element with "lastModifiedBy" tag
print(s.text)

Parse element's tail with requests-html

I want to parse an HTML document like this with requests-html 0.9.0:
from requests_html import HTML
html = HTML(html='<span><span class="data">important data</span> and some rubbish</span>')
data = html.find('.data', first=True)
print(data.html)
# <span class="data">important data</span> and some rubbish
print(data.text)
# important data and some rubbish
I need to distinguish the text inside the tag (enclosed by it) from the tag's tail (the text that follows the element up to the next tag). This is the behaviour I initially expected:
data.text == 'important data'
data.tail == ' and some rubbish'
But tail is not defined for Elements. Since requests-html provides access to inner lxml objects, we can try to get it from lxml.etree.Element.tail:
from lxml.etree import tostring
print(tostring(data.lxml))
# b'<html><span class="data">important data</span></html>'
print(data.lxml.tail is None)
# True
There's no tail in lxml representation! The tag with its inner text is OK, but the tail seems to be stripped away. How do I extract 'and some rubbish'?
Edit: I discovered that full_text provides the inner text only (so much for “full”). This enables a dirty hack of subtracting full_text from text, although I'm not positive it will work if there are any links.
print(data.full_text)
# important data
I'm not sure I've understood your problem, but if you just want to get 'and some rubbish' you can use below code:
from requests_html import HTML
from lxml.html import fromstring
html = HTML(html='<span><span class="data">important data</span> and some rubbish</span>')
data = fromstring(html.html)
# or without using requests_html.HTML: data = fromstring('<span><span class="data">important data</span> and some rubbish</span>')
print(data.xpath('//span[span[#class="data"]]/text()')[-1]) # " and some rubbish"
NOTE that data = html.find('.data', first=True) returns you <span class="data">important data</span> node which doesn't contain " and some rubbish" - it's a text child node of parent span!
the tail property exists with objects of type 'lxml.html.HtmlElement'.
I think what you are asking for is very easy to implement.
Here is a very simple example using requests_html and lxml:
from requests_html import HTML
html = HTML(html='<span><span class="data">important data</span> and some rubbish</span>')
data = html.find('span')
print (data[0].text) # important data and some rubbish
print (data[-1].text) # important data
print (data[-1].element.tail) # and some rubbish
The element attribute points to the 'lxml.html.HtmlElement' object.
Hope this helps.

XML parsing specific values - Python

I've been attempting to parse a list of xml files. I'd like to print specific values such as the userName value.
<?xml version="1.0" encoding="utf-8"?>
<Drives clsid="{8FDDCC1A-0C3C-43cd-A6B4-71A6DF20DA8C}"
disabled="1">
<Drive clsid="{935D1B74-9CB8-4e3c-9914-7DD559B7A417}"
name="S:"
status="S:"
image="2"
changed="2007-07-06 20:57:37"
uid="{4DA4A7E3-F1D8-4FB1-874F-D2F7D16F7065}">
<Properties action="U"
thisDrive="NOCHANGE"
allDrives="NOCHANGE"
userName=""
cpassword=""
path="\\scratch"
label="SCRATCH"
persistent="1"
useLetter="1"
letter="S"/>
</Drive>
</Drives>
My script is working fine collecting a list of xml files etc. However the below function is to print the relevant values. I'm trying to achieve this as suggested in this post. However I'm clearly doing something incorrectly as I'm getting errors suggesting that elm object has no attribute text. Any help would be appreciated.
Current Code
from lxml import etree as ET
def read_files(files):
for fi in files:
doc = ET.parse(fi)
elm = doc.find('userName')
print elm.text
doc.find looks for a tag with the given name. You are looking for an attribute with the given name.
elm.text is giving you an error because doc.find doesn't find any tags, so it returns None, which has no text property.
Read the lxml.etree docs some more, and then try something like this:
doc = ET.parse(fi)
root = doc.getroot()
prop = root.find(".//Properties") # finds the first <Properties> tag anywhere
elm = prop.attrib['userName']
userName is an attribute, not an element. Attributes don't have text nodes attached to them at all.
for el in doc.xpath('//*[#userName]'):
print el.attrib['userName']
You can try to take the element using the tag name and then try to take its attribute (userName is an attribute for Properties):
from lxml import etree as ET
def read_files(files):
for fi in files:
doc = ET.parse(fi)
props = doc.getElementsByTagName('Properties')
elm = props[0].attributes['userName']
print elm.value

Parsing XML using Python minidom

<PacketHeader>
<HeaderField>
<name>number</name>
<dataType>int</dataType>
</HeaderField>
</PacketHeader>
This is my small XML file and I want to extract out the text which is within the name tag.
Here is my code snippet:-
from xml.dom import minidom
from xml.dom.minidom import parse
xmldoc = minidom.parse('sample.xml')
packetHeader = xmldoc.getElementsByTagName("PacketHeader")
headerField = packetHeader.getElementsByTagName("HeaderField")
for field in headerField:
getFieldName = field.getElementsByTagName("name")
print getFieldName
But I am getting the location but not the text.
from xml.dom import minidom
from xml.dom.minidom import parse
xmldoc = minidom.parse('sample.xml')
# find the name element, if found return a list, get the first element
name_element = xmldoc.getElementsByTagName("name")[0]
# this will be a text node that contains the actual text
text_node = name_element.childNodes[0]
# get text
print text_node.data
Please check this.
Update
BTW i suggest you ElementTree, Below is the code snippet using ElementTree which is doing samething as the above minidom code
import elementtree.ElementTree as ET
tree = ET.parse("sample.xml")
# the tree root is the toplevel `PacketHeader` element
print tree.findtext("HeaderField/name")
A small variant of the accepted and correct answer above is:
from xml.dom import minidom
xmldoc = minidom.parse('fichier.xml')
name_element = xmldoc.getElementsByTagName('name')[0]
print name_element.childNodes[0].nodeValue
This simply uses nodeValue instead of its alias data

Python and ElementTree: return "inner XML" excluding parent element

In Python 2.6 using ElementTree, what's a good way to fetch the XML (as a string) inside a particular element, like what you can do in HTML and javascript with innerHTML?
Here's a simplified sample of the XML node I am starting with:
<label attr="foo" attr2="bar">This is some text and a link in embedded HTML</label>
I'd like to end up with this string:
This is some text and a link in embedded HTML
I've tried iterating over the parent node and concatenating the tostring() of the children, but that gave me only the subnodes:
# returns only subnodes (e.g. and a link)
''.join([et.tostring(sub, encoding="utf-8") for sub in node])
I can hack up a solution using regular expressions, but was hoping there'd be something less hacky than this:
re.sub("</\w+?>\s*?$", "", re.sub("^\s*?<\w*?>", "", et.tostring(node, encoding="utf-8")))
How about:
from xml.etree import ElementTree as ET
xml = '<root>start here<child1>some text<sub1/>here</child1>and<child2>here as well<sub2/><sub3/></child2>end here</root>'
root = ET.fromstring(xml)
def content(tag):
return tag.text + ''.join(ET.tostring(e) for e in tag)
print content(root)
print content(root.find('child2'))
Resulting in:
start here<child1>some text<sub1 />here</child1>and<child2>here as well<sub2 /><sub3 /></child2>end here
here as well<sub2 /><sub3 />
This is based on the other solutions, but the other solutions did not work in my case (resulted in exceptions) and this one worked:
from xml.etree import Element, ElementTree
def inner_xml(element: Element):
return (element.text or '') + ''.join(ElementTree.tostring(e, 'unicode') for e in element)
Use it the same way as in Mark Tolonen's answer.
The following worked for me:
from xml.etree import ElementTree as etree
xml = '<root>start here<child1>some text<sub1/>here</child1>and<child2>here as well<sub2/><sub3/></child2>end here</root>'
dom = etree.XML(xml)
(dom.text or '') + ''.join(map(etree.tostring, dom)) + (dom.tail or '')
# 'start here<child1>some text<sub1 />here</child1>and<child2>here as well<sub2 /><sub3 /></child2>end here'
dom.text or '' is used to get the text at the start of the root element. If there is no text dom.text is None.
Note that the result is not a valid XML - a valid XML should have only one root element.
Have a look at the ElementTree docs about mixed content.
Using Python 2.6.5, Ubuntu 10.04

Categories

Resources