Parsing XML using Python minidom - python

<PacketHeader>
<HeaderField>
<name>number</name>
<dataType>int</dataType>
</HeaderField>
</PacketHeader>
This is my small XML file and I want to extract out the text which is within the name tag.
Here is my code snippet:-
from xml.dom import minidom
from xml.dom.minidom import parse
xmldoc = minidom.parse('sample.xml')
packetHeader = xmldoc.getElementsByTagName("PacketHeader")
headerField = packetHeader.getElementsByTagName("HeaderField")
for field in headerField:
getFieldName = field.getElementsByTagName("name")
print getFieldName
But I am getting the location but not the text.

from xml.dom import minidom
from xml.dom.minidom import parse
xmldoc = minidom.parse('sample.xml')
# find the name element, if found return a list, get the first element
name_element = xmldoc.getElementsByTagName("name")[0]
# this will be a text node that contains the actual text
text_node = name_element.childNodes[0]
# get text
print text_node.data
Please check this.
Update
BTW i suggest you ElementTree, Below is the code snippet using ElementTree which is doing samething as the above minidom code
import elementtree.ElementTree as ET
tree = ET.parse("sample.xml")
# the tree root is the toplevel `PacketHeader` element
print tree.findtext("HeaderField/name")

A small variant of the accepted and correct answer above is:
from xml.dom import minidom
xmldoc = minidom.parse('fichier.xml')
name_element = xmldoc.getElementsByTagName('name')[0]
print name_element.childNodes[0].nodeValue
This simply uses nodeValue instead of its alias data

Related

lxml create CDATA element

I am trying to create CDATA element as per https://lxml.de/apidoc/lxml.etree.html#lxml.etree.CDATA
The simplified version of my code looks like this:
description = ET.SubElement(item, "description")
description.text = CDATA('test')
But when I later try to convert it to string:
xml_str = ET.tostring(self.__root, xml_declaration=True).decode()
I get an exception
cannot serialize <lxml.etree.CDATA object at 0x122c30ef0> (type CDATA)
Could you advise me what am I missing?
Here is a simple example:
import xml.etree.cElementTree as ET
from lxml.etree import CDATA
root = ET.Element('rss')
root.set("version", "2.0")
description = ET.SubElement(root, "description")
description.text = CDATA('test')
xml_str = ET.tostring(root, xml_declaration=True).decode()
print(xml_str)
lxml.etree and xml.etree are two different libraries; you should pick one and stick with it, rather than using both and trying to pass objects created by one to the other.
A working example, using lxml only:
import lxml.etree as ET
from lxml.etree import CDATA
root = ET.Element('rss')
root.set("version", "2.0")
description = ET.SubElement(root, "description")
description.text = CDATA('test')
xml_str = ET.tostring(root, xml_declaration=True).decode()
print(xml_str)
You can run this yourself at https://replit.com/#CharlesDuffy2/JovialMediumLeadership

Parsing XML Attributes with Python

I am trying to parse out all the green highlighted attributes (some sensitive things have been blacked out), I have a bunch of XML files all with similar formats, I already know how to loop through all of them individually them I am having trouble parsing out the specific attributes though.
XML Document
I need the text in the attributes: name="text1"
from
project logLevel="verbose" version="2.0" mainModule="Main" name="text1">
destinationDir="/text2" from
put label="Put Files" destinationDir="/Trigger/FPDMMT_INBOUND">
destDir="/text3" from
copy disabled="false" version="1.0" label="Archive Files" destDir="/text3" suffix="">
I am using
import csv
import os
import re
import xml.etree.ElementTree as ET
tree = ET.parse(XMLfile_path)
item = tree.getroot()[0]
root = tree.getroot()
print (item.get("name"))
print (root.get("name"))
This outputs:
Main
text1
The item.get pulls the line at index [0] which is the first line root in the tree which is <module
The root.get pulls from the first line <project
I know there's a way to search for exactly the right part of the root/tree with something like:
test = root.find('./project/module/ftp/put')
print (test.get("destinationDir"))
I need to be able to jump directly to the thing I need and output the attributes I need.
Any help would be appreciated
Thanks.
Simplified copy of your XML:
xml = '''<project logLevel="verbose" version="2.0" mainModule="Main" name="hidden">
<module name="Main">
<createWorkspace version="1.0"/>
<ftp version="1.0" label="FTP connection to PRD">
<put label="Put Files" destinationDir="destination1">
</put>
</ftp>
<ftp version="1.0" label="FTP connection to PRD">
<put label="Put Files" destinationDir="destination2">
</put>
</ftp>
<copy disabled="false" destDir="destination3">
</copy>
</module>
</project>
'''
# solution using ETree
from xml.etree import ElementTree as ET
root = ET.fromstring(xml)
name = root.get('name')
ftp_destination_dir1 = root.findall('./module/ftp/put')[0].get('destinationDir')
ftp_destination_dir2 = root.findall('./module/ftp/put')[1].get('destinationDir')
copy_destination_dir = root.find('./module/copy').get('destDir')
print(name)
print(ftp_destination_dir1)
print(ftp_destination_dir2)
print(copy_destination_dir)
# solution using lxml
from lxml import etree as et
root = et.fromstring(xml)
name = root.get('name')
ftp_destination_dirs = root.xpath('./module/ftp/put/#destinationDir')
copy_destination_dir = root.xpath('./module/copy/#destDir')[0]
print(name)
print(ftp_destination_dirs[0])
print(ftp_destination_dirs[1])
print(copy_destination_dir)

xml minidom - get the full content of childnodes text

I have a Test.xml file as:
<?xml version="1.0" encoding="utf-8"?>
<SetupConf>
<LocSetup>
<Src>
<Dir1>C:\User1\test1</Dir1>
<Dir2>C:\User2\log</Dir2>
<Dir3>D:\Users\Checkup</Dir3>
<Dir4>D:\Work1</Dir4>
<Dir5>E:\job1</Dir5>
</Src>
</LocSetup>
</SetupConf>
Where node depends on user input. In "Dir" node it may be 1,2,5,10 dir structure defined. As per requirement I am able to extract data from the Test.xml with help of #Padraic Cunningham using below Python code:
from xml.dom import minidom
from StringIO import StringIO
dom = minidom.parse('Test.xml')
Src = dom.getElementsByTagName('Src')
output = ", ".join([a.childNodes[0].nodeValue for node in Src for a in node.getElementsByTagName('Dir')])
print [output]
And getting the output:
C:\User1\test1, C:\User2\log, D:\Users\Checkup, D:\Work1, E:\job1
But the expected output is:
['C:\\User1\\test1', 'C:\\User2\\log', 'D:\\Users\\Checkup', 'D:\\Work1', 'E:\\job1']
Well it's solved by myself:
from xml.dom import minidom
DOMTree = minidom.parse('Test0001.xml')
dom = DOMTree.documentElement
Src = dom.getElementsByTagName('Src')
for node in Src:
output = [a.childNodes[0].nodeValue for a in node.getElementsByTagName('Dir')]
print output
And getting output:
[u'C:\User1\test1', u'C:\User2\log', u'D:\Users\Checkup', u'D:\Work1', u'E:\job1']
I am sure there is more simple another way .. please let me know.. Thanks in adv.

Python, extract date from XML

Apologies, my Python knowledge is pretty non-existant. I need to extract a date from some XML which is in a format similar to:
<Header>
<Version>1.0</Version>
....
<cd:Data>...</Data>
.....
<cd:DateReceived>20070620171524</cd:DateReceived>
From looking around here I found something similar
#!/usr/bin/python
from xml.dom.minidom import parse
import xml.dom.minidom
# Open XML document using minidom parser
DOMTree = xml.dom.minidom.parse("date.xml")
collection = DOMTree.documentElement
print collection.getElementsByTagName("cd:DateReceived").item(0)
However this only prints the Hex value:
<DOM Element: cd:DateReceived at 0x1529e0>
How can I get the date 20070620171524?
I've tried using the following
#!/usr/bin/python
from xml.dom.minidom import parse
import xml.dom.minidom
# Open XML document using minidom parser
DOMTree = xml.dom.minidom.parse("date.xml")
collection = DOMTree.documentElement
date = cd:DateReceived[0].firstChild.nodeValue
print date
but it gives an error as it doesn't like the "cd" part of the tag
date = cd:DateReceived[0].firstChild.nodeValue
^
SyntaxError: invalid syntax
Any help would be appreciated. Thanks!
collection.getElementsByTagName("cd:DateReceived").item(0) returns a node. from that node, you can get nodeValue

How to comment out an XML Element (using minidom DOM implementation)

I would like to comment out a specific XML element in an xml file. I could just remove the element, but I would prefer to leave it commented out, in case it's needed later.
The code I use at the moment that removes the element looks like this:
from xml.dom import minidom
doc = minidom.parse(myXmlFile)
for element in doc.getElementsByTagName('MyElementName'):
if element.getAttribute('name') in ['AttribName1', 'AttribName2']:
element.parentNode.removeChild(element)
f = open(myXmlFile, "w")
f.write(doc.toxml())
f.close()
I would like to modify this so that it comments the element out rather then deleting it.
The following solution does exactly what I want.
from xml.dom import minidom
doc = minidom.parse(myXmlFile)
for element in doc.getElementsByTagName('MyElementName'):
if element.getAttribute('name') in ['AttrName1', 'AttrName2']:
parentNode = element.parentNode
parentNode.insertBefore(doc.createComment(element.toxml()), element)
parentNode.removeChild(element)
f = open(myXmlFile, "w")
f.write(doc.toxml())
f.close()
You can do it with beautifulSoup. Read target tag, create appropriate comment tag and replace target tag
For example, creating comment tag:
from BeautifulSoup import BeautifulSoup
hello = "<!--Comment tag-->"
commentSoup = BeautifulSoup(hello)

Categories

Resources