xml minidom - get the full content of childnodes text - python

I have a Test.xml file as:
<?xml version="1.0" encoding="utf-8"?>
<SetupConf>
<LocSetup>
<Src>
<Dir1>C:\User1\test1</Dir1>
<Dir2>C:\User2\log</Dir2>
<Dir3>D:\Users\Checkup</Dir3>
<Dir4>D:\Work1</Dir4>
<Dir5>E:\job1</Dir5>
</Src>
</LocSetup>
</SetupConf>
Where node depends on user input. In "Dir" node it may be 1,2,5,10 dir structure defined. As per requirement I am able to extract data from the Test.xml with help of #Padraic Cunningham using below Python code:
from xml.dom import minidom
from StringIO import StringIO
dom = minidom.parse('Test.xml')
Src = dom.getElementsByTagName('Src')
output = ", ".join([a.childNodes[0].nodeValue for node in Src for a in node.getElementsByTagName('Dir')])
print [output]
And getting the output:
C:\User1\test1, C:\User2\log, D:\Users\Checkup, D:\Work1, E:\job1
But the expected output is:
['C:\\User1\\test1', 'C:\\User2\\log', 'D:\\Users\\Checkup', 'D:\\Work1', 'E:\\job1']

Well it's solved by myself:
from xml.dom import minidom
DOMTree = minidom.parse('Test0001.xml')
dom = DOMTree.documentElement
Src = dom.getElementsByTagName('Src')
for node in Src:
output = [a.childNodes[0].nodeValue for a in node.getElementsByTagName('Dir')]
print output
And getting output:
[u'C:\User1\test1', u'C:\User2\log', u'D:\Users\Checkup', u'D:\Work1', u'E:\job1']
I am sure there is more simple another way .. please let me know.. Thanks in adv.

Related

Parsing XML Attributes with Python

I am trying to parse out all the green highlighted attributes (some sensitive things have been blacked out), I have a bunch of XML files all with similar formats, I already know how to loop through all of them individually them I am having trouble parsing out the specific attributes though.
XML Document
I need the text in the attributes: name="text1"
from
project logLevel="verbose" version="2.0" mainModule="Main" name="text1">
destinationDir="/text2" from
put label="Put Files" destinationDir="/Trigger/FPDMMT_INBOUND">
destDir="/text3" from
copy disabled="false" version="1.0" label="Archive Files" destDir="/text3" suffix="">
I am using
import csv
import os
import re
import xml.etree.ElementTree as ET
tree = ET.parse(XMLfile_path)
item = tree.getroot()[0]
root = tree.getroot()
print (item.get("name"))
print (root.get("name"))
This outputs:
Main
text1
The item.get pulls the line at index [0] which is the first line root in the tree which is <module
The root.get pulls from the first line <project
I know there's a way to search for exactly the right part of the root/tree with something like:
test = root.find('./project/module/ftp/put')
print (test.get("destinationDir"))
I need to be able to jump directly to the thing I need and output the attributes I need.
Any help would be appreciated
Thanks.
Simplified copy of your XML:
xml = '''<project logLevel="verbose" version="2.0" mainModule="Main" name="hidden">
<module name="Main">
<createWorkspace version="1.0"/>
<ftp version="1.0" label="FTP connection to PRD">
<put label="Put Files" destinationDir="destination1">
</put>
</ftp>
<ftp version="1.0" label="FTP connection to PRD">
<put label="Put Files" destinationDir="destination2">
</put>
</ftp>
<copy disabled="false" destDir="destination3">
</copy>
</module>
</project>
'''
# solution using ETree
from xml.etree import ElementTree as ET
root = ET.fromstring(xml)
name = root.get('name')
ftp_destination_dir1 = root.findall('./module/ftp/put')[0].get('destinationDir')
ftp_destination_dir2 = root.findall('./module/ftp/put')[1].get('destinationDir')
copy_destination_dir = root.find('./module/copy').get('destDir')
print(name)
print(ftp_destination_dir1)
print(ftp_destination_dir2)
print(copy_destination_dir)
# solution using lxml
from lxml import etree as et
root = et.fromstring(xml)
name = root.get('name')
ftp_destination_dirs = root.xpath('./module/ftp/put/#destinationDir')
copy_destination_dir = root.xpath('./module/copy/#destDir')[0]
print(name)
print(ftp_destination_dirs[0])
print(ftp_destination_dirs[1])
print(copy_destination_dir)

How to print CData content while parsing xml local file with python?

So I have a XML file from a local folder that I want to scrape using Python. It has CData and looks like this:
<?xml version="1.0" encoding="utf-8"?>
<trial xmlns="urn::trial">
<drksId><![CDATA[DRKS00000024]]></drksId>
<firstDrksPublishDate><![CDATA[2008-09-05T12:36:54.000+02:00]]></firstDrksPublishDate>
<firstPartnerPublishDate><![CDATA[2004-01-15T00:00:00.000+01:00]]></firstPartnerPublishDate>
......
I tried:
import xml.etree.ElementTree as Et
tree=Et.parse(filename)
root=tree.getroot()
print(root.find('drksId').text)
Output:
I am getting root.find('drksId') as None. Thanks in advance
Try to search element considering namespace:
ns = {'ns': 'urn::trial'}
drksId = root.find('./ns:drksId', ns)
print(drksId.text)

Python, extract date from XML

Apologies, my Python knowledge is pretty non-existant. I need to extract a date from some XML which is in a format similar to:
<Header>
<Version>1.0</Version>
....
<cd:Data>...</Data>
.....
<cd:DateReceived>20070620171524</cd:DateReceived>
From looking around here I found something similar
#!/usr/bin/python
from xml.dom.minidom import parse
import xml.dom.minidom
# Open XML document using minidom parser
DOMTree = xml.dom.minidom.parse("date.xml")
collection = DOMTree.documentElement
print collection.getElementsByTagName("cd:DateReceived").item(0)
However this only prints the Hex value:
<DOM Element: cd:DateReceived at 0x1529e0>
How can I get the date 20070620171524?
I've tried using the following
#!/usr/bin/python
from xml.dom.minidom import parse
import xml.dom.minidom
# Open XML document using minidom parser
DOMTree = xml.dom.minidom.parse("date.xml")
collection = DOMTree.documentElement
date = cd:DateReceived[0].firstChild.nodeValue
print date
but it gives an error as it doesn't like the "cd" part of the tag
date = cd:DateReceived[0].firstChild.nodeValue
^
SyntaxError: invalid syntax
Any help would be appreciated. Thanks!
collection.getElementsByTagName("cd:DateReceived").item(0) returns a node. from that node, you can get nodeValue

Parsing XML using Python minidom

<PacketHeader>
<HeaderField>
<name>number</name>
<dataType>int</dataType>
</HeaderField>
</PacketHeader>
This is my small XML file and I want to extract out the text which is within the name tag.
Here is my code snippet:-
from xml.dom import minidom
from xml.dom.minidom import parse
xmldoc = minidom.parse('sample.xml')
packetHeader = xmldoc.getElementsByTagName("PacketHeader")
headerField = packetHeader.getElementsByTagName("HeaderField")
for field in headerField:
getFieldName = field.getElementsByTagName("name")
print getFieldName
But I am getting the location but not the text.
from xml.dom import minidom
from xml.dom.minidom import parse
xmldoc = minidom.parse('sample.xml')
# find the name element, if found return a list, get the first element
name_element = xmldoc.getElementsByTagName("name")[0]
# this will be a text node that contains the actual text
text_node = name_element.childNodes[0]
# get text
print text_node.data
Please check this.
Update
BTW i suggest you ElementTree, Below is the code snippet using ElementTree which is doing samething as the above minidom code
import elementtree.ElementTree as ET
tree = ET.parse("sample.xml")
# the tree root is the toplevel `PacketHeader` element
print tree.findtext("HeaderField/name")
A small variant of the accepted and correct answer above is:
from xml.dom import minidom
xmldoc = minidom.parse('fichier.xml')
name_element = xmldoc.getElementsByTagName('name')[0]
print name_element.childNodes[0].nodeValue
This simply uses nodeValue instead of its alias data

XML parsing using python (with xmlns attribute) doesn't work

It's my first time trying to parse XML with python so answer could be simple but I can't figure this out.
I'm using ElementTree to parse some XML file. Problem is that I cannot get any result inside the tree when having this attribute:
<package xmlns="http://apple.com/itunes/importer" version="software5.1">
When removing this attribute everything works great. To be clear I mean when changing first line of XML file to:
<package>
Everything works great.
What am I doing wrong?
Here is my code:
import xml.etree.ElementTree as ET
tree = ET.parse('metadataCopy.xml')
root = tree.getroot()
p = root.find(".//intervals/interval")
print p
for interval in root.iterfind(".//intervals/interval"):
start_date = interval.find('start_date').text
end_date = interval.find('end_date').text
print start_date, end_date
Please help. Thanks!
UPDATE:
The XML file:
<?xml version="1.0" encoding="UTF-8"?>
<package xmlns="http://apple.com/itunes/importer" version="software5.1">
<metadata_token>TOKEN</metadata_token>
<provider>Provider Name</provider>
<team_id>Team_ID_Here</team_id>
<software>
<!--Apple ID: 01234567-->
<vendor_id>vendorSKU</vendor_id>
<read_only_info>
<read_only_value key="apple-id">01234567</read_only_value>
</read_only_info>
<software_metadata>
<versions>
<version string="1.0">
<locales>
<locale name="en-US">
<title>title text</title>
<description>Description text</description>
<keywords>
<keyword>key1</keyword>
<keyword>key2</keyword>
</keywords>
<version_whats_new>New things here</version_whats_new>
<support_url>http://someurl.com</support_url>
<software_screenshots>
<software_screenshot display_target="iOS-3.5-in" position="1">
</software_screenshot>
<software_screenshot display_target="iOS-4-in" position="1">
</software_screenshot>
</software_screenshots>
</locale>
</locales>
</version>
</versions>
<products>
<product>
<territory>WW</territory>
<cleared_for_sale>true</cleared_for_sale>
<sales_start_date>2013-01-05</sales_start_date>
<intervals>
<interval>
<start_date>2013-08-25</start_date>
<end_date>2014-09-01</end_date>
<wholesale_price_tier>5</wholesale_price_tier>
</interval>
<interval>
<start_date>2014-09-01</start_date>
<wholesale_price_tier>6</wholesale_price_tier>
</interval>
</intervals>
<allow_volume_discount>true</allow_volume_discount>
</product>
</products>
</software_metadata>
</software>
This is because, xml in python is not auto aware of namespaces. We need to prefix every element in a tree with the namespace prefix for lookup.
import xml.etree.ElementTree as ET
namespaces = {"pns" : "http://apple.com/itunes/importer"}
tree = ET.parse('metadataCopy.xml')
root = tree.getroot()
p = root.find(".//pns:intervals/pns:interval", namespaces=namespaces)
print p
for interval in root.iterfind(".//pns:intervals/pns:interval",namespaces=namespaces):
start_date = interval.find('pns:start_date',namespaces=namespaces)
end_date = interval.find('pns:end_date',namespaces=namespaces)
st_text = end_text = None
if start_date is not None:
st_text = start_date.text
if end_date is not None:
end_text = end_date.text
print st_text, end_text
The xml file shared is not well formed XML. The last tag has to end with package tag. With this change done, programs produces:
<Element '{http://apple.com/itunes/importer}interval' at 0x178b350>
2013-08-25 2014-09-01
2014-09-01 None
If its possible to change the library, you can look for using lxml. lxml has a great support for working with namespaces. Check out the quick short tutorial here http://lxml.de/tutorial.html#namespaces

Categories

Resources