I have a Test.xml file as:
<?xml version="1.0" encoding="utf-8"?>
<SetupConf>
<LocSetup>
<Src>
<Dir1>C:\User1\test1</Dir1>
<Dir2>C:\User2\log</Dir2>
<Dir3>D:\Users\Checkup</Dir3>
<Dir4>D:\Work1</Dir4>
<Dir5>E:\job1</Dir5>
</Src>
</LocSetup>
</SetupConf>
Where node depends on user input. In "Dir" node it may be 1,2,5,10 dir structure defined. As per requirement I am able to extract data from the Test.xml with help of #Padraic Cunningham using below Python code:
from xml.dom import minidom
from StringIO import StringIO
dom = minidom.parse('Test.xml')
Src = dom.getElementsByTagName('Src')
output = ", ".join([a.childNodes[0].nodeValue for node in Src for a in node.getElementsByTagName('Dir')])
print [output]
And getting the output:
C:\User1\test1, C:\User2\log, D:\Users\Checkup, D:\Work1, E:\job1
But the expected output is:
['C:\\User1\\test1', 'C:\\User2\\log', 'D:\\Users\\Checkup', 'D:\\Work1', 'E:\\job1']
Well it's solved by myself:
from xml.dom import minidom
DOMTree = minidom.parse('Test0001.xml')
dom = DOMTree.documentElement
Src = dom.getElementsByTagName('Src')
for node in Src:
output = [a.childNodes[0].nodeValue for a in node.getElementsByTagName('Dir')]
print output
And getting output:
[u'C:\User1\test1', u'C:\User2\log', u'D:\Users\Checkup', u'D:\Work1', u'E:\job1']
I am sure there is more simple another way .. please let me know.. Thanks in adv.
Related
I am trying to parse out all the green highlighted attributes (some sensitive things have been blacked out), I have a bunch of XML files all with similar formats, I already know how to loop through all of them individually them I am having trouble parsing out the specific attributes though.
XML Document
I need the text in the attributes: name="text1"
from
project logLevel="verbose" version="2.0" mainModule="Main" name="text1">
destinationDir="/text2" from
put label="Put Files" destinationDir="/Trigger/FPDMMT_INBOUND">
destDir="/text3" from
copy disabled="false" version="1.0" label="Archive Files" destDir="/text3" suffix="">
I am using
import csv
import os
import re
import xml.etree.ElementTree as ET
tree = ET.parse(XMLfile_path)
item = tree.getroot()[0]
root = tree.getroot()
print (item.get("name"))
print (root.get("name"))
This outputs:
Main
text1
The item.get pulls the line at index [0] which is the first line root in the tree which is <module
The root.get pulls from the first line <project
I know there's a way to search for exactly the right part of the root/tree with something like:
test = root.find('./project/module/ftp/put')
print (test.get("destinationDir"))
I need to be able to jump directly to the thing I need and output the attributes I need.
Any help would be appreciated
Thanks.
Simplified copy of your XML:
xml = '''<project logLevel="verbose" version="2.0" mainModule="Main" name="hidden">
<module name="Main">
<createWorkspace version="1.0"/>
<ftp version="1.0" label="FTP connection to PRD">
<put label="Put Files" destinationDir="destination1">
</put>
</ftp>
<ftp version="1.0" label="FTP connection to PRD">
<put label="Put Files" destinationDir="destination2">
</put>
</ftp>
<copy disabled="false" destDir="destination3">
</copy>
</module>
</project>
'''
# solution using ETree
from xml.etree import ElementTree as ET
root = ET.fromstring(xml)
name = root.get('name')
ftp_destination_dir1 = root.findall('./module/ftp/put')[0].get('destinationDir')
ftp_destination_dir2 = root.findall('./module/ftp/put')[1].get('destinationDir')
copy_destination_dir = root.find('./module/copy').get('destDir')
print(name)
print(ftp_destination_dir1)
print(ftp_destination_dir2)
print(copy_destination_dir)
# solution using lxml
from lxml import etree as et
root = et.fromstring(xml)
name = root.get('name')
ftp_destination_dirs = root.xpath('./module/ftp/put/#destinationDir')
copy_destination_dir = root.xpath('./module/copy/#destDir')[0]
print(name)
print(ftp_destination_dirs[0])
print(ftp_destination_dirs[1])
print(copy_destination_dir)
So I have a XML file from a local folder that I want to scrape using Python. It has CData and looks like this:
<?xml version="1.0" encoding="utf-8"?>
<trial xmlns="urn::trial">
<drksId><![CDATA[DRKS00000024]]></drksId>
<firstDrksPublishDate><![CDATA[2008-09-05T12:36:54.000+02:00]]></firstDrksPublishDate>
<firstPartnerPublishDate><![CDATA[2004-01-15T00:00:00.000+01:00]]></firstPartnerPublishDate>
......
I tried:
import xml.etree.ElementTree as Et
tree=Et.parse(filename)
root=tree.getroot()
print(root.find('drksId').text)
Output:
I am getting root.find('drksId') as None. Thanks in advance
Try to search element considering namespace:
ns = {'ns': 'urn::trial'}
drksId = root.find('./ns:drksId', ns)
print(drksId.text)
Apologies, my Python knowledge is pretty non-existant. I need to extract a date from some XML which is in a format similar to:
<Header>
<Version>1.0</Version>
....
<cd:Data>...</Data>
.....
<cd:DateReceived>20070620171524</cd:DateReceived>
From looking around here I found something similar
#!/usr/bin/python
from xml.dom.minidom import parse
import xml.dom.minidom
# Open XML document using minidom parser
DOMTree = xml.dom.minidom.parse("date.xml")
collection = DOMTree.documentElement
print collection.getElementsByTagName("cd:DateReceived").item(0)
However this only prints the Hex value:
<DOM Element: cd:DateReceived at 0x1529e0>
How can I get the date 20070620171524?
I've tried using the following
#!/usr/bin/python
from xml.dom.minidom import parse
import xml.dom.minidom
# Open XML document using minidom parser
DOMTree = xml.dom.minidom.parse("date.xml")
collection = DOMTree.documentElement
date = cd:DateReceived[0].firstChild.nodeValue
print date
but it gives an error as it doesn't like the "cd" part of the tag
date = cd:DateReceived[0].firstChild.nodeValue
^
SyntaxError: invalid syntax
Any help would be appreciated. Thanks!
collection.getElementsByTagName("cd:DateReceived").item(0) returns a node. from that node, you can get nodeValue
<PacketHeader>
<HeaderField>
<name>number</name>
<dataType>int</dataType>
</HeaderField>
</PacketHeader>
This is my small XML file and I want to extract out the text which is within the name tag.
Here is my code snippet:-
from xml.dom import minidom
from xml.dom.minidom import parse
xmldoc = minidom.parse('sample.xml')
packetHeader = xmldoc.getElementsByTagName("PacketHeader")
headerField = packetHeader.getElementsByTagName("HeaderField")
for field in headerField:
getFieldName = field.getElementsByTagName("name")
print getFieldName
But I am getting the location but not the text.
from xml.dom import minidom
from xml.dom.minidom import parse
xmldoc = minidom.parse('sample.xml')
# find the name element, if found return a list, get the first element
name_element = xmldoc.getElementsByTagName("name")[0]
# this will be a text node that contains the actual text
text_node = name_element.childNodes[0]
# get text
print text_node.data
Please check this.
Update
BTW i suggest you ElementTree, Below is the code snippet using ElementTree which is doing samething as the above minidom code
import elementtree.ElementTree as ET
tree = ET.parse("sample.xml")
# the tree root is the toplevel `PacketHeader` element
print tree.findtext("HeaderField/name")
A small variant of the accepted and correct answer above is:
from xml.dom import minidom
xmldoc = minidom.parse('fichier.xml')
name_element = xmldoc.getElementsByTagName('name')[0]
print name_element.childNodes[0].nodeValue
This simply uses nodeValue instead of its alias data
It's my first time trying to parse XML with python so answer could be simple but I can't figure this out.
I'm using ElementTree to parse some XML file. Problem is that I cannot get any result inside the tree when having this attribute:
<package xmlns="http://apple.com/itunes/importer" version="software5.1">
When removing this attribute everything works great. To be clear I mean when changing first line of XML file to:
<package>
Everything works great.
What am I doing wrong?
Here is my code:
import xml.etree.ElementTree as ET
tree = ET.parse('metadataCopy.xml')
root = tree.getroot()
p = root.find(".//intervals/interval")
print p
for interval in root.iterfind(".//intervals/interval"):
start_date = interval.find('start_date').text
end_date = interval.find('end_date').text
print start_date, end_date
Please help. Thanks!
UPDATE:
The XML file:
<?xml version="1.0" encoding="UTF-8"?>
<package xmlns="http://apple.com/itunes/importer" version="software5.1">
<metadata_token>TOKEN</metadata_token>
<provider>Provider Name</provider>
<team_id>Team_ID_Here</team_id>
<software>
<!--Apple ID: 01234567-->
<vendor_id>vendorSKU</vendor_id>
<read_only_info>
<read_only_value key="apple-id">01234567</read_only_value>
</read_only_info>
<software_metadata>
<versions>
<version string="1.0">
<locales>
<locale name="en-US">
<title>title text</title>
<description>Description text</description>
<keywords>
<keyword>key1</keyword>
<keyword>key2</keyword>
</keywords>
<version_whats_new>New things here</version_whats_new>
<support_url>http://someurl.com</support_url>
<software_screenshots>
<software_screenshot display_target="iOS-3.5-in" position="1">
</software_screenshot>
<software_screenshot display_target="iOS-4-in" position="1">
</software_screenshot>
</software_screenshots>
</locale>
</locales>
</version>
</versions>
<products>
<product>
<territory>WW</territory>
<cleared_for_sale>true</cleared_for_sale>
<sales_start_date>2013-01-05</sales_start_date>
<intervals>
<interval>
<start_date>2013-08-25</start_date>
<end_date>2014-09-01</end_date>
<wholesale_price_tier>5</wholesale_price_tier>
</interval>
<interval>
<start_date>2014-09-01</start_date>
<wholesale_price_tier>6</wholesale_price_tier>
</interval>
</intervals>
<allow_volume_discount>true</allow_volume_discount>
</product>
</products>
</software_metadata>
</software>
This is because, xml in python is not auto aware of namespaces. We need to prefix every element in a tree with the namespace prefix for lookup.
import xml.etree.ElementTree as ET
namespaces = {"pns" : "http://apple.com/itunes/importer"}
tree = ET.parse('metadataCopy.xml')
root = tree.getroot()
p = root.find(".//pns:intervals/pns:interval", namespaces=namespaces)
print p
for interval in root.iterfind(".//pns:intervals/pns:interval",namespaces=namespaces):
start_date = interval.find('pns:start_date',namespaces=namespaces)
end_date = interval.find('pns:end_date',namespaces=namespaces)
st_text = end_text = None
if start_date is not None:
st_text = start_date.text
if end_date is not None:
end_text = end_date.text
print st_text, end_text
The xml file shared is not well formed XML. The last tag has to end with package tag. With this change done, programs produces:
<Element '{http://apple.com/itunes/importer}interval' at 0x178b350>
2013-08-25 2014-09-01
2014-09-01 None
If its possible to change the library, you can look for using lxml. lxml has a great support for working with namespaces. Check out the quick short tutorial here http://lxml.de/tutorial.html#namespaces