parsing XML file in python2.7 - python

I know this is a very common question, but the kind of XML file and the kind of extraction of data i need is a little unique due to the nature of the xml file. So appreciate any help on the steps to extract the required data, with pyhton2.7
I have the below XML
<?xml version="1.0" encoding="UTF-8"?>
<Package xmlns="http://soap.sforce.com/2006/04/metadata">
<types>
<members>Mango.XYZ_DIG_Team_ABCDEF_Mango_Review</members>
<members>Mango.XYZ_DIG_Team_Reporting_Mango_Review</members>
<members>Opportunity.A_T_Occupier_City_Job_List</members>
<name>ListView</name>
</types>
<types>
<members>Modify_All_Data_Permission</members>
<members>Opportunity_Alerts_Implementation</members>
<members>Process_Builder_Permission</members>
<members>Regional_Business_Support</members>
<members>Reports_Dashboards_Data_Export_for_Super_Users</members>
<name>PermissionSet</name>
</types>
<types>
<members>SolutionManager</members>
<members>Standard</members>
<name>Profile</name>
</types>
<types>
<members>Mango.Set Verified Date and System Id</members>
<members>Mango.Update Mango Site With Billing Street%2C City%2C Country</members>
<members>Mango.Update Family Id on Mango when created</members>
<members>Opportunity.Set Opportunity Name</members>
<name>WorkflowRule</name>
</types>
<version>38.0</version>
</Package>
i am trying to extract only the members from the PermissionSet block. So that eventually i will have a file, that only have the entries like
Modify_All_Data_Permission
Opportunity_Alerts_Implementation
Process_Builder_Permission
Regional_Business_Support
Reports_Dashboards_Data_Export_for_Super_Users
I have been able to extract only the 'name' tag by
from xml.dom import minidom
doc = minidom.parse("path_to_xmlFile")
t = doc.getElementsByTagName("types")
for n in t:
name = n.getElementsByTagName("name")[0]
print name.firstChild.data
How can i extract the members and save that to a file?
Note: the number of 'members' are not fixed they varies.
I can also try with a different library, if it serves the purpose.

Probably easiest to use XPath
import xml.etree.ElementTree as ET
root = ET.parse('file.xml').getroot()
for member in root.findall(".//members/")
print(member.text)

This may help you!
import xml.etree.ElementTree as ET
tree = ET.parse('file.xml')
root = tree.getroot()
for data in root[1]:
print data.text

Related

store content of a tag in a string with elementtree in python3

I'm using Python 3.7.2 and elementtree to copy the content of a tag in an XML file.
This is my XML file:
<?xml version="1.0" encoding="UTF-8"?>
<Document xmlns="urn:iso:std:iso:20022:tech:xsd:pain.001.003.03">
<CstmrCdtTrfInitn>
<GrpHdr>
<MsgId>nBblsUR-uH..6jmGgZNHLQAAAXgXN1Lu</MsgId>
<CreDtTm>2016-11-10T12:00:00.000+01:00</CreDtTm>
<NbOfTxs>1</NbOfTxs>
<CtrlSum>6</CtrlSum>
<InitgPty>
<Nm>TC 03000 Kunde 55 Protokollr ckf hrung</Nm>
</InitgPty>
</GrpHdr>
</CstmrCdtTrfInitn>
</Document>
I want to copy the content of the 'MsgId' tag and save it as a string.
I've manage to do this with minidom before, but due to new circumstances, I have to settle with elementtree for now.
This is that code with minidom:
dom = xml.dom.minidom.parse('H:\\app_python/in_spsh/{}'.format(filename_string))
message = dom.getElementsByTagName('MsgId')
for MsgId in message:
print(MsgId.firstChild.nodeValue)
Now I want to do the exact same thing with elementtree. How can I achieve this?
To get the text value of a single element, you can use the findtext() method. The namespace needs to be taken into account.
from xml.etree import ElementTree as ET
tree = ET.parse("test.xml") # Your XML document
msgid = tree.findtext('.//{urn:iso:std:iso:20022:tech:xsd:pain.001.003.03}MsgId')
With Python 3.8 and later, it is possible to use a wildcard for the namespace:
msgid = tree.findtext('.//{*}MsgId')

Parsing XML Attributes with Python

I am trying to parse out all the green highlighted attributes (some sensitive things have been blacked out), I have a bunch of XML files all with similar formats, I already know how to loop through all of them individually them I am having trouble parsing out the specific attributes though.
XML Document
I need the text in the attributes: name="text1"
from
project logLevel="verbose" version="2.0" mainModule="Main" name="text1">
destinationDir="/text2" from
put label="Put Files" destinationDir="/Trigger/FPDMMT_INBOUND">
destDir="/text3" from
copy disabled="false" version="1.0" label="Archive Files" destDir="/text3" suffix="">
I am using
import csv
import os
import re
import xml.etree.ElementTree as ET
tree = ET.parse(XMLfile_path)
item = tree.getroot()[0]
root = tree.getroot()
print (item.get("name"))
print (root.get("name"))
This outputs:
Main
text1
The item.get pulls the line at index [0] which is the first line root in the tree which is <module
The root.get pulls from the first line <project
I know there's a way to search for exactly the right part of the root/tree with something like:
test = root.find('./project/module/ftp/put')
print (test.get("destinationDir"))
I need to be able to jump directly to the thing I need and output the attributes I need.
Any help would be appreciated
Thanks.
Simplified copy of your XML:
xml = '''<project logLevel="verbose" version="2.0" mainModule="Main" name="hidden">
<module name="Main">
<createWorkspace version="1.0"/>
<ftp version="1.0" label="FTP connection to PRD">
<put label="Put Files" destinationDir="destination1">
</put>
</ftp>
<ftp version="1.0" label="FTP connection to PRD">
<put label="Put Files" destinationDir="destination2">
</put>
</ftp>
<copy disabled="false" destDir="destination3">
</copy>
</module>
</project>
'''
# solution using ETree
from xml.etree import ElementTree as ET
root = ET.fromstring(xml)
name = root.get('name')
ftp_destination_dir1 = root.findall('./module/ftp/put')[0].get('destinationDir')
ftp_destination_dir2 = root.findall('./module/ftp/put')[1].get('destinationDir')
copy_destination_dir = root.find('./module/copy').get('destDir')
print(name)
print(ftp_destination_dir1)
print(ftp_destination_dir2)
print(copy_destination_dir)
# solution using lxml
from lxml import etree as et
root = et.fromstring(xml)
name = root.get('name')
ftp_destination_dirs = root.xpath('./module/ftp/put/#destinationDir')
copy_destination_dir = root.xpath('./module/copy/#destDir')[0]
print(name)
print(ftp_destination_dirs[0])
print(ftp_destination_dirs[1])
print(copy_destination_dir)

python, xml: how to access the 3rd child by element' name

Would you help me, pleace, to get an access to elemnt with name 'id' by the following construction in Python (i have lxml and xml.etree.ElementTree libraries).
Desirable result: '0000000'
Desirable method:
Search in xml-document a child, where it's name is fcsProtocolEF3.
Search in fcsProtocolEF3 an element with name 'id'.
It is crucial to search by element name. Not by ordinal position.
I tried to use something like this: tree.findall('{http://zakupki.gov.ru/oos/export/1}fcsProtocolEF3')[0].findall('{http://zakupki.gov.ru/oos/types/1}id')[0].text
it works, but it requires to input namespaces. XML-document have different namespaces and I don't know how to define them beforehand.
Thank you.
That would be great to use something like XQuery in SQL:
value('(/*:export/*:fcsProtocolEF3/*:id)[1]', 'nvarchar(21)')) AS [id],
XML-document:
<?xml version="1.0" encoding="UTF-8" standalone="true"?>
<ns2:export xmlns:ns3="http://zakupki.gov.ru/oos/common/1" xmlns:ns4="http://zakupki.gov.ru/oos/base/1" xmlns:ns2="http://zakupki.gov.ru/oos/export/1" xmlns:ns10="http://zakupki.gov.ru/oos/printform/1" xmlns:ns11="http://zakupki.gov.ru/oos/control99/1" xmlns:ns9="http://zakupki.gov.ru/oos/SMTypes/1" xmlns:ns7="http://zakupki.gov.ru/oos/pprf615types/1" xmlns:ns8="http://zakupki.gov.ru/oos/EPtypes/1" xmlns:ns5="http://zakupki.gov.ru/oos/TPtypes/1" xmlns:ns6="http://zakupki.gov.ru/oos/CPtypes/1" xmlns="http://zakupki.gov.ru/oos/types/1">
<ns2:fcsProtocolEF3 schemeVersion="10.2">
<id>0000000</id>
<purchaseNumber>0000000000000000</purchaseNumber>
</ns2:fcsProtocolEF3>
</ns2:export>
lxml solution:
xml = '''<?xml version="1.0"?>
<ns2:export xmlns:ns3="http://zakupki.gov.ru/oos/common/1" xmlns:ns4="http://zakupki.gov.ru/oos/base/1" xmlns:ns2="http://zakupki.gov.ru/oos/export/1" xmlns:ns10="http://zakupki.gov.ru/oos/printform/1" xmlns:ns11="http://zakupki.gov.ru/oos/control99/1" xmlns:ns9="http://zakupki.gov.ru/oos/SMTypes/1" xmlns:ns7="http://zakupki.gov.ru/oos/pprf615types/1" xmlns:ns8="http://zakupki.gov.ru/oos/EPtypes/1" xmlns:ns5="http://zakupki.gov.ru/oos/TPtypes/1" xmlns:ns6="http://zakupki.gov.ru/oos/CPtypes/1" xmlns="http://zakupki.gov.ru/oos/types/1">
<ns2:fcsProtocolEF3 schemeVersion="10.2">
<id>0000000</id>
<purchaseNumber>0000000000000000</purchaseNumber>
</ns2:fcsProtocolEF3>
</ns2:export>'''
from lxml import etree as et
root = et.fromstring(xml)
text = root.xpath('//*[local-name()="export"]/*[local-name()="fcsProtocolEF3"]/*[local-name()="id"]/text()')[0]
print(text)
Below is ET based solution. NS are in use.
import xml.etree.ElementTree as ET
xml = '''<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<ns2:export xmlns:ns3="http://zakupki.gov.ru/oos/common/1" xmlns:ns4="http://zakupki.gov.ru/oos/base/1" xmlns:ns2="http://zakupki.gov.ru/oos/export/1" xmlns:ns10="http://zakupki.gov.ru/oos/printform/1" xmlns:ns11="http://zakupki.gov.ru/oos/control99/1" xmlns:ns9="http://zakupki.gov.ru/oos/SMTypes/1" xmlns:ns7="http://zakupki.gov.ru/oos/pprf615types/1" xmlns:ns8="http://zakupki.gov.ru/oos/EPtypes/1" xmlns:ns5="http://zakupki.gov.ru/oos/TPtypes/1" xmlns:ns6="http://zakupki.gov.ru/oos/CPtypes/1" xmlns="http://zakupki.gov.ru/oos/types/1">
<ns2:fcsProtocolEF3 schemeVersion="10.2">
<id>0000000</id>
<purchaseNumber>0000000000000000</purchaseNumber>
</ns2:fcsProtocolEF3>
</ns2:export>
'''
def get_id_text():
root = ET.fromstring(xml)
fcs = root.find('{http://zakupki.gov.ru/oos/export/1}fcsProtocolEF3')
# assuming there is one fcs element and one id under fcs
return fcs.find('{http://zakupki.gov.ru/oos/types/1}id').text
print(get_id_text())
output
0000000

Reading xml with lxml lib geting strange string from xmlns tag

I am writing program to work on xml file and change it. But when I try to get to any part of it I get some extra part.
My xml file:
<?xml version="1.0" encoding="UTF-8"?>
<Package xmlns="http://soap.sforce.com/2006/04/metadata">
<types>
<members>sbaa__ApprovalChain__c.ExternalID__c</members>
<members>sbaa__ApprovalCondition__c.ExternalID__c</members>
<members>sbaa__ApprovalRule__c.ExternalID__c</members>
<name>CustomField</name>
</types>
<version>40.0</version>
</Package>
And I have my code:
from lxml import etree
import sys
tree = etree.parse('package.xml')
root = tree.getroot()
print( root[0][0].tag )
As output I expect to see members but I get something like this:
{http://soap.sforce.com/2006/04/metadata}members
Why do I see that url and how to stop it from showing up?
You have defined a default namespace (Wikipedia, lxml tutorial). When defined, it is a part of every child tag.
If you want to print the tag without the namespace, it's easy
tag = root[0][0].tag
print(tag[tag.find('}')+1:])
If you want to remove the namespace from XML, see this question.

List only one category Python xml

I am trying to write a python program that uses DOM to read xml file and print another xml structure that list from only one node with particular selected attribute "fun".
<?xml version="1.0" encoding="ISO-8859-1"?>
<website>
<url category="fun">
<title>Fun world</title>
<author>Jack</author>
<year>2010</year>
<price>100.00</price>
</url>
<url category="entertainment">
<title>Fun world</title>
<author>Jack</author>
<year>2010</year>
<price>100.00</price>
</url>
</website>
I couldn't select the list from the URL having category="fun".
I tried this code:
for n in dom.getElementsByTagName('url'):
s = n.attribute['category']
if (s.value == "fun"):
print n.toxml()
Can you guys help to me to debug my code?
nb: One of your tags opens "Website" and attempts to close "website" - so you'll want to fix that one...
You've mentioned lxml.
from lxml import etree as et
root = et.fromstring(xml)
fun = root.xpath('/Website/url[#category="fun"]')
for node in fun:
print et.tostring(node)
Use getAttribute:
for n in dom.getElementsByTagName('url'):
if (n.getAttribute('category') == "fun"):
print(n.toxml())

Categories

Resources