strange output from Python 3 ElementTree

strange output from Python 3 ElementTree - python

I'm parsing a really simple .xml file with this snippet
import xml.etree.ElementTree as etree
tree = etree.parse('/home/user/dummy.xml')
print(tree.getroot())
the output is
<Element 'doc' at 0x1d2f090>
which is correct, but I was expecting something cleaner and as simple as
doc
is this the normal output ? How I can clean this ?
I'm using Python 3.x
the dummy.xml file
<?xml version="1.0"?>
<doc>
<branch name="testing" hash="1cdf045c">
text,source
</branch>
<branch name="release01" hash="f200013e">
<sub-branch name="subrelease01">
xml,sgml
</sub-branch>
</branch>
<branch name="invalid">
</branch>
</doc>

Yes, that's the default output for an Element. If you want just the tag, try:
print(tree.getroot().tag)

Related

How can I retrieve specific information from a XML file using python?

I am working with Sentinel-2 Images, and I want to retrieve the Cloud_Coverage_Assessment from the XML file. I need to do this with Python.
Does anyone have any idea how to do this? I think I have to use the xml.etree.ElementTree but I'm not sure how?
The XML file:
<n1:Level-1C_User_Product xmlns:n1="https://psd-14.sentinel2.eo.esa.int/PSD/User_Product_Level-1C.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="https://psd-14.sentinel2.eo.esa.int/PSD/User_Product_Level-1C.xsd">
<n1:General_Info>
...
</n1:General_Info>
<n1:Geometric_Info>
...
</n1:Geometric_Info>
<n1:Auxiliary_Data_Info>
...
</n1:Auxiliary_Data_Info>
<n1:Quality_Indicators_Info>
<Cloud_Coverage_Assessment>90.7287</Cloud_Coverage_Assessment>
<Technical_Quality_Assessment>
...
</Technical_Quality_Assessment>
<Quality_Control_Checks>
...
</Quality_Control_Checks>
</n1:Quality_Indicators_Info>
</n1:Level-1C_User_Product>

read xml from file
import xml.etree.ElementTree as ET
tree = ET.parse('sentinel2.xml')
root = tree.getroot()
print(root.find('.//Cloud_Coverage_Assessment').text)

..and I want to retrieve the Cloud_Coverage_Assessment
Try the below (use xpath)
import xml.etree.ElementTree as ET
xml = '''<n1:Level-1C_User_Product xmlns:n1="https://psd-14.sentinel2.eo.esa.int/PSD/User_Product_Level-1C.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="https://psd-14.sentinel2.eo.esa.int/PSD/User_Product_Level-1C.xsd">
<n1:General_Info>
</n1:General_Info>
<n1:Geometric_Info>
</n1:Geometric_Info>
<n1:Auxiliary_Data_Info>
</n1:Auxiliary_Data_Info>
<n1:Quality_Indicators_Info>
<Cloud_Coverage_Assessment>90.7287</Cloud_Coverage_Assessment>
<Technical_Quality_Assessment>
</Technical_Quality_Assessment>
<Quality_Control_Checks>
</Quality_Control_Checks>
</n1:Quality_Indicators_Info>
</n1:Level-1C_User_Product>'''
root = ET.fromstring(xml)
print(root.find('.//Cloud_Coverage_Assessment').text)
output
90.7287

How to print CData content while parsing xml local file with python?

So I have a XML file from a local folder that I want to scrape using Python. It has CData and looks like this:
<?xml version="1.0" encoding="utf-8"?>
<trial xmlns="urn::trial">
<drksId><![CDATA[DRKS00000024]]></drksId>
<firstDrksPublishDate><![CDATA[2008-09-05T12:36:54.000+02:00]]></firstDrksPublishDate>
<firstPartnerPublishDate><![CDATA[2004-01-15T00:00:00.000+01:00]]></firstPartnerPublishDate>
......
I tried:
import xml.etree.ElementTree as Et
tree=Et.parse(filename)
root=tree.getroot()
print(root.find('drksId').text)
Output:
I am getting root.find('drksId') as None. Thanks in advance

Try to search element considering namespace:
ns = {'ns': 'urn::trial'}
drksId = root.find('./ns:drksId', ns)
print(drksId.text)

Reading xml with lxml lib geting strange string from xmlns tag

I am writing program to work on xml file and change it. But when I try to get to any part of it I get some extra part.
My xml file:
<?xml version="1.0" encoding="UTF-8"?>
<Package xmlns="http://soap.sforce.com/2006/04/metadata">
<types>
<members>sbaa__ApprovalChain__c.ExternalID__c</members>
<members>sbaa__ApprovalCondition__c.ExternalID__c</members>
<members>sbaa__ApprovalRule__c.ExternalID__c</members>
<name>CustomField</name>
</types>
<version>40.0</version>
</Package>
And I have my code:
from lxml import etree
import sys
tree = etree.parse('package.xml')
root = tree.getroot()
print( root[0][0].tag )
As output I expect to see members but I get something like this:
{http://soap.sforce.com/2006/04/metadata}members
Why do I see that url and how to stop it from showing up?

You have defined a default namespace (Wikipedia, lxml tutorial). When defined, it is a part of every child tag.
If you want to print the tag without the namespace, it's easy
tag = root[0][0].tag
print(tag[tag.find('}')+1:])
If you want to remove the namespace from XML, see this question.

parsing XML file in python2.7

I know this is a very common question, but the kind of XML file and the kind of extraction of data i need is a little unique due to the nature of the xml file. So appreciate any help on the steps to extract the required data, with pyhton2.7
I have the below XML
<?xml version="1.0" encoding="UTF-8"?>
<Package xmlns="http://soap.sforce.com/2006/04/metadata">
<types>
<members>Mango.XYZ_DIG_Team_ABCDEF_Mango_Review</members>
<members>Mango.XYZ_DIG_Team_Reporting_Mango_Review</members>
<members>Opportunity.A_T_Occupier_City_Job_List</members>
<name>ListView</name>
</types>
<types>
<members>Modify_All_Data_Permission</members>
<members>Opportunity_Alerts_Implementation</members>
<members>Process_Builder_Permission</members>
<members>Regional_Business_Support</members>
<members>Reports_Dashboards_Data_Export_for_Super_Users</members>
<name>PermissionSet</name>
</types>
<types>
<members>SolutionManager</members>
<members>Standard</members>
<name>Profile</name>
</types>
<types>
<members>Mango.Set Verified Date and System Id</members>
<members>Mango.Update Mango Site With Billing Street%2C City%2C Country</members>
<members>Mango.Update Family Id on Mango when created</members>
<members>Opportunity.Set Opportunity Name</members>
<name>WorkflowRule</name>
</types>
<version>38.0</version>
</Package>
i am trying to extract only the members from the PermissionSet block. So that eventually i will have a file, that only have the entries like
Modify_All_Data_Permission
Opportunity_Alerts_Implementation
Process_Builder_Permission
Regional_Business_Support
Reports_Dashboards_Data_Export_for_Super_Users
I have been able to extract only the 'name' tag by
from xml.dom import minidom
doc = minidom.parse("path_to_xmlFile")
t = doc.getElementsByTagName("types")
for n in t:
name = n.getElementsByTagName("name")[0]
print name.firstChild.data
How can i extract the members and save that to a file?
Note: the number of 'members' are not fixed they varies.
I can also try with a different library, if it serves the purpose.

Probably easiest to use XPath
import xml.etree.ElementTree as ET
root = ET.parse('file.xml').getroot()
for member in root.findall(".//members/")
print(member.text)

This may help you!
import xml.etree.ElementTree as ET
tree = ET.parse('file.xml')
root = tree.getroot()
for data in root[1]:
print data.text

Adding processing instruction to xml before root element with cElementTree

I am using cElementTree library to produce xml files. Now I want to write .xsl file for better readability. That's why I need to add <?xml-stylesheet type="text/xsl" href="style.xsl"?> before first tag. Unfortunately I was able to put desired line only after first tag:
import xml.etree.cElementTree as Et
test_report = Et.Element("TEST_REPORT")
root = test_report
root.append(Et.ProcessingInstruction('xml-stylesheet', 'type="text/xsl" href="style.xsl"'))
...
...
tree = Et.ElementTree(root)
tree.write(self.file_name+"_result.xml")
Witch logically produces:
<TEST_REPORT>
<?xml-stylesheet type="text/xsl" href="style.xsl"?>
...
...
</TEST_REPORT>
What I need is:
<?xml-stylesheet type="text/xsl" href="style.xsl"?>
<TEST_REPORT>
...
...
</TEST_REPORT>
I am looking for something like this but it seems like there is no addprevious method in cElementTree.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

strange output from Python 3 ElementTree - python

Yes, that's the default output for an Element. If you want just the tag, try: print(tree.getroot().tag)

Related

How can I retrieve specific information from a XML file using python?

How to print CData content while parsing xml local file with python?

Reading xml with lxml lib geting strange string from xmlns tag

parsing XML file in python2.7

Adding processing instruction to xml before root element with cElementTree

Categories

Resources