So I have a XML file from a local folder that I want to scrape using Python. It has CData and looks like this:
<?xml version="1.0" encoding="utf-8"?>
<trial xmlns="urn::trial">
<drksId><![CDATA[DRKS00000024]]></drksId>
<firstDrksPublishDate><![CDATA[2008-09-05T12:36:54.000+02:00]]></firstDrksPublishDate>
<firstPartnerPublishDate><![CDATA[2004-01-15T00:00:00.000+01:00]]></firstPartnerPublishDate>
......
I tried:
import xml.etree.ElementTree as Et
tree=Et.parse(filename)
root=tree.getroot()
print(root.find('drksId').text)
Output:
I am getting root.find('drksId') as None. Thanks in advance
Try to search element considering namespace:
ns = {'ns': 'urn::trial'}
drksId = root.find('./ns:drksId', ns)
print(drksId.text)
Related
I am working with Sentinel-2 Images, and I want to retrieve the Cloud_Coverage_Assessment from the XML file. I need to do this with Python.
Does anyone have any idea how to do this? I think I have to use the xml.etree.ElementTree but I'm not sure how?
The XML file:
<n1:Level-1C_User_Product xmlns:n1="https://psd-14.sentinel2.eo.esa.int/PSD/User_Product_Level-1C.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="https://psd-14.sentinel2.eo.esa.int/PSD/User_Product_Level-1C.xsd">
<n1:General_Info>
...
</n1:General_Info>
<n1:Geometric_Info>
...
</n1:Geometric_Info>
<n1:Auxiliary_Data_Info>
...
</n1:Auxiliary_Data_Info>
<n1:Quality_Indicators_Info>
<Cloud_Coverage_Assessment>90.7287</Cloud_Coverage_Assessment>
<Technical_Quality_Assessment>
...
</Technical_Quality_Assessment>
<Quality_Control_Checks>
...
</Quality_Control_Checks>
</n1:Quality_Indicators_Info>
</n1:Level-1C_User_Product>
read xml from file
import xml.etree.ElementTree as ET
tree = ET.parse('sentinel2.xml')
root = tree.getroot()
print(root.find('.//Cloud_Coverage_Assessment').text)
..and I want to retrieve the Cloud_Coverage_Assessment
Try the below (use xpath)
import xml.etree.ElementTree as ET
xml = '''<n1:Level-1C_User_Product xmlns:n1="https://psd-14.sentinel2.eo.esa.int/PSD/User_Product_Level-1C.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="https://psd-14.sentinel2.eo.esa.int/PSD/User_Product_Level-1C.xsd">
<n1:General_Info>
</n1:General_Info>
<n1:Geometric_Info>
</n1:Geometric_Info>
<n1:Auxiliary_Data_Info>
</n1:Auxiliary_Data_Info>
<n1:Quality_Indicators_Info>
<Cloud_Coverage_Assessment>90.7287</Cloud_Coverage_Assessment>
<Technical_Quality_Assessment>
</Technical_Quality_Assessment>
<Quality_Control_Checks>
</Quality_Control_Checks>
</n1:Quality_Indicators_Info>
</n1:Level-1C_User_Product>'''
root = ET.fromstring(xml)
print(root.find('.//Cloud_Coverage_Assessment').text)
output
90.7287
Would you help me, pleace, to get an access to elemnt with name 'id' by the following construction in Python (i have lxml and xml.etree.ElementTree libraries).
Desirable result: '0000000'
Desirable method:
Search in xml-document a child, where it's name is fcsProtocolEF3.
Search in fcsProtocolEF3 an element with name 'id'.
It is crucial to search by element name. Not by ordinal position.
I tried to use something like this: tree.findall('{http://zakupki.gov.ru/oos/export/1}fcsProtocolEF3')[0].findall('{http://zakupki.gov.ru/oos/types/1}id')[0].text
it works, but it requires to input namespaces. XML-document have different namespaces and I don't know how to define them beforehand.
Thank you.
That would be great to use something like XQuery in SQL:
value('(/*:export/*:fcsProtocolEF3/*:id)[1]', 'nvarchar(21)')) AS [id],
XML-document:
<?xml version="1.0" encoding="UTF-8" standalone="true"?>
<ns2:export xmlns:ns3="http://zakupki.gov.ru/oos/common/1" xmlns:ns4="http://zakupki.gov.ru/oos/base/1" xmlns:ns2="http://zakupki.gov.ru/oos/export/1" xmlns:ns10="http://zakupki.gov.ru/oos/printform/1" xmlns:ns11="http://zakupki.gov.ru/oos/control99/1" xmlns:ns9="http://zakupki.gov.ru/oos/SMTypes/1" xmlns:ns7="http://zakupki.gov.ru/oos/pprf615types/1" xmlns:ns8="http://zakupki.gov.ru/oos/EPtypes/1" xmlns:ns5="http://zakupki.gov.ru/oos/TPtypes/1" xmlns:ns6="http://zakupki.gov.ru/oos/CPtypes/1" xmlns="http://zakupki.gov.ru/oos/types/1">
<ns2:fcsProtocolEF3 schemeVersion="10.2">
<id>0000000</id>
<purchaseNumber>0000000000000000</purchaseNumber>
</ns2:fcsProtocolEF3>
</ns2:export>
lxml solution:
xml = '''<?xml version="1.0"?>
<ns2:export xmlns:ns3="http://zakupki.gov.ru/oos/common/1" xmlns:ns4="http://zakupki.gov.ru/oos/base/1" xmlns:ns2="http://zakupki.gov.ru/oos/export/1" xmlns:ns10="http://zakupki.gov.ru/oos/printform/1" xmlns:ns11="http://zakupki.gov.ru/oos/control99/1" xmlns:ns9="http://zakupki.gov.ru/oos/SMTypes/1" xmlns:ns7="http://zakupki.gov.ru/oos/pprf615types/1" xmlns:ns8="http://zakupki.gov.ru/oos/EPtypes/1" xmlns:ns5="http://zakupki.gov.ru/oos/TPtypes/1" xmlns:ns6="http://zakupki.gov.ru/oos/CPtypes/1" xmlns="http://zakupki.gov.ru/oos/types/1">
<ns2:fcsProtocolEF3 schemeVersion="10.2">
<id>0000000</id>
<purchaseNumber>0000000000000000</purchaseNumber>
</ns2:fcsProtocolEF3>
</ns2:export>'''
from lxml import etree as et
root = et.fromstring(xml)
text = root.xpath('//*[local-name()="export"]/*[local-name()="fcsProtocolEF3"]/*[local-name()="id"]/text()')[0]
print(text)
Below is ET based solution. NS are in use.
import xml.etree.ElementTree as ET
xml = '''<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<ns2:export xmlns:ns3="http://zakupki.gov.ru/oos/common/1" xmlns:ns4="http://zakupki.gov.ru/oos/base/1" xmlns:ns2="http://zakupki.gov.ru/oos/export/1" xmlns:ns10="http://zakupki.gov.ru/oos/printform/1" xmlns:ns11="http://zakupki.gov.ru/oos/control99/1" xmlns:ns9="http://zakupki.gov.ru/oos/SMTypes/1" xmlns:ns7="http://zakupki.gov.ru/oos/pprf615types/1" xmlns:ns8="http://zakupki.gov.ru/oos/EPtypes/1" xmlns:ns5="http://zakupki.gov.ru/oos/TPtypes/1" xmlns:ns6="http://zakupki.gov.ru/oos/CPtypes/1" xmlns="http://zakupki.gov.ru/oos/types/1">
<ns2:fcsProtocolEF3 schemeVersion="10.2">
<id>0000000</id>
<purchaseNumber>0000000000000000</purchaseNumber>
</ns2:fcsProtocolEF3>
</ns2:export>
'''
def get_id_text():
root = ET.fromstring(xml)
fcs = root.find('{http://zakupki.gov.ru/oos/export/1}fcsProtocolEF3')
# assuming there is one fcs element and one id under fcs
return fcs.find('{http://zakupki.gov.ru/oos/types/1}id').text
print(get_id_text())
output
0000000
I have a Test.xml file as:
<?xml version="1.0" encoding="utf-8"?>
<SetupConf>
<LocSetup>
<Src>
<Dir1>C:\User1\test1</Dir1>
<Dir2>C:\User2\log</Dir2>
<Dir3>D:\Users\Checkup</Dir3>
<Dir4>D:\Work1</Dir4>
<Dir5>E:\job1</Dir5>
</Src>
</LocSetup>
</SetupConf>
Where node depends on user input. In "Dir" node it may be 1,2,5,10 dir structure defined. As per requirement I am able to extract data from the Test.xml with help of #Padraic Cunningham using below Python code:
from xml.dom import minidom
from StringIO import StringIO
dom = minidom.parse('Test.xml')
Src = dom.getElementsByTagName('Src')
output = ", ".join([a.childNodes[0].nodeValue for node in Src for a in node.getElementsByTagName('Dir')])
print [output]
And getting the output:
C:\User1\test1, C:\User2\log, D:\Users\Checkup, D:\Work1, E:\job1
But the expected output is:
['C:\\User1\\test1', 'C:\\User2\\log', 'D:\\Users\\Checkup', 'D:\\Work1', 'E:\\job1']
Well it's solved by myself:
from xml.dom import minidom
DOMTree = minidom.parse('Test0001.xml')
dom = DOMTree.documentElement
Src = dom.getElementsByTagName('Src')
for node in Src:
output = [a.childNodes[0].nodeValue for a in node.getElementsByTagName('Dir')]
print output
And getting output:
[u'C:\User1\test1', u'C:\User2\log', u'D:\Users\Checkup', u'D:\Work1', u'E:\job1']
I am sure there is more simple another way .. please let me know.. Thanks in adv.
I'm parsing a really simple .xml file with this snippet
import xml.etree.ElementTree as etree
tree = etree.parse('/home/user/dummy.xml')
print(tree.getroot())
the output is
<Element 'doc' at 0x1d2f090>
which is correct, but I was expecting something cleaner and as simple as
doc
is this the normal output ? How I can clean this ?
I'm using Python 3.x
the dummy.xml file
<?xml version="1.0"?>
<doc>
<branch name="testing" hash="1cdf045c">
text,source
</branch>
<branch name="release01" hash="f200013e">
<sub-branch name="subrelease01">
xml,sgml
</sub-branch>
</branch>
<branch name="invalid">
</branch>
</doc>
Yes, that's the default output for an Element. If you want just the tag, try:
print(tree.getroot().tag)
I am trying to write a python program that uses DOM to read xml file and print another xml structure that list from only one node with particular selected attribute "fun".
<?xml version="1.0" encoding="ISO-8859-1"?>
<website>
<url category="fun">
<title>Fun world</title>
<author>Jack</author>
<year>2010</year>
<price>100.00</price>
</url>
<url category="entertainment">
<title>Fun world</title>
<author>Jack</author>
<year>2010</year>
<price>100.00</price>
</url>
</website>
I couldn't select the list from the URL having category="fun".
I tried this code:
for n in dom.getElementsByTagName('url'):
s = n.attribute['category']
if (s.value == "fun"):
print n.toxml()
Can you guys help to me to debug my code?
nb: One of your tags opens "Website" and attempts to close "website" - so you'll want to fix that one...
You've mentioned lxml.
from lxml import etree as et
root = et.fromstring(xml)
fun = root.xpath('/Website/url[#category="fun"]')
for node in fun:
print et.tostring(node)
Use getAttribute:
for n in dom.getElementsByTagName('url'):
if (n.getAttribute('category') == "fun"):
print(n.toxml())