Problem getting the correct value when using xml.etree

Problem getting the correct value when using xml.etree - python

I'm trying to export all the movie titles from an xml file but I can't seem to get the titles. The xml looks like:
<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
<videodb>
<version>1</version>
<movie>
<title>2 Guns</title>
<originaltitle>2 Guns</originaltitle>
<ratings>
<rating name="themoviedb" max="10" default="true">
<value>6.500000</value>
<votes>1776</votes>
</rating>
</ratings>
I've seen lots of examples for values where xml has value="title" but can't find a guiding example that works when there is no value="title"
My code so far:
#Import required library
import xml.etree.cElementTree as ET
root = ET.parse('D:\\temp\\videodb.xml').getroot()
for type_text in root.findall('movie/title'):
value = type_text.get ('text')
print(value)

XML file:
<?xml version="1.0" encoding="utf-8"?>
<videodb>
<version>1</version>
<movie>
<title>2 Guns</title>
<originaltitle>2 Guns</originaltitle>
<ratings>
<rating name="themoviedb" max="10" default="true">
<value>6.500000</value>
<votes>1776</votes>
</rating>
</ratings>
</movie>
<movie>
<title>Top Gun</title>
<originaltitle>Top Gun</originaltitle>
<ratings>
<rating name="themoviedb" max="10" default="true">
<value>7.500000</value>
<votes>1566</votes>
</rating>
</ratings>
</movie>
<movie>
<title>Inception</title>
<originaltitle>Inceptions</originaltitle>
<ratings>
<rating name="themoviedb" max="10" default="true">
<value>9.500000</value>
<votes>177346</votes>
</rating>
</ratings>
</movie>
</videodb>
Code:
import xml.etree.ElementTree as ET
tree = ET.parse('E:\Python\DataFiles\movies.xml') # replace with your path
root = tree.getroot()
for aMovie in root.iter('movie'):
print(aMovie.find('title').text)
Output:
2 Guns
Top Gun
Inception

Try replacing:
value = type_text.get ('text')
with
value = type_text.text
xml.etree uses Element.get to retrieve the content element attributes.
You're after the element text; see Element.text.
For example, given this contrived XML:
<element some_attribute="Some Attribute">Some Text</element>
.get('some_attribute') would return Some Attribute while .text would return Some Text.

Related

filter non-nested tag values from XML

I have an xml that looks like this.
<?xml version="1.0" encoding="UTF-8" ?>
<main_heading timestamp="20220113">
<details>
<offer id="11" parent_id="12">
<name>Alpha</name>
<pos>697</pos>
<kat_pis>
<pos kat="2">112</pos>
</kat_pis>
</offer>
<offer id="12" parent_id="31">
<name>Beta</name>
<pos>099</pos>
<kat_pis>
<pos kat="2">113</pos>
</kat_pis>
</offer>
</details>
</main_heading>
I am parsing it using BeautifulSoup. Upon doing this:
soup = BeautifulSoup(file, 'xml')
pos = []
for i in (soup.find_all('pos')):
pos.append(i.text)
I get a list of all POS tag values, even the ones that are nested within the tag kat_pis.
So I get (697, 112, 099. 113).
However, I only want to get the POS values of the non-nested tags.
Expected desired result is (697, 099).
How can I achieve this?

Here is one way of getting those first level pos:
from bs4 import BeautifulSoup as bs
xml_doc = '''<?xml version="1.0" encoding="UTF-8" ?>
<main_heading timestamp="20220113">
<details>
<offer id="11" parent_id="12">
<name>Alpha</name>
<pos>697</pos>
<kat_pis>
<pos kat="2">112</pos>
</kat_pis>
</offer>
<offer id="12" parent_id="31">
<name>Beta</name>
<pos>099</pos>
<kat_pis>
<pos kat="2">113</pos>
</kat_pis>
</offer>
</details>
</main_heading>'''
soup = bs(xml_doc, 'xml')
pos = []
for i in (soup.select('offer > pos')):
pos.append(i.text)
print(pos)
Result in terminal:
['697', '099']

I think the best solution would be to abandon BeautifulSoup for an XML parser with XPath support, like lxml. Using XPath expressions, you can ask for only those tos elements that are children of offer elements:
from lxml import etree
with open('data.xml') as fd:
doc = etree.parse(fd)
pos = []
for ele in (doc.xpath('//offer/pos')):
pos.append(ele.text)
print(pos)
Given your example input, the above code prints:
['697', '099']

Remove namespaces and nodes from XML string in python

I get an xml string from a post request and I need to use this xml in a subsequent request. I need to edit the XML from the first request to reflect the correct format for the subsequent request.
I can successfully remove the name spaces but am struggling with extracting the desired node and keeping the xml formatting.
current format
<?xml version="1.0" encoding="UTF-8"?>
<soap:Envelope xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/">
<soap:Body>
<GetExResponse xmlns="http://www.someurl.com/">
<GetExResult>
<DataMap xmlns="" sourceType="0">
<FieldMap flag="Q1" destination="Q1_1" source="Q1_1"/>
<FieldMap flag="Q1" destination="Q1_1" source="Q1_1"/>
</DataMap>
</GetExResult>
</GetExResponse>
</soap:Body>
</soap:Envelope>
Desired Format
<?xml version="1.0" encoding="UTF-8"?>
<DataMap xmlns="" sourceType="0">
<FieldMap flag="Q1" destination="Q1_1" source="Q1_1"/>
<FieldMap flag="Q1" destination="Q1_1" source="Q1_1"/>
</DataMap>
--removes namespaces
dmXML = xmlstring
from lxml import etree
root = etree.fromstring(dmXML)
for elem in root.getiterator():
elem.tag = etree.QName(elem).localname
etree.cleanup_namespaces(root)
test = etree.tostring(root).decode()
print(test)
--extracts desired node but into dataframe changing the formatting
xdf = pandas.read_xml(dmXML, xpath='.//DataMap/*', namespaces={"doc": "http://www.w3.org/2001/XMLSchema"})
xml = pandas.DataFrame.to_xml(xdf)

You can simply extract the relevant portion into a new document:
import xml.etree.ElementTree as ET
root = ET.fromstring(dmXML)
new_root = root.find('.//DataMap')
print(ET.tostring(new_root, xml_declaration=True, encoding='UTF-8').decode())
Output:
<?xml version='1.0' encoding='UTF-8'?>
<DataMap sourceType="0">
<FieldMap flag="Q1" destination="Q1_1" source="Q1_1" />
<FieldMap flag="Q1" destination="Q1_1" source="Q1_1" />
</DataMap>

Python - replace root element of one xml file with another root element without its children

I have one xml file that looks like this, XML1:
<?xml version='1.0' encoding='utf-8'?>
<report>
</report>
And the other one that is like this,
XML2:
<?xml version='1.0' encoding='utf-8'?>
<report attrib1="blabla" attrib2="blabla" attrib3="blabla" attrib4="blabla" attrib5="blabla" >
<child1>
<child2>
....
</child2>
</child1>
</report>
I need to replace and put root element of XML2 without its children, so XML1 looks like this:
<?xml version='1.0' encoding='utf-8'?>
<report attrib1="blabla" attrib2="blabla" attrib3="blabla" attrib4="blabla" attrib5="blabla">
</report>
Currently my code looks like this but it won't remove children but put whole tree inside:
source_tree = ET.parse('XML2.xml')
source_root = source_tree.getroot()
report = source_root.findall('report')
for child in list(report):
report.remove(child)
source_tree.write('XML1.xml', encoding='utf-8', xml_declaration=True)
Anyone has ide how can I achieve this?
Thanks!

Try the below (just copy attrib)
import xml.etree.ElementTree as ET
xml1 = '''<?xml version='1.0' encoding='utf-8'?>
<report>
</report>'''
xml2 = '''<?xml version='1.0' encoding='utf-8'?>
<report attrib1="blabla" attrib2="blabla" attrib3="blabla" attrib4="blabla" attrib5="blabla" >
<child1>
<child2>
</child2>
</child1>
</report>'''
root1 = ET.fromstring(xml1)
root2 = ET.fromstring(xml2)
root1.attrib = root2.attrib
ET.dump(root1)
output
<report attrib1="blabla" attrib2="blabla" attrib3="blabla" attrib4="blabla" attrib5="blabla">
</report>

So here is working code:
source_tree = ET.parse('XML2.xml')
source_root = source_tree.getroot()
dest_tree = ET.parse('XML1.xml')
dest_root = dest_tree.getroot()
dest_root.attrib = source_root.attrib
dest_tree.write('XML1.xml', encoding='utf-8', xml_declaration=True)

Parsing XML in Python with ElementTree

I'm using the documentation here to try to get only the values (name,ip , netmask) for certain elements.
This is an example of the structure of my xml:
<?xml version="1.0" ?>
<rpc-reply xmlns="urn:ietf:params:xml:ns:netconf:base:1.0" xmlns:nc="urn:ietf:params:xml:ns:netconf:base:1.0" message-id="urn:uuid:5cf32451-91af-4f71-a0bd-ead244b81b1f">
<data>
<interfaces xmlns="urn:ietf:params:xml:ns:yang:ietf-interfaces">
<interface>
<name>GigabitEthernet1</name>
<type xmlns:ianaift="urn:ietf:params:xml:ns:yang:iana-if-type">ianaift:ethernetCsmacd</type>
<enabled>true</enabled>
<ipv4 xmlns="urn:ietf:params:xml:ns:yang:ietf-ip">
<address>
<ip>192.168.40.30</ip>
<netmask>255.255.255.0</netmask>
</address>
</ipv4>
<ipv6 xmlns="urn:ietf:params:xml:ns:yang:ietf-ip"/>
</interface>
<interface>
<name>GigabitEthernet2</name>
<type xmlns:ianaift="urn:ietf:params:xml:ns:yang:iana-if-type">ianaift:ethernetCsmacd</type>
<enabled>true</enabled>
<ipv4 xmlns="urn:ietf:params:xml:ns:yang:ietf-ip">
<address>
<ip>10.10.10.1</ip>
<netmask>255.255.255.0</netmask>
</address>
</ipv4>
<ipv6 xmlns="urn:ietf:params:xml:ns:yang:ietf-ip"/>
</interface>
</interfaces>
</data>
</rpc-reply>
Python code: This code returns nothing .
import xml.etree.ElementTree as ET
tree = ET.parse("C:\\Users\\Redha\\Documents\\test_network\\interface1234.xml")
root = tree.getroot()
namespaces = {'interfaces': 'urn:ietf:params:xml:ns:yang:ietf-interfaces' }
for elem in root.findall('.//interfaces:interfaces', namespaces):
s0 = elem.find('.//interfaces:name',namespaces)
name = s0.text
print(name)

interface = ET.parse('interface2.xml')
interface_root = interface.getroot()
for interface_attribute in interface_root[0][0]:
print(f"{interface_attribute[0].text}, {interface_attribute[3][0][0].text}, {interface_attribute[3][0][1].text}")

Python / XML request

I am new in python and try to request a website for public transport information which I then want to show on a small display of my raspberry-pi.
import request
xml = """<?xml version="1.0" encoding="UTF-8"?>
<Trias version="1.1" xmlns="http://www.vdv.de/trias" xmlns:siri="http://www.siri.org.uk/siri" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<ServiceRequest>
<siri:RequestTimestamp>2016-06-27T13:34:00</siri:RequestTimestamp>
<siri:RequestorRef>EPSa</siri:RequestorRef>
<RequestPayload>
<StopEventRequest>
<Location>
<LocationRef>
<StopPointRef>8578169</StopPointRef>
</LocationRef>
</Location>
<Params>
<NumberOfResults>5</NumberOfResults>
<StopEventType>departure</StopEventType>
<IncludePreviousCalls>false</IncludePreviousCalls>
<IncludeOnwardCalls>false</IncludeOnwardCalls>
<IncludeRealtimeData>true</IncludeRealtimeData>
</Params>
</StopEventRequest>
</RequestPayload>
</ServiceRequest>
</Trias>"""
headers = {'Authorization': *'#MYCODE'*, 'Content-Type': 'application/xml'}
answer = requests.post('https://api.opentransportdata.swiss/trias', data=xml, headers=headers)
What the answer will be:
<?xml version="1.0" encoding="UTF-8"?>
<Trias xmlns="http://www.vdv.de/trias" version="1.1">
<ServiceDelivery>
<ResponseTimestamp xmlns="http://www.siri.org.uk/siri">2018-11-19T14:17:42Z</ResponseTimestamp>
<ProducerRef xmlns="http://www.siri.org.uk/siri">EFAController10.2.9.62-WIN-G0NJHFUK71P</ProducerRef>
<Status xmlns="http://www.siri.org.uk/siri">true</Status>
<MoreData>false</MoreData>
<Language>de</Language>
<DeliveryPayload>
<StopEventResponse>
<StopEventResult>
<ResultId>ID-8E6262DF-2FB8-4591-97A3-AC3E94E56635</ResultId>
<StopEvent>
<ThisCall>
<CallAtStop>
<StopPointRef>8578169</StopPointRef>
<StopPointName>
<Text>Basel, Thomaskirche</Text>
<Language>de</Language>
</StopPointName>
<ServiceDeparture>
<TimetabledTime>2018-11-19T14:16:00Z</TimetabledTime>
<EstimatedTime>2018-11-19T14:17:00Z</EstimatedTime>
</ServiceDeparture>
<StopSeqNumber>31</StopSeqNumber>
</CallAtStop>
</ThisCall>
<Service>
<OperatingDayRef>2018-11-19</OperatingDayRef>
<JourneyRef>odp:05036::H:j18:36143:36143</JourneyRef>
<LineRef>odp:05036::H</LineRef>
<DirectionRef>outward</DirectionRef>
<Mode>
<PtMode>bus</PtMode>
<BusSubmode>regionalBus</BusSubmode>
<Name>
<Text>Bus</Text>
<Language>de</Language>
</Name>
</Mode>
<PublishedLineName>
<Text>36</Text>
<Language>de</Language>
</PublishedLineName>
<OperatorRef>odp:823</OperatorRef>
<OriginStopPointRef>8589334</OriginStopPointRef>
<OriginText>
<Text>Basel, Kleinhüningen</Text>
<Language>de</Language>
</OriginText>
<DestinationStopPointRef>8588780</DestinationStopPointRef>
<DestinationText>
<Text>Basel, Schifflände</Text>
<Language>de</Language>
</DestinationText>
</Service>
</StopEvent>
</StopEventResult>
</StopEventResponse>
</DeliveryPayload>
</ServiceDelivery>
How can I now continue to get some information out of it? (Interested in TimetabledTime and EstimatedTime)
I tried to use the ElementTree but it did not really work.
Thanks in advance!
Website of the data provider: https://opentransportdata.swiss/en/cookbook/departurearrival-display/

I tried to use the ElementTree but it did not really work.
I think #mzjn was probably right when they mentioned: Note that XML namespaces are used.
Just in case that's what the issue was, here's an example of using ElementTree to parse the XML while properly handling the default namespace.
I used the answer from #AndreaCattaneo as a base. It produces the exact same output.
Python
import xml.etree.ElementTree as ET
from datetime import datetime
test_answer = """<?xml version="1.0" encoding="UTF-8"?>
<Trias xmlns="http://www.vdv.de/trias" version="1.1">
<ServiceDelivery>
<ResponseTimestamp xmlns="http://www.siri.org.uk/siri">2018-11-19T14:17:42Z</ResponseTimestamp>
<ProducerRef xmlns="http://www.siri.org.uk/siri">EFAController10.2.9.62-WIN-G0NJHFUK71P</ProducerRef>
<Status xmlns="http://www.siri.org.uk/siri">true</Status>
<MoreData>false</MoreData>
<Language>de</Language>
<DeliveryPayload>
<StopEventResponse>
<StopEventResult>
<ResultId>ID-8E6262DF-2FB8-4591-97A3-AC3E94E56635</ResultId>
<StopEvent>
<ThisCall>
<CallAtStop>
<StopPointRef>8578169</StopPointRef>
<StopPointName>
<Text>Basel, Thomaskirche</Text>
<Language>de</Language>
</StopPointName>
<ServiceDeparture>
<TimetabledTime>2018-11-19T14:16:00Z</TimetabledTime>
<EstimatedTime>2018-11-19T14:17:00Z</EstimatedTime>
</ServiceDeparture>
<StopSeqNumber>31</StopSeqNumber>
</CallAtStop>
</ThisCall>
<Service>
<OperatingDayRef>2018-11-19</OperatingDayRef>
<JourneyRef>odp:05036::H:j18:36143:36143</JourneyRef>
<LineRef>odp:05036::H</LineRef>
<DirectionRef>outward</DirectionRef>
<Mode>
<PtMode>bus</PtMode>
<BusSubmode>regionalBus</BusSubmode>
<Name>
<Text>Bus</Text>
<Language>de</Language>
</Name>
</Mode>
<PublishedLineName>
<Text>36</Text>
<Language>de</Language>
</PublishedLineName>
<OperatorRef>odp:823</OperatorRef>
<OriginStopPointRef>8589334</OriginStopPointRef>
<OriginText>
<Text>Basel, Kleinhüningen</Text>
<Language>de</Language>
</OriginText>
<DestinationStopPointRef>8588780</DestinationStopPointRef>
<DestinationText>
<Text>Basel, Schifflände</Text>
<Language>de</Language>
</DestinationText>
</Service>
</StopEvent>
</StopEventResult>
</StopEventResponse>
</DeliveryPayload>
</ServiceDelivery>
</Trias>"""
ns = {"t": "http://www.vdv.de/trias"}
tree = ET.fromstring(test_answer)
# as strings
timetabled_time = tree.find(".//t:TimetabledTime", ns).text
estimated_time = tree.find(".//t:EstimatedTime", ns).text
# as datetime objects
date_format = "%Y-%m-%dT%H:%M:%SZ"
timetabled_time = datetime.strptime(timetabled_time, date_format)
estimated_time = datetime.strptime(estimated_time, date_format)
print("Timetabled time: {} at {}".format(timetabled_time.date(), timetabled_time.time()))
print("Estimated time: {} at {}".format(estimated_time.date(), estimated_time.time()))
Output
Timetabled time: 2018-11-19 at 14:16:00
Estimated time: 2018-11-19 at 14:17:00

I tried to use the ElementTree but it did not really work.
As mzjn said, you should provide us more information about the difficulties you encountered.
Anyway, if you want to parse the xml I suggest using a third party library to ease your work. In my example I used BeautifulSoup:
from bs4 import BeautifulSoup
from datetime import datetime
test_answer = """<?xml version="1.0" encoding="UTF-8"?>
<Trias xmlns="http://www.vdv.de/trias" version="1.1">
<ServiceDelivery>
<ResponseTimestamp xmlns="http://www.siri.org.uk/siri">2018-11-19T14:17:42Z</ResponseTimestamp>
<ProducerRef xmlns="http://www.siri.org.uk/siri">EFAController10.2.9.62-WIN-G0NJHFUK71P</ProducerRef>
<Status xmlns="http://www.siri.org.uk/siri">true</Status>
<MoreData>false</MoreData>
<Language>de</Language>
<DeliveryPayload>
<StopEventResponse>
<StopEventResult>
<ResultId>ID-8E6262DF-2FB8-4591-97A3-AC3E94E56635</ResultId>
<StopEvent>
<ThisCall>
<CallAtStop>
<StopPointRef>8578169</StopPointRef>
<StopPointName>
<Text>Basel, Thomaskirche</Text>
<Language>de</Language>
</StopPointName>
<ServiceDeparture>
<TimetabledTime>2018-11-19T14:16:00Z</TimetabledTime>
<EstimatedTime>2018-11-19T14:17:00Z</EstimatedTime>
</ServiceDeparture>
<StopSeqNumber>31</StopSeqNumber>
</CallAtStop>
</ThisCall>
<Service>
<OperatingDayRef>2018-11-19</OperatingDayRef>
<JourneyRef>odp:05036::H:j18:36143:36143</JourneyRef>
<LineRef>odp:05036::H</LineRef>
<DirectionRef>outward</DirectionRef>
<Mode>
<PtMode>bus</PtMode>
<BusSubmode>regionalBus</BusSubmode>
<Name>
<Text>Bus</Text>
<Language>de</Language>
</Name>
</Mode>
<PublishedLineName>
<Text>36</Text>
<Language>de</Language>
</PublishedLineName>
<OperatorRef>odp:823</OperatorRef>
<OriginStopPointRef>8589334</OriginStopPointRef>
<OriginText>
<Text>Basel, Kleinhüningen</Text>
<Language>de</Language>
</OriginText>
<DestinationStopPointRef>8588780</DestinationStopPointRef>
<DestinationText>
<Text>Basel, Schifflände</Text>
<Language>de</Language>
</DestinationText>
</Service>
</StopEvent>
</StopEventResult>
</StopEventResponse>
</DeliveryPayload>
</ServiceDelivery>
</Trias>"""
soup = BeautifulSoup(test_answer, "html.parser")
service_departure = soup.find("servicedeparture")
# as Tag objects
timetabled_time = service_departure.timetabledtime
estimated_time = service_departure.estimatedtime
# as strings
timetabled_time = timetabled_time.text
estimated_time = estimated_time.text
# as datetime objects
date_format = "%Y-%m-%dT%H:%M:%SZ"
timetabled_time = datetime.strptime(timetabled_time, date_format)
estimated_time = datetime.strptime(estimated_time, date_format)
print("Timetabled time: {} at {}".format(timetabled_time.date(), timetabled_time.time()))
print("Estimated time: {} at {}".format(estimated_time.date(), estimated_time.time()))
This prints:
Timetabled time: 2018-11-19 at 14:16:00
Estimated time: 2018-11-19 at 14:17:00

If you want to stick to standard Python you can use: html.parser
https://docs.python.org/3/library/html.parser.html
There are also many third party libraries that make life easier (google "html parsing python")

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Problem getting the correct value when using xml.etree - python

Related

filter non-nested tag values from XML

Remove namespaces and nodes from XML string in python

Python - replace root element of one xml file with another root element without its children

Parsing XML in Python with ElementTree

Python / XML request

Categories

Resources