Parsing a deeply nested xml file using a for loop - python

How could I efficiently pull data from the nested xml?
By efficiently, I mean for example using a for loop.
Would I need to make use a of new data structure?
Parsing function:
import xml.etree.ElementTree as ET
it = ET.iterparse('OTA_AirSeatMapRS.xml')
# This for loop removes the namespaces
for _, el in it:
_, _, el.tag = el.tag.rpartition('}')
root = it.root
# I am not able to select data with this loop
for x in element.find(Service):
print(x)
This is part of the XML file:
<?xml version="1.0" encoding="UTF-8"?>
<soapenv:Envelope
xmlns:soapenc="http://schemas.xmlsoap.org/soap/encoding/"
xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/"
xmlns:xsd="http://www.w3.org/2001/XMLSchema"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<soapenv:Body>
<ns:OTA_AirSeatMapRS Version="1"
xmlns:ns="http://www.opentravel.org/OTA/2003/05/common/">
<ns:Success/>
<ns:SeatMapResponses>
<ns:SeatMapResponse>
<ns:FlightSegmentInfo DepartureDateTime="2020-11-22T15:30:00" FlightNumber="1179">
<ns:DepartureAirport LocationCode="LAS"/>
<ns:ArrivalAirport LocationCode="IAH"/>
<ns:Equipment AirEquipType="739"/>
</ns:FlightSegmentInfo>
<ns:SeatMapDetails>
<ns:CabinClass Layout="AB EF" UpperDeckInd="false">
<ns:RowInfo CabinType="First" OperableInd="true" RowNumber="1">
<ns:SeatInfo BlockedInd="false" BulkheadInd="false" ColumnNumber="1" ExitRowInd="false" GalleyInd="false" GridNumber="1" PlaneSection="Left">
<ns:Summary AvailableInd="false" InoperativeInd="false" OccupiedInd="false" SeatNumber="1A"/>
<ns:Features>Window</ns:Features>
</ns:SeatInfo>
My eventual goal is to use the parsed data to store in a JSON.

Take a look at the following instruction in your code:
for x in element.find(Service):
The first flaw in your code sample is that:
Service is a variable (not a string literal),
probably you initialized this variable to some string, but failed
to put this instruction in your code sample.
The source of another flaw is that find finds the first element
matching the given path, so you should not use it in a loop.
Maybe you should also check whether find returned some not-None
content, but this is another detail.
The third reason why you got the empty outptut is that print(x)
prints actually only the text of the element in question.
So to have a more general example, run:
Service = 'Summary'
x = root.find(f'.//{Service}')
print(f'{x.tag}, {x.text}, {x.attrib}')
The first instruction sets the tag name.
The second instruction invokes find, but note that I added './/'
to the XPath, to look at any depth of the source XML tree.
And the last instruction prints not only text of the element found,
but also the tag name and attributes.
The result I got (for your input XML) is:
Summary, None, {'AvailableInd': 'false', 'InoperativeInd': 'false', 'OccupiedInd': 'false', 'SeatNumber': '1A'}
(text is just None, so you didn't see any result in your original
output).

Related

Modifying element in xml using python

can anyone please explain how to modify xml element in python using elementtree.
I want to keep the rego AD-4214 and change make 'Tata' into 'Nissan' and model 'Sumo' into 'Skyline'.
If rewriting the entire file is acceptable1, the easiest way would be to turn the xml file into a dictionary (see for example here: How to convert an XML string to a dictionary?), do your modifications on that dictionary, and convert this dict back to xml (like for example here: https://pypi.org/project/dicttoxml/)
1 Consider lost formatting: whitespace, number formats etc may not be preserved by this.
This should work:
import xml.etree.ElementTree as ET
tree = ET.parse('your_xml_source.xml')
root = tree.getroot()
root[1][1].text = "Nissan"
root[1][2].text = "Skyline"
getroot() gives you the root element (<motorvehicle>), [1] selects its second child, the <vehicle> with rego AD-4214. The secondary indexing, [1] and [2], gives you AD-4214's <make> and <model> respectively. Then using the text attribute, you can change their text content.

ElementTree namespace dictionary not working with find() or findall()

I'm stumped with how to do the ElementTree namespace dictionary and subsequent find() and findall() calls using the documented sytnax:
A better way to search the namespaced XML example is to create a
dictionary with your own prefixes and use those in the search
functions:
ns = {'real_person': 'http://people.example.com',
'role': 'http://characters.example.com'}
for actor in root.findall('real_person:actor', ns):
name = actor.find('real_person:name', ns)
print(name.text)
for char in actor.findall('role:character', ns):
print(' |-->', char.text)
The issue i'm having is if i try to use the syntax noted in that doc, by passing the "ns" dictionary as a 2nd argument in find() or findall(), i get an empty list. If I type out the full namespace without passing the 2nd argument, it returns all of the expected elements.
I've defined my namespace dictionary as such:
ns = {'ws':'{urn:com.workday/workersync}'}
And here is the ElementTree and root setup:
xmlparser = ET.parse(xmlfile)
xmlroot = xmlparser.getroot()
Here is what i get when i try to use the dictionary shortcut syntax noted in the docs:
>>> xmlroot.findall('ws:Worker', ns)
[]
Just an empty list... Here is what i get if type out the namespace in the call:
xmlroot.findall('{urn:com.workday/workersync}Worker')
[<Element '{urn:com.workday/workersync}Worker' at 0x03220A78>, <Element'{urn:com.workday/workersync}Worker' at 0x0322D8C0>]
That returns the expected 2 elements in my sample file.
Here is what the top of my sample file looks like for reference:
<?xml version="1.0" encoding="UTF-8"?>
<ws:Worker_Sync xmlns:ws="urn:com.workday/workersync" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<ws:Header>
<ws:Version>34.0</ws:Version>
<ws:Prior_Entry_Time>2020-07-04T21:40:25.822-07:00</ws:Prior_Entry_Time>
<ws:Current_Entry_Time>2020-07-04T22:03:47.458-07:00</ws:Current_Entry_Time>
<ws:Prior_Effective_Time>2020-07-04T00:00:00.000-07:00</ws:Prior_Effective_Time>
<ws:Current_Effective_Time>2020-07-05T00:00:00.000-07:00</ws:Current_Effective_Time>
<ws:Full_File>true</ws:Full_File>
<ws:Document_Retention_Policy>30</ws:Document_Retention_Policy>
<ws:Worker_Count>2</ws:Worker_Count>
</ws:Header>
<ws:Worker>
*<snipped rest of XML data>*
The snipped XML data contains 2 <ws:Worker> elements with many subchildren under them.
I've been messing with this for longer than i'd care to admit. I feel like I'm missing something incredibly obvious, as to my eyes, my code looks like every example i've found online and the example code on the docs.
Please help!
Remove the curly brackets from the URI string. The namespace dictionary should look like this:
ns = {'ws': 'urn:com.workday/workersync'}
Another option is to use a wildcard for the namespace. This is supported for find() and findall() since Python 3.8:
print(xmlroot.findall('{*}Worker'))
Output:
[<Element '{urn:com.workday/workersync}Worker' at 0x033E6AC8>]

How do I extract specific data from xml using python?

I'm relatively new to python. I've been trying to learn python through a hands-on approach (I learnt c/c++ through the doing the euler project).
Right now I'm learning how to extract data from files. I've gotten the hang of extracting data from simple text files but I'm kinda stuck on xml files.
An example of what I was trying to do.
I have my call logs backed up on google drive and they're a lot (about 4000)
Here is the xml file example
<call number="+91234567890" duration="49" date="1483514046018" type="3" presentation="1" readable_date="04-Jan-2017 12:44:06 PM" contact_name="Dad" />
I want to take all the calls to my dad and display them like this
number = 234567890
duration = "49" date="04-Jan-2017 12:44:06 PM"
duration = "x" date="y"
duration = "n" date="z"
and so on like that.
How do you propose I do that?
It's advisable to provide sufficient information in a question so that problem can be recreated.
<?xml version="1.0" encoding="UTF-8"?>
<call number="+91234567890" duration="49" date="1483514046018" type="3"
presentation="1" readable_date="04-Jan-2017 12:44:06 PM"
contact_name="Dad" />
First we need to figure out what elements can we iter on. Since <call ../> is root element over here, we iter over that.
NOTE: if you have tags/element prior to the line provided, you will need to figure out proper root element instead of call.
>>> [i for i in root.iter('call')]
[<Element 'call' at 0x29d3410>]
Here you can see, we can iter on element call.
Then we simply iter over the element and separate out element attribute key and values as per requirements.
Working Code
import xml.etree.ElementTree as ET
data_file = 'test.xml'
tree = ET.parse(data_file)
root = tree.getroot()
for i in root.iter('call'):
print 'duration', "=", i.attrib['duration']
print 'data', "=", i.attrib['date']
Result
>>>
duration = 49
data = 1483514046018
>>>

Iterparse object in Python is not returning iter object

I'm working with the XML file in this link (Downloadable file of 40MB). In this file, I'm expecting data from 2 types of tags.
Those are: OpportunityForecastDetail_1_0 and OpportunitySynopsisDetail_1_0.
I wrote the following code for that:
ARTICLE_TAGS = ['OpportunitySynopsisDetail_1_0', 'OpportunityForecastDetail_1_0']
for _tag in ARTICLE_TAGS:
f = open(xml_f)
context = etree.iterparse(f, tag = _tag)
for _, e in context:
_id = e.xpath('.//OpportunityID/text()')
text = e.xpath('.//OpportunityTitle/text()')
f.close()
Then etree.iterparse(f, tag = _tag) is returning an object which is not iterable. I think this occurs when the tag is not found in the XML file.
So, I added name spaces to the iterable tag like this.
context = etree.iterparse(f, tag='{http://apply.grants.gov/system/OpportunityDetail-V1.0}'+_tag)
Now, it is creating an iterable object. But, I'm not getting any text. I tried other namespaces in that file. But, not working.
Please tell me the solution to this problem. This is a sample snippet of the XML file. OpportunityForecastDetail_1_0 and OpportunitySynopsisDetail_1_0 tags are repeated n number of times in the XML file.
<?xml version="1.0" encoding="UTF-8"?>
<Grants xsi:schemaLocation="http://apply.grants.gov/system/OpportunityDetail-V1.0 https://apply07.grants.gov/apply/system/schemas/OppotunityDetail-V1.0.xsd" xmlns="http://apply.grants.gov/system/OpportunityDetail-V1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instace">
<OpportunitySynopsisDetail_1_0>
<OpportunityID>262148</OpportunityID>
<OpportunityTitle>Establishment of the Edmund S. Muskie Graduate Internship Program</OpportunityTitle>
</OpportunitySynopsisDetail_1_0>
<OpportunityForecastDetail_1_0>
<OpportunityID>284765</OpportunityID>
<OpportunityTitle>PPHF 2015: Immunization Grants-CDC Partnership: Strengthening Public Health Laboratories-financed in part by 2015 Prevention and Public Health Funds</OpportunityTitle>
</OpportunityForecastDetail_1_0>
</Grants>
First, when you are parsing XML that contains namespaces, you must use those namespaces when looking at tag names.
Second, iterparse doesn't take an argument named tag, so I don't see how your code could have worked as posted.
Finally, the elements returned from iterparse don't have a member function called xpath, so that can't have worked either.
Here is an example of how to parse XML using iterparse:
NS='{http://apply.grants.gov/system/OpportunityDetail-V1.0}'
ARTICLE_TAGS = [NS+'OpportunitySynopsisDetail_1_0', NS+'OpportunityForecastDetail_1_0']
with open(xml_f, 'r') as f:
context = etree.iterparse(f)
for _, e in context:
if e.tag in ARTICLE_TAGS:
_id = e.find(NS+'OpportunityID')
text = e.find(NS+'OpportunityTitle')
print(_id.text, text.text)
As I said in my comment, the Python documentation is helpful, as is the Effbot page on ElementTree. There are lots of other resources available; put xml.etree.elementtree into Google and start reading!

Creating Properly-Nested XML Output in Python

I'm attempting to save data from several lists in XML format, but I cannot understand how to make the XML display properly. An example of my code right now is as follows:
from lxml import etree
#Create XML Root
articles = etree.Element('root')
#Create Lists & Data
t_list = ['title1', 'title2', 'title3', 'title4', 'title5']
c_list = ['content1', 'content2', 'content3', 'content4', 'content5']
sum_list = ['summary1', 'summary2', 'summary3', 'summary4', 'summary5']
s_list = ['source1', 'source2', 'source3', 'source4', 'source5']
i = 0
for t in t_list:
for i in range(len(t_list)):
#Create SubElements of XML Root
article = etree.SubElement(articles, 'Article')
titles = etree.SubElement(article, 'Title')
summary = etree.SubElement(article, 'Summary')
source = etree.SubElement(article, 'Source')
content = etree.SubElement(article, 'Content')
#Add List Data to SubElements
titles.text = t_list[i]
summary.text = sum_list[i]
source.text = s_list[i]
content.text = c_list[i]
print(etree.tostring(articles, pretty_print=True))
My Current Output is written in one very jumbled fashion, all on a single line as follows:
b'<root>\n <Article>\n <Title>title1</Title>\n <Summary>summary1</Summary>\n <Source>source1</Source>\n <Content>content1</Content>\n </Article>\n
It looks like the pretty_print function within lxml is adding proper indentation, as well as \n breaks as I would want, but it doesn't seem to be getting interpreted correctly during output; it write on a single line.
The output I'm trying to get is as follows:
<root>
<Article>
<Title>title1</Title>
<Summary>summary1</Summary>
<Source>source1</Source>
<Content>content1</Content>
</Article>
Ideally, I'd like for my output to be viewed as a valid XML document, and display in proper nested format.
Your "Current Output" is the representation (internal python representation) of the bytestring generated by etree.tostring(), and seems that in Python3 print(somebytestring) prints the representation instead of the actual string.
Hopefully the solution is quite simple: just pass the desired encoding to etree.tostring(), ie:
xml = etree.tostring(articles, encoding="unicode", pretty_print=True)
print(xml)
I've only used the base ET module in Python and can't find an lxml download for python 3.5 (which I'm on) in order to test it, but the b before the line indicates bytes and a quick glance at the documentation indicates that tostring() has an encoding keyword, so you should just need to set that to unicode or utf-8.
I'll also mention that you don't need to set "i" before your for-loop (python will create the "i" it needs for the for-loop), though I- personally- would zip the lists and iterate the items in the lists themselves (though that's not going to have any real impact on the code in this situation).

Categories

Resources