How do I extract specific data from xml using python? - python

I'm relatively new to python. I've been trying to learn python through a hands-on approach (I learnt c/c++ through the doing the euler project).
Right now I'm learning how to extract data from files. I've gotten the hang of extracting data from simple text files but I'm kinda stuck on xml files.
An example of what I was trying to do.
I have my call logs backed up on google drive and they're a lot (about 4000)
Here is the xml file example
<call number="+91234567890" duration="49" date="1483514046018" type="3" presentation="1" readable_date="04-Jan-2017 12:44:06 PM" contact_name="Dad" />
I want to take all the calls to my dad and display them like this
number = 234567890
duration = "49" date="04-Jan-2017 12:44:06 PM"
duration = "x" date="y"
duration = "n" date="z"
and so on like that.
How do you propose I do that?

It's advisable to provide sufficient information in a question so that problem can be recreated.
<?xml version="1.0" encoding="UTF-8"?>
<call number="+91234567890" duration="49" date="1483514046018" type="3"
presentation="1" readable_date="04-Jan-2017 12:44:06 PM"
contact_name="Dad" />
First we need to figure out what elements can we iter on. Since <call ../> is root element over here, we iter over that.
NOTE: if you have tags/element prior to the line provided, you will need to figure out proper root element instead of call.
>>> [i for i in root.iter('call')]
[<Element 'call' at 0x29d3410>]
Here you can see, we can iter on element call.
Then we simply iter over the element and separate out element attribute key and values as per requirements.
Working Code
import xml.etree.ElementTree as ET
data_file = 'test.xml'
tree = ET.parse(data_file)
root = tree.getroot()
for i in root.iter('call'):
print 'duration', "=", i.attrib['duration']
print 'data', "=", i.attrib['date']
Result
>>>
duration = 49
data = 1483514046018
>>>

Related

XMLTree Parsing and Printing

I'm starting to learn python3 and one of the things being discussed is XMLTree which I'm having a hard time grasping (most likely due to learning python concurrently)
What I am trying to do is output an easier to read version of my XML file.
The XML File: (there is no limit to the number of child customers - i've included two for example)
<?xml version="1.0" encoding="UTF-8"?>
<customers>
<customers>
<number area_code="800" exch_code="225" sub_code="5288" />
<address zip_code="90210" st_addr="9401 Sunset Blvd" />
<nameText>First Choice</nameText>
</customers>
<customers>
<number area_code="800" exch_code="867" sub_code="5309" />
<address zip_code="60652" st_addr="5 Lake Shore Drive" />
<nameText>Green Grass"</nameText>
</customers>
</customers>
From what I understand, the XML tree defines these lines as the following:
<root>
<child>
<element attribute...>
Where the first xml files 'customers' is the root, the second 'customers' is a child of 'customers', and 'number' (or address, or nameText) are elements.
With that being said, here is where I start to get confused.
If we take <number area_code="800" exch_code="225" sub_code="5288" />
This is an element with three attributes, area_code, exch_code, and sub_code but no text.
If we take <nameText>Green Grass"</nameText>
This is an element with no attributes, but does contain Text (Green Grass)
What I would like to see would be something like this:
First Choice
|--> Phone Number: 800-225-5288
|--> Address: 9401 Sunset Blvd, Zip Code: 90210
Green Grass
|--> Phone Number: 800-867-5309
|--> Address: 5 Lake Shore Drive, Zip Code: 60652
I dont have really any code to share but here it is:
import xml.etree.ElementTree as ET
tree = ET.parse(my_files[0])
root = tree.getroot()
print(root.tag)
for child in root:
print(child.tag,child.attrib)
Which provides the following output (line 1 being from print(root.tag) I believe)
customer
customer
{}
customer
{}
The questions I have after writing all this:
1 - Is my interpretation of the tree structure correct?
2 - How do you differentiate between attributes in ElementTree?
3 - How/what should I be considerate of in terms of the attributes, tags, and the rest of this file when trying to make the desired output? I might be overthinking how much more complex having XML in the mix is making this scenario so I am struggling to figure out how to do something similar to get the output I saw here: https://docs.python.org/3/library/xml.etree.elementtree.html#parsing-xml-with-namespaces but my xml lacks namespaces.
I'm still trying to learn, so any additional explanation is sincerely appreciated!
Resources that I've been trying to read through to understand all this:
https://docs.python.org/3/library/xml.etree.elementtree.html# (When I'm looking through this, I'm going off the assumption that when they are calling something an attribute, its not something unique to ElementTree but the same attribute as defined in the next link)
https://www.w3schools.com/xml/xml_tree.asp (however I havent seen anything yet about multiple attributes)
https://www.edureka.co/blog/python-xml-parser-tutorial/ (This page has been a great help breaking things down step by step so I have been able to follow along)
1 - Is my interpretation of the tree structure correct?
The ElementTree parser only knows about two entities: elements and
attributes. So when you say:
From what I understand, the XML tree defines these lines as the following:
<root>
<child>
<element attribute...>
I'm a little confused. Your XML document -- or any other XML document
-- is just an element that may have zero or more attributes and may
have zero or more children...and so forth all the way down.
2 - How do you differentiate between attributes in ElementTree?
It's not clear what you mean by "differentiate" here; you can ask for
elements by name. For example, the following code prints out the
areacode attribute of all <number> elements:
>>> from xml.etree import ElementTree as ET
>>> doc = ET.parse(open('data.xml'))
>>> doc.findall('.//number')
[<Element number at 0x7fdb8981e640>, <Element number at 0x7fdb8981e680>]
>>> for x in root.findall('.//number'):
... print(x.get('area_code'))
...
800
800
If you'd like, you can get all of the attributes of an element as a Python
dictionary:
>>> number = doc.find('customers/number')
>>> attrs = dict(number.items())
>>> attrs
{'area_code': '800', 'exch_code': '225', 'sub_code': '5288'}
3 - How/what should I be considerate of in terms of the attributes, tags, and the rest of this file when trying to make the desired output?
That code seems to have mostly what you're looking for. As you say,
you're not using namespaces, so you don't need to qualify element
names with namespace names...that is, you can write number instead
of {some/name/space}number.
That gives us something like:
from xml.etree import ElementTree as ET
with open('data.xml') as fd:
doc = ET.parse(fd)
for customer in doc.findall('customers'):
name = customer.find('nameText')
number = customer.find('number')
address = customer.find('address')
print(name.text)
print('|--> Address: {}, Zip Code: {}'.format(
address.get('st_addr'), address.get('zip_code')))
print('|--> Phone number: {}-{}-{}'.format(
number.get('area_code'), number.get('exch_code'), number.get('sub_code')))
Given your sample input, this produces:
First Choice
|--> Address: 9401 Sunset Blvd, Zip Code: 90210
|--> Phone number: 800-225-5288
Green Grass"
|--> Address: 5 Lake Shore Drive, Zip Code: 60652
|--> Phone number: 800-867-5309

Parsing a deeply nested xml file using a for loop

How could I efficiently pull data from the nested xml?
By efficiently, I mean for example using a for loop.
Would I need to make use a of new data structure?
Parsing function:
import xml.etree.ElementTree as ET
it = ET.iterparse('OTA_AirSeatMapRS.xml')
# This for loop removes the namespaces
for _, el in it:
_, _, el.tag = el.tag.rpartition('}')
root = it.root
# I am not able to select data with this loop
for x in element.find(Service):
print(x)
This is part of the XML file:
<?xml version="1.0" encoding="UTF-8"?>
<soapenv:Envelope
xmlns:soapenc="http://schemas.xmlsoap.org/soap/encoding/"
xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/"
xmlns:xsd="http://www.w3.org/2001/XMLSchema"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<soapenv:Body>
<ns:OTA_AirSeatMapRS Version="1"
xmlns:ns="http://www.opentravel.org/OTA/2003/05/common/">
<ns:Success/>
<ns:SeatMapResponses>
<ns:SeatMapResponse>
<ns:FlightSegmentInfo DepartureDateTime="2020-11-22T15:30:00" FlightNumber="1179">
<ns:DepartureAirport LocationCode="LAS"/>
<ns:ArrivalAirport LocationCode="IAH"/>
<ns:Equipment AirEquipType="739"/>
</ns:FlightSegmentInfo>
<ns:SeatMapDetails>
<ns:CabinClass Layout="AB EF" UpperDeckInd="false">
<ns:RowInfo CabinType="First" OperableInd="true" RowNumber="1">
<ns:SeatInfo BlockedInd="false" BulkheadInd="false" ColumnNumber="1" ExitRowInd="false" GalleyInd="false" GridNumber="1" PlaneSection="Left">
<ns:Summary AvailableInd="false" InoperativeInd="false" OccupiedInd="false" SeatNumber="1A"/>
<ns:Features>Window</ns:Features>
</ns:SeatInfo>
My eventual goal is to use the parsed data to store in a JSON.
Take a look at the following instruction in your code:
for x in element.find(Service):
The first flaw in your code sample is that:
Service is a variable (not a string literal),
probably you initialized this variable to some string, but failed
to put this instruction in your code sample.
The source of another flaw is that find finds the first element
matching the given path, so you should not use it in a loop.
Maybe you should also check whether find returned some not-None
content, but this is another detail.
The third reason why you got the empty outptut is that print(x)
prints actually only the text of the element in question.
So to have a more general example, run:
Service = 'Summary'
x = root.find(f'.//{Service}')
print(f'{x.tag}, {x.text}, {x.attrib}')
The first instruction sets the tag name.
The second instruction invokes find, but note that I added './/'
to the XPath, to look at any depth of the source XML tree.
And the last instruction prints not only text of the element found,
but also the tag name and attributes.
The result I got (for your input XML) is:
Summary, None, {'AvailableInd': 'false', 'InoperativeInd': 'false', 'OccupiedInd': 'false', 'SeatNumber': '1A'}
(text is just None, so you didn't see any result in your original
output).

How to parse .xml file with multiple nested children in python?

I am using python to parse a .xml file which is quite complicated since it has a lot of nested children; accessing some of the values contained in it is quite annoying since the code starts to become pretty bad looking.
Let me first present you the .xml file:
<?xml version="1.0" encoding="utf-8"?>
<Start>
<step1 stepA="5" stepB="6" />
<step2>
<GOAL1>11111</GOAL1>
<stepB>
<stepBB>
<stepBBB stepBBB1="pinco">1</stepBBB>
</stepBB>
<stepBC>
<stepBCA>
<GOAL2>22222</GOAL2>
</stepBCA>
</stepBC>
<stepBD>-NO WOMAN NO CRY
-I SHOT THE SHERIF
-WHO LET THE DOGS OUT
</stepBD>
</stepB>
</step2>
<step3>
<GOAL3 GOAL3_NAME="GIOVANNI" GOAL3_ID="GIO">
<stepB stepB1="12" stepB2="13" />
<stepC>XXX</stepC>
<stepC>
<stepCC>
<stepCC GOAL4="saf12">33333</stepCC>
</stepCC>
</stepC>
</GOAL3>
</step3>
<step3>
<GOAL3 GOAL3_NAME="ANDREA" GOAL3_ID="DRW">
<stepB stepB1="14" stepB2="15" />
<stepC>YYY</stepC>
<stepC>
<stepCC>
<stepCC GOAL4="fwe34">44444</stepCC>
</stepCC>
</stepC>
</GOAL3>
</step3>
</Start>
My goal would be to access the values contained inside of the children named "GOAL" in a nicer way then the one I wrote in my sample code below. Furthermore I would like to find an automated way to find the values of GOALS having the same type of tag belonging to different children having the same name:
Example: GIOVANNI and ANDREA are both under the same kind of tag (GOAL3_NAME) and belong to different children having the same name (<step3>) though.
Here is the code that I wrote:
import xml.etree.ElementTree as ET
data = ET.parse('test.xml').getroot()
GOAL1 = data.getchildren()[1].getchildren()[0].text
print(GOAL1)
GOAL2 = data.getchildren()[1].getchildren()[1].getchildren()[1].getchildren()[0].getchildren()[0].text
print(GOAL2)
GOAL3 = data.getchildren()[2].getchildren()[0].text
print(GOAL3)
GOAL4_A = data.getchildren()[2].getchildren()[0].getchildren()[2].getchildren()[0].getchildren()[0].text
print(GOAL4_A)
GOAL4_B = data.getchildren()[3].getchildren()[0].getchildren()[2].getchildren()[0].getchildren()[0].text
print(GOAL4_B)
and the output that I get is the following:
11111
22222
33333
44444
The output that I would like should be like this:
11111
22222
GIOVANNI
33333
ANDREA
44444
As you can see I am able to read GOAL1 and GOAL2 easily but I am looking for a nicer code practice to access those values since it seems to me too long and hard to read/understand.
The second thing I would like to do is getting GOAL3 and GOAL4 in a automated way so that I do not have to repeat similar lines of codes and make it more readable and understandable.
Note: as you can see I was not able to read GOAL3. If possible I would like to get both the GOAL3_NAME and GOAL3_ID
In order to make the .xml file structure more understandable I post an image of what it looks like:
The highlighted elements are what I am looking for.
here is simple example for iterating from head to tail with a recursive method and cElementTree(15-20x faster), you can than collect the needed information from that
import xml.etree.cElementTree as ET
tree = ET.parse('test.xml')
root = tree.getroot()
def get_tail(root):
for child in root:
print child.text
get_tail(child)
get_tail(root)
import xml.etree.cElementTree as ET
data = ET.parse('test.xml')
for d in data.iter():
if d.tag in ["GOAL1", "GOAL2", "stepCC", "stepCC"]:
print d.text
elif d.tag in ["GOAL3", "GOAL4"]:
print d.attrib.values()[0]

Creating Properly-Nested XML Output in Python

I'm attempting to save data from several lists in XML format, but I cannot understand how to make the XML display properly. An example of my code right now is as follows:
from lxml import etree
#Create XML Root
articles = etree.Element('root')
#Create Lists & Data
t_list = ['title1', 'title2', 'title3', 'title4', 'title5']
c_list = ['content1', 'content2', 'content3', 'content4', 'content5']
sum_list = ['summary1', 'summary2', 'summary3', 'summary4', 'summary5']
s_list = ['source1', 'source2', 'source3', 'source4', 'source5']
i = 0
for t in t_list:
for i in range(len(t_list)):
#Create SubElements of XML Root
article = etree.SubElement(articles, 'Article')
titles = etree.SubElement(article, 'Title')
summary = etree.SubElement(article, 'Summary')
source = etree.SubElement(article, 'Source')
content = etree.SubElement(article, 'Content')
#Add List Data to SubElements
titles.text = t_list[i]
summary.text = sum_list[i]
source.text = s_list[i]
content.text = c_list[i]
print(etree.tostring(articles, pretty_print=True))
My Current Output is written in one very jumbled fashion, all on a single line as follows:
b'<root>\n <Article>\n <Title>title1</Title>\n <Summary>summary1</Summary>\n <Source>source1</Source>\n <Content>content1</Content>\n </Article>\n
It looks like the pretty_print function within lxml is adding proper indentation, as well as \n breaks as I would want, but it doesn't seem to be getting interpreted correctly during output; it write on a single line.
The output I'm trying to get is as follows:
<root>
<Article>
<Title>title1</Title>
<Summary>summary1</Summary>
<Source>source1</Source>
<Content>content1</Content>
</Article>
Ideally, I'd like for my output to be viewed as a valid XML document, and display in proper nested format.
Your "Current Output" is the representation (internal python representation) of the bytestring generated by etree.tostring(), and seems that in Python3 print(somebytestring) prints the representation instead of the actual string.
Hopefully the solution is quite simple: just pass the desired encoding to etree.tostring(), ie:
xml = etree.tostring(articles, encoding="unicode", pretty_print=True)
print(xml)
I've only used the base ET module in Python and can't find an lxml download for python 3.5 (which I'm on) in order to test it, but the b before the line indicates bytes and a quick glance at the documentation indicates that tostring() has an encoding keyword, so you should just need to set that to unicode or utf-8.
I'll also mention that you don't need to set "i" before your for-loop (python will create the "i" it needs for the for-loop), though I- personally- would zip the lists and iterate the items in the lists themselves (though that's not going to have any real impact on the code in this situation).

Extracting Specific Lines of XML with Python ElementTree

I am a bit stuck on a project I am doing which uses Python -which I am very new to. I have been told to use ElementTree and get specified data out of an incoming XML file. It sounds simple but I am not great at programming. Below is a (very!) tiny example of an incoming file along with the code I am trying to use.
I would like any tips or places to go next with this. I have tried searching and following what other people have done but I can't seem to get the same results. My aim is to get the information contained in the "Active", "Room" and "Direction" but later on I will need to get much more information.
I have tried using XPaths but it does not work too well, especially with the namespaces the xml uses and the fact that an XPath for everything I would need would become too large. I have simplified the example so I can understand the principle to do, as after this it must be extended to gain more information from an "AssetEquipment" and multiple instances of them. Then end goal would be all information from one equipment being saved to a dictionary so I can manipulate it later, with each new equipment in its own separate dictionary.
Example XML:
<AssetData>
<Equipment>
<AssetEquipment ID="3" name="PC960">
<Active>Yes</Active>
<Location>
<RoomLocation>
<Room>23</Room>
<Area>
<X-Area>-1</X-Area>
<Y-Area>2.4</Y-Area>
</Area>
</RoomLocation>
</Location>
<Direction>Positive</Direction>
<AssetSupport>12</AssetSupport>
</AssetEquipment>
</Equipment>
Example Code:
tree = ET.parse('C:\Temp\Example.xml')
root = tree.getroot()
ns = "{http://namespace.co.uk}"
for equipment in root.findall(ns + "Equipment//"):
tagname = re.sub(r'\{.*?\}','',equipment.tag)
name = equipment.get('name')
if tagname == 'AssetEquipment':
print "\tName: " + repr(name)
for attributes in root.findall(ns + "Equipment/" + ns + "AssetEquipment//"):
attname = re.sub(r'\{.*?\}','',attributes.tag)
if tagname == 'Room': #This does not work but I need it to be found while
#in this instance of "AssetEquipment" so it does not
#call information from another asset instead.
room = equipment.text
print "\t\tRoom:", repr(room)
import xml.etree.cElementTree as ET
tree = ET.parse('test.xml')
for elem in tree.getiterator():
if elem.tag=='{http://www.namespace.co.uk}AssetEquipment':
output={}
for elem1 in list(elem):
if elem1.tag=='{http://www.namespace.co.uk}Active':
output['Active']=elem1.text
if elem1.tag=='{http://www.namespace.co.uk}Direction':
output['Direction']=elem1.text
if elem1.tag=='{http://www.namespace.co.uk}Location':
for elem2 in list(elem1):
if elem2.tag=='{http://www.namespace.co.uk}RoomLocation':
for elem3 in list(elem2):
if elem3.tag=='{http://www.namespace.co.uk}Room':
output['Room']=elem3.text
print output

Categories

Resources