How to parse .xml file with multiple nested children in python? - python

I am using python to parse a .xml file which is quite complicated since it has a lot of nested children; accessing some of the values contained in it is quite annoying since the code starts to become pretty bad looking.
Let me first present you the .xml file:
<?xml version="1.0" encoding="utf-8"?>
<Start>
<step1 stepA="5" stepB="6" />
<step2>
<GOAL1>11111</GOAL1>
<stepB>
<stepBB>
<stepBBB stepBBB1="pinco">1</stepBBB>
</stepBB>
<stepBC>
<stepBCA>
<GOAL2>22222</GOAL2>
</stepBCA>
</stepBC>
<stepBD>-NO WOMAN NO CRY
-I SHOT THE SHERIF
-WHO LET THE DOGS OUT
</stepBD>
</stepB>
</step2>
<step3>
<GOAL3 GOAL3_NAME="GIOVANNI" GOAL3_ID="GIO">
<stepB stepB1="12" stepB2="13" />
<stepC>XXX</stepC>
<stepC>
<stepCC>
<stepCC GOAL4="saf12">33333</stepCC>
</stepCC>
</stepC>
</GOAL3>
</step3>
<step3>
<GOAL3 GOAL3_NAME="ANDREA" GOAL3_ID="DRW">
<stepB stepB1="14" stepB2="15" />
<stepC>YYY</stepC>
<stepC>
<stepCC>
<stepCC GOAL4="fwe34">44444</stepCC>
</stepCC>
</stepC>
</GOAL3>
</step3>
</Start>
My goal would be to access the values contained inside of the children named "GOAL" in a nicer way then the one I wrote in my sample code below. Furthermore I would like to find an automated way to find the values of GOALS having the same type of tag belonging to different children having the same name:
Example: GIOVANNI and ANDREA are both under the same kind of tag (GOAL3_NAME) and belong to different children having the same name (<step3>) though.
Here is the code that I wrote:
import xml.etree.ElementTree as ET
data = ET.parse('test.xml').getroot()
GOAL1 = data.getchildren()[1].getchildren()[0].text
print(GOAL1)
GOAL2 = data.getchildren()[1].getchildren()[1].getchildren()[1].getchildren()[0].getchildren()[0].text
print(GOAL2)
GOAL3 = data.getchildren()[2].getchildren()[0].text
print(GOAL3)
GOAL4_A = data.getchildren()[2].getchildren()[0].getchildren()[2].getchildren()[0].getchildren()[0].text
print(GOAL4_A)
GOAL4_B = data.getchildren()[3].getchildren()[0].getchildren()[2].getchildren()[0].getchildren()[0].text
print(GOAL4_B)
and the output that I get is the following:
11111
22222
33333
44444
The output that I would like should be like this:
11111
22222
GIOVANNI
33333
ANDREA
44444
As you can see I am able to read GOAL1 and GOAL2 easily but I am looking for a nicer code practice to access those values since it seems to me too long and hard to read/understand.
The second thing I would like to do is getting GOAL3 and GOAL4 in a automated way so that I do not have to repeat similar lines of codes and make it more readable and understandable.
Note: as you can see I was not able to read GOAL3. If possible I would like to get both the GOAL3_NAME and GOAL3_ID
In order to make the .xml file structure more understandable I post an image of what it looks like:
The highlighted elements are what I am looking for.

here is simple example for iterating from head to tail with a recursive method and cElementTree(15-20x faster), you can than collect the needed information from that
import xml.etree.cElementTree as ET
tree = ET.parse('test.xml')
root = tree.getroot()
def get_tail(root):
for child in root:
print child.text
get_tail(child)
get_tail(root)

import xml.etree.cElementTree as ET
data = ET.parse('test.xml')
for d in data.iter():
if d.tag in ["GOAL1", "GOAL2", "stepCC", "stepCC"]:
print d.text
elif d.tag in ["GOAL3", "GOAL4"]:
print d.attrib.values()[0]

Related

XMLTree Parsing and Printing

I'm starting to learn python3 and one of the things being discussed is XMLTree which I'm having a hard time grasping (most likely due to learning python concurrently)
What I am trying to do is output an easier to read version of my XML file.
The XML File: (there is no limit to the number of child customers - i've included two for example)
<?xml version="1.0" encoding="UTF-8"?>
<customers>
<customers>
<number area_code="800" exch_code="225" sub_code="5288" />
<address zip_code="90210" st_addr="9401 Sunset Blvd" />
<nameText>First Choice</nameText>
</customers>
<customers>
<number area_code="800" exch_code="867" sub_code="5309" />
<address zip_code="60652" st_addr="5 Lake Shore Drive" />
<nameText>Green Grass"</nameText>
</customers>
</customers>
From what I understand, the XML tree defines these lines as the following:
<root>
<child>
<element attribute...>
Where the first xml files 'customers' is the root, the second 'customers' is a child of 'customers', and 'number' (or address, or nameText) are elements.
With that being said, here is where I start to get confused.
If we take <number area_code="800" exch_code="225" sub_code="5288" />
This is an element with three attributes, area_code, exch_code, and sub_code but no text.
If we take <nameText>Green Grass"</nameText>
This is an element with no attributes, but does contain Text (Green Grass)
What I would like to see would be something like this:
First Choice
|--> Phone Number: 800-225-5288
|--> Address: 9401 Sunset Blvd, Zip Code: 90210
Green Grass
|--> Phone Number: 800-867-5309
|--> Address: 5 Lake Shore Drive, Zip Code: 60652
I dont have really any code to share but here it is:
import xml.etree.ElementTree as ET
tree = ET.parse(my_files[0])
root = tree.getroot()
print(root.tag)
for child in root:
print(child.tag,child.attrib)
Which provides the following output (line 1 being from print(root.tag) I believe)
customer
customer
{}
customer
{}
The questions I have after writing all this:
1 - Is my interpretation of the tree structure correct?
2 - How do you differentiate between attributes in ElementTree?
3 - How/what should I be considerate of in terms of the attributes, tags, and the rest of this file when trying to make the desired output? I might be overthinking how much more complex having XML in the mix is making this scenario so I am struggling to figure out how to do something similar to get the output I saw here: https://docs.python.org/3/library/xml.etree.elementtree.html#parsing-xml-with-namespaces but my xml lacks namespaces.
I'm still trying to learn, so any additional explanation is sincerely appreciated!
Resources that I've been trying to read through to understand all this:
https://docs.python.org/3/library/xml.etree.elementtree.html# (When I'm looking through this, I'm going off the assumption that when they are calling something an attribute, its not something unique to ElementTree but the same attribute as defined in the next link)
https://www.w3schools.com/xml/xml_tree.asp (however I havent seen anything yet about multiple attributes)
https://www.edureka.co/blog/python-xml-parser-tutorial/ (This page has been a great help breaking things down step by step so I have been able to follow along)
1 - Is my interpretation of the tree structure correct?
The ElementTree parser only knows about two entities: elements and
attributes. So when you say:
From what I understand, the XML tree defines these lines as the following:
<root>
<child>
<element attribute...>
I'm a little confused. Your XML document -- or any other XML document
-- is just an element that may have zero or more attributes and may
have zero or more children...and so forth all the way down.
2 - How do you differentiate between attributes in ElementTree?
It's not clear what you mean by "differentiate" here; you can ask for
elements by name. For example, the following code prints out the
areacode attribute of all <number> elements:
>>> from xml.etree import ElementTree as ET
>>> doc = ET.parse(open('data.xml'))
>>> doc.findall('.//number')
[<Element number at 0x7fdb8981e640>, <Element number at 0x7fdb8981e680>]
>>> for x in root.findall('.//number'):
... print(x.get('area_code'))
...
800
800
If you'd like, you can get all of the attributes of an element as a Python
dictionary:
>>> number = doc.find('customers/number')
>>> attrs = dict(number.items())
>>> attrs
{'area_code': '800', 'exch_code': '225', 'sub_code': '5288'}
3 - How/what should I be considerate of in terms of the attributes, tags, and the rest of this file when trying to make the desired output?
That code seems to have mostly what you're looking for. As you say,
you're not using namespaces, so you don't need to qualify element
names with namespace names...that is, you can write number instead
of {some/name/space}number.
That gives us something like:
from xml.etree import ElementTree as ET
with open('data.xml') as fd:
doc = ET.parse(fd)
for customer in doc.findall('customers'):
name = customer.find('nameText')
number = customer.find('number')
address = customer.find('address')
print(name.text)
print('|--> Address: {}, Zip Code: {}'.format(
address.get('st_addr'), address.get('zip_code')))
print('|--> Phone number: {}-{}-{}'.format(
number.get('area_code'), number.get('exch_code'), number.get('sub_code')))
Given your sample input, this produces:
First Choice
|--> Address: 9401 Sunset Blvd, Zip Code: 90210
|--> Phone number: 800-225-5288
Green Grass"
|--> Address: 5 Lake Shore Drive, Zip Code: 60652
|--> Phone number: 800-867-5309

How to copy attributes from a SVG to another?

I have two SVGs that where the same but I changed some values inside, for instance:
SVG1
<desc
id="desc20622">[Visualization]
name=H131B1;
</desc>
and the other is:
SVG2
<desc
id="desc20622">[Visualization]
name=R131C2;
</desc>
Now I have realocated a lot of elements in one SVG and I would like to replicate this changes to the other SVG. How is the simplest way to consume those SVGs, compare the ids, copy the values from SVG2 to SVG1 and save a new SVG file?
I'm familiar with a bunch of programming languages but I was taking a look at Python to do this job using minidom or xml.etree.ElementTree.
Could some one help me on that? Thanks in advance.
I figured out how to do it by my own with Python.
import xml.etree.ElementTree as ET
tree1 = ET.parse('SVG1.svg')
root1 = tree1.getroot()
tree2 = ET.parse('SVG2.svg')
root2 = tree2.getroot()
for child1 in root1.iter('desc'):
for child2 in root2.iter('desc'):
if child1.attrib == child2.attrib:
child1.text = child2.text
break
tree1.write('output.svg')
Just have to parse both SVGs, iterate on every desc compare the id and copy the text!

How do I extract specific data from xml using python?

I'm relatively new to python. I've been trying to learn python through a hands-on approach (I learnt c/c++ through the doing the euler project).
Right now I'm learning how to extract data from files. I've gotten the hang of extracting data from simple text files but I'm kinda stuck on xml files.
An example of what I was trying to do.
I have my call logs backed up on google drive and they're a lot (about 4000)
Here is the xml file example
<call number="+91234567890" duration="49" date="1483514046018" type="3" presentation="1" readable_date="04-Jan-2017 12:44:06 PM" contact_name="Dad" />
I want to take all the calls to my dad and display them like this
number = 234567890
duration = "49" date="04-Jan-2017 12:44:06 PM"
duration = "x" date="y"
duration = "n" date="z"
and so on like that.
How do you propose I do that?
It's advisable to provide sufficient information in a question so that problem can be recreated.
<?xml version="1.0" encoding="UTF-8"?>
<call number="+91234567890" duration="49" date="1483514046018" type="3"
presentation="1" readable_date="04-Jan-2017 12:44:06 PM"
contact_name="Dad" />
First we need to figure out what elements can we iter on. Since <call ../> is root element over here, we iter over that.
NOTE: if you have tags/element prior to the line provided, you will need to figure out proper root element instead of call.
>>> [i for i in root.iter('call')]
[<Element 'call' at 0x29d3410>]
Here you can see, we can iter on element call.
Then we simply iter over the element and separate out element attribute key and values as per requirements.
Working Code
import xml.etree.ElementTree as ET
data_file = 'test.xml'
tree = ET.parse(data_file)
root = tree.getroot()
for i in root.iter('call'):
print 'duration', "=", i.attrib['duration']
print 'data', "=", i.attrib['date']
Result
>>>
duration = 49
data = 1483514046018
>>>

Nested XML tags in Python

I have a nested XML that looks like this:
<data>foo <data1>hello</data1> bar</data>
I am using minidom, but no matter how I try to get the values between "data", I am only get "foo" but not "bar"
It is even worse if the XML is like this:
<data><data1>hello</data1> bar</data>
I only get a "None", which is correct according to the logic above. So I came accross this: http://levdev.wordpress.com/2011/07/29/get-xml-element-value-in-python-using-minidom and concluded that it is due to the limitation of minidom?
So I used the method in that blog and I now get
foo <data1>hello</data1> bar
and
<data1>hello</data1> bar
which is acceptable. However, if I try to create a new node (createTextNode) using the output above as node values, the XML becomes:
<data>foo <data1>hello</data1> bar</data>
and
<data><data1>hello</data1> bar</data>
Is there any way that I can create it so that it looks like the original? Thank you.
You can use element tree For xml it very efficient for both retrieval and creation of the node
have a look at the link below
element tree--
tutorials
mixed xml
someof the examples of creating node
import xml.etree.ElementTree as ET
data = ET.Element('data')
data1= ET.SubElement(data, 'data1',attr="value")
data1.text="hello"
data.text="bar"
data1.tail="some code"
ET.dump(data)
output :<data>bar<data1 attr="value">hello</data1>some code</data>
Use the following function to prettify your xml so it is a LOT easier to see...first of all..
import xml.dom.minidom as minidom
def prettify(elem):
"""Return a pretty-printed XML string for the Element. Props goes
to Maxime from stackoverflow for this code."""
rough_string = et.tostring(elem, 'utf-8')
reparsed = minidom.parseString(rough_string)
return reparsed.toprettyxml(indent="\t")
That makes stepping through the tree visually a lot simpler.
Next I would suggest a modification in your xml that will make your life a whole lot easier i think.
Instead of :
<data>foo
<data1>hello</data1>
bar
</data>
which is not a correct XML format I would save your 'foo' and 'bar' as attributes of
it looks like this:
<data var1='foo' var2='bar'>
<data1>hello</data1>
</data>
to do this using xml.etree.ElementTree:
import xml.etree.ElementTree as ET
data = ET.Element('data', {'var1:'foo', 'var2':'bar'})
data1= ET.SubElement(data, 'data1')
data1.text='hello'
print prettify(data)
So after pointed out by #pandubear, the XML:
<data>foo <data1>hello</data1> bar</data>
Does have two text nodes, containing "foo " and " bar", so what can be done is to iterate through all the child nodes in data and get the values.

Extracting Specific Lines of XML with Python ElementTree

I am a bit stuck on a project I am doing which uses Python -which I am very new to. I have been told to use ElementTree and get specified data out of an incoming XML file. It sounds simple but I am not great at programming. Below is a (very!) tiny example of an incoming file along with the code I am trying to use.
I would like any tips or places to go next with this. I have tried searching and following what other people have done but I can't seem to get the same results. My aim is to get the information contained in the "Active", "Room" and "Direction" but later on I will need to get much more information.
I have tried using XPaths but it does not work too well, especially with the namespaces the xml uses and the fact that an XPath for everything I would need would become too large. I have simplified the example so I can understand the principle to do, as after this it must be extended to gain more information from an "AssetEquipment" and multiple instances of them. Then end goal would be all information from one equipment being saved to a dictionary so I can manipulate it later, with each new equipment in its own separate dictionary.
Example XML:
<AssetData>
<Equipment>
<AssetEquipment ID="3" name="PC960">
<Active>Yes</Active>
<Location>
<RoomLocation>
<Room>23</Room>
<Area>
<X-Area>-1</X-Area>
<Y-Area>2.4</Y-Area>
</Area>
</RoomLocation>
</Location>
<Direction>Positive</Direction>
<AssetSupport>12</AssetSupport>
</AssetEquipment>
</Equipment>
Example Code:
tree = ET.parse('C:\Temp\Example.xml')
root = tree.getroot()
ns = "{http://namespace.co.uk}"
for equipment in root.findall(ns + "Equipment//"):
tagname = re.sub(r'\{.*?\}','',equipment.tag)
name = equipment.get('name')
if tagname == 'AssetEquipment':
print "\tName: " + repr(name)
for attributes in root.findall(ns + "Equipment/" + ns + "AssetEquipment//"):
attname = re.sub(r'\{.*?\}','',attributes.tag)
if tagname == 'Room': #This does not work but I need it to be found while
#in this instance of "AssetEquipment" so it does not
#call information from another asset instead.
room = equipment.text
print "\t\tRoom:", repr(room)
import xml.etree.cElementTree as ET
tree = ET.parse('test.xml')
for elem in tree.getiterator():
if elem.tag=='{http://www.namespace.co.uk}AssetEquipment':
output={}
for elem1 in list(elem):
if elem1.tag=='{http://www.namespace.co.uk}Active':
output['Active']=elem1.text
if elem1.tag=='{http://www.namespace.co.uk}Direction':
output['Direction']=elem1.text
if elem1.tag=='{http://www.namespace.co.uk}Location':
for elem2 in list(elem1):
if elem2.tag=='{http://www.namespace.co.uk}RoomLocation':
for elem3 in list(elem2):
if elem3.tag=='{http://www.namespace.co.uk}Room':
output['Room']=elem3.text
print output

Categories

Resources