Parsing nested XML with ElementTree - python

I have the following XML format, and I want to pull out the values for name, region, and status using python's xml.etree.ElementTree module.
However, my attempt to get this information has been unsuccessful so far.
<feed>
<entry>
<id>uuid:asdfadsfasdf123123</id>
<title type="text"></title>
<content type="application/xml">
<NamespaceDescription xmlns="http://schemas.microsoft.com/netservices/2010/10/servicebus/connect" xmlns:i="http://www.w3.org/2001/XMLSchema-instance">
<Name>instancename</Name>
<Region>US</Region>
<Status>Active</Status>
</NamespaceDescription>
</content>
</entry>
<entry>
<id>uuid:asdfadsfasdf234234</id>
<title type="text"></title>
<content type="application/xml">
<NamespaceDescription xmlns="http://schemas.microsoft.com/netservices/2010/10/servicebus/connect" xmlns:i="http://www.w3.org/2001/XMLSchema-instance">
<Name>instancename2</Name>
<Region>US2</Region>
<Status>Active</Status>
</NamespaceDescription>
</content>
</entry>
</feed>
My code attempt:
NAMESPACE = '{http://www.w3.org/2005/Atom}'
root = et.fromstring(XML_STRING)
entry_root = root.findall('{0}entry'.format(NAMESPACE))
for child in entry_root:
content_node = child.find('{0}content'.format(NAMESPACE))
for content in content_node:
for desc in content.iter():
print desc.tag
name = desc.find('{0}Name'.format(NAMESPACE))
print name
desc.tag is giving me the nodes I want to access, but name is returning None. Any ideas what's wrong with my code?
Output of desc.tag:
{http://schemas.microsoft.com/netservices/2010/10/servicebus/connect}Name
{http://schemas.microsoft.com/netservices/2010/10/servicebus/connect}Region
{http://schemas.microsoft.com/netservices/2010/10/servicebus/connect}Status

I don't know why I didn't see this before. But, I was able to get the values.
root = et.fromstring(XML_STRING)
entry_root = root.findall('{0}entry'.format(NAMESPACE))
for child in entry_root:
content_node = child.find('{0}content'.format(NAMESPACE))
for descr in content_node:
name_node = descr.find('{0}Name'.format(NAMESPACE))
print name_node.text

You can use lxml.etree along with default namespace mapping to parse the XML as follows:
content = '''
<feed>
<entry>
<id>uuid:asdfadsfasdf123123</id>
<title type="text"></title>
<content type="application/xml">
<NamespaceDescription xmlns="http://schemas.microsoft.com/netservices/2010/10/servicebus/connect" xmlns:i="http://www.w3.org/2001/XMLSchema-instance">
<Name>instancename</Name>
<Region>US</Region>
<Status>Active</Status>
</NamespaceDescription>
</content>
</entry>
<entry>
<id>uuid:asdfadsfasdf234234</id>
<title type="text"></title>
<content type="application/xml">
<NamespaceDescription xmlns="http://schemas.microsoft.com/netservices/2010/10/servicebus/connect" xmlns:i="http://www.w3.org/2001/XMLSchema-instance">
<Name>instancename2</Name>
<Region>US2</Region>
<Status>Active</Status>
</NamespaceDescription>
</content>
</entry>
</feed>'''
from lxml import etree
tree = etree.XML(content)
ns = {'default': 'http://schemas.microsoft.com/netservices/2010/10/servicebus/connect'}
names = tree.xpath('//default:Name/text()', namespaces=ns)
regions = tree.xpath('//default:Region/text()', namespaces=ns)
statuses = tree.xpath('//default:Status/text()', namespaces=ns)
print(names)
print(regions)
print(statuses)
Output
['instancename', 'instancename2']
['US', 'US2']
['Active', 'Active']
This XPath/namespace functionality can be adapted to output the data in any format you require.

Related

Export information from child nodes in xml using Python

I have an xml file called persons.xml in the following format:
<?xml version="1.0" encoding="UTF-8"?>
<persons>
<person id="1" name="John">
<city id="21" name="New York"/>
</person>
<person id="2" name="Mary">
<city id="22" name="Los Angeles"/>
</person>
</persons>
I want to export to a file the list of person names along with the city names
import pandas as pd
import xml.etree.ElementTree as ET
tree = ET.parse('./persons.xml')
root = tree.getroot()
df_cols = ["person_name", "city_name"]
rows = []
for node in root:
person_name = node.attrib.get("name")
rows.append({"person_name": person_name})
out_df = pd.DataFrame(rows, columns = df_cols)
out_df
Obviously this part of the code will only work for obtaining the name as it’s part of the root, but I can’t figure out how to loop through the child nodes too and obtain this info. Do I need to append something to root to iterate over the child nodes?
I can obtain everything using root.getchildren but it doesn’t allow me to return only the child nodes:
children = root.getchildren()
for child in children:
ElementTree.dump(child)
Is there a good way to get this information?
See below
import xml.etree.ElementTree as ET
import pandas as pd
xml = '''<?xml version="1.0" encoding="UTF-8"?>
<persons>
<person id="1" name="John">
<city id="21" name="New York" />
</person>
<person id="2" name="Mary">
<city id="22" name="Los Angeles" />
</person>
</persons>'''
root = ET.fromstring(xml)
data = []
for p in root.findall('.//person'):
data.append({'parson': p.attrib['name'], 'city': p.find('city').attrib['name']})
df = pd.DataFrame(data)
print(df)
output
parson city
0 John New York
1 Mary Los Angeles

Parse XML with childs that have different tags in Python

I am trying to parse following xml data from a file with python for print only the elements with tag "zip-code" with his attribute name
<response status="success" code="19"><result total-count="1" count="1">
<address>
<entry name="studio">
<zip-code>14407</zip-code>
<description>Nothing</description>
</entry>
<entry name="mailbox">
<zip-code>33896</zip-code>
<description>Nothing</description>
</entry>
<entry name="garage">
<zip-code>33746</zip-code>
<description>Tony garage</description>
</entry>
<entry name="playstore">
<url>playstation.com</url>
<description>game download</description>
</entry>
<entry name="gym">
<zip-code>33746</zip-code>
<description>Getronics NOC subnet 2</description>
</entry>
<entry name="e-cigars">
<url>vape.com/24</url>
<description>vape juices</description>
</entry>
</address>
</result></response>
The python code that I am trying to run is
from xml.etree import ElementTree as ET
tree = ET.parse('file.xml')
root = tree.getroot()
items = root.iter('entry')
for item in items:
zip = item.find('zip-code').text
names = (item.attrib)
print(' {} {} '.format(
names, zip
))
However it fails once it gets to the items without "zip-code" tag.
How I could make this run?
Thanks in advance
As #AmitaiIrron suggested, xpath can help here.
This code searches the document for element named zip-code, and pings back to get the parent of that element. From there, you can get the name attribute, and pair with the text from zip-code element
for ent in root.findall(".//zip-code/.."):
print(ent.attrib.get('name'), ent.find('zip-code').text)
studio 14407
mailbox 33896
garage 33746
gym 33746
OR
{ent.attrib.get('name') : ent.find('zip-code').text
for ent in root.findall(".//zip-code/..")}
{'studio': '14407', 'mailbox': '33896', 'garage': '33746', 'gym': '33746'}
Your loop should look like this:
# Find all <entry> tags in the hierarchy
for item in root.findall('.//entry'):
# Try finding a <zip-code> child
zipc = item.find('./zip-code')
# If found a child, print data for it
if zipc is not None:
names = (item.attrib)
print(' {} {} '.format(
names, zipc.text
))
It's all a matter of learning to use xpath properly when searching through the XML tree.
If you have no problem using regular expressions, the following works just fine:
import re
file = open('file.xml', 'r').read()
pattern = r'name="(.*?)".*?<zip-code>(.*?)<\/zip-code>'
matches = re.findall(pattern, file, re.S)
for m in matches:
print("{} {}".format(m[0], m[1]))
and produces the result:
studio 14407
mailbox 33896
garage 33746
aystore 33746

Missing wrapper in XML

I'm trying to pull specific data from one big XML to another. My Main XML File looks like below
<MAIN>
<transaction>
<date>20190415</date>
<ticket>1</ticket>
<value>15</value>
</transaction>
<transaction>
<date>20190415</date>
<ticket>2</ticket>
<value>15</value>
</transaction>
<transaction>
<date>20190415</date>
<ticket>3</ticket>
<value>15</value>
</transaction>
<transaction>
<date>20190415</date>
<ticket>4</ticket>
<value>15</value>
</transaction>
<transaction>
<date>20190415</date>
<ticket>5</ticket>
<value>15</value>
</transaction>
</MAIN>
I'm only pulling the <ticket> values & Appending it to a Fresh/New xml file.
Below is my code
import pandas as pd
import xml.etree.ElementTree as ET
from lxml import etree
path_source = 'source\path'
path_dest = 'dest\path'
tree = ET.parse(path_source)
root = tree.getroot()
L_roots = []
for trx in root.iter('transaction'):
ticket = trx.find('ticket').text
root_T = ET.Element('MAIN')
doc = ET.SubElement(root_T, 'Transaction')
ET.SubElement(doc, 'ticket').text = ticket
L_roots.append(doc)
with open(path_dest,'wb') as f:
for i in L_roots:
ET.Element('MAIN')
f.write(ET.tostring(i, method="xml"))
what i get is a plain text file without the outer <MAIN> tags. like below
<Transaction>
<ticket>1</ticket>
</Transaction>
<Transaction>
<ticket>2</ticket>
</Transaction>
<Transaction>
<ticket>3</ticket>
</Transaction>
<Transaction>
<ticket>4</ticket>
</Transaction>
<Transaction>
<ticket>5</ticket>
</Transaction>
What is missing here is the wrapper <MAIN> tags. what should be changed in my code to achieve this?
Replace this:
with open(path_dest,'wb') as f:
for i in L_roots:
ET.Element('MAIN')
f.write(ET.tostring(i, method="xml"))
with this:
outroot = ET.Element('MAIN')
outroot.extend(L_roots)
with open(path_dest,'wb') as f:
f.write(ET.tostring(outroot, method="xml"))
The error in your snippet is that you never save the new ET.Element('MAIN') to a variable, so that is lost. When using f.write you are simply writing the elements in L_roots, which have the Transaction tag.
In the snippet I propose, all the L_roots elements are inserted into another MAIN element, and then the main element is written (all its subelements are automatically written).

XML Parse attributes using namespace

I'm using the following XML:
<feed xmlns:im="http://itunes.apple.com/rss" xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
<id>
https://itunes.apple.com/IN/rss/topfreeapplications/limit=200/xml
</id>
<title>iTunes Store: Top Free Apps</title>
<updated>2016-12-05T12:37:06-07:00</updated>
<link rel="alternate" type="text/html" href="https://itunes.apple.com/WebObjects/MZStore.woa/wa/viewTop?cc=in&id=134581&popId=27"/>
<link rel="self" href="https://itunes.apple.com/IN/rss/topfreeapplications/limit=200/xml"/>
<icon>http://itunes.apple.com/favicon.ico</icon>
<author>
<name>iTunes Store</name>
<uri>http://www.apple.com/uk/itunes/</uri>
</author>
<rights>Copyright 2008 Apple Inc.</rights>
<entry>
<updated>2016-12-05T12:37:06-07:00</updated>
<id im:id="473941634" im:bundleId="com.one97.paytm">https://itunes.apple.com/in/app/recharge-bill-payment-wallet/id473941634?mt=8&uo=2</id>
<title>Recharge, Bill Payment & Wallet - Paytm Mobile Solutions</title>
<summary></summary>
<im:name>Recharge, Bill Payment & Wallet</im:name>
<link rel="alternate" type="text/html" href="https://itunes.apple.com/in/app/recharge-bill-payment-wallet/id473941634?mt=8&uo=2"/>
<im:contentType term="Application" label="Application"/>
<category im:id="6024" term="Shopping" scheme="https://itunes.apple.com/in/genre/ios-shopping/id6024?mt=8&uo=2" label="Shopping"/>
<im:artist href="https://itunes.apple.com/in/developer/paytm-mobile-solutions/id473941637?mt=8&uo=2">Paytm Mobile Solutions</im:artist>
<im:price amount="0.00000" currency="INR">Get</im:price>
<im:image height="53">http://is1.mzstatic.com/image/thumb/Purple71/v4/9b/37/bf/9b37bf75-6b4d-9c95-a8a4-ea369f05ae7e/pr_source.png/53x53bb-85.png</im:image>
<im:image height="75">http://is5.mzstatic.com/image/thumb/Purple71/v4/9b/37/bf/9b37bf75-6b4d-9c95-a8a4-ea369f05ae7e/pr_source.png/75x75bb-85.png</im:image>
<im:image height="100">http://is5.mzstatic.com/image/thumb/Purple71/v4/9b/37/bf/9b37bf75-6b4d-9c95-a8a4-ea369f05ae7e/pr_source.png/100x100bb-85.png</im:image>
<rights>© One97 Communications Ltd</rights>
<im:releaseDate label="24 October 2011">2011-10-24T16:18:48-07:00</im:releaseDate>
<content type="html"></content>
</entry>
</feed>
I would like to extract the id information for each entry value:
the attribute is as follows: "im:id"
from xml.dom import minidom
xmldoc = minidom.parse('topIN.xml')
itemlist = xmldoc.getElementsByTagName('link')
print(len(itemlist))
print(itemlist[0].attributes.keys())
I get information:
1
[u'href', u'type', u'rel']
But when I do the same of id, nothing returns.
Here is a version using xml.etree.ElementTree:
import xml.etree.ElementTree as ET
tree = ET.parse('topIN.xml')
root = tree.getroot()
ns={'im':"http://itunes.apple.com/rss", 'atom':"http://www.w3.org/2005/Atom"}
for id_ in root.findall('atom:entry/atom:id', ns):
print (id_.attrib['{' + ns['im'] + '}id'])
Here is a version using lxml:
from lxml import etree
root=etree.parse('topIN.xml')
ns={'im':"http://itunes.apple.com/rss", 'atom':"http://www.w3.org/2005/Atom"}
print('\n'.join(root.xpath('atom:entry/atom:id/#im:id', namespaces=ns)))
This worked:
from xml.dom import minidom
xmldoc = minidom.parse('topIN.xml')
itemlist = xmldoc.getElementsByTagName('entry')
print(len(itemlist))
for s in itemlist:
print s.getElementsByTagName('id')[0].attributes['im:id'].value

Appending to attribute values of an xml using python

Example xml:
<response version-api="2.0">
<value>
<books>
<book available="20" id="1" tags="">
<title></title>
<author id="1" tags="Joel">Manuel De Cervantes</author>
</book>
<book available="14" id="2" tags="Jane">
<title>Catcher in the Rye</title>
<author id="2" tags="">JD Salinger</author>
</book>
<book available="13" id="3" tags="">
<title></title>
<author id="3">Lewis Carroll</author>
</book>
<book available="5" id="4" tags="Harry">
<title>Don</title>
<author id="4">Manuel De Cervantes</author>
</book>
</books>
</value>
</response>
I want to append a string value of my choosing to all attributes called "tags". This is whether the "tags" attribute has a value or not and also the attributes are at different levels of the xml structure. I have tried the method findall() but I keep on getting an error "IndexError: list index out of range." This is the code I have so far which is a little short but I have run out of steam for what else I need to type...
splitter = etree.XMLParser(strip_cdata=False)
xmldoc = etree.parse(os.path.join(root, xml_file), splitter ).getroot()
for child in xmldoc:
if child.tag != 'response':
allDescendants = list(etree.findall())
for child in allDescendants:
if hasattr(child, 'tags'):
child.attribute["tags"].value = "someString"
findall() is the right API to use. Here is an example:
from lxml import etree
import os
splitter = etree.XMLParser(strip_cdata=False)
xml_file = 'foo.xml'
root = '.'
xmldoc = etree.parse(os.path.join(root, xml_file), splitter ).getroot()
for element in xmldoc.findall(".//*[#tags]"):
element.attrib["tags"] += " KILROY!"
print etree.tostring(xmldoc)

Categories

Resources