Example xml:
<response version-api="2.0">
<value>
<books>
<book available="20" id="1" tags="">
<title></title>
<author id="1" tags="Joel">Manuel De Cervantes</author>
</book>
<book available="14" id="2" tags="Jane">
<title>Catcher in the Rye</title>
<author id="2" tags="">JD Salinger</author>
</book>
<book available="13" id="3" tags="">
<title></title>
<author id="3">Lewis Carroll</author>
</book>
<book available="5" id="4" tags="Harry">
<title>Don</title>
<author id="4">Manuel De Cervantes</author>
</book>
</books>
</value>
</response>
I want to append a string value of my choosing to all attributes called "tags". This is whether the "tags" attribute has a value or not and also the attributes are at different levels of the xml structure. I have tried the method findall() but I keep on getting an error "IndexError: list index out of range." This is the code I have so far which is a little short but I have run out of steam for what else I need to type...
splitter = etree.XMLParser(strip_cdata=False)
xmldoc = etree.parse(os.path.join(root, xml_file), splitter ).getroot()
for child in xmldoc:
if child.tag != 'response':
allDescendants = list(etree.findall())
for child in allDescendants:
if hasattr(child, 'tags'):
child.attribute["tags"].value = "someString"
findall() is the right API to use. Here is an example:
from lxml import etree
import os
splitter = etree.XMLParser(strip_cdata=False)
xml_file = 'foo.xml'
root = '.'
xmldoc = etree.parse(os.path.join(root, xml_file), splitter ).getroot()
for element in xmldoc.findall(".//*[#tags]"):
element.attrib["tags"] += " KILROY!"
print etree.tostring(xmldoc)
Related
I have an xml file called persons.xml in the following format:
<?xml version="1.0" encoding="UTF-8"?>
<persons>
<person id="1" name="John">
<city id="21" name="New York"/>
</person>
<person id="2" name="Mary">
<city id="22" name="Los Angeles"/>
</person>
</persons>
I want to export to a file the list of person names along with the city names
import pandas as pd
import xml.etree.ElementTree as ET
tree = ET.parse('./persons.xml')
root = tree.getroot()
df_cols = ["person_name", "city_name"]
rows = []
for node in root:
person_name = node.attrib.get("name")
rows.append({"person_name": person_name})
out_df = pd.DataFrame(rows, columns = df_cols)
out_df
Obviously this part of the code will only work for obtaining the name as it’s part of the root, but I can’t figure out how to loop through the child nodes too and obtain this info. Do I need to append something to root to iterate over the child nodes?
I can obtain everything using root.getchildren but it doesn’t allow me to return only the child nodes:
children = root.getchildren()
for child in children:
ElementTree.dump(child)
Is there a good way to get this information?
See below
import xml.etree.ElementTree as ET
import pandas as pd
xml = '''<?xml version="1.0" encoding="UTF-8"?>
<persons>
<person id="1" name="John">
<city id="21" name="New York" />
</person>
<person id="2" name="Mary">
<city id="22" name="Los Angeles" />
</person>
</persons>'''
root = ET.fromstring(xml)
data = []
for p in root.findall('.//person'):
data.append({'parson': p.attrib['name'], 'city': p.find('city').attrib['name']})
df = pd.DataFrame(data)
print(df)
output
parson city
0 John New York
1 Mary Los Angeles
I am using python's ElementTree library to parse an XML file which has the following structure. I am trying to get the xml string corresponding to entity with id = 192 with all its parents (folders) but without other entities
<catalog>
<folder name="entities">
<entity id="102">
</entity>
<folder name="newEntities">
<entity id="192">
</entity>
<entity id="2982">
</entity>
</folder>
</folder>
</catalog>
The required result should be
<catalog>
<folder name="entities">
<folder name="newEntities">
<entity id="192">
</entity>
</folder>
</folder>
</catalog>
assuming the 1st xml string is stored in a variable called xml_string
tree = ET.fromstring(xmlstring)
id = 192
required_element = tree.find(".//entity[#id='" + id + "']")
This gets the xml element for the required entity but not the parent folders, any quick solution fix for this?
The challenge here is to bypass the fact that ET has no parent information. The solution is to use parent_map
import copy
import xml.etree.ElementTree as ET
import xml.dom.minidom as minidom
xml = '''<catalog>
<folder name="entities">
<entity id="102">
</entity>
<folder name="newEntities">
<entity id="192">
</entity>
<entity id="2982">
</entity>
</folder>
</folder>
</catalog>'''
def prettify(elem):
"""Return a pretty-printed XML string for the Element.
"""
rough_string = ET.tostring(elem, 'utf-8')
reparsed = minidom.parseString(rough_string)
return reparsed.toprettyxml(indent="\t")
root = ET.fromstring(xml)
parent_map = {c: p for p in root.iter() for c in p}
_id = 192
required_element = root.find(".//entity[#id='" + str(_id) + "']")
_path = [copy.deepcopy(required_element)]
while True:
parent = parent_map.get(required_element)
if parent:
_path.append(copy.deepcopy(parent))
required_element = parent
else:
break
idx = len(_path) - 1
while idx >= 1:
_path[idx].clear()
_path[idx].append(_path[idx-1])
idx -= 1
print(prettify(_path[-1]))
output
<?xml version="1.0" ?>
<catalog>
<folder>
<folder>
<entity id="192">
</entity>
</folder>
</folder>
</catalog>
My sample XML:
<RecordContainer RecordNumber = "1">
<catalog>
<book id="bk101">
<person>
<author>Gambardella, Matthew</author>
<personal_info>
<age>40</age>
</personal_info>
</person>
<title>XML Developer's Guide</title>
<description>
<price>44.95</price>
<publish_date>2000-10-01</publish_date>
</description>
<details>
<info>this is the guide to XML</info>
</details>
</book>
</catalog>
</RecordContainer>
<RecordContainer RecordNumber = "2">
<catalog>
<book id="bk102">
<person>
<author>Ralls, Kim</author>
</person>
<title>Midnight Rain</title>
<genre>Fantasy</genre>
<description>
<price>5.95</price>
<publish_date>2000-12-16</publish_date>
</description>
</book>
</catalog>
</RecordContainer>
Plase note that above XML has nested child tags and some of nested tags are missing in some of the containers.
My expected output is pandas dataframe with all the tags and fill null in case of any missing tag text.
code to parse the data:
import xml.etree.ElementTree as ET
import pandas as pd
root = ET.fromstring("<root>"+ sample_data + "</root>")
records = []
containers = root.findall('.//RecordContainer')
for container in containers:
entry = container.attrib
book = container.find('.//catalog/book')
entry.update(book.attrib)
for child in list(book):
entry[child.tag] = child.text
records.append(entry)
df = pd.DataFrame(records)
Above code return null in case of missing tags and it is not aligned with the column name.
I have the following XML format, and I want to pull out the values for name, region, and status using python's xml.etree.ElementTree module.
However, my attempt to get this information has been unsuccessful so far.
<feed>
<entry>
<id>uuid:asdfadsfasdf123123</id>
<title type="text"></title>
<content type="application/xml">
<NamespaceDescription xmlns="http://schemas.microsoft.com/netservices/2010/10/servicebus/connect" xmlns:i="http://www.w3.org/2001/XMLSchema-instance">
<Name>instancename</Name>
<Region>US</Region>
<Status>Active</Status>
</NamespaceDescription>
</content>
</entry>
<entry>
<id>uuid:asdfadsfasdf234234</id>
<title type="text"></title>
<content type="application/xml">
<NamespaceDescription xmlns="http://schemas.microsoft.com/netservices/2010/10/servicebus/connect" xmlns:i="http://www.w3.org/2001/XMLSchema-instance">
<Name>instancename2</Name>
<Region>US2</Region>
<Status>Active</Status>
</NamespaceDescription>
</content>
</entry>
</feed>
My code attempt:
NAMESPACE = '{http://www.w3.org/2005/Atom}'
root = et.fromstring(XML_STRING)
entry_root = root.findall('{0}entry'.format(NAMESPACE))
for child in entry_root:
content_node = child.find('{0}content'.format(NAMESPACE))
for content in content_node:
for desc in content.iter():
print desc.tag
name = desc.find('{0}Name'.format(NAMESPACE))
print name
desc.tag is giving me the nodes I want to access, but name is returning None. Any ideas what's wrong with my code?
Output of desc.tag:
{http://schemas.microsoft.com/netservices/2010/10/servicebus/connect}Name
{http://schemas.microsoft.com/netservices/2010/10/servicebus/connect}Region
{http://schemas.microsoft.com/netservices/2010/10/servicebus/connect}Status
I don't know why I didn't see this before. But, I was able to get the values.
root = et.fromstring(XML_STRING)
entry_root = root.findall('{0}entry'.format(NAMESPACE))
for child in entry_root:
content_node = child.find('{0}content'.format(NAMESPACE))
for descr in content_node:
name_node = descr.find('{0}Name'.format(NAMESPACE))
print name_node.text
You can use lxml.etree along with default namespace mapping to parse the XML as follows:
content = '''
<feed>
<entry>
<id>uuid:asdfadsfasdf123123</id>
<title type="text"></title>
<content type="application/xml">
<NamespaceDescription xmlns="http://schemas.microsoft.com/netservices/2010/10/servicebus/connect" xmlns:i="http://www.w3.org/2001/XMLSchema-instance">
<Name>instancename</Name>
<Region>US</Region>
<Status>Active</Status>
</NamespaceDescription>
</content>
</entry>
<entry>
<id>uuid:asdfadsfasdf234234</id>
<title type="text"></title>
<content type="application/xml">
<NamespaceDescription xmlns="http://schemas.microsoft.com/netservices/2010/10/servicebus/connect" xmlns:i="http://www.w3.org/2001/XMLSchema-instance">
<Name>instancename2</Name>
<Region>US2</Region>
<Status>Active</Status>
</NamespaceDescription>
</content>
</entry>
</feed>'''
from lxml import etree
tree = etree.XML(content)
ns = {'default': 'http://schemas.microsoft.com/netservices/2010/10/servicebus/connect'}
names = tree.xpath('//default:Name/text()', namespaces=ns)
regions = tree.xpath('//default:Region/text()', namespaces=ns)
statuses = tree.xpath('//default:Status/text()', namespaces=ns)
print(names)
print(regions)
print(statuses)
Output
['instancename', 'instancename2']
['US', 'US2']
['Active', 'Active']
This XPath/namespace functionality can be adapted to output the data in any format you require.
my XML
<root>
- <Book category="Children">
<title>Harry Potter</title>
<author>J.K</author>
<year>2005</year>
<price>29.99</price>
</Book>
- <Book category="WEB">
<title>Learning XML</title>
<author>Erik T. Ray</author>
<year>2003</year>
<price>39.95</price>
</Book>
</root>
I'm using etree in python
import xml.etree.ElementTree as ET
Books = ET.parse('4.xml') #parse the xml file into an elementtre
were the list of elements i would like to receive is
BookInfo = [title,author,year,price]
2) how would it be corect to read the Text in a specific elemnt of the list BookInfo
thanks
1) Try this:
import xml.etree.ElementTree as ET
Books = ET.parse('4.xml') #parse the xml file into an elementtre
root = Books.getroot()
for child in root:
BookInfo = [
child.find('title').text,
child.find('author').text,
child.find('year').text,
child.find('price').text
]
print (BookInfo)
2)if you can receive the specific element from the list use BookInfo[0] - this is title, BookInfo[1] - author...