Export information from child nodes in xml using Python - python

I have an xml file called persons.xml in the following format:
<?xml version="1.0" encoding="UTF-8"?>
<persons>
<person id="1" name="John">
<city id="21" name="New York"/>
</person>
<person id="2" name="Mary">
<city id="22" name="Los Angeles"/>
</person>
</persons>
I want to export to a file the list of person names along with the city names
import pandas as pd
import xml.etree.ElementTree as ET
tree = ET.parse('./persons.xml')
root = tree.getroot()
df_cols = ["person_name", "city_name"]
rows = []
for node in root:
person_name = node.attrib.get("name")
rows.append({"person_name": person_name})
out_df = pd.DataFrame(rows, columns = df_cols)
out_df
Obviously this part of the code will only work for obtaining the name as it’s part of the root, but I can’t figure out how to loop through the child nodes too and obtain this info. Do I need to append something to root to iterate over the child nodes?
I can obtain everything using root.getchildren but it doesn’t allow me to return only the child nodes:
children = root.getchildren()
for child in children:
ElementTree.dump(child)
Is there a good way to get this information?

See below
import xml.etree.ElementTree as ET
import pandas as pd
xml = '''<?xml version="1.0" encoding="UTF-8"?>
<persons>
<person id="1" name="John">
<city id="21" name="New York" />
</person>
<person id="2" name="Mary">
<city id="22" name="Los Angeles" />
</person>
</persons>'''
root = ET.fromstring(xml)
data = []
for p in root.findall('.//person'):
data.append({'parson': p.attrib['name'], 'city': p.find('city').attrib['name']})
df = pd.DataFrame(data)
print(df)
output
parson city
0 John New York
1 Mary Los Angeles

Related

Python Parsing Multiple XML Nodes with Dynamic Data

I have an log file from an application in XML-like format that I'm trying to parse. As you can see from the file, one "group" starts with a [trace] line, and contains 4 nodes - RequestMeta, Request, ReplyMeta, and Reply.
Once the file is parsed, I want to create an object for each "group" and use the objects for further processing. There could be from 1:n groups depending on the complexity of the log file.
I have been able to parse the XML, but I have some questions on how best to proceed based on it's structure.
The first problem is how to structure/re-structure the file for parsing. Since I'm adding a single root node to more than one "group", there will be no easy way for me to know which children of the root node belong together in that group. In the original file, the group is denoted as everything between the [trace] line and the next [trace] line.
I think I could potentially solve this by taking each string "group" and create a tree for each group instead of a tree for the entire file.
The second problem is how to store the data once it's parsed. Each and every request/reply will contain different data elements under the srvdata node. I'm not sure how to dynamically store a variable number of values that have a variable number of names.
After parsing all of the data, I want to output it in a simple webpage that looks something like https://imgur.com/a/2l6ZSJK
py script
import xml.etree.ElementTree as ET
with open('C:/code/mra/requestreply.txt') as f:
txt = f.read()
pos = 0
# replace all [trace] lines
while pos >= 0:
pos = txt.find('[trace-')
pos2 = txt.find('\n', pos + 1) + 1
if pos >= 0:
txt = txt.replace(txt[pos:pos2], '')
# replace all xml instances because they are out of order
txt = txt.replace('<?xml version="1.0" encoding="utf-8"?>\n', '')
# add a master root node
xml = '<root>\n' + txt + '</root>'
tree = ET.fromstring(xml)
xml file - this is considered a single group (there could be hundreds)
[trace-592] TransactionID=6010 TransactionName=CPM.ExecuteDiscernScript User=MEPPS
<RequestMeta>
<?xml version="1.0" encoding="utf-8"?>
<srvxml>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
....
</xs:schema>
<srvdata lang="C">
....
</srvdata>
</srvxml>
</RequestMeta>
<Request>
<?xml version="1.0" encoding="utf-8"?>
<srvxml>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
....
</xs:schema>
<srvdata lang="C">
....
</srvdata>
</srvxml>
</Request>
<ReplyMeta>
<?xml version="1.0" encoding="utf-8"?>
<srvxml>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
....
</xs:schema>
<srvdata lang="C">
....
</srvdata>
</srvxml>
</ReplyMeta>
<Reply>
<?xml version="1.0" encoding="utf-8"?>
<srvxml>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
....
</xs:schema>
<srvdata lang="C">
....
</srvdata>
</srvxml>
</Reply>
I suggest modify your xml structure like this, I named the file trace.xml:
<?xml version="1.0" encoding="utf-8"?>
<root>
<!--[trace-592] TransactionID=6010 TransactionName=CPM.ExecuteDiscernScript User=MEPPS-->
<RequestMeta>
<!-- <?xml version="1.0" encoding="utf-8"?> -->
<srvxml>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
....
</xs:schema>
<srvdata lang="C">
....
</srvdata>
</srvxml>
</RequestMeta>
<Request>
<!-- <?xml version="1.0" encoding="utf-8"?> -->
<srvxml>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
....
</xs:schema>
<srvdata lang="C">
....
</srvdata>
</srvxml>
</Request>
<ReplyMeta>
<!-- <?xml version="1.0" encoding="utf-8"?> -->
<srvxml>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
....
</xs:schema>
<srvdata lang="C">
....
</srvdata>
</srvxml>
</ReplyMeta>
<Reply>
<!-- <?xml version="1.0" encoding="utf-8"?> -->
<srvxml>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
....
</xs:schema>
<srvdata lang="C">
....
</srvdata>
</srvxml>
</Reply>
</root>
Then you can parse each segment separate like:
import xml.etree.ElementTree as ET
def parseRequestMeta(RequestMeta):
"""Parse your interest here """
for root in RequestMeta:
print(root.tag)
for child in root.iter():
print(child.tag, child.text)
def parseRequest(Request):
psss
def parseReplyMeta(ReplyMeta):
psss
def parseReply(Reply):
psss
RequestMeta = []
Request = []
ReplyMeta = []
Reply = []
events = ["start", "end"]
for event, node in ET.iterparse('trace.xml', events=events):
if event == "end" and node.tag == "RequestMeta":
RequestMeta.append(node)
print(node.tag)
if event == "end" and node.tag == "Request":
Request.append(node)
print(node.tag)
if event == "end" and node.tag == "ReplyMeta":
ReplyMeta.append(node)
print(node.tag)
if event == "end" and node.tag == "Reply":
Reply.append(node)
print(node.tag)
parseRequestMeta(RequestMeta)
parseRequestMeta(Request)
parseRequestMeta(ReplyMeta)
parseRequestMeta(Reply)

Python - replace root element of one xml file with another root element without its children

I have one xml file that looks like this, XML1:
<?xml version='1.0' encoding='utf-8'?>
<report>
</report>
And the other one that is like this,
XML2:
<?xml version='1.0' encoding='utf-8'?>
<report attrib1="blabla" attrib2="blabla" attrib3="blabla" attrib4="blabla" attrib5="blabla" >
<child1>
<child2>
....
</child2>
</child1>
</report>
I need to replace and put root element of XML2 without its children, so XML1 looks like this:
<?xml version='1.0' encoding='utf-8'?>
<report attrib1="blabla" attrib2="blabla" attrib3="blabla" attrib4="blabla" attrib5="blabla">
</report>
Currently my code looks like this but it won't remove children but put whole tree inside:
source_tree = ET.parse('XML2.xml')
source_root = source_tree.getroot()
report = source_root.findall('report')
for child in list(report):
report.remove(child)
source_tree.write('XML1.xml', encoding='utf-8', xml_declaration=True)
Anyone has ide how can I achieve this?
Thanks!
Try the below (just copy attrib)
import xml.etree.ElementTree as ET
xml1 = '''<?xml version='1.0' encoding='utf-8'?>
<report>
</report>'''
xml2 = '''<?xml version='1.0' encoding='utf-8'?>
<report attrib1="blabla" attrib2="blabla" attrib3="blabla" attrib4="blabla" attrib5="blabla" >
<child1>
<child2>
</child2>
</child1>
</report>'''
root1 = ET.fromstring(xml1)
root2 = ET.fromstring(xml2)
root1.attrib = root2.attrib
ET.dump(root1)
output
<report attrib1="blabla" attrib2="blabla" attrib3="blabla" attrib4="blabla" attrib5="blabla">
</report>
So here is working code:
source_tree = ET.parse('XML2.xml')
source_root = source_tree.getroot()
dest_tree = ET.parse('XML1.xml')
dest_root = dest_tree.getroot()
dest_root.attrib = source_root.attrib
dest_tree.write('XML1.xml', encoding='utf-8', xml_declaration=True)

Get text inside xml tags by their name

I had a xml code and i want to get text in exact elements(xml tags) using python language .
I have tried couple of solutions and didnt work.
import xml.etree.ElementTree as ET
tree = ET.fromstring(xml)
for node in tree.iter('Model'):
print node
How can i do that ?
Xml Code :
<soap:Envelope
xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<soap:Body>
<GetVehicleLimitedInfoResponse
xmlns="http://schemas.conversesolutions.com/xsd/dmticta/v1">
<return>
<ResponseMessage xsi:nil="true" />
<ErrorCode xsi:nil="true" />
<RequestId> 2012290007705 </RequestId>
<TransactionCharge>150</TransactionCharge>
<VehicleNumber>GF-0176</VehicleNumber>
<AbsoluteOwner>SIYAPATHA FINANCE PLC</AbsoluteOwner>
<EngineNo>GA15-483936F</EngineNo>
<ClassOfVehicle>MOTOR CAR</ClassOfVehicle>
<Make>NISSAN</Make>
<Model>PULSAR</Model>
<YearOfManufacture>1998</YearOfManufacture>
<NoOfSpecialConditions>0</NoOfSpecialConditions>
<SpecialConditions xsi:nil="true" />
</return>
</GetVehicleLimitedInfoResponse>
</soap:Body>
</soap:Envelope>
Edited and improved answer:
import xml.etree.ElementTree as ET
import re
ns = {"veh": "http://schemas.conversesolutions.com/xsd/dmticta/v1"}
tree = ET.parse('test.xml') # save your xml as test.xml
root = tree.getroot()
def get_tag_name(tag):
return re.sub(r'\{.*\}', '',tag)
for node in root.find(".//veh:return", ns):
print(get_tag_name(node.tag)+': ', node.text)
It should produce something like this:
ResponseMessage: None
ErrorCode: None
RequestId: 2012290007705
TransactionCharge: 150
VehicleNumber: GF-0176
AbsoluteOwner: SIYAPATHA FINANCE PLC
EngineNo: GA15-483936F
ClassOfVehicle: MOTOR CAR
Make: NISSAN
Model: PULSAR
YearOfManufacture: 1998
NoOfSpecialConditions: 0
SpecialConditions: None

Extract multiple xml attributes to pandas dataframe

I have a basic xml file called meals.xml which looks like this:
<?xml version="1.0" encoding="UTF-8"?>
<meals name="Sample Text">
<meal id="1" name="Poached Eggs" type="breakfast"/>
<meal id="2" name="Club Sandwich" type="lunch"/>
<meal id="3" name="Steak" type="dinner"/>
<meal id="4" name="Steak" type="dinner"/>
</meals>
I want to extract both 'id' and 'name' attributes in to a dataframe. I can extract one when specifying one column and one attribute (eg, name only), but can't seem to figure out the syntax for getting multiple attributes in the for loop. This what I've tried, adding id to the 'df_cols' and 'attrib.get' function:
import xml.etree.ElementTree as ET
import pandas as pd
root = ET.parse('meals.xml').getroot()
df_cols = ["id", "name"]
rows = []
for node in root:
value = node.attrib.get('id', 'name')
rows.append(value)
df = pd.DataFrame(rows, columns = df_cols)
df
Can someone advise how to do this?
The below may work for you
import xml.etree.ElementTree as ET
import pandas as pd
xml = '''<?xml version="1.0" encoding="UTF-8"?>
<meals name="Sample Text">
<meal id="1" name="Poached Eggs" type="breakfast"/>
<meal id="2" name="Club Sandwich" type="lunch"/>
<meal id="3" name="Steak" type="dinner"/>
<meal id="4" name="Steak" type="dinner"/>
</meals>'''
root = ET.fromstring(xml)
data = [{'id': m.attrib['id'], 'name': m.attrib['name']} for m in root.findall('.//meal')]
df = pd.DataFrame(data)
print(df)
output
id name
0 1 Poached Eggs
1 2 Club Sandwich
2 3 Steak
3 4 Steak

Python XML check next item

Here is a little xml example:
<?xml version="1.0" encoding="UTF-8"?>
<list>
<person id="1">
<name>Smith</name>
<city>New York</city>
</person>
<person id="2">
<name>Pitt</name>
</person>
...
...
</list>
Now I need all Persons with a name and city.
I tried:
#!/usr/bin/python
# coding: utf8
import xml.dom.minidom as dom
tree = dom.parse("test.xml")
for listItems in tree.firstChild.childNodes:
for personItems in listItems.childNodes:
if personItems.nodeName == "name" and personItems.nextSibling == "city":
print personItems.firstChild.data.strip()
But the ouput is empty. Without the "and" condition I become all names. How can I check that the next tag after "name" is "city"?
You can do this in minidom:
import xml.dom.minidom as minidom
def getChild(n,v):
for child in n.childNodes:
if child.localName==v:
yield child
xmldoc = minidom.parse('test.xml')
person = getChild(xmldoc, 'list')
for p in person:
for v in getChild(p,'person'):
attr = v.getAttributeNode('id')
if attr:
print attr.nodeValue.strip()
This prints id of person nodes:
1
2
use element tree check this element tree
import xml.etree.ElementTree as ET
tree = ET.parse('a.xml')
root = tree.getroot()
for person in root.findall('person'):
name = person.find('name').text
try:
city = person.find('city').text
except:
continue
print name, city
for id u can get it by id= person.get('id')
output:Smith New York
Using lxml, you can use xpath to get in one step what you need:
from lxml import etree
xmlstr = """
<list>
<person id="1">
<name>Smith</name>
<city>New York</city>
</person>
<person id="2">
<name>Pitt</name>
</person>
</list>
"""
xml = etree.fromstring(xmlstr)
xp = "//person[city]"
for person in xml.xpath(xp):
print etree.tostring(person)
lxml is external python package, but is so useful, that to me it is always worth to install.
xpath is searching for any (//) element person having (declared by content of []) subelement city.

Categories

Resources