I'm working with some XML data that will ultimately be loaded into a csv. I am experiencing an issue with properly indexing the data when a element doesn't exist in an entry. Below is a simple xml example of what I am working with
<root>
<entry>
<LASTNAME>Doe</LASTNAME>
<FIRSTNAME>Jon</FIRSTNAME>
<GENDER>M</GENDER>
</entry>
<entry>
<LASTNAME>Doe</LASTNAME>
<FIRSTNAME>Jane</FIRSTNAME>
<GENDER>F</GENDER>
<HAIRCOLOR>Blonde</HAIRCOLOR>
</entry>
</root>
The output I end up getting is as follows:
LASTNAME
FIRSTNAME
GENDER
HAIRCOLOR
Doe
John
M
Blonde
Doe
Jane
F
But the correct output should be:
LASTNAME
FIRSTNAME
GENDER
HAIRCOLOR
Doe
John
M
Doe
Jane
F
Blonde
So I seem to have an indexing problem where the first few times HAIRCOLOR (depending on the number of HAIRCOLOR elements are present on the page) is searched for, it goes down the XML until it finds one, but it should stop when it reaches the end of the entry.
Here's the code I am working with:
import xml.etree.ElementTree as ET
from xml.etree.ElementTree import ParseError
from xml.etree import ElementTree
bytes_ = '''
<root>
<entry>
<LASTNAME>Doe</LASTNAME>
<FIRSTNAME>Jon</FIRSTNAME>
<GENDER>M</GENDER>
</entry>
<entry>
<LASTNAME>Doe</LASTNAME>
<FIRSTNAME>Jane</FIRSTNAME>
<GENDER>F</GENDER>
<HAIRCOLOR>Blonde</HAIRCOLOR>
</entry>
</root>
'''
xpaths = [
"./entry/LASTNAME",
"./entry/FIRSTNAME",
"./entry/GENDER",
"./entry/HAIRCOLOR"
]
data = []
_fields = [
{'text' : ''},
{'text' : ''},
{'text' : ''},
{'text' : ''}
]
root = ET.fromstring(bytes_)
for count in range(0,len(root.findall("./entry"))):
for ele, xpath in enumerate(xpaths):
try:
attribs = list(root.findall(xpath)[count].attrib.keys())
for attrib in attribs:
for i in _fields[ele].keys():
if attrib == i:
_fields[ele][i] = root.findall(xpath)[count].attrib[attrib]
_fields[ele]["text"] =root.findall(xpath)[count].text
except IndexError:
_fields[ele]["text"]=''
data.append(_fields[ele].values())
data_list = [item for sublist in data for item in sublist]
data.clear()
print(data_list)
Any help is appreciated.
Edited for clarity
Your xpath ’./entry/HAIRCOLOR’ will match the HAIRCOLOR tag wherever it is, i.e. anywhere. You need to first find each ’./entry’ then for each entry look for HAIRCOLOR and the other tags within the entry. At the moment you’re looking in root For the sub-tags.
Related
List elements to be appended in XML data:
Sorted_TestSpecID: [10860972, 10860972, 10860972, 10860972, 10860972]
Sorted_TestCaseID: [16961435, 16961462, 16961739, 16961741, 16961745]
Sorted_TestText : ['SIG1', 'SIG2', 'SIG3', 'Signal1', 'Signal2']
original xml data:
<tc>
<title>Signal1</title>
<tcid>2c758925-dc3d-4b1d-a5e2-e0ca54c52a47</tcid>
<attributes>
<attr>
<key>TestSpec ID</key>
<value>0</value>
</attr>
<attr>
<key>TestCase ID</key>
<value>0</value>
</attr>
</attributes>
</tc>
Trying Python script to:
Search title Signal1 in xml data from Sorted_TestText
Then it should search for Key =TestCase ID and update the corresponding 16961741 value
Then it shall check for its resp. Key =TestSpec ID and update the corresponding 10860972.
soup = BeautifulSoup(xml_data, 'xml')
for tc in soup.find_all('tc'):
for title, spec, case in zip(Sorted_TestText, Sorted_TestSpecID, Sorted_TestCaseID):
if tc.find('title').text == title:
for attr in tc.find_all('attr'):
if attr.find('key').text == "TestSpec ID":
attr.find('value').text = str(spec)
if attr.find('key').text == "TestCase ID"
attr.find('value').text = str(case)
print(soup)
I've tried above script ,this script is not updating spec and case based on title, working on if spec, case and title are in order. My intention was script shall look for title and then it shall update its respective attributes. Lets say in my xml 'SIG1', 'SIG2', 'SIG3' are not present; I want to update spec and case of Signal1 with spec: 10860972 case: 16961741, but with this script it is updating SIG4 as spec: 10860972 case: 16961435. Need to traverse the spec and case lists as well for respective title. I tried, but no luck.; Required support here. Thanks in advance.
I'd use a dictionary where keys are titles and values are TestCaseIDs and TestSpecIDs.
Then, to change the contents of <value> use .string instead of .text:
dct = {
c: (str(a), str(b))
for a, b, c in zip(Sorted_TestSpecID, Sorted_TestCaseID, Sorted_TestText)
}
for tc in soup.select("tc"):
title = tc.title.get_text(strip=True)
if title not in dct:
continue
val = tc.select_one('attr:has(key:-soup-contains("TestSpec ID")) value')
if val:
val.string = str(dct[title][0])
val = tc.select_one('attr:has(key:-soup-contains("TestCase ID")) value')
if val:
val.string = str(dct[title][1])
print(soup.prettify())
Prints:
<?xml version="1.0" encoding="utf-8"?>
<tc>
<title>
Signal1
</title>
<tcid>
2c758925-dc3d-4b1d-a5e2-e0ca54c52a47
</tcid>
<attributes>
<attr>
<key>
TestSpec ID
</key>
<value>
10860972
</value>
</attr>
<attr>
<key>
TestCase ID
</key>
<value>
16961741
</value>
</attr>
</attributes>
</tc>
I am trying to parse following xml data from a file with python for print only the elements with tag "zip-code" with his attribute name
<response status="success" code="19"><result total-count="1" count="1">
<address>
<entry name="studio">
<zip-code>14407</zip-code>
<description>Nothing</description>
</entry>
<entry name="mailbox">
<zip-code>33896</zip-code>
<description>Nothing</description>
</entry>
<entry name="garage">
<zip-code>33746</zip-code>
<description>Tony garage</description>
</entry>
<entry name="playstore">
<url>playstation.com</url>
<description>game download</description>
</entry>
<entry name="gym">
<zip-code>33746</zip-code>
<description>Getronics NOC subnet 2</description>
</entry>
<entry name="e-cigars">
<url>vape.com/24</url>
<description>vape juices</description>
</entry>
</address>
</result></response>
The python code that I am trying to run is
from xml.etree import ElementTree as ET
tree = ET.parse('file.xml')
root = tree.getroot()
items = root.iter('entry')
for item in items:
zip = item.find('zip-code').text
names = (item.attrib)
print(' {} {} '.format(
names, zip
))
However it fails once it gets to the items without "zip-code" tag.
How I could make this run?
Thanks in advance
As #AmitaiIrron suggested, xpath can help here.
This code searches the document for element named zip-code, and pings back to get the parent of that element. From there, you can get the name attribute, and pair with the text from zip-code element
for ent in root.findall(".//zip-code/.."):
print(ent.attrib.get('name'), ent.find('zip-code').text)
studio 14407
mailbox 33896
garage 33746
gym 33746
OR
{ent.attrib.get('name') : ent.find('zip-code').text
for ent in root.findall(".//zip-code/..")}
{'studio': '14407', 'mailbox': '33896', 'garage': '33746', 'gym': '33746'}
Your loop should look like this:
# Find all <entry> tags in the hierarchy
for item in root.findall('.//entry'):
# Try finding a <zip-code> child
zipc = item.find('./zip-code')
# If found a child, print data for it
if zipc is not None:
names = (item.attrib)
print(' {} {} '.format(
names, zipc.text
))
It's all a matter of learning to use xpath properly when searching through the XML tree.
If you have no problem using regular expressions, the following works just fine:
import re
file = open('file.xml', 'r').read()
pattern = r'name="(.*?)".*?<zip-code>(.*?)<\/zip-code>'
matches = re.findall(pattern, file, re.S)
for m in matches:
print("{} {}".format(m[0], m[1]))
and produces the result:
studio 14407
mailbox 33896
garage 33746
aystore 33746
I'm relatively new to python and currently trying to parse and convert XML to CSV. My code works if the parent and child tags exists, but I receive this error message:
Phone = element[3][0].text
IndexError: child index out of range
when a tag exist in the first attribute but not the second attribute.
I tried to put in an if statement, but it didn't quite work. This is what the xml and my original code looks like. If anyone can point me in the right the direction, I would appreciate it!
XML File
<Member>
<Person>
<FirstName>JOHN</FirstName>
<LastName>DOE</LastName>
<Address>
<Address1>1234 TEST DR</Address1>
<Address2></Address2>
<City>SIMCITY</City>
<State>TD</State>
<ZipCode>12345 </ZipCode>
</Address>
<Phone>
<AreaCode>212</AreaCode>
<PhoneNumber>2223333</PhoneNumber>
</Phone>
</Person>
<Person>
<FirstName>JANE</FirstName>
<LastName>DOE</LastName>
<Address>
<Address1>1234 DEE ST</Address1>
<Address2></Address2>
<City>LCITY</City>
<State>TD</State>
<ZipCode>12345 </ZipCode>
</Address>
</Person>
</Member>
My Code:
import csv
import xml.etree.ElementTree as ET
tree = ET.parse("Stack.xml")
root = tree.getroot()
xml_data_to_csv =open('Out.csv','w')
Csv_writer=csv.writer(xml_data_to_csv)
list_head=[]
count=0
for element in root.findall('Person'):
person = []
address_list = []
phone_list = []
#get head node
if count == 0:
FirstName = element.find('FirstName').tag
list_head.append(FirstName)
LastName = element.find('LastName').tag
list_head.append(LastName)
Address = element[2].tag
list_head.append(Address)
Phone = element[3].tag
list_head.append(Phone)
Csv_writer.writerow(list_head)
count = count +1
#get child node
FirstName = element.find('FirstName').text
person.append(FirstName)
LastName = element.find('LastName').text
person.append(LastName)
Address = element[2][0].text
address_list.append(Address)
Address2 = element[2][1].text
address_list.append(Address2)
City = element[2][2].text
address_list.append(City)
State = element[2][3].text
address_list.append(State)
ZipCode = element[2][4].text
address_list.append(ZipCode)
person.append(address_list)
Phone = element[3][0].text
phone_list.append(Phone)
AreaCode = element[3][1].text
phone_list.append(AreaCode)
person.append(phone_list)
#Write List_nodes to csv
Csv_writer.writerow(person)
xml_data_to_csv.close()
Try using xpath to find the tags you need, for example, you can replace this code:
Phone = element[3][0].text
phone_list.append(Phone)
AreaCode = element[3][1].text
phone_list.append(AreaCode)
person.append(phone_list)
something like this:
phone_list = [e.text for e in element.findall('Phone//')]
person.append(phone_list)
or is it (in my opinion the best option):
person.append([e.text for e in element.findall('Phone//')])
Thus, you will be able to bypass the error and significantly reduce the amount of code :)
Here is a little xml example:
<?xml version="1.0" encoding="UTF-8"?>
<list>
<person id="1">
<name>Smith</name>
<city>New York</city>
</person>
<person id="2">
<name>Pitt</name>
</person>
...
...
</list>
Now I need all Persons with a name and city.
I tried:
#!/usr/bin/python
# coding: utf8
import xml.dom.minidom as dom
tree = dom.parse("test.xml")
for listItems in tree.firstChild.childNodes:
for personItems in listItems.childNodes:
if personItems.nodeName == "name" and personItems.nextSibling == "city":
print personItems.firstChild.data.strip()
But the ouput is empty. Without the "and" condition I become all names. How can I check that the next tag after "name" is "city"?
You can do this in minidom:
import xml.dom.minidom as minidom
def getChild(n,v):
for child in n.childNodes:
if child.localName==v:
yield child
xmldoc = minidom.parse('test.xml')
person = getChild(xmldoc, 'list')
for p in person:
for v in getChild(p,'person'):
attr = v.getAttributeNode('id')
if attr:
print attr.nodeValue.strip()
This prints id of person nodes:
1
2
use element tree check this element tree
import xml.etree.ElementTree as ET
tree = ET.parse('a.xml')
root = tree.getroot()
for person in root.findall('person'):
name = person.find('name').text
try:
city = person.find('city').text
except:
continue
print name, city
for id u can get it by id= person.get('id')
output:Smith New York
Using lxml, you can use xpath to get in one step what you need:
from lxml import etree
xmlstr = """
<list>
<person id="1">
<name>Smith</name>
<city>New York</city>
</person>
<person id="2">
<name>Pitt</name>
</person>
</list>
"""
xml = etree.fromstring(xmlstr)
xp = "//person[city]"
for person in xml.xpath(xp):
print etree.tostring(person)
lxml is external python package, but is so useful, that to me it is always worth to install.
xpath is searching for any (//) element person having (declared by content of []) subelement city.
I am new to python or coding , so please be patient with my question,
So here's my busy XML
<?xml version="1.0" encoding="utf-8"?>
<Total>
<ID>999</ID>
<Response>
<Detail>
<Nix>
<Check>pass</Check>
</Nix>
<MaxSegment>
<Status>V</Status>
<Input>
<Name>
<First>jack</First>
<Last>smiths</Last>
</Name>
<Address>
<StreetAddress1>100 rodeo dr</StreetAddress1>
<City>long beach</City>
<State>ca</State>
<ZipCode>90802</ZipCode>
</Address>
<DriverLicense>
<Number>123456789</Number>
<State>ca</State>
</DriverLicense>
<Contact>
<Email>x#me.com</Email>
<Phones>
<Home>0000000000</Home>
<Work>1111111111</Work>
</Phones>
</Contact>
</Input>
<Type>Regular</Type>
</MaxSegment>
</Detail>
</Response>
</Total>
what I am trying to do is extract these value into nice and clean table below :
Here's my code so far.. but I couldn't figure it out how to get the subchild :
import os
os.chdir('d:/py/xml/')
import xml.etree.ElementTree as ET
tree = ET.parse('xxml.xml')
root=tree.getroot()
x = root.tag
y = root.attrib
print(x,y)
#---PRINT ALL NODES---
for child in root:
print(child.tag, child.attrib)
Thank you in advance !
You could create a dictionary that maps the column names to xpath expressions that extract corresponding values e.g.:
xpath = {
"ID": "/Total/ID/text()",
"Check": "/Total/Response/Detail/Nix/Check/text()", # or "//Check/text()"
}
To populate the table row:
row = {name: tree.xpath(path) for name, path in xpath.items()}
The above assumes that you use lxml that support the full xpath syntax. ElementTree supports only a subset of XPath expressions but it might be enough in your case (you could remove "text()" expression and use el.text in this case) e.g.:
xpath = {
"ID": ".//ID",
"Check": ".//Check",
}
row = {name: tree.findtext(path) for name, path in xpath.items()}
To print all text with corresponding tag names:
import xml.etree.cElementTree as etree
for _, el in etree.iterparse("xxm.xml"):
if el.text and not el: # leaf element with text
print el.tag, el.text
If column names differ from tag names (as in your case) then the last example is not enough to build the table.
This is how you could traverse the tree and print only the text nodes:
def traverse(node):
show = True
for c in node.getchildren():
show = False
traverse(c)
if show:
print node.tag, node.text
for you example I get the following:
traverse(root)
ID 999
Check pass
Status V
First jack
Last smiths
StreetAddress1 100 rodeo dr
City long beach
State ca
ZipCode 90802
Number 123456789
State ca
Email x#me.com
Home 0000000000
Work 1111111111
Type Regular
Instead of printing out you could store (node.tag, node.text) tuples or store {node.tag: node.text} in a dict.