I have a xml file where I want to extract data from. I tried using python, but when I try to use an example script I found online, I can't extract the data I want.
I found a script at: https://medium.com/analytics-vidhya/parsing-xml-files-in-python-d7c136bb9aa5
it is able to extract the data from the given example, but when I try to apply this to my xml, I can't get it to work. Here is my xml file: https://pastebin.com/Q4HTYacM (it is ~200 lines long, which is why I pasted it to pastebin)
The Data I'm interested in can be found in the
<datafield tag="100" ind1="1" ind2=" "> <!--VerfasserIn-->
<subfield code="a">Ullenboom, Christian</subfield>
<subfield code="e">VerfasserIn</subfield>
<subfield code="0">(DE-588)123404738</subfield>
<subfield code="0">(DE-627)502584122</subfield>
<subfield code="0">(DE-576)184619254</subfield>
<subfield code="4">aut</subfield>
</datafield>
Field, as well as some others. The problem is, that I'm interested in <subfield code="a">Ullenboom, Christian</subfield>
but I can't get it to extract, as it seems like the root=tree.getroot() only counts the first line as a searchable line and I haven't found any way to search for the specific datafields.
Any Help is appreciated.
Edit: My Script:
## source: https://medium.com/analytics-vidhya/parsing-xml-files-in-python-d7c136bb9aa5
# import libs
import pandas as pd
import numpy as np
import glob
import xml.etree.cElementTree as et
#parse the file
tree=et.parse(glob.glob('./**/*baselinexml.xml',recursive=True)[0])
root=root.getroot()
#create list for values
creator = []
titlebook = []
VerfasserIn = []
# Converting the data
for creator in root.iter('datafield tag="100" '):
print(creator)
print("step1")
creator.append(VerfasserIn)
for titlebook in root.iter('datafield tag="245" ind1="1" ind2="0"'):
print(titlebook.text)
# creating dataframe
Jobs_df = pd.DataFrame(
list(zip(creator, titlebook)),
columns=['creator','titlebook'])
#saving as .csv
Jobs_df.to_csv("sample-api1.csv")
I'm fairly new to this kind of programming, so I tried modifying the code from the example
Listing [Python.Docs]: xml.etree.ElementTree - The ElementTree XML API, you will find everything you need to know there.
The XML you posted on PasteBin is not complete (and therefore invalid). It's lacking </zs:recordData></zs:record></zs:records></zs:searchRetrieveResponse> at the end.
The presence of namespaces makes things complicated, as the (real) node tags are not the literal strings from the file. You should insist on namespaces, and also on XPath in the above URL.
Here's a variant.
code00.py:
#!/usr/bin/env python
import sys
from xml.etree import ElementTree as ET
def main(*argv):
doc = ET.parse("./blob.xml") # Saved (and corrected) the file from PasteBin
root = doc.getroot()
namespaces = { # Manually extracted from the XML file, but there could be code written to automatically do that.
"zs": "http://www.loc.gov/zing/srw/",
"": "http://www.loc.gov/MARC21/slim",
}
#print(root)
datafield_nodes_path = "./zs:records/zs:record/zs:recordData/record/datafield" # XPath
datafield_attribute_filters = [
{
"tag": "100",
"ind1": "1",
"ind2": " ",
},
{
"tag": "245",
"ind1": "1",
"ind2": "0",
},
]
#datafield_attribute_filters = [] # Decomment this line to clear filters (and process each datafield node)
ret = []
for datafield_node in root.iterfind(datafield_nodes_path, namespaces=namespaces):
if datafield_attribute_filters:
skip_node = True
for attr_dict in datafield_attribute_filters:
for k, v in attr_dict.items():
if datafield_node.get(k) != v:
break
else:
skip_node = False
break
if skip_node:
continue
for subfield_node in datafield_node.iterfind("./subfield[#code='a']", namespaces=namespaces):
ret.append(subfield_node.text)
print("Results:")
for i, e in enumerate(ret, start=1):
print("{:2d}: {:s}".format(i, e))
if __name__ == "__main__":
print("Python {:s} {:03d}bit on {:s}\n".format(" ".join(elem.strip() for elem in sys.version.split("\n")),
64 if sys.maxsize > 0x100000000 else 32, sys.platform))
rc = main(*sys.argv[1:])
print("\nDone.")
sys.exit(rc)
Output:
[cfati#CFATI-5510-0:e:\Work\Dev\StackOverflow\q071724477]> "e:\Work\Dev\VEnvs\py_pc064_03.09_test0\Scripts\python.exe" code00.py
Python 3.9.9 (tags/v3.9.9:ccb0e6a, Nov 15 2021, 18:08:50) [MSC v.1929 64 bit (AMD64)] 064bit on win32
Results:
1: Ullenboom, Christian
2: Java ist auch eine Insel
Done.
In the above example, I extracted the text of every subfield node (with code attribute having a value of a) which is a child of a datafield node that has the attributes matching one of the entries in the datafield_attribute_filters list (got the attributes filtering later, from your script). You can do some more filtering if you need to.
Related
I wrote this code below. In my XML file I have nodes:
Assembly_1, Detail_1, Detail_2, Assembly_2, Detail_3
What I am trying to do is to get the name of the assembly for each detail (Detail_1 and 2 would be in Assembly_1, etc.)
I have a lot of details... more than 200. So this code (function) works good but it takes a lot of time because the XML file is loaded each time.
How can I make it run faster?
def CorrectAssembly(detail):
from xml.dom import minidom
xml_path = r"C:\Users\vblagoje\test_python_s2k\Load_Independent_Results\HSB53111-01-D_2008_v2-Final-Test-Cases_All_1.1.xml"
mydoc=minidom.parse(xml_path)
root = mydoc.getElementsByTagName("FEST2000")
assembly=""
for node in root:
for childNodes in node.childNodes:
if childNodes.nodeType == childNodes.TEXT_NODE: continue
if childNodes.nodeName == "ASSEMBLY":
assembly = childNodes.getAttribute("NAME")
if childNodes.nodeName == "DETAIL":
if detail == childNodes.getAttribute("NAME"):
break
return assembly
One solution is, to simply read the XML-file once before looking up all the details.
Something along this:
from xml.dom import minidom
def CorrectAssembly(detail, root):
assembly=""
for node in root:
for childNodes in node.childNodes:
if childNodes.nodeType == childNodes.TEXT_NODE: continue
if childNodes.nodeName == "ASSEMBLY":
assembly = childNodes.getAttribute("NAME")
if childNodes.nodeName == "DETAIL":
if detail == childNodes.getAttribute("NAME"):
break
return assembly
xml_path = r"C:\Users\vblagoje\test_python_s2k\Load_Independent_Results\HSB53111-01-D_2008_v2-Final-Test-Cases_All_1.1.xml"
mydoc=minidom.parse(xml_path)
root = mydoc.getElementsByTagName("FEST2000")
aDetail = "myDetail"
assembly = CorrectAssembly(aDetail, root)
anotherDetail = "myDetail2"
assembly = CorrectAssembly(anotherDetail , root)
# an so on
You still go through (part of) the loaded XML every time you call the function though. Maybe it is beneficial to create a dictionary mapping the assembly to details and then to simply look them up when you need it:
from xml.dom import minidom
# read the xml
xml_path = r"C:\Users\vblagoje\test_python_s2k\Load_Independent_Results\HSB53111-01-D_2008_v2-Final-Test-Cases_All_1.1.xml"
mydoc=minidom.parse(xml_path)
root = mydoc.getElementsByTagName("FEST2000")
detail_assembly_map = {}
# fill the dictionary
for node in root:
for childNodes in node.childNodes:
if childNodes.nodeType == childNodes.TEXT_NODE: continue
if childNodes.nodeName == "ASSEMBLY":
assembly = childNodes.getAttribute("NAME")
if childNodes.nodeName == "DETAIL":
detail_assembly_map[childNodes.getAttribute("NAME")] = assembly
# use it
aDetail = "myDetail"
assembly = detail_assembly_map[aDetail]
From your post it is not really clear how the structure of the XML is, but in case the details are children of the assemblies, then the mapping could be done differently by iterating first through the assembly-knots and therein through its detail-children. Then you would not rely on a proper ordering of the elements.
This post could be helpful too, depending on the structure of your XML-tree.
Is it possible in python to pretty print the root's attributes?
I used etree to extend the attributes of the child tag and then I had overwritten the existing file with the new content. However during the first generation of the XML, we were using a template where the attributes of the root tag were listed one per line and now with the etree I don't manage to achieve the same result.
I found similar questions but they were all referring to the tutorial of etree, which I find incomplete.
Hopefully someone has found a solution for this using etree.
EDIT: This is for custom XML so HTML Tidy (which was proposed in the comments), doesn't work for this.
Thanks!
generated_descriptors = list_generated_files(generated_descriptors_folder)
counter = 0
for g in generated_descriptors:
if counter % 20 == 0:
print "Extending Descriptor # %s out of %s" % (counter, len(descriptor_attributes))
with open(generated_descriptors_folder + "\\" + g, 'r+b') as descriptor:
root = etree.XML(descriptor.read(), parser=parser)
# Go through every ContextObject to check if the block is mandatory
for context_object in root.findall('ContextObject'):
for attribs in descriptor_attributes:
if attribs['descriptor_name'] == g[:-11] and context_object.attrib['name'] in attribs['attributes']['mandatoryobjects']:
context_object.set('allow-null', 'false')
elif attribs['descriptor_name'] == g[:-11] and context_object.attrib['name'] not in attribs['attributes']['mandatoryobjects']:
context_object.set('allow-null', 'true')
# Sort the ContextObjects based on allow-null and their name
context_objects = root.findall('ContextObject')
context_objects_sorted = sorted(context_objects, key=lambda c: (c.attrib['allow-null'], c.attrib['name']))
root[:] = context_objects_sorted
# Remove mandatoryobjects from Descriptor attributes and pretty print
root.attrib.pop("mandatoryobjects", None)
# paste new line here
# Convert to string in order to write the enhanced descriptor
xml = etree.tostring(root, pretty_print=True, encoding="UTF-8", xml_declaration=True)
# Write the enhanced descriptor
descriptor.seek(0) # Set cursor at beginning of the file
descriptor.truncate(0) # Make sure that file is empty
descriptor.write(xml)
descriptor.close()
counter+=1
My code imports a MARC file using MARCReader and compares a string against a list of acceptable answers. If the string from MARC has no match in my list, it gets added to an error list. This has worked for years in Python 2.7.4 installations on Windows 7 with no issue. I recently got a Windows 10 machine and installed Python 2.7.10, and now strings with non-standard characters fail that match. the issue is not Python 2.7.10 alone; I've installed every version from 2.7.4 through 2.7.10 on this new machine, and get the same problem. A new install of Python 2.7.10 on a Windows 7 machine also gets the problem.
I've trimmed out functions that aren't relevant, and I've dramatically trimmed the master list. In this example, "Académie des Sciences" is an existing repository, but "Acadm̌ie des Sciences" now appears in our list of new repositories.
# -*- coding: utf-8 -*-
from aipmarc import get_catdb, get_bibno, parse_date
from phfawstemplate import browsepage #, nutchpage, eadpage, titlespage
from pymarc import MARCReader, marc8_to_unicode
from time import strftime
from umlautsort import alafiling
import urllib2
import sys
import os
import string
def make_newrepos_list(list, fn): # Create list of unexpected repositories found in the MArcout database dump
output = "These new repositories are not yet included in the master list in phfaws.py. Please add the repository code (in place of ""NEWCODE*""), and the URL (in place of ""TEST""), and then add these lines to phfaws.py. Please keep the list alphabetical. \nYou can find repository codes at http://www.loc.gov/marc/organizations/ \n \n"
for row in list:
output = '%s reposmasterlist.append([u"%s", "%s", "%s"])\n' % (output, row[0], row[1], row[2])
fh = open(fn,'w')
fh.write(output.encode("utf-8"))
fh.close()
def main(marcfile):
reader = MARCReader(file(marcfile))
'''
Creating list of preset repository codes.
'''
reposmasterlist =[[u"American Institute of Physics", "MdCpAIP", "http://www.aip.org/history/nbl/index.html"]]
reposmasterlist.append([u"Académie des Sciences", "FrACADEMIE", "http://www.academie-sciences.fr/fr/Transmettre-les-connaissances/inventaires-des-fonds-d-archives-personnelles.html"])
reposmasterlist.append([u"American Association for the Advancement of Science", "daaas", "http://archives.aaas.org/"])
newreposcounter = 0
newrepos = ""
newreposlist = []
findingaidcounter = 0
reposcounter = 0
for record in reader:
if record['903']: # Get only records where 903a="PHFAWS"
phfawsfull = record.get_fields('903')
for field in phfawsfull:
phfawsnote = field['a']
if 'PHFAWS' in phfawsnote:
if record['852'] is not None: # Get only records where 852/repository is not blank
repository = record.get_fields('852')
for field in repository:
reposname = field['a']
reposname = marc8_to_unicode(reposname) # Convert repository name from MARC file to Unicode
reposname = reposname.rstrip('.,')
reposcode = None
reposurl = None
for row in reposmasterlist: # Match field 852 repository against the master list.
if row[0] == reposname: # If it's in the master list, use the master list to populate our repository-related fields
reposcode = row[1]
reposurl = row[2]
if record['856'] is not None: # Get only records where 856 is not blank and includes "online finding aid"
links = record.get_fields('856')
for field in links:
linksthree = field['3']
if linksthree is not None and "online finding aid" in linksthree:
if reposcode == None: # If this record's repository wasn't in the master list, add to list of new repositories
newreposcounter += 1
newrepos = '%s %s \n' % (newrepos, reposname)
reposcode = "NEWCODE" + str(newreposcounter)
reposurl = "TEST"
reposmasterlist.append([reposname, reposcode, reposurl])
newreposlist.append([reposname, reposcode, reposurl])
human_url = field['u']
else:
pass
else:
pass
else:
pass
else:
pass
else:
pass
# Output list of new repositories
newreposlist.sort(key = lambda rep: rep[0])
if newreposcounter != 0:
status = '%d new repositories found. you must add information on these repositories, then run phfaws.py again. Please see the newly updated rewrepos.txt for details.' % (newreposcounter)
sys.stderr.write(status)
make_newrepos_list(newreposlist, 'newrepos.txt')
if __name__ == '__main__':
try:
mf = sys.argv[1]
sys.exit(main(mf))
except IndexError:
sys.exit('Usage: %s <marcfile>' % sys.argv[0])
Edit: I've found that simply commenting out the "reposname = marc8_to_unicode(reposname)" line gets me the results I want. I still don't understand why this is, since it was a necessary step before.
Edit: I've found that simply commenting out the "reposname = marc8_to_unicode(reposname)" line gets me the results I want. I still don't understand why this is, since it was a necessary step before.
This suggests to me that the encoding of strings in your database changed from MARC8 to Unicode. Have you upgraded your cataloging system recently?
I'm trying to parse the following xml to pull out certain data then eventually edit the data as needed.
Here is the xml:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<CHECKLIST>
<VULN>
<STIG_DATA>
<VULN_ATTRIBUTE>Vuln_Num</VULN_ATTRIBUTE>
<ATTRIBUTE_DATA>V-38438</ATTRIBUTE_DATA>
</STIG_DATA>
<STIG_DATA>
<VULN_ATTRIBUTE>Rule_Title</VULN_ATTRIBUTE>
<ATTRIBUTE_DATA>More text.</ATTRIBUTE_DATA>
</STIG_DATA>
<STIG_DATA>
<VULN_ATTRIBUTE>Vuln_Discuss</VULN_ATTRIBUTE>
<ATTRIBUTE_DATA>Some text here</ATTRIBUTE_DATA>
</STIG_DATA>
<STIG_DATA>
<VULN_ATTRIBUTE>IA_Controls</VULN_ATTRIBUTE>
<ATTRIBUTE_DATA></ATTRIBUTE_DATA>
</STIG_DATA>
<STIG_DATA>
<VULN_ATTRIBUTE>Rule_Ver</VULN_ATTRIBUTE>
<ATTRIBUTE_DATA>Gen000000</ATTRIBUTE_DATA>
</STIG_DATA>
<STATUS>NotAFinding</STATUS>
<FINDING_DETAILS></FINDING_DETAILS>
<COMMENTS></COMMENTS>
<SEVERITY_OVERRIDE></SEVERITY_OVERRIDE>
<SEVERITY_JUSTIFICATION></SEVERITY_JUSTIFICATION>
</VULN>
The data that I'm looking to pull from this is the STATUS, COMMENTS and the ATTRIBUTE_DATA directly following VULN_ATTRIBUTE that matches == Rule_Ver. So in this example.
I should get the following:
Gen000000 NotAFinding None
What I have so far is that I can get the Status and Comments easy, but can't figure out the ATTRIBUTE_DATA portion. I can find the first one (Vuln_Num), then I tried to add a index but that gives a "list index out of range" error.
This is where I'm at now.
import xml.etree.ElementTree as ET
doc = ET.parse('test.ckl')
root=doc.getroot()
TagList = doc.findall("./VULN")
for curTag in TagList:
StatusTag = curTag.find("STATUS")
CommentTag = curTag.find("COMMENTS")
DataTag = curTag.find("./STIG_DATA/ATTRIBUTE_DATA")
print "GEN:[%s] Status:[%s] Comments: %s" %( DataTag.text, StatusTag.text, CommentTag.text)
This gives the following output:
GEN:[V-38438] Status:[NotAFinding] Comments: None
I want:
GEN:[Gen000000] Status:[NotAFinding] Comments: None
So the end goal is to be able to parse hundreds of these and edit the comments field as needed. I don't think the editing part will be that hard once I get the right element.
Logically I see two ways of doing this. Either go to the ATTRIBUTE_DATA[5] and grab the text or find VULN_ATTRIBUTE == Rule_Ver then grab the next ATTRIBUTE_DATA.
I have tried doing this:
DataTag = curTag.find(".//STIG_DATA//ATTRIBUTE_DATA")[5]
andDataTag[5].text`
and both give meIndexError: list index out of range
I saw lxml had get_element_by_id and xpath, but I can't add modules to this system so it is etree for me.
Thanks in advance.
One can find an element by position, but you've used the incorrect XPath syntax. Either of the following lines should work:
DataTag = curTag.find("./STIG_DATA[5]/ATTRIBUTE_DATA") # Note: 5, not 4
DataTag = curTag.findall("./STIG_DATA/ATTRIBUTE_DATA")[4] # Note: 4, not 5
However, I strongly recommend against using that. There is no guarantee that the Rule_Ver instance of STIG_DATA is always the fifth item.
If you could change to lxml, then this works:
DataTag = curTag.xpath(
'./STIG_DATA/VULN_ATTRIBUTE[text()="Rule_Ver"]/../ATTRIBUTE_DATA')[0]
Since you can't use lxml, you must iterate the STIG_DATA elements by hand, like so:
def GetData(curTag):
for stig in curTag.findall('STIG_DATA'):
if stig.find('VULN_ATTRIBUTE').text == 'Rule_Ver':
return stig.find('ATTRIBUTE_DATA')
Here is a complete program with error checking added to GetData():
import xml.etree.ElementTree as ET
doc = ET.parse('test.ckl')
root=doc.getroot()
TagList = doc.findall("./VULN")
def GetData(curTag):
for stig in curTag.findall('STIG_DATA'):
vuln = stig.find('VULN_ATTRIBUTE')
if vuln is not None and vuln.text == 'Rule_Ver':
data = stig.find('ATTRIBUTE_DATA')
return data
for curTag in TagList:
StatusTag = curTag.find("STATUS")
CommentTag = curTag.find("COMMENTS")
DataTag = GetData(curTag)
print "GEN:[%s] Status:[%s] Comments: %s" %( DataTag.text, StatusTag.text, CommentTag.text)
References:
https://stackoverflow.com/a/10836343/8747
http://lxml.de/xpathxslt.html#xpath
OK I'll be the first to admit its is, just not the path I want and I don't know how to get it.
I'm using Python 3.3 in Eclipse with Pydev plugin in both Windows 7 at work and ubuntu 13.04 at home. I'm new to python and have limited programming experience.
I'm trying to write a script to take in an XML Lloyds market insurance message, find all the tags and dump them in a .csv where we can easily update them and then reimport them to create an updated xml.
I have managed to do all of that except when I get all the tags it only gives the tag name and not the tags above it.
<TechAccount Sender="broker" Receiver="insurer">
<UUId>2EF40080-F618-4FF7-833C-A34EA6A57B73</UUId>
<BrokerReference>HOY123/456</BrokerReference>
<ServiceProviderReference>2012080921401A1</ServiceProviderReference>
<CreationDate>2012-08-10</CreationDate>
<AccountTransactionType>premium</AccountTransactionType>
<GroupReference>2012080921401A1</GroupReference>
<ItemsInGroupTotal>
<Count>1</Count>
</ItemsInGroupTotal>
<ServiceProviderGroupReference>8-2012-08-10</ServiceProviderGroupReference>
<ServiceProviderGroupItemsTotal>
<Count>13</Count>
</ServiceProviderGroupItemsTotal>
That is a fragment of the XML. What I want is to find all the tags and their path. For example for I want to show it as ItemsInGroupTotal/Count but can only get it as Count.
Here is my code:
xml = etree.parse(fullpath)
print( xml.xpath('.//*'))
all_xpath = xml.xpath('.//*')
every_tag = []
for i in all_xpath:
single_tag = '%s,%s' % (i.tag, i.text)
every_tag.append(single_tag)
print(every_tag)
This gives:
'{http://www.ACORD.org/standards/Jv-Ins-Reinsurance/1}ServiceProviderGroupReference,8-2012-08-10', '{http://www.ACORD.org/standards/Jv-Ins-Reinsurance/1}ServiceProviderGroupItemsTotal,\n', '{http://www.ACORD.org/standards/Jv-Ins-Reinsurance/1}Count,13',
As you can see Count is shown as {namespace}Count, 13 and not {namespace}ItemsInGroupTotal/Count, 13
Can anyone point me towards what I need?
Thanks (hope my first post is OK)
Adam
EDIT:
This is my code now:
with open(fullpath, 'rb') as xmlFilepath:
xmlfile = xmlFilepath.read()
fulltext = '%s' % xmlfile
text = fulltext[2:]
print(text)
xml = etree.fromstring(fulltext)
tree = etree.ElementTree(xml)
every_tag = ['%s, %s' % (tree.getpath(e), e.text) for e in xml.iter()]
print(every_tag)
But this returns an error:
ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.
I remove the first two chars as thy are b' and it complained it didn't start with a tag
Update:
I have been playing around with this and if I remove the xis: xxx tags and the namespace stuff at the top it works as expected. I need to keep the xis tags and be able to identify them as xis tags so can't just delete them.
Any help on how I can achieve this?
ElementTree objects have a method getpath(element), which returns a
structural, absolute XPath expression to find that element
Calling getpath on each element in a iter() loop should work for you:
from pprint import pprint
from lxml import etree
text = """
<TechAccount Sender="broker" Receiver="insurer">
<UUId>2EF40080-F618-4FF7-833C-A34EA6A57B73</UUId>
<BrokerReference>HOY123/456</BrokerReference>
<ServiceProviderReference>2012080921401A1</ServiceProviderReference>
<CreationDate>2012-08-10</CreationDate>
<AccountTransactionType>premium</AccountTransactionType>
<GroupReference>2012080921401A1</GroupReference>
<ItemsInGroupTotal>
<Count>1</Count>
</ItemsInGroupTotal>
<ServiceProviderGroupReference>8-2012-08-10</ServiceProviderGroupReference>
<ServiceProviderGroupItemsTotal>
<Count>13</Count>
</ServiceProviderGroupItemsTotal>
</TechAccount>
"""
xml = etree.fromstring(text)
tree = etree.ElementTree(xml)
every_tag = ['%s, %s' % (tree.getpath(e), e.text) for e in xml.iter()]
pprint(every_tag)
prints:
['/TechAccount, \n',
'/TechAccount/UUId, 2EF40080-F618-4FF7-833C-A34EA6A57B73',
'/TechAccount/BrokerReference, HOY123/456',
'/TechAccount/ServiceProviderReference, 2012080921401A1',
'/TechAccount/CreationDate, 2012-08-10',
'/TechAccount/AccountTransactionType, premium',
'/TechAccount/GroupReference, 2012080921401A1',
'/TechAccount/ItemsInGroupTotal, \n',
'/TechAccount/ItemsInGroupTotal/Count, 1',
'/TechAccount/ServiceProviderGroupReference, 8-2012-08-10',
'/TechAccount/ServiceProviderGroupItemsTotal, \n',
'/TechAccount/ServiceProviderGroupItemsTotal/Count, 13']
UPD:
If your xml data is in the file test.xml, the code would look like:
from pprint import pprint
from lxml import etree
xml = etree.parse('test.xml').getroot()
tree = etree.ElementTree(xml)
every_tag = ['%s, %s' % (tree.getpath(e), e.text) for e in xml.iter()]
pprint(every_tag)
Hope that helps.
getpath() does indeed return an xpath that's not suited for human consumption. From this xpath, you can build up a more useful one though. Such as with this quick-and-dirty approach:
def human_xpath(element):
full_xpath = element.getroottree().getpath(element)
xpath = ''
human_xpath = ''
for i, node in enumerate(full_xpath.split('/')[1:]):
xpath += '/' + node
element = element.xpath(xpath)[0]
namespace, tag = element.tag[1:].split('}', 1)
if element.getparent() is not None:
nsmap = {'ns': namespace}
same_name = element.getparent().xpath('./ns:' + tag,
namespaces=nsmap)
if len(same_name) > 1:
tag += '[{}]'.format(same_name.index(element) + 1)
human_xpath += '/' + tag
return human_xpath