Python xml parsing etree find element X by postion

Python xml parsing etree find element X by postion - python

I'm trying to parse the following xml to pull out certain data then eventually edit the data as needed.
Here is the xml:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<CHECKLIST>
<VULN>
<STIG_DATA>
<VULN_ATTRIBUTE>Vuln_Num</VULN_ATTRIBUTE>
<ATTRIBUTE_DATA>V-38438</ATTRIBUTE_DATA>
</STIG_DATA>
<STIG_DATA>
<VULN_ATTRIBUTE>Rule_Title</VULN_ATTRIBUTE>
<ATTRIBUTE_DATA>More text.</ATTRIBUTE_DATA>
</STIG_DATA>
<STIG_DATA>
<VULN_ATTRIBUTE>Vuln_Discuss</VULN_ATTRIBUTE>
<ATTRIBUTE_DATA>Some text here</ATTRIBUTE_DATA>
</STIG_DATA>
<STIG_DATA>
<VULN_ATTRIBUTE>IA_Controls</VULN_ATTRIBUTE>
<ATTRIBUTE_DATA></ATTRIBUTE_DATA>
</STIG_DATA>
<STIG_DATA>
<VULN_ATTRIBUTE>Rule_Ver</VULN_ATTRIBUTE>
<ATTRIBUTE_DATA>Gen000000</ATTRIBUTE_DATA>
</STIG_DATA>
<STATUS>NotAFinding</STATUS>
<FINDING_DETAILS></FINDING_DETAILS>
<COMMENTS></COMMENTS>
<SEVERITY_OVERRIDE></SEVERITY_OVERRIDE>
<SEVERITY_JUSTIFICATION></SEVERITY_JUSTIFICATION>
</VULN>
The data that I'm looking to pull from this is the STATUS, COMMENTS and the ATTRIBUTE_DATA directly following VULN_ATTRIBUTE that matches == Rule_Ver. So in this example.
I should get the following:
Gen000000 NotAFinding None
What I have so far is that I can get the Status and Comments easy, but can't figure out the ATTRIBUTE_DATA portion. I can find the first one (Vuln_Num), then I tried to add a index but that gives a "list index out of range" error.
This is where I'm at now.
import xml.etree.ElementTree as ET
doc = ET.parse('test.ckl')
root=doc.getroot()
TagList = doc.findall("./VULN")
for curTag in TagList:
StatusTag = curTag.find("STATUS")
CommentTag = curTag.find("COMMENTS")
DataTag = curTag.find("./STIG_DATA/ATTRIBUTE_DATA")
print "GEN:[%s] Status:[%s] Comments: %s" %( DataTag.text, StatusTag.text, CommentTag.text)
This gives the following output:
GEN:[V-38438] Status:[NotAFinding] Comments: None
I want:
GEN:[Gen000000] Status:[NotAFinding] Comments: None
So the end goal is to be able to parse hundreds of these and edit the comments field as needed. I don't think the editing part will be that hard once I get the right element.
Logically I see two ways of doing this. Either go to the ATTRIBUTE_DATA[5] and grab the text or find VULN_ATTRIBUTE == Rule_Ver then grab the next ATTRIBUTE_DATA.
I have tried doing this:
DataTag = curTag.find(".//STIG_DATA//ATTRIBUTE_DATA")[5]
andDataTag[5].text`
and both give meIndexError: list index out of range
I saw lxml had get_element_by_id and xpath, but I can't add modules to this system so it is etree for me.
Thanks in advance.

One can find an element by position, but you've used the incorrect XPath syntax. Either of the following lines should work:
DataTag = curTag.find("./STIG_DATA[5]/ATTRIBUTE_DATA") # Note: 5, not 4
DataTag = curTag.findall("./STIG_DATA/ATTRIBUTE_DATA")[4] # Note: 4, not 5
However, I strongly recommend against using that. There is no guarantee that the Rule_Ver instance of STIG_DATA is always the fifth item.
If you could change to lxml, then this works:
DataTag = curTag.xpath(
'./STIG_DATA/VULN_ATTRIBUTE[text()="Rule_Ver"]/../ATTRIBUTE_DATA')[0]
Since you can't use lxml, you must iterate the STIG_DATA elements by hand, like so:
def GetData(curTag):
for stig in curTag.findall('STIG_DATA'):
if stig.find('VULN_ATTRIBUTE').text == 'Rule_Ver':
return stig.find('ATTRIBUTE_DATA')
Here is a complete program with error checking added to GetData():
import xml.etree.ElementTree as ET
doc = ET.parse('test.ckl')
root=doc.getroot()
TagList = doc.findall("./VULN")
def GetData(curTag):
for stig in curTag.findall('STIG_DATA'):
vuln = stig.find('VULN_ATTRIBUTE')
if vuln is not None and vuln.text == 'Rule_Ver':
data = stig.find('ATTRIBUTE_DATA')
return data
for curTag in TagList:
StatusTag = curTag.find("STATUS")
CommentTag = curTag.find("COMMENTS")
DataTag = GetData(curTag)
print "GEN:[%s] Status:[%s] Comments: %s" %( DataTag.text, StatusTag.text, CommentTag.text)
References:
https://stackoverflow.com/a/10836343/8747
http://lxml.de/xpathxslt.html#xpath

Related

XML data extraction in python

I have an XML file like the following:
<AreaModel>
...
<RecipePhase>
<UniqueName>PHASE1</UniqueName>
...
<NumberOfParameterTags>7</NumberOfParameterTags>
...
<DefaultRecipeParameter>
<Name>PARAM1</Name>
----
</DefaultRecipeParameter>
<DefaultRecipeParameter>
<Name>PARAM2</Name>
----
</DefaultRecipeParameter>
<DefaultRecipeParameter>
<Name>PARAM3</Name>
----
</DefaultRecipeParameter>
</RecipePhase>
<RecipePhase>
....
</RecipePhase>
</AreaModel>
I would like to read this file in sequential order and generate different list. One for the texts of UniqueName TAGs and a list of lists containing for each list the set of texts for tag Name under each RecipePhase element.
For example, I might have 10 RecipePhase elements, each one with TAG UniqueName and each one containing a different set of children with tag DefaultRecipeParameter.
How can I take into account when I enter into RecipePhase and when I go out of the element during parsing?
I am trying ElementTree but I am not able to find a solution.
cheers,
m

You can use xml python module:
See my example:
from xml.dom import minidom as dom
import urllib2
def fetchPage(url):
a = urllib2.urlopen(url)
return ''.join(a.readlines())
def extract(page):
a = dom.parseString(page)
item = a.getElementsByTagName('Rate')
for i in item:
if i.hasChildNodes() == True:
print i.getAttribute('currency')+"-"+ i.firstChild.nodeValue
if __name__=='__main__':
page = fetchPage("http://www.bnro.ro/nbrfxrates.xml")
extract(page)

I solved partially my problem with the following code:
import xml.etree.ElementTree as ET
tree = ET.parse('control_strategies.axml')
root = tree.getroot()
phases=[]
for recipephase in root.findall('./RecipePhase/UniqueName'):
phases.append(recipephase.text)
n_elem = len(phases)
param=[[] for _ in range(n_elem)]
i = 0
for recipephase in root.findall('./RecipePhase'):
for defparam in recipephase.findall('./DefaultRecipeParameter'):
for paramname in defparam.findall('./Name'):
param[i].append(paramname.text)
i = i + 1

How python lxml iteration handles tag text? [duplicate]

I'd like to write a code snippet that would grab all of the text inside the <content> tag, in lxml, in all three instances below, including the code tags. I've tried tostring(getchildren()) but that would miss the text in between the tags. I didn't have very much luck searching the API for a relevant function. Could you help me out?
<!--1-->
<content>
<div>Text inside tag</div>
</content>
#should return "<div>Text inside tag</div>
<!--2-->
<content>
Text with no tag
</content>
#should return "Text with no tag"
<!--3-->
<content>
Text outside tag <div>Text inside tag</div>
</content>
#should return "Text outside tag <div>Text inside tag</div>"

Just use the node.itertext() method, as in:
''.join(node.itertext())

Does text_content() do what you need?

Try:
def stringify_children(node):
from lxml.etree import tostring
from itertools import chain
parts = ([node.text] +
list(chain(*([c.text, tostring(c), c.tail] for c in node.getchildren()))) +
[node.tail])
# filter removes possible Nones in texts and tails
return ''.join(filter(None, parts))
Example:
from lxml import etree
node = etree.fromstring("""<content>
Text outside tag <div>Text <em>inside</em> tag</div>
</content>""")
stringify_children(node)
Produces: '\nText outside tag <div>Text <em>inside</em> tag</div>\n'

A version of albertov 's stringify-content that solves the bugs reported by hoju:
def stringify_children(node):
from lxml.etree import tostring
from itertools import chain
return ''.join(
chunk for chunk in chain(
(node.text,),
chain(*((tostring(child, with_tail=False), child.tail) for child in node.getchildren())),
(node.tail,)) if chunk)

The following snippet which uses python generators works perfectly and is very efficient.
''.join(node.itertext()).strip()

Defining stringify_children this way may be less complicated:
from lxml import etree
def stringify_children(node):
s = node.text
if s is None:
s = ''
for child in node:
s += etree.tostring(child, encoding='unicode')
return s
or in one line
return (node.text if node.text is not None else '') + ''.join((etree.tostring(child, encoding='unicode') for child in node))
Rationale is the same as in this answer: leave the serialization of child nodes to lxml. The tail part of node in this case isn't interesting since it is "behind" the end tag. Note that the encoding argument may be changed according to one's needs.
Another possible solution is to serialize the node itself and afterwards, strip the start and end tag away:
def stringify_children(node):
s = etree.tostring(node, encoding='unicode', with_tail=False)
return s[s.index(node.tag) + 1 + len(node.tag): s.rindex(node.tag) - 2]
which is somewhat horrible. This code is correct only if node has no attributes, and I don't think anyone would want to use it even then.

One of the simplest code snippets, that actually worked for me and as per documentation at http://lxml.de/tutorial.html#using-xpath-to-find-text is
etree.tostring(html, method="text")
where etree is a node/tag whose complete text, you are trying to read. Behold that it doesn't get rid of script and style tags though.

import urllib2
from lxml import etree
url = 'some_url'
getting url
test = urllib2.urlopen(url)
page = test.read()
getting all html code within including table tag
tree = etree.HTML(page)
xpath selector
table = tree.xpath("xpath_here")
res = etree.tostring(table)
res is the html code of table
this was doing job for me.
so you can extract the tags content with xpath_text() and tags including their content using tostring()
div = tree.xpath("//div")
div_res = etree.tostring(div)
text = tree.xpath_text("//content")
or text = tree.xpath("//content/text()")
div_3 = tree.xpath("//content")
div_3_res = etree.tostring(div_3).strip('<content>').rstrip('</')
this last line with strip method using is not nice, but it just works

Just a quick enhancement as the answer has been given. If you want to clean the inside text:
clean_string = ' '.join([n.strip() for n in node.itertext()]).strip()

In response to #Richard's comment above, if you patch stringify_children to read:
parts = ([node.text] +
-- list(chain(*([c.text, tostring(c), c.tail] for c in node.getchildren()))) +
++ list(chain(*([tostring(c)] for c in node.getchildren()))) +
[node.tail])
it seems to avoid the duplication he refers to.

I know that this is an old question, but this is a common problem and I have a solution that seems simpler than the ones suggested so far:
def stringify_children(node):
"""Given a LXML tag, return contents as a string
>>> html = "<p><strong>Sample sentence</strong> with tags.</p>"
>>> node = lxml.html.fragment_fromstring(html)
>>> extract_html_content(node)
"<strong>Sample sentence</strong> with tags."
"""
if node is None or (len(node) == 0 and not getattr(node, 'text', None)):
return ""
node.attrib.clear()
opening_tag = len(node.tag) + 2
closing_tag = -(len(node.tag) + 3)
return lxml.html.tostring(node)[opening_tag:closing_tag]
Unlike some of the other answers to this question this solution preserves all of tags contained within it and attacks the problem from a different angle than the other working solutions.

Here is a working solution. We can get content with a parent tag and then cut the parent tag from output.
import re
from lxml import etree
def _tostr_with_tags(parent_element, html_entities=False):
RE_CUT = r'^<([\w-]+)>(.*)</([\w-]+)>$'
content_with_parent = etree.tostring(parent_element)
def _replace_html_entities(s):
RE_ENTITY = r'&#(\d+);'
def repl(m):
return unichr(int(m.group(1)))
replaced = re.sub(RE_ENTITY, repl, s, flags=re.MULTILINE|re.UNICODE)
return replaced
if not html_entities:
content_with_parent = _replace_html_entities(content_with_parent)
content_with_parent = content_with_parent.strip() # remove 'white' characters on margins
start_tag, content_without_parent, end_tag = re.findall(RE_CUT, content_with_parent, flags=re.UNICODE|re.MULTILINE|re.DOTALL)[0]
if start_tag != end_tag:
raise Exception('Start tag does not match to end tag while getting content with tags.')
return content_without_parent
parent_element must have Element type.
Please note, that if you want text content (not html entities in text) please leave html_entities parameter as False.

lxml have a method for that:
node.text_content()

If this is an a tag, you can try:
node.values()

import re
from lxml import etree
node = etree.fromstring("""
<content>Text before inner tag
<div>Text
<em>inside</em>
tag
</div>
Text after inner tag
</content>""")
print re.search("\A<[^<>]*>(.*)</[^<>]*>\Z", etree.tostring(node), re.DOTALL).group(1)

Python ElementTree

Having trouble with XML config files using ElementTree. I want to have an easy way to find the text of an element regardless of where it is in the XML Tree. From what the documentation says, I should be able to do this with findtext(), but no matter what, I get a return of None. Where am I going wrong here? Everyone was telling me XML is so simple to handle in Python, yet I have had nothing but troubles.
configFileName = 'file.xml'
def configSet (x):
if os.path.exists(configFileName):
tree = ET.parse(configFileName)
root = tree.getroot()
return root.findtext(x)
hiTemp = configSet('hiTemp')
print hiTemp
and the XML
<configData>
<units>
<temp>F</temp>
</units>
<pins>
<lights>1</lights>
<fan>2</fan>
<co2>3</co2>
</pins>
<events>
<airTemps>
<hiTemp>80</hiTemp>
<lowTemp>72</lowTemp>
<hiTempAlarm>84</hiTempAlarm>
</airTemps>
<CO2>
<co2Hi>1500</co2Hi>
<co2Low>1400</co2Low>
<co2Alarm>600</co2Alarm>
</CO2>
</events>
<settings>
<apikeys>
<prowl>
<apikey>None</apikey>
</prowl>
</apikeys>
</settings>
expected result
80
actual result
None

findtext requires a full path, but you have given a relative path, so you cannot find the element you are looking for.
You can either provide a good xpath or modify your code
def configSet(x):
if os.path.exists(configFileName):
tree = ET.parse(configFileName)
root = tree.getroot()
for e in root.getiterator():
t = e.findtext(x)
if t is not None:
return t
Update 1:
If you want to have all matched text as a list, the code is a bit different.
def configSet(x):
matches = []
if os.path.exists(configFileName):
tree = ET.parse(configFileName)
root = tree.getroot()
for e in root.getiterator():
t = e.findtext(x)
if t is not None:
matches.append(t)
return matches

You can use xpath to get to your desired element.
return root.find('./events/airTemps/hiTemp').text
There's easy to follow documentation here.

Reg adding data to an existing XML in Python

I have to parse an xml file & modify the data in a particular tag using Python. I'm using Element Tree to do this. I'm able to parse & reach the required tag. But I'm not able to modify the value. I'm not sure if Element Tree is okay or if I should use TreeBuilder for this.
As you can see below I just want to replace the Not Executed under Verdict with a string value.
-<Procedure>
<PreCondition>PRECONDITION: - ECU in extended diagnostic session (zz = 0x03) </PreCondition>
<PostCondition/>
<ProcedureID>428495</ProcedureID>
<SequenceNumber>2</SequenceNumber>
<CID>-1</CID>
<**Verdict** Writable="true">NotExecuted</Verdict>
</Procedure>
import xml.etree.ElementTree as etree
X_tree = etree.parse('DIAGNOSTIC SERVER.xml')
X_root = X_tree.getroot()
ATC_Name = X_root.iterfind('TestOrder//TestOrder//TestSuite//')
try:
while(1):
temp = ATC_Name.next()
if temp.tag == 'ProcedureID' and temp.text == str(TestCase_Id[j].text).split('-')[1]:
ATC_Name.next()
ATC_Name.next()
ATC_Name.next().text = 'Pass' <--This is what I want to do
ATC_Name.close()
break
except:
print sys.exc_info()
I believe my approach is wrong. Kindly guide me with right pointers.
Thanks.

You'd better switch to lxml so that you can use the "unlimited" power of xpath.
The idea is to use the following xpath expression:
//Procedure[ProcedureID/text()="%d"]/Verdict
where %d placeholder is substituted with the appropriate procedure id via string formatting operation.
The xpath expression finds the appropriate Verdict tag which you can set text on:
from lxml import etree
data = """<Procedure>
<PreCondition>PRECONDITION: - ECU in extended diagnostic session (zz = 0x03) </PreCondition>
<PostCondition/>
<ProcedureID>428495</ProcedureID>
<SequenceNumber>2</SequenceNumber>
<CID>-1</CID>
<Verdict Writable="true">NotExecuted</Verdict>
</Procedure>"""
ID = 428495
tree = etree.fromstring(data)
verdict = tree.xpath('//Procedure[ProcedureID/text()="%d"]/Verdict' % ID)[0]
verdict.text = 'test'
print etree.tostring(tree)
prints:
<Procedure>
<PreCondition>PRECONDITION: - ECU in extended diagnostic session (zz = 0x03) </PreCondition>
<PostCondition/>
<ProcedureID>428495</ProcedureID>
<SequenceNumber>2</SequenceNumber>
<CID>-1</CID>
<Verdict Writable="true">test</Verdict>
</Procedure>

Here is a solution using ElementTree. See Modifying an XML File
import xml.etree.ElementTree as et
tree = et.parse('prison.xml')
root = tree.getroot()
print root.find('Verdict').text #before update
root.find('Verdict').text = 'Executed'
tree.write('prison.xml')

try this
import xml.etree.ElementTree as et
root=et.parse(xmldata).getroot()
s=root.find('Verdict')
s.text='Your string'

LXML Xpath does not seem to return full path

OK I'll be the first to admit its is, just not the path I want and I don't know how to get it.
I'm using Python 3.3 in Eclipse with Pydev plugin in both Windows 7 at work and ubuntu 13.04 at home. I'm new to python and have limited programming experience.
I'm trying to write a script to take in an XML Lloyds market insurance message, find all the tags and dump them in a .csv where we can easily update them and then reimport them to create an updated xml.
I have managed to do all of that except when I get all the tags it only gives the tag name and not the tags above it.
<TechAccount Sender="broker" Receiver="insurer">
<UUId>2EF40080-F618-4FF7-833C-A34EA6A57B73</UUId>
<BrokerReference>HOY123/456</BrokerReference>
<ServiceProviderReference>2012080921401A1</ServiceProviderReference>
<CreationDate>2012-08-10</CreationDate>
<AccountTransactionType>premium</AccountTransactionType>
<GroupReference>2012080921401A1</GroupReference>
<ItemsInGroupTotal>
<Count>1</Count>
</ItemsInGroupTotal>
<ServiceProviderGroupReference>8-2012-08-10</ServiceProviderGroupReference>
<ServiceProviderGroupItemsTotal>
<Count>13</Count>
</ServiceProviderGroupItemsTotal>
That is a fragment of the XML. What I want is to find all the tags and their path. For example for I want to show it as ItemsInGroupTotal/Count but can only get it as Count.
Here is my code:
xml = etree.parse(fullpath)
print( xml.xpath('.//*'))
all_xpath = xml.xpath('.//*')
every_tag = []
for i in all_xpath:
single_tag = '%s,%s' % (i.tag, i.text)
every_tag.append(single_tag)
print(every_tag)
This gives:
'{http://www.ACORD.org/standards/Jv-Ins-Reinsurance/1}ServiceProviderGroupReference,8-2012-08-10', '{http://www.ACORD.org/standards/Jv-Ins-Reinsurance/1}ServiceProviderGroupItemsTotal,\n', '{http://www.ACORD.org/standards/Jv-Ins-Reinsurance/1}Count,13',
As you can see Count is shown as {namespace}Count, 13 and not {namespace}ItemsInGroupTotal/Count, 13
Can anyone point me towards what I need?
Thanks (hope my first post is OK)
Adam
EDIT:
This is my code now:
with open(fullpath, 'rb') as xmlFilepath:
xmlfile = xmlFilepath.read()
fulltext = '%s' % xmlfile
text = fulltext[2:]
print(text)
xml = etree.fromstring(fulltext)
tree = etree.ElementTree(xml)
every_tag = ['%s, %s' % (tree.getpath(e), e.text) for e in xml.iter()]
print(every_tag)
But this returns an error:
ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.
I remove the first two chars as thy are b' and it complained it didn't start with a tag
Update:
I have been playing around with this and if I remove the xis: xxx tags and the namespace stuff at the top it works as expected. I need to keep the xis tags and be able to identify them as xis tags so can't just delete them.
Any help on how I can achieve this?

ElementTree objects have a method getpath(element), which returns a
structural, absolute XPath expression to find that element
Calling getpath on each element in a iter() loop should work for you:
from pprint import pprint
from lxml import etree
text = """
<TechAccount Sender="broker" Receiver="insurer">
<UUId>2EF40080-F618-4FF7-833C-A34EA6A57B73</UUId>
<BrokerReference>HOY123/456</BrokerReference>
<ServiceProviderReference>2012080921401A1</ServiceProviderReference>
<CreationDate>2012-08-10</CreationDate>
<AccountTransactionType>premium</AccountTransactionType>
<GroupReference>2012080921401A1</GroupReference>
<ItemsInGroupTotal>
<Count>1</Count>
</ItemsInGroupTotal>
<ServiceProviderGroupReference>8-2012-08-10</ServiceProviderGroupReference>
<ServiceProviderGroupItemsTotal>
<Count>13</Count>
</ServiceProviderGroupItemsTotal>
</TechAccount>
"""
xml = etree.fromstring(text)
tree = etree.ElementTree(xml)
every_tag = ['%s, %s' % (tree.getpath(e), e.text) for e in xml.iter()]
pprint(every_tag)
prints:
['/TechAccount, \n',
'/TechAccount/UUId, 2EF40080-F618-4FF7-833C-A34EA6A57B73',
'/TechAccount/BrokerReference, HOY123/456',
'/TechAccount/ServiceProviderReference, 2012080921401A1',
'/TechAccount/CreationDate, 2012-08-10',
'/TechAccount/AccountTransactionType, premium',
'/TechAccount/GroupReference, 2012080921401A1',
'/TechAccount/ItemsInGroupTotal, \n',
'/TechAccount/ItemsInGroupTotal/Count, 1',
'/TechAccount/ServiceProviderGroupReference, 8-2012-08-10',
'/TechAccount/ServiceProviderGroupItemsTotal, \n',
'/TechAccount/ServiceProviderGroupItemsTotal/Count, 13']
UPD:
If your xml data is in the file test.xml, the code would look like:
from pprint import pprint
from lxml import etree
xml = etree.parse('test.xml').getroot()
tree = etree.ElementTree(xml)
every_tag = ['%s, %s' % (tree.getpath(e), e.text) for e in xml.iter()]
pprint(every_tag)
Hope that helps.

getpath() does indeed return an xpath that's not suited for human consumption. From this xpath, you can build up a more useful one though. Such as with this quick-and-dirty approach:
def human_xpath(element):
full_xpath = element.getroottree().getpath(element)
xpath = ''
human_xpath = ''
for i, node in enumerate(full_xpath.split('/')[1:]):
xpath += '/' + node
element = element.xpath(xpath)[0]
namespace, tag = element.tag[1:].split('}', 1)
if element.getparent() is not None:
nsmap = {'ns': namespace}
same_name = element.getparent().xpath('./ns:' + tag,
namespaces=nsmap)
if len(same_name) > 1:
tag += '[{}]'.format(same_name.index(element) + 1)
human_xpath += '/' + tag
return human_xpath

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python xml parsing etree find element X by postion - python

Related

XML data extraction in python

How python lxml iteration handles tag text? [duplicate]

Python ElementTree

Reg adding data to an existing XML in Python

LXML Xpath does not seem to return full path

Categories

Resources