I am using lxml to parse the following XML text block:
<block>{<block_content><argument_list>(<argument><expr><name><name>String</name><operator>.</operator><name>class</name></name></expr></argument>, <argument><expr><name><name>Object</name><operator>.</operator><name>class</name></name></expr></argument>)</argument_list></block_content>}</block>
<block>{<block_content><argument_list>(<argument><expr><literal type="string">"Expected exception to be thrown"</literal></expr></argument>)</argument_list></block_content>}</block>
<block>{<block_content></block_content>}</block>
My requirement is to print the following from the above xml snippet:
String.class
Object.class
"Expected exception to be thrown"
Basically, I need to print the text values contained within the argument node of the xml snippet.
Below is the code block that I am using.
from lxml import etree
xml_text = '<unit>' \
'<block>{<block_content><argument_list>(<argument><expr><name><name>String</name><operator>.</operator><name>class</name></name></expr></argument>, <argument><expr><name><name>Object</name><operator>.</operator><name>class</name></name></expr></argument>)</argument_list></block_content>}</block> ' \
'<block>{<block_content><argument_list>(<argument><expr><literal type="string">"Expected exception to be thrown"</literal></expr></argument>)</argument_list></block_content>}</block> ' \
'<block>{<block_content></block_content>}</block>' \
'</unit>'
tree = etree.fromstring(xml_text)
args = tree.xpath('//argument_list/argument')
for i in range(len(args)):
print('%s. %s' %(i+1, etree.tostring(args[i]).decode("utf-8")))
However, the below output produced by this code does not meet my requirement.
1. <argument><expr><name><name>String</name><operator>.</operator><name>class</name></name></expr></argument>,
2. <argument><expr><name><name>Object</name><operator>.</operator><name>class</name></name></expr></argument>)
3. <argument><expr><literal type="string">"Expected exception to be thrown"</literal></expr></argument>)
Would appreciate it if someone can point out what modifications I need to make to my code
I found that the strip_tags function gets the job done. Below is the updated code:
for i in range(len(args)):
etree.strip_tags(args[i], "*")
print('%s. %s' %(i+1, args[i].text))
Output from the update code:
String.class
Object.class
"Expected exception to be thrown"
Related
I am trying to develop a simple web scraper of sorts, and keep having issues with the parsing code for the XML file used.
Whenever I run it it gives me Errno22, even though the path is valid. Could anyone assist?
try:
xmlTree = ET.parse('C:\TestWork\RWPlus\test.xml')
root = xmlTree.getroot()
returnValue = root[tariffPOS][childPOS].text
return returnValue
except Exception as error:
errorMessage = "A " + str(
error) + " error occurred when trying to read the XML file."
ErrorReport(errorMessage)
You are supposed to escape backslashes in Python strings
ET.parse('C:\\TestWork\\RWPlus\\test.xml')
or you can use raw strings (note the r)
ET.parse(r'C:\TestWork\RWPlus\test.xml')
I have a XML file that previously I commented some elements, and now I want to uncomment them..
I have this structure
<parent parId="22" attr="Alpha">
<!--<reg regId="1">
<cont>There is some content</cont><cont2 attr1="val">Another content</cont2>
</reg>
--></parent>
<parent parId="23" attr="Alpha">
<reg regId="1">
<cont>There is more content</cont><cont2 attr1="noval">Morecont</cont2>
</reg>
</parent>
<parent parId="24" attr="Alpha">
<!--<reg regId="1">
<cont>There is some content</cont><cont2 attr1="val">Another content</cont2>
</reg>
--></parent>
I would like to uncomment all the comments of the file. That consequentially, also are the commented element and I would to uncomment them.
I am able to find the elements that are comment using xpath. Here is my snippet of code.
def unhide_element():
path = r'path_to_file\file.xml'
xml_parser = et.parse(path)
comments = root.xpath('//comment')
for c in comments:
print('Comment: ', c)
parent_comment = c.getparent()
parent_comment.replace(c,'')
tree = et.ElementTree(root)
tree.write(new_file)
However, the replace is not working as it expects another element.
How can I fix this?
Your code is missing a crucial bit of creating the new XML element from comment text. There are a few other bugs related to the incorrect XPath query, and to saving the output file multiple times inside the loop.
Also, it appears that you are mixing xml.etree with lxml.etree. As per documentation, the former ignores comments when the XML file is parsed, so the best way to go is to use lxml.
After fixing all of the above we get something like this.
import lxml.etree as ET
def unhide_element():
path = r'test.xml'
root = ET.parse(path)
comments = root.xpath('//comment()')
for c in comments:
print('Comment: ', c)
parent_comment = c.getparent()
parent_comment.remove(c) # skip this if you want to retain the comment
new_elem = ET.XML(c.text) # this bit creates the new element from comment text
parent_comment.addnext(new_elem)
root.write(r'new_file.xml')
Well, since you want to uncomment everything, all you really need to do is remove each "< !--" and "-->":
import re
new_xml = ''.join(re.split('<!--|-->', xml))
Or:
new_xml = xml.replace('<!--', '').replace('-->', '')
I have the following simple piece of code to parse a reSt file and return the corresponding DOM tree.
from docutils import nodes, utils
from docutils.parsers import rst
def _rst_to_dom(self, txt):
"""Parse reStructuredText and return corresponding DOM tree."""
document = utils.new_document("Doc")
document.settings.tab_width = 4
document.settings.pep_references = 1
document.settings.rfc_references = 1
document.settings.raw_enabled = True
document.settings.file_insertion_enabled = True
rst.Parser().parse(txt, document)
return document.asdom()
This works great, but when the parser finds some problem with the input, instead of raising an exception so that my program knows that there is something wrong, it simply prints out an error message to the standard output and returns a tree with what it could do. How can I get it to raise an exception? Or, how can I know that something was amiss?
I'm trying to process an XML file using XPATH in Python / lxml.
I can pull out the values at a particular level of the tree using this code:
file_name = input('Enter the file name, including .xml extension: ') # User inputs file name
print('Parsing ' + file_name)
from lxml import etree
parser = etree.XMLParser()
tree = etree.parse(file_name, parser)
r = tree.xpath('/dataimport/programmelist/programme')
print (len(r))
with open(file_name+'.log', 'w', encoding='utf-8') as f:
for r in tree.xpath('/dataimport/programmelist/programme'):
progid = (r.get("id"))
print (progid)
It returns a list of values as expected. I also want to return the value of a 'child' (where it exists), but I can't work out how (I can only get it to work as a separate list, but I need to maintain the link between them).
Note: I will be writing the values out to a log file, but since I haven't been successful in getting everything out that I want, I haven't added the 'write out' code yet.
This is the structure of the XML:
<dataimport dtdversion="1.1">
<programmelist>
<programme id="eid-273168">
<imageref idref="img-1844575"/>
How can I get Python to return the id + idref?
The previous examples I have worked with had namespaces, but this file doesn't.
Since xpath() method returns tree, you can use xpath again to get idref list you want:
for r in tree.xpath('/dataimport/programmelist/programme')
progid = r.get("id")
ref_list = r.xpath('imageref/#idref')
print progid, ref_lis
I am new to Python and working on a utility that changes an XML file into an HTML. The XML comes from a call to request = urllib2.Request(url), where I generate the custom url earlier in the code, and then set response = urllib2.urlopen(request) and, finally, xml_response = response.read(). This works okay, as far as I can tell.
My trouble is with parsing the response. For starters, here is a partial example of the XML structure I get back:
I tried adapting the slideshow example in the minidom tutorial here to parse my XML (which is ebay search results, by the way): http://docs.python.org/2/library/xml.dom.minidom.html
My code so far looks like this, with try blocks as an attempt to diagnose issues:
doc = minidom.parseString(xml_response)
#Extract relevant information and prepare it for HTML formatting.
try:
handleDocument(doc)
except:
print "Failed to handle document!"
def getText(nodelist): #taken straight from slideshow example
rc = []
for node in nodelist:
if node.nodeType == node.TEXT_NODE:
print "A TEXT NODE!"
rc.append(node.data)
return ''.join(rc) #this is a string, right?
def handleDocument(doc):
outputFile = open("EbaySearchResults.html", "w")
outputFile.write("<html>\n")
outputFile.write("<body>\n")
try:
items = doc.getElementsByTagName("item")
except:
"Failed to get elements by tag name."
handleItems(items)
outputFile.write("</html>\n")
outputFile.write("</body>\n")
def handleItems(items):
for item in items:
title = item.getElementsByTagName("title")[0] #there should be only one title
print "<h2>%s</h2>" % getText(title.childNodes) #this works fine!
try: #none of these things work!
outputFile.write("<h2>%s</h2>" % getText(title.childNodes))
#outputFile.write("<h2>" + getText(title.childNodes) + "</h2>")
#str = getText(title.childNodes)
#outputFIle.write(string(str))
#outputFile.write(getText(title.childNodes))
except:
print "FAIL"
I do not understand why the correct title text does print to the console but throws an exception and does not work for the output file. Writing plain strings like this works fine: outputFile.write("<html>\n") What is going on with my string construction? As far as I can tell, the getText method I am using from the minidom example returns a string--which is just the sort of thing you can write to a file..?
If I print the actual stack trace...
...
except:
print "Exception when trying to write to file:"
print '-'*60
traceback.print_exc(file=sys.stdout)
print '-'*60
traceback.print_tb(sys.last_traceback)
...
...I will instantly see the problem:
------------------------------------------------------------
Traceback (most recent call last):
File "tohtml.py", line 85, in handleItems
outputFile.write(getText(title.childNodes))
NameError: global name 'outputFile' is not defined
------------------------------------------------------------
Looks like something has gone out of scope!
Fellow beginners, take note.