Debugging xml ElementTrees in Python

Debugging xml ElementTrees in Python - python

I have an XML string
xml_str = '<Foo><Bar>burp</Bar></Foo>'
I'm parsing it with xml etree
import xml.etree.ElementTree as ET
root_element = ET.fromstring(xml_str)
This creates an Element object(root_element) with a tag, tail, text, and attrib values within it. I can see all of them when debugging. However, I can't see any children Elements while debugging. I know the children are there because I can access them in a for loop.
for child in root_element:
*break point here*
Below is a screenshot of what I'm seeing
Is there a way to see all elements at once while debugging? And is this issue because the XML parser is a JIT parser or something?

It sounds like you want to be able to see the available elements in the XML document you want to parse.
This will list all the child tags of the root element
all_children = list(root_element.iter())
This will produce
[<Element 'Foo' at 0x11b315908>, <Element 'Bar' at 0x11b315c28>]
This output, however, doesn't respect the 'shape' of the XML.
When I want to parse XML, I find it easier to use ElementTree but my first experiences parsing XML was with BeautifulSoup. I still like the prettify() function.
This code,
pretty = ""
soup = BeautifulSoup(xml_str, 'html.parser')
for value in soup.find_all("foo"):
pretty += value.prettify()
produces this output
print(pretty)
<foo>
<bar>
burp
</bar>
</foo>
You can replace the find_all() with specific elements you might be looking for.

Related

Incorrect parent element lxml

I am implementing a web scraping program in Python.
Consider my following HTML snippet.
<div>
<b>
<i>
HelloWorld
</i>
HiThere
</b>
</div>
If I wish to use lxml to extract my bold or italicized texts only, I use the following command
tree = etree.fromstring(myhtmlstr, htmlparser)
opp1 = tree.xpath(".//text()[ancestor::b or ancestor::i or ancestor::strong]")
This gives me the correct result, i.e. the result of my opp1 is :
['HelloWorld', 'HiThere']
So far, everything is perfect. However, the real problem arises if I try to query the parents of the tags. As expected, the output of opp1[0].getparent().tag and opp1[0].getparent().getparent().tag are i and b.
The real problem is however in the second tag. Ideally, the parent of opp[1] should be the b tag. However, the output of opp1[1].getparent().tag and opp1[1].getparent().getparent().tag are i and b again.
You can verify the same in the following code:
from lxml import etree
htmlstr = """<div><b><i>HelloWorld</i>HiThere</b></div>"""
htmlparser = etree.HTMLParser()
tree = etree.fromstring(htmlstr, htmlparser)
opp1 = tree.xpath(".//text()[ancestor::b or ancestor::i or ancestor::strong]")
print(opp1)
print(opp1[0].getparent(), opp1[0].getparent().getparent())
print(opp1[1].getparent(), opp1[1].getparent().getparent())
Can someone point out why this is the case? What can I do to correct it? I plan to use only lxml for my program, and do not want any solution that uses bs4.

The issue seems to stem from LXML's (and ElementTree's) data model, where an element is roughly "tag, attributes, text, children, tail"; the DOM data model has actual nodes for text too.
If you change your program to do
for x in tree.xpath(".//text()[ancestor::b or ancestor::i or ancestor::strong]"):
print(x, x.getparent(), "text?", x.is_text, "tail?", x.is_tail)
it will print
HelloWorld <Element i at 0x10aa0ccd0> text? True tail? False
HiThere <Element i at 0x10aa0ccd0> text? False tail? True
i.e. "HiThere" is the tail of the i element, since that's the way the Etree data model represents intermingled text and tags.
The takeaway here (which should probably work for your use case) is to consider .getparent().getparent() as the effective parent of a text result that has is_tail=True.

Parse self-closing tags missing the '/'

I'm trying to parse some old SGML code using BeautifulSoup4 and build an Element Tree with the data. It's mostly working fine, but some of the tags that should be self-closing are aren't marked as such. For example:
<element1>
<element2 attr="0">
<element3>Data</element3>
</element1>
When I parse the data, it ends up like:
<element1>
<element2 attr="0">
<element3>Data</element3>
</element2>
</element1>
What I'd like is for it to assume that if it doesn't find a closing tag for such elements, it should treat it as self-closing tag instead of assuming that everything after it is a child and putting the closing tag as late as possible, like so:
<element1>
<element2 attr="0"/>
<element3>Data</element3>
</element1>
Can anyone point me to a parser that could do this, or some way to modify an existing one to act this way? I've dug through a few parsers (lxml, lxml-xml, html5lib) but I can't figure out how to get these results.

What I ended up doing was extracting all empty elements where the end tag can be omitted from the DTD (eg. <!ELEMENT elem_name - o EMPTY >), creating a list from those elements, then using regex to close all the tags in the list. The resulting text is then passed to the XML parser.
Here's a boiled down version of what I'm doing:
import re
from lxml.html import soupparser
from lxml import etree as ET
empty_tags = ['elem1', 'elem2', 'elem3']
markup = """
<elem1 attr="some value">
<elem2/>
<elem3></elem3>
"""
for t in empty_tags:
markup = re.sub(r'(<{0}(?:>|\s+[^>/]*))>\s*(?:</{0}>)?\n?'.format(t), r'\1/>\n', markup)
tree = soupparser.fromstring(markup)
print(ET.tostring(tree, pretty_print=True).decode("utf-8"))
The output should be:
<elem1 attr="some value"/>
<elem2/>
<elem3/>
(This will actually be enclosed in tags, but the parser adds those in.)
It will leave attributes alone, and won't touch tags that are already self-closed. If the tag has a closing tag, but is empty, it will remove the closing tag and self-close the tag instead, just so it's standardized.
It's not a very generic solution but, as far as I can tell, there's no other way to do this without knowing which tags should be closed. Even OpenSP needs the DTD to know which tags it should be closing.

'XML' document with multiple root elements

I have an 'XML' file, which I do not control, which I am trying to parse with etree.ElementTree which contains two root elements:
<?xml version="1.0"?>
<meta>
... data I do not care about
</meta>
<database>
... data I wish to parse
</database>
Trying to parse the file I'm getting the error: 'junk after document element' which I understand is related to the fact that it isn't valid xml, since xml can only have one root element. I've been reading around for a solution, and while I have found a few posts addressing this issue they have all been different enough or difficult enough that I could not, as a beginner, get my head round them.
As I understand it the solution would either be to encase everything in a new root element, and parse that, or somehow ignore/split off the <meta> element and it's children. Any guidance on how to best accomplish this would be appreciated.

Beautiful Soup might ease your problem (although it is the lxml inside which renders this service), but its a long-term downgrade, thus for instance when you want to use xpath.
Stick to ET. It is strict and won't allow you to parse not well-formed XML, which requires one root element and nothing else outside of it.
If you manage to parse your xml-file, you can be sure, it is well-formed. All options are legit:
1) Read the file as a string, remove the declaration and put the root tags around it. Then parse from string. (Clear the string variable after that.) Or you could edit the file first.
2) Create a new root element ( new_root = ET.Element('new_root') ), read the top-level elements in the file an append them with SubElement.
The second option requires more coding and maintainance, if the file gets changed.

Here is one solution using BeautifulSoup, in data is malformed xml. BeautifulSoup will process it as any document, so you can access both parts:
from bs4 import BeautifulSoup
data = """<?xml version="1.0"?>
<meta>
<somedata>1</somedata>
</meta>
<database>
<important>100</important>
</database>"""
soup = BeautifulSoup(data, 'lxml')
print(soup.database.important.text)
Prints:
100

lxml removing <?xml ...> tags when parsing?

I'm currently working with parsing XML documents (adding elements, adding attributes, etc). So I first need to parse the XML in before working on it. However, lxml seems to be removing the element <?xml ...>. For example
from lxml import etree
tree = etree.fromstring('<?xml version="1.0" encoding="utf-8"?><dmodule>test</dmodule>', etree.XMLParser())
print etree.tostring(tree)
will result in
<dmodule>test</dmodule>
Does anyone know why the <?xml ...> element is being removed? I thought encoding tags were valid XML. Thanks for your time.

The <?xml> element is an XML declaration, so it's not strictly an element. It just gives info about the XML tree below it.
If you need to print it out with lxml, there is some info here about the xmlDeclaration=TRUE flag you can use.
http://lxml.de/api.html#serialisation
etree.tostring(tree, xml_declaration=True)

Does anyone know why the <?xml ...> element is being removed?
XML defaults to version 1.0 in UTF-8 so the document is equivalent if you remove them.
You are parsing some XML to a data structure and then converting that data structure back to XML. You will get a representation of that data structure in XML, but it might not be expressed in the same way (so the prolog can be removed and <foo /> can be exchanged with <foo></foo> and so on).

Setting value for a node in XML document in Python

I have a XML document "abc.xml":
I need to write a function replace(name, newvalue) which can replace the value node with tag 'name' with the new value and write it back to the disk. Is this possible in python? How should I do this?

import xml.dom.minidom
filename='abc.xml'
doc = xml.dom.minidom.parse(filename)
print doc.toxml()
c = doc.getElementsByTagName("c")
print c[0].toxml()
c[0].childNodes[0].nodeValue = 'zip'
print doc.toxml()
def replace(tagname, newvalue):
'''doc is global, first occurrence of tagname gets it!'''
doc.getElementsByTagName(tagname)[0].childNodes[0].nodeValue = newvalue
replace('c', 'zit')
print doc.toxml()
See minidom primer and API Reference.
# cat abc.xml
<root>
<a>
<c>zap</c>
</a>
<b>
</b>
</root>

Sure it is possible.
The xml.etree.ElementTree module will help you with parsing XML, finding tags and replacing values.
If you know a little bit more about the XML file you want to change, you can probably make the task a bit easier than if you need to write a generic function that will handle any XML file.
If you are already familiar with DOM parsing, there's a xml.dom package to use instead of the ElementTree one.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Debugging xml ElementTrees in Python - python

Related

Incorrect parent element lxml

Parse self-closing tags missing the '/'

'XML' document with multiple root elements

lxml removing <?xml ...> tags when parsing?

Setting value for a node in XML document in Python

Categories

Resources