How to get whole text of an Element in xml.minidom? - python

I want to get the whole text of an Element to parse some xhtml:
<div id='asd'>
<pre>skdsk</pre>
</div>
begin E = div element on the above example, I want to get
<pre>skdsk</pre>
How?

Strictly speaking:
from xml.dom.minidom import parse, parseString
tree = parseString("<div id='asd'><pre>skdsk</pre></div>")
root = tree.firstChild
node = root.childNodes[0]
print node.toxml()
In practice, though, I'd recommend looking at the http://www.crummy.com/software/BeautifulSoup/ library. Finding the right childNode in an xhtml document, and skipping "whitespace nodes" is a pain. BeautifulSoup is a robust html/xhtml parser with fantastic tree-search capacilities.
Edit: The example above compresses the HTML into one string. If you use the HTML as in the question, the line breaks and so-forth will generate "whitespace" nodes, so the node you want won't be at childNodes[0].

Related

Parse self-closing tags missing the '/'

I'm trying to parse some old SGML code using BeautifulSoup4 and build an Element Tree with the data. It's mostly working fine, but some of the tags that should be self-closing are aren't marked as such. For example:
<element1>
<element2 attr="0">
<element3>Data</element3>
</element1>
When I parse the data, it ends up like:
<element1>
<element2 attr="0">
<element3>Data</element3>
</element2>
</element1>
What I'd like is for it to assume that if it doesn't find a closing tag for such elements, it should treat it as self-closing tag instead of assuming that everything after it is a child and putting the closing tag as late as possible, like so:
<element1>
<element2 attr="0"/>
<element3>Data</element3>
</element1>
Can anyone point me to a parser that could do this, or some way to modify an existing one to act this way? I've dug through a few parsers (lxml, lxml-xml, html5lib) but I can't figure out how to get these results.
What I ended up doing was extracting all empty elements where the end tag can be omitted from the DTD (eg. <!ELEMENT elem_name - o EMPTY >), creating a list from those elements, then using regex to close all the tags in the list. The resulting text is then passed to the XML parser.
Here's a boiled down version of what I'm doing:
import re
from lxml.html import soupparser
from lxml import etree as ET
empty_tags = ['elem1', 'elem2', 'elem3']
markup = """
<elem1 attr="some value">
<elem2/>
<elem3></elem3>
"""
for t in empty_tags:
markup = re.sub(r'(<{0}(?:>|\s+[^>/]*))>\s*(?:</{0}>)?\n?'.format(t), r'\1/>\n', markup)
tree = soupparser.fromstring(markup)
print(ET.tostring(tree, pretty_print=True).decode("utf-8"))
The output should be:
<elem1 attr="some value"/>
<elem2/>
<elem3/>
(This will actually be enclosed in tags, but the parser adds those in.)
It will leave attributes alone, and won't touch tags that are already self-closed. If the tag has a closing tag, but is empty, it will remove the closing tag and self-close the tag instead, just so it's standardized.
It's not a very generic solution but, as far as I can tell, there's no other way to do this without knowing which tags should be closed. Even OpenSP needs the DTD to know which tags it should be closing.

LXML: get text inbetween elements children

I have a badly structured html template, where my <section> elements contain multiple elements (p, figure, a, etc), but also raw text in between. How can I access all those snippets of texts, and edit them in place (what I need is to replace all $$code$$ with tags?)
both section.text and section.tail return empty strings...
Examine the .tail of the complete tag that immediately precedes the text. So, in <section>A<p>B</p>C<p>D</p>E</section>, the .tails of the two <p> elemnts will contain C and E.
Example:
from lxml import etree
root = etree.fromstring('<root><section>A<p>B</p>C<p>D</p>E</section></root>')
for section_child in root.find('section'):
section_child.tail = section_child.tail.lower()
print(etree.tounicode(root))
Result:
<root><section>A<p>B</p>c<p>D</p>e</section></root>
I learnt from the answer in my posted question: Parse XML text in between elements within a root element
from lxml import etree
xml = '<a>aaaa1<b>bbbb</b>aaaa2<c>cccc</c>aaaa3</a>'
element = etree.fromstring(xml)
for text in element.xpath('text()'):
xml = xml.replace(f'>{text}<', f'>{text.upper()}<')
One concern for this is regarding CDATA in xml, but I would guess this is not an issue for html.

How can I go to exact position of xml file having its XPath and Offset?

I'm using lxml to parse xml files as ElementTree objects. I'm building annotation application, and I need to reach to exact positions in the file.
I have relative XPath and startOffset of where the intended text is located. For example in this piece of code:
<section role="doc-abstract">
<h1>Abstract</h1>
<p>The creation and use of knowledge graphs for information discovery, question answering, and task completion has exploded in recent years, but their application has often been limited to the most common user scenarios.</p>
</section>
I want to get the part "knowledge graphs for information discovery" with following XPath ".//section[2]/p[1]" so I can get to that <p> element. Then I have startOffset variable equal to "26" which means the text is 26 characters far from the beginning of the element.
My question is how can I get to that exact position using lxml?
Considering your xml to be stored in a string - xml_string.
from lxml import etree
#initialize a parser
parser = etree.XMLParser(remove_blank_text=True)
#initialize the xml root, it will automatically take the root of the xml
root = etree.XML(xml_string, parser)
node = root.find('//section[2]/p[1]')
Now you can do processing of this node. Also, you can use a loop for finding more node_elements, for eg: root.findall()
For more reference on lxml: https://lxml.de/tutorial.html

lxml moves text with element

I have an issue with wrapping image with a div.
from lxml.html import fromstring
from lxml import etree
tree = fromstring('<img src="/img.png"/> some text')
div = etree.Element('div')
div.insert(0, tree.find('img'))
tree.insert(0, div)
print etree.tostring(tree)
<span><div><img src="/img.png"/> some text</div></span>
Why does it add a span and how can I make it wrap image without text?
Because lxml is acutally an xml parser. It has some forgiving parsing rules that allows it to parse html (the lxml.html part), but it will internally always build a valid tree.
'<img src="/img.png"/> some text' isn't a tree, as it has no single root element, there is a img element, and a text node. To be able to store this snipplet internally, lxml needs to wrap it in a suitable tag. If you give it a string alone, it will wrap it in a p tag. Earlier versions just wrapped everything in html tags, which can lead to even more confusion.
You could also use html.fragment_fromstring, which doesn't add tags in that case, but would raise an error because the fragment isn't valid.
As for why the text sticks to the img tag: that's how lxml stores text. Take this example:
>>> p = html.fromstring("<p>spam<br />eggs</p>")
>>> br = p.find("br")
>>> p.text
'spam'
>>> br.text # empty
>>> br.tail # this is where text that comes after a tag is stored
'eggs'
So by moving a tag, you also move it's tail.
lxml.html is a kinder, gentler xml processor that tries to make sense of invalid xml. The sting you passed in is just junk from an xml perspective, but lxml.html wrapped it in a span element to make it valid again. If you don't want lxml.html guestimating, stick with lxml.etree.fromstring(). That version will reject the string.

Parse XML in order using Python

I'm trying to parse an XML document. The document has HTML like formatting embedded, for example
<p>This is a paragraph
<em>with some <b>extra</b> formatting</em>
scattered throughout.
</p>
So far I've used
import xml.etree.cElementTree as xmlTree
to handle the XML document, but I am not sure if this provides the functionality I look for. How would I go about handling the text nodes here?
Also, is there a way to find the closing tags in a document?
Thanks!
If your XML document fits in memory, you should use Beautiful Soup which will give you a much cleaner access to the document. You'll be able to select a node and automatically interact with its children; every node will have a .next command, which will iterate through the text up to the next tag.
So:
>>> b = BeautifulSoup.BeautifulStoneSoup("<p>This is a paragraph <em>with some <b>extra</b> formatting</em> scattered throughout.</p>")
>>> b.find('p')
<p>This is a paragraph <em>with some <b>extra</b> formatting</em> scattered throughout.</p>
>>> b.find('p').next
u'This is a paragraph '
>>> b.find('p').next.next
<em>with some <b>extra</b> formatting</em>
That, or something like it, should solve your problem.
If it doesn't fit in memory, you'll need to subclass a SAX parser, which is a bit more work. To do that, you use from xml.parsers import expat and write handlers for opening and closing of tags. It's a bit more involved.

Categories

Resources