LXML: get text inbetween elements children - python

I have a badly structured html template, where my <section> elements contain multiple elements (p, figure, a, etc), but also raw text in between. How can I access all those snippets of texts, and edit them in place (what I need is to replace all $$code$$ with tags?)
both section.text and section.tail return empty strings...

Examine the .tail of the complete tag that immediately precedes the text. So, in <section>A<p>B</p>C<p>D</p>E</section>, the .tails of the two <p> elemnts will contain C and E.
Example:
from lxml import etree
root = etree.fromstring('<root><section>A<p>B</p>C<p>D</p>E</section></root>')
for section_child in root.find('section'):
section_child.tail = section_child.tail.lower()
print(etree.tounicode(root))
Result:
<root><section>A<p>B</p>c<p>D</p>e</section></root>

I learnt from the answer in my posted question: Parse XML text in between elements within a root element
from lxml import etree
xml = '<a>aaaa1<b>bbbb</b>aaaa2<c>cccc</c>aaaa3</a>'
element = etree.fromstring(xml)
for text in element.xpath('text()'):
xml = xml.replace(f'>{text}<', f'>{text.upper()}<')
One concern for this is regarding CDATA in xml, but I would guess this is not an issue for html.

Related

Parse self-closing tags missing the '/'

I'm trying to parse some old SGML code using BeautifulSoup4 and build an Element Tree with the data. It's mostly working fine, but some of the tags that should be self-closing are aren't marked as such. For example:
<element1>
<element2 attr="0">
<element3>Data</element3>
</element1>
When I parse the data, it ends up like:
<element1>
<element2 attr="0">
<element3>Data</element3>
</element2>
</element1>
What I'd like is for it to assume that if it doesn't find a closing tag for such elements, it should treat it as self-closing tag instead of assuming that everything after it is a child and putting the closing tag as late as possible, like so:
<element1>
<element2 attr="0"/>
<element3>Data</element3>
</element1>
Can anyone point me to a parser that could do this, or some way to modify an existing one to act this way? I've dug through a few parsers (lxml, lxml-xml, html5lib) but I can't figure out how to get these results.
What I ended up doing was extracting all empty elements where the end tag can be omitted from the DTD (eg. <!ELEMENT elem_name - o EMPTY >), creating a list from those elements, then using regex to close all the tags in the list. The resulting text is then passed to the XML parser.
Here's a boiled down version of what I'm doing:
import re
from lxml.html import soupparser
from lxml import etree as ET
empty_tags = ['elem1', 'elem2', 'elem3']
markup = """
<elem1 attr="some value">
<elem2/>
<elem3></elem3>
"""
for t in empty_tags:
markup = re.sub(r'(<{0}(?:>|\s+[^>/]*))>\s*(?:</{0}>)?\n?'.format(t), r'\1/>\n', markup)
tree = soupparser.fromstring(markup)
print(ET.tostring(tree, pretty_print=True).decode("utf-8"))
The output should be:
<elem1 attr="some value"/>
<elem2/>
<elem3/>
(This will actually be enclosed in tags, but the parser adds those in.)
It will leave attributes alone, and won't touch tags that are already self-closed. If the tag has a closing tag, but is empty, it will remove the closing tag and self-close the tag instead, just so it's standardized.
It's not a very generic solution but, as far as I can tell, there's no other way to do this without knowing which tags should be closed. Even OpenSP needs the DTD to know which tags it should be closing.

In Python, Parsing Custom XML Tags Without Parsing HTML

I'm new to Python 2.7, and I'm trying to parse an XML file that contains HTML. I want to parse the custom XML tags without parsing any HTML content whatsoever. What's the best way to do this? (If it's helpful, my list of custom XML tags is small, so if there's an XML parser that has an option to only parse specified tags that would probably work fine.)
E.g. I have an XML file that looks like
<myTag1 myAttrib="value">
<myTag2>
<p>My what a lovely day.</p>
</myTag2>
</myTag1>
I'd like to be able to parse apart everything except the HTML, and in particular to extract the value of myTag2 as un-parsed HTML.
EDIT: Here's more information to answer a question below. I had previously tried using ElementTree. This is what happened:
root = ET.fromstring(xmlstring)
root.tag # returns 'myTag1'
root[0].tag # returns 'myTag2'
root[0].text # returns None, but I want it to return the HTML string
The HTML string I want has been parsed and is stored as a tag and text:
root[0][0].tag # returns 'p', but I don't even want root[0][0] to exist
root[0][0].text # returns 'My ... day.'
But really I'd like to be able to do something like this...
root[0].unparsedtext # returns '<p>My ... day.</p>'
SOLUTION:
har07's answer works great. I modified that code slightly to account for an edge case. Here's what I'm implementing:
def _getInner(element):
if element.text == None:
textStr = ''
else:
textStr = element.text
return textStr + ''.join(ET.tostring(e) for e in element)
Then if
element = ET.fromstring('<myTag>Let us be <b>gratuitous</b> with tags</myTag>')
the original code will only return the text starting with the first XML-formatted tag, but the modified version will capture the desired text:
''.join(ET.tostring(e) for e in element) # returns '<b>gratuitous</b> with tags'
_getInner(element) # returns 'Let us be <b>gratuitous</b> with tags'
I don't think there is an easy way to modify an XML parser behavior to ignore some predefined tags. A much easier way would be to let the parser normally parse the XML, then you can create a function that return unparsed content of an element for this purpose, for example :
import xml.etree.ElementTree as ET
def getUnparsedContent(element):
return ''.join(ET.tostring(e) for e in element)
xmlstring = """<myTag1 myAttrib="value">
<myTag2>
<p>My what a lovely day.</p>
</myTag2>
</myTag1>"""
root = ET.fromstring(xmlstring)
print(getUnparsedContent(root[0]))
output :
<p>My what a lovely day.</p>
You should be able to implement this through the built-in minidom xml parser.
from xml.dom import minidom
xmldoc = minidom.parse("document.xml")
rootNode = xmldoc.firstChild
firstNode = rootNode.childNodes[0]
In your example case, firstNode would end up as:
<p>My what a lovely day.</p>
Note that minidom (and probably any other xml-parsing library you might use) won't recognize HTML by default. This is by design, because XML documents do not have predefined tags.
You could then use a series of if or try statements to determine whether you have reached a HTML formatted node while extracting data:
for i in range (0, len(rootNode))
rowNode = rootNode.childNodes[i]
if "<p>" in rowNode:
#this is an html-formatted node: extract the value and continue

lxml moves text with element

I have an issue with wrapping image with a div.
from lxml.html import fromstring
from lxml import etree
tree = fromstring('<img src="/img.png"/> some text')
div = etree.Element('div')
div.insert(0, tree.find('img'))
tree.insert(0, div)
print etree.tostring(tree)
<span><div><img src="/img.png"/> some text</div></span>
Why does it add a span and how can I make it wrap image without text?
Because lxml is acutally an xml parser. It has some forgiving parsing rules that allows it to parse html (the lxml.html part), but it will internally always build a valid tree.
'<img src="/img.png"/> some text' isn't a tree, as it has no single root element, there is a img element, and a text node. To be able to store this snipplet internally, lxml needs to wrap it in a suitable tag. If you give it a string alone, it will wrap it in a p tag. Earlier versions just wrapped everything in html tags, which can lead to even more confusion.
You could also use html.fragment_fromstring, which doesn't add tags in that case, but would raise an error because the fragment isn't valid.
As for why the text sticks to the img tag: that's how lxml stores text. Take this example:
>>> p = html.fromstring("<p>spam<br />eggs</p>")
>>> br = p.find("br")
>>> p.text
'spam'
>>> br.text # empty
>>> br.tail # this is where text that comes after a tag is stored
'eggs'
So by moving a tag, you also move it's tail.
lxml.html is a kinder, gentler xml processor that tries to make sense of invalid xml. The sting you passed in is just junk from an xml perspective, but lxml.html wrapped it in a span element to make it valid again. If you don't want lxml.html guestimating, stick with lxml.etree.fromstring(). That version will reject the string.

Using Python ElementTree to Extract Text in XML Tag

I have a corpus with tens of thousands of XML file (small sized files) and I'm trying to use Python and extract the text contained in one of the XML tags, for example, everything between the body tags for something like:
<body> sample text here with <bold> nested </bold> tags in this paragraph </body>
and then write a text document that contains this string, and move on down the list of XML files.
I'm using effbot's ELementTree but couldn't find the right commands/syntax to do this. I found a website that uses miniDOM's dom.getElementsByTagName but I'm not sure what the corresponding method is for ElementTree. Any ideas would be greatly appreciated.
A better answer, showing how to actually use XML parsing to do this:
import xml.etree.ElementTree as ET
stringofxml = "<body> sample text here with <bold> nested </bold> tags in this paragraph </body>"
def extractTextFromElement(elementName, stringofxml):
tree = ET.fromstring(stringofxml)
for child in tree:
if child.tag == elementName:
return child.text.strip()
print extractTextFromElement('bold', stringofxml)
I would just use re:
import re
body_txt = re.match('<body>(.*)</body>',body_txt).groups()[0]
then to remove the inner tags:
body_txt = re.sub('<.*?>','',body_txt)
You shouldn't use regexp when they are not needed, it's true... but there's nothing wrong with using them when they are.

How to get whole text of an Element in xml.minidom?

I want to get the whole text of an Element to parse some xhtml:
<div id='asd'>
<pre>skdsk</pre>
</div>
begin E = div element on the above example, I want to get
<pre>skdsk</pre>
How?
Strictly speaking:
from xml.dom.minidom import parse, parseString
tree = parseString("<div id='asd'><pre>skdsk</pre></div>")
root = tree.firstChild
node = root.childNodes[0]
print node.toxml()
In practice, though, I'd recommend looking at the http://www.crummy.com/software/BeautifulSoup/ library. Finding the right childNode in an xhtml document, and skipping "whitespace nodes" is a pain. BeautifulSoup is a robust html/xhtml parser with fantastic tree-search capacilities.
Edit: The example above compresses the HTML into one string. If you use the HTML as in the question, the line breaks and so-forth will generate "whitespace" nodes, so the node you want won't be at childNodes[0].

Categories

Resources