Parse element's tail with requests-html

Parse element's tail with requests-html - python

I want to parse an HTML document like this with requests-html 0.9.0:
from requests_html import HTML
html = HTML(html='<span><span class="data">important data</span> and some rubbish</span>')
data = html.find('.data', first=True)
print(data.html)
# <span class="data">important data</span> and some rubbish
print(data.text)
# important data and some rubbish
I need to distinguish the text inside the tag (enclosed by it) from the tag's tail (the text that follows the element up to the next tag). This is the behaviour I initially expected:
data.text == 'important data'
data.tail == ' and some rubbish'
But tail is not defined for Elements. Since requests-html provides access to inner lxml objects, we can try to get it from lxml.etree.Element.tail:
from lxml.etree import tostring
print(tostring(data.lxml))
# b'<html><span class="data">important data</span></html>'
print(data.lxml.tail is None)
# True
There's no tail in lxml representation! The tag with its inner text is OK, but the tail seems to be stripped away. How do I extract 'and some rubbish'?
Edit: I discovered that full_text provides the inner text only (so much for “full”). This enables a dirty hack of subtracting full_text from text, although I'm not positive it will work if there are any links.
print(data.full_text)
# important data

I'm not sure I've understood your problem, but if you just want to get 'and some rubbish' you can use below code:
from requests_html import HTML
from lxml.html import fromstring
html = HTML(html='<span><span class="data">important data</span> and some rubbish</span>')
data = fromstring(html.html)
# or without using requests_html.HTML: data = fromstring('<span><span class="data">important data</span> and some rubbish</span>')
print(data.xpath('//span[span[#class="data"]]/text()')[-1]) # " and some rubbish"
NOTE that data = html.find('.data', first=True) returns you <span class="data">important data</span> node which doesn't contain " and some rubbish" - it's a text child node of parent span!

the tail property exists with objects of type 'lxml.html.HtmlElement'.
I think what you are asking for is very easy to implement.
Here is a very simple example using requests_html and lxml:
from requests_html import HTML
html = HTML(html='<span><span class="data">important data</span> and some rubbish</span>')
data = html.find('span')
print (data[0].text) # important data and some rubbish
print (data[-1].text) # important data
print (data[-1].element.tail) # and some rubbish
The element attribute points to the 'lxml.html.HtmlElement' object.
Hope this helps.

Related

Extract xml text when elements in between text

I have this xml file:
<do title='Example document' date='today'>
<db descr='First level'>
<P>
Some text here that
<af d='reference 1'>continues</af>
but then has some more stuff.
</P>
</db>
and I need to parse it to extract its text. I am using xml.etree.ElementTree for this (see documentation).
This is the simple code I use to parse and explore the file:
import xml.etree.ElementTree as ET
tree = ET.parse(file_path)
root = tree.getroot()
def explore_element(element):
print(element.tag)
print(element.attrib)
print(element.text)
for child in element:
explore_element(child)
explore_element(root)
Things work as expected, except that element <P> does not have the complete text. In particular, I seem to be missing "but then has some more stuff" (the text in <P> that comes after the <af> element).
The xml file is a given, so I cannot improve it, even if there is a better recommended way to write it (and there are too many to try to fix manually).
Is there a way I can get all the text?
The output that my code produces (in case it helps) is this:
do
{'title': 'Example document', 'date': 'today'}
db
{'descr': 'First level'}
P
{}
Some text here that
af
{'d': 'reference 1'}
continues
EDIT:
The accepted answer made me realize I had not read the documentation as closely as I should. People with related problems may also find .tail useful.

Using BeautifulSoup:
list_test.xml:
<do title='Example document' date='today'>
<db descr='First level'>
<P>
Some text here that
<af d='reference 1'>continues</af>
but then has some more stuff.
</P>
</db>
and then:
from bs4 import BeautifulSoup
with open('list_test.xml','r') as f:
soup = BeautifulSoup(f.read(), "html.parser")
for line in soup.find_all('p'):
print(line.text)
OUTPUT:
Some text here that
continues
but then has some more stuff.
EDIT:
Using elementree:
import xml.etree.ElementTree as ET
xml = '<p> Some text here that <af d="reference 1">continues</af> but then has some more stuff.</p>'
tree = ET.fromstring(xml)
print(''.join(tree.itertext()))
OUTPUT:
Some text here that continues but then has some more stuff.

Adding a blank space in an XML attrib with lxml in Python

from lxml import etree
html = etree.Element("html")
body = etree.SubElement(html, "body")
body.text = "TEXT"
body.set("p style", "color:red")
print(etree.tostring(html))
Gives me the error: ValueError: Invalid attribute name u'p style'

You can't have an attribute with a space in it in XML, which is what lxml and etree are for. The XML specification states what a valid attribute name is here.
If you are trying to achieve this:
<html><body p style="color:red">TEXT</body></html>
You can't do that in XML. You can do something similar in HTML: empty attributes. See the HTML5 specification for details. But you wouldn't use the kind of code written above to get that result.
If you are trying to get the following result (which seems more likely):
<html><body><p style="color:red">TEXT</p></body></html>
Then it is very easy.
from lxml import etree
html = etree.Element("html")
body = etree.SubElement(html, "body")
p = etree.subElement(body, "p")
p.text = "TEXT"
p.set("style", "color:red")
print(etree.tostring(html))

Adding html tags to text of XML.ElementTree Elements in Python

I am trying to use a python script to generate an HTML document with text from a data table using the XML.etree.ElementTree module. I would like to format some of the cells to include html tags, typically either <br /> or <sup></sup> tags. When I generate a string and write it to a file, I believe the XML parser is converting these tags to individual characters. The output the shows the tags as text rather than processing them as tags. Here is a trivial example:
import xml.etree.ElementTree as ET
root = ET.Element('html')
#extraneous code removed
td = ET.SubElement(tr, 'td')
td.text = 'This is the first line <br /> and the second'
tree = ET.tostring(root)
out = open('test.html', 'w+')
out.write(tree)
out.close()
When you open the resulting 'test.html' file, it displays the text string exactly as typed: 'This is the first line <br /> and the second'.
The HTML document itself shows the problem in the source. It appears that the parser substitutes the "less than" and "greater than" symbols in the tag to the HTML representations of those symbols:
<!--Extraneous code removed-->
<td>This is the first line %lt;br /> and the second</td>
Clearly, my intent is to have the document process the tag itself, not display it as text. I'm not sure if there are different parser options I can pass to get this to work, or if there is a different method I should be using. I am open to using other modules (e.g. lxml) if that will solve the problem. I am mainly using the built-in XML module for convenience.
The only thing I've figured out that works is to modify the final string with re substitutions before I write the file:
tree = ET.tostring(root)
tree = re.sub(r'<','<',tree)
tree = re.sub(r'>','>',tree)
This works, but seems like it should be avoidable by using a different setting in xml. Any suggestions?

You can use tail attribute with td and br to construct the text exactly what you want:
import xml.etree.ElementTree as ET
root = ET.Element('html')
table = ET.SubElement(root, 'table')
tr = ET.SubElement(table, 'tr')
td = ET.SubElement(tr, 'td')
td.text = "This is the first line "
# note how to end td tail
td.tail = None
br = ET.SubElement(td, 'br')
# now continue your text with br.tail
br.tail = " and the second"
tree = ET.tostring(root)
# see the string
tree
'<html><table><tr><td>This is the first line <br /> and the second</td></tr></table></html>'
with open('test.html', 'w+') as f:
f.write(tree)
# and the output html file
cat test.html
<html><table><tr><td>This is the first line <br /> and the second</td></tr></table></html>
As a side note, to include the <sup></sup> and append text but still within <td>, use tail will have the desire effect too:
...
td.text = "this is first line "
sup = ET.SubElement(td, 'sup')
sup.text = "this is second"
# use tail to continue your text
sup.tail = "well and the last"
print ET.tostring(root)
<html><table><tr><td>this is first line <sup>this is second</sup>well and the last</td></tr></table></html>

How to comment out an XML Element (using minidom DOM implementation)

I would like to comment out a specific XML element in an xml file. I could just remove the element, but I would prefer to leave it commented out, in case it's needed later.
The code I use at the moment that removes the element looks like this:
from xml.dom import minidom
doc = minidom.parse(myXmlFile)
for element in doc.getElementsByTagName('MyElementName'):
if element.getAttribute('name') in ['AttribName1', 'AttribName2']:
element.parentNode.removeChild(element)
f = open(myXmlFile, "w")
f.write(doc.toxml())
f.close()
I would like to modify this so that it comments the element out rather then deleting it.

The following solution does exactly what I want.
from xml.dom import minidom
doc = minidom.parse(myXmlFile)
for element in doc.getElementsByTagName('MyElementName'):
if element.getAttribute('name') in ['AttrName1', 'AttrName2']:
parentNode = element.parentNode
parentNode.insertBefore(doc.createComment(element.toxml()), element)
parentNode.removeChild(element)
f = open(myXmlFile, "w")
f.write(doc.toxml())
f.close()

You can do it with beautifulSoup. Read target tag, create appropriate comment tag and replace target tag
For example, creating comment tag:
from BeautifulSoup import BeautifulSoup
hello = "<!--Comment tag-->"
commentSoup = BeautifulSoup(hello)

Python and ElementTree: return "inner XML" excluding parent element

In Python 2.6 using ElementTree, what's a good way to fetch the XML (as a string) inside a particular element, like what you can do in HTML and javascript with innerHTML?
Here's a simplified sample of the XML node I am starting with:
<label attr="foo" attr2="bar">This is some text and a link in embedded HTML</label>
I'd like to end up with this string:
This is some text and a link in embedded HTML
I've tried iterating over the parent node and concatenating the tostring() of the children, but that gave me only the subnodes:
# returns only subnodes (e.g. and a link)
''.join([et.tostring(sub, encoding="utf-8") for sub in node])
I can hack up a solution using regular expressions, but was hoping there'd be something less hacky than this:
re.sub("</\w+?>\s*?$", "", re.sub("^\s*?<\w*?>", "", et.tostring(node, encoding="utf-8")))

How about:
from xml.etree import ElementTree as ET
xml = '<root>start here<child1>some text<sub1/>here</child1>and<child2>here as well<sub2/><sub3/></child2>end here</root>'
root = ET.fromstring(xml)
def content(tag):
return tag.text + ''.join(ET.tostring(e) for e in tag)
print content(root)
print content(root.find('child2'))
Resulting in:
start here<child1>some text<sub1 />here</child1>and<child2>here as well<sub2 /><sub3 /></child2>end here
here as well<sub2 /><sub3 />

This is based on the other solutions, but the other solutions did not work in my case (resulted in exceptions) and this one worked:
from xml.etree import Element, ElementTree
def inner_xml(element: Element):
return (element.text or '') + ''.join(ElementTree.tostring(e, 'unicode') for e in element)
Use it the same way as in Mark Tolonen's answer.

The following worked for me:
from xml.etree import ElementTree as etree
xml = '<root>start here<child1>some text<sub1/>here</child1>and<child2>here as well<sub2/><sub3/></child2>end here</root>'
dom = etree.XML(xml)
(dom.text or '') + ''.join(map(etree.tostring, dom)) + (dom.tail or '')
# 'start here<child1>some text<sub1 />here</child1>and<child2>here as well<sub2 /><sub3 /></child2>end here'
dom.text or '' is used to get the text at the start of the root element. If there is no text dom.text is None.
Note that the result is not a valid XML - a valid XML should have only one root element.
Have a look at the ElementTree docs about mixed content.
Using Python 2.6.5, Ubuntu 10.04

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parse element's tail with requests-html - python

Related

Extract xml text when elements in between text

Adding a blank space in an XML attrib with lxml in Python

Adding html tags to text of XML.ElementTree Elements in Python

How to comment out an XML Element (using minidom DOM implementation)

Python and ElementTree: return "inner XML" excluding parent element

Categories

Resources