Extract xml text when elements in between text - python

I have this xml file:
<do title='Example document' date='today'>
<db descr='First level'>
<P>
Some text here that
<af d='reference 1'>continues</af>
but then has some more stuff.
</P>
</db>
and I need to parse it to extract its text. I am using xml.etree.ElementTree for this (see documentation).
This is the simple code I use to parse and explore the file:
import xml.etree.ElementTree as ET
tree = ET.parse(file_path)
root = tree.getroot()
def explore_element(element):
print(element.tag)
print(element.attrib)
print(element.text)
for child in element:
explore_element(child)
explore_element(root)
Things work as expected, except that element <P> does not have the complete text. In particular, I seem to be missing "but then has some more stuff" (the text in <P> that comes after the <af> element).
The xml file is a given, so I cannot improve it, even if there is a better recommended way to write it (and there are too many to try to fix manually).
Is there a way I can get all the text?
The output that my code produces (in case it helps) is this:
do
{'title': 'Example document', 'date': 'today'}
db
{'descr': 'First level'}
P
{}
Some text here that
af
{'d': 'reference 1'}
continues
EDIT:
The accepted answer made me realize I had not read the documentation as closely as I should. People with related problems may also find .tail useful.

Using BeautifulSoup:
list_test.xml:
<do title='Example document' date='today'>
<db descr='First level'>
<P>
Some text here that
<af d='reference 1'>continues</af>
but then has some more stuff.
</P>
</db>
and then:
from bs4 import BeautifulSoup
with open('list_test.xml','r') as f:
soup = BeautifulSoup(f.read(), "html.parser")
for line in soup.find_all('p'):
print(line.text)
OUTPUT:
Some text here that
continues
but then has some more stuff.
EDIT:
Using elementree:
import xml.etree.ElementTree as ET
xml = '<p> Some text here that <af d="reference 1">continues</af> but then has some more stuff.</p>'
tree = ET.fromstring(xml)
print(''.join(tree.itertext()))
OUTPUT:
Some text here that continues but then has some more stuff.

Related

XML Python find all paragraphs and extract text doesn't find ALL paragraphs

I want to find all paragraphs, and it seems like it works, but I'm not sure every paragraph is taken. For example in this piece of XML:
<body>
<sec id="s1">
<title>Introduction</title>
<p>Ordered segregation of the genome during cell division requires bipolar attachment to spindle microtubules [<xref ref-type="bibr" rid="pbio-0060207-b001">1</xref>] and maintenance of sister chromatid cohesion until anaphase onset [<xref ref-type="bibr" rid="pbio-0060207-b002">2</xref>]. Cohesin provides a physical link between sister chromatids, and cleavage of cohesin subunits results from separase activation after the spindle
</p> </sec></body>
The code to extract it is:
for nameoffile in os.listdir(words_input_dir):
if filename.endswith(".xml"):
tree = ET.parse(filename)
root = tree.getroot()
node = root.findall("./body/sec/p")
for x in node:
print(x.text)
But it seems like it doesn't get ALL paragraphs. Am I doing something wrong?
I think you find all <p> with your code, but what you maybe miss is the tail content of some </closing tag>. I will use the xml.etree.ElementTree to explain your code. On the other hand I recommend you for parse html use the Beautiful Soup (bs4) will be the better solution, but you have to install it via pip install beautifulsoup4.
import xml.etree.ElementTree as ET
import pandas as pd
tree = ET.parse("html.xml")
root = tree.getroot()
columns = ["TAG", "ATTRIBUTE", "TEXT", "TAIL"]
data = []
for elem in root.iter():
#print(elem.tag, elem.attrib, elem.text, elem.tail)
all = elem.tag, elem.attrib, elem.text, elem.tail
data.append(all)
df = pd.DataFrame(data, columns=columns)
df.to_csv('out_html.csv')
Output, opend in LibreOffice:

How to position a whole text of such a whole div with Xpath?

My test website is https://www.nature.com/articles/ncomms12114.
I want to get the text from the "Additional Information" part.
My Xpath code is
from lxml import etree
page_text = requests.get(url="https://www.nature.com/articles/ncomms12114",headers = headers).text
tree = etree.HTML(page_text)
addInformatin_raw= tree.xpath("//h2[text()='Additional information']/..")[0]
addInformatin = addInformatin_raw.xpath('./div//text()')[0]
But the response of my code is just a string called-"How to cite this article:" which is part of the text.
My ideal response is"How to cite this article: Wang, S.Y. et al. Hypoxia causes transgenerational impairments in reproduction of fish. Nat. Commun. 7:12114 doi: 10.1038/ncomms12114 (2016)."
So I want to know how to revise my code.Thank you foe your answer.
I think you should remove the [0] here.
//h2[text()='Additional information']/..//div XPath gives the element containing exactly what you are looking for.
So, I'm not familiar with etree but it seems like your code could be something like this:
from lxml import etree
page_text = requests.get(url="https://www.nature.com/articles/ncomms12114",headers = headers).text
tree = etree.HTML(page_text)
addInformatin_texts= tree.xpath("//h2[text()='Additional information']/..//div//text()")
this will give you 6 texts.

Parse element's tail with requests-html

I want to parse an HTML document like this with requests-html 0.9.0:
from requests_html import HTML
html = HTML(html='<span><span class="data">important data</span> and some rubbish</span>')
data = html.find('.data', first=True)
print(data.html)
# <span class="data">important data</span> and some rubbish
print(data.text)
# important data and some rubbish
I need to distinguish the text inside the tag (enclosed by it) from the tag's tail (the text that follows the element up to the next tag). This is the behaviour I initially expected:
data.text == 'important data'
data.tail == ' and some rubbish'
But tail is not defined for Elements. Since requests-html provides access to inner lxml objects, we can try to get it from lxml.etree.Element.tail:
from lxml.etree import tostring
print(tostring(data.lxml))
# b'<html><span class="data">important data</span></html>'
print(data.lxml.tail is None)
# True
There's no tail in lxml representation! The tag with its inner text is OK, but the tail seems to be stripped away. How do I extract 'and some rubbish'?
Edit: I discovered that full_text provides the inner text only (so much for “full”). This enables a dirty hack of subtracting full_text from text, although I'm not positive it will work if there are any links.
print(data.full_text)
# important data
I'm not sure I've understood your problem, but if you just want to get 'and some rubbish' you can use below code:
from requests_html import HTML
from lxml.html import fromstring
html = HTML(html='<span><span class="data">important data</span> and some rubbish</span>')
data = fromstring(html.html)
# or without using requests_html.HTML: data = fromstring('<span><span class="data">important data</span> and some rubbish</span>')
print(data.xpath('//span[span[#class="data"]]/text()')[-1]) # " and some rubbish"
NOTE that data = html.find('.data', first=True) returns you <span class="data">important data</span> node which doesn't contain " and some rubbish" - it's a text child node of parent span!
the tail property exists with objects of type 'lxml.html.HtmlElement'.
I think what you are asking for is very easy to implement.
Here is a very simple example using requests_html and lxml:
from requests_html import HTML
html = HTML(html='<span><span class="data">important data</span> and some rubbish</span>')
data = html.find('span')
print (data[0].text) # important data and some rubbish
print (data[-1].text) # important data
print (data[-1].element.tail) # and some rubbish
The element attribute points to the 'lxml.html.HtmlElement' object.
Hope this helps.

Adding html tags to text of XML.ElementTree Elements in Python

I am trying to use a python script to generate an HTML document with text from a data table using the XML.etree.ElementTree module. I would like to format some of the cells to include html tags, typically either <br /> or <sup></sup> tags. When I generate a string and write it to a file, I believe the XML parser is converting these tags to individual characters. The output the shows the tags as text rather than processing them as tags. Here is a trivial example:
import xml.etree.ElementTree as ET
root = ET.Element('html')
#extraneous code removed
td = ET.SubElement(tr, 'td')
td.text = 'This is the first line <br /> and the second'
tree = ET.tostring(root)
out = open('test.html', 'w+')
out.write(tree)
out.close()
When you open the resulting 'test.html' file, it displays the text string exactly as typed: 'This is the first line <br /> and the second'.
The HTML document itself shows the problem in the source. It appears that the parser substitutes the "less than" and "greater than" symbols in the tag to the HTML representations of those symbols:
<!--Extraneous code removed-->
<td>This is the first line %lt;br /> and the second</td>
Clearly, my intent is to have the document process the tag itself, not display it as text. I'm not sure if there are different parser options I can pass to get this to work, or if there is a different method I should be using. I am open to using other modules (e.g. lxml) if that will solve the problem. I am mainly using the built-in XML module for convenience.
The only thing I've figured out that works is to modify the final string with re substitutions before I write the file:
tree = ET.tostring(root)
tree = re.sub(r'<','<',tree)
tree = re.sub(r'>','>',tree)
This works, but seems like it should be avoidable by using a different setting in xml. Any suggestions?
You can use tail attribute with td and br to construct the text exactly what you want:
import xml.etree.ElementTree as ET
root = ET.Element('html')
table = ET.SubElement(root, 'table')
tr = ET.SubElement(table, 'tr')
td = ET.SubElement(tr, 'td')
td.text = "This is the first line "
# note how to end td tail
td.tail = None
br = ET.SubElement(td, 'br')
# now continue your text with br.tail
br.tail = " and the second"
tree = ET.tostring(root)
# see the string
tree
'<html><table><tr><td>This is the first line <br /> and the second</td></tr></table></html>'
with open('test.html', 'w+') as f:
f.write(tree)
# and the output html file
cat test.html
<html><table><tr><td>This is the first line <br /> and the second</td></tr></table></html>
As a side note, to include the <sup></sup> and append text but still within <td>, use tail will have the desire effect too:
...
td.text = "this is first line "
sup = ET.SubElement(td, 'sup')
sup.text = "this is second"
# use tail to continue your text
sup.tail = "well and the last"
print ET.tostring(root)
<html><table><tr><td>this is first line <sup>this is second</sup>well and the last</td></tr></table></html>

Python and ElementTree: return "inner XML" excluding parent element

In Python 2.6 using ElementTree, what's a good way to fetch the XML (as a string) inside a particular element, like what you can do in HTML and javascript with innerHTML?
Here's a simplified sample of the XML node I am starting with:
<label attr="foo" attr2="bar">This is some text and a link in embedded HTML</label>
I'd like to end up with this string:
This is some text and a link in embedded HTML
I've tried iterating over the parent node and concatenating the tostring() of the children, but that gave me only the subnodes:
# returns only subnodes (e.g. and a link)
''.join([et.tostring(sub, encoding="utf-8") for sub in node])
I can hack up a solution using regular expressions, but was hoping there'd be something less hacky than this:
re.sub("</\w+?>\s*?$", "", re.sub("^\s*?<\w*?>", "", et.tostring(node, encoding="utf-8")))
How about:
from xml.etree import ElementTree as ET
xml = '<root>start here<child1>some text<sub1/>here</child1>and<child2>here as well<sub2/><sub3/></child2>end here</root>'
root = ET.fromstring(xml)
def content(tag):
return tag.text + ''.join(ET.tostring(e) for e in tag)
print content(root)
print content(root.find('child2'))
Resulting in:
start here<child1>some text<sub1 />here</child1>and<child2>here as well<sub2 /><sub3 /></child2>end here
here as well<sub2 /><sub3 />
This is based on the other solutions, but the other solutions did not work in my case (resulted in exceptions) and this one worked:
from xml.etree import Element, ElementTree
def inner_xml(element: Element):
return (element.text or '') + ''.join(ElementTree.tostring(e, 'unicode') for e in element)
Use it the same way as in Mark Tolonen's answer.
The following worked for me:
from xml.etree import ElementTree as etree
xml = '<root>start here<child1>some text<sub1/>here</child1>and<child2>here as well<sub2/><sub3/></child2>end here</root>'
dom = etree.XML(xml)
(dom.text or '') + ''.join(map(etree.tostring, dom)) + (dom.tail or '')
# 'start here<child1>some text<sub1 />here</child1>and<child2>here as well<sub2 /><sub3 /></child2>end here'
dom.text or '' is used to get the text at the start of the root element. If there is no text dom.text is None.
Note that the result is not a valid XML - a valid XML should have only one root element.
Have a look at the ElementTree docs about mixed content.
Using Python 2.6.5, Ubuntu 10.04

Categories

Resources