I have an issue with wrapping image with a div.
from lxml.html import fromstring
from lxml import etree
tree = fromstring('<img src="/img.png"/> some text')
div = etree.Element('div')
div.insert(0, tree.find('img'))
tree.insert(0, div)
print etree.tostring(tree)
<span><div><img src="/img.png"/> some text</div></span>
Why does it add a span and how can I make it wrap image without text?
Because lxml is acutally an xml parser. It has some forgiving parsing rules that allows it to parse html (the lxml.html part), but it will internally always build a valid tree.
'<img src="/img.png"/> some text' isn't a tree, as it has no single root element, there is a img element, and a text node. To be able to store this snipplet internally, lxml needs to wrap it in a suitable tag. If you give it a string alone, it will wrap it in a p tag. Earlier versions just wrapped everything in html tags, which can lead to even more confusion.
You could also use html.fragment_fromstring, which doesn't add tags in that case, but would raise an error because the fragment isn't valid.
As for why the text sticks to the img tag: that's how lxml stores text. Take this example:
>>> p = html.fromstring("<p>spam<br />eggs</p>")
>>> br = p.find("br")
>>> p.text
'spam'
>>> br.text # empty
>>> br.tail # this is where text that comes after a tag is stored
'eggs'
So by moving a tag, you also move it's tail.
lxml.html is a kinder, gentler xml processor that tries to make sense of invalid xml. The sting you passed in is just junk from an xml perspective, but lxml.html wrapped it in a span element to make it valid again. If you don't want lxml.html guestimating, stick with lxml.etree.fromstring(). That version will reject the string.
Related
I have a badly structured html template, where my <section> elements contain multiple elements (p, figure, a, etc), but also raw text in between. How can I access all those snippets of texts, and edit them in place (what I need is to replace all $$code$$ with tags?)
both section.text and section.tail return empty strings...
Examine the .tail of the complete tag that immediately precedes the text. So, in <section>A<p>B</p>C<p>D</p>E</section>, the .tails of the two <p> elemnts will contain C and E.
Example:
from lxml import etree
root = etree.fromstring('<root><section>A<p>B</p>C<p>D</p>E</section></root>')
for section_child in root.find('section'):
section_child.tail = section_child.tail.lower()
print(etree.tounicode(root))
Result:
<root><section>A<p>B</p>c<p>D</p>e</section></root>
I learnt from the answer in my posted question: Parse XML text in between elements within a root element
from lxml import etree
xml = '<a>aaaa1<b>bbbb</b>aaaa2<c>cccc</c>aaaa3</a>'
element = etree.fromstring(xml)
for text in element.xpath('text()'):
xml = xml.replace(f'>{text}<', f'>{text.upper()}<')
One concern for this is regarding CDATA in xml, but I would guess this is not an issue for html.
I'm new to Python 2.7, and I'm trying to parse an XML file that contains HTML. I want to parse the custom XML tags without parsing any HTML content whatsoever. What's the best way to do this? (If it's helpful, my list of custom XML tags is small, so if there's an XML parser that has an option to only parse specified tags that would probably work fine.)
E.g. I have an XML file that looks like
<myTag1 myAttrib="value">
<myTag2>
<p>My what a lovely day.</p>
</myTag2>
</myTag1>
I'd like to be able to parse apart everything except the HTML, and in particular to extract the value of myTag2 as un-parsed HTML.
EDIT: Here's more information to answer a question below. I had previously tried using ElementTree. This is what happened:
root = ET.fromstring(xmlstring)
root.tag # returns 'myTag1'
root[0].tag # returns 'myTag2'
root[0].text # returns None, but I want it to return the HTML string
The HTML string I want has been parsed and is stored as a tag and text:
root[0][0].tag # returns 'p', but I don't even want root[0][0] to exist
root[0][0].text # returns 'My ... day.'
But really I'd like to be able to do something like this...
root[0].unparsedtext # returns '<p>My ... day.</p>'
SOLUTION:
har07's answer works great. I modified that code slightly to account for an edge case. Here's what I'm implementing:
def _getInner(element):
if element.text == None:
textStr = ''
else:
textStr = element.text
return textStr + ''.join(ET.tostring(e) for e in element)
Then if
element = ET.fromstring('<myTag>Let us be <b>gratuitous</b> with tags</myTag>')
the original code will only return the text starting with the first XML-formatted tag, but the modified version will capture the desired text:
''.join(ET.tostring(e) for e in element) # returns '<b>gratuitous</b> with tags'
_getInner(element) # returns 'Let us be <b>gratuitous</b> with tags'
I don't think there is an easy way to modify an XML parser behavior to ignore some predefined tags. A much easier way would be to let the parser normally parse the XML, then you can create a function that return unparsed content of an element for this purpose, for example :
import xml.etree.ElementTree as ET
def getUnparsedContent(element):
return ''.join(ET.tostring(e) for e in element)
xmlstring = """<myTag1 myAttrib="value">
<myTag2>
<p>My what a lovely day.</p>
</myTag2>
</myTag1>"""
root = ET.fromstring(xmlstring)
print(getUnparsedContent(root[0]))
output :
<p>My what a lovely day.</p>
You should be able to implement this through the built-in minidom xml parser.
from xml.dom import minidom
xmldoc = minidom.parse("document.xml")
rootNode = xmldoc.firstChild
firstNode = rootNode.childNodes[0]
In your example case, firstNode would end up as:
<p>My what a lovely day.</p>
Note that minidom (and probably any other xml-parsing library you might use) won't recognize HTML by default. This is by design, because XML documents do not have predefined tags.
You could then use a series of if or try statements to determine whether you have reached a HTML formatted node while extracting data:
for i in range (0, len(rootNode))
rowNode = rootNode.childNodes[i]
if "<p>" in rowNode:
#this is an html-formatted node: extract the value and continue
I'm using lxml to parse an XML message. What I want to do is convert the string into an xml message, extract some informations thanks to xpath directives, edit a few attributes and then dump the XML into a string again.
lxml is doing a wonderful job at it, except for one thing : It won't respect the tag declaration that were originally provided. What I mean by this, is that if in your input you have :
xml_str = "<root><tag><tutu/></tag></root>"
or
xml_str = "<root><tag><tutu></tutu></tag></root>"
The following code will return the same thing:
>>> from lxml import etree
>>> root = etree.XML(xml_str)
>>> print etree.tostring(root)
<root><tag><tutu/></tag></root>
The tutu tag will be rendered no matter what as <tutu/>
I found here that by setting the text of the element to '' we can force the closing tag to be explicitly rendered.
My issue is the following : I need to have the exact same tag rendering before and after calling lxml (because some external program will perform a string comparison on both strings and a mismatch will be detected on <tutu/> and <tutu></tutu>)
I know we can create a custom ElementTree class as well as a custom parser...What I was thinking was while parsing the string, to save in the custom ElementTree what type of tag we have (short or extended) and then before calling tostring function, update the tree and set the text to None or '' to keep the same type of tag as in the input
The question is : How may I know what type of tag I have? Or do you have any other idea on how to solve this issue?
Thanks a lot for your help
I want to write a simple SOAP response using Python's ElementTree API (and lxml). Writing the SOAP response involves writing element text (values) with a namespace. For an example, click here.
Writing an element with a namespace isn't that big of a problem, but some elements contain text which have a namespace.
I want to create something like:
<pleh:a xmlns:pleh="http://pleh">pleh:x</pleh:a>
So 'naturally' I do:
try:
from lxml import etree
except ImportError:
import xml.etree.ElementTree as etree
pleh = 'http://pleh'
etree.register_namespace('pleh', pleh)
a = etree.Element('{%s}a' % pleh)
a.text = '{%s}x' % pleh
print(etree.tostring(a))
But this prints <pleh:a xmlns:pleh="http://pleh">{http://pleh}x</pleh:a>
What am I missing here?
There is no such thing as "namespaced element text" or "text which have a namespace". Your element as a whole is bound to a namespace, that is true. But element text content such as pleh:x is just a plain string. The pleh bit happens to be the (arbitrary) prefix associated with the element's namespace, but it is not significant as far as XML namespaces are concerned.
Here you create an XML element, in the http://pleh namespace:
a = etree.Element('{%s}a' % pleh)
The curly braces are interpreted as delimiters of the namespace URI. This is known as "Clark notation". The braces do not appear in the serialized XML.
Here you create a string:
a.text = '{%s}x' % pleh
The text content of the element becomes {http://pleh}x. This is not what you expected, but it is perfectly logical (it is not a bug in lxml).
If you need the text content to be pleh:x, just use a.text = 'pleh:x'
I want to get the whole text of an Element to parse some xhtml:
<div id='asd'>
<pre>skdsk</pre>
</div>
begin E = div element on the above example, I want to get
<pre>skdsk</pre>
How?
Strictly speaking:
from xml.dom.minidom import parse, parseString
tree = parseString("<div id='asd'><pre>skdsk</pre></div>")
root = tree.firstChild
node = root.childNodes[0]
print node.toxml()
In practice, though, I'd recommend looking at the http://www.crummy.com/software/BeautifulSoup/ library. Finding the right childNode in an xhtml document, and skipping "whitespace nodes" is a pain. BeautifulSoup is a robust html/xhtml parser with fantastic tree-search capacilities.
Edit: The example above compresses the HTML into one string. If you use the HTML as in the question, the line breaks and so-forth will generate "whitespace" nodes, so the node you want won't be at childNodes[0].