Python lxml html tostring - python

I have the following Code and Output (without colon and text before it):
from lxml import html
root = html.fromstring('<ac:link> <ri:attachment filename="test.pdf"/> </ac:link>')
html.tostring(root,encoding='unicode',method='xml')
Output : '<link> <attachment ri:filename="test.pdf'/> </link>
However, I have expected some similar to this (keeping the original tag structure):
Output : '<ac:link> <ri:attachment ri:filename="test.pdf'/> </ac:link>
Is there a way to produce the wanted output ? (I cant change the HTML structure as is).
Maybe somebody could also point to the relevant part in the source code.

Related

Python lxml method tostring with selfclosing tag <br/>

I have the following Code and Output (without closing br-tag):
from lxml import html
root = html.fromstring('<br/>')
html.tostring(root)
Output : '<br>'
However, I have expected some similar to this (adding also a closing tag for this self-closing element):
from lxml import html
root = html.fromstring('<a/>')
html.tostring(root)
Output : '<a></a>'
Is there a way to produce the wanted output or just '<br/>' ?
Maybe somebody could also point to the relevant part in the source code.
HTML is not XML compliant and this is one of the places where that shows up. A break is <br>. Since a break element does not usually have internal content (attributes or text) it does not need to be written as an enclosing <br/> or <br></br> (although most browsers will accept both). You can change the output method to get XML compliance.
html.tostring(root, method='xml')
This could negatively impact other parts of the document, though.

Incorrect parent element lxml

I am implementing a web scraping program in Python.
Consider my following HTML snippet.
<div>
<b>
<i>
HelloWorld
</i>
HiThere
</b>
</div>
If I wish to use lxml to extract my bold or italicized texts only, I use the following command
tree = etree.fromstring(myhtmlstr, htmlparser)
opp1 = tree.xpath(".//text()[ancestor::b or ancestor::i or ancestor::strong]")
This gives me the correct result, i.e. the result of my opp1 is :
['HelloWorld', 'HiThere']
So far, everything is perfect. However, the real problem arises if I try to query the parents of the tags. As expected, the output of opp1[0].getparent().tag and opp1[0].getparent().getparent().tag are i and b.
The real problem is however in the second tag. Ideally, the parent of opp[1] should be the b tag. However, the output of opp1[1].getparent().tag and opp1[1].getparent().getparent().tag are i and b again.
You can verify the same in the following code:
from lxml import etree
htmlstr = """<div><b><i>HelloWorld</i>HiThere</b></div>"""
htmlparser = etree.HTMLParser()
tree = etree.fromstring(htmlstr, htmlparser)
opp1 = tree.xpath(".//text()[ancestor::b or ancestor::i or ancestor::strong]")
print(opp1)
print(opp1[0].getparent(), opp1[0].getparent().getparent())
print(opp1[1].getparent(), opp1[1].getparent().getparent())
Can someone point out why this is the case? What can I do to correct it? I plan to use only lxml for my program, and do not want any solution that uses bs4.
The issue seems to stem from LXML's (and ElementTree's) data model, where an element is roughly "tag, attributes, text, children, tail"; the DOM data model has actual nodes for text too.
If you change your program to do
for x in tree.xpath(".//text()[ancestor::b or ancestor::i or ancestor::strong]"):
print(x, x.getparent(), "text?", x.is_text, "tail?", x.is_tail)
it will print
HelloWorld <Element i at 0x10aa0ccd0> text? True tail? False
HiThere <Element i at 0x10aa0ccd0> text? False tail? True
i.e. "HiThere" is the tail of the i element, since that's the way the Etree data model represents intermingled text and tags.
The takeaway here (which should probably work for your use case) is to consider .getparent().getparent() as the effective parent of a text result that has is_tail=True.

Navigating in html by xpath in python

So, I am accessing some url that is formatted something like the following:
<DOCUMENT>
<TYPE>A
<SEQUENCE>1
<TEXT>
<HTML>
<BODY BGCOLOR="#FFFFFF" LINK=BLUE VLINK=PURPLE>
</BODY>
</HTML>
</TEXT>
</DOCUMENT>
<DOCUMENT>
<TYPE>B
<SEQUENCE>2
...
As you can see, it starts a document, (which is the sequence number 1), and then finishes the document, and then document with sequence 2 starts and so on.
So, what I want to do, is to write an xpath address in python such that to just get the document with sequence value 1, (or, equivalently, TYPE A).
I supposed that such a thing would work:
import lxml
from lxml import html
page = html.fromstring(pagehtml)
type_a = page.xpath("//document[sequence=1]/descendant::*/text()")
however, it just gives me an empty list as type_a variable.
Could someone please let me know what is my mistake in this code? I am really new to this xml stuff.
It might be because that's highly dubious HTML. The <SEQUENCE> tag is unclosed, so it could well be interpreted by lxml as containing all of the code until the next </DOCUMENT>, so it does not end up just containing the 1. When your XPath code then looks for a <SEQUENCE> containing 1, there isn't one.
Additionally, XML is case-sensitive, but HTML isn't. XPath is designed for XML, so it is also case sensitive, which would also stop your document matching <DOCUMENT>.
Try //DOCUMENT[starts-with(SEQUENCE,'1')]. That's based on Xpath using starts-with function.
Ideally, if the input is under your control, you should instead just close the type and sequence tags (with </TYPE> and </SEQUENCE>) to make the input valid.
I'd like to point out, apart from the great answer provided by #GKFX, lxml.html module is capable of parsing broken or a fragment of HTML. In fact it will parse from your string just fine and handle it well.
fromstring(string): Returns document_fromstring or
fragment_fromstring, based on whether the string looks like a full
document, or just a fragment.
The problem you have, perhaps from your other codes generating the string, also lies on the fact that, you haven't given the true path to access the SEQUENCE node.
type_a = page.xpath("//document[sequence=1]/descendant::*/text()")
your above xpath will try to find all document nodes with a following children node called sequence which its value 1, however your document's first children node is type, not sequence, so you will never get what you want.
Consider rewriting to this, will get what you need:
page.xpath('//document[type/sequence=1]/descendant::*/text()')
['A\n ', '1\n ']
Since your html string is missing the closing tag for sequence, you cannot, however get the correct result by another xpath like this:
page.xpath('//document[type/sequence=1]/../..//text()')
['A\n ', '1\n ', 'B\n ', '2']
That is because your sequence=1 has no closing tag, sequence=2 will become a child node of it.
I have to point out an important point that your html string is still invalid, but the tolerance from lxml's parser can handle your case just fine.
Try using a relative path: explicitly specifying the correct path to your element. (not skipping type)
page.xpath("//document[./type/sequence = 1]")
See: http://pastebin.com/ezQXtKcr
Output:
Trying original post (novice_007): //document[sequence=1]/descendant::*/text()
[]
Using GKFX's answer: //DOCUMENT[starts-with(SEQUENCE,'1')]
[]
My answer: //document[./type/sequence = 1]
[<Element document at 0x1bfcb30>]
Currently, the xpath I provided is the only one that ... to just get the document with sequence value 1

lxml: keep the same tags in fromstring and tostring

I'm using lxml to parse an XML message. What I want to do is convert the string into an xml message, extract some informations thanks to xpath directives, edit a few attributes and then dump the XML into a string again.
lxml is doing a wonderful job at it, except for one thing : It won't respect the tag declaration that were originally provided. What I mean by this, is that if in your input you have :
xml_str = "<root><tag><tutu/></tag></root>"
or
xml_str = "<root><tag><tutu></tutu></tag></root>"
The following code will return the same thing:
>>> from lxml import etree
>>> root = etree.XML(xml_str)
>>> print etree.tostring(root)
<root><tag><tutu/></tag></root>
The tutu tag will be rendered no matter what as <tutu/>
I found here that by setting the text of the element to '' we can force the closing tag to be explicitly rendered.
My issue is the following : I need to have the exact same tag rendering before and after calling lxml (because some external program will perform a string comparison on both strings and a mismatch will be detected on <tutu/> and <tutu></tutu>)
I know we can create a custom ElementTree class as well as a custom parser...What I was thinking was while parsing the string, to save in the custom ElementTree what type of tag we have (short or extended) and then before calling tostring function, update the tree and set the text to None or '' to keep the same type of tag as in the input
The question is : How may I know what type of tag I have? Or do you have any other idea on how to solve this issue?
Thanks a lot for your help

How can I add consistent whitespace to existing HTML using Python?

I just started working on a website that is full of pages with all their HTML on a single line, which is a real pain to read and work with. I'm looking for a tool (preferably a Python library) that will take HTML input and return the same HTML unchanged, except for adding linebreaks and appropriate indentation. (All tags, markup, and content should be untouched.)
The library doesn't have to handle malformed HTML; I'm passing the HTML through html5lib first, so it will be getting well-formed HTML. However, as mentioned above, I would rather it didn't change any of the actual markup itself; I trust html5lib and would rather let it handle the correctness aspect.
First, does anyone know if this is possible with just html5lib? (Unfortunately, their documentation seems a bit sparse.) If not, what tool would you suggest? I've seen some people recommend HTML Tidy, but I'm not sure if it can be configured to only change whitespace. (Would it do anything except insert whitespace if it were passed well-formed HTML to start with?)
Algorithm
Parse html into some representation
Serialize the representation back to html
Example html5lib parser with BeautifulSoup tree builder
#!/usr/bin/env python
from html5lib import HTMLParser, treebuilders
parser = HTMLParser(tree=treebuilders.getTreeBuilder("beautifulsoup"))
c = """<HTML><HEAD><TITLE>Title</TITLE></HEAD><BODY>...... </BODY></HTML>"""
soup = parser.parse(c)
print soup.prettify()
Output:
<html>
<head>
<title>
Title
</title>
</head>
<body>
......
</body>
</html>
I chose J.F. Sebastian's answer because I think it's the simplest and thus the best, but I'm adding another solution for anyone who doesn't want to install Beautiful Soup. (Also, the Beautiful Soup tree builder is going to be deprecated in html5lib 1.0.) This solution was thanks to Amarghosh's tip; I just fleshed it out a bit. Looking at html5lib, I realized that it will output a minidom object natively, which means I can use his suggestion of toprettyxml(). Here's what I came up with:
from html5lib import HTMLParser, treebuilders
from cStringIO import StringIO
def tidy_html(text):
"""Returns a well-formatted version of input HTML."""
p = HTMLParser(tree=treebuilders.getTreeBuilder("dom"))
dom_tree = p.parseFragment(text)
# using cStringIO for fast string concatenation
pretty_HTML = StringIO()
node = dom_tree.firstChild
while node:
node_contents = node.toprettyxml(indent=' ')
pretty_HTML.write(node_contents)
node = node.nextSibling
output = pretty_HTML.getvalue()
pretty_HTML.close()
return output
And an example:
>>> text = """<b><i>bold, italic</b></i><div>a div</div>"""
>>> tidy_html(text)
<b>
<i>
bold, italic
</i>
</b>
<div>
a div
</div>
Why am I iterating over the children of the tree, rather than just calling toprettyxml() on dom_tree directly? Some of the HTML I'm dealing with is actually HTML fragments, so it's missing the <head> and <body> tags. To handle this I used the parseFragment() method, which means I get a DocumentFragment in return (rather than a Document). Unfortunately, it doesn't have a writexml() method (which toprettyxml() calls), so I iterate over the child nodes, which do have the method.
If the html is indeed well formed xml, you can use DOM parser.
from xml.dom.minidom import parse, parseString
#if you have html string in a variable
html = parseString(theHtmlString)
#or parse the html file
html = parse(htmlFileName)
print html.toprettyxml()
The toprettyxml() method lets to specify the indent, new-line character and the encoding of the output. You might want to check out the writexml() method also.

Categories

Resources