Custom Python Dominate Tag Element - python

To support both a JPEG and WEBP compressed image, I'd like to include the following HTML code in a web page:
<picture>
<source srcset="img/awesomeWebPImage.webp" type="image/webp">
<source srcset="img/creakyOldJPEG.jpg" type="image/jpeg">
<img src="img/creakyOldJPEG.jpg" alt="Alt Text!">
</picture>
I've been using Python Dominate and it has generally worked well for me.
But the Picture and Source tags I think are not supported by Dominate.
I could add the HTML as a raw() Dominate tag, but was wondering if there was a way to get Dominate to recognize these tags.
p = picture()
with p:
source(srcset=image.split('.')[0]+'.webp', type="image/webp")
source(srcset=image, type="image/jpeg")
img(src=image, alt=imagealt)
I am seeing this kind of error:
p = picture()
NameError: global name 'picture' is not defined

You can create a picture class by inheriting from the dominate.tags.html_tag class
from dominate.tags import html_tag
class picture(html_tag):
pass
This can now be used as any of the predefined tags.

Dominate is used to generate HTML(5) documents.
The list of elements are defined in the tags.py file, see the repository in GitHub: https://github.com/Knio/dominate/blob/master/dominate/tags.py.
But, picture is not a standard tag.
You may look at the lxml library which contains a ElementMaker similar to Dominate to build XML tree easily. See the E-Factory.
For instance:
>>> from lxml.builder import E
>>> def CLASS(*args): # class is a reserved word in Python
... return {"class":' '.join(args)}
>>> html = page = (
... E.html( # create an Element called "html"
... E.head(
... E.title("This is a sample document")
... ),
... E.body(
... E.h1("Hello!", CLASS("title")),
... E.p("This is a paragraph with ", E.b("bold"), " text in it!"),
... E.p("This is another paragraph, with a", "\n ",
... E.a("link", href="http://www.python.org"), "."),
... E.p("Here are some reserved characters: <spam&egg>."),
... etree.XML("<p>And finally an embedded XHTML fragment.</p>"),
... )
... )
... )
>>> print(etree.tostring(page, pretty_print=True))
<html>
<head>
<title>This is a sample document</title>
</head>
<body>
<h1 class="title">Hello!</h1>
<p>This is a paragraph with <b>bold</b> text in it!</p>
<p>This is another paragraph, with a
link.</p>
<p>Here are some reserved characters: <spam&egg>.</p>
<p>And finally an embedded XHTML fragment.</p>
</body>
</html>

Related

HTMLParsing in Python

So i have a need to process some HTML in Python, and my requirement is that i need to find a certain tag and replace it with different charecter based on the content of the charecters...
<html>
<Head>
</HEAD>
<body>
<blah>
<_translate attr="french"> I am no one,
and no where <_translate>
<Blah/>
</body>
</html>
Should become
<html>
<Head>
</HEAD>
<body>
<blah>
Je suis personne et je suis nulle part
<Blah/>
</body>
</html>
I would like to leave the original HTML untouched an only replace the tags labeled 'important-tag'. Attributes and the contents of that tag will be important to generate the tags output.
I had though about using extending HTMLParser Object but I am having trouble getting out the orginal HTML when i want it. I think what i most want is to parse the HTML into tokens, with the orginal text in each token so i can output my desired output ... i.e. get somthing like
(tag, "<html>")
(data, "\n ")
(tag, "<head>")
(data, "\n ")
(end-tag,"</HEAD>")
ect...
ect...
Anyone know of a good pythonic way to accomplish this ? Python 2.7 standard libs are prefered, third party would also be useful to consider...
Thanks!
You can use lxml to perform such a task http://lxml.de/tutorial.html and use XPath to navigate easily trough your html:
from lxml.html import fromstring
my_html = "HTML CONTENT"
root = fromstring(my_html)
nodes_to_process = root.xpath("//_translate")
for node in nodes_to_process:
lang = node.attrib["attr"]
translate = AWESOME_TRANSLATE(node.text, lang)
node.parent.text = translate
I'll leave up to you the implementation of the AWESOME_TRANSLATE function ;)

How to control newline processing in the lxml xpath text() function?

Having switched from Fedora 17 to 18, I get different parsing behaviour for the same lxml code, apparently due to different versions of the underlying libraries (libxml2 and libxslt versions changed).
Here's an example of lxml code with different results for the two versions:
from io import BytesIO
from lxml import etree
myHtmlString = \
'<!doctype html public "-//w3c//dtd html 4.0 transitional//en">\r\n'+\
'<html>\r\n'+\
'<head>\r\n'+\
' <title>Title</title>\r\n'+\
'</head>\r\n'+\
'<body/>\r\n'+\
'</html>\r\n'
myFile = BytesIO(myHtmlString)
myTree = etree.parse(myFile, etree.HTMLParser())
myTextElements = myTree.xpath("//text()")
myFullText = ''.join([myEl for myEl in myTextElements])
assert myFullText == 'Title', repr(myFullText)
The f17 version passes the assert, i.e. xpath("//text()") only returns text 'Title', whereas the f18 version fails with output
Traceback (most recent call last):
File "TestLxml.py", line 17, in <module>
assert myFullText == 'Title', repr(myFullText)
AssertionError: '\r\n\r\n Title\r\n\r\n\r\n'
Apparently, the f18 version handles newlines and whitespace differently from the f17 version.
Is there a way to have control over this behaviour? (An optional argument somewhere?)
Or even better, is there a way in which I can get the old behaviour back using the new libraries?
in XML, the text() returns the text inside the tags as is (unstripped), so if you have any whitespace characters, tabs, new lines they will be included.
It might be that the way you construct the multiline string with + and \n\r accidentally testing two different strings.
If you change your string to a triple quote string like the example below and test it.
from io import BytesIO
from lxml import etree
html = '''
<!doctype html public "-//w3c//dtd html 4.0 transitional//en">
<html>
<head>
<title>Title</title>
</head>
<body/>
</html>
'''
tree = etree.parse(BytesIO(html), etree.HTMLParser())
text_elements = tree.xpath("//text()")
full_text = ''.join(text_elements)
assert full_text == 'Title', repr(full_text)
You can also see that surrounding the text with spaces or new lines make them part of the text() function return. See title below.
html = '''
<!doctype html public "-//w3c//dtd html 4.0 transitional//en">
<html>
<head>
<title> Title </title>
</head>
<body/>
</html>
'''
tree = etree.parse(BytesIO(html), etree.HTMLParser())
text_elements = tree.xpath("//text()")
full_text = ''.join(text_elements)
assert full_text == ' Title ', repr(full_text)
If you don't need the spaces you can always call strip() on the string yourself. If you're sure you're getting spaces even though your tags do not contain them, then you should report that as a bug on the lxml mailing list.

Replacing string contents with regular expressions

I am trying to remove all the html surrounding the data that I seek from a webpage so that all that is left is the raw data that I will then be able to input into a database. so if I have something like:
<p class="location"> Atlanta, GA </p>
The following code would return
Atlanta, GA </p>
But what I expect is not what is returned. This is a more specific solution to the basic problem I found here. Any help would be appreciated, thanks! Code is found below.
def delHTML(self, html):
"""
html is a list made up of items with data surrounded by html
this function should get rid of the html and return the data as a list
"""
for n,i in enumerate(html):
if i==re.match('<p class="location">',str(html[n])):
html[n]=re.sub('<p class="location">', '', str(html[n]))
return html
As rightfully pointed out in the comments, you should be using a specific library to parse HTML and extract text, here are some examples:
html2text: Limited functionnality, but exactly what you need.
BeautifulSoup: More complex, more powerful.
Assuming all you want is to extract the data contained in <p class="location"> tags, you could use a quick & dirty (but correct) approach with the Python HTMLParser module (a simple HTML SAX parser), like this:
from HTMLParser import HTMLParser
class MyHTMLParser(HTMLParser):
PLocationID=0
PCount=0
buf=""
out=[]
def handle_starttag(self, tag, attrs):
if tag=="p":
self.PCount+=1
if ("class", "location") in attrs and self.PLocationID==0:
self.PLocationID=self.PCount
def handle_endtag(self, tag):
if tag=="p":
if self.PLocationID==self.PCount:
self.out.append(self.buf)
self.buf=""
self.PLocationID=0
self.PCount-=1
def handle_data(self, data):
if self.PLocationID:
self.buf+=data
# instantiate the parser and fed it some HTML
parser = MyHTMLParser()
parser.feed("""
<html>
<body>
<p>This won't appear!</p>
<p class="location">This <b>will</b></p>
<div>
<p class="location">This <span class="someclass">too</span></p>
<p>Even if <p class="location">nested Ps <p class="location"><b>shouldn't</b> <p>be allowed</p></p> <p>this will work</p></p> (this last text is out!)</p>
</div>
</body>
</html>
""")
print parser.out
Output:
['This will', 'This too', "nested Ps shouldn't be allowed this will work"]
This will extract all the text contained inside any <p class="location"> tag, stripping all the tags inside it. Separate tags (if not nested - which shouldn't be allowed anyhow for paragraphs) will have a separate entry in the out list.
Notice that for more complex requirements this can easily get out of hand; in those cases a DOM parser is way more appropriate.

POT file with tags instead of <dynamic element>

I'm trying to translate text out of a template file in a Pyramid project. More or less as in this example: http://docs.pylonsproject.org/projects/pyramid_cookbook/en/latest/chameleon_i18n.html
Now how do I get rid of the <dynamic element> in the comment of my .pot file? I'd like to see the rest of the code along with its tags.
My chameleon template (.pt):
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" xmlns:tal="http://xml.zope.org/namespaces/tal"
xmlns:i18n="http://xml.zope.org/namespaces/i18n"
i18n:domain="MyDomain">
<head>
...
</head>
<body>
<div i18n:translate="MyID">
This will appear in the comments.
<span>This will NOT.</span>
While this will again appear.
</div>
</body>
</html>
I use Babel and Lingua to extract the messages with the following options in my setup.py:
message_extractors = { '.': [
('**.py', 'lingua_python', None ),
('**.pt', 'lingua_xml', None ),
]}
And the relevant output in my .pot file looks like this:
#. Default: This will appear in the comments. <dynamic element> While this will
#. again appear.
#: myproject/templates/base.pt:10
msgid "MyID"
msgstr ""
This is explicitly not supported: a translation should only contain the text - it should never contain markup. Otherwise you would have two problems:
translators could insert markup, which may break your site or create a security problem
a template toolkit would have no way to determine if any characters in a translation
need to be escaped or should be output as-is.
It is common to need to translate items with dynamic components or markup inside them: for those you use the i18n:name attribute. For example you can do this:
<p i18n:translate="">This is <strong i18n:name="very" i18n:translate="">very</strong> important.
That would give you two strings to translate: This is ${very} string and very.

How can i remove <p> </p> with python sub

I have an html file and I want to replace the empty paragraphs with a space.
mystring = "This <p></p><p>is a test</p><p></p><p></p>"
result = mystring.sub("<p></p>" , " ")
This is not working.
Please, don't try to parse HTML with regular expressions. Use a proper parsing module, like htmlparser or BeautifulSoup to achieve this. "Suffer" a short learning curve now and benefit:
Your parsing code will be more robust, handling corner cases you may not have considered that will fail with a regex
For future HTML parsing/munging tasks, you will be empowered to do things faster, so eventually the time investment pays off as well.
You won't be sorry! Profit guaranteed!
I think it's always nice to give an example of how to do this with a real parser, as well as just repeating the sound advice that Eli Bendersky gives in his answer.
Here's an example of how to remove empty <p> elements using lxml. lxml's HTMLParser deals with HTML very well.
from lxml import etree
from StringIO import StringIO
input = '''This <p> </p><p>is a test</p><p></p><p><b>Bye.</b></p>'''
parser = etree.HTMLParser()
tree = etree.parse(StringIO(input), parser)
for p in tree.xpath("//p"):
if len(p):
continue
t = p.text
if not (t and t.strip()):
p.getparent().remove(p)
print etree.tostring(tree.getroot(), pretty_print=True)
... which produces the output:
<html>
<body>
<p>This </p>
<p>is a test</p>
<p>
<b>Bye.</b>
</p>
</body>
</html>
Note that I misread the question when replying to this, and I'm only removing the empty <p> elements, not replacing them with &nbsp. With lxml, I'm not sure of a simple way to do this, so I've created another question to ask:
How can one replace an element with text in lxml?
I think for this particular problem a parsing module would be overkill
simply that function:
>>> mystring = "This <p></p><p>is a test</p><p></p><p></p>"
>>> mystring.replace("<p></p>"," ")
'This <p>is a test</p> '
What if <p> is entered as <P>, or < p >, or has an attribute added, or is given using the empty tag syntax <P/>? Pyparsing's HTML tag support handles all of these variations:
from pyparsing import makeHTMLTags, replaceWith, withAttribute
mystring = 'This <p></p><p>is a test</p><p align="left"></p><P> </p><P/>'
p,pEnd = makeHTMLTags("P")
emptyP = p.copy().setParseAction(withAttribute(empty=True))
null_paragraph = emptyP | p+pEnd
null_paragraph.setParseAction(replaceWith(" "))
print null_paragraph.transformString(mystring)
Prints:
This <p>is a test</p>
using regexp ?
import re
result = re.sub("<p>\s*</p>"," ", mystring, flags=re.MULTILINE)
compile the regexp if you use it often.
I wrote that code:
from lxml import etree
from StringIO import StringIO
html_tags = """<div><ul><li>PID temperature controller</li> <li>Smart and reliable</li> <li>Auto-diagnosing</li> <li>Auto setting</li> <li>Intelligent control</li> <li>2-Rows 4-Digits LED display</li> <li>Widely applied in the display and control of the parameter of temperature, pressure, flow, and liquid level</li> <li> </li> <p> </p></ul> <div> </div></div>"""
document = etree.iterparse(StringIO(html_tags), html=True)
for a, e in document:
if not (e.text and e.text.strip()) and len(e) == 0:
e.getparent().remove(e)
print etree.tostring(document.root)

Categories

Resources