HTMLParsing in Python - python

So i have a need to process some HTML in Python, and my requirement is that i need to find a certain tag and replace it with different charecter based on the content of the charecters...
<html>
<Head>
</HEAD>
<body>
<blah>
<_translate attr="french"> I am no one,
and no where <_translate>
<Blah/>
</body>
</html>
Should become
<html>
<Head>
</HEAD>
<body>
<blah>
Je suis personne et je suis nulle part
<Blah/>
</body>
</html>
I would like to leave the original HTML untouched an only replace the tags labeled 'important-tag'. Attributes and the contents of that tag will be important to generate the tags output.
I had though about using extending HTMLParser Object but I am having trouble getting out the orginal HTML when i want it. I think what i most want is to parse the HTML into tokens, with the orginal text in each token so i can output my desired output ... i.e. get somthing like
(tag, "<html>")
(data, "\n ")
(tag, "<head>")
(data, "\n ")
(end-tag,"</HEAD>")
ect...
ect...
Anyone know of a good pythonic way to accomplish this ? Python 2.7 standard libs are prefered, third party would also be useful to consider...
Thanks!

You can use lxml to perform such a task http://lxml.de/tutorial.html and use XPath to navigate easily trough your html:
from lxml.html import fromstring
my_html = "HTML CONTENT"
root = fromstring(my_html)
nodes_to_process = root.xpath("//_translate")
for node in nodes_to_process:
lang = node.attrib["attr"]
translate = AWESOME_TRANSLATE(node.text, lang)
node.parent.text = translate
I'll leave up to you the implementation of the AWESOME_TRANSLATE function ;)

Related

HTML Output in Pyscript

I was experimenting in Pyscript and I tried to print an HTML table, but it didn't work. It seems to delete the tags and mantain just the plain text.
Why is that? I tried to search online, but being a new technology i didn't find much.
This is my code:
<py-script>
print("<table>")
for i in range (2):
print("<tr>")
for j in range (2):
print("<td>test</td>")
print("</tr>")
print("</table>")
</py-script>
And this is the output I get:
I tried to replace the print() method with the pyscript.write() method, but it didn't work too.
I dig in source code pyscript.py
and at this moment works for me only code similar to JavaScript
For example this adds <h1>Hello</h1>
<div id="output"></div>
<py-script>
element = document.createElement('h1')
element.innerText = "Hello"
document.getElementById("output").append(element)
</py-script>
Full working code
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<title>PyScript Demo</title>
<!--<link rel="stylesheet" href="https://pyscript.net/alpha/pyscript.css" />-->
<script defer src="https://pyscript.net/alpha/pyscript.js"></script>
</head>
<body>
<div id="output"></div>
<py-script>
element = document.createElement('h1')
element.innerText = "Hello"
document.getElementById("output").append(element)
</py-script>
</body>
</html>
EDIT:
After digging in source code I found that pyscript.js runs function htmlDecode() which removes all tags from code in <py-script> (and probably it also removes tags when you load code from file) and this makes problem.
See Pyscript issue: [BUG] print() doesn't output HTML tags. · Issue #347 · pyscript/pyscript
Some workaround is to use some replacement - ie. {{ }} instead of < > in code - and later use code to replace it back to < >
print( "{{h1}}Hello{{/h1}}".replace("{{", "<").replace("}}", ">") )
or more universal - using function for this
def HTML(text):
return text.replace("{{", "<").replace("}}", ">")
print( HTML("{{h1}}Hello{{/h1}}") )
pyscript.write(some_id, HTML("{{h1}}Hello{{/h1}}") )
document.getElementById(some_id).innerHTML = HTML("{{h1}}Hello{{/h1}}")
Sometimes problem can be also pyscript.css which redefines some items and ie. <h1> looks like normal text.
One solution is to remove pyscript.css.
Other solution is to use classes from pyscript.css like in examples/index.html
<h1 class="text-4xl font-bold">Hello World</h1>
which means
print( HTML('{{h1 class="text-4xl font-bold"}}Hello{{/h1}}') )

Custom Python Dominate Tag Element

To support both a JPEG and WEBP compressed image, I'd like to include the following HTML code in a web page:
<picture>
<source srcset="img/awesomeWebPImage.webp" type="image/webp">
<source srcset="img/creakyOldJPEG.jpg" type="image/jpeg">
<img src="img/creakyOldJPEG.jpg" alt="Alt Text!">
</picture>
I've been using Python Dominate and it has generally worked well for me.
But the Picture and Source tags I think are not supported by Dominate.
I could add the HTML as a raw() Dominate tag, but was wondering if there was a way to get Dominate to recognize these tags.
p = picture()
with p:
source(srcset=image.split('.')[0]+'.webp', type="image/webp")
source(srcset=image, type="image/jpeg")
img(src=image, alt=imagealt)
I am seeing this kind of error:
p = picture()
NameError: global name 'picture' is not defined
You can create a picture class by inheriting from the dominate.tags.html_tag class
from dominate.tags import html_tag
class picture(html_tag):
pass
This can now be used as any of the predefined tags.
Dominate is used to generate HTML(5) documents.
The list of elements are defined in the tags.py file, see the repository in GitHub: https://github.com/Knio/dominate/blob/master/dominate/tags.py.
But, picture is not a standard tag.
You may look at the lxml library which contains a ElementMaker similar to Dominate to build XML tree easily. See the E-Factory.
For instance:
>>> from lxml.builder import E
>>> def CLASS(*args): # class is a reserved word in Python
... return {"class":' '.join(args)}
>>> html = page = (
... E.html( # create an Element called "html"
... E.head(
... E.title("This is a sample document")
... ),
... E.body(
... E.h1("Hello!", CLASS("title")),
... E.p("This is a paragraph with ", E.b("bold"), " text in it!"),
... E.p("This is another paragraph, with a", "\n ",
... E.a("link", href="http://www.python.org"), "."),
... E.p("Here are some reserved characters: <spam&egg>."),
... etree.XML("<p>And finally an embedded XHTML fragment.</p>"),
... )
... )
... )
>>> print(etree.tostring(page, pretty_print=True))
<html>
<head>
<title>This is a sample document</title>
</head>
<body>
<h1 class="title">Hello!</h1>
<p>This is a paragraph with <b>bold</b> text in it!</p>
<p>This is another paragraph, with a
link.</p>
<p>Here are some reserved characters: <spam&egg>.</p>
<p>And finally an embedded XHTML fragment.</p>
</body>
</html>

Genshi: for loop inserts line breaks

Source code: I have the following program.
import genshi
from genshi.template import MarkupTemplate
html = '''
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:py="http://genshi.edgewall.org/">
<head>
</head>
<body>
<py:for each="i in range(3)">
<py:choose>
<del py:when="i == 1">
${i}
</del>
<py:otherwise>
${i}
</py:otherwise>
</py:choose>
</py:for>
</body>
</html>
'''
template = MarkupTemplate(html)
stream = template.generate()
html = stream.render('html')
print(html)
Expected output: the numbers are printed consecutively with no whitespace (and most critically no line-break) between them.
<html>
<head>
</head>
<body>
0<del>1</del>2
</body>
</html>
Actual output: It outputs the following:
<html>
<head>
</head>
<body>
0
<del>1</del>
2
</body>
</html>
Question: How do I eliminate the line-breaks? I can deal with the leading whitespace by stripping it from the final HTML, but I don't know how to get rid of the line-breaks. I need the contents of the for loop to be displayed as a single continuous "word" (e.g. 012 instead of 0 \n 1 \n 2).
What I've tried:
Reading the Genshi documentation.
Searching StackOverflow
Searching Google
Using a <?python ...code... ?> code block. This doesn't work since the carets in the <del> tags are escaped and displayed.
<?python
def numbers():
n = ''
for i in range(3):
if i == 1:
n += '<del>{i}</del>'.format(i=i)
else:
n += str(i)
return n
?>
${numbers()}
Produces 0<del>1</del>2
I also tried this, but using genshi.builder.Element('del') instead. The results are the same, and I was able to conclusively determine that the string returned by numbers() is being escaped after the return occurs.
A bunch of other things that I can't recall at the moment.
Not ideal, but I did finally find an acceptable solution. The trick is to put the closing caret for a given tag on the next line right before the next tag's opening caret.
<body>
<py:for each="i in range(3)"
><py:choose
><del py:when="i == 1">${i}</del
><py:otherwise>${i}</py:otherwise
></py:choose
</py:for>
</body>
Source: https://css-tricks.com/fighting-the-space-between-inline-block-elements/
If anyone has a better approach I'd love to hear it.

How to wrap custom <root> element around whole HTML document?

I have a large number of HTML documents that must be converted to XML. Not all may look exactly the same. For example, the sample below ends with an HTML comment tag, not with the HTML tag.
Note this question is related to this one.
Here is my code:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<comment>this is an HTML comment</comment>
<comment>this is another HTML comment</comment>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
...
<comment>here is a comment inside the head tag</comment>
</head>
<body>
...
<comment>Comment inside body tag</comment>
<comment>Another comment inside body tag</comment>
<comment>There could be many comments in each file and scattered, not just 1 in the head and three in the body. This is just a sample.</comment>
</body>
</html>
<comment>This comment is the last line of the file</comment>
I wish to wrap the entire document with a custom tag called <root>. So far, the best I can do is wrap <root> around <html>.
root_tag = bs4.Tag(name="root")
soup.html.wrap(root_tag)
How can I position the <root> element such that it wraps the entire document?
A little crude, as this is just wrapping any given file in <root> </root>
See if it works for your use case:
def root_wrap(file):
fin = open(file, 'r+')
fin.write('<root>')
for line in fin:
fin.write(line)
fin.write('</root>')
fin.close()

POT file with tags instead of <dynamic element>

I'm trying to translate text out of a template file in a Pyramid project. More or less as in this example: http://docs.pylonsproject.org/projects/pyramid_cookbook/en/latest/chameleon_i18n.html
Now how do I get rid of the <dynamic element> in the comment of my .pot file? I'd like to see the rest of the code along with its tags.
My chameleon template (.pt):
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" xmlns:tal="http://xml.zope.org/namespaces/tal"
xmlns:i18n="http://xml.zope.org/namespaces/i18n"
i18n:domain="MyDomain">
<head>
...
</head>
<body>
<div i18n:translate="MyID">
This will appear in the comments.
<span>This will NOT.</span>
While this will again appear.
</div>
</body>
</html>
I use Babel and Lingua to extract the messages with the following options in my setup.py:
message_extractors = { '.': [
('**.py', 'lingua_python', None ),
('**.pt', 'lingua_xml', None ),
]}
And the relevant output in my .pot file looks like this:
#. Default: This will appear in the comments. <dynamic element> While this will
#. again appear.
#: myproject/templates/base.pt:10
msgid "MyID"
msgstr ""
This is explicitly not supported: a translation should only contain the text - it should never contain markup. Otherwise you would have two problems:
translators could insert markup, which may break your site or create a security problem
a template toolkit would have no way to determine if any characters in a translation
need to be escaped or should be output as-is.
It is common to need to translate items with dynamic components or markup inside them: for those you use the i18n:name attribute. For example you can do this:
<p i18n:translate="">This is <strong i18n:name="very" i18n:translate="">very</strong> important.
That would give you two strings to translate: This is ${very} string and very.

Categories

Resources