I have a large number of HTML documents that must be converted to XML. Not all may look exactly the same. For example, the sample below ends with an HTML comment tag, not with the HTML tag.
Note this question is related to this one.
Here is my code:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<comment>this is an HTML comment</comment>
<comment>this is another HTML comment</comment>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
...
<comment>here is a comment inside the head tag</comment>
</head>
<body>
...
<comment>Comment inside body tag</comment>
<comment>Another comment inside body tag</comment>
<comment>There could be many comments in each file and scattered, not just 1 in the head and three in the body. This is just a sample.</comment>
</body>
</html>
<comment>This comment is the last line of the file</comment>
I wish to wrap the entire document with a custom tag called <root>. So far, the best I can do is wrap <root> around <html>.
root_tag = bs4.Tag(name="root")
soup.html.wrap(root_tag)
How can I position the <root> element such that it wraps the entire document?
A little crude, as this is just wrapping any given file in <root> </root>
See if it works for your use case:
def root_wrap(file):
fin = open(file, 'r+')
fin.write('<root>')
for line in fin:
fin.write(line)
fin.write('</root>')
fin.close()
Related
Source code: I have the following program.
import genshi
from genshi.template import MarkupTemplate
html = '''
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:py="http://genshi.edgewall.org/">
<head>
</head>
<body>
<py:for each="i in range(3)">
<py:choose>
<del py:when="i == 1">
${i}
</del>
<py:otherwise>
${i}
</py:otherwise>
</py:choose>
</py:for>
</body>
</html>
'''
template = MarkupTemplate(html)
stream = template.generate()
html = stream.render('html')
print(html)
Expected output: the numbers are printed consecutively with no whitespace (and most critically no line-break) between them.
<html>
<head>
</head>
<body>
0<del>1</del>2
</body>
</html>
Actual output: It outputs the following:
<html>
<head>
</head>
<body>
0
<del>1</del>
2
</body>
</html>
Question: How do I eliminate the line-breaks? I can deal with the leading whitespace by stripping it from the final HTML, but I don't know how to get rid of the line-breaks. I need the contents of the for loop to be displayed as a single continuous "word" (e.g. 012 instead of 0 \n 1 \n 2).
What I've tried:
Reading the Genshi documentation.
Searching StackOverflow
Searching Google
Using a <?python ...code... ?> code block. This doesn't work since the carets in the <del> tags are escaped and displayed.
<?python
def numbers():
n = ''
for i in range(3):
if i == 1:
n += '<del>{i}</del>'.format(i=i)
else:
n += str(i)
return n
?>
${numbers()}
Produces 0<del>1</del>2
I also tried this, but using genshi.builder.Element('del') instead. The results are the same, and I was able to conclusively determine that the string returned by numbers() is being escaped after the return occurs.
A bunch of other things that I can't recall at the moment.
Not ideal, but I did finally find an acceptable solution. The trick is to put the closing caret for a given tag on the next line right before the next tag's opening caret.
<body>
<py:for each="i in range(3)"
><py:choose
><del py:when="i == 1">${i}</del
><py:otherwise>${i}</py:otherwise
></py:choose
</py:for>
</body>
Source: https://css-tricks.com/fighting-the-space-between-inline-block-elements/
If anyone has a better approach I'd love to hear it.
Consider the following snippet:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head><title>©</title></head>
<body></body>
</html>
It is deemed valid XHTML 1.0 Transitional per W3C's validator (https://validator.w3.org/). However, Python (3.7)'s ElementTree chokes on it with
$ python -c 'from xml.etree import ElementTree as ET; ET.parse("foo.html")'
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/usr/lib/python3.7/xml/etree/ElementTree.py", line 1197, in parse
tree.parse(source, parser)
File "/usr/lib/python3.7/xml/etree/ElementTree.py", line 598, in parse
self._root = parser._parse_whole(source)
xml.etree.ElementTree.ParseError: undefined entity ©: line 4, column 15
Note that © is indeed an entity defined (ultimately) in xhtml-lat1.ent.
Is there a way to parse such documents using ElementTree? An answer to a similar question suggested manually prepending the appropritate XML definitions to the HTML content (e.g. <!ENTITY nbsp ' '>) but that's not really a general solution (unless one prepends a header with all definitions to any document, but it seems like there should be something simpler?).
Thanks in advance.
Consider about lxml?
from lxml import html
root = html.fromstring("""
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head><title>©</title></head>
<body></body>
</html>
""".strip())
print(root.head.getchildren()[0].text)
# '©'
© is not valid in xml. xml package really parse xml but not html. Actually built-in html parser do can parse this content:
from html.parser import HTMLParser
parser = HTMLParser()
parser.feed("""
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head><title>©</title></head>
<body></body>
</html>
""".strip())
# no error
But its api is really difficult to use lol. lxml provides an equivalent api.
I am trying to replace every occurrence of an XML tag in a document (call it the target) with the contents of a tag in a different document (call it the source). The tag from the source could contain just text, or it could contain more XML.
Here is a simple example of what I am not able to get working:
test-source.htm:
<?xml version="1.0" encoding="utf-8"?>
<html>
<head>
</head>
<body>
<srctxt>text to be added</srctxt>
</body>
</html>
test-target.htm:
<?xml version="1.0" encoding="utf-8"?>
<html>
<head>
</head>
<body>
<replacethis src="test-source.htm"></replacethis>
<p>irrelevant, just here for filler</p>
<replacethis src="test-source.htm"></replacethis>
</body>
</html>
replace_example.py:
import os
import re
from bs4 import BeautifulSoup
# Just for testing
source_file = "test-source.htm"
target_file = "test-target.htm"
with open(source_file) as s:
source = BeautifulSoup(s, "lxml")
with open(target_file) as t:
target = BeautifulSoup(t, "lxml")
source_tag = source.srctxt
for tag in target():
for attribute in tag.attrs:
if re.search(source_file, str(tag[attribute])):
tag.replace_with(source_tag)
with open(target_file, "w") as w:
w.write(str(target))
This is my unfortunate test-target.htm after running replace_example.py
<?xml version="1.0" encoding="utf-8"?><html>
<head>
</head>
<body>
<p>irrelevant, just here for filler</p>
<srctxt>text to be added</srctxt>
</body>
</html>
The first replacethis tag is now gone and the second replacethis tag has been replaced. This same problem happens with "insert" and "insert_before".
The output I want is:
<?xml version="1.0" encoding="utf-8"?><html>
<head>
</head>
<body>
<srctxt>text to be added</srctxt>
<p>irrelevant, just here for filler</p>
<srctxt>text to be added</srctxt>
</body>
</html>
Can someone please point me in the right direction?
Additional Complications: The example above is the simplest case where I could reproduce the problem I seem to be having with BeautifulSoup, but it does not convey the full detail of the problem I'm trying to solve. Actually, I have a list of targets and sources. The replacethis tag needs to be replaced by the contents of a source only if the src attribute contains a reference to a source in the list. So I could use the replace method, but it would require writing a lot more regex than if I could convince BeautifulSoup to work. If this problem is a BeautifulSoup bug, then maybe I'll just have to write the regex instead.
You could use another parser (html.parser) if you want to get rid of extra tags.
BS4's replace_with behavior looks like some bug in library.
As a partial solution you can just call
target_text.replace('<replacethis></replacethis>', source_text)
First, it is highly advised to not use regex on [X]HTML documents. Since you are modifying XML content, consider an lxml solution which you do have installed being the parsing engine in your BeautifulSoup calls. No for or if logic needed for this approach.
Specifically, consider XSLT, the special-purpose language, designed to transform XML into other XML, HTML, even json/csv/txt files. XSLT maintains the document() function allowing you to parse across documents. Python's lxml can run XSLT 1.0 scripts.
XSLT (save as .xsl in same folder as source file, adjust 'replacethis' and 'srctxt' names)
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output indent="yes" method="xml"/>
<xsl:strip-space elements="*"/>
<xsl:template match="#*|node()">
<xsl:copy>
<xsl:apply-templates select="#*|node()"/>
</xsl:copy>
</xsl:template>
<!-- UPDATE <replacethis> TAG WITH <srctxt> FROM SOURCE -->
<xsl:template match="replacethis">
<xsl:copy-of select="document('test-source.htm')/html/body/srctxt"/>
</xsl:template>
</xsl:stylesheet>
Python
import lxml.etree as et
# LOAD XML AND XSL SOURCES
doc = et.parse('test-target.htm')
xsl = et.parse('XSLTScript.xsl')
# TRANSFORM SOURCE
transform = et.XSLT(xsl)
result = transform(doc)
# OUTPUT TO SCREEN
print(result)
# OUTPUT TO FILE
with open('test-target.htm', 'wb') as f:
f.write(result)
Output
<?xml version="1.0"?>
<html>
<head/>
<body>
<srctxt>text to be added</srctxt>
<p>irrelevant, just here for filler</p>
<srctxt>text to be added</srctxt>
</body>
</html>
So i have a need to process some HTML in Python, and my requirement is that i need to find a certain tag and replace it with different charecter based on the content of the charecters...
<html>
<Head>
</HEAD>
<body>
<blah>
<_translate attr="french"> I am no one,
and no where <_translate>
<Blah/>
</body>
</html>
Should become
<html>
<Head>
</HEAD>
<body>
<blah>
Je suis personne et je suis nulle part
<Blah/>
</body>
</html>
I would like to leave the original HTML untouched an only replace the tags labeled 'important-tag'. Attributes and the contents of that tag will be important to generate the tags output.
I had though about using extending HTMLParser Object but I am having trouble getting out the orginal HTML when i want it. I think what i most want is to parse the HTML into tokens, with the orginal text in each token so i can output my desired output ... i.e. get somthing like
(tag, "<html>")
(data, "\n ")
(tag, "<head>")
(data, "\n ")
(end-tag,"</HEAD>")
ect...
ect...
Anyone know of a good pythonic way to accomplish this ? Python 2.7 standard libs are prefered, third party would also be useful to consider...
Thanks!
You can use lxml to perform such a task http://lxml.de/tutorial.html and use XPath to navigate easily trough your html:
from lxml.html import fromstring
my_html = "HTML CONTENT"
root = fromstring(my_html)
nodes_to_process = root.xpath("//_translate")
for node in nodes_to_process:
lang = node.attrib["attr"]
translate = AWESOME_TRANSLATE(node.text, lang)
node.parent.text = translate
I'll leave up to you the implementation of the AWESOME_TRANSLATE function ;)
I'm trying to translate text out of a template file in a Pyramid project. More or less as in this example: http://docs.pylonsproject.org/projects/pyramid_cookbook/en/latest/chameleon_i18n.html
Now how do I get rid of the <dynamic element> in the comment of my .pot file? I'd like to see the rest of the code along with its tags.
My chameleon template (.pt):
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" xmlns:tal="http://xml.zope.org/namespaces/tal"
xmlns:i18n="http://xml.zope.org/namespaces/i18n"
i18n:domain="MyDomain">
<head>
...
</head>
<body>
<div i18n:translate="MyID">
This will appear in the comments.
<span>This will NOT.</span>
While this will again appear.
</div>
</body>
</html>
I use Babel and Lingua to extract the messages with the following options in my setup.py:
message_extractors = { '.': [
('**.py', 'lingua_python', None ),
('**.pt', 'lingua_xml', None ),
]}
And the relevant output in my .pot file looks like this:
#. Default: This will appear in the comments. <dynamic element> While this will
#. again appear.
#: myproject/templates/base.pt:10
msgid "MyID"
msgstr ""
This is explicitly not supported: a translation should only contain the text - it should never contain markup. Otherwise you would have two problems:
translators could insert markup, which may break your site or create a security problem
a template toolkit would have no way to determine if any characters in a translation
need to be escaped or should be output as-is.
It is common to need to translate items with dynamic components or markup inside them: for those you use the i18n:name attribute. For example you can do this:
<p i18n:translate="">This is <strong i18n:name="very" i18n:translate="">very</strong> important.
That would give you two strings to translate: This is ${very} string and very.