Parsing XHTML including standard entities using ElementTree - python

Consider the following snippet:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head><title>©</title></head>
<body></body>
</html>
It is deemed valid XHTML 1.0 Transitional per W3C's validator (https://validator.w3.org/). However, Python (3.7)'s ElementTree chokes on it with
$ python -c 'from xml.etree import ElementTree as ET; ET.parse("foo.html")'
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/usr/lib/python3.7/xml/etree/ElementTree.py", line 1197, in parse
tree.parse(source, parser)
File "/usr/lib/python3.7/xml/etree/ElementTree.py", line 598, in parse
self._root = parser._parse_whole(source)
xml.etree.ElementTree.ParseError: undefined entity ©: line 4, column 15
Note that © is indeed an entity defined (ultimately) in xhtml-lat1.ent.
Is there a way to parse such documents using ElementTree? An answer to a similar question suggested manually prepending the appropritate XML definitions to the HTML content (e.g. <!ENTITY nbsp ' '>) but that's not really a general solution (unless one prepends a header with all definitions to any document, but it seems like there should be something simpler?).
Thanks in advance.

Consider about lxml?
from lxml import html
root = html.fromstring("""
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head><title>©</title></head>
<body></body>
</html>
""".strip())
print(root.head.getchildren()[0].text)
# '©'
© is not valid in xml. xml package really parse xml but not html. Actually built-in html parser do can parse this content:
from html.parser import HTMLParser
parser = HTMLParser()
parser.feed("""
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head><title>©</title></head>
<body></body>
</html>
""".strip())
# no error
But its api is really difficult to use lol. lxml provides an equivalent api.

Related

How to get a href attribute value in xml content (atom feed)?

I'm saving the content (atom feed / xml content) from a get request as content = response.text and the content looks like this:
<feed xmlns="http://www.w3.org/2005/Atom">
<title type="text">title-a</title>
<subtitle type="text">content: application/abc</subtitle>
<updated>2021-08-05T16:29:20.202Z</updated>
<id>tag:tag-a,2021-08:27445852</id>
<generator uri="uri-a" version="v-5.1.0.3846329218047">abc</generator>
<author>
<name>name-a</name>
<email>email-a</email>
</author>
<link href="url-a" rel="self"/>
<link href="url-b" rel="next"/>
<link href="url-c" rel="previous"/>
</feed>
How can I get the value "url-b" of the href attribute with rel="next" ?
I tried it with the ElementTree module, for example:
from xml.etree import ElementTree
response = requests.get("myurl", headers={"Authorization": f"Bearer {my_access_token}"})
content = response.text
tree = ElementTree.fromstring(content)
tree.find('.//link[#rel="next"]')
// or
tree.find('./link').attrib['href']
but that didn't work.
I appreciate any help and thank you in advance.
If there is an easier, simpler solution (maybe feedparser) I welcome that too.
How can I get the value "url-b" of the href attribute with rel="next" ?
see below
from xml.etree import ElementTree as ET
xml = '''<feed xmlns="http://www.w3.org/2005/Atom">
<title type="text">title-a</title>
<subtitle type="text">content: application/abc</subtitle>
<updated>2021-08-05T16:29:20.202Z</updated>
<id>tag:tag-a,2021-08:27445852</id>
<generator uri="uri-a" version="v-5.1.0.3846329218047">abc</generator>
<author>
<name>name-a</name>
<email>email-a</email>
</author>
<link href="url-a" rel="self"/>
<link href="url-b" rel="next"/>
<link href="url-c" rel="previous"/>
</feed>'''
root = ET.fromstring(xml)
links = root.findall('.//{http://www.w3.org/2005/Atom}link[#rel="next"]')
for link in links:
print(f'{link.attrib["href"]}')
output
url-b
You can use this XPath-1.0 expression:
./*[local-name()="feed"]/*[local-name()="link" and #rel="next"]/#href
This should result in "url-b".

Use BeautifulSoup to Replace Every Occurrence of XML Tag with Another Tag

I am trying to replace every occurrence of an XML tag in a document (call it the target) with the contents of a tag in a different document (call it the source). The tag from the source could contain just text, or it could contain more XML.
Here is a simple example of what I am not able to get working:
test-source.htm:
<?xml version="1.0" encoding="utf-8"?>
<html>
<head>
</head>
<body>
<srctxt>text to be added</srctxt>
</body>
</html>
test-target.htm:
<?xml version="1.0" encoding="utf-8"?>
<html>
<head>
</head>
<body>
<replacethis src="test-source.htm"></replacethis>
<p>irrelevant, just here for filler</p>
<replacethis src="test-source.htm"></replacethis>
</body>
</html>
replace_example.py:
import os
import re
from bs4 import BeautifulSoup
# Just for testing
source_file = "test-source.htm"
target_file = "test-target.htm"
with open(source_file) as s:
source = BeautifulSoup(s, "lxml")
with open(target_file) as t:
target = BeautifulSoup(t, "lxml")
source_tag = source.srctxt
for tag in target():
for attribute in tag.attrs:
if re.search(source_file, str(tag[attribute])):
tag.replace_with(source_tag)
with open(target_file, "w") as w:
w.write(str(target))
This is my unfortunate test-target.htm after running replace_example.py
<?xml version="1.0" encoding="utf-8"?><html>
<head>
</head>
<body>
<p>irrelevant, just here for filler</p>
<srctxt>text to be added</srctxt>
</body>
</html>
The first replacethis tag is now gone and the second replacethis tag has been replaced. This same problem happens with "insert" and "insert_before".
The output I want is:
<?xml version="1.0" encoding="utf-8"?><html>
<head>
</head>
<body>
<srctxt>text to be added</srctxt>
<p>irrelevant, just here for filler</p>
<srctxt>text to be added</srctxt>
</body>
</html>
Can someone please point me in the right direction?
Additional Complications: The example above is the simplest case where I could reproduce the problem I seem to be having with BeautifulSoup, but it does not convey the full detail of the problem I'm trying to solve. Actually, I have a list of targets and sources. The replacethis tag needs to be replaced by the contents of a source only if the src attribute contains a reference to a source in the list. So I could use the replace method, but it would require writing a lot more regex than if I could convince BeautifulSoup to work. If this problem is a BeautifulSoup bug, then maybe I'll just have to write the regex instead.
You could use another parser (html.parser) if you want to get rid of extra tags.
BS4's replace_with behavior looks like some bug in library.
As a partial solution you can just call
target_text.replace('<replacethis></replacethis>', source_text)
First, it is highly advised to not use regex on [X]HTML documents. Since you are modifying XML content, consider an lxml solution which you do have installed being the parsing engine in your BeautifulSoup calls. No for or if logic needed for this approach.
Specifically, consider XSLT, the special-purpose language, designed to transform XML into other XML, HTML, even json/csv/txt files. XSLT maintains the document() function allowing you to parse across documents. Python's lxml can run XSLT 1.0 scripts.
XSLT (save as .xsl in same folder as source file, adjust 'replacethis' and 'srctxt' names)
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output indent="yes" method="xml"/>
<xsl:strip-space elements="*"/>
<xsl:template match="#*|node()">
<xsl:copy>
<xsl:apply-templates select="#*|node()"/>
</xsl:copy>
</xsl:template>
<!-- UPDATE <replacethis> TAG WITH <srctxt> FROM SOURCE -->
<xsl:template match="replacethis">
<xsl:copy-of select="document('test-source.htm')/html/body/srctxt"/>
</xsl:template>
</xsl:stylesheet>
Python
import lxml.etree as et
# LOAD XML AND XSL SOURCES
doc = et.parse('test-target.htm')
xsl = et.parse('XSLTScript.xsl')
# TRANSFORM SOURCE
transform = et.XSLT(xsl)
result = transform(doc)
# OUTPUT TO SCREEN
print(result)
# OUTPUT TO FILE
with open('test-target.htm', 'wb') as f:
f.write(result)
Output
<?xml version="1.0"?>
<html>
<head/>
<body>
<srctxt>text to be added</srctxt>
<p>irrelevant, just here for filler</p>
<srctxt>text to be added</srctxt>
</body>
</html>

How to wrap custom <root> element around whole HTML document?

I have a large number of HTML documents that must be converted to XML. Not all may look exactly the same. For example, the sample below ends with an HTML comment tag, not with the HTML tag.
Note this question is related to this one.
Here is my code:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<comment>this is an HTML comment</comment>
<comment>this is another HTML comment</comment>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
...
<comment>here is a comment inside the head tag</comment>
</head>
<body>
...
<comment>Comment inside body tag</comment>
<comment>Another comment inside body tag</comment>
<comment>There could be many comments in each file and scattered, not just 1 in the head and three in the body. This is just a sample.</comment>
</body>
</html>
<comment>This comment is the last line of the file</comment>
I wish to wrap the entire document with a custom tag called <root>. So far, the best I can do is wrap <root> around <html>.
root_tag = bs4.Tag(name="root")
soup.html.wrap(root_tag)
How can I position the <root> element such that it wraps the entire document?
A little crude, as this is just wrapping any given file in <root> </root>
See if it works for your use case:
def root_wrap(file):
fin = open(file, 'r+')
fin.write('<root>')
for line in fin:
fin.write(line)
fin.write('</root>')
fin.close()

Python DOCTYPE Syntax Error

#!/usr/bin/python
# Import modules for CGI handling
import cgi, cgitb
# Create instance of FieldStorage
form = cgi.FieldStorage()
name = form.getvalue('name')
age = int(form.getvalue('age')) + 1
print "Content-type: text/html"
print
print "<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">"
print "<html>"
print "<head><title></title></head>"
print "<body>"
print "<p> Hello, %s</p>" % (name)
print "<p> Next year, you will be %s years old.</p>" % age
print "</body>"
print "</html>"
Whenever I write the DOCTYPE down, I get an Invalid Syntax error. Don't know what the problem is. Help would be appreciated since I'm new to python. Thank you!
Your quotes are conflicting (notice how the syntax highlighting breaks after that line).
Either use single quotes:
print '<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" '
'"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">'
Or triple quote it:
print """<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">"""
Use different quotes:
print '<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN""http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">'
Print statement sees the quotes in the middle as ending quotes. You need to escape out of quotes by using /" or using different quotes.
print '<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Strict//EN\"\"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd\">'
You have double-quoted a string that already contains a double-quote. Python thinks your string ends after PUBLIC, and the next thing appears to be a minus sign followed by a division sign, which is an error. On top of that, you have broken the string into two lines without any continuation characters, which won't work. Use triple-quotes to allow a string to continue from one line to the next (this will also resolve your problem with the embedded " characters).
print '''<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">'''
For those kind of "long-multiline-text" you might prefer using the triple quotes (""").
Coupled with the format string method available on any decently recent version of Python, you get the poor's man template engine:
tmpl = """Content-type: text/html
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html>
<head><title></title></head>
<body>
<p> Hello, {name}</p>
<p> Next year, you will be {age} years old.</p>
</body>
</html>
"""
print tmpl.format(name='Sylvain', age=40)

How to control newline processing in the lxml xpath text() function?

Having switched from Fedora 17 to 18, I get different parsing behaviour for the same lxml code, apparently due to different versions of the underlying libraries (libxml2 and libxslt versions changed).
Here's an example of lxml code with different results for the two versions:
from io import BytesIO
from lxml import etree
myHtmlString = \
'<!doctype html public "-//w3c//dtd html 4.0 transitional//en">\r\n'+\
'<html>\r\n'+\
'<head>\r\n'+\
' <title>Title</title>\r\n'+\
'</head>\r\n'+\
'<body/>\r\n'+\
'</html>\r\n'
myFile = BytesIO(myHtmlString)
myTree = etree.parse(myFile, etree.HTMLParser())
myTextElements = myTree.xpath("//text()")
myFullText = ''.join([myEl for myEl in myTextElements])
assert myFullText == 'Title', repr(myFullText)
The f17 version passes the assert, i.e. xpath("//text()") only returns text 'Title', whereas the f18 version fails with output
Traceback (most recent call last):
File "TestLxml.py", line 17, in <module>
assert myFullText == 'Title', repr(myFullText)
AssertionError: '\r\n\r\n Title\r\n\r\n\r\n'
Apparently, the f18 version handles newlines and whitespace differently from the f17 version.
Is there a way to have control over this behaviour? (An optional argument somewhere?)
Or even better, is there a way in which I can get the old behaviour back using the new libraries?
in XML, the text() returns the text inside the tags as is (unstripped), so if you have any whitespace characters, tabs, new lines they will be included.
It might be that the way you construct the multiline string with + and \n\r accidentally testing two different strings.
If you change your string to a triple quote string like the example below and test it.
from io import BytesIO
from lxml import etree
html = '''
<!doctype html public "-//w3c//dtd html 4.0 transitional//en">
<html>
<head>
<title>Title</title>
</head>
<body/>
</html>
'''
tree = etree.parse(BytesIO(html), etree.HTMLParser())
text_elements = tree.xpath("//text()")
full_text = ''.join(text_elements)
assert full_text == 'Title', repr(full_text)
You can also see that surrounding the text with spaces or new lines make them part of the text() function return. See title below.
html = '''
<!doctype html public "-//w3c//dtd html 4.0 transitional//en">
<html>
<head>
<title> Title </title>
</head>
<body/>
</html>
'''
tree = etree.parse(BytesIO(html), etree.HTMLParser())
text_elements = tree.xpath("//text()")
full_text = ''.join(text_elements)
assert full_text == ' Title ', repr(full_text)
If you don't need the spaces you can always call strip() on the string yourself. If you're sure you're getting spaces even though your tags do not contain them, then you should report that as a bug on the lxml mailing list.

Categories

Resources