Convert HTML entities to Unicode and vice versa - python

How do you convert HTML entities to Unicode and vice versa in Python?

As to the "vice versa" (which I needed myself, leading me to find this question, which didn't help, and subsequently another site which had the answer):
u'some string'.encode('ascii', 'xmlcharrefreplace')
will return a plain string with any non-ascii characters turned into XML (HTML) entities.

You need to have BeautifulSoup.
from BeautifulSoup import BeautifulStoneSoup
import cgi
def HTMLEntitiesToUnicode(text):
"""Converts HTML entities to unicode. For example '&' becomes '&'."""
text = unicode(BeautifulStoneSoup(text, convertEntities=BeautifulStoneSoup.ALL_ENTITIES))
return text
def unicodeToHTMLEntities(text):
"""Converts unicode to HTML entities. For example '&' becomes '&'."""
text = cgi.escape(text).encode('ascii', 'xmlcharrefreplace')
return text
text = "&, ®, <, >, ¢, £, ¥, €, §, ©"
uni = HTMLEntitiesToUnicode(text)
htmlent = unicodeToHTMLEntities(uni)
print uni
print htmlent
# &, ®, <, >, ¢, £, ¥, €, §, ©
# &, ®, <, >, ¢, £, ¥, €, §, ©

Update for Python 2.7 and BeautifulSoup4
Unescape -- Unicode HTML to unicode with htmlparser (Python 2.7 standard lib):
>>> escaped = u'Monsieur le Curé of the «Notre-Dame-de-Grâce» neighborhood'
>>> from HTMLParser import HTMLParser
>>> htmlparser = HTMLParser()
>>> unescaped = htmlparser.unescape(escaped)
>>> unescaped
u'Monsieur le Cur\xe9 of the \xabNotre-Dame-de-Gr\xe2ce\xbb neighborhood'
>>> print unescaped
Monsieur le Curé of the «Notre-Dame-de-Grâce» neighborhood
Unescape -- Unicode HTML to unicode with bs4 (BeautifulSoup4):
>>> html = '''<p>Monsieur le Curé of the «Notre-Dame-de-Grâce» neighborhood</p>'''
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(html)
>>> soup.text
u'Monsieur le Cur\xe9 of the \xabNotre-Dame-de-Gr\xe2ce\xbb neighborhood'
>>> print soup.text
Monsieur le Curé of the «Notre-Dame-de-Grâce» neighborhood
Escape -- Unicode to unicode HTML with bs4 (BeautifulSoup4):
>>> unescaped = u'Monsieur le Curé of the «Notre-Dame-de-Grâce» neighborhood'
>>> from bs4.dammit import EntitySubstitution
>>> escaper = EntitySubstitution()
>>> escaped = escaper.substitute_html(unescaped)
>>> escaped
u'Monsieur le Curé of the «Notre-Dame-de-Grâce» neighborhood'

As hekevintran answer suggests, you may use cgi.escape(s) for encoding stings, but notice that encoding of quote is false by default in that function and it may be a good idea to pass the quote=True keyword argument alongside your string. But even by passing quote=True, the function won't escape single quotes ("'") (Because of these issues the function has been deprecated since version 3.2)
It's been suggested to use html.escape(s) instead of cgi.escape(s). (New in version 3.2)
Also html.unescape(s) has been introduced in version 3.4.
So in python 3.4 you can:
Use html.escape(text).encode('ascii', 'xmlcharrefreplace').decode() to convert special characters to HTML entities.
And html.unescape(text) for converting HTML entities back to plain-text representations.

For python3 use html.unescape():
import html
s = "&"
decoded = html.unescape(s)
# &

$ python3 -c "
> import html
> print(
> html.unescape('&©—')
> )"
&©—
$ python3 -c "
> import html
> print(
> html.escape('&©—')
> )"
&©—
$ python2 -c "
> from HTMLParser import HTMLParser
> print(
> HTMLParser().unescape('&©—')
> )"
&©—
$ python2 -c "
> import cgi
> print(
> cgi.escape('&©—')
> )"
&©—
HTML only strictly requires & (ampersand) and < (left angle bracket / less-than sign) to be escaped. https://html.spec.whatwg.org/multipage/parsing.html#data-state

If someone like me is out there wondering why some entity numbers (codes) like ™ (for trademark symbol), € (for euro symbol) are not encoded properly, the reason is in ISO-8859-1 (aka Windows-1252) those characters are not defined.
Also note that, the default character set as of html5 is utf-8 it was ISO-8859-1 for html4
So, we will have to workaround somehow (find & replace those at first)
Reference (starting point) from Mozilla's documentation
https://developer.mozilla.org/en-US/docs/Web/Guide/Localizations_and_character_encodings

I used the following function to convert unicode ripped from an xls file into a an html file while conserving the special characters found in the xls file:
def html_wr(f, dat):
''' write dat to file f as html
. file is assumed to be opened in binary format
. if dat is nul it is replaced with non breakable space
. non-ascii characters are translated to xml
'''
if not dat:
dat = ' '
try:
f.write(dat.encode('ascii'))
except:
f.write(html.escape(dat).encode('ascii', 'xmlcharrefreplace'))
hope this is useful to somebody

#!/usr/bin/env python3
import fileinput
import html
for line in fileinput.input():
print(html.unescape(line.rstrip('\n')))

Related

Encode UTF-8 not working when retrieving Xpath text

I'm retrieving some date from a website via lxml xpath:
page = requests.get(url)
tree = html.fromstring(page.content)
titles_arr = tree.xpath("//span[#class='lister-item-header']/span/a/text()")
Some of the titles have German Umlaute (e.g. üöä) so I thought of encoding the text returned like so:
for title in titles_arr:
title = title.encode('utf-8')
but it still consists of like Der Herr der Ringe - Die R\u00fcckkehr des K\u00f6nigs instead of their respective unicode character. What am I doing wrong?
Thanks
You seem to be dealing with a bytestring encoded with unicode characters escaped.
You can decode like this:
>>> bs = b'Die R\u00fcckkehr des K\u00f6nigs'
>>> bs.decode('raw-unicode-escape')
'Die Rückkehr des Königs'
If you are dealing with text, rather than bytes, you'll need to encode then decode:
>>> s = 'Die R\u00fcckkehr des K\u00f6nigs'
>>> s.encode('latin-1').decode('raw-unicode-escape')
'Die Rückkehr des Königs'
This kind of encoding is used to escape unicode characters in json, to restrict the json to ascii values:
>>> json.dumps('Die Rückkehr des Königs')
'"Die R\\u00fcckkehr des K\\u00f6nigs"'
so it's possible that whatever url you are fetching is html with embedded json, or json with embedded html - it might be worth checking the response's json attribute.

scrape with correct character encoding (python requests + beautifulsoup)

I have an issue parsing this website: http://fm4-archiv.at/files.php?cat=106
It contains special characters such as umlauts. See here:
My chrome browser displays the umlauts properly as you can see in the screenshot above. However on other pages (e.g.: http://fm4-archiv.at/files.php?cat=105) the umlauts are not displayed properly, as can be seen in the screenshot below:
The meta HTML tag defines the following charset on the pages:
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"/>
I use the python requests package to get the HTML and then use Beautifulsoup to scrape the desired data. My code is as follows:
r = requests.get(URL)
soup = BeautifulSoup(r.content,"lxml")
If I print the encoding (print(r.encoding) the result is UTF-8. If I manually change the encoding to ISO-8859-1 or cp1252 by calling r.encoding = ISO-8859-1 nothing changes when I output the data on the console. This is also my main issue.
r = requests.get(URL)
r.encoding = 'ISO-8859-1'
soup = BeautifulSoup(r.content,"lxml")
still results in the following string shown on the console output in my python IDE:
Der Wildlöwenpfleger
instead it should be
Der Wildlöwenpfleger
How can I change my code to parse the umlauts properly?
In general, instead of using r.content which is the byte string received, use r.text which is the decoded content using the encoding determined by requests.
In this case requests will use UTF-8 to decode the incoming byte string because this is the encoding reported by the server in the Content-Type header:
import requests
r = requests.get('http://fm4-archiv.at/files.php?cat=106')
>>> type(r.content) # raw content
<class 'bytes'>
>>> type(r.text) # decoded to unicode
<class 'str'>
>>> r.headers['Content-Type']
'text/html; charset=UTF-8'
>>> r.encoding
'UTF-8'
>>> soup = BeautifulSoup(r.text, 'lxml')
That will fix the "Wildlöwenpfleger" problem, however, other parts of the page then begin to break, for example:
>>> soup = BeautifulSoup(r.text, 'lxml') # using decoded string... should work
>>> soup.find_all('a')[39]
Der Wildlöwenpfleger
>>> soup.find_all('a')[10]
<a href="files.php?cat=87" title="Stermann und Grissemann sind auf Sommerfrische und haben Hermes ihren Salon �bergeben. Auf Streifz�gen durch die Popliteratur st��t Hermes auf deren gro�e Themen und h�rt mit euch quer. In der heutige">Salon Hermes (6 files)
shows that "Wildlöwenpfleger" is fixed but now "übergeben" and others in the second link are broken.
It appears that multiple encodings are used in the one HTML document. The first link uses UTF-8 encoding:
>>> r.content[8013:8070].decode('iso-8859-1')
'Der Wildlöwenpfleger'
>>> r.content[8013:8070].decode('utf8')
'Der Wildlöwenpfleger'
but the second link uses ISO-8859-1 encoding:
>>> r.content[2868:3132].decode('iso-8859-1')
'Salon Hermes (6 files)\r\n'
>>> r.content[2868:3132].decode('utf8', 'replace')
'Salon Hermes (6 files)\r\n'
Obviously it is incorrect to use multiple encodings in the same HTML document. Other than contacting the document's author and asking for a correction, there is not much that you can easily do to handle the mixed encoding. Perhaps you can run chardet.detect() over the data as you process it, but it's not going to be pleasant.
I just found two solutions. Can you confirm?
Soup = BeautifulSoup(r.content.decode('utf-8','ignore'),"lxml")
and
Soup = BeautifulSoup(r.content,"lxml", fromEncoding='utf-8')
Both results in the following example output:
Der Wildlöwenpfleger
EDIT:
I just wonder why these work, because r.encoding results in UTF-8 anyway. This tells me that requests anyway handled the data as UTF-8 data. Hence I wonder why .decode('utf-8','ignore') or fromEncoding='utf-8' result in the desired output?
EDIT 2:
okay, I think I get it now. The .decode('utf-8','ignore') and fromEncoding='utf-8' mean that the actual data is encoded as UTF-8 and that Beautifulsoup should parse it handling it as UTF-8 encoded data which is actually the case.
I assume that requests correctly handled it as UTF-8, but BeautifulSoup did not. Hence, I have to do this extra decoding.

urllib: get utf-8 encoded site source code

I'm trying to fetch a segment of some website. The script works, however it's a website that has accents such as á, é, í, ó, ú.
When I fetch the site using urllib or urllib2, the site source code is not encoded in utf-8, which I would like it to be, as utf-8 supports these accents.
I believe that the target site is encoded in utf-8 as it contains the following meta tag:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
My python script:
opener = urllib2.build_opener()
opener.addheaders = [('Accept-Charset', 'utf-8')]
url_response = opener.open(url)
deal_html = url_response.read().decode('utf-8')
However, I keep getting results that look like they are not encoded un utf-8.
E.g: "Milán" on website = "Mil\xe1n" after urllib2 fetches it
Any suggestions?
Your script is working correctly. The "\xe1" string is the representation of the unicode object resulting from decoding. For example:
>>> "Mil\xc3\xa1n".decode('utf-8')
u'Mil\xe1n'
The "\xc3\xa1" sequence is the UTF-8 sequence for leter a with diacritic mark: á.

Extract text between pattern using REGEX

I need help with regex in python.
I've got a large html file[around 400 lines] with the following pattern
text here(div,span,img tags)
<!-- 3GP||Link|| -->
text here(div,span,img tags)
So, now i am searching for a regex expression which can extract me this-:
Link
The given pattern is unique in the html file.
>>> d = """
... Some text here(div,span,img tags)
...
... <!-- 3GP||**Some link**|| -->
...
... Some text here(div,span,img tags)
... """
>>> import re
>>> re.findall(r'\<!-- 3GP\|\|([^|]+)\|\| --\>',d)
['**Some link**']
r'' is a raw literal, it stops interpretation of standard string escapes
\<!-- 3GP\|\| is a regexp escaped match for <!-- 3GP||
([^|]+) will match everything upto a | and groups it for convenience
\|\| --\> is a regexp escaped match for || -->
re.findall returns all non-overlapping matches of re pattern within a string, if there's a group expression in the re pattern, it returns that.
import re
re.match(r"<!-- 3GP\|\|(.+?)\|\| -->", "<!-- 3GP||Link|| -->").group(1)
yields "Link".
In case you need to parse something else, you can also combine the regular expression with BeautifulSoup:
import re
from BeautifulSoup import BeautifulSoup, Comment
soup = BeautifulSoup(<your html here>)
link_regex = re.compile('\s+3GP\|\|(.*)\|\|\s+')
comment = soup.find(text=lambda text: isinstance(text, Comment)
and link_regex.match(text))
link = link_regex.match(comment).group(1)
print link
Note that in this case the regular expresion only needs to match the comment contents because BeautifulSoup already takes care of extracting the text from the comments.

How do I perform HTML decoding/encoding using Python/Django?

I have a string that is HTML encoded:
'''<img class="size-medium wp-image-113"\
style="margin-left: 15px;" title="su1"\
src="http://blah.org/wp-content/uploads/2008/10/su1-300x194.jpg"\
alt="" width="300" height="194" />'''
I want to change that to:
<img class="size-medium wp-image-113" style="margin-left: 15px;"
title="su1" src="http://blah.org/wp-content/uploads/2008/10/su1-300x194.jpg"
alt="" width="300" height="194" />
I want this to register as HTML so that it is rendered as an image by the browser instead of being displayed as text.
The string is stored like that because I am using a web-scraping tool called BeautifulSoup, it "scans" a web-page and gets certain content from it, then returns the string in that format.
I've found how to do this in C# but not in Python. Can someone help me out?
Related
Convert XML/HTML Entities into Unicode String in Python
With the standard library:
HTML Escape
try:
from html import escape # python 3.x
except ImportError:
from cgi import escape # python 2.x
print(escape("<"))
HTML Unescape
try:
from html import unescape # python 3.4+
except ImportError:
try:
from html.parser import HTMLParser # python 3.x (<3.4)
except ImportError:
from HTMLParser import HTMLParser # python 2.x
unescape = HTMLParser().unescape
print(unescape(">"))
Given the Django use case, there are two answers to this. Here is its django.utils.html.escape function, for reference:
def escape(html):
"""Returns the given HTML with ampersands, quotes and carets encoded."""
return mark_safe(force_unicode(html).replace('&', '&').replace('<', '&l
t;').replace('>', '>').replace('"', '"').replace("'", '''))
To reverse this, the Cheetah function described in Jake's answer should work, but is missing the single-quote. This version includes an updated tuple, with the order of replacement reversed to avoid symmetric problems:
def html_decode(s):
"""
Returns the ASCII decoded version of the given HTML string. This does
NOT remove normal HTML tags like <p>.
"""
htmlCodes = (
("'", '''),
('"', '"'),
('>', '>'),
('<', '<'),
('&', '&')
)
for code in htmlCodes:
s = s.replace(code[1], code[0])
return s
unescaped = html_decode(my_string)
This, however, is not a general solution; it is only appropriate for strings encoded with django.utils.html.escape. More generally, it is a good idea to stick with the standard library:
# Python 2.x:
import HTMLParser
html_parser = HTMLParser.HTMLParser()
unescaped = html_parser.unescape(my_string)
# Python 3.x:
import html.parser
html_parser = html.parser.HTMLParser()
unescaped = html_parser.unescape(my_string)
# >= Python 3.5:
from html import unescape
unescaped = unescape(my_string)
As a suggestion: it may make more sense to store the HTML unescaped in your database. It'd be worth looking into getting unescaped results back from BeautifulSoup if possible, and avoiding this process altogether.
With Django, escaping only occurs during template rendering; so to prevent escaping you just tell the templating engine not to escape your string. To do that, use one of these options in your template:
{{ context_var|safe }}
{% autoescape off %}
{{ context_var }}
{% endautoescape %}
For html encoding, there's cgi.escape from the standard library:
>> help(cgi.escape)
cgi.escape = escape(s, quote=None)
Replace special characters "&", "<" and ">" to HTML-safe sequences.
If the optional flag quote is true, the quotation mark character (")
is also translated.
For html decoding, I use the following:
import re
from htmlentitydefs import name2codepoint
# for some reason, python 2.5.2 doesn't have this one (apostrophe)
name2codepoint['#39'] = 39
def unescape(s):
"unescape HTML code refs; c.f. http://wiki.python.org/moin/EscapingHtml"
return re.sub('&(%s);' % '|'.join(name2codepoint),
lambda m: unichr(name2codepoint[m.group(1)]), s)
For anything more complicated, I use BeautifulSoup.
Use daniel's solution if the set of encoded characters is relatively restricted.
Otherwise, use one of the numerous HTML-parsing libraries.
I like BeautifulSoup because it can handle malformed XML/HTML :
http://www.crummy.com/software/BeautifulSoup/
for your question, there's an example in their documentation
from BeautifulSoup import BeautifulStoneSoup
BeautifulStoneSoup("Sacré bleu!",
convertEntities=BeautifulStoneSoup.HTML_ENTITIES).contents[0]
# u'Sacr\xe9 bleu!'
In Python 3.4+:
import html
html.unescape(your_string)
See at the bottom of this page at Python wiki, there are at least 2 options to "unescape" html.
Daniel's comment as an answer:
"escaping only occurs in Django during template rendering. Therefore, there's no need for an unescape - you just tell the templating engine not to escape. either {{ context_var|safe }} or {% autoescape off %}{{ context_var }}{% endautoescape %}"
If anyone is looking for a simple way to do this via the django templates, you can always use filters like this:
<html>
{{ node.description|safe }}
</html>
I had some data coming from a vendor and everything I posted had html tags actually written on the rendered page as if you were looking at the source.
I found a fine function at: http://snippets.dzone.com/posts/show/4569
def decodeHtmlentities(string):
import re
entity_re = re.compile("&(#?)(\d{1,5}|\w{1,8});")
def substitute_entity(match):
from htmlentitydefs import name2codepoint as n2cp
ent = match.group(2)
if match.group(1) == "#":
return unichr(int(ent))
else:
cp = n2cp.get(ent)
if cp:
return unichr(cp)
else:
return match.group()
return entity_re.subn(substitute_entity, string)[0]
Even though this is a really old question, this may work.
Django 1.5.5
In [1]: from django.utils.text import unescape_entities
In [2]: unescape_entities('<img class="size-medium wp-image-113" style="margin-left: 15px;" title="su1" src="http://blah.org/wp-content/uploads/2008/10/su1-300x194.jpg" alt="" width="300" height="194" />')
Out[2]: u'<img class="size-medium wp-image-113" style="margin-left: 15px;" title="su1" src="http://blah.org/wp-content/uploads/2008/10/su1-300x194.jpg" alt="" width="300" height="194" />'
I found this in the Cheetah source code (here)
htmlCodes = [
['&', '&'],
['<', '<'],
['>', '>'],
['"', '"'],
]
htmlCodesReversed = htmlCodes[:]
htmlCodesReversed.reverse()
def htmlDecode(s, codes=htmlCodesReversed):
""" Returns the ASCII decoded version of the given HTML string. This does
NOT remove normal HTML tags like <p>. It is the inverse of htmlEncode()."""
for code in codes:
s = s.replace(code[1], code[0])
return s
not sure why they reverse the list,
I think it has to do with the way they encode, so with you it may not need to be reversed.
Also if I were you I would change htmlCodes to be a list of tuples rather than a list of lists...
this is going in my library though :)
i noticed your title asked for encode too, so here is Cheetah's encode function.
def htmlEncode(s, codes=htmlCodes):
""" Returns the HTML encoded version of the given string. This is useful to
display a plain ASCII text string on a web page."""
for code in codes:
s = s.replace(code[0], code[1])
return s
You can also use django.utils.html.escape
from django.utils.html import escape
something_nice = escape(request.POST['something_naughty'])
Below is a python function that uses module htmlentitydefs. It is not perfect. The version of htmlentitydefs that I have is incomplete and it assumes that all entities decode to one codepoint which is wrong for entities like &NotEqualTilde;:
http://www.w3.org/TR/html5/named-character-references.html
NotEqualTilde; U+02242 U+00338 ≂̸
With those caveats though, here's the code.
def decodeHtmlText(html):
"""
Given a string of HTML that would parse to a single text node,
return the text value of that node.
"""
# Fast path for common case.
if html.find("&") < 0: return html
return re.sub(
'&(?:#(?:x([0-9A-Fa-f]+)|([0-9]+))|([a-zA-Z0-9]+));',
_decode_html_entity,
html)
def _decode_html_entity(match):
"""
Regex replacer that expects hex digits in group 1, or
decimal digits in group 2, or a named entity in group 3.
"""
hex_digits = match.group(1) # '
' -> unichr(10)
if hex_digits: return unichr(int(hex_digits, 16))
decimal_digits = match.group(2) # '' -> unichr(0x10)
if decimal_digits: return unichr(int(decimal_digits, 10))
name = match.group(3) # name is 'lt' when '<' was matched.
if name:
decoding = (htmlentitydefs.name2codepoint.get(name)
# Treat &GT; like >.
# This is wrong for &Gt; and &Lt; which HTML5 adopted from MathML.
# If htmlentitydefs included mappings for those entities,
# then this code will magically work.
or htmlentitydefs.name2codepoint.get(name.lower()))
if decoding is not None: return unichr(decoding)
return match.group(0) # Treat "&noSuchEntity;" as "&noSuchEntity;"
This is the easiest solution for this problem -
{% autoescape on %}
{{ body }}
{% endautoescape %}
From this page.
Searching the simplest solution of this question in Django and Python I found you can use builtin theirs functions to escape/unescape html code.
Example
I saved your html code in scraped_html and clean_html:
scraped_html = (
'<img class="size-medium wp-image-113" '
'style="margin-left: 15px;" title="su1" '
'src="http://blah.org/wp-content/uploads/2008/10/su1-300x194.jpg" '
'alt="" width="300" height="194" />'
)
clean_html = (
'<img class="size-medium wp-image-113" style="margin-left: 15px;" '
'title="su1" src="http://blah.org/wp-content/uploads/2008/10/su1-300x194.jpg" '
'alt="" width="300" height="194" />'
)
Django
You need Django >= 1.0
unescape
To unescape your scraped html code you can use django.utils.text.unescape_entities which:
Convert all named and numeric character references to the corresponding unicode characters.
>>> from django.utils.text import unescape_entities
>>> clean_html == unescape_entities(scraped_html)
True
escape
To escape your clean html code you can use django.utils.html.escape which:
Returns the given text with ampersands, quotes and angle brackets encoded for use in HTML.
>>> from django.utils.html import escape
>>> scraped_html == escape(clean_html)
True
Python
You need Python >= 3.4
unescape
To unescape your scraped html code you can use html.unescape which:
Convert all named and numeric character references (e.g. >, >, &x3e;) in the string s to the corresponding unicode characters.
>>> from html import unescape
>>> clean_html == unescape(scraped_html)
True
escape
To escape your clean html code you can use html.escape which:
Convert the characters &, < and > in string s to HTML-safe sequences.
>>> from html import escape
>>> scraped_html == escape(clean_html)
True

Categories

Resources