Python - HTML to Unicode

Python - HTML to Unicode - python

I have a python script where I am getting some html and parsing it using beautiful soup. In the HTML sometimes there are no unicode characters and it causes errors with my script and the file I am creating.
Here is how I am getting the HTML
html = urllib2.urlopen(url).read().replace(' ',"")
xml = etree.HTML(html)
When I use this
html = urllib2.urlopen(url).read().encode('ascii', 'xmlcharrefreplace')
I get an error UnicodeDecodeError
How could I change this into unicode. So if there are non unicode characters, my code won't break.

When I use this
html = urllib2.urlopen(url).read().encode('ascii', 'xmlcharrefreplace')
I get an error UnicodeDecodeError. How could I change this into unicode.
unicode characters -> bytes = ‘encode’
bytes -> unicode characters = ‘decode’
You have bytes and you want unicode characters, so the method for that is decode. As you have used encode, Python thinks you want to go from characters to bytes, so tries to convert the bytes to characters so they can be turned back to bytes! It uses the default encoding for this, which in your case is ASCII, so it fails for non-ASCII bytes.
However it is unclear why you want to do this. etree parses bytes as-is. If you want to remove character U+00A0 Non Breaking Space from your data you should do that with the extracted content you get after HTML parsing, rather than try to grapple with the HTML source version. HTML markup might include U+00A0 as raw bytes, incorrectly-unterminated entity references, numeric character references and so on. Let the HTML parser handle that for you, it's what it's good at.

If you feed HTML to BeautifulSoup, it will decode it to Unicode.
If the charset declaration is wrong or missing, or parts of the document are encoded differently, this might fail; there is a special module which comes with BeautifulSoup, dammit, which might help you with these documents.
If you mention BeautifulSoup, why don't you do it like this:
from bs4 import BeautifulSoup
soup = BeautifulSoup(urllib2.urlopen(url).read())
and work with the soup?
BTW, all HTML entities will be resolved to unicode characters.
The ascii character set is very limited and might lack many characters in your document. I'd use utf-8 instead whenever possible.

Related

TypeError: 'str' does not support the buffer interface in html2text

I'm using python3 to do some web scraping. I want to save a webpage and convert it to text using the following code:
import urllib
import html2text
url='http://www.google.com'
page = urllib.request.urlopen(url)
html_content = page.read()
rendered_content = html2text.html2text(html_content)
But when I run the code, it reports a type error:
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/html2text-2016.4.2-py3.4.egg/html2text/__init__.py", line 127, in feed
data = data.replace("</' + 'script>", "</ignore>")
TypeError: 'str' does not support the buffer interface
Could anyone tell me how to deal with this error? Thank you in advance!

I took the time to investigate this, and it turns out to be easily resolved.
Why You Got This Error
The problem is one of bad input: when you called page.read(), a byte string was returned, rather than a regular string.
Byte strings are Python's way of dealing with unfamiliar character encodings: basically there are characters in the raw text that don't map to Unicode (Python 3's default character encoding).
Because Python doesn't know what encoding to use, Python instead represents such strings using raw bytes - this is how all data is represented internally anyway - and lets the programmer decide what encoding to use.
Regular string methods called on these byte strings - such as replace(), which html2text tried to use - fail because byte strings don't have these methods defined.
Solution
html_content = page.read().decode('iso-8859-1')
Padraic Cunningham's solution in the comments is correct in its essence: you have to first tell Python which character encoding to use to try to map these bytes to correct character set.
Unfortunately, this particular text doesn't use Unicode, so asking it to decode using the UTF-8 encoding throws an error.
The correct encoding to use is actually contained in the request headers itself under the Content-Type header - this is a standard header that all HTTP-compliant server responses are guaranteed to provide.
Simply calling page.info().get_content_charset() returns the value of this header, which in this case is iso-8859-1. From there, you can decode it correctly using iso-8859-1, so that regular tools can operate on it normally.
A More Generic Solution
charset_encoding = page.info().get_content_charset()
html_content = page.read().decode(charset_encoding)

The stream returned by urlopen is indicated as being a bytestream by b as the first character before the quoted string. If you exclude it, as in the appended code it seems to work as input for html2txt.
import urllib
import html2text
url='http://www.google.com'
with urllib.request.urlopen(url) as page:
html_content = page.read()
charset_encoding = page.info().get_content_charset()
rendered_content = html2text.html2text(str(html_content)[1:], charset_encoding)
Revised using suggestions about encoding. Yes, it's a hack, but it runs. Not using str() means the original TypeError problem remains.

Python - How to get accented characters correct? (BeautifulSoup)

I've write a s python code with BeautifulSoup to get HTML but not getting how to solve accented characters correct.
The charset of the HTML is this
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
I've this python code:
some_text = soup_ad.find("span", { "class" : "h1_span" }).contents[0]
some_text.decode('iso-8859-1','ignore')
And I get this:
CalÃ§Ãµes
What I'm doing wrong here? Some clues?
Best Regards,

The question here is about "where" do you "get this".
If that's the output received in your terminal, it might as well be possible that your terminal expects a different encoding!
You can try this when using print:
import sys
outenc = sys.stdout.encoding or sys.getfilesystemencoding()
print t.decode("iso-8859-1").encode(outenc)

As bernie points out, BS uses Unicode internally.
For BS3:
Beautiful Soup Gives You Unicode, Dammit
By the time your document is parsed, it has been transformed into Unicode. Beautiful Soup stores only Unicode strings in its data structures.
For BS4, the docs explain a bit more clearly when this happens:
You can pass in a string or an open filehandle… First, the document is converted to Unicode, and HTML entities are converted to Unicode characters…`
In other words, it decodes the data immediately. So, if you're getting mojibake, you have to fix it before it gets into BS, not after.
The input to the BeautifulSoup constructor can take 8-bit byte strings or files, and try to figure out the encoding. See Encodings for details. You can check whether it guessed right by printing out soup.original_encoding. If it didn't guess ISO-8859-1 or a synonym, your only option is to make it explicit: decode the string before passing it in, open the file in Unicode mode with an encoding, etc.
The results that come out of any BS object, and anything you pass as an argument to any method, will always be UTF-8 (if they're byte strings). So, calling decode('iso-8859-1') on something you got out of BS is guaranteed to break stuff if it's not already broken.
And you don't want to do this anyway. As you said in a comment, "I'm outputting to an SQLite3 database." Well, sqlite3 always uses UTF-8. (You can change this with a pragma at runtime, or change the default at compile time, but that basically breaks the Python interface, so… don't.) And the Python interface only allows UTF-8 in Py2 str (and of course in Py2 unicode/Py3 str, there is no encoding.) So, if you try to encode the BS data into Latin-1 to store in the database, you're creating problems. Just store the Unicode as-is, or encode it to UTF-8 if you must (Py2 only).
If you don't want to figure all of this out, just use Unicode everywhere after the initial call to BeautifulSoup and you'll never go wrong.

Python(2.6) cStringIO unicode support?

I'm using python pycurl module to download content from various web pages. Since I also wanted to support potential unicode text I've been avoiding the cStringIO.StringIO function which according to python docs: cStringIO - Faster version of StringIO
Unlike the StringIO module, this module is not able to accept Unicode strings that cannot be encoded as plain ASCII strings.
... does not support unicode strings. Actually it states that it does not support unicode strings that can not be converted to ASCII strings. Can someone please clarify this to me? Which can and which can not be converted?
I've tested with the following code and it seems to work just fine with unicode:
import pycurl
import cStringIO
downloadedContent = cStringIO.StringIO()
curlHandle = pycurl.Curl()
curlHandle.setopt(pycurl.WRITEFUNCTION, downloadedContent.write)
curlHandle.setopt(pycurl.URL, 'http://www.ltg.ed.ac.uk/~richard/unicode-sample.html')
curlHandle.perform()
content = downloadedContent.getvalue()
fileHandle = open('unicode-test.txt','w')
for char in content:
fileHandle.write(char)
And the file is correctly written. I can even print the whole content in the console, all characters show up fine... So what I'm puzzled about is, where does the cStringIO fail ? Is there any reason why I should not use it?
[Note: I'm using Python 2.6 and need to stick to this version]

Any text that only uses ASCII codepoints (byte values 00-7F hexadecimal) can be converted to ASCII. Basically any text that uses characters not often used in American English is not ASCII.
In your example code, you are not converting the input to Unicode text; you are treating it as un-decoded bytes. The test page in question is encoded in UTF-8, and you never decode that to Unicode.
If you were to decode the value to a Unicode string, you won't be able to store that string in a cStringIO object.
You may want to read up on the difference between Unicode and text encodings such as ASCII and UTF-8. I can recommend:
Joel Spolsky's minimum Unicode article
The Python Unicode HOWTO.

Parsing a utf-8 encoded web page with some gb2312 body text with Python

I'm trying to parse a web page using Python's beautiful soup Python parser, and am running into an issue.
The header of the HTML we get from them declares a utf-8 character set, so Beautiful Soup encodes the whole document in utf-8, and indeed the HTML tags are encoded in UTF-8 so we get back a nicely structured HTML page.
The trouble is, this stupid website injects gb2312-encoded body text into the page that gets parsed as utf-8 by beautiful soup. Is there a way to convert the text from this "gb2312 pretending to be utf-8" state to "proper expression of the character set in utf-8?"

The simplest way might be to parse the page twice, once as UTF-8, and once as GB2312. Then extract the relevant section from the GB2312 parse.
I don't know much about GB2312, but looking it up it appears to at least agree with ASCII on the basic letters, numbers, etc. So you should still be able to parse the HTML structure using GB2312, which would hopefully give you enough information to extract the part you need.
This may be the only way to do it, actually. In general, GB2312-encoded text won't be valid UTF-8, so trying to decode it as UTF-8 should lead to errors. The BeautifulSoup documentation says:
In rare cases (usually when a UTF-8 document contains text written in a completely different encoding), the only way to get Unicode may be to replace some characters with the special Unicode character “REPLACEMENT CHARACTER” (U+FFFD, �). If Unicode, Dammit needs to do this, it will set the .contains_replacement_characters attribute to True on the UnicodeDammit or BeautifulSoup object.
This makes it sound like BeautifulSoup just ignores decoding errors and replaces the erroneous characters with U+FFFD. If this is the case (i.e., if your document has contains_replacement_characters == True), then there is no way to get the original data back from document once it's been decoded as UTF-8. You will have to do something like what I suggested above, decoding the entire document twice with different codecs.

Python minidom and UTF-8 encoded XML with hash references

I am experiencing some difficulty in my home project where I need to parse a SOAP request. The SOAP is generated with gSOAP and involves string parameters with special characters like the danish letters "æøå".
gSOAP builds SOAP requests with UTF-8 encoding by default, but instead of sending the special chatacters in raw format (ie. bytes C3A6 for the special character "æ") it sends what I think is called character hash references (ie. Ã¦).
I don't completely understand why gSOAP does it this way as I can see that it has marked the incomming payload as being UTF-8 encoded anyway (Content-Type: text/xml; charset=utf-8), but this is besides the question (I think).
Anyway I guess gSOAP probably is obeying transport rules, or what?
When I parse the request from gSOAP in python with xml.dom.minidom.parseString() I get element values as unicode objects which is fine, but the character hash references are not decoded as UTF-8 character codes. It unescapes the character hash references, but does not decode the string afterwards. In the end I have a unicode string object with UTF-8 encoding:
So if the string "æble" is contained in the XML, it comes like this in the request:
"Ã¦ble"
After parsing the XML the unicode string in the DOM Text Node's data member looks like this:
u'\xc3\xa6ble'
I would expect it to look like this:
u'\xe6ble'
What am I doing wrong? Should I unescape the SOAP XML before parsing it, or is it somewhere else I should be looking for the solution, maybe gSOAP?
Thanks in advance.
Best regards Jakob Simon-Gaarde

Ã¦ble is actually Ã¦ble.
To get the expected Unicode string u'\xe6ble' after parsing, the string in the request should be æble.

Here's how to unescape such stuff: http://effbot.org/zone/re-sub.htm#unescape-html
However the primary problem is what you and/or this "gSOAP" (URL, please) are doing ...
Your example character is LATIN SMALL LIGATURE AE (U+00E6). As you say, encoded in UTF-8, this is \xc3\xa6. 0xc3 == 195 and 0xa6 == 166. 0xe6 == 230. Escaping your character should produce 'æ', not 'Ã¦'.
However it appears that it is encoding to UTF-8 first and then doing the escaping.
What you need to do is to show us in fine detail the code that you are using together with diagnostic prints (using the repr() function so that we can see the type and unambiguously-represented contents) of each str and unicode object involved in the process. Also provide the docs for the gSOAP API(s) that you are using.
On the receiving end, please show us the repr() of the raw XML that you receive.
Edit in response to this comment on another answer: """The problem is that minidom.parseString() does not seem to unescape the character hash representation before it decodes to unicode."""
It (and any other XML parser) {does not, cannot in generality, and must not} unescape numerical character references or predefined character entities BEFORE decoding.
(1) unescaping "<" to "<" would blow up
(2) what would you unescape "&#256" to? "\xc4\x80"?
(3) how could it unescape at all if the encoding was UTF-16xx?

Some more detail about my problem. The project I am creating uses wsgi. The SOAP request is extracted using environ['wsgi.input'].read(). It always seems to return a raw string. I created a function that unescapes the character hashes:
def unescape_hash_char(req):
pat = re.compile('&#(\d+);',re.M)
parts = pat.split(req)
a=0
ret = ''
for p in parts:
if a%2:
n = chr(int(p))
else:
n = p
ret += n
a+=1
return ret
After doing this I parse the XML and I get the expected reslut.
Still I would like to know what you think, and if it is a good solution. Also I wrote the function because I couldn't find a function to do the job in the standard python modules, does such a function exist?
Best regards
Jakob Simon-Gaarde

Note that
In [5]: 'æ'.encode('utf-8')
Out[5]: '\xc3\xa6'
So we have is the unicode object u'\xc3\xa6' and we really want the string object'\xc3\xa6'. This transformation can be performed with the raw-unicode-escape codec:
In [1]: text=u'\xc3\xa6'
In [2]: text.encode('raw-unicode-escape')
Out[2]: '\xc3\xa6ble'
In [3]: text.encode('raw-unicode-escape').decode('utf-8')
Out[3]: u'\xe6'
In [4]: print(text.encode('raw-unicode-escape').decode('utf-8'))
æ

Unless someone can tell me that gSOAP is not producing valid encoded SOAP XML: (see http://pastebin.com/raw.php?i=9NS7vCMB or the codeblock below) I see no other solution than to unescape character hash references before parsing the XML.
Of course as John Machin has pointed out, I cannot unescape XML control characters like "<" and ">".
<?xml version="1.0" encoding="UTF-8"?>
<SOAP-ENV:Envelope xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/" xmlns:SOAP-ENC="http://schemas.xmlsoap.org/soap/encoding/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:ns1="urn:ShopService"><SOAP-ENV:Body SOAP-ENV:encodingStyle="http://schemas.xmlsoap.org/soap/encoding/"><ns1:createCompany><company-code>DK-123</company-code><name>Ã¦ble</name></ns1:createCompany></SOAP-ENV:Body></SOAP-ENV:Envelope>
/ Jakob

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.