Chinese Unicode issue? - python

From this website http://engine.data.cnzz.com/main.php?s=engine&uv=&st=2014-03-01&et=2014-03-31
<tr class="list03" onclick="showMen1(9);" style="cursor:pointer;">
<td id="e_9" class="qh_one">百度汇总</td>
I'm scraping the text and trying to get 百度汇总
but when I r.encoding = 'utf-8' the result is �ٶȻ���
if I don't use utf-8 the result is °Ù¶È»ã×Ü

The server doesn't tell you anything helpful in the response headers, but the HTML page itself contains:
<meta http-equiv="Content-Type" content="text/html; charset=gb2312" />
GB2312 is a variable-width encoding, like UTF-8. The page lies however; it in actual fact uses GBK, an extension to GB2312.
You can decode it with GBK just fine:
>>> len(r.content.decode('gbk'))
44535
>>> u'百度汇总' in r.content.decode('gbk')
True
Decoding with gb2313 fails:
>>> r.content.decode('gb2312')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'gb2312' codec can't decode bytes in position 26367-26368: illegal multibyte sequence
but since GBK is a superset of GB2313, it should always be safe to use the former even when the latter is specified.
If you are using requests, then setting r.encoding to gb2312 works because r.text uses replace when handling decode errors:
content = str(self.content, encoding, errors='replace')
so the decoding error when using GB2312 is masked for those codepoints only defined in GBK.
Note that BeautifulSoup can do the decoding all by itself; it'll find the meta header:
>>> soup = BeautifulSoup(r.content)
WARNING:root:Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
The warning is caused by the GBK codepoints being used while the page claims to use GB2312.

Related

how to encode string to utf8 inside html

I'm trying to use awdeoroi mailmerge. In the html template i have french encoded characters in paragraph tags.
When i execute the mailing i have encoding errors :
UnicodeDecodeError : 'utf-8' codec can't decode byte 0xf4 in position 81: invalid continuation byte
How to encode those paragraph so that they are well treated in python ?
TO: {{email}}
SUBJECT: Testing mailmerge
FROM: My Self <myself#mydomain.com>
Content-Type: text/html
<html>
<body>
<p>Hi, {{name}},</p>
<p>Your number is {{number}}.</p>
<p>Sent by Here is the paragraph. Ce texte est en francais. <b>Accentué<b>. L'ideal</p>
</body>
</html>
Hex f4 is latin1 for ô. If this was typed into Python, you needed this at the start of the source file:
# -*- coding: utf-8 -*-
If the data is coming from a database, please provide some more details.
If the text is coming from somewhere else, please provide some more details.

scrape with correct character encoding (python requests + beautifulsoup)

I have an issue parsing this website: http://fm4-archiv.at/files.php?cat=106
It contains special characters such as umlauts. See here:
My chrome browser displays the umlauts properly as you can see in the screenshot above. However on other pages (e.g.: http://fm4-archiv.at/files.php?cat=105) the umlauts are not displayed properly, as can be seen in the screenshot below:
The meta HTML tag defines the following charset on the pages:
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"/>
I use the python requests package to get the HTML and then use Beautifulsoup to scrape the desired data. My code is as follows:
r = requests.get(URL)
soup = BeautifulSoup(r.content,"lxml")
If I print the encoding (print(r.encoding) the result is UTF-8. If I manually change the encoding to ISO-8859-1 or cp1252 by calling r.encoding = ISO-8859-1 nothing changes when I output the data on the console. This is also my main issue.
r = requests.get(URL)
r.encoding = 'ISO-8859-1'
soup = BeautifulSoup(r.content,"lxml")
still results in the following string shown on the console output in my python IDE:
Der Wildlöwenpfleger
instead it should be
Der Wildlöwenpfleger
How can I change my code to parse the umlauts properly?
In general, instead of using r.content which is the byte string received, use r.text which is the decoded content using the encoding determined by requests.
In this case requests will use UTF-8 to decode the incoming byte string because this is the encoding reported by the server in the Content-Type header:
import requests
r = requests.get('http://fm4-archiv.at/files.php?cat=106')
>>> type(r.content) # raw content
<class 'bytes'>
>>> type(r.text) # decoded to unicode
<class 'str'>
>>> r.headers['Content-Type']
'text/html; charset=UTF-8'
>>> r.encoding
'UTF-8'
>>> soup = BeautifulSoup(r.text, 'lxml')
That will fix the "Wildlöwenpfleger" problem, however, other parts of the page then begin to break, for example:
>>> soup = BeautifulSoup(r.text, 'lxml') # using decoded string... should work
>>> soup.find_all('a')[39]
Der Wildlöwenpfleger
>>> soup.find_all('a')[10]
<a href="files.php?cat=87" title="Stermann und Grissemann sind auf Sommerfrische und haben Hermes ihren Salon �bergeben. Auf Streifz�gen durch die Popliteratur st��t Hermes auf deren gro�e Themen und h�rt mit euch quer. In der heutige">Salon Hermes (6 files)
shows that "Wildlöwenpfleger" is fixed but now "übergeben" and others in the second link are broken.
It appears that multiple encodings are used in the one HTML document. The first link uses UTF-8 encoding:
>>> r.content[8013:8070].decode('iso-8859-1')
'Der Wildlöwenpfleger'
>>> r.content[8013:8070].decode('utf8')
'Der Wildlöwenpfleger'
but the second link uses ISO-8859-1 encoding:
>>> r.content[2868:3132].decode('iso-8859-1')
'Salon Hermes (6 files)\r\n'
>>> r.content[2868:3132].decode('utf8', 'replace')
'Salon Hermes (6 files)\r\n'
Obviously it is incorrect to use multiple encodings in the same HTML document. Other than contacting the document's author and asking for a correction, there is not much that you can easily do to handle the mixed encoding. Perhaps you can run chardet.detect() over the data as you process it, but it's not going to be pleasant.
I just found two solutions. Can you confirm?
Soup = BeautifulSoup(r.content.decode('utf-8','ignore'),"lxml")
and
Soup = BeautifulSoup(r.content,"lxml", fromEncoding='utf-8')
Both results in the following example output:
Der Wildlöwenpfleger
EDIT:
I just wonder why these work, because r.encoding results in UTF-8 anyway. This tells me that requests anyway handled the data as UTF-8 data. Hence I wonder why .decode('utf-8','ignore') or fromEncoding='utf-8' result in the desired output?
EDIT 2:
okay, I think I get it now. The .decode('utf-8','ignore') and fromEncoding='utf-8' mean that the actual data is encoded as UTF-8 and that Beautifulsoup should parse it handling it as UTF-8 encoded data which is actually the case.
I assume that requests correctly handled it as UTF-8, but BeautifulSoup did not. Hence, I have to do this extra decoding.

'ascii' codec can't encode characters in position 2307-2309: ordinal not in range(128) [duplicate]

This question already has answers here:
UnicodeEncodeError: 'ascii' codec can't encode character in position 0: ordinal not in range(128)
(4 answers)
Closed 8 years ago.
I am making a request using requests lib, as can be seen in the content the encoding for page is utf-8, also the default used by requests is utf-8 as seen in r.encoding, but why is this showing unicode error while reading text.
r = requests.get(url, auth=('username', 'password'))
print r.status_code
print r.encoding
print r.content
print r.text
output:
200
UTF-8
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"><html><head><title>Sign In</title><meta http-equiv="X-UA-Compatible" content="IE=EmulateIE7"/><meta http-equiv="content-type" content="text/html; charset=utf-8"/>.............
Traceback (most recent call last):
File "E:\Python practise programms\reuters.py", line 18, in <module>
print r.text
UnicodeEncodeError: 'ascii' codec can't encode characters in position 2307-2309: ordinal not in range(128)
This was due to sublime text not supporting utf-8 by default. I was able to get the output in python IDLE

urllib: get utf-8 encoded site source code

I'm trying to fetch a segment of some website. The script works, however it's a website that has accents such as á, é, í, ó, ú.
When I fetch the site using urllib or urllib2, the site source code is not encoded in utf-8, which I would like it to be, as utf-8 supports these accents.
I believe that the target site is encoded in utf-8 as it contains the following meta tag:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
My python script:
opener = urllib2.build_opener()
opener.addheaders = [('Accept-Charset', 'utf-8')]
url_response = opener.open(url)
deal_html = url_response.read().decode('utf-8')
However, I keep getting results that look like they are not encoded un utf-8.
E.g: "Milán" on website = "Mil\xe1n" after urllib2 fetches it
Any suggestions?
Your script is working correctly. The "\xe1" string is the representation of the unicode object resulting from decoding. For example:
>>> "Mil\xc3\xa1n".decode('utf-8')
u'Mil\xe1n'
The "\xc3\xa1" sequence is the UTF-8 sequence for leter a with diacritic mark: á.

lxml.html and Unicode: Extract Links

The code below extracts links from a web page and shows them in a browser. With a lot of UTF-8 encoded webpages this works great. But the French Wikipedia page http://fr.wikipedia.org/wiki/États_unis for example produces an error.
# -*- coding: utf-8 -*-
print 'Content-Type: text/html; charset=utf-8\n'
print '''<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN">
<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
<title>Show Links</title>
</head>
<body>'''
import urllib2, lxml.html as lh
def load_page(url):
headers = {'User-Agent' : 'Mozilla/5.0 (compatible; testbot/0.1)'}
try:
req = urllib2.Request(url, None, headers)
response = urllib2.urlopen(req)
page = response.read()
return page
except:
print '<b>Couldn\'t load:', url, '</b><br>'
return None
def show_links(page):
tree = lh.fromstring(page)
for node in tree.xpath('//a'):
if 'href' in node.attrib:
url = node.attrib['href']
if '#' in url:
url=url.split('#')[0]
if '#' not in url and 'javascript' not in url:
if node.text:
linktext = node.text
else:
linktext = '-'
print '%s<br>' % (url, linktext.encode('utf-8'))
page = load_page('http://fr.wikipedia.org/wiki/%C3%89tats_unis')
show_links(page)
print '''
</body>
</html>
'''
I get the following error:
Traceback (most recent call last):
File "C:\***\question.py", line 42, in <module>
show_links(page)
File "C:\***\question.py", line 39, in show_links
print '%s<br>' % (url, linktext.encode('utf-8'))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 3: ordinal not in range(128)
My system: Python 2.6 (Windows), lxml 2.3.3, Apache Server (to show the results)
What am I doing wrong?
You need to encode url too.
The problem might be similar to:
>>> "%s%s" % (u"", "€ <-non-ascii char in a bytestring")
Traceback (most recent call last):
File "<input>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in
range(128)
But this works:
>>> "%s%s" % (u"".encode('utf-8'), "€ <-non-ascii char in a bytestring")
'\xe2\x82\xac <-non-ascii char in a bytestring'
The empty Unicode string forces the whole expression to be converted to Unicode. Therefore you see Unicode Decode Error.
In general it is a bad idea to mix Unicode and bytestrings. It might appear to be working but sooner or later it breaks. Convert text to Unicode as soon as you receive it, process it and then convert it to bytes for I/O.
lxml returns bytestrings not unicode. It might be better to decode the bytestring to unicode using whatever encoding the page was served with, before encoding as utf-8.
If your text is already in utf-8, there is no need to do any encoding or decoding - just take that operation out.
However, if your linktext is of type unicode (as you say it is), then it is a unicode string (each element represents a unicode codepoint), and encoding as utf-8 should work perfectly well.
I suspect the problem is that your url string is also a unicode string, and it also needs to be encoded as utf-8 before being substituted into your bytestring.

Categories

Resources