urllib: get utf-8 encoded site source code

urllib: get utf-8 encoded site source code - python

I'm trying to fetch a segment of some website. The script works, however it's a website that has accents such as á, é, í, ó, ú.
When I fetch the site using urllib or urllib2, the site source code is not encoded in utf-8, which I would like it to be, as utf-8 supports these accents.
I believe that the target site is encoded in utf-8 as it contains the following meta tag:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
My python script:
opener = urllib2.build_opener()
opener.addheaders = [('Accept-Charset', 'utf-8')]
url_response = opener.open(url)
deal_html = url_response.read().decode('utf-8')
However, I keep getting results that look like they are not encoded un utf-8.
E.g: "Milán" on website = "Mil\xe1n" after urllib2 fetches it
Any suggestions?

Your script is working correctly. The "\xe1" string is the representation of the unicode object resulting from decoding. For example:
>>> "Mil\xc3\xa1n".decode('utf-8')
u'Mil\xe1n'
The "\xc3\xa1" sequence is the UTF-8 sequence for leter a with diacritic mark: á.

Related

Getting proper UTF-8 from lxml.html.fromstring via requests.get from HTML page?

Here is the MWE, test.py - the test webpage that is written inline as mypage, is served from http://sdaaubckp.sourceforge.net/test/test-utf8.html , so you should be able to run this script as-is:
#!/usr/bin/python
# -*- coding: utf-8 -*-
import os, sys
import re
import lxml.html as LH
import requests
if sys.version_info[0]<3: # python 2
from StringIO import StringIO
else: #python 3
from io import StringIO
# this page uploaded on: http://sdaaubckp.sourceforge.net/test/test-utf8.html
mypage = """
<!doctype html>
<html lang="en">
<head>
<!-- Basic Page Needs
–––––––––––––––––––––––––––––––––––––––––––––––––– -->
<meta charset="utf-8">
<title>My Page</title>
<meta name="description" content="">
<meta name="author" content="">
</head>
<body>
<div>Testing: tøst</div>
</body>
</html>
"""
url_page = "http://sdaaubckp.sourceforge.net/test/test-utf8.html"
confpage = requests.get(url_page)
print(confpage.encoding) # it detects ISO-8859-1, even if the page declares <meta charset="utf-8">?
confpage.encoding = "UTF-8"
print(confpage.encoding) # now it says UTF-8, but...
#print(confpage.content)
if sys.version_info[0]<3: # python 2
mystr = confpage.content
else: #python 3
mystr = confpage.content.decode("utf-8")
for line in iter(mystr.splitlines()):
if 'Testing' in line:
print(line)
confpagetree = LH.fromstring(confpage.content)
print(confpagetree) # <Element html at 0x7f4b7074eec0>
#print(confpagetree.text_content())
for line in iter(confpagetree.text_content().splitlines()):
if 'Testing' in line:
print(line)
I'm running this on Ubuntu 14.04.5 LTS; both Python 2 and 3 give the same results with this script:
$ python2 test.py
ISO-8859-1
UTF-8
<div>Testing: tøst</div>
<Element html at 0x7fb5b9d12ec0>
Testing: tÃ¸st
$ python3 test.py
ISO-8859-1
UTF-8
<div>Testing: tøst</div>
<Element html at 0x7f272fc53318>
Testing: tÃ¸st
Note how:
In both cases, confpage.encoding detects ISO-8859-1, even if the webpage declares <meta charset="utf-8">
In both cases, correct UTF-8 character ø is printed from confpage.content
In both cases, corrupt UTF-8 representation Ã¸ is output from lxml.html.fromstring(confpage.content).text_content()
My suspicion is, since the webpage uses – UTF-8 character (Char: '–' u: 8211 [0x2013] b: 226,128,147 [0xE2,0x80,0x93] n: EN DASH [General Punctuation]) before it declares <meta charset="utf-8"> in the <head>, this somehow borks requests and/or lxml.html.fromstring().text_content(), which results with the corrupt representation.
My question is - what can I do, so I get a correct UTF-8 character at the output of lxml.html.fromstring().text_content() - hopefully for both Python 2 and 3?

The root problem is that you're using confpage.content instead of confpage.text.
requests.Response.content gives you the raw bytes (bytes in 3.x, str in 2.x), as pulled off the wire. It doesn't matter what encoding is, because you aren't using it.
requests.Response.text gives you the decoded Unicode (str in 3.x, unicode in 2.x), based on the encoding.
So, setting the encoding but then using content doesn't do anything. If you just change the rest of your code to use text instead of content (and get rid of the now-spurious decode for Python 3), it will work:
mystr = confpage.text
for line in iter(mystr.splitlines()):
if 'Testing' in line:
print(line)
confpagetree = LH.fromstring(confpage.text)
print(confpagetree) # <Element html at 0x7f4b7074eec0>
#print(confpagetree.text_content())
for line in iter(confpagetree.text_content().splitlines()):
if 'Testing' in line:
print(line)
If you want to go through the exact problem with each of your examples:
Your first example is right in Python 3, but not the best way to do it. By calling decode("utf-8") on the content, since the bytes do happen to be UTF-8, you're decoding them properly. So they will print out properly.
Your first example is wrong in Python 2. You're just printing the content, which is a bunch of UTF-8 bytes. If your console is UTF-8 (as it is on macOS, and might be on Linux), this will happen to work. If your console is something else, like cp1252 or Latin-1 (as it is on Windows, and might be on Linux), this will give you mojibake.
Your second example is also wrong. By passing bytes to LH.fromstring, you're forcing lxml to guess what encoding to use, and it guesses Latin-1, so you get mojibake.

Send request to page with windows-1251 encoding from python

i need get a page source (html) and convert him to uft8, because i want find some text in this page( like, if 'my_same_text' in page_source: then...). This page contains russian text (сyrillic symbols), and this tag
<meta http-equiv="Content-Type" content="text/html; charset=windows-1251">
I use flask, and request python lib.
i send request
source = requests.get('url/')
if 'сyrillic symbols' in source.text: ...
and i can`t find my text, this is due to the encoding
how i can convert text to utf8? i try .encode() .decode() but it did not help.

Let's create a page with an windows-1251 charset given in meta tag and some Russian nonsense text. I saved it in Sublime Text as a windows-1251 file, for sure.
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=windows-1251">
</head>
<body>
<p>Привет, мир!</p>
</body>
</html>
You can use a little trick in the requests library:
If you change the encoding, Requests will use the new value of r.encoding whenever you call r.text.
So it goes like that:
In [1]: import requests
In [2]: result = requests.get('http://127.0.0.1:1234/1251.html')
In [3]: result.encoding = 'windows-1251'
In [4]: u'Привет' in result.text
Out[4]: True
Voila!
If it doesn't work for you, there's a slightly uglier approach.
You should take a look at what encoding do the web-server is sending you.
It may be that the encoding of the response is actually cp1252 (also known as ISO-8859-1), or whatever else, but neither utf8 nor cp1251. It may differ and depends on a web-server!
In [1]: import requests
In [2]: result = requests.get('http://127.0.0.1:1234/1251.html')
In [3]: result.encoding
Out[3]: 'ISO-8859-1'
So we should recode it accordingly.
In [4]: u'Привет'.encode('cp1251').decode('cp1252') in result.text
Out[4]: True
But that just looks ugly to me (also, I suck at encodings and it's not really the best solution at all). I'd go with a re-setting the encoding using requests itself.

As documented, requests automatically decode response.text to unicode, so you must either look for a unicode string:
if u'cyrillic symbols' in source.text:
# ...
or encode response.text in the appropriate encoding:
# -*- coding: utf-8 -*-
# (....)
if 'cyrillic symbols' in source.text.encode("utf-8"):
# ...
The first solution being much simpler and lighter.

Chinese Unicode issue?

From this website http://engine.data.cnzz.com/main.php?s=engine&uv=&st=2014-03-01&et=2014-03-31
<tr class="list03" onclick="showMen1(9);" style="cursor:pointer;">
<td id="e_9" class="qh_one">百度汇总</td>
I'm scraping the text and trying to get 百度汇总
but when I r.encoding = 'utf-8' the result is �ٶȻ���
if I don't use utf-8 the result is °Ù¶È»ã×Ü

The server doesn't tell you anything helpful in the response headers, but the HTML page itself contains:
<meta http-equiv="Content-Type" content="text/html; charset=gb2312" />
GB2312 is a variable-width encoding, like UTF-8. The page lies however; it in actual fact uses GBK, an extension to GB2312.
You can decode it with GBK just fine:
>>> len(r.content.decode('gbk'))
44535
>>> u'百度汇总' in r.content.decode('gbk')
True
Decoding with gb2313 fails:
>>> r.content.decode('gb2312')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'gb2312' codec can't decode bytes in position 26367-26368: illegal multibyte sequence
but since GBK is a superset of GB2313, it should always be safe to use the former even when the latter is specified.
If you are using requests, then setting r.encoding to gb2312 works because r.text uses replace when handling decode errors:
content = str(self.content, encoding, errors='replace')
so the decoding error when using GB2312 is masked for those codepoints only defined in GBK.
Note that BeautifulSoup can do the decoding all by itself; it'll find the meta header:
>>> soup = BeautifulSoup(r.content)
WARNING:root:Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
The warning is caused by the GBK codepoints being used while the page claims to use GB2312.

lxml.html and Unicode: Extract Links

The code below extracts links from a web page and shows them in a browser. With a lot of UTF-8 encoded webpages this works great. But the French Wikipedia page http://fr.wikipedia.org/wiki/États_unis for example produces an error.
# -*- coding: utf-8 -*-
print 'Content-Type: text/html; charset=utf-8\n'
print '''<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN">
<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
<title>Show Links</title>
</head>
<body>'''
import urllib2, lxml.html as lh
def load_page(url):
headers = {'User-Agent' : 'Mozilla/5.0 (compatible; testbot/0.1)'}
try:
req = urllib2.Request(url, None, headers)
response = urllib2.urlopen(req)
page = response.read()
return page
except:
print '<b>Couldn\'t load:', url, '</b><br>'
return None
def show_links(page):
tree = lh.fromstring(page)
for node in tree.xpath('//a'):
if 'href' in node.attrib:
url = node.attrib['href']
if '#' in url:
url=url.split('#')[0]
if '#' not in url and 'javascript' not in url:
if node.text:
linktext = node.text
else:
linktext = '-'
print '%s<br>' % (url, linktext.encode('utf-8'))
page = load_page('http://fr.wikipedia.org/wiki/%C3%89tats_unis')
show_links(page)
print '''
</body>
</html>
'''
I get the following error:
Traceback (most recent call last):
File "C:\***\question.py", line 42, in <module>
show_links(page)
File "C:\***\question.py", line 39, in show_links
print '%s<br>' % (url, linktext.encode('utf-8'))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 3: ordinal not in range(128)
My system: Python 2.6 (Windows), lxml 2.3.3, Apache Server (to show the results)
What am I doing wrong?

You need to encode url too.
The problem might be similar to:
>>> "%s%s" % (u"", "€ <-non-ascii char in a bytestring")
Traceback (most recent call last):
File "<input>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in
range(128)
But this works:
>>> "%s%s" % (u"".encode('utf-8'), "€ <-non-ascii char in a bytestring")
'\xe2\x82\xac <-non-ascii char in a bytestring'
The empty Unicode string forces the whole expression to be converted to Unicode. Therefore you see Unicode Decode Error.
In general it is a bad idea to mix Unicode and bytestrings. It might appear to be working but sooner or later it breaks. Convert text to Unicode as soon as you receive it, process it and then convert it to bytes for I/O.

lxml returns bytestrings not unicode. It might be better to decode the bytestring to unicode using whatever encoding the page was served with, before encoding as utf-8.
If your text is already in utf-8, there is no need to do any encoding or decoding - just take that operation out.
However, if your linktext is of type unicode (as you say it is), then it is a unicode string (each element represents a unicode codepoint), and encoding as utf-8 should work perfectly well.
I suspect the problem is that your url string is also a unicode string, and it also needs to be encoded as utf-8 before being substituted into your bytestring.

How do I get a µ character out of sqlite and onto a web-page?

On a Python driven web app using a sqlite datastore I had this error:
Could not decode to UTF-8 column
'name' with text '300µL-10-10'
Reading here it looks like I need to switch my text-factory to str and get bytestrings but when I do this my html output looks like this:
300�L-10-10
I do have my content-type set as:
<meta http-equiv="content-type" content="text/html; charset=utf-8" />

Unfortunately, the data in your datastore is not encoded as UTF-8; instead, it's probably either latin-1 or cp1252. To decode it automatically, try setting Connection.text_factory to your own function:
def convert_string(s):
try:
u = s.decode("utf-8")
except UnicodeDecodeError:
u = s.decode("cp1252")
return u
conn.text_factory = convert_string

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

urllib: get utf-8 encoded site source code - python

Your script is working correctly. The "\xe1" string is the representation of the unicode object resulting from decoding. For example: >>> "Mil\xc3\xa1n".decode('utf-8') u'Mil\xe1n' The "\xc3\xa1" sequence is the UTF-8 sequence for leter a with diacritic mark: á.

Related

Getting proper UTF-8 from lxml.html.fromstring via requests.get from HTML page?

Send request to page with windows-1251 encoding from python

Chinese Unicode issue?

lxml.html and Unicode: Extract Links

How do I get a µ character out of sqlite and onto a web-page?

Categories

Resources