lxml.html and Unicode: Extract Links

lxml.html and Unicode: Extract Links - python

The code below extracts links from a web page and shows them in a browser. With a lot of UTF-8 encoded webpages this works great. But the French Wikipedia page http://fr.wikipedia.org/wiki/États_unis for example produces an error.
# -*- coding: utf-8 -*-
print 'Content-Type: text/html; charset=utf-8\n'
print '''<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN">
<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
<title>Show Links</title>
</head>
<body>'''
import urllib2, lxml.html as lh
def load_page(url):
headers = {'User-Agent' : 'Mozilla/5.0 (compatible; testbot/0.1)'}
try:
req = urllib2.Request(url, None, headers)
response = urllib2.urlopen(req)
page = response.read()
return page
except:
print '<b>Couldn\'t load:', url, '</b><br>'
return None
def show_links(page):
tree = lh.fromstring(page)
for node in tree.xpath('//a'):
if 'href' in node.attrib:
url = node.attrib['href']
if '#' in url:
url=url.split('#')[0]
if '#' not in url and 'javascript' not in url:
if node.text:
linktext = node.text
else:
linktext = '-'
print '%s<br>' % (url, linktext.encode('utf-8'))
page = load_page('http://fr.wikipedia.org/wiki/%C3%89tats_unis')
show_links(page)
print '''
</body>
</html>
'''
I get the following error:
Traceback (most recent call last):
File "C:\***\question.py", line 42, in <module>
show_links(page)
File "C:\***\question.py", line 39, in show_links
print '%s<br>' % (url, linktext.encode('utf-8'))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 3: ordinal not in range(128)
My system: Python 2.6 (Windows), lxml 2.3.3, Apache Server (to show the results)
What am I doing wrong?

You need to encode url too.
The problem might be similar to:
>>> "%s%s" % (u"", "€ <-non-ascii char in a bytestring")
Traceback (most recent call last):
File "<input>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in
range(128)
But this works:
>>> "%s%s" % (u"".encode('utf-8'), "€ <-non-ascii char in a bytestring")
'\xe2\x82\xac <-non-ascii char in a bytestring'
The empty Unicode string forces the whole expression to be converted to Unicode. Therefore you see Unicode Decode Error.
In general it is a bad idea to mix Unicode and bytestrings. It might appear to be working but sooner or later it breaks. Convert text to Unicode as soon as you receive it, process it and then convert it to bytes for I/O.

lxml returns bytestrings not unicode. It might be better to decode the bytestring to unicode using whatever encoding the page was served with, before encoding as utf-8.
If your text is already in utf-8, there is no need to do any encoding or decoding - just take that operation out.
However, if your linktext is of type unicode (as you say it is), then it is a unicode string (each element represents a unicode codepoint), and encoding as utf-8 should work perfectly well.
I suspect the problem is that your url string is also a unicode string, and it also needs to be encoded as utf-8 before being substituted into your bytestring.

Related

xpath return UnicodeDecodeError: 'utf8' codec can't decode byte 0x93 in position 211: invalid start byte

I want to scrap data from top gainers(%) from the link but it return UnicodeDecodeError: 'utf8' codec can't decode byte 0x93 in position 211: invalid start byte
import requests
from lxml import html
page_indo = requests.get('http://www.sinarmassekuritas.co.id/id/index.asp')
indo = html.fromstring(page_indo.content)
indo = indo.xpath('//tr/td/text()')
I do not found anything weird in line 211 when I view the source of the page. Please guide how to avoid this error and get the data in the top gainer(%) in table
Updated
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<script type="text/javascript">
<!--
function MM_reloadPage(init) { //reloads the window if Nav4 resized
if (init==true) with (navigator) {if ((appName=="Netscape")&&(parseInt(appVersion)==4)) {
document.MM_pgW=innerWidth; document.MM_pgH=innerHeight; onresize=MM_reloadPage; }}
else if (innerWidth!=document.MM_pgW || innerHeight!=document.MM_pgH) location.reload();
}
MM_reloadPage(true);`
I am not sure what is the 211 try to point out. Triplee said it is 211th character from the beginning of the offending line
If it counted from <!DOCTYPE html, then the character is (... reloads the window ...) i
if counted from <script type="text/javascript">, then it will be document.MM**_**
I am not sure how one of these two will cause the error

I downloaded a copy of this data and found the offending character at offset 103826. The error message from lxml isn't very helpful for debugging this.
The context around that place in the file is (wrapped for legibility)
b'tas Pancasakti Tegal dengan tema : \x93Pasar Modal sebagai'
b' indikator perekonomian negaradan peluang investasi pasar '
b'modal\x94.</td>'
I don't speak this language (Indonesian Malay?) so I have no idea what the offending character is supposed to represent, but https://tripleee.github.io/8bit#93 suggests a left curly quote U+201C in some legacy Windows 8-bit encoding, and the \x94 at the end of this fragment seems to reinforce this guess.

For anyone else looking to solve this issue of unicode and XPath, it works for me:
Assuming, page = requests.get(url), instead of creating html tree of lxml using this way:
tree = html.fromstring(page.content)
Use this:
tree = html.fromstring(page.content.decode("utf-8", "replace"))

'ascii' codec can't encode characters in position 2307-2309: ordinal not in range(128) [duplicate]

This question already has answers here:
UnicodeEncodeError: 'ascii' codec can't encode character in position 0: ordinal not in range(128)
(4 answers)
Closed 8 years ago.
I am making a request using requests lib, as can be seen in the content the encoding for page is utf-8, also the default used by requests is utf-8 as seen in r.encoding, but why is this showing unicode error while reading text.
r = requests.get(url, auth=('username', 'password'))
print r.status_code
print r.encoding
print r.content
print r.text
output:
200
UTF-8
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"><html><head><title>Sign In</title><meta http-equiv="X-UA-Compatible" content="IE=EmulateIE7"/><meta http-equiv="content-type" content="text/html; charset=utf-8"/>.............
Traceback (most recent call last):
File "E:\Python practise programms\reuters.py", line 18, in <module>
print r.text
UnicodeEncodeError: 'ascii' codec can't encode characters in position 2307-2309: ordinal not in range(128)

This was due to sublime text not supporting utf-8 by default. I was able to get the output in python IDLE

python Parsing html page: how to decode � char?

I'm trying to parse an HTML page like this
# coding: utf8
[...]
def search(self, a, b):
word = self.champ_rech_canal.get_text()
url_canal = "http://www.canalplus.fr/pid3330-c-recherche.html?rechercherSite=" + mot_canal
try:
f = urllib.urlopen(url_canal)
self.feuille_canal = f.read()
f.close()
except:
self.champ_rech_canal.set_text("La recherche a échoué")
pass
print self.feuille_canal
The result is good, also I have � as "é" or "ô"
How can I decode it?
Tried:
self.feuille_canal = self.feuille_canal.decode("utf-8")
Result:
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe9 in position 8789: invalid continuation byte

You are trying to decode an ISO-8859-1 page as UTF-8, which cannot work. See the content header in the returned HTML:
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1" />

Chinese Unicode issue?

From this website http://engine.data.cnzz.com/main.php?s=engine&uv=&st=2014-03-01&et=2014-03-31
<tr class="list03" onclick="showMen1(9);" style="cursor:pointer;">
<td id="e_9" class="qh_one">百度汇总</td>
I'm scraping the text and trying to get 百度汇总
but when I r.encoding = 'utf-8' the result is �ٶȻ���
if I don't use utf-8 the result is °Ù¶È»ã×Ü

The server doesn't tell you anything helpful in the response headers, but the HTML page itself contains:
<meta http-equiv="Content-Type" content="text/html; charset=gb2312" />
GB2312 is a variable-width encoding, like UTF-8. The page lies however; it in actual fact uses GBK, an extension to GB2312.
You can decode it with GBK just fine:
>>> len(r.content.decode('gbk'))
44535
>>> u'百度汇总' in r.content.decode('gbk')
True
Decoding with gb2313 fails:
>>> r.content.decode('gb2312')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'gb2312' codec can't decode bytes in position 26367-26368: illegal multibyte sequence
but since GBK is a superset of GB2313, it should always be safe to use the former even when the latter is specified.
If you are using requests, then setting r.encoding to gb2312 works because r.text uses replace when handling decode errors:
content = str(self.content, encoding, errors='replace')
so the decoding error when using GB2312 is masked for those codepoints only defined in GBK.
Note that BeautifulSoup can do the decoding all by itself; it'll find the meta header:
>>> soup = BeautifulSoup(r.content)
WARNING:root:Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
The warning is caused by the GBK codepoints being used while the page claims to use GB2312.

urllib: get utf-8 encoded site source code

I'm trying to fetch a segment of some website. The script works, however it's a website that has accents such as á, é, í, ó, ú.
When I fetch the site using urllib or urllib2, the site source code is not encoded in utf-8, which I would like it to be, as utf-8 supports these accents.
I believe that the target site is encoded in utf-8 as it contains the following meta tag:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
My python script:
opener = urllib2.build_opener()
opener.addheaders = [('Accept-Charset', 'utf-8')]
url_response = opener.open(url)
deal_html = url_response.read().decode('utf-8')
However, I keep getting results that look like they are not encoded un utf-8.
E.g: "Milán" on website = "Mil\xe1n" after urllib2 fetches it
Any suggestions?

Your script is working correctly. The "\xe1" string is the representation of the unicode object resulting from decoding. For example:
>>> "Mil\xc3\xa1n".decode('utf-8')
u'Mil\xe1n'
The "\xc3\xa1" sequence is the UTF-8 sequence for leter a with diacritic mark: á.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

lxml.html and Unicode: Extract Links - python

Related

xpath return UnicodeDecodeError: 'utf8' codec can't decode byte 0x93 in position 211: invalid start byte

'ascii' codec can't encode characters in position 2307-2309: ordinal not in range(128) [duplicate]

python Parsing html page: how to decode � char?

Chinese Unicode issue?

urllib: get utf-8 encoded site source code

Categories

Resources