python Parsing html page: how to decode � char?

python Parsing html page: how to decode � char? - python

I'm trying to parse an HTML page like this
# coding: utf8
[...]
def search(self, a, b):
word = self.champ_rech_canal.get_text()
url_canal = "http://www.canalplus.fr/pid3330-c-recherche.html?rechercherSite=" + mot_canal
try:
f = urllib.urlopen(url_canal)
self.feuille_canal = f.read()
f.close()
except:
self.champ_rech_canal.set_text("La recherche a échoué")
pass
print self.feuille_canal
The result is good, also I have � as "é" or "ô"
How can I decode it?
Tried:
self.feuille_canal = self.feuille_canal.decode("utf-8")
Result:
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe9 in position 8789: invalid continuation byte

You are trying to decode an ISO-8859-1 page as UTF-8, which cannot work. See the content header in the returned HTML:
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1" />

Related

how to encode string to utf8 inside html

I'm trying to use awdeoroi mailmerge. In the html template i have french encoded characters in paragraph tags.
When i execute the mailing i have encoding errors :
UnicodeDecodeError : 'utf-8' codec can't decode byte 0xf4 in position 81: invalid continuation byte
How to encode those paragraph so that they are well treated in python ?
TO: {{email}}
SUBJECT: Testing mailmerge
FROM: My Self <myself#mydomain.com>
Content-Type: text/html
<html>
<body>
<p>Hi, {{name}},</p>
<p>Your number is {{number}}.</p>
<p>Sent by Here is the paragraph. Ce texte est en francais. <b>Accentué<b>. L'ideal</p>
</body>
</html>

Hex f4 is latin1 for ô. If this was typed into Python, you needed this at the start of the source file:
# -*- coding: utf-8 -*-
If the data is coming from a database, please provide some more details.
If the text is coming from somewhere else, please provide some more details.

xpath return UnicodeDecodeError: 'utf8' codec can't decode byte 0x93 in position 211: invalid start byte

I want to scrap data from top gainers(%) from the link but it return UnicodeDecodeError: 'utf8' codec can't decode byte 0x93 in position 211: invalid start byte
import requests
from lxml import html
page_indo = requests.get('http://www.sinarmassekuritas.co.id/id/index.asp')
indo = html.fromstring(page_indo.content)
indo = indo.xpath('//tr/td/text()')
I do not found anything weird in line 211 when I view the source of the page. Please guide how to avoid this error and get the data in the top gainer(%) in table
Updated
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<script type="text/javascript">
<!--
function MM_reloadPage(init) { //reloads the window if Nav4 resized
if (init==true) with (navigator) {if ((appName=="Netscape")&&(parseInt(appVersion)==4)) {
document.MM_pgW=innerWidth; document.MM_pgH=innerHeight; onresize=MM_reloadPage; }}
else if (innerWidth!=document.MM_pgW || innerHeight!=document.MM_pgH) location.reload();
}
MM_reloadPage(true);`
I am not sure what is the 211 try to point out. Triplee said it is 211th character from the beginning of the offending line
If it counted from <!DOCTYPE html, then the character is (... reloads the window ...) i
if counted from <script type="text/javascript">, then it will be document.MM**_**
I am not sure how one of these two will cause the error

I downloaded a copy of this data and found the offending character at offset 103826. The error message from lxml isn't very helpful for debugging this.
The context around that place in the file is (wrapped for legibility)
b'tas Pancasakti Tegal dengan tema : \x93Pasar Modal sebagai'
b' indikator perekonomian negaradan peluang investasi pasar '
b'modal\x94.</td>'
I don't speak this language (Indonesian Malay?) so I have no idea what the offending character is supposed to represent, but https://tripleee.github.io/8bit#93 suggests a left curly quote U+201C in some legacy Windows 8-bit encoding, and the \x94 at the end of this fragment seems to reinforce this guess.

For anyone else looking to solve this issue of unicode and XPath, it works for me:
Assuming, page = requests.get(url), instead of creating html tree of lxml using this way:
tree = html.fromstring(page.content)
Use this:
tree = html.fromstring(page.content.decode("utf-8", "replace"))

'ascii' codec can't encode characters in position 2307-2309: ordinal not in range(128) [duplicate]

This question already has answers here:
UnicodeEncodeError: 'ascii' codec can't encode character in position 0: ordinal not in range(128)
(4 answers)
Closed 8 years ago.
I am making a request using requests lib, as can be seen in the content the encoding for page is utf-8, also the default used by requests is utf-8 as seen in r.encoding, but why is this showing unicode error while reading text.
r = requests.get(url, auth=('username', 'password'))
print r.status_code
print r.encoding
print r.content
print r.text
output:
200
UTF-8
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"><html><head><title>Sign In</title><meta http-equiv="X-UA-Compatible" content="IE=EmulateIE7"/><meta http-equiv="content-type" content="text/html; charset=utf-8"/>.............
Traceback (most recent call last):
File "E:\Python practise programms\reuters.py", line 18, in <module>
print r.text
UnicodeEncodeError: 'ascii' codec can't encode characters in position 2307-2309: ordinal not in range(128)

This was due to sublime text not supporting utf-8 by default. I was able to get the output in python IDLE

lxml.html and Unicode: Extract Links

The code below extracts links from a web page and shows them in a browser. With a lot of UTF-8 encoded webpages this works great. But the French Wikipedia page http://fr.wikipedia.org/wiki/États_unis for example produces an error.
# -*- coding: utf-8 -*-
print 'Content-Type: text/html; charset=utf-8\n'
print '''<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN">
<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
<title>Show Links</title>
</head>
<body>'''
import urllib2, lxml.html as lh
def load_page(url):
headers = {'User-Agent' : 'Mozilla/5.0 (compatible; testbot/0.1)'}
try:
req = urllib2.Request(url, None, headers)
response = urllib2.urlopen(req)
page = response.read()
return page
except:
print '<b>Couldn\'t load:', url, '</b><br>'
return None
def show_links(page):
tree = lh.fromstring(page)
for node in tree.xpath('//a'):
if 'href' in node.attrib:
url = node.attrib['href']
if '#' in url:
url=url.split('#')[0]
if '#' not in url and 'javascript' not in url:
if node.text:
linktext = node.text
else:
linktext = '-'
print '%s<br>' % (url, linktext.encode('utf-8'))
page = load_page('http://fr.wikipedia.org/wiki/%C3%89tats_unis')
show_links(page)
print '''
</body>
</html>
'''
I get the following error:
Traceback (most recent call last):
File "C:\***\question.py", line 42, in <module>
show_links(page)
File "C:\***\question.py", line 39, in show_links
print '%s<br>' % (url, linktext.encode('utf-8'))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 3: ordinal not in range(128)
My system: Python 2.6 (Windows), lxml 2.3.3, Apache Server (to show the results)
What am I doing wrong?

You need to encode url too.
The problem might be similar to:
>>> "%s%s" % (u"", "€ <-non-ascii char in a bytestring")
Traceback (most recent call last):
File "<input>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in
range(128)
But this works:
>>> "%s%s" % (u"".encode('utf-8'), "€ <-non-ascii char in a bytestring")
'\xe2\x82\xac <-non-ascii char in a bytestring'
The empty Unicode string forces the whole expression to be converted to Unicode. Therefore you see Unicode Decode Error.
In general it is a bad idea to mix Unicode and bytestrings. It might appear to be working but sooner or later it breaks. Convert text to Unicode as soon as you receive it, process it and then convert it to bytes for I/O.

lxml returns bytestrings not unicode. It might be better to decode the bytestring to unicode using whatever encoding the page was served with, before encoding as utf-8.
If your text is already in utf-8, there is no need to do any encoding or decoding - just take that operation out.
However, if your linktext is of type unicode (as you say it is), then it is a unicode string (each element represents a unicode codepoint), and encoding as utf-8 should work perfectly well.
I suspect the problem is that your url string is also a unicode string, and it also needs to be encoded as utf-8 before being substituted into your bytestring.

How do I get a µ character out of sqlite and onto a web-page?

On a Python driven web app using a sqlite datastore I had this error:
Could not decode to UTF-8 column
'name' with text '300µL-10-10'
Reading here it looks like I need to switch my text-factory to str and get bytestrings but when I do this my html output looks like this:
300�L-10-10
I do have my content-type set as:
<meta http-equiv="content-type" content="text/html; charset=utf-8" />

Unfortunately, the data in your datastore is not encoded as UTF-8; instead, it's probably either latin-1 or cp1252. To decode it automatically, try setting Connection.text_factory to your own function:
def convert_string(s):
try:
u = s.decode("utf-8")
except UnicodeDecodeError:
u = s.decode("cp1252")
return u
conn.text_factory = convert_string

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

python Parsing html page: how to decode � char? - python

You are trying to decode an ISO-8859-1 page as UTF-8, which cannot work. See the content header in the returned HTML: <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1" />

Related

how to encode string to utf8 inside html

xpath return UnicodeDecodeError: 'utf8' codec can't decode byte 0x93 in position 211: invalid start byte

'ascii' codec can't encode characters in position 2307-2309: ordinal not in range(128) [duplicate]

lxml.html and Unicode: Extract Links

How do I get a µ character out of sqlite and onto a web-page?

Categories

Resources