How to get a webpage with unicode chars in python - python

I am trying to get and parse a webpage that contains non-ASCII characters (the URL is http://www.one.co.il). This is what I have:
url = "http://www.one.co.il"
req = urllib2.Request(url)
response = urllib2.urlopen(req)
encoding = response.headers.getparam('charset') # windows-1255
html = response.read() # The length of this is valid - about 31000-32000,
# but printing the first characters shows garbage -
# '\x1f\x8b\x08\x00\x00\x00\x00\x00', instead of
# '<!DOCTYPE'
html_decoded = html.decode(encoding)
The last line gives me an exception:
File "C:/Users/....\WebGetter.py", line 16, in get_page
html_decoded = html.decode(encoding)
File "C:\Python27\lib\encodings\cp1255.py", line 15, in decode
return codecs.charmap_decode(input,errors,decoding_table)
UnicodeDecodeError: 'charmap' codec can't decode byte 0xdb in position 14: character maps to <undefined>
I tried looking at other related questions such as urllib2 read to Unicode and How to handle response encoding from urllib.request.urlopen() , but didn't find anything helpful about this.
Can someone please shed some light and guide me in this subject? Thanks!

0x1f 0x8b 0x08 is the magic number for a gzipped file. You will need to decompress it before you can use the contents.

Related

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte [duplicate]

This question already has answers here:
UnicodeEncodeError: 'charmap' codec can't encode characters
(11 answers)
Closed 5 months ago.
I am trying to make a crawler in python by following an udacity course. I have this method get_page() which returns the content of the page.
def get_page(url):
'''
Open the given url and return the content of the page.
'''
data = urlopen(url)
html = data.read()
return html.decode('utf8')
the original method was just returning data.read(), but that way I could not do operations like str.find(). After a quick search I found out I need to decode the data. But now I get this error
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position
1: invalid start byte
I have found similar questions in SO but none of them were specifically for this. Please help.
You are trying to decode an invalid string.
The start byte of any valid UTF-8 string must be in the range of 0x00 to 0x7F.
So 0x8B is definitely invalid.
From RFC3629 Section 3:
In UTF-8, characters from the U+0000..U+10FFFF range (the UTF-16 accessible range) are encoded using sequences of 1 to 4 octets. The only octet of a "sequence" of one has the higher-order bit set to 0, the remaining 7 bits being used to encode the character number.
You should post the string you are trying to decode.
Maybe the page is encoded with other character encoding but 'utf-8'. So the start byte is invalid.
You could do this.
def get_page(self, url):
if url is None:
return None
response=urllib.request.urlopen(url)
if response.getcode()!=200:
print("Http code:",response.getcode())
return None
else:
try:
return response.read().decode('utf-8')
except:
return response.read()
Web servers often serve HTML pages with a Content-Type header that includes the encoding used to encoding the page. The header might look this:
Content-Type: text/html; charset=UTF-8
We can inspect the content of this header to find the encoding to use to decode the page:
from urllib.request import urlopen
def get_page(url):
""" Open the given url and return the content of the page."""
data = urlopen(url)
content_type = data.headers.get('content-type', '')
print(f'{content_type=}')
encoding = 'latin-1'
if 'charset' in content_type:
_, _, encoding = content_type.rpartition('=')
print(f'{encoding=}')
html = data.read()
return html.decode(encoding)
Using requests is similar:
response = requests.get(url)
content_type = reponse.headers.get('content-type', '')
Latin-1 (or ISO-8859-1) is a safe default: it will always decode any bytes (though the result may not be useful).
If the server doesn't serve a content-type header you can try looking for a <meta> tag that specifies the encoding in the HTML. Or pass the response bytes to Beautiful Soup and let it try to guess the encoding.

python string encoding unicode

I'm using python 2.7 and I have some problems converting chars like "ä" to "ae".
I'm retrieving the content of a webpage using:
req = urllib2.Request(url + str(questionID))
response = urllib2.urlopen(req)
data = response.read()
After that I'm doing some extraction stuff and there is my problem.
extractedStr = pageContent[start:end] // this string contains the "ä" !
extractedStr = extractedStr.decode("utf8") // here I get the error, tried it with encode aswell
extractedStr = extractedStr.replace(u"ä", "ae")
--> 'utf8' codec can't decode byte 0xe4 in position 13: invalid continuation byte
But: my simple trial is working fine...:
someStr = "geräusch"
someStr = someStr.decode("utf8")
someStr = someStr.replace(u"ä", "ae")
I've got the feeling, it has something to do with WHEN I try to use the .decode() function... I tried it at several positions, no success :(
Use .decode("latin-1") instead. That is what you are trying to decode.

Python script receiving a UnicodeEncodeError: 'ascii' codec can't encode character

I have a simple Python script that pulls posts from reddit and posts them on Twitter. Unfortunately, tonight it began having issues that I'm assuming are because of someone's title on reddit having a formatting issue. The error that I'm reciving is:
File "redditbot.py", line 82, in <module>
main()
File "redditbot.py", line 64, in main
tweeter(post_dict, post_ids)
File "redditbot.py", line 74, in tweeter
print post+" "+post_dict[post]+" #python"
UnicodeEncodeError: 'ascii' codec can't encode character u'\u201c' in position 34: ordinal not in range(128)
And here is my script:
# encoding=utf8
import praw
import json
import requests
import tweepy
import time
import urllib2
import sys
reload(sys)
sys.setdefaultencoding('utf8')
access_token = 'hidden'
access_token_secret = 'hidden'
consumer_key = 'hidden'
consumer_secret = 'hidden'
def strip_title(title):
if len(title) < 75:
return title
else:
return title[:74] + "..."
def tweet_creator(subreddit_info):
post_dict = {}
post_ids = []
print "[bot] Getting posts from Reddit"
for submission in subreddit_info.get_hot(limit=2000):
post_dict[strip_title(submission.title)] = submission.url
post_ids.append(submission.id)
print "[bot] Generating short link using goo.gl"
mini_post_dict = {}
for post in post_dict:
post_title = post
post_link = post_dict[post]
mini_post_dict[post_title] = post_link
return mini_post_dict, post_ids
def setup_connection_reddit(subreddit):
print "[bot] setting up connection with Reddit"
r = praw.Reddit('PythonReddit PyReTw'
'monitoring %s' %(subreddit))
subreddit = r.get_subreddit('python')
return subreddit
def duplicate_check(id):
found = 0
with open('posted_posts.txt', 'r') as file:
for line in file:
if id in line:
found = 1
return found
def add_id_to_file(id):
with open('posted_posts.txt', 'a') as file:
file.write(str(id) + "\n")
def main():
subreddit = setup_connection_reddit('python')
post_dict, post_ids = tweet_creator(subreddit)
tweeter(post_dict, post_ids)
def tweeter(post_dict, post_ids):
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth)
for post, post_id in zip(post_dict, post_ids):
found = duplicate_check(post_id)
if found == 0:
print "[bot] Posting this link on twitter"
print post+" "+post_dict[post]+" #python"
api.update_status(post+" "+post_dict[post]+" #python")
add_id_to_file(post_id)
time.sleep(3000)
else:
print "[bot] Already posted"
if __name__ == '__main__':
main()
Any help would be very much appreciated - thanks in advance!
Consider this simple program:
print(u'\u201c' + "python")
If you try printing to a terminal (with an appropriate character encoding), you get
“python
However, if you try redirecting output to a file, you get a UnicodeEncodeError.
script.py > /tmp/out
Traceback (most recent call last):
File "/home/unutbu/pybin/script.py", line 4, in <module>
print(u'\u201c' + "python")
UnicodeEncodeError: 'ascii' codec can't encode character u'\u201c' in position 0: ordinal not in range(128)
When you print to a terminal, Python uses the terminal's character encoding to encode unicode. (Terminals can only print bytes, so unicode must be encoded in order to be printed.)
When you redirect output to a file, Python can not determine the character encoding since files have no declared encoding. So by default Python2 implicitly encodes all unicode using the ascii encoding before writing to the file. Since u'\u201c' can not be ascii encoded, a UnicodeEncodeError. (Only the first 127 unicode code points can be encoded with ascii).
This issue is explained in detail in the Why Print Fails wiki.
To fix the problem, first, avoid adding unicode and byte strings. This causes implicit conversion using the ascii codec in Python2, and an exception in Python3. To future-proof your code, it is better to be explicit. For example, encode post explicitly before formatting and printing the bytes:
post = post.encode('utf-8')
print('{} {} #python'.format(post, post_dict[post]))
You are trying to print a unicode string to your terminal (or possibly a file by IO redirection), but the encoding used by your terminal (or file system) is ASCII. Because of this Python attempts to convert it from the unicode representation to ASCII, but fails because codepoint u'\u201c' (“) can not be represented in ASCII. Effectively your code is doing this:
>>> print u'\u201c'.encode('ascii')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u201c' in position 0: ordinal not in range(128)
You could try converting to UTF-8:
print (post + " " + post_dict[post] + " #python").encode('utf8')
or convert to ASCII like this:
print (post + " " + post_dict[post] + " #python").encode('ascii', 'replace')
which will replace invalid ASCII characters with ?.
Another way, which is useful if you are printing for debugging purposes, is to print the repr of the string:
print repr(post + " " + post_dict[post] + " #python")
which would output something like this:
>>> s = 'string with \u201cLEFT DOUBLE QUOTATION MARK\u201c'
>>> print repr(s)
u'string with \u201cLEFT DOUBLE QUOTATION MARK\u201c'
The problem likely arises from mixing bytestrings and unicode strings on concatenation. As an alternative to prefixing all string literals with u, maybe
from __future__ import unicode_literals
fixes things for you. See here for a deeper explanation and to decide whether it's an option for you or not.

'charmap' codec can't encode character '\xae' While Scraping a Webpage

I am web-scraping with Python using BeautifulSoap
I am getting this error
'charmap' codec can't encode character '\xae' in position 69: character maps to <undefined>
when scraping a webpage
This is my Python
hotel = BeautifulSoup(state.)
print (hotel.select("div.details.cf span.hotel-name a"))
# Tried: print (hotel.select("div.details.cf span.hotel-name a")).encode('utf-8')
We usually encounter this problem here when we are trying to .encode() an already encoded byte string. So you might try to decode it first as in
html = urllib.urlopen(link).read()
unicode_str = html.decode(<source encoding>)
encoded_str = unicode_str.encode("utf8")
As an example:
html = '\xae'
encoded_str = html.encode("utf8")
Fails with
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 0: ordinal not in range(128)
While:
html = '\xae'
decoded_str = html.decode("windows-1252")
encoded_str = decoded_str.encode("utf8")
print encoded_str
®
Succeeds without error. Do note that "windows-1252" is something I used as an example. I got this from chardet and it had 0.5 confidence that it is right! (well, as given with a 1-character-length string, what do you expect) You should change that to the encoding of the byte string returned from .urlopen().read() to what applies to the content you retrieved.

UnicodeEncodeError when fetching URLs

I am using urlfetch to fetch a URL. When I try to send it to html2text function (strips off all HTML tags), I get the following message:
UnicodeEncodeError: 'charmap' codec can't encode characters in position ... character maps to <undefined>
I've been trying to process encode('UTF-8','ignore') on the string but I keep getting this error.
Any ideas?
Thanks,
Joel
Some Code:
result = urlfetch.fetch(url="http://www.google.com")
html2text(result.content.encode('utf-8', 'ignore'))
And the error message:
File "C:\Python26\lib\encodings\cp1252.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_table)
UnicodeEncodeError: 'charmap' codec can't encode characters in position 159-165: character maps to <undefined>
You need to decode the data you fetched first! With which codec? Depends on the website you fetch.
When you have unicode and try to encode it with some_unicode.encode('utf-8', 'ignore') i can't image how it could throw an error.
Ok what you need to do:
result = fetch('http://google.com')
content_type = result.headers['Content-Type'] # figure out what you just fetched
ctype, charset = content_type.split(';')
encoding = charset[len(' charset='):] # get the encoding
print encoding # ie ISO-8859-1
utext = result.content.decode(encoding) # now you have unicode
text = utext.encode('utf8', 'ignore') # encode to uft8
This is not really robust but it should show you the way.

Categories

Resources