Python 2.7 , issue with decode('utf-8') - python

I have:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from urllib2 import urlopen
page2 = urlopen('http://pogoda.yandex.ru/moscow/').read().decode('utf-8')
page = urlopen('http://yasko.by/').read().decode('utf-8')
And in line "page ..." I have error "UnicodeDecodeError: 'utf8' codec can't decode byte 0xc3 in position 32: invalid continuation byte", but in line "page2 ..." th error is not, why?
From a position of 32 in yasko.by starts Cyrillic symbols, how I get it correctly?
Thanks!

The content of http://yasko.by/ is encoded with windows-1251, while the content of http://pogoda.yandex.ru/moscow/ is encoded with utf-8.
page = .. line should become:
page = urlopen('http://yasko.by/').read().decode('windows-1251')

Related

python3 decode byte string encoded in base64

I know how to decode a base64 string and decode a UTF-8 string. The string I have has some other type of problem.
I have a web app sending a parameter value as a URL encoded Base64 encoded someting encoded string which I'm struggling to decode to string.
Here's the orignal value I8PhvhSPMM23ie9C0mNJoA%3D%3D
Removed URL encoding it look like normal base64 I8PhvhSPMM23ie9C0mNJoA==
When I try and decode it, I get struck. Normal base64 decoding looks like this
>>> _string = "I8PhvhSPMM23ie9C0mNJoA=="
>>> _decoded = base64.b64decode(_string)
>>> _decoded
b'#\xc3\xe1\xbe\x14\x8f0\xcd\xb7\x89\xefB\xd2cI\xa0'
When I run this decoded byte string through .decode() I get the below. I tried UTF-8 and Latin-1.
>>> print(_decoded.decode('utf-8'))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc3 in position 1: invalid continuation byte
>>> print(_decoded.decode('latin-1'))
#Ãá¾0Í·ïBÒcI 
>>>
I'm trying to decode the value.
Any help would be appreciated.

How can I make a request including "à" character in its URL in Python?

I use the requests module in Python to fetch a result of a web page. However, I found that if the URL includes a character à in its URL, it issues the UnicodeDecodeError:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe0 in position 27: invalid continuation byte
Strangely, this only happens if I also add a space in the URL. So for example, the following does not issue an error.
requests.get("http://myurl.com/àieou")
However, the following does:
requests.get("http://myurl.com/àienah aie")
Why does it happen and how can I make the request correctly?
using the lib urllib to auto-encode characters.
import urllib
requests.get("http://myurl.com/"+urllib.quote_plus("àieou"))
Use quote_plus().
from urllib.parse import quote_plus
requests.get("http://myurl.com/" + quote_plus("àienah aie"))
You can try to url encode your value:
requests.get("http://myurl.com/%C3%A0ieou")
The value for à is %C3%A0 once encoded.

hi § symbol unrecognized

good morning.
I'm trying to do this and not leave me .
Can you help me?
thank you very much
soup = BeautifulSoup(html_page)
titulo=soup.find('h3').get_text()
titulo=titulo.replace('§','')
titulo=titulo.replace('§','')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)
Define the coding and operate with unicode strings:
# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
html_page = u"<h3>§ title here</h3>"
soup = BeautifulSoup(html_page, "html.parser")
titulo = soup.find('h3').get_text()
titulo = titulo.replace(u'§', '')
print(titulo)
Prints title here.
I'll explain you clearly what's the problem:
By default Python does not recognize particular characters like "à" or "ò". To make Python recognize those characters you have to put at the top of your script:
# -*- coding: utf-8 -*-
This codes makes Python recognize particular characters that by default are not recognized.
Another method to use the coding is using "sys" library:
# sys.setdefaultencoding() does not exist, here!
import sys
reload(sys) #This reloads the sys module
sys.setdefaultencoding('UTF8') #Here you choose the encoding

'charmap' codec can't encode character '\xae' While Scraping a Webpage

I am web-scraping with Python using BeautifulSoap
I am getting this error
'charmap' codec can't encode character '\xae' in position 69: character maps to <undefined>
when scraping a webpage
This is my Python
hotel = BeautifulSoup(state.)
print (hotel.select("div.details.cf span.hotel-name a"))
# Tried: print (hotel.select("div.details.cf span.hotel-name a")).encode('utf-8')
We usually encounter this problem here when we are trying to .encode() an already encoded byte string. So you might try to decode it first as in
html = urllib.urlopen(link).read()
unicode_str = html.decode(<source encoding>)
encoded_str = unicode_str.encode("utf8")
As an example:
html = '\xae'
encoded_str = html.encode("utf8")
Fails with
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 0: ordinal not in range(128)
While:
html = '\xae'
decoded_str = html.decode("windows-1252")
encoded_str = decoded_str.encode("utf8")
print encoded_str
®
Succeeds without error. Do note that "windows-1252" is something I used as an example. I got this from chardet and it had 0.5 confidence that it is right! (well, as given with a 1-character-length string, what do you expect) You should change that to the encoding of the byte string returned from .urlopen().read() to what applies to the content you retrieved.

UnicodeEncodeError when fetching URLs

I am using urlfetch to fetch a URL. When I try to send it to html2text function (strips off all HTML tags), I get the following message:
UnicodeEncodeError: 'charmap' codec can't encode characters in position ... character maps to <undefined>
I've been trying to process encode('UTF-8','ignore') on the string but I keep getting this error.
Any ideas?
Thanks,
Joel
Some Code:
result = urlfetch.fetch(url="http://www.google.com")
html2text(result.content.encode('utf-8', 'ignore'))
And the error message:
File "C:\Python26\lib\encodings\cp1252.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_table)
UnicodeEncodeError: 'charmap' codec can't encode characters in position 159-165: character maps to <undefined>
You need to decode the data you fetched first! With which codec? Depends on the website you fetch.
When you have unicode and try to encode it with some_unicode.encode('utf-8', 'ignore') i can't image how it could throw an error.
Ok what you need to do:
result = fetch('http://google.com')
content_type = result.headers['Content-Type'] # figure out what you just fetched
ctype, charset = content_type.split(';')
encoding = charset[len(' charset='):] # get the encoding
print encoding # ie ISO-8859-1
utext = result.content.decode(encoding) # now you have unicode
text = utext.encode('utf8', 'ignore') # encode to uft8
This is not really robust but it should show you the way.

Categories

Resources