I have:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from urllib2 import urlopen
page2 = urlopen('http://pogoda.yandex.ru/moscow/').read().decode('utf-8')
page = urlopen('http://yasko.by/').read().decode('utf-8')
And in line "page ..." I have error "UnicodeDecodeError: 'utf8' codec can't decode byte 0xc3 in position 32: invalid continuation byte", but in line "page2 ..." th error is not, why?
From a position of 32 in yasko.by starts Cyrillic symbols, how I get it correctly?
Thanks!
The content of http://yasko.by/ is encoded with windows-1251, while the content of http://pogoda.yandex.ru/moscow/ is encoded with utf-8.
page = .. line should become:
page = urlopen('http://yasko.by/').read().decode('windows-1251')
Related
I know how to decode a base64 string and decode a UTF-8 string. The string I have has some other type of problem.
I have a web app sending a parameter value as a URL encoded Base64 encoded someting encoded string which I'm struggling to decode to string.
Here's the orignal value I8PhvhSPMM23ie9C0mNJoA%3D%3D
Removed URL encoding it look like normal base64 I8PhvhSPMM23ie9C0mNJoA==
When I try and decode it, I get struck. Normal base64 decoding looks like this
>>> _string = "I8PhvhSPMM23ie9C0mNJoA=="
>>> _decoded = base64.b64decode(_string)
>>> _decoded
b'#\xc3\xe1\xbe\x14\x8f0\xcd\xb7\x89\xefB\xd2cI\xa0'
When I run this decoded byte string through .decode() I get the below. I tried UTF-8 and Latin-1.
>>> print(_decoded.decode('utf-8'))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc3 in position 1: invalid continuation byte
>>> print(_decoded.decode('latin-1'))
#Ãá¾0Í·ïBÒcI
>>>
I'm trying to decode the value.
Any help would be appreciated.
I use the requests module in Python to fetch a result of a web page. However, I found that if the URL includes a character à in its URL, it issues the UnicodeDecodeError:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe0 in position 27: invalid continuation byte
Strangely, this only happens if I also add a space in the URL. So for example, the following does not issue an error.
requests.get("http://myurl.com/àieou")
However, the following does:
requests.get("http://myurl.com/àienah aie")
Why does it happen and how can I make the request correctly?
using the lib urllib to auto-encode characters.
import urllib
requests.get("http://myurl.com/"+urllib.quote_plus("àieou"))
Use quote_plus().
from urllib.parse import quote_plus
requests.get("http://myurl.com/" + quote_plus("àienah aie"))
You can try to url encode your value:
requests.get("http://myurl.com/%C3%A0ieou")
The value for à is %C3%A0 once encoded.
good morning.
I'm trying to do this and not leave me .
Can you help me?
thank you very much
soup = BeautifulSoup(html_page)
titulo=soup.find('h3').get_text()
titulo=titulo.replace('§','')
titulo=titulo.replace('§','')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)
Define the coding and operate with unicode strings:
# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
html_page = u"<h3>§ title here</h3>"
soup = BeautifulSoup(html_page, "html.parser")
titulo = soup.find('h3').get_text()
titulo = titulo.replace(u'§', '')
print(titulo)
Prints title here.
I'll explain you clearly what's the problem:
By default Python does not recognize particular characters like "à" or "ò". To make Python recognize those characters you have to put at the top of your script:
# -*- coding: utf-8 -*-
This codes makes Python recognize particular characters that by default are not recognized.
Another method to use the coding is using "sys" library:
# sys.setdefaultencoding() does not exist, here!
import sys
reload(sys) #This reloads the sys module
sys.setdefaultencoding('UTF8') #Here you choose the encoding
I am web-scraping with Python using BeautifulSoap
I am getting this error
'charmap' codec can't encode character '\xae' in position 69: character maps to <undefined>
when scraping a webpage
This is my Python
hotel = BeautifulSoup(state.)
print (hotel.select("div.details.cf span.hotel-name a"))
# Tried: print (hotel.select("div.details.cf span.hotel-name a")).encode('utf-8')
We usually encounter this problem here when we are trying to .encode() an already encoded byte string. So you might try to decode it first as in
html = urllib.urlopen(link).read()
unicode_str = html.decode(<source encoding>)
encoded_str = unicode_str.encode("utf8")
As an example:
html = '\xae'
encoded_str = html.encode("utf8")
Fails with
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 0: ordinal not in range(128)
While:
html = '\xae'
decoded_str = html.decode("windows-1252")
encoded_str = decoded_str.encode("utf8")
print encoded_str
®
Succeeds without error. Do note that "windows-1252" is something I used as an example. I got this from chardet and it had 0.5 confidence that it is right! (well, as given with a 1-character-length string, what do you expect) You should change that to the encoding of the byte string returned from .urlopen().read() to what applies to the content you retrieved.
I am using urlfetch to fetch a URL. When I try to send it to html2text function (strips off all HTML tags), I get the following message:
UnicodeEncodeError: 'charmap' codec can't encode characters in position ... character maps to <undefined>
I've been trying to process encode('UTF-8','ignore') on the string but I keep getting this error.
Any ideas?
Thanks,
Joel
Some Code:
result = urlfetch.fetch(url="http://www.google.com")
html2text(result.content.encode('utf-8', 'ignore'))
And the error message:
File "C:\Python26\lib\encodings\cp1252.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_table)
UnicodeEncodeError: 'charmap' codec can't encode characters in position 159-165: character maps to <undefined>
You need to decode the data you fetched first! With which codec? Depends on the website you fetch.
When you have unicode and try to encode it with some_unicode.encode('utf-8', 'ignore') i can't image how it could throw an error.
Ok what you need to do:
result = fetch('http://google.com')
content_type = result.headers['Content-Type'] # figure out what you just fetched
ctype, charset = content_type.split(';')
encoding = charset[len(' charset='):] # get the encoding
print encoding # ie ISO-8859-1
utext = result.content.decode(encoding) # now you have unicode
text = utext.encode('utf8', 'ignore') # encode to uft8
This is not really robust but it should show you the way.