Python: difficulty converting ascii to unicode - python

My goal: get the page source from a url and count all instances of a keyword within that page source
How I am doing it: getting the pagesource via urllib2, looping through each char of the page source and comparing it to the keyword
My problem: my keyword is encoded in utf-8 while the page source is in ascii... I am running into errors whenever I try conversions.
getting the page source:
import urllib2
response = urllib2.urlopen(myUrl)
return response.read()
comparing page source and keyword:
pageSource[i] == keyWord[j]
I need to convert one of these strings to the other's encoding. Intuitively I felt that ascii (the page source) to utf-8 (the key word) would be the best and easiest, so:
pageSource = unicode(pageSource)
UnicodeDecodeError: 'ascii' codec can't decode byte __ in position __: ordinal not in range(128)

When trying to work with text, don't leave your data as byte strings. Decode to Unicode early, encode back to bytes as late as possible.
Decode your downloaded network data:
import urllib2
response = urllib2.urlopen(myUrl)
# Latin-1 is the default for HTTP text/ responses, adjust as needed
codec = response.info().getparam('charset', 'latin1')
return response.read().decode(codec)
and do the same for your keyWord data. If it is encoded as UTF-8, decode it as such, or use Unicode string literals.
You may want to read up on Python and Unicode:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
Pragmatic Unicode by Ned Batchelder
The Python Unicode HOWTO

I'll assume your remote "source page" contains more than just ASCII otherwise your comparison will already work as is (ASCII is now a subset of UTF-8. I.e. A in ASCII is 0x41, which is the same as UTF-8).
You may find Python Requests library easier as it will automatically decode remote content to Unicode strings based on the server's headers (Unicode strings are encoding neutral so can be compared without worrying about encoding).
resp = requests.get("http://www.example.com/utf8page.html")
resp.text
>> u'My unicode data €'
You will then need to decode your reference data:
keyWord[j] = "€".decode("UTF-8")
keyWord[j]
>> u'€'
If you're embedding non-ASCII in your source code, you need to define the encoding you're using. For example, at the top of your source code/script:
# coding=UTF-8

Related

Decoding a VIEWSTATE string with UTF-8 in Python 3

I'm having trouble decoding a ASP.NET view state string in Python 3.
When I try decoding the string using bash's base64 command, it decodes the string successfully and I'm able to see all the information I need (most of it is in Hebrew, meaning UTF-8). The view state is of course base64-encoded only and not encrypted.
However, when I try do decode the string using Python's base64 library and then decoding the byte array to a UTF-8 string, I get an error message:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position
0: invalid start byte
I should mention that since the string is a view state, the first few bytes are binary data and "0xff" makes sense, however after these bytes the data is readable.
Python 3 code segment:
b = "The_ViewState"
print(base64.b64decode(b).decode("utf-8"))
Why does decoding work in bash and not in Python? How can this be resolved?
After a little bit of research I found the answer:
b = "The_ViewState"
print(base64.b64decode(b).decode("utf-8", "ignore"))
Adding the "ignore" flag causes decode() to discard any invalid byte sequences, thus leaving the irrelevant bytes out of the decoded string.
Best way is use this link.
A small Python 3.5+ library for decoding ASP.NET viewstate.
First install that: pip install viewstate
>>> from viewstate import ViewState
>>> base64_encoded_viewstate = '/wEPBQVhYmNkZQ9nAgE='
>>> vs = ViewState(base64_encoded_viewstate)
>>> vs.decode()
('abcde', (True, 1))

python encoding error when searching a string

I get the following error while trying to search the string below
ERROR:
SyntaxError: Non-ASCII character '\xd8' in file Hadith_scraper.py on line 44, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details
STRING:
دَّثَنَا عَبْدَانُ، قَالَ أَخْبَرَنَا عَبْ
CODE:
arabic_hadith = "دَّثَنَا عَبْدَانُ، قَالَ أَخْبَرَنَا عَبْ"
arabic_hadith.encode('utf8')
print arabic_hadith
if "الجمعة" in arabic_hadith:‎
day = "5"
else:
day = ""
You have a byte string, not a unicode value. Trying to encode a byte string in Python 2 means that Python will first try to decode it to unicode so that it can then encode.
Use unicode values here instead, and make sure you set the codec at the top of the file first. See PEP 263 - Defining Python Source Code Encodings (which your error message pointed you to).
Note that there is no need to encode to UTF8 here, that'll only complicate text comparisons:
# encoding: utf8
arabic_hadith = u"دَّثَنَا عَبْدَانُ، قَالَ أَخْبَرَنَا عَبْ"
print arabic_hadith
if u"الجمعة" in arabic_hadith:‎
day = "5"
else:
day = ""
Rule of thumb: decode bytes from incoming sources (files, network data) to Unicode, process only Unicode in your program, and only encode again for any outgoing data.
I urge you to read:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
The Python Unicode HOWTO
Pragmatic Unicode by Ned Batchelder
before you continue.

Send a non-ASCII POST request in Python?

I'm trying to send a POST request to a web app. I'm using the mechanize module (itself a wrapper of urllib2). Anyway, when I try to send a POST request, I get UnicodeDecodeError: 'ascii' codec can't decode byte 0xc5 in position 0: ordinal not in range(128). I tried putting the unicode(string), the unicode(string, encoding="utf-8"), unicode(string).encode() etc, nothing worked - either returned the error above, or the TypeError: decoding Unicode is not supported
I looked at the other SO answers to similar questions, but none helped.
Thanks in advance!
EDIT: Example that produces an error:
prda = "šđćč" #valid UTF-8 characters
prda # typing in python shell
'\xc5\xa1\xc4\x91\xc4\x87\xc4\x8d'
print prda # in shell
šđćč
prda.encode("utf-8") #in shell
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc5 in position 0: ordinal not in range(128)
unicode(prda)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc5 in position 0: ordinal not in range(128)
I assume you're using Python 2.x.
Given a unicode object:
myUnicode = u'\u4f60\u597d'
encode it using utf-8:
mystr = myUnicode.encode('utf-8')
Note that you need to specify the encoding explicitly. By default it'll (usually) use ascii.
In your example, you use a non-unicode string literal containing non-ascii characters, which results in prda becoming a bytes string.
To achieve this, python uses sys.stdin.encoding to automatically encode the string. In your case, this means the string is gets encoded as "utf-8".
To convert prda to a unicode object, you need to decode it using the appropriate encoding:
>>> print prda.decode('utf-8')
šđćč
Note that, in a script or module, you cannot rely on python to automatically guess the encoding - you would need to explicitly delare the encoding at the top of the file, like this:
# -*- coding: utf-8 -*-
Whenever you encounter unicode errors in Python 2, it is very often because your code is mixing bytes strings with unicode strings. So you should always check what kind of string is causing the error, by using type(string).
If the string object is <type 'str'>, but you need unicode, decode it using the appropriate encoding. If the string object is <type 'unicode'>, but you need bytes, encode it using the appropriate encoding.
You don't need to wrap your chars in unicode calls, because they're already encoded :) if anything, you need to DE-code it to get a unicode object:
>>> s = '\xc5\xa1\xc4\x91\xc4\x87\xc4\x8d' # your string
>>> s.decode('utf-8')
u'\u0161\u0111\u0107\u010d'
>>> type(s.decode('utf-8'))
<type 'unicode'>
I don't know mechanize so I don't know exactly whether it handles it correctly or not, I'm afraid.
What I'd do with a regular urllib2 POST call, would be to use urlencode :
>>> from urllib import urlencode
>>> postData = urlencode({'test': s }) # note I'm NOT decoding it
>>> postData
'test=%C5%A1%C4%91%C4%87%C4%8D'
>>> urllib2.urlopen(url, postData) # etc etc etc

Decoding not reversing unicode encoding in Django/Python

Ok, I have a hardcoded string I declare like this
name = u"Par Catégorie"
I have a # -- coding: utf-8 -- magic header, so I am guessing it's converted to utf-8
Down the road it's outputted to xml through
xml_output.toprettyxml(indent='....', encoding='utf-8')
And I get a
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 3: ordinal not in range(128)
Most of my data is in French and is ouputted correctly in CDATA nodes, but that one harcoded string keep ... I don't see why an ascii codec is called.
what's wrong ?
The coding header in your source file tells Python what encoding your source is in. It's the encoding Python uses to decode the source of the unicode string literal (u"Par Catégorie") into a unicode object. The unicode object itself has no encoding; it's raw unicode data. (Internally, Python will use one of two encodings, depending on how it was configured, but Python code shouldn't worry about that.)
The UnicodeDecodeError you get means that somewhere, you are mixing unicode strings and bytestrings (normal strings.) When mixing them together (concatenating, performing string interpolation, et cetera) Python will try to convert the bytestring into a unicode string by decoding the bytestring using the default encoding, ASCII. If the bytestring contains non-ASCII data, this will fail with the error you see. The operation being done may be in a library somewhere, but it still means you're mixing inputs of different types.
Unfortunately the fact that it'll work just fine as long as the bytestrings contain just ASCII data means this type of error is all too frequent even in library code. Python 3.x solves that problem by getting rid of the implicit conversion between unicode strings (just str in 3.x) and bytestrings (the bytes type in 3.x.)
Wrong parameter name? From the doc, I can see the keyword argument name is supposed to be encoding and not coding.

Convert Unicode to ASCII without errors in Python

My code just scrapes a web page, then converts it to Unicode.
html = urllib.urlopen(link).read()
html.encode("utf8","ignore")
self.response.out.write(html)
But I get a UnicodeDecodeError:
Traceback (most recent call last):
File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/ext/webapp/__init__.py", line 507, in __call__
handler.get(*groups)
File "/Users/greg/clounce/main.py", line 55, in get
html.encode("utf8","ignore")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 2818: ordinal not in range(128)
I assume that means the HTML contains some wrongly-formed attempt at Unicode somewhere. Can I just drop whatever code bytes are causing the problem instead of getting an error?
>>> u'aあä'.encode('ascii', 'ignore')
'a'
Decode the string you get back, using either the charset in the the appropriate meta tag in the response or in the Content-Type header, then encode.
The method encode(encoding, errors) accepts custom handlers for errors. The default values, besides ignore, are:
>>> u'aあä'.encode('ascii', 'replace')
b'a??'
>>> u'aあä'.encode('ascii', 'xmlcharrefreplace')
b'aあä'
>>> u'aあä'.encode('ascii', 'backslashreplace')
b'a\\u3042\\xe4'
See https://docs.python.org/3/library/stdtypes.html#str.encode
As an extension to Ignacio Vazquez-Abrams' answer
>>> u'aあä'.encode('ascii', 'ignore')
'a'
It is sometimes desirable to remove accents from characters and print the base form. This can be accomplished with
>>> import unicodedata
>>> unicodedata.normalize('NFKD', u'aあä').encode('ascii', 'ignore')
'aa'
You may also want to translate other characters (such as punctuation) to their nearest equivalents, for instance the RIGHT SINGLE QUOTATION MARK unicode character does not get converted to an ascii APOSTROPHE when encoding.
>>> print u'\u2019'
’
>>> unicodedata.name(u'\u2019')
'RIGHT SINGLE QUOTATION MARK'
>>> u'\u2019'.encode('ascii', 'ignore')
''
# Note we get an empty string back
>>> u'\u2019'.replace(u'\u2019', u'\'').encode('ascii', 'ignore')
"'"
Although there are more efficient ways to accomplish this. See this question for more details Where is Python's "best ASCII for this Unicode" database?
2018 Update:
As of February 2018, using compressions like gzip has become quite popular (around 73% of all websites use it, including large sites like Google, YouTube, Yahoo, Wikipedia, Reddit, Stack Overflow and Stack Exchange Network sites).
If you do a simple decode like in the original answer with a gzipped response, you'll get an error like or similar to this:
UnicodeDecodeError: 'utf8' codec can't decode byte 0x8b in position 1: unexpected code byte
In order to decode a gzpipped response you need to add the following modules (in Python 3):
import gzip
import io
Note: In Python 2 you'd use StringIO instead of io
Then you can parse the content out like this:
response = urlopen("https://example.com/gzipped-ressource")
buffer = io.BytesIO(response.read()) # Use StringIO.StringIO(response.read()) in Python 2
gzipped_file = gzip.GzipFile(fileobj=buffer)
decoded = gzipped_file.read()
content = decoded.decode("utf-8") # Replace utf-8 with the source encoding of your requested resource
This code reads the response, and places the bytes in a buffer. The gzip module then reads the buffer using the GZipFile function. After that, the gzipped file can be read into bytes again and decoded to normally readable text in the end.
Original Answer from 2010:
Can we get the actual value used for link?
In addition, we usually encounter this problem here when we are trying to .encode() an already encoded byte string. So you might try to decode it first as in
html = urllib.urlopen(link).read()
unicode_str = html.decode(<source encoding>)
encoded_str = unicode_str.encode("utf8")
As an example:
html = '\xa0'
encoded_str = html.encode("utf8")
Fails with
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 0: ordinal not in range(128)
While:
html = '\xa0'
decoded_str = html.decode("windows-1252")
encoded_str = decoded_str.encode("utf8")
Succeeds without error. Do note that "windows-1252" is something I used as an example. I got this from chardet and it had 0.5 confidence that it is right! (well, as given with a 1-character-length string, what do you expect) You should change that to the encoding of the byte string returned from .urlopen().read() to what applies to the content you retrieved.
Another problem I see there is that the .encode() string method returns the modified string and does not modify the source in place. So it's kind of useless to have self.response.out.write(html) as html is not the encoded string from html.encode (if that is what you were originally aiming for).
As Ignacio suggested, check the source webpage for the actual encoding of the returned string from read(). It's either in one of the Meta tags or in the ContentType header in the response. Use that then as the parameter for .decode().
Do note however that it should not be assumed that other developers are responsible enough to make sure the header and/or meta character set declarations match the actual content. (Which is a PITA, yeah, I should know, I was one of those before).
Use unidecode - it even converts weird characters to ascii instantly, and even converts Chinese to phonetic ascii.
$ pip install unidecode
then:
>>> from unidecode import unidecode
>>> unidecode(u'北京')
'Bei Jing'
>>> unidecode(u'Škoda')
'Skoda'
I use this helper function throughout all of my projects. If it can't convert the unicode, it ignores it. This ties into a django library, but with a little research you could bypass it.
from django.utils import encoding
def convert_unicode_to_string(x):
"""
>>> convert_unicode_to_string(u'ni\xf1era')
'niera'
"""
return encoding.smart_str(x, encoding='ascii', errors='ignore')
I no longer get any unicode errors after using this.
For broken consoles like cmd.exe and HTML output you can always use:
my_unicode_string.encode('ascii','xmlcharrefreplace')
This will preserve all the non-ascii chars while making them printable in pure ASCII and in HTML.
WARNING: If you use this in production code to avoid errors then most likely there is something wrong in your code. The only valid use case for this is printing to a non-unicode console or easy conversion to HTML entities in an HTML context.
And finally, if you are on windows and use cmd.exe then you can type chcp 65001 to enable utf-8 output (works with Lucida Console font). You might need to add myUnicodeString.encode('utf8').
You wrote """I assume that means the HTML contains some wrongly-formed attempt at unicode somewhere."""
The HTML is NOT expected to contain any kind of "attempt at unicode", well-formed or not. It must of necessity contain Unicode characters encoded in some encoding, which is usually supplied up front ... look for "charset".
You appear to be assuming that the charset is UTF-8 ... on what grounds? The "\xA0" byte that is shown in your error message indicates that you may have a single-byte charset e.g. cp1252.
If you can't get any sense out of the declaration at the start of the HTML, try using chardet to find out what the likely encoding is.
Why have you tagged your question with "regex"?
Update after you replaced your whole question with a non-question:
html = urllib.urlopen(link).read()
# html refers to a str object. To get unicode, you need to find out
# how it is encoded, and decode it.
html.encode("utf8","ignore")
# problem 1: will fail because html is a str object;
# encode works on unicode objects so Python tries to decode it using
# 'ascii' and fails
# problem 2: even if it worked, the result will be ignored; it doesn't
# update html in situ, it returns a function result.
# problem 3: "ignore" with UTF-n: any valid unicode object
# should be encodable in UTF-n; error implies end of the world,
# don't try to ignore it. Don't just whack in "ignore" willy-nilly,
# put it in only with a comment explaining your very cogent reasons for doing so.
# "ignore" with most other encodings: error implies that you are mistaken
# in your choice of encoding -- same advice as for UTF-n :-)
# "ignore" with decode latin1 aka iso-8859-1: error implies end of the world.
# Irrespective of error or not, you are probably mistaken
# (needing e.g. cp1252 or even cp850 instead) ;-)
If you have a string line, you can use the .encode([encoding], [errors='strict']) method for strings to convert encoding types.
line = 'my big string'
line.encode('ascii', 'ignore')
For more information about handling ASCII and unicode in Python, this is a really useful site: https://docs.python.org/2/howto/unicode.html
I think the answer is there but only in bits and pieces, which makes it difficult to quickly fix the problem such as
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 2818: ordinal not in range(128)
Let's take an example, Suppose I have file which has some data in the following form ( containing ascii and non-ascii chars )
1/10/17, 21:36 - Land : Welcome ��
and we want to ignore and preserve only ascii characters.
This code will do:
import unicodedata
fp = open(<FILENAME>)
for line in fp:
rline = line.strip()
rline = unicode(rline, "utf-8")
rline = unicodedata.normalize('NFKD', rline).encode('ascii','ignore')
if len(rline) != 0:
print rline
and type(rline) will give you
>type(rline)
<type 'str'>
unicodestring = '\xa0'
decoded_str = unicodestring.decode("windows-1252")
encoded_str = decoded_str.encode('ascii', 'ignore')
Works for me
You can use the following piece of code as an example to avoid Unicode to ASCII errors:
from anyascii import anyascii
content = "Base Rent for – CC# 2100 Acct# 8410: $41,667.00 – PO – Lines - for Feb to Dec to receive monthly"
content = anyascii(content)
print(content)
Looks like you are using python 2.x.
Python 2.x defaults to ascii and it doesn’t know about Unicode. Hence the exception.
Just paste the below line after shebang, it will work
# -*- coding: utf-8 -*-

Categories

Resources