What happens when you call str() on a unicode string?

What happens when you call str() on a unicode string? - python

I'm wondering what happens internally when you call str() on a unicode string.
# coding: utf-8
s2 = str(u'hello')
Is s2 just the unicode byte representation of the str() arg?

It will try to encode it with your default encoding. On my system, that's ASCII, and if there's any non-ASCII characters, it will fail:
>>> str(u'あ')
UnicodeEncodeError: 'ascii' codec can't encode character u'\u3042' in position 0: ordinal not in range(128)
Note that this is the same error you'd get if you called encode('ascii') on it:
>>> u'あ'.encode('ascii')
UnicodeEncodeError: 'ascii' codec can't encode character u'\u3042' in position 0: ordinal not in range(128)
As you might imagine, str working on some arguments and failing on others makes it easy to write code that on first glance seems to work, but stops working once you throw some international characters in there. Python 3 avoids this by making the problem blatantly obvious: you can't convert Unicode to a byte string without an explicit encoding:
>>> bytes(u'あ')
TypeError: string argument without an encoding

Related

URLDecoding requests

I am trying to get the original url from requests. Here is what I have so far:
res = requests.get(...)
url = urllib.unquote(res.url).decode('utf8')
I then get an error that says:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 60-61: ordinal not in range(128)
The original url I requested is:
https://www.microsoft.com/de-at/store/movies/american-pie-pr\xc3\xa4sentiert-nackte-tatsachen/8d6kgwzl63ql
And here is what happens when I try printing:
>>> print '111', res.url
111 https://www.microsoft.com/de-at/store/movies/american-pie-pr%C3%A4sentiert-nackte-tatsachen/8d6kgwzl63ql
>>> print '222', urllib.unquote( res.url )
222 https://www.microsoft.com/de-at/store/movies/american-pie-prÃ¤sentiert-nackte-tatsachen/8d6kgwzl63ql
>>> print '333', urllib.unquote(res.url).decode('utf8')
UnicodeEncodeError: 'ascii' codec can't encode characters in position 60-61: ordinal not in range(128)
Why is this occurring, and how would I fix this?

UnicodeEncodeError: 'ascii' codec can't encode characters
You are trying to decode a string that is Unicode already. It raises AttributeError on Python 3 (unicode string has no .decode() method there). Python 2 tries to encode the string into bytes first using sys.getdefaultencoding() ('ascii') before passing it to .decode('utf8') which leads to UnicodeEncodeError.
In short, do not call .decode() on Unicode strings, use this instead:
print urllib.unquote(res.url.encode('ascii')).decode('utf-8')
Without .decode() call, the code prints bytes (assuming a bytestring is passed to unquote()) that may lead to mojibake if the character encoding used by your environment is not utf-8. To avoid mojibake, always print Unicode (don't print text as bytes), do not hardcode the character encoding of your environment inside your script i.e., .decode() is necessary here.
There is a bug in urllib.unquote() if you pass it a Unicode string:
>>> print urllib.unquote(u'%C3%A4')
Ã¤
>>> print urllib.unquote('%C3%A4') # utf-8 output
ä
Pass bytestrings to unquote() on Python 2.

Python 2.7.6 + unicode_literals - UnicodeDecodeError: 'ascii' codec can't decode byte

I'm trying to print the following unicode string but I'm receiving a UnicodeDecodeError: 'ascii' codec can't decode byte error. Can you please help form this query so it can print the unicode string properly?
>>> from __future__ import unicode_literals
>>> ts='now'
>>> free_form_request='[EXID(이엑스아이디)] 위아래 (UP&DOWN) MV'
>>> nick='me'
>>> print('{ts}: free form request {free_form_request} requested from {nick}'.format(ts=ts,free_form_request=free_form_request.encode('utf-8'),nick=nick))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xec in position 6: ordinal not in range(128)
Thank you very much in advance!

Here's what happen when you construct this string:
'{ts}: free form request {free_form_request} requested from {nick}'.format(ts=ts,free_form_request=free_form_request.encode('utf-8'),nick=nick)
free_form_request is encode-d into a byte string using utf-8 as the encoding. This works because utf-8 can represent [EXID(이엑스아이디)] 위아래 (UP&DOWN) MV.
However, the format string ('{ts}: free form request {free_form_request} requested from {nick}') is a unicode string (because of imported from __future__ import unicode_literals).
You can't use byte strings as format arguments for a unicode string, so Python attempts to decode the byte string created in 1. to create a unicode string (which would be valid as an format argument).
Python attempts the decode-ing using the default encoding, which is ascii, and fails, because the byte string is a utf-8 byte string that includes byte values that don't make sense in ascii.
Python throws a UnicodeDecodeError.
Note that while the code is obviously doing something here, this would actually not throw an exception on Python 3, which would instead substitute the repr of the byte string (the repr being a unicode string).
To fix your issue, just pass unicode strings to format.
That is, don't do step 1. where you encoded free_form_request as a byte string: keep it as a unicode string by removing .encode(...):
'{ts}: free form request {free_form_request} requested from {nick}'.format(
ts=ts,
free_form_request=free_form_request,
nick=nick)
Note Padraic Cunningham's answer in the comments as well.

Python: Unicode problems

I am getting an error at this line
logger.info(u"Data: {}".format(data))
I'm getting this error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 4: ordinal not in range(128)
Before that line, I tried adding data = data.decode('utf8') and I still get the same error.
I tried data = data.encode('utf8') and it says UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 4: ordinal not in range(128)
How do I fix this? I don't know if I should encode or decode but neither works.

Use a string literal:
if isinstance(data, unicode):
data = data.encode('utf8')
logger.info("Data: {}".format(data))
The logging module needs you to pass in string values as these values are passed on unaltered to formatters and the handlers. Writing log messages to a file means that unicode values are encoded with the default (ASCII) codec otherwise. But you also need to pass in a bytestring value when formatting.
Passing in a str value into a unicode .format() template leads to decoding errors, passing in a unicode value into a str .format() template leads to encoding errors, and passing a formatted unicode value to logger.info() leads to encoding errors too.
Better not mix and encode explicitly beforehand.

You could do something such as
data.decode('utf-8').encode("ascii",errors="ignore")
This will "ignore" the unicode characters
edit: data.encode('ascii',error='ignore') may be enough but i'm not in a position to test this currently.

Convert hash.digest() to unicode

import hashlib
string1 = u'test'
hashstring = hashlib.md5()
hashstring.update(string1)
string2 = hashstring.digest()
unicode(string2)
UnicodeDecodeError: 'ascii' codec can't decode byte 0x8f in position 1: ordinal
not in range(128)
The string HAS to be unicode for it to be any use to me, can this be done?
Using python 2.7 if that helps...

Ignacio just gave the perfect answer. Just a complement: when you convert some string from an encoding which has chars not found in ASCII to unicode, you have to pass the encoding as a parameter:
>>> unicode("órgão")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)
>>> unicode("órgão", "UTF-8")
u'\xf3rg\xe3o'
If you cannot say what is the original encoding (UTF-8 in my example) you really cannot convert to Unicode. It is a signal that something is not pretty correct in your intentions.
Last but not least, encodings are pretty confusing stuff. This comprehensive text about them can make them clear.

The result of .digest() is a bytestring¹, so converting it to Unicode is pointless. Use .hexdigest() if you want a readable representation.
¹ Some bytestrings can be converted to Unicode, but the bytestrings returned by .digest() do not contain textual data. They can contain any byte including the null byte: they're usually not printable without using escape sequences.

Convert or strip out "illegal" Unicode characters

I've got a database in MSSQL that I'm porting to SQLite/Django. I'm using pymssql to connect to the database and save a text field to the local SQLite database.
However for some characters, it explodes. I get complaints like this:
UnicodeDecodeError: 'ascii' codec can't decode byte 0x97 in position 1916: ordinal not in range(128)
Is there some way I can convert the chars to proper unicode versions? Or strip them out?

Once you have the string of bytes s, instead of using it as a unicode obj directly, convert it explicitly with the right codec, e.g.:
u = s.decode('latin-1')
and use u instead of s in the code that follows this point (presumably the part that writes to sqlite). That's assuming latin-1 is the encoding that was used to make the byte string originally -- it's impossible for us to guess, so try to find out;-).
As a general rule, I suggest: don't process in your applications any text as encoded byte strings -- decode them to unicode objects right after input, and, if necessary, encode them back to byte strings right before output.

When you decode, just pass 'ignore' to strip those characters
there is some more way of stripping / converting those are
'replace': replace malformed data with a suitable replacement marker, such as '?' or '\ufffd'
'ignore': ignore malformed data and continue without further notice
'backslashreplace': replace with backslashed escape sequences (for encoding only)
Test
>>> "abcd\x97".decode("ascii")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0x97 in position 4: ordinal not in range(128)
>>>
>>> "abcd\x97".decode("ascii","ignore")
u'abcd'

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

What happens when you call str() on a unicode string? - python

I'm wondering what happens internally when you call str() on a unicode string. # coding: utf-8 s2 = str(u'hello') Is s2 just the unicode byte representation of the str() arg?

Related

URLDecoding requests

Python 2.7.6 + unicode_literals - UnicodeDecodeError: 'ascii' codec can't decode byte

Python: Unicode problems

Convert hash.digest() to unicode

Convert or strip out "illegal" Unicode characters

Categories

Resources