The proper way to convert a unicode string u to a (byte)string in Python is by calling u.encode(someencoding).
Unfortunately, I didn't know that before and I had used str(u) for conversion. In particular, I called str(u) to coerce u to be a string so that I can make it a valid shelve key (which must be a str).
Since I didn't encounter any UnicodeEncodeError, I wonder if this process is reversible/lossless. That is, can I do u = str(converted_unicode) (or u = bytes(converted_unicode) in Python 3) to get the original u?
In Python 2, if the conversion with str() was successful, then you can reverse the result. Using str() on a unicode value is the equivalent of using unicode_value.encode('ascii') and the reverse is to simply use str_value.decode('ascii'). Using unicode(str_value) will use the same implicit ASCII codec to decode.
In Python 3, calling str() on a unicode value simply gives you the same object back, since in Python 3 str() is the Unicode type. Using bytes() on a Unicode value without an encoding fails, you always have to use explicit codecs in Python 3 to convert between str and bytes.
Related
In a python source code I stumbled upon I've seen a small b before a string like in:
b"abcdef"
I know about the u prefix signifying a unicode string, and the r prefix for a raw string literal.
What does the b stand for and in which kind of source code is it useful as it seems to be exactly like a plain string without any prefix?
The b prefix signifies a bytes string literal.
If you see it used in Python 3 source code, the expression creates a bytes object, not a regular Unicode str object. If you see it echoed in your Python shell or as part of a list, dict or other container contents, then you see a bytes object represented using this notation.
bytes objects basically contain a sequence of integers in the range 0-255, but when represented, Python displays these bytes as ASCII codepoints to make it easier to read their contents. Any bytes outside the printable range of ASCII characters are shown as escape sequences (e.g. \n, \x82, etc.). Inversely, you can use both ASCII characters and escape sequences to define byte values; for ASCII values their numeric value is used (e.g. b'A' == b'\x41')
Because a bytes object consist of a sequence of integers, you can construct a bytes object from any other sequence of integers with values in the 0-255 range, like a list:
bytes([72, 101, 108, 108, 111])
and indexing gives you back the integers (but slicing produces a new bytes value; for the above example, value[0] gives you 72, but value[:1] is b'H' as 72 is the ASCII code point for the capital letter H).
bytes model binary data, including encoded text. If your bytes value does contain text, you need to first decode it, using the correct codec. If the data is encoded as UTF-8, for example, you can obtain a Unicode str value with:
strvalue = bytesvalue.decode('utf-8')
Conversely, to go from text in a str object to bytes you need to encode. You need to decide on an encoding to use; the default is to use UTF-8, but what you will need is highly dependent on your use case:
bytesvalue = strvalue.encode('utf-8')
You can also use the constructor, bytes(strvalue, encoding) to do the same.
Both the decoding and encoding methods take an extra argument to specify how errors should be handled.
Python 2, versions 2.6 and 2.7 also support creating string literals using b'..' string literal syntax, to ease code that works on both Python 2 and 3.
bytes objects are immutable, just like str strings are. Use a bytearray() object if you need to have a mutable bytes value.
This is Python3 bytes literal. This prefix is absent in Python 2.5 and older (it is equivalent to a plain string of 2.x, while plain string of 3.x is equivalent to a literal with u prefix in 2.x). In Python 2.6+ it is equivalent to a plain string, for compatibility with 3.x.
In a python source code I stumbled upon I've seen a small b before a string like in:
b"abcdef"
I know about the u prefix signifying a unicode string, and the r prefix for a raw string literal.
What does the b stand for and in which kind of source code is it useful as it seems to be exactly like a plain string without any prefix?
The b prefix signifies a bytes string literal.
If you see it used in Python 3 source code, the expression creates a bytes object, not a regular Unicode str object. If you see it echoed in your Python shell or as part of a list, dict or other container contents, then you see a bytes object represented using this notation.
bytes objects basically contain a sequence of integers in the range 0-255, but when represented, Python displays these bytes as ASCII codepoints to make it easier to read their contents. Any bytes outside the printable range of ASCII characters are shown as escape sequences (e.g. \n, \x82, etc.). Inversely, you can use both ASCII characters and escape sequences to define byte values; for ASCII values their numeric value is used (e.g. b'A' == b'\x41')
Because a bytes object consist of a sequence of integers, you can construct a bytes object from any other sequence of integers with values in the 0-255 range, like a list:
bytes([72, 101, 108, 108, 111])
and indexing gives you back the integers (but slicing produces a new bytes value; for the above example, value[0] gives you 72, but value[:1] is b'H' as 72 is the ASCII code point for the capital letter H).
bytes model binary data, including encoded text. If your bytes value does contain text, you need to first decode it, using the correct codec. If the data is encoded as UTF-8, for example, you can obtain a Unicode str value with:
strvalue = bytesvalue.decode('utf-8')
Conversely, to go from text in a str object to bytes you need to encode. You need to decide on an encoding to use; the default is to use UTF-8, but what you will need is highly dependent on your use case:
bytesvalue = strvalue.encode('utf-8')
You can also use the constructor, bytes(strvalue, encoding) to do the same.
Both the decoding and encoding methods take an extra argument to specify how errors should be handled.
Python 2, versions 2.6 and 2.7 also support creating string literals using b'..' string literal syntax, to ease code that works on both Python 2 and 3.
bytes objects are immutable, just like str strings are. Use a bytearray() object if you need to have a mutable bytes value.
This is Python3 bytes literal. This prefix is absent in Python 2.5 and older (it is equivalent to a plain string of 2.x, while plain string of 3.x is equivalent to a literal with u prefix in 2.x). In Python 2.6+ it is equivalent to a plain string, for compatibility with 3.x.
In a python source code I stumbled upon I've seen a small b before a string like in:
b"abcdef"
I know about the u prefix signifying a unicode string, and the r prefix for a raw string literal.
What does the b stand for and in which kind of source code is it useful as it seems to be exactly like a plain string without any prefix?
The b prefix signifies a bytes string literal.
If you see it used in Python 3 source code, the expression creates a bytes object, not a regular Unicode str object. If you see it echoed in your Python shell or as part of a list, dict or other container contents, then you see a bytes object represented using this notation.
bytes objects basically contain a sequence of integers in the range 0-255, but when represented, Python displays these bytes as ASCII codepoints to make it easier to read their contents. Any bytes outside the printable range of ASCII characters are shown as escape sequences (e.g. \n, \x82, etc.). Inversely, you can use both ASCII characters and escape sequences to define byte values; for ASCII values their numeric value is used (e.g. b'A' == b'\x41')
Because a bytes object consist of a sequence of integers, you can construct a bytes object from any other sequence of integers with values in the 0-255 range, like a list:
bytes([72, 101, 108, 108, 111])
and indexing gives you back the integers (but slicing produces a new bytes value; for the above example, value[0] gives you 72, but value[:1] is b'H' as 72 is the ASCII code point for the capital letter H).
bytes model binary data, including encoded text. If your bytes value does contain text, you need to first decode it, using the correct codec. If the data is encoded as UTF-8, for example, you can obtain a Unicode str value with:
strvalue = bytesvalue.decode('utf-8')
Conversely, to go from text in a str object to bytes you need to encode. You need to decide on an encoding to use; the default is to use UTF-8, but what you will need is highly dependent on your use case:
bytesvalue = strvalue.encode('utf-8')
You can also use the constructor, bytes(strvalue, encoding) to do the same.
Both the decoding and encoding methods take an extra argument to specify how errors should be handled.
Python 2, versions 2.6 and 2.7 also support creating string literals using b'..' string literal syntax, to ease code that works on both Python 2 and 3.
bytes objects are immutable, just like str strings are. Use a bytearray() object if you need to have a mutable bytes value.
This is Python3 bytes literal. This prefix is absent in Python 2.5 and older (it is equivalent to a plain string of 2.x, while plain string of 3.x is equivalent to a literal with u prefix in 2.x). In Python 2.6+ it is equivalent to a plain string, for compatibility with 3.x.
Why does the str operator able to convert a list of unicode objects to a str object, but not able to convert a single unicode object?
For example, in the code below I'm creating a list of unicode objects, and then attempting to print out that list. In the second print statement, I'm just printing out a single unicode object.
bill = []
bill.append(u'的东西')
bill.append(u'的东西')
print("list is " + str(bill)) # this is OK
print ("this string is " + str(u'的东西')) # generates a UnicodeEncodeError
The first print statement results in:
list is [u'\u7684\u4e1c\u897f', u'\u7684\u4e1c\u897f']
But the second:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128)
I can see that in the first statement, the actual unicode objects are being printed, and not converted using any codec - why can't this be done for the single object?
You are looking for the repr() function; lists don't have direct support for str(), and the default fallback is to produce the output for repr() instead.
repr() will always produce ASCII-safe output for built-in types:
>>> bill = [u'的东西', u'的东西']
>>> print repr(bill[0])
u'\u7684\u4e1c\u897f'
For built-in containers such as list, tuple, dict and set, the contents are always represented with their repr() content, recursively.
Note that repr() is aimed at producing debug output, not user-readable output. Stick to using Unicode in your code everywhere if you need to handle text, decode when ingesting (unless the API you use already decodes for you), encode when producing output (again, unless the API already encodes, like print will). I strongly recommend you read / watch Pragmatic Unicode by Ned Batchelder to understand Python and Unicode better.
In a python source code I stumbled upon I've seen a small b before a string like in:
b"abcdef"
I know about the u prefix signifying a unicode string, and the r prefix for a raw string literal.
What does the b stand for and in which kind of source code is it useful as it seems to be exactly like a plain string without any prefix?
The b prefix signifies a bytes string literal.
If you see it used in Python 3 source code, the expression creates a bytes object, not a regular Unicode str object. If you see it echoed in your Python shell or as part of a list, dict or other container contents, then you see a bytes object represented using this notation.
bytes objects basically contain a sequence of integers in the range 0-255, but when represented, Python displays these bytes as ASCII codepoints to make it easier to read their contents. Any bytes outside the printable range of ASCII characters are shown as escape sequences (e.g. \n, \x82, etc.). Inversely, you can use both ASCII characters and escape sequences to define byte values; for ASCII values their numeric value is used (e.g. b'A' == b'\x41')
Because a bytes object consist of a sequence of integers, you can construct a bytes object from any other sequence of integers with values in the 0-255 range, like a list:
bytes([72, 101, 108, 108, 111])
and indexing gives you back the integers (but slicing produces a new bytes value; for the above example, value[0] gives you 72, but value[:1] is b'H' as 72 is the ASCII code point for the capital letter H).
bytes model binary data, including encoded text. If your bytes value does contain text, you need to first decode it, using the correct codec. If the data is encoded as UTF-8, for example, you can obtain a Unicode str value with:
strvalue = bytesvalue.decode('utf-8')
Conversely, to go from text in a str object to bytes you need to encode. You need to decide on an encoding to use; the default is to use UTF-8, but what you will need is highly dependent on your use case:
bytesvalue = strvalue.encode('utf-8')
You can also use the constructor, bytes(strvalue, encoding) to do the same.
Both the decoding and encoding methods take an extra argument to specify how errors should be handled.
Python 2, versions 2.6 and 2.7 also support creating string literals using b'..' string literal syntax, to ease code that works on both Python 2 and 3.
bytes objects are immutable, just like str strings are. Use a bytearray() object if you need to have a mutable bytes value.
This is Python3 bytes literal. This prefix is absent in Python 2.5 and older (it is equivalent to a plain string of 2.x, while plain string of 3.x is equivalent to a literal with u prefix in 2.x). In Python 2.6+ it is equivalent to a plain string, for compatibility with 3.x.