If I have a list of unicode strings
lst = [ u"aaa", u"bbb", u"foo", u"bar", ... u"baz", u"zzz" ]
is it necessary to write a prefix u before every string? Can I make a construction that says that every element of lst will be unicode string and then write it without u prefix?
In Python 2.7 (also Python 2.6) you can make unicode literals the default for a module:
from __future__ import unicode_literals
You must include the import at the top of the file, and it then applies to all string literals in the file. Use a b prefix to force byte strings:
>>> from __future__ import unicode_literals
>>> "sss"
u'sss'
>>> b"x"
'x'
If your intention is to convert a set of standard strings to unicode, you could map that function onto your list:
lst = ["aaa", "bbb", "ccc"]
map(unicode, lst)
Which gives
[u"aaa", u"bbb", u"ccc"]
If however lst contains a non ASCII character string, you'll have to prefix that particular string with the u. If you don't, you'll get this error on the conversion:
lst = ["\xe4"]
map(unicode,lst)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 0: ordinal not in range(128)
As noted in the comments, this answer is different for Python 2.x or 3.x. In Python 3, everything changes:
Everything you thought you knew about binary data and Unicode has changed. Python 3.0 uses the concepts of text and (binary) data instead of Unicode strings and 8-bit strings. All text is Unicode; however encoded Unicode is represented as binary data. The type used to hold text is str, the type used to hold data is bytes. The biggest difference with the 2.x situation is that any attempt to mix text and data in Python 3.0 raises TypeError, whereas if you were to mix Unicode and 8-bit strings in Python 2.x, it would work if the 8-bit string happened to contain only 7-bit (ASCII) bytes, but you would get UnicodeDecodeError if it contained non-ASCII values. This value-specific behavior has caused numerous sad faces over the years.
Related
I know that using this code can remove the b prefix
>>> b'Hello'
b'Hello'
>>> b'Hello'.decode() # decodes bytes type
'Hello'
But if I use a unicode escape (with unicode_escape codec because the utf-8 codec struggles), it works fine... [Python 3.10, Windows 10, AMD64]
>>> b'Hello\xeb'.decode() # utf-8 codec does not work that well
Traceback (most recent call last):
File "<pyshell#25>", line 1, in <module>
b'Hello\xeb'.decode()
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xeb in position 5: unexpected end of data
>>> b'Hello\xeb'.decode('unicode_escape')
'Helloë'
...but for example if I use a .exe file it does not work (and the b prefix is still there??)
>>> # some thing that reads file, code: with open('autoclicker.exe', 'rb') ...
b'MZ\x90\x00\x03\x00\x00\x00\x04\x00\x00\x00\xff\xff\x00\x00\xb8\x00\x00\x00\x00\x00\x00\x00#\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x10\x01\x00\x00\x0e\x1f\xba\x0e\x00\xb4\t\xcd!\xb8\x01L\xcd!This program cannot be run in DOS mode.\r\r\n$\x00\x00\x00\x00\x00\x00\x00-\x82\xc1\xedi\xe3\xaf\xbei\xe3\xaf\xbei\xe3\xaf\xbe\xd4\xac9\xbek\xe3\xaf\xbe`\x9b:\xbew\xe3\xaf\xbe`\x9b,\xbe\xdb\xe3\xaf\xbe`\x9b+\xbeP\xe3\xaf\xbeN%\xc2\xbec\xe3\xaf\xbeN%\xd4\xbeH\xe3\xaf\xbei\xe3\xae\xbed\xe1\xaf\xbe`\x9b \xbe/\xe3\xaf\xbew\xb1:\xbek\xe3\xaf\xbew\xb1;\xbeh\xe3\xaf\xbei\xe38\xbeh\xe3\xaf\xbe`\x9b>\xbeh\xe3\xaf\xbeRichi\xe3\xaf\xbe\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00PE\x00\x00L\x01\x04\x00\x14\xb3\x8a\\\x00\x00\x00\x00\x00\x00\x00\x00\xe0\x00#\x01\x0b\x01\t\x00\x00\x02\x08\x00\x00\xf0\x01\x00\x00\x00\x00\x00\x10c\x01\x00\x00\x10\x00\x00\x00 \x08\x00\x00\x00#\x00\x00\x10\x00\x00\x00\x02\x00\x00\x05\x00\x00\x00\x00\x00\x00\x00\x05\x00\x00\x00\x00\x00\x00\x00\x00`\x0b\x00\x00\x04\x00\x00\x03\t\x0e\x00\x02\x00\x00\x80\x00\x00#\x00\x00\x10\x00\x00\x00\x00#\x00\x00\x10\x00\x00\x00\x00\x00\x00\x10\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00<\xcd\x08\x00T\x01\x00\x00\x00\xb0\n\x00\xfc\xac\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00 \x08\x00#\x08\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00.text\x00\x00\x00\x17\x00\x08\x00\x00\x10\x00\x00\x00\x02\x08\x00\x00\x04\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00 \x00\x00`.rdata\x00\x00\\\xd9\x00\x00\x00 \x08\x00\x00\xda\x00\x00\x00\x06\x08\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00#\x00\x00#.data\x00\x00\x00\x18\xa5\x01\x00\x00\x00\t\x00\x00h\x00\x00\x00\xe0\x08\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00#\x00\x00\xc0.rsrc\x00\x00\x00\xfc\xac\x00\x00\x00\xb0\n\x00\x00\xae\x00\x00\x00H\t\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00#\x00\x00#\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x003\xc0\x81\xec\xac\x03\x00\x008\x05\x92rI\x00tCh\xa4\x03\x00\x00P\x8dT$\x0cR\xa2\x92rI\x00\x89\x81\x8c\x01\x00\x00\xe8\xc5!\x01\x00\xa1\xc0rI\x00\x83\xc4\x0c\x8d\x0c$Qj\x02\xc7D$\x08\xa8\x03\x00\x00\x89D$\x0c\xc7D$\x10\x01\x00\x00\x00\xff\x15\x8c$H\x00\x81\xc4\xac\x03\x00\x00\xc3\xcc\xcc\xcc\xcc\xcc\xcc\x8bF$S3\xdb;\xc3\x0f\x85J\x85\x02\x00\x8bF,\x89^$;\xc3\x0f\x85J\x85\x02\x00\x89^,\x89^0\x89^4\x89^8\x88^\x10[\xc3\xcc\xcc\xcc\x80~\t\x00\x0f\x85\x95\x82\x02\x00j\x08\xe8y\x06\x01\x00\x83\xc4\x04\x85\xc0t\x10\x8b\x17\x89\x10\x8bN\x04\x89H\x04\xff\x06\x89F\x04\xc3\x8bN\x043\xc0\x89H\x04\xff\x06\x89F\x04\xc3\xcc\xcc\xcc\xcc\xcc\xcc\xcc\xcc\xcc\xcc\x8bD$\x10\x8bL$\x0cS\x8b\\$\x08W\x8b|$\x10PQ\xe8\x07\x00\x00\x00_[\xc2\x10\x00\xcc\xccU\x8b\xec\x83\xe4\xf8QV;\x1d\xc0rI\x00uP\x81\xff\x11\x01\x00\x00s(\x83\xff\x12r#;=\x80\x93J\x00\x0f\x84r\xd6\x02\x00\x8bU\x0c\x8bE\x08RPWS\xff\x15\x84&H\x00^\x8b\xe5]\xc2\x08\x00\x81\xff\x13\x01\x00\x00u#\x8bE\x08h\xb8\x84J\x00\x8b\xf3\xe8\xe9\x00\x00\x003\xc0^\x8b\xe5]\xc2\x08\x00\xa1\xc0rI\x00\x85\xc0t\xa7\xeb\xbe\x83\xff\x10w9\x0f\x84\xa5\xd5\x02\x00\x8dG\xff\x83\xf8\x06w\x9f\xff$\x85\x14\x12#\x00j\x01S\xff\x15\xd0&H\x00\xb9\xb8\x84J\x00\xe8x\xfe\xff\xffj\x00\xff\x15\xcc&H\x003\xc0^\x8b\xe5]\xc2\x08\x00\x81\xff\x12\x03\x00\x00wa\x0f\x84\xc9\xd5\x02\x00\x83\xff\x11\x0f\x84\x90\xd5\x02\x00\x81\xff\x11\x01\x00\x00\x0f\x85Q\xff\xff\xff\xe9`\xd5\x02\x00j\x00h\xee\x02\x00\x00j\x01S\xff\x15\xdc&H\x00hhHH\x00\xff\x15\xd8&H\x00\x83=\xb8\x84J\x00\x00\xa3\x80\x93J\x00\x0f\x85Y\xff\xff\xff\xff\x15\xd4&H\x00\xa3\xb8\x84J\x003\xc0^\x8b\xe5]\xc2\x08\x00\x81\xff\x01\x04\x00\x00\x0f\x85\xff\xfe\xff\xff\xe9\x9e\xd5\x02\x00\x90\xc1\x11#\x00u\x11#\x00\r\x11#\x00\r\x11#\x00\xe7\xe6B\x00\r\x11#\x00\xd6\xe6B\x00\x81\xec\xa8\x03\x00\x00\x83\xe8\x01SW\x0f\x85\x84\x00\x00\x00h\xa4\x03\x00\x00P\x8dX\x01\x8dD$\x14P\xc7D$\x14\xa8\x03\x00\x00\xe8\x94\x1f\x01\x00\x8b\x84$\xc0\x03\x00\x00\x83\xc4\x0c\xe8\x05\x0c\x00\x00\x80=\x92rI\x00\x00t:\x80=\x94rI\x00\x00\x8b\xbc$\xb4\x03\x00\x00\x89t$\x0c\x89\\$\x10\xc7D$\x14\x02\x00\x00\x00\x0f\x85\xc9\x97\x02\x00\x80\x7f\t\x00\x0f\x85\n\x98\x02\x008\x9f\x84\x01\x00\x00\x0f\x84J\x98\x02\x00SV\xff\x15\xd0&H\x00j\x00h\xee\x02\x00\x00SV\xff\x15\xdc&H\x00_[\x81\xc4\xa8\x03\x00\x00\xc2\x04\x00\x8b\x03\x85\xc0t\tP\xe8\x13\x00\x01\x00\x83\xc4\x04V\x8d\xb3\xec\x00\x00\x00W\xc7\x06p\xa0H\x00\xe8\x1e\xf5\x00\x00\x8bF\x04P\xe8\xf4\xff\x00\x00\x83\xc4\x04\x8d\x8b\xbc\x00\x00\x00\xe8\x17\x13\x00\x00\x8d{x\xe8O\x00\x00\x00\x8d{4\xe8G\x00\x00\x00\x8dK$\xe8\xff\x12\x00\x00_\x8dK\x14^\xe9\xf5\x12\x00\x00\xcc\xcc\xcc\xcc\xcc\x8bF\x0c\xff\x08\x8bF\x0c\x838\x00u\x14\x8b\x0eQ\xe8\xaa\xff\x00\x00\x8bV\x0cR\xe8\xa1\xff\x00\x00\x83\xc4\x08\xc3\xcc\xcc\xcc\xcc\xcc\xcc\xcc\xcc\xcc\xcc\xcc\xcc\xcc\xccV\x8b\xf7\xe8\xf8\xfc\xff\xff\x8dw\x14\xe8\xc0\xff\xff\xff\x8b\xf7\xe8\xb9\xff\xff\xff^\xc3\xcc\xcc\xcc\xcc\xcc\xcc\xcc\x8b\x06\x85\xc0t\tP\xe8c\xff\x00\x00\x83\xc4\x04W\x8d\xbe\xec\x00\x00\x00\xe8\xc5\xff\xff\xff\x8d\x8e\xbc\x00\x00\x00\xe8z\x12\x00\x00\x8d\x8e\xac\x00\x00\x00\xe8o\x12\x00\x00\x8d\x8e\x9c\x00\x00\x00\xe8d\x12\x00\x00\x8d\x8e\x8c\x00\x00\x00\xe8Y\x12\x00\x00\x8d~\x08\xe8\xd1\xf0\x00\x00_\xc3\xcc\xcc\xcc\xcc\xcc\xcc\xcc\xcc\xcc\xcc\xcc\xcc\xcc\xcc\xccj\x04\xe83\x03\x01\x00\x83\xc4\x04\x85\xc0\x0f\x84\xc2\x82\x02\x00\xc7\x00\x01\x00\x00\x00\x89F\x0c\xc3\xcc\xcc\xcc\xccV\x8b\xf1#3\xc9\x89F\x08\xba\x02\x00\x00\x00\xf7\xe2\x0f\x90\xc1\xc7F\x04\x00\x00\x00\x00\xf7\xd9\x0b\xc8Q\xe8\xf6\x02\x01\x003\xc9\x89\x06\x83\xc4\x04f\x89\x08\xe8\xad\xff\xff\xff\x8b\xc6^\xc3\xcc\xcc\xcc\xcc\xcc\xcc\xcc\xcc\xcc\x81\xec\x14\x02\x00\x00\xe8\x85\x13\x00\x00\x84\xc0\x0f\x84n\xae\x02\x00\xe8\x98\x00\x00\x00\x85\xc0\x0f\x85a\xae\x02\x00\xe8k\xc1\x00\x00\x85\xc0\x0f\x85T\xae\x02\x00\x8b\x94$\x18\x02\x00\x00\x8d\x04$P\x8dL$\x08Qh\x04\x01\x00\x00R\xff\x15 #H\x00\x8dD$\x04P\xb8\xe8\x7fJ\x00\xe8H\r\x00\x00\x8b\x0c$Q\xb8\xd8\x7fJ\x00\xe8:\r\x00\x00\x8b\x04$3\xd2f\x89P\xfef9T$\x08\x0f\x84\x11\xae\x02\x00\x8dT$\x04R\xb8\xf8\x7fJ\x00\xe8\x17\r\x00\x00\x8b\x84$\x1c\x02\x00\x00\xa3\xd4\x7fJ\x003\xc0\x81\xc4\x14\x02\x00\x00\xc2\x08\x00\x8b#\x04\x85\xc0\x0f\x85\xb4\x81\x02\x00\xc3\xcc\xcc\xcc\xcc\x83\xect3\xc0SUVW\xbd\x01\x00\x00\x00\x89D$\x18\x89D$,\x89D$0\x89D$$\x89D$ \x89D$\x1c\x89D ...
>>> b'hi\xbd'.decode('unicode_escape') # using some (unicode) escape methods work separately
'hi½'
>>> b'abc\'
The b'' isn't a "string prefix", instead it indicates that you are dealing with a sequence of bytes. Bytes can represent anything, including a text which is just a series of characters in some encoding, like UTF-8, ASCII, etc.
That's what .decode() does, it takes the sequence of bytes and interprets it as if it were a string of characters in that encoding and returns a string of those characters. Conversely, you could then encode the resulting string of characters into some other encoding by calling .encode() on the string and you'd get the sequence of bytes that represents that string in that encoding.
However, you can't just take any sequence of bytes and 'decode' it as any decoding - the bytes will have a certain encoding if they represent some string, but the example you give (of an executable) doesn't represent a string of characters at all and thus won't successfully decode into a string if you just call .decode() on it.
If you're lucky, the decoding works on the parts of the executable that are strings in that encoding, but even that's not guaranteed to work, as the strings will be surrounded by bytes that don't represent that encoding.
If you want to extract strings from an executable, you need to correctly identify what parts of the executable represent strings, extract those sequences of bytes and decode them with the correct encoding. How to do that will depend on the operating system the executable is for, whether it's 32-bit or 64-bit, etc.
Note: many programmers new to Python or coding in general get confused by the fact that Python (for the sake of convenience) shows you a bytes object as very similar to a string (it looks just like string with a b before it), this is even more confusing if it happens to be an encoding that's UTF or very similar, as the contents of the bytes object will even be readable then. But that doesn't mean the bytes objects actually is a string.
Apparently, the following is the valid syntax:
b'The string'
I would like to know:
What does this b character in front of the string mean?
What are the effects of using it?
What are appropriate situations to use it?
I found a related question right here on SO, but that question is about PHP though, and it states the b is used to indicate the string is binary, as opposed to Unicode, which was needed for code to be compatible from version of PHP < 6, when migrating to PHP 6. I don't think this applies to Python.
I did find this documentation on the Python site about using a u character in the same syntax to specify a string as Unicode. Unfortunately, it doesn't mention the b character anywhere in that document.
Also, just out of curiosity, are there more symbols than the b and u that do other things?
Python 3.x makes a clear distinction between the types:
str = '...' literals = a sequence of Unicode characters (Latin-1, UCS-2 or UCS-4, depending on the widest character in the string)
bytes = b'...' literals = a sequence of octets (integers between 0 and 255)
If you're familiar with:
Java or C#, think of str as String and bytes as byte[];
SQL, think of str as NVARCHAR and bytes as BINARY or BLOB;
Windows registry, think of str as REG_SZ and bytes as REG_BINARY.
If you're familiar with C(++), then forget everything you've learned about char and strings, because a character is not a byte. That idea is long obsolete.
You use str when you want to represent text.
print('שלום עולם')
You use bytes when you want to represent low-level binary data like structs.
NaN = struct.unpack('>d', b'\xff\xf8\x00\x00\x00\x00\x00\x00')[0]
You can encode a str to a bytes object.
>>> '\uFEFF'.encode('UTF-8')
b'\xef\xbb\xbf'
And you can decode a bytes into a str.
>>> b'\xE2\x82\xAC'.decode('UTF-8')
'€'
But you can't freely mix the two types.
>>> b'\xEF\xBB\xBF' + 'Text with a UTF-8 BOM'
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: can't concat bytes to str
The b'...' notation is somewhat confusing in that it allows the bytes 0x01-0x7F to be specified with ASCII characters instead of hex numbers.
>>> b'A' == b'\x41'
True
But I must emphasize, a character is not a byte.
>>> 'A' == b'A'
False
In Python 2.x
Pre-3.0 versions of Python lacked this kind of distinction between text and binary data. Instead, there was:
unicode = u'...' literals = sequence of Unicode characters = 3.x str
str = '...' literals = sequences of confounded bytes/characters
Usually text, encoded in some unspecified encoding.
But also used to represent binary data like struct.pack output.
In order to ease the 2.x-to-3.x transition, the b'...' literal syntax was backported to Python 2.6, in order to allow distinguishing binary strings (which should be bytes in 3.x) from text strings (which should be str in 3.x). The b prefix does nothing in 2.x, but tells the 2to3 script not to convert it to a Unicode string in 3.x.
So yes, b'...' literals in Python have the same purpose that they do in PHP.
Also, just out of curiosity, are there
more symbols than the b and u that do
other things?
The r prefix creates a raw string (e.g., r'\t' is a backslash + t instead of a tab), and triple quotes '''...''' or """...""" allow multi-line string literals.
To quote the Python 2.x documentation:
A prefix of 'b' or 'B' is ignored in
Python 2; it indicates that the
literal should become a bytes literal
in Python 3 (e.g. when code is
automatically converted with 2to3). A
'u' or 'b' prefix may be followed by
an 'r' prefix.
The Python 3 documentation states:
Bytes literals are always prefixed with 'b' or 'B'; they produce an instance of the bytes type instead of the str type. They may only contain ASCII characters; bytes with a numeric value of 128 or greater must be expressed with escapes.
The b denotes a byte string.
Bytes are the actual data. Strings are an abstraction.
If you had multi-character string object and you took a single character, it would be a string, and it might be more than 1 byte in size depending on encoding.
If took 1 byte with a byte string, you'd get a single 8-bit value from 0-255 and it might not represent a complete character if those characters due to encoding were > 1 byte.
TBH I'd use strings unless I had some specific low level reason to use bytes.
From server side, if we send any response, it will be sent in the form of byte type, so it will appear in the client as b'Response from server'
In order get rid of b'....' simply use below code:
Server file:
stri="Response from server"
c.send(stri.encode())
Client file:
print(s.recv(1024).decode())
then it will print Response from server
The answer to the question is that, it does:
data.encode()
and in order to decode it(remove the b, because sometimes you don't need it)
use:
data.decode()
Here's an example where the absence of b would throw a TypeError exception in Python 3.x
>>> f=open("new", "wb")
>>> f.write("Hello Python!")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'str' does not support the buffer interface
Adding a b prefix would fix the problem.
It turns it into a bytes literal (or str in 2.x), and is valid for 2.6+.
The r prefix causes backslashes to be "uninterpreted" (not ignored, and the difference does matter).
In addition to what others have said, note that a single character in unicode can consist of multiple bytes.
The way unicode works is that it took the old ASCII format (7-bit code that looks like 0xxx xxxx) and added multi-bytes sequences where all bytes start with 1 (1xxx xxxx) to represent characters beyond ASCII so that Unicode would be backwards-compatible with ASCII.
>>> len('Öl') # German word for 'oil' with 2 characters
2
>>> 'Öl'.encode('UTF-8') # convert str to bytes
b'\xc3\x96l'
>>> len('Öl'.encode('UTF-8')) # 3 bytes encode 2 characters !
3
You can use JSON to convert it to dictionary
import json
data = b'{"key":"value"}'
print(json.loads(data))
{"key":"value"}
FLASK:
This is an example from flask. Run this on terminal line:
import requests
requests.post(url='http://localhost(example)/',json={'key':'value'})
In flask/routes.py
#app.route('/', methods=['POST'])
def api_script_add():
print(request.data) # --> b'{"hi":"Hello"}'
print(json.loads(request.data))
return json.loads(request.data)
{'key':'value'}
b"hello" is not a string (even though it looks like one), but a byte sequence. It is a sequence of 5 numbers, which, if you mapped them to a character table, would look like h e l l o. However the value itself is not a string, Python just has a convenient syntax for defining byte sequences using text characters rather than the numbers itself. This saves you some typing, and also often byte sequences are meant to be interpreted as characters. However, this is not always the case - for example, reading a JPG file will produce a sequence of nonsense letters inside b"..." because JPGs have a non-text structure.
.encode() and .decode() convert between strings and bytes.
bytes(somestring.encode()) is the solution that worked for me in python 3.
def compare_types():
output = b'sometext'
print(output)
print(type(output))
somestring = 'sometext'
encoded_string = somestring.encode()
output = bytes(encoded_string)
print(output)
print(type(output))
compare_types()
As far as I know there is a difference between strings and unicode strings in Python. But is it possible to instruct Python to use unicode strings instead of regular ones whenever a string object is created?
So when I get a text input, I don't need to use unicode()?
I might sound lazy but I am just interested if this is possible...
p.s. I don't know a lot about character encoding so please correct me if I got anything wrong
For Example(In pyhon interactive,diff in GUI Shell) :
>>> s = '你好'
>>> s
'\xe4\xbd\xa0\xe5\xa5\xbd'
>>> us = u'你好'
>>> us
u'\u4f60\u597d'
>>> print type(s)
<type 'str'>
>>> print type(us)
<type 'unicode'>
>>> len(s)
6
>>> len(us)
2
In short:
First, a string object is a sequence of characters,a Unicode string is a sequence of code points(Unicode code units), which are numbers from 0 to 0x10ffff.
Them, len(string) will reture a set of bytes,len(unicode) will return a number of characters.This sequence needs to be represented as a set of bytes (meaning, values from 0-255) in memory. The rules for translating a Unicode string into a sequence of bytes are called an encoding.
I think you should use raw_input to instead input, if you want to get bytestring.
But is it possible to instruct Python to use unicode strings instead of regular ones whenever a string object is created?
There are two type of strings in Python (on both Python 2 and 3): a bytestring (a sequence of bytes) and a Unicode string (a sequence of Unicode codepoints).
bytestring = b'abc'
unicode_text = u'abc'
The type of string created using 'abc' string literal depends on Python version and the presence of from __future__ import unicode_literals import. Without the import on Python 2, 'abc' literal creates a bytestring otherwise it creates a Unicode string.
Add the encoding declaration at the top of your Python source file if you use non-ascii characters in string literals e.g.: # -*- coding: utf-8 -*-.
So when I get a text input, I don't need to use unicode()?
If by "text input" you mean that your program receives bytes somehow (from a file, network, from the command-line) then no: you shouldn't rely on Python to convert bytes to Unicode implicitly -- you should do it explicitly as soon as you receive the bytes using unicode_text = bytestring.decode(character_encoding).
And in reverse, keep the text as Unicode inside your program. Convert Unicode strings to bytes as late as possible when it is necessary (e.g., to send the text via the network).
Use bytestrings to work with a binary data: an image, a compressed content, etc. Use Unicode strings to work with text in Python.
To read Unicode from a file, use io.open() (you have to know the correct character encoding if it is not locale.getpreferredencoding(False)).
What character encoding to use when you receive your Unicode text via network may depend on the corresponding protocol e.g., the charset can be specified in Content-Type http header:
text = data.decode(response.headers.getparam('charset'))
You could use universal_newlines=True or io.TextIOWrapper() to get Unicode text from an external process started using subprocess module. It can be non-trivial to figure out what character encoding should be used on Windows (if you read Russian, see the gory details here: Byte при печати вывода внешней команды).
In Python 2.6+ you can use from __future__ import unicode_literals, but that only makes string literals Unicode. All functions that returned byte strings still return byte strings.
Example:
>>> s = 'abc'
>>> type(s)
<type 'str'>
>>> from __future__ import unicode_literals
>>> s = 'abc'
>>> type(s)
<type 'unicode'>
For the behavior you want, use Python 3.
I've been banging my head on this error for some time now and I can't seem to find a solution anywhere on SO, even though there are similar questions.
Here's my code:
f = codecs.open(path, "a", encoding="utf-8")
value = "Bitte überprüfen"
f.write(("\"%s\" = \"%s\";\n" % ("no_internet", value)).encode("utf-8"))
And what I get as en error is:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 23: ordinal not in range(128)
Why ascii if I say utf-8? I would really appreciate any help.
Try:
value = u"Bitte überprüfen"
in order to declare value as a unicode string and
# -*- coding: utf-8 -*-
at the start of your file in order to declare that your python file is saved with utf-8 encoding.
For the sake of never being hurt by unicode errors ever again, switch to python3:
% python3
>>> with open('/tmp/foo', 'w') as f:
... value = "Bitte überprüfen"
... f.write(('"{}" = "{}";\n'.format('no_internet', value)))
...
36
>>> import sys
>>> sys.exit(0)
% cat /tmp/foo
"no_internet" = "Bitte überprüfen";
though if you're really tied to python2 and have no choice:
% python2
>>> with open('/tmp/foo2', 'w') as f:
... value = u"Bitte überprüfen"
... f.write(('"{}" = "{}";\n'.format('no_internet', value.encode('utf-8'))))
...
>>> import sys
>>> sys.exit(0)
% cat /tmp/foo2
"no_internet" = "Bitte überprüfen";
And as #JuniorCompressor suggests, don't forget to add # encoding: utf-8 at the start of your python2 file to tell python to read the source file in unicode, not in ASCII!
Your error in:
f.write(("\"%s\" = \"%s\";\n" % ("no_internet", value)).encode("utf-8"))
is that you're encoding the whole formatted string into utf-8, whereas you shall encode the value string into utf-8 before doing the format:
>>> with open('/tmp/foo2', 'w') as f:
... value = u"Bitte überprüfen"
... f.write(('"{}" = "{}";\n'.format('no_internet', value).encode('utf-8')))
...
Traceback (most recent call last):
File "<stdin>", line 3, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 6: ordinal not in range(128)
Which is because python needs to first decode the string into utf-8, so you have to use the unicode type (which is what u"" does). Then you need to explicitly decode that value as unicode before feeding it to the format parser, to build the new string.
As Karl says in his answer, Python2 is totally messy/buggy when using unicode strings, defeating the Explicit is better than implicit zen of python. And for more weird behaviour, the following works just fine in python2:
>>> value = "Bitte überprüfen"
>>> out = '"{}" = "{}";\n'.format('no_internet', value)
>>> out
'"no_internet" = "Bitte \xc3\xbcberpr\xc3\xbcfen";\n'
>>> print(out)
"no_internet" = "Bitte überprüfen";
Still not convinced to switch to python3 ? :-)
Update:
This is the way to go to read and write an unicode string from a file to another file:
% echo "Bitte überprüfen" > /tmp/foobar
% python2
>>> with open('/tmp/foobar', 'r') as f:
... data = f.read().decode('utf-8').strip()
...
>>>
>>> with open('/tmp/foo2', 'w') as f:
... f.write(('"{}" = "{}";\n'.format('no_internet', data.encode('utf-8'))))
...
>>> import sys;sys.exit(0)
% cat /tmp/foo2
"no_internet" = "Bitte überprüfen";
Update:
as a general rule:
when you get a DecodeError you shall use the .decode('utf-8') on the string that contains unicode data and
when you get an EncodeError, you shall use the .encode('utf-8') on the string that contains unicode data
Update: if you cannot update to python3, you can at least make your python2 behave like it is almost python3, using the following python-future import statement:
from __future__ import absolute_import, division, print_function, unicode_literals
HTH
Like already suggested your error results from this line:
f.write(("\"%s\" = \"%s\";\n" % ("no_internet", value)).encode("utf-8"))
it should be:
f.write(('"{}" = "{}";\n'.format('no_internet', value.encode('utf-8'))))
A note on unicode and encodings
If woking with Python 2, software should only work with unicode strings internally, converting to a particular encoding on output.
Do prevent from making the same error over and over again you should make sure you understood the difference between ascii and utf-8 encodings and also between str and unicode objects in Python.
The difference between ASCII and UTF-8 encoding:
Ascii needs just one byte to represent all possible characters in the ascii charset/encoding. UTF-8 needs up to four bytes to represent the complete charset.
ascii (default)
1 If the code point is < 128, each byte is the same as the value of the code point.
2 If the code point is 128 or greater, the Unicode string can’t be represented in this encoding. (Python raises a UnicodeEncodeError exception in this case.)
utf-8 (unicode transformation format)
1 If the code point is <128, it’s represented by the corresponding byte value.
2 If the code point is between 128 and 0x7ff, it’s turned into two byte values between 128 and 255.
3 Code points >0x7ff are turned into three- or four-byte sequences, where each byte of the sequence is between 128 and 255.
The difference between str and unicode objects:
You can say that str is baiscally a byte string and unicode is a unicode string. Both can have a different encoding like ascii or utf-8.
str vs. unicode
1 str = byte string (8-bit) - uses \x and two digits
2 unicode = unicode string - uses \u and four digits
3 basestring
/\
/ \
str unicode
If you follow some simple rules you should go fine with handling str/unicode objects in different encodings like ascii or utf-8 or whatever encoding you have to use:
Rules
1 encode(): Gets you from Unicode -> bytes
encode([encoding], [errors='strict']), returns an 8-bit string version of the Unicode string,
2 decode(): Gets you from bytes -> Unicode
decode([encoding], [errors]) method that interprets the 8-bit string using the given encoding
3 codecs.open(encoding=”utf-8″): Read and write files directly to/from Unicode (you can use any encoding, not just utf-8, but utf-8 is most common).
4 u”: Makes your string literals into Unicode objects rather than byte sequences.
5 unicode(string[, encoding, errors])
Warning: Don’t use encode() on bytes or decode() on Unicode objects
And again: Software should only work with Unicode strings internally, converting to a particular encoding on output.
Why ascii if I say utf-8?
Because in Python 2, "Bitte überprüfen" is not a Unicode string. Before it can be .encoded by your explicit call, Python must implicitly decode it to Unicode (This is also why it raises a UnicodeDecodeError), and it chooses ASCII because it has no other information to work with. The ü is represented with some byte with value >= 128, so it's not valid ASCII.
The u prefix shown by #JuniorCompressor will make it a Unicode string, and you should specify the encoding for the file as well (don't just blindly set utf-8; it needs to match whatever your text editor saves the .py file with!).
Switching to Python 3 is realistically (part of) a better long-term solution :) but it is still essential to understand the problem. See http://bit.ly/unipain for more details. The Python 2 behaviour is really a bug, or at least a failure to meet Pythonic design principles: Explicit is better than implicit, and here we see why very clearly ;)
def _oauth_escape(val):
if isinstance(val, unicode):# useful ?
val = val.encode("utf-8")#useful ?
return urllib.quote(val, safe="~")
i think it is not useful ,
yes ??
updated
i think unicode is ‘utf-8’ ,yes ?
utf-8 is an encoding, a recipe for concretely representing unicode data as a series of bytes. This is one of many such encodings. Python str objects are bytestrings, which can represent arbitrary binary data, such as text in a specific encoding.
Python's unicode type is an abstract, not-encoded way to represent text. unicode strings can be encoded in any of many encodings.
As others have said already, unicode and utf-8 are not the same. Utf-8 is one of many encodings for unicode.
Think of unicode objects as "unencoded" unicode strings, while string objects are encoded in a particular encoding (unfortunately, string objects don't have an attribute that tells you how they are encoded).
val.encode("utf-8") converts this unicode object into an utf-8 encoded string object.
In Python 2.6, this is necessary, as urllib can't handle unicode properly.
>>> import urllib
>>> urllib.quote(u"")
''
>>> urllib.quote(u"ä")
/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/urllib.py:1216: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
res = map(safe_map.__getitem__, s)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/urllib.py", line 1216, in quote
res = map(safe_map.__getitem__, s)
KeyError: u'\xe4'
>>> urllib.quote(u"ä".encode("utf-8"))
'%C3%A4'
Python 3.x however, where all strings are unicode (the Python 3 equivalent to an encoded string is a bytes object), it is not necessary anymore.
>>> import urllib.parse
>>> urllib.parse.quote("ä")
'%C3%A4'
In Python 3.0 all strings support Unicode, but with previous versions one has to explicitly encode strings to Unicode strings. Could that be it?
(utf-8 is not the only, but the most common encoding for Unicode. Read this.)