From the Pattern python docs, I see 'Unicode String can't be compared with Byte String', but why?
You can read the line here:https://github.com/python/cpython/blob/3.5/Lib/re.py
Python 3 introduced a somewhat controversial change where all Python strings are Unicode strings, and all byte strings need to have an encoding specified before they can be converted to Unicode strings.
This goes with the Python principle of "explicit is better than implicit", and removes a large number of potential bugs where implicit conversion would quietly produce wrong or corrupt results when the programmer was careless or unaware of the implications.
The flip side of this is now that it's hard to write code which mixes Unicode and byte strings unless you properly understand the model. (Well, it was hard before, too; but programmers who were oblivious remained so, and thought their code worked until someone tested it properly. Now they get errors up front.)
Briefly, quoting from the Stack Overflow character-encoding tag info page:
just like changing the font from Arial to Wingdings
changes what your text looks like,
changing encodings affects the interpretation
of a sequence of bytes.
For example, depending on the encoding,
the bytes 0xE2 0x89 0xA0 could represent
the text ≠in Windows code page 1252,
or Б┴═ in KOI8-R,
or the character ≠ in UTF-8.
Python 2 would do some unobvious stuff under the hood to coerce this byte string into a native string, which depending on context might involve the local system's "default encoding", and thus produce different results on different systems, creating some pretty hard bugs. Python 3 requires you to explicitly say how the bytes should be interpreted if you want to convert them into a string.
bytestr = b'\xE2\x89\xA0'
fugly = bytestr.decode('cp1252') # u'≠'
cyril = bytestr.decode('koi8-r') # u'Б┴═'
wtf_8 = bytestr.decode('utf-8') # u'≠'
What function can I apply to a string variable that will cause the same result as prepending the b modifier to a string literal?
I've read in this question about the b modifier for string literals in Python 2 that prepending b to a string makes it a byte string (mainly for compatibility between Python 2 and Python 3 when using 2to3). The result I would like to obtain is the same, but applied to a variable, like so:
def is_binary_string_equal(string_variable):
binary_string = b'this is binary'
return convert_to_binary(string_variable) == binary_string
>>> convert_to_binary('this is binary')
[1] True
What is the correct definition of convert_to_binary?
First, note that in Python 2.x, the b prefix actually does nothing. b'foo' and 'foo' are both exactly the same string literal. The b only exists to allow you to write code that's compatible with both Python 2.x and Python 3.x: you can use b'foo' to mean "I want bytes in both versions", and u'foo' to mean "I want Unicode in both versions", and just plain 'foo' to mean "I want the default str type in both versions, even though that's Unicode in 3.x and bytes in 2.x".
So, "the functional equivalent of prepending the 'b' character to a string literal in Python 2" is literally doing nothing at all.
But let's assume that you actually have a Unicode string (like what you get out of a plain literal or a text file in Python 3, even though in Python 2 you can only get these by explicitly decoding, or using some function that does it for you, like opening a file with codecs.open). Because then it's an interesting question.
The short answer is: string_variable.encode(encoding).
But before you can do that, you need to know what encoding you want. You don't need that with a literal string, because when you use the b prefix in your source code, Python knows what encoding you want: the same encoding as your source code file.* But everything other than your source code—files you open and read, input the user types, messages coming in over a socket—could be anything, and Python has no idea; you have to tell it.**
In many cases (especially if you're on a reasonably recent non-Windows machine and dealing with local data), it's safe to assume that the answer is UTF-8, so you can spell convert_to_binary_string(string_variable) as string_variable.encode('utf8'). But "many" isn't "all".*** This is why text editors and web browsers let the user select an encoding—because sometimes only the user actually knows.
* See PEP 263 for how you can specify the encoding, and why you'd want to..
** You can also use bytes(s, encoding), which is a synonym for s.encode(encoding). And, in both cases, you can leave off the encoding argument—but then it defaults to something which is more likely to be ASCII than what you actually wanted, so don't do that.
*** For example, many older network protocols are defined as Latin-1. Many Windows text files are created in whatever the OEM charset is set to—usually cp1252 on American systems, but there are hundreds of other possibilities. Sometimes sys.getdefaultencoding() or locale.getpreferredencoding() gets what you want, but that obviously doesn't work when, say, you're processing a file that someone uploaded that's in his machine's preferred encoding, not yours.
In the special case where the relevant encoding is "whatever this particular source file is in", you pretty much have to know that somehow out-of-band.* Once a script or module has been compiled and loaded, it's no longer possible to tell what encoding it was originally in.**
But there shouldn't be much reason to want that. After all, if two binary strings are equal, and in the same encoding, the Unicode strings are also equal, and vice-versa, so you could just write your code as:
def is_binary_string_equal(string_variable):
binary_string = u'this is binary'
return string_variable == binary_string
* The default is, of course, documented—it's UTF-8 for 3.0, ASCII or Latin-1 for 2.x depending on your version. But you can override that, as PEP 263 explains.
** Well, you could use the inspect module to find the source, then the importlib module to start processing it, etc.—but that only works if the file is still there and hasn't been edited since you last compiled it.
Note that in python 3.7, executed on linux machine, it is not the same to use .encode('UTF-8') and b'string' .
It cause a lot of pain in a project of mine and to this day I have no clear understanding of why it happens but doing this in Python 3.7
print('\xAD\x43\x48\x49\x44\x44\x49\x4E\x47\x53\x54\x4F\x4E\x45'.encode('UTF-8'))
print(b'\xAD\x43\x48\x49\x44\x44\x49\x4E\x47\x53\x54\x4F\x4E\x45')
returns this on console
b'\xc2\xadCHIDDINGSTONE'
b'\xadCHIDDINGSTONE'
I'm looking into new languages, kind of craving for one where I no longer need to worry about charset problems amongst inordinate amounts of other niggles I have with PHP for a new project.
I tend to find Java too verbose and messy, and my not wanting to touch Windows with a 6-foot pole tends to rule out .Net. That leaves essentially everything else -- except PHP, C and C++ (the latter two of which I know get messy with unicode stuff irrespective of the ICU library).
I've short listed a few languages to date, namely Ruby (loved the mixins), Python, Lisp and Javascript (node.js). However, I'm coming with highly inconsistent information on unicode support and I'm dreading (lack of time...) to learn each and every one of them to the point where I can safely break it to rule it out.
In so far as I understood, Python 3 seems to have it. As does Ruby 1.9. Lisp not necessarily. Javascript presumably.
There's arguably more than unicode support to a language, but in my experience it tends to become a major drawback when dealing with locale.
I also realize the question is somewhat subjective. (Please don't close it on that grounds: I'm actually linking to several SO threads which I found unsatisfying.) But... as a user of any of these languages, how well do they support unicode in practice?
Python's unicode support did not really change in 3.x. The unicode support in Python has been pretty much the same since Python 2.x, which introduced the separate unicode type and the encoding handling. What Python 3.x changes is that unicode becomes the only string type (and is renamed to str), whereas 2.x has bytestrings (str, "...") and unicode strings (unicode, u"...") that often but not always don't quite mix. (Allowing them to mix was an attempt to make transitioning from bytestrings to unicode easier, but it turned out a mistake.) All in all, Python's unicode support is quite good, mistakes in Python 2.x notwithstanding. There's unicode literals with numeric and named escapes, source-encoding declarations for non-ASCII characters in unicode literals, automatic encoding/decoding through the codecs module, unicode support in many libraries (like the regular expression and DB-API modules) and a builtin unicode database.
That said, you still need to know about encodings in order to handle text correctly. Your program will receive bytes in some encoding (be it from files, from environment variables or through other input) and they will need to be interpreted in that encoding. If you don't know the encoding (and can't determine it from the data, like in HTML or XML) you can really only process the data as bytes. If you do know the encoding, Python does allow you to deal with it mostly transparently.
Perl has excellent support of unicode. You need to know how to use is properly, but i never find any language what has better unicode support than perl, especially now with perl5.14.
Racket (in the Lisp/Scheme camp) has good Unicode support. Racket distinguishes character strings (written "abc") from byte strings (written #"abc"). Character strings consist of Unicode characters and have all the Unicode-aware string operations one would expect (comparison, case folding, etc). By default Racket uses UTF-8 for character string I/O (including the encoding of source files), but it also supports conversion to and from other encodings. The GUI toolkit works with Unicode. So do regular expressions.
From my personal experience, Ruby 1.9.2 handles unicode internally pretty good except some strange areas like upcase/downcase/capitalize methods for String class. I have to override them for all my Rails applications.
Lisps have strong support for unicode. All modern popular lisps (SBCL, Clozure CL, clisp) use UTF-32/UCS-4 for strings and support UTF-8 as an external format.
Ruby examples :
# encoding: UTF-8
puts RUBY_VERSION # => 1.9.2
def Σ(arr)
arr.inject(:+)
end
Π = Math::PI
str = "abc日本def"
puts Σ [4,6,8,3] # => 21
puts Π # => 3.141592653589793
puts str.scan(/\p{Han}+/) # => 日本
p Encoding.name_list # not just utf8
#["ASCII-8BIT", "UTF-8", "US-ASCII", "Big5", "Big5-HKSCS", "Big5-UAO", "CP949", "Emacs-Mule", "EUC-JP", "EUC-KR", "EUC-TW", "GB18030", "GBK", "ISO-8859-1", "ISO-8859-2", "ISO-8859-3", "ISO-8859-4", "ISO-8859-5", "ISO-8859-6", "ISO-8859-7", "ISO-8859-8", "ISO-8859-9", "ISO-8859-10", "ISO-8859-11", "ISO-8859-13", "ISO-8859-14", "ISO-8859-15", "ISO-8859-16", "KOI8-R", "KOI8-U", "Shift_JIS", "UTF-16BE", "UTF-16LE", "UTF-32BE", "UTF-32LE", "Windows-1251", "BINARY", "IBM437", "CP437", "IBM737", "CP737", "IBM775", "CP775", "CP850", "IBM850", "IBM852", "CP852", "IBM855", "CP855", "IBM857", "CP857", "IBM860", "CP860", "IBM861", "CP861", "IBM862", "CP862", "IBM863", "CP863", "IBM864", "CP864", "IBM865", "CP865", "IBM866", "CP866", "IBM869", "CP869", "Windows-1258", "CP1258", "GB1988", "macCentEuro", "macCroatian", "macCyrillic", "macGreek", "macIceland", "macRoman", "macRomania", "macThai", "macTurkish", "macUkraine", "CP950", "CP951", "stateless-ISO-2022-JP", "eucJP", "eucJP-ms", "euc-jp-ms", "CP51932", "eucKR", "eucTW", "GB2312", "EUC-CN", "eucCN", "GB12345", "CP936", "ISO-2022-JP", "ISO2022-JP", "ISO-2022-JP-2", "ISO2022-JP2", "CP50220", "CP50221", "ISO8859-1", "Windows-1252", "CP1252", "ISO8859-2", "Windows-1250", "CP1250", "ISO8859-3", "ISO8859-4", "ISO8859-5", "ISO8859-6", "Windows-1256", "CP1256", "ISO8859-7", "Windows-1253", "CP1253", "ISO8859-8", "Windows-1255", "CP1255", "ISO8859-9", "Windows-1254", "CP1254", "ISO8859-10", "ISO8859-11", "TIS-620", "Windows-874", "CP874", "ISO8859-13", "Windows-1257", "CP1257", "ISO8859-14", "ISO8859-15", "ISO8859-16", "CP878", "SJIS", "Windows-31J", "CP932", "csWindows31J", "MacJapanese", "MacJapan", "ASCII", "ANSI_X3.4-1968", "646", "UTF-7", "CP65000", "CP65001", "UTF8-MAC", "UTF-8-MAC", "UTF-8-HFS", "UCS-2BE", "UCS-4BE", "UCS-4LE", "CP1251", "UTF8-DoCoMo", "SJIS-DoCoMo", "UTF8-KDDI", "SJIS-KDDI", "ISO-2022-JP-KDDI", "stateless-ISO-2022-JP-KDDI", "UTF8-SoftBank", "SJIS-SoftBank", "locale", "external", "filesystem", "internal"]
Indeed, capitalization is not supported for non-ascii chars, with reason.
To follow best practices for Unicode in python, you should prefix all string literals of characters with 'u'. Is there any tool available (preferably PyDev compatible) that warns if you forget it?
you should prefix all string literals with 'u'
No, not really.
You should prefix literals for strings of characters with u. But not all strings are strings of characters. When you are talking to components that are byte based, like network services, or binary files, you need to be using byte strings.
eg. Want to try to write a Unicode string into a PNG file? Not sensible. Want to base64-decode the string Y2Fm6Q==? You can't reasonably use a Unicode string here, base64 is explicitly bytes.
Sure, Python will often let you get away with passing a unicode string where a byte string is expected, but only by automatically encoding to ASCII. If the string contains non-ASCII characters you going to get UnicodeError just as surely as if you'd used bytes where unicode was expected. “Unicode is right, bytes are wrong” is a damaging myth. Manipulation for both kinds of strings are required.
If you are concerned about the transition to Python 3, you should certainly mark up your character strings as u'', but you should then also mark up your explicitly-bytes strings as b''. Strings where it doesn't matter you can leave as '' and let them get converted from byte strings to unicode strings on Python 3. There are lots of cases where Python 2 used to use bytes and Python 3 uses Unicode where it is appropriate to do this. But there are still plenty of cases where you do really need to be talking bytes, and having that converted to Python 3 as unicode will cause problems.
(The only problem with this is that b'' syntax requires Python 2.6 or later, so using it will make you incompatible with earlier versions.)
You might want to write a such a warnging-generator tool by parsing Python source code using the parser or the dis built-in modules. You may also consider adding such a feature to pylint.
KennyTM's comment should be posted as an answer:
from __future__ import unicode_literals
This future declaration can be used in Python 2.6 and 2.7 and enables Python 3's string syntax so that unprefixed string literals are Unicode strings and byte arrays require a b prefix.
i know that django uses unicode strings all over the framework instead of normal python strings. what encoding are normal python strings use ? and why don't they use unicode?
In Python 2: Normal strings (Python 2.x str) don't have an encoding: they are raw data.
In Python 3: These are called "bytes" which is an accurate description, as they are simply sequences of bytes, which can be text encoded in any encoding (several are common!) or non-textual data altogether.
For representing text, you want unicode strings, not byte strings. By "unicode strings", I mean unicode instances in Python 2 and str instances in Python 3. Unicode strings are sequences of unicode codepoints represented abstractly without an encoding; this is well-suited for representing text.
Bytestrings are important because to represent data for transmission over a network or writing to a file or whatever, you cannot have an abstract representation of unicode, you need a concrete representation of bytes. Though they are often used to store and represent text, this is at least a little naughty.
This whole situation is complicated by the fact that while you should turn unicode into bytes by calling encode and turn bytes into unicode using decode, Python will try to do this automagically for you using a global encoding you can set that is by default ASCII, which is the safest choice. Never depend on this for your code and never ever change this to a more flexible encoding--explicitly decode when you get a bytestring and encode if you need to send a string somewhere external.
Hey! I'd like to add some stuff to other answers, unfortunately I don't have enough rep yet to do that properly :-(
FWIW, Mike Graham's post is pretty good and that's probably what you should be reading first.
Here's a few comments:
The need to prefix unicode literals with "u" in 2.x is pretty easily removed in recent (2.6+) 2.x Pythons. from __future__ import unicode_literals
Simialrly, ASCII is only the default source encoding. Python understands a variety of coding hints including the emacs-style # -*- coding: utf-8 -*-. For more information see PEP 0263. Changing the source encoding affects how Unicode literals (regardless of their prefix or lack of prefix, as affected by point 1) are interpreted. In Py3k, the default file encoding is UTF-8.
Python of course does use an encoding internally for Unicode strings (str in py3k, unicode in 2.x) because at some point in time stuff's going to have to be written to memory. Ideally, this would never be evident to the end-user. Unfortunately nothing's perfect and you can occasionally run into problems with this: specifically if you use funky squiggles outside of the Unicode Base Multilingual Plane. Since Python 2.2, we've had what's called wide builds and narrow builds; these names refer to the type used internally to store Unicode code points. Wide builds use UCS-4, which uses 4 bytes to store a Unicode code point. (This means UCS-4's code unit size is 4 bytes, or 32 bits.) Narrow builds use UCS-2. UCS-2 only has 16 bits, and therefore can not encode all Unicode code points accurately (it's like UTF-16, except without the surrogate pairs). To check, test the value of sys.maxunicode. If it's 1114111, you've got a wide build (which can correctly represent all of Unicode). If it's less, well, don't fret too much. The BMP (code points 0x0000 to 0xFFFF) covers most people's needs. For more information, see PEP 0261.
what encoding are normal python
strings use?
In Python 3.x
str is Unicode. This may be either UTF-16 or UTF-32 depending on whether your Python interpreter was built with "narrow" or "wide" Unicode characters.
The Windows version of CPython uses UTF-16. On Unix-like systems, UTF-32 tends to be preferred.
In Python 2.x
str is a byte string type like C char. The encoding isn't defined by the language, but is whatever your locale's default encoding is. Or whatever the MIME charset of the document you got off the Internet is. Or, if you get a string from a function like struct.pack, it's binary data, and doesn't meaningfully have a character encoding at all.
unicode strings in 2.x are equivalent to str in 3.x.
and why don't they use unicode?
Because Python (slightly) predates Unicode. And because Guido wanted to save all the major backwards-incompatible changes for 3.0. Strings in 3.x do use Unicode by default.
From Python 3.0 on all strings are unicode by default, there is also the bytes datatype (Python documentation).
So the python developers think that using unicode is a good idea, that it is not used universally in python 2 is mostly due to backwards compatibility. It also has performance implications.
Python 2.x strings are 8-bit, nothing more. The encoding may vary (though ASCII is assumed). I guess the reasons are historical. Few languages, especially languages that date back to the last century, use unicode right away.
In Python 3, all strings are unicode.
Before Python 3.0, string encoding was ascii by default, but could be changed. Unicode string literals were u"...". This was silly.