What encoding do normal python strings use? - python

i know that django uses unicode strings all over the framework instead of normal python strings. what encoding are normal python strings use ? and why don't they use unicode?

In Python 2: Normal strings (Python 2.x str) don't have an encoding: they are raw data.
In Python 3: These are called "bytes" which is an accurate description, as they are simply sequences of bytes, which can be text encoded in any encoding (several are common!) or non-textual data altogether.
For representing text, you want unicode strings, not byte strings. By "unicode strings", I mean unicode instances in Python 2 and str instances in Python 3. Unicode strings are sequences of unicode codepoints represented abstractly without an encoding; this is well-suited for representing text.
Bytestrings are important because to represent data for transmission over a network or writing to a file or whatever, you cannot have an abstract representation of unicode, you need a concrete representation of bytes. Though they are often used to store and represent text, this is at least a little naughty.
This whole situation is complicated by the fact that while you should turn unicode into bytes by calling encode and turn bytes into unicode using decode, Python will try to do this automagically for you using a global encoding you can set that is by default ASCII, which is the safest choice. Never depend on this for your code and never ever change this to a more flexible encoding--explicitly decode when you get a bytestring and encode if you need to send a string somewhere external.

Hey! I'd like to add some stuff to other answers, unfortunately I don't have enough rep yet to do that properly :-(
FWIW, Mike Graham's post is pretty good and that's probably what you should be reading first.
Here's a few comments:
The need to prefix unicode literals with "u" in 2.x is pretty easily removed in recent (2.6+) 2.x Pythons. from __future__ import unicode_literals
Simialrly, ASCII is only the default source encoding. Python understands a variety of coding hints including the emacs-style # -*- coding: utf-8 -*-. For more information see PEP 0263. Changing the source encoding affects how Unicode literals (regardless of their prefix or lack of prefix, as affected by point 1) are interpreted. In Py3k, the default file encoding is UTF-8.
Python of course does use an encoding internally for Unicode strings (str in py3k, unicode in 2.x) because at some point in time stuff's going to have to be written to memory. Ideally, this would never be evident to the end-user. Unfortunately nothing's perfect and you can occasionally run into problems with this: specifically if you use funky squiggles outside of the Unicode Base Multilingual Plane. Since Python 2.2, we've had what's called wide builds and narrow builds; these names refer to the type used internally to store Unicode code points. Wide builds use UCS-4, which uses 4 bytes to store a Unicode code point. (This means UCS-4's code unit size is 4 bytes, or 32 bits.) Narrow builds use UCS-2. UCS-2 only has 16 bits, and therefore can not encode all Unicode code points accurately (it's like UTF-16, except without the surrogate pairs). To check, test the value of sys.maxunicode. If it's 1114111, you've got a wide build (which can correctly represent all of Unicode). If it's less, well, don't fret too much. The BMP (code points 0x0000 to 0xFFFF) covers most people's needs. For more information, see PEP 0261.

what encoding are normal python
strings use?
In Python 3.x
str is Unicode. This may be either UTF-16 or UTF-32 depending on whether your Python interpreter was built with "narrow" or "wide" Unicode characters.
The Windows version of CPython uses UTF-16. On Unix-like systems, UTF-32 tends to be preferred.
In Python 2.x
str is a byte string type like C char. The encoding isn't defined by the language, but is whatever your locale's default encoding is. Or whatever the MIME charset of the document you got off the Internet is. Or, if you get a string from a function like struct.pack, it's binary data, and doesn't meaningfully have a character encoding at all.
unicode strings in 2.x are equivalent to str in 3.x.
and why don't they use unicode?
Because Python (slightly) predates Unicode. And because Guido wanted to save all the major backwards-incompatible changes for 3.0. Strings in 3.x do use Unicode by default.

From Python 3.0 on all strings are unicode by default, there is also the bytes datatype (Python documentation).
So the python developers think that using unicode is a good idea, that it is not used universally in python 2 is mostly due to backwards compatibility. It also has performance implications.

Python 2.x strings are 8-bit, nothing more. The encoding may vary (though ASCII is assumed). I guess the reasons are historical. Few languages, especially languages that date back to the last century, use unicode right away.
In Python 3, all strings are unicode.

Before Python 3.0, string encoding was ascii by default, but could be changed. Unicode string literals were u"...". This was silly.

Related

Python 2.x Strings: Unicode vs. Bytes

I deal with languages that are non-us and also sometimes still have to write in Python 2.x. Reading this article: http://www.snarky.ca/why-python-3-exists by Brett Cannon makes me wonder if that implies that if I use strings that are only characters and not bytes, should I prepend all my strings with u, to avoid a potential mix up betwen byte-strings and unicode-strings? And: Does this also apply for Jython?
And one last question: -*- coding: utf-8 -*- is completely independed of the above, providing only the encoding of the file itself - correct?
Yes, you want to keep text in unicode objects (the str type in Python 3), and maintain a Unicode sandwich (decode incoming data as soon as possible, postpone encoding until the data needs to exit your application). See Ned Batchelder's excellent Unicode presentation.
This also applies to Jython, which is just another implementation of the Python language.
The PEP 263 source code encoding declaration tells the interpreter what codec to use when decoding bytes in your source code. It helps when defining Unicode literals with non-ASCII bytes, but doesn't dictate how other data other than the source code is encoded or decoded.

How well does your language support unicode in practice?

I'm looking into new languages, kind of craving for one where I no longer need to worry about charset problems amongst inordinate amounts of other niggles I have with PHP for a new project.
I tend to find Java too verbose and messy, and my not wanting to touch Windows with a 6-foot pole tends to rule out .Net. That leaves essentially everything else -- except PHP, C and C++ (the latter two of which I know get messy with unicode stuff irrespective of the ICU library).
I've short listed a few languages to date, namely Ruby (loved the mixins), Python, Lisp and Javascript (node.js). However, I'm coming with highly inconsistent information on unicode support and I'm dreading (lack of time...) to learn each and every one of them to the point where I can safely break it to rule it out.
In so far as I understood, Python 3 seems to have it. As does Ruby 1.9. Lisp not necessarily. Javascript presumably.
There's arguably more than unicode support to a language, but in my experience it tends to become a major drawback when dealing with locale.
I also realize the question is somewhat subjective. (Please don't close it on that grounds: I'm actually linking to several SO threads which I found unsatisfying.) But... as a user of any of these languages, how well do they support unicode in practice?
Python's unicode support did not really change in 3.x. The unicode support in Python has been pretty much the same since Python 2.x, which introduced the separate unicode type and the encoding handling. What Python 3.x changes is that unicode becomes the only string type (and is renamed to str), whereas 2.x has bytestrings (str, "...") and unicode strings (unicode, u"...") that often but not always don't quite mix. (Allowing them to mix was an attempt to make transitioning from bytestrings to unicode easier, but it turned out a mistake.) All in all, Python's unicode support is quite good, mistakes in Python 2.x notwithstanding. There's unicode literals with numeric and named escapes, source-encoding declarations for non-ASCII characters in unicode literals, automatic encoding/decoding through the codecs module, unicode support in many libraries (like the regular expression and DB-API modules) and a builtin unicode database.
That said, you still need to know about encodings in order to handle text correctly. Your program will receive bytes in some encoding (be it from files, from environment variables or through other input) and they will need to be interpreted in that encoding. If you don't know the encoding (and can't determine it from the data, like in HTML or XML) you can really only process the data as bytes. If you do know the encoding, Python does allow you to deal with it mostly transparently.
Perl has excellent support of unicode. You need to know how to use is properly, but i never find any language what has better unicode support than perl, especially now with perl5.14.
Racket (in the Lisp/Scheme camp) has good Unicode support. Racket distinguishes character strings (written "abc") from byte strings (written #"abc"). Character strings consist of Unicode characters and have all the Unicode-aware string operations one would expect (comparison, case folding, etc). By default Racket uses UTF-8 for character string I/O (including the encoding of source files), but it also supports conversion to and from other encodings. The GUI toolkit works with Unicode. So do regular expressions.
From my personal experience, Ruby 1.9.2 handles unicode internally pretty good except some strange areas like upcase/downcase/capitalize methods for String class. I have to override them for all my Rails applications.
Lisps have strong support for unicode. All modern popular lisps (SBCL, Clozure CL, clisp) use UTF-32/UCS-4 for strings and support UTF-8 as an external format.
Ruby examples :
# encoding: UTF-8
puts RUBY_VERSION # => 1.9.2
def Σ(arr)
arr.inject(:+)
end
Π = Math::PI
str = "abc日本def"
puts Σ [4,6,8,3] # => 21
puts Π # => 3.141592653589793
puts str.scan(/\p{Han}+/) # => 日本
p Encoding.name_list # not just utf8
#["ASCII-8BIT", "UTF-8", "US-ASCII", "Big5", "Big5-HKSCS", "Big5-UAO", "CP949", "Emacs-Mule", "EUC-JP", "EUC-KR", "EUC-TW", "GB18030", "GBK", "ISO-8859-1", "ISO-8859-2", "ISO-8859-3", "ISO-8859-4", "ISO-8859-5", "ISO-8859-6", "ISO-8859-7", "ISO-8859-8", "ISO-8859-9", "ISO-8859-10", "ISO-8859-11", "ISO-8859-13", "ISO-8859-14", "ISO-8859-15", "ISO-8859-16", "KOI8-R", "KOI8-U", "Shift_JIS", "UTF-16BE", "UTF-16LE", "UTF-32BE", "UTF-32LE", "Windows-1251", "BINARY", "IBM437", "CP437", "IBM737", "CP737", "IBM775", "CP775", "CP850", "IBM850", "IBM852", "CP852", "IBM855", "CP855", "IBM857", "CP857", "IBM860", "CP860", "IBM861", "CP861", "IBM862", "CP862", "IBM863", "CP863", "IBM864", "CP864", "IBM865", "CP865", "IBM866", "CP866", "IBM869", "CP869", "Windows-1258", "CP1258", "GB1988", "macCentEuro", "macCroatian", "macCyrillic", "macGreek", "macIceland", "macRoman", "macRomania", "macThai", "macTurkish", "macUkraine", "CP950", "CP951", "stateless-ISO-2022-JP", "eucJP", "eucJP-ms", "euc-jp-ms", "CP51932", "eucKR", "eucTW", "GB2312", "EUC-CN", "eucCN", "GB12345", "CP936", "ISO-2022-JP", "ISO2022-JP", "ISO-2022-JP-2", "ISO2022-JP2", "CP50220", "CP50221", "ISO8859-1", "Windows-1252", "CP1252", "ISO8859-2", "Windows-1250", "CP1250", "ISO8859-3", "ISO8859-4", "ISO8859-5", "ISO8859-6", "Windows-1256", "CP1256", "ISO8859-7", "Windows-1253", "CP1253", "ISO8859-8", "Windows-1255", "CP1255", "ISO8859-9", "Windows-1254", "CP1254", "ISO8859-10", "ISO8859-11", "TIS-620", "Windows-874", "CP874", "ISO8859-13", "Windows-1257", "CP1257", "ISO8859-14", "ISO8859-15", "ISO8859-16", "CP878", "SJIS", "Windows-31J", "CP932", "csWindows31J", "MacJapanese", "MacJapan", "ASCII", "ANSI_X3.4-1968", "646", "UTF-7", "CP65000", "CP65001", "UTF8-MAC", "UTF-8-MAC", "UTF-8-HFS", "UCS-2BE", "UCS-4BE", "UCS-4LE", "CP1251", "UTF8-DoCoMo", "SJIS-DoCoMo", "UTF8-KDDI", "SJIS-KDDI", "ISO-2022-JP-KDDI", "stateless-ISO-2022-JP-KDDI", "UTF8-SoftBank", "SJIS-SoftBank", "locale", "external", "filesystem", "internal"]
Indeed, capitalization is not supported for non-ascii chars, with reason.

How to move a python 2.6 project to UTF-8?

We are moving from latin1 to UTF-8 and have 100k lines of python code.
Plus I'm new in python (ha-ha-ha!).
I already know that str() function fails when receiving Unicode so we should use unicode() instead of it with almost the same effect.
What are the other "dangerous" places of code?
Are there any basic guidelines/algorithms for moving to UTF-8? Can it be written an automatic 'code transformer'?
str and unicode are classes, not functions. When you call str(u'abcd') you are initialising a new string which takes 'abcd' as a variable. It just so happens that str() can be used to convert a string of any type to an ascii str.
Other areas to look out for are when reading from a file/input, or basically anything you get back as a string from a function that was not written for unicode.
Enjoy :)
Can it be written an automatic 'code transformer'? =)
No. str and unicode are two different types which have different purposes. You should not attempt to replace every occurrence of a byte string with a Unicode string, neither in Python 2 nor Python 3.
Continue to use byte strings for binary data. In particular anything you're writing to a file or network socket is bytes. And use Unicode strings for user-facing text.
In between there is a grey area of internal ASCII-character strings which could equally be bytes or Unicode. In Python 2 these are typically bytes, in Python 3 typically Unicode. In you are happy to limit your code to Python 2.6+, you can mark your definitely-bytes strings as b'' and bytes, your definitely-characters strings as u'' and unicode, and use '' and str for the “whatever the default type of string is” strings.
One way to quickly convert Python 2.x to have a default encoding of UTF-8 is to set the default encoding. This approach has its downsides--primarily that it changes the encoding for all libraries as well as your application, so use with caution. My company uses that technique in our production apps and it suits us well. It's also forward-compatible with Python 3, which has UTF-8 as the default encoding. You'll still have to change references of str() to unicode(), but you won't have to explicitly specify the encoding with .decode() and encode().

Get warning for python string literals not prefixed with 'u'

To follow best practices for Unicode in python, you should prefix all string literals of characters with 'u'. Is there any tool available (preferably PyDev compatible) that warns if you forget it?
you should prefix all string literals with 'u'
No, not really.
You should prefix literals for strings of characters with u. But not all strings are strings of characters. When you are talking to components that are byte based, like network services, or binary files, you need to be using byte strings.
eg. Want to try to write a Unicode string into a PNG file? Not sensible. Want to base64-decode the string Y2Fm6Q==? You can't reasonably use a Unicode string here, base64 is explicitly bytes.
Sure, Python will often let you get away with passing a unicode string where a byte string is expected, but only by automatically encoding to ASCII. If the string contains non-ASCII characters you going to get UnicodeError just as surely as if you'd used bytes where unicode was expected. “Unicode is right, bytes are wrong” is a damaging myth. Manipulation for both kinds of strings are required.
If you are concerned about the transition to Python 3, you should certainly mark up your character strings as u'', but you should then also mark up your explicitly-bytes strings as b''. Strings where it doesn't matter you can leave as '' and let them get converted from byte strings to unicode strings on Python 3. There are lots of cases where Python 2 used to use bytes and Python 3 uses Unicode where it is appropriate to do this. But there are still plenty of cases where you do really need to be talking bytes, and having that converted to Python 3 as unicode will cause problems.
(The only problem with this is that b'' syntax requires Python 2.6 or later, so using it will make you incompatible with earlier versions.)
You might want to write a such a warnging-generator tool by parsing Python source code using the parser or the dis built-in modules. You may also consider adding such a feature to pylint.
KennyTM's comment should be posted as an answer:
from __future__ import unicode_literals
This future declaration can be used in Python 2.6 and 2.7 and enables Python 3's string syntax so that unprefixed string literals are Unicode strings and byte arrays require a b prefix.

Python: what does "...".encode("utf8") fix?

I wanted to url encode a python string and got exceptions with hebrew strings.
I couldn't fix it and started doing some guess oriented programming.
Finally, doing mystr = mystr.encode("utf8") before sending it to the url encoder saved the day.
Can somebody explain what happened? What does .encode("utf8") do? My original string was a unicode string anyways (i.e. prefixed by a u).
My original string was a unicode string anyways (i.e. prefixed by a u)
...which is the problem. It wasn't a "string", as such, but a "Unicode object". It contains a sequence of Unicode code points. These code points must, of course, have some internal representation that Python knows about, but whatever that is is abstracted away and they're shown as those \uXXXX entities when you print repr(my_u_str).
To get a sequence of bytes that another program can understand, you need to take that sequence of Unicode code points and encode it. You need to decide on the encoding, because there are plenty to choose from. UTF8 and UTF16 are common choices. ASCII could be too, if it fits. u"abc".encode('ascii') works just fine.
Do my_u_str = u"\u2119ython" and then type(my_u_str) and type(my_u_str.encode('utf8')) to see the difference in types: The first is <type 'unicode'> and the second is <type 'str'>. (Under Python 2.5 and 2.6, anyway).
Things are different in Python 3, but since I rarely use it I'd be talking out of my hat if I tried to say anything authoritative about it.
You original string was a unicode object containing raw Unicode code points, after encoding it as UTF-8 it is a normal byte string that contains UTF-8 encoded data.
The URL encoder seems to expect a byte string, so that it can URL-encode one byte after another and doesn't have to deal with Unicode code points. When you give it a unicode object, it tries to convert it to a byte string using some default encoding, probably ASCII. For Hebrew characters that cannot be represented as ASCII, this will lead to errors.
What does .encode("utf8") do?
It depends on which version of Python you're using:
In Python 3.x, it converts a str object (encoded in UTF-16 or UTF-32) into a bytes object containing the UTF-8 representation of the string.
In Python 2.x, it converts a unicode object into a str object encoded in UTF-8. But str has an encode method too, and writing '...'.encode('UTF-8') is equivalent to writing '...'.decode('ascii').encode('UTF-8').
Since you mentioned the "u" prefix, you must be using 2.x. If you don't require any 2.x-only libraries, I'd recommend switching to 3.x, which has a nice clear distinction between text and binary data.
Dive into Python 3 has a good explanation of the issue.
Can somebody explain what happened?
It would help if you told us what the error message was.
The urllib.quote function expects a str object. It also happens to work with unicode objects that contain only ASCII characters, but not when they contain Hebrew letters.
In Python 3.x, urllib.parse.quote accepts both str (=Python 2.x unicode) and bytes objects. Strings are automatically encoded in UTF-8.
"...".encode("utf-8") transforms the in-memory representation of the string into an UTF-8 -encoded string.
url encoder likely expected a bytestring, that is, string representation where each character is represented with a single byte.
It returns a UTF-8 encoded version of the Unicode string, mystr. It is important to realize that UTF-8 is simply 1 way of encoding Unicode. Python can work with many other encodings (eg. mystr.encode("utf32") or even mystr.encode("ascii")).
The link that balpha posted explains it all. In short:
The fact that your string was prefixed with "u" just means it's composed of Unicode characters (or code points). UTF-8 is an encoding of this string into a sequence of bytes.

Categories

Resources