How does Python 2 represent Unicode internally?

How does Python 2 represent Unicode internally? - python

When I read this Python2's official page on Unicode, it says
Under the hood, Python represents Unicode strings as either 16-or 32-bit integers, depending on how the Python interpreter was compiled.
What does above sentence mean? Could it mean that Python2 has its own special encodings of Unicode? If so, why not just use UTF-8?

This statement simply means that there is underlying C code that uses both these encodings and that depending on the circumstances, either variant is chosen. Those circumstances are typically user choice, compiler and operating system.
Now, for the possible rationale for that, there are reasons not to use UTF-8:
First and foremost, indexing into a UTF-8 string is O(n) in complexity, while it is O(1) for UTF-32/UCS4. While that is irrelevant for streamed data and UTF-8 can actually save space for transmission or storage, in-memory handling is more convenient with one character per Unicode codepoint.
Secondly, using one character per codepoint translates very well to the API that Python itself provides in its language, so this is a natural choice.
On MS Windows platforms, the native encoding for UI and filesystem is UTF-16, so using that encoding provides seamless integration with that platform.
On some compilers wchar_t is actually a 16-bit type, so if you wanted to use a 32-bit type there you would have to reimplement all kinds of functions for your self-invented character type. Dropping support for anything above the Unicode BMP or leaking surrogate sequences into the Python API is a reasonable compromise then (but one that sticks unfortunately).
Note that those are possible reasons, I don't claim that these apply to Python's implementation.

Related

Why not to use Ωβλδεαθφψωϕϖ (non-ASCII chars) in Python code?

..I find that when coding very "math-heavy" function in python one way of making the code more readable (to me at least) is by following the literature as tightly as possible, including using the greek letters such as Ω, θ, φ, ψ, ω, ϕ etc.
And the code works as expected, but PyCharm highlights those characters and says "non-ASCII characters in identifier".
..without referencing any PEP (Python coding standard).
I'm asking for sober arguments as to why I should refrain from using non-ASCII characters in identifiers.

Because they're not included in a standard qwerty keyboard and would be a pain in the arse to keep referring to in your code.
You're better off just referring to them by their ascii equivalents (the string alpha etc.) and letting intellisense in modern ide's handle the typing.
If you're looking for a pep standard, you probably want PEP 3131
Should identifiers be allowed to contain any Unicode letter?
Drawbacks of allowing non-ASCII identifiers wholesale:
Python will lose the ability to make a reliable round trip to a
human-readable display on screen or on paper.
Python will become vulnerable to a new class of security exploits; code and submitted
patches will be much harder to inspect.
Humans will no longer be able
to validate Python syntax.
Unicode is young; its problems are not yet well understood and solved; tool support is weak.
Languages with non-ASCII identifiers use different character sets and normalization
schemes; PEP 3131's choices are non-obvious.
The Unicode bidi algorithm yields an extremely confusing display order for RTL text when digits or operators are nearby.
Of course this has been accepted so its pycharm's decision, not pep.

Can I use Python 3 with default bytes not Unicode?

Is there a systematic way to run Python 3.x with all strings defaulting to bytes? I am finding that when "crossing boundaries" for example talking to msgpack, Elixir, or ZeroMQ, I'm having to do all sorts of contortions constantly figuring out whether strings or bytes will be returned. It's a complete pain and adds a layer of cognitive friction over and above my problem.
For example I have
import argparse
parser.add_argument("--nodename")
args = parser.parse_args()
and then to get the nodename I need to do
str(args.nodename)
However zeroMQ wants bytes, and I'm going to use the nodename everywhere I use zeroMQ. So I make it bytes up front with
nodename.encode()
But now every time I want to use it with a string, say for concatenation, I cannot do so because I have to encode the string first. And half the libraries take perfectly good bytes data type and return them to you as strings, at which time you have to convert them back again to bytes if you want to send them outside Python. For a "glue language" this is a total disaster. I'm having to do this encode decode dance whenever I cross the boundary, and the worst is that it does not seem consistent across libraries whether they co-opt you to strings or bytes if you send them bytes.
In Python 3 is there an option to forego Unicode-by-default (since it does after all say, "by default", suggesting it can be changed), or is the answer "stick with 2.7".

In short, no. And you really don't want to try. You mention contortions but don't give specific examples, so it's hard to offer specific advice.
Neither, in this author's humble opinion, do you want to stick with Python 2.7, but if you don't need bugfixes and language updates after 2020 it won't matter.
The point is precisely that all translation between bytes and text should take place at the boundaries of your code. Decode (from whatever external representation is used) on input, encode (to whatever encoding you wish or need to use) on output. Python 3 is written to enforce this distinction, but understanding the separation should give you proper control and reduce your frustrations.
In Python 3, opening a file in text mode causes readline and friends to produce Unicode strings. You can specify the encoding when you open the file if you wish. Opening a file in binary mode causes them to produce bytestrings, to which you will have to apply your own decoding to make sense of them as text.
Whether the Python API for a particular system returns bytes or text is up to its author, and calling Python 3 functions that expect strings with bytestring arguments is likely to lead to confusion and unhappiness. All external communications (network, files, etc.) must necessarily take place in terms of bytestrings, so be clear what is text (decoding on input and encoding on output) and deal with the outside world exclusively in bytestrings.
There are always, of course, difficult corner cases. I don't envy the maintainers of the email package, who have to deal with messages containing 6-bit encoded bytestreams themselves potentially containing attachments in multiple different encodings. But then I don't usually have to work in such complex environments, and hopefully neither do you.

Why doesn't the Python 2 csv module support unicode?

As you might know the Python 2 stdlib csv module doesn't "properly" support unicode. It expects binary strings that it will write to the file as it gets them.
To me this always seemed a bit counter-intuitive as i would tell people to internally work with unicode strings and properly serialize things for the external world by opening files with codecs.open(..., encoding='...'), but in the csv module case you need to do this manually for the lists / dicts you pass in.
It always puzzled me why this is and now that a colleague asks me again, I have to admit that i don't know any reason for it other than "probably grown and never fixed".
It seems that even PEP305 already contained TODOs for unicode and references to codecs.open.
Is there some wise python guru here who knows and could enlighten us?

Python 2 csv doesn't support Unicode because CSV doesn't support Unicode.
CSV as defined in RFC 4180 and in common usage is no more than a sequence of bytes. There is no standard to define how those bytes are mapped to readable text, and different CSV-handling tools have divergent behaviours. If Python's csv provided particular encoding rules they would be wrong in many cases. Better to let the user decide by manually encoding/decoding using whichever convention works for that application.
Python 3 csv gains Unicode support in as much as it has to talk to text IO streams (as these are now much more common). These have their own encoding, but if you use one with its default encoding for CSV, the results will still be wrong as often as not.

Should I use Unicode string by default?

Is it considered as a good practice to pick Unicode string over regular string when coding in Python? I mainly work on the Windows platform, where most of the string types are Unicode these days (i.e. .NET String, '_UNICODE' turned on by default on a new c++ project, etc ). Therefore, I tend to think that the case where non-Unicode string objects are used is a sort of rare case. Anyway, I'm curious about what Python practitioners do in real-world projects.

From my practice -- use unicode.
At beginning of one project we used usuall strings, however our project was growing, we were implementing new features and using new third-party libraries. In that mess with non-unicode/unicode string some functions started failing. We started spending time localizing this problems and fixing them. However, some third-party modules doesn't supported unicode and started failing after we switched to it (but this is rather exclusion than a rule).
Also I have some experience when we needed to rewrite some third party modules(e.g. SendKeys) cause they were not supporting unicode. If it was done in unicode from beginning it will be better :)
So I think today we should use unicode.
P.S. All that mess upwards is only my hamble opinion :)

As you ask this question, I suppose you are using Python 2.x.
Python 3.0 changed quite a lot in string representation, and all text now is unicode.
I would go for unicode in any new project - in a way compatible with the switch to Python 3.0 (see details).

Yes, use unicode.
Some hints:
When doing input output in any sort of binary format, decode directly after reading and encode directly before writing, so that you never need to mix strings and unicode. Because mixing that tends to lead to UnicodeEncodeDecodeErrors sooner or later.
[Forget about this one, my explanations just made it even more confusing. It's only an issue when porting to Python 3, you can care about it then.]
Common Python newbie errors with Unicode (not saying you are a newbie, but this may be read by newbies): Don't confuse encode and decode. Remember, UTF-8 is an ENcoding, so you ENcode Unicode to UTF-8 and DEcode from it.
Do not fall into the temptation of setting the default encoding in Python (by setdefaultencoding in sitecustomize.py or similar) to whatever you use most. That is just going to give you problems if you reinstall or move to another computer or suddenly need to use another encoding. Be explicit.
Remember, not all of Python 2s standard library accepts unicode. If you feed a method unicode and it doesn't work, but it should, try feeding it ascii and see. Examples: urllib.urlopen(), which fails with unhelpful errors if you give it a unicode object instead of a string.
Hm. That's all I can think of now!

It can be tricky to consistently use unicode strings in Python 2.x - be it because somebody inadvertently uses the more natural str(blah) where they meant unicode(blah), forgetting the u prefix on string literals, third-party module incompatibilities - whatever. So in Python 2.x, use unicode only if you have to, and are prepared to provide good unit test coverage.
If you have the option of using Python 3.x however, you don't need to care - strings will be unicode with no extra effort.

Additional to Mihails comment I would say: Use Unicode, since it is the future. In Python 3.0, Non-Unicode will be gone, and as much I know, all the "U"-Prefixes will make trouble, since they are also gone.

If you are dealing with severely constrained memory or disk space, use ASCII strings. In this case, you should additionally write your software in C or something even more compact :)

arguments to cryptographic functions

I'm a bit confused that the argument to crypto functions is a string. Should I simply wrap non-string arguments with str() e.g.
hashlib.sha256(str(user_id)+str(expiry_time))
hmac.new(str(random.randbits(256)))
(ignore for the moment that random.randbits() might not be cryptographically good).
edit: I realise that the hmac example is silly because I'm not storing the key anywhere!

Well, usually hash-functions (and cryptographic functions generally) work on bytes. The Python strings are basically byte-strings. If you want to compute the hash of some object you have to convert it to a string representation. Just make sure to apply the same operation later if you want to check if the hash is correct. And make sure that your string representation doesn't contain any changing data that you don't want to be checked.
Edit: Due to popular request a short reminder that Python unicode strings don't contain bytes but unicode code points. Each unicode code point contains multiple bytes (2 or 4, depending on how the Python interpreter was compiled). Python strings only contain bytes. So Python strings (type str) are the type most similar to an array of bytes.

You can.
However, for the HMAC, you actually want to store the key somewhere. Without the key, there is no way for you to verify the hash value later. :-)

Oh and Sha256 isn't really an industrial strength cryptographic function (although unfortunately it's used quite commonly on many sites). It's not a real way to protect passwords or other critical data, but more than good enough for generating temporal tokens
Edit: As mentioned Sha256 needs at least some salt. Without salt, Sha256 has a low barrier to being cracked with a dictionary attack (time-wise) and there are plenty of Rainbow tables to use as well. Personally I'd not use anything less than Blowfish for passwords (but that's because I'm paranoid)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.