Should I use Unicode string by default?

Should I use Unicode string by default? - python

Is it considered as a good practice to pick Unicode string over regular string when coding in Python? I mainly work on the Windows platform, where most of the string types are Unicode these days (i.e. .NET String, '_UNICODE' turned on by default on a new c++ project, etc ). Therefore, I tend to think that the case where non-Unicode string objects are used is a sort of rare case. Anyway, I'm curious about what Python practitioners do in real-world projects.

From my practice -- use unicode.
At beginning of one project we used usuall strings, however our project was growing, we were implementing new features and using new third-party libraries. In that mess with non-unicode/unicode string some functions started failing. We started spending time localizing this problems and fixing them. However, some third-party modules doesn't supported unicode and started failing after we switched to it (but this is rather exclusion than a rule).
Also I have some experience when we needed to rewrite some third party modules(e.g. SendKeys) cause they were not supporting unicode. If it was done in unicode from beginning it will be better :)
So I think today we should use unicode.
P.S. All that mess upwards is only my hamble opinion :)

As you ask this question, I suppose you are using Python 2.x.
Python 3.0 changed quite a lot in string representation, and all text now is unicode.
I would go for unicode in any new project - in a way compatible with the switch to Python 3.0 (see details).

Yes, use unicode.
Some hints:
When doing input output in any sort of binary format, decode directly after reading and encode directly before writing, so that you never need to mix strings and unicode. Because mixing that tends to lead to UnicodeEncodeDecodeErrors sooner or later.
[Forget about this one, my explanations just made it even more confusing. It's only an issue when porting to Python 3, you can care about it then.]
Common Python newbie errors with Unicode (not saying you are a newbie, but this may be read by newbies): Don't confuse encode and decode. Remember, UTF-8 is an ENcoding, so you ENcode Unicode to UTF-8 and DEcode from it.
Do not fall into the temptation of setting the default encoding in Python (by setdefaultencoding in sitecustomize.py or similar) to whatever you use most. That is just going to give you problems if you reinstall or move to another computer or suddenly need to use another encoding. Be explicit.
Remember, not all of Python 2s standard library accepts unicode. If you feed a method unicode and it doesn't work, but it should, try feeding it ascii and see. Examples: urllib.urlopen(), which fails with unhelpful errors if you give it a unicode object instead of a string.
Hm. That's all I can think of now!

It can be tricky to consistently use unicode strings in Python 2.x - be it because somebody inadvertently uses the more natural str(blah) where they meant unicode(blah), forgetting the u prefix on string literals, third-party module incompatibilities - whatever. So in Python 2.x, use unicode only if you have to, and are prepared to provide good unit test coverage.
If you have the option of using Python 3.x however, you don't need to care - strings will be unicode with no extra effort.

Additional to Mihails comment I would say: Use Unicode, since it is the future. In Python 3.0, Non-Unicode will be gone, and as much I know, all the "U"-Prefixes will make trouble, since they are also gone.

If you are dealing with severely constrained memory or disk space, use ASCII strings. In this case, you should additionally write your software in C or something even more compact :)

Related

Should you remove future imports and unicode strings when upgrading to Python 3?

I have been in the process of upgrading code bases to Python 3. One thing I've been doing is running 2to3 and seeing what the script suggests. Something it continually suggests is to remove all __future__ imports as well as any unicode strings e.g. u"python2 unicode str" (which makes sense to me, since Python 3 strings are unicode by default).
From what I can tell, these changes do not alter the functionality of the code in any way - it seems to just be "clean up". Is that correct? Is there any reason to keep the __future__ imports and unicode strings? Any explicit reason to remove them?
Note: I don't care about keeping Python 2 compatibility - it's out of support.

There's no reason to remove them, nor any strong reason to keep them. They're guaranteed to remain available, but do nothing, on Python versions that enable them by default:
MandatoryRelease records when the feature became part of the language; in releases at or after that, modules no longer need a future statement to use the feature in question, but may continue to use such imports.
No feature description will ever be deleted from __future__
If you're sure you'll never run on Python 2, it doesn't really matter what you do.

Can I use Python 3 with default bytes not Unicode?

Is there a systematic way to run Python 3.x with all strings defaulting to bytes? I am finding that when "crossing boundaries" for example talking to msgpack, Elixir, or ZeroMQ, I'm having to do all sorts of contortions constantly figuring out whether strings or bytes will be returned. It's a complete pain and adds a layer of cognitive friction over and above my problem.
For example I have
import argparse
parser.add_argument("--nodename")
args = parser.parse_args()
and then to get the nodename I need to do
str(args.nodename)
However zeroMQ wants bytes, and I'm going to use the nodename everywhere I use zeroMQ. So I make it bytes up front with
nodename.encode()
But now every time I want to use it with a string, say for concatenation, I cannot do so because I have to encode the string first. And half the libraries take perfectly good bytes data type and return them to you as strings, at which time you have to convert them back again to bytes if you want to send them outside Python. For a "glue language" this is a total disaster. I'm having to do this encode decode dance whenever I cross the boundary, and the worst is that it does not seem consistent across libraries whether they co-opt you to strings or bytes if you send them bytes.
In Python 3 is there an option to forego Unicode-by-default (since it does after all say, "by default", suggesting it can be changed), or is the answer "stick with 2.7".

In short, no. And you really don't want to try. You mention contortions but don't give specific examples, so it's hard to offer specific advice.
Neither, in this author's humble opinion, do you want to stick with Python 2.7, but if you don't need bugfixes and language updates after 2020 it won't matter.
The point is precisely that all translation between bytes and text should take place at the boundaries of your code. Decode (from whatever external representation is used) on input, encode (to whatever encoding you wish or need to use) on output. Python 3 is written to enforce this distinction, but understanding the separation should give you proper control and reduce your frustrations.
In Python 3, opening a file in text mode causes readline and friends to produce Unicode strings. You can specify the encoding when you open the file if you wish. Opening a file in binary mode causes them to produce bytestrings, to which you will have to apply your own decoding to make sense of them as text.
Whether the Python API for a particular system returns bytes or text is up to its author, and calling Python 3 functions that expect strings with bytestring arguments is likely to lead to confusion and unhappiness. All external communications (network, files, etc.) must necessarily take place in terms of bytestrings, so be clear what is text (decoding on input and encoding on output) and deal with the outside world exclusively in bytestrings.
There are always, of course, difficult corner cases. I don't envy the maintainers of the email package, who have to deal with messages containing 6-bit encoded bytestreams themselves potentially containing attachments in multiple different encodings. But then I don't usually have to work in such complex environments, and hopefully neither do you.

Why doesn't the Python 2 csv module support unicode?

As you might know the Python 2 stdlib csv module doesn't "properly" support unicode. It expects binary strings that it will write to the file as it gets them.
To me this always seemed a bit counter-intuitive as i would tell people to internally work with unicode strings and properly serialize things for the external world by opening files with codecs.open(..., encoding='...'), but in the csv module case you need to do this manually for the lists / dicts you pass in.
It always puzzled me why this is and now that a colleague asks me again, I have to admit that i don't know any reason for it other than "probably grown and never fixed".
It seems that even PEP305 already contained TODOs for unicode and references to codecs.open.
Is there some wise python guru here who knows and could enlighten us?

Python 2 csv doesn't support Unicode because CSV doesn't support Unicode.
CSV as defined in RFC 4180 and in common usage is no more than a sequence of bytes. There is no standard to define how those bytes are mapped to readable text, and different CSV-handling tools have divergent behaviours. If Python's csv provided particular encoding rules they would be wrong in many cases. Better to let the user decide by manually encoding/decoding using whichever convention works for that application.
Python 3 csv gains Unicode support in as much as it has to talk to text IO streams (as these are now much more common). These have their own encoding, but if you use one with its default encoding for CSV, the results will still be wrong as often as not.

How does Python 2 represent Unicode internally?

When I read this Python2's official page on Unicode, it says
Under the hood, Python represents Unicode strings as either 16-or 32-bit integers, depending on how the Python interpreter was compiled.
What does above sentence mean? Could it mean that Python2 has its own special encodings of Unicode? If so, why not just use UTF-8?

This statement simply means that there is underlying C code that uses both these encodings and that depending on the circumstances, either variant is chosen. Those circumstances are typically user choice, compiler and operating system.
Now, for the possible rationale for that, there are reasons not to use UTF-8:
First and foremost, indexing into a UTF-8 string is O(n) in complexity, while it is O(1) for UTF-32/UCS4. While that is irrelevant for streamed data and UTF-8 can actually save space for transmission or storage, in-memory handling is more convenient with one character per Unicode codepoint.
Secondly, using one character per codepoint translates very well to the API that Python itself provides in its language, so this is a natural choice.
On MS Windows platforms, the native encoding for UI and filesystem is UTF-16, so using that encoding provides seamless integration with that platform.
On some compilers wchar_t is actually a 16-bit type, so if you wanted to use a 32-bit type there you would have to reimplement all kinds of functions for your self-invented character type. Dropping support for anything above the Unicode BMP or leaking surrogate sequences into the Python API is a reasonable compromise then (but one that sticks unfortunately).
Note that those are possible reasons, I don't claim that these apply to Python's implementation.

how to treat ruby symbols in cross language object serialization

I'm currently working on a project where I need to transfer objects from ruby to python and back again, obviously serialization is the way to go. I've looked at things like yaml but decided to write my own as I didn't want to deal with the dependencies of the libraries and such when it came time to distribute. I've wrote up how this serialization format works here.
my question is that as this format is intended to work cross language between ruby and python,
how should I serialize ruby's symbols? I'm not aware of a object that works the same way in python. should a dump containing a symbol fail? should I just serialize it as a string? what would be best?

Doesn't that depend on what your project needs? If symbols are important, you'll need some way to deal with them.
I'm not a Ruby programmer, but from what I've just read, I think converting them to strings is probably easiest. The standard Python interpreter will reuse memory for identical short strings, which seems to be a key reason suggested for using symbols.
EDIT: If it needs to work for other programmers, passing values back and forth shouldn't change them. So you either have to handle symbols properly, or throw an error straight away. It should be simple enough in Python:
class Symbol(str):
pass
# In serialising code:
if isinstance(x, Symbol):
serialise_as_symbol(x)

Any reason you're not using a standard data interchange format like JSON or XML? They seem to be acceptable to countless applications, services, and programmers.

If symbols are a stumbling block then you have three choices, don't allow them, convert them to strings on the fly or figure out a way to make them universal and/or innocuous in other languages.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.