(Python 2.6)
I have some code that does some customization based on client name.
It does something like this:
custom_module = __import__("some.module.company_inc")
this works fine so long as our clients have ascii-only names.
I would like to make this code work correctly for non-ascii company names as well, e.g.
custom_module = __import__(u"some.module.unicóde_company_inc")
However, __import__ only accepts bytes, so I need to encode this first.
Is __import__(u"some.module.unicóde_company_inc".encode(sys.getfilesystemencoding()) guaranteed to work on all systems (assuming that the filesystem encoding supports "ó" of course)? Is this the right way to do this? (Assuming I don't statically know the encoding that the box uses)
I am most interested in linux systems. (But it would be nice to know for non-linux as well)
Strictly speaking, it is possible that sys.getfilesystemencoding() can return None under some circumstances (e.g. if LANG is not set). So it would probably be slightly safer to fallback to "utf-8" to allow for that (rather unlikely) possibility:
encoding = sys.getfilesystemencoding() or 'utf-8'
That will cover 99.9% of cases. For the rest, I would just allow the application to raise an exception (since that's exactly what it is), and then bail out gracefully with a suitably informative error message.
Related
My boss asked me to put the following lines (from this answer) into a Python 3 script I wrote:
import sys
import codecs
sys.stdout = codecs.getwriter("utf-8")(sys.stdout.detach())
He says it's to prevent UnicodeEncodeErrors when printing Unicode characters in non-UTF8 locales. I am wondering whether this is really necessary, and why Python wouldn't handle encoding/decoding correctly without boilerplate code.
What is the most Pythonic way to make Python scripts compatible with different operating system locales? And what does this boilerplate code do exactly?
The answer provided here has a good excerpt from the Python mailing list regarding your question. I guess it is not necessary to do this.
The only supported default encodings in Python are:
Python 2.x: ASCII
Python 3.x: UTF-8
If you change these, you are on your own and strange things will start
to happen. The default encoding does not only affect the translation
between Python and the outside world, but also all internal
conversions between 8-bit strings and Unicode.
Hacks like what's happening in the pango module (setting the default
encoding to 'utf-8' by reloading the site module in order to get the
sys.setdefaultencoding() API back) are just downright wrong and will
cause serious problems since Unicode objects cache their default
encoded representation.
Please don't enable the use of a locale based default encoding.
If all you want to achieve is getting the encodings of stdout and
stdin correctly setup for pipes, you should instead change the
.encoding attribute of those (only).
--
Marc-Andre Lemburg
eGenix.com
As you might know the Python 2 stdlib csv module doesn't "properly" support unicode. It expects binary strings that it will write to the file as it gets them.
To me this always seemed a bit counter-intuitive as i would tell people to internally work with unicode strings and properly serialize things for the external world by opening files with codecs.open(..., encoding='...'), but in the csv module case you need to do this manually for the lists / dicts you pass in.
It always puzzled me why this is and now that a colleague asks me again, I have to admit that i don't know any reason for it other than "probably grown and never fixed".
It seems that even PEP305 already contained TODOs for unicode and references to codecs.open.
Is there some wise python guru here who knows and could enlighten us?
Python 2 csv doesn't support Unicode because CSV doesn't support Unicode.
CSV as defined in RFC 4180 and in common usage is no more than a sequence of bytes. There is no standard to define how those bytes are mapped to readable text, and different CSV-handling tools have divergent behaviours. If Python's csv provided particular encoding rules they would be wrong in many cases. Better to let the user decide by manually encoding/decoding using whichever convention works for that application.
Python 3 csv gains Unicode support in as much as it has to talk to text IO streams (as these are now much more common). These have their own encoding, but if you use one with its default encoding for CSV, the results will still be wrong as often as not.
When I read this Python2's official page on Unicode, it says
Under the hood, Python represents Unicode strings as either 16-or 32-bit integers, depending on how the Python interpreter was compiled.
What does above sentence mean? Could it mean that Python2 has its own special encodings of Unicode? If so, why not just use UTF-8?
This statement simply means that there is underlying C code that uses both these encodings and that depending on the circumstances, either variant is chosen. Those circumstances are typically user choice, compiler and operating system.
Now, for the possible rationale for that, there are reasons not to use UTF-8:
First and foremost, indexing into a UTF-8 string is O(n) in complexity, while it is O(1) for UTF-32/UCS4. While that is irrelevant for streamed data and UTF-8 can actually save space for transmission or storage, in-memory handling is more convenient with one character per Unicode codepoint.
Secondly, using one character per codepoint translates very well to the API that Python itself provides in its language, so this is a natural choice.
On MS Windows platforms, the native encoding for UI and filesystem is UTF-16, so using that encoding provides seamless integration with that platform.
On some compilers wchar_t is actually a 16-bit type, so if you wanted to use a 32-bit type there you would have to reimplement all kinds of functions for your self-invented character type. Dropping support for anything above the Unicode BMP or leaking surrogate sequences into the Python API is a reasonable compromise then (but one that sticks unfortunately).
Note that those are possible reasons, I don't claim that these apply to Python's implementation.
Somewhat out of necessity, I develop software with my locale set to either "C" or "en_US". It's difficult to use a different locale because I only speak one language with anything even remotely approaching fluency.
As a result, I often overlook the differences in behavior that can be introduced by having different locale settings.Unsurprisingly, overlooking those differences will sometimes lead to bugs which are only discovered by some unfortunate user using a different locale. In particularly bad cases, that user may not even share a language with me, making the bug reporting process a challenging one. And, importantly, a lot of my software is in the form of libraries; while almost none of it sets the locale, it may be combined with another library, or used in an application which does set the locale - generating behavior I never experience myself.
To be a bit more specific, the kinds of bugs I have in mind are not missing text localizations or bugs in the code for using those localizations. Instead, I mean bugs where a locale changes the result of some locale-aware API (for example, toupper(3)) when the code using that API did not anticipate the possibility of such a change (eg, in the Turkish locale, toupper does not change "i" to "I" - potentially a problem for a network server trying to speak a particular network protocol to another host).
A few examples of such bugs in software I maintain:
AttributeError in a Turkish locale
imap relies on a C locale for date formatting
Fix for locale-dependant date formatting in imap and conch
In the past, one approach I've taken to dealing with this is to write regression tests which explicitly change the locale to one where code was known not to work, exercise the code, verify correct behavior, and then restore the original locale. This works well enough, but only after someone has reported a bug, and it only covers one small area of a codebase.
Another approach which seems possible is to have a continuous integration system (CIS) set up to run a full suite of tests in an environment with a different locale set. This improves the situation somewhat, by giving as much coverage in that one alternate locale as the test suite normally gives. Another shortcoming is that there are many, many, many locales, and each may possibly cause different problems. In practice, there are probably only a dozen or so different ways a locale can break a program, but having dozens of extra testing configurations is taxing on resources (particularly for a project already stretching its resource limits by testing on different platforms, against different library versions, etc).
Another approach which occurred to me is to use (possibly first creating) a new locale which is radically different from the "C" locale in every way it can be - have a different case mapping, use a different thousands separator, format dates differently, etc. This locale could be used with one extra CIS configuration and hopefully relied upon to catch any errors in the code that could be triggered by any locale.
Does such a testing locale exist already? Are there flaws with this idea to testing for locale compatibility?
What other approaches to locale testing have people taken?
I'm primarily interested in POSIX locales, since those are the ones I know about. However, I know that Windows also has some similar features, so extra information (perhaps with more background information about how those features work), could perhaps also be useful.
I would just audit your code for incorrect uses of functions like toupper. Under the C locale model, such functions should be considered as operating only on natural-language text in the locale's language. For any application which deals with potentially multi-lingual text, this means functions such as tolower should not be used at all.
If your target is POSIX, you have a little bit more flexibility due to the uselocale function which makes it possible to temporarily override the locale in a single thread (i.e. without messing up the global state of your program). You could then keep the C locale globally and use tolower etc. for ASCII/machine-oriented text (like config files and such) and only uselocale to the user's selected locale when working with natural-language text from said locale.
Otherwise (and perhaps even then if you needs are more advanced), I think the best solution is to completely throw out functions like tolower and write your own ASCII versions for config text and the like, and use a powerful Unicode-aware library for natural-language text.
One sticky issue that I haven't yet touched on is the decimal separator in relation to functions like snprintf and strtod. Having it changed to a , instead of a . in some locales can ruin your ability to parse files with the C library. My preferred solution is simply to never set the LC_NUMERIC locale whatsoever. (And I'm a mathematician so I tend to believe numbers should be universal, not subject to cultural convention.) Depending on your application, the only locale categories really needed may just be LC_CTYPE, LC_COLLATE, and LC_MESSAGES. Also often useful are LC_MONETARY and LC_TIME.
You have two different problems to solve to answer your question: testing your code and dealing with issues with other peoples code.
Testing your own code - I've dealt with this by using 2 or 3 english based locales setup in a CI environment: en_GB (collation), en_ZW (almost everything changes but you can still read the errors) and then en_AU (date, collation)
If you want to make sure your code works with multibyte filenames then you also need to test with ja_JP
Dealing with other peoples code is in many ways the hardest and my solution for that is to store the date values (it's almost always dates :) in their raw date/time value and always keep them as GMT. Then when you are crossing the boundary of your app you convert to the appropriate format.
PyTZ and PyICU are very helpful doing the above.
Is it considered as a good practice to pick Unicode string over regular string when coding in Python? I mainly work on the Windows platform, where most of the string types are Unicode these days (i.e. .NET String, '_UNICODE' turned on by default on a new c++ project, etc ). Therefore, I tend to think that the case where non-Unicode string objects are used is a sort of rare case. Anyway, I'm curious about what Python practitioners do in real-world projects.
From my practice -- use unicode.
At beginning of one project we used usuall strings, however our project was growing, we were implementing new features and using new third-party libraries. In that mess with non-unicode/unicode string some functions started failing. We started spending time localizing this problems and fixing them. However, some third-party modules doesn't supported unicode and started failing after we switched to it (but this is rather exclusion than a rule).
Also I have some experience when we needed to rewrite some third party modules(e.g. SendKeys) cause they were not supporting unicode. If it was done in unicode from beginning it will be better :)
So I think today we should use unicode.
P.S. All that mess upwards is only my hamble opinion :)
As you ask this question, I suppose you are using Python 2.x.
Python 3.0 changed quite a lot in string representation, and all text now is unicode.
I would go for unicode in any new project - in a way compatible with the switch to Python 3.0 (see details).
Yes, use unicode.
Some hints:
When doing input output in any sort of binary format, decode directly after reading and encode directly before writing, so that you never need to mix strings and unicode. Because mixing that tends to lead to UnicodeEncodeDecodeErrors sooner or later.
[Forget about this one, my explanations just made it even more confusing. It's only an issue when porting to Python 3, you can care about it then.]
Common Python newbie errors with Unicode (not saying you are a newbie, but this may be read by newbies): Don't confuse encode and decode. Remember, UTF-8 is an ENcoding, so you ENcode Unicode to UTF-8 and DEcode from it.
Do not fall into the temptation of setting the default encoding in Python (by setdefaultencoding in sitecustomize.py or similar) to whatever you use most. That is just going to give you problems if you reinstall or move to another computer or suddenly need to use another encoding. Be explicit.
Remember, not all of Python 2s standard library accepts unicode. If you feed a method unicode and it doesn't work, but it should, try feeding it ascii and see. Examples: urllib.urlopen(), which fails with unhelpful errors if you give it a unicode object instead of a string.
Hm. That's all I can think of now!
It can be tricky to consistently use unicode strings in Python 2.x - be it because somebody inadvertently uses the more natural str(blah) where they meant unicode(blah), forgetting the u prefix on string literals, third-party module incompatibilities - whatever. So in Python 2.x, use unicode only if you have to, and are prepared to provide good unit test coverage.
If you have the option of using Python 3.x however, you don't need to care - strings will be unicode with no extra effort.
Additional to Mihails comment I would say: Use Unicode, since it is the future. In Python 3.0, Non-Unicode will be gone, and as much I know, all the "U"-Prefixes will make trouble, since they are also gone.
If you are dealing with severely constrained memory or disk space, use ASCII strings. In this case, you should additionally write your software in C or something even more compact :)