Should I include this boilerplate code in every Python script I write? - python

My boss asked me to put the following lines (from this answer) into a Python 3 script I wrote:
import sys
import codecs
sys.stdout = codecs.getwriter("utf-8")(sys.stdout.detach())
He says it's to prevent UnicodeEncodeErrors when printing Unicode characters in non-UTF8 locales. I am wondering whether this is really necessary, and why Python wouldn't handle encoding/decoding correctly without boilerplate code.
What is the most Pythonic way to make Python scripts compatible with different operating system locales? And what does this boilerplate code do exactly?

The answer provided here has a good excerpt from the Python mailing list regarding your question. I guess it is not necessary to do this.
The only supported default encodings in Python are:
Python 2.x: ASCII
Python 3.x: UTF-8
If you change these, you are on your own and strange things will start
to happen. The default encoding does not only affect the translation
between Python and the outside world, but also all internal
conversions between 8-bit strings and Unicode.
Hacks like what's happening in the pango module (setting the default
encoding to 'utf-8' by reloading the site module in order to get the
sys.setdefaultencoding() API back) are just downright wrong and will
cause serious problems since Unicode objects cache their default
encoded representation.
Please don't enable the use of a locale based default encoding.
If all you want to achieve is getting the encodings of stdout and
stdin correctly setup for pipes, you should instead change the
.encoding attribute of those (only).
--
Marc-Andre Lemburg
eGenix.com

Related

Should you remove __future__ imports and unicode strings when upgrading to Python 3?

I have been in the process of upgrading code bases to Python 3. One thing I've been doing is running 2to3 and seeing what the script suggests. Something it continually suggests is to remove all __future__ imports as well as any unicode strings e.g. u"python2 unicode str" (which makes sense to me, since Python 3 strings are unicode by default).
From what I can tell, these changes do not alter the functionality of the code in any way - it seems to just be "clean up". Is that correct? Is there any reason to keep the __future__ imports and unicode strings? Any explicit reason to remove them?
Note: I don't care about keeping Python 2 compatibility - it's out of support.
There's no reason to remove them, nor any strong reason to keep them. They're guaranteed to remain available, but do nothing, on Python versions that enable them by default:
MandatoryRelease records when the feature became part of the language; in releases at or after that, modules no longer need a future statement to use the feature in question, but may continue to use such imports.
No feature description will ever be deleted from __future__
If you're sure you'll never run on Python 2, it doesn't really matter what you do.

Safely Importing Modules With Unicode Names in Python 2

(Python 2.6)
I have some code that does some customization based on client name.
It does something like this:
custom_module = __import__("some.module.company_inc")
this works fine so long as our clients have ascii-only names.
I would like to make this code work correctly for non-ascii company names as well, e.g.
custom_module = __import__(u"some.module.unicóde_company_inc")
However, __import__ only accepts bytes, so I need to encode this first.
Is __import__(u"some.module.unicóde_company_inc".encode(sys.getfilesystemencoding()) guaranteed to work on all systems (assuming that the filesystem encoding supports "ó" of course)? Is this the right way to do this? (Assuming I don't statically know the encoding that the box uses)
I am most interested in linux systems. (But it would be nice to know for non-linux as well)
Strictly speaking, it is possible that sys.getfilesystemencoding() can return None under some circumstances (e.g. if LANG is not set). So it would probably be slightly safer to fallback to "utf-8" to allow for that (rather unlikely) possibility:
encoding = sys.getfilesystemencoding() or 'utf-8'
That will cover 99.9% of cases. For the rest, I would just allow the application to raise an exception (since that's exactly what it is), and then bail out gracefully with a suitably informative error message.

Why doesn't the Python 2 csv module support unicode?

As you might know the Python 2 stdlib csv module doesn't "properly" support unicode. It expects binary strings that it will write to the file as it gets them.
To me this always seemed a bit counter-intuitive as i would tell people to internally work with unicode strings and properly serialize things for the external world by opening files with codecs.open(..., encoding='...'), but in the csv module case you need to do this manually for the lists / dicts you pass in.
It always puzzled me why this is and now that a colleague asks me again, I have to admit that i don't know any reason for it other than "probably grown and never fixed".
It seems that even PEP305 already contained TODOs for unicode and references to codecs.open.
Is there some wise python guru here who knows and could enlighten us?
Python 2 csv doesn't support Unicode because CSV doesn't support Unicode.
CSV as defined in RFC 4180 and in common usage is no more than a sequence of bytes. There is no standard to define how those bytes are mapped to readable text, and different CSV-handling tools have divergent behaviours. If Python's csv provided particular encoding rules they would be wrong in many cases. Better to let the user decide by manually encoding/decoding using whichever convention works for that application.
Python 3 csv gains Unicode support in as much as it has to talk to text IO streams (as these are now much more common). These have their own encoding, but if you use one with its default encoding for CSV, the results will still be wrong as often as not.

proper replacement of QString().arg method in python3

When it comes to internationalization - using python2 and PyQt4 - the "proposed way" to format a translated string is using the QString.arg() method:
from PyQt4.QtGui import QDialog
#somewhere in a QDialog:
self.tr("string %1 %2").arg(arg1).arg(arg2)
But QString() doesn't exist in python3-PyQt4.
So my question is, what is the best way to format any translated strings in python3? Should I use the standard python method str.format() or maybe there is something more suitable?
The QString::arg method is really there as a workaround for C++'s limited string formatting support, to make sure you don't use sprintf with all the problems that entails (not handling placeholders that are in different orders in different languages, buffer overruns, etc.). Because Python doesn't have any such problems, there's no good reason to use it.
In fact, there's very little reason to ever use QString explicitly. In PyQt4, it wasn't phased out completely, but by PyQt5, it was. (Technically, PyQt4 supports "string API v2" in both Python 2.x and 3.x, but only enables it by default in 3.x; PyQt4 enables v2 by default in both, and hides the ability to switch back to v1.) See Python Strings, Qt Strings and Unicode in the documentation for some added info.
There is one uncommon, but major if it affects you, exception: If you're writing an app that's partly in Qt/C++ and partly in PyQt, you're going to have problems sharing I18N data when some of them are in "string %1 %2" format and others are in "string {1} {2}" format. (The first time you ship one of your file out to an outsourced translation company, they're going to get it wrong, guaranteed.)
Yes, just use standard python string formatting.
QString is gone because it's pretty much interchangable with python's unicode strings (which are str in python3 and unicode in python2), so PyQt takes care of converting one into the other as needed.
QString being disabled isn't limited to python3, it's just the default there. You can get the same on python2 by doing this befor importing anything from PyQt4:
import sip
sip.setapi('QString', 2)

Should I use Unicode string by default?

Is it considered as a good practice to pick Unicode string over regular string when coding in Python? I mainly work on the Windows platform, where most of the string types are Unicode these days (i.e. .NET String, '_UNICODE' turned on by default on a new c++ project, etc ). Therefore, I tend to think that the case where non-Unicode string objects are used is a sort of rare case. Anyway, I'm curious about what Python practitioners do in real-world projects.
From my practice -- use unicode.
At beginning of one project we used usuall strings, however our project was growing, we were implementing new features and using new third-party libraries. In that mess with non-unicode/unicode string some functions started failing. We started spending time localizing this problems and fixing them. However, some third-party modules doesn't supported unicode and started failing after we switched to it (but this is rather exclusion than a rule).
Also I have some experience when we needed to rewrite some third party modules(e.g. SendKeys) cause they were not supporting unicode. If it was done in unicode from beginning it will be better :)
So I think today we should use unicode.
P.S. All that mess upwards is only my hamble opinion :)
As you ask this question, I suppose you are using Python 2.x.
Python 3.0 changed quite a lot in string representation, and all text now is unicode.
I would go for unicode in any new project - in a way compatible with the switch to Python 3.0 (see details).
Yes, use unicode.
Some hints:
When doing input output in any sort of binary format, decode directly after reading and encode directly before writing, so that you never need to mix strings and unicode. Because mixing that tends to lead to UnicodeEncodeDecodeErrors sooner or later.
[Forget about this one, my explanations just made it even more confusing. It's only an issue when porting to Python 3, you can care about it then.]
Common Python newbie errors with Unicode (not saying you are a newbie, but this may be read by newbies): Don't confuse encode and decode. Remember, UTF-8 is an ENcoding, so you ENcode Unicode to UTF-8 and DEcode from it.
Do not fall into the temptation of setting the default encoding in Python (by setdefaultencoding in sitecustomize.py or similar) to whatever you use most. That is just going to give you problems if you reinstall or move to another computer or suddenly need to use another encoding. Be explicit.
Remember, not all of Python 2s standard library accepts unicode. If you feed a method unicode and it doesn't work, but it should, try feeding it ascii and see. Examples: urllib.urlopen(), which fails with unhelpful errors if you give it a unicode object instead of a string.
Hm. That's all I can think of now!
It can be tricky to consistently use unicode strings in Python 2.x - be it because somebody inadvertently uses the more natural str(blah) where they meant unicode(blah), forgetting the u prefix on string literals, third-party module incompatibilities - whatever. So in Python 2.x, use unicode only if you have to, and are prepared to provide good unit test coverage.
If you have the option of using Python 3.x however, you don't need to care - strings will be unicode with no extra effort.
Additional to Mihails comment I would say: Use Unicode, since it is the future. In Python 3.0, Non-Unicode will be gone, and as much I know, all the "U"-Prefixes will make trouble, since they are also gone.
If you are dealing with severely constrained memory or disk space, use ASCII strings. In this case, you should additionally write your software in C or something even more compact :)

Categories

Resources