As you might know the Python 2 stdlib csv module doesn't "properly" support unicode. It expects binary strings that it will write to the file as it gets them.
To me this always seemed a bit counter-intuitive as i would tell people to internally work with unicode strings and properly serialize things for the external world by opening files with codecs.open(..., encoding='...'), but in the csv module case you need to do this manually for the lists / dicts you pass in.
It always puzzled me why this is and now that a colleague asks me again, I have to admit that i don't know any reason for it other than "probably grown and never fixed".
It seems that even PEP305 already contained TODOs for unicode and references to codecs.open.
Is there some wise python guru here who knows and could enlighten us?
Python 2 csv doesn't support Unicode because CSV doesn't support Unicode.
CSV as defined in RFC 4180 and in common usage is no more than a sequence of bytes. There is no standard to define how those bytes are mapped to readable text, and different CSV-handling tools have divergent behaviours. If Python's csv provided particular encoding rules they would be wrong in many cases. Better to let the user decide by manually encoding/decoding using whichever convention works for that application.
Python 3 csv gains Unicode support in as much as it has to talk to text IO streams (as these are now much more common). These have their own encoding, but if you use one with its default encoding for CSV, the results will still be wrong as often as not.
Related
Is there a systematic way to run Python 3.x with all strings defaulting to bytes? I am finding that when "crossing boundaries" for example talking to msgpack, Elixir, or ZeroMQ, I'm having to do all sorts of contortions constantly figuring out whether strings or bytes will be returned. It's a complete pain and adds a layer of cognitive friction over and above my problem.
For example I have
import argparse
parser.add_argument("--nodename")
args = parser.parse_args()
and then to get the nodename I need to do
str(args.nodename)
However zeroMQ wants bytes, and I'm going to use the nodename everywhere I use zeroMQ. So I make it bytes up front with
nodename.encode()
But now every time I want to use it with a string, say for concatenation, I cannot do so because I have to encode the string first. And half the libraries take perfectly good bytes data type and return them to you as strings, at which time you have to convert them back again to bytes if you want to send them outside Python. For a "glue language" this is a total disaster. I'm having to do this encode decode dance whenever I cross the boundary, and the worst is that it does not seem consistent across libraries whether they co-opt you to strings or bytes if you send them bytes.
In Python 3 is there an option to forego Unicode-by-default (since it does after all say, "by default", suggesting it can be changed), or is the answer "stick with 2.7".
In short, no. And you really don't want to try. You mention contortions but don't give specific examples, so it's hard to offer specific advice.
Neither, in this author's humble opinion, do you want to stick with Python 2.7, but if you don't need bugfixes and language updates after 2020 it won't matter.
The point is precisely that all translation between bytes and text should take place at the boundaries of your code. Decode (from whatever external representation is used) on input, encode (to whatever encoding you wish or need to use) on output. Python 3 is written to enforce this distinction, but understanding the separation should give you proper control and reduce your frustrations.
In Python 3, opening a file in text mode causes readline and friends to produce Unicode strings. You can specify the encoding when you open the file if you wish. Opening a file in binary mode causes them to produce bytestrings, to which you will have to apply your own decoding to make sense of them as text.
Whether the Python API for a particular system returns bytes or text is up to its author, and calling Python 3 functions that expect strings with bytestring arguments is likely to lead to confusion and unhappiness. All external communications (network, files, etc.) must necessarily take place in terms of bytestrings, so be clear what is text (decoding on input and encoding on output) and deal with the outside world exclusively in bytestrings.
There are always, of course, difficult corner cases. I don't envy the maintainers of the email package, who have to deal with messages containing 6-bit encoded bytestreams themselves potentially containing attachments in multiple different encodings. But then I don't usually have to work in such complex environments, and hopefully neither do you.
My boss asked me to put the following lines (from this answer) into a Python 3 script I wrote:
import sys
import codecs
sys.stdout = codecs.getwriter("utf-8")(sys.stdout.detach())
He says it's to prevent UnicodeEncodeErrors when printing Unicode characters in non-UTF8 locales. I am wondering whether this is really necessary, and why Python wouldn't handle encoding/decoding correctly without boilerplate code.
What is the most Pythonic way to make Python scripts compatible with different operating system locales? And what does this boilerplate code do exactly?
The answer provided here has a good excerpt from the Python mailing list regarding your question. I guess it is not necessary to do this.
The only supported default encodings in Python are:
Python 2.x: ASCII
Python 3.x: UTF-8
If you change these, you are on your own and strange things will start
to happen. The default encoding does not only affect the translation
between Python and the outside world, but also all internal
conversions between 8-bit strings and Unicode.
Hacks like what's happening in the pango module (setting the default
encoding to 'utf-8' by reloading the site module in order to get the
sys.setdefaultencoding() API back) are just downright wrong and will
cause serious problems since Unicode objects cache their default
encoded representation.
Please don't enable the use of a locale based default encoding.
If all you want to achieve is getting the encodings of stdout and
stdin correctly setup for pipes, you should instead change the
.encoding attribute of those (only).
--
Marc-Andre Lemburg
eGenix.com
I'm currently working on a project where I need to transfer objects from ruby to python and back again, obviously serialization is the way to go. I've looked at things like yaml but decided to write my own as I didn't want to deal with the dependencies of the libraries and such when it came time to distribute. I've wrote up how this serialization format works here.
my question is that as this format is intended to work cross language between ruby and python,
how should I serialize ruby's symbols? I'm not aware of a object that works the same way in python. should a dump containing a symbol fail? should I just serialize it as a string? what would be best?
Doesn't that depend on what your project needs? If symbols are important, you'll need some way to deal with them.
I'm not a Ruby programmer, but from what I've just read, I think converting them to strings is probably easiest. The standard Python interpreter will reuse memory for identical short strings, which seems to be a key reason suggested for using symbols.
EDIT: If it needs to work for other programmers, passing values back and forth shouldn't change them. So you either have to handle symbols properly, or throw an error straight away. It should be simple enough in Python:
class Symbol(str):
pass
# In serialising code:
if isinstance(x, Symbol):
serialise_as_symbol(x)
Any reason you're not using a standard data interchange format like JSON or XML? They seem to be acceptable to countless applications, services, and programmers.
If symbols are a stumbling block then you have three choices, don't allow them, convert them to strings on the fly or figure out a way to make them universal and/or innocuous in other languages.
I'm trying to make my program faster, so I'm profiling it. Right now the top reason is:
566 1.780 0.003 1.780 0.003 (built-in method decode)
What is this exactly? I never call 'decode' anywhere in my code. It reads text files, but I don't believe they are unicode-encoded.
Most likely, this is the decode method of string objects.
(Answering #Claudiu's latest question, weirdly hidden in a commennt...?!-)... To really speed up pickling, try unladen swallow -- most of its ambitious targets are still to come, but it DOES already give at least 20-25% speedup in pickling and unpickling.
Presumably this is str.decode ... search your source for "decode". If it's not in your code, look at Python library routines that show up in the profile results. It's highly unlikely to be to be anything to do with cPickle. Care to show us a few more "reasons", preferably with the column headings, to give us a wider view of your problem?
Can you explain the connection between "using cPickle" and "some test cases would run faster"?
You left the X and Y out of "Is there anything that will do task X faster than resource Y?" ... Update so you were asking about cPickle. What are you using for the (optional) protocol arg of cPickle.dump() and/or cPickle.dumps() ?
I believe decode is called anytime you are converting unicode strings into ascii strings. I am guessing you have a large amount of unicode data. I'm not sure how the internals of pickle work, but it sounds like that unicode data gets converted to ascii when pickled?
Is it considered as a good practice to pick Unicode string over regular string when coding in Python? I mainly work on the Windows platform, where most of the string types are Unicode these days (i.e. .NET String, '_UNICODE' turned on by default on a new c++ project, etc ). Therefore, I tend to think that the case where non-Unicode string objects are used is a sort of rare case. Anyway, I'm curious about what Python practitioners do in real-world projects.
From my practice -- use unicode.
At beginning of one project we used usuall strings, however our project was growing, we were implementing new features and using new third-party libraries. In that mess with non-unicode/unicode string some functions started failing. We started spending time localizing this problems and fixing them. However, some third-party modules doesn't supported unicode and started failing after we switched to it (but this is rather exclusion than a rule).
Also I have some experience when we needed to rewrite some third party modules(e.g. SendKeys) cause they were not supporting unicode. If it was done in unicode from beginning it will be better :)
So I think today we should use unicode.
P.S. All that mess upwards is only my hamble opinion :)
As you ask this question, I suppose you are using Python 2.x.
Python 3.0 changed quite a lot in string representation, and all text now is unicode.
I would go for unicode in any new project - in a way compatible with the switch to Python 3.0 (see details).
Yes, use unicode.
Some hints:
When doing input output in any sort of binary format, decode directly after reading and encode directly before writing, so that you never need to mix strings and unicode. Because mixing that tends to lead to UnicodeEncodeDecodeErrors sooner or later.
[Forget about this one, my explanations just made it even more confusing. It's only an issue when porting to Python 3, you can care about it then.]
Common Python newbie errors with Unicode (not saying you are a newbie, but this may be read by newbies): Don't confuse encode and decode. Remember, UTF-8 is an ENcoding, so you ENcode Unicode to UTF-8 and DEcode from it.
Do not fall into the temptation of setting the default encoding in Python (by setdefaultencoding in sitecustomize.py or similar) to whatever you use most. That is just going to give you problems if you reinstall or move to another computer or suddenly need to use another encoding. Be explicit.
Remember, not all of Python 2s standard library accepts unicode. If you feed a method unicode and it doesn't work, but it should, try feeding it ascii and see. Examples: urllib.urlopen(), which fails with unhelpful errors if you give it a unicode object instead of a string.
Hm. That's all I can think of now!
It can be tricky to consistently use unicode strings in Python 2.x - be it because somebody inadvertently uses the more natural str(blah) where they meant unicode(blah), forgetting the u prefix on string literals, third-party module incompatibilities - whatever. So in Python 2.x, use unicode only if you have to, and are prepared to provide good unit test coverage.
If you have the option of using Python 3.x however, you don't need to care - strings will be unicode with no extra effort.
Additional to Mihails comment I would say: Use Unicode, since it is the future. In Python 3.0, Non-Unicode will be gone, and as much I know, all the "U"-Prefixes will make trouble, since they are also gone.
If you are dealing with severely constrained memory or disk space, use ASCII strings. In this case, you should additionally write your software in C or something even more compact :)