Dangers of sys.setdefaultencoding('utf-8') - python

There is a trend of discouraging setting sys.setdefaultencoding('utf-8') in Python 2. Can anybody list real examples of problems with that? Arguments like it is harmful or it hides bugs don't sound very convincing.
UPDATE: Please note that this question is only about utf-8, it is not about changing default encoding "in general case".
Please give some examples with code if you can.

The original poster asked for code which demonstrates that the switch is harmful—except that it "hides" bugs unrelated to the switch.
Updates
[2020-11-01]: pip install setdefaultencoding
Eradicates the need to reload(sys) (from Thomas
Grainger).
[2019]: Personal experience with python3:
No unicode en/decoding problems. Reasons:
Got used to writing .encode('utf-8') .decode('utf-8') a (felt) 100 times a day.
Looking into libraries: Same. 'utf-8' either hardcoded or the silent default, in pretty much all the I/O done
Heavily improved byte strings support made it finally possible to convert I/O centric applications like mercurial.
Having to write .encode and .decode all the time got people aware of the difference between strings for humans and machines.
In my opinion, python2's bytestrings combined with (utf-8 default) decoding only before outputting to humans or unicode only formats would have been the technical superior approach, compared to decoding/encoding everything at ingress and at egress w/o actual need many many times. It depends on the application if something like the len() function is more practical, when returning the character count for humans, compared to returning the bytes used to store and forward by machines.
=> I think it's safe to say that UTF-8 everywhere saved the Unicode Sandwich Design.
Without that many libraries and applications, which only pass through strings w/o interpreting them could not work.
Summary of conclusions
(from 2017)
Based on both experience and evidence I've collected, here are the conclusions I've arrived at.
Setting the defaultencoding to UTF-8 nowadays is safe, except for specialised applications, handling files from non unicode ready systems.
The "official" rejection of the switch is based on reasons no longer relevant for a vast majority of end users (not library providers), so we should stop discouraging users to set it.
Working in a model that handles Unicode properly by default is far better suited for applications for inter-systems communications than manually working with unicode APIs.
Effectively, modifying the default encoding very frequently avoids a number of user headaches in the vast majority of use cases. Yes, there are situations in which programs dealing with multiple encodings will silently misbehave, but since this switch can be enabled piecemeal, this is not a problem in end-user code.
More importantly, enabling this flag is a real advantage is users' code, both by reducing the overhead of having to manually handle Unicode conversions, cluttering the code and making it less readable, but also by avoiding potential bugs when the programmer fails to do this properly in all cases.
Since these claims are pretty much the exact opposite of Python's official line of communication, I think the an explanation for these conclusions is warranted.
Examples of successfully using a modified defaultencoding in the wild
Dave Malcom of Fedora believed it is always right. He proposed, after investigating risks, to change distribution wide def.enc.=UTF-8 for all Fedora users.
Hard fact presented though why Python would break is only the hashing behavior I listed, which is never picked up by any other opponent within the core community as a reason to worry about or even by the same person, when working on user tickets.
Resume of Fedora: Admittedly, the change itself was described as "wildly unpopular" with the core developers, and it was accused of being inconsistent with previous versions.
There are 3000 projects alone at openhub doing it. They have a slow search frontend, but scanning over it, I estimate 98% are using UTF-8. Nothing found about nasty surprises.
There are 18000(!) github master branches with it changed.
While the change is "unpopular" at the core community its pretty popular in the user base. Though this could be disregarded, since users are known to use hacky solutions, I don't think this is a relevant argument due to my next point.
There are only 150 bugreports total on GitHub due to this. At a rate of effectively 100%, the change seems to be positive, not negative.
To summarize the existing issues people have run into, I've scanned through all of the aforementioned tickets.
Chaging def.enc. to UTF-8 is typically introduced but not removed in the issue closing process, most often as a solution. Some bigger ones excusing it as temporary fix, considering the "bad press" it has, but far more bug reporters are justglad about the fix.
A few (1-5?) projects modified their code doing the type conversions manually so that they did not need to change the default anymore.
In two instances I see someone claiming that with def.enc. set to UTF-8 leads to a complete lack of output entirely, without explaining the test setup. I could not verify the claim, and I tested one and found the opposite to be true.
One claims his "system" might depend on not changing it but we do not learn why.
One (and only one) had a real reason to avoid it: ipython either uses a 3rd party module or the test runner modified their process in an uncontrolled way (it is never disputed that a def.enc. change is advocated by its proponents only at interpreter setup time, i.e. when 'owning' the process).
I found zero indication that the different hashes of 'é' and u'é' causes problems in real-world code.
Python does not "break"
After changing the setting to UTF-8, no feature of Python covered by unit tests is working any differently than without the switch. The switch itself, though, is not tested at all.
It is advised on bugs.python.org to frustrated users
Examples here, here or here
(often connected with the official line of warning)
The first one demonstrates how established the switch is in Asia (compare also with the github argument).
Ian Bicking published his support for always enabling this behavior.
I can make my systems and communications consistently UTF-8, things will just get better. I really don't see a downside. But why does Python make it SO DAMN HARD [...] I feel like someone decided they were smarter than me, but I'm not sure I believe them.
Martijn Fassen, while refuting Ian, admitted that ASCII might have been wrong in the first place.
I believe if, say, Python 2.5, shipped with a default encoding of UTF-8, it wouldn't actually break anything. But if I did it for my Python, I'd have problems soon as I gave my code to someone else.
In Python3, they don't "practice what they preach"
While opposing any def.enc. change so harshly because of environment dependent code or implicitness, a discussion here revolves about Python3's problems with its 'unicode sandwich' paradigm and the corresponding required implicit assumptions.
Further they created possibilities to write valid Python3 code like:
>>> from 褐褑褒褓褔褕褖褗褘 import *
>>> def 空手(合氣道): あいき(ど(合氣道))
>>> 空手(う힑힜('👏 ') + 흾)
💔
DiveIntoPython recommends it.
In this thread, Guido himself advises a professional end user to use a process specific environt with the switch set to "create a custom Python environment for each project."
The fundamental reason the designers of Python's 2.x standard library don't want you to be able to set the default encoding in your app, is that the standard library is written with the assumption that the default encoding is fixed, and no guarantees about the correct workings of the standard library can be made when you change it. There are no tests for this situation. Nobody knows what will fail when. And you (or worse, your users) will come back to us with complaints if the standard library suddenly starts doing things you didn't expect.
Jython offers to change it on the fly, even in modules.
PyPy did not support reload(sys) - but brought it back on user request within a single day without questions asked. Compare with the "you are doing it wrong" attitude of CPython, claiming without proof it is the "root of evil".
Ending this list I confirm that one could construct a module which crashes because of a changed interpreter config, doing something like this:
def is_clean_ascii(s):
""" [Stupid] type agnostic checker if only ASCII chars are contained in s"""
try:
unicode(str(s))
# we end here also for NON ascii if the def.enc. was changed
return True
except Exception, ex:
return False
if is_clean_ascii(mystr):
<code relying on mystr to be ASCII>
I don't think this is a valid argument because the person who wrote this dual type accepting module was obviously aware about ASCII vs. non ASCII strings and would be aware of encoding and decoding.
I think this evidence is more than enough indication that changing this setting does not lead to any problems in real world codebases the vast majority of the time.

Because you don't always want to have your strings automatically decoded to Unicode, or for that matter your Unicode objects automatically encoded to bytes. Since you are asking for a concrete example, here is one:
Take a WSGI web application; you are building a response by adding the product of an external process to a list, in a loop, and that external process gives you UTF-8 encoded bytes:
results = []
content_length = 0
for somevar in some_iterable:
output = some_process_that_produces_utf8(somevar)
content_length += len(output)
results.append(output)
headers = {
'Content-Length': str(content_length),
'Content-Type': 'text/html; charset=utf8',
}
start_response(200, headers)
return results
That's great and fine and works. But then your co-worker comes along and adds a new feature; you are now providing labels too, and these are localised:
results = []
content_length = 0
for somevar in some_iterable:
label = translations.get_label(somevar)
output = some_process_that_produces_utf8(somevar)
content_length += len(label) + len(output) + 1
results.append(label + '\n')
results.append(output)
headers = {
'Content-Length': str(content_length),
'Content-Type': 'text/html; charset=utf8',
}
start_response(200, headers)
return results
You tested this in English and everything still works, great!
However, the translations.get_label() library actually returns Unicode values and when you switch locale, the labels contain non-ASCII characters.
The WSGI library writes out those results to the socket, and all the Unicode values get auto-encoded for you, since you set setdefaultencoding() to UTF-8, but the length you calculated is entirely wrong. It'll be too short as UTF-8 encodes everything outside of the ASCII range with more than one byte.
All this is ignoring the possibility that you are actually working with data in a different codec; you could be writing out Latin-1 + Unicode, and now you have an incorrect length header and a mix of data encodings.
Had you not used sys.setdefaultencoding() an exception would have been raised and you knew you had a bug, but now your clients are complaining about incomplete responses; there are bytes missing at the end of the page and you don't quite know how that happened.
Note that this scenario doesn't even involve 3rd party libraries that may or may not depend on the default still being ASCII. The sys.setdefaultencoding() setting is global, applying to all code running in the interpreter. How sure are you there are no issues in those libraries involving implicit encoding or decoding?
That Python 2 encodes and decodes between str and unicode types implicitly can be helpful and safe when you are dealing with ASCII data only. But you really need to know when you are mixing Unicode and byte string data accidentally, rather than plaster over it with a global brush and hope for the best.

First of all: Many opponents of changing default enc argue that its dumb because its even changing ascii comparisons
I think its fair to make clear that, compliant with the original question, I see nobody advocating anything else than deviating from Ascii to UTF-8.
The setdefaultencoding('utf-16') example seems to be always just brought forward by those who oppose changing it ;-)
With m = {'a': 1, 'é': 2} and the file 'out.py':
# coding: utf-8
print u'é'
Then:
+---------------+-----------------------+-----------------+
| DEF.ENC | OPERATION | RESULT (printed)|
+---------------+-----------------------+-----------------+
| ANY | u'abc' == 'abc' | True |
| (i.e.Ascii | str(u'abc') | 'abc' |
| or UTF-8) | '%s %s' % ('a', u'a') | u'a a' |
| | python out.py | é |
| | u'a' in m | True |
| | len(u'a'), len(a) | (1, 1) |
| | len(u'é'), len('é') | (1, 2) [*] |
| | u'é' in m | False (!) |
+---------------+-----------------------+-----------------+
| UTF-8 | u'abé' == 'abé' | True [*] |
| | str(u'é') | 'é' |
| | '%s %s' % ('é', u'é') | u'é é' |
| | python out.py | more | 'é' |
+---------------+-----------------------+-----------------+
| Ascii | u'abé' == 'abé' | False, Warning |
| | str(u'é') | Encoding Crash |
| | '%s %s' % ('é', u'é') | Decoding Crash |
| | python out.py | more | Encoding Crash |
+---------------+-----------------------+-----------------+
[*]: Result assumes the same é. See below on that.
While looking at those operations, changing the default encoding in your program might not look too bad, giving you results 'closer' to having Ascii only data.
Regarding the hashing ( in ) and len() behaviour you get the same then in Ascii (more on the results below). Those operations also show that there are significant differences between unicode and byte strings - which might cause logical errors if ignored by you.
As noted already: It is a process wide option so you just have one shot to choose it - which is the reason why library developers should really never ever do it but get their internals in order so that they do not need to rely on python's implicit conversions.
They also need to clearly document what they expect and return and deny input they did not write the lib for (like the normalize function, see below).
=> Writing programs with that setting on makes it risky for others to use the modules of your program in their code, at least without filtering input.
Note: Some opponents claim that def.enc. is even a system wide option (via sitecustomize.py) but latest in times of software containerisation (docker) every process can be started in its perfect environment w/o overhead.
Regarding the hashing and len() behaviour:
It tells you that even with a modified def.enc. you still can't be ignorant about the types of strings you process in your program. u'' and '' are different sequences of bytes in the memory - not always but in general.
So when testing make sure your program behaves correctly also with non Ascii data.
Some say the fact that hashes can become unequal when data values change - although due to implicit conversions the '==' operations remain equal - is an argument against changing def.enc.
I personally don't share that since the hashing behaviour just remains the same as w/o changing it. Have yet to see a convincing example of undesired behaviour due to that setting in a process I 'own'.
All in all, regarding setdefaultencoding("utf-8"): The answer regarding if its dumb or not should be more balanced.
It depends.
While it does avoid crashes e.g. at str() operations in a log statement - the price is a higher chance for unexpected results later since wrong types make it longer into code whose correct functioning depends on a certain type.
In no case it should be the alternative to learning the difference between byte strings and unicode strings for your own code.
Lastly, setting default encoding away from Ascii does not make your life any easier for common text operations like len(), slicing and comparisons - should you assume than (byte)stringyfying everything with UTF-8 on resolves problems here.
Unfortunately it doesn't - in general.
The '==' and len() results are far more complex problem than one might think - but even with the same type on both sides.
W/o def.enc. changed, "==" fails always for non Ascii, like shown in the table. With it, it works - sometimes:
Unicode did standardise around a million symbols of the world and gave them a number - but there is unfortunately NOT a 1:1 bijection between glyphs displayed to a user in output devices and the symbols they are generated from.
To motivate you research this: Having two files, j1, j2 written with the same program using the same encoding, containing user input:
>>> u1, u2 = open('j1').read(), open('j2').read()
>>> print sys.version.split()[0], u1, u2, u1 == u2
Result: 2.7.9 José José False (!)
Using print as a function in Py2 you see the reason: Unfortunately there are TWO ways to encode the same character, the accented 'e':
>>> print (sys.version.split()[0], u1, u2, u1 == u2)
('2.7.9', 'Jos\xc3\xa9', 'Jose\xcc\x81', False)
What a stupid codec you might say but its not the fault of the codec. Its a problem in unicode as such.
So even in Py3:
>>> u1, u2 = open('j1').read(), open('j2').read()
>>> print sys.version.split()[0], u1, u2, u1 == u2
Result: 3.4.2 José José False (!)
=> Independent of Py2 and Py3, actually independent of any computing language you use: To write quality software you probably have to "normalise" all user input. The unicode standard did standardise normalisation.
In Python 2 and 3 the unicodedata.normalize function is your friend.

Real-word example #1
It doesn't work in unit tests.
The test runner (nose, py.test, ...) initializes sys first, and only then discovers and imports your modules. By that time it's too late to change default encoding.
By the same virtue, it doesn't work if someone runs your code as a module, as their initialisation comes first.
And yes, mixing str and unicode and relying on implicit conversion only pushes the problem further down the line.

One thing we should know is
Python 2 use sys.getdefaultencoding() to decode/encode between str and unicode
so if we change default encoding, there will be all kinds of incompatible issues. eg:
# coding: utf-8
import sys
print "你好" == u"你好"
# False
reload(sys)
sys.setdefaultencoding("utf-8")
print "你好" == u"你好"
# True
More examples:
https://pythonhosted.org/kitchen/unicode-frustrations.html
That said, I remember there is some blog suggesting use unicode whenever possible, and only bit string when deal with I/O. I think if your follow this convention, life will be much easier. More solutions can be found:
https://pythonhosted.org/kitchen/unicode-frustrations.html#a-few-solutions

Related

Current idiom for removing 'surrogateescape' characters fron a decoded string

Armin Ronacher, http://lucumr.pocoo.org/2013/7/2/the-updated-guide-to-unicode/
If you for instance pass [the result of os.fsdecode() or equivalent] to a template engine you [sometimes get a UnicodeEncodeError] somewhere else entirely and because the encoding happens at a much later stage you no longer know why the string was incorrect. If you detect that error when it happens the issue becomes much easier to debug
Armin suggests a function
def remove_surrogate_escaping(s, method='ignore'):
assert method in ('ignore', 'replace'), 'invalid removal method'
return s.encode('utf-8', method).decode('utf-8')
Nick Coghlan, 2014, [Python-Dev] Cleaning up surrogate escaped strings
The current proposal on the issue tracker is to ... take advantage of
the existing error handlers:
def convert_surrogateescape(data, errors='replace'):
return data.encode('utf-8', 'surrogateescape').decode('utf-8', errors)
That code is short, but semantically dense - it took a few iterations to
come up with that version. (Added bonus: once you're alerted to the
possibility, it's trivial to write your own version for existing Python 3
versions. The standard name just makes it easier to look up when you come
across it in a piece of code, and provides the option of optimising it
later if it ever seems worth the extra work)
The functions are slightly different. The second was written with knowledge of the first.
Since Python 3.5, the backslashreplace error handler now works on decoding as well as encoding. The first approach is not designed to use backslashreplace e.g. an error decoding the byte 0xff would get printed as "\udcff". The second approach is designed to solve this; it would print "\xff".
If you did not need backslashreplace, you might prefer the first version if you had the misfortune to be supporting Python < 3.5 (including polyglot 2/3 code, ouch).
Question
Is there a better idiom for this purpose yet? Or do we still use this drop-in function?
Nick referred to an issue for adding such a function to the codecs module. As of 2019 the function has not been added, and the ticket remains open.
The latest comment says
msg314682 Nick Coghlan, 2018
A recent discussion on python-ideas also introduced me to the third party library, "ftfy", which offers a wide range of tools for cleaning up improperly decoded data.
That includes a lone surrogate fixer: ftfy.fixes.fix_surrogates(text)
...
I do not find the function in ftfy appealing. The documentation does not say so, but it appears to be designed to handle both surrogateescape and ... be part of a workaround for CESU-8, or something like that ?
Replace 16-bit surrogate codepoints with the characters they represent (when properly paired), or with � otherwise.

What character-shifting / pseudo-encryption algorithm is used here?

This is a cry for help from all you cryptologists out there.
Scenario: I have a Windows application (likely built with VC++ or VB and subsequently moved to .Net) that saves some passwords in an XML file. Given a password A0123456789abcDEFGH, the resulting "encrypted" value is 04077040940409304092040910409004089040880408704086040850404504044040430407404073040720407104070
Looking at the string, I've figured out that this is just character shifting: '04' delimits actual character values, which are decimal; if I then subtract these values from 142, I get back the original ASCII code. In Jython (2.2), my decryption routine looks like this (EDITED thanks to suggestions in comments):
blocks = [ pwd[i:i+5] for i in range(0, len(pwd), 5) ]
# now a block looks like '04093'
decrypted = [ chr( 142 - int(block[3:].lstrip('0')) ) for block in blocks ]
This is fine for ASCII values (127 in total) and a handful of accented letters, but 8-bit charsets have another 128 characters; limiting accepted values to 142 doesn't make sense from a decimal perspective.
EDIT: I've gone rummaging through our systems and found three non-ASCII chars:
è 03910
Ø 03926
Õ 03929
From these values, it looks like actually subtracting the 4-number block from 4142 (leaving only '0' as separator) gives me the correct character.
So my question is:
is anybody familiar with this sort of obfuscation scheme in the Windows world? Could this be the product of a standard library function? I'm not very familiar with Win32 and .Net development, to be honest, so I might be missing something very simple.
If it's not a library function, can you think of a better method to de-obfuscate these values without resorting to the magic 142 number, i.e. a scheme that can actually be applied on non-ASCII characters without special-casing them? I'm crap at bit shifting and all that, so again I might be missing something obvious to the trained eye.
is anybody familiar with this sort of obfuscation scheme in the Windows world?
Once you understand it correctly, it's just a trivial rotation cipher like ROT13.
Why would anyone use this?
Well, in general, this is very common. Let's say you have some data that you need to obfuscate. But the decryption algorithm and key have to be embedded in software that the viewers have. There's no point using something fancy like AES, because someone can always just dig the algorithm and key out of your code instead of cracking AES. An encryption scheme that's even marginally harder to crack than finding the hidden key is just as good as a perfect encryption scheme—that is, good enough to deter casual viewers, and useless against serious attackers. (Often you aren't even really worried about stopping attacks, but about proving after the fact that your attacker must have acted in bad faith for contractual/legal reasons.) So, you use either a simple rotation cipher, or a simple xor cipher—it's fast, it's hard to get wrong and easy to debug, and if worst comes to worst you can even decrypt it manually to recover corrupted data.
As for the particulars:
If you want to handle non-ASCII characters, you pretty much have to use Unicode. If you used some fixed 8-bit charset, or the local system's OEM charset, you wouldn't be able to handle passwords from other machines.
A Python script would almost certainly handle Unicode characters, because in Python you either deal in bytes in a str, or Unicode characters in a unicode. But a Windows C or .NET app would be much more likely to use UTF-16, because Windows native APIs deal in UTF-16-LE code points in a WCHAR * (aka a string of 16-bit words).
So, why 4142? Well, it really doesn't matter what the key is. I'm guessing some programmer suggested 42. His manager then said "That doesn't sound very secure." He sighed and said, "I already explained why no key is going to be any more secure than… you know what, forget it, what about 4142?" The manager said, "Ooh, that sounds like a really secure number!" So that's why 4142.
If it's not a library function, can you think of a better method to de-obfuscate these values without resorting to the magic 142 number.
You do need to resort to the magic 4142, but you can make this a lot simpler:
def decrypt(block):
return struct.pack('>H', (4142 - int(block, 10)) % 65536)
So, each block of 5 characters is the decimal representation of a UTF-16 code unit, subtracted from 4142, using C unsigned-short wraparound rules.
This would be trivial to implement in native Windows C, but it's slightly harder in Python. The best transformation function I can come up with is:
def decrypt_block(block):
return struct.pack('>H', (4142 - int(block, 10)) % 65536)
def decrypt(pwd):
blocks = [pwd[i:i+5] for i in range(0, len(pwd), 5)]
return ''.join(map(decrypt_block, blocks)).decode('utf-16-be')
This would be a lot more trivial in C or C#, which is probably what they implemented things in, so let me explain what I'm doing.
You already know how to transform the string into a sequence of 5-character blocks.
My int(block, 10) is doing the same thing as your int(block.lstrip('0')), making sure that a '0' prefix doesn't make Python treat it as an octal numeral instead of decimal, but more explicitly. I don't think this is actually necessary in Jython 2.2 (it definitely isn't in more modern Python/Jython), but I left it just in case.
Next, in C, you'd just do unsigned short x = 4142U - y;, which would automatically underflow appropriately. Python doesn't have unsigned short values, just signed int, so we have to do the underflow manually. (Because Python uses floored division and remainder, the sign is always the same as the divisor—this wouldn't be true in C, at least not C99 and most platforms' C89.)
Then, in C, we'd just cast the unsigned short to a 16-bit "wide character"; Python doesn't have any way to do that, so we have to use struct.pack. (Note that I'm converting it to big-endian, because I think that makes this easier to debug; in C you'd convert to native-endian, and since this is Windows, that would be little-endian.)
So, now we've got a sequence of 2-character UTF-16-BE code points. I just join them into one big string, then decode it as UTF-16-BE.
If you really want to test that I've got this right, you'll need to find characters that aren't just non-ASCII, but non-Western. In particular, you need:
A character that's > U+4142 but < U+10000. Most CJK ideographs, like U+7000 (瀀), fit the bill. This should appear as '41006', because that's 4142-0x7000 rolled over as an unsigned short.
A character that's >= U+10000. This includes uncommon CJK characters, specialized mathematical characters, characters from ancient scripts, etc. For example, the Old Italic character U+10300 (𐌀) encodes to the surrogate pair (0xd800, 0xdf00); 4142-0xd800=14382, and 4142-0xdf00=12590, so you'd get '1438212590'.
The first will be hard to find—even most Chinese- and Japanese-native programmers I've dealt with use ASCII passwords. And the second, even more so; nobody but a historical linguistics professor is likely to even think of using archaic scripts in their passwords. By Murphy's Law, if you write the correct code, it will never be used, but if you don't, it's guaranteed to show up as soon as you ship your code.

Pitfalls in my code for detecting text file encoding with Python?

I know more about bicycle repair, chainsaw use and trench safety than I do Python or text encoding; with that in mind...
Python text encoding seems to be a perennial issue (my own question: Searching text files' contents with various encodings with Python?, and others I've read: 1, 2. I've taken a crack at writing some code to guess the encoding below.
In limited testing this code seems to work for my purposes* without me having to know an excess about the first three bytes of text encoding and the situations where those data aren't informative.
*My purposes are:
Have a dependency-free snippet I can use with a moderate-high degree of success,
Scan a local workstation for text based log files of any encoding and identify them as a file I am interested in based on their contents (which requires the file to be opened with the proper encoding)
for the challenge of getting this to work.
Question: What are the pitfalls with using a what I assume to be a klutzy method of comparing and counting characters like I do below? Any input is greatly appreciated.
def guess_encoding_debug(file_path):
"""
DEBUG - returns many 2 value tuples
Will return list of all possible text encodings with a count of the number of chars
read that are common characters, which might be a symptom of success.
SEE warnings in sister function
"""
import codecs
import string
from operator import itemgetter
READ_LEN = 1000
ENCODINGS = ['ascii','cp1252','mac_roman','utf_8','utf_16','utf_16_le',\
'utf_16_be','utf_32','utf_32_le','utf_32_be']
#chars in the regular ascii printable set are BY FAR the most common
#in most files written in English, so their presence suggests the file
#was decoded correctly.
nonsuspect_chars = string.printable
#to be a list of 2 value tuples
results = []
for e in ENCODINGS:
#some encodings will cause an exception with an incompatible file,
#they are invalid encoding, so use try to exclude them from results[]
try:
with codecs.open(file_path, 'r', e) as f:
#sample from the beginning of the file
data = f.read(READ_LEN)
nonsuspect_sum = 0
#count the number of printable ascii chars in the
#READ_LEN sized sample of the file
for n in nonsuspect_chars:
nonsuspect_sum += data.count(n)
#if there are more chars than READ_LEN
#the encoding is wrong and bloating the data
if nonsuspect_sum <= READ_LEN:
results.append([e, nonsuspect_sum])
except:
pass
#sort results descending based on nonsuspect_sum portion of
#tuple (itemgetter index 1).
results = sorted(results, key=itemgetter(1), reverse=True)
return results
def guess_encoding(file_path):
"""
Stupid, simple, slow, brute and yet slightly accurate text file encoding guessing.
Will return one likely text encoding, though there may be others just as likely.
WARNING: DO NOT use if your file uses any significant number of characters
outside the standard ASCII printable characters!
WARNING: DO NOT use for critical applications, this code will fail you.
"""
results = guess_encoding_debug(file_path)
#return the encoding string (second 0 index) from the first
#result in descending list of encodings (first 0 index)
return results[0][0]
I am assuming it would be slow compared to chardet, which I am not particularly familiar with. Also less accurate. They way it is designed, any roman character based language that uses accents, umlauts, etc. will not work, at least not well. It will be hard to know when it fails. However, most text in English, including most programming code, would largely be written with string.printable on which this code depends.
External libraries may be an option in the future, but for now I want to avoid them because:
This script will be run on multiple company computers on and off the network with various versions of python, so the fewer complications the better. When I say 'company' I mean small non-profit of social scientists.
I am in charge of collecting the logs from GPS data processing, but I am not the systems administrator - she is not a python programmer and the less time I take of hers the better.
The installation of Python that is generally available at my company is installed with a GIS software package, and is generally better when left alone.
My requirements aren't too strict, I just want to identify the files I am interested in and use other methods to copy them to an archive. I am not reading the full contents to memory to manipulate, appending or to rewriting the contents.
It seems like a high-level programming language should have some way of accomplishing this on its own. While "seems like" is a shaky foundation for any endeavor, I wanted to try and see if I could get it to work.
Probably the simplest way to find out how well your code works is to take the test suites for the other existing libraries, and use those as a base to create your own comprehensive test suite. They you will know if your code works for all of those cases, and you can also test for all of the cases you care about.

how to avoid python numeric literals beginning with "0" being treated as octal?

I am trying to write a small Python 2.x API to support fetching a
job by jobNumber, where jobNumber is provided as an integer.
Sometimes the users provide ajobNumber as an integer literal
beginning with 0, e.g. 037537. (This is because they have been
coddled by R, a language that sanely considers 037537==37537.)
Python, however, considers integer literals starting with "0" to
be OCTAL, thus 037537!=37537, instead 037537==16223. This
strikes me as a blatant affront to the principle of least
surprise, and thankfully it looks like this was fixed in Python
3---see PEP 3127.
But I'm stuck with Python 2.7 at the moment. So my users do this:
>>> fetchJob(037537)
and silently get the wrong job (16223), or this:
>>> fetchJob(038537)
File "<stdin>", line 1
fetchJob(038537)
^
SyntaxError: invalid token
where Python is rejecting the octal-incompatible digit.
There doesn't seem to be anything provided via __future__ to
allow me to get the Py3K behavior---it would have to be built-in
to Python in some manner, since it requires a change to the lexer
at least.
Is anyone aware of how I could protect my users from getting the
wrong job in cases like this? At the moment the best I can think
of is to change that API so it take a string instead of an int.
At the moment the best I can think of is to change that API so it take a string instead of an int.
Yes, and I think this is a reasonable option given the situation.
Another option would be to make sure that all your job numbers contain at least one digit greater than 7 so that adding the leading zero will give an error immediately instead of an incorrect result, but that seems like a bigger hack than using strings.
A final option could be to educate your users. It will only take five minutes or so to explain not to add the leading zero and what can happen if you do. Even if they forget or accidentally add the zero due to old habits, they are more likely to spot the problem if they have heard of it before.
Perhaps you could take the input as a string, strip leading zeros, then convert back to an int?
test = "001234505"
test = int(test.lstrip("0")) # 1234505

How do I handle Python unicode strings with null-bytes the 'right' way?

Question
It seems that PyWin32 is comfortable with giving null-terminated unicode strings as return values. I would like to deal with these strings the 'right' way.
Let's say I'm getting a string like: u'C:\\Users\\Guest\\MyFile.asy\x00\x00sy'. This appears to be a C-style null-terminated string hanging out in a Python unicode object. I want to trim this bad boy down to a regular ol' string of characters that I could, for example, display in a window title bar.
Is trimming the string off at the first null byte the right way to deal with it?
I didn't expect to get a return value like this, so I wonder if I'm missing something important about how Python, Win32, and unicode play together... or if this is just a PyWin32 bug.
Background
I'm using the Win32 file chooser function GetOpenFileNameW from the PyWin32 package. According to the documentation, this function returns a tuple containing the full filename path as a Python unicode object.
When I open the dialog with an existing path and filename set, I get a strange return value.
For example I had the default set to: C:\\Users\\Guest\\MyFileIsReallyReallyReallyAwesome.asy
In the dialog I changed the name to MyFile.asy and clicked save.
The full path part of the return value was: u'C:\Users\Guest\MyFile.asy\x00wesome.asy'`
I expected it to be: u'C:\\Users\\Guest\\MyFile.asy'
The function is returning a recycled buffer without trimming off the terminating bytes. Needless to say, the rest of my code wasn't set up for handling a C-style null-terminated string.
Demo Code
The following code demonstrates null-terminated string in return value from GetSaveFileNameW.
Directions: In the dialog change the filename to 'MyFile.asy' then click Save. Observe what is printed to the console. The output I get is u'C:\\Users\\Guest\\MyFile.asy\x00wesome.asy'.
import win32gui, win32con
if __name__ == "__main__":
initial_dir = 'C:\\Users\\Guest'
initial_file = 'MyFileIsReallyReallyReallyAwesome.asy'
filter_string = 'All Files\0*.*\0'
(filename, customfilter, flags) = \
win32gui.GetSaveFileNameW(InitialDir=initial_dir,
Flags=win32con.OFN_EXPLORER, File=initial_file,
DefExt='txt', Title="Save As", Filter=filter_string,
FilterIndex=0)
print repr(filename)
Note: If you don't shorten the filename enough (for example, if you try MyFileIsReally.asy) the string will be complete without a null byte.
Environment
Windows 7 Professional 64-bit (no service pack), Python 2.7.1, PyWin32 Build 216
UPDATE: PyWin32 Tracker Artifact
Based on the comments and answers I have received so far, this is likely a pywin32 bug so I filed a tracker artifact.
UPDATE 2: Fixed!
Mark Hammond reported in the tracker artifact that this is indeed a bug. A fix was checked in to rev f3fdaae5e93d, so hopefully that will make the next release.
I think Aleksi Torhamo's answer below is the best solution for versions of PyWin32 before the fix.
I'd say it's a bug. The right way to deal with it would probably be fixing pywin32, but in case you aren't feeling adventurous enough, just trim it.
You can get everything before the first '\x00' with filename.split('\x00', 1)[0].
This doesn't happen on the version of PyWin32/Windows/Python I tested; I don't get any nulls in the returned string even if it's very short. You might investigate if a newer version of one of the above fixes the bug.
ISTR that I had this issue some years ago, then I discovered that such Win32 filename-dialog-related functions return a sequence of 'filename1\0filename2\0...filenameN\0\0', while including possible garbage characters depending on the buffer that Windows allocated.
Now, you might prefer a list instead of the raw return value, but that would be a RFE, not a bug.
PS When I had this issue, I quite understood why one would expect GetOpenFileName to possibly return a list of filenames, while I couldn't imagine why GetSaveFileName would. Perhaps this is considered as API uniformity. Who am I to know, anyway?

Categories

Resources