Python 'ascii' encode problems in print statement

Python 'ascii' encode problems in print statement - python

System: python 3.4.2 on linux.
I'm woring on a django application (irrelevant), and I encountered a problem that it throws
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)
when print is called (!). After quite a bit of digging, I discovered I should check
>>> sys.getdefaultencoding()
'utf-8'
but it was as expected, utf8. I noticed also that os.path.exists throws the same exception when used with a unicode string. So I checked
>>> sys.getfilesystemencoding()
'ascii'
When I used LANG=en_US.UTF-8 the issue disappeared. I understand now why os.path.exists had problems with that. But I have absolutely no clue why print statement is affected by the filesystem setting. Is there a third setting I'm missing? Or does it just assume LANG environment is to be trusted for everything?
Also... I don't get the reasoning here. LANG does not tell what encoding is supported by the filenames. It has nothing to do with that. It's set separately for the current environment, not for the filesystem. Why is python using this setting for filesystem filenames? It makes applications very fragile, as all the file operations just break when run in an environment where LANG is not set or set to C (not uncommon, especially when a web-app is run as root or a new user created specifically for the daemon).
Test code (no actual unicode input needed to avoid terminal encoding pitfalls):
x=b'\xc4\x8c\xc5\xbd'
y=x.decode('utf-8')
print(y)
Question:
is there a good and accepted way of making the application robust to the LANG setting?
is there any real-world reason to guess the filesystem capabilities from environment instead of the filesystem driver?
why is print affected?

LANG is used to determine your locale; if you don't set specific LC_ variables the LANG variable is used as the default.
The filesystem encoding is determined by the LC_CTYPE variable, but if you haven't set that variable specifically, the LANG environment variable is used instead.
Printing uses sys.stdout, a textfile configured with the codec your terminal uses. Your terminal settings is also locale specific; your LANG variable should really reflect what locale your terminal is set to. If that is UTF-8, you need to make sure your LANG variable reflects that. sys.stdout uses locale.getpreferredencoding(False) (like all text streams opened without an explicit encoding set) and on POSIX systems that'll use LC_CTYPE too.

Related

UnicodeDecodeError when trying to read an hdf file made with python 2.7

I have a bunch of hdf files that I need to read in with pandas pd.read_hdf() but they have been saved in a python 2.7 environment. Nowadays, I'm on python 3.7, and when trying to read them with data = pd.read_hdf('data.h5', 'data'), I'm getting
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf6 in position 6: invalid start byte
Now I know, those files can contain various weird things like Ä or ö, and 0xf6 probably is ö.
So how do I read this hdf file?
The documentation for read_hdf only specifies mode as a parameter, but this doesn't do anything. Apparently, this is an old bug in pandas, or rather with underlying pytables that can't be fixed. However, that is from 2017, so I wonder if that's fixed, or rather if there's a workaround that I just can't find. According to the bug report, you can also pass enconding='' to the reader, but that doesn't do anything when I specify encoding='UTF8' as suggested in the bug, or encoding='cp1250' which I would assume could be the culprit.
It's quite annoying to have a file format that is meant to archive data, which apparently can't be read anymore by the program that produced it after just one version step. I would be perfectly fine with just having the ös garbled to ␣ý⌧ or similar fun things as usual with encoding errors, but simply not being able to read it is an issue.

Windows file names displayed corrupted characters in Linux

I believe this is a common issue when it comes to the default encoding of characters on Linux and Windows. However after I searched the internet I have not got any easy way to fix it automatically and therefore I am about to write a script to do it.
Here is the scenario:
I created some files on Windows system, some with non-English names (Chinese specifically in my case). And I compressed them into a zip file using 7-zip. After that I downloaded the zip file to a Linux and extract the files on the Linux system (Ubuntu 16.04 LTS) (the default archive program). As much as I have guessed, all the non-English file names are now displayed as some corrupted characters! At first I thought it should be easy with convmv, but...
I tried convmv, and it says:"Skipping, already utf8". Nothing got changed.
And then I decided to write a tool using Python to do the dirty job, after some testing I come to a point where I cannot associate the original file names to the corrupted file names, (unless by hashing the contents.)
Here is an example. I setup a webserver to list the file names on Windows, and one file, after encoded with "gbk" in python, is displayed as
u'j\u63a5\u53e3\u6587\u6863'
And I can query the file names on my Linux system. I can create a file directly with the name as shown above, and the name is CORRECT. I can also encode the unicode gbk string to utf8 encoding and create a file, the name is also CORRECT. (Thus I am not able to do them at the same time since they are indeed the same name). Now when I read the file name which I extracted earlier, which should be the same file. BUT the file name is completely different as:
'j\xe2\x95\x9c\xe2\x95\x99.....'
decoding it with utf8, it is something like u'j\u255c\u2559...'. decoding it with gbk resulted in UnicodeDecodeError exception, and I also tried to decode it with utf8 and then encode with gbk, but the result is still something else.
To summarize it, I cannot inspect the original file name by decoding or encoding it after it was extracted to the linux system. If I really want to let a program do the job, I have to either re-do the archive with possibly some encoding options maybe, or just go with my script but using file content hash (like md5 or sha1) to determine its original file name on Windows.
Do I still got any chance to infer the original name from a python script in above case other than comparing file contents between two systems?

With a little experimentation with common encodings, I was able to reverse your mojibake:
bad = 'j\xe2\x95\x9c\xe2\x95\x99\xe2\x94\x90\xe2\x94\x8c\xe2\x95\xac\xe2\x94\x80\xe2\x95\xa1\xe2\x95\xa1'
>>> good = bad.decode('utf8').encode('cp437').decode('gbk')
>>> good
u'j\u63a5\u53e3\u6587\u6863' # u'j接口文档'
gbk - common Chinese Windows encoding
cp437 - common US Windows OEM console encoding
utf8 - common Linux encoding

Django: 'ascii' codec can't decode byte 0xc3 in position 1035: ordinal not in range(128)

Django 1.6
Python 3
Nginx, uWsgi
There are a view topics about this error, but the solutions are not applicable for me.
I have a web application where it is possible to upload a xml file inside a tar-archive for import purposes.
While developing on my local machine running the application with "python manage.py runserver" the import process runs flawless. When running the application on the vServer with Nginx and uWsgi, i get this error:
UnicodeDecodeError at /sync/upload/
'ascii' codec can't decode byte 0xc3 in position 1035: ordinal not in range(128)
The error happens on this last line written by me and the code run behind from there:
xml = f.read() <- My line
return codecs.ascii_decode(input, self.errors)[0]
Since the whole thing wokrs on my system but not on the vps, i assume the problem is some kind of configuration issue. So far i've tried to set LANG and LC_ALL before nginx starts as well as providing the encoding='utf-8' attribute when "open(xmlfile) as f". Plus many different approaches trying to encode by hand.
So know i'm out of options.
I'm working from Switzerland using an en_US.UTF-8 ArchLinux machine. The VPS is a Debian machine from which i don't know how to configure the default charset, if this is even related. Any help is welcome.
Thanks and regards,
Adrian

You are using ascii_decode explicitly. So language settings are out of scope. ascii_decode can decode bytes in the range 0..127. But 0xc3 = 195 > 127. So you are testing your server with another file, that contains bytes with values > 127. Normally, xml-readers take bytes and not strings, so decoding is unnecessary, because the encoding is given inside the xml-file.

Python, UTF-8 filesystem, iso-8859-1 files

I have an application written in Python 2.7 that reads user's file from the hard-drive using os.walk.
The application requires a UTF-8 system locale (we check the env variables before it starts) because we handle files with Unicode characters (audio files with the artist name in it for example), and want to make sure we can save these files with the correct file name to the filesystem.
Some of our users have UTF-8 locales (therefore a UTF-8 fs), but still somehow manage to have ISO-8859-1 files stored on their drive. This causes problems when our code tries to os.walk() these directories as Python throws an exception when trying to decode this sequence of ISO-8859-1 bytes using UTF-8.
So my question is, how do I get python to ignore this file and move on to the next one instead of aborting the entire os.walk(). Should I just roll my own os.walk() function?
Edit: Until now we've been telling our users to use the convmv linux command to correct their filenames, however many users have various different types of encodings (8859-1, 8859-2, etc.), and using convmv requires the user to make an educated guess on what files have what encoding before they run convmv on each one individually.

Please read Unicode filenames, part of the Python Unicode how-to. Most importantly, filesystem encodings are not necessarily the same as the current LANG setting in the terminal.
Specifically, os.walk is built upon os.listdir, and will thus switch between unicode and 8-bit bytes depending on wether or not you give it a unicode path.
Pass it an 8-bit path instead, and your code will work properly, then decode from UTF-8 or ISO 8859-1 as needed.

Use character encoding detection, chardet modules for python work well for determining actual encoding with some confidence. "as appropriate" -- You either know the encoding or you have to guess at it. If with chardet you guess wrong, at least you tried.

Enable Unicode "globally" in Python

Is it possible to avoid having to put this in every page?
# -*- coding: utf-8 -*-
I'd really like Python to default to this.

In Python 3, the default encoding is UTF-8, so you won't need to set it explicitly anymore. There isn't a way to 'globally' set the default source encoding, though, and history has shown that such global options are generally a bad idea. (For instance, the -U and -Q options to Python, and sys.setdefaultencoding() back when we had it.) You don't (directly) control all the source that gets imported in your program, because it includes the standard library and any third-party modules you use directly or indirectly.
Also note that this isn't enabling Unicode, as your question title suggests. What it does is make the source encoding UTF-8, meaning that any non-ASCII characters in unicode literals (e.g. u'spæm') will be interpreted using that encoding. It won't make non-unicode literals ('spam' and "spam") suddenly unicode, nor will it do anything for non-literals anywhere in your code.

This is a feature of Python 3.0
It was one of the things that was done in Python 3 because it would break backward compatibility, so you won't find such a global option in 2.x

It's a very bad idea for Python 2 because you will be expecting behavior which is only preset on your dev machine. Which means that when your library goes out to someone else, or to a host server, or elsewhere, any use of it will flood the logs with UnicodeDecodeErrors.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.