Changing the “locale preferred encoding” in Python 3 in Windows - python

I'm using Python 3 (recently switched from Python 2). My code usually runs on Linux but also sometimes (not often) on Windows. According to Python 3 documentation for open(), the default encoding for a text file is from locale.getpreferredencoding() if the encoding arg is not supplied. I want this default value to be utf-8 for a project of mine, no matter what OS it's running on (currently, it's always UTF-8 for Linux, but not for Windows). The project has many many calls to open() and I don't want to add encoding='utf-8' to all of them. Thus, I want to change the locale's preferred encoding in Windows, as Python 3 sees it.
I found a previous question
"Changing the "locale preferred encoding"", which has an accepted answer, so I thought I was good to go. But unfortunately, neither of the suggested commands in that answer and its first comment work for me in Windows. Specifically, that accepted answer and its first comment suggest running chcp 65001 and set PYTHONIOENCODING=UTF-8, and I've tried both. Please see transcript below from my cmd window:
> py -i
Python 3.4.3 ...
>>> f = open('foo.txt', 'w')
>>> f.encoding
'cp1252'
>>> exit()
> chcp 65001
Active code page: 65001
> py -i
Python 3.4.3 ...
>>> f = open('foo.txt', 'w')
>>> f.encoding
'cp1252'
>>> exit()
> set PYTHONIOENCODING=UTF-8
> py -i
Python 3.4.3 ...
>>> f = open('foo.txt', 'w')
>>> f.encoding
'cp1252'
>>> exit()
Note that even after both suggested commands, my opened file's encoding is still cp1252 instead of the intended utf-8.

As of python3.5.1 this hack looks like this:
import _locale
_locale._getdefaultlocale = (lambda *args: ['en_US', 'utf8'])
All files opened thereafter will assume the default encoding to be utf8.

i know its a real hacky workaround, but you could redefine the locale.getpreferredencoding() function like so:
import locale
def getpreferredencoding(do_setlocale = True):
return "utf-8"
locale.getpreferredencoding = getpreferredencoding
if you run this early on, all files opened after (at lest in my testing on a win xp machine) open in utf-8, and as this overrides the module method this would apply to all platforms.

Locale can be set in windows globally to UTF-8, if you so desire, as follows:
Control panel -> Clock and Region -> Region -> Administrative -> Change system locale -> Check Beta: Use Unicode UTF-8 ...
After this, and a reboot, I confirmed that locale.getpreferredencoding() returns 'cp65001' (=UTF-8) and that functions like open default to UTF-8.

The post is old but the issue is still of actuality (under Python 3.7 and Windows 10).
I've improved the solution as follows, making sure that the language/country part isn't overwritten but only the encoding, and also to make sure that it is only done under Windows:
if os.name == "nt":
import _locale
_locale._gdl_bak = _locale._getdefaultlocale
_locale._getdefaultlocale = (lambda *args: (_locale._gdl_bak()[0], 'utf8'))
Hope this helps...

As of Python 3.7, you may want to use UTF-8 mode by setting an environment variable or passing a flag to Python. Note that it turns a few more things into using utf-8 other than just locale.getpreferredencoding, but that may well be a good thing. As of Python 3.15, UTF-8 mode is set to become the default.

Related

Python 3 array with UTF-16 chars

I created two arrays with positive and negative emojis by using the emojis' unicode value:
positive = [
u'\U0001F600',
u'\U0001F601',
u'\U0001F602',
u'\U0001F923',
u'\U0001F603',
u'\U0001F604',
u'\U0001F605',
u'\U0001F606',
u'\U0001F609',
u'\U0001F60A',
u'\U0001F60B',
u'\U0001F60E',
u'\U0001F60D',
u'\U0001F618',
u'\U0001F617',
u'\U0001F619',
u'\U0001F61A',
u'\U0000263A',
u'\U0001F642',
u'\U0001F917',
u'\U0001F60F',
u'\U0001F60C',
u'\U0001F61B',
u'\U0001F61C',
u'\U0001F61D',
u'\U0001F924',
u'\U0001F643',
u'\U0001F62C']
negative = [
u'\U0001F610',
u'\U0001F611',
u'\U0001F636',
u'\U0001F644',
u'\U0001F60F',
u'\U0001F623',
u'\U0001F625',
u'\U0001F62E',
u'\U0001F910',
u'\U0001F62F',
u'\U0001F62A',
u'\U0001F62B',
u'\U0001F634',
u'\U0001F612',
u'\U0001F613',
u'\U0001F614',
u'\U0001F615',
u'\U0001F641',
u'\U0001F616',
u'\U0001F61E',
u'\U0001F61F',
u'\U0001F624',
u'\U0001F622',
u'\U0001F62D',
u'\U0001F626',
u'\U0001F627',
u'\U0001F628',
u'\U0001F629',
u'\U0001F630',
u'\U0001F631',
u'\U0001F635',
u'\U0001F621',
u'\U0001F620',
u'\U0001F637',
u'\U0001F912',
u'\U0001F915',
u'\U0001F922',
u'\U0001F927']
But when I print positive[0], for example, I get back this weird character instead of an emoji:
�
I'm working on an EC2 machine with Amazon Linux and using python-3.4.
Same code works as expected from my Macbook.
The issue is not a Python Issue. The Mac supports fonts in the terminal that can print those unicode characters.
Those same fonts are not supported in the case you are using.
If the device in question were to support those unicode values they would print properly.
I tested a standard macOS SSH Terminal to Ubuntu and that worked as does native Mac.
The issue was I was running under screen
Hello Daniel Haviv,
Try this if it works:
print(positive[0].encode('utf-8'))
You can read more here Python 2.x’s Unicode Support
Update: The previous solution did not worked for you try this:
import sys
import codecs
sys.stdout = codecs.getwriter('utf8')(sys.stdout)
print(positive[0])
I found the solution from Setting the correct encoding when piping stdout in Python

How to print unicode strings in a Python 2 shell under Windows?

I'm having problems when trying to print symbols such as €, ≤, Å, Ω, ℃, etc., in Python 2.7.11 under Windows 10. I expected that running this piece of code from IDLE:
print u'\u20AC\u2A7D\u212B\u2126\u2103'
would produce the following output on the screen:
>>> ================================ RESTART ================================
>>>
€⩽ÅΩ℃
>>>
But it didn't. I obtained a funky string of non-ascii characters instead. After struggling for a while, I finally got the expected output by setting up an environment variable:
PYTHONIOENCODING=UTF-8
So far, so good. My problem is that I am unable to get the same output from the Python shell:
>>> print u'\u20AC\u2A7D\u212B\u2126\u2103'
Ôé¼Ô®¢Ôä½ÔäªÔäâ
>>>
I have unsuccessfully tried a number of workarounds I found in answers to similar questions:
Changed the code page from 850 (which is the default in my system) to 65001 (which corresponds to utf-8 enconding)
Wrapped sys.stdout to ensure the appropriate encoding
sys.stdout = codecs.getwriter('utf8')(sys.stdout)
Even changed - although it is widely discouraged - the default encoding
sys.setdefaultencoding("UTF-8")
None of the above worked for me.
My question is twofold:
Why if I run print u'\u20AC\u2A7D\u212B\u2126\u2103' from IDLE the output is €⩽ÅΩ℃ (as expected) whereas if I run this code from the Python shell the output is incorrect?
Does anyone have any tips for printing those symbols correctly from the shell?
Why: IDLE uses tkinter, which wraps the tcl/tk GUI framework. Tcl/tk uses unicode strings, like Python 3, except that it is limited to the first 2**16 characters (the Basic Multilingual Plane, BMP). On Windows, Python uses Command Prompt, which uses code pages mostly limited to 256 chars. CP65001 seems to be a fraud; join the large crowd of people who have failed to get it to work over the last decade. (Search web for code page 65001.)
Tip: unless you limit output to chars in a working codepage, use IDLE to run the program. IDLE has a -r file startup option. See Help => IDLE Help, 3.1 Command line usage. I don't normally recommend using IDLE to run already developed programs, but do on Windows for BMP output.

Modify Windows unicode shortcuts using Python

Following this question, I've settled on the following Python code to modify Windows shortcuts. It works for English based shortcuts but it doesn't for unicode based shortcuts.
How could this (or any other) snippet be modified to support unicode?
import re, os, pythoncom
from win32com.shell import shell, shellcon
shortcut_path = os.path.join(path_to_shortcut, shortcut_filename)
shortcut = pythoncom.CoCreateInstance (shell.CLSID_ShellLink, None, pythoncom.CLSCTX_INPROC_SERVER, shell.IID_IShellLink)
persist_file = shortcut.QueryInterface (pythoncom.IID_IPersistFile)
persist_file.Load (shortcut_path)
destination1 = shortcut.GetPath(0)[0]
destination2 = os.path.join(destination_path, destination_filename)
shortcut.SetPath(destination2)
persist_file.Save(shortcut_path, 0)
Assume the following are unicode: path_to_shortcut, shortcut_filename, destination_path, destination_filename
Perhaps looking here may help: Python Unicode HOWTO
I'm guessing you'd need to be sure that each of those strings was properly encoded as Unicode and any changes need to preserve that encoding. That article should provide all the information you'll need.

How do you get the encoding of the terminal from within a python script?

Let's say you want to start a python script with some parameters like
python myscript some arguments
I understand, that the strings sys.argv[1] and sys.argv[2] will have the encoding specified in the terminal. Is there a way to get this information from within the python script?
My goal is something like this:
terminal_enocding = some_way.to.GET_TERMINAL_ENCODING
some = `sys.argv[1]`.decode(terminal_encoding)
arguments = `sys.argv[2]`.decode(terminal_encoding)
sys.stdout.encoding will give you the encoding of standard output. sys.stdin.encoding will give you the encdoing for standard input.
You can call locale.getdefaultlocale() and use the second part of the tuple.
See more here (Fedora wiki entry explaining the why's and how's of the default encoding in Python)
The function locale.getpreferredencoding() also seems to do the job.
It returns the Python encoding string which you can directly use like this:
>>> import locale
>>> s = b'123\n'
>>> enc = locale.getpreferredencoding()
>>> s.decode(enc)
'123\n'

Unicode filename to python subprocess.call() [duplicate]

This question already has answers here:
Unicode filenames on Windows with Python & subprocess.Popen()
(5 answers)
Closed 7 years ago.
I'm trying to run subprocess.call() with unicode filename, and here is simplified problem:
n = u'c:\\windows\\notepad.exe '
f = u'c:\\temp\\nèw.txt'
subprocess.call(n + f)
which raises famous error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe8'
Encoding to utf-8 produces wrong filename, and mbcs passes filename as new.txt without accent
I just can't read any more on this confusing subject and spin in circle. I found here lot of answers for many different problems in past so I thought to join and ask for help myself
Thanks
I found a fine workaround, it's a bit messy, but it works.
subprocess.call is going to pass the text in its own encoding to the terminal, which might or not be the one it's expecting. Because you want to make it portable, you'll need to know the machine's encoding at runtime.
The following
notepad = 'C://Notepad.exe'
subprocess.call([notepad.encode(sys.getfilesystemencoding())])
attempts to figure out the current encoding and therefore applies the correct one to subprocess.call
As a sidenote, I have also found that if you attempt to compose a string with the current directory, using
os.cwd()
Python (or the OS, don't know) will mess up directories with accented characters. To prevent this I have found the following to work:
os.cwd().decode(sys.getfilesystemencoding())
Which is very similar to the solution above.
Hope it helps.
If your file exists, you can use short filename (aka 8.3 name). This name is defined
for existent files, and should cause no trouble to non-Unicode aware programs when passed as argument.
One way to obtain one (needs Pywin32 to be installed):
import win32api
short_path = win32api.GetShortPathName(unicode_path)
Alternatively, you can also use ctypes:
import ctypes
import ctypes.wintypes
ctypes.windll.kernel32.GetShortPathNameW.argtypes = [
ctypes.wintypes.LPCWSTR, # lpszLongPath
ctypes.wintypes.LPWSTR, # lpszShortPath
ctypes.wintypes.DWORD # cchBuffer
]
ctypes.windll.kernel32.GetShortPathNameW.restype = ctypes.wintypes.DWORD
buf = ctypes.create_unicode_buffer(1024) # adjust buffer size, if necessary
ctypes.windll.kernel32.GetShortPathNameW(unicode_path, buf, len(buf))
short_path = buf.value
It appears that to make this work, the subprocess code would have to be modified to use a wide character version of CreateProcess (assuming that one exists). There's a PEP discussing the same change made for the file object at http://www.python.org/dev/peps/pep-0277/ Perhaps you could research the Windows C calls and propose a similar change for subprocess.
I don't have an answer for you, but I've done a fair amount of research into this problem. Python converts all output (including system calls) to the same character as the terminal it is running in. Windows terminals use code pages for character mapping; the default code page is 437, but it can be changed with the chcp command. chcp 65001 will theoretically change the code page to utf-8, but as far as I know python doesn't know what to do with this, so you're SOL.
As ΤΖΩΤΖΙΟΥ and starbuck mentioned, the problem is with the console code page which is in your case 866 (in Russian localization of windows) and not 1251. Just run chcp in console.
The problem is the same as when you want output unicode to Windows console. Unfortunatelly you will need at least to reqister and alias for unicode as 'cp866' in encodings\aliases.py (or do it programmatically on script start) and change the code page of the console to 65001 before running the notepad and setting it back afterwards.
chcp 65001 & c:\WINDOWS\notepad.exe nèw.txt & chcp 866
By the way, to be able to run the command in console and see the filename correctly, you will need to change the console font to Lucida Console in console window properties.
It might be even worse: you will need to change the code page of the current process. To do that, you will need either run chcp 65001 right before the script start or use pywin32 to do it within the script.
You can try opening the file as:
subprocess.call((n + f).encode("cp437"))
or whichever codepage chcp reports as being used in a command prompt window. If you try to chcp 65001 as starbuck suggested, you'll have to edit the stdlib encodings\aliases.py file and add cp65001 as an alias to 'utf-8' beforehand. It's an open issue in the Python source.
UPDATE: since this is a multiple target scenario, before running such a command, make sure you run a single chcp command first, analyse the output and retrieve the current "Command Prompt" (DOS) codepage. Subsequently, use the discovered codepage to encode the subprocess.call argument.
Use os.startfile with the operation edit. This will work better as it will open the default application for your extension.

Categories

Resources