python, and unicode stderr

python, and unicode stderr - python

I used an anonymous pipe to capture all stdout,and stderr then print into a richedit, it's ok when i use wsprintf ,but the python using multibyte char that really annoy me. how can I convert all these output to unicode?
UPDATE 2010-01-03:
Thank you for the reply, but it seems the str.encode() only worked with print xxx stuff, if there is an error during the py_runxxx(), my redirected stderr will capture the error message in multibyte string, so is there a way can make python output it's message in unicode way? And there seems to be an available solution in this post.
I'll try it later.

First, please remember that on Windows console may not fully support Unicode.
The example below does make python output to stderr and stdout using UTF-8. If you want you could change it to other encodings.
#!/usr/bin/python
# -*- coding: UTF-8 -*-
import codecs, sys
reload(sys)
sys.setdefaultencoding('utf-8')
print sys.getdefaultencoding()
sys.stdout = codecs.getwriter('utf8')(sys.stdout)
sys.stderr = codecs.getwriter('utf8')(sys.stderr)
print "This is an Е乂αmp١ȅ testing Unicode support using Arabic, Latin, Cyrillic, Greek, Hebrew and CJK code points."

You can work with Unicode in python either by marking strings as Unicode (ie: u'Hello World') or by using the encode() method that all strings have.
Eg. assuming you have a Unicode string, aStringVariable:
aStringVariable.encode('utf-8')
will convert it to UTF-8. 'utf-16' will give you UTF-16 and 'ascii' will convert it to a plain old ASCII string.
For more information, see:
Tutorial - Unicode Strings
Python String Methods

wsprintf?
This seems to be a "C/C++" question rather than a Python question.
The Python interpreter always writes bytestrings to stdout/stderr, rather than unicode (or "wide") strings. It means Python first encodes all unicode data using the current encoding (likely sys.getdefaultencoding()).
If you want to get at stdout/stderr as unicode data, you must decode it by yourself using the right encoding.
Your favourite C/C++ library certainly has what it takes to do that.

Related

Convert string of unknown encoding to UTF-8

I am consuming a text response from a third party API. This text is in an encoding which is unknown to me. I consume the text in python3 and want to change the encoding into UTF-8.
This is an example of the contents I get:
Danke
"TrÃ¤ume groÃŸ"
ðŸ™ŒðŸ¼
Super Idee!!!
I was able to get the messed up characters readable by doing the following manually:
Open new document in Notepad++
Via the Encoding menu switch the encoding of the document to ANSI
Paste the contents
Again use the Encoding menu, this time switch to UTF-8
Now the text is properly legible like below
Correct content:
Danke
"Träume groß"
🙌🏼
Super Idee!!!
I want to repeat this process in python3, but struggle to do so. From the notepad workflow I gather that the encoding shouldn't be converted, rather the existing characters should be interpreted with a different encoding. That's because if I select Convert to UTF-8 in the Encoding menu, it doesn't work.
From what I have read on SO, there are the encode and decode methods to do that. Also ANSI isn't really an encoding but rather refers to the standard encoding the current machine uses. This would most likely be cp1525 on my windows machine. I have messed around with all combinations of cp1252 and utf-8 as source and/or target, but to no avail. I always end up with a UnicodeEncodeError.
I have also tried using the chardet module to determine the encoding of my input string, but it requires bytes as input and b'ðŸ™ŒðŸ¼' is rejected with SyntaxError: bytes can only contain ASCII literal characters.

"TrÃ¤ume groÃŸ" is a hint that you got something originally encoded as utf-8, but your process read it as cp1252.
A possible way is to encode your string back to cp1252 and then correctly decode it as utf-8:
print('"TrÃ¤ume groÃŸ"'.encode('cp1252').decode('utf8'))
gives as expected:
"Träume groß"
But this is only a workaround. The correct solution is to understand where you have read the original bytes as cp1252 and directly use the utf8 conversion there.

You can use bytes() to convert a string to bytes, and then decode it with .decode()
>>> bytes("TrÃ¤ume groÃŸ", "cp1252").decode("utf-8")
'Träume groß'

chardet could probably be useful here -
Quoting straight from the docs
import urllib.request
rawdata = urllib.request.urlopen('http://yahoo.co.jp/').read()
import chardet
chardet.detect(rawdata) {'encoding': 'EUC-JP', 'confidence': 0.99}

How to solve the error 'UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 1593: character maps to <undefined>' in Python [duplicate]

This question already has answers here:
Python, Unicode, and the Windows console
(15 answers)
Closed 5 years ago.
I am writing a Python (Python 3.3) program to send some data to a webpage using POST method. Mostly for debugging process I am getting the page result and displaying it on the screen using print() function.
The code is like this:
conn.request("POST", resource, params, headers)
response = conn.getresponse()
print(response.status, response.reason)
data = response.read()
print(data.decode('utf-8'));
the HTTPResponse .read() method returns a bytes element encoding the page (which is a well formated UTF-8 document) It seemed okay until I stopped using IDLE GUI for Windows and used the Windows console instead. The returned page has a U+2014 character (em-dash) which the print function translates well in the Windows GUI (I presume Code Page 1252) but does not in the Windows Console (Code Page 850). Given the strict default behavior I get the following error:
UnicodeEncodeError: 'charmap' codec can't encode character '\u2014' in position 10248: character maps to <undefined>
I could fix it using this quite ugly code:
print(data.decode('utf-8').encode('cp850','replace').decode('cp850'))
Now it replace the offending character "—" with a ?. Not the ideal case (a hyphen should be a better replacement) but good enough for my purpose.
There are several things I do not like from my solution.
The code is ugly with all that decoding, encoding, and decoding.
It solves the problem for just this case. If I port the program for a system using some other encoding (latin-1, cp437, back to cp1252, etc.) it should recognize the target encoding. It does not. (for instance, when using again the IDLE GUI, the emdash is also lost, which didn't happen before)
It would be nicer if the emdash translated to a hyphen instead of a interrogation bang.
The problem is not the emdash (I can think of several ways to solve that particularly problem) but I need to write robust code. I am feeding the page with data from a database and that data can come back. I can anticipate many other conflicting cases: an 'Á' U+00c1 (which is possible in my database) could translate into CP-850 (DOS/Windows Console encodign for Western European Languages) but not into CP-437 (encoding for US English, which is default in many Windows instalations).
So, the question:
Is there a nicer solution that makes my code agnostic from the output interface encoding?

I see three solutions to this:
Change the output encoding, so it will always output UTF-8. See e.g. Setting the correct encoding when piping stdout in Python, but I could not get these example to work.
Following example code makes the output aware of your target charset.
# -*- coding: utf-8 -*-
import sys
print sys.stdout.encoding
print u"Stöcker".encode(sys.stdout.encoding, errors='replace')
print u"Стоескер".encode(sys.stdout.encoding, errors='replace')
This example properly replaces any non-printable character in my name with a question mark.
If you create a custom print function, e.g. called myprint, using that mechanisms to encode output properly you can simply replace print with myprint whereever necessary without making the whole code look ugly.
Reset the output encoding globally at the begin of the software:
The page http://www.macfreek.nl/memory/Encoding_of_Python_stdout has a good summary what to do to change output encoding. Especially the section "StreamWriter Wrapper around Stdout" is interesting. Essentially it says to change the I/O encoding function like this:
In Python 2:
if sys.stdout.encoding != 'cp850':
sys.stdout = codecs.getwriter('cp850')(sys.stdout, 'strict')
if sys.stderr.encoding != 'cp850':
sys.stderr = codecs.getwriter('cp850')(sys.stderr, 'strict')
In Python 3:
if sys.stdout.encoding != 'cp850':
sys.stdout = codecs.getwriter('cp850')(sys.stdout.buffer, 'strict')
if sys.stderr.encoding != 'cp850':
sys.stderr = codecs.getwriter('cp850')(sys.stderr.buffer, 'strict')
If used in CGI outputting HTML you can replace 'strict' by 'xmlcharrefreplace' to get HTML encoded tags for non-printable characters.
Feel free to modify the approaches, setting different encodings, .... Note that it still wont work to output non-specified data. So any data, input, texts must be correctly convertable into unicode:
# -*- coding: utf-8 -*-
import sys
import codecs
sys.stdout = codecs.getwriter("iso-8859-1")(sys.stdout, 'xmlcharrefreplace')
print u"Stöcker" # works
print "Stöcker".decode("utf-8") # works
print "Stöcker" # fails

Based on Dirk Stöcker's answer, here's a neat wrapper function for Python 3's print function. Use it just like you would use print.
As an added bonus, compared to the other answers, this won't print your text as a bytearray ('b"content"'), but as normal strings ('content'), because of the last decode step.
def uprint(*objects, sep=' ', end='\n', file=sys.stdout):
enc = file.encoding
if enc == 'UTF-8':
print(*objects, sep=sep, end=end, file=file)
else:
f = lambda obj: str(obj).encode(enc, errors='backslashreplace').decode(enc)
print(*map(f, objects), sep=sep, end=end, file=file)
uprint('foo')
uprint(u'Antonín Dvořák')
uprint('foo', 'bar', u'Antonín Dvořák')

For debugging purposes, you could use print(repr(data)).
To display text, always print Unicode. Don't hardcode the character encoding of your environment such as Cp850 inside your script. To decode the HTTP response, see A good way to get the charset/encoding of an HTTP response in Python.
To print Unicode to Windows console, you could use win-unicode-console package.

I dug deeper into this and found the best solutions are here.
http://blog.notdot.net/2010/07/Getting-unicode-right-in-Python
In my case I solved "UnicodeEncodeError: 'charmap' codec can't encode character "
original code:
print("Process lines, file_name command_line %s\n"% command_line))
New code:
print("Process lines, file_name command_line %s\n"% command_line.encode('utf-8'))

If you are using Windows command line to print the data, you should use
chcp 65001
This worked for me!

If you use Python 3.6 (possibly 3.5 or later), it doesn't give that error to me anymore. I had a similar issue, because I was using v3.4, but it went away after I uninstalled and reinstalled.

hi-ascii characters python string

I am always perplexed with the whole hi-ascii handling in python 2.x. I am currently facing an issue in which I have a string with hiascii characters in it. I have a few questions related to it.
How can a string store hiascii characters in it (not a unicode string, but a normal str in python 2.x), which I thought can handle only ascii chars. Does python internally convert the hiascii to something else ?
I have a cli which I spawn as a subprocess from my python code, when I pass this string to the cli, it works fine. While, if I encode this string to utf-8, the cli fails( this string is a password, so it fails saying the password is invalid).
For the second point, I actually did a bit of research and found the following:
1) In windows(sucks), the command line args are encoded in mbcs (sys.getfilesystemencoding). The question I still don't get is, if I read the same string using raw_input, it is encoded in Windows console encoding(on EN windows, it was cp437).
I have a different question that am confused about now regarding Windows encoding. Is the windows sys.stdin.encoding different from Windows console encoding ?
If yes, is there a pythonic way to figure out what my windows console encoding is. I needed this because when I read input using raw_input, its encoded in Windows console encoding, and I want to convert it to say, utf-8. Note: I have already set my sys.stdin.encoding to utf-8, but it doesnt seem to make any effect in the read input.

To answer your first question, Python 2.x byte strings contain the source-encoded bytes of the characters, meaning the exact bytes used to store the string on disk in the source file. For example, here is a Python 2.x program where the source is saved in Windows-1252 encoding (Notepad's default on US Windows):
#!python2
#coding:windows-1252
s = 'æüÿ€éêè'
u = u'æüÿ€éêè'
print repr(s)
print repr(u)
Output:
'\xe6\xfc\xff\x80\xe9\xea\xe8'
u'\xe6\xfc\xff\u20ac\xe9\xea\xe8'
The byte string contains the bytes that represent the characters in Windows-1252.
The Python decodes that same sequence of using the declared source encoding (!#coding:Windows-1252) into Unicode codepoints. Since Windows-1252 is very close to iso-8859-1, and iso-8859-1 is a 1:1 mapping to the first 0-255 Unicode codepoints, the code points are almost the same, except for the Euro character.
But save the source in a different encoding, and you'll get those bytes instead for the byte string:
#!python2
#coding:utf8
s = 'æüÿ€éêè'
u = u'æüÿ€éêè'
print repr(s)
print repr(u)
Output:
'\xc3\xa6\xc3\xbc\xc3\xbf\xe2\x82\xac\xc3\xa9\xc3\xaa\xc3\xa8'
u'\xe6\xfc\xff\u20ac\xe9\xea\xe8'
So, Python 2.X just gives you the source code bytes directly, without decoding them to Unicode codepoints, like a Unicode string would do.
Python 3.X notes that this is confusing, and just forbids non-ASCII characters in byte strings:
#!python3
#coding:utf8
s = b'æüÿ€éêè'
u = 'æüÿ€éêè'
print(repr(s))
print(repr(u))
Output:
File "C:\test.py", line 3
s = b'æüÿ\u20acéêè'
^
SyntaxError: bytes can only contain ASCII literal characters.
To answer your second question, please edit your question to show an example that demonstrates the problem.

Is the windows sys.stdin.encoding different from Windows console encoding?
Yes. There are two locale-specific codepages:
the ANSI code page, aka mbcs, used for strings in Win32 ...A APIs and hence for C runtime operations like reading the command line;
the IO code page, used for stdin/stdout/stderr streams.
These do not have to be the same encoding, and typically they aren't. In my locale (UK), the ANSI code page is 1252 and the IO code page defaults to 850. You can change the console code page using the chcp command, so you can make the two encodings match using eg chcp 1252 before running the Python command.
(You also have to be using a TrueType font in the console for chcp to make any difference.)
is there a pythonic way to figure out what my windows console encoding is.
Python reads it at startup using the Win32 API GetConsoleOutputCP and—unless overridden by PYTHONIOENCODING—writes that to the property sys.stdout.encoding. (Similarly GetConsoleCP for stdin though they will generally be the same code page.)
If you need to read this directly, regardless of whether PYTHONIOENCODING is set, you might have to use ctypes to call GetConsoleOutputCP directly.
Note: I have already set my sys.stdin.encoding to utf-8, but it doesnt seem to make any effect in the read input.
(How have you done that? It's a read-only property.)
Although you can certainly treat input and output as UTF-8 at your end, the Windows console won't supply or display content in that encoding. Most other tools you call via the command line will also be treating their input as encoded in the IO code page, so would misinterpret any UTF-8 sent to them.
You can affect what code page the console side uses by calling the Win32 SetConsoleCP/SetConsoleOutputCP APIs with ctypes (equivalent of the chcp command and also requires TTF console font). In principle you should be able to set code page 65001 and get something that is nearly UTF-8. Unfortunately long-standing console bugs usually make this approach infeasible.
windows(sucks)
yes.

Python how to handle unicode text

I am using Python 2.6.6
item = {u'snippet': {u'title': u'How to Pronounce Canap\xe9'}}
title = item['snippet']['title']
print title
Result:
How to Pronounce CanapÃ©
Desired result:
How to Pronounce Canapé
This looks like a Unicode issue, I tried encode and decode to utf8, but result still the same, any ideas?

Your terminal expects UTF-8:
$ locale charmap
UTF-8
Python prints using UTF-8:
>>> sys.stdout.encoding
UTF-8
Change SecureCRT setting to accept UTF-8.

This is quite possibly due to mismatch of the default encoding that Python is using versus the console's encoding. It looks like Python is assuming that the encoding is UTF-8 but then the console is interpreting that as latin-1.

Instead of \xe9, use \u00e9 if possible. Then pick an appropriate encoding when outputting the unicode string:
print title.encode('latin1')
What encoding is sensible depends on where you are outputting to. Generally, you have to infer it from the environment variables, or maybe let your users make a choice in a configuration file.
PS: If you deal with Unicode strings a lot, I'd recommend switching to Python 3 (e.g. 3.3), if at all possible. Unicode handling is a lot more clear/explicit/sane, there.

I am getting your expected output in my terminal (using python 2.7.7)
The format you are expecting depends on encoding set in the terminal. For me, it is set to 'cp437'
>>> import sys
>>> sys.stdin.encoding
'cp437'
>>> sys.stdout.encoding
'cp437'
You can verify that, you are getting correct output by giving:
print title.encode('cp437')

set your default encoding to iso-8859-1 in your sitecustomize.py file in ${pythondir}/lib/site-packages/ as
import sys
sys.setdefaultencoding('iso-8859-1')
for me it worked with \xe9.

decoding problem with urllib2 in python

I'm trying to use urllib2 in python 2.7 to fetch a page from the web. The page happens to be encoded in unicode(UTF-8) and have greek characters. When I try to fetch and print it with the code below, I get gibberish instead of the greek characters.
import urllib2
print urllib2.urlopen("http://www.pamestihima.gr").read()
The result is the same both in Netbeans 6.9.1 and in Windows 7 CLI.
I'm doing something wrong, but what?

Unicode is not UTF-8. UTF-8 is a string encoding, like ISO-8859-1, ASCII etc.
Always decode your data as soon as possible, to make real Unicode out of it. ('somestring in utf8'.decode('utf-8') == u'somestring in utf-8'), unicode objects are u'' , not ''
When you have data leaving your app, always encode it in the proper encoding. For Web stuff this is utf-8mostly. For console stuff this is whatever your console encoding is. On Windows this is not UTF-8 by default.

It prints correctly for me, too.
Check the character encoding of the program in which you are viewing the HTML source code. For example, in a Linux terminal, you can find "Set Character Encoding" and make sure it is UTF-8.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.