I'm having some trouble handling unicode output from a QProcess. When I run the following example I get ?? instead of 中文. Can anyone tell me how to get the unicode output?
from PyQt4.QtCore import *
def on_ready_stdout():
byte_array = proc.readAllStandardOutput()
print 'byte_array: ', byte_array
print 'unicode: ', unicode(byte_array)
proc = QProcess()
proc.connect(proc, SIGNAL('readyReadStandardOutput()'), on_ready_stdout)
proc.start(u'python -c "print \'hello 中文\'"')
proc.waitForFinished()
#serge
I tried running your modified code, but I get an error:
byte_array: hello Σ╕¡µ??
unicode:
Traceback (most recent call last):
File "python_temp.py", line 7, in on_ready_stdout
print 'unicode: ', unicode(byte_array)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 6: ordinal
not in range(128)
I've changed your code a little and got the expected output:
byte_array: hello 中文
unicode: hello 中文
my changes were:
I added # -- coding: utf-8 -- magic comment (details here)
Removed "u" string declaration from the proc.start call
below is your code with my changes:
# -*- coding: utf-8 -*-
from PyQt4.QtCore import *
def on_ready_stdout():
byte_array = proc.readAllStandardOutput()
print 'byte_array: ', byte_array
print 'unicode: ', unicode(byte_array)
proc = QProcess()
proc.connect(proc, SIGNAL('readyReadStandardOutput()'), on_ready_stdout)
proc.start('python -c "print \'hello 中文\'"')
proc.waitForFinished()
hope this helps, regards
Related
Today at work, I saw the following line of code:
decoded_token = base64.b64decode(api_token.encode("utf-8")).decode("utf-8")
It is part of an AirFlow ETL script and decoded_token is used as a Bearer Token in an API request. This code is executed on a server that uses Python 2.7 and my coworker told my that this code runs daily, successfully.
Yet, from my understanding, the code first tries to turn api_token into bytes (.encode), then turn the bytes into a string (base64.b64decode) and finally turn the string again into a string (.decode). I would think that this always leads to an error.
import base64
api_token = "random-string"
decoded_token = base64.b64decode(api_token.encode("utf-8")).decode("utf-8")
Running the code locally gives me:
Error: UnicodeDecodeError: 'utf8' codec can't decode byte 0xad in position 0: invalid start byte
What input/type would api_token need to be in order for this line not to throw an error? Is that even possible or must there be something else at play?
Edit: As mentioned by Klaus D., apparently, in Python 2 both encode and decode consumed and returned a string. Yet, running the code above in Python 2.7 gives me the same error and I have yet to find an input for api_token that does not throw an error.
The issue is likely just that your test input string is not a base64-encoded string, while in production, whatever input already is!
Python 2.7.18 (default, Jan 4 2022, 17:47:56)
...
>>> import base64
>>> api_token = "random-string"
>>> base64.b64decode(api_token)
'\xad\xa9\xdd\xa2k-\xae)\xe0'
>>> base64.b64decode(api_token).decode("utf-8")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xad in position 0: invalid start byte
encoding the string as base64, you also don't need to decode it as "utf-8" afterwards, though you may if you expect unicode characters
>>> api_token = base64.b64encode(api_token)
>>> api_token
'cmFuZG9tLXN0cmluZw=='
>>> base64.b64decode(api_token)
'random-string'
>>> base64.b64decode(api_token).decode("utf-8")
u'random-string'
Example with non-ascii characters
>>> base64.b64decode(base64.b64encode("random string后缀"))
'random string\xe5\x90\x8e\xe7\xbc\x80'
>>> base64.b64decode(base64.b64encode("random string后缀")).decode("utf-8")
u'random string\u540e\u7f00'
>>> sys.stdout.write(base64.b64decode(base64.b64encode("random string后缀")) + "\n")
random string后缀
Note that in Python 2.7, bytes is just an alias for str, and a special unicode was added to support unicode!
>>> bytes is str
True
>>> bytes is unicode
False
>>> str("foo")
'foo'
>>> unicode("foo")
u'foo'
I migrated my python code from Win10 host to WS2012R2. Surprisingly it stops operating correctly and now shows warning message: "UnicodeEncodeError: 'charmap' codec can't encode characters in position 0-2: character maps to "
I've tried to execute a command:
set PYTHONLEGACYWINDOWSSTDIO=yes
My code:
import logging
import sys
def get_console_handler():
console_handler = logging.StreamHandler(sys.stdout)
return console_handler
def get_logger():
logger = logging.getLogger()
logger.setLevel(logging.DEBUG)
logger.addHandler(get_console_handler())
return logger
my_logger = get_logger()
my_logger.debug("Это отладочное сообщение".encode("cp1252"))
What should I do to get rid of this warning?
Update
Colleagues, I am sorry for misleading you! I am obviously was tired after long hours of bug tracking )
The problem doesn't connect with "*.encode()" calling as such, it is connected with default python encoding while IO console operation (I suppose so)! The original code makes some requests from DB in cp1251 charset but the problem appears when python is trying to convert it to cp1252.
Here is another example of how to summon the error.
Create a plain text file, i.e. test.txt with text "Это отладочное сообщение" and save it cp1252.
Run python console and enter:
f = open("test.txt")
f.read()
Output:
f = open("test.txt")
f.read()
Traceback (most recent call last): File "<stdin>", line 1, in <module>
File "c:\project\venv\lib\encodings\cp1252.py", line 23, in decode return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 29: character maps to <undefined>
Use encode("utf-8"). Here is a list of python encodings: https://docs.python.org/2.4/lib/standard-encodings.html
my_logger.debug("Это отладочное сообщение".encode("utf-8"))
then use .decode("utf-8") to see the printable value of your string
The problem is how logging.StreamHandler performs console output, namely due to the fact that you couldn't change default encoding in contrast with FileHandler.
If the default system encoding doesn't match the needed one, you could face an issue.
For my example. I wanted to output cp1251 lines, while system default encoding was:
import locale
locale.getpreferredencoding()
'cp1252'
This question was solved by changing system locale (see https://stackoverflow.com/a/11234956/9851754). Choose "Change system locale..." for non-Unicode programs. No code changes needed.
import locale
locale.getpreferredencoding()
'cp1251'
I have tested your code with Python 3.6.8 and it worked for me (I didn't change anything).
Python 3.6.8:
>>> python3 -V
Python 3.6.8
>>> python3 test.py
Это отладочное сообщение
But when I have tested it with Python 2.7.15+, I got a similar error than you.
Python 2.7.15+ with your implementation:
>>> python2 -V
Python 2.7.15+
>>> python2 test.py
File "test.py", line 17
SyntaxError: Non-ASCII character '\xd0' in file test.py on line 17, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details
Then I have put the following line into the first line it worked for me.
Begging of code:
# -*- coding: utf-8 -*-
import logging
import sys
...
Output with Python 2.7.15+ and with modified code:
>>> python2 -V
Python 2.7.15+
>>> python2 test.py
Это отладочное сообщение
I wrote a small example of the issue for everybody to see what's going on using Python 2.7 and Django 1.10.8
# -*- coding: utf-8 -*-
from __future__ import absolute_import, division, unicode_literals, print_function
import time
from django import setup
setup()
from django.contrib.auth.models import Group
group = Group(name='schön')
print(type(repr(group)))
print(type(str(group)))
print(type(unicode(group)))
print(group)
print(repr(group))
print(str(group))
print(unicode(group))
time.sleep(1.0)
print('%s' % group)
print('%r' % group) # fails
print('%s' % [group]) # fails
print('%r' % [group]) # fails
Exits with the following output + traceback
$ python .PyCharmCE2017.2/config/scratches/scratch.py
<type 'str'>
<type 'str'>
<type 'unicode'>
schön
<Group: schön>
schön
schön
schön
Traceback (most recent call last):
File "/home/srkunze/.PyCharmCE2017.2/config/scratches/scratch.py", line 22, in <module>
print('%r' % group) # fails
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 11: ordinal not in range(128)
Has somebody an idea what's going on here?
At issue here is that you are interpolating UTF-8 bytestrings into a Unicode string. Your '%r' string is a Unicode string because you used from __future__ import unicode_literals, but repr(group) (used by the %r placeholder) returns a bytestring. For Django models, repr() can include Unicode data in the representation, encoded to a bytestring using UTF-8. Such representations are not ASCII safe.
For your specific example, repr() on your Group instance produces the bytestring '<Group: sch\xc3\xb6n>'. Interpolating that into a Unicode string triggers the implicit decoding:
>>> u'%s' % '<Group: sch\xc3\xb6n>'
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 11: ordinal not in range(128)
Note that I did not use from __future__ import unicode_literals in my Python session, so the '<Group: sch\xc3\xb6n>' string is not a unicode object, it is a str bytestring object!
In Python 2, you should avoid mixing Unicode and byte strings. Always explicitly normalise your data (encoding Unicode to bytes or decoding bytes to Unicode).
If you must use from __future__ import unicode_literals, you can still create bytestrings by using a b prefix:
>>> from __future__ import unicode_literals
>>> type('') # empty unicode string
<type 'unicode'>
>>> type(b'') # empty bytestring, note the b prefix
<type 'str'>
>>> b'%s' % b'<Group: sch\xc3\xb6n>' # two bytestrings
'<Group: sch\xc3\xb6n>'
I had a hard time finding general solution to your problem.
__repr__() is what I understand supposed to return str, any efforts to change that seems to cause new problems.
Regarding the fact that the __repr__() method is defined outside the project, you are able to overload methods. For example
def new_repr(self):
return 'My representation of self {}'.format(self.name)
Group.add_to_class("__repr__", new_repr)
The only solution I can find, that works is to explicitly tell the interpreter how to handle the strings.
from __future__ import unicode_literals
from django.contrib.auth.models import Group
group = Group(name='schön')
print(type(repr(group)))
print(type(str(group)))
print(type(unicode(group)))
print(group)
print(repr(group))
print(str(group))
print(unicode(group))
print('%s' % group)
print('%r' % repr(group))
print('%s' % [str(group)])
print('%r' % [repr(group)])
# added
print('{}'.format([repr(group).decode("utf-8")]))
print('{}'.format([repr(group)]))
print('{}'.format(group))
Working with strings in python 2.x is a mess.
Hope this brings some light into how to work around (which is the only way I can find) the problem.
I think the real issue is in the django code.
It was reported six years ago:
https://code.djangoproject.com/ticket/18063
I think patch to django would solve it:
def __repr__(self):
return self.....encode('ascii', 'replace')
I think the repr() method should return "7 bit ascii".
If it's the case then we need to override the unicode method with our customised method. Try below code. It will work. I have tested it.
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
from django.contrib.auth.models import Group
def custom_unicode(self):
return u"%s" % (self.name.encode('utf-8', 'ignore'))
Group.__unicode__ = custom_unicode
group = Group(name='schön')
# Tests
print(type(repr(group)))
print(type(str(group)))
print(type(unicode(group)))
print(group)
print(repr(group))
print(str(group))
print(unicode(group))
print('%s' % group)
print('%r' % group)
print('%s' % [group])
print('%r' % [group])
# output:
<type 'str'>
<type 'str'>
<type 'unicode'>
schön
<Group: schön>
schön
schön
schön
<Group: schön>
[<Group: schön>]
[<Group: schön>]
Reference: https://docs.python.org/2/howto/unicode.html
I am not familiar with Django. Your issue seems to be representing text data in ASCI which is actually in unicode. Please try unidecode module in Python.
from unidecode import unidecode
#print(string) is replaced with
print(unidecode(string))
Refer Unidecode
I have been trying to encode an encrypted text by taking the input (encrypted text) from command line and encoding using the following code:
# -*- coding: utf-8 -*-
import sys
a = sys.argv[1]
b = a.encode('utf-8')
print(a)
print('\n')
print(b)
OUTPUT:
$python3 test.py 'b\x90\x89\xc6g\xa6\x15I\x9bKD\xd4s\xf2\x9f\x82Y\xedaa}0wL'
b\x90\x89\xc6g\xa6\x15I\x9bKD\xd4s\xf2\x9f\x82Y\xedaa}0wL
b'b\\x90\\x89\\xc6g\\xa6\\x15I\\x9bKD\\xd4s\\xf2\\x9f\\x82Y\\xedaa}0wL'
I need the exact same output which i input from terminal just in bytes to perform the decryption operation. When i try to replace it by the following code:
# -*- coding: utf-8 -*-
import sys
a = sys.argv[1]
b = a.encode('utf-8').replace('\\','\')
print(a)
print('\n')
print(b)
OUTPUT:
$python3 test.py 'b\x90\x89\xc6g\xa6\x15I\x9bKD\xd4s\xf2\x9f\x82Y\xedaa}0wL'
File "testsys.py", line 6
b = a.encode('utf-8').replace('\\','\')
^
SyntaxError: EOL while scanning string literal
I don't know about the error but in the line :
b = a.encode().replace('\\\','\')
but the parenthesis in bold is still colored like a string.
How can I get the exact same string just in bytes ?
\' is an escaped single quote character.
\\ is an escaped backslash character.
The quote for the string never got closed
You are escaping the closing '
b = a.encode('utf-8').replace('\\','\')
should be:
b = a.encode('utf-8').replace('\\','\'')
The data you provided cannot be encoded with utf-8.
a = 'b\x90\x89\xc6g\xa6\x15I\x9bKD\xd4s\xf2\x9f\x82Y\xedaa}0wL'
>>> b = a.encode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0x90 in position 1: ordinal
not in range(128)
If it actually works for you did you check decrypting the encoded string and did you get a different answer than the original string. Because by encoding the string with utf-8 doesn't mean that you are changing the integrity of the data.
I use this code to deal with chinese:
# -*- coding: utf-8 -*-
strInFilNname = u'%s' % raw_input("input fileName:").decode('utf-8')
pathName = u'%s' % raw_input("input filePath:").decode('utf-8')
When I run this on PyCharm everything is ok. But when I run this on windows CMD, I get this error code:
Traceback (most recent call last):
File "E:\Sites\GetAllFile.py", line 23, in <module>
strInFilNname = u'%s' % raw_input("input filename:").decode('utf-8')
File "E:\Portable Python 2.7.5.1\App\lib\encodings\utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xd3 in position 0: invalid continuation byte
I have read this python document about Unicode HOWTO, but can't get effective solution.
I really want to know why it does so .
The Windows console encoding is not UTF-8. I will assume you are using a Chinese-localized version of Windows since you mention the errors go away in Python 3.3 and suggest trying sys.stdin.encoding instead of utf-8.
Below is an example from my US-localized Windows using characters from the cp437 code page, which is what the US console uses (Python 2.7.9):
This returns a byte string in the console encoding:
>>> raw_input('test? ')
test? │┤╡╢╖╕╣
'\xb3\xb4\xb5\xb6\xb7\xb8\xb9'
Convert to Unicode:
>>> import sys
>>> sys.stdin.encoding
'cp437'
>>> raw_input('test? ').decode(sys.stdin.encoding)
test? │┤╡╢╖╕╣║╗╝╜╛
u'\u2502\u2524\u2561\u2562\u2556\u2555\u2563\u2551\u2557\u255d\u255c\u255b'
Note it prints correctly:
>>> print(raw_input('test? ').decode(sys.stdin.encoding))
test? │┤╡╢╖╕╣║╗
│┤╡╢╖╕╣║╗
This works correctly for a Chinese Windows console as well as it will use the correct console encoding for Chinese. Here's the same code after switching my system to use Chinese:
>>> raw_input('Test? ')
Test? 我是美国人。
'\xce\xd2\xca\xc7\xc3\xc0\xb9\xfa\xc8\xcb\xa1\xa3'
>>> import sys
>>> sys.stdin.encoding
'cp936'
>>> raw_input('Test? ').decode(sys.stdin.encoding)
Test? 我是美国人。
u'\u6211\u662f\u7f8e\u56fd\u4eba\u3002'
>>> print raw_input('Test? ').decode(sys.stdin.encoding)
Test? 我是美国人。
我是美国人。
Python 3.3 makes this much simpler:
>>> input('Test? ')
Test? 我是美国人。
'我是美国人。'