Today at work, I saw the following line of code:
decoded_token = base64.b64decode(api_token.encode("utf-8")).decode("utf-8")
It is part of an AirFlow ETL script and decoded_token is used as a Bearer Token in an API request. This code is executed on a server that uses Python 2.7 and my coworker told my that this code runs daily, successfully.
Yet, from my understanding, the code first tries to turn api_token into bytes (.encode), then turn the bytes into a string (base64.b64decode) and finally turn the string again into a string (.decode). I would think that this always leads to an error.
import base64
api_token = "random-string"
decoded_token = base64.b64decode(api_token.encode("utf-8")).decode("utf-8")
Running the code locally gives me:
Error: UnicodeDecodeError: 'utf8' codec can't decode byte 0xad in position 0: invalid start byte
What input/type would api_token need to be in order for this line not to throw an error? Is that even possible or must there be something else at play?
Edit: As mentioned by Klaus D., apparently, in Python 2 both encode and decode consumed and returned a string. Yet, running the code above in Python 2.7 gives me the same error and I have yet to find an input for api_token that does not throw an error.
The issue is likely just that your test input string is not a base64-encoded string, while in production, whatever input already is!
Python 2.7.18 (default, Jan 4 2022, 17:47:56)
...
>>> import base64
>>> api_token = "random-string"
>>> base64.b64decode(api_token)
'\xad\xa9\xdd\xa2k-\xae)\xe0'
>>> base64.b64decode(api_token).decode("utf-8")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xad in position 0: invalid start byte
encoding the string as base64, you also don't need to decode it as "utf-8" afterwards, though you may if you expect unicode characters
>>> api_token = base64.b64encode(api_token)
>>> api_token
'cmFuZG9tLXN0cmluZw=='
>>> base64.b64decode(api_token)
'random-string'
>>> base64.b64decode(api_token).decode("utf-8")
u'random-string'
Example with non-ascii characters
>>> base64.b64decode(base64.b64encode("random string后缀"))
'random string\xe5\x90\x8e\xe7\xbc\x80'
>>> base64.b64decode(base64.b64encode("random string后缀")).decode("utf-8")
u'random string\u540e\u7f00'
>>> sys.stdout.write(base64.b64decode(base64.b64encode("random string后缀")) + "\n")
random string后缀
Note that in Python 2.7, bytes is just an alias for str, and a special unicode was added to support unicode!
>>> bytes is str
True
>>> bytes is unicode
False
>>> str("foo")
'foo'
>>> unicode("foo")
u'foo'
I have a colleague whose computer won't run a Python script that uses the dateutil.tz module; there is a timezone name '\xc3\xc0\xb9\xfa\xc9\xbd\xb5\xd8\xb1\xea\xd7\xbc\xca\xb1\xbc\xe4' that shows up and in dateutil.tz there is this code:
def tzname_in_python2(namefunc):
"""Change unicode output into bytestrings in Python 2
tzname() API changed in Python 3. It used to return bytes, but was changed
to unicode strings
"""
def adjust_encoding(*args, **kwargs):
name = namefunc(*args, **kwargs)
if name is not None and not PY3:
name = name.encode()
return name
return adjust_encoding
which breaks because the string in question is not ASCII. What is this string? It doesn't look like valid Unicode:
>>> a = '\xc3\xc0\xb9\xfa\xc9\xbd\xb5\xd8\xb1\xea\xd7\xbc\xca\xb1\xbc\xe4'
>>> a.decode('utf8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "c:\app\python\anaconda\2\envs\emblaze\lib\encodings\utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xc3 in position 0: invalid continuation byte
My python script contains
timezone = dateutil.tz.tzlocal()
and the resulting object fails to run timezone.tzname(some_timestamp) because of the non-ASCII nature of the timezone name.
If this happens again, there is a python module for this:
>>> import chardet
>>> z = b'\xc3\xc0\xb9\xfa\xc9\xbd\xb5\xd8\xb1\xea\xd7\xbc\xca\xb1\xbc\xe4'
>>> chardet.detect(z)
{'encoding': 'GB2312', 'confidence': 0.99, 'language': 'Chinese'}
Aha, I figured it out after a bunch of searching on the net. It's not UTF8 or UTF16; it seems to be GB2312 (or GBK) encoding, which can be decoded in Python (on MS Windows, at least) with the gbk codec:
>>> '\xc3\xc0\xb9\xfa\xc9\xbd\xb5\xd8\xb1\xea\xd7\xbc\xca\xb1\xbc\xe4'.decode('gbk')
u'\u7f8e\u56fd\u5c71\u5730\u6807\u51c6\u65f6\u95f4'
>>> '\xc3\xc0\xb9\xfa\xc9\xbd\xb5\xd8\xcf\xc4\xc1\xee\xca\xb1'.decode('gbk')
u'\u7f8e\u56fd\u5c71\u5730\u590f\u4ee4\u65f6'
which print out (in IPython Notebook) as
美国山地标准时间
美国山地夏令时
which Google Translate tells me represents "American Mountain Standard Time" and "American Mountain Summertime", respectively.
I have a large project where at various places problematic implicit Unicode conversions (coersions) were used in the form of e.g.:
someDynamicStr = "bar" # could come from various sources
# works
u"foo" + someDynamicStr
u"foo{}".format(someDynamicStr)
someDynamicStr = "\xff" # uh-oh
# raises UnicodeDecodeError
u"foo" + someDynamicStr
u"foo{}".format(someDynamicStr)
(Possibly other forms as well.)
Now I would like to track down those usages, especially those in actively used code.
It would be great if I could easily replace the unicode constructor with a wrapper which checks whether the input is of type str and the encoding/errors parameters are set to the default values and then notifies me (prints traceback or such).
/edit:
While not directly related to what I am looking for I came across this gloriously horrible hack for how to make the decode exception go away altogether (the decode one only, i.e. str to unicode, but not the other way around, see https://mail.python.org/pipermail/python-list/2012-July/627506.html).
I don't plan on using it but it might be interesting for those battling problems with invalid Unicode input and looking for a quick fix (but please think about the side effects):
import codecs
codecs.register_error("strict", codecs.ignore_errors)
codecs.register_error("strict", lambda x: (u"", x.end)) # alternatively
(An internet search for codecs.register_error("strict" revealed that apparently it's used in some real projects.)
/edit #2:
For explicit conversions I made a snippet with the help of a SO post on monkeypatching:
class PatchedUnicode(unicode):
def __init__(self, obj=None, encoding=None, *args, **kwargs):
if encoding in (None, "ascii", "646", "us-ascii"):
print("Problematic unicode() usage detected!")
super(PatchedUnicode, self).__init__(obj, encoding, *args, **kwargs)
import __builtin__
__builtin__.unicode = PatchedUnicode
This only affects explicit conversions using the unicode() constructor directly so it's not something I need.
/edit #3:
The thread "Extension method for python built-in types!" makes me think that it might actually not be easily possible (in CPython at least).
/edit #4:
It's nice to see many good answers here, too bad I can only give out the bounty once.
In the meantime I came across a somewhat similar question, at least in the sense of what the person tried to achieve: Can I turn off implicit Python unicode conversions to find my mixed-strings bugs?
Please note though that throwing an exception would not have been OK in my case. Here I was looking for something which might point me to the different locations of problematic code (e.g. by printing smth.) but not something which might exit the program or change its behavior (because this way I can prioritize what to fix).
On another note, the people working on the Mypy project (which include Guido van Rossum) might also come up with something similar helpful in the future, see the discussions at https://github.com/python/mypy/issues/1141 and more recently https://github.com/python/typing/issues/208.
/edit #5
I also came across the following but didn't have yet the time to test it: https://pypi.python.org/pypi/unicode-nazi
You can register a custom encoding which prints a message whenever it's used:
Code in ourencoding.py:
import sys
import codecs
import traceback
# Define a function to print out a stack frame and a message:
def printWarning(s):
sys.stderr.write(s)
sys.stderr.write("\n")
l = traceback.extract_stack()
# cut off the frames pointing to printWarning and our_encode
l = traceback.format_list(l[:-2])
sys.stderr.write("".join(l))
# Define our encoding:
originalencoding = sys.getdefaultencoding()
def our_encode(s, errors='strict'):
printWarning("Default encoding used");
return (codecs.encode(s, originalencoding, errors), len(s))
def our_decode(s, errors='strict'):
printWarning("Default encoding used");
return (codecs.decode(s, originalencoding, errors), len(s))
def our_search(name):
if name == 'our_encoding':
return codecs.CodecInfo(
name='our_encoding',
encode=our_encode,
decode=our_decode);
return None
# register our search and set the default encoding:
codecs.register(our_search)
reload(sys)
sys.setdefaultencoding('our_encoding')
If you import this file at the start of our script, then you'll see warnings for implicit conversions:
#!python2
# coding: utf-8
import ourencoding
print("test 1")
a = "hello " + u"world"
print("test 2")
a = "hello ☺ " + u"world"
print("test 3")
b = u" ".join(["hello", u"☺"])
print("test 4")
c = unicode("hello ☺")
output:
test 1
test 2
Default encoding used
File "test.py", line 10, in <module>
a = "hello ☺ " + u"world"
test 3
Default encoding used
File "test.py", line 13, in <module>
b = u" ".join(["hello", u"☺"])
test 4
Default encoding used
File "test.py", line 16, in <module>
c = unicode("hello ☺")
It's not perfect as test 1 shows, if the converted string only contain ASCII characters, sometimes you won't see a warning.
What you can do is the following:
First create a custom encoding. I will call it "lascii" for "logging ASCII":
import codecs
import traceback
def lascii_encode(input,errors='strict'):
print("ENCODED:")
traceback.print_stack()
return codecs.ascii_encode(input)
def lascii_decode(input,errors='strict'):
print("DECODED:")
traceback.print_stack()
return codecs.ascii_decode(input)
class Codec(codecs.Codec):
def encode(self, input,errors='strict'):
return lascii_encode(input,errors)
def decode(self, input,errors='strict'):
return lascii_decode(input,errors)
class IncrementalEncoder(codecs.IncrementalEncoder):
def encode(self, input, final=False):
print("Incremental ENCODED:")
traceback.print_stack()
return codecs.ascii_encode(input)
class IncrementalDecoder(codecs.IncrementalDecoder):
def decode(self, input, final=False):
print("Incremental DECODED:")
traceback.print_stack()
return codecs.ascii_decode(input)
class StreamWriter(Codec,codecs.StreamWriter):
pass
class StreamReader(Codec,codecs.StreamReader):
pass
def getregentry():
return codecs.CodecInfo(
name='lascii',
encode=lascii_encode,
decode=lascii_decode,
incrementalencoder=IncrementalEncoder,
incrementaldecoder=IncrementalDecoder,
streamwriter=StreamWriter,
streamreader=StreamReader,
)
What this does is basically the same as the ASCII-codec, just that it prints a message and the current stack trace every time it encodes or decodes from unicode to lascii.
Now you need to make it available to the codecs module so that it can be found by the name "lascii". For this you need to create a search function that returns the lascii-codec when it's fed with the string "lascii". This is then registered to the codecs module:
def searchFunc(name):
if name=="lascii":
return getregentry()
else:
return None
codecs.register(searchFunc)
The last thing that is now left to do is to tell the sys module to use 'lascii' as default encoding:
import sys
reload(sys) # necessary, because sys.setdefaultencoding is deleted on start of Python
sys.setdefaultencoding('lascii')
Warning:
This uses some deprecated or otherwise unrecommended features. It might not be efficient or bug-free. Do not use in production, only for testing and/or debugging.
Just add:
from __future__ import unicode_literals
at the beginning of your source code files - it has to be the first import and it has to be in all source code files affected and the headache of using unicode in Python-2.7 goes away. If you didn't do anything super weird with strings then it should get rid of the problem out of the box.
Check out the following Copy&Paste from my console - I tried with the sample from your question:
user#linux2:~$ python
Python 2.7.6 (default, Jun 22 2015, 17:58:13)
[GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> someDynamicStr = "bar" # could come from various sources
>>>
>>> # works
... u"foo" + someDynamicStr
u'foobar'
>>> u"foo{}".format(someDynamicStr)
u'foobar'
>>>
>>> someDynamicStr = "\xff" # uh-oh
>>>
>>> # raises UnicodeDecodeError
... u"foo" + someDynamicStr
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
uUnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 0: ordinal not in range(128)
">>> u"foo{}".format(someDynamicStr)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 0: ordinal not in range(128)
>>>
And now with __future__ magic:
user#linux2:~$ python
Python 2.7.6 (default, Jun 22 2015, 17:58:13)
[GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from __future__ import unicode_literals
>>> someDynamicStr = "bar" # could come from various sources
>>>
>>> # works
... u"foo" + someDynamicStr
u'foobar'
>>> u"foo{}".format(someDynamicStr)
u'foobar'
>>>
>>> someDynamicStr = "\xff" # uh-oh
>>>
>>> # raises UnicodeDecodeError
... u"foo" + someDynamicStr
u'foo\xff'
>>> u"foo{}".format(someDynamicStr)
u'foo\xff'
>>>
I see you have a lot of edits relating to solutions you may have encountered. I'm just going to address your original post which I believe to be: "I want to create a wrapper around the unicode constructor that checks input".
The unicode method is part of Python's standard library. You will decorate the unicode method to add checks to the method.
def add_checks(fxn):
def resulting_fxn(*args, **kargs):
# this is where whether the input is of type str
if type(args[0]) is str:
# do something
# this is where the encoding/errors parameters are set to the default values
encoding = 'utf-8'
# Set default error behavior
error = 'ignore'
# Print any information (i.e. traceback)
# print 'blah'
# TODO: for traceback, you'll want to use the pdb module
return fxn(args[0], encoding, error)
return resulting_fxn
Using this will look like this:
unicode = add_checks(unicode)
We overwrite the existing function name so that you don't have to change all the calls in the large project. You want to do this very early on in the runtime so that subsequent calls have the new behavior.
I have been trying to encode an encrypted text by taking the input (encrypted text) from command line and encoding using the following code:
# -*- coding: utf-8 -*-
import sys
a = sys.argv[1]
b = a.encode('utf-8')
print(a)
print('\n')
print(b)
OUTPUT:
$python3 test.py 'b\x90\x89\xc6g\xa6\x15I\x9bKD\xd4s\xf2\x9f\x82Y\xedaa}0wL'
b\x90\x89\xc6g\xa6\x15I\x9bKD\xd4s\xf2\x9f\x82Y\xedaa}0wL
b'b\\x90\\x89\\xc6g\\xa6\\x15I\\x9bKD\\xd4s\\xf2\\x9f\\x82Y\\xedaa}0wL'
I need the exact same output which i input from terminal just in bytes to perform the decryption operation. When i try to replace it by the following code:
# -*- coding: utf-8 -*-
import sys
a = sys.argv[1]
b = a.encode('utf-8').replace('\\','\')
print(a)
print('\n')
print(b)
OUTPUT:
$python3 test.py 'b\x90\x89\xc6g\xa6\x15I\x9bKD\xd4s\xf2\x9f\x82Y\xedaa}0wL'
File "testsys.py", line 6
b = a.encode('utf-8').replace('\\','\')
^
SyntaxError: EOL while scanning string literal
I don't know about the error but in the line :
b = a.encode().replace('\\\','\')
but the parenthesis in bold is still colored like a string.
How can I get the exact same string just in bytes ?
\' is an escaped single quote character.
\\ is an escaped backslash character.
The quote for the string never got closed
You are escaping the closing '
b = a.encode('utf-8').replace('\\','\')
should be:
b = a.encode('utf-8').replace('\\','\'')
The data you provided cannot be encoded with utf-8.
a = 'b\x90\x89\xc6g\xa6\x15I\x9bKD\xd4s\xf2\x9f\x82Y\xedaa}0wL'
>>> b = a.encode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0x90 in position 1: ordinal
not in range(128)
If it actually works for you did you check decrypting the encoded string and did you get a different answer than the original string. Because by encoding the string with utf-8 doesn't mean that you are changing the integrity of the data.
I use this code to deal with chinese:
# -*- coding: utf-8 -*-
strInFilNname = u'%s' % raw_input("input fileName:").decode('utf-8')
pathName = u'%s' % raw_input("input filePath:").decode('utf-8')
When I run this on PyCharm everything is ok. But when I run this on windows CMD, I get this error code:
Traceback (most recent call last):
File "E:\Sites\GetAllFile.py", line 23, in <module>
strInFilNname = u'%s' % raw_input("input filename:").decode('utf-8')
File "E:\Portable Python 2.7.5.1\App\lib\encodings\utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xd3 in position 0: invalid continuation byte
I have read this python document about Unicode HOWTO, but can't get effective solution.
I really want to know why it does so .
The Windows console encoding is not UTF-8. I will assume you are using a Chinese-localized version of Windows since you mention the errors go away in Python 3.3 and suggest trying sys.stdin.encoding instead of utf-8.
Below is an example from my US-localized Windows using characters from the cp437 code page, which is what the US console uses (Python 2.7.9):
This returns a byte string in the console encoding:
>>> raw_input('test? ')
test? │┤╡╢╖╕╣
'\xb3\xb4\xb5\xb6\xb7\xb8\xb9'
Convert to Unicode:
>>> import sys
>>> sys.stdin.encoding
'cp437'
>>> raw_input('test? ').decode(sys.stdin.encoding)
test? │┤╡╢╖╕╣║╗╝╜╛
u'\u2502\u2524\u2561\u2562\u2556\u2555\u2563\u2551\u2557\u255d\u255c\u255b'
Note it prints correctly:
>>> print(raw_input('test? ').decode(sys.stdin.encoding))
test? │┤╡╢╖╕╣║╗
│┤╡╢╖╕╣║╗
This works correctly for a Chinese Windows console as well as it will use the correct console encoding for Chinese. Here's the same code after switching my system to use Chinese:
>>> raw_input('Test? ')
Test? 我是美国人。
'\xce\xd2\xca\xc7\xc3\xc0\xb9\xfa\xc8\xcb\xa1\xa3'
>>> import sys
>>> sys.stdin.encoding
'cp936'
>>> raw_input('Test? ').decode(sys.stdin.encoding)
Test? 我是美国人。
u'\u6211\u662f\u7f8e\u56fd\u4eba\u3002'
>>> print raw_input('Test? ').decode(sys.stdin.encoding)
Test? 我是美国人。
我是美国人。
Python 3.3 makes this much simpler:
>>> input('Test? ')
Test? 我是美国人。
'我是美国人。'