Encoding String to Bytes in Python

Encoding String to Bytes in Python - python

I have been trying to encode an encrypted text by taking the input (encrypted text) from command line and encoding using the following code:
# -*- coding: utf-8 -*-
import sys
a = sys.argv[1]
b = a.encode('utf-8')
print(a)
print('\n')
print(b)
OUTPUT:
$python3 test.py 'b\x90\x89\xc6g\xa6\x15I\x9bKD\xd4s\xf2\x9f\x82Y\xedaa}0wL'
b\x90\x89\xc6g\xa6\x15I\x9bKD\xd4s\xf2\x9f\x82Y\xedaa}0wL
b'b\\x90\\x89\\xc6g\\xa6\\x15I\\x9bKD\\xd4s\\xf2\\x9f\\x82Y\\xedaa}0wL'
I need the exact same output which i input from terminal just in bytes to perform the decryption operation. When i try to replace it by the following code:
# -*- coding: utf-8 -*-
import sys
a = sys.argv[1]
b = a.encode('utf-8').replace('\\','\')
print(a)
print('\n')
print(b)
OUTPUT:
$python3 test.py 'b\x90\x89\xc6g\xa6\x15I\x9bKD\xd4s\xf2\x9f\x82Y\xedaa}0wL'
File "testsys.py", line 6
b = a.encode('utf-8').replace('\\','\')
^
SyntaxError: EOL while scanning string literal
I don't know about the error but in the line :
b = a.encode().replace('\\\','\')
but the parenthesis in bold is still colored like a string.
How can I get the exact same string just in bytes ?

\' is an escaped single quote character.
\\ is an escaped backslash character.
The quote for the string never got closed

You are escaping the closing '
b = a.encode('utf-8').replace('\\','\')
should be:
b = a.encode('utf-8').replace('\\','\'')

The data you provided cannot be encoded with utf-8.
a = 'b\x90\x89\xc6g\xa6\x15I\x9bKD\xd4s\xf2\x9f\x82Y\xedaa}0wL'
>>> b = a.encode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0x90 in position 1: ordinal
not in range(128)
If it actually works for you did you check decrypting the encoded string and did you get a different answer than the original string. Because by encoding the string with utf-8 doesn't mean that you are changing the integrity of the data.

Related

Why does this production code work: `base64.b64decode(api_token.encode(“utf-8)).decode(“utf-8”)`?

Today at work, I saw the following line of code:
decoded_token = base64.b64decode(api_token.encode("utf-8")).decode("utf-8")
It is part of an AirFlow ETL script and decoded_token is used as a Bearer Token in an API request. This code is executed on a server that uses Python 2.7 and my coworker told my that this code runs daily, successfully.
Yet, from my understanding, the code first tries to turn api_token into bytes (.encode), then turn the bytes into a string (base64.b64decode) and finally turn the string again into a string (.decode). I would think that this always leads to an error.
import base64
api_token = "random-string"
decoded_token = base64.b64decode(api_token.encode("utf-8")).decode("utf-8")
Running the code locally gives me:
Error: UnicodeDecodeError: 'utf8' codec can't decode byte 0xad in position 0: invalid start byte
What input/type would api_token need to be in order for this line not to throw an error? Is that even possible or must there be something else at play?
Edit: As mentioned by Klaus D., apparently, in Python 2 both encode and decode consumed and returned a string. Yet, running the code above in Python 2.7 gives me the same error and I have yet to find an input for api_token that does not throw an error.

The issue is likely just that your test input string is not a base64-encoded string, while in production, whatever input already is!
Python 2.7.18 (default, Jan 4 2022, 17:47:56)
...
>>> import base64
>>> api_token = "random-string"
>>> base64.b64decode(api_token)
'\xad\xa9\xdd\xa2k-\xae)\xe0'
>>> base64.b64decode(api_token).decode("utf-8")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xad in position 0: invalid start byte
encoding the string as base64, you also don't need to decode it as "utf-8" afterwards, though you may if you expect unicode characters
>>> api_token = base64.b64encode(api_token)
>>> api_token
'cmFuZG9tLXN0cmluZw=='
>>> base64.b64decode(api_token)
'random-string'
>>> base64.b64decode(api_token).decode("utf-8")
u'random-string'
Example with non-ascii characters
>>> base64.b64decode(base64.b64encode("random string后缀"))
'random string\xe5\x90\x8e\xe7\xbc\x80'
>>> base64.b64decode(base64.b64encode("random string后缀")).decode("utf-8")
u'random string\u540e\u7f00'
>>> sys.stdout.write(base64.b64decode(base64.b64encode("random string后缀")) + "\n")
random string后缀
Note that in Python 2.7, bytes is just an alias for str, and a special unicode was added to support unicode!
>>> bytes is str
True
>>> bytes is unicode
False
>>> str("foo")
'foo'
>>> unicode("foo")
u'foo'

Rise UnicodeEncodeError in logging.StreamHandler

I migrated my python code from Win10 host to WS2012R2. Surprisingly it stops operating correctly and now shows warning message: "UnicodeEncodeError: 'charmap' codec can't encode characters in position 0-2: character maps to "
I've tried to execute a command:
set PYTHONLEGACYWINDOWSSTDIO=yes
My code:
import logging
import sys
def get_console_handler():
console_handler = logging.StreamHandler(sys.stdout)
return console_handler
def get_logger():
logger = logging.getLogger()
logger.setLevel(logging.DEBUG)
logger.addHandler(get_console_handler())
return logger
my_logger = get_logger()
my_logger.debug("Это отладочное сообщение".encode("cp1252"))
What should I do to get rid of this warning?
Update
Colleagues, I am sorry for misleading you! I am obviously was tired after long hours of bug tracking )
The problem doesn't connect with "*.encode()" calling as such, it is connected with default python encoding while IO console operation (I suppose so)! The original code makes some requests from DB in cp1251 charset but the problem appears when python is trying to convert it to cp1252.
Here is another example of how to summon the error.
Create a plain text file, i.e. test.txt with text "Это отладочное сообщение" and save it cp1252.
Run python console and enter:
f = open("test.txt")
f.read()
Output:
f = open("test.txt")
f.read()
Traceback (most recent call last): File "<stdin>", line 1, in <module>
File "c:\project\venv\lib\encodings\cp1252.py", line 23, in decode return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 29: character maps to <undefined>

Use encode("utf-8"). Here is a list of python encodings: https://docs.python.org/2.4/lib/standard-encodings.html
my_logger.debug("Это отладочное сообщение".encode("utf-8"))
then use .decode("utf-8") to see the printable value of your string

The problem is how logging.StreamHandler performs console output, namely due to the fact that you couldn't change default encoding in contrast with FileHandler.
If the default system encoding doesn't match the needed one, you could face an issue.
For my example. I wanted to output cp1251 lines, while system default encoding was:
import locale
locale.getpreferredencoding()
'cp1252'
This question was solved by changing system locale (see https://stackoverflow.com/a/11234956/9851754). Choose "Change system locale..." for non-Unicode programs. No code changes needed.
import locale
locale.getpreferredencoding()
'cp1251'

I have tested your code with Python 3.6.8 and it worked for me (I didn't change anything).
Python 3.6.8:
>>> python3 -V
Python 3.6.8
>>> python3 test.py
Это отладочное сообщение
But when I have tested it with Python 2.7.15+, I got a similar error than you.
Python 2.7.15+ with your implementation:
>>> python2 -V
Python 2.7.15+
>>> python2 test.py
File "test.py", line 17
SyntaxError: Non-ASCII character '\xd0' in file test.py on line 17, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details
Then I have put the following line into the first line it worked for me.
Begging of code:
# -*- coding: utf-8 -*-
import logging
import sys
...
Output with Python 2.7.15+ and with modified code:
>>> python2 -V
Python 2.7.15+
>>> python2 test.py
Это отладочное сообщение

What is the timezone name '\xc3\xc0\xb9\xfa\xc9\xbd\xb5\xd8\xb1\xea\xd7\xbc\xca\xb1\xbc\xe4'?

I have a colleague whose computer won't run a Python script that uses the dateutil.tz module; there is a timezone name '\xc3\xc0\xb9\xfa\xc9\xbd\xb5\xd8\xb1\xea\xd7\xbc\xca\xb1\xbc\xe4' that shows up and in dateutil.tz there is this code:
def tzname_in_python2(namefunc):
"""Change unicode output into bytestrings in Python 2
tzname() API changed in Python 3. It used to return bytes, but was changed
to unicode strings
"""
def adjust_encoding(*args, **kwargs):
name = namefunc(*args, **kwargs)
if name is not None and not PY3:
name = name.encode()
return name
return adjust_encoding
which breaks because the string in question is not ASCII. What is this string? It doesn't look like valid Unicode:
>>> a = '\xc3\xc0\xb9\xfa\xc9\xbd\xb5\xd8\xb1\xea\xd7\xbc\xca\xb1\xbc\xe4'
>>> a.decode('utf8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "c:\app\python\anaconda\2\envs\emblaze\lib\encodings\utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xc3 in position 0: invalid continuation byte
My python script contains
timezone = dateutil.tz.tzlocal()
and the resulting object fails to run timezone.tzname(some_timestamp) because of the non-ASCII nature of the timezone name.

If this happens again, there is a python module for this:
>>> import chardet
>>> z = b'\xc3\xc0\xb9\xfa\xc9\xbd\xb5\xd8\xb1\xea\xd7\xbc\xca\xb1\xbc\xe4'
>>> chardet.detect(z)
{'encoding': 'GB2312', 'confidence': 0.99, 'language': 'Chinese'}

Aha, I figured it out after a bunch of searching on the net. It's not UTF8 or UTF16; it seems to be GB2312 (or GBK) encoding, which can be decoded in Python (on MS Windows, at least) with the gbk codec:
>>> '\xc3\xc0\xb9\xfa\xc9\xbd\xb5\xd8\xb1\xea\xd7\xbc\xca\xb1\xbc\xe4'.decode('gbk')
u'\u7f8e\u56fd\u5c71\u5730\u6807\u51c6\u65f6\u95f4'
>>> '\xc3\xc0\xb9\xfa\xc9\xbd\xb5\xd8\xcf\xc4\xc1\xee\xca\xb1'.decode('gbk')
u'\u7f8e\u56fd\u5c71\u5730\u590f\u4ee4\u65f6'
which print out (in IPython Notebook) as
美国山地标准时间
美国山地夏令时
which Google Translate tells me represents "American Mountain Standard Time" and "American Mountain Summertime", respectively.

Encoding and decoding in Python 2.7.5.1 on windows cmd and pycharm get diffrent result

I use this code to deal with chinese:
# -*- coding: utf-8 -*-
strInFilNname = u'%s' % raw_input("input fileName:").decode('utf-8')
pathName = u'%s' % raw_input("input filePath:").decode('utf-8')
When I run this on PyCharm everything is ok. But when I run this on windows CMD, I get this error code:
Traceback (most recent call last):
File "E:\Sites\GetAllFile.py", line 23, in <module>
strInFilNname = u'%s' % raw_input("input filename:").decode('utf-8')
File "E:\Portable Python 2.7.5.1\App\lib\encodings\utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xd3 in position 0: invalid continuation byte
I have read this python document about Unicode HOWTO, but can't get effective solution.
I really want to know why it does so .

The Windows console encoding is not UTF-8. I will assume you are using a Chinese-localized version of Windows since you mention the errors go away in Python 3.3 and suggest trying sys.stdin.encoding instead of utf-8.
Below is an example from my US-localized Windows using characters from the cp437 code page, which is what the US console uses (Python 2.7.9):
This returns a byte string in the console encoding:
>>> raw_input('test? ')
test? │┤╡╢╖╕╣
'\xb3\xb4\xb5\xb6\xb7\xb8\xb9'
Convert to Unicode:
>>> import sys
>>> sys.stdin.encoding
'cp437'
>>> raw_input('test? ').decode(sys.stdin.encoding)
test? │┤╡╢╖╕╣║╗╝╜╛
u'\u2502\u2524\u2561\u2562\u2556\u2555\u2563\u2551\u2557\u255d\u255c\u255b'
Note it prints correctly:
>>> print(raw_input('test? ').decode(sys.stdin.encoding))
test? │┤╡╢╖╕╣║╗
│┤╡╢╖╕╣║╗
This works correctly for a Chinese Windows console as well as it will use the correct console encoding for Chinese. Here's the same code after switching my system to use Chinese:
>>> raw_input('Test? ')
Test? 我是美国人。
'\xce\xd2\xca\xc7\xc3\xc0\xb9\xfa\xc8\xcb\xa1\xa3'
>>> import sys
>>> sys.stdin.encoding
'cp936'
>>> raw_input('Test? ').decode(sys.stdin.encoding)
Test? 我是美国人。
u'\u6211\u662f\u7f8e\u56fd\u4eba\u3002'
>>> print raw_input('Test? ').decode(sys.stdin.encoding)
Test? 我是美国人。
我是美国人。
Python 3.3 makes this much simpler:
>>> input('Test? ')
Test? 我是美国人。
'我是美国人。'

printing unicode through a QProcess

I'm having some trouble handling unicode output from a QProcess. When I run the following example I get ?? instead of 中文. Can anyone tell me how to get the unicode output?
from PyQt4.QtCore import *
def on_ready_stdout():
byte_array = proc.readAllStandardOutput()
print 'byte_array: ', byte_array
print 'unicode: ', unicode(byte_array)
proc = QProcess()
proc.connect(proc, SIGNAL('readyReadStandardOutput()'), on_ready_stdout)
proc.start(u'python -c "print \'hello 中文\'"')
proc.waitForFinished()
#serge
I tried running your modified code, but I get an error:
byte_array: hello Σ╕¡µ??
unicode:
Traceback (most recent call last):
File "python_temp.py", line 7, in on_ready_stdout
print 'unicode: ', unicode(byte_array)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 6: ordinal
not in range(128)

I've changed your code a little and got the expected output:
byte_array: hello 中文
unicode: hello 中文
my changes were:
I added # -- coding: utf-8 -- magic comment (details here)
Removed "u" string declaration from the proc.start call
below is your code with my changes:
# -*- coding: utf-8 -*-
from PyQt4.QtCore import *
def on_ready_stdout():
byte_array = proc.readAllStandardOutput()
print 'byte_array: ', byte_array
print 'unicode: ', unicode(byte_array)
proc = QProcess()
proc.connect(proc, SIGNAL('readyReadStandardOutput()'), on_ready_stdout)
proc.start('python -c "print \'hello 中文\'"')
proc.waitForFinished()
hope this helps, regards

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Encoding String to Bytes in Python - python

\' is an escaped single quote character. \\ is an escaped backslash character. The quote for the string never got closed

You are escaping the closing ' b = a.encode('utf-8').replace('\\','\') should be: b = a.encode('utf-8').replace('\\','\'')

Related

Why does this production code work: `base64.b64decode(api_token.encode(“utf-8)).decode(“utf-8”)`?

Rise UnicodeEncodeError in logging.StreamHandler

What is the timezone name '\xc3\xc0\xb9\xfa\xc9\xbd\xb5\xd8\xb1\xea\xd7\xbc\xca\xb1\xbc\xe4'?

Encoding and decoding in Python 2.7.5.1 on windows cmd and pycharm get diffrent result

printing unicode through a QProcess

Categories

Resources