decode python binary string but not ensure ascii symbols - python

I have a binary object:
b'{"node": "\\u041e\\u0431\\u043d\\u043e\\u0432\\u043b\\u0435\\u043d\\u0438\\u0435"}}'
and I want it to be printed in Unicode and not strictly using ASCII symbols.
There is a hacky way to do it:
decoded = string.decode()
parsed_to_dict = json.loads(decoded)
dumped = json.dumps(parsed_to_dict, ensure_ascii=False)
print(dumped)
>>> {"node": "Обновление"}
however the text will not always be parseable as JSON, so I need a simpler way.
Is there a way to print out my binary object (or a decoded Unicode string) as a non-ascii string without going trough parsing/dumping JSON?
For example, how to print this b'\\u041e\\u0431\\u043d\\u043e\\u0432\\u043b\\u0435\\u043d\\u0438\\u0435' as Обновление?

A bytes string like
b'\\u041e\\u0431\\u043d\\u043e\\u0432\\u043b\\u0435\\u043d\\u0438\\u0435'
has been encoded using Unicode escape sequences. To convert it back into a proper Unicode string you simply need to specify the 'unicode-escape' codec:
data = b'\\u041e\\u0431\\u043d\\u043e\\u0432\\u043b\\u0435\\u043d\\u0438\\u0435'
out = data.decode('unicode-escape')
print(out)
output
Обновление
However, if data is already a Unicode string, then you first need to encode it to bytes. You can do that using the ascii codec, presuming data only contains ASCII characters. If it contains characters outside ASCII but within the range of \x80 to \xff you may be able to use the 'latin1' codec.
data = '\\u041e\\u0431\\u043d\\u043e\\u0432\\u043b\\u0435\\u043d\\u0438\\u0435'
out = data.encode('ascii').decode('unicode-escape')

This should work so long as all the escapes are valid (no single \).
import ast
bytes_object = b'{"node": "\\u041e\\u0431\\u043d\\u043e\\u0432\\u043b\\u0435\\u043d\\u0438\\u0435"}}'
unicode_string = ast.literal_eval("'{}'".format(bytes_object.decode()))
output:
'{"node": "Обновление"}}'

Related

How to tell python that a string is actually bytes-object? Not converting

I have a txt file which contains a line:
' 6: "\\351\\231\\220\\346\\227\\266\\345\\205\\215\\350\\264\\271"'
The contents in the double quotes is actually octal encoding, but with two escape characters.
After the line has been read in, I used regex to extract the contents in the double quotes.
c = re.search(r': "(.+)"', line).group(1)
After that, I have two problem:
First, I need to replace the two escape characters with one.
Second, Tell python that the str object c is actually a byte object.
None of them has been done.
I have tried:
re.sub('\\', '\', line)
re.sub(r'\\', '\', line)
re.sub(r'\\', r'\', line)
All failed.
A bytes object can be easily define with 'b'.
c = b'\351\231\220\346\227\266\345\205\215\350\264\271'
How to change the variable type of a string to bytes? I think this not a encode-and-decode thing.
I googled a lot, but with no answers. Maybe I use the wrong key word.
Does anyone know how to do these? Or other way to get what I want?
This is always a little confusing. I assume your bytes object should represent a string like:
b = b'\351\231\220\346\227\266\345\205\215\350\264\271'
b.decode()
# '限时免费'
To get that with your escaped string, you could use the codecs library and try:
import re
import codecs
line = ' 6: "\\351\\231\\220\\346\\227\\266\\345\\205\\215\\350\\264\\271"'
c = re.search(r': "(.+)"', line).group(1)
codecs.escape_decode(bytes(c, "utf-8"))[0].decode("utf-8")
# '限时免费'
giving the same result.
The string contains literal text for escape codes. You cannot just replace the literal backslashes with a single backslash as escape codes are used in source code to indicate a single character. Decoding is needed to change literal escape codes to the actual character, but only byte strings can be decoded.
Encoding a Unicode string to a byte string with the Latin-1 codec translates Unicode code points 1:1 to the corresponding byte, so it is the common way to directly convert a "byte-string-like" Unicode string to an actual byte string.
Step-by-Step:
>>> s = "\\351\\231\\220\\346\\227\\266\\345\\205\\215\\350\\264\\271"
>>> print(s) # Actual text of the string
\351\231\220\346\227\266\345\205\215\350\264\271
>>> s.encode('latin1') # Convert to byte string
b'\\351\\231\\220\\346\\227\\266\\345\\205\\215\\350\\264\\271'
>>> # decode the escape codes...result is latin-1 characters in Unicode
>>> s.encode('latin1').decode('unicode-escape')
'é\x99\x90æ\x97¶å\x85\x8dè´¹' # convert back to byte string
>>> s.encode('latin1').decode('unicode-escape').encode('latin1')
b'\xe9\x99\x90\xe6\x97\xb6\xe5\x85\x8d\xe8\xb4\xb9'
>>> # data is UTF-8-encoded text so decode it correctly now
>>> s.encode('latin1').decode('unicode-escape').encode('latin1').decode('utf8')
'限时免费'
Your text example looks like part of a Python dictionary. You may be able to save some steps by using the ast module's literal_eval function to turn the dictionary directly into a Python object, and then just fix this line of code:
>>> # Python dictionary-like text
d='{6: "\\351\\231\\220\\346\\227\\266\\345\\205\\215\\350\\264\\271"}'
>>> import ast
>>> ast.literal_eval(d) # returns Python dictionary with value already decoded
{6: 'é\x99\x90æ\x97¶å\x85\x8dè´¹'}
>>> ast.literal_eval(d)[6] # but decoded incorrectly as Latin-1 text.
'é\x99\x90æ\x97¶å\x85\x8dè´¹'
>>> ast.literal_eval(d)[6].encode('latin1').decode('utf8') # undo Latin1, decode as UTF-8
'限时免费'

How to convert a string containing a byte string to a byte string

How do I convert a string which contains the literal representation of a byte string, to a byte string?
This might seem strange, but for a library I'm using for a certain type of exception I need one of the attributes of the exception, this gives me the value I need, but it is a byte string in a string.
It is "value=b'\\xbbOFa\\x14\\xdb{\\xf5\\x1b~H\\xba\\x96\\xdaec'", I can get the value by splitting on the equals and then using eval, such as
>>> eval("value=b'\\xbbOFa\\x14\\xdb{\\xf5\\x1b~H\\xba\\x96\\xdaec'".split("=")[1])
b'\xbbOFa\x14\xdb{\xf5\x1b~H\xba\x96\xdaec'
This works, but as we all know eval can be very, very bad. So, is there an alternative to using eval?
There is a unicode-escape codec that will convert bytes containing literal sequences like \x.. or \u.... into their equivalent characters in the string. The remainder of the string is converted using the latin1 encoding, which just translates all the bytes.
So you convert the string to raw bytes using latin1, then convert back to a string using unicode-escape, and finally back to bytes using latin1 again:
>>> s = '\\xbbOFa\\x14\\xdb{\\xf5\\x1b~H\\xba\\x96\\xdaec'
>>> s.encode('latin1').decode('unicode-escape').encode('latin1')
b'\xbbOFa\x14\xdb{\xf5\x1b~H\xba\x96\xdaec'
Getting rid of the clutter around the string is pretty easy using regex or the more manual parsing you showed. For example:
>>> x = "value=b'\\xbbOFa\\x14\\xdb{\\xf5\\x1b~H\\xba\\x96\\xdaec'"
>>> s = re.fullmatch('[^\'"]+b([\'"])(.*)\\1[^\'"]*', x).group(2)
>>> s
'\\xbbOFa\\x14\\xdb{\\xf5\\x1b~H\\xba\\x96\\xdaec'
OR
>>> s = x.split('=')[1].lstrip('b').strip("'")
>>> s
'\\xbbOFa\\x14\\xdb{\\xf5\\x1b~H\\xba\\x96\\xdaec'

Decode UTF-8 encoding in JSON string

I have JSON file which contains followingly encoded strings:
"sender_name": "Horn\u00c3\u00adkov\u00c3\u00a1",
I am trying to parse this file using the json module. However I am not able to decode this string correctly.
What I get after decoding the JSON using .load() method is 'HornÃ\xadková'. The string should be correctly decoded as 'Horníková' instead.
I read the JSON specification and I understasnd that after \u there should be 4 hexadecimal numbers specifing Unicode number of character. But it seems that in this JSON file UTF-8 encoded bytes are stored as \u-sequences.
What type of encoding is this and how to correctly parse it in Python 3?
Is this type JSON file even valid JSON file according to the specification?
Your text is already encoded and you need to tell this to Python by using a b prefix in your string but since you're using json and the input needs to be string you have to decode your encoded text manually. Since your input is not byte you can use 'raw_unicode_escape' encoding to convert the string to byte without encoding and prevent the open method to use its own default encoding. Then you can simply use aforementioned approach to get the desired result.
Note that since you need to do the encoding and decoding your have to read file content and perform the encoding on loaded string, then you should use json.loads() instead of json.load().
In [168]: with open('test.json', encoding='raw_unicode_escape') as f:
...: d = json.loads(f.read().encode('raw_unicode_escape').decode())
...:
In [169]: d
Out[169]: {'sender_name': 'Horníková'}
The JSON you are reading was written incorrectly and the Unicode strings decoded from it will have to be re-encoded with the wrong encoding used, then decoded with the correct encoding.
Here's an example:
#!python3
import json
# The bad JSON you have
bad_json = r'{"sender_name": "Horn\u00c3\u00adkov\u00c3\u00a1"}'
print('bad_json =',bad_json)
# The wanted result from json.loads()
wanted = {'sender_name':'Horníková'}
# What correctly written JSON should look like
good_json = json.dumps(wanted)
print('good_json =',good_json)
# What you get when loading the bad JSON.
got = json.loads(bad_json)
print('wanted =',wanted)
print('got =',got)
# How to correct the mojibake string
corrected_sender = got['sender_name'].encode('latin1').decode('utf8')
print('corrected_sender =',corrected_sender)
Output:
bad_json = {"sender_name": "Horn\u00c3\u00adkov\u00c3\u00a1"}
good_json = {"sender_name": "Horn\u00edkov\u00e1"}
wanted = {'sender_name': 'Horníková'}
got = {'sender_name': 'HornÃ\xadková'}
corrected_sender = Horníková
I don't know enough about JSON to be able to say whether this is valid or not, but you can parse these strings using the raw_unicode_escape codec:
>>> "Horn\u00c3\u00adkov\u00c3\u00a1".encode('raw_unicode_escape').decode('utf8')
'Horníková'
Reencode to bytes, and then redecode to text.
>>> 'HornÃ\xadková'.encode('latin-1').decode('utf-8')
'Horníková'
Is this type JSON file even valid JSON file according to the specification?
No.
A string is a sequence of zero or more Unicode characters, wrapped in double quotes, using backslash escapes [emphasis added].
source
A string is a sequence of Unicode code points wrapped with quotation marks (U+0022). [...] Any code point may be represented as a hexadecimal escape sequence [...] represented as a six-character sequence: a reverse solidus, followed by the lowercase letter u, followed by four hexadecimal digits that encode the code point [emphasis added].
source
UTF-8 byte sequences are neither Unicode characters nor Unicode code points.

Convert ascii string to base64 without the "b" and quotation marks

I wanted to convert an ascii string (well just text to be precise) towards base64.
So I know how to do that, I just use the following code:
import base64
string = base64.b64encode(bytes("string", 'utf-8'))
print (string)
Which gives me
b'c3RyaW5n'
However the problem is, I'd like it to just print
c3RyaW5n
Is it possible to print the string without the "b" and the '' quotation marks?
Thanks!
The b prefix denotes that it is a binary string. A binary string is not a string: it is a sequence of bytes (values in the 0 to 255 range). It is simply typesetted as a string to make it more compact.
In case of base64 however, all characters are valid ASCII characters, you can thus simply decode it like:
print(string.decode('ascii'))
So here we will decode each byte to its ASCII equivalent. Since base64 guarantees that every byte it produces is in the ASCII range 'A' to '/') we will always produce a valid string. Mind however that this is not guaranteed with an arbitrary binary string.
A simple .decode("utf-8") would do
import base64
string = base64.b64encode(bytes("string", 'utf-8'))
print (string.decode("utf-8"))

Python, Encoding output to UTF-8

I have a definition that builds a string composed of UTF-8 encoded characters. The output files are opened using 'w+', "utf-8" arguments.
However, when I try to x.write(string) I get the UnicodeEncodeError: 'ascii' codec can't encode character u'\ufeff' in position 1: ordinal not in range(128)
I assume this is because normally for example you would do `print(u'something'). But I need to use a variable and the quotations in u'_' negate that...
Any suggestions?
EDIT: Actual code here:
source = codecs.open("actionbreak/" + target + '.csv','r', "utf-8")
outTarget = codecs.open("actionbreak/" + newTarget, 'w+', "utf-8")
x = str(actionT(splitList[0], splitList[1]))
outTarget.write(x)
Essentially all this is supposed to be doing is building me a large amount of strings that look similar to this:
[日木曜 Deliverables]= CASE WHEN things = 11
THEN C ELSE 0 END
Are you using codecs.open()? Python 2.7's built-in open() does not support a specific encoding, meaning you have to manually encode non-ascii strings (as others have noted), but codecs.open() does support that and would probably be easier to drop in than manually encoding all the strings.
As you are actually using codecs.open(), going by your added code, and after a bit of looking things up myself, I suggest attempting to open the input and/or output file with encoding "utf-8-sig", which will automatically handle the BOM for UTF-8 (see http://docs.python.org/2/library/codecs.html#encodings-and-unicode, near the bottom of the section) I would think that would only matter for the input file, but if none of those combinations (utf-8-sig/utf-8, utf-8/utf-8-sig, utf-8-sig/utf-8-sig) work, then I believe the most likely situation would be that your input file is encoded in a different Unicode format with BOM, as Python's default UTF-8 codec interprets BOMs as regular characters so the input would not have an issue but output could.
Just noticed this, but... when you use codecs.open(), it expects a Unicode string, not an encoded one; try x = unicode(actionT(splitList[0], splitList[1])).
Your error can also occur when attempting to decode a unicode string (see http://wiki.python.org/moin/UnicodeEncodeError), but I don't think that should be happening unless actionT() or your list-splitting does something to the Unicode strings that causes them to be treated as non-Unicode strings.
In python 2.x there are two types of string: byte string and unicode string. First one contains bytes and last one - unicode code points. It is easy to determine, what type of string it is - unicode string starts with u:
# byte string
>>> 'abc'
'abc'
# unicode string:
>>> u'abc абв'
u'abc \u0430\u0431\u0432'
'abc' chars are the same, because the are in ASCII range. \u0430 is a unicode code point, it is out of ASCII range. "Code point" is python internal representation of unicode points, they can't be saved to file. It is needed to encode them to bytes first. Here how encoded unicode string looks like (as it is encoded, it becomes a byte string):
>>> s = u'abc абв'
>>> s.encode('utf8')
'abc \xd0\xb0\xd0\xb1\xd0\xb2'
This encoded string now can be written to file:
>>> s = u'abc абв'
>>> with open('text.txt', 'w+') as f:
... f.write(s.encode('utf8'))
Now, it is important to remember, what encoding we used when writing to file. Because to be able to read the data, we need to decode the content. Here what data looks like without decoding:
>>> with open('text.txt', 'r') as f:
... content = f.read()
>>> content
'abc \xd0\xb0\xd0\xb1\xd0\xb2'
You see, we've got encoded bytes, exactly the same as in s.encode('utf8'). To decode it is needed to provide coding name:
>>> content.decode('utf8')
u'abc \u0430\u0431\u0432'
After decode, we've got back our unicode string with unicode code points.
>>> print content.decode('utf8')
abc абв
xgord is right, but for further edification it's worth noting exactly what \ufeff means. It's known as a BOM or a byte order mark and basically it's a callback to the early days of unicode when people couldn't agree which way they wanted their unicode to go. Now all unicode documents are prefaced with either an \ufeff or an \uffef depending on which order they decide to arrange their bytes in.
If you hit an error on those characters in the first location you can be sure the issue is that you are not trying to decode it as utf-8, and the file is probably still fine.

Categories

Resources