How to convert incorrectly encoded string to bytes? - python

I've got utf-8 string in the form of 'РїРѕРј'... - in Python 3 string. How can I decode it (to get correct string)?
As I see from error messages I can only convert string from bytes array, but how to get it then? I tried
bytes(str, 'ascii', errors='ignore')
so it should not change existing byte values, but it removed all "incorrect" characters (I suppose because they have codes >= 128).
The example string contains Russian 'пом'...

It looks like you have a string that has been encoded as UTF-8, then decoded as cp1251.
>>> s = 'пом'
>>> s.encode('utf-8').decode('cp1251')
'РїРѕРј'
You can get the original string by reversing the operation.
>>> e = 'РїРѕРј'
>>> e.encode('cp1251').decode('utf-8')
'пом'
If you want to encode the mojibake string as bytes, without losing information, use the backslashreplace error handler.
>>> e.encode('ascii', errors='backslashreplace')
b'\\u0420\\u0457\\u0420\\u0455\\u0420\\u0458'

Related

How to convert a string containing a byte string to a byte string

How do I convert a string which contains the literal representation of a byte string, to a byte string?
This might seem strange, but for a library I'm using for a certain type of exception I need one of the attributes of the exception, this gives me the value I need, but it is a byte string in a string.
It is "value=b'\\xbbOFa\\x14\\xdb{\\xf5\\x1b~H\\xba\\x96\\xdaec'", I can get the value by splitting on the equals and then using eval, such as
>>> eval("value=b'\\xbbOFa\\x14\\xdb{\\xf5\\x1b~H\\xba\\x96\\xdaec'".split("=")[1])
b'\xbbOFa\x14\xdb{\xf5\x1b~H\xba\x96\xdaec'
This works, but as we all know eval can be very, very bad. So, is there an alternative to using eval?
There is a unicode-escape codec that will convert bytes containing literal sequences like \x.. or \u.... into their equivalent characters in the string. The remainder of the string is converted using the latin1 encoding, which just translates all the bytes.
So you convert the string to raw bytes using latin1, then convert back to a string using unicode-escape, and finally back to bytes using latin1 again:
>>> s = '\\xbbOFa\\x14\\xdb{\\xf5\\x1b~H\\xba\\x96\\xdaec'
>>> s.encode('latin1').decode('unicode-escape').encode('latin1')
b'\xbbOFa\x14\xdb{\xf5\x1b~H\xba\x96\xdaec'
Getting rid of the clutter around the string is pretty easy using regex or the more manual parsing you showed. For example:
>>> x = "value=b'\\xbbOFa\\x14\\xdb{\\xf5\\x1b~H\\xba\\x96\\xdaec'"
>>> s = re.fullmatch('[^\'"]+b([\'"])(.*)\\1[^\'"]*', x).group(2)
>>> s
'\\xbbOFa\\x14\\xdb{\\xf5\\x1b~H\\xba\\x96\\xdaec'
OR
>>> s = x.split('=')[1].lstrip('b').strip("'")
>>> s
'\\xbbOFa\\x14\\xdb{\\xf5\\x1b~H\\xba\\x96\\xdaec'

Decode UTF-8 encoding in JSON string

I have JSON file which contains followingly encoded strings:
"sender_name": "Horn\u00c3\u00adkov\u00c3\u00a1",
I am trying to parse this file using the json module. However I am not able to decode this string correctly.
What I get after decoding the JSON using .load() method is 'HornÃ\xadková'. The string should be correctly decoded as 'Horníková' instead.
I read the JSON specification and I understasnd that after \u there should be 4 hexadecimal numbers specifing Unicode number of character. But it seems that in this JSON file UTF-8 encoded bytes are stored as \u-sequences.
What type of encoding is this and how to correctly parse it in Python 3?
Is this type JSON file even valid JSON file according to the specification?
Your text is already encoded and you need to tell this to Python by using a b prefix in your string but since you're using json and the input needs to be string you have to decode your encoded text manually. Since your input is not byte you can use 'raw_unicode_escape' encoding to convert the string to byte without encoding and prevent the open method to use its own default encoding. Then you can simply use aforementioned approach to get the desired result.
Note that since you need to do the encoding and decoding your have to read file content and perform the encoding on loaded string, then you should use json.loads() instead of json.load().
In [168]: with open('test.json', encoding='raw_unicode_escape') as f:
...: d = json.loads(f.read().encode('raw_unicode_escape').decode())
...:
In [169]: d
Out[169]: {'sender_name': 'Horníková'}
The JSON you are reading was written incorrectly and the Unicode strings decoded from it will have to be re-encoded with the wrong encoding used, then decoded with the correct encoding.
Here's an example:
#!python3
import json
# The bad JSON you have
bad_json = r'{"sender_name": "Horn\u00c3\u00adkov\u00c3\u00a1"}'
print('bad_json =',bad_json)
# The wanted result from json.loads()
wanted = {'sender_name':'Horníková'}
# What correctly written JSON should look like
good_json = json.dumps(wanted)
print('good_json =',good_json)
# What you get when loading the bad JSON.
got = json.loads(bad_json)
print('wanted =',wanted)
print('got =',got)
# How to correct the mojibake string
corrected_sender = got['sender_name'].encode('latin1').decode('utf8')
print('corrected_sender =',corrected_sender)
Output:
bad_json = {"sender_name": "Horn\u00c3\u00adkov\u00c3\u00a1"}
good_json = {"sender_name": "Horn\u00edkov\u00e1"}
wanted = {'sender_name': 'Horníková'}
got = {'sender_name': 'HornÃ\xadková'}
corrected_sender = Horníková
I don't know enough about JSON to be able to say whether this is valid or not, but you can parse these strings using the raw_unicode_escape codec:
>>> "Horn\u00c3\u00adkov\u00c3\u00a1".encode('raw_unicode_escape').decode('utf8')
'Horníková'
Reencode to bytes, and then redecode to text.
>>> 'HornÃ\xadková'.encode('latin-1').decode('utf-8')
'Horníková'
Is this type JSON file even valid JSON file according to the specification?
No.
A string is a sequence of zero or more Unicode characters, wrapped in double quotes, using backslash escapes [emphasis added].
source
A string is a sequence of Unicode code points wrapped with quotation marks (U+0022). [...] Any code point may be represented as a hexadecimal escape sequence [...] represented as a six-character sequence: a reverse solidus, followed by the lowercase letter u, followed by four hexadecimal digits that encode the code point [emphasis added].
source
UTF-8 byte sequences are neither Unicode characters nor Unicode code points.

decode python binary string but not ensure ascii symbols

I have a binary object:
b'{"node": "\\u041e\\u0431\\u043d\\u043e\\u0432\\u043b\\u0435\\u043d\\u0438\\u0435"}}'
and I want it to be printed in Unicode and not strictly using ASCII symbols.
There is a hacky way to do it:
decoded = string.decode()
parsed_to_dict = json.loads(decoded)
dumped = json.dumps(parsed_to_dict, ensure_ascii=False)
print(dumped)
>>> {"node": "Обновление"}
however the text will not always be parseable as JSON, so I need a simpler way.
Is there a way to print out my binary object (or a decoded Unicode string) as a non-ascii string without going trough parsing/dumping JSON?
For example, how to print this b'\\u041e\\u0431\\u043d\\u043e\\u0432\\u043b\\u0435\\u043d\\u0438\\u0435' as Обновление?
A bytes string like
b'\\u041e\\u0431\\u043d\\u043e\\u0432\\u043b\\u0435\\u043d\\u0438\\u0435'
has been encoded using Unicode escape sequences. To convert it back into a proper Unicode string you simply need to specify the 'unicode-escape' codec:
data = b'\\u041e\\u0431\\u043d\\u043e\\u0432\\u043b\\u0435\\u043d\\u0438\\u0435'
out = data.decode('unicode-escape')
print(out)
output
Обновление
However, if data is already a Unicode string, then you first need to encode it to bytes. You can do that using the ascii codec, presuming data only contains ASCII characters. If it contains characters outside ASCII but within the range of \x80 to \xff you may be able to use the 'latin1' codec.
data = '\\u041e\\u0431\\u043d\\u043e\\u0432\\u043b\\u0435\\u043d\\u0438\\u0435'
out = data.encode('ascii').decode('unicode-escape')
This should work so long as all the escapes are valid (no single \).
import ast
bytes_object = b'{"node": "\\u041e\\u0431\\u043d\\u043e\\u0432\\u043b\\u0435\\u043d\\u0438\\u0435"}}'
unicode_string = ast.literal_eval("'{}'".format(bytes_object.decode()))
output:
'{"node": "Обновление"}}'

Evaluate UTF-8 literal escape sequences in a string in Python3

I have a string of the form:
s = '\\xe2\\x99\\xac'
I would like to convert this to the character ♬ by evaluating the escape sequence. However, everything I've tried either results in an error or prints out garbage. How can I force Python to convert the escape sequence into a literal unicode character?
What I've read elsewhere suggests that the following line of code should do what I want, but it results in a UnicodeEncodeError.
print(bytes(s, 'utf-8').decode('unicode-escape'))
I also tried the following, which has the same result:
import codecs
print(codecs.getdecoder('unicode_escape')(s)[0])
Both of these approaches produce the string 'â\x99¬', which print is subsequently unable to handle.
In case it makes any difference the string is being read in from a UTF-8 encoded file and will ultimately be output to a different UTF-8 encoded file after processing.
...decode('unicode-escape') will give you string '\xe2\x99\xac'.
>>> s = '\\xe2\\x99\\xac'
>>> s.encode().decode('unicode-escape')
'â\x99¬'
>>> _ == '\xe2\x99\xac'
True
You need to decode it. But to decode it, encode it first with latin1 (or iso-8859-1) to preserve the bytes.
>>> s = '\\xe2\\x99\\xac'
>>> s.encode().decode('unicode-escape').encode('latin1').decode('utf-8')
'♬'

How do I convert a string to a buffer in Python 3.1?

I am attempting to pipe something to a subprocess using the following line:
p.communicate("insert into egg values ('egg');");
TypeError: must be bytes or buffer, not str
How can I convert the string to a buffer?
The correct answer is:
p.communicate(b"insert into egg values ('egg');");
Note the leading b, telling you that it's a string of bytes, not a string of unicode characters. Also, if you are reading this from a file:
value = open('thefile', 'rt').read()
p.communicate(value);
The change that to:
value = open('thefile', 'rb').read()
p.communicate(value);
Again, note the 'b'.
Now if your value is a string you get from an API that only returns strings no matter what, then you need to encode it.
p.communicate(value.encode('latin-1');
Latin-1, because unlike ASCII it supports all 256 bytes. But that said, having binary data in unicode is asking for trouble. It's better if you can make it binary from the start.
You can convert it to bytes with encode method:
>>> "insert into egg values ('egg');".encode('ascii') # ascii is just an example
b"insert into egg values ('egg');"

Categories

Resources