I am trying to convert a string that contains unicode-style formatting, like:
'#U0048#U0045#U004C#U004C#U004F'
What would be the most pythonic way to convert this to:
'HELLO'
thanks!
You have to replace #U with \u to create unicode's codes
\u0048\u0045\u004C\u004C\u004F
and then you can encode it to bytes and decode it back using 'unicode_escape' or 'raw_unicode_escape'
print('#U0048#U0045#U004C#U004C#U004F'.replace('#U', '\\u').encode().decode('unicode_escape'))
Doc: codecs - Text Encoding
Related
I have JSON file which contains followingly encoded strings:
"sender_name": "Horn\u00c3\u00adkov\u00c3\u00a1",
I am trying to parse this file using the json module. However I am not able to decode this string correctly.
What I get after decoding the JSON using .load() method is 'HornÃ\xadková'. The string should be correctly decoded as 'Horníková' instead.
I read the JSON specification and I understasnd that after \u there should be 4 hexadecimal numbers specifing Unicode number of character. But it seems that in this JSON file UTF-8 encoded bytes are stored as \u-sequences.
What type of encoding is this and how to correctly parse it in Python 3?
Is this type JSON file even valid JSON file according to the specification?
Your text is already encoded and you need to tell this to Python by using a b prefix in your string but since you're using json and the input needs to be string you have to decode your encoded text manually. Since your input is not byte you can use 'raw_unicode_escape' encoding to convert the string to byte without encoding and prevent the open method to use its own default encoding. Then you can simply use aforementioned approach to get the desired result.
Note that since you need to do the encoding and decoding your have to read file content and perform the encoding on loaded string, then you should use json.loads() instead of json.load().
In [168]: with open('test.json', encoding='raw_unicode_escape') as f:
...: d = json.loads(f.read().encode('raw_unicode_escape').decode())
...:
In [169]: d
Out[169]: {'sender_name': 'Horníková'}
The JSON you are reading was written incorrectly and the Unicode strings decoded from it will have to be re-encoded with the wrong encoding used, then decoded with the correct encoding.
Here's an example:
#!python3
import json
# The bad JSON you have
bad_json = r'{"sender_name": "Horn\u00c3\u00adkov\u00c3\u00a1"}'
print('bad_json =',bad_json)
# The wanted result from json.loads()
wanted = {'sender_name':'Horníková'}
# What correctly written JSON should look like
good_json = json.dumps(wanted)
print('good_json =',good_json)
# What you get when loading the bad JSON.
got = json.loads(bad_json)
print('wanted =',wanted)
print('got =',got)
# How to correct the mojibake string
corrected_sender = got['sender_name'].encode('latin1').decode('utf8')
print('corrected_sender =',corrected_sender)
Output:
bad_json = {"sender_name": "Horn\u00c3\u00adkov\u00c3\u00a1"}
good_json = {"sender_name": "Horn\u00edkov\u00e1"}
wanted = {'sender_name': 'Horníková'}
got = {'sender_name': 'HornÃ\xadková'}
corrected_sender = Horníková
I don't know enough about JSON to be able to say whether this is valid or not, but you can parse these strings using the raw_unicode_escape codec:
>>> "Horn\u00c3\u00adkov\u00c3\u00a1".encode('raw_unicode_escape').decode('utf8')
'Horníková'
Reencode to bytes, and then redecode to text.
>>> 'HornÃ\xadková'.encode('latin-1').decode('utf-8')
'Horníková'
Is this type JSON file even valid JSON file according to the specification?
No.
A string is a sequence of zero or more Unicode characters, wrapped in double quotes, using backslash escapes [emphasis added].
source
A string is a sequence of Unicode code points wrapped with quotation marks (U+0022). [...] Any code point may be represented as a hexadecimal escape sequence [...] represented as a six-character sequence: a reverse solidus, followed by the lowercase letter u, followed by four hexadecimal digits that encode the code point [emphasis added].
source
UTF-8 byte sequences are neither Unicode characters nor Unicode code points.
I have a binary object:
b'{"node": "\\u041e\\u0431\\u043d\\u043e\\u0432\\u043b\\u0435\\u043d\\u0438\\u0435"}}'
and I want it to be printed in Unicode and not strictly using ASCII symbols.
There is a hacky way to do it:
decoded = string.decode()
parsed_to_dict = json.loads(decoded)
dumped = json.dumps(parsed_to_dict, ensure_ascii=False)
print(dumped)
>>> {"node": "Обновление"}
however the text will not always be parseable as JSON, so I need a simpler way.
Is there a way to print out my binary object (or a decoded Unicode string) as a non-ascii string without going trough parsing/dumping JSON?
For example, how to print this b'\\u041e\\u0431\\u043d\\u043e\\u0432\\u043b\\u0435\\u043d\\u0438\\u0435' as Обновление?
A bytes string like
b'\\u041e\\u0431\\u043d\\u043e\\u0432\\u043b\\u0435\\u043d\\u0438\\u0435'
has been encoded using Unicode escape sequences. To convert it back into a proper Unicode string you simply need to specify the 'unicode-escape' codec:
data = b'\\u041e\\u0431\\u043d\\u043e\\u0432\\u043b\\u0435\\u043d\\u0438\\u0435'
out = data.decode('unicode-escape')
print(out)
output
Обновление
However, if data is already a Unicode string, then you first need to encode it to bytes. You can do that using the ascii codec, presuming data only contains ASCII characters. If it contains characters outside ASCII but within the range of \x80 to \xff you may be able to use the 'latin1' codec.
data = '\\u041e\\u0431\\u043d\\u043e\\u0432\\u043b\\u0435\\u043d\\u0438\\u0435'
out = data.encode('ascii').decode('unicode-escape')
This should work so long as all the escapes are valid (no single \).
import ast
bytes_object = b'{"node": "\\u041e\\u0431\\u043d\\u043e\\u0432\\u043b\\u0435\\u043d\\u0438\\u0435"}}'
unicode_string = ast.literal_eval("'{}'".format(bytes_object.decode()))
output:
'{"node": "Обновление"}}'
I searched a bit about this. But most of people want to convert original string(테스트) to unicode(\uD14C\uC2A4\uD2B8).
But what I want is converting unicode string(such as \uD14C\uC2A4\uD2B8) to real string(테스트). I have JSON file in which all the Korean strings are in form of unicode(\uXXXX) and I have to parse it into original string. How can I do it in Python?
To sum up,
the way to convert unicode string to original string such as in Python
\uD14C\uC2A4\uD2B8 -> 테스트
import codecs
file_variable = 'path/to/file.json'
with codecs.open(file_variable, encoding='utf-8') as file:
json_object = json.load(file)
See if that allows you to handle your json in unicode. The import codecs also allows you to use the .encode() and .decode() functions to go back and forth between unicode and unicode escaped:
string = ((some unicode text here))
string.decode('utf-8')
string.encode('utf-8')
I solved by this way
read byte by byte and add byte(variable c) into variable
aJSON+=encode(c)
aJSON.decode('unicode-escape') gives expected result.
Thanks for interest anyway..
I first tried typing in a Unicode character, encode it in UTF-8, and decode it back. Python happily gives back the original character.
I took a look at the encoded string, it is b'\xe6\x88\x91'. I don't understand what this is, it looks like 3 hex numbers.
Then I did some research and I found that the CJK set starts from 4E00, so now I want Python to show me what this character looks like. How do I do that? Do I need to convert 4E00 to the form of something like the one above?
The text b'\xe6\x88\x91' is the representation of the bytes that are the utf-8 encoding of the unicode codepoint \u6211 which is the character 我. So there is no need in converting something, other than to a unicode string with .decode('utf-8').
You'll need to decode it using the UTF-8 encoding:
>>> print(b'\xe6\x88\x91'.decode('UTF-8'))
我
By decoding it you're turning the bytes (which is what b'...' is) into a Unicode string and that's how you can display / use the text.
I'm looking for a function to convert "\x61\x62\x63\x64\x65\x66" to "abcdef" in python, without having to print it
and is the proper term for this type of encoding ascii hex?
>>> "\x61\x62\x63\x64\x65\x66".decode('ascii')
u'abcdef'
You can convert "\x61\x62\x63\x64\x65\x66" to a unicode string with the unicode method:
unicode("\x61\x62\x63\x64\x65\x66")
Output: u'abcdef'