Interpret "plain text" as utf-8 text in python [duplicate] - python

This question already has answers here:
Process escape sequences in a string in Python
(8 answers)
Closed 7 months ago.
I have a text file with text that should have been interpreted as utf-8 but wasn't (it was given to me this way).
Here is an example of a typical line of the file:
\u30ed\u30f3\u30c9\u30f3\u5728\u4f4f
which should have been:
ロンドン在住
Now, I can do it manually on python by typing the following in the command line:
>>> h1 = u'\u30ed\u30f3\u30c9\u30f3\u5728\u4f4f'
>>> print h1
ロンドン在住
which gives me what I want. Is there a way that I can do this automatically? I've tried doing stuff like this
>>> f = codecs.open('testfile.txt', encoding='utf-8')
>>> h = f.next()
>>> print h
\u30ed\u30f3\u30c9\u30f3\u5728\u4f4f
I've also tried with the 'encode' and 'decode' functions, any ideas?
Thanks!

\u30ed\u30f3\u30c9\u30f3\u5728\u4f4f is not UTF8; it's using the python unicode escape format. Use the unicode_escape codec instead:
>>> print '\u30ed\u30f3\u30c9\u30f3\u5728\u4f4f'.decode('unicode_escape')
ロンドン在住
Here is the UTF-8 encoding of the above phrase, for comparison:
>>> '\u30ed\u30f3\u30c9\u30f3\u5728\u4f4f'.decode('unicode_escape').encode('utf-8')
'\xe3\x83\xad\xe3\x83\xb3\xe3\x83\x89\xe3\x83\xb3\xe5\x9c\xa8\xe4\xbd\x8f'
Note that the data decoded with unicode_escape are treated as Latin-1 for anything that's not a recognised Python escape sequence.
Be careful however; it may be you are really looking at JSON-encoded data, which uses the same notation for specifying character escapes. Use json.loads() to decode actual JSON data; JSON strings with such escapes are delimited with " quotes and are usually part of larger structures (such as JSON lists or objects).

Related

How to tell python that a string is actually bytes-object? Not converting

I have a txt file which contains a line:
' 6: "\\351\\231\\220\\346\\227\\266\\345\\205\\215\\350\\264\\271"'
The contents in the double quotes is actually octal encoding, but with two escape characters.
After the line has been read in, I used regex to extract the contents in the double quotes.
c = re.search(r': "(.+)"', line).group(1)
After that, I have two problem:
First, I need to replace the two escape characters with one.
Second, Tell python that the str object c is actually a byte object.
None of them has been done.
I have tried:
re.sub('\\', '\', line)
re.sub(r'\\', '\', line)
re.sub(r'\\', r'\', line)
All failed.
A bytes object can be easily define with 'b'.
c = b'\351\231\220\346\227\266\345\205\215\350\264\271'
How to change the variable type of a string to bytes? I think this not a encode-and-decode thing.
I googled a lot, but with no answers. Maybe I use the wrong key word.
Does anyone know how to do these? Or other way to get what I want?
This is always a little confusing. I assume your bytes object should represent a string like:
b = b'\351\231\220\346\227\266\345\205\215\350\264\271'
b.decode()
# '限时免费'
To get that with your escaped string, you could use the codecs library and try:
import re
import codecs
line = ' 6: "\\351\\231\\220\\346\\227\\266\\345\\205\\215\\350\\264\\271"'
c = re.search(r': "(.+)"', line).group(1)
codecs.escape_decode(bytes(c, "utf-8"))[0].decode("utf-8")
# '限时免费'
giving the same result.
The string contains literal text for escape codes. You cannot just replace the literal backslashes with a single backslash as escape codes are used in source code to indicate a single character. Decoding is needed to change literal escape codes to the actual character, but only byte strings can be decoded.
Encoding a Unicode string to a byte string with the Latin-1 codec translates Unicode code points 1:1 to the corresponding byte, so it is the common way to directly convert a "byte-string-like" Unicode string to an actual byte string.
Step-by-Step:
>>> s = "\\351\\231\\220\\346\\227\\266\\345\\205\\215\\350\\264\\271"
>>> print(s) # Actual text of the string
\351\231\220\346\227\266\345\205\215\350\264\271
>>> s.encode('latin1') # Convert to byte string
b'\\351\\231\\220\\346\\227\\266\\345\\205\\215\\350\\264\\271'
>>> # decode the escape codes...result is latin-1 characters in Unicode
>>> s.encode('latin1').decode('unicode-escape')
'é\x99\x90æ\x97¶å\x85\x8dè´¹' # convert back to byte string
>>> s.encode('latin1').decode('unicode-escape').encode('latin1')
b'\xe9\x99\x90\xe6\x97\xb6\xe5\x85\x8d\xe8\xb4\xb9'
>>> # data is UTF-8-encoded text so decode it correctly now
>>> s.encode('latin1').decode('unicode-escape').encode('latin1').decode('utf8')
'限时免费'
Your text example looks like part of a Python dictionary. You may be able to save some steps by using the ast module's literal_eval function to turn the dictionary directly into a Python object, and then just fix this line of code:
>>> # Python dictionary-like text
d='{6: "\\351\\231\\220\\346\\227\\266\\345\\205\\215\\350\\264\\271"}'
>>> import ast
>>> ast.literal_eval(d) # returns Python dictionary with value already decoded
{6: 'é\x99\x90æ\x97¶å\x85\x8dè´¹'}
>>> ast.literal_eval(d)[6] # but decoded incorrectly as Latin-1 text.
'é\x99\x90æ\x97¶å\x85\x8dè´¹'
>>> ast.literal_eval(d)[6].encode('latin1').decode('utf8') # undo Latin1, decode as UTF-8
'限时免费'

Mojibake when reading JSON containing escaped unicode - wrongly decoded as Latin-1? [duplicate]

This question already has answers here:
Facebook JSON badly encoded
(9 answers)
Closed 3 years ago.
I have a JSON file that contains /u escaped unicode characters, however when I read this in Python, the escaped characters are seemingly incorrectly decoded as Latin-1 rather than UTF-8. Calling .encode('latin-1').decode('utf-8') on the affected strings seems to fix this, but why is it happening, and is there a way to specify to json.load that escape sequences should be read as unicode rather than Latin-1?
JSON file message.json, which should contain a message composed of a "Grinning Face With Sweat" emoji:
{
"message": "\u00f0\u009f\u0098\u0085"
}
Python:
>>> with open('message.json') as infile:
... msg_json = json.load(infile)
...
>>> msg_json
{'message': 'ð\x9f\x98\x85'}
>>> msg_json['message']
'ð\x9f\x98\x85'
>>> msg_json['message'].encode('latin-1').decode('utf-8')
'😅'
Setting the encoding parameter in open or json.load doesn't seem to change anything, as the JSON file is plain ASCII, and the unicode is escaped within it.
What you have there is not the correct notation for the 😅 emoji; it really means "ð" and three undefined codepoints, so the translation you get is correct! (The \u... notation is independent of encoding.)
The proper notation for 😅, unicode U+1F605, in JavaScript is \ud83d\ude05. Use that in the JSON.
{
"message": "\ud83d\ude05"
}
If, on the other hand, your question is how you can get the correct results from the wrong data, then yes, as the comments say you may have to run through some hoops to do that.

Python - Unicode & double backslashes [duplicate]

This question already has answers here:
Process escape sequences in a string in Python
(8 answers)
Closed 7 months ago.
I scrapped a webpage with BeautifulSoup.
I got great output except parts of the list look like this after getting the text:
list = [u'that\\u2019s', u'it\\u2019ll', u'It\\u2019s', u'don\\u2019t', u'That\\u2019s', u'we\\u2019re', u'\\u2013']
My question now is how to get rid or replace these double backslashes with the special characters they are.
If i print the first the first element of the example list the output looks like
print list[0]
that\u2019s
I already read a lot of other questions / threads about this topic but I ended up being even more confused, as I am a beginner considering unicode / encoding / decoding.
I hope that someone could help me with this issue.
Thanks!
MG
Since you are using Python 2 there, it is simply a matter of re-applying the "decode" method - using the special codec "unicode_escape". It "sees" the "physical" backlashes and decodes those sequences proper unicode characters:
data = [u'that\\u2019s', u'it\\u2019ll', u'It\\u2019s', u'don\\u2019t', u'That\\u2019s', u'we\\u2019re', u'\\u2013']
result = [part.decode('unicode_escape') for part in data]
To aAnyone getting here using Python3: in that version can not apply the "decode" method to the str objects delivered by beautifulsoup - one has to first re-encode those to byte-string objects, and then decode with the uncode_escape codec. For these purposes it is usefull to make use of the latin1 codec as the transparent encoding: all bytes in the str object are preserved in the new bytes object:
result = [part.encode('latin1').decode('unicode_escape') for part in data]
the problem here is that the site ended up double encoding those unicode arguments, just do the following:
ls = [u'that\\u2019s', u'it\\u2019ll', u'It\\u2019s', u'don\\u2019t', u'That\\u2019s', u'we\\u2019re', u'\\u2013']
ls = map(lambda x: x.decode('unicode-escape'), ls)
now you have a list with properly unicode encoded strings:
for a in ls:
print a

Converting Unicode sequences to a string in Python 3 [duplicate]

This question already has answers here:
Process escape sequences in a string in Python
(8 answers)
Closed 7 months ago.
In parsing an HTML response to extract data with Python 3.4 on Kubuntu 15.10 in the Bash CLI, using print() I am getting output that looks like this:
\u05ea\u05d4 \u05e0\u05e9\u05de\u05e2 \u05de\u05e6\u05d5\u05d9\u05df
How would I output the actual text itself in my application?
This is the code generating the string:
response = requests.get(url)
messages = json.loads( extract_json(response.text) )
for k,v in messages.items():
for message in v['foo']['bar']:
print("\nFoobar: %s" % (message['body'],))
Here is the function which returns the JSON from the HTML page:
def extract_json(input_):
"""
Get the JSON out of a webpage.
The line of interest looks like this:
foobar = ["{\"name\":\"dotan\",\"age\":38}"]
"""
for line in input_.split('\n'):
if 'foobar' in line:
return line[line.find('"')+1:-2].replace(r'\"',r'"')
return None
In googling the issue, I've found quite a bit of information relating to Python 2, however Python 3 has completely changed how strings and especially Unicode are handled in Python.
How can I convert the example string (\u05ea) to characters (ת) in Python 3?
Addendum:
Here is some information regarding message['body']:
print(type(message['body']))
# Prints: <class 'str'>
print(message['body'])
# Prints: \u05ea\u05d4 \u05e0\u05e9\u05de\u05e2 \u05de\u05e6\u05d5\u05d9\u05df
print(repr(message['body']))
# Prints: '\\u05ea\u05d4 \\u05e0\\u05e9\\u05de\\u05e2 \\u05de\\u05e6\\u05d5\\u05d9\\u05df'
print(message['body'].encode().decode())
# Prints: \u05ea\u05d4 \u05e0\u05e9\u05de\u05e2 \u05de\u05e6\u05d5\u05d9\u05df
print(message['body'].encode().decode('unicode-escape'))
# Prints: תה נשמע מצוין
Note that the last line does work as expected, but it has a few issues:
Decoding string literals with unicode-escape is the wrong thing as Python escapes are different to JSON escapes for many characters. (Thank you bobince)
encode() relies on the default encoding, which is a bad thing.(Thank you bobince)
The encode() fails on some newer Unicode characters, such as \ud83d\ude03, with UnicodeEncodeError "surrogates not allowed".
It appears your input uses backslash as an escape character, you should unescape the text before passing it to json:
>>> foobar = '{\\"body\\": \\"\\\\u05e9\\"}'
>>> import re
>>> json_text = re.sub(r'\\(.)', r'\1', foobar) # unescape
>>> import json
>>> print(json.loads(json_text)['body'])
ש
Don't use 'unicode-escape' encoding on JSON text; it may produce different results:
>>> import json
>>> json_text = '["\\ud83d\\ude02"]'
>>> json.loads(json_text)
['😂']
>>> json_text.encode('ascii', 'strict').decode('unicode-escape') #XXX don't do it
'["\ud83d\ude02"]'
'😂' == '\U0001F602' is U+1F602 (FACE WITH TEARS OF JOY).

What does 'u' before a string in python mean? [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
What does the ‘u’ symbol mean in front of string values?
I have this json text file:
{
"EASW_KEY":" aaaaaaaaaaaaaaaaaaaaaaa",
"EASF_SESS":" aaaaaaaaaaaaaaaaaaaaaaa",
"PHISHKEY":" aaaaaaaaaaaaaaaaaaaaaaa",
"XSID":" aaaaaaaaaaaaaaaaaaaaaaa"
}
If I decode it like this:
import json
data = open('data.txt').read()
data = json.loads(data)
print data['EASW_KEY']
I get this:
u' aaaaaaaaaaaaaaaaaaaaaaa'
instead of simply:
aaaaaaaaaaaaaaaaaaaaaaa
What does this u mean?
What does 'u' before a string in python mean?
The character prefixing a python string is called a python String Encoding declaration. You can read all about them here: https://docs.python.org/3.3/reference/lexical_analysis.html#encoding-declarations
The u stands for unicode. Alternative letters in that slot can be r'foobar' for raw string and b'foobar' for byte string.
Some help on what Python considers unicode:
http://docs.python.org/howto/unicode.html
Then to explore the nature of this, run this command:
type(u'abc')
returns:
<type 'unicode'>
If you use Unicode when you should be using ascii, or ascii when you should be using unicode you are very likely going to encounter bubbleup implementationitis when users start "exploring the unicode space" at runtime.
For example, if you pass a unicode string to facebook's api.fql(...) function, which says it accepts unicode, it will happily process your request, and then later on return undefined results as if the function succeeded as it pukes all over the carpet.
As defined in this post:
FQL multiquery from python fails with unicode query
Some unicode characters are going to cause your code in production to have a seizure then puke all over the floor. So make sure you're okay and customers can tolerate that.
The u prefix in the repr of a unicode object indicates that it's a unicode object.
You should not be seeing this if you're really just printing the unicode object itself, and not, say, a list containing it. (and, in fact, running your code doesn't actually print it like this.)

Categories

Resources