Converting Unicode sequences to a string in Python 3 [duplicate] - python

This question already has answers here:
Process escape sequences in a string in Python
(8 answers)
Closed 7 months ago.
In parsing an HTML response to extract data with Python 3.4 on Kubuntu 15.10 in the Bash CLI, using print() I am getting output that looks like this:
\u05ea\u05d4 \u05e0\u05e9\u05de\u05e2 \u05de\u05e6\u05d5\u05d9\u05df
How would I output the actual text itself in my application?
This is the code generating the string:
response = requests.get(url)
messages = json.loads( extract_json(response.text) )
for k,v in messages.items():
for message in v['foo']['bar']:
print("\nFoobar: %s" % (message['body'],))
Here is the function which returns the JSON from the HTML page:
def extract_json(input_):
"""
Get the JSON out of a webpage.
The line of interest looks like this:
foobar = ["{\"name\":\"dotan\",\"age\":38}"]
"""
for line in input_.split('\n'):
if 'foobar' in line:
return line[line.find('"')+1:-2].replace(r'\"',r'"')
return None
In googling the issue, I've found quite a bit of information relating to Python 2, however Python 3 has completely changed how strings and especially Unicode are handled in Python.
How can I convert the example string (\u05ea) to characters (ת) in Python 3?
Addendum:
Here is some information regarding message['body']:
print(type(message['body']))
# Prints: <class 'str'>
print(message['body'])
# Prints: \u05ea\u05d4 \u05e0\u05e9\u05de\u05e2 \u05de\u05e6\u05d5\u05d9\u05df
print(repr(message['body']))
# Prints: '\\u05ea\u05d4 \\u05e0\\u05e9\\u05de\\u05e2 \\u05de\\u05e6\\u05d5\\u05d9\\u05df'
print(message['body'].encode().decode())
# Prints: \u05ea\u05d4 \u05e0\u05e9\u05de\u05e2 \u05de\u05e6\u05d5\u05d9\u05df
print(message['body'].encode().decode('unicode-escape'))
# Prints: תה נשמע מצוין
Note that the last line does work as expected, but it has a few issues:
Decoding string literals with unicode-escape is the wrong thing as Python escapes are different to JSON escapes for many characters. (Thank you bobince)
encode() relies on the default encoding, which is a bad thing.(Thank you bobince)
The encode() fails on some newer Unicode characters, such as \ud83d\ude03, with UnicodeEncodeError "surrogates not allowed".

It appears your input uses backslash as an escape character, you should unescape the text before passing it to json:
>>> foobar = '{\\"body\\": \\"\\\\u05e9\\"}'
>>> import re
>>> json_text = re.sub(r'\\(.)', r'\1', foobar) # unescape
>>> import json
>>> print(json.loads(json_text)['body'])
ש
Don't use 'unicode-escape' encoding on JSON text; it may produce different results:
>>> import json
>>> json_text = '["\\ud83d\\ude02"]'
>>> json.loads(json_text)
['😂']
>>> json_text.encode('ascii', 'strict').decode('unicode-escape') #XXX don't do it
'["\ud83d\ude02"]'
'😂' == '\U0001F602' is U+1F602 (FACE WITH TEARS OF JOY).

Related

How can I remove escaping '\' in my string to decode encoded letters? [duplicate]

This question already has answers here:
Process escape sequences in a string in Python
(8 answers)
Closed 1 year ago.
I'm working on a project with a dataset coming from Board Game Geek.
The issue I have concerns the name of the games I'm studying.
I think the encoding worked bad so I have encoded letters in the csv file I received.
For example : Orl\u00e9ans instead of Orléans
When I import the csv in Python, they remain like that and I want to correct these letters.
I manage to find the correct function I guess with this :
>>> unicodedata.normalize("NFD", 'Orl\u00e9ans')
'Orléans'
The problem is that I can't run this function through a for loop.
Indeed, the string displayed is 'Orl\u00e9ans' but in fact, it's 'Orl\\u00e9ans' so the function cannot do the job.
Is there any way to correct this ? I have 20000 entries in the dataset, I can't correct them all 1 by 1.
Thank you
EDIT
I got the answer in this article : Process escape sequences in a string in Python
>>> myString = "spam\\neggs"
>>> decoded_string = bytes(myString, "utf-8").decode("unicode_escape") # python3
>>> decoded_string = myString.decode('string_escape') # python2
>>> print(decoded_string)
spam
eggs
Thanks a lot
I would try to use latin1 encoding as follows:
import codecs
with codecs.open(r'$(path to your csv file)', encoding='latin1') as f:

How to change the coding for python array?

I use the following code to scrape a table from a Chinese website. It works fine. But it seems that the contents I stored in the list are not shown properly.
import requests
from bs4 import BeautifulSoup
import pandas as pd
x = requests.get('http://www.sohu.com/a/79780904_126549')
bs = BeautifulSoup(x.text,'lxml')
clg_list = []
for tr in bs.find_all('tr'):
tds = tr.find_all('td')
for i in range(len(tds)):
clg_list.append(tds[i].text)
print(tds[i].text)
When I print the text, it shows Chinese characters. But when I print out the list, it's showing \u4e00\u671f\uff0834\u6240\uff09'. I am not sure if I should change the encoding or something else is wrong.
There is nothing wrong in this case.
When you print a python list, python calls repr on each of the list's elements. In python2, the repr of a unicode string shows the unicode code points for the characters that make up the string.
>>> c = clg_list[0]
>>> c # Ask the interpreter to display the repr of c
u'\u201c985\u201d\u5de5\u7a0b\u5927\u5b66\u540d\u5355\uff08\u622a\u6b62\u52302011\u5e743\u670831\u65e5\uff09'
However, if you print the string, python encodes the unicode string with a text encoding (for example, utf-8) and your computer displays the characters that match the encoding.
>>> print c
“985”工程大学名单(截止到2011年3月31日)
Note that in python3 printing the list will show the chinese characters as you expect, because of python3's better unicode handling.

Unable to decode utf8 in python to display hindi language in the shell [duplicate]

I'm having a hard time trying to generate a list from a string, with a proper UTF-8 encoding, I'm using Python (I'm just learning to program, so bare with my silly question/terrible coding).
The source file is a tweet feed (JSON format), after parsing it successfully and extracting the tweet message from all the rest I manage to get the text with the right encoding only after a print (as a string). If I try to put it pack into list forms, it goes back to unencoded u\000000 form.
My code is:
import json
with open("file_name.txt") as tweets_file:
tweets_list = []
for a in tweets_file:
b = json.loads(a)
tweets_list.append(b)
tweet = []
for i in tweets_list:
key = "text"
if key in i:
t = i["text"]
tweet.append(t)
for k in tweet:
print k.encode("utf-8")
As an alternative, I tried to have the encoding at the beginning (when fetching the file):
import json
import codecs
tweets_file = codecs.open("file_name.txt", "r", "utf-8")
tweets_list = []
for a in tweets_file:
b = json.loads(a)
tweets_list.append(b)
tweets_file.close()
tweet = []
for i in tweets_list:
key = "text"
if key in i:
t = i["text"]
tweet.append(t)
for k in tweet:
print k
My question is: how can I put the resulting k strings, into a list? With each k string as an item?
You are getting confused by the Python string representation.
When you print a python list (or any other standard Python container), the contents are shown in special representation to make debugging easier; each value is shown is the result of calling the repr() function on that value. For string values, that means the result is a unicode string representation, and that is not the same thing as what you see when the string is printed directly.
Unicode and byte strings, when shown like that, are presented as string literals; quoted values that you can copy and paste straight back into Python code without having to worry about encoding; anything that is not a printable ASCII character is shown in quoted form. Unicode code points beyond the latin-1 plane are shown as '\u....' escape sequences. Characters in the latin-1 range use the '\x.. escape sequence. Many control characters are shown in their 1-letter escape form, such as \n and \t.
The python interactive prompt does the same thing; when you echo a value on the prompt without using print, the value in 'represented', shown in the repr() form:
>>> print u'\u2036Hello World!\u2033'
‶Hello World!″
>>> u'\u2036Hello World!\u2033'
u'\u2036Hello World!\u2033'
>>> [u'\u2036Hello World!\u2033', u'Another\nstring']
[u'\u2036Hello World!\u2033', u'Another\nstring']
>>> print _[1]
Another
string
This entirly normal behaviour. In other words, your code works, nothing is broken.
To come back to your code, if you want to extract just the 'text' key from the tweet JSON structures, filter while reading the file, don't bother with looping twice:
import json
with open("file_name.txt") as tweets_file:
tweets = []
for line in tweets_file:
data = json.loads(a)
if 'text' in data:
tweets.append(data['text'])

Parsing invalid Unicode JSON in Python

i have a problematic json string contains some funky unicode characters
"test":{"foo":"Ig0s\x5C/k\x5C/4jRk"}}
and if I convert using python
import json
s = r'{"test":{"foo":"Ig0s\x5C/k\x5C/4jRk"}}'
json.loads(s)
# Error..
If I can accept to skip/lose the value of these unicode characters, what is the best way to make my json.loads(s) works?
If the rest of the string apart from invalid \x5c is a JSON then you could use string-escape encoding to decode `'\x5c into backslashes:
>>> import json
>>> s = r'{"test":{"foo":"Ig0s\x5C/k\x5C/4jRk"}}'
>>> json.loads(s.decode('string-escape'))
{u'test': {u'foo': u'Ig0s/k/4jRk'}}
You don't have JSON; that can be interpreted directly as Python instead. Use ast.literal_eval():
>>> import ast
>>> s = r'{"test":{"foo":"Ig0s\x5C/k\x5C/4jRk"}}'
>>> ast.literal_eval(s)
{'test': {'foo': 'Ig0s\\/k\\/4jRk'}}
The \x5C is a single backslash, doubled in the Python literal string representation here. The actual string value is:
>>> print _['test']['foo']
Ig0s\/k\/4jRk
This parses the input as Python source, but only allows for literal values; strings, None, True, False, numbers and containers (lists, tuples, dictionaries).
This method is slower than json.loads() because it does part of the parse-tree processing in pure Python code.
Another approach would be to use a regular expression to replace the \xhh escape codes with JSON \uhhhh codes:
import re
escape_sequence = re.compile(r'\\x([a-fA-F0-9]{2})')
def repair(string):
return escape_sequence.sub(r'\\u00\1', string)
Demo:
>>> import json
>>> json.loads(repair(s))
{u'test': {u'foo': u'Ig0s\\/k\\/4jRk'}}
If you can repair the source producing this value to output actual JSON instead that'd be a much better solution.
I'm a bit late for the party, but we were seeing a similar issue, to be precise this one Logstash JSON input with escaped double quote, just for \xXX.
There JS.stringify created such (per specification) invalid json texts.
The solution is to simply replace the \x by \u00, as unicode escaped characters are allowed, while ASCII escaped characters are not.
import json
s = r'{"test":{"foo":"Ig0s\x5C/k\x5C/4jRk"}}'
s = s.replace("\\x", "\\u00")
json.loads(s)

Interpret "plain text" as utf-8 text in python [duplicate]

This question already has answers here:
Process escape sequences in a string in Python
(8 answers)
Closed 7 months ago.
I have a text file with text that should have been interpreted as utf-8 but wasn't (it was given to me this way).
Here is an example of a typical line of the file:
\u30ed\u30f3\u30c9\u30f3\u5728\u4f4f
which should have been:
ロンドン在住
Now, I can do it manually on python by typing the following in the command line:
>>> h1 = u'\u30ed\u30f3\u30c9\u30f3\u5728\u4f4f'
>>> print h1
ロンドン在住
which gives me what I want. Is there a way that I can do this automatically? I've tried doing stuff like this
>>> f = codecs.open('testfile.txt', encoding='utf-8')
>>> h = f.next()
>>> print h
\u30ed\u30f3\u30c9\u30f3\u5728\u4f4f
I've also tried with the 'encode' and 'decode' functions, any ideas?
Thanks!
\u30ed\u30f3\u30c9\u30f3\u5728\u4f4f is not UTF8; it's using the python unicode escape format. Use the unicode_escape codec instead:
>>> print '\u30ed\u30f3\u30c9\u30f3\u5728\u4f4f'.decode('unicode_escape')
ロンドン在住
Here is the UTF-8 encoding of the above phrase, for comparison:
>>> '\u30ed\u30f3\u30c9\u30f3\u5728\u4f4f'.decode('unicode_escape').encode('utf-8')
'\xe3\x83\xad\xe3\x83\xb3\xe3\x83\x89\xe3\x83\xb3\xe5\x9c\xa8\xe4\xbd\x8f'
Note that the data decoded with unicode_escape are treated as Latin-1 for anything that's not a recognised Python escape sequence.
Be careful however; it may be you are really looking at JSON-encoded data, which uses the same notation for specifying character escapes. Use json.loads() to decode actual JSON data; JSON strings with such escapes are delimited with " quotes and are usually part of larger structures (such as JSON lists or objects).

Categories

Resources