How do I convert a string formatted like unicode into unicode? - python

I have a variable with the string '\u96e8' and I want to convert this to unicode, because the function kanji_to_romaji() only accepts unicode. How would I do this? I am on python 2.7
# -*- coding: UTF-8 -*-
from kanji_to_romaji import kanji_to_romaji
message = '\u96e8'
message = unicode(message)
x = kanji_to_romaji(message)
print(x)

You can decode the bytestring to unicode using the unicode-escape codec.
>>> message = '\u96e8'
>>> unicode_message = message.decode('unicode-escape')
>>> unicode_message
u'\u96e8'
>>> print unicode_message
雨

Use ast.literal_eval:
>>> message = '\u96e8'
>>> ast.literal_eval('u"{}"'.format(message))
u'\u96e8'
The trick is to construct a string containing a unicode string literal to pass as the argument to literal_eval. That is, u"\u96e8" rather than just \u96e8.
(This is only partially correct, though. It will fail if the value of message itself contains a double-quote. There are probably other cases where this fails as well.)

Related

Correctly decoding hex escaped unicode strings in python [duplicate]

I have a unicode string like '%C3%A7%C3%B6asd+fjkls%25asd' and I want to decode this string.
I used urllib.unquote_plus(str) but it works wrong.
expected : çöasd+fjkls%asd
result : çöasd fjkls%asd
double coded utf-8 characters(%C3%A7 and %C3%B6) are decoded wrong.
My python version is 2.7 under a linux distro.
What is the best way to get expected result?
You have 3 or 4 or 5 problems ... but repr() and unicodedata.name() are your friends; they unambiguously show you exactly what you have got, without the confusion engendered by people with different console encodings communicating the results of print fubar.
Summary: either (a) you start with a unicode object and apply the unquote function to that or (b) you start off with a str object and your console encoding is not UTF-8.
If as you say you start off with a unicode object:
>>> s0 = u'%C3%A7%C3%B6asd+fjkls%25asd'
>>> print repr(s0)
u'%C3%A7%C3%B6asd+fjkls%25asd'
this is an accidental nonsense. If you apply urllibX.unquote_YYYY() to it, you get another nonsense unicode object (u'\xc3\xa7\xc3\xb6asd+fjkls%asd') which would cause your shown symptoms when printed. You should convert your original unicode object to a str object immediately:
>>> s1 = s0.encode('ascii')
>>> print repr(s1)
'%C3%A7%C3%B6asd+fjkls%25asd'
then you should unquote it:
>>> import urllib2
>>> s2 = urllib2.unquote(s1)
>>> print repr(s2)
'\xc3\xa7\xc3\xb6asd+fjkls%asd'
Looking at the first 4 bytes of that, it's encoded in UTF-8. If you do print s2, it will look OK if your console is expecting UTF-8, but if it's expecting ISO-8859-1 (aka latin1) you'll see your symptomatic rubbish (first char will be A-tilde). Let's park that thought for a moment and convert it to a Unicode object:
>>> s3 = s2.decode('utf8')
>>> print repr(s3)
u'\xe7\xf6asd+fjkls%asd'
and inspect it to see what we've actually got:
>>> import unicodedata
>>> for c in s3[:6]:
... print repr(c), unicodedata.name(c)
...
u'\xe7' LATIN SMALL LETTER C WITH CEDILLA
u'\xf6' LATIN SMALL LETTER O WITH DIAERESIS
u'a' LATIN SMALL LETTER A
u's' LATIN SMALL LETTER S
u'd' LATIN SMALL LETTER D
u'+' PLUS SIGN
Looks like what you said you expected. Now we come to the question of displaying it on your console. Note: don't freak out when you see "cp850"; I'm doing this portably and just happen to be doing this in a Command Prompt on Windows.
>>> import sys
>>> sys.stdout.encoding
'cp850'
>>> print s3
çöasd+fjkls%asd
Note: the unicode object was explicitly encoded using sys.stdout.encoding. Fortunately all the unicode characters in s3 are representable in that encoding (and cp1252 and latin1).
Using either unquote or unquote_plus will give you a byte string. If you want a Unicode string then you have to decode the byte string to unicode:
>>> print(urllib.unquote_plus('%C3%A7%C3%B6asd+fjkls%25asd').decode('utf8'))
çöasd fjkls%asd
>>>
Compared with:
>>> print(urllib.unquote_plus('%C3%A7%C3%B6asd+fjkls%25asd'))
çöasd fjkls%asd
>>>
Note that your input string must be a byte string: if you pass unicode to unquote/unquote_plus then you'll get a bit of a mess. If this is the case then encode it first:
>>> print(urllib.unquote_plus(u'%C3%A7%C3%B6asd+fjkls%25asd'.encode('ascii')).decode('utf8'))
çöasd fjkls%asd
Try urllib2 once more:
print urllib2.unquote('%C3%A7%C3%B6asd+fjkls%25asd')
'%C3%A7%C3%B6asd+fjkls%25asd' - this is not a unicode string.
This is a url-encoded string. Use urllib2.unquote() instead.
You have a double problem: your string is unicode encoded and contains caracter urlencoded. Some match. You can normalize your string to ascci to be sure it won't be interpreted incorrectly:
>>> s = '%C3%A7%C3%B6asd+fjkls%25asd' # ascii string
>>> print urllib2.unquote(s) # works as expected
çöasd+fjkls%asd
>>> s = u'%C3%A7%C3%B6asd+fjkls%25asd' # unicode string
>>> print urllib2.unquote(s) # decode stuff that it shouldn't
çöasd+fjkls%asd
>>> print urllib2.unquote(s.encode('ascii')) # encode the unicode string to ascii: works!
çöasd+fjkls%asd
You are using unquote_plus method which is taking space into account and converting to +. Just use unquote method and you should be fine.
>>> import urllib
>>> print urllib.unquote('%C3%A7%C3%B6asd+fjkls%25asd')
çöasd+fjkls%asd
>>> print urllib.unquote_plus('%C3%A7%C3%B6asd+fjkls%25asd')
çöasd fjkls%asd

Parsing invalid Unicode JSON in Python

i have a problematic json string contains some funky unicode characters
"test":{"foo":"Ig0s\x5C/k\x5C/4jRk"}}
and if I convert using python
import json
s = r'{"test":{"foo":"Ig0s\x5C/k\x5C/4jRk"}}'
json.loads(s)
# Error..
If I can accept to skip/lose the value of these unicode characters, what is the best way to make my json.loads(s) works?
If the rest of the string apart from invalid \x5c is a JSON then you could use string-escape encoding to decode `'\x5c into backslashes:
>>> import json
>>> s = r'{"test":{"foo":"Ig0s\x5C/k\x5C/4jRk"}}'
>>> json.loads(s.decode('string-escape'))
{u'test': {u'foo': u'Ig0s/k/4jRk'}}
You don't have JSON; that can be interpreted directly as Python instead. Use ast.literal_eval():
>>> import ast
>>> s = r'{"test":{"foo":"Ig0s\x5C/k\x5C/4jRk"}}'
>>> ast.literal_eval(s)
{'test': {'foo': 'Ig0s\\/k\\/4jRk'}}
The \x5C is a single backslash, doubled in the Python literal string representation here. The actual string value is:
>>> print _['test']['foo']
Ig0s\/k\/4jRk
This parses the input as Python source, but only allows for literal values; strings, None, True, False, numbers and containers (lists, tuples, dictionaries).
This method is slower than json.loads() because it does part of the parse-tree processing in pure Python code.
Another approach would be to use a regular expression to replace the \xhh escape codes with JSON \uhhhh codes:
import re
escape_sequence = re.compile(r'\\x([a-fA-F0-9]{2})')
def repair(string):
return escape_sequence.sub(r'\\u00\1', string)
Demo:
>>> import json
>>> json.loads(repair(s))
{u'test': {u'foo': u'Ig0s\\/k\\/4jRk'}}
If you can repair the source producing this value to output actual JSON instead that'd be a much better solution.
I'm a bit late for the party, but we were seeing a similar issue, to be precise this one Logstash JSON input with escaped double quote, just for \xXX.
There JS.stringify created such (per specification) invalid json texts.
The solution is to simply replace the \x by \u00, as unicode escaped characters are allowed, while ASCII escaped characters are not.
import json
s = r'{"test":{"foo":"Ig0s\x5C/k\x5C/4jRk"}}'
s = s.replace("\\x", "\\u00")
json.loads(s)

How to decode unicode raw literals to readable string?

If I assign unicode raw literals to a variable, I can read its value:
>>> s = u'\u0421\u043e\u043e\u0431\u0449\u0435\u043d\u0438\u0435 \u043e\u0442\u043f\u0440\u0430\u0432\u043b\u0435\u043d\u043e'
>>> s
u'\u0421\u043e\u043e\u0431\u0449\u0435\u043d\u0438\u0435 \u043e\u0442\u043f\u0440\u0430\u0432\u043b\u0435\u043d\u043e'
>>> print s
Сообщение отправлено
But when I have already assigned value to a plain, not unicode string, I can not:
>>> s = '\u0421\u043e\u043e\u0431\u0449\u0435\u043d\u0438\u0435 \u043e\u0442\u043f\u0440\u0430\u0432\u043b\u0435\u043d\u043e'
>>> s
'\\u0421\\u043e\\u043e\\u0431\\u0449\\u0435\\u043d\\u0438\\u0435 \\u043e\\u0442\\u043f\\u0440\\u0430\\u0432\\u043b\\u0435\\u043d\\u043e'
>>> print s
\u0421\u043e\u043e\u0431\u0449\u0435\u043d\u0438\u0435 \u043e\u0442\u043f\u0440\u0430\u0432\u043b\u0435\u043d\u043e
How can I decode and read it?
Use the unicode_escape codec:
s.decode('unicode_escape')
If you are getting weird results when decoding try following
print repr(s).decode('unicode-escape').encode('latin-1') // or encode using some other encoding
It could be that python terminal is using default ASCII and there is symbol that goes out of range.

How to convert a string to utf-8 in Python

I have a browser which sends utf-8 characters to my Python server, but when I retrieve it from the query string, the encoding that Python returns is ASCII. How can I convert the plain string to utf-8?
NOTE: The string passed from the web is already UTF-8 encoded, I just want to make Python to treat it as UTF-8 not ASCII.
In Python 2
>>> plain_string = "Hi!"
>>> unicode_string = u"Hi!"
>>> type(plain_string), type(unicode_string)
(<type 'str'>, <type 'unicode'>)
^ This is the difference between a byte string (plain_string) and a unicode string.
>>> s = "Hello!"
>>> u = unicode(s, "utf-8")
^ Converting to unicode and specifying the encoding.
In Python 3
All strings are unicode. The unicode function does not exist anymore. See answer from #Noumenon
If the methods above don't work, you can also tell Python to ignore portions of a string that it can't convert to utf-8:
stringnamehere.decode('utf-8', 'ignore')
Might be a bit overkill, but when I work with ascii and unicode in same files, repeating decode can be a pain, this is what I use:
def make_unicode(inp):
if type(inp) != unicode:
inp = inp.decode('utf-8')
return inp
Adding the following line to the top of your .py file:
# -*- coding: utf-8 -*-
allows you to encode strings directly in your script, like this:
utfstr = "ボールト"
If I understand you correctly, you have a utf-8 encoded byte-string in your code.
Converting a byte-string to a unicode string is known as decoding (unicode -> byte-string is encoding).
You do that by using the unicode function or the decode method. Either:
unicodestr = unicode(bytestr, encoding)
unicodestr = unicode(bytestr, "utf-8")
Or:
unicodestr = bytestr.decode(encoding)
unicodestr = bytestr.decode("utf-8")
city = 'Ribeir\xc3\xa3o Preto'
print city.decode('cp1252').encode('utf-8')
In Python 3.6, they do not have a built-in unicode() method.
Strings are already stored as unicode by default and no conversion is required. Example:
my_str = "\u221a25"
print(my_str)
>>> √25
Translate with ord() and unichar().
Every unicode char have a number asociated, something like an index. So Python have a few methods to translate between a char and his number. Downside is a ñ example. Hope it can help.
>>> C = 'ñ'
>>> U = C.decode('utf8')
>>> U
u'\xf1'
>>> ord(U)
241
>>> unichr(241)
u'\xf1'
>>> print unichr(241).encode('utf8')
ñ
First, str in Python is represented in Unicode.
Second, UTF-8 is an encoding standard to encode Unicode string to bytes. There are many encoding standards out there (e.g. UTF-16, ASCII, SHIFT-JIS, etc.).
When the client sends data to your server and they are using UTF-8, they are sending a bunch of bytes not str.
You received a str because the "library" or "framework" that you are using, has implicitly converted some random bytes to str.
Under the hood, there is just a bunch of bytes. You just need ask the "library" to give you the request content in bytes and you will handle the decoding yourself (if library can't give you then it is trying to do black magic then you shouldn't use it).
Decode UTF-8 encoded bytes to str: bs.decode('utf-8')
Encode str to UTF-8 bytes: s.encode('utf-8')
The url is translated to ASCII and to the Python server it is just a Unicode string, eg.:
"T%C3%A9st%C3%A3o"
Python understands "é" and "ã" as actual %C3%A9 and %C3%A3.
You can encode an URL just like this:
import urllib
url = "T%C3%A9st%C3%A3o"
print(urllib.parse.unquote(url))
>> Téstão
See https://www.adamsmith.haus/python/answers/how-to-decode-a-utf-8-url-in-python for details.
you can also do this:
from unidecode import unidecode
unidecode(yourStringtoDecode)
You can use python's standard library codecs module.
import codecs
codecs.decode(b'Decode me', 'utf-8')
Yes, You can add
# -*- coding: utf-8 -*-
in your source code's first line.
You can read more details here https://www.python.org/dev/peps/pep-0263/

Python 3.1.1 string to hex

I am trying to use str.encode() but I get
>>> "hello".encode(hex)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: must be string, not builtin_function_or_method
I have tried a bunch of variations and they seem to all work in Python 2.5.2, so what do I need to do to get them to work in Python 3.1?
The hex codec has been chucked in 3.x. Use binascii instead:
>>> binascii.hexlify(b'hello')
b'68656c6c6f'
In Python 3.5+, encode the string to bytes and use the hex() method, returning a string.
s = "hello".encode("utf-8").hex()
s
# '68656c6c6f'
Optionally convert the string back to bytes:
b = bytes(s, "utf-8")
b
# b'68656c6c6f'
You've already got some good answers, but I thought you might be interested in a bit of the background too.
Firstly you're missing the quotes. It should be:
"hello".encode("hex")
Secondly this codec hasn't been ported to Python 3.1. See here. It seems that they haven't yet decided whether or not these codecs should be included in Python 3 or implemented in a different way.
If you look at the diff file attached to that bug you can see the proposed method of implementing it:
import binascii
output = binascii.b2a_hex(input)
The easiest way to do it in Python 3.5 and higher is:
>>> 'halo'.encode().hex()
'68616c6f'
If you manually enter a string into a Python Interpreter using the utf-8 characters, you can do it even faster by typing b before the string:
>>> b'halo'.hex()
'68616c6f'
Equivalent in Python 2.x:
>>> 'halo'.encode('hex')
'68616c6f'
binascii methodes are easier by the way
>>> import binascii
>>> x=b'test'
>>> x=binascii.hexlify(x)
>>> x
b'74657374'
>>> y=str(x,'ascii')
>>> y
'74657374'
>>> x=binascii.unhexlify(x)
>>> x
b'test'
>>> y=str(x,'ascii')
>>> y
'test'
Hope it helps. :)
In Python 3, all strings are unicode. Usually, if you encode an unicode object to a string, you use .encode('TEXT_ENCODING'), since hex is not a text encoding, you should use codecs.encode() to handle arbitrary codecs. For example:
>>>> "hello".encode('hex')
LookupError: 'hex' is not a text encoding; use codecs.encode() to handle arbitrary codecs
>>>> import codecs
>>>> codecs.encode(b"hello", 'hex')
b'68656c6c6f'
Again, since "hello" is unicode, you need to indicate it as a byte string before encoding to hexadecimal. This may be more inline with what your original approach of using the encode method.
The differences between binascii.hexlify and codecs.encode are as follow:
binascii.hexlify
Hexadecimal representation of binary data.
The return value is a bytes object.
Type: builtin_function_or_method
codecs.encode
encode(obj, [encoding[,errors]]) -> object
Encodes obj using the codec registered for encoding. encoding defaults
to the default encoding. errors may be given to set a different error
handling scheme. Default is 'strict' meaning that encoding errors raise
a ValueError. Other possible values are 'ignore', 'replace' and
'xmlcharrefreplace' as well as any other name registered with
codecs.register_error that can handle ValueErrors.
Type: builtin_function_or_method
base64.b16encode and base64.b16decode convert bytes to and from hex and work across all Python versions. The codecs approach also works, but is less straightforward in Python 3.
Use hexlify - http://epydoc.sourceforge.net/stdlib/binascii-module.html
Yet another method:
s = 'hello'
h = ''.join([hex(ord(i)) for i in s]);
# outputs: '0x680x650x6c0x6c0x6f'
This basically splits the string into chars, does the conversion through hex(ord(char)), and joins the chars back together. In case you want the result without the prefix 0x then do:
h = ''.join([str(hex(ord(i)))[2:4] for i in s]);
# outputs: '68656c6c6f'
Tested with Python 3.5.3.

Categories

Resources