Python chinese characters - python

I have the following encoding in python 2.7:
["\xe5\x81\x9a\xe6\x88\x8f\xe4\xb9\x8b\xe8\xaf\xb4"]
I need to get the following (chinese characters) from that:
["做戏之说"]
Anyone knows how to decode the above to get that?

You need to decode your string:
>>> l = ["\xe5\x81\x9a\xe6\x88\x8f\xe4\xb9\x8b\xe8\xaf\xb4"]
>>> a = [l[0].decode('utf8')]
>>> print a[0]
做戏之说
If you want to show your Unicode inside the list you need to convert the standard representation of the list to unicode then print it:
>>> print unicode(repr([l[0].decode('utf8')]), 'unicode-escape')
[u'做戏之说']

Related

Python - How to convert utf literal such as '\xc3\xb6' to the letter ö

I am trying to convert an encoded url with german Umlaute into a string with those Umlaute.
Here is an example of an encoded string = 'K%C3%B6nnen'.
I would like to convert it to 'Können'
When I use urllib.unquote(a) I get this returned: 'K\xc3\xb6nnen'
\xc3\xb6 I found out is a utf literal.
How can I convert this to an ö ? I find that if I use the print function it converts it correctly, but I cannot figure out how to get a function to return this value? Any ideas?
With decode("utf-8")
print('K\xc3\xb6nnen'.decode("utf-8"))
OUTPUT
Können
EXTRA edit, take care with that
>>> l = []
>>> l.append(s.decode("utf-8")) #s is the string
>>> l
[u'K\xf6nnen']
>>> print(l)
[u'K\xf6nnen']
>>> print(l[0])
Können
>>>
Python will use codification to manage string, print can give you the representation but no the real value, use repr(s) for real value

escape a string which contains non-ascii

now I have string s = "\\u653e"
I want to convert this string into s = "\u653e"
I try to make it clear:
# this is what I want
>>s
>>'\u653e'
# this is not what I want, print will escape the string automatically
>>print s
>>\653e
how can I do that?
the original question is that
I have a string s = u'\u653e', [s] = [u'\u653e']
So I want to remove the u, that is, [s] = ['\u653e']
so I just use the command ast.literal_eval(json.dumps(r)) to get the above string "\\u653e"
UPDATE
Thanks tdelaney
Creating a string from an entire list causes my problem. What I should to do is using a unicode string to start with and build the list from its individual elements instead of the entire list. For more details you can see his answer.
s is a single unicode character. "\u653e is a literal encoding that python uses to express unicode characters in ascii text. The unicode_escape codec converts between these types.
>>> s = u'\u653e'
>>> print type(s), len(s), s
<type 'unicode'> 1 放
>>> encoded = s.encode('unicode_escape')
>>> print type(encoded), len(encoded), encoded
<type 'str'> 6 \u653e
In your example just do
s = u'\u653e'
somelist = [s.encode('unicode_escape')]
>>> print somelist
['\\u653e']
>>> print somelist[0]
\u653e
update
From your comments, your problem may be how you create your command string. There seems to be a problem with the python representation of a string verses the string itself. Use a unicode string to start with and build the list from its individual elements instead of the entire list.
>>> excel = [u'\u4e00', u'\u4e8c', u'\u4e09']
>>> cmd = u'create vertex v set s = [{}]'.format(u','.join(excel))
>>> cmd
u'create vertex v set s = [\u4e00,\u4e8c,\u4e09]'
>>> print cmd
create vertex v set s = [一,二,三]

How to remove '\x' from a hex string in Python?

I'm reading a wav audio file in Python using wave module. The readframe() function in this library returns frames as hex string. I want to remove \x of this string, but translate() function doesn't work as I want:
>>> input = wave.open(r"G:\Workspace\wav\1.wav",'r')
>>> input.readframes (1)
'\xff\x1f\x00\xe8'
>>> '\xff\x1f\x00\xe8'.translate(None,'\\x')
'\xff\x1f\x00\xe8'
>>> '\xff\x1f\x00\xe8'.translate(None,'\x')
ValueError: invalid \x escape
>>> '\xff\x1f\x00\xe8'.translate(None,r'\x')
'\xff\x1f\x00\xe8'
>>>
Any way I want divide the result values by 2 and then add \x again and generate a new wav file containing these new values. Does any one have any better idea?
What's wrong?
Indeed, you don't have backslashes in your string. So, that's why you can't remove them.
If you try to play with each hex character from this string (using ord() and len() functions - you'll see their real values. Besides, the length of your string is just 4, not 16.
You can play with several solutions to achieve your result:
'hex' encode:
'\xff\x1f\x00\xe8'.encode('hex')
'ff1f00e8'
Or use repr() function:
repr('\xff\x1f\x00\xe8').translate(None,r'\\x')
One way to do what you want is:
>>> s = '\xff\x1f\x00\xe8'
>>> ''.join('%02x' % ord(c) for c in s)
'ff1f00e8'
The reason why translate is not working is that what you are seeing is not the string itself, but its representation. In other words, \x is not contained in the string:
>>> '\\x' in '\xff\x1f\x00\xe8'
False
\xff, \x1f, \x00 and \xe8 are the hexadecimal representation of for characters (in fact, len(s) == 4, not 24).
Use the encode method:
>>> s = '\xff\x1f\x00\xe8'
>>> print s.encode("hex")
'ff1f00e8'
As this is a hexadecimal representation, encode with hex
>>> '\xff\x1f\x00\xe8'.encode('hex')
'ff1f00e8'

utf-8 convert to utf-16

i want to convert the chinese character to the unicode format, like '\uXXXX'
but when i use str.encode('utf-16be'), it'll show that:
b'\xOO\xOO'
so, i write some code to perform my request as below:
data="index=索引?"
print(data.encode('UTF-16LE'))
def convert(s):
returnCode=[]
temp=''
for n in s.encode('utf-16be'):
if temp=='':
if str.replace(hex(n),'0x','')=='0':
temp='00'
continue
temp+=str.replace(hex(n),'0x','')
else:
returnCode.append(temp+str.replace(hex(n),'0x',''))
temp=''
return returnCode
print(convert(data))
can someone give me suggestions to do this conversion in python 3.x?
I'm not sure if I understand you well.
Unicode is like a type. In python 3, all strings are unicode, so when you write data = "index=索引?" then data is already unicode. If you want to get an alternative representation just for displaying, you could use:
def display_unicode(data):
return "".join(["\\u%s" % hex(ord(l))[2:].zfill(4) for l in data])
>>> data = "index=索引?"
>>> print(display_unicode(data))
\u0069\u006e\u0064\u0065\u0078\u003d\u7d22\u5f15\u003f
Note that the string has now real backslashes and numeric representations, not unicode characters.
But there may be other alternatives
>>> data.encode('ascii', 'backslashreplace')
b'index=\\u7d22\\u5f15?'
>>> data.encode('unicode_escape')
b'index=\\u7d22\\u5f15?'
Try to decode first, like: s.decode('utf-8').encode('utf-16be')?

python - problems with regular expression and unicode

Hi I have a problem in python. I try to explain my problem with an example.
I have this string:
>>> string = 'ÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿÀÁÂÃ'
>>> print string
ÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿÀÁÂÃ
and i want, for example, replace charachters different from Ñ,Ã,ï with ""
i have tried:
>>> rePat = re.compile('[^ÑÃï]',re.UNICODE)
>>> print rePat.sub("",string)
�Ñ�����������������������������ï�������������������Ã
I obtained this �.
I think that it's happen because this type of characters in python are represented by two position in the vector: for example \xc3\x91 = Ñ.
For this, when i make the regolar expression, all the \xc3 are not substitued. How I can do this type of sub?????
Thanks
Franco
You need to make sure that your strings are unicode strings, not plain strings (plain strings are like byte arrays).
Example:
>>> string = 'ÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿÀÁÂÃ'
>>> type(string)
<type 'str'>
# do this instead:
# (note the u in front of the ', this marks the character sequence as a unicode literal)
>>> string = u'\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff\xc0\xc1\xc2\xc3'
# or:
>>> string = 'ÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿÀÁÂÃ'.decode('utf-8')
# ... but be aware that the latter will only work if the terminal (or source file) has utf-8 encoding
# ... it is a best practice to use the \xNN form in unicode literals, as in the first example
>>> type(string)
<type 'unicode'>
>>> print string
ÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿÀÁÂÃ
>>> rePat = re.compile(u'[^\xc3\x91\xc3\x83\xc3\xaf]',re.UNICODE)
>>> print rePat.sub("", string)
Ã
When reading from a file, string = open('filename.txt').read() reads a byte sequence.
To get the unicode content, do: string = unicode(open('filename.txt').read(), 'encoding'). Or: string = open('filename.txt').read().decode('encoding').
The codecs module can decode unicode streams (such as files) on-the-fly.
Do a google search for python unicode. Python unicode handling can be a bit hard to grasp at first, it pays to read up on it.
I live by this rule: "Software should only work with Unicode strings internally, converting to a particular encoding on output." (from http://www.amk.ca/python/howto/unicode)
I also recommend: http://www.joelonsoftware.com/articles/Unicode.html

Categories

Resources