Convert special characters in to original form in python - python

Suppose I had a string for example:
>>> stri = "日本"
>>> res = stri
>>> res
'\xe6\x97\xa5\xe6\x9c\xac'
Now I want to convert the result in res back to the form in "日本".

(Assuming that you're using Python 2.x on a UTF-8 console):
Nothing has been converted, and there is no need to convert anything back; what you're seeing is the internal representation of the string. Try printing it.
>>> stri = "日本"
>>> stri
'\xe6\x97\xa5\xe6\x9c\xac'
>>> print(stri)
日本
To clarify:
If you enter the name of a Python variable in the console, the console will print the repr of that variable. If you want to print the variable in human-readable form, use print instead. There is no difference in the way the variable is stored, therefore there's nothing to convert.

That is the expected behaviour - console doesn't print variable in unicode. If you actually print it out you'll see that the correct chars are still there. The console automatically uses repr on every variable before printing it out. You can verify that yourself by doing print(repr...)) like the example below:
>>> stri = "日本"
>>> stri
'\xe6\x97\xa5\xe6\x9c\xac'
>>> print stri
日本
>>> print repr(stri)
'\xe6\x97\xa5\xe6\x9c\xac'
>>>

Like Tim said, the characters haven't been converted.
This article should help you understand what's happening

Related

Python - How to convert utf literal such as '\xc3\xb6' to the letter ö

I am trying to convert an encoded url with german Umlaute into a string with those Umlaute.
Here is an example of an encoded string = 'K%C3%B6nnen'.
I would like to convert it to 'Können'
When I use urllib.unquote(a) I get this returned: 'K\xc3\xb6nnen'
\xc3\xb6 I found out is a utf literal.
How can I convert this to an ö ? I find that if I use the print function it converts it correctly, but I cannot figure out how to get a function to return this value? Any ideas?
With decode("utf-8")
print('K\xc3\xb6nnen'.decode("utf-8"))
OUTPUT
Können
EXTRA edit, take care with that
>>> l = []
>>> l.append(s.decode("utf-8")) #s is the string
>>> l
[u'K\xf6nnen']
>>> print(l)
[u'K\xf6nnen']
>>> print(l[0])
Können
>>>
Python will use codification to manage string, print can give you the representation but no the real value, use repr(s) for real value

How do I get '\\\\host\\printer' out of a string var of '\\host\printer' in python 2.7.5?

How do I get '\\\\host\\printer' out of a string var of '\\host\printer' in python 2.7.5?
My program takes in a string argument, "\\host\printer", and I need to convert it to "\\\\host\\printer" in order to submit it as a JSON doc to a web endpoint.
Seems simple enough, but python won't let me. Here's what happens:
>>> data = '\\host\printer'
>>> print data.replace('\\','\\\\')
\\host\\printer
Now, if this data var was assigned a raw string, it'd work fine:
>>> data = r'\\host\printer'
>>> print data.replace('\\','\\\\')
\\\\host\\printer
However, since data is an input argument, I can't make it a raw string. I've tried several tricks found on SO to convert it to a raw string, but no luck with the final result, as shown below.
encode() doesn't help:
>>> data = '\\host\printer'
>>> data = data.encode('string-escape')
>>> print data.replace('\\','\\\\')
\\\\host\\\\printer
nor does repr():
>>> data = '\\host\printer'
>>> data = repr(data)
>>> print data.replace('\\','\\\\')
'\\\\host\\\\printer'
nor does re.escape():
>>> import re
>>> data = '\\host\printer'
>>> data = re.escape(data)
>>> print data.replace('\\','\\\\')
\\\\host\\\\printer
When you write this:
>>> data = '\\host\printer'
You end up with data containing the literal string \host\printer, because in Python \ is an escape character, and when you want a single \ you need to write \\. You can disable this behavior by using a raw string, or by escaping \ whenever you use it. So you can write:
>>> data = '\\\\host\\printer'
Or you can write:
>>> data = r'\\host\printer'
Since you want the literal string \\\\host\\printer, you need to replace every instance of \ with \\. Which means you can write this:
>>> newdata = data.replace('\\', '\\\\')
And that gets you:
>>> print newdata
\\\\host\\printer
The String Literals section of the docs has some details on the above.

Python chinese characters

I have the following encoding in python 2.7:
["\xe5\x81\x9a\xe6\x88\x8f\xe4\xb9\x8b\xe8\xaf\xb4"]
I need to get the following (chinese characters) from that:
["做戏之说"]
Anyone knows how to decode the above to get that?
You need to decode your string:
>>> l = ["\xe5\x81\x9a\xe6\x88\x8f\xe4\xb9\x8b\xe8\xaf\xb4"]
>>> a = [l[0].decode('utf8')]
>>> print a[0]
做戏之说
If you want to show your Unicode inside the list you need to convert the standard representation of the list to unicode then print it:
>>> print unicode(repr([l[0].decode('utf8')]), 'unicode-escape')
[u'做戏之说']

escape a string which contains non-ascii

now I have string s = "\\u653e"
I want to convert this string into s = "\u653e"
I try to make it clear:
# this is what I want
>>s
>>'\u653e'
# this is not what I want, print will escape the string automatically
>>print s
>>\653e
how can I do that?
the original question is that
I have a string s = u'\u653e', [s] = [u'\u653e']
So I want to remove the u, that is, [s] = ['\u653e']
so I just use the command ast.literal_eval(json.dumps(r)) to get the above string "\\u653e"
UPDATE
Thanks tdelaney
Creating a string from an entire list causes my problem. What I should to do is using a unicode string to start with and build the list from its individual elements instead of the entire list. For more details you can see his answer.
s is a single unicode character. "\u653e is a literal encoding that python uses to express unicode characters in ascii text. The unicode_escape codec converts between these types.
>>> s = u'\u653e'
>>> print type(s), len(s), s
<type 'unicode'> 1 放
>>> encoded = s.encode('unicode_escape')
>>> print type(encoded), len(encoded), encoded
<type 'str'> 6 \u653e
In your example just do
s = u'\u653e'
somelist = [s.encode('unicode_escape')]
>>> print somelist
['\\u653e']
>>> print somelist[0]
\u653e
update
From your comments, your problem may be how you create your command string. There seems to be a problem with the python representation of a string verses the string itself. Use a unicode string to start with and build the list from its individual elements instead of the entire list.
>>> excel = [u'\u4e00', u'\u4e8c', u'\u4e09']
>>> cmd = u'create vertex v set s = [{}]'.format(u','.join(excel))
>>> cmd
u'create vertex v set s = [\u4e00,\u4e8c,\u4e09]'
>>> print cmd
create vertex v set s = [一,二,三]

String formating with Python's escape sequence

For clarification purposes, I am rewriting from scratch with additional information.
Consider the following:
y = hex(1200)
y
'0x4b0'
I need to replace that first 0 of y with a '\' to make it look like '\x04b0'. I am communicating with an instrument over RS-232 serial which takes parameters strictly in that format ('\xSumCharsHere'). Python won't let me do the following.
z = '\x' + y[2:]
ValueError: invalid \x escape
The following is not acceptable, because it still has '\\' in the actual value assigned to z.
z = '\\' + y[1:]
z
'\\x4b0'
The end goal is to send a command like this to my serial port:
s.write(z) # s is a serial object
s.write('\x04b0') # This call is an equivalent of the call above
s.write('\\x04b0') # This command will not work
Your last bit of code doesn't do what you think it does:
>>> x = hex(1200)
>>> y = '\\' + x[1: len(x)]
>>> y
'\\x4b0'
>>> print y
\x4b0
When you type the name of a variable in the Python console, Python prints the string's representation as Python code, which is why you see two backslashes -- a literal backslash in a Python string is escaped by another leading backslash. This code does in fact work, the representation of the result is just throwing you off.
However, I would suggest you use this snippet instead, since yours is omitting leading zeroes:
>>> y = '\\x%04x' % 1200
>>> print y
\x04b0
Your last code bit is correct, and it can be alternatively written using a raw string:
y = r'\x' + x[2: len(x)]
As cdhowie said in his answer:
When you type the name of a variable in the Python console, Python prints the string's representation as Python code. This code does in fact work, the representation of the result is just throwing you off.
This is an alternative for hand-writing escape sequences, however, and one I think is slightly better coding practice as it is much more readable.
The latter will work. In the console, Python uses repr() to print objects, which in this case will show the double slash. Do print y in the console and you'll see that it outputs properly.
You can also clean up your first example a bit:
y = "\\x" + x[2:]
Or the second:
y = "\\" + x[1:]
If you are just trying to get the string \0x4b0 as the representation at the console, you need to actually call print on it at the console:
>>> s='\\0{}'.format(hex(1200)[1:])
>>> s
'\\0x4b0'
>>> print s
\0x4b0
>>> s2='\\0'+hex(1200)[1:]
>>> s2
'\\0x4b0'
>>> print s2
\0x4b0
If you just FORM the string in the console (i.e., it does not go through print), Python is showing you its representation:
>>> '\\0{}'.format(hex(1200)[1:])
'\\0x4b0'
>>> repr(s2)
"'\\\\0x4b0'"
>>> s2
'\\0x4b0'
Edit (based on your comment):
I assume this is an old HP plotter?
Don't be confused by what the shell is showing as your string.
You state that you want to produce a string of \x<someNumGoesHere> (or is it \x0<someNumGoesHere> with a leading 0?)
Here is how:
>>> def angle_string(angle):
... return '\\0{}'.format(hex(angle)[1:])
...
>>> angle_string(1200)
'\\x04b0'
>>> print _
\x04b0
>>> angle_string(33)
'\\x021'
>>> print _
\x021
When you send the string to your device (through the OS file/print like service to the RS232 port), it will be as you format it.
Edit 2
String interpolation is the process where these string literals:
>>> s1
'\n\n\t\tline'
Get translated to this:
>>> print s
line
Logically, these literal characters are single characters:
>>> s1[0]
'\n'
>>> len('\\')
1
My guess is that the way you have opened the serial port s is using the strings is raw mode, so the string \\x0123 is being sent that way (raw mode) vs being interpreted as \x0123
You might try as a work around this:
>>> cmd=chr(92)+'0'+hex(1200)[1:]
>>> s.write(cmd)
I think you also need to open the serial port in FileLike mode so that the string literals are sent as proper single characters.

Categories

Resources