Python string to unicode [duplicate]

Python string to unicode [duplicate] - python

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
How do I treat an ASCII string as unicode and unescape the escaped characters in it in python?
How do convert unicode escape sequences to unicode characters in a python string
I have a string that contains unicode characters e.g. \u2026 etc. Somehow it is not received to me as unicode, but is received as a str. How do I convert it back to unicode?
>>> a="Hello\u2026"
>>> b=u"Hello\u2026"
>>> print a
Hello\u2026
>>> print b
Hello…
>>> print unicode(a)
Hello\u2026
>>>
So clearly unicode(a) is not the answer. Then what is?

Unicode escapes only work in unicode strings, so this
a="\u2026"
is actually a string of 6 characters: '\', 'u', '2', '0', '2', '6'.
To make unicode out of this, use decode('unicode-escape'):
a="\u2026"
print repr(a)
print repr(a.decode('unicode-escape'))
## '\\u2026'
## u'\u2026'

Decode it with the unicode-escape codec:
>>> a="Hello\u2026"
>>> a.decode('unicode-escape')
u'Hello\u2026'
>>> print _
Hello…
This is because for a non-unicode string the \u2026 is not recognised but is instead treated as a literal series of characters (to put it more clearly, 'Hello\\u2026'). You need to decode the escapes, and the unicode-escape codec can do that for you.
Note that you can get unicode to recognise it in the same way by specifying the codec argument:
>>> unicode(a, 'unicode-escape')
u'Hello\u2026'
But the a.decode() way is nicer.

>>> a="Hello\u2026"
>>> print a.decode('unicode-escape')
Hello…

Related

Convert unicode string into byte string [duplicate]

This question already has answers here:
Process escape sequences in a string in Python
(8 answers)
Closed 7 months ago.
I have a string like:
s_str: str = r"\x00\x01\x00\xc0\x01\x00\x00\x00\x04"
I need to be able to get the corresponding byte literal of that unicode (for pickle.loads):
s_bytes: bytes = b'\x00\x01\x00\xc0\x01\x00\x00\x00\x04'
Here the solution of using s_new: bytes = bytes(s_str, encoding="raw_unicode_escape") was posted, but it does not work for me. I got an incorrect result: b'\\x00\\x01\\x00\\xc0\\x01\\x00\\x00\\x00\\x04' that has two backslashes (actually representing only one) for each one that it should have.
Also here and here a similar solution is proposed, but it does not work for me either, I end up getting the double backslashes again. Why does this occur? How do I get the bytes result I want?

You do not have byte escape codes as shown below (length 9) or you wouldn't get the s_not_bytes result:
s_str: str = "\x00\x01\x00\xc0\x01\x00\x00\x00\x04"
You have literal escape codes (length 36), and note the r for raw string that prevents interpreting the escape codes as bytes:
s_str: str = r"\x00\x01\x00\xc0\x01\x00\x00\x00\x04"
Note the difference. \\ is an escape code indicating a literal, single backslash:
>>> '\x00\x01\x00\xc0\x01\x00\x00\x00\x04'
'\x00\x01\x00À\x01\x00\x00\x00\x04'
>>> r'\x00\x01\x00\xc0\x01\x00\x00\x00\x04'
'\\x00\\x01\\x00\\xc0\\x01\\x00\\x00\\x00\\x04'
>>> len('\x00\x01\x00\xc0\x01\x00\x00\x00\x04')
9
>>> len(r'\x00\x01\x00\xc0\x01\x00\x00\x00\x04')
36
The following gets the desired byte string by converting each code point to a byte using the latin1 codec, which maps 1:1 between the first 256 code points (U+0000 to U+00FF) and the byte values 0x00 to 0xFF. Then it decodes the literal escape codes, resulting in a Unicode string again so once more encode using latin1 to convert 1:1 back to bytes:
s_bytes: bytes = s_str.encode('latin1').decode('unicode_escape').encode('latin1')
print(s_bytes)
Output:
b'\x00\x01\x00\xc0\x01\x00\x00\x00\x04'
If you did have s_str as posted, a simple .encode('latin1') would convert it:
>>> s_str: str = "\x00\x01\x00\xc0\x01\x00\x00\x00\x04"
>>> s_str.encode('latin1')
b'\x00\x01\x00\xc0\x01\x00\x00\x00\x04'

I was about to post the question when I encounter a valid solution almost by chance. The combination that works for me is:
s_new: bytes = bytes(s_str.encode('utf-8').decode('unicode-escape'), encoding="oem")
As I said I have no idea why this works so feel free to explain it if you know why.

You might simply use .encode("utf-8") to get desired result i.e.:
s_1 = "\x00\x01\x00\xc0\x01\x00\x00\x00\x04"
s_2 = s_1.encode("utf-8")
print(s_2)
output
b'\x00\x01\x00\xc3\x80\x01\x00\x00\x00\x04'

Convert ASCII to Unicode encoding issue [duplicate]

This question already has answers here:
Python string to unicode [duplicate]
(3 answers)
Closed 6 years ago.
I have a question about Python 2 encoding. I am trying to decode an ASCII string which contains Unicode code of a letter to Unicode, and then encode it back to Latin-1, but with no success. Here is an illustration:
In[27]: d = u'\u010d'
In[28]: print d.encode('utf-8')
č
In[29]: d1 = '\u010d'
In[30]: d1.decode('ascii').encode('utf-8')
Out[30]: '\\u010d'
I would like to convert '\u010d' to 'č'. Are there any built-in solutions to avoid custom string replacement?

When you do
d1 = '\u010d'
you actually get this string:
In [3]: d1
Out[3]: '\\u010d'
This is because "normal" (non-Unicode) strings don't recognize the \unnnn escape sequence and therefore convert it to a literal backslash, followed by unnnn.
In order to decode that, you need to use the unicode_escape codec:
In [4]: print d1.decode("unicode_escape").encode('utf-8')
č
But of course you shouldn't use Unicode escape sequences in non-Unicode strings in the first place.

How do I use string formatting with explicit character escape codes in Python?

I know that I can print the following:
print u'\u2550'
How do I do that with the .format string method? For example:
for i in range(0x2550, 0x257F):
s = u'\u{04X}'.format(i) # of course this is a syntax error

You are looking for the unichr() function:
s = u'{}'.format(unichr(i))
or just plain
s = unichr(i)
unichr(integer) produces a unicode character for the given codepoint.
The \uxxxx syntax only works for string literals; these are processed by the Python compiler before running the code.
Demo:
>>> for i in range(0x2550, 0x257F):
... print unichr(i)
...
═
║
╒
╓
╔
╕
╖
# etc.
If you ever do have literal \uxxxx sequences in your strings, you can still have Python turn those into Unicode characters with the unicode_escape codec:
>>> print u'\\u2550'
\u2550
>>> print u'\\u2550'.decode('unicode_escape')
═
>>> print '\\u2550'.decode('unicode_escape')
═
>>> '\\u2550'.decode('unicode_escape')
u'\u2550'
In Python 2, you can use this codec on both byte string and unicode string values; the output is always a unicode string.

Convert u'\xe0' to '\u00E0' in Python 2.x?

In Python 2.x, how can I convert an unicode string (ex, u'\xe0') to a string (here I need it to be '\u00E0')?
To make it clearer. I do like to have '\u00E0', a string with length of 6. That is, ¥u is treated as 2 chars instead of one escaped char.

\u doesn't exist as a string escape sequence in Python 2.
You might mean a JSON-encoded string:
>>> s = u'\xe0'
>>> import json
>>> json.dumps(s)
'"\\u00e0"'
or a UTF-16 (big-endian)-encoded string:
>>> s.encode("utf-16-be")
'\x00\xe0'
but your original request is not fulfillable.
As an aside, note that u'\u00e0' is identical to u'\xe0', but '\u00e0' doesn't exist:
>>> u'\u00e0'
u'\xe0'

python - problems with regular expression and unicode

Hi I have a problem in python. I try to explain my problem with an example.
I have this string:
>>> string = 'ÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿÀÁÂÃ'
>>> print string
ÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿÀÁÂÃ
and i want, for example, replace charachters different from Ñ,Ã,ï with ""
i have tried:
>>> rePat = re.compile('[^ÑÃï]',re.UNICODE)
>>> print rePat.sub("",string)
�Ñ�����������������������������ï�������������������Ã
I obtained this �.
I think that it's happen because this type of characters in python are represented by two position in the vector: for example \xc3\x91 = Ñ.
For this, when i make the regolar expression, all the \xc3 are not substitued. How I can do this type of sub?????
Thanks
Franco

You need to make sure that your strings are unicode strings, not plain strings (plain strings are like byte arrays).
Example:
>>> string = 'ÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿÀÁÂÃ'
>>> type(string)
<type 'str'>
# do this instead:
# (note the u in front of the ', this marks the character sequence as a unicode literal)
>>> string = u'\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff\xc0\xc1\xc2\xc3'
# or:
>>> string = 'ÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿÀÁÂÃ'.decode('utf-8')
# ... but be aware that the latter will only work if the terminal (or source file) has utf-8 encoding
# ... it is a best practice to use the \xNN form in unicode literals, as in the first example
>>> type(string)
<type 'unicode'>
>>> print string
ÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿÀÁÂÃ
>>> rePat = re.compile(u'[^\xc3\x91\xc3\x83\xc3\xaf]',re.UNICODE)
>>> print rePat.sub("", string)
Ã
When reading from a file, string = open('filename.txt').read() reads a byte sequence.
To get the unicode content, do: string = unicode(open('filename.txt').read(), 'encoding'). Or: string = open('filename.txt').read().decode('encoding').
The codecs module can decode unicode streams (such as files) on-the-fly.
Do a google search for python unicode. Python unicode handling can be a bit hard to grasp at first, it pays to read up on it.
I live by this rule: "Software should only work with Unicode strings internally, converting to a particular encoding on output." (from http://www.amk.ca/python/howto/unicode)
I also recommend: http://www.joelonsoftware.com/articles/Unicode.html

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python string to unicode [duplicate] - python

Unicode escapes only work in unicode strings, so this a="\u2026" is actually a string of 6 characters: '\', 'u', '2', '0', '2', '6'. To make unicode out of this, use decode('unicode-escape'): a="\u2026" print repr(a) print repr(a.decode('unicode-escape')) ## '\\u2026' ## u'\u2026'

>>> a="Hello\u2026" >>> print a.decode('unicode-escape') Hello…

Related

Convert unicode string into byte string [duplicate]

Convert ASCII to Unicode encoding issue [duplicate]

How do I use string formatting with explicit character escape codes in Python?

Convert u'\xe0' to '\u00E0' in Python 2.x?

python - problems with regular expression and unicode

Categories

Resources