I'm From Iran And I Use Persian Character. When should I use 'u', 'decode()', 'encode' and unicode()?
1) You use decode() and unicode() to decode the input string from its representation (for example iso-8859-2 or utf-8) and get the unicode object.
2) You use a u to indicate that the string is to be treated as unicode (in fact the result object is unicode type):
>>> foo = u'łódź'
>>> foo.__class__
<type 'unicode'>
3) Use the encode() to encode the input string using for example utf-8 (or any other encoding of your choice) and get the str object:
>>> foo = u'łódź'
>>> foo.__class__
<type 'unicode'>
>>> bar = foo.encode('utf-8')
>>> bar.__class__
<type 'str'>
Read through this article about unicode in Python to get better idea of string/unicode/string encoding mess.
Related
I have some text encoded to bytes using utf-8 encoding. When processing this text I incautiously used str() to make it a Unicode string because I assumed this would automatically decode the bytes object with the right encoding. This, however, is not the case. For example:
a = "عجائب"
a_bytes = a.encode(encoding="utf-8")
b = str(a_bytes)
yields
b = "b'\\xd8\\xb9\\xd8\\xac\\xd8\\xa7\\xd8\\xa6\\xd8\\xa8'"
which is not what I expected.
According to the docs
If neither encoding nor errors is given, str(object) returns type(object).__str__(object), [...].
So my question is: What is the implemented string representation of a bytes object in Python and can I recreate my original Unicode string from it in general?
It gives you a string containing the string representation of a bytes object:
>>> a = "عجائب"
>>> a_bytes = a.encode(encoding="utf-8")
>>> a_bytes
b'\xd8\xb9\xd8\xac\xd8\xa7\xd8\xa6\xd8\xa8'
>>> str(a_bytes)
"b'\\xd8\\xb9\\xd8\\xac\\xd8\\xa7\\xd8\\xa6\\xd8\\xa8'"
Meaning, you have a valid literal representing a bytes. You can parse that literal again to an actual bytes using ast.literal_eval:
>>> import ast
>>> ast.literal_eval(str(a_bytes))
b'\xd8\xb9\xd8\xac\xd8\xa7\xd8\xa6\xd8\xa8'
This is again the same as a_bytes. You can properly decode those to a str again, either using .decode, or by using the encoding parameter of str:
>>> str(a_bytes, 'utf-8')
'عجائب'
>>> a_bytes.decode('utf-8')
'عجائب'
When you call str() and pass it as an argument a bytes variable, it converts from bytes to string. If you want to decode from utf-8 bytes to the original string, you need to use decode() function and specify the initial coding method:
a = "عجائب"
a_bytes = a.encode(encoding="utf-8")
b = str(a_bytes)
print(b)
print(a_bytes)
print(a_bytes.decode("utf-8")) #Prints decoded string from bytes
Output:
b'\xd8\xb9\xd8\xac\xd8\xa7\xd8\xa6\xd8\xa8'
b'\xd8\xb9\xd8\xac\xd8\xa7\xd8\xa6\xd8\xa8'
عجائب
I have a data in form of 2\u2070iPSC.
which is actually 2⁰iPSC. how do i convert 2\u2070iPSC to 2⁰iPSC using python.
As a unicode string the data already is 2⁰iPSC. I think that you are concerned about its display.
The code point \u2070 is ⁰:
>>> import unicodedata
>>> unicodedata.name(u'\u2070')
'SUPERSCRIPT ZERO'
If you are using Python 2 you need to prefix the string with u to indicate that the unicode escape sequences are to be interpreted:
>>> type('2\u2070iPSC')
<type 'str'>
>>> type(u'2\u2070iPSC') # note `u` prefix
<type 'unicode'>
In Python 3 strings are unicode by default, so the u prefix is not required:
>>> type('2\u2070iPSC')
<class 'str'>
To display the string you can simply print it:
>>> print(u'2\u2070iPSC')
2⁰iPSC
This works if the default encoding of your interpreter can represent u'\u2070', e.g. UTF-8.
You need to add u as prefix in order to set it as unicode string.
unicode_string = u'2\u2070iPSC'
print(unicode_string)
>> 2⁰iPSC
I have a set of UTF-8 octets and I need to convert them back to unicode code points. How can I do this in python.
e.g. UTF-8 octet ['0xc5','0x81'] should be converted to 0x141 codepoint.
Python 3.x:
In Python 3.x, str is the class for Unicode text, and bytes is for containing octets.
If by "octets" you really mean strings in the form '0xc5' (rather than '\xc5') you can convert to bytes like this:
>>> bytes(int(x,0) for x in ['0xc5', '0x81'])
b'\xc5\x81'
You can then convert to str (ie: Unicode) using the str constructor...
>>> str(b'\xc5\x81', 'utf-8')
'Ł'
...or by calling .decode('utf-8') on the bytes object:
>>> b'\xc5\x81'.decode('utf-8')
'Ł'
>>> hex(ord('Ł'))
'0x141'
Pre-3.x:
Prior to 3.x, the str type was a byte array, and unicode was for Unicode text.
Again, if by "octets" you really mean strings in the form '0xc5' (rather than '\xc5') you can convert them like this:
>>> ''.join(chr(int(x,0)) for x in ['0xc5', '0x81'])
'\xc5\x81'
You can then convert to unicode using the constructor...
>>> unicode('\xc5\x81', 'utf-8')
u'\u0141'
...or by calling .decode('utf-8') on the str:
>>> '\xc5\x81'.decode('utf-8')
u'\u0141'
In lovely 3.x, where all strs are Unicode, and bytes are what strs used to be:
>>> s = str(bytes([0xc5, 0x81]), 'utf-8')
>>> s
'Ł'
>>> ord(s)
321
>>> hex(ord(s))
'0x141'
Which is what you asked for.
l = ['0xc5','0x81']
s = ''.join([chr(int(c, 16)) for c in l]).decode('utf8')
s
>>> u'\u0141'
>>> "".join((chr(int(x,16)) for x in ['0xc5','0x81'])).decode("utf8")
u'\u0141'
I know this may sounds like a duplicate question, but that's because I don't know how to describe this question properly.
For some reason I got a bunch of unicode string like this:
a = u'\xcb\xea'
As you can see, it's actually bytes representation of a Chinese character, encoding in gbk
>>> print(b'\xcb\xea'.decode('gbk'))
岁
u'岁' is what I need, but I don't know how to convert u'\xcb\xea' to b'\xcb\xea'.
Any suggestions?
It's not really a bytes representation, it's still unicode codepoints. They are the wrong codepoints, because it was decoded from bytes as if it was encoded to Latin-1.
Encode to Latin 1 (whose codepoints map one-on-one to bytes), then decode as GBK:
a.encode('latin1').decode('gbk')
Demo:
>>> a = u'\xcb\xea'
>>> a.encode('latin1').decode('gbk')
u'\u5c81'
>>> print a.encode('latin1').decode('gbk')
岁
The simpliest way for python2 is to use the repr():
>>> key_unicode = u'uuuu\xf6\x9f_\xa1\x05\xeb9\xd4\xa3\xd1'
>>> key_ascii = 'uuuu\xf6\x9f_\xa1\x05\xeb9\xd4\xa3\xd1'
>>> print(key_ascii)
uuuu��_��9ԣ�
>>> print(key_unicode)
uuuuö_¡ë9Ô£Ñ
>>>
>>> # here is the save method for both string types:
>>> print(repr(key_ascii).lstrip('u')[1:-1])
uuuu\xf6\x9f_\xa1\x05\xeb9\xd4\xa3\xd1
>>> print(repr(key_unicode).lstrip('u')[1:-1])
uuuu\xf6\x9f_\xa1\x05\xeb9\xd4\xa3\xd1
>>> # ____________WARNING!______________
>>> # if you will use jsut `str.strip('u\'\"')`, you will lose
>>> # the "uuuu" (and quotes, if such are present) on sides of the string:
>>> print(repr(key_unicode).strip('u\'\"'))
\xf6\x9f_\xa1\x05\xeb9\xd4\xa3\xd1
For python3 use str.encode() to get the bytes type.
>>> key = 'l\xf6\x9f_\xa1\x05\xeb9\xd4\xa3\xd1q\xf5L\xa9\xdd0\x90\x8b\xf5ht\x86za\x0e\x1b\xed\xb6(\xaa+'
>>> key
'lö\x9f_¡\x05ë9Ô£ÑqõL©Ý0\x90\x8bõht\x86za\x0e\x1bí¶(ª+'
>>> print(key)
lö_¡ë9Ô£ÑqõL©Ý0õhtzaí¶(ª+
>>> print(repr(key.encode()).lstrip('b')[1:-1])
l\xc3\xb6\xc2\x9f_\xc2\xa1\x05\xc3\xab9\xc3\x94\xc2\xa3\xc3\x91
Hi I have a problem in python. I try to explain my problem with an example.
I have this string:
>>> string = 'ÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿÀÁÂÃ'
>>> print string
ÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿÀÁÂÃ
and i want, for example, replace charachters different from Ñ,Ã,ï with ""
i have tried:
>>> rePat = re.compile('[^ÑÃï]',re.UNICODE)
>>> print rePat.sub("",string)
�Ñ�����������������������������ï�������������������Ã
I obtained this �.
I think that it's happen because this type of characters in python are represented by two position in the vector: for example \xc3\x91 = Ñ.
For this, when i make the regolar expression, all the \xc3 are not substitued. How I can do this type of sub?????
Thanks
Franco
You need to make sure that your strings are unicode strings, not plain strings (plain strings are like byte arrays).
Example:
>>> string = 'ÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿÀÁÂÃ'
>>> type(string)
<type 'str'>
# do this instead:
# (note the u in front of the ', this marks the character sequence as a unicode literal)
>>> string = u'\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff\xc0\xc1\xc2\xc3'
# or:
>>> string = 'ÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿÀÁÂÃ'.decode('utf-8')
# ... but be aware that the latter will only work if the terminal (or source file) has utf-8 encoding
# ... it is a best practice to use the \xNN form in unicode literals, as in the first example
>>> type(string)
<type 'unicode'>
>>> print string
ÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿÀÁÂÃ
>>> rePat = re.compile(u'[^\xc3\x91\xc3\x83\xc3\xaf]',re.UNICODE)
>>> print rePat.sub("", string)
Ã
When reading from a file, string = open('filename.txt').read() reads a byte sequence.
To get the unicode content, do: string = unicode(open('filename.txt').read(), 'encoding'). Or: string = open('filename.txt').read().decode('encoding').
The codecs module can decode unicode streams (such as files) on-the-fly.
Do a google search for python unicode. Python unicode handling can be a bit hard to grasp at first, it pays to read up on it.
I live by this rule: "Software should only work with Unicode strings internally, converting to a particular encoding on output." (from http://www.amk.ca/python/howto/unicode)
I also recommend: http://www.joelonsoftware.com/articles/Unicode.html