Convert unicode base notation to string in python - python

I have a data in form of 2\u2070iPSC.
which is actually 2⁰iPSC. how do i convert 2\u2070iPSC to 2⁰iPSC using python.

As a unicode string the data already is 2⁰iPSC. I think that you are concerned about its display.
The code point \u2070 is ⁰:
>>> import unicodedata
>>> unicodedata.name(u'\u2070')
'SUPERSCRIPT ZERO'
If you are using Python 2 you need to prefix the string with u to indicate that the unicode escape sequences are to be interpreted:
>>> type('2\u2070iPSC')
<type 'str'>
>>> type(u'2\u2070iPSC') # note `u` prefix
<type 'unicode'>
In Python 3 strings are unicode by default, so the u prefix is not required:
>>> type('2\u2070iPSC')
<class 'str'>
To display the string you can simply print it:
>>> print(u'2\u2070iPSC')
2⁰iPSC
This works if the default encoding of your interpreter can represent u'\u2070', e.g. UTF-8.

You need to add u as prefix in order to set it as unicode string.
unicode_string = u'2\u2070iPSC'
print(unicode_string)
>> 2⁰iPSC

Related

Encode string as octal utf-8 Python 3

Is there a good way to encode strings to utf-8, but in octal format instead of the default hexadecimal?
For example:
>>> "õ".encode("utf-8")
b'\xc3\xb5'
Here the output is hex, not octal. The output in octal would be: b'\303\265'
Python 3 automatically handles the decoding just fine:
>>> b"\xc3\xb5".decode("utf-8")
'õ'
>>> b'\303\265'.decode("utf-8")
'õ'
Is there a codec or option I'm missing? I'd like to avoid a lot of manual string manipulation.
update: I had misunderstood -- there is no difference between b"\xc3\xb5" and b'\303\265' at all, rather they are just 2 different ways to display the same underlying byte code. In fact:
>>> b"\xc3\xb5" == b'\303\265'
True
Here's a class that overrides the representation of the string it wraps:
>>> class OctUTF8:
... def __init__(self,s):
... self.s = s.encode()
... def __repr__(self):
... return "b'" + ''.join(f'\\{n:03o}' for n in self.s) + "'"
...
>>> s='õ'
>>> OctUTF8(s)
b'\303\265'
This representation can be evaluated as a byte string and decoded back to the original:
>>> eval(repr(OctUTF8(s))).decode()
'õ'
First, you can use ord() to convert a character in a string to it's Unicode form, then, you can use oct():
print(oct(ord("õ")))
Output:
0o365
You can convert each byte in a bytes object to it's octal representation
[oct(b) for b in "õ".encode("utf-8")]
Gives
['0o303', '0o265']
You can manipulate the results to convert it to your desired output

Showing text representation of Unicode symbol in Python 3

I can do this in my ipython notebook:
print(u"\u2605")
★
But how do I go backwards? That is, going from the symbol to the unicode string. Encoding it in UTF-8 or UTF-16 is giving binary representations. For example:
print('★'.encode('utf-16'))
b'\xff\xfe\x05&'
You can use unicode-escape encoding:
>>> '★'.encode('unicode-escape')
b'\\u2605'
>>> print('★'.encode('unicode-escape').decode())
\u2605
or ord if you just want to know the codepoint:
>>> ord('★')
9733
>>> hex(ord('★')) # as hexa decimal
'0x2605'
>>> print(r'\u%x' % ord('★'))
\u2605
UPDATE
You can also use ascii:
>>> print(ascii('★')) # NOTE: surrounding quote
'\u2605'
>>> print(ascii('★').strip("'"))
\u2605

How to convert string to bytes in Python 2

I know this may sounds like a duplicate question, but that's because I don't know how to describe this question properly.
For some reason I got a bunch of unicode string like this:
a = u'\xcb\xea'
As you can see, it's actually bytes representation of a Chinese character, encoding in gbk
>>> print(b'\xcb\xea'.decode('gbk'))
岁
u'岁' is what I need, but I don't know how to convert u'\xcb\xea' to b'\xcb\xea'.
Any suggestions?
It's not really a bytes representation, it's still unicode codepoints. They are the wrong codepoints, because it was decoded from bytes as if it was encoded to Latin-1.
Encode to Latin 1 (whose codepoints map one-on-one to bytes), then decode as GBK:
a.encode('latin1').decode('gbk')
Demo:
>>> a = u'\xcb\xea'
>>> a.encode('latin1').decode('gbk')
u'\u5c81'
>>> print a.encode('latin1').decode('gbk')
岁
The simpliest way for python2 is to use the repr():
>>> key_unicode = u'uuuu\xf6\x9f_\xa1\x05\xeb9\xd4\xa3\xd1'
>>> key_ascii = 'uuuu\xf6\x9f_\xa1\x05\xeb9\xd4\xa3\xd1'
>>> print(key_ascii)
uuuu��_��9ԣ�
>>> print(key_unicode)
uuuuö_¡ë9Ô£Ñ
>>>
>>> # here is the save method for both string types:
>>> print(repr(key_ascii).lstrip('u')[1:-1])
uuuu\xf6\x9f_\xa1\x05\xeb9\xd4\xa3\xd1
>>> print(repr(key_unicode).lstrip('u')[1:-1])
uuuu\xf6\x9f_\xa1\x05\xeb9\xd4\xa3\xd1
>>> # ____________WARNING!______________
>>> # if you will use jsut `str.strip('u\'\"')`, you will lose
>>> # the "uuuu" (and quotes, if such are present) on sides of the string:
>>> print(repr(key_unicode).strip('u\'\"'))
\xf6\x9f_\xa1\x05\xeb9\xd4\xa3\xd1
For python3 use str.encode() to get the bytes type.
>>> key = 'l\xf6\x9f_\xa1\x05\xeb9\xd4\xa3\xd1q\xf5L\xa9\xdd0\x90\x8b\xf5ht\x86za\x0e\x1b\xed\xb6(\xaa+'
>>> key
'lö\x9f_¡\x05ë9Ô£ÑqõL©Ý0\x90\x8bõht\x86za\x0e\x1bí¶(ª+'
>>> print(key)
lö_¡ë9Ô£ÑqõL©Ý0õhtzaí¶(ª+
>>> print(repr(key.encode()).lstrip('b')[1:-1])
l\xc3\xb6\xc2\x9f_\xc2\xa1\x05\xc3\xab9\xc3\x94\xc2\xa3\xc3\x91

Django Unicode- When I Use unicode functions?

I'm From Iran And I Use Persian Character. When should I use 'u', 'decode()', 'encode' and unicode()?
1) You use decode() and unicode() to decode the input string from its representation (for example iso-8859-2 or utf-8) and get the unicode object.
2) You use a u to indicate that the string is to be treated as unicode (in fact the result object is unicode type):
>>> foo = u'łódź'
>>> foo.__class__
<type 'unicode'>
3) Use the encode() to encode the input string using for example utf-8 (or any other encoding of your choice) and get the str object:
>>> foo = u'łódź'
>>> foo.__class__
<type 'unicode'>
>>> bar = foo.encode('utf-8')
>>> bar.__class__
<type 'str'>
Read through this article about unicode in Python to get better idea of string/unicode/string encoding mess.

python - problems with regular expression and unicode

Hi I have a problem in python. I try to explain my problem with an example.
I have this string:
>>> string = 'ÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿÀÁÂÃ'
>>> print string
ÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿÀÁÂÃ
and i want, for example, replace charachters different from Ñ,Ã,ï with ""
i have tried:
>>> rePat = re.compile('[^ÑÃï]',re.UNICODE)
>>> print rePat.sub("",string)
�Ñ�����������������������������ï�������������������Ã
I obtained this �.
I think that it's happen because this type of characters in python are represented by two position in the vector: for example \xc3\x91 = Ñ.
For this, when i make the regolar expression, all the \xc3 are not substitued. How I can do this type of sub?????
Thanks
Franco
You need to make sure that your strings are unicode strings, not plain strings (plain strings are like byte arrays).
Example:
>>> string = 'ÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿÀÁÂÃ'
>>> type(string)
<type 'str'>
# do this instead:
# (note the u in front of the ', this marks the character sequence as a unicode literal)
>>> string = u'\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff\xc0\xc1\xc2\xc3'
# or:
>>> string = 'ÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿÀÁÂÃ'.decode('utf-8')
# ... but be aware that the latter will only work if the terminal (or source file) has utf-8 encoding
# ... it is a best practice to use the \xNN form in unicode literals, as in the first example
>>> type(string)
<type 'unicode'>
>>> print string
ÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿÀÁÂÃ
>>> rePat = re.compile(u'[^\xc3\x91\xc3\x83\xc3\xaf]',re.UNICODE)
>>> print rePat.sub("", string)
Ã
When reading from a file, string = open('filename.txt').read() reads a byte sequence.
To get the unicode content, do: string = unicode(open('filename.txt').read(), 'encoding'). Or: string = open('filename.txt').read().decode('encoding').
The codecs module can decode unicode streams (such as files) on-the-fly.
Do a google search for python unicode. Python unicode handling can be a bit hard to grasp at first, it pays to read up on it.
I live by this rule: "Software should only work with Unicode strings internally, converting to a particular encoding on output." (from http://www.amk.ca/python/howto/unicode)
I also recommend: http://www.joelonsoftware.com/articles/Unicode.html

Categories

Resources