How to get the Unicode character from a code point variable? [duplicate] - python

This question already has answers here:
Process escape sequences in a string in Python
(8 answers)
Closed 4 years ago.
I have a variable which stores the string "u05e2" (The value is constantly changing because I set it within a loop). I want to print the Hebrew letter with that Unicode value. I tried the following but it didn't work:
>>> a = 'u05e2'
>>> print(u'\{}'.format(a))
I got \u05e2 instead of ע(In this case).
I also tried to do:
>>> a = 'u05e2'
>>> b = '\\' + a
>>> print(u'{}'.format(b))
Neither one worked. How can I fix this?
Thanks in advance!

This seems like an X-Y Problem. If you want the Unicode character for a code point, use an integer variable and the function chr (or unichr on Python 2) instead of trying to format an escape code:
>>> for a in range(0x5e0,0x5eb):
... print(hex(a),chr(a))
...
0x5e0 נ
0x5e1 ס
0x5e2 ע
0x5e3 ף
0x5e4 פ
0x5e5 ץ
0x5e6 צ
0x5e7 ק
0x5e8 ר
0x5e9 ש
0x5ea ת

All you need is a \ before u05e2. To print a Unicode character, you must provide a unicode format string.
a = '\u05e2'
print(u'{}'.format(a))
#Output
ע
When you try the other approach by printing the \ within the print() function, Python first escapes the \ and does not show the desired result.
a = 'u05e2'
print(u'\{}'.format(a))
#Output
\u05e2
A way to verify the validity of Unicode format strings is using the ord() built-in function in the Python standard library. This returns the Unicode code point(an integer) of the character passed to it. This function only expects either a Unicode character or a string representing a Unicode character.
a = '\u05e2'
print(ord(a)) #1506, the Unicode code point for the Unicode string stored in a
To print the Unicode character for the above Unicode code value(1506), use the character type formatting with c. This is explained in the Python docs.
print('{0:c}'.format(1506))
#Output
ע
If we pass a normal string literal to ord(), we get an error. This is because this string does not represent a Unicode character.
a = 'u05e2'
print(ord(a))
#Error
TypeError: ord() expected a character, but string of length 5 found

This is happening because you have to add the suffix u outside of the string.
a = u'\u05e2'
print(a)
ע
Hope this helps you.

Related

replace() function not replacing backslash (\)

Non of following code is working ...
str = '\U0009d014'
print(str)
output: 򝀔
print(str.replace('\\',''))
output: 򝀔
print(str.split('\\'))
output: ['\U0009d014']
I want to get string output but it's giving that unknown symbol.
If you want to get the raw string output:
s = '\U0009d014'
print(repr(s))
gives
'\U0009d014'
s is a unicode character (\U used for numbers in the highest range like chr(1114111), you have \u for numbers in smaller ranges like chr(57344) and then ASCII in even smaller ranges like chr(65) that gives you the character 'A') in Python that's why printing it directly gives you an unknown character.
Also, don't use str as a variable. It's a built-in <class 'str'> in Python, you would overshadow this built-in with your variable that can cause issues in your code.
The reason you're receiving a weird character in the output is because Python interprets your '\U0009d14' as a single unicode character. When you print it, it's going to try to display the actual character, not the code you put in. If you want it to operate like a string, you have to escape the backslash character at the beginning, like so:
str = '\\U0009d014'
This tells python to treat the backslash as the actual backslash character and not the beginning of a unicode character.
As other answerers have noted, you can also use repr(s) which gives you the string representation of a character.
An additional note: str is the string type in python and shouldn't be used as a variable name. It will cause a lot of issues if you do so. Consider s or a name more suited for the problem you're solving.

Python. Replace "\\uxxxx" to "\uxxxx" [duplicate]

This question already has an answer here:
Python-encoding and decoding using codecs,unicode_escape()
(1 answer)
Closed 4 years ago.
I'm scraping a web, and I get Unicode characters as raw.
Instead of getting the "ó" character, I get \u00f3.
It is the same as write:
>>>print("\\u00f3")
I want to convert "\\u00f3" into "\u00f3" in all unicode characters. It is:
"\\uxxxx" -> "\uxxxx"
But if I try to replace \\ by \, next characters are interpreted as escape characters.
How can I do it?
Applying the next code, I can convert part of the characters:
def raw_to_utf8(matcher):
string2convert = matcher.group(0)
return(chr(int(string2convert[2:],base=16)))
def decode_utf8(text_raw):
text_raw_re=re.compile(r"\\u[0-9a-ce-z]\w{0,3}")
return text_raw_re.sub(raw_to_utf8, text_raw)
text_fixed = decode_utf8(text_raw)
As you can see in the regular expression pattern, I have skipped the 'd' character. It is because \udxxx characters can't be converted in UTF-8 by this metod and any other one. They aren't important characters for me, so it is not a problem.
Thanks for your help.
************************** Solved ********************************
The best solution was solved previously:
Python-encoding and decoding using codecs,unicode_escape()
Thanks for your help.
First: maybe you are not decoding the webpage with the correct charset. If the web server does not supply the charset you might have to find it in the meta tags or make an educated guess. Maybe try a couple of usual charsets and compare the results.
Second: I played around with strings and decoding for a while and it's really frustrating, but I found a possible solution in format():
s = "\\u00f3"
print('{:c}'.format(int(s[2:], 16)))
Format the extracted hex value as unicode seems to work.
You cannot replace '\\' by '\' because '\' is not a valid literal string.
Convert the hexadecimal expression into a number, then find the corresponding character:
original = '\\u00f3'
char = chr(int(original[2:], base=16))
You can check that this gives the desired result:
assert char == '\u00f3'

Python 2 string somehow saved as pure Unicode

I have the following strings in Chinese that are saved in a following form as "str" type:
\u72ec\u5230
\u7528\u8272
I am on Python 2.7, when I print those strings they are printed as actual Chinese characters:
chinese_list = ["\u72ec\u5230", "\u7528\u8272", "\u72ec"]
print(chinese_list[0], chinese_list[1], chinese_list[2])
>>> 独到 用色 独
I can't really figure out how they were saved in that form, to me it looks like Unicode. The goal would be to take other Chinese characters that I have and save them in the same kind of encoding. Say I have "国道" and I would need them to be saved in the same way as in the original chinese_list.
I've tried to encode it as utf-8 and also other encodings but I never get the same output as in the original:
new_string = u"国道"
print(new_string.encode("utf-8"))
# >>> b'\xe5\x9b\xbd\xe9\x81\x93'
print(new_string.encode("utf-16"))
# >>> b'\xff\xfe\xfdVS\x90'
Any help appreciated!
EDIT: it doesn't have to have 2 Chinese characters.
EDIT2: Apparently, the encoding was unicode-escape. Thanks #deceze.
print(u"国".encode('unicode-escape'))
>>> \u56fd
The \u.... is unicode escape syntax. It works similar to how \n is a newline, not the two characters \ and n.
The elements of your list never actually contain a byte string with literal characters of \, u, 7 and so on. They contain a unicode string with the actual unicode characters, i.e. 独 and so on.
Note that this only works with unicode strings! In Python2, you need to write u"\u....". Python3 always uses unicode strings.
The unicode escape value of a character can be gotten with the ord builtin. For example, ord(u"国") gives 22269 - the same value as 0x56fd.
To get the hexadezimal escape value, convert the result to hex.
>>> def escape_literal(character):
... return r'\u' + hex(ord(character))[2:]
...
>>> print(escape_literal('国'))
\u56fd

How do I convert a list of strings to a unicode value? [duplicate]

This question already has answers here:
Get unicode code point of a character using Python
(5 answers)
Closed 9 years ago.
I receive the following:
value = ['\', 'n']
and my regular routine of converting to unicode and calling ord throws the error:
ord() expects a character, but string of length 2 found
It would seem that I need to join the characters within the list if len(value) > 2.
How do I go about doing this?
If you're trying to figure out how to treat this as a single string '\\n' that can then be interpreted as the single character '\n' according to some set of rules, like Python's unicode-escape rules, you have to decide exactly what you want before you can code it.
First, to turn a list of two single-character strings into one two-character string, just use join:
>>> value = ['\\', 'n']
>>> escaped_character = ''.join(value)
>>> escaped_character
'\\n'
Next, to interpret a two-character escape sequence as a single character, you have to know which escape rules you're trying to undo. If it's Python's Unicode escape, there's a codec named unicode_escape that does that:
>>> character = escaped_character.decode('unicode_escape')
>>> character
u'\n'
If, on the other hand, you're trying to undo UTF-8 encoding followed by Python string-escape, or C backslash escapes, or something different, you obviously have to write something different. And given what you've said about UTF-8, I think you probably do want something different. For example, u'é'.encode('UTF-8') is the two-byte sequence '\xce\xa9'. Just calling decode('unicode_escape') on that will give you the two-character sequence u'\u00c3\u00a9', which is not what you want.
Anyway, now that you've got a single character, just call ord:
>>> char_ord = ord(character)
>>> char_ord
10
I'm not sure what the convert-to-unicode bit is about. If this is Python 3.x, the strings are already Unicode. If it's 2.x, and the strings are ASCII, it's guaranteed that ord(s) == ord(unicode(s)). If it's 2.x, and the strings are in some other encoding, just calling unicode on them is going to give you a UnicodeError or mojibake; you need to pass an encoding in as well, in which case you might as well use the decode method.

How do convert unicode escape sequences to unicode characters in a python string

When I tried to get the content of a tag using "unicode(head.contents[3])" i get the output similar to this: "Christensen Sk\xf6ld". I want the escape sequence to be returned as string. How to do it in python?
Assuming Python sees the name as a normal string, you'll first have to decode it to unicode:
>>> name
'Christensen Sk\xf6ld'
>>> unicode(name, 'latin-1')
u'Christensen Sk\xf6ld'
Another way of achieving this:
>>> name.decode('latin-1')
u'Christensen Sk\xf6ld'
Note the "u" in front of the string, signalling it is uncode. If you print this, the accented letter is shown properly:
>>> print name.decode('latin-1')
Christensen Sköld
BTW: when necessary, you can use de "encode" method to turn the unicode into e.g. a UTF-8 string:
>>> name.decode('latin-1').encode('utf-8')
'Christensen Sk\xc3\xb6ld'
I suspect that it's acutally working correctly. By default, Python displays strings in ASCII encoding, since not all terminals support unicode. If you actually print the string, though, it should work. See the following example:
>>> u'\xcfa'
u'\xcfa'
>>> print u'\xcfa'
Ïa
Given a byte string with Unicode escapes b"\N{SNOWMAN}", b"\N{SNOWMAN}".decode('unicode-escape) will produce the expected Unicode string u'\u2603'.

Categories

Resources