How to combine unicode subscript with string formatting

How to combine unicode subscript with string formatting - python

I am trying to get unicode subscripts working with string formatting...
I know I can do something like this...
>>>print('Y\u2081')
Y₁
>>>print('Y\u2082')
Y₂
But what i really need is something like this since I need the subscript to iterate over a range. Obviously this doesn't work though.
>>>print('Y\u208{0}'.format(1))
File "<ipython-input-62-99965eda0209>", line 1
print('Y\u208{0}'.format(1))
^
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 1-5: truncated \uXXXX escape
Any help appreciated

\uhhhh is an escape syntax in the string literal. You'd have to produce a raw string (where the escape syntax is ignored), then re-apply the normal Python parser handling of escapes:
import codecs
print(codecs.decode(r'Y\u208{0}'.format(1), 'unicode_escape'))
However, you'd be better of using the chr() function to produce the whole character:
print('Y{0}'.format(chr(0x2080 + 1)))
The chr() function takes an integer and outputs the corresponding Unicode codepoint in a string. The above defines a hexadecimal number and adds 1 to produce your desired 2080 range Unicode character.

Related

Using f-strings with unicode escapes

I've got strings that look something like this:
a = "testing test<U+00FA>ing <U+00F3>"
Format will not always be like that, but those unicode characters in brackets will be scattered throughout the code. I want to turn those into the actual unicode characters they represent. I tried this function:
def replace_unicode(s):
uni = re.findall(r'<U\+\w\w\w\w>', s)
for a in uni:
s = s.replace(a, f'\u{a[3:7]}')
return s
This successfully finds all of the <U+> unicode strings, but it won't let me put them together to create a unicode escape in this manner.
File "D:/Programming/tests/test.py", line 8
s = s.replace(a, f'\u{a[3:7]}')
^
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-1: truncated \uXXXX escape
How can I create a unicode escape character using an f-string, or via some other method with the information I'm getting from strings?

chepner's answer is good, but you don't actually need an f-string. int(a[3:7], base=16) works perfectly fine.
Also, it would make a lot more sense to use re.sub() instead of re.findall() then str.replace(). I would also restrict the regex down to just hex digits and group them.
import re
def replace_unicode(s):
pattern = re.compile(r'<U\+([0-9A-F]{4})>')
return pattern.sub(lambda match: chr(int(match.group(1), base=16)), s)
a = "testing test<U+00FA>ing <U+00F3>"
print(replace_unicode(a)) # -> testing testúing ó

You can use an f-string to create an appropriate argument to int, whose result the chr function can use to produce the desired character.
for a in uni:
s = s.replace(a, chr(int(f'0x{a[3:7]}', base=16)))

How can I resolve a Unicode error when using raster data?

I am new to Python. I am trying to instantiate a grid from DEM data. Then I will try to create a flow direction map from raster data. But when I write the following line:
(grid = Grid.from_raster("C:\Users\ogun_\Masaüstü\DEM", data_name = 'dem')
I have got this error.
(SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape)
When I try the r or R functions, they don't work for this code.

Strings in Python allow "escape codes" which start with a \. For example, \n is used, that signifies a new-line character.
In your case, you've specified \U as part of \Users. Python is trying to interpret that as a raw Unicode value (which is what a \U escape code normally denotes).
You can solve this either with an escape for your \ characters (which makes the string "C:\\Users\\ogun_\\Masaüstü\\DEM") or with a "raw string" which doesn't do escape codes (which makes the string r"C:\Users\ogun_\Masaüstü\DEM").
I suspect the latter might be what you mean by "r functions". These are not functions, but rather are lexigraphic markers. They're much lower-level than function and can be thought of as part of the quoting itself. If you tried to call these as functions, that would've been why it didn't work.
You can read more about by marking a string as raw here.
NOTE: This also is different between Python 2 and 3. In 2, strings aren't Unicode by default, so you don't experience this unless you ask for a unicode string. In Python 3, strings are Unicode by default, so this happens unless you explicitly ask for bytes with a byte string.

Order of evaluation , unicode string & format

Here are a lot of strings in unicode map.
unicode_strings = ["\U00000{:0>3}".format(str.upper(hex(i))[2:]) for i in range(16)]
but this code emits an error message.
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-6: truncated \UXXXXXXXX escape
This causes because of the order of evaluation.
First, "\U00000" is evaluated, Second, format is executed.
As the error message, unicode character must be the block of "\UXXXXXXXX".
Unicode characters are evaluated at the first time, but this is not the appropriate block at that time.
At the time the format function is executed, the unicode characters must be constructed completely.
I know the cap string 'r' can escape this error message, but it makes no unicode string.
How should I attach "\U" to the string or execute the format function at the first time?
If I delete '\U', the result is utopia.
['00000001',
'00000002',
'00000003',
'00000004',
'00000005',
'00000006',
'00000007',
'00000008',
'00000009',
'0000000A',
'0000000B',
'0000000C',
'0000000D',
'0000000E',
'0000000F']
UPDATE:
I want such a result.
['\U00000001',
'\U00000002',
'\U00000003',
'\U00000004',
'\U00000005',
'\U00000006',
'\U00000007',
'\U00000008',
'\U00000009',
'\U0000000A',
'\U0000000B',
'\U0000000C',
'\U0000000D',
'\U0000000E',
'\U0000000F']
I want to get the sequence of characters in Unicode map.

Not entirely sure what exactly are you after, but given that for instance \U00000000 is the same as \x00 and to generate this list a following comprehension would seem to make more sense anyways:
unicode_strings = [chr(i) for i in range(16)]
If the question was why does this happen, the format docs may be a little subtle about it:*)
The string on which this method is called can contain literal text or replacement fields delimited by braces {}... Returns a copy of the string where each replacement field is replaced with the string value of the corresponding argument.
But basically the literal strings and "replacement fields" are identified and each is considered as such. In your case a string literal \U00000 is being considered and is invalid as four bytes hex values are expected following \U. Or in other words, it's not really a matter of order (literal firsts, expressions later), but how does the str get split into chunks / processed (literals and expressions are identified first and processed as such).
So if you were trying to do something like that for a larger string generation, you could do it as follows:
somelist = [f"abcd{chr(i)}efgh" for i in range(16)]
*) PEP-498 on f-strings may be a bit more explicit (and the mechanics are the same in this respect), namely:
f-strings are parsed in to literal strings and expressions...
The parts of the f-string outside of braces are literal strings. These literal portions are then decoded. For non-raw f-strings, this includes converting backslash escapes such as '\n', '\"', "\'", '\xhh', '\uxxxx', '\Uxxxxxxxx', and named unicode characters '\N{name}' into their associated Unicode characters.

Unable to convert hex code to unicode characters, get unicodeescape error

I have a pandas dataframe with hex values as given below:
df['col1']
<0020>
<0938>
<002E>
<092B>
<092B>
<0916>
<0915>
<0915>
<096F>
<096C>
I want to convert the hex values to their corresponding unicode literals. So, I try to do the following:
df['col1'] = df['col1'].apply(lambda x : '\u' + str(x)[1:-1])
Hoping, that this would convert it to my required unicode literal, but I get the following error:
File "<ipython-input-22-891ccdd39e79>", line 1
df['col1'].apply(lambda x : '\u' + str(x)[1:-1])
^
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-1: truncated \uXXXX escape
In python3, when we try the following we get :
>>> string1 = '\u03b4'
>>> print(string1)
>>> δ
So, I tried adding \u to my given string, I also tried adding \\u, but that shows up as two backslashes . Also, adding a r before \u, also ends up showing two backslashes, instead of the unicode literal. I also tried decode-unicode, but it didn't work either.
Also, it'd be great, if someone can explain the concept of rawstrings , \u, etc.

Oops, literals are for... literal values! As soon as you have variables, you should use conversion functions like int and chr.
Here you have a column containing strings. For each cell in the column, you want to remove first and last character, process what remains as an hex value, and get the unicode character with that code point. In Python, it just reads:
df['col1'].apply(lambda x: chr(int(x[1:-1], 16)))
And with your values, it gives:
0
1 स
2 .
3 फ
4 फ
5 ख
6 क
7 क
8 ९
9 ६
Now for the reason of your error.
\uxxxx escape sequences are intended for the Python parser. When they are found in a string literal they are automatically replaced with the unicode character having that code point. You can use the codecs module and the unicode_escape encoding to decode a string that would contain actual \u character characters (meaning that you escape the backslash as in "\\uxxx", but as you have directly an hex representation of the code point, it is simpler to directly use the chr function.
And in your initial code, when you write '\u', the parser sees the initial part of an encoded character and tries to decode it immediately... but cannot find the hexa code point after it, so it throws the exception. If you really want to go that way, you have to double the backslash (\) to escape it and store it as is in the string and then use codecs.decode(..., encoding='unicode_escape') to decode the string as shown in #ndclt's answer. But I do not advise you to do so.
References are to be found in the Standard Python Library documentation, chr function and codecs module.

In order to convert all your codes into unicode here one line:
import codecs
import pandas as pd
(
# create a series with the prefix "\u" to add to the existing column
pd.Series([r'\u'] * len(df['col1']))
# str.strip deletes the "<" and ">" from your column
# str.cat concatenates the prefix created before to the existing column
.str.cat(df['col1'].str.strip('<>'))
# then you apply a conversion from the raw string to normal string.
.apply(codecs.decode, args=['unicode_escape'])
)
In the previous code, you have to create the prefix as a raw string. If not, Python is waiting for a valid utf-8 code (the error you have in your code).
Edit: I add the explanation from Serge Ballesta post
\uxxxx escape sequences are intended for the Python parser. When they are found in a string literal they are automatically replaced with the unicode character having that code point. You can use the codecs module and the unicode_escape encoding to decode a string that would contain actual \u character characters (meaning that you escape the backslash as in "\uxxx", but as you have directly an hex representation of the code point, it is simpler to directly use the chr function.
His solution is more elegant than mine.

How to encode Python 3 string using \u escape code?

In Python 3, suppose I have
>>> thai_string = 'สีเ'
Using encode gives
>>> thai_string.encode('utf-8')
b'\xe0\xb8\xaa\xe0\xb8\xb5'
My question: how can I get encode() to return a bytes sequence using \u instead of \x? And how can I decode them back to a Python 3 str type?
I tried using the ascii builtin, which gives
>>> ascii(thai_string)
"'\\u0e2a\\u0e35'"
But this doesn't seem quite right, as I can't decode it back to obtain thai_string.
Python documentation tells me that
\xhh escapes the character with the hex value hh while
\uxxxx escapes the character with the 16-bit hex value xxxx
The documentation says that \u is only used in string literals, but I'm not sure what that means. Is this a hint that my question has a flawed premise?

You can use unicode_escape:
>>> thai_string.encode('unicode_escape')
b'\\u0e2a\\u0e35\\u0e40'
Note that encode() will always return a byte string (bytes) and the unicode_escape encoding is intended to:
Produce a string that is suitable as Unicode literal in Python source code

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.