I've got strings that look something like this:
a = "testing test<U+00FA>ing <U+00F3>"
Format will not always be like that, but those unicode characters in brackets will be scattered throughout the code. I want to turn those into the actual unicode characters they represent. I tried this function:
def replace_unicode(s):
uni = re.findall(r'<U\+\w\w\w\w>', s)
for a in uni:
s = s.replace(a, f'\u{a[3:7]}')
return s
This successfully finds all of the <U+> unicode strings, but it won't let me put them together to create a unicode escape in this manner.
File "D:/Programming/tests/test.py", line 8
s = s.replace(a, f'\u{a[3:7]}')
^
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-1: truncated \uXXXX escape
How can I create a unicode escape character using an f-string, or via some other method with the information I'm getting from strings?
chepner's answer is good, but you don't actually need an f-string. int(a[3:7], base=16) works perfectly fine.
Also, it would make a lot more sense to use re.sub() instead of re.findall() then str.replace(). I would also restrict the regex down to just hex digits and group them.
import re
def replace_unicode(s):
pattern = re.compile(r'<U\+([0-9A-F]{4})>')
return pattern.sub(lambda match: chr(int(match.group(1), base=16)), s)
a = "testing test<U+00FA>ing <U+00F3>"
print(replace_unicode(a)) # -> testing testúing ó
You can use an f-string to create an appropriate argument to int, whose result the chr function can use to produce the desired character.
for a in uni:
s = s.replace(a, chr(int(f'0x{a[3:7]}', base=16)))
Related
Why doesn't this code work?
Code:
a="600"
print(f"\U0001f{a}")
Error:
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-6: truncated \UXXXXXXXX escape
To overcome the syntax error:
The first backslash in your string is interpreted as a special character; since it's followed by a "U" (start of a unicode code point). To fix this you need to escape the backslashes in the string.
The direct way to do this is by doubling the backslashes:
print("\\U0001f"+ a)
You can also put r infront of the string
print(r"\U0001f"+ a)
Output:
\U0001f600
Finally checking if this displays an emoji.
>>> s = "\U0001f600"
>>> print(s)
😀
Also to print emoji's, there are a few other methods.
Method1:
>>> print("\N{grinning face}")
😀
>>> print("\N{slightly smiling face}")
🙂
Method2:
There is a package called emoji that can be utilized.
FURTHERMORE:
Let's consider a scenario, where we do string concatenation, and we expect emoji's to be printed like -
>>> res = "\\U0001f"+ a
>>> print(res)
\U0001f600
Here, the value printed is not a true unicode representation of a string. In these scenarios, I would recommend doing the following to keep things simple, when writing a script.
>>> a = "600"
>>> string = '1F' + a
>>> print(string)
1F600
>>> print(chr(int(string,16)))
😀
This question already has answers here:
How should I write a Windows path in a Python string literal?
(5 answers)
Closed 3 years ago.
I'm trying to read a CSV file into Python (Spyder), but I keep getting an error. My code:
import csv
data = open("C:\Users\miche\Documents\school\jaar2\MIK\2.6\vektis_agb_zorgverlener")
data = csv.reader(data)
print(data)
I get the following error:
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes
in position 2-3: truncated \UXXXXXXXX escape
I have tried to replace the \ with \\ or with / and I've tried to put an r before "C.., but all these things didn't work.
This error occurs, because you are using a normal string as a path. You can use one of the three following solutions to fix your problem:
1: Just put r before your normal string. It converts a normal string to a raw string:
pandas.read_csv(r"C:\Users\DeePak\Desktop\myac.csv")
2:
pandas.read_csv("C:/Users/DeePak/Desktop/myac.csv")
3:
pandas.read_csv("C:\\Users\\DeePak\\Desktop\\myac.csv")
The first backslash in your string is being interpreted as a special character. In fact, because it's followed by a "U", it's being interpreted as the start of a Unicode code point.
To fix this, you need to escape the backslashes in the string. The direct way to do this is by doubling the backslashes:
data = open("C:\\Users\\miche\\Documents\\school\\jaar2\\MIK\\2.6\\vektis_agb_zorgverlener")
If you don't want to escape backslashes in a string, and you don't have any need for escape codes or quotation marks in the string, you can instead use a "raw" string, using "r" just before it, like so:
data = open(r"C:\Users\miche\Documents\school\jaar2\MIK\2.6\vektis_agb_zorgverlener")
You can just put r in front of the string with your actual path, which denotes a raw string. For example:
data = open(r"C:\Users\miche\Documents\school\jaar2\MIK\2.6\vektis_agb_zorgverlener")
Consider it as a raw string. Just as a simple answer, add r before your Windows path.
import csv
data = open(r"C:\Users\miche\Documents\school\jaar2\MIK\2.6\vektis_agb_zorgverlener")
data = csv.reader(data)
print(data)
Try writing the file path as "C:\\Users\miche\Documents\school\jaar2\MIK\2.6\vektis_agb_zorgverlener" i.e with double backslash after the drive as opposed to "C:\Users\miche\Documents\school\jaar2\MIK\2.6\vektis_agb_zorgverlener"
Add r before your string. It converts a normal string to a raw string.
As per String literals:
String literals can be enclosed within single quotes (i.e. '...') or double quotes (i.e. "..."). They can also be enclosed in matching groups of three single or double quotes (these are generally referred to as triple-quoted strings).
The backslash character (i.e. \) is used to escape characters which otherwise will have a special meaning, such as newline, backslash itself, or the quote character. String literals may optionally be prefixed with a letter r or R. Such strings are called raw strings and use different rules for backslash escape sequences.
In triple-quoted strings, unescaped newlines and quotes are allowed, except that the three unescaped quotes in a row terminate the string.
Unless an r or R prefix is present, escape sequences in strings are interpreted according to rules similar to those used by Standard C.
So ideally you need to replace the line:
data = open("C:\Users\miche\Documents\school\jaar2\MIK\2.6\vektis_agb_zorgverlener")
To any one of the following characters:
Using raw prefix and single quotes (i.e. '...'):
data = open(r'C:\Users\miche\Documents\school\jaar2\MIK\2.6\vektis_agb_zorgverlener')
Using double quotes (i.e. "...") and escaping backslash character (i.e. \):
data = open("C:\\Users\\miche\\Documents\\school\\jaar2\\MIK\\2.6\\vektis_agb_zorgverlener")
Using double quotes (i.e. "...") and forwardslash character (i.e. /):
data = open("C:/Users/miche/Documents/school/jaar2/MIK/2.6/vektis_agb_zorgverlener")
Just putting an r in front works well.
eg:
white = pd.read_csv(r"C:\Users\hydro\a.csv")
It worked for me by neutralizing the '' by f = open('F:\\file.csv')
The double \ should work for Windows, but you still need to take care of the folders you mention in your path. All of them (except the filename) must exist. Otherwise you will get an error.
I have a pandas dataframe with hex values as given below:
df['col1']
<0020>
<0938>
<002E>
<092B>
<092B>
<0916>
<0915>
<0915>
<096F>
<096C>
I want to convert the hex values to their corresponding unicode literals. So, I try to do the following:
df['col1'] = df['col1'].apply(lambda x : '\u' + str(x)[1:-1])
Hoping, that this would convert it to my required unicode literal, but I get the following error:
File "<ipython-input-22-891ccdd39e79>", line 1
df['col1'].apply(lambda x : '\u' + str(x)[1:-1])
^
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-1: truncated \uXXXX escape
In python3, when we try the following we get :
>>> string1 = '\u03b4'
>>> print(string1)
>>> δ
So, I tried adding \u to my given string, I also tried adding \\u, but that shows up as two backslashes . Also, adding a r before \u, also ends up showing two backslashes, instead of the unicode literal. I also tried decode-unicode, but it didn't work either.
Also, it'd be great, if someone can explain the concept of rawstrings , \u, etc.
Oops, literals are for... literal values! As soon as you have variables, you should use conversion functions like int and chr.
Here you have a column containing strings. For each cell in the column, you want to remove first and last character, process what remains as an hex value, and get the unicode character with that code point. In Python, it just reads:
df['col1'].apply(lambda x: chr(int(x[1:-1], 16)))
And with your values, it gives:
0
1 स
2 .
3 फ
4 फ
5 ख
6 क
7 क
8 ९
9 ६
Now for the reason of your error.
\uxxxx escape sequences are intended for the Python parser. When they are found in a string literal they are automatically replaced with the unicode character having that code point. You can use the codecs module and the unicode_escape encoding to decode a string that would contain actual \u character characters (meaning that you escape the backslash as in "\\uxxx", but as you have directly an hex representation of the code point, it is simpler to directly use the chr function.
And in your initial code, when you write '\u', the parser sees the initial part of an encoded character and tries to decode it immediately... but cannot find the hexa code point after it, so it throws the exception. If you really want to go that way, you have to double the backslash (\) to escape it and store it as is in the string and then use codecs.decode(..., encoding='unicode_escape') to decode the string as shown in #ndclt's answer. But I do not advise you to do so.
References are to be found in the Standard Python Library documentation, chr function and codecs module.
In order to convert all your codes into unicode here one line:
import codecs
import pandas as pd
(
# create a series with the prefix "\u" to add to the existing column
pd.Series([r'\u'] * len(df['col1']))
# str.strip deletes the "<" and ">" from your column
# str.cat concatenates the prefix created before to the existing column
.str.cat(df['col1'].str.strip('<>'))
# then you apply a conversion from the raw string to normal string.
.apply(codecs.decode, args=['unicode_escape'])
)
In the previous code, you have to create the prefix as a raw string. If not, Python is waiting for a valid utf-8 code (the error you have in your code).
Edit: I add the explanation from Serge Ballesta post
\uxxxx escape sequences are intended for the Python parser. When they are found in a string literal they are automatically replaced with the unicode character having that code point. You can use the codecs module and the unicode_escape encoding to decode a string that would contain actual \u character characters (meaning that you escape the backslash as in "\uxxx", but as you have directly an hex representation of the code point, it is simpler to directly use the chr function.
His solution is more elegant than mine.
I am trying to use the encode method of python strings to return the unicode escape codes for characters, like this:
>>> print( 'ф'.encode('unicode_escape').decode('utf8') )
\u0444
This works fine with non-ascii characters, but for ascii characters, it just returns the ascii characters themselves:
>>> print( 'f'.encode('unicode_escape').decode('utf8') )
f
The desired output would be \u0066. This script is for pedagogical purposes.
How can I get the unicode hex codes for ALL characters?
ord can be used for this, there is no need for encoding/decoding at all:
>>> '"\\U{:08x}"'.format(ord('f')) # ...or \u{:04x} if you prefer
'"\\U00000066"'
>>> eval(_)
'f'
You'd have to do so manually; if you assume that all your input is within the Unicode BMP, then a straightforward regex will probably be fastest; this replaces every character with their \uhhhh escape:
import re
def unicode_escaped(s, _pattern=re.compile(r'[\x00-\uffff]')):
return _pattern.sub(lambda m: '\\u{:04x}'.format(
ord(m.group(0))), s)
I've explicitly limited the pattern to the BMP to gracefully handle non-BMP points.
Demo:
>>> print(unicode_escaped('foo bar ф'))
\u0066\u006f\u006f\u0020\u0062\u0061\u0072\u0020\u0444
I am trying to get unicode subscripts working with string formatting...
I know I can do something like this...
>>>print('Y\u2081')
Y₁
>>>print('Y\u2082')
Y₂
But what i really need is something like this since I need the subscript to iterate over a range. Obviously this doesn't work though.
>>>print('Y\u208{0}'.format(1))
File "<ipython-input-62-99965eda0209>", line 1
print('Y\u208{0}'.format(1))
^
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 1-5: truncated \uXXXX escape
Any help appreciated
\uhhhh is an escape syntax in the string literal. You'd have to produce a raw string (where the escape syntax is ignored), then re-apply the normal Python parser handling of escapes:
import codecs
print(codecs.decode(r'Y\u208{0}'.format(1), 'unicode_escape'))
However, you'd be better of using the chr() function to produce the whole character:
print('Y{0}'.format(chr(0x2080 + 1)))
The chr() function takes an integer and outputs the corresponding Unicode codepoint in a string. The above defines a hexadecimal number and adds 1 to produce your desired 2080 range Unicode character.