Backslash to forward in unicode string in Python - python

I have a spreadsheet with dates, usually encoded as strings in the format "DD\MM\YYYY", as 08\09\2014. The function I use returns the data as unicode, and I use Python 2.7. So, i start with:
> data_prob_raw
08\09\2014
To convert the string to a datetime object (datetime.parser.parse()) I need a string without '\', but I don't find a way to remove or substitute that problematic character with '/'.
I already tried with unicode codes:
data_prob_raw=data_prob_raw.replace(r'\x81', '/201')
data_prob_raw=data_prob_raw.replace(u'\x81', '/201')
And simply a string:
data_prob_raw=data_prob_raw.replace('\201','/201')
But it doesn't change anything:
08\09\2014
decoding the string:
data_prob_raw=data_raw_unic.encode('ascii')
But \201 goes uver the 128 ascii chars:
UnicodeDecodeError: 'ascii' codec can't decode byte 0x81 in position 0: ordinal not in range(128)
How can I solve that problem?

When you read data into a file from python you should get an escaped string.
I have a file called test.txt with the contents 01\01\2010
>>> with open(r'C:\users\john\desktop\test.txt') as f:
s = f.read()
>>> s
'01\\01\\2010'
>>> s.replace('\\', '/')
'01/01/2010'
and I have no problem using .replace on the string. What might be happening is that you are creating a variable directly, to test the functionality, and are assigning data_prob_raw='08\09\2014' when you should be testing with either data_prob_raw='08\\09\\2014' or reading the date in from the file.
As zondo suggested you can also use raw stings like so; data_prob_raw=r'08\09\2014'. Notice the preceding r, that r tells Python to treat the backslashes as literal backslashes instead of parsing the escape characters.

To process simply a backslash in a string, you just have to put it twice. It is the escape character, so the following replace should be enough:
data_prob_raw=data_prob_raw.replace('\\', '/')

You don't need to perform replacement. datetime can parse any date format you specify:
>>> data = ur'08\09\2014'
>>> from datetime import datetime
>>> datetime.strptime(data,ur'%m\%d\%Y')
datetime.datetime(2014, 8, 9, 0, 0)

Related

How to tell python that a string is actually bytes-object? Not converting

I have a txt file which contains a line:
' 6: "\\351\\231\\220\\346\\227\\266\\345\\205\\215\\350\\264\\271"'
The contents in the double quotes is actually octal encoding, but with two escape characters.
After the line has been read in, I used regex to extract the contents in the double quotes.
c = re.search(r': "(.+)"', line).group(1)
After that, I have two problem:
First, I need to replace the two escape characters with one.
Second, Tell python that the str object c is actually a byte object.
None of them has been done.
I have tried:
re.sub('\\', '\', line)
re.sub(r'\\', '\', line)
re.sub(r'\\', r'\', line)
All failed.
A bytes object can be easily define with 'b'.
c = b'\351\231\220\346\227\266\345\205\215\350\264\271'
How to change the variable type of a string to bytes? I think this not a encode-and-decode thing.
I googled a lot, but with no answers. Maybe I use the wrong key word.
Does anyone know how to do these? Or other way to get what I want?
This is always a little confusing. I assume your bytes object should represent a string like:
b = b'\351\231\220\346\227\266\345\205\215\350\264\271'
b.decode()
# '限时免费'
To get that with your escaped string, you could use the codecs library and try:
import re
import codecs
line = ' 6: "\\351\\231\\220\\346\\227\\266\\345\\205\\215\\350\\264\\271"'
c = re.search(r': "(.+)"', line).group(1)
codecs.escape_decode(bytes(c, "utf-8"))[0].decode("utf-8")
# '限时免费'
giving the same result.
The string contains literal text for escape codes. You cannot just replace the literal backslashes with a single backslash as escape codes are used in source code to indicate a single character. Decoding is needed to change literal escape codes to the actual character, but only byte strings can be decoded.
Encoding a Unicode string to a byte string with the Latin-1 codec translates Unicode code points 1:1 to the corresponding byte, so it is the common way to directly convert a "byte-string-like" Unicode string to an actual byte string.
Step-by-Step:
>>> s = "\\351\\231\\220\\346\\227\\266\\345\\205\\215\\350\\264\\271"
>>> print(s) # Actual text of the string
\351\231\220\346\227\266\345\205\215\350\264\271
>>> s.encode('latin1') # Convert to byte string
b'\\351\\231\\220\\346\\227\\266\\345\\205\\215\\350\\264\\271'
>>> # decode the escape codes...result is latin-1 characters in Unicode
>>> s.encode('latin1').decode('unicode-escape')
'é\x99\x90æ\x97¶å\x85\x8dè´¹' # convert back to byte string
>>> s.encode('latin1').decode('unicode-escape').encode('latin1')
b'\xe9\x99\x90\xe6\x97\xb6\xe5\x85\x8d\xe8\xb4\xb9'
>>> # data is UTF-8-encoded text so decode it correctly now
>>> s.encode('latin1').decode('unicode-escape').encode('latin1').decode('utf8')
'限时免费'
Your text example looks like part of a Python dictionary. You may be able to save some steps by using the ast module's literal_eval function to turn the dictionary directly into a Python object, and then just fix this line of code:
>>> # Python dictionary-like text
d='{6: "\\351\\231\\220\\346\\227\\266\\345\\205\\215\\350\\264\\271"}'
>>> import ast
>>> ast.literal_eval(d) # returns Python dictionary with value already decoded
{6: 'é\x99\x90æ\x97¶å\x85\x8dè´¹'}
>>> ast.literal_eval(d)[6] # but decoded incorrectly as Latin-1 text.
'é\x99\x90æ\x97¶å\x85\x8dè´¹'
>>> ast.literal_eval(d)[6].encode('latin1').decode('utf8') # undo Latin1, decode as UTF-8
'限时免费'

Unable to convert hex code to unicode characters, get unicodeescape error

I have a pandas dataframe with hex values as given below:
df['col1']
<0020>
<0938>
<002E>
<092B>
<092B>
<0916>
<0915>
<0915>
<096F>
<096C>
I want to convert the hex values to their corresponding unicode literals. So, I try to do the following:
df['col1'] = df['col1'].apply(lambda x : '\u' + str(x)[1:-1])
Hoping, that this would convert it to my required unicode literal, but I get the following error:
File "<ipython-input-22-891ccdd39e79>", line 1
df['col1'].apply(lambda x : '\u' + str(x)[1:-1])
^
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-1: truncated \uXXXX escape
In python3, when we try the following we get :
>>> string1 = '\u03b4'
>>> print(string1)
>>> δ
So, I tried adding \u to my given string, I also tried adding \\u, but that shows up as two backslashes . Also, adding a r before \u, also ends up showing two backslashes, instead of the unicode literal. I also tried decode-unicode, but it didn't work either.
Also, it'd be great, if someone can explain the concept of rawstrings , \u, etc.
Oops, literals are for... literal values! As soon as you have variables, you should use conversion functions like int and chr.
Here you have a column containing strings. For each cell in the column, you want to remove first and last character, process what remains as an hex value, and get the unicode character with that code point. In Python, it just reads:
df['col1'].apply(lambda x: chr(int(x[1:-1], 16)))
And with your values, it gives:
0
1 स
2 .
3 फ
4 फ
5 ख
6 क
7 क
8 ९
9 ६
Now for the reason of your error.
\uxxxx escape sequences are intended for the Python parser. When they are found in a string literal they are automatically replaced with the unicode character having that code point. You can use the codecs module and the unicode_escape encoding to decode a string that would contain actual \u character characters (meaning that you escape the backslash as in "\\uxxx", but as you have directly an hex representation of the code point, it is simpler to directly use the chr function.
And in your initial code, when you write '\u', the parser sees the initial part of an encoded character and tries to decode it immediately... but cannot find the hexa code point after it, so it throws the exception. If you really want to go that way, you have to double the backslash (\) to escape it and store it as is in the string and then use codecs.decode(..., encoding='unicode_escape') to decode the string as shown in #ndclt's answer. But I do not advise you to do so.
References are to be found in the Standard Python Library documentation, chr function and codecs module.
In order to convert all your codes into unicode here one line:
import codecs
import pandas as pd
(
# create a series with the prefix "\u" to add to the existing column
pd.Series([r'\u'] * len(df['col1']))
# str.strip deletes the "<" and ">" from your column
# str.cat concatenates the prefix created before to the existing column
.str.cat(df['col1'].str.strip('<>'))
# then you apply a conversion from the raw string to normal string.
.apply(codecs.decode, args=['unicode_escape'])
)
In the previous code, you have to create the prefix as a raw string. If not, Python is waiting for a valid utf-8 code (the error you have in your code).
Edit: I add the explanation from Serge Ballesta post
\uxxxx escape sequences are intended for the Python parser. When they are found in a string literal they are automatically replaced with the unicode character having that code point. You can use the codecs module and the unicode_escape encoding to decode a string that would contain actual \u character characters (meaning that you escape the backslash as in "\uxxx", but as you have directly an hex representation of the code point, it is simpler to directly use the chr function.
His solution is more elegant than mine.

Normalizing unicode in dataset

Currently my code is the following:
import unicodedata
unicode = open("unicode.txt").read()
unicode = unicodedata.normalize('NFKC', unicode)
print(unicode)
where unicode.txt is a text file that simply reads \u00e9.
When I run the program, the output is still \u00e9, however, if I replace unicode in the .normalize line with \u00e9 the output is é.
The end goal is simply to replace all unicode strings (eg. \u00e9) with their regular characters. Like cafe instead of café.
The normalize function seems to work fine when the string is entered into the function but not when it is in a file to be opened. Even then it seems to return the stylized é instead of the regular e.
Is there any way to make this work?
The content of the file is literally six characters: \u0029. '\u0029' typed in code is a single Unicode code point represented as an escape code:
>>> print('\u00e9') # A single character escape code
é
>>> print(r'\u0039') # A six-character string using raw string notation.
\u0039 # Escape codes are ignored and characters are literal.
>>> print('\\u0039') # A six-character string using an escaped backslash
\u0039 # to indicate a literal backslash.
To convert the six-character string to a character, use the following,
>>> r'\u00e9'.encode('ascii').decode('unicode-escape')
'é'
The ascii encode is needed to translate a Unicode string of ASCII characters to a byte string because you can only decode byte strings in Python 3. Python 2 can skip it as it implicitly encodes Unicode strings back to ASCII if needed.
You can also directly read it from the file (assuming Python 3), with:
with open('unicode.txt',encoding='unicode-escape') as f:
data = f.read()
Use import io and io.open on Python 2.
I guess you can change it to be readline() or readlines().
The code would be:
import unicodedata
unicode = open("unicode.txt", 'r')
for ln in unicode.readlines():
ln = unicodedata.normalize('NFKC', ln)
print(ln)
The reason is because read() would treat each character in the file separately, meaning that the iteration would happen for every character. Meanwhile, readline or readlines() would treat the iteration for a line or the whole lines.
However, unicodedata tries to normalize the unicode in the string not as per character. Hope that would help.
References:
https://www.tutorialspoint.com/what-are-the-differences-between-readline-and-readlines-in-selenium-with-python
https://discuss.codecademy.com/t/what-is-difference-between-read-and-readlines-in-python/478934

Python 3 Decoding Strings

I understand that this is likely a repeat question, but I'm having trouble finding a solution.
In short I have a string I'd like to decode:
raw = "\x94my quote\x94"
string = decode(raw)
expected from string
'"my quote"'
Last point of note is that I'm working with Python 3 so raw is unicode, and thus is already decoded. Given that, what exactly do I need to do to "decode" the "\x94" characters?
string = "\x22my quote\x22"
print(string)
You don't need to decode, Python 3 does that for you, but you need the correct control character for the double quote "
If however you have a different character set, it appears you have Windows-1252, then you need to decode the byte string from that character set:
str(b"\x94my quote\x94", "windows-1252")
If your string isn't a byte string you have to encode it first, I found the latin-1 encoding to work:
string = "\x94my quote\x94"
str(string.encode("latin-1"), "windows-1252")
I don't know if you mean to this, but this works:
some_binary = a = b"\x94my quote\x94"
result = some_binary.decode()
And you got the result...
If you don't know which encoding to choose, you can use chardet.detect:
import chardet
chardet.detect(some_binary)
Did you try it like this? I think you need to call decode as a method of the byte class, and pass utf-8 as the argument. Add b in front of the string too.
string = b"\x94my quote\x94"
decoded_str = string.decode('utf-8', 'ignore')
print(decoded_str)

Evaluate UTF-8 literal escape sequences in a string in Python3

I have a string of the form:
s = '\\xe2\\x99\\xac'
I would like to convert this to the character ♬ by evaluating the escape sequence. However, everything I've tried either results in an error or prints out garbage. How can I force Python to convert the escape sequence into a literal unicode character?
What I've read elsewhere suggests that the following line of code should do what I want, but it results in a UnicodeEncodeError.
print(bytes(s, 'utf-8').decode('unicode-escape'))
I also tried the following, which has the same result:
import codecs
print(codecs.getdecoder('unicode_escape')(s)[0])
Both of these approaches produce the string 'â\x99¬', which print is subsequently unable to handle.
In case it makes any difference the string is being read in from a UTF-8 encoded file and will ultimately be output to a different UTF-8 encoded file after processing.
...decode('unicode-escape') will give you string '\xe2\x99\xac'.
>>> s = '\\xe2\\x99\\xac'
>>> s.encode().decode('unicode-escape')
'â\x99¬'
>>> _ == '\xe2\x99\xac'
True
You need to decode it. But to decode it, encode it first with latin1 (or iso-8859-1) to preserve the bytes.
>>> s = '\\xe2\\x99\\xac'
>>> s.encode().decode('unicode-escape').encode('latin1').decode('utf-8')
'♬'

Categories

Resources