How do I convert unicode to unicode-escaped text [duplicate] - python

This question already has an answer here:
How to encode Python 3 string using \u escape code?
(1 answer)
Closed 1 year ago.
I'm loading a file with a bunch of unicode characters (e.g. \xe9\x87\x8b). I want to convert these characters to their escaped-unicode form (\u91cb) in Python. I've found a couple of similar questions here on StackOverflow including this one Evaluate UTF-8 literal escape sequences in a string in Python3, which does almost exactly what I want, but I can't work out how to save the data.
For example:
Input file:
\xe9\x87\x8b
Python Script
file = open("input.txt", "r")
text = file.read()
file.close()
encoded = text.encode().decode('unicode-escape').encode('latin1').decode('utf-8')
file = open("output.txt", "w")
file.write(encoded) # fails with a unicode exception
file.close()
Output File (That I would like):
\u91cb

You need to encode it again with unicode-escape encoding.
>>> br'\xe9\x87\x8b'.decode('unicode-escape').encode('latin1').decode('utf-8')
'釋'
>>> _.encode('unicode-escape')
b'\\u91cb'
Code modified (used binary mode to reduce unnecessary encode/decodes)
with open("input.txt", "rb") as f:
text = f.read().rstrip() # rstrip to remove trailing spaces
decoded = text.decode('unicode-escape').encode('latin1').decode('utf-8')
with open("output.txt", "wb") as f:
f.write(decoded.encode('unicode-escape'))
http://asciinema.org/a/797ruy4u5gd1vsv8pplzlb6kq

\xe9\x87\x8b is not a Unicode character. It looks like a representation of a bytestring that represents 釋 Unicode character encoded using utf-8 character encoding. \u91cb is a representation of 釋 character in Python source code (or in JSON format). Don't confuse the text representation and the character itself:
>>> b"\xe9\x87\x8b".decode('utf-8')
u'\u91cb' # repr()
>>> print(b"\xe9\x87\x8b".decode('utf-8'))
釋
>>> import unicodedata
>>> unicodedata.name(b"\xe9\x87\x8b".decode('utf-8'))
'CJK UNIFIED IDEOGRAPH-91CB'
To read text encoded as utf-8 from a file, specify the character encoding explicitly:
with open('input.txt', encoding='utf-8') as file:
unicode_text = file.read()
It is exactly the same for saving Unicode text to a file:
with open('output.txt', 'w', encoding='utf-8') as file:
file.write(unicode_text)
If you omit the explicit encoding parameter then locale.getpreferredencoding(False) is used that may produce mojibake if it does not correspond to the actual character encoding used to save a file.
If your input file literally contains \xe9 (4 characters) then you should fix whatever software generates it. If you need to use 'unicode-escape'; something is broken.

It looks as if your input file is UTF-8 encoded so specify UTF-8 encoding when you open the file (Python3 is assumed as per your reference):
with open("input.txt", "r", encoding='utf8') as f:
text = f.read()
text will contain the content of the file as a str (i.e. unicode string). Now you can write it in unicode escaped form directly to a file by specifying encoding='unicode-escape':
with open('output.txt', 'w', encoding='unicode-escape') as f:
f.write(text)
The content of your file will now contain unicode-escaped literals:
$ cat output.txt
\u91cb

Related

Python search through file and un-escape unicode characters [duplicate]

I'm working on an application which is using utf-8 encoding. For debugging purposes I need to print the text. If I use print() directly with variable containing my unicode string, ex- print(pred_str).
I get this error:
UnicodeEncodeError: 'charmap' codec can't encode character '\ufeff' in position 0: character maps to
So I tried print(pred_str.encode('utf-8')) and my output looks like this:
b'\xef\xbb\xbfpudgala-dharma-nair\xc4\x81tmyayo\xe1\xb8\xa5 apratipanna-vipratipann\xc4\x81n\xc4\x81m'
b'avipar\xc4\xabta-pudgala-dharma-nair\xc4\x81tmya-pratip\xc4\x81dana-artham'
b'tri\xe1\xb9\x83\xc5\x9bik\xc4\x81-vij\xc3\xb1apti-prakara\xe1\xb9\x87a-\xc4\x81rambha\xe1\xb8\xa5'
b'pudgala-dharma-nair\xc4\x81tmya-pratip\xc4\x81danam punar kle\xc5\x9ba-j\xc3\xb1eya-\xc4\x81vara\xe1\xb9\x87a-prah\xc4\x81\xe1\xb9\x87a-artham'
But, I want my output to look like this:
pudgala-dharma-nairātmyayoḥ apratipanna-vipratipannānām
aviparīta-pudgala-dharma-nairātmya-pratipādana-artham
triṃśikā-vijñapti-prakaraṇa-ārambhaḥ
pudgala-dharma-nairātmya-pratipādanam punar kleśa-jñeya-āvaraṇa-prahāṇa-artham
If i save my string in file using:
with codecs.open('out.txt', 'w', 'UTF-8') as f:
f.write(pred_str)
it saves string as expected.
Your data is encoded with the "UTF-8-SIG" codec, which is sometimes used in Microsoft environments.
This variant of UTF-8 prefixes encoded text with a byte order mark '\xef\xbb\xbf', to make it easier for applications to detect UTF-8 encoded text vs other encodings.
You can decode such bytestrings like this:
>>> bs = b'\xef\xbb\xbfpudgala-dharma-nair\xc4\x81tmyayo\xe1\xb8\xa5 apratipanna-vipratipann\xc4\x81n\xc4\x81m'
>>> text = bs.decode('utf-8-sig')
>>> print(text)
pudgala-dharma-nairātmyayoḥ apratipanna-vipratipannānām
To read such data from a file:
with open('myfile.txt', 'r', encoding='utf-8-sig') as f:
text = f.read()
Note that even after decoding from UTF-8-SIG, you may still be unable to print your data because your console's default code page may not be able to encode other non-ascii characters in the data. In that case you will need to adjust your console settings to support UTF-8.
try this code:
if pred_str.startswith('\ufeff'):
pred_str = pred_str.split('\ufeff')[1]

python encoding error with utf-8

I want write some strings to file which is not in English, they are in Azeri language. Even if I do utf-8 encoding I get following error:
TypeError: write() argument must be str, not bytes
Even if I make code as:
t_w = text_list[y].encode('utf-8')
new_file.write(t_w)
new_file.write('\n')
I get following error which is :
TypeError: write() argument must be str, not bytes
The reason why I dont open file as 'wb' is I am writing different strings and integers to file.
If text_list contains unicode strings you should encode (not decode) them to string before saving to file.
Try this instead:
t_w = text_list[y].encode('utf-8')
Also it could be helpful to look at codecs standard module https://docs.python.org/2/library/codecs.html. You could try this:
import codecs
with codecs.open('path/to/file', 'w', 'utf-8') as f:
f.write(text_list[y])
f.write(u'\n')
But note that codecs always opens files in binary mode.
When using write in text mode, the UTF-8 mode is the default (in Python 3, I assume you use only Python 3, not Python 2) so do not encode the strings. Or open your file in binary mode and encode EVERYTHING you write to your file. I suggest NOT using binary mode in your case. So, your code will look like this:
with open('myfile.txt', 'w') as new_file:
t_w = text_list[y]
new_file.write(t_w)
new_file.write('\n')
or for Python 2:
new_file = open('myfile.txt', 'wb')
t_w = text_list[y].encode('utf-8') # I assume you use Unicode strings
new_file.write(t_w)
new_file.write(ub'\n')
new_file.close()

Python encoding unicode<>utf-8

So I am getting lost somewhere in converting unicode to utf-8. I am trying to define some JSON containing unicode characters, and writing them to file. When printing to the terminal the character is represented as '\u2606'. When having a look at the file the character is encoded to '\u2606', note the double backslash. Could someone point me into the right direction regarding these encoding issues?
# encoding=utf8
import json
data = {"summary" : u"This is a unicode character: ☆"}
print data
decoded_data = unicode(data)
print decoded_data
with open('decoded_data.json', 'w') as outfile:
json.dump(decoded_data, outfile)
I tried adding the following snippet to the head of my file, but this had no success neither.
import sys
import codecs
sys.stdout = codecs.getwriter('utf8')(sys.stdout)
sys.stderr = codecs.getwriter('utf8')(sys.stderr)
First you are printing the representation of a dictionary, and python only uses ascii characters and escapes any other character with \uxxxx.
The same is with json.dump trying to only use ascii characters. You can force json.dump to use unicode with:
json_data = json.dumps(data, ensure_ascii=False)
with open('decoded_data.json', 'w') as outfile:
outfile.write(json_data.encode('utf8'))
I think you can also refer to this link.It is also really useful
Set Default Encoding

Change string in file in python

I have a problem changing a file in python. I need to change a string.
The file is not a text file but can be edited with a text editor.
Here is my code:
with open(esp,"r") as f:
content=f.readlines()
with open(esp_carsat,"w") as f:
for line in content:
f.write(line.replace("201","202")))
Problem is that content is in byte I think.
'\xff\xfe<\x00I\x00n\x00s\x00t\x00a\x00n\x00c\x00e\x00N\x00a\x00m\x00e\x00s\x00>\x00\r\x00\n'
So my replace is not working. I tried to play with encoding but the file is not readable afterwards. Furthermore, I have accents in the file (é,è...)
Is there a way to do what I want?
You have UTF-16 encoded data. Decode to Unicode text, replace, and then encode back to UTF-16 again:
>>> data = '\xff\xfe<\x00I\x00n\x00s\x00t\x00a\x00n\x00c\x00e\x00N\x00a\x00m\x00e\x00s\x00>\x00\r\x00\n\x00'
>>> data.decode('utf16')
u'<InstanceNames>\r\n'
I had to append an extra \x00 to decode that; by reading the file without decoding Python split the line on the \n and left the \x00 for the next line.
Unicode data can handle accents just fine, no further work required there.
This is easiest done with io.open() to open file objects that do the decoding and encoding for you:
import io
with io.open(esp, "r", encoding='utf16') as f:
content=f.readlines()
with open(esp_carsat, "w", encoding='utf16') as f:
for line in content:
f.write(line.replace("201", "202")))
It's UTF-16-LE data:
>>> b
'\xff\xfe<\x00I\x00n\x00s\x00t\x00a\x00n\x00c\x00e\x00N\x00a\x00m\x00e\x00s\x00>\x00\r\x00\n'
>>> print(b[:-1].decode('utf-16-le'))
<InstanceNames>

Why does python print ascii rather than unicode despire that I declare coding=UTF-8?

# coding=UTF-8
with open('/home/marius/dev/python/navn/list.txt') as f:
lines = f.read().splitlines()
print lines
The file /home/marius/dev/python/navn/list.txt contains a list of strings with some special characters, such as æ,ø,å,Æ,Ø,Å. In the terminal, these are all rendered as hexadecimals. I want these to be rendered as UTF-8. How is this done?
By decoding the data from UTF-8 to Unicode values, then having Python encode those values back to your terminal encoding automatically:
with open('/home/marius/dev/python/navn/list.txt') as f:
for line in f:
print line.decode('utf8')
You can use io.open() and have the data decoded for you as you read:
import io
with io.open('/home/marius/dev/python/navn/list.txt', encoding='utf8') as f:
for line in f:
print line

Categories

Resources