Python encoding unicode<>utf-8

Python encoding unicode<>utf-8 - python

So I am getting lost somewhere in converting unicode to utf-8. I am trying to define some JSON containing unicode characters, and writing them to file. When printing to the terminal the character is represented as '\u2606'. When having a look at the file the character is encoded to '\u2606', note the double backslash. Could someone point me into the right direction regarding these encoding issues?
# encoding=utf8
import json
data = {"summary" : u"This is a unicode character: ☆"}
print data
decoded_data = unicode(data)
print decoded_data
with open('decoded_data.json', 'w') as outfile:
json.dump(decoded_data, outfile)
I tried adding the following snippet to the head of my file, but this had no success neither.
import sys
import codecs
sys.stdout = codecs.getwriter('utf8')(sys.stdout)
sys.stderr = codecs.getwriter('utf8')(sys.stderr)

First you are printing the representation of a dictionary, and python only uses ascii characters and escapes any other character with \uxxxx.
The same is with json.dump trying to only use ascii characters. You can force json.dump to use unicode with:
json_data = json.dumps(data, ensure_ascii=False)
with open('decoded_data.json', 'w') as outfile:
outfile.write(json_data.encode('utf8'))

I think you can also refer to this link.It is also really useful
Set Default Encoding

Related

Converting a json file with escaped unicode to real unicode while retaining escaped double quotes

I have an escaped unicode string in a json file, for example this:
{"word": "\u043a\u043e\u0433\u0434\u0430 \u0440\u0430\u043a \u043d\u0430 \u0433\u043e\u0440\u0435 \u0441\u0432\u0438\u0441\u0442\u043d\u0435\u0442",
"glosses": ["when pigs fly, never (lit., \"when the crawfish whistles on the mountain\")"]}}
I want to convert this file so that proper unicode is shown. In Python I found several suggestions for this, for example this:
import codecs
# opens a file and converts input to true Unicode
with codecs.open("kaikki.org-dictionary-Russian.json", "rb", "unicode_escape") as my_input:
contents = my_input.read()
# type(contents) = unicode
# opens a file with UTF-8 encoding
with codecs.open("utf8-dictionary.json", "wb", "utf8") as my_output:
my_output.write(contents)
I also wrote another similar function without using "codecs", but both got the same result. After executing the command, I get:
{"word": "когда рак на горе свистнет",
"glosses": ["when pigs fly, never (lit., "when the crawfish whistles on the mountain")"]}
The escaped double quotes are not escaped anymore, which makes the JSON invalid. How can I prevent this?
Edit: I forgot to mention that I have the file in a jsonlines format, so each line is a json object beginning and ending with { ... }.
Thanks for all the help! My final solution:
import json
with open("kaikki.org-dictionary-Russian.json", "r", encoding="utf-8") as input, \
open("utf8-dictionary-4.json", "w", encoding="utf-8") as out:
for line in input:
data = json.loads(line)
json.dump(data, out, ensure_ascii=False)
out.write("\n")

Use the json library for working with JSON data.
It will make sure that serialised data are valid JSON, and it has a few options for controlling the output, such as as indented pretty-printing and non-ASCII characters without escaping.
First, parse the data with json.load():
>>> with open("kaikki.org-dictionary-Russian.json", encoding="utf8") as f:
... data = json.load(f)
Note: in Python 3, there's no need to use the codecs library for reading/writing files. Just specify the file encoding in the built-in open function.
The serialise the data again, now using the ensure_ascii option, which leads to minimal use of escape sequences (only double quotes, newlines and tabs are escaped IIRC):
>>> with open("utf8-dictionary.json", "w", encoding="utf8") as f:
... json.dump(data, f, ensure_ascii=False)

Python search through file and un-escape unicode characters [duplicate]

I'm working on an application which is using utf-8 encoding. For debugging purposes I need to print the text. If I use print() directly with variable containing my unicode string, ex- print(pred_str).
I get this error:
UnicodeEncodeError: 'charmap' codec can't encode character '\ufeff' in position 0: character maps to
So I tried print(pred_str.encode('utf-8')) and my output looks like this:
b'\xef\xbb\xbfpudgala-dharma-nair\xc4\x81tmyayo\xe1\xb8\xa5 apratipanna-vipratipann\xc4\x81n\xc4\x81m'
b'avipar\xc4\xabta-pudgala-dharma-nair\xc4\x81tmya-pratip\xc4\x81dana-artham'
b'tri\xe1\xb9\x83\xc5\x9bik\xc4\x81-vij\xc3\xb1apti-prakara\xe1\xb9\x87a-\xc4\x81rambha\xe1\xb8\xa5'
b'pudgala-dharma-nair\xc4\x81tmya-pratip\xc4\x81danam punar kle\xc5\x9ba-j\xc3\xb1eya-\xc4\x81vara\xe1\xb9\x87a-prah\xc4\x81\xe1\xb9\x87a-artham'
But, I want my output to look like this:
pudgala-dharma-nairātmyayoḥ apratipanna-vipratipannānām
aviparīta-pudgala-dharma-nairātmya-pratipādana-artham
triṃśikā-vijñapti-prakaraṇa-ārambhaḥ
pudgala-dharma-nairātmya-pratipādanam punar kleśa-jñeya-āvaraṇa-prahāṇa-artham
If i save my string in file using:
with codecs.open('out.txt', 'w', 'UTF-8') as f:
f.write(pred_str)
it saves string as expected.

Your data is encoded with the "UTF-8-SIG" codec, which is sometimes used in Microsoft environments.
This variant of UTF-8 prefixes encoded text with a byte order mark '\xef\xbb\xbf', to make it easier for applications to detect UTF-8 encoded text vs other encodings.
You can decode such bytestrings like this:
>>> bs = b'\xef\xbb\xbfpudgala-dharma-nair\xc4\x81tmyayo\xe1\xb8\xa5 apratipanna-vipratipann\xc4\x81n\xc4\x81m'
>>> text = bs.decode('utf-8-sig')
>>> print(text)
pudgala-dharma-nairātmyayoḥ apratipanna-vipratipannānām
To read such data from a file:
with open('myfile.txt', 'r', encoding='utf-8-sig') as f:
text = f.read()
Note that even after decoding from UTF-8-SIG, you may still be unable to print your data because your console's default code page may not be able to encode other non-ascii characters in the data. In that case you will need to adjust your console settings to support UTF-8.

try this code:
if pred_str.startswith('\ufeff'):
pred_str = pred_str.split('\ufeff')[1]

Reading files with non ascii characters [duplicate]

I'm reading a series of source code files using Python and running into a unicode BOM error. Here's my code:
bytes = min(32, os.path.getsize(filename))
raw = open(filename, 'rb').read(bytes)
result = chardet.detect(raw)
encoding = result['encoding']
infile = open(filename, mode, encoding=encoding)
data = infile.read()
infile.close()
print(data)
As you can see, I'm detecting the encoding using chardet, then reading the file in memory and attempting to print it. The print statement fails on Unicode files containing a BOM with the error:
UnicodeEncodeError: 'charmap' codec can't encode characters in position 0-2:
character maps to <undefined>
I'm guessing it's trying to decode the BOM using the default character set and it's failing. How do I remove the BOM from the string to prevent this?

There is no reason to check if a BOM exists or not, utf-8-sig manages that for you and behaves exactly as utf-8 if the BOM does not exist:
# Standard UTF-8 without BOM
>>> b'hello'.decode('utf-8')
'hello'
>>> b'hello'.decode('utf-8-sig')
'hello'
# BOM encoded UTF-8
>>> b'\xef\xbb\xbfhello'.decode('utf-8')
'\ufeffhello'
>>> b'\xef\xbb\xbfhello'.decode('utf-8-sig')
'hello'
In the example above, you can see utf-8-sig correctly decodes the given string regardless of the existence of BOM. If you think there is even a small chance that a BOM character might exist in the files you are reading, just use utf-8-sig and not worry about it

BOM characters should be automatically stripped when decoding UTF-16, but not UTF-8, unless you explicitly use the utf-8-sig encoding. You could try something like this:
import io
import chardet
import codecs
bytes = min(32, os.path.getsize(filename))
raw = open(filename, 'rb').read(bytes)
if raw.startswith(codecs.BOM_UTF8):
encoding = 'utf-8-sig'
else:
result = chardet.detect(raw)
encoding = result['encoding']
infile = io.open(filename, mode, encoding=encoding)
data = infile.read()
infile.close()
print(data)

I've composed a nifty BOM-based detector based on Chewie's answer.
It autodetects the encoding in the common use case where data can be either in a known local encoding or in Unicode with BOM (that's what text editors typically produce). More importantly, unlike chardet, it doesn't do any random guessing, so it gives predictable results:
def detect_by_bom(path, default):
with open(path, 'rb') as f:
raw = f.read(4) # will read less if the file is smaller
# BOM_UTF32_LE's start is equal to BOM_UTF16_LE so need to try the former first
for enc, boms in \
('utf-8-sig', (codecs.BOM_UTF8,)), \
('utf-32', (codecs.BOM_UTF32_LE, codecs.BOM_UTF32_BE)), \
('utf-16', (codecs.BOM_UTF16_LE, codecs.BOM_UTF16_BE)):
if any(raw.startswith(bom) for bom in boms):
return enc
return default

chardet detects BOM_UTF8 automatically since 2.3.0 version released on Oct 7, 2014:
#!/usr/bin/env python
import chardet # $ pip install chardet
# detect file encoding
with open(filename, 'rb') as file:
raw = file.read(32) # at most 32 bytes are returned
encoding = chardet.detect(raw)['encoding']
with open(filename, encoding=encoding) as file:
text = file.read()
print(text)
Note: chardet may return 'UTF-XXLE', 'UTF-XXBE' encodings that leave the BOM in the text. 'LE', 'BE' should be stripped to avoid it -- though it is easier to detect BOM yourself at this point e.g., as in #ivan_pozdeev's answer.
To avoid UnicodeEncodeError while printing Unicode text to Windows console, see Python, Unicode, and the Windows console.

I find the other answers overly complex. There is a simpler way that doesn't need dropping down into the lower-level idiom of binary file I/O, doesn't rely on a character set heuristic (chardet) that's not part of the Python standard library, and doesn't need a rarely-seen alternate encoding signature (utf-8-sig vs. the common utf-8) that doesn't seem to have an analog in the UTF-16 family.
The simplest approach I've found is dealing with BOM characters in Unicode, and letting the codecs do the heavy lifting. There is only one Unicode byte order mark, so once data is converted to Unicode characters, determining if it's there and/or adding/removing it is easy. To read a file with a possible BOM:
BOM = '\ufeff'
with open(filepath, mode='r', encoding='utf-8') as f:
text = f.read()
if text.startswith(BOM):
text = text[1:]
This works with all the interesting UTF codecs (e.g. utf-8, utf-16le, utf-16be, ...), doesn't require extra modules, and doesn't require dropping down into binary file processing or specific codec constants.
To write a BOM:
text_with_BOM = text if text.startswith(BOM) else BOM + text
with open(filepath, mode='w', encoding='utf-16be') as f:
f.write(text_with_BOM)
This works with any encoding. UTF-16 big endian is just an example.
This is not, btw, to dismiss chardet. It can help when you have no information what encoding a file uses. It's just not needed for adding / removing BOMs.

In case you want to edit the file, you will want to know which BOM was used. This version of #ivan_pozdeev answer returns both encoding and optional BOM:
def encoding_by_bom(path, default='utf-8') -> Tuple[str, Optional[bytes]]:
"""Adapted from https://stackoverflow.com/questions/13590749/reading-unicode-file-data-with-bom-chars-in-python/24370596#24370596 """
with open(path, 'rb') as f:
raw = f.read(4) # will read less if the file is smaller
# BOM_UTF32_LE's start is equal to BOM_UTF16_LE so need to try the former first
for enc, boms in \
('utf-8-sig', (codecs.BOM_UTF8,)), \
('utf-32', (codecs.BOM_UTF32_LE, codecs.BOM_UTF32_BE)), \
('utf-16', (codecs.BOM_UTF16_LE, codecs.BOM_UTF16_BE)):
for bom in boms:
if raw.startswith(bom):
return enc, bom
return default, None

A variant of #ivan_pozdeev's answer for strings/exceptions (rather than files). I'm dealing with unicode HTML content that was stuffed in a python exception (see http://bugs.python.org/issue2517)
def detect_encoding(bytes_str):
for enc, boms in \
('utf-8-sig',(codecs.BOM_UTF8,)),\
('utf-16',(codecs.BOM_UTF16_LE,codecs.BOM_UTF16_BE)),\
('utf-32',(codecs.BOM_UTF32_LE,codecs.BOM_UTF32_BE)):
if (any(bytes_str.startswith(bom) for bom in boms): return enc
return 'utf-8' # default
def safe_exc_to_str(exc):
try:
return str(exc)
except UnicodeEncodeError:
return unicode(exc).encode(detect_encoding(exc.content))
Alternatively, this much simpler code is able to delete non-ascii characters without much fuss:
def just_ascii(str):
return unicode(str).encode('ascii', 'ignore')

URL component % and \x

I have a doubt.
st = "b%C3%BCrokommunikation"
urllib2.unquote(st)
OUTPUT: 'b\xc3\xbcrokommunikation'
But, if I print it:
print urllib2.unquote(st)
OUTPUT: bürokommunikation
Why is the difference?
I have to write bürokommunikation instead of 'b\xc3\xbcrokommunikation' into a file.
My problem is:
I have lots of data with such values extracted from URLs. I have to store them as eg. bürokommunikation into a text file.

When you print the string, your terminal emulator recognizes the unicode character \xc3\xbc and displays it correctly.
However, as #MarkDickinson says in the comments, ü doesn't exist in ASCII, so you'll need to tell Python that the string you want to write to a file is unicode encoded, and what encoding format you want to use, for instance UTF-8.
This is very easy using the codecs library:
import codecs
# First create a Python UTF-8 string
st = "b%C3%BCrokommunikation"
encoded_string = urllib2.unquote(st).decode('utf-8')
# Write it to file keeping the encoding
with codecs.open('my_file.txt', 'w', 'utf-8') as f:
f.write(encoded_string)

You are looking at the same result. when you try to print it without print command, it just show the __repr__() result. when you use print, it shows the unicode character instead of escaping it with \x

How do I convert unicode to unicode-escaped text [duplicate]

This question already has an answer here:
How to encode Python 3 string using \u escape code?
(1 answer)
Closed 1 year ago.
I'm loading a file with a bunch of unicode characters (e.g. \xe9\x87\x8b). I want to convert these characters to their escaped-unicode form (\u91cb) in Python. I've found a couple of similar questions here on StackOverflow including this one Evaluate UTF-8 literal escape sequences in a string in Python3, which does almost exactly what I want, but I can't work out how to save the data.
For example:
Input file:
\xe9\x87\x8b
Python Script
file = open("input.txt", "r")
text = file.read()
file.close()
encoded = text.encode().decode('unicode-escape').encode('latin1').decode('utf-8')
file = open("output.txt", "w")
file.write(encoded) # fails with a unicode exception
file.close()
Output File (That I would like):
\u91cb

You need to encode it again with unicode-escape encoding.
>>> br'\xe9\x87\x8b'.decode('unicode-escape').encode('latin1').decode('utf-8')
'釋'
>>> _.encode('unicode-escape')
b'\\u91cb'
Code modified (used binary mode to reduce unnecessary encode/decodes)
with open("input.txt", "rb") as f:
text = f.read().rstrip() # rstrip to remove trailing spaces
decoded = text.decode('unicode-escape').encode('latin1').decode('utf-8')
with open("output.txt", "wb") as f:
f.write(decoded.encode('unicode-escape'))
http://asciinema.org/a/797ruy4u5gd1vsv8pplzlb6kq

\xe9\x87\x8b is not a Unicode character. It looks like a representation of a bytestring that represents 釋 Unicode character encoded using utf-8 character encoding. \u91cb is a representation of 釋 character in Python source code (or in JSON format). Don't confuse the text representation and the character itself:
>>> b"\xe9\x87\x8b".decode('utf-8')
u'\u91cb' # repr()
>>> print(b"\xe9\x87\x8b".decode('utf-8'))
釋
>>> import unicodedata
>>> unicodedata.name(b"\xe9\x87\x8b".decode('utf-8'))
'CJK UNIFIED IDEOGRAPH-91CB'
To read text encoded as utf-8 from a file, specify the character encoding explicitly:
with open('input.txt', encoding='utf-8') as file:
unicode_text = file.read()
It is exactly the same for saving Unicode text to a file:
with open('output.txt', 'w', encoding='utf-8') as file:
file.write(unicode_text)
If you omit the explicit encoding parameter then locale.getpreferredencoding(False) is used that may produce mojibake if it does not correspond to the actual character encoding used to save a file.
If your input file literally contains \xe9 (4 characters) then you should fix whatever software generates it. If you need to use 'unicode-escape'; something is broken.

It looks as if your input file is UTF-8 encoded so specify UTF-8 encoding when you open the file (Python3 is assumed as per your reference):
with open("input.txt", "r", encoding='utf8') as f:
text = f.read()
text will contain the content of the file as a str (i.e. unicode string). Now you can write it in unicode escaped form directly to a file by specifying encoding='unicode-escape':
with open('output.txt', 'w', encoding='unicode-escape') as f:
f.write(text)
The content of your file will now contain unicode-escaped literals:
$ cat output.txt
\u91cb

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python encoding unicode<>utf-8 - python

I think you can also refer to this link.It is also really useful Set Default Encoding

Related

Converting a json file with escaped unicode to real unicode while retaining escaped double quotes

Python search through file and un-escape unicode characters [duplicate]

Reading files with non ascii characters [duplicate]

URL component % and \x

How do I convert unicode to unicode-escaped text [duplicate]

Categories

Resources