String Encodings IDNA -> UTF-8 (Python) - python

String encodings and formats always throw me.
Here's what I have:
'ไทย'
Which I believe is UTF-8, and
'xn--o3cw4h'
Which should be the same thing in IDNA encoding. However, I can't figure out how to get python to convert from one to the other.
I was just trying
a = u'xn--o3cw4h'
b = a.encode('idna')
b.decode('utf-8')
but I get the exact same string back ('xn--o3cw4h', although no longer unicode). I am using python 3.5 currently.

To convert from one encoding to another encoding, one must first decode the string to Unicode, then encode it again in the target encoding.
So, for example:
idna_encoded_bytes = b'xn--o3cw4h'
unicode_string = idna_encoded_bytes.decode('idna')
utf8_encoded_bytes = unicode_string.encode('utf-8')
print (repr(idna_encoded_bytes))
print (repr(utf8_encoded_bytes))
print (repr(unicode_string))
Python2 result:
'xn--o3cw4h'
'\xe0\xb9\x84\xe0\xb8\x97\xe0\xb8\xa2'
u'\u0e44\u0e17\u0e22'
As you can see, the first line is the IDNA encoding of ไทย, the second line is the utf8 encoding, and the final line is the unencoded sequence of Unicode code points U-0E44, U-0E17, and U-0E22.
To do the conversion in one step, just chain the operations:
utf8_encoded_bytes = idna_encoded_bytes.decode('idna').encode('utf8')
Responding to a comment:
I'm starting with isn't b'xn--o3cw4h' but just the string 'xn--o3cw4h'. [in Python3].
You have an odd duck there. You have apparently-encoded data stored in a unicode string. We'll need to convert that to a bytes object somehow. An easy way to do that is to use (confusingly) ASCII encoding:
improperly_encoded_idna = 'xn--o3cw4h'
idna_encoded_bytes = improperly_encoded_idna.encode('ascii')
unicode_string = idna_encoded_bytes.decode('idna')
utf8_encoded_bytes = unicode_string.encode('utf-8')
print (repr(idna_encoded_bytes))
print (repr(utf8_encoded_bytes))
print (repr(unicode_string))

Related

Decode UTF-8 encoding in JSON string

I have JSON file which contains followingly encoded strings:
"sender_name": "Horn\u00c3\u00adkov\u00c3\u00a1",
I am trying to parse this file using the json module. However I am not able to decode this string correctly.
What I get after decoding the JSON using .load() method is 'HornÃ\xadková'. The string should be correctly decoded as 'Horníková' instead.
I read the JSON specification and I understasnd that after \u there should be 4 hexadecimal numbers specifing Unicode number of character. But it seems that in this JSON file UTF-8 encoded bytes are stored as \u-sequences.
What type of encoding is this and how to correctly parse it in Python 3?
Is this type JSON file even valid JSON file according to the specification?
Your text is already encoded and you need to tell this to Python by using a b prefix in your string but since you're using json and the input needs to be string you have to decode your encoded text manually. Since your input is not byte you can use 'raw_unicode_escape' encoding to convert the string to byte without encoding and prevent the open method to use its own default encoding. Then you can simply use aforementioned approach to get the desired result.
Note that since you need to do the encoding and decoding your have to read file content and perform the encoding on loaded string, then you should use json.loads() instead of json.load().
In [168]: with open('test.json', encoding='raw_unicode_escape') as f:
...: d = json.loads(f.read().encode('raw_unicode_escape').decode())
...:
In [169]: d
Out[169]: {'sender_name': 'Horníková'}
The JSON you are reading was written incorrectly and the Unicode strings decoded from it will have to be re-encoded with the wrong encoding used, then decoded with the correct encoding.
Here's an example:
#!python3
import json
# The bad JSON you have
bad_json = r'{"sender_name": "Horn\u00c3\u00adkov\u00c3\u00a1"}'
print('bad_json =',bad_json)
# The wanted result from json.loads()
wanted = {'sender_name':'Horníková'}
# What correctly written JSON should look like
good_json = json.dumps(wanted)
print('good_json =',good_json)
# What you get when loading the bad JSON.
got = json.loads(bad_json)
print('wanted =',wanted)
print('got =',got)
# How to correct the mojibake string
corrected_sender = got['sender_name'].encode('latin1').decode('utf8')
print('corrected_sender =',corrected_sender)
Output:
bad_json = {"sender_name": "Horn\u00c3\u00adkov\u00c3\u00a1"}
good_json = {"sender_name": "Horn\u00edkov\u00e1"}
wanted = {'sender_name': 'Horníková'}
got = {'sender_name': 'HornÃ\xadková'}
corrected_sender = Horníková
I don't know enough about JSON to be able to say whether this is valid or not, but you can parse these strings using the raw_unicode_escape codec:
>>> "Horn\u00c3\u00adkov\u00c3\u00a1".encode('raw_unicode_escape').decode('utf8')
'Horníková'
Reencode to bytes, and then redecode to text.
>>> 'HornÃ\xadková'.encode('latin-1').decode('utf-8')
'Horníková'
Is this type JSON file even valid JSON file according to the specification?
No.
A string is a sequence of zero or more Unicode characters, wrapped in double quotes, using backslash escapes [emphasis added].
source
A string is a sequence of Unicode code points wrapped with quotation marks (U+0022). [...] Any code point may be represented as a hexadecimal escape sequence [...] represented as a six-character sequence: a reverse solidus, followed by the lowercase letter u, followed by four hexadecimal digits that encode the code point [emphasis added].
source
UTF-8 byte sequences are neither Unicode characters nor Unicode code points.

decode python binary string but not ensure ascii symbols

I have a binary object:
b'{"node": "\\u041e\\u0431\\u043d\\u043e\\u0432\\u043b\\u0435\\u043d\\u0438\\u0435"}}'
and I want it to be printed in Unicode and not strictly using ASCII symbols.
There is a hacky way to do it:
decoded = string.decode()
parsed_to_dict = json.loads(decoded)
dumped = json.dumps(parsed_to_dict, ensure_ascii=False)
print(dumped)
>>> {"node": "Обновление"}
however the text will not always be parseable as JSON, so I need a simpler way.
Is there a way to print out my binary object (or a decoded Unicode string) as a non-ascii string without going trough parsing/dumping JSON?
For example, how to print this b'\\u041e\\u0431\\u043d\\u043e\\u0432\\u043b\\u0435\\u043d\\u0438\\u0435' as Обновление?
A bytes string like
b'\\u041e\\u0431\\u043d\\u043e\\u0432\\u043b\\u0435\\u043d\\u0438\\u0435'
has been encoded using Unicode escape sequences. To convert it back into a proper Unicode string you simply need to specify the 'unicode-escape' codec:
data = b'\\u041e\\u0431\\u043d\\u043e\\u0432\\u043b\\u0435\\u043d\\u0438\\u0435'
out = data.decode('unicode-escape')
print(out)
output
Обновление
However, if data is already a Unicode string, then you first need to encode it to bytes. You can do that using the ascii codec, presuming data only contains ASCII characters. If it contains characters outside ASCII but within the range of \x80 to \xff you may be able to use the 'latin1' codec.
data = '\\u041e\\u0431\\u043d\\u043e\\u0432\\u043b\\u0435\\u043d\\u0438\\u0435'
out = data.encode('ascii').decode('unicode-escape')
This should work so long as all the escapes are valid (no single \).
import ast
bytes_object = b'{"node": "\\u041e\\u0431\\u043d\\u043e\\u0432\\u043b\\u0435\\u043d\\u0438\\u0435"}}'
unicode_string = ast.literal_eval("'{}'".format(bytes_object.decode()))
output:
'{"node": "Обновление"}}'

Python 3 Decoding Strings

I understand that this is likely a repeat question, but I'm having trouble finding a solution.
In short I have a string I'd like to decode:
raw = "\x94my quote\x94"
string = decode(raw)
expected from string
'"my quote"'
Last point of note is that I'm working with Python 3 so raw is unicode, and thus is already decoded. Given that, what exactly do I need to do to "decode" the "\x94" characters?
string = "\x22my quote\x22"
print(string)
You don't need to decode, Python 3 does that for you, but you need the correct control character for the double quote "
If however you have a different character set, it appears you have Windows-1252, then you need to decode the byte string from that character set:
str(b"\x94my quote\x94", "windows-1252")
If your string isn't a byte string you have to encode it first, I found the latin-1 encoding to work:
string = "\x94my quote\x94"
str(string.encode("latin-1"), "windows-1252")
I don't know if you mean to this, but this works:
some_binary = a = b"\x94my quote\x94"
result = some_binary.decode()
And you got the result...
If you don't know which encoding to choose, you can use chardet.detect:
import chardet
chardet.detect(some_binary)
Did you try it like this? I think you need to call decode as a method of the byte class, and pass utf-8 as the argument. Add b in front of the string too.
string = b"\x94my quote\x94"
decoded_str = string.decode('utf-8', 'ignore')
print(decoded_str)

Python UnicodeDecodeError when writing German letters

I've been banging my head on this error for some time now and I can't seem to find a solution anywhere on SO, even though there are similar questions.
Here's my code:
f = codecs.open(path, "a", encoding="utf-8")
value = "Bitte überprüfen"
f.write(("\"%s\" = \"%s\";\n" % ("no_internet", value)).encode("utf-8"))
And what I get as en error is:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 23: ordinal not in range(128)
Why ascii if I say utf-8? I would really appreciate any help.
Try:
value = u"Bitte überprüfen"
in order to declare value as a unicode string and
# -*- coding: utf-8 -*-
at the start of your file in order to declare that your python file is saved with utf-8 encoding.
For the sake of never being hurt by unicode errors ever again, switch to python3:
% python3
>>> with open('/tmp/foo', 'w') as f:
... value = "Bitte überprüfen"
... f.write(('"{}" = "{}";\n'.format('no_internet', value)))
...
36
>>> import sys
>>> sys.exit(0)
% cat /tmp/foo
"no_internet" = "Bitte überprüfen";
though if you're really tied to python2 and have no choice:
% python2
>>> with open('/tmp/foo2', 'w') as f:
... value = u"Bitte überprüfen"
... f.write(('"{}" = "{}";\n'.format('no_internet', value.encode('utf-8'))))
...
>>> import sys
>>> sys.exit(0)
% cat /tmp/foo2
"no_internet" = "Bitte überprüfen";
And as #JuniorCompressor suggests, don't forget to add # encoding: utf-8 at the start of your python2 file to tell python to read the source file in unicode, not in ASCII!
Your error in:
f.write(("\"%s\" = \"%s\";\n" % ("no_internet", value)).encode("utf-8"))
is that you're encoding the whole formatted string into utf-8, whereas you shall encode the value string into utf-8 before doing the format:
>>> with open('/tmp/foo2', 'w') as f:
... value = u"Bitte überprüfen"
... f.write(('"{}" = "{}";\n'.format('no_internet', value).encode('utf-8')))
...
Traceback (most recent call last):
File "<stdin>", line 3, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 6: ordinal not in range(128)
Which is because python needs to first decode the string into utf-8, so you have to use the unicode type (which is what u"" does). Then you need to explicitly decode that value as unicode before feeding it to the format parser, to build the new string.
As Karl says in his answer, Python2 is totally messy/buggy when using unicode strings, defeating the Explicit is better than implicit zen of python. And for more weird behaviour, the following works just fine in python2:
>>> value = "Bitte überprüfen"
>>> out = '"{}" = "{}";\n'.format('no_internet', value)
>>> out
'"no_internet" = "Bitte \xc3\xbcberpr\xc3\xbcfen";\n'
>>> print(out)
"no_internet" = "Bitte überprüfen";
Still not convinced to switch to python3 ? :-)
Update:
This is the way to go to read and write an unicode string from a file to another file:
% echo "Bitte überprüfen" > /tmp/foobar
% python2
>>> with open('/tmp/foobar', 'r') as f:
... data = f.read().decode('utf-8').strip()
...
>>>
>>> with open('/tmp/foo2', 'w') as f:
... f.write(('"{}" = "{}";\n'.format('no_internet', data.encode('utf-8'))))
...
>>> import sys;sys.exit(0)
% cat /tmp/foo2
"no_internet" = "Bitte überprüfen";
Update:
as a general rule:
when you get a DecodeError you shall use the .decode('utf-8') on the string that contains unicode data and
when you get an EncodeError, you shall use the .encode('utf-8') on the string that contains unicode data
Update: if you cannot update to python3, you can at least make your python2 behave like it is almost python3, using the following python-future import statement:
from __future__ import absolute_import, division, print_function, unicode_literals
HTH
Like already suggested your error results from this line:
f.write(("\"%s\" = \"%s\";\n" % ("no_internet", value)).encode("utf-8"))
it should be:
f.write(('"{}" = "{}";\n'.format('no_internet', value.encode('utf-8'))))
A note on unicode and encodings
If woking with Python 2, software should only work with unicode strings internally, converting to a particular encoding on output.
Do prevent from making the same error over and over again you should make sure you understood the difference between ascii and utf-8 encodings and also between str and unicode objects in Python.
The difference between ASCII and UTF-8 encoding:
Ascii needs just one byte to represent all possible characters in the ascii charset/encoding. UTF-8 needs up to four bytes to represent the complete charset.
ascii (default)
1 If the code point is < 128, each byte is the same as the value of the code point.
2 If the code point is 128 or greater, the Unicode string can’t be represented in this encoding. (Python raises a UnicodeEncodeError exception in this case.)
utf-8 (unicode transformation format)
1 If the code point is <128, it’s represented by the corresponding byte value.
2 If the code point is between 128 and 0x7ff, it’s turned into two byte values between 128 and 255.
3 Code points >0x7ff are turned into three- or four-byte sequences, where each byte of the sequence is between 128 and 255.
The difference between str and unicode objects:
You can say that str is baiscally a byte string and unicode is a unicode string. Both can have a different encoding like ascii or utf-8.
str vs. unicode
1 str = byte string (8-bit) - uses \x and two digits
2 unicode = unicode string - uses \u and four digits
3 basestring
/\
/ \
str unicode
If you follow some simple rules you should go fine with handling str/unicode objects in different encodings like ascii or utf-8 or whatever encoding you have to use:
Rules
1 encode(): Gets you from Unicode -> bytes
encode([encoding], [errors='strict']), returns an 8-bit string version of the Unicode string,
2 decode(): Gets you from bytes -> Unicode
decode([encoding], [errors]) method that interprets the 8-bit string using the given encoding
3 codecs.open(encoding=”utf-8″): Read and write files directly to/from Unicode (you can use any encoding, not just utf-8, but utf-8 is most common).
4 u”: Makes your string literals into Unicode objects rather than byte sequences.
5 unicode(string[, encoding, errors])
Warning: Don’t use encode() on bytes or decode() on Unicode objects
And again: Software should only work with Unicode strings internally, converting to a particular encoding on output.
Why ascii if I say utf-8?
Because in Python 2, "Bitte überprüfen" is not a Unicode string. Before it can be .encoded by your explicit call, Python must implicitly decode it to Unicode (This is also why it raises a UnicodeDecodeError), and it chooses ASCII because it has no other information to work with. The ü is represented with some byte with value >= 128, so it's not valid ASCII.
The u prefix shown by #JuniorCompressor will make it a Unicode string, and you should specify the encoding for the file as well (don't just blindly set utf-8; it needs to match whatever your text editor saves the .py file with!).
Switching to Python 3 is realistically (part of) a better long-term solution :) but it is still essential to understand the problem. See http://bit.ly/unipain for more details. The Python 2 behaviour is really a bug, or at least a failure to meet Pythonic design principles: Explicit is better than implicit, and here we see why very clearly ;)

Python, Encoding output to UTF-8

I have a definition that builds a string composed of UTF-8 encoded characters. The output files are opened using 'w+', "utf-8" arguments.
However, when I try to x.write(string) I get the UnicodeEncodeError: 'ascii' codec can't encode character u'\ufeff' in position 1: ordinal not in range(128)
I assume this is because normally for example you would do `print(u'something'). But I need to use a variable and the quotations in u'_' negate that...
Any suggestions?
EDIT: Actual code here:
source = codecs.open("actionbreak/" + target + '.csv','r', "utf-8")
outTarget = codecs.open("actionbreak/" + newTarget, 'w+', "utf-8")
x = str(actionT(splitList[0], splitList[1]))
outTarget.write(x)
Essentially all this is supposed to be doing is building me a large amount of strings that look similar to this:
[日木曜 Deliverables]= CASE WHEN things = 11
THEN C ELSE 0 END
Are you using codecs.open()? Python 2.7's built-in open() does not support a specific encoding, meaning you have to manually encode non-ascii strings (as others have noted), but codecs.open() does support that and would probably be easier to drop in than manually encoding all the strings.
As you are actually using codecs.open(), going by your added code, and after a bit of looking things up myself, I suggest attempting to open the input and/or output file with encoding "utf-8-sig", which will automatically handle the BOM for UTF-8 (see http://docs.python.org/2/library/codecs.html#encodings-and-unicode, near the bottom of the section) I would think that would only matter for the input file, but if none of those combinations (utf-8-sig/utf-8, utf-8/utf-8-sig, utf-8-sig/utf-8-sig) work, then I believe the most likely situation would be that your input file is encoded in a different Unicode format with BOM, as Python's default UTF-8 codec interprets BOMs as regular characters so the input would not have an issue but output could.
Just noticed this, but... when you use codecs.open(), it expects a Unicode string, not an encoded one; try x = unicode(actionT(splitList[0], splitList[1])).
Your error can also occur when attempting to decode a unicode string (see http://wiki.python.org/moin/UnicodeEncodeError), but I don't think that should be happening unless actionT() or your list-splitting does something to the Unicode strings that causes them to be treated as non-Unicode strings.
In python 2.x there are two types of string: byte string and unicode string. First one contains bytes and last one - unicode code points. It is easy to determine, what type of string it is - unicode string starts with u:
# byte string
>>> 'abc'
'abc'
# unicode string:
>>> u'abc абв'
u'abc \u0430\u0431\u0432'
'abc' chars are the same, because the are in ASCII range. \u0430 is a unicode code point, it is out of ASCII range. "Code point" is python internal representation of unicode points, they can't be saved to file. It is needed to encode them to bytes first. Here how encoded unicode string looks like (as it is encoded, it becomes a byte string):
>>> s = u'abc абв'
>>> s.encode('utf8')
'abc \xd0\xb0\xd0\xb1\xd0\xb2'
This encoded string now can be written to file:
>>> s = u'abc абв'
>>> with open('text.txt', 'w+') as f:
... f.write(s.encode('utf8'))
Now, it is important to remember, what encoding we used when writing to file. Because to be able to read the data, we need to decode the content. Here what data looks like without decoding:
>>> with open('text.txt', 'r') as f:
... content = f.read()
>>> content
'abc \xd0\xb0\xd0\xb1\xd0\xb2'
You see, we've got encoded bytes, exactly the same as in s.encode('utf8'). To decode it is needed to provide coding name:
>>> content.decode('utf8')
u'abc \u0430\u0431\u0432'
After decode, we've got back our unicode string with unicode code points.
>>> print content.decode('utf8')
abc абв
xgord is right, but for further edification it's worth noting exactly what \ufeff means. It's known as a BOM or a byte order mark and basically it's a callback to the early days of unicode when people couldn't agree which way they wanted their unicode to go. Now all unicode documents are prefaced with either an \ufeff or an \uffef depending on which order they decide to arrange their bytes in.
If you hit an error on those characters in the first location you can be sure the issue is that you are not trying to decode it as utf-8, and the file is probably still fine.

Categories

Resources