PEP-263 specifies that encoding specified in the source is applied in the following order:
read the file
decode it into Unicode assuming a fixed per-file encoding
convert it into a UTF-8 byte string
tokenize the UTF-8 content
compile it, creating Unicode objects from the given Unicode data
and creating string objects from the Unicode literal data
by first reencoding the UTF-8 data into 8-bit string data
using the given file encoding
So, if I take this code:
print 'abcdefgh'
print u'abcdefgh'
And convert it to ROT-13:
# coding: rot13
cevag 'nopqrstu'
cevag h'nopqrstu'
I would expect that it is first decoded and then becomes identical to the original, printing:
abcdefgh
abcdefgh
But instead, it prints:
nopqrstu
abcdefgh
So, the unicode literal works as expeced, but str remains unconverted. Why?
Eliminating some possibilities:
I confirmed that the problem is not in a later phase (printing to console), but immediately at parsing, becuase this code produces "ValueError: unsupported format character 'q' (0x71) at index 1":
x = '%q' % 1 # that is %d !
I guess the last point actually explains what happens quite accurately:
compile it, creating Unicode objects from the given Unicode data and
creating string objects from the Unicode literal data by first
reencoding the UTF-8 data into 8-bit string data using the given file
encoding
After the first 4 steps, the contents of the source file are a tokenized unicode version of the following string:
print 'abcdefgh'
print u'abcdefgh'
After that, in step 5, the string object 'abcdefgh' is reencoded into 8-bit string data using the given file encoding (which is rot13), so the contents become:
print 'nopqrstu'
print u'abcdefgh'
Related
Is there an elegant way to convert "test\207\128" into "testπ" in python?
My issue stems from using avahi-browse on Linux, which has a -p flag to output information in an easy to parse format. However the problem is that it outputs non alpha-numeric characters as escaped sequences. So a service published as "name#id" gets output by avahi-browse as "name\035id". This can be dealt with by splitting on the \, dropping a leading zero and using chr(35) to recover the #. This solution breaks on multi-byte utf characters such as "π" which gets output as "\207\128".
The input string you have is an encoding of a UTF-8 string, in a format that Python can't deal with natively. This means you'll need to write a simple decoder, then use Python to translate the UTF-8 string to a string object:
import re
value = r"test\207\128"
# First off turn this into a byte array, since it's not a unicode string
value = value.encode("utf-8")
# Now replace any "\###" with a byte character based off
# the decimal number captured
value = re.sub(b"\\\\([0-9]{3})", lambda m: bytes([int(m.group(1))]), value)
# And now that we have a normal UTF-8 string, decode it back to a string
value = value.decode("utf-8")
print(value)
# Outputs: testπ
I have a string whose value is 'Opérations'. In my script I will read a file and do some comparisons. While comparing strings, the string that I have copied from the same source and placed in my python script DOES not equal to the same string that I receive when reading the same file in my script. Printing both strings give me 'Opérations'. However, when I encode it to utf-8 I notice the difference.
b'Ope\xcc\x81rations'
b'Op\xc3\xa9rations'
My question is what do I do to ensure that the special character in my python script is the same as the file content's when comparing such strings.
Good to know:
You are talking about two type of strings, byte string and unicode string. Each have a method to convert it to the other type of string. Unicode strings have a .encode() method that produces bytes, and byte strings have a .decode() method that produces unicode. It means:
unicode.enocde() ----> bytes
and
bytes.decode() -----> unicode
and UTF-8 is easily the most popular encoding for storage and transmission of Unicode. It uses a variable number of bytes for each code point. The higher the code point value, the more bytes it needs in UTF-8.
Get to the point:
If you redefine your string to two Byte strings and unicode strings, as follwos:
a_byte = b'Ope\xcc\x81rations'
a_unicode = u'Ope\xcc\x81rations'
and
b_byte = b'Op\xc3\xa9rations'
b_unicode = u'Op\xc3\xa9rations'
you w'll see:
print 'a_byte lenght is: ', len(a_byte.decode("utf-8"))
#print 'a_unicode lenght is: ',len(a_unicode.encode("utf-8"))
print 'b_byte lenght is: ',len(b_byte.decode("utf-8"))
#print 'b_unicode lenght is: ', len(b_unicode.encode("utf-8"))
output:
a_byte lenght is: 11
b_byte lenght is: 10
So you see they are not the same.
My solution:
If You don't want to be confused, then you can use repr(), and while print a_byte, b_byte printes Opérations as output, but:
print repr(a_byte),repr(b_byte)
will return:
'Ope\xcc\x81rations','Op\xc3\xa9rations'
You can also normalize the unicode before comparison as #Daniel's answer, as follows:
from unicodedata import normalize
from functools import partial
a_byte = 'Opérations'
norm = partial(normalize, 'NFC')
your_string = norm(a_byte.decode('utf8'))
I have JSON file which contains followingly encoded strings:
"sender_name": "Horn\u00c3\u00adkov\u00c3\u00a1",
I am trying to parse this file using the json module. However I am not able to decode this string correctly.
What I get after decoding the JSON using .load() method is 'HornÃ\xadková'. The string should be correctly decoded as 'Horníková' instead.
I read the JSON specification and I understasnd that after \u there should be 4 hexadecimal numbers specifing Unicode number of character. But it seems that in this JSON file UTF-8 encoded bytes are stored as \u-sequences.
What type of encoding is this and how to correctly parse it in Python 3?
Is this type JSON file even valid JSON file according to the specification?
Your text is already encoded and you need to tell this to Python by using a b prefix in your string but since you're using json and the input needs to be string you have to decode your encoded text manually. Since your input is not byte you can use 'raw_unicode_escape' encoding to convert the string to byte without encoding and prevent the open method to use its own default encoding. Then you can simply use aforementioned approach to get the desired result.
Note that since you need to do the encoding and decoding your have to read file content and perform the encoding on loaded string, then you should use json.loads() instead of json.load().
In [168]: with open('test.json', encoding='raw_unicode_escape') as f:
...: d = json.loads(f.read().encode('raw_unicode_escape').decode())
...:
In [169]: d
Out[169]: {'sender_name': 'Horníková'}
The JSON you are reading was written incorrectly and the Unicode strings decoded from it will have to be re-encoded with the wrong encoding used, then decoded with the correct encoding.
Here's an example:
#!python3
import json
# The bad JSON you have
bad_json = r'{"sender_name": "Horn\u00c3\u00adkov\u00c3\u00a1"}'
print('bad_json =',bad_json)
# The wanted result from json.loads()
wanted = {'sender_name':'Horníková'}
# What correctly written JSON should look like
good_json = json.dumps(wanted)
print('good_json =',good_json)
# What you get when loading the bad JSON.
got = json.loads(bad_json)
print('wanted =',wanted)
print('got =',got)
# How to correct the mojibake string
corrected_sender = got['sender_name'].encode('latin1').decode('utf8')
print('corrected_sender =',corrected_sender)
Output:
bad_json = {"sender_name": "Horn\u00c3\u00adkov\u00c3\u00a1"}
good_json = {"sender_name": "Horn\u00edkov\u00e1"}
wanted = {'sender_name': 'Horníková'}
got = {'sender_name': 'HornÃ\xadková'}
corrected_sender = Horníková
I don't know enough about JSON to be able to say whether this is valid or not, but you can parse these strings using the raw_unicode_escape codec:
>>> "Horn\u00c3\u00adkov\u00c3\u00a1".encode('raw_unicode_escape').decode('utf8')
'Horníková'
Reencode to bytes, and then redecode to text.
>>> 'HornÃ\xadková'.encode('latin-1').decode('utf-8')
'Horníková'
Is this type JSON file even valid JSON file according to the specification?
No.
A string is a sequence of zero or more Unicode characters, wrapped in double quotes, using backslash escapes [emphasis added].
source
A string is a sequence of Unicode code points wrapped with quotation marks (U+0022). [...] Any code point may be represented as a hexadecimal escape sequence [...] represented as a six-character sequence: a reverse solidus, followed by the lowercase letter u, followed by four hexadecimal digits that encode the code point [emphasis added].
source
UTF-8 byte sequences are neither Unicode characters nor Unicode code points.
I have a definition that builds a string composed of UTF-8 encoded characters. The output files are opened using 'w+', "utf-8" arguments.
However, when I try to x.write(string) I get the UnicodeEncodeError: 'ascii' codec can't encode character u'\ufeff' in position 1: ordinal not in range(128)
I assume this is because normally for example you would do `print(u'something'). But I need to use a variable and the quotations in u'_' negate that...
Any suggestions?
EDIT: Actual code here:
source = codecs.open("actionbreak/" + target + '.csv','r', "utf-8")
outTarget = codecs.open("actionbreak/" + newTarget, 'w+', "utf-8")
x = str(actionT(splitList[0], splitList[1]))
outTarget.write(x)
Essentially all this is supposed to be doing is building me a large amount of strings that look similar to this:
[日木曜 Deliverables]= CASE WHEN things = 11
THEN C ELSE 0 END
Are you using codecs.open()? Python 2.7's built-in open() does not support a specific encoding, meaning you have to manually encode non-ascii strings (as others have noted), but codecs.open() does support that and would probably be easier to drop in than manually encoding all the strings.
As you are actually using codecs.open(), going by your added code, and after a bit of looking things up myself, I suggest attempting to open the input and/or output file with encoding "utf-8-sig", which will automatically handle the BOM for UTF-8 (see http://docs.python.org/2/library/codecs.html#encodings-and-unicode, near the bottom of the section) I would think that would only matter for the input file, but if none of those combinations (utf-8-sig/utf-8, utf-8/utf-8-sig, utf-8-sig/utf-8-sig) work, then I believe the most likely situation would be that your input file is encoded in a different Unicode format with BOM, as Python's default UTF-8 codec interprets BOMs as regular characters so the input would not have an issue but output could.
Just noticed this, but... when you use codecs.open(), it expects a Unicode string, not an encoded one; try x = unicode(actionT(splitList[0], splitList[1])).
Your error can also occur when attempting to decode a unicode string (see http://wiki.python.org/moin/UnicodeEncodeError), but I don't think that should be happening unless actionT() or your list-splitting does something to the Unicode strings that causes them to be treated as non-Unicode strings.
In python 2.x there are two types of string: byte string and unicode string. First one contains bytes and last one - unicode code points. It is easy to determine, what type of string it is - unicode string starts with u:
# byte string
>>> 'abc'
'abc'
# unicode string:
>>> u'abc абв'
u'abc \u0430\u0431\u0432'
'abc' chars are the same, because the are in ASCII range. \u0430 is a unicode code point, it is out of ASCII range. "Code point" is python internal representation of unicode points, they can't be saved to file. It is needed to encode them to bytes first. Here how encoded unicode string looks like (as it is encoded, it becomes a byte string):
>>> s = u'abc абв'
>>> s.encode('utf8')
'abc \xd0\xb0\xd0\xb1\xd0\xb2'
This encoded string now can be written to file:
>>> s = u'abc абв'
>>> with open('text.txt', 'w+') as f:
... f.write(s.encode('utf8'))
Now, it is important to remember, what encoding we used when writing to file. Because to be able to read the data, we need to decode the content. Here what data looks like without decoding:
>>> with open('text.txt', 'r') as f:
... content = f.read()
>>> content
'abc \xd0\xb0\xd0\xb1\xd0\xb2'
You see, we've got encoded bytes, exactly the same as in s.encode('utf8'). To decode it is needed to provide coding name:
>>> content.decode('utf8')
u'abc \u0430\u0431\u0432'
After decode, we've got back our unicode string with unicode code points.
>>> print content.decode('utf8')
abc абв
xgord is right, but for further edification it's worth noting exactly what \ufeff means. It's known as a BOM or a byte order mark and basically it's a callback to the early days of unicode when people couldn't agree which way they wanted their unicode to go. Now all unicode documents are prefaced with either an \ufeff or an \uffef depending on which order they decide to arrange their bytes in.
If you hit an error on those characters in the first location you can be sure the issue is that you are not trying to decode it as utf-8, and the file is probably still fine.
I am attempting to pipe something to a subprocess using the following line:
p.communicate("insert into egg values ('egg');");
TypeError: must be bytes or buffer, not str
How can I convert the string to a buffer?
The correct answer is:
p.communicate(b"insert into egg values ('egg');");
Note the leading b, telling you that it's a string of bytes, not a string of unicode characters. Also, if you are reading this from a file:
value = open('thefile', 'rt').read()
p.communicate(value);
The change that to:
value = open('thefile', 'rb').read()
p.communicate(value);
Again, note the 'b'.
Now if your value is a string you get from an API that only returns strings no matter what, then you need to encode it.
p.communicate(value.encode('latin-1');
Latin-1, because unlike ASCII it supports all 256 bytes. But that said, having binary data in unicode is asking for trouble. It's better if you can make it binary from the start.
You can convert it to bytes with encode method:
>>> "insert into egg values ('egg');".encode('ascii') # ascii is just an example
b"insert into egg values ('egg');"