This is Python 2
My understanding is,
decode is about decoding from bytes to anything(ascii/codepoint/utf-8/whatever..)
encode is about encoding from unicode code points to anything(bytes/ascii/utf-8/...)
From the below code,
>>> myUtf8
'Hi \xe2\x84\x99\xc6\xb4\xe2\x98\x82\xe2\x84\x8c\xc3\xb8\xe1\xbc\xa4'
>>> myUtf8.decode("ascii", "replace")
u'Hi \ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd'
>>> myUtf8.decode('utf-16')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/encodings/utf_16.py", line 16, in decode
return codecs.utf_16_decode(input, errors, True)
UnicodeDecodeError: 'utf16' codec can't decode byte 0xa4 in position 18: truncated data
Question:
Why code point \ufffd is replaced for the bytes that can't be decoded?
Edit:
Question:
Agreement (say) is bytes are being received. how to find, bytes are expressed in which encoding? Assuming those bytes are being received from network.
Your understanding is wrong. decode is "from bytes expressed in the encoding you pass as first parameter to unicode"; encode is "from unicode to bytes expressed in the encoding you pass as first parameter".
In your example, you give some bytes expressed in UTF-8 and tell to Python to interpret them as ASCII to then build a unicode string; given that all the >127 bytes aren't valid ASCII, they are considered garbage, and thus, as you requested with the "replace" parameter, they are replaced with the Unicode replacement character.
Related
So, I'm having issues with Python3 encoding. I have a few bytes I want to work as strings. (long story)
In few words, this works
a = "\x85".encode()
print(a.decode())
But this doesn't
b = (0x85).to_bytes(1,"big")
print(b.decode())
UnicodeDecodeError: utf-8 codec can't decode byte 0x85 in position 0:
invalid start byte
I have read a handful of articles on the subject, but they insist that 'python3 is broken' or that 'you shouldn't be using strings for that'. Plenty articles on Stackoverflow just use "work arounds" (such as "use replace on error" or "user utc-16").
Could anyone tell me where the difference lies and why the function works while the second one doesn't? Shouldn't both of them work identically? Why can't utf-8 decode the byte on the second attempt?
In the first case '\x85'.encode() encodes the Unicode code point U+0085 in the Python 3 default encoding of UTF-8. So the output is the correct two-byte UTF-8 encoding of that code point:
>>> '\x85'.encode()
b'\xc2\x85'
Decode then works because it was correctly encoded in UTF-8 to begin with:
>>> b'\xc2\x85'.decode()
'\x85'
The second case is a complicated way of creating a single byte string:
>>> (0x85).to_bytes(1,'big')
b'\x85'
This byte string is not correctly encoded as UTF-8, so it fails to decode:
>>> b'\x85'.decode()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x85 in position 0: invalid start byte
Python 3 is definitely not "broken". It cleanly separates byte data from text.
If you have raw bytes, work with them as bytes. Raw data in Python 3 is intended to be manipulated in byte strings or byte arrays. Unicode strings are for text. Decode bytes to text to manipulate it, then encode back to bytes to serialize to file, socket, database, etc.
If for some reason you feel the need to use Unicode strings for raw data, the first 256 code points of Unicode correspond to the latin1 codec for 1:1 mapping of one to the other.
>>> '\x85'.encode('latin1')
b'\x85'
>>> b'\x85'.decode('latin1')
'\x85'
This is often used to correct programming errors due to encoding/decoding with the wrong encodings.
I have read a lot now on the topic of UTF-8 encoding in Python 3 but it still doesn't work, and I can't find my mistake.
My code looks like this
def main():
with open("test.txt", "rU", encoding='utf-8') as test_file:
text = test_file.read()
print(str(len(text)))
if __name__ == "__main__":
main()
My test.txt file looks like this
ö
And I get the following error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf6 in position 0: invalid start byte
Your file is not UTF-8 encoded. I'm not sure what encoding uses F6 for ä either; that codepoint is the encoding for ö in Latin 1 and CP-1252:
>>> b'\xf6'.decode('utf8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf6 in position 0: invalid start byte
>>> b'\xf6'.decode('latin1')
'ö'
You'll need to save that file as UTF-8 instead, with whatever tool you used to create that file.
If open('text').read() works, then you were able to decode the file using the default system encoding. See the open() function documentation:
encoding is the name of the encoding used to decode or encode the file. This should only be used in text mode. The default encoding is platform dependent (whatever locale.getpreferredencoding() returns), but any encoding supported by Python can be used.
That is not to say that you were reading the file using the correct encoding; that just means that the default encoding didn't break (encountered bytes for which it doesn't have a character mapping). It could still be mapping those bytes to the wrong characters.
I urge you to read up on Unicode and Python:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
The Python Unicode HOWTO
Pragmatic Unicode by Ned Batchelder
In my database mix some wrong ascii code, how to make concatenate those string without errors?
my example situation is like(some ascii character is larger than 128):
>>> s=b'\xb0'
>>> addstr='read '+s
>>> print addstr
read ░
>>> addstr.encode('ascii','ignore')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xb0 in position 5: ordinal
not in range(128)
>>> addstr.encode('utf_8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xb0 in position 5: ordinal
not in range(128)
I can do:
>>> addstr.decode("windows-1252").encode('utf-8')
'read \xc2\xb0'
but you can see the windows-1252 coding will change my character.
I would like convert the addstr to unicode? how to do it?
addstrUnicode = addstr.decode("unicode-escape")
You should not be concerned about the character changing, it is just that the utf-8 encoding requires two bytes, not one byte, for characters between 0x80 and 0x7FF, so when you encode as utf-8, an extra byte (0xC2) is added.
This is a useful link to read to help understand different types of encodings.
Additionally, make sure you know the original encoding of the character before you start trying to decode it. While you mentioned that it was "ascii code", the ascii character set only extends up to 127, which means the character cannot be ascii-encoded. I'm assuming here it's just Unicode point \u00B0.
I got this string 'Velcro Back Rest \xa36.99'. Note it does not have u in the front. Its just plain ascii.
How do I convert it to unicode?
I tried this,
>>> unicode('Velcro Back Rest \xa36.99')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa3 in position 17: ordinal not in range(128)
This answer explain it nicely. But I have same question as the OP of that question. In the answer to that comment Winston says "You should not encoding a string object ..."
But the framework I am working requires that it should be converted unicode string. I use scrapy and I have this line.
loader.add_value('name', product_name)
Here product_name contains that problematic string and it throws the error.
You need to specify an encoding to decode the bytes to Unicode with:
>>> 'Velcro Back Rest \xa36.99'.decode('latin1')
u'Velcro Back Rest \xa36.99'
>>> print 'Velcro Back Rest \xa36.99'.decode('latin1')
Velcro Back Rest £6.99
In this case, I was able to guess the encoding from experience, you need to provide the correct codec used for each encoding you encounter. For web data, that is usually included in the from of the content-type header:
Content-Type: text/html; charset=iso-8859-1
where iso-8859-1 is the official standard name for the Latin 1 encoding, for example. Python recognizes latin1 as an alias for iso-8859-1.
Note that your input data is not plain ASCII. If it was, it'd only use bytes in the range 0 through to 127; \xa3 is 163 decimal, so outside of the ASCII range.
I need to do in Python 2.4 (yes, 2.4 :-( ).
I've got a plain string object, which represents some text encoded with UTF-8. It comes from an external library, which can't be modified.
So, what I think I need to do, is to create an Unicode object using bytes from that source object, and then convert it to some other encoding (iso-8859-2, actually).
The plain string object is 'x'. "unicode()" seems to not work:
>>> x
'Sk\xc5\x82odowski'
>>> str(unicode(x, encoding='iso-8859-2'))
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeEncodeError: 'ascii' codec can't encode characters in position 2-3: ordinal not in range(128)
>>> unicode(x, encoding='iso-8859-2')
u'Sk\u0139\x82odowski'
>>> x.decode('utf8').encode('iso-8859-2')
'Sk\xb3odowski'