So, I'm having issues with Python3 encoding. I have a few bytes I want to work as strings. (long story)
In few words, this works
a = "\x85".encode()
print(a.decode())
But this doesn't
b = (0x85).to_bytes(1,"big")
print(b.decode())
UnicodeDecodeError: utf-8 codec can't decode byte 0x85 in position 0:
invalid start byte
I have read a handful of articles on the subject, but they insist that 'python3 is broken' or that 'you shouldn't be using strings for that'. Plenty articles on Stackoverflow just use "work arounds" (such as "use replace on error" or "user utc-16").
Could anyone tell me where the difference lies and why the function works while the second one doesn't? Shouldn't both of them work identically? Why can't utf-8 decode the byte on the second attempt?
In the first case '\x85'.encode() encodes the Unicode code point U+0085 in the Python 3 default encoding of UTF-8. So the output is the correct two-byte UTF-8 encoding of that code point:
>>> '\x85'.encode()
b'\xc2\x85'
Decode then works because it was correctly encoded in UTF-8 to begin with:
>>> b'\xc2\x85'.decode()
'\x85'
The second case is a complicated way of creating a single byte string:
>>> (0x85).to_bytes(1,'big')
b'\x85'
This byte string is not correctly encoded as UTF-8, so it fails to decode:
>>> b'\x85'.decode()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x85 in position 0: invalid start byte
Python 3 is definitely not "broken". It cleanly separates byte data from text.
If you have raw bytes, work with them as bytes. Raw data in Python 3 is intended to be manipulated in byte strings or byte arrays. Unicode strings are for text. Decode bytes to text to manipulate it, then encode back to bytes to serialize to file, socket, database, etc.
If for some reason you feel the need to use Unicode strings for raw data, the first 256 code points of Unicode correspond to the latin1 codec for 1:1 mapping of one to the other.
>>> '\x85'.encode('latin1')
b'\x85'
>>> b'\x85'.decode('latin1')
'\x85'
This is often used to correct programming errors due to encoding/decoding with the wrong encodings.
Related
I've received a .xlsx file where foreign language characters appear to have to have been originally encoded as utf-8 and then decoded as utf-16.
I don't have access to the original file.
é and ó, for example, should be read as é and ó, respectively, which are encoded as 0xC3 0xA9 and 0xC3 0xB3. Instead each single utf-8 character has been decoded at some point as two utf-16 characters.
I've tried encoding them to bytes and decoding them with UTF-8, but that doesn't translate correctly.
This:
s = "ó".encode("utf-16")
uni = s.decode("utf-8")
print(uni)
returns this:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
I've tried the above encode/decode sequence with a variety of different parameters including UTF-16be and UTF-16le, and every error parameter of decode, and just slicing off the BOM, none of which worked.
Is there a way to fix this? Some library that can make this a less painful process than doing a literal replacement by reading every string and replacing ó with ó?
I am having problem with decoding byte string that I have to send from one computer to another. File is format PDF. I get error that goes:
fileStrings[i] = fileStrings[i].decode()
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xda in position 648: invalid continuation byte
Any ideas of how to remove b' ' marking? I need to compile file back up, but i also need to know its size in bytes before sending it and I figured I will know it by decoding each byte string (Works for txt files but not for pdf ones..)
Code is:
with open(inputne, "rb") as file:
while 1:
readBytes= file.read(dataMaxSize)
fileStrings.append(readBytes)
if not readBytes:
break
readBytes= ''
filesize=0
for i in range(0, len(fileStrings)):
fileStrings[i] = fileStrings[i].decode()
filesize += len(fileStrings[i])
Edit: For anyone having same issue, parameter len() will give you size without b''.
In Python, bytestrings are for raw binary data, and strings are for textual data. decode tries to decode it as utf-8, which is valid for txt files, but not for pdf files, since they can contain random bytes. You should not try to get a string, since bytestrings are designed for this purpose. You can get the length of bytestrings like normal, with len(data). Many of the string operations also apply to bytestrings, such as concatenation and slicing (data1 + data2 and data[1:3]).
As a side note, the b'' when you print it is just because the __str__ method for bytestrings is equivalent to repr. It's not in the data itself.
I want to substitude a substring with a hash - said substring contains non-ascii caracters, so I tried to encode it to UTF-8.
result = re.sub(r'(Start:\s*)([^:]+)(:\s*)([^:]+)', lambda m: m.group(1) + m.group(2) + m.group(3) + hashlib.sha512(m.group(4).encode()).hexdigest(), line.encode('utf-8'))
I am not realy sure why this doesn't work, I thought with line.encode('utf-8'), the whole string is getting encoded.
I also tried to encode my m.groups to UTF-8, but I got the same UnicodeDecodeError.
[unicodedecodeerror: 'ascii' codec can't decode byte in position
ordinal not in range(128)]
Sample input:
Start: myUsername: myÜsername:
What am I missing ?
EDIT_
Traceback (most recent call last):
File "C:/Users/Peter/Desktop/coding/filter.py", line 26, in <module>
encodeline = line.encode('utf-8')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 112: ordinal not in range(128)
Based on your symptoms, you're running on Python 2. Calling encode on a Python 2 str is almost always nonsensical.
You have two problems; one you're hitting now, and one you'll hit if you fix your current code.
Your first problem is line is already a str in (apparently) UTF-8 encoded bytes, not unicode, so encodeing it implicitly decodes with Python's default encoding (ASCII; this isn't locale specific to my knowledge, and it's a rare Python 2 install that uses anything else), then re-encodes with the specified codec (or the default if not specified). Basically, line was already UTF-8 encoded, you told it to encode again as UTF-8, but that's nonsensical, so Python tried to decode as ASCII first, and failed before it even tried to encode as you instructed.
The solution to this problem is to just not encode line at all; it's already UTF-8 encoded, so you're already golden.
Your second problem (which you haven't encountered yet, but you will) is that you're calling encode on the group(4) result. But of course, since the input was a str, the group is a str too, and you'll encounter the same problem trying to encode a str; since the group came from raw UTF-8 encoded bytes, the non-ASCII parts of it cause a UnicodeDecodeError during the implicit decode step before the encode.
The reason:
import sys
reload(sys)
sys.setdefaultencoding('UTF8')
works is that it (dangerously) changes the implicit decode step to use UTF-8, so all your encode calls now perform the implicit decode with UTF-8 instead of ASCII; the decode and encode is mostly pointless, since all it does is return the original str after confirming it's legal UTF-8 by means of decodeing it as such, and otherwise acting as an expensive no-op.
To fix the second problem, just change:
m.group(4).encode()
to:
m.group(4)
That leaves your final code as:
result = re.sub(r'(Start:\s*)([^:]+)(:\s*)([^:]+)',
lambda m: m.group(1) + m.group(2) + m.group(3) + hashlib.sha512(m.group(4)).hexdigest(),
line)
Optionally, if you want to confirm your expectation that line is in fact UTF-8 encoded bytes already, add the following above that re.sub line:
try:
line.decode('utf-8')
except Exception as e:
sys.exit("line (of type {!r}) not decodable as UTF-8: {}".format(line.__class__.__name__, e))
which will cause the program to exit immediately if the data given is not legal UTF-8 (and will also let you know what type line is, so you can confirm for sure if it's really str or unicode, since str implies you chose the wrong codec, while unicode means your inputs aren't of the expected type).
I found .. in my eyes a workaround.
Doesn't feel right though, but it does the job.
import sys
reload(sys)
sys.setdefaultencoding('UTF8')
I thought it could be done with .encode('utf-8')
file = 'xyz'
res = hashlib.sha224(str(file).encode('utf-8)).hexdigest()
Because of unicode object must be encode as string before hash.
This is Python 2
My understanding is,
decode is about decoding from bytes to anything(ascii/codepoint/utf-8/whatever..)
encode is about encoding from unicode code points to anything(bytes/ascii/utf-8/...)
From the below code,
>>> myUtf8
'Hi \xe2\x84\x99\xc6\xb4\xe2\x98\x82\xe2\x84\x8c\xc3\xb8\xe1\xbc\xa4'
>>> myUtf8.decode("ascii", "replace")
u'Hi \ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd'
>>> myUtf8.decode('utf-16')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/encodings/utf_16.py", line 16, in decode
return codecs.utf_16_decode(input, errors, True)
UnicodeDecodeError: 'utf16' codec can't decode byte 0xa4 in position 18: truncated data
Question:
Why code point \ufffd is replaced for the bytes that can't be decoded?
Edit:
Question:
Agreement (say) is bytes are being received. how to find, bytes are expressed in which encoding? Assuming those bytes are being received from network.
Your understanding is wrong. decode is "from bytes expressed in the encoding you pass as first parameter to unicode"; encode is "from unicode to bytes expressed in the encoding you pass as first parameter".
In your example, you give some bytes expressed in UTF-8 and tell to Python to interpret them as ASCII to then build a unicode string; given that all the >127 bytes aren't valid ASCII, they are considered garbage, and thus, as you requested with the "replace" parameter, they are replaced with the Unicode replacement character.
I am currently writing a program that reads data from a serial port adds some header information then writes this data to a .jpg file.
I require to write to the file in Windows-1252 encoding format, yes the method in which I construct the data and the header is in hexadecimal format.
I realised my problem when comparing the picture that should be written and what was actually written, and saw that DOULBE LOW 9 QUOTES were not written as quotes but rather as a zero.
The decimal code for that symbol is 132 (0x84). If I use chr(0x84) I get the following error
UnicodeEncodeError: 'charmap' codec can't encode character \x84 in position 0: character maps to
Which only makes sense if chr() was trying to map to Latin-1 codeset. I have tried to convert the int to a unicode but from my research chr is the only function that does this.
I have also tried to use the struct package in python.
import struct
a = 123;
b = struct.pack("c",a)
print(b)
I get the error
Traceback (most recent call last): File "python", line 3, in
struct.error: char format requires a bytes object of length 1
Reading past questions, answers and documentation does get quite confusing as there is a mix of python2 and python3 answers mixed in with people converting to ascii (which obviously wouldn't work).
I am using Python 3.4.3 (the latest version) on a Windows 7 machine.
UnicodeEncodeError: 'charmap' codec can't encode character \x84
\x84 is the encoding of the lower quotes character in Windows-1252. This suggests your data is already encoded, and you should not try to encode it again. In a text string the quote should show up as "\u201E". "\u0084" (the result of chr(132)) is actually a control character.
You should have either bytes which you can decode to a string:
>>> b"\x84".decode('windows-1252')
'\u201e'
Or you should have a text string, which you can encode to a byte string
>>> "\u201e".encode('windows-1252')
b'\x84'
If you read data from somewhere you could use the struct module like this
# suppose we download some data:
data=b'*\x00\x00\x00abcde'
a, txt = struct.unpack("I5s", data)
print(txt.decode('windows-1252'))