I am having problem with decoding byte string that I have to send from one computer to another. File is format PDF. I get error that goes:
fileStrings[i] = fileStrings[i].decode()
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xda in position 648: invalid continuation byte
Any ideas of how to remove b' ' marking? I need to compile file back up, but i also need to know its size in bytes before sending it and I figured I will know it by decoding each byte string (Works for txt files but not for pdf ones..)
Code is:
with open(inputne, "rb") as file:
while 1:
readBytes= file.read(dataMaxSize)
fileStrings.append(readBytes)
if not readBytes:
break
readBytes= ''
filesize=0
for i in range(0, len(fileStrings)):
fileStrings[i] = fileStrings[i].decode()
filesize += len(fileStrings[i])
Edit: For anyone having same issue, parameter len() will give you size without b''.
In Python, bytestrings are for raw binary data, and strings are for textual data. decode tries to decode it as utf-8, which is valid for txt files, but not for pdf files, since they can contain random bytes. You should not try to get a string, since bytestrings are designed for this purpose. You can get the length of bytestrings like normal, with len(data). Many of the string operations also apply to bytestrings, such as concatenation and slicing (data1 + data2 and data[1:3]).
As a side note, the b'' when you print it is just because the __str__ method for bytestrings is equivalent to repr. It's not in the data itself.
Related
I'm using Python 3.8. I want to get image byte data into a JSON object. So I tried this
with open(os.path.join(dir_path, "../image_data", "myimg.jpg"), mode='rb') as img_file:
image_data = img_file.read().decode("utf-16")
my_json_data = {
"image_data": image_data
...
}
but the image_data = line is giving this error:
UnicodeDecodeError: 'utf-16-le' codec can't decode bytes in position 0-1: illegal UTF-16 surrogate
What's the proper way to load data for inclusion into a JSON object?
decode and encode work on character data. You use them to covert between character encoding formats, such as utf-8 and ASCII. It doesn't make sense to take pure binary data -- such as an image -- and try to convert it to characters. Not every binary value can be converted to character; most character formats leave a few values as reserved or unused.
What you need is a simple raw byte format. Read the file as a sequence of bytes; this preserves the binary form, making it easy for your eventual JSON consumer to utilize the information.
So, I'm having issues with Python3 encoding. I have a few bytes I want to work as strings. (long story)
In few words, this works
a = "\x85".encode()
print(a.decode())
But this doesn't
b = (0x85).to_bytes(1,"big")
print(b.decode())
UnicodeDecodeError: utf-8 codec can't decode byte 0x85 in position 0:
invalid start byte
I have read a handful of articles on the subject, but they insist that 'python3 is broken' or that 'you shouldn't be using strings for that'. Plenty articles on Stackoverflow just use "work arounds" (such as "use replace on error" or "user utc-16").
Could anyone tell me where the difference lies and why the function works while the second one doesn't? Shouldn't both of them work identically? Why can't utf-8 decode the byte on the second attempt?
In the first case '\x85'.encode() encodes the Unicode code point U+0085 in the Python 3 default encoding of UTF-8. So the output is the correct two-byte UTF-8 encoding of that code point:
>>> '\x85'.encode()
b'\xc2\x85'
Decode then works because it was correctly encoded in UTF-8 to begin with:
>>> b'\xc2\x85'.decode()
'\x85'
The second case is a complicated way of creating a single byte string:
>>> (0x85).to_bytes(1,'big')
b'\x85'
This byte string is not correctly encoded as UTF-8, so it fails to decode:
>>> b'\x85'.decode()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x85 in position 0: invalid start byte
Python 3 is definitely not "broken". It cleanly separates byte data from text.
If you have raw bytes, work with them as bytes. Raw data in Python 3 is intended to be manipulated in byte strings or byte arrays. Unicode strings are for text. Decode bytes to text to manipulate it, then encode back to bytes to serialize to file, socket, database, etc.
If for some reason you feel the need to use Unicode strings for raw data, the first 256 code points of Unicode correspond to the latin1 codec for 1:1 mapping of one to the other.
>>> '\x85'.encode('latin1')
b'\x85'
>>> b'\x85'.decode('latin1')
'\x85'
This is often used to correct programming errors due to encoding/decoding with the wrong encodings.
I have the following code:
with open("heart.png", "rb") as f:
byte = f.read(1)
while byte:
byte = f.read(1)
strb = byte.decode("utf-8", "ignore")
print(strb)
When reading the bytes from "heart.png" I have to read hex bytes such as:
b'öx1a', b'öxff', b'öxa4', etc.
and also bytes in this form:
b'A', b'D', b'O', b'D', b'E', etc. <- spells ADOBE
Now for some reason when I use the above code to convert from byte to string it does not seem to work with the bytes in hex form but it works for everything else.
So when b'öx1a' comes along it converts it to "" (empty string)
and when b'H' comes along it converts it to "H"
does anyone know why this is the case?
There's a few things going on here.
The PNG file format can contain text chunks encoded in either Latin-1 or UTF-8. The tEXt chunks are encoded in Latin-1 and you would need to decode them using the 'latin-1' codec. iTXt chunks are encoded in UTF-8 and would need to be decoded with the 'utf-8' codec.
However, you appear to be trying to decode individual bytes, whereas characters in UTF-8 may span multiple bytes. So assuming you want to read UTF-8 strings, what you should do is read in the entire length of the string you wish to decode before attempting to decode it.
If instead you are trying to interpret binary data from the file, take a look at the struct module which is intended for that purpose.
I have a very long JSON message that contains characters that go beyond the ASCII table. I convert it into a string as follows:
messStr = json.dumps(message,encoding='utf-8', ensure_ascii=False, sort_keys=True)
I need to store this string using a service that restricts its size to X bytes. I want to split the JSON string into pieces of length X and store them separately. I ran into some issues doing this (described here) so I want to compress the string slices to work around those issues. I tried to do this:
ss = mStr[start:fin] # get piece of length X
ssc = zlib.compress(ss) # compress it
When I do that, I get the following error from zlib.compress:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf1' in position 225: ordinal not in range(128)
What is the right way to compress a UTF-8 string and what is then the right way to decompress it?
A little addition to Martijn's response.
I read in an Enthought blog a nifty one liner statement that will spare you the need to import zlib in your own code.
Safely compressing a string (including your json dump) would look like that:
ssc = ss.encode('utf-8').encode('zlib_codec')
Decompressing back to utf-8 would be:
ss = ssc.decode('zlib_codec').decode('utf-8')
Hope this helps.
Your JSON data is not UTF-8 encoded. The encoding parameter to the json.dumps() function instructs it how to interpret Python byte strings in message (e.g. the input), not how to encode the resulting output. It doesn't encode the output at all because you used ensure_ascii=False.
Encode the data before compression:
ssc = zlib.compress(ss.encode('utf8'))
When decompressing again, there is no need to decode from UTF-8; the json.loads() function assumes UTF-8 if the input is a bytestring.
I am trying to read a pdf file with the following contain:
%PDF-1.4\n%âãÏÓ
If I read it with open, it works but if I try with codecs.open(filename, encoding="utf8", mode="rb") to get a unicode string, I got the following exception:
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe2 in position 10: invalid continuation byte
Do you know a way to get a unicode string from the content of this file?
PS: I am using python 2.7
PDFs are made of binary data, not text. They cannot be meaningfully represented as Unicode strings.
For what it's worth, you can get a Unicode string containing those particular characters by treating the PDF as ISO8859-1 text:
f = codecs.open(filename, encoding="ISO8859-1", mode="rb")
But at that point, you're better off just using normal open and reading bytes. Unicode is for text, not data.
The issue of trying to interpret arbitrary binary data as text aside, 0xe2 is â in Latin-1, not UTF-8. You're using the wrong codec.