Python converting bytes to string - python

I have the following code:
with open("heart.png", "rb") as f:
byte = f.read(1)
while byte:
byte = f.read(1)
strb = byte.decode("utf-8", "ignore")
print(strb)
When reading the bytes from "heart.png" I have to read hex bytes such as:
b'öx1a', b'öxff', b'öxa4', etc.
and also bytes in this form:
b'A', b'D', b'O', b'D', b'E', etc. <- spells ADOBE
Now for some reason when I use the above code to convert from byte to string it does not seem to work with the bytes in hex form but it works for everything else.
So when b'öx1a' comes along it converts it to "" (empty string)
and when b'H' comes along it converts it to "H"
does anyone know why this is the case?

There's a few things going on here.
The PNG file format can contain text chunks encoded in either Latin-1 or UTF-8. The tEXt chunks are encoded in Latin-1 and you would need to decode them using the 'latin-1' codec. iTXt chunks are encoded in UTF-8 and would need to be decoded with the 'utf-8' codec.
However, you appear to be trying to decode individual bytes, whereas characters in UTF-8 may span multiple bytes. So assuming you want to read UTF-8 strings, what you should do is read in the entire length of the string you wish to decode before attempting to decode it.
If instead you are trying to interpret binary data from the file, take a look at the struct module which is intended for that purpose.

Related

How to decode a text in python3?

I have a text Aur\xc3\xa9lien and want to decode it with python 3.8.
I tried the following
import codecs
s = "Aur\xc3\xa9lien"
codecs.decode(s, "urf-8")
codecs.decode(bytes(s), "urf-8")
codecs.decode(bytes(s, "utf-8"), "utf-8")
but none of them gives the correct result Aurélien.
How to do it correctly?
And is there no basic, general authoritative simple page that describes all these encodings for python?
First find the encoding of the string and then decode it... to do this you will need to make a byte string by adding the letter 'b' to the front of the original string.
Try this:
import chardet
s = "Aur\xc3\xa9lien"
bs = b"Aur\xc3\xa9lien"
encoding = chardet.detect(bs)["encoding"]
str = s.encode(encoding).decode("utf-8")
print(str)
If you are reading the text from a file you can detect the encoding using the magic lib, see here: https://stackoverflow.com/a/16203777/1544937
Your string is not a Unicode sequence, so you should prefix it with b
import codecs
b = b"Aur\xc3\xa9lien"
b.decode('utf-8')
So you have the expected: 'Aurélien'.
If you want to use s, you should use mbcs, latin-1, mac_roman or any 8-bit encoding. It doesn't matter. Such 8-bit codecs can get the binary character in your string correctly (a 1 to 1 mapping). So you get a byte array (and so now you can use the first part of this answers and so you can decode the binary string.
You have UTF-8 decoded as latin-1, so the solution is to encode as latin-1 then decode as UTF-8.
s = "Aur\xc3\xa9lien"
s.encode('latin-1').decode('utf-8')
print(s.encode('latin-1').decode('utf-8'))
Output
Aurélien

Python unable to decode byte string

I am having problem with decoding byte string that I have to send from one computer to another. File is format PDF. I get error that goes:
fileStrings[i] = fileStrings[i].decode()
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xda in position 648: invalid continuation byte
Any ideas of how to remove b' ' marking? I need to compile file back up, but i also need to know its size in bytes before sending it and I figured I will know it by decoding each byte string (Works for txt files but not for pdf ones..)
Code is:
with open(inputne, "rb") as file:
while 1:
readBytes= file.read(dataMaxSize)
fileStrings.append(readBytes)
if not readBytes:
break
readBytes= ''
filesize=0
for i in range(0, len(fileStrings)):
fileStrings[i] = fileStrings[i].decode()
filesize += len(fileStrings[i])
Edit: For anyone having same issue, parameter len() will give you size without b''.
In Python, bytestrings are for raw binary data, and strings are for textual data. decode tries to decode it as utf-8, which is valid for txt files, but not for pdf files, since they can contain random bytes. You should not try to get a string, since bytestrings are designed for this purpose. You can get the length of bytestrings like normal, with len(data). Many of the string operations also apply to bytestrings, such as concatenation and slicing (data1 + data2 and data[1:3]).
As a side note, the b'' when you print it is just because the __str__ method for bytestrings is equivalent to repr. It's not in the data itself.

In Python 3.8, how do I load image data from a file for inclusion ina JSON object?

I'm using Python 3.8. I want to get image byte data into a JSON object. So I tried this
with open(os.path.join(dir_path, "../image_data", "myimg.jpg"), mode='rb') as img_file:
image_data = img_file.read().decode("utf-16")
my_json_data = {
"image_data": image_data
...
}
but the image_data = line is giving this error:
UnicodeDecodeError: 'utf-16-le' codec can't decode bytes in position 0-1: illegal UTF-16 surrogate
What's the proper way to load data for inclusion into a JSON object?
decode and encode work on character data. You use them to covert between character encoding formats, such as utf-8 and ASCII. It doesn't make sense to take pure binary data -- such as an image -- and try to convert it to characters. Not every binary value can be converted to character; most character formats leave a few values as reserved or unused.
What you need is a simple raw byte format. Read the file as a sequence of bytes; this preserves the binary form, making it easy for your eventual JSON consumer to utilize the information.

How to convert unicode characters into their respective symbols in python?

I have a text file which contains unicode characters in the following format:
\u0935\u094d\u0926\u094d\u0928\u094d\u0935\u094d\u0926\
I want to convert it into devnagri characters in the following format:
वर्जनरूपमिति दर्शित्म् । स पूरुषः अमृतत्वाय कल्पते व्द्न्व्द
and then write it to a file.
Presently my code
encoded = x.encode('utf-8')
print (encoded.decode('unicode-escape'))
can print the devnagri characters in the terminal. However when I try to write it to a file using
text = 'target:'+encoded.decode('unicode-escape')+'\n'
fileid.write(text)
I am getting the following error.
'ascii' codec can't encode characters in position 7-18: ordinal not in range(128)
Can anybody please help me?
If you are using Python 2 it's because after using .decode('unicode-escape') you have an unicode object and fileid.write() only accepts string objects. Python then tries to convert the object to a byte string by using the ASCII encoding that doesn't cover devnagri characters. This conversion causes the exception.
You need to manually convert the unicode string back into a byte string before writing it to the file:
fileid.write(text.encode('utf-8'))
Here I assumed you want UTF-8 encoding. If you want to save the characters in another encoding replace 'utf-8' with the name of that encoding.
In Python 3 you can set the used encoding when opening the file:
fileid = open('compare.txt', 'a', encoding='utf-8')
Then the extra .encode('utf-8') isn't neccessary.

Write to file bytes and strings

I have to create files which have some chars and hex value in little-endian encoding. To do encoding, I use:
pack("I", 0x01ddf23a)
and this give me:
b':\xf2\xdd\x01'
First problem is that, this give me bytes string which I cannot write to file. Second one is that \x3a is turn to ':'. What I expect, is write to file \x3a\xf2\xdd\x01 as bytes not as chars.
What I tried:
>>> a=0x01ddf23a
>>> str(pack("I", a))
"b':\\xf2\\xdd\\x01'" <= wrong
>>> pack("I", a).hex()
'3af2dd01 <= I need '\x' before each byte
>>> pack("I", a).decode()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf2 in position 1: invalid continuation byte
Changing open() from "w" to "wb" force me to write only bytes, but I want to writes lots of strings and few bytes, eg.:
Hello world
^I^M^T^B
End file
I know I can simple do this:
fs.open("file" "w")
fs.write("Hello world")
fs.write("\x3a\xf2\xdd\x01")
fs.write("End file")
fs.close()
But this is make my byte value 0x01ddf23a hard to read and there is easy to make some mistake when changing this value in that form.
You are producing bytes, which can be written to files opened in binary mode without issue. Add b to the file mode when opening and either use bytes string literals or encode your strings to bytes if you need to write other data too:
with open("file", "wb") as fs:
fs.write(b"Hello world") # note, a byte literal!
fs.write(pack("I", 0x01ddf23a))
fs.write("End file".encode('ASCII')) # encoded string to bytes
The alternative would be to decode your binary packed data to a text string first, but since packed data does not, in fact, contain decodable text, that approach would require contortions to force the binary data to be decodable and encodable again, which only works if your file encoding was set to Latin-1 and severely limits what actual text you could add.
A bytes representation will always try to show printable characters where possible. The byte \x3a is also the correct ASCII value for the ':' character, so in a bytes representation the latter is preferred over using the \x3a escape sequence. The correct value is present in the bytes value and would be written to the file entirely correctly:
>>> b'\x3a'
b':'
>>> b'\x3a' == b':'
True
>>> b':'[0]
58
>>> b'\x3a'[0]
58
>>> hex(58)
'0x3a'

Categories

Resources