Read an image from file using ascii encoding - python

I've been having trouble loading images from a file as a string.
Many of the functions that I need to use in my program rely on the read data being encoded with ascii and it simply fails to handle the data I give it producing the following error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa8 in position 14: ordinal not in range(128)
So how would I go about converting this data to ascii.
EDIT:
Here is my admittedly messy code I am using. Please do not comment about how messy it is, this is a rough draft:
def text_to_bits(text, encoding='utf-8', errors='surrogatepass'):
bits = bin(int(binascii.hexlify(text.encode(encoding, errors)), 16))[2:]
return bits.zfill(8 * ((len(bits) + 7) // 8))
def str2int(string):
binary = text_to_bits(string)
number = int(binary, 2)
return number
def go():
#filen is the name of the file
global filen
#Reading the file
content = str(open(filen, "r").read())
#Using A function from above
integer = str2int(content)
#Write back to the file
w = open(filen, "w").write(str(integer))

Image data is not ASCII. Image data is binary, and thus uses bytes that the ASCII standard doesn't cover. Don't try to decode the data as ASCII. You also want to make sure you open your file in binary mode, to avoid platform-specific line separator translations, something that'll damage your image data.
Any method expecting to handle image data will deal with binary data, and in Python 2 that means you'll be handling that as the str type.
In your specific case, you are using a function that expects to work on Unicode data, not binary image data, and it is trying to encode that data to binary. In other words, because you are you are giving it data that is already binary (encoded), the function applies a conversion method for Unicode (to produce a binary representation) on data that is already binary. Python then tries to decode first to give you Unicode to encode. It is that implicit decoding that fails here:
>>> '\xa8'.encode('utf8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa8 in position 0: ordinal not in range(128)
Note that I encoded, but got a decoding exception.
The code you are using is extremely convoluted. If you wanted to interpret the whole binary contents of a file as one large integer, you could do it by converting to a hex representation, but then you'd not convert to a binary string and back to an integer again. The following would suffice:
with open(filename, 'rb') as fileobj:
binary_contents = fileobj.read()
integer_value = int(binascii.hexlify(binary_contents), 16)
Image data is not unually interpreted as one long number however. Binary data can encode integers, but when processing images, you'd usually do so using the struct module to decode specific integer values from specific bytes instead.

Related

Python unable to decode byte string

I am having problem with decoding byte string that I have to send from one computer to another. File is format PDF. I get error that goes:
fileStrings[i] = fileStrings[i].decode()
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xda in position 648: invalid continuation byte
Any ideas of how to remove b' ' marking? I need to compile file back up, but i also need to know its size in bytes before sending it and I figured I will know it by decoding each byte string (Works for txt files but not for pdf ones..)
Code is:
with open(inputne, "rb") as file:
while 1:
readBytes= file.read(dataMaxSize)
fileStrings.append(readBytes)
if not readBytes:
break
readBytes= ''
filesize=0
for i in range(0, len(fileStrings)):
fileStrings[i] = fileStrings[i].decode()
filesize += len(fileStrings[i])
Edit: For anyone having same issue, parameter len() will give you size without b''.
In Python, bytestrings are for raw binary data, and strings are for textual data. decode tries to decode it as utf-8, which is valid for txt files, but not for pdf files, since they can contain random bytes. You should not try to get a string, since bytestrings are designed for this purpose. You can get the length of bytestrings like normal, with len(data). Many of the string operations also apply to bytestrings, such as concatenation and slicing (data1 + data2 and data[1:3]).
As a side note, the b'' when you print it is just because the __str__ method for bytestrings is equivalent to repr. It's not in the data itself.

In Python 3.8, how do I load image data from a file for inclusion ina JSON object?

I'm using Python 3.8. I want to get image byte data into a JSON object. So I tried this
with open(os.path.join(dir_path, "../image_data", "myimg.jpg"), mode='rb') as img_file:
image_data = img_file.read().decode("utf-16")
my_json_data = {
"image_data": image_data
...
}
but the image_data = line is giving this error:
UnicodeDecodeError: 'utf-16-le' codec can't decode bytes in position 0-1: illegal UTF-16 surrogate
What's the proper way to load data for inclusion into a JSON object?
decode and encode work on character data. You use them to covert between character encoding formats, such as utf-8 and ASCII. It doesn't make sense to take pure binary data -- such as an image -- and try to convert it to characters. Not every binary value can be converted to character; most character formats leave a few values as reserved or unused.
What you need is a simple raw byte format. Read the file as a sequence of bytes; this preserves the binary form, making it easy for your eventual JSON consumer to utilize the information.

Write to file bytes and strings

I have to create files which have some chars and hex value in little-endian encoding. To do encoding, I use:
pack("I", 0x01ddf23a)
and this give me:
b':\xf2\xdd\x01'
First problem is that, this give me bytes string which I cannot write to file. Second one is that \x3a is turn to ':'. What I expect, is write to file \x3a\xf2\xdd\x01 as bytes not as chars.
What I tried:
>>> a=0x01ddf23a
>>> str(pack("I", a))
"b':\\xf2\\xdd\\x01'" <= wrong
>>> pack("I", a).hex()
'3af2dd01 <= I need '\x' before each byte
>>> pack("I", a).decode()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf2 in position 1: invalid continuation byte
Changing open() from "w" to "wb" force me to write only bytes, but I want to writes lots of strings and few bytes, eg.:
Hello world
^I^M^T^B
End file
I know I can simple do this:
fs.open("file" "w")
fs.write("Hello world")
fs.write("\x3a\xf2\xdd\x01")
fs.write("End file")
fs.close()
But this is make my byte value 0x01ddf23a hard to read and there is easy to make some mistake when changing this value in that form.
You are producing bytes, which can be written to files opened in binary mode without issue. Add b to the file mode when opening and either use bytes string literals or encode your strings to bytes if you need to write other data too:
with open("file", "wb") as fs:
fs.write(b"Hello world") # note, a byte literal!
fs.write(pack("I", 0x01ddf23a))
fs.write("End file".encode('ASCII')) # encoded string to bytes
The alternative would be to decode your binary packed data to a text string first, but since packed data does not, in fact, contain decodable text, that approach would require contortions to force the binary data to be decodable and encodable again, which only works if your file encoding was set to Latin-1 and severely limits what actual text you could add.
A bytes representation will always try to show printable characters where possible. The byte \x3a is also the correct ASCII value for the ':' character, so in a bytes representation the latter is preferred over using the \x3a escape sequence. The correct value is present in the bytes value and would be written to the file entirely correctly:
>>> b'\x3a'
b':'
>>> b'\x3a' == b':'
True
>>> b':'[0]
58
>>> b'\x3a'[0]
58
>>> hex(58)
'0x3a'

Python Conversion from int to Windows-1252

I am currently writing a program that reads data from a serial port adds some header information then writes this data to a .jpg file.
I require to write to the file in Windows-1252 encoding format, yes the method in which I construct the data and the header is in hexadecimal format.
I realised my problem when comparing the picture that should be written and what was actually written, and saw that DOULBE LOW 9 QUOTES were not written as quotes but rather as a zero.
The decimal code for that symbol is 132 (0x84). If I use chr(0x84) I get the following error
UnicodeEncodeError: 'charmap' codec can't encode character \x84 in position 0: character maps to
Which only makes sense if chr() was trying to map to Latin-1 codeset. I have tried to convert the int to a unicode but from my research chr is the only function that does this.
I have also tried to use the struct package in python.
import struct
a = 123;
b = struct.pack("c",a)
print(b)
I get the error
Traceback (most recent call last): File "python", line 3, in
struct.error: char format requires a bytes object of length 1
Reading past questions, answers and documentation does get quite confusing as there is a mix of python2 and python3 answers mixed in with people converting to ascii (which obviously wouldn't work).
I am using Python 3.4.3 (the latest version) on a Windows 7 machine.
UnicodeEncodeError: 'charmap' codec can't encode character \x84
\x84 is the encoding of the lower quotes character in Windows-1252. This suggests your data is already encoded, and you should not try to encode it again. In a text string the quote should show up as "\u201E". "\u0084" (the result of chr(132)) is actually a control character.
You should have either bytes which you can decode to a string:
>>> b"\x84".decode('windows-1252')
'\u201e'
Or you should have a text string, which you can encode to a byte string
>>> "\u201e".encode('windows-1252')
b'\x84'
If you read data from somewhere you could use the struct module like this
# suppose we download some data:
data=b'*\x00\x00\x00abcde'
a, txt = struct.unpack("I5s", data)
print(txt.decode('windows-1252'))

Python can't write to a file despite printing perfectly fine

I'm running up against what I assume is some strange encoding error, but it's really baffling me. Basically I'm trying to write a unicode string to a file as an image, and the string representation is printed fine.
ìԉcïԁiԁúлt cúɭpâ ρáncéttá, ëɑ ëɭìt haϻ offícìà còлѕêɋûät. Sunt ԁësërúлt
but any way I try to write the string out to any relevant place I get the standard ascii encoding error:
UnicodeEncodeError: 'ascii' codec can't encode characters 0-3: ordinal not in range 128
I've tried setting the encoding of my source files, and ensuring that my system variable isn't set to ascii, and I've tried directly outputting to a file via:
python script.py > output.jpg
and none of it seems to have any effect. I feel a little silly for not being able to solve a simple encoding issue, but I've really got no clue as to where the ascii codec is even coming from at this point.
Relevant code:
def random_image(**kwargs):
image_array = numpy.random.rand(kwargs["dims"][0], kwargs["dims"][1], 3)*255
image = Image.fromarray(image_array.astype('uint8')).convert('RGBA')
format = kwargs.get("format", "JPEG")
output = StringIO.StringIO()
image.save(output, format=format)
content = output.getvalue()
output.close()
content = [str(ord(char)) for char in content]
return content
The first question is why do you store the contents of your image in the form of a Unicode string? Images typically contain arbitrary octets and should be represented with str (bytes in Python 3), not with the unicode type.
When you print a Unicode string to the screen, encoding is chosen based on the environment settings. When you print it to the file, you need to specify an encoding, otherwise ascii is assumed. To have your program default to something more sane for files, start it with:
encoding = sys.stdout.encoding or 'utf-8'
sys.stdout = codecs.getwriter(encoding)(sys.stdout, errors='replace')

Categories

Resources