Write to file bytes and strings - python

I have to create files which have some chars and hex value in little-endian encoding. To do encoding, I use:
pack("I", 0x01ddf23a)
and this give me:
b':\xf2\xdd\x01'
First problem is that, this give me bytes string which I cannot write to file. Second one is that \x3a is turn to ':'. What I expect, is write to file \x3a\xf2\xdd\x01 as bytes not as chars.
What I tried:
>>> a=0x01ddf23a
>>> str(pack("I", a))
"b':\\xf2\\xdd\\x01'" <= wrong
>>> pack("I", a).hex()
'3af2dd01 <= I need '\x' before each byte
>>> pack("I", a).decode()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf2 in position 1: invalid continuation byte
Changing open() from "w" to "wb" force me to write only bytes, but I want to writes lots of strings and few bytes, eg.:
Hello world
^I^M^T^B
End file
I know I can simple do this:
fs.open("file" "w")
fs.write("Hello world")
fs.write("\x3a\xf2\xdd\x01")
fs.write("End file")
fs.close()
But this is make my byte value 0x01ddf23a hard to read and there is easy to make some mistake when changing this value in that form.

You are producing bytes, which can be written to files opened in binary mode without issue. Add b to the file mode when opening and either use bytes string literals or encode your strings to bytes if you need to write other data too:
with open("file", "wb") as fs:
fs.write(b"Hello world") # note, a byte literal!
fs.write(pack("I", 0x01ddf23a))
fs.write("End file".encode('ASCII')) # encoded string to bytes
The alternative would be to decode your binary packed data to a text string first, but since packed data does not, in fact, contain decodable text, that approach would require contortions to force the binary data to be decodable and encodable again, which only works if your file encoding was set to Latin-1 and severely limits what actual text you could add.
A bytes representation will always try to show printable characters where possible. The byte \x3a is also the correct ASCII value for the ':' character, so in a bytes representation the latter is preferred over using the \x3a escape sequence. The correct value is present in the bytes value and would be written to the file entirely correctly:
>>> b'\x3a'
b':'
>>> b'\x3a' == b':'
True
>>> b':'[0]
58
>>> b'\x3a'[0]
58
>>> hex(58)
'0x3a'

Related

Python unable to decode byte string

I am having problem with decoding byte string that I have to send from one computer to another. File is format PDF. I get error that goes:
fileStrings[i] = fileStrings[i].decode()
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xda in position 648: invalid continuation byte
Any ideas of how to remove b' ' marking? I need to compile file back up, but i also need to know its size in bytes before sending it and I figured I will know it by decoding each byte string (Works for txt files but not for pdf ones..)
Code is:
with open(inputne, "rb") as file:
while 1:
readBytes= file.read(dataMaxSize)
fileStrings.append(readBytes)
if not readBytes:
break
readBytes= ''
filesize=0
for i in range(0, len(fileStrings)):
fileStrings[i] = fileStrings[i].decode()
filesize += len(fileStrings[i])
Edit: For anyone having same issue, parameter len() will give you size without b''.
In Python, bytestrings are for raw binary data, and strings are for textual data. decode tries to decode it as utf-8, which is valid for txt files, but not for pdf files, since they can contain random bytes. You should not try to get a string, since bytestrings are designed for this purpose. You can get the length of bytestrings like normal, with len(data). Many of the string operations also apply to bytestrings, such as concatenation and slicing (data1 + data2 and data[1:3]).
As a side note, the b'' when you print it is just because the __str__ method for bytestrings is equivalent to repr. It's not in the data itself.

Python3, issues using ctf-8 encode/decode (on some ocassions)

So, I'm having issues with Python3 encoding. I have a few bytes I want to work as strings. (long story)
In few words, this works
a = "\x85".encode()
print(a.decode())
But this doesn't
b = (0x85).to_bytes(1,"big")
print(b.decode())
UnicodeDecodeError: utf-8 codec can't decode byte 0x85 in position 0:
invalid start byte
I have read a handful of articles on the subject, but they insist that 'python3 is broken' or that 'you shouldn't be using strings for that'. Plenty articles on Stackoverflow just use "work arounds" (such as "use replace on error" or "user utc-16").
Could anyone tell me where the difference lies and why the function works while the second one doesn't? Shouldn't both of them work identically? Why can't utf-8 decode the byte on the second attempt?
In the first case '\x85'.encode() encodes the Unicode code point U+0085 in the Python 3 default encoding of UTF-8. So the output is the correct two-byte UTF-8 encoding of that code point:
>>> '\x85'.encode()
b'\xc2\x85'
Decode then works because it was correctly encoded in UTF-8 to begin with:
>>> b'\xc2\x85'.decode()
'\x85'
The second case is a complicated way of creating a single byte string:
>>> (0x85).to_bytes(1,'big')
b'\x85'
This byte string is not correctly encoded as UTF-8, so it fails to decode:
>>> b'\x85'.decode()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x85 in position 0: invalid start byte
Python 3 is definitely not "broken". It cleanly separates byte data from text.
If you have raw bytes, work with them as bytes. Raw data in Python 3 is intended to be manipulated in byte strings or byte arrays. Unicode strings are for text. Decode bytes to text to manipulate it, then encode back to bytes to serialize to file, socket, database, etc.
If for some reason you feel the need to use Unicode strings for raw data, the first 256 code points of Unicode correspond to the latin1 codec for 1:1 mapping of one to the other.
>>> '\x85'.encode('latin1')
b'\x85'
>>> b'\x85'.decode('latin1')
'\x85'
This is often used to correct programming errors due to encoding/decoding with the wrong encodings.

Read an image from file using ascii encoding

I've been having trouble loading images from a file as a string.
Many of the functions that I need to use in my program rely on the read data being encoded with ascii and it simply fails to handle the data I give it producing the following error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa8 in position 14: ordinal not in range(128)
So how would I go about converting this data to ascii.
EDIT:
Here is my admittedly messy code I am using. Please do not comment about how messy it is, this is a rough draft:
def text_to_bits(text, encoding='utf-8', errors='surrogatepass'):
bits = bin(int(binascii.hexlify(text.encode(encoding, errors)), 16))[2:]
return bits.zfill(8 * ((len(bits) + 7) // 8))
def str2int(string):
binary = text_to_bits(string)
number = int(binary, 2)
return number
def go():
#filen is the name of the file
global filen
#Reading the file
content = str(open(filen, "r").read())
#Using A function from above
integer = str2int(content)
#Write back to the file
w = open(filen, "w").write(str(integer))
Image data is not ASCII. Image data is binary, and thus uses bytes that the ASCII standard doesn't cover. Don't try to decode the data as ASCII. You also want to make sure you open your file in binary mode, to avoid platform-specific line separator translations, something that'll damage your image data.
Any method expecting to handle image data will deal with binary data, and in Python 2 that means you'll be handling that as the str type.
In your specific case, you are using a function that expects to work on Unicode data, not binary image data, and it is trying to encode that data to binary. In other words, because you are you are giving it data that is already binary (encoded), the function applies a conversion method for Unicode (to produce a binary representation) on data that is already binary. Python then tries to decode first to give you Unicode to encode. It is that implicit decoding that fails here:
>>> '\xa8'.encode('utf8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa8 in position 0: ordinal not in range(128)
Note that I encoded, but got a decoding exception.
The code you are using is extremely convoluted. If you wanted to interpret the whole binary contents of a file as one large integer, you could do it by converting to a hex representation, but then you'd not convert to a binary string and back to an integer again. The following would suffice:
with open(filename, 'rb') as fileobj:
binary_contents = fileobj.read()
integer_value = int(binascii.hexlify(binary_contents), 16)
Image data is not unually interpreted as one long number however. Binary data can encode integers, but when processing images, you'd usually do so using the struct module to decode specific integer values from specific bytes instead.

Python converting bytes to string

I have the following code:
with open("heart.png", "rb") as f:
byte = f.read(1)
while byte:
byte = f.read(1)
strb = byte.decode("utf-8", "ignore")
print(strb)
When reading the bytes from "heart.png" I have to read hex bytes such as:
b'öx1a', b'öxff', b'öxa4', etc.
and also bytes in this form:
b'A', b'D', b'O', b'D', b'E', etc. <- spells ADOBE
Now for some reason when I use the above code to convert from byte to string it does not seem to work with the bytes in hex form but it works for everything else.
So when b'öx1a' comes along it converts it to "" (empty string)
and when b'H' comes along it converts it to "H"
does anyone know why this is the case?
There's a few things going on here.
The PNG file format can contain text chunks encoded in either Latin-1 or UTF-8. The tEXt chunks are encoded in Latin-1 and you would need to decode them using the 'latin-1' codec. iTXt chunks are encoded in UTF-8 and would need to be decoded with the 'utf-8' codec.
However, you appear to be trying to decode individual bytes, whereas characters in UTF-8 may span multiple bytes. So assuming you want to read UTF-8 strings, what you should do is read in the entire length of the string you wish to decode before attempting to decode it.
If instead you are trying to interpret binary data from the file, take a look at the struct module which is intended for that purpose.

Reading UTF8 encoded CSV and converting to UTF-16

I'm reading in a CSV file that has UTF8 encoding:
ifile = open(fname, "r")
for row in csv.reader(ifile):
name = row[0]
print repr(row[0])
This works fine, and prints out what I expect it to print out; a UTF8 encoded str:
> '\xc3\x81lvaro Salazar'
> '\xc3\x89lodie Yung'
...
Furthermore when I simply print the str (as opposed to repr()) the output displays ok (which I don't understand eitherway - shouldn't this cause an error?):
> Álvaro Salazar
> Élodie Yung
but when I try to convert my UTF8 encoded strs to unicode:
ifile = open(fname, "r")
for row in csv.reader(ifile):
name = row[0]
print unicode(name, 'utf-8') # or name.decode('utf-8')
I get the infamous:
Traceback (most recent call last):
File "scripts/script.py", line 33, in <module>
print unicode(fullname, 'utf-8')
UnicodeEncodeError: 'ascii' codec can't encode character u'\xc1' in position 0: ordinal not in range(128)
So I looked at the unicode strings that are created:
ifile = open(fname, "r")
for row in csv.reader(ifile):
name = row[0]
unicode_name = unicode(name, 'utf-8')
print repr(unicode_name)
and the output is
> u'\xc1lvaro Salazar'
> u'\xc9lodie Yung'
So now I'm totally confused as these seem to be mangled hex values. I've read this question:
Reading a UTF8 CSV file with Python
and it appears I am doing everything correctly, leading me to believe that my file is not actually UTF8, but when I initially print out the repr values of the cells, they appear to to correct UTF8 hex values. Can anyone either point out my problem or indicate where my understanding is breaking down (as I'm starting to get lost in the jungle of encodings)
As an aside, I believe I could use codecs to open the file and read it directly into unicode objects, but the csv module doesn't support unicode natively so I can use this approach.
Your default encoding is ASCII. When you try to print a unicode object, the interpreter therefore tries to encode it using the ASCII codec, which fails because your text includes characters that don't exist in ASCII.
The reason that printing the UTF-8 encoded bytestring doesn't produce an error (which seems to confuse you, although it shouldn't) is that this simply sends the bytes to your terminal. It will never produce a Python error, although it may produce ugly output if your terminal doesn't know what to do with the bytes.
To print a unicode, use print some_unicode.encode('utf-8'). (Or whatever encoding your terminal is actually using).
As for the u'\xc1lvaro Salazar', nothing here is mangled. The character Á is at the unicode codepoint C1 (which has nothing to do with it's UTF-8 representation, but happens to be the same value as in Latin-1), and Python uses \x hex escapes instead of \u unicode codepoint notation for codepoints that would have 00 as the most significant byte to save space (it could also have displayed this as \u00c1.)
To get a good overview of how Unicode works in Python, I suggest http://nedbatchelder.com/text/unipain.html

Categories

Resources