Python - Read string from binary file - python

I need to read up to the point of a certain string in a binary file, and then act on the bytes that follow. The string is 'colr' (this is a JPEG 2000 file) and here is what I have so far:
from collections import deque
f = open('my.jp2', 'rb')
bytes = deque([], 4)
while ''.join(map(chr, bytes)) != 'colr':
bytes.appendleft(ord(f.read(1)))
if this works:
bytes = deque([0x63, 0x6F, 0x6C, 0x72], 4)
print ''.join(map(chr, bytes))
(returns 'colr'), I'm not sure why the test in my loop never evaluates to True. I wind up spinning - just hanging - I don't even get an exit when I've read through the whole file.

Change your bytes.appendleft() to bytes.append() and then it will work -- it does for me.

with open("my.jpg","rb") as f:
print f.read().split("colr",1)
if you dont want to read it all at once ... then
def preprocess(line):
print "Do Something with this line"
def postprocess(line):
print "Do something else with this line"
currentproc = preprocess
with open("my.jpg","rb") as f:
for line in f:
if "colr" in line:
left,right = line.split("colr")
preprocess(left)
postprocess(right)
currentproc= postprocess
else:
currentproc(line)
its line by line rather than byte by byte ... but meh ...
I have a hard time thinking that you dont have enough ram to hold the whole jpg in memory... python is not really an awesome language to minimize memory or time footprints
but it is awesome for functional requirements :)

Related

I'm reading into a 256 byte string. I want to skip it, if it's all binary zeros (\x00) Is there a single test?

Totally new to python. Trying to parse a file but not all records contain data. I want to skip the records that are all hex 00.
if record == ('\x00' * 256): from a sample of print("-"*80))
gave a Syntax error, hey I said I was new. :)
Thanks for the reply, I'm using 2.7 and reading like this....
with open(testfile, "rb") as f:
counter = 0
while True:
record = f.read(256)
counter += 1
Your example looks to be very close. I'm not sure about Python 2, but in Python 3 you should specify that a string is binary.
I would do something like:
empty = b'\x00' * 256
if record == empty:
print('skipped this line')
Remember that Python 2 uses print statements, so you should do print 'skipped this line' instead.

Ignore newline character in binary file with Python?

I open my file like so :
f = open("filename.ext", "rb") # ensure binary reading with b
My first line of data looks like this (when using f.readline()):
'\x04\x00\x00\x00\x12\x00\x00\x00\x04\x00\x00\x00\xb4\x00\x00\x00\x01\x00\x00\x00\x08\x00\x00\x00\x00\x00\x00\x00\x18\x00\x00\x00\x01\x00\x00\x00\x02\x00\x00\x00\x03\x00\x00\x00\x04\x00\x00\x00\x05\x00\x00\x00\x06\x00\x00\x00:\x00\x00\x00;\x00\x00\x00<\x00\x00\x007\x00\x00\x008\x00\x00\x009\x00\x00\x00\x07\x00\x00\x00\x08\x00\x00\x00\t\x00\x00\x00\n'
Thing is, I want to read this data byte by byte (f.read(4)). While debugging, I realized that when it gets to the end of the first line, it still takes in the newline character \n and it is used as the first byte of the following int I read. I don't want to simply use .splitlines()because some data could have an n inside and I don't want to corrupt it. I'm using Python 2.7.10, by the way. I also read that opening a binary file with the b parameter "takes care" of the new line/end of line characters; why is not the case with me?
This is what happens in the console as the file's position is right before the newline character:
>>> d = f.read(4)
>>> d
'\n\x00\x00\x00'
>>> s = struct.unpack("i", d)
>>> s
(10,)
(Followed from discussion with OP in chat)
Seems like the file is in binary format and the newlines are just mis-interpreted values. This can happen when writing 10 to the file for example.
This doesn't mean that newline was intended, and it is probably not. You can just ignore it being printed as \n and just use it as data.
You should just be able to replace the bytes that indicate it is a newline.
>>> d = f.read(4).replace(b'\x0d\x0a', b'') #\r\n should be bytes b'\x0d\x0a'
>>> diff = 4 - len(d)
>>> while diff > 0: # You can probably make this more sophisticated
... d += f.read(diff).replace(b'\x0d\x0a', b'') #\r\n should be bytes b'\x0d\x0a'
... diff = 4 - len(d)
>>>
>>> s = struct.unpack("i", d)
This should give you an idea of how it will work. This approach could mess with your data's byte alignment.
If you really are seeing "\n" in your print of d then try .replace(b"\n", b"")

Python f.read not reading the correct number of bytes

I have code that is supposed to read 4 bytes but it is only reading 3 sometimes:
f = open('test.sgy', 'r+')
f.seek(99716)
AAA = f.read(4)
BBB = f.read(4)
CCC = f.read(4)
print len(AAA)
print len(BBB)
print len(CCC)
exit()
And this program returns:
4
3
4
What am I doing wrong? Thanks!
You're assuming read does something it does not. As its documentation tells you:
read(...)
read([size]) -> read at most size bytes, returned as a string.
it reads at most size bytes
If you need exactly size bytes, you'll have to create a wrapper function.
Here's a (not thoroughly tested) example that you can adapt:
def read_exactly( fd, size ):
data=""
remaining= size
while remaining>0: #or simply "while remaining", if you'd like
newdata= fd.read(remaining)
if len(newdata)==0: #problem
raise IOError("Failed to read enough data")
data+=newdata
remaining-= len(newdata)
return data
As Mark Dickinson mentioned in the comments, if you're on Windows, make sure you're reading in binary mode - otherwise you risk reading your (binary) data wrong.

Python pySerial read data from arduino breaks when sending "(char)0"

I send some data from an arduino using pySerial.
My Data looks like
bytearray(DST, SRC, STATUS, TYPE, CHANNEL, DATA..., SIZEOFDATA)
where sizeofData is a test that all bytes are received.
The problem is, every time when a byte is zero, my python program just stops reading there:
serial_port = serial.Serial("/dev/ttyUSB0")
while serial_port.isOpen():
response_header_str = serial_port.readline()
format = '>';
format += ('B'*len(response_header_str));
response_header = struct.unpack(format, response_header_str)
pprint(response_header)
serial_port.close()
For example, when I send bytearray(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15) everything is fine. But when I send something like bytearray(1,2,3,4,0,1,2,3,4) I don't see everything beginning with the zero.
The problem is that I cannot avoid sending zeros as I am just sending the "memory dump" e.g. when I send a float value, there might be zero bytes.
how can I tell pyserial not to ignore zero bytes.
I've looked through the source of PySerial and the problem is in PySerial's implementation of FileLike.readline (in http://svn.code.sf.net/p/pyserial/code/trunk/pyserial/serial/serialutil.py). The offending function is:
def readline(self, size=None, eol=LF):
"""\
Read a line which is terminated with end-of-line (eol) character
('\n' by default) or until timeout.
"""
leneol = len(eol)
line = bytearray()
while True:
c = self.read(1)
if c:
line += c
if line[-leneol:] == eol:
break
if size is not None and len(line) >= size:
break
else:
break
return bytes(line)
With the obvious problem being the if c: line. When c == b'\x00' this evaluates to false, and the routine breaks out of the read loop. The easiest thing to do would be to reimplement this yourself as something like:
def readline(port, size=None, eol="\n"):
"""\
Read a line which is terminated with end-of-line (eol) character
('\n' by default) or until timeout.
"""
leneol = len(eol)
line = bytearray()
while True:
line += port.read(1)
if line[-leneol:] == eol:
break
if size is not None and len(line) >= size:
break
return bytes(line)
To clarify from your comments, this is a replacement for the Serial.readline method that will consume null-bytes and add them to the returned string until it hits the eol character, which we define here as "\n".
An example of using the new method, with a file-object substituted for the socket:
>>> # Create some example data terminated by a newline containing nulls.
>>> handle = open("test.dat", "wb")
>>> handle.write(b"hell\x00o, w\x00rld\n")
>>> handle.close()
>>>
>>> # Use our readline method to read it back in.
>>> handle = open("test.dat", "rb")
>>> readline(handle)
'hell\x00o, w\x00rld\n'
Hopefully this makes a little more sense.

Convert binary data to web-safe text and back - Python

I want to convert a binary file (such as a jpg, mp3, etc) to web-safe text and then back into binary data. I've researched a few modules and I think I'm really close but I keep getting data corruption.
After looking at the documentation for binascii I came up with this:
from binascii import *
raw_bytes = open('test.jpg','rb').read()
text = b2a_qp(raw_bytes,quotetabs=True,header=False)
bytesback = a2b_qp(text,header=False)
f = open('converted.jpg','wb')
f.write(bytesback)
f.close()
When I try to open the converted.jpg I get data corruption :-/
I also tried using b2a_base64 with 57-long blocks of binary data. I took each block, converted to a string, concatenated them all together, and then converted back in a2b_base64 and got corruption again.
Can anyone help? I'm not super knowledgeable on all the intricacies of bytes and file formats. I'm using Python on Windows if that makes a difference with the \r\n stuff
Your code looks quite complicated. Try this:
#!/usr/bin/env python
from binascii import *
raw_bytes = open('28.jpg','rb').read()
i = 0
str_one = b2a_base64(raw_bytes) # 1
str_list = b2a_base64(raw_bytes).split("\n") #2
bytesBackAll = a2b_base64(''.join(str_list)) #2
print bytesBackAll == raw_bytes #True #2
bytesBackAll = a2b_base64(str_one) #1
print bytesBackAll == raw_bytes #True #1
Lines tagged with #1 and #2 represent alternatives to each other. #1 seems most straightforward to me - just make it one string, process it and convert it back.
You should use base64 encoding instead of quoted printable. Use b2a_base64() and a2b_base64().
Quoted printable is much bigger for binary data like pictures. In this encoding each binary (non alphanumeric character) code is changed into =HEX. It can be used for texts that consist mainly of alphanumeric like email subjects.
Base64 is much better for mainly binary data. It takes 6 bites of first byte, then last 2 bits of 1st byte and 4 bites from 2nd byte. etc. It can be recognized by = padding at the end of the encoded text (sometimes other character is used).
As an example I took .jpeg of 271 700 bytes. In qp it is 627 857 b while in base64 it is 362 269 bytes. Size of qp is dependent of data type: text which is letters only do not change. Size of base64 is orig_size * 8 / 6.
Your documentation reference is for Python 3.0.1. There is no good reason using Python 3.0. You should be using 3.2 or 2.7. What exactly are you using?
Suggestion: (1) change bytes to raw_bytes to avoid confusion with the bytes built-in (2) check for raw_bytes == bytes_back in your test script (3) while your test should work with quoted-printable, it is very inefficient for binary data; use base64 instead.
Update: Base64 encoding produces 4 output bytes for every 3 input bytes. Your base64 code doesn't work with 56-byte chunks because 56 is not an integral multiple of 3; each chunk is padded out to a multiple of 3. Then you join the chunks and attempt to decode, which is guaranteed not to work.
Your chunking loop would be much better written as:
output_string = ''.join(
b2a_base64(raw_bytes[i:i+57]) for i in xrange(0, xrange(len(raw_bytes), 57)
)
In any case, chunking is rather slow and pointless; just do b2a_base64(raw_bytes)
#PMC's answer copied from the question:
Here's what works:
from binascii import *
raw_bytes = open('28.jpg','rb').read()
str_list = []
i = 0
while i < len(raw_bytes):
byteSegment = raw_bytes[i:i+57]
str_list.append(b2a_base64(byteSegment))
i += 57
bytesBackAll = a2b_base64(''.join(str_list))
print bytesBackAll == raw_bytes #True
Thanks for the help guys. I'm not sure why this would fail with [0:56] instead of [0:57] but I'll leave that as an exercise for the reader :P

Categories

Resources