I have written a class which parses the header segment of data file (for storing scientific instrument data) and collects things like offsets to various data segments within the file. The actual data is obtained through various methods which read and parse the data segments.
The problem I'm running into is that there is a segment defined for vendor-specific, unstructured data. Since there's nothing to parse I just need my method to return raw binary data. However this segment could be very large so I don't just want to read it all at once and return a single bytes object.
What I'd like to do is have the method return an io.BufferedReader object or similar into the file that only reads between a beginning and end offset. I haven't been able to figure out a way to do this using the builtin IO classes. Is it possible?
You inherit all of the class methods from IOBase so you can absolutely call reader.seek(byte_offset) to skip to this byte position in the stream. However from there you will have to manually track the bytes you have read until you get to the max offset you are reading to. The start offset for seek() will of course have to be known beforehand as will the ending byte offset. Some example code follows (assuming a start offset of byte 250):
import io
stream = io.open("file.txt", "r")
buffered_reader = io.BufferedReader(stream)
# set the stream to byte 250
buffered_reader.seek(250)
# read up to byte 750 (500 bytes from position 250)
data = buffered_reader.read(500)
Of course, if this header is dynamically sized... You will have to scan to determine start positions which means reading line by line.
Related
I have binary data that is stored in a non-trivial format where the information 'chunks' are not a fixed size and are similar to packets. I am reading them dynamically using this function:
def unpack_bytes(stream: BytesIO, binary_format: str) -> tuple:
size = struct.calcsize(binary_format)
buf = stream.read(size)
print(buf)
return struct.unpack(binary_format, buf)
This function is called with the appropriate format as needed and the code that creates the stream and loops over it is as follows:
def parse_data_file(data_directory: str) -> Generator[CompressedFile]:
with open(data_directory, 'rb') as packet_stream:
while <EOF file logic here>:
contents = parse_packet(packet_stream)
contents = gzip.compress(data=contents, compresslevel=9)
yield CompressedFile(filename=f"{uuid.uuid4()}.gz", datetime=datetime.now(),
contents=contents)
CompressedFile is just a small dataclass to store the
parse_packet extracts a single packet (as per the data spec) from the bin file and returns the contents. Since the packets don't have a fixed width I am wondering what the best way to stop the loop would be. The two options I know of are:
Add some extra logic to unpack_bytes() to bubble up an EOF.
Do some cursor-foo to save the EOF and check against it as it loops. I'd like to not manipulate the cursor directly if possible
Is there are more idomatic way to check EOF within parse_data_file?
The last call to parse_packet (and by extension the last call to unpack_bytes) will consume all the data and the cursor will be at the end when the next iteration of the loop begins. I'd like to take advantage of that state instead of adding EOF handling code all the way up from unpack_bytes or fiddling with the cursor directly.
I am trying to extract embeddings from a hidden layer of LSTM. I have a huge dataset with multiple sentences and therefore those will generate multiple numpy vectors. I want to store all those vectors efficiently into a single file. This is what I have so far
with open(src_vectors_save_file, "wb") as s_writer, open(tgt_vectors_save_file, "wb") as t_writer:
for batch in data_iter:
encoder_hidden_layer, decoder_hidden_layer = self.extract_lstm_hidden_states_for_batch(
batch, data.src_vocabs, attn_debug
)
encoder_hidden_layer = encoder_hidden_layer.detach().numpy()
decoder_hidden_layer = decoder_hidden_layer.detach().numpy()
enc_hidden_bytes = pickle.dumps(encoder_hidden_layer)
dec_hidden_bytes = pickle.dumps(decoder_hidden_layer)
s_writer.write(enc_hidden_bytes)
s_writer.write("\n")
t_writer.write(dec_hidden_bytes)
t_writer.write("\n")
Essentially I am using pickle to get the bytes from the np.array and writing that in binary file. I tried to naively separate each byte encoded array with ASCII newline which obviously throws an error. I was planning to use .readlines() function or read each byte-encoded array per line using a for loop in the next program. However, that won't be possible now.
I am out of any ideas can someone suggest an alternative? How can I efficiently store all the arrays in a compressed fashion in one file and how can I read them back from that file?
There is a problem with using \ns are separators because the dump from pickle (enc_hidden_bytes) could have \n in it because the data is not ASCII encoded.
There are two solutions. You can escape the \n appearing in the data and then use \n as terminators. But this adds complexity even while reading.
The other solution is to put into the file the size of the data before starting the actual data. This is like some sort of a header and is a very common practice while sending data over a connection.
You can write the following two functions -
import struct
def write_bytes(handle, data):
total_bytes = len(data)
handle.write(struct.pack(">Q", total_bytes))
handle.write(data)
def read_bytes(handle):
size_bytes = handle.read(8)
if len(size_bytes) == 0:
return None
total_bytes = struct.unpack(">Q", size_bytes)[0]
return handle.read(total_bytes)
Now you can replace
s_writer.write(enc_hidden_bytes)
s_writer.write("\n")
with
write_bytes(s_writer, enc_hidden_bytes)
and same for the other variables.
While reading back from the file in a loop you can use the read_bytes function in a similar way.
We are writing a class that calls read on an io object, likely (but not necessarily) a file or a pipe. That stream contains a large number of small binary messages of format (length, blob), not fixed length, but in the order of 100 bytes.
The two obvious options are along the lines of
while True:
length = unpack(f.read(4))
blob = read(length)
which works fine, but is slow or:
while True:
buffer = f.read(8192)
for blob in unpack_buffer(blob):
...
# handle the remainder
Which is fast, but when reading from a pipe in live streaming, will not return until there's 8K of data there, so not good for latency on sporadic input.
We have considered non-blocking reads, but that path is problematic is there isn't always an fd, and we're also reluctant to mess with the file object's params on a file object passed in from the user.
Is there some way to make read() return a partial buffer < 8192 bytes as soon as some data is present? I believe the underlying Unix read() syscall does this, but not fread().
I have a .txt file whose contents are:
This is an example file.
These are its contents.
This is line 3.
If I open the file, move to the beginning, and write some text like so...
f = open(r'C:\Users\piano\Documents\sample.txt', 'r+')
f.seek(0, 0)
f.write('Now I am adding text.\n')
What I am expecting is for the file to read:
Now I am adding text.
This is an example file.
These are its contents.
This is line 3.
...but instead it reads:
Now I am adding text.
.
These are its contents.
This is line 3.
So why is some of the text being replaced instead of the text I'm writing simply being added onto the beginning? How can I fix this?
Write - will overwrite any existing content
To overcome this, you can do:
with open(r'C:\Users\piano\Documents\sample.txt', 'r+') as file:
string = file.read()
file.truncate(0) #delete all contents
file.seek(0, 0)
file.write('Now I am adding text.\n' + string)
It is also recommended you use with because it comes automatic with the close() method in its __exit__() magic method. This is important as not all Python interpreters use CPython
Bonus: If you wish to insert lines inbetween, you can do:
with open(r'C:\Users\piano\Documents\sample.txt', 'r+') as file:
contents = file.readlines()
contents.insert(1, 'Now I am adding text.\n')
#Inserting into second line
file.truncate(0) #delete all contents
file.seek(0, 0)
file.writelines(contents)
Most file systems don't work like that. A file's contents is mapped to data blocks, and these data blocks are not guaranteed to be contiguous on the underlying system (i.e. not necessarily "side-by-side").
When you seek, you're seeking to a byte offset. So if you want to insert new data between 2 byte offsets of a particular block, you'll have to actually shift all subsequent data over by the length of what you're inserting. Since the block could easily be entirely "filled", shifting the bytes over might require allocating a new block. If the subsequent block was entirely "filled" as well, you'll have to shift the data of that block as well, and so on.. You can start to see why there's no "simple" operation for shifting data.
Generally, we solve this by just reading all the data into memory and then re-writing it back to a file. When you encounter the byte offset you're interested in inserting "new" content at, you write your buffer and then continue writing the "original" data. In Python, you won't have to worry about interleaving multiple buffers when writing, since Python will abstract the data to some data structure. So you'd just concatenate the higher-level data structures (e.g. if it's a text file, just concat the 3 strings).
If the file is too large for you to comfortably place it in memory, you can write to a "new" temporary file, and then just swap it with the original when done your write operation.
Now if you consider the "shifting" of data in data blocks I mentioned above, you might consider the simpler edge case where you happen to be inserting data of length N at an offset that's a multiple of N, where N is the fixed size of the data block in the file system. In this case, if you think of the data blocks as a linked list, you might consider it a rather simple operation to add a new data block between the offset you're inserting at and the next block in the list.
In fact, Linux systems do support allocating an additional block at this boundary. See fallocate.
I've been supplied with a zipped file containing multiple individual streams of compressed XML. The compressed file is 833 mb.
If I try to decompress it as a single object, I only get the first stream (about 19 kb).
I've modified the following code supplied as a answer to an older question to decompress each stream and write it to a file:
import zlib
outfile = open('output.xml', 'w')
def zipstreams(filename):
"""Return all zip streams and their positions in file."""
with open(filename, 'rb') as fh:
data = fh.read()
i = 0
print "got it"
while i < len(data):
try:
zo = zlib.decompressobj()
dat =zo.decompress(data[i:])
outfile.write(dat)
zo.flush()
i += len(data[i:]) - len(zo.unused_data)
except zlib.error:
i += 1
outfile.close()
zipstreams('payload')
infile.close()
This code runs and produces the desired result (all the XML data decompressed to a single file). The problem is that it takes several days to work!
Even though there are tens of thousands of streams in the compressed file, it still seems like this should be a much faster process. Roughly 8 days to decompress 833mb (estimated 3gb raw) suggests that I'm doing something very wrong.
Is there another way to do this more efficiently, or is the slow speed the result of a read-decompress-write---repeat bottleneck that I'm stuck with?
Thanks for any pointers or suggestions you have!
It's hard to say very much without more specific knowledge of the file format you're actually dealing with, but it's clear that your algorithm's handling of substrings is quadratic-- not a good thing when you've got tens of thousands of them. So let's see what we know:
You say that the vendor states that they are
using the standard zlib compression library.These are the same compression routines on which the gzip utilities are built.
From this we can conclude that the component streams are in raw zlib format, and are not encapsulated in a gzip wrapper (or a PKZIP archive, or whatever). The authoritative documentation on the ZLIB format is here: https://www.rfc-editor.org/rfc/rfc1950
So let's assume that your file is exactly as you describe: A 32-byte header, followed by raw ZLIB streams concatenated together, without any other stuff in between. (Edit: That's not the case, after all).
Python's zlib documentation provides a Decompress class that is actually pretty well suited to churning through your file. It includes an attribute unused_data whose documentation states clearly that:
The only way to determine where a string of compressed data ends is by actually decompressing it. This means that when compressed data is contained part of a larger file, you can only find the end of it by reading data and feeding it followed by some non-empty string into a decompression object’s decompress() method until the unused_data attribute is no longer the empty string.
So, this is what you can do: Write a loop that reads through data, say, one block at a time (no need to even read the entire 800MB file into memory). Push each block to the Decompress object, and check the unused_data attribute. When it becomes non-empty, you've got a complete object. Write it to disk, create a new decompress object and initialize iw with the unused_data from the last one. This just might work (untested, so check for correctness).
Edit: Since you do have other data in your data stream, I've added a routine that aligns to the next ZLIB start. You'll need to find and fill in the two-byte sequence that identifies a ZLIB stream in your data. (Feel free to use your old code to discover it.) While there's no fixed ZLIB header in general, it should be the same for each stream since it consists of protocol options and flags, which are presumably the same for the entire run.
import zlib
# FILL IN: ZHEAD is two bytes with the actual ZLIB settings in the input
ZHEAD = CMF+FLG
def findstart(header, buf, source):
"""Find `header` in str `buf`, reading more from `source` if necessary"""
while buf.find(header) == -1:
more = source.read(2**12)
if len(more) == 0: # EOF without finding the header
return ''
buf += more
offset = buf.find(header)
return buf[offset:]
You can then advance to the start of the next stream. I've added a try/except pair since the same byte sequence might occur outside a stream:
source = open(datafile, 'rb')
skip_ = source.read(32) # Skip non-zlib header
buf = ''
while True:
decomp = zlib.decompressobj()
# Find the start of the next stream
buf = findstart(ZHEAD, buf, source)
try:
stream = decomp.decompress(buf)
except zlib.error:
print "Spurious match(?) at output offset %d." % outfile.tell(),
print "Skipping 2 bytes"
buf = buf[2:]
continue
# Read until zlib decides it's seen a complete file
while decomp.unused_data == '':
block = source.read(2**12)
if len(block) > 0:
stream += decomp.decompress(block)
else:
break # We've reached EOF
outfile.write(stream)
buf = decomp.unused_data # Save for the next stream
if len(block) == 0:
break # EOF
outfile.close()
PS 1. If I were you I'd write each XML stream into a separate file.
PS 2. You can test whatever you do on the first MB of your file, till you get adequate performance.
Decompressing 833 MB should take about 30 seconds on a modern processor (e.g. a 2 GHz i7). So yes, you are doing something very wrong. Attempting to decompress at every byte offset to see if you get an error is part of the problem, though not all of it. There are better ways to find the compressed data. Ideally you should find or figure out the format. Alternatively, you can search for valid zlib headers using the RFC 1950 specification, though you may get false positives.
More significant may be that you are reading the entire 833 MB into memory at once, and decompressing the 3 GB to memory, possibly in large pieces each time. How much memory does your machine have? You may be thrashing to virtual memory.
If the code you show works, then the data is not zipped. zip is a specific file format, normally with the .zip extension, that encapsulates raw deflate data within a structure of local and central directory information intended to reconstruct a directory in a file system. You must have something rather different, since your code is looking for and apparently finding zlib streams. What is the format you have? Where did you get it? How is it documented? Can you provide a dump of, say, the first 100 bytes?
The way this should be done is not to read the whole thing into memory and decompress entire streams at once, also into memory. Instead, make use of the zlib.decompressobj interface which allows you provide a piece at a time, and get the resulting available decompressed data. You can read the input file in much smaller pieces, find the decompressed data streams by using the documented format or looking for zlib (RFC 1950 headers), and then running those a chunk at a time through the decompressed object, writing out the decompressed data where you want it. decomp.unused_data can be used to detect the end of the compressed stream (as in the example you found).
From what you've described in the comments, it sounds like they're concatenating together the individual files they would have sent you separately. Which means each one has a 32-byte header you need to skip.
If you don't skip those headers, it would probably have exactly the behavior you described: If you get lucky, you'll get 32 invalid-header errors and then successfully parse the next stream. If you get unlucky, the 32 bytes of garbage will look like the start of a real stream, and you'll waste a whole lot of time parsing some arbitrary number of bytes until you finally get a decoding error. (If you get really unlucky, it'll actually decode successfully, giving you a giant hunk of garbage and eating up one or more subsequent streams.)
So, try just skipping 32 bytes after each stream finishes.
Or, if you have a more reliable way of detecting the start of the next stream (this is why I told you to print out the offsets and look at the data in a hex editor, and while alexis told you to look at the zlib spec), do that instead.