I'd like to pack records into a list of io.ByteIO using gzip. I want to set a max_size for each pack that I don't want to exceed. The problem is I don't know if I'll exceed that size with a new record until I do. Once it's gone over size I don't have a good way of undoing that addition.
def pack_gz_records(records: List[any], max_size: int) -> List[io.BytesIO]:
packets = []
mem_file = io.BytesIO()
gz = gzip.GzipFile(fileobj=mem_file, mode="w")
for record in records:
if gz.size >= max_size:
# Size exceeded limit. Add this mem file to the package and cut a new mem file
mem_file.seek(0)
packets.append(mem_file)
mem_file = io.BytesIO()
gz = gzip.GzipFile(fileobj=mem_file, mode="w")
gz.write(serialize(record))
if gz.size:
mem_file.seek(0)
packets.append(mem_file)
return packets
Is there a way to undo a write, or "peek" a write in an efficient way without making a copy of all of the bytes for each record before writing?
Yes. Use the zlib library (instead of gzip). Create the compression object with wbits=31 to select the gzip format. The copy() function can make a copy of the compression object before adding the next record. After making a copy, add the next record to the original object and flush with Z_BLOCK. If the result, plus some margin for the gzip trailer, doesn't go over your limit, then delete the copy. If it does go over, then delete the object that went over, and go back and finish (flush with Z_FINISH) the compression on the copied object.
This assumes that your records are at least several K in size, so that compression is not impacted significantly by the flushing. If your records are small, you should compress several records before flushing. (Experiment with the number of records per flush to measure the compression impact.) If you'd like to get fancy, when you go over your limit and back up, you could follow that with a binary search to determine the number of records to just fill it up.
Related
I have a 100 GB text file in a 7z archive. I can find a pattern 'hello' in it by reading it by 1 MB block (7z outputs the data to stdout):
Popen("7z e -so archive.7z big100gb_file.txt", stdout=PIPE)
while True:
block = proc.stdout.read(1024*1024) # 1 MB block
i += 1
...
if b'hello' in block: # omitting other details for search pattern split in consecutive blocks...
print('pattern found in block %i' % i)
...
Now that we have found after 5 minutes of search that the pattern 'hello' is, say, in the 23456th block, how to access this block or line very fast in the future inside the 7z file?
(if possible, without saving this data in another file/index)
With 7z, how to seek in the middle of the file?
Note: I already read Indexing / random access to 7zip .7z archives and random seek in 7z single file archive but these questions don't discuss concrete implementation.
It is possible, in principle, to build an index to compressed data. You would pick, say, a block size of uncompressed data, where the start of each block would be an entry point at which you would be able to start decompressing. The index would be separate file or large structure in memory that you would build, with the entire decompression state saved for each entry point. You would need to decompress all of the compressed data once to build the index. The choice of block size would be a balance of how quickly you want to access any given byte in the compressed data, against the size of the index.
There are several different compression methods that 7z can use (deflate, lzma2, bzip2, ppmd). What you would need to do to implement this sort of random access would be entirely different for each method.
Also for each method there are better places to pick entry points than some fixed uncompressed block size. Such choices would greatly reduce the size of the index, taking advantage of the internal structure of the compressed data used by that method.
For example, bzip2 has natural entry points with no history at each bzip2 block, by default each with 900 KiB of uncompressed data. This allows the index to be quite small with just the compressed and uncompressed offsets needing to be saved.
For deflate, the entry points can be deflate blocks, where the index is the compressed and uncompressed offset of selected deflate blocks, along with the 32K dictionary for each entry point. zran.c implements such an index for deflate compressed data.
The decompression state at any point in an lzma2 or ppmd compressed stream is extremely large. I do not believe that such a random access approach could be practical for those compression methods. The compressed data formats would need to be modified to break it up into blocks at the time of compression, at some cost to the compression ratio.
as you can see in the code below, I'm having troubles adding new rows to a Table saved in a memory mapped file.
I just want to write the file again with the new rows.
import pyarrow as pa
source = pa.memory_map(path, 'r')
table = pa.ipc.RecordBatchFileReader(source).read_all()
schema = pa.ipc.RecordBatchFileReader(source).schema
new_table = create_arrow_table(schema.names) #new table from pydict with same schema and random new values
updated_table = pa.concat_tables([table, new_table], promote=True)
source.close()
with pa.MemoryMappedFile(path, 'w') as sink:
with pa.RecordBatchFileWriter(sink, updated_table.schema) as writer:
writer.write_table(table)
I get an Exception stating that the memory mapped file is not closed:
ValueError: I/O operation on closed file.
Any suggestion?
Your immediate issue is that you are using pa.MemoryMappedFile(path, 'w') instead of pa.memory_map(path, 'w'). The latter is defined as...
_check_is_file(path)
cdef MemoryMappedFile mmap = MemoryMappedFile()
mmap._open(path, mode)
return mmap
...so it should be pretty clear why it was closed.
The next issue you'll run into (assuming it isn't a copy/paste error into SO) is that you are writing table and not updated_table. Easily fixed.
The third issue is more problematic. Memory mapped files have a fixed size and cannot grow naturally in the same way that normal files do. If you try and write your updated table into the same file you will see...
OSError: Write out of bounds (offset = ..., size = ...) in file of size ...
This problem is not so easily overcome. You could resize the memory map (sink.resize(...)) to some "big enough" size but then you end up with a file with a bunch of 0's at the end so you'll need to make sure to shrink it back down after you write and I'm not really sure if that's going to give you better performance than writing a regular file.
You could write to a bytes object and then resize the file and write your bytes to the memory mapped file but that's going to give you some extra bookkeeping and I don't know the performance impact of resizing the file.
I'm reading data from a large text file (a VCF) into a zarr array. The overall flow of the code is
with zarr.LMDBStore(...) as store:
array = zarr.create(..., chunks=(1000,1000), store=store, ...)
for line_num, line in enumerate(text_file):
array[line_num, :] = process_data(line)
I'm wondering - when does zarr compress the modified chunks of the array and push them to the underlying store (in this case LMDB)? Does it do that every time a chunk is updated (i.e. each line)? Or does it wait till a chunk is filled/evicted from memory before doing that? Assuming that I need to process each line separately in a for loop (that there aren't efficient array operations to use here due to the nature of the data and processing), is there any optimization I should do here with regards to how I feed the data into Zarr?
I just don't want Zarr running compression on each modified chunk every line when each chunk will be modified 1000 times before being complete and ready to save to disk.
Thanks!
Every time you execute this line:
array[line_num, :] = process_data(line)
...zarr will (1) figure out which chunks overlap the array region you want to write to, (2) retrieve those chunks from the store, (3) decompress the chunks, (4) modify the data, (5) compress the modified chunks, (6) write the modified compressed chunks to the store.
This will happen regardless of what type of underlying storage you are using.
If you have created an array with chunks that are more than one row tall, then this will likely be inefficient, resulting in each chunk being read, decompressed, updated, compressed and written many times.
A better strategy would be to parse your input file in blocks of N lines, where N is equal to the number of rows in each chunk of the output array, so that each chunk is only compressed and written once.
If by VCF you mean Variant Call Format files, you might want to look at the vcf_to_zarr function implementation in scikit-allel.
I believe the LMDB store (as far as I can tell) will write/compress every time you assign.
You could aggregate your rows in an in-memory Zarr and then assign for each block.
There could be a "batch" option to the datasets but it has not been implemeted yet as far as I can tell.
Here is a simple example to illustrate my problem:
I have a large binary file with 10 million values.
I want to get 5K values from certain points in this file.
I have a list of indexes giving me the exact place in the file I have my value in.
To solve this I tried two methods:
Going through the values and simply using seek() (from the start of the file) to get each value, something like this:
binaryFile_new = open(binary_folder_path, "r+b")
for index in index_list:
binaryFile_new.seek (size * (index), 0)
wanted_line = binaryFile_new.read (size)
wanted_line_list.append(wanted_line)
binaryFile_new.close()
But as I understand this solution reads through from the beginning for each index, therefore the complexity is O(N**2) in terms of file size.
Sorting the indexes so I could go through the file "once" while seeking from the current position with something like that:
binaryFile_new = open(binary_folder_path, "r+b")
sorted_index_list = sorted(index_list)
for i, index in enumerate(sorted_index_list):
if i == 0:
binaryFile_new.seek (size * (v), 0)
else:
binaryFile_new.seek ((index - sorted_index_list[i-1]) * size - size, 1)
binaryFile_new.seek (size * (index), 0)
wanted_line = binaryFile_new.read (size)
wanted_line_list.append(wanted_line)
binaryFile_new.close()
I expected the second solution to be much faster because in theory it would go through the whole file once O(N).
But for some reason both solutions run the same.
I also have a hard constraint on memory usage, as I run this operation in parallel and on many files, so I can't read files into memory.
Maybe the mmap package will help? Though, I think mmap also scans the entire file until it gets to the index so it's not "true" random access.
I'd go with #1:
for index in index_list:
binary_file.seek(size * index)
# ...
(I cleaned up your code a bit to comply with Python naming conventions and to avoid using a magic 0 constant, as SEEK_SET is default anyway.)
as I understand this solution reads through from the beginning for each index, therefore the complexity is O(N**2) in terms of file size.
No, a seek() does not "read through from the beginning", that would defeat the point of seeking. Seeking to the beginning of file and to the end of file have roughly the same cost.
Sorting the indexes so I could go through the file "once" while seeking from the current position
I can't quickly find a reference for this, but I believe there's absolutely no point in calculating the relative offset in order to use SEEK_CUR instead of SEEK_SET.
There might be a small improvement just from seeking to the positions you need in order instead of randomly, as there's an increased chance your random reads will be serviced from cache, in case many of the points you need to read happen to be close to each other (and so your read patterns trigger read-ahead in the file system).
Maybe the mmap package will help? Though, I think mmap also scans the entire file until it gets to the index so it's not "true" random access.
mmap doesn't scan the file. It sets up a region in your program's virtual memory to correspond to the file, so that accessing any page from this region the first time leads to a page fault, during which the OS reads that page (several KB) from the file (assuming it's not in the page cache) before letting your program proceed.
The internet is full of discussions of relative merits of read vs mmap, but I recommend you don't bother with trying to optimize by using mmap and use this time to learn about the virtual memory and the page cache.
[edit] reading in chunks larger than the size of your values might save you a bit of CPU time in case many of the values you need to read are in the same chunk (which is not a given) - but unless your program is CPU bound in production, I wouldn't bother with that either.
I have a file with integers stored as binary and I'm trying to extract values at specific locations. It's one big serialized integer array for which I need values at specific indexes. I've created the following code but its terribly slow compared to the F# version I created before.
import os, struct
def read_values(filename, indices):
# indices are sorted and unique
values = []
with open(filename, 'rb') as f:
for index in indices:
f.seek(index*4L, os.SEEK_SET)
b = f.read(4)
v = struct.unpack("#i", b)[0]
values.append(v)
return values
For comparison here is the F# version:
open System
open System.IO
let readValue (reader:BinaryReader) cellIndex =
// set stream to correct location
reader.BaseStream.Position <- cellIndex*4L
match reader.ReadInt32() with
| Int32.MinValue -> None
| v -> Some(v)
let readValues fileName indices =
use reader = new BinaryReader(File.Open(fileName, FileMode.Open, FileAccess.Read, FileShare.Read))
// Use list or array to force creation of values (otherwise reader gets disposed before the values are read)
let values = List.map (readValue reader) (List.ofSeq indices)
values
Any tips on how to improve the performance of the python version, e.g. by usage of numpy ?
Update
Hdf5 works very good (from 5 seconds to 0.8 seconds on my test file):
import tables
def read_values_hdf5(filename, indices):
values = []
with tables.open_file(filename) as f:
dset = f.root.raster
return dset[indices]
Update 2
I went with the np.memmap because the performance is similar to hdf5 and I already have numpy in production.
Heavily depending on your index file size you might want to read it completely into a numpy array. If the file is not large, complete sequential read may be faster than a large number of seeks.
One problem with the seek operations is that python operates on buffered input. If the program was written in some lower level language, the use on unbuffered IO would be a good idea, as you only need a few values.
import numpy as np
# read the complete index into memory
index_array = np.fromfile("my_index", dtype=np.uint32)
# look up the indices you need (indices being a list of indices)
return index_array[indices]
If you would anyway read almost all pages (i.e. your indices are random and at a frequency of 1/1000 or more), this is probably faster. On the other hand, if you have a large index file, and you only want to pick a few indices, this is not so fast.
Then one more possibility - which might be the fastest - is to use the python mmap module. Then the file is memory-mapped, and only the pages really required are accessed.
It should be something like this:
import mmap
with open("my_index", "rb") as f:
memory_map = mmap.mmap(mmap.mmap(f.fileno(), 0)
for i in indices:
# the index at position i:
idx_value = struct.unpack('I', memory_map[4*i:4*i+4])
(Note, I did not actually test that one, so there may be typing errors. Also, I did not care about endianess, so please check it is correct.)
Happily, these can be combined by using numpy.memmap. It should keep your array on disk but give you numpyish indexing. It should be as easy as:
import numpy as np
index_arr = np.memmap(filename, dtype='uint32', mode='rb')
return index_arr[indices]
I think this should be the easiest and fastest alternative. However, if "fast" is important, please test and profile.
EDIT: As the mmap solution seems to gain some popularity, I'll add a few words about memory mapped files.
What is mmap?
Memory mapped files are not something uniquely pythonic, because memory mapping is something defined in the POSIX standard. Memory mapping is a way to use devices or files as if they were just areas in memory.
File memory mapping is a very efficient way to randomly access fixed-length data files. It uses the same technology as is used with virtual memory. The reads and writes are ordinary memory operations. If they point to a memory location which is not in the physical RAM memory ("page fault" occurs), the required file block (page) is read into memory.
The delay in random file access is mostly due to the physical rotation of the disks (SSD is another story). In average, the block you need is half a rotation away; for a typical HDD this delay is approximately 5 ms plus any data handling delay. The overhead introduced by using python instead of a compiled language is negligible compared to this delay.
If the file is read sequentially, the operating system usually uses a read-ahead cache to buffer the file before you even know you need it. For a randomly accessed big file this does not help at all. Memory mapping provides a very efficient way, because all blocks are loaded exactly when you need and remain in the cache for further use. (This could in principle happen with fseek, as well, because it might use the same technology behind the scenes. However, there is no guarantee, and there is anyway some overhead as the call wanders through the operating system.)
mmap can also be used to write files. It is very flexible in the sense that a single memory mapped file can be shared by several processes. This may be very useful and efficient in some situations, and mmap can also be used in inter-process communication. In that case usually no file is specified for mmap, instead the memory map is created with no file behind it.
mmap is not very well-known despite its usefulness and relative ease of use. It has, however, one important 'gotcha'. The file size has to remain constant. If it changes during mmap, odd things may happen.
Is the indices list sorted? i think you could get better performance if the list would be sorted, as you would make a lot less disk seeks