Out-of-core custom binary file processing

Out-of-core custom binary file processing - python

The problem at hand is a batch of large (500GB-1TB) binary files representing continuous time-series of data, that each have header and footer. I've seen plenty of solutions, like Dask, or Python core mmap, or NumPy memmap, or the like, but none of them seem to deal with the footer inclusion problem. Not that I have noticed at least, maybe there is some trivial thing like a "file-size mask" that allows me to "ghost" last N bytes of each file.
My restrictions in file interaction are limited to read-only access, and I have them all available locally.
What I currently have written is an io.open(filename, 'rb', buffering=0) that iterates over a list of files, and manually moves the pointer after each read, and keeps track of pointer's location relatively to the file. When it reaches the point where the next read would get into footer, e.g. into the next file, there is a really ugly bit that splits the chunk into two smaller, often asymmetric, chunks, where the first is read, then file switch occurs, and then the next one is read.
I feel like I am reinventing the bicycle with that, and I would greatly appreciate any brainstorm-level suggestions. I'll gladly work out technical details on my own, once I know the direction I should head.

Related

gzip.open() look-forward rolling list when reading file line-by-line

Looking-forward is necessary to check if the current line's data "makes sense" in context, or if it should be omitted. Reading line-by-line is necessary because the files are sometimes 20GB of uncompressed data.
If I were reading entire files, I would've used a indexed for-loop to look ahead. Then I thought it might be easier to read the file in reverse, but I presume that's unfeasible due to the necessity of seeking and the nature of gzip.
So the idea now is to have a rolling-list of the next X number of lines, and keep this filled from the front, as my loop reads lines from the back while having forward-looking access to the next (X - 1) lines. If this is an ideal solution, does there exist any name or optimized recipe for it? Or is there a better solution?

Save periodically gathered data with python

I periodically receive data (every 15 minutes) and have them in an array (numpy array to be precise) in python, that is roughly 50 columns, the number of rows varies, usually is somewhere around 100-200.
Before, I only analyzed this data and tossed it, but now I'd like to start saving it, so that I can create statistics later.
I have considered saving it in a csv file, but it did not seem right to me to save high amounts of such big 2D arrays to a csv file.
I've looked at serialization options, particularly pickle and numpy's .tobytes(), but in both cases I run into an issue - I have to track the amount of arrays stored. I've seen people write the number as the first thing in the file, but I don't know how I would be able to keep incrementing the number while having the file still opened (the program that gathers the data runs practically non-stop). Constantly opening the file, reading the number, rewriting it, seeking to the end to write new data and closing the file again doesn't seem very efficient.
I feel like I'm missing some vital information and have not been able to find it. I'd love it if someone could show me something I can not see and help me solve the problem.

Saving on a csv file might not be a good idea in this case, think about the accessibility and availability of your data. Using a database will be better, you can easily update your data and control the size amount of data you store.

modify and write large file in python

Say I have a data file of size 5GB in the disk, and I want to append another set of data of size 100MB at the end of the file -- Just simply append, I don't want to modify nor move the original data in the file. I know I can read the hole file into memory as a long long list and append my small new data to it, but it's too slow. How I can do this more efficiently?
I mean, without reading the hole file into memory?
I have a script that generates a large stream of data, say 5GB, as a long long list, and I need to save these data into a file. I tried to generate the list first and then output them all in once, but as the list increased, the computer got slow down very very severely. So I decided to output them by several times: each time I have a list of 100MB, then output them and clear the list. (this is why I have the first problem)
I have no idea how to do this.Is there any lib or function that can do this?

Let's start from the second point: if the list you store in memory is larger than the available ram, the computer starts using the hd as ram and this severely slow down everything. The optimal way of outputting in your situation is fill the ram as much as possible (always keeping enough space for the rest of the software running on your pc) and then writing on a file all in once.
The fastest way to store a list in a file would be using pickle so that you store binary data that take much less space than formatted ones (so even the read/write process is much much faster).
When you write to a file, you should keep the file always open, using something like with open('namefile', 'w') as f. In this way, you save the time to open/close the file and the cursor will always be at the end. If you decide to do that, use f.flush() once you have written the file to avoid loosing data if something bad happen. The append method is good alternative anyway.
If you provide some code it would be easier to help you...

How to read a big (3-4GB) file that doesn't have newlines into a numpy array?

I have a 3.3gb file containing one long line. The values in the file are comma separated and either floats or ints. Most of the values are 10. I want to read the data into a numpy array. Currently, I'm using numpy.fromfile:
>>> import numpy
>>> f = open('distance_matrix.tmp')
>>> distance_matrix = numpy.fromfile(f, sep=',')
but that has been running for over an hour now and it's currently using ~1 Gig memory, so I don't think it's halfway yet.
Is there a faster way to read in large data that is on a single line?

This should probably be a comment... but I don't have enough reputation to put comments in.
I've used hdf files, via h5py, of sizes well over 200 gigs with very little processing time, on the order of a minute or two, for file accesses. In addition the hdf libraries support mpi and concurrent access.
This means that, assuming you can format your original one line file, as an appropriately hierarchic hdf file (e.g. make a group for every `large' segment of data) you can use the inbuilt capabilities of hdf to make use of multiple core processing of your data exploiting mpi to pass what ever data you need between the cores.
You need to be careful with your code and understand how mpi works in conjunction with hdf, but it'll speed things up no end.
Of course all of this depends on putting the data into an hdf file in a way that allows you to take advantage of mpi... so maybe not the most practical suggestion.

Consider dumping the data using some binary format. See something like http://docs.scipy.org/doc/numpy/reference/generated/numpy.save.html
This way it will be much faster because you don't need to parse the values.
If you can't change the file type (not the result of one of your programs) then there's not much you can do about it. Make sure your machine has lots of ram (at least 8GB) so that it doesn't need to use the swap at all. Defragmenting the harddrive might help as well, or using a SSD drive.
An intermediate solution might be a C++ binary to do the parsing and then dump it in a binary format. I don't have any links for examples on this one.

Any efficient way to read datas from large binary file?

I need to handle tens of Gigabytes data in one binary file. Each record in the data file is variable length.
So the file is like:
<len1><data1><len2><data2>..........<lenN><dataN>
The data contains integer, pointer, double value and so on.
I found python can not even handle this situation. There is no problem if I read the whole file in memory. It's fast. But it seems the struct package is not good at performance. It almost stuck on unpack the bytes.
Any help is appreciated.
Thanks.

struct and array, which other answers recommend, are fine for the details of the implementation, and might be all you need if your needs are always to sequentially read all of the file or a prefix of it. Other options include buffer, mmap, even ctypes, depending on many details you don't mention regarding your exact needs. Maybe a little specialized Cython-coded helper can offer all the extra performance you need, if no suitable and accessible library (in C, C++, Fortran, ...) already exists that can be interfaced for the purpose of handling this humongous file as you need to.
But clearly there are peculiar issues here -- how can a data file contain pointers, for example, which are intrinsically a concept related to addressing memory? Are they maybe "offsets" instead, and, if so, how exactly are they based and coded? Are your needs at all more advanced than simply sequential reading (e.g., random access), and if so, can you do a first "indexing" pass to get all the offsets from start of file to start of record into a more usable, compact, handily-formatted auxiliary file? (That binary file of offsets would be a natural for array -- unless the offsets need to be longer than array supports on your machine!). What is the distribution of record lengths and compositions and number of records to make up the "tens of gigabytes"? Etc, etc.
You have a very large scale problem (and no doubt very large scale hardware to support it, since you mention that you can easily read all of the file into memory that means a 64bit box with many tens of GB of RAM -- wow!), so it's well worth the detailed care to optimize the handling thereof -- but we can't help much with such detailed care unless we know enough detail to do so!-).

have a look at array module, specifically at array.fromfile method. This bit:
Each record in the data file is variable length.
is rather unfortunate. but you could handle it with a try-except clause.

For a similar task, I defined a class like this:
class foo(Structure):
_fields_ = [("myint", c_uint32)]
created an instance
bar = foo()
and did,
block = file.read(sizeof(bar))
memmove(addressof(bar), block, sizeof(bar))
In the event of variable-size records, you can use a similar method for retrieving lenN, and then read the corresponding data entries. Seems trivial to implement. However, I have no idea of how fast this method is compared to using pack() and unpack(), perhaps someone else has profiled both methods.

For help with parsing the file without reading it into memory you can use the bitstring module.
Internally this is using the struct module and a bytearray, but an immutable Bits object can be initialised with a filename so it won't read it all into memory.
For example:
from bitstring import Bits
s = Bits(filename='your_file')
while s.bytepos != s.length:
# Read a byte and interpret as an unsigned integer
length = s.read('uint:8')
# Read 'length' bytes and convert to a Python string
data = s.read(length*8).bytes
# Now do whatever you want with the data
Of course you can parse the data however you want.
You can also use slice notation to read the file contents, although note that the indices will be in bits rather than bytes so for example s[-800:] would be the final 100 bytes.

What if you use dump the data file into sqlite3 in memory.
import sqlite3
sqlite3.Connection(":memory:")
You can then use sql to process the data.
Besides, you might want to look at generators (or here) and iterators (or here and here).

PyTables is a very good library to handle HDF5, a binary format used in astronomy and meteorology to handle very big datasets:
PyTables
It works more or less like an hierarchical database, where you can store multiple tables, inside columns. Have a look at it.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.