Efficiently processing large binary files in python

Efficiently processing large binary files in python - python

I'm currently reading binary files that are 150,000 kb each. They contain roughly 3,000 structured binary messages and I'm trying to figure out the quickest way to process them. Out of each message, I only need to actually read about 30 lines of data. These messages have headers that allow me to jump to specific portions of the message and find the data I need.
I'm trying to figure out whether it's more efficient to unpack the entire message (50 kb each) and pull my data from the resulting tuple that includes a lot of data I don't actually need, or would it cost less to use seek to go to each line of data I need for every message and unpack each of those 30 lines? Alternatively, is this something better suited to mmap?

Seeking, possibly several times, within just 50 kB is probably not worthwhile: system calls are expensive. Instead, read each message into one bytes and use slicing to “seek” to the offsets you need and get the right amount of data.
It may be beneficial to wrap the bytes in a memoryview to avoid copying, but for small individual reads it probably doesn’t matter much. If you can use a memoryview, definitely try using mmap, which exposes a similar interface over the whole file. If you’re using struct, its unpack_from can already seek within a bytes or an mmap without wrapping or copying.

Related

Is there a Panda feature for streaming to / from a large binary source fast instead of CSV or JSON? Or is there another tool for it?

JSON isn't necessarily a high efficiency structure to store data in terms of bytes of overhead and parsing. There's a logical parsing structure, for example, based on syntax rather than being able to look up a specific segment. Let's say you have 20 years of timestep data, ~ 1TB compressed and want to be able to store it efficiently and load / store it as fast as possible for maximum speed simulation.
At first I tried relational databases, but those are actually not that fast - they're designed to load over a network, not locally, and the OSI model has overhead.
I was able to speed this up by creating a custom binary data structure with defined block sizes and header indexes, sort of like a file system, but this was time consuming and highly specified for a single type of data, for example fixed length data nodes. Editing the data wasn't a feature, it was a one time export spanning days of time. I'm sure some library could do it better.
I learned about Pandas, but they seem to load to / from CSV and JSON most commonly, and both of those are plain-text, so storing an int takes the space of multiple characters rather than having the power of deciding a 32 bit unsigned int for example.
What's the right tool? Can Pandas do this, or is there something better?
I need to be able to specify data type for each property being stored so if I only need a 16 bit int, thats the space that gets used.
I need to be able to use stream to read / write from big (1-10TB) data as fast as fundamentally possible per the hardware..

Save periodically gathered data with python

I periodically receive data (every 15 minutes) and have them in an array (numpy array to be precise) in python, that is roughly 50 columns, the number of rows varies, usually is somewhere around 100-200.
Before, I only analyzed this data and tossed it, but now I'd like to start saving it, so that I can create statistics later.
I have considered saving it in a csv file, but it did not seem right to me to save high amounts of such big 2D arrays to a csv file.
I've looked at serialization options, particularly pickle and numpy's .tobytes(), but in both cases I run into an issue - I have to track the amount of arrays stored. I've seen people write the number as the first thing in the file, but I don't know how I would be able to keep incrementing the number while having the file still opened (the program that gathers the data runs practically non-stop). Constantly opening the file, reading the number, rewriting it, seeking to the end to write new data and closing the file again doesn't seem very efficient.
I feel like I'm missing some vital information and have not been able to find it. I'd love it if someone could show me something I can not see and help me solve the problem.

Saving on a csv file might not be a good idea in this case, think about the accessibility and availability of your data. Using a database will be better, you can easily update your data and control the size amount of data you store.

modify and write large file in python

Say I have a data file of size 5GB in the disk, and I want to append another set of data of size 100MB at the end of the file -- Just simply append, I don't want to modify nor move the original data in the file. I know I can read the hole file into memory as a long long list and append my small new data to it, but it's too slow. How I can do this more efficiently?
I mean, without reading the hole file into memory?
I have a script that generates a large stream of data, say 5GB, as a long long list, and I need to save these data into a file. I tried to generate the list first and then output them all in once, but as the list increased, the computer got slow down very very severely. So I decided to output them by several times: each time I have a list of 100MB, then output them and clear the list. (this is why I have the first problem)
I have no idea how to do this.Is there any lib or function that can do this?

Let's start from the second point: if the list you store in memory is larger than the available ram, the computer starts using the hd as ram and this severely slow down everything. The optimal way of outputting in your situation is fill the ram as much as possible (always keeping enough space for the rest of the software running on your pc) and then writing on a file all in once.
The fastest way to store a list in a file would be using pickle so that you store binary data that take much less space than formatted ones (so even the read/write process is much much faster).
When you write to a file, you should keep the file always open, using something like with open('namefile', 'w') as f. In this way, you save the time to open/close the file and the cursor will always be at the end. If you decide to do that, use f.flush() once you have written the file to avoid loosing data if something bad happen. The append method is good alternative anyway.
If you provide some code it would be easier to help you...

How to read a big (3-4GB) file that doesn't have newlines into a numpy array?

I have a 3.3gb file containing one long line. The values in the file are comma separated and either floats or ints. Most of the values are 10. I want to read the data into a numpy array. Currently, I'm using numpy.fromfile:
>>> import numpy
>>> f = open('distance_matrix.tmp')
>>> distance_matrix = numpy.fromfile(f, sep=',')
but that has been running for over an hour now and it's currently using ~1 Gig memory, so I don't think it's halfway yet.
Is there a faster way to read in large data that is on a single line?

This should probably be a comment... but I don't have enough reputation to put comments in.
I've used hdf files, via h5py, of sizes well over 200 gigs with very little processing time, on the order of a minute or two, for file accesses. In addition the hdf libraries support mpi and concurrent access.
This means that, assuming you can format your original one line file, as an appropriately hierarchic hdf file (e.g. make a group for every `large' segment of data) you can use the inbuilt capabilities of hdf to make use of multiple core processing of your data exploiting mpi to pass what ever data you need between the cores.
You need to be careful with your code and understand how mpi works in conjunction with hdf, but it'll speed things up no end.
Of course all of this depends on putting the data into an hdf file in a way that allows you to take advantage of mpi... so maybe not the most practical suggestion.

Consider dumping the data using some binary format. See something like http://docs.scipy.org/doc/numpy/reference/generated/numpy.save.html
This way it will be much faster because you don't need to parse the values.
If you can't change the file type (not the result of one of your programs) then there's not much you can do about it. Make sure your machine has lots of ram (at least 8GB) so that it doesn't need to use the swap at all. Defragmenting the harddrive might help as well, or using a SSD drive.
An intermediate solution might be a C++ binary to do the parsing and then dump it in a binary format. I don't have any links for examples on this one.

Any efficient way to read datas from large binary file?

I need to handle tens of Gigabytes data in one binary file. Each record in the data file is variable length.
So the file is like:
<len1><data1><len2><data2>..........<lenN><dataN>
The data contains integer, pointer, double value and so on.
I found python can not even handle this situation. There is no problem if I read the whole file in memory. It's fast. But it seems the struct package is not good at performance. It almost stuck on unpack the bytes.
Any help is appreciated.
Thanks.

struct and array, which other answers recommend, are fine for the details of the implementation, and might be all you need if your needs are always to sequentially read all of the file or a prefix of it. Other options include buffer, mmap, even ctypes, depending on many details you don't mention regarding your exact needs. Maybe a little specialized Cython-coded helper can offer all the extra performance you need, if no suitable and accessible library (in C, C++, Fortran, ...) already exists that can be interfaced for the purpose of handling this humongous file as you need to.
But clearly there are peculiar issues here -- how can a data file contain pointers, for example, which are intrinsically a concept related to addressing memory? Are they maybe "offsets" instead, and, if so, how exactly are they based and coded? Are your needs at all more advanced than simply sequential reading (e.g., random access), and if so, can you do a first "indexing" pass to get all the offsets from start of file to start of record into a more usable, compact, handily-formatted auxiliary file? (That binary file of offsets would be a natural for array -- unless the offsets need to be longer than array supports on your machine!). What is the distribution of record lengths and compositions and number of records to make up the "tens of gigabytes"? Etc, etc.
You have a very large scale problem (and no doubt very large scale hardware to support it, since you mention that you can easily read all of the file into memory that means a 64bit box with many tens of GB of RAM -- wow!), so it's well worth the detailed care to optimize the handling thereof -- but we can't help much with such detailed care unless we know enough detail to do so!-).

have a look at array module, specifically at array.fromfile method. This bit:
Each record in the data file is variable length.
is rather unfortunate. but you could handle it with a try-except clause.

For a similar task, I defined a class like this:
class foo(Structure):
_fields_ = [("myint", c_uint32)]
created an instance
bar = foo()
and did,
block = file.read(sizeof(bar))
memmove(addressof(bar), block, sizeof(bar))
In the event of variable-size records, you can use a similar method for retrieving lenN, and then read the corresponding data entries. Seems trivial to implement. However, I have no idea of how fast this method is compared to using pack() and unpack(), perhaps someone else has profiled both methods.

For help with parsing the file without reading it into memory you can use the bitstring module.
Internally this is using the struct module and a bytearray, but an immutable Bits object can be initialised with a filename so it won't read it all into memory.
For example:
from bitstring import Bits
s = Bits(filename='your_file')
while s.bytepos != s.length:
# Read a byte and interpret as an unsigned integer
length = s.read('uint:8')
# Read 'length' bytes and convert to a Python string
data = s.read(length*8).bytes
# Now do whatever you want with the data
Of course you can parse the data however you want.
You can also use slice notation to read the file contents, although note that the indices will be in bits rather than bytes so for example s[-800:] would be the final 100 bytes.

What if you use dump the data file into sqlite3 in memory.
import sqlite3
sqlite3.Connection(":memory:")
You can then use sql to process the data.
Besides, you might want to look at generators (or here) and iterators (or here and here).

PyTables is a very good library to handle HDF5, a binary format used in astronomy and meteorology to handle very big datasets:
PyTables
It works more or less like an hierarchical database, where you can store multiple tables, inside columns. Have a look at it.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.