Slow loading of large NumPy datasets - python

I notice a long loading time (~10 min) of a .npy file for a 1D numpy array of object data type and with a length of ~10000. Each element in this array is an ordered dictionary (OrderedDict, a dictionary subclass from collections package) with a length ~5000. So, how can I efficiently save and load large NumPy arrays to and from disk? How are large data sets in Python traditionally handled?

Numpy will pickle embedded objects by default (which you could avoid with allow_pickle=False but sounds like you may need it) which is slow (see https://docs.scipy.org/doc/numpy/reference/generated/numpy.save.html).
You may want to check Pandas (see http://matthewrocklin.com/blog/work/2015/03/16/Fast-Serialization) or try to come up with your own file format that avoids pickling of your complex data structures.

Saving and loading large datasets to/from disk will always be a costly operation. One possible optimization is using memory mapping to disk and working directly on the array (if this is compatible with your application), especially if you're only interested in a small part of the dataset. This is what numpy.memmap does.
For example:
import numpy as np
a=np.memmap('largeArray.dat',dtype=np.int32,mode='w+',shape=(100000,))
this will create a numpy array 'a' of 1000000 int32. It can be handled as any "normal" numpy array. This also creates the corresponding file 'largeArray' on your disk that will contain the data in 'a'. Synchronization between 'a' and 'largeArray' is handled by numpy and this depends on your RAM size.
More info here

Related

How can I persistently store and efficiently access a very large 2D list in Python?

In Python, I'm reading in a very large 2D grid of data that consists of around 200,000,000 data points in total. Each data point is a tuple of 3 floats. Reading all of this data into a two dimensional list frequently causes Memory Errors. To get around this Memory Error, I would like to be able to read this data into some sort of table on the hard drive that can be efficiently accessed when given a grid coordinate i.e harddrive_table.get(300, 42).
So far in my research, I've come across PyTables, which is an implementation of HDF5 and seems like overkill, and the built in shelve library, which uses a dictionary-like method to access saved data, but the keys have to be strings and the performance of converting hundreds of millions of grid coordinates to strings for storage could be too much of a performance hit for my use.
Are there any libraries that allow me to store a 2D table of data on the hard drive with efficient access for a single data point?
This table of data is only needed while the program is running, so I don't care about it's interoperability or how it stores the data on the hard drive as it will be deleted after the program has run.
HDF5 isn't really overkill if it works. In addition to PyTables there's the somewhat simpler h5py.
Numpy lets you mmap a file directly into a numpy array. The values will be stored in the disk file in the minimum-overhead way, with the numpy array shape providing the mapping between array indices and file offsets. mmap uses the same underlying OS mechanisms that power the disk cache to map a disk file into virtual memory, meaning that the whole thing can be loaded into RAM if memory permits, but parts can be flushed to disk (and reloaded later on demand) if it doesn't all fit at once.

compress mixed type numpy arrays efficiently

I need a way to efficiently store (size & read speed) data using numpy arrays with mixed (heterogeneous) dtypes. Imagine a dataset that has 100M observations, and 5 variables per observation (3 of which are int32, and 2 are float32).
I'm currently storing the data in two gzipped .npy files, one for the ints and one for the floats:
import numpy as np
import gzip as gz
with gz.open('array_ints.npy.gz', 'wb') as fObj:
np.save(fObj, int_ndarray)
with gz.open('array_floats.npy.gz', 'wb') as fObj:
np.save(fObj, flt_ndarray)
I've also tried storing the data as a Structured Array, but the final file size is roughly 25% larger than the combined size of storing the ints and floats separately. My data is stretching into the TBs range, so I'm looking for the most efficient way to store it (but I'd like to avoid changing compression algos to something like LZMA).
Is there another way different data types are efficiently stored together, so I can read in both at the same time? I'm starting to look into HD5, but I'm not sure that can help.
EDIT:
Ultimately, I ended up going down the HD5 route with h5py. Relative to gzip-compressed .npy arrays, I actually see a 25% decrease in size using h5py. However, this can be attributed to the shuffle filter. But when saving two arrays in the same file, there is virtually no overhead relative to saving individual files.
I realize that the original question was too broad, and sufficient answers can't be given without the specific format of the data and a representative sample (which I can't really disclose). For this reason, I'm closing the question.

Save list of numpy arrays onto disk

I have a list of 42000 numpy arrays (each array is 240x240) that I want to save to a file for use in another python script.
I've tried using pickle and numpy.savez_compressed and I run into Memory Errors (I have 16gb DDR3). I read that hdf5 which is commonly used for deep learning stuff cannot save lists so I'm kind of stuck.
Does anyone have any idea how I can save my data?
EDIT: I previously saved this data into a numpy array onto disk using np.save and it was around 2.3GB but my computer couldn't always handle it so it would sometimes crash if I tried to process it. I read lists might be better so I have moved to using lists of numpy arrays
Assume we have a list of numpy arrays, A, and wish to save these sequentially to a HDF5 file.
We can use the h5py library to create datasets, with each dataset corresponding to an array in A.
import h5py, numpy as np
A = [arr1, arr2, arr3] # each arrX is a numpy array
with h5py.File('file.h5', 'w', libver='latest') as f: # use 'latest' for performance
for idx, arr in enumerate(A):
dset = f.create_dataset(str(idx), shape=(240, 240), data=arr, chunks=(240, 240)
compression='gzip', compression_opts=9)
I use gzip compression here for compatibility reasons, since it ships with every HDF5 installation. You may also wish to consider blosc & lzf filters. I also set chunks equal to shape, under the assumption you intend to read entire arrays rather than partial arrays.
The h5py documentation is an excellent resource to improve your understanding of the HDF5 format, as the h5py API follows the C API closely.

Saving numpy array such that it is readily available without loading

I have a 20GB library of images stored as a high-dimensional numpy array. This library allows me to these use images without having to generate them anew each time. Now my problem is that np.load("mylibrary") takes as much time as it would take to generate a couple of those images. Therefore my question is: Is there a way to store a numpy array such that it is readily accessible without having to load it?
Edit: I am using PyCharm
I would suggest h5py which is a Pythonic interface to the HDF5 binary data format.
It lets you store huge amounts of numerical data, and easily manipulate that data from NumPy. For example, you can slice into multi-terabyte datasets stored on disk, as if they were real NumPy arrays. Thousands of datasets can be stored in a single file, categorized and tagged however you want.
You can also use PyTables'. It is another HDF5 interface for python and numpy
PyTables is a package for managing hierarchical datasets and designed to efficiently and easily cope with extremely large amounts of data. You can download PyTables and use it for free. You can access documentation, some examples of use and presentations here.
numpy.memap is another option. It however would be slower than hdf5. Another condition is that a array should be limited to 2.5G

How to save the n-d numpy array data and read it quickly next time?

Here is my question:
I have a 3-d numpy array Data which in the shape of (1000, 100, 100).
And I want to save it as a .txt or .csv file, how to achieve that?
My first attempt was to reshape it into a 1-d array which length 1000*100*100, and transfer it into pandas.Dataframe, and then, I save it as .csv file.
When I wanted to call it next time,I would reshape it back to 3-d array.
I think there must be some methods easier.
If you need to re-read it quickly into numpy you could just use the cPickle module.
This is going to be much faster that parsing it back from an ASCII dump (but however only the program will be able to re-read it). As a bonus with just one instruction you could dump more than a single matrix (i.e. any data structure built with core python and numpy arrays).
Note that parsing a floating point value from an ASCII string is a quite complex and slow operation (if implemented correctly down to ulp).

Categories

Resources