I need a way to efficiently store (size & read speed) data using numpy arrays with mixed (heterogeneous) dtypes. Imagine a dataset that has 100M observations, and 5 variables per observation (3 of which are int32, and 2 are float32).
I'm currently storing the data in two gzipped .npy files, one for the ints and one for the floats:
import numpy as np
import gzip as gz
with gz.open('array_ints.npy.gz', 'wb') as fObj:
np.save(fObj, int_ndarray)
with gz.open('array_floats.npy.gz', 'wb') as fObj:
np.save(fObj, flt_ndarray)
I've also tried storing the data as a Structured Array, but the final file size is roughly 25% larger than the combined size of storing the ints and floats separately. My data is stretching into the TBs range, so I'm looking for the most efficient way to store it (but I'd like to avoid changing compression algos to something like LZMA).
Is there another way different data types are efficiently stored together, so I can read in both at the same time? I'm starting to look into HD5, but I'm not sure that can help.
EDIT:
Ultimately, I ended up going down the HD5 route with h5py. Relative to gzip-compressed .npy arrays, I actually see a 25% decrease in size using h5py. However, this can be attributed to the shuffle filter. But when saving two arrays in the same file, there is virtually no overhead relative to saving individual files.
I realize that the original question was too broad, and sufficient answers can't be given without the specific format of the data and a representative sample (which I can't really disclose). For this reason, I'm closing the question.
Related
I have a large numpy array (188,995 values to be exact) containing 18-digit integers. Here would be the first 5:
array([873205635515447425, 872488459744513265, 872556415745513809,
872430459826834345, 867251246913838889])
The array's dtype is dtype('int64'). I'm currently storing this array in a .npy file that's 1.5mb in size.
I'll be storing a couple of these arrays every day, and I want to be conscious of storage. If it helps, the integers are always 18-digits long. They don't have any discernible pattern, so dividing them down won't work.
I was able to decrease the file size to 1.4mb by gzip compressing and storing as a .npy.gz file, but that's the lowest it'll go.
Is there a way to compress the array down further?
In Python, I'm reading in a very large 2D grid of data that consists of around 200,000,000 data points in total. Each data point is a tuple of 3 floats. Reading all of this data into a two dimensional list frequently causes Memory Errors. To get around this Memory Error, I would like to be able to read this data into some sort of table on the hard drive that can be efficiently accessed when given a grid coordinate i.e harddrive_table.get(300, 42).
So far in my research, I've come across PyTables, which is an implementation of HDF5 and seems like overkill, and the built in shelve library, which uses a dictionary-like method to access saved data, but the keys have to be strings and the performance of converting hundreds of millions of grid coordinates to strings for storage could be too much of a performance hit for my use.
Are there any libraries that allow me to store a 2D table of data on the hard drive with efficient access for a single data point?
This table of data is only needed while the program is running, so I don't care about it's interoperability or how it stores the data on the hard drive as it will be deleted after the program has run.
HDF5 isn't really overkill if it works. In addition to PyTables there's the somewhat simpler h5py.
Numpy lets you mmap a file directly into a numpy array. The values will be stored in the disk file in the minimum-overhead way, with the numpy array shape providing the mapping between array indices and file offsets. mmap uses the same underlying OS mechanisms that power the disk cache to map a disk file into virtual memory, meaning that the whole thing can be loaded into RAM if memory permits, but parts can be flushed to disk (and reloaded later on demand) if it doesn't all fit at once.
I notice a long loading time (~10 min) of a .npy file for a 1D numpy array of object data type and with a length of ~10000. Each element in this array is an ordered dictionary (OrderedDict, a dictionary subclass from collections package) with a length ~5000. So, how can I efficiently save and load large NumPy arrays to and from disk? How are large data sets in Python traditionally handled?
Numpy will pickle embedded objects by default (which you could avoid with allow_pickle=False but sounds like you may need it) which is slow (see https://docs.scipy.org/doc/numpy/reference/generated/numpy.save.html).
You may want to check Pandas (see http://matthewrocklin.com/blog/work/2015/03/16/Fast-Serialization) or try to come up with your own file format that avoids pickling of your complex data structures.
Saving and loading large datasets to/from disk will always be a costly operation. One possible optimization is using memory mapping to disk and working directly on the array (if this is compatible with your application), especially if you're only interested in a small part of the dataset. This is what numpy.memmap does.
For example:
import numpy as np
a=np.memmap('largeArray.dat',dtype=np.int32,mode='w+',shape=(100000,))
this will create a numpy array 'a' of 1000000 int32. It can be handled as any "normal" numpy array. This also creates the corresponding file 'largeArray' on your disk that will contain the data in 'a'. Synchronization between 'a' and 'largeArray' is handled by numpy and this depends on your RAM size.
More info here
I have a large, sparse, multidimensional lookup table, where cells contain arrays varying in size from 34 kB to circa 10 MB (essentially one or more elements stored in this bin/bucket/cell). My prototype has dimensions of 30**5=24,300,000, of which only 4,568 cells are non-empty (so it's sparse). Prototype non-empty cells contain structured arrays with sizes between 34 kB and 7.5 MB. At 556 MB, the prototype is easily small enough to fit in memory, but the production version will be a lot larger; maybe 100–1000 times (it is hard to estimate). This growth will be mostly due to increased dimensions, rather than due to the data contained in individual cells. My typical use case is write once (or rarely), read often.
I'm currently using a Python dictionary, where the keys are tuples, i.e. db[(29,27,29,29,16)] is a structured numpy.ndarray of around 1 MB. However, as it grows, this won't fit in memory.
A natural and easy to implement extension would be the Python shelve module.
I think tables is fast, in particular for the write once, read often use case, but I don't think it fits my data structure.
Considering that I will always need access only by the tuple index, a very simple way to store it would be to have a directory with some thousands of files with names like entry-29-27-29-29-16, which then stores the numpy.ndarray object in some format (NetCDF, HDF5, npy...).
I'm not sure if a classical database would work, considering that the size of the entries varies considerably.
What is a way to store a data structure as described above, that has efficient storage and a fast retrieval of data?
From what I understand, you might want to look at the amazing pandas package, as it has a specific facility for the sparse data structure you've described.
Also, while this stackoverflow post doesn't specifically address sparse data, it's a great description of using pandas for BIG data, which may be of interest.
Best of luck!
I have a dictionary with many entries and a huge vector as values. These vectors can be 60.000 dimensions large and I have about 60.000 entries in the dictionary. To save time, I want to store this after computation. However, using a pickle led to a huge file. I have tried storing to JSON, but the file remains extremely large (like 10.5 MB on a sample of 50 entries with less dimensions). I have also read about sparse matrices. As most entries will be 0, this is a possibility. Will this reduce the filesize? Is there any other way to store this information? Or am I just unlucky?
Update:
Thank you all for the replies. I want to store this data as these are word counts. For example, when given sentences, I store the amount of times word 0 (at location 0 in the array) appears in the sentence. There are obviously more words in all sentences than appear in one sentence, hence the many zeros. Then, I want to use this array tot train at least three, maybe six classifiers. It seemed easier to create the arrays with word counts and then run the classifiers over night to train and test. I use sklearn for this. This format was chosen to be consistent with other feature vector formats, which is why I am approaching the problem this way. If this is not the way to go, in this case, please let me know. I am very much aware that I have much to learn in coding efficiently!
I also started implementing sparse matrices. The file is even bigger now (testing with a sample set of 300 sentences).
Update 2:
Thank you all for the tips. John Mee was right by not needing to store the data. Both he and Mike McKerns told me to use sparse matrices, which sped up calculation significantly! So thank you for your input. Now I have a new tool in my arsenal!
See my answer to a very closely related question https://stackoverflow.com/a/25244747/2379433, if you are ok with pickling to several files instead of a single file.
Also see: https://stackoverflow.com/a/21948720/2379433 for other potential improvements, and here too: https://stackoverflow.com/a/24471659/2379433.
If you are using numpy arrays, it can be very efficient, as both klepto and joblib understand how to use minimal state representation for an array. If you indeed have most elements of the arrays as zeros, then by all means, convert to sparse matrices... and you will find huge savings in storage size of the array.
As the links above discuss, you could use klepto -- which provides you with the ability to easily store dictionaries to disk or database, using a common API. klepto also enables you to pick a storage format (pickle, json, etc.) -- where HDF5 is coming soon. It can utilize both specialized pickle formats (like numpy's) and compression (if you care about size and not speed).
klepto gives you the option to store the dictionary with "all-in-one" file or "one-entry-per" file, and also can leverage multiprocessing or multithreading -- meaning that you can save and load dictionary items to/from the backend in parallel.
With 60,000 dimensions do you mean 60,000 elements? if this is the case and the numbers are 1..10 then a reasonably compact but still efficient approach is to use a dictionary of Python array.array objects with 1 byte per element (type 'B').
The size in memory should be about 60,000 entries x 60,000 bytes, totaling 3.35Gb of data.
That data structure is pickled to about the same size to disk too.