I have a list of 42000 numpy arrays (each array is 240x240) that I want to save to a file for use in another python script.
I've tried using pickle and numpy.savez_compressed and I run into Memory Errors (I have 16gb DDR3). I read that hdf5 which is commonly used for deep learning stuff cannot save lists so I'm kind of stuck.
Does anyone have any idea how I can save my data?
EDIT: I previously saved this data into a numpy array onto disk using np.save and it was around 2.3GB but my computer couldn't always handle it so it would sometimes crash if I tried to process it. I read lists might be better so I have moved to using lists of numpy arrays
Assume we have a list of numpy arrays, A, and wish to save these sequentially to a HDF5 file.
We can use the h5py library to create datasets, with each dataset corresponding to an array in A.
import h5py, numpy as np
A = [arr1, arr2, arr3] # each arrX is a numpy array
with h5py.File('file.h5', 'w', libver='latest') as f: # use 'latest' for performance
for idx, arr in enumerate(A):
dset = f.create_dataset(str(idx), shape=(240, 240), data=arr, chunks=(240, 240)
compression='gzip', compression_opts=9)
I use gzip compression here for compatibility reasons, since it ships with every HDF5 installation. You may also wish to consider blosc & lzf filters. I also set chunks equal to shape, under the assumption you intend to read entire arrays rather than partial arrays.
The h5py documentation is an excellent resource to improve your understanding of the HDF5 format, as the h5py API follows the C API closely.
Related
Though there exist so many of these questions i couldnt find any working solutions for Windows:
I got a large list of lists (of lists):(~30000,48,411)(or even bigger in some cases), which i need as a numpy array for the training of my LSTM model...
Any ideas, how to work it out? (i dont use Linux, just Windows and python 64 bit)
I already tried converting it to np.float32-> still too big!
Then i tried to convert it to np.float16-> "tuple not callable"
The idea was to save and load it via np.memmap(), but therefore i would also need it as a numpy array before. (this format is also needed for the training process, so the goal is to convert it to a np.NdArray)
I even tried to split it into smaller lists (tenths) but still it was unable to allocate.
It is not clear for me in what format you have these "lists of lists" and what you mean with "too big" (for your memory, I assume?) but you might want to look into dask.
With that you can do something like
import dask.array as da
import dask
import numpy as np
...
arrays = []
for i in range(nfiles):
arrays.append(da.from_delayed(read_list(...), shape = (...))
arr = da.stack(arrays)
The dask documentation has more examples on how to create dask arrays.
In general, if you have data too large for your memory to handle (should not be the case for the 2-3GB of data) the processing will be very slow, so you then best bet is to chunk and then analyze it in chunks.
Since Numpy arrays map to C arrays and MonetDB is using C arrays as its storage model, is it possible to load data from in-memory Numpy arrays into MonetDB? This would save a round-trip to disk, i.e. writing the data from the Numpy array to disk and bulk loading it from disk into MonetDB. I'm aware of embedded Python in MonetDB but I'd rather have embedded MonetDB in Python.
The official MonetDBLite for Python implementation supports this. See the examples for inserting data. https://www.monetdb.org/blog/monetdblite-for-python
I notice a long loading time (~10 min) of a .npy file for a 1D numpy array of object data type and with a length of ~10000. Each element in this array is an ordered dictionary (OrderedDict, a dictionary subclass from collections package) with a length ~5000. So, how can I efficiently save and load large NumPy arrays to and from disk? How are large data sets in Python traditionally handled?
Numpy will pickle embedded objects by default (which you could avoid with allow_pickle=False but sounds like you may need it) which is slow (see https://docs.scipy.org/doc/numpy/reference/generated/numpy.save.html).
You may want to check Pandas (see http://matthewrocklin.com/blog/work/2015/03/16/Fast-Serialization) or try to come up with your own file format that avoids pickling of your complex data structures.
Saving and loading large datasets to/from disk will always be a costly operation. One possible optimization is using memory mapping to disk and working directly on the array (if this is compatible with your application), especially if you're only interested in a small part of the dataset. This is what numpy.memmap does.
For example:
import numpy as np
a=np.memmap('largeArray.dat',dtype=np.int32,mode='w+',shape=(100000,))
this will create a numpy array 'a' of 1000000 int32. It can be handled as any "normal" numpy array. This also creates the corresponding file 'largeArray' on your disk that will contain the data in 'a'. Synchronization between 'a' and 'largeArray' is handled by numpy and this depends on your RAM size.
More info here
I have a 20GB library of images stored as a high-dimensional numpy array. This library allows me to these use images without having to generate them anew each time. Now my problem is that np.load("mylibrary") takes as much time as it would take to generate a couple of those images. Therefore my question is: Is there a way to store a numpy array such that it is readily accessible without having to load it?
Edit: I am using PyCharm
I would suggest h5py which is a Pythonic interface to the HDF5 binary data format.
It lets you store huge amounts of numerical data, and easily manipulate that data from NumPy. For example, you can slice into multi-terabyte datasets stored on disk, as if they were real NumPy arrays. Thousands of datasets can be stored in a single file, categorized and tagged however you want.
You can also use PyTables'. It is another HDF5 interface for python and numpy
PyTables is a package for managing hierarchical datasets and designed to efficiently and easily cope with extremely large amounts of data. You can download PyTables and use it for free. You can access documentation, some examples of use and presentations here.
numpy.memap is another option. It however would be slower than hdf5. Another condition is that a array should be limited to 2.5G
I have an existing hdf5 file with three arrays, i want to extract one of the arrays using h5py.
h5py already reads files in as numpy arrays, so just:
with h5py.File('the_filename', 'r') as f:
my_array = f['array_name'][()]
The [()] means to read the entire array in; if you don't do that, it doesn't read the whole data but instead gives you lazy access to sub-parts (very useful when the array is huge but you only need a small part of it).
For this question it is way overkill but if you have a lot of things like this to do I use a package SpacePy that makes some of this easier.
datamodel.fromHDF5() documentation This returns a dictionary of arrays stored in a similar way to how h5py handles data.