I have a very large dataset which is a single npy file that contains around 1.5m elements each a 150x150x3 image. The output has 51 columns (51 outputs). Since the dataset can't fit into memory, How do I load it and use it to fit the model? An efficient way is using TFRecords and tf.data but I couldn't understand how to do this. I would appreciate the help. Thank you.
One way is to load your NPY file fragment by fragment ( to feed your neural network with) and not to load it into the memory at once. You can use numpy.load as normal and specify the mmap_mode keyword so that the array is kept on disk, and only necessary bits are loaded into memory upon access (more details here)
numpy.load(file, mmap_mode=None, allow_pickle=False, fix_imports=True, encoding='ASCII')
Memory-mapped files are used for accessing small segments of large files on disk, without reading the entire file into memory. NumPy’s memmap’s are array-like objects. This differs from Python’s mmap module, which uses file-like objects.
If you want to know how to create a tfrecords from a numpy array, and then read the tfrecords using the Dataset API, this link provides a good answer.
Related
In Python, I'm reading in a very large 2D grid of data that consists of around 200,000,000 data points in total. Each data point is a tuple of 3 floats. Reading all of this data into a two dimensional list frequently causes Memory Errors. To get around this Memory Error, I would like to be able to read this data into some sort of table on the hard drive that can be efficiently accessed when given a grid coordinate i.e harddrive_table.get(300, 42).
So far in my research, I've come across PyTables, which is an implementation of HDF5 and seems like overkill, and the built in shelve library, which uses a dictionary-like method to access saved data, but the keys have to be strings and the performance of converting hundreds of millions of grid coordinates to strings for storage could be too much of a performance hit for my use.
Are there any libraries that allow me to store a 2D table of data on the hard drive with efficient access for a single data point?
This table of data is only needed while the program is running, so I don't care about it's interoperability or how it stores the data on the hard drive as it will be deleted after the program has run.
HDF5 isn't really overkill if it works. In addition to PyTables there's the somewhat simpler h5py.
Numpy lets you mmap a file directly into a numpy array. The values will be stored in the disk file in the minimum-overhead way, with the numpy array shape providing the mapping between array indices and file offsets. mmap uses the same underlying OS mechanisms that power the disk cache to map a disk file into virtual memory, meaning that the whole thing can be loaded into RAM if memory permits, but parts can be flushed to disk (and reloaded later on demand) if it doesn't all fit at once.
I have saved arrays as npy with sizes around 2GB. Can I somehow load only specific columns,rows with numpy.load ? I did not find a command for that or is there a workaround for that case?
This is not possible with .npy files. For that kind of problems, it is better recommended to use .h5 files, with the h5py package. You will find an example in this post: h5py: how to read selected rows of an hdf5 file?.
I'm dealing with a huge amount of data in Tensorflow.
One way is to define placeholder and then read my data by my own defined functions outside of the graph, such as a queue and feed a batch every time into the placeholders.
Another way is to use recorder related built-in classes in Tensorflow to directly read data as tensors.
I searched but failed to find any relavant comparison between the two. Does anyone has idea about their advantages and disadvanteges, especially about the efficiency? Which one do you guys prefer when you use tensorflow?
The different methods of reading data in Tensorflow are compared and discussed here with more comparison here
tfrecord allows to read data in chunks, so you can deal with data that exceed RAM capacity. Also it can be arranged in such way that you read data a separate thread using tf.Coordinator and start_queue_runners. More information can be found here
I notice a long loading time (~10 min) of a .npy file for a 1D numpy array of object data type and with a length of ~10000. Each element in this array is an ordered dictionary (OrderedDict, a dictionary subclass from collections package) with a length ~5000. So, how can I efficiently save and load large NumPy arrays to and from disk? How are large data sets in Python traditionally handled?
Numpy will pickle embedded objects by default (which you could avoid with allow_pickle=False but sounds like you may need it) which is slow (see https://docs.scipy.org/doc/numpy/reference/generated/numpy.save.html).
You may want to check Pandas (see http://matthewrocklin.com/blog/work/2015/03/16/Fast-Serialization) or try to come up with your own file format that avoids pickling of your complex data structures.
Saving and loading large datasets to/from disk will always be a costly operation. One possible optimization is using memory mapping to disk and working directly on the array (if this is compatible with your application), especially if you're only interested in a small part of the dataset. This is what numpy.memmap does.
For example:
import numpy as np
a=np.memmap('largeArray.dat',dtype=np.int32,mode='w+',shape=(100000,))
this will create a numpy array 'a' of 1000000 int32. It can be handled as any "normal" numpy array. This also creates the corresponding file 'largeArray' on your disk that will contain the data in 'a'. Synchronization between 'a' and 'largeArray' is handled by numpy and this depends on your RAM size.
More info here
I have a 20GB library of images stored as a high-dimensional numpy array. This library allows me to these use images without having to generate them anew each time. Now my problem is that np.load("mylibrary") takes as much time as it would take to generate a couple of those images. Therefore my question is: Is there a way to store a numpy array such that it is readily accessible without having to load it?
Edit: I am using PyCharm
I would suggest h5py which is a Pythonic interface to the HDF5 binary data format.
It lets you store huge amounts of numerical data, and easily manipulate that data from NumPy. For example, you can slice into multi-terabyte datasets stored on disk, as if they were real NumPy arrays. Thousands of datasets can be stored in a single file, categorized and tagged however you want.
You can also use PyTables'. It is another HDF5 interface for python and numpy
PyTables is a package for managing hierarchical datasets and designed to efficiently and easily cope with extremely large amounts of data. You can download PyTables and use it for free. You can access documentation, some examples of use and presentations here.
numpy.memap is another option. It however would be slower than hdf5. Another condition is that a array should be limited to 2.5G