Saving numpy array such that it is readily available without loading - python

I have a 20GB library of images stored as a high-dimensional numpy array. This library allows me to these use images without having to generate them anew each time. Now my problem is that np.load("mylibrary") takes as much time as it would take to generate a couple of those images. Therefore my question is: Is there a way to store a numpy array such that it is readily accessible without having to load it?
Edit: I am using PyCharm

I would suggest h5py which is a Pythonic interface to the HDF5 binary data format.
It lets you store huge amounts of numerical data, and easily manipulate that data from NumPy. For example, you can slice into multi-terabyte datasets stored on disk, as if they were real NumPy arrays. Thousands of datasets can be stored in a single file, categorized and tagged however you want.
You can also use PyTables'. It is another HDF5 interface for python and numpy
PyTables is a package for managing hierarchical datasets and designed to efficiently and easily cope with extremely large amounts of data. You can download PyTables and use it for free. You can access documentation, some examples of use and presentations here.
numpy.memap is another option. It however would be slower than hdf5. Another condition is that a array should be limited to 2.5G

Related

How to store a set of arrays for deep learning not consuming too much memory (Python)?

I`m trying to make a research in which the observations of my dataset are represented by matrices (arrays composed of numbers, similar to how images for deep learning are represented, but mine are not images) of different shapes.
What I`ve already tried is to write those arrays as lists in one column of a pandas dataframe and then save this as a csv\excel. After that I planned just to load such a file and convert those lists to arrays of appropriate shapes and then to convert a set of such arrays to a tensor which I finally will use for training the deep model in keras.
But it seems like this method is extremely inefficient, cause only 1/6 of my dataset has already occupied about 6 Gb of memory (pandas saved as csv) which is huge and I won't be able to load it in RAM (I'm using google colab to run my experiments).
So my question is: is there any other way of storing a set of arrays of different shapes, which won`t occupy so much memory? Maybe I can store tensors directly somehow? Or maybe there are some ways to store pandas in some compressed types of files which are not so heavy?
Yes, Avoid using csv/excel for big datasets, there are tons of data formats out there, for this case I would recommend to use a compressed format like pd.Dataframe.to_hdf, pd.Dataframe.to_parquet or pd.Dataframe.to_pickle.
There are even more formats to choose and compression options within the functions (for example to_hdf takes the argument complevel that you can set to 9 ).
Are you storing purely (or mostly) continuous variables? If so, maybe you could reduce the accuracy (i.e., from float64 to float32) these variables if you don't need need such an accurate value per datapoint.
There's a bunch of ways in reducing the size of your data that's being stored in your memory, and the what's written is one of the many ways to do so. Maybe you could break the process that you've mentioned into smaller chunks (i.e., storage of data, extraction of data), and work on each chunk/stage individually, which hopefully will reduce the overall size of your data!
Otherwise, you could perhaps take advantage of database management systems (SQL or NoSQL depending on which fits best) which might be better, though querying that amount of data might constitute yet another issue.
I'm by no means an expert in this but I'm just explaining more of how I've dealt with excessively large datasets (similar to what you're currently experiencing) in the past, and I'm pretty sure someone here will probably give you a more definitive answer as compared to my 'a little of everything' answer. All the best!

How can I persistently store and efficiently access a very large 2D list in Python?

In Python, I'm reading in a very large 2D grid of data that consists of around 200,000,000 data points in total. Each data point is a tuple of 3 floats. Reading all of this data into a two dimensional list frequently causes Memory Errors. To get around this Memory Error, I would like to be able to read this data into some sort of table on the hard drive that can be efficiently accessed when given a grid coordinate i.e harddrive_table.get(300, 42).
So far in my research, I've come across PyTables, which is an implementation of HDF5 and seems like overkill, and the built in shelve library, which uses a dictionary-like method to access saved data, but the keys have to be strings and the performance of converting hundreds of millions of grid coordinates to strings for storage could be too much of a performance hit for my use.
Are there any libraries that allow me to store a 2D table of data on the hard drive with efficient access for a single data point?
This table of data is only needed while the program is running, so I don't care about it's interoperability or how it stores the data on the hard drive as it will be deleted after the program has run.
HDF5 isn't really overkill if it works. In addition to PyTables there's the somewhat simpler h5py.
Numpy lets you mmap a file directly into a numpy array. The values will be stored in the disk file in the minimum-overhead way, with the numpy array shape providing the mapping between array indices and file offsets. mmap uses the same underlying OS mechanisms that power the disk cache to map a disk file into virtual memory, meaning that the whole thing can be loaded into RAM if memory permits, but parts can be flushed to disk (and reloaded later on demand) if it doesn't all fit at once.

Slow loading of large NumPy datasets

I notice a long loading time (~10 min) of a .npy file for a 1D numpy array of object data type and with a length of ~10000. Each element in this array is an ordered dictionary (OrderedDict, a dictionary subclass from collections package) with a length ~5000. So, how can I efficiently save and load large NumPy arrays to and from disk? How are large data sets in Python traditionally handled?
Numpy will pickle embedded objects by default (which you could avoid with allow_pickle=False but sounds like you may need it) which is slow (see https://docs.scipy.org/doc/numpy/reference/generated/numpy.save.html).
You may want to check Pandas (see http://matthewrocklin.com/blog/work/2015/03/16/Fast-Serialization) or try to come up with your own file format that avoids pickling of your complex data structures.
Saving and loading large datasets to/from disk will always be a costly operation. One possible optimization is using memory mapping to disk and working directly on the array (if this is compatible with your application), especially if you're only interested in a small part of the dataset. This is what numpy.memmap does.
For example:
import numpy as np
a=np.memmap('largeArray.dat',dtype=np.int32,mode='w+',shape=(100000,))
this will create a numpy array 'a' of 1000000 int32. It can be handled as any "normal" numpy array. This also creates the corresponding file 'largeArray' on your disk that will contain the data in 'a'. Synchronization between 'a' and 'largeArray' is handled by numpy and this depends on your RAM size.
More info here

Efficiently Reading Large Files with ATpy and numpy?

I've looked all over for an answer to this one, but nothing really seems to fit the bill. I've got very large files that I'm trying to read with ATpy, and the data comes in the form of numpy arrays. For smaller files the following code has been sufficient:
sat = atpy.Table('satellite_data.tbl')
From there I build up a number of variables that I have to manipulate later for plotting purposes. It's lots of these kinds of operations:
w1 = np.array([sat['w1_column']])
w2 = np.array([sat['w2_column']])
w3 = np.array([sat['w3_column']])
colorw1w2 = w1 - w2 #just subtracting w2 values from w1 values for each element
colorw1w3 = w1 - w3
etc.
But for very large files the computer can't handle it. I think all the data is getting stored in memory before parsing begins, and that's not feasible for 2GB files. So, what can I use instead to handle these large files?
I've seen lots of posts where people are breaking up the data into chunks and using for loops to iterate over each line, but I don't think that's going to work for me here given the nature of these files, and the kinds of operations I need to do on these arrays. I can't just do a single operation on every line of the file, because each line contains a number of parameters that are assigned to columns, and in some cases I need to do multiple operations with figures from a single column.
Honestly I don't really understand everything going on behind the scenes with ATpy and numpy. I'm new to Python, so I appreciate answers that spell it out clearly (i.e. not relying on lots of implicit coding knowledge). There has to be a clean way of parsing this, but I'm not finding it. Thanks.
For very large arrays (larger than your memory capacity) you can use pytables which stores arrays on disk in some clever ways (using the HDF5 format) so that manipulations can be done on them without loading the entire array into memory at once. Then, you won't have to manually break up your datasets or manipulate them one line at a time.
I know nothing about ATpy so you might be better off asking on an ATpy mailing list or at least some astronomy python users mailing list, as it's possible that ATpy has another solution built in.
From the pyables website:
PyTables is a package for managing hierarchical datasets and designed to efficiently and easily cope with extremely large amounts of data.
PyTables is built on top of the HDF5 library, using the Python language and the NumPy package.
... fast, yet extremely easy to use tool for interactively browse, process and search very large amounts of data. One important feature of PyTables is that it optimizes memory and disk resources so that data takes much less space...
Look into using pandas. It's built for this kind of work. But the data files need to be stored in a well structured binary format like hdf5 to get good performance with any solution.

how to read a binary file into variables in python

I am working with information from big models, which means I have a lot of big ascii files with two float columns (lets say X and Y). However, whenever I have to read these files it takes a long time, so I thought maybe converthing them to binary files will make the reading process much faster.
I converted my asciifiles into binary files using the uu.encode(ascii_file,binary_file) command, and it worked quite well (Actually, tested the decode part and I recovered the same files).
My question is: is there anyway to read the binary files directly into python and get the data into two variables (x and y)?
Thanks!
You didn't specify how your float columns are represented in Python. The cPickle module is a fast general solution, with the drawback that it creates files readable only from Python, and that it should never be allowed to read untrusted data (received from the network). It is likely to just work with all regular datatypes, including numpy arrays.
If you can use numpy and store your data in numpy arrays, look into numpy.save and numpy.savetxt and the corresponding loading functions, which should offer performance superior to manually extracting the data.
array.array also has methods for writing array data to file, with the drawback that the array data is written in the native format and cannot be read from a different architecture.
Check out python's struct module. It's probably what you'd want to be using for reading and writing your data.
I suggest that instead of the suggested struct module, if your model is just floats/doubles (coordinates), you should see the array module, must be much faster than any ops in the struct module. The downside of it is that the collection is homogenous, you need to have first values in odd indexes, second ones in even indexes, or sequentially.

Categories

Resources