Sparse matrix in npz format in Python - python

I have a sparse matrix in numpy's .npz format. I know that to read this matrix I need to use scipy.sparse.load_npz(), but would like to understand its internals.
I see in the preview of the .npz file that it contains the following 5 parts:
data
format
indices
indptr
shape
How can I better understand this file format?

npz is a simple zip archive, which contains numpy files. Simple review of internal structure of ZIP can be found here http://en.wikipedia.org/wiki/ZIP_(file_format)
Here are the docs:
Format of .npz files https://docs.scipy.org/doc/numpy/reference/generated/numpy.savez.html
Format of .npy files
http://pyopengl.sourceforge.net/pydoc/numpy.lib.format.html

Related

Handling large numpy array in tensorflow with regression output(51 outputs)

I have a very large dataset which is a single npy file that contains around 1.5m elements each a 150x150x3 image. The output has 51 columns (51 outputs). Since the dataset can't fit into memory, How do I load it and use it to fit the model? An efficient way is using TFRecords and tf.data but I couldn't understand how to do this. I would appreciate the help. Thank you.
One way is to load your NPY file fragment by fragment ( to feed your neural network with) and not to load it into the memory at once. You can use numpy.load as normal and specify the mmap_mode keyword so that the array is kept on disk, and only necessary bits are loaded into memory upon access (more details here)
numpy.load(file, mmap_mode=None, allow_pickle=False, fix_imports=True, encoding='ASCII')
Memory-mapped files are used for accessing small segments of large files on disk, without reading the entire file into memory. NumPy’s memmap’s are array-like objects. This differs from Python’s mmap module, which uses file-like objects.
If you want to know how to create a tfrecords from a numpy array, and then read the tfrecords using the Dataset API, this link provides a good answer.

Python: Load numpy arrays (npys) with specific columns

I have saved arrays as npy with sizes around 2GB. Can I somehow load only specific columns,rows with numpy.load ? I did not find a command for that or is there a workaround for that case?
This is not possible with .npy files. For that kind of problems, it is better recommended to use .h5 files, with the h5py package. You will find an example in this post: h5py: how to read selected rows of an hdf5 file?.

how to load Matlab's struct (saved with v7.3) in Python

I created a 1X20 struct in Matlab. This struct has 9 fields. The struct is saved in -v7.3 version because of its dimensions (about 3 Giga). one of the fields contains 4D matrix, other contain cell arrays, meaning it is a complex struct.
I would like to know if there is a way to load this struct into Python?
MATLAB v7.3 uses HDF5 storage; scipy.io.loadmat cannot handle that
MATLAB: Differences between .mat versions
Instead you have to use numpy plus h5py
How to read a v7.3 mat file via h5py?
how to read Mat v7.3 files in python ?
and a scattering of more recent questions.
Try that, and come back with a new question it you still have problems sorting out the results.

How to save the n-d numpy array data and read it quickly next time?

Here is my question:
I have a 3-d numpy array Data which in the shape of (1000, 100, 100).
And I want to save it as a .txt or .csv file, how to achieve that?
My first attempt was to reshape it into a 1-d array which length 1000*100*100, and transfer it into pandas.Dataframe, and then, I save it as .csv file.
When I wanted to call it next time,I would reshape it back to 3-d array.
I think there must be some methods easier.
If you need to re-read it quickly into numpy you could just use the cPickle module.
This is going to be much faster that parsing it back from an ASCII dump (but however only the program will be able to re-read it). As a bonus with just one instruction you could dump more than a single matrix (i.e. any data structure built with core python and numpy arrays).
Note that parsing a floating point value from an ASCII string is a quite complex and slow operation (if implemented correctly down to ulp).

how to export HDF5 file to NumPy using H5PY?

I have an existing hdf5 file with three arrays, i want to extract one of the arrays using h5py.
h5py already reads files in as numpy arrays, so just:
with h5py.File('the_filename', 'r') as f:
my_array = f['array_name'][()]
The [()] means to read the entire array in; if you don't do that, it doesn't read the whole data but instead gives you lazy access to sub-parts (very useful when the array is huge but you only need a small part of it).
For this question it is way overkill but if you have a lot of things like this to do I use a package SpacePy that makes some of this easier.
datamodel.fromHDF5() documentation This returns a dictionary of arrays stored in a similar way to how h5py handles data.

Categories

Resources