Load a highly nested MAT file into Python - python

I'm trying to load a MAT file that is a cell array of structs. Each of those structs have many fields, some of which are themselves cells.
A typical call would be:
myCell{1}.myStructField{1}.myStructField
How do I load such a nested structure into Python?
Thanks for your thoughts.

scipy.io.loadmat will load the mat file if it's pre-v7.3; you can then access it like matfile['myCell'][0]['myStructField'][0]['myStructField'].
If it's v7.3 or higher, you can use h5py; after opening it, I think it'll also be f['myCell'][0]['myStructField'][0]['myStructField'], though you'll need to worry about possibly transposing the matrices because of column-major / row-major differences.

Related

Python: Can I write to a file without loading its contents in RAM?

Got a big data-set that I want to shuffle. The entire set won't fit into RAM so it would be good if I could open several files (e.g. hdf5, numpy) simultaneously, loop through my data chronologically and randomly assign each data-point to one of the piles (then afterwards shuffle each pile).
I'm really inexperienced with working with data in python so I'm not sure if it's possible to write to files without holding the rest of its contents in RAM (been using np.save and savez with little success).
Is this possible and in h5py or numpy and, if so, how could I do it?
Memmory mapped files will allow for what you want. They create a numpy array which leaves the data on disk, only loading data as needed. The complete manual page is here. However, the easiest way to use them is by passing the argument mmap_mode=r+ or mmap_mode=w+ in the call to np.load leaves the file on disk (see here).
I'd suggest using advanced indexing. If you have data in a one dimensional array arr, you can index it using a list. So arr[ [0,3,5]] will give you the 0th, 3rd, and 5th elements of arr. That will make selecting the shuffled versions much easier. Since this will overwrite the data you'll need to open the files on disk read only, and create copies (using mmap_mode=w+) to put the shuffled data in.

how to load Matlab's struct (saved with v7.3) in Python

I created a 1X20 struct in Matlab. This struct has 9 fields. The struct is saved in -v7.3 version because of its dimensions (about 3 Giga). one of the fields contains 4D matrix, other contain cell arrays, meaning it is a complex struct.
I would like to know if there is a way to load this struct into Python?
MATLAB v7.3 uses HDF5 storage; scipy.io.loadmat cannot handle that
MATLAB: Differences between .mat versions
Instead you have to use numpy plus h5py
How to read a v7.3 mat file via h5py?
how to read Mat v7.3 files in python ?
and a scattering of more recent questions.
Try that, and come back with a new question it you still have problems sorting out the results.

How to save the n-d numpy array data and read it quickly next time?

Here is my question:
I have a 3-d numpy array Data which in the shape of (1000, 100, 100).
And I want to save it as a .txt or .csv file, how to achieve that?
My first attempt was to reshape it into a 1-d array which length 1000*100*100, and transfer it into pandas.Dataframe, and then, I save it as .csv file.
When I wanted to call it next time,I would reshape it back to 3-d array.
I think there must be some methods easier.
If you need to re-read it quickly into numpy you could just use the cPickle module.
This is going to be much faster that parsing it back from an ASCII dump (but however only the program will be able to re-read it). As a bonus with just one instruction you could dump more than a single matrix (i.e. any data structure built with core python and numpy arrays).
Note that parsing a floating point value from an ASCII string is a quite complex and slow operation (if implemented correctly down to ulp).

how to read a binary file into variables in python

I am working with information from big models, which means I have a lot of big ascii files with two float columns (lets say X and Y). However, whenever I have to read these files it takes a long time, so I thought maybe converthing them to binary files will make the reading process much faster.
I converted my asciifiles into binary files using the uu.encode(ascii_file,binary_file) command, and it worked quite well (Actually, tested the decode part and I recovered the same files).
My question is: is there anyway to read the binary files directly into python and get the data into two variables (x and y)?
Thanks!
You didn't specify how your float columns are represented in Python. The cPickle module is a fast general solution, with the drawback that it creates files readable only from Python, and that it should never be allowed to read untrusted data (received from the network). It is likely to just work with all regular datatypes, including numpy arrays.
If you can use numpy and store your data in numpy arrays, look into numpy.save and numpy.savetxt and the corresponding loading functions, which should offer performance superior to manually extracting the data.
array.array also has methods for writing array data to file, with the drawback that the array data is written in the native format and cannot be read from a different architecture.
Check out python's struct module. It's probably what you'd want to be using for reading and writing your data.
I suggest that instead of the suggested struct module, if your model is just floats/doubles (coordinates), you should see the array module, must be much faster than any ops in the struct module. The downside of it is that the collection is homogenous, you need to have first values in odd indexes, second ones in even indexes, or sequentially.

how to export HDF5 file to NumPy using H5PY?

I have an existing hdf5 file with three arrays, i want to extract one of the arrays using h5py.
h5py already reads files in as numpy arrays, so just:
with h5py.File('the_filename', 'r') as f:
my_array = f['array_name'][()]
The [()] means to read the entire array in; if you don't do that, it doesn't read the whole data but instead gives you lazy access to sub-parts (very useful when the array is huge but you only need a small part of it).
For this question it is way overkill but if you have a lot of things like this to do I use a package SpacePy that makes some of this easier.
datamodel.fromHDF5() documentation This returns a dictionary of arrays stored in a similar way to how h5py handles data.

Categories

Resources