I have an existing h5py file that I downloaded which is ~18G in size. It has a number of nested datasets within it:
h5f = h5py.File('input.h5', 'r')
data = h5f['data']
latlong_data = data['lat_long'].value
I want to be able to some basic min/max scaling of the numerical data within latlong, so i want to put it in its own h5py file for easier use and lower memory usage.
However, when i try to write it out to its own file:
out = h5py.File('latlong_only.h5', 'w')
out.create_dataset('latlong', data=latlong)
out.close()
The output file is incredibly large. It's still not done writing to disk and is ~85GB in space. Why is the data being written to the new file not compressed?
Could be h5f['data/lat_long'] is using compression filters (and you aren't). To check the original dataset's compression settings, use this line:
print (h5f['data/latlong'].compression, h5f['data/latlong'].compression_opts)
After writing my answer, it occurred to me that you don't need to copy the data to another file to reduce the memory footprint. Your code reads the dataset into an array, which is not necessary in most use cases. A h5py dataset object behaves similar to a NumPy array. Instead, use this line: ds = h5f1['data/latlong'] to create a dataset object (instead of an array) and use it "like" it's a NumPy array. FYI, .value is a deprecated method to return the dataset as an array. Use this syntax instead arr = h5f1['data/latlong'][()]. Loading the dataset into an array also requires more memory than using an h5py object (which could be an issue with large datasets).
There are other ways to access the data. My suggestion to use dataset objects is 1 way. Your method (extracting data to a new file) is another way. I am not found of that approach because you now have 2 copies of the data; a bookkeeping nightmare. Another alternative is to create external links from the new file to the existing 18GB file. That way you have a small file that links to the big file (and no duplicate data). I describe that method in this post: [How can I combine multiple .h5 file?][1] Method 1: Create External Links.
If you still want to copy the data, here is what I would do. Your code reads the dataset into an array then writes the array to the new file (uncompressed). Instead, copy the dataset using h5py's group .copy() method, it will retain compression settings and attributes.
See below:
with h5py.File('input.h5', 'r') as h5f1, \
h5py.File('latlong_only.h5', 'w') as h5f2:
h5f1.copy(h5f1['data/latlong'], h5f2,'latlong')
Is it possible to read binary MATLAB .mat files in Python?
I've seen that SciPy has alleged support for reading .mat files, but I'm unsuccessful with it. I installed SciPy version 0.7.0, and I can't find the loadmat() method.
An import is required, import scipy.io...
import scipy.io
mat = scipy.io.loadmat('file.mat')
Neither scipy.io.savemat, nor scipy.io.loadmat work for MATLAB arrays version 7.3. But the good part is that MATLAB version 7.3 files are hdf5 datasets. So they can be read using a number of tools, including NumPy.
For Python, you will need the h5py extension, which requires HDF5 on your system.
import numpy as np
import h5py
f = h5py.File('somefile.mat','r')
data = f.get('data/variable1')
data = np.array(data) # For converting to a NumPy array
First save the .mat file as:
save('test.mat', '-v7')
After that, in Python, use the usual loadmat function:
import scipy.io as sio
test = sio.loadmat('test.mat')
There is a nice package called mat4py which can easily be installed using
pip install mat4py
It is straightforward to use (from the website):
Load data from a MAT-file
The function loadmat loads all variables stored in the MAT-file into a simple Python data structure, using only Python’s dict and list objects. Numeric and cell arrays are converted to row-ordered nested lists. Arrays are squeezed to eliminate arrays with only one element. The resulting data structure is composed of simple types that are compatible with the JSON format.
Example: Load a MAT-file into a Python data structure:
from mat4py import loadmat
data = loadmat('datafile.mat')
The variable data is a dict with the variables and values contained in the MAT-file.
Save a Python data structure to a MAT-file
Python data can be saved to a MAT-file, with the function savemat. Data has to be structured in the same way as for loadmat, i.e. it should be composed of simple data types, like dict, list, str, int, and float.
Example: Save a Python data structure to a MAT-file:
from mat4py import savemat
savemat('datafile.mat', data)
The parameter data shall be a dict with the variables.
Having MATLAB 2014b or newer installed, the MATLAB engine for Python could be used:
import matlab.engine
eng = matlab.engine.start_matlab()
content = eng.load("example.mat", nargout=1)
Reading the file
import scipy.io
mat = scipy.io.loadmat(file_name)
Inspecting the type of MAT variable
print(type(mat))
#OUTPUT - <class 'dict'>
The keys inside the dictionary are MATLAB variables, and the values are the objects assigned to those variables.
There is a great library for this task called: pymatreader.
Just do as follows:
Install the package: pip install pymatreader
Import the relevant function of this package: from pymatreader import read_mat
Use the function to read the matlab struct: data = read_mat('matlab_struct.mat')
use data.keys() to locate where the data is actually stored.
The keys will usually look like: dict_keys(['__header__', '__version__', '__globals__', 'data_opp']). Where data_opp will be the actual key which stores the data. The name of this key can ofcourse be changed between different files.
Last step - Create your dataframe: my_df = pd.DataFrame(data['data_opp'])
That's it :)
There is also the MATLAB Engine for Python by MathWorks itself. If you have MATLAB, this might be worth considering (I haven't tried it myself but it has a lot more functionality than just reading MATLAB files). However, I don't know if it is allowed to distribute it to other users (it is probably not a problem if those persons have MATLAB. Otherwise, maybe NumPy is the right way to go?).
Also, if you want to do all the basics yourself, MathWorks provides (if the link changes, try to google for matfile_format.pdf or its title MAT-FILE Format) a detailed documentation on the structure of the file format. It's not as complicated as I personally thought, but obviously, this is not the easiest way to go. It also depends on how many features of the .mat-files you want to support.
I've written a "small" (about 700 lines) Python script which can read some basic .mat-files. I'm neither a Python expert nor a beginner and it took me about two days to write it (using the MathWorks documentation linked above). I've learned a lot of new stuff and it was quite fun (most of the time). As I've written the Python script at work, I'm afraid I cannot publish it... But I can give some advice here:
First read the documentation.
Use a hex editor (such as HxD) and look into a reference .mat-file you want to parse.
Try to figure out the meaning of each byte by saving the bytes to a .txt file and annotate each line.
Use classes to save each data element (such as miCOMPRESSED, miMATRIX, mxDOUBLE, or miINT32)
The .mat-files' structure is optimal for saving the data elements in a tree data structure; each node has one class and subnodes
To read mat file to pandas dataFrame with mixed data types
import scipy.io as sio
mat=sio.loadmat('file.mat')# load mat-file
mdata = mat['myVar'] # variable in mat file
ndata = {n: mdata[n][0,0] for n in mdata.dtype.names}
Columns = [n for n, v in ndata.items() if v.size == 1]
d=dict((c, ndata[c][0]) for c in Columns)
df=pd.DataFrame.from_dict(d)
display(df)
Apart from scipy.io.loadmat for v4 (Level 1.0), v6, v7 to 7.2 matfiles and h5py.File for 7.3 format matfiles, there is anther type of matfiles in text data format instead of binary, usually created by Octave, which can't even be read in MATLAB.
Both of scipy.io.loadmat and h5py.File can't load them (tested on scipy 1.5.3 and h5py 3.1.0), and the only solution I found is numpy.loadtxt.
import numpy as np
mat = np.loadtxt('xxx.mat')
Can also use the hdf5storage library. official documentation here for details on matlab version support.
import hdf5storage
label_file = "./LabelTrain.mat"
out = hdf5storage.loadmat(label_file)
print(type(out)) # <class 'dict'>
from os.path import dirname, join as pjoin
import scipy.io as sio
data_dir = pjoin(dirname(sio.__file__), 'matlab', 'tests', 'data')
mat_fname = pjoin(data_dir, 'testdouble_7.4_GLNX86.mat')
mat_contents = sio.loadmat(mat_fname)
You can use above code to read the default saved .mat file in Python.
After struggling with this problem myself and trying other libraries (I have to say mat4py is a good one as well but with a few limitations) I have built this library ("matdata2py") that can handle most variable types and most importantly for me the "string" type. The .mat file needs to be saved in the -V7.3 version. I hope this can be useful for the community.
Installation:
pip install matdata2py
How to use this lib:
import matdata2py as mtp
To load the Matlab data file:
Variables_output = mtp.loadmatfile(file_Name, StructsExportLikeMatlab = True, ExportVar2PyEnv = False)
print(Variables_output.keys()) # with ExportVar2PyEnv = False the variables are as elements of the Variables_output dictionary.
with ExportVar2PyEnv = True you can see each variable separately as python variables with the same name as saved in the Mat file.
Flag descriptions
StructsExportLikeMatlab = True/False structures are exported in dictionary format (False) or dot-based format similar to Matlab (True)
ExportVar2PyEnv = True/False export all variables in a single dictionary (True) or as separate individual variables into the python environment (False)
scipy will work perfectly to load the .mat files.
And we can use the get() function to convert it to a numpy array.
mat = scipy.io.loadmat('point05m_matrix.mat')
x = mat.get("matrix")
print(type(x))
print(len(x))
plt.imshow(x, extent=[0,60,0,55], aspect='auto')
plt.show()
To Upload and Read mat files in python
Install mat4py in python.On successful installation we get:
Successfully installed mat4py-0.5.0.
Importing loadmat from mat4py.
Save file actual location inside a variable.
Load mat file format to a data value using python
pip install mat4py
from mat4py import loadmat
boston = r"E:\Downloads\boston.mat"
data = loadmat(boston, meta=False)
I have few lists which i want to save it to a *.mat file. But according to scipy.io.savemat command documentation i Need to create a dictionary with the lists and then use the command to save it to a *.mat file.
If i save it according to the way mentioned in the docs the mat file will have structure with variables as the Arrays which i used in the dictionary. Now i have a Problem here, I have another program (which is not editable) will use the mat files and load them to plot some Graphs from the data. The program cannot process the structure because it is written in a way where if it loads a mat files and then it will directly process the Arrays in it.
So is there a way to save the mat file without using dictionaries? Please see the Image for more understanding
Thanks
This is the sample algorithm i used to save my *.mat file
import os
os.getcwd()
os.chdir(os.getcwd())
import scipy.io as sio
x=[1,2,3,4,5]
y=[234,5445,778] #can be 1000 lists
data={}
data['x']=x
data['y']=y
sio.savemat('test.mat',{'interpolated_data':data})
How about
scipy.io.savemat('interpolated_data_max_compare.mat',
{'NA1_X_order10_ACCE_ms2': np.zeros((3000,1)),
'NA1_X_order10_DISP_mm': np.ones((3000,1))})
Should work fine...
According to the code you added in your question, instead of sio.savemat('...', {'interpolated_data':data}), just save
sio.savemat('...', data)
and you should be fine: data is already a dictionary you don't need to add an extra level with {'interpolated_data': data} when saving.
You could use the Writing primitives directly
import scipy.io.matlab as ml
f=open("something.mat","wb")
mw=ml.mio5.MatFile5Writer(f)
mw.put_variables({"testVar":22})
I have a .mat workspace file containing 4 character variables. These variables contain paths to various folders I need to be able to cd to and from relatively quickly. Usually, when using only Matlab I can load this workspace as follows (provided the .mat file is in the current directory).
load paths.mat
Currently I am experimenting with the Matlab Engine API for Python. The Matlab help docs recommend using the following Python formula to send variables to the current workspace in the desktop app:
import matlab.engine
eng = matlab.engine.start_matlab()
x = 4.0
eng.workspace['y'] = x
a = eng.eval('sqrt(y)')
print(a)
Which works well. However the whole point of the .mat file is that it can quickly load entire sets of variables the user is comfortable with. So the above is not efficient when trying to load the workspace.
I have also tried two different variations in Python:
eng.load("paths.mat")
eng.eval("load paths.mat")
The first variation successfully loads a dict variable in Python containing all four keys and values but this does not propagate to the workspace in Matlab. The second variation throws an error:
File "", line unknown SyntaxError: Error: Unexpected MATLAB
expression.
How do I load up a workspace through the engine without having to manually do it in Matlab? This is an important part of my workflow....
You didn't specify the number of output arguments from the MATLAB engine, which is a possible reason for the error.
I would expect the error from eng.load("paths.mat") to read something like
TypeError: unsupported data type returned from MATLAB
The difference in error messages may arise from different versions of MATLAB, engine API...
In any case, try specifying the number of output arguments like so,
eng.load("paths.mat", nargout=0)
This was giving me fits for a while. A few things to try. I was able to get this working on Matlab 2019a with Python 3.7. I had the most trouble trying to create a string and using the string as an argument for load and eval/evalin, so there might be some trickiness with the single or double quotes, or needing to have an additional set of quotes in the string.
Make sure the MAT file is on the Matlab Path. You can use addpath and rmpath really easily with pathlib objects:
from pathlib import Path
mat_file = Path('local/path/from/cwd/example.mat').resolve # get absolute path
eng.addpath(str(mat_file.parent))
# Execute other commands
eng.rmpath(str(mat_file.parent))
Per dML's answer, make sure to specify the nargout=0 when there are no outputs from the function, and always when calling a script. If there are 1 or more outputs you don't have to have an output in Python, and there is more than one it will be output as a tuple.
You can also turn your script into a function (just won't have access to base workspace without using evalin/assignin):
function load_example_matfile()
evalin('base','load example.mat')
end
eng.feval('load_example_matfile')
And, it does seem to work on the plain vanilla eval and load as well, but if you leave off the nargout=0 it either errors out or gives you the output of the file in python directly.
Both of these work.
eng.eval('load example.mat', nargout=0)
eng.load('example.mat', nargout=0)
I am confronted to the loss of alpha channel when I try to send image to clipboard, none of the solutions described here worked with the software I am working with but when I copy paste png files into this software, the alpha channel seems to be preserved.
Under this consideration, I want to simulate the Ctrl+C on files allowed by Windows Explorer. Using Clipview I found that the field 15 : CF_HDROP is relevant to my goal. tried to set this field using win32clipboard
import win32clipboard
win32clipboard.OpenClipboard(0)
file1="C:\\Users\\User\\Desktop\\test.png"
win32clipboard.SetClipboardData(15, file1)
win32clipboard.CloseClipboard()
I don't get any error doing this, but it does not work when I try to use this new clipboard content, because as described there tuple of unicode filenames must be stored in the CF_HDROP field.
I have no clue how to proceed. I also tried with
file1= (unicode('C:\\Users\\User\\Desktop\\CANEVAS\\test.png'),)
but I got this error:
TypeError: expected a readable buffer object.
The documentation for CF_HDROP says
The data consists of an STGMEDIUM structure that contains a global memory object. The structure's hGlobal member points to a DROPFILES structure as its hGlobal member.
win32clipboard.GetClipboardData has built-in support for CF_HDROP. It decodes the STGMEDIUM and DROPFILES structures to produce a tuple of file names.
The documentation does not state that SetClipboardData has the corresponding code to construct the STGMEDIUM and DROPFILES structures from a tuple of file names.
I don't know enough about Python or its FFI to know how straightforward it is to construct the structures and pass them to the SetClipboardData function. Or if there is an existing library that will do this for you.