I am struggling to find a way to retrieve metadata information from a FILE using GDAL.
Specifically, I would like to retrieve the band names and the order in which they are stored in a given file (may that be a GEOTIFF or a NETCDF).
For instance, if we follow the description within the GDAL documentation, we have the "GetMetaData" method from the gdal.Dataset (see here and here). Despite this method returning a whole set of information regarding the dataset, it does not provide the band names and the order that they are stored within the given FILE. As a matter of fact, it seems to be an old problem (from 2015) that seems not to be solved yet (more info here). As it seems, "R" language has already solved this problem (see here), though Python hasn't.
Just to be thorough here, I know that there are other Python packages that can help in this endeavour (e.g., xarray, rasterio, etc.); nevertheless, it would be important to be concise with the set of packages that one should use in a single script. Therefore, I would like to know a definite way to find the band (a.k.a., variable) names and the order they are stored within a single FILE using gdal.
Please, let me know your thoughs in this regard.
Below, I present a starting point for solving this Issue, in which a file is opened by GDAL (creating a Dataset object).
from gdal import Dataset
from osgeo import gdal
OpeneddatasetFile = gdal.Open(f'NETCDF:{input}/{file_name}.nc:' + var)
if isinstance(OpeneddatasetFile , Dataset):
print("File opened successfully")
# here is where one should be capable of fetching the variable (a.k.a., band) names
# of the OpeneddatasetFile.
# Ideally, it would be most welcome some kind of method that could return a dictionary
# with this information
# something like:
# VariablesWithinFile = OpeneddatasetFile.getVariablesWithinFileAsDictionary()
I have finally found a way to retrieve variable names from the NETCDF file using GDAL, and that is thank's to the comments given by Robert Davy above.
I have organized the code into a set of functions to help its visualization. Notice that there is also a function for reading metadata from the NETCDF, which returns this info in a dictionary format (see the "readInfo" function).
from gdal import Dataset, InfoOptions
from osgeo import gdal
import numpy as np
def read_data(filename):
dataset = gdal.Open(filename)
if not isinstance(dataset, Dataset):
raise FileNotFoundError("Impossible to open the netcdf file")
return dataset
def readInfo(ds, infoFormat="json"):
"how to: https://gdal.org/python/"
info = gdal.Info(ds, options=InfoOptions(format=infoFormat))
return info
def listAllSubDataSets(infoDict: dict):
subDatasetVariableKeys = [x for x in infoDict["metadata"]["SUBDATASETS"].keys()
if "_NAME" in x]
subDatasetVariableNames = [infoDict["metadata"]["SUBDATASETS"][x]
for x in subDatasetVariableKeys]
formatedsubDatasetVariableNames = []
for x in subDatasetVariableNames:
s = x.replace('"', '').split(":")[-1]
s = ''.join(s)
formatedsubDatasetVariableNames.append(s)
return formatedsubDatasetVariableNames
if "__main__" == __name__:
filename = "netcdfFile.nc"
ds = read_data(filename)
infoDict = readInfo(ds)
infoDict["VariableNames"] = listAllSubDataSets(infoDict)
I am using the book Forecasting: Methods and Applications by Makridakis, Wheelwright and Hyndman. I want to do the exercises along the way, but in Python, not R (as suggested in the book).
I do not know how to use R. I know that the datasets can be availed from an R package - fma. This is the link to the package.
Is there a possible script, in R or Python, which will allow me to download the datasets as .csv files? That way, I will be able to access them using Python.
one possibility:
## install and load package:
install.packages('fma')
library('fma')
## list example data of package fma:
data(package = 'fma')
## export single data as csv:
write.csv(cement, file = 'cement.csv')
## bulk export:
## data names are in `[,3]`rd column of list member "results"
## of `data(...)` output
for (data_name in data(package = 'fma')[['results']][,3]){
write.csv(get(data_name), file = paste0(data_name, '.csv'))
}
Edit:
As Anirban noted, attaching the package {fma} exposes only a few datasets to the search path. The data can be obtained by cloning or downloading from Rob J. Hyndman's source (click green Code button and choose). Subfolder 'data' contains each dataset as an .rda file which can be load()ed and converted. (Observe the licence conditions - GPL3.0 - and acknowledge the authors' efforts anyway.)
That said, you could load and convert the data like this:
setwd('path/to/fma-master/data')
for(data_name in dir()){
cat(paste0('converting ', data_name, '... '))
load(data_name)
object_name <- (gsub('\\.rda','', data_name))
write.csv(get(object_name),
file = paste0(object_name,'.csv'),
row.names = FALSE,
append = FALSE ## overwrite file if exists
)
}
Is it possible to read binary MATLAB .mat files in Python?
I've seen that SciPy has alleged support for reading .mat files, but I'm unsuccessful with it. I installed SciPy version 0.7.0, and I can't find the loadmat() method.
An import is required, import scipy.io...
import scipy.io
mat = scipy.io.loadmat('file.mat')
Neither scipy.io.savemat, nor scipy.io.loadmat work for MATLAB arrays version 7.3. But the good part is that MATLAB version 7.3 files are hdf5 datasets. So they can be read using a number of tools, including NumPy.
For Python, you will need the h5py extension, which requires HDF5 on your system.
import numpy as np
import h5py
f = h5py.File('somefile.mat','r')
data = f.get('data/variable1')
data = np.array(data) # For converting to a NumPy array
First save the .mat file as:
save('test.mat', '-v7')
After that, in Python, use the usual loadmat function:
import scipy.io as sio
test = sio.loadmat('test.mat')
There is a nice package called mat4py which can easily be installed using
pip install mat4py
It is straightforward to use (from the website):
Load data from a MAT-file
The function loadmat loads all variables stored in the MAT-file into a simple Python data structure, using only Python’s dict and list objects. Numeric and cell arrays are converted to row-ordered nested lists. Arrays are squeezed to eliminate arrays with only one element. The resulting data structure is composed of simple types that are compatible with the JSON format.
Example: Load a MAT-file into a Python data structure:
from mat4py import loadmat
data = loadmat('datafile.mat')
The variable data is a dict with the variables and values contained in the MAT-file.
Save a Python data structure to a MAT-file
Python data can be saved to a MAT-file, with the function savemat. Data has to be structured in the same way as for loadmat, i.e. it should be composed of simple data types, like dict, list, str, int, and float.
Example: Save a Python data structure to a MAT-file:
from mat4py import savemat
savemat('datafile.mat', data)
The parameter data shall be a dict with the variables.
Having MATLAB 2014b or newer installed, the MATLAB engine for Python could be used:
import matlab.engine
eng = matlab.engine.start_matlab()
content = eng.load("example.mat", nargout=1)
Reading the file
import scipy.io
mat = scipy.io.loadmat(file_name)
Inspecting the type of MAT variable
print(type(mat))
#OUTPUT - <class 'dict'>
The keys inside the dictionary are MATLAB variables, and the values are the objects assigned to those variables.
There is a great library for this task called: pymatreader.
Just do as follows:
Install the package: pip install pymatreader
Import the relevant function of this package: from pymatreader import read_mat
Use the function to read the matlab struct: data = read_mat('matlab_struct.mat')
use data.keys() to locate where the data is actually stored.
The keys will usually look like: dict_keys(['__header__', '__version__', '__globals__', 'data_opp']). Where data_opp will be the actual key which stores the data. The name of this key can ofcourse be changed between different files.
Last step - Create your dataframe: my_df = pd.DataFrame(data['data_opp'])
That's it :)
There is also the MATLAB Engine for Python by MathWorks itself. If you have MATLAB, this might be worth considering (I haven't tried it myself but it has a lot more functionality than just reading MATLAB files). However, I don't know if it is allowed to distribute it to other users (it is probably not a problem if those persons have MATLAB. Otherwise, maybe NumPy is the right way to go?).
Also, if you want to do all the basics yourself, MathWorks provides (if the link changes, try to google for matfile_format.pdf or its title MAT-FILE Format) a detailed documentation on the structure of the file format. It's not as complicated as I personally thought, but obviously, this is not the easiest way to go. It also depends on how many features of the .mat-files you want to support.
I've written a "small" (about 700 lines) Python script which can read some basic .mat-files. I'm neither a Python expert nor a beginner and it took me about two days to write it (using the MathWorks documentation linked above). I've learned a lot of new stuff and it was quite fun (most of the time). As I've written the Python script at work, I'm afraid I cannot publish it... But I can give some advice here:
First read the documentation.
Use a hex editor (such as HxD) and look into a reference .mat-file you want to parse.
Try to figure out the meaning of each byte by saving the bytes to a .txt file and annotate each line.
Use classes to save each data element (such as miCOMPRESSED, miMATRIX, mxDOUBLE, or miINT32)
The .mat-files' structure is optimal for saving the data elements in a tree data structure; each node has one class and subnodes
To read mat file to pandas dataFrame with mixed data types
import scipy.io as sio
mat=sio.loadmat('file.mat')# load mat-file
mdata = mat['myVar'] # variable in mat file
ndata = {n: mdata[n][0,0] for n in mdata.dtype.names}
Columns = [n for n, v in ndata.items() if v.size == 1]
d=dict((c, ndata[c][0]) for c in Columns)
df=pd.DataFrame.from_dict(d)
display(df)
Apart from scipy.io.loadmat for v4 (Level 1.0), v6, v7 to 7.2 matfiles and h5py.File for 7.3 format matfiles, there is anther type of matfiles in text data format instead of binary, usually created by Octave, which can't even be read in MATLAB.
Both of scipy.io.loadmat and h5py.File can't load them (tested on scipy 1.5.3 and h5py 3.1.0), and the only solution I found is numpy.loadtxt.
import numpy as np
mat = np.loadtxt('xxx.mat')
Can also use the hdf5storage library. official documentation here for details on matlab version support.
import hdf5storage
label_file = "./LabelTrain.mat"
out = hdf5storage.loadmat(label_file)
print(type(out)) # <class 'dict'>
from os.path import dirname, join as pjoin
import scipy.io as sio
data_dir = pjoin(dirname(sio.__file__), 'matlab', 'tests', 'data')
mat_fname = pjoin(data_dir, 'testdouble_7.4_GLNX86.mat')
mat_contents = sio.loadmat(mat_fname)
You can use above code to read the default saved .mat file in Python.
After struggling with this problem myself and trying other libraries (I have to say mat4py is a good one as well but with a few limitations) I have built this library ("matdata2py") that can handle most variable types and most importantly for me the "string" type. The .mat file needs to be saved in the -V7.3 version. I hope this can be useful for the community.
Installation:
pip install matdata2py
How to use this lib:
import matdata2py as mtp
To load the Matlab data file:
Variables_output = mtp.loadmatfile(file_Name, StructsExportLikeMatlab = True, ExportVar2PyEnv = False)
print(Variables_output.keys()) # with ExportVar2PyEnv = False the variables are as elements of the Variables_output dictionary.
with ExportVar2PyEnv = True you can see each variable separately as python variables with the same name as saved in the Mat file.
Flag descriptions
StructsExportLikeMatlab = True/False structures are exported in dictionary format (False) or dot-based format similar to Matlab (True)
ExportVar2PyEnv = True/False export all variables in a single dictionary (True) or as separate individual variables into the python environment (False)
scipy will work perfectly to load the .mat files.
And we can use the get() function to convert it to a numpy array.
mat = scipy.io.loadmat('point05m_matrix.mat')
x = mat.get("matrix")
print(type(x))
print(len(x))
plt.imshow(x, extent=[0,60,0,55], aspect='auto')
plt.show()
To Upload and Read mat files in python
Install mat4py in python.On successful installation we get:
Successfully installed mat4py-0.5.0.
Importing loadmat from mat4py.
Save file actual location inside a variable.
Load mat file format to a data value using python
pip install mat4py
from mat4py import loadmat
boston = r"E:\Downloads\boston.mat"
data = loadmat(boston, meta=False)
I am trying to work with netcdf files on my macbook pro (macOS Mojave 10.14.6). For some reason, I can't export any of the xarray datasets I made to net cdf files. Basically, I am trying to create monthly netcdf files inside a for loop which I would like to export in order for me to use the results of each iteration later in my script. I imported xarray and netCDF4. Down below is some of my code which is in my for loop (which iterates over a list of months).
# Xarray
tas_xr = xr.DataArray(bilt_month, dims = ['years'], coords = {'years':years})
tas_xr.attrs['units'] = 'degrees Celsius'
tas_xr.attrs['month'] = month
tas_g_xr = xr.DataArray(global_month, dims = ['years'], coords = {'years':years})
tas_g_xr.attrs['units'] = 'degrees Celsius'
tas_g_xr.attrs['month'] = month
# Dataset
ds = tas_xr.to_dataset(name = 'tas_Bilt')
ds['tas_global'] = tas_g_xr
# Exporting
file_out = 'obs_data_'+month+'.nc'
ds.to_netcdf(data + file_out, 'w')
Where data is a string variable which contains the location of where I want the netcdf files to be stored. The program runs fine until the last line, where I get the following error:
AttributeError: module 'dask.base' has no attribute 'get_scheduler'
Anyone familiar with this error? I have downloaded and installed, xcode, homebrew, macports and xquartz because I heard that netcdf libraries are not necessary compatible with mac OS, and while it seems my terminal is now capable of running the ncview function (which before installing xquartz it wasn't), my Python script still gives me errors when trying to export xarrays to netcdf files.
(Hoping my question is a correct one, as I am new to using stackoverflow and Python)
You should upgrade your dask - it appears that your xarray is ahead; but it would be good to give your versions in the question. The function get_scheduler is certainly there now.
I'm trying to do some work on a complex Excel Workbook which has a large number of variables which have been created and used using the Name Box feature. See picture attached for example/detail.
I'd like to store or change DeathRate or maybe read all the Name Boxes and create a dictionary between names and locations of the cell from outside Excel.
I'm using the win32com library in Python but I guess I could switch to another Excel reader as long as it copes with XLSX files.
Has someone come across this before?
Found the solution, see code below:
import os
from win32com.client import Dispatch #win32com is based around cells beginning at one.
app_xl = Dispatch("Excel.Application")
WORKING_DIR = os.getcwd()
excelPath = WORKING_DIR + "\SampleModel.xls"
wb = app_xl.Workbooks.Open(excelPath)
# Get Named Boxes
name_box_list = [x for x in app_xl.ActiveWorkbook.Names]
name_box_map = {x.Name:x.Value for x in name_box_list}
print name_box_list
print name_box_map
# Change Named Boxes
name_box_list[0].Name = u'NewName'
name_box_list[0].Value = u'=model!$B$5'
name_box_map = {x.Name:x.Value for x in name_box_list}