Efficient reading of netcdf variable in python

Efficient reading of netcdf variable in python - python

I need to be able to quickly read lots of netCDF variables in python (1 variable per file). I'm finding that the Dataset function in netCDF4 library is rather slow compared to reading utilities in other languages (e.g., IDL).
My variables have shape of (2600,5200) and type float. They don't seem that big to me (filesize = 52Mb).
Here is my code:
import numpy as np
from netCDF4 import Dataset
import time
file = '20151120-235839.netcdf'
t0=time.time()
openFile = Dataset(file,'r')
raw_data = openFile.variables['MergedReflectivityQCComposite']
data = np.copy(raw_data)
openFile.close()
print time.time-t0
It takes about 3 seconds to read one variable (one file). I think the main slowdown is np.copy. raw_data is <type 'netCDF4.Variable'>, thus the copy. Is this the best/fastest way to do netCDF reads in python?
Thanks.

The power of Numpy is that you can create views into the exiting data in memory via the metadata it retains about the data. So a copy will always be slower than a view, via pointers. As JCOidl says it's not clear why you don't just use:
raw_data = openFile.variables['MergedReflectivityQCComposite'][:]
For more info see SciPy Cookbook and SO View onto a numpy array?

I'm not sure what to say about the np.copy operation (which is indeed slow), but I find that the PyNIO module from UCAR works well for both NetCDF and HDF files. This will place data into a numpy array:
import Nio
f = Nio.open_file(file, format="netcdf")
data = f.variables['MergedReflectivityQCComposite'][:]
f.close()
Testing your code versus the PyNIO code on a ndfCDF file I have resulted in 1.1 seconds for PyNIO, versus 3.1 seconds for the netCDF4 module. Your results may vary; worth a look though.

You can use xarray for that.
%matplotlib inline
import xarray as xr
### Single netcdf file ###
ds = xr.open_dataset('path/file.nc')
### Opening multiple NetCDF files and concatenating them by time ####
ds = xr.open_mfdatset('path/*.nc', concat_dim='time
To read the variable you can simply type ds.MergedReflectivityQCCompositeor ds.['MergedReflectivityQCComposite'][:]
You can also use xr.load_dataset but I find that it uses up more space than the open function. For xr.open_mfdataset, you can also chunk along the dimensions of the file if you want. There are other options for both functions and you might be interested to learn more about it in the xarray documentation.

Related

Storing multiple GeoTiffs in HDF5 file in Python

I want to store multiple GeoTiff files in one HDF5 file to use it for further analysis since the function I am supposed to use can just deal with HDF5 (so basically like a raster stack in R but stored in a HDF5). I have to use Python. I am relatively new to HDF5 format (and geoanalysis in Python generally) and don't really know how to approach this issue. Especially keeping the geolocation/projection inforation seems tricky to me. So far I tried:
import h5py
import rasterio
r1 = rasterio.open("filename.tif")
r2 = rasterio.open("filename2.tif")
with h5py.File('path/test.h5', 'w') as hdf:
hdf.create_dataset('GeoTiff1', data=r1)
hdf.create_dataset('GeoTiff2', data=r2)
Yielding the following errror:
TypeError: Object dtype dtype('O') has no native HDF5 equivalent
I am pretty sure this not at all the correct approach and I'm happy about any suggestions.

What you can try is to do this:
import numpy as np
spec_dtype = h5py.special_dtype(vlen=np.dtype('float64'))
Just make a spec_dtype variable with float64 type then apply this to create_dataset:
with h5py.File('path/test.h5', 'w') as hdf:
hdf.create_dataset('GeoTiff1', data=r1,, dtype=spec_dtype)
hdf.create_dataset('GeoTiff2', data=r2,, dtype=spec_dtype)
Apply these and hopefully it will work.

Using HDFql in Python, your use-case could be solved as follows:
import HDFql
HDFql.execute("SHOW FILE SIZE filename.tif, filename2.tif")
HDFql.cursor_next()
HDFql.execute("CREATE DATASET path/test.h5 GeoTiff1 AS OPAQUE(%d) VALUES FROM BINARY FILE filename.tif" % HDFql.cursor_get_bigint())
HDFql.cursor_next()
HDFql.execute("CREATE DATASET path/test.h5 GeoTiff2 AS OPAQUE(%d) VALUES FROM BINARY FILE filename2.tif" % HDFql.cursor_get_bigint())

Python library to use .mat files [duplicate]

Is it possible to read binary MATLAB .mat files in Python?
I've seen that SciPy has alleged support for reading .mat files, but I'm unsuccessful with it. I installed SciPy version 0.7.0, and I can't find the loadmat() method.

An import is required, import scipy.io...
import scipy.io
mat = scipy.io.loadmat('file.mat')

Neither scipy.io.savemat, nor scipy.io.loadmat work for MATLAB arrays version 7.3. But the good part is that MATLAB version 7.3 files are hdf5 datasets. So they can be read using a number of tools, including NumPy.
For Python, you will need the h5py extension, which requires HDF5 on your system.
import numpy as np
import h5py
f = h5py.File('somefile.mat','r')
data = f.get('data/variable1')
data = np.array(data) # For converting to a NumPy array

First save the .mat file as:
save('test.mat', '-v7')
After that, in Python, use the usual loadmat function:
import scipy.io as sio
test = sio.loadmat('test.mat')

There is a nice package called mat4py which can easily be installed using
pip install mat4py
It is straightforward to use (from the website):
Load data from a MAT-file
The function loadmat loads all variables stored in the MAT-file into a simple Python data structure, using only Python’s dict and list objects. Numeric and cell arrays are converted to row-ordered nested lists. Arrays are squeezed to eliminate arrays with only one element. The resulting data structure is composed of simple types that are compatible with the JSON format.
Example: Load a MAT-file into a Python data structure:
from mat4py import loadmat
data = loadmat('datafile.mat')
The variable data is a dict with the variables and values contained in the MAT-file.
Save a Python data structure to a MAT-file
Python data can be saved to a MAT-file, with the function savemat. Data has to be structured in the same way as for loadmat, i.e. it should be composed of simple data types, like dict, list, str, int, and float.
Example: Save a Python data structure to a MAT-file:
from mat4py import savemat
savemat('datafile.mat', data)
The parameter data shall be a dict with the variables.

Having MATLAB 2014b or newer installed, the MATLAB engine for Python could be used:
import matlab.engine
eng = matlab.engine.start_matlab()
content = eng.load("example.mat", nargout=1)

Reading the file
import scipy.io
mat = scipy.io.loadmat(file_name)
Inspecting the type of MAT variable
print(type(mat))
#OUTPUT - <class 'dict'>
The keys inside the dictionary are MATLAB variables, and the values are the objects assigned to those variables.

There is a great library for this task called: pymatreader.
Just do as follows:
Install the package: pip install pymatreader
Import the relevant function of this package: from pymatreader import read_mat
Use the function to read the matlab struct: data = read_mat('matlab_struct.mat')
use data.keys() to locate where the data is actually stored.
The keys will usually look like: dict_keys(['__header__', '__version__', '__globals__', 'data_opp']). Where data_opp will be the actual key which stores the data. The name of this key can ofcourse be changed between different files.
Last step - Create your dataframe: my_df = pd.DataFrame(data['data_opp'])
That's it :)

There is also the MATLAB Engine for Python by MathWorks itself. If you have MATLAB, this might be worth considering (I haven't tried it myself but it has a lot more functionality than just reading MATLAB files). However, I don't know if it is allowed to distribute it to other users (it is probably not a problem if those persons have MATLAB. Otherwise, maybe NumPy is the right way to go?).
Also, if you want to do all the basics yourself, MathWorks provides (if the link changes, try to google for matfile_format.pdf or its title MAT-FILE Format) a detailed documentation on the structure of the file format. It's not as complicated as I personally thought, but obviously, this is not the easiest way to go. It also depends on how many features of the .mat-files you want to support.
I've written a "small" (about 700 lines) Python script which can read some basic .mat-files. I'm neither a Python expert nor a beginner and it took me about two days to write it (using the MathWorks documentation linked above). I've learned a lot of new stuff and it was quite fun (most of the time). As I've written the Python script at work, I'm afraid I cannot publish it... But I can give some advice here:
First read the documentation.
Use a hex editor (such as HxD) and look into a reference .mat-file you want to parse.
Try to figure out the meaning of each byte by saving the bytes to a .txt file and annotate each line.
Use classes to save each data element (such as miCOMPRESSED, miMATRIX, mxDOUBLE, or miINT32)
The .mat-files' structure is optimal for saving the data elements in a tree data structure; each node has one class and subnodes

To read mat file to pandas dataFrame with mixed data types
import scipy.io as sio
mat=sio.loadmat('file.mat')# load mat-file
mdata = mat['myVar'] # variable in mat file
ndata = {n: mdata[n][0,0] for n in mdata.dtype.names}
Columns = [n for n, v in ndata.items() if v.size == 1]
d=dict((c, ndata[c][0]) for c in Columns)
df=pd.DataFrame.from_dict(d)
display(df)

Apart from scipy.io.loadmat for v4 (Level 1.0), v6, v7 to 7.2 matfiles and h5py.File for 7.3 format matfiles, there is anther type of matfiles in text data format instead of binary, usually created by Octave, which can't even be read in MATLAB.
Both of scipy.io.loadmat and h5py.File can't load them (tested on scipy 1.5.3 and h5py 3.1.0), and the only solution I found is numpy.loadtxt.
import numpy as np
mat = np.loadtxt('xxx.mat')

Can also use the hdf5storage library. official documentation here for details on matlab version support.
import hdf5storage
label_file = "./LabelTrain.mat"
out = hdf5storage.loadmat(label_file)
print(type(out)) # <class 'dict'>

from os.path import dirname, join as pjoin
import scipy.io as sio
data_dir = pjoin(dirname(sio.__file__), 'matlab', 'tests', 'data')
mat_fname = pjoin(data_dir, 'testdouble_7.4_GLNX86.mat')
mat_contents = sio.loadmat(mat_fname)
You can use above code to read the default saved .mat file in Python.

After struggling with this problem myself and trying other libraries (I have to say mat4py is a good one as well but with a few limitations) I have built this library ("matdata2py") that can handle most variable types and most importantly for me the "string" type. The .mat file needs to be saved in the -V7.3 version. I hope this can be useful for the community.
Installation:
pip install matdata2py
How to use this lib:
import matdata2py as mtp
To load the Matlab data file:
Variables_output = mtp.loadmatfile(file_Name, StructsExportLikeMatlab = True, ExportVar2PyEnv = False)
print(Variables_output.keys()) # with ExportVar2PyEnv = False the variables are as elements of the Variables_output dictionary.
with ExportVar2PyEnv = True you can see each variable separately as python variables with the same name as saved in the Mat file.
Flag descriptions
StructsExportLikeMatlab = True/False structures are exported in dictionary format (False) or dot-based format similar to Matlab (True)
ExportVar2PyEnv = True/False export all variables in a single dictionary (True) or as separate individual variables into the python environment (False)

scipy will work perfectly to load the .mat files.
And we can use the get() function to convert it to a numpy array.
mat = scipy.io.loadmat('point05m_matrix.mat')
x = mat.get("matrix")
print(type(x))
print(len(x))
plt.imshow(x, extent=[0,60,0,55], aspect='auto')
plt.show()

To Upload and Read mat files in python
Install mat4py in python.On successful installation we get:
Successfully installed mat4py-0.5.0.
Importing loadmat from mat4py.
Save file actual location inside a variable.
Load mat file format to a data value using python
pip install mat4py
from mat4py import loadmat
boston = r"E:\Downloads\boston.mat"
data = loadmat(boston, meta=False)

How to put many numpy files in one big numpy file without having memory error?

I follow this question Append multiple numpy files to one big numpy file in python in order to put many numpy files in one big file, the result is:
import matplotlib.pyplot as plt
import numpy as np
import glob
import os, sys
fpath ="path_Of_my_final_Big_File"
npyfilespath ="path_of_my_numpy_files"
os.chdir(npyfilespath)
npfiles= glob.glob("*.npy")
npfiles.sort()
all_arrays = np.zeros((166601,8000))
for i,npfile in enumerate(npfiles):
all_arrays[i]=np.load(os.path.join(npyfilespath, npfile))
np.save(fpath, all_arrays)
data = np.load(fpath)
print data
print data.shape
I have thousands of files, by using this code, I have always a memory error, so I can't have my result file.
How to resolve this error?
How to read, write and append int the final numpy file by file, ?

Try to have a look to np.memmap. You can instantiateall_arrays:
all_arrays = np.memmap("all_arrays.dat", dtype='float64', mode='w+', shape=(166601,8000))
from the documentation:
Memory-mapped files are used for accessing small segments of large files on disk, without reading the entire file into memory.
You will be able to access all the array, but the operating system will take care of loading the part that you actually need. Read carefully the documentation page and note that from the performance point of view you can decide whether the file should be stored column-wise or row-wise.

Reading 1-D Variables in WRF NetCDF files with GDAL python

My question is simple.
With an wrfout file "out.nc" for example.
The file contain Geo2D, Geo3D and 1D variables.
Using GDAL package in Python 2.7, I can extract the Geo2D variables easily like this:
## T2 is 2-d variable means temperature 2 m above the ground
temp = gdal.Open('NETCDF:"'+"out.nc"+'":T2')
But when I want to use this code to extract 1d array, it failed.
## Time is 1-d array represent the timeseries throught the simulation period
time = gdal.Open('NETCDF:"'+"out.nc"+'":Time')
Nothing happened! Wish some one offer some advice to read any-dimension of WRF output variables easyily!

You can also use the NetCDF reader in scipy.io:
import scipy.io.netcdf as nc
# Open a netcdf file object and assign the data values to a variable
time = nc.netcdf_file('out.nc', 'r').variables['Time'][:]
It has the benefit of scipy being a very popular and widely installed package, while working similar to opening files in some respects.

Save .dta files in python

I'm wondering if anyone knows a Python package that allows you to save numpy arrays/recarrays in the .dta format of the statistical data analysis software Stata. This would really speed up a few steps in a system I have.

The scikits.statsmodels package includes a reader for Stata data files, which relies in part on PyDTA as pointed out by #Sven. In particular, genfromdta() will return an ndarray, e.g.
from Python 2.7/statsmodels 0.3.1:
>>> import scikits.statsmodels.api as sm
>>> arr = sm.iolib.genfromdta('/Applications/Stata12/auto.dta')
>>> type(arr)
<type 'numpy.ndarray'>
The savetxt() function can be used in turn to save an array as a text file, which can be imported in Stata. For example, we can export the above as
>>> sm.iolib.savetxt('auto.txt', arr, fmt='%2s', delimiter=",")
and read it in Stata without a dictionary file as follows:
. insheet using auto.txt, clear
I believe a *.dta reader should be added in the near future.

The only Python library for STATA interoperability I could find merely provides read-only access to .dta files. The R foreign library however provides a function write.dta, and RPy provides a Python interface to R. Maybe the combination of these tools can help you.

pandas DataFrame objects now have a "to_stata" method. So you can do for instance
import pandas as pd
df = pd.read_stata('my_data_in.dta')
df.to_stata('my_data_out.dta')
DISCLAIMER: the first step is quite slow (in my test, around 1 minute for reading a 51 MB dta - also see this question), and the second produces a file which can be way larger than the original one (in my test, the size goes from 51 MB to 111MB). This answer may look less elegant, but it is probably more efficient.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.