fastest way to get NetCDF variable min/max using Python? - python

My usual method for extracting the min/max of a variable's data values from a NetCDF file is a magnitude of order slower when switching to the netCDF4 Python module compared to scipy.io.netcdf.
I am working with relatively large ocean model output files (from ROMS) with multiple depth levels over a given map region (Hawaii). When these were in NetCDF-3, I used scipy.io.netcdf.
Now that these files are in NetCDF-4 ("Classic") I can no longer use scipy.io.netcdf and have instead switched over to using the netCDF4 Python module. However, the slowness is a concern and I wondered if there is a more efficient method of extracting a variable's data range (minimum and maximum data values)?
Here was my NetCDF-3 method using scipy:
import scipy.io.netcdf
netcdf = scipy.io.netcdf.netcdf_file(file)
var = netcdf.variables['sea_water_potential_temperature']
min = var.data.min()
max = var.data.max()
Here is my NetCDF-4 method using netCDF4:
import netCDF4
netcdf = netCDF4.Dataset(file)
var = netcdf.variables['sea_water_potential_temperature']
var_array = var.data.flatten()
min = var_array.data.min()
max = var_array.data.max()
The notable difference is that I must first flatten the data array in netCDF4, and this operation apparently slows things down.
Is there a better/faster way?

Per suggestion of hpaulj here is a function that calls the nco command ncwa using subprocess. It hangs terribly when using an OPeNDAP address, and I don't have any files on hand to test it locally.
You can see if it works for you and what the speed difference is.
This assumes you have the nco library installed.
def ncwa(path, fnames, var, op_type, times=None, lons=None, lats=None):
'''Perform arithmetic operations on netCDF file or OPeNDAP data
Args
----
path: str
prefix
fnames: str or iterable
Names of file(s) to perform operation on
op_type: str
ncwa arithmetic operation to perform. Available operations are:
avg,mabs,mebs,mibs,min,max,ttl,sqravg,avgsqr,sqrt,rms,rmssdn
times: tuple
Minimum and maximum timestamps within which to perform the operation
lons: tuple
Minimum and maximum longitudes within which to perform the operation
lats: tuple
Minimum and maximum latitudes within which to perform the operation
Returns
-------
result: float
Result of the operation on the selected data
Note
----
Adapted from the OPeNDAP examples in the NCO documentation:
http://nco.sourceforge.net/nco.html#OPeNDAP
'''
import os
import netCDF4
import numpy
import subprocess
output = 'tmp_output.nc'
# Concatenate subprocess command
cmd = ['ncwa']
cmd.extend(['-y', '{}'.format(op_type)])
if times:
cmd.extend(['-d', 'time,{},{}'.format(times[0], times[1])])
if lons:
cmd.extend(['-d', 'lon,{},{}'.format(lons[0], lons[1])])
if lats:
cmd.extend(['-d', 'lat,{},{}'.format(lats[0], lats[1])])
cmd.extend(['-p', path])
cmd.extend(numpy.atleast_1d(fnames).tolist())
cmd.append(output)
# Run cmd and check for errors
subprocess.run(cmd, stdout=subprocess.PIPE, check=True)
# Load, read, close data and delete temp .nc file
data = netCDF4.Dataset(output)
result = float(data[var][:])
data.close()
os.remove(output)
return result
path = 'https://ecowatch.ncddc.noaa.gov/thredds/dodsC/hycom/hycom_reg6_agg/'
fname = 'HYCOM_Region_6_Aggregation_best.ncd'
times = (0.0, 48.0)
lons = (201.5, 205.5)
lats = (18.5, 22.5)
smax = ncwa(path, fname, 'salinity', 'max', times, lons, lats)

If you're just getting the min/max values across an array of a variable, you can use xarray.
%matplotlib inline
import xarray as xr
da = xr.open_dataset('infile/file.nc')
max = da.sea_water_potential_temperature.max()
min = da.sea_water_potential_temperature.min()
This should give you a single value of min/max, respectively. You could also get the min/max of a variable across a selected dimension like time, longitude, latitude etc. Xarray is great for handling multidimensional arrays that is why it's pretty easy to handle NetCDF in python when you're not using other operating tools like CDO and NCO.
Lastly, xarray is also used in other related libraries that deals with weather and climate data in python ( http://xarray.pydata.org/en/stable/related-projects.html ).

A Python solution (using CDO as a backend) is my package nctoolkit (https://pypi.org/project/nctoolkit/ https://nctoolkit.readthedocs.io/en/latest/installing.html).
This has a number of built in methods for calculating different types of min/max values.
We would first need to read the file in as a dataset:
import nctoolkit as nc
data = nc.open_data(file)
If you wanted the maximum value across space, for each timestep, you would do the following:
data.spatial_max()
Maximum across all depths for each grid cell and time step would be calculated as follows:
data.vertical_max()
If you wanted the maximum across time, you would do:
data.max()
These methods are chainable, and the CDO backend is very efficient, so should be ideal for working with ROMS data.

Related

How can I read values of a variable in a netcdf file in python?

I have a NetCDF file called air.sig995.2012.nc. it has four variables :
('lat','lon','time','air').
I am trying to read the values of any of the variables, lets say the variable air using below line:
import scipy.io.netcdf as S
fileobj=S.netcdf_file('air.sig995.2012.nc','r')
data=fileobj.variables['air'].getValue()
but it gives me below error:
ValueError: can only convert an array of size 1 to a Python scalar
I am fairly new to python. Can anyone help me on this one.
If not xarray, you can do this with the netcdf4-python library with the same slice syntax:
from netCDF4 import Dataset
nc = Dataset('air.sig995.2012.nc')
my_array = nc.variables['air'][:]
The output you're expecting is quite ambiguous but both methods should work depending on your specific goal in mind.
`import xarray as xr `
### open netcdf file ###
df = xr.open_dataset('path/file.nc')
### extract values of variable 'air' ####
air = df['air'][:]
air_flat = df['air'].values.flatten() # 1-d data

Loading hdf5 files into python xarrays

The python module xarray greatly supports loading/mapping netCDF files, even lazily with dask.
The data source I have to work with are thousands of hdf5 files, with lots of groups, datasets, attributes - all created with h5py.
The Question is: How can I load (or even better with dask, lazily map) hdf5 data (datasets, metadata,...) into an xarray dataset structure?
Has anybody experience with that or came across a similar issue?
Thank you!
One possible solution to this is to open the hdf5-file using netCDF4 in diskless non-persistence mode:
ncf = netCDF4.Dataset(hdf5file, diskless=True, persist=False)
Now you can inspect the file contents including groups.
After that you can make use of xarray.backends.NetCDF4DataStore to open the wanted hdf5-groups (xarray can only get hold of one hdf5-group at a time):
nch = ncf.groups.get('hdf5-name')
xds = xarray.open_dataset(xarray.backends.NetCDF4DataStore(nch))
This will give you a dataset xds with all attributes and variables (datasets) of the
group hdf5-name. Note that you will not get access to sub-groups. You would need to claim subgroups by the same mechanism. If you want to apply dask, you would need to add the keyword chunking with wanted values.
There is no (real) automatism for decoding data like this could be done for NetCDF files. If you have a integer compressed 2d variable (dataset) var with some attributes gain and offset you can add the NetCDF specific attributes scale_factor and add_offset to the variable:
var = xds['var']
var.attrs['scale_factor'] = var.attrs.get('gain')
var.attrs['add_offset'] = var.attrs.get('offset')
ds = xarray.decode_cf(xds)
This will decode your variable using netcdf mechanisms.
Additionally you could try to give the extracted dimension useful names (you will get something like phony_dim_0, phony_dim_1, ..., phony_dim_N) and assign new (as in example) or existing variables/coordinates to those dimensions to gain as much of the xarray machinery:
var = xds['var']
var.attrs['scale_factor'] = var.attrs.get('gain')
var.attrs['add_offset'] = var.attrs.get('offset')
dims = var.dims
xds['var'] = var.rename({dims[0]: 'x', dims[1]: 'y'})
xds = xds.assign({'x': (['x'], xvals, xattrs)})
xds = xds.assign({'y': (['y'], yvals, yattrs)})
ds = xarray.decode_cf(xds)
References:
netCDF4 Dataset
xarray.backends.NetCDF4DataStore
xarray.decode_cf

how to prevent netcdf4 in python from loading entire variable when using datetime

I am looking to verify my understanding of how python objects behave in this example.
Say I have on a laptop with limited memory a very large netcdf4 dataset, for example a million points in the unlimited dimension which is "time" with units of seconds since 2015-11-12 16:0:8.000000 0:00. I want to access, as a datetime object, the very first and the very last time without loading all the values in memory.
Now I know I can get at the first and last dates as datetime objects with this code:
import netCDF4 as nc4
from netCDF4 import Dataset
cdf = Dataset(fname,mode="r",format='NETCDF4')
time_var = cdf.variables['time']
dtime = nc4.num2date(time_var[0:10],time_var.units)
print('data starts at %s' % dtime[0])
The print statement gives me what I want:
"data starts at 2015-11-12 16:00:08"
Now did python load all the 'time' data into memory to do this? Or, as I have come to understand using MATLAB, cdf is now a pointer to the 'time' variable in the open file.
Many thanks,
Marinna
Yes, cdf is a pointer or view into the open file, not a copy into memory. This answer discusses this.
https://stackoverflow.com/a/4371049/1211981
As #bart mentioned you should just use:
dtime = nc4.num2date(time_var[0],time_var.units)
and
dtime2 = nc4.num2date(time_var[-1],time_var.units)
to get the times you want. No big copy into memory.

Reading 1-D Variables in WRF NetCDF files with GDAL python

My question is simple.
With an wrfout file "out.nc" for example.
The file contain Geo2D, Geo3D and 1D variables.
Using GDAL package in Python 2.7, I can extract the Geo2D variables easily like this:
## T2 is 2-d variable means temperature 2 m above the ground
temp = gdal.Open('NETCDF:"'+"out.nc"+'":T2')
But when I want to use this code to extract 1d array, it failed.
## Time is 1-d array represent the timeseries throught the simulation period
time = gdal.Open('NETCDF:"'+"out.nc"+'":Time')
Nothing happened! Wish some one offer some advice to read any-dimension of WRF output variables easyily!
You can also use the NetCDF reader in scipy.io:
import scipy.io.netcdf as nc
# Open a netcdf file object and assign the data values to a variable
time = nc.netcdf_file('out.nc', 'r').variables['Time'][:]
It has the benefit of scipy being a very popular and widely installed package, while working similar to opening files in some respects.

Efficient reading of netcdf variable in python

I need to be able to quickly read lots of netCDF variables in python (1 variable per file). I'm finding that the Dataset function in netCDF4 library is rather slow compared to reading utilities in other languages (e.g., IDL).
My variables have shape of (2600,5200) and type float. They don't seem that big to me (filesize = 52Mb).
Here is my code:
import numpy as np
from netCDF4 import Dataset
import time
file = '20151120-235839.netcdf'
t0=time.time()
openFile = Dataset(file,'r')
raw_data = openFile.variables['MergedReflectivityQCComposite']
data = np.copy(raw_data)
openFile.close()
print time.time-t0
It takes about 3 seconds to read one variable (one file). I think the main slowdown is np.copy. raw_data is <type 'netCDF4.Variable'>, thus the copy. Is this the best/fastest way to do netCDF reads in python?
Thanks.
The power of Numpy is that you can create views into the exiting data in memory via the metadata it retains about the data. So a copy will always be slower than a view, via pointers. As JCOidl says it's not clear why you don't just use:
raw_data = openFile.variables['MergedReflectivityQCComposite'][:]
For more info see SciPy Cookbook and SO View onto a numpy array?
I'm not sure what to say about the np.copy operation (which is indeed slow), but I find that the PyNIO module from UCAR works well for both NetCDF and HDF files. This will place data into a numpy array:
import Nio
f = Nio.open_file(file, format="netcdf")
data = f.variables['MergedReflectivityQCComposite'][:]
f.close()
Testing your code versus the PyNIO code on a ndfCDF file I have resulted in 1.1 seconds for PyNIO, versus 3.1 seconds for the netCDF4 module. Your results may vary; worth a look though.
You can use xarray for that.
%matplotlib inline
import xarray as xr
### Single netcdf file ###
ds = xr.open_dataset('path/file.nc')
### Opening multiple NetCDF files and concatenating them by time ####
ds = xr.open_mfdatset('path/*.nc', concat_dim='time
To read the variable you can simply type ds.MergedReflectivityQCCompositeor ds.['MergedReflectivityQCComposite'][:]
You can also use xr.load_dataset but I find that it uses up more space than the open function. For xr.open_mfdataset, you can also chunk along the dimensions of the file if you want. There are other options for both functions and you might be interested to learn more about it in the xarray documentation.

Categories

Resources