Loading hdf5 files into python xarrays - python

The python module xarray greatly supports loading/mapping netCDF files, even lazily with dask.
The data source I have to work with are thousands of hdf5 files, with lots of groups, datasets, attributes - all created with h5py.
The Question is: How can I load (or even better with dask, lazily map) hdf5 data (datasets, metadata,...) into an xarray dataset structure?
Has anybody experience with that or came across a similar issue?
Thank you!

One possible solution to this is to open the hdf5-file using netCDF4 in diskless non-persistence mode:
ncf = netCDF4.Dataset(hdf5file, diskless=True, persist=False)
Now you can inspect the file contents including groups.
After that you can make use of xarray.backends.NetCDF4DataStore to open the wanted hdf5-groups (xarray can only get hold of one hdf5-group at a time):
nch = ncf.groups.get('hdf5-name')
xds = xarray.open_dataset(xarray.backends.NetCDF4DataStore(nch))
This will give you a dataset xds with all attributes and variables (datasets) of the
group hdf5-name. Note that you will not get access to sub-groups. You would need to claim subgroups by the same mechanism. If you want to apply dask, you would need to add the keyword chunking with wanted values.
There is no (real) automatism for decoding data like this could be done for NetCDF files. If you have a integer compressed 2d variable (dataset) var with some attributes gain and offset you can add the NetCDF specific attributes scale_factor and add_offset to the variable:
var = xds['var']
var.attrs['scale_factor'] = var.attrs.get('gain')
var.attrs['add_offset'] = var.attrs.get('offset')
ds = xarray.decode_cf(xds)
This will decode your variable using netcdf mechanisms.
Additionally you could try to give the extracted dimension useful names (you will get something like phony_dim_0, phony_dim_1, ..., phony_dim_N) and assign new (as in example) or existing variables/coordinates to those dimensions to gain as much of the xarray machinery:
var = xds['var']
var.attrs['scale_factor'] = var.attrs.get('gain')
var.attrs['add_offset'] = var.attrs.get('offset')
dims = var.dims
xds['var'] = var.rename({dims[0]: 'x', dims[1]: 'y'})
xds = xds.assign({'x': (['x'], xvals, xattrs)})
xds = xds.assign({'y': (['y'], yvals, yattrs)})
ds = xarray.decode_cf(xds)
References:
netCDF4 Dataset
xarray.backends.NetCDF4DataStore
xarray.decode_cf

Related

Converting CSV to numpy NPY efficiently

How to convert a .csv file to .npy efficently?
I've tried:
import numpy as np
filename = "myfile.csv"
vec =np.loadtxt(filename, delimiter=",")
np.save(f"{filename}.npy", vec)
While the above works for smallish file, the actual .csv file I'm working on has ~12 million lines with 1024 columns, it takes quite a lot to load everything into RAM before converting into an .npy format.
Q (Part 1): Is there some way to load/convert a .csv to .npy efficiently for large CSV file?
The above code snippet is similar to the answer from Convert CSV to numpy but that won't work for ~12M x 1024 matrix.
Q (Part 2): If there isn't any way to to load/convert a .csv to .npy efficiently, is there some way to iteratively read the .csv file into .npy efficiently?
Also, there's an answer here https://stackoverflow.com/a/53558856/610569 to save the csv file as numpy array iteratively. But seems like the np.vstack isn't the best solution when reading the file. The accepted answer there suggests hdf5 but the format is not the main objective of this question and the hdf5 format isn't desired in my use-case since I've to read it back into a numpy array afterwards.
Q (Part 3): If part 1 and part2 are not possible, are there other efficient storage (e.g. tensorstore) that can store and efficiently convert to numpy array when loading the saved storage format?
There is another library tensorstore that seems to efficiently handles arrays which support conversion to numpy array when read, https://google.github.io/tensorstore/python/tutorial.html. But somehow there isn't any information on how to save the tensor/array without the exact dimensions, all of the examples seem to include configurations like 'dimensions': [1000, 20000],.
Unlike the HDF5, the tensorstore doesn't seem to have reading overhead issues when converting to numpy, from docs:
Conversion to an numpy.ndarray also implicitly performs a synchronous read (which hits the in-memory cache since the same region was just retrieved)
Nice question; Informative in itself.
I understand you want to have the whole data set/array in memory, eventually, as a NumPy array. I assume, then, you have enough (RAM) memory to host such array -- 12M x 1K.
I don't specifically know about how np.loadtxt (genfromtxt) is operating behind the scenes, so I will tell you how I would do (after trying like you did).
Reasoning about memory...
Notice that a simple boolean array will cost ~12 GBytes of memory:
>>> print("{:.1E} bytes".format(
np.array([True]).itemsize * 12E6 * 1024
))
1.2E+10 bytes
And this is for a Boolean data type. Most likely, you have -- what -- a dataset of Integer, Float? The size may increase quite significantly:
>>> np.array([1], dtype=bool).itemsize
1
>>> np.array([1], dtype=int).itemsize
8
>>> np.array([1], dtype=float).itemsize
8
It's a lot of memory (which you know, just want to emphasize).
At this point, I would like to point out a possible swapping of the working memory. You may have enough physical (RAM) memory in your machine, but if not enough of free memory, your system will use the swap memory (i.e, disk) to keep your system stable & have the work done. The cost you pay is clear: read/writing from/to the disk is very slow.
My point so far is: check the data type of your dataset, estimate the size of your future array, and guarantee you have that minimum amount of RAM memory available.
I/O text
Considering you do have all the (RAM) memory necessary to host the whole numpy array: I would then loop over the whole (~12M lines) text file, filling the pre-existing array row-by-row.
More precisely, I would have the (big) array already instantiated before start reading the file. Only then, I would read each line, split the columns, and give it to np.asarray and assign those (1024) values to each respective row of the output array.
The looping over the file is slow, yes. The thing here is that you limit (and control) the amount of memory being used. Roughly speaking, the big objects consuming your memory are the "output" (big) array, and the "line" (1024) array. Sure, there are quite a considerable amount of memory being consumed in each loop in the temporary objects during reading (text!) values, splitting into list elements and casting to an array. Still, it's something that will remain largely constant during the whole ~12M lines.
So, the steps I would go through are:
0) estimate and guarantee enough RAM memory available
1) instantiate (np.empty or np.zeros) the "output" array
2) loop over "input.txt" file, create a 1D array from each line "i"
3) assign the line values/array to row "i" of "output" array
Sure enough, you can even make it parallel: If on one hand text files cannot be randomly (r/w) accessed, on the other hand you can easily split them (see How can I split one text file into multiple *.txt files?) to have -- if fun is at the table -- them read in parallel, if that time if critical.
Hope that helps.
TL;DR
Export to a different function other than .npy seems inevitable unless your machine is able to handle the size of the data in-memory as per described in #Brandt answer.
Reading the data, then processing it (Kinda answering Q part 2)
To handle data size larger than what the RAM can handle, one would often resort to libraries that performs "out-of-core" computation, e.g. turicreate.SFrame, vaex or dask . These libraries would be able to lazily load the .csv files into dataframes and process them by chunks when evaluated.
from turicreate import SFrame
filename = "myfile.csv"
sf = SFrame.read_csv(filename)
sf.apply(...) # Trying to process the data
or
import vaex
filename = "myfile.csv"
df = vaex.from_csv(filename,
convert=True,
chunk_size=50_000_000)
df.apply(...)
Converting the read data into numpy array (kinda answering Q part 1)
While out-of-core libraries can read and process the data efficiently, converting into numpy is an "in-memory" operation, the machine needs to have enough RAM to fit all data.
The turicreate.SFrame.to_numpy documentation writes:
Converts this SFrame to a numpy array
This operation will construct a numpy array in memory. Care must be taken when size of the returned object is big.
And the vaex documentation writes:
In-memory data representations
One can construct a Vaex DataFrame from a variety of in-memory data representations.
And dask best practices actually reimplemented their own array objects that are simpler than numpy array, see https://docs.dask.org/en/stable/array-best-practices.html. But when going through the docs, it seems like the format they have saved the dask array in are not .npy but various other formats.
Writing the file into non-.npy versions (answering Q Part 3)
Given the numpy arrays are inevitably in-memory, trying to save the data into one single .npy isn't the most viable option.
Different libraries seems to have different solutions for storage. E.g.
vaex saves the data into hdf5 by default if the convert=True argument is set when data is read through vaex.from_csv()
sframe saves the data into their own binary format
dask export functions save to_hdf() and to_parquet() format
It it's latest version (4.14) vaex support "streaming", i.e. lazy loading of CSV files. It uses pyarrow under the hood so it is supper fast. Try something like
df = vaex.open(my_file.csv)
# or
df = vaex.from_csv_arrow(my_file.csv, lazy=True)
Then you can export to bunch of formats as needed, or keep working with it like that (it is surprisingly fast). Of course, it is better to convert to some kind of binary format..
import numpy as np
import pandas as pd
# Define the input and output file names
csv_file = 'data.csv'
npy_file = 'data.npy'
# Create dummy data
data = np.random.rand(10000, 100)
df = pd.DataFrame(data)
df.to_csv(csv_file, index=False)
# Define the chunk size
chunk_size = 1000
# Read the header row and get the number of columns
header = pd.read_csv(csv_file, nrows=0)
num_cols = len(header.columns)
# Initialize an empty array to store the data
data = np.empty((0, num_cols))
# Loop over the chunks of the csv file
for chunk in pd.read_csv(csv_file, chunksize=chunk_size):
# Convert the chunk to a numpy array
chunk_array = chunk.to_numpy()
# Append the chunk to the data array
data = np.append(data, chunk_array, axis=0)
np.save(npy_file, data)
# Load the npy file and check the shape
npy_data = np.load(npy_file)
print('Shape of data before conversion:', data.shape)
print('Shape of data after conversion:', npy_data.shape)```
I'm not aware of any existing function or utility that directly and efficiently converts csv files into npy files. With efficient I guess primarily meaning with low memory requirements.
Writing a npy file iteratively is indeed possible, with some extra effort. There's already a question on SO that addresses this, see:
save numpy array in append mode
For example using the NpyAppendArray class from Michael's answer you can do:
with open('data.csv') as csv, NpyAppendArray('data.npy') as npy:
for line in csv:
row = np.fromstring(line, sep=',')
npy.append(row[np.newaxis, :])
The NpyAppendArray class updates the npy file header on every call to append, which is a bit much for your 12M rows. Maybe you could update the class to (optionally) only write the header on close. Or you could easily batch the writes:
batch_lines = 128
with open('data.csv') as csv, NpyAppendArray('data.npy') as npy:
done = False
while not done:
batch = []
for count, line in enumerate(csv):
row = np.fromstring(line, sep=',')
batch.append(row)
if count + 1 >= batch_lines:
break
else:
done = True
npy.append(np.array(batch))
(code is not tested)

h5py file subset taking more space than parent file?

I have an existing h5py file that I downloaded which is ~18G in size. It has a number of nested datasets within it:
h5f = h5py.File('input.h5', 'r')
data = h5f['data']
latlong_data = data['lat_long'].value
I want to be able to some basic min/max scaling of the numerical data within latlong, so i want to put it in its own h5py file for easier use and lower memory usage.
However, when i try to write it out to its own file:
out = h5py.File('latlong_only.h5', 'w')
out.create_dataset('latlong', data=latlong)
out.close()
The output file is incredibly large. It's still not done writing to disk and is ~85GB in space. Why is the data being written to the new file not compressed?
Could be h5f['data/lat_long'] is using compression filters (and you aren't). To check the original dataset's compression settings, use this line:
print (h5f['data/latlong'].compression, h5f['data/latlong'].compression_opts)
After writing my answer, it occurred to me that you don't need to copy the data to another file to reduce the memory footprint. Your code reads the dataset into an array, which is not necessary in most use cases. A h5py dataset object behaves similar to a NumPy array. Instead, use this line: ds = h5f1['data/latlong'] to create a dataset object (instead of an array) and use it "like" it's a NumPy array. FYI, .value is a deprecated method to return the dataset as an array. Use this syntax instead arr = h5f1['data/latlong'][()]. Loading the dataset into an array also requires more memory than using an h5py object (which could be an issue with large datasets).
There are other ways to access the data. My suggestion to use dataset objects is 1 way. Your method (extracting data to a new file) is another way. I am not found of that approach because you now have 2 copies of the data; a bookkeeping nightmare. Another alternative is to create external links from the new file to the existing 18GB file. That way you have a small file that links to the big file (and no duplicate data). I describe that method in this post: [How can I combine multiple .h5 file?][1] Method 1: Create External Links.
If you still want to copy the data, here is what I would do. Your code reads the dataset into an array then writes the array to the new file (uncompressed). Instead, copy the dataset using h5py's group .copy() method, it will retain compression settings and attributes.
See below:
with h5py.File('input.h5', 'r') as h5f1, \
h5py.File('latlong_only.h5', 'w') as h5f2:
h5f1.copy(h5f1['data/latlong'], h5f2,'latlong')

Reading 1-D Variables in WRF NetCDF files with GDAL python

My question is simple.
With an wrfout file "out.nc" for example.
The file contain Geo2D, Geo3D and 1D variables.
Using GDAL package in Python 2.7, I can extract the Geo2D variables easily like this:
## T2 is 2-d variable means temperature 2 m above the ground
temp = gdal.Open('NETCDF:"'+"out.nc"+'":T2')
But when I want to use this code to extract 1d array, it failed.
## Time is 1-d array represent the timeseries throught the simulation period
time = gdal.Open('NETCDF:"'+"out.nc"+'":Time')
Nothing happened! Wish some one offer some advice to read any-dimension of WRF output variables easyily!
You can also use the NetCDF reader in scipy.io:
import scipy.io.netcdf as nc
# Open a netcdf file object and assign the data values to a variable
time = nc.netcdf_file('out.nc', 'r').variables['Time'][:]
It has the benefit of scipy being a very popular and widely installed package, while working similar to opening files in some respects.

Efficient reading of netcdf variable in python

I need to be able to quickly read lots of netCDF variables in python (1 variable per file). I'm finding that the Dataset function in netCDF4 library is rather slow compared to reading utilities in other languages (e.g., IDL).
My variables have shape of (2600,5200) and type float. They don't seem that big to me (filesize = 52Mb).
Here is my code:
import numpy as np
from netCDF4 import Dataset
import time
file = '20151120-235839.netcdf'
t0=time.time()
openFile = Dataset(file,'r')
raw_data = openFile.variables['MergedReflectivityQCComposite']
data = np.copy(raw_data)
openFile.close()
print time.time-t0
It takes about 3 seconds to read one variable (one file). I think the main slowdown is np.copy. raw_data is <type 'netCDF4.Variable'>, thus the copy. Is this the best/fastest way to do netCDF reads in python?
Thanks.
The power of Numpy is that you can create views into the exiting data in memory via the metadata it retains about the data. So a copy will always be slower than a view, via pointers. As JCOidl says it's not clear why you don't just use:
raw_data = openFile.variables['MergedReflectivityQCComposite'][:]
For more info see SciPy Cookbook and SO View onto a numpy array?
I'm not sure what to say about the np.copy operation (which is indeed slow), but I find that the PyNIO module from UCAR works well for both NetCDF and HDF files. This will place data into a numpy array:
import Nio
f = Nio.open_file(file, format="netcdf")
data = f.variables['MergedReflectivityQCComposite'][:]
f.close()
Testing your code versus the PyNIO code on a ndfCDF file I have resulted in 1.1 seconds for PyNIO, versus 3.1 seconds for the netCDF4 module. Your results may vary; worth a look though.
You can use xarray for that.
%matplotlib inline
import xarray as xr
### Single netcdf file ###
ds = xr.open_dataset('path/file.nc')
### Opening multiple NetCDF files and concatenating them by time ####
ds = xr.open_mfdatset('path/*.nc', concat_dim='time
To read the variable you can simply type ds.MergedReflectivityQCCompositeor ds.['MergedReflectivityQCComposite'][:]
You can also use xr.load_dataset but I find that it uses up more space than the open function. For xr.open_mfdataset, you can also chunk along the dimensions of the file if you want. There are other options for both functions and you might be interested to learn more about it in the xarray documentation.

fastest way to get NetCDF variable min/max using Python?

My usual method for extracting the min/max of a variable's data values from a NetCDF file is a magnitude of order slower when switching to the netCDF4 Python module compared to scipy.io.netcdf.
I am working with relatively large ocean model output files (from ROMS) with multiple depth levels over a given map region (Hawaii). When these were in NetCDF-3, I used scipy.io.netcdf.
Now that these files are in NetCDF-4 ("Classic") I can no longer use scipy.io.netcdf and have instead switched over to using the netCDF4 Python module. However, the slowness is a concern and I wondered if there is a more efficient method of extracting a variable's data range (minimum and maximum data values)?
Here was my NetCDF-3 method using scipy:
import scipy.io.netcdf
netcdf = scipy.io.netcdf.netcdf_file(file)
var = netcdf.variables['sea_water_potential_temperature']
min = var.data.min()
max = var.data.max()
Here is my NetCDF-4 method using netCDF4:
import netCDF4
netcdf = netCDF4.Dataset(file)
var = netcdf.variables['sea_water_potential_temperature']
var_array = var.data.flatten()
min = var_array.data.min()
max = var_array.data.max()
The notable difference is that I must first flatten the data array in netCDF4, and this operation apparently slows things down.
Is there a better/faster way?
Per suggestion of hpaulj here is a function that calls the nco command ncwa using subprocess. It hangs terribly when using an OPeNDAP address, and I don't have any files on hand to test it locally.
You can see if it works for you and what the speed difference is.
This assumes you have the nco library installed.
def ncwa(path, fnames, var, op_type, times=None, lons=None, lats=None):
'''Perform arithmetic operations on netCDF file or OPeNDAP data
Args
----
path: str
prefix
fnames: str or iterable
Names of file(s) to perform operation on
op_type: str
ncwa arithmetic operation to perform. Available operations are:
avg,mabs,mebs,mibs,min,max,ttl,sqravg,avgsqr,sqrt,rms,rmssdn
times: tuple
Minimum and maximum timestamps within which to perform the operation
lons: tuple
Minimum and maximum longitudes within which to perform the operation
lats: tuple
Minimum and maximum latitudes within which to perform the operation
Returns
-------
result: float
Result of the operation on the selected data
Note
----
Adapted from the OPeNDAP examples in the NCO documentation:
http://nco.sourceforge.net/nco.html#OPeNDAP
'''
import os
import netCDF4
import numpy
import subprocess
output = 'tmp_output.nc'
# Concatenate subprocess command
cmd = ['ncwa']
cmd.extend(['-y', '{}'.format(op_type)])
if times:
cmd.extend(['-d', 'time,{},{}'.format(times[0], times[1])])
if lons:
cmd.extend(['-d', 'lon,{},{}'.format(lons[0], lons[1])])
if lats:
cmd.extend(['-d', 'lat,{},{}'.format(lats[0], lats[1])])
cmd.extend(['-p', path])
cmd.extend(numpy.atleast_1d(fnames).tolist())
cmd.append(output)
# Run cmd and check for errors
subprocess.run(cmd, stdout=subprocess.PIPE, check=True)
# Load, read, close data and delete temp .nc file
data = netCDF4.Dataset(output)
result = float(data[var][:])
data.close()
os.remove(output)
return result
path = 'https://ecowatch.ncddc.noaa.gov/thredds/dodsC/hycom/hycom_reg6_agg/'
fname = 'HYCOM_Region_6_Aggregation_best.ncd'
times = (0.0, 48.0)
lons = (201.5, 205.5)
lats = (18.5, 22.5)
smax = ncwa(path, fname, 'salinity', 'max', times, lons, lats)
If you're just getting the min/max values across an array of a variable, you can use xarray.
%matplotlib inline
import xarray as xr
da = xr.open_dataset('infile/file.nc')
max = da.sea_water_potential_temperature.max()
min = da.sea_water_potential_temperature.min()
This should give you a single value of min/max, respectively. You could also get the min/max of a variable across a selected dimension like time, longitude, latitude etc. Xarray is great for handling multidimensional arrays that is why it's pretty easy to handle NetCDF in python when you're not using other operating tools like CDO and NCO.
Lastly, xarray is also used in other related libraries that deals with weather and climate data in python ( http://xarray.pydata.org/en/stable/related-projects.html ).
A Python solution (using CDO as a backend) is my package nctoolkit (https://pypi.org/project/nctoolkit/ https://nctoolkit.readthedocs.io/en/latest/installing.html).
This has a number of built in methods for calculating different types of min/max values.
We would first need to read the file in as a dataset:
import nctoolkit as nc
data = nc.open_data(file)
If you wanted the maximum value across space, for each timestep, you would do the following:
data.spatial_max()
Maximum across all depths for each grid cell and time step would be calculated as follows:
data.vertical_max()
If you wanted the maximum across time, you would do:
data.max()
These methods are chainable, and the CDO backend is very efficient, so should be ideal for working with ROMS data.

Categories

Resources