How to merge different shaped netcdf4 files? - python

I am storing weather forecasts as netcdf4 files. These netcdf4 files are batched following the google maps tiles principle. This means I define a zoom level (here 6) to get the extent of each tile.
Based on that information I used the following code to slice the array:
sliced_data = data.where(
(data[lat_coord_name] <= maxLat)
& (data[lat_coord_name] > minLat)
& (data[lon_coord_name] <= maxLon)
& (data[lon_coord_name] > minLon),
drop=True,
)
Here, data is a xarray.Dataset. At the end of this process I have 36 tiles for a weather model covering middle europe.
My problem is to merge them back to the native untiled xarray.Dataset. The projection of the weather model differs from the projection of the tile maps. So at the end I have netcdf4 files with different shapes in x and y dimension. So I have no axis to align them with xarray.
The dimension of the native grid is 340x340. You can find a test dataset here
My expectation was:
import glob
import xarray
file_list = glob.glob('test_data_stackoverflow/*')
file_list.sort()
dataset = xarray.open_mfdataset(file_list, engine="h5netcdf")
But this will fail due to different shaped datasets.
I am open using other tools like netcdf4, h5netcdf or cdo. But the data should not be manipulated e.g. with an interpolation to the origin grid.

Combining datasets with the same dimension names but different dimension sizes is not possible in an Xarray Dataset. But it is possible in a new type of Xarray data structure, currently under development, called a DataTree. Currently DataTree lives in a separate package - https://xarray-datatree.readthedocs.io/en/latest/ - but the plan is to merge it into Xarray proper soon. DataTree is used by the library ndpyramid to store multi-scale array data, very similar to the use case you are describing.
I would explore combing your datasets into a single DataTree object. First organize your data into a dict, and then create a DataTree from the dict. You will need to decide how to encode the level of the hierarchy. The simplest is to just use an integer for each zoom level, e.g.
data_dict = {level: ds for level, ds in enumerage(file_list)}
dt = DataTree.from_dict(data_dict)
They ndpyramid code might be a useful reference: https://github.com/carbonplan/ndpyramid/blob/main/ndpyramid/core.py

You can probably solve this using CDO's merge method:
cdo merge test_data_stackoverflow/* out.nc
If the 36 tiles make up a 6 x 6 grid, then mergegrid can potentially merge them:
cdo mergegird test_data_stackoverflow/* out.nc

Related

Looking for a Simple Way to Access and Export to a new array multiple data variables in an xarray object

I have an xarray object containing MODIS data imported from HDF4 format. The structure of the DataSet is something like the structure below where I have spectral bands in an array format stored within a Data Variable - each band is a different variable.
# Create example DataSet with 5 variables (let's pretend each layer is a band)
ds = xr.Dataset(
{var: (("band", "x", "y"), np.random.rand(1, 10, 10)) for var in "abcde"}
)
If the bands were stored in an array, it would be easy to explore the data and plot each band using .plot and the built in facet grid tools. However in this case i have to plot each layer individually or using a loop. Is there a quick way to automate grabbing say x number of those variables or bands (maybe b,c and e as an example) and plotting them?
In some cases you may need to plot an RGB image - so i'd do something like this:
# This is one way of combining several bands into a single array object however it's very
# manual. I need to specify each band in the concat statement. But it does achieve
# the output that I want using a manual approach.
new_object = xr.concat([ds.b,
ds.c],
ds.e,
dim="band")
# Now I can plot the data
new_object.imshow.plot()
My goal is to automate the process of selecting x bands (or x number of data variables) for both plotting / visualization and analysis. I don't want to hard code each band into a concat() function as I did above. In the example above i wanted to plot an RGB image. In other examples i'd want visually explore each band before addition processing OR just extract a few bands for other types of calculations & analysis.
Thank you for any direction!!
I think xarray.Dataset.to_array() may be what you are looking for. For example for the RGB image I think something like the following would work:
ds.squeeze("band").to_array(dim="band").sel(band=["b", "c", "e"]).plot.imshow()
You could also facet over the "band" dimension in that case:
ds.squeeze("band").to_array(dim="band").plot(col="band")

How do I get arrays of coordinate values for a variable from a netCDF gridded dataset using Siphon and MetPy?

I have requested a netCDF subset using Siphon and formed a query to retrieve a variable within a bounding box:
from siphon.catalog import TDSCatalog
cat = TDSCatalog("https://thredds.ucar.edu/thredds/catalog/grib/NCEP/GFS/Global_onedeg/catalog.xml?dataset=grib/NCEP/GFS/Global_onedeg/Best")
ncss = cat.datasets[0].subset()
query = ncss.query()
query.variables("Absolute_vorticity_isobaric")
query.lonlat_box(north=34., south=33., west=-102., east=-101.)
query.accept("netcdf4")
I am seeking a reliable, concise approach to getting the values of that variable's coordinates, specifically time and vertical level. A working, but impractical approach to this would be to request and work with the whole dataset.
Functional-but-Impractical Approach
Get the data
import xarray as xr
query.all_times()
data = ncss.get_data(query)
datastore = xr.backends.NetCDF4DataStore(data)
Get data as xarray.Dataset using MetPy's xarray accessor
ds = xr.open_dataset(datastore).metpy.parse_cf()
Get the coordinate axes from a constituent xarray.DataArray
For each variable of the dataset as an xarray.DataArray, calling ds.VARIABLE.metpy.DIMENSION has MetPy automatically return the appropriate coordinate variable (no matter what it is named, e.g. lat, lon, time, time1, altitude_above_msl, isobaric3, height_above_ground1), where DIMENSION is one of time, vertical, x, and y.
Get the values
In this case, ds.Absolute_vorticity_isobaric.metpy.time returns ds.time, and ds.Absolute_vorticity_isobaric.metpy.vertical returns ds.isobaric2. Adding .values to the call returns just the numpy.ndarray with the values I have been trying to get. So, calling ds.Absolute_vorticity_isobaric.metpy.time.values produces the following (which is truncated below):
array(['2019-11-17T00:00:00.000000000', '2019-11-17T03:00:00.000000000',
'2019-11-17T06:00:00.000000000', ..., '2020-01-02T06:00:00.000000000',
'2020-01-02T09:00:00.000000000', '2020-01-02T12:00:00.000000000'],
dtype='datetime64[ns]')
Calling ds.Absolute_vorticity_isobaric.metpy.time.values and ds.Absolute_vorticity_isobaric.metpy.vertical.values will return just the NumPy arrays, which is what I seek.
The Problem
While the above does in fact do what I want, it took nearly a minute and a half to run for just one variable, and it (I assume) unnecessarily taxes UCAR servers. Is there any way to get the output above without the massive overhead of loading all of that data itself?
If you are concerned about the performance of your original method, and only wish to extract the time and vertical coordinates, I would recommend using OPENDAP to access your data rather than NCSS. This will simply fetch the metadata at first, and then will lazily load the data that you request (time and vertical coordinates, in your case). Using MetPy v0.11 or newer, an example script using your TDS Catalog of interest would look something like the following:
import metpy
import xarray as xr
from siphon.catalog import TDSCatalog
cat = TDSCatalog("https://thredds.ucar.edu/thredds/catalog/grib/NCEP/GFS/Global_onedeg/catalog.xml?dataset=grib/NCEP/GFS/Global_onedeg/Best")
opendap_url = cat.datasets[0].access_urls['OPENDAP']
ds = xr.open_dataset(opendap_url)
time = ds['Absolute_vorticity_isobaric'].metpy.time.values
vertical = ds['Absolute_vorticity_isobaric'].metpy.vertical.values
print(time)
print(vertical)
This takes roughly a half-second to run on my system.
If you instead have MetPy older than v0.11, you will need to use .metpy.parse_cf() when opening the dataset, as follows:
ds = xr.open_dataset(opendap_url).metpy.parse_cf()

Add custom property to vtkXMLUnstructuredGrid using python

I have a .vtu file representing a mesh which I read through vtkXMLUnstructuredGridReader. Then I create a numpy array (nbOfPoints x 3) in which I store the mesh vertex coordinates, which I'll call meshArray.
I also have a column array (nOfPoints x 1), which I'll call brightnessArray, which represents a certain property I want to assign to the vertexes of the meshArray; so to each vertex corresponds a scalar value. For example: to the element meshArray[0] will correspond brightnessArray[0] and so on.
How can I do this?
It is then possible to interpolate the value at the vertexes of the mesh to obtain a smooth variation of the property I had set in order to visualize it in paraview?
Thank you.
Simon
Here is what you need to do :
Write a Python Programmable Source to read your numpy data as a vtkUnstructuredGrid.
Here are a few examples of programmable sources :
https://www.paraview.org/Wiki/ParaView/Simple_ParaView_3_Python_Filters
https://www.paraview.org/Wiki/Python_Programmable_Filter
Read your .vtu dataset
Use a "Ressample with Dataset" filter on your python programmable source output and select your dataset as "source"
And you're done.
The hardest part is writing the programmble source script.

Specify concat_dim for xarray open_mfdataset

I'm using xarray.open_mfdataset() to open and combine 8 netcdf files (output from model simulations with different settings) without loading them into memory. This works great if I specify concat_dim='run_number', which adds run_number as a dimension without coordinates and just fills it with values from 0 to 7.
The problem is that now, I don't know which run_number belongs to which simulation. The original netcdf's all have attributes that help me to distinguish them, e.g. identifyer=1, identifyer=2, etc., but this is not recognized by xarray, even if I specify concat_dim='identifyer' (perhaps because there are many attributes?).
Is there any way in which I can tell xarray that it has to use this attribute as concat_dim? Or alternatively, in which order does xarray read the input files, so that I can infer which value of the new dimension corresponds to which simulation?
Xarray will use the values of existing scalar coordinates to label result coordinates, but it doesn't look at attributes. Only looking at metadata found in coordinates is a general theme in xarray: we leave attrs to user code only. So this should work you assign scalar 'identifyer' coordinates to each dataset, e.g., using the preprocess argument to open_mfdataset:
def add_id(ds):
ds.coords['identifyer'] = ds.attrs['identifyer']
xarray.open_mfdataset(path, preprocess=add_id)
Alternatively, you can either pass an explicit list of filenames to open_mfdataset or rely on the fact that open_mfdataset sorts the glob of filenames before combining them: the datasets will always be combined in lexicographic order of their names.

Regridding NetCDF4 in Python

I'm working with various climate models, but right now I'm working on regridding the latitudes and longitudes of these files from 2.5x2.5 to 0.5x0.5, and I am completely lost. I've been running on the Anaconda package for all of my netCDF4 needs, and I've made good progress, it's just regridding that baffles me completely. I have three main arrays that I'm using:
The first is the data_array, a numpy array that contains the information for precipitation.
The second is the lan_array, a numpy array containing the latitude information.
The third is the lot_array, a numpy array containing the longitude information.
All this data came from the netCDF4 file.
Again, my data is currently in 2.5x2.5. Meaning, the lonxlat is currently 144x72. I use np.meshgrid(lon_array,lat_array) to bring lonxlat to go to 72. My data_array also contains 72 elements, thus matching up perfectly.
This is where I get stuck and I have no idea how to proceed.
My thoughts: I want my 144x72 to convert to 720x360 in order for it to be 0.5x0.5.
I know one way of creating the lonxlat that I want is by np.arange(-89.75,90.25,0.5) and np.arange(-179.75,181.25,0.5). But I don't know how to match up the data_array to match with that.
Can anyone please offer any assistance? Any help is much appreciated!
Note: I also have ESMF modules available to me.
An easy option would be nctoolkit (https://nctoolkit.readthedocs.io/en/latest/installing.html). This has a built in method called to_latlon that easily achieves what you want. Just do the following for bilinear interpolation (and see the user guide for other methods):
import nctoolkit as nc
data = nc.open("infile.nc")
data.to_latlon(lon = [-179.75, 179.75], lat = [-89.75, 89.75], res = [0.5, 0.5])

Categories

Resources