Specify concat_dim for xarray open_mfdataset

Specify concat_dim for xarray open_mfdataset - python

I'm using xarray.open_mfdataset() to open and combine 8 netcdf files (output from model simulations with different settings) without loading them into memory. This works great if I specify concat_dim='run_number', which adds run_number as a dimension without coordinates and just fills it with values from 0 to 7.
The problem is that now, I don't know which run_number belongs to which simulation. The original netcdf's all have attributes that help me to distinguish them, e.g. identifyer=1, identifyer=2, etc., but this is not recognized by xarray, even if I specify concat_dim='identifyer' (perhaps because there are many attributes?).
Is there any way in which I can tell xarray that it has to use this attribute as concat_dim? Or alternatively, in which order does xarray read the input files, so that I can infer which value of the new dimension corresponds to which simulation?

Xarray will use the values of existing scalar coordinates to label result coordinates, but it doesn't look at attributes. Only looking at metadata found in coordinates is a general theme in xarray: we leave attrs to user code only. So this should work you assign scalar 'identifyer' coordinates to each dataset, e.g., using the preprocess argument to open_mfdataset:
def add_id(ds):
ds.coords['identifyer'] = ds.attrs['identifyer']
xarray.open_mfdataset(path, preprocess=add_id)
Alternatively, you can either pass an explicit list of filenames to open_mfdataset or rely on the fact that open_mfdataset sorts the glob of filenames before combining them: the datasets will always be combined in lexicographic order of their names.

Related

How to merge different shaped netcdf4 files?

I am storing weather forecasts as netcdf4 files. These netcdf4 files are batched following the google maps tiles principle. This means I define a zoom level (here 6) to get the extent of each tile.
Based on that information I used the following code to slice the array:
sliced_data = data.where(
(data[lat_coord_name] <= maxLat)
& (data[lat_coord_name] > minLat)
& (data[lon_coord_name] <= maxLon)
& (data[lon_coord_name] > minLon),
drop=True,
)
Here, data is a xarray.Dataset. At the end of this process I have 36 tiles for a weather model covering middle europe.
My problem is to merge them back to the native untiled xarray.Dataset. The projection of the weather model differs from the projection of the tile maps. So at the end I have netcdf4 files with different shapes in x and y dimension. So I have no axis to align them with xarray.
The dimension of the native grid is 340x340. You can find a test dataset here
My expectation was:
import glob
import xarray
file_list = glob.glob('test_data_stackoverflow/*')
file_list.sort()
dataset = xarray.open_mfdataset(file_list, engine="h5netcdf")
But this will fail due to different shaped datasets.
I am open using other tools like netcdf4, h5netcdf or cdo. But the data should not be manipulated e.g. with an interpolation to the origin grid.

Combining datasets with the same dimension names but different dimension sizes is not possible in an Xarray Dataset. But it is possible in a new type of Xarray data structure, currently under development, called a DataTree. Currently DataTree lives in a separate package - https://xarray-datatree.readthedocs.io/en/latest/ - but the plan is to merge it into Xarray proper soon. DataTree is used by the library ndpyramid to store multi-scale array data, very similar to the use case you are describing.
I would explore combing your datasets into a single DataTree object. First organize your data into a dict, and then create a DataTree from the dict. You will need to decide how to encode the level of the hierarchy. The simplest is to just use an integer for each zoom level, e.g.
data_dict = {level: ds for level, ds in enumerage(file_list)}
dt = DataTree.from_dict(data_dict)
They ndpyramid code might be a useful reference: https://github.com/carbonplan/ndpyramid/blob/main/ndpyramid/core.py

You can probably solve this using CDO's merge method:
cdo merge test_data_stackoverflow/* out.nc
If the 36 tiles make up a 6 x 6 grid, then mergegrid can potentially merge them:
cdo mergegird test_data_stackoverflow/* out.nc

How do I lazily concatenate "numpy ndarray"-like objects for sequential reading?

I have a list of several large hdf5 files, each with a 4D dataset. I would like to obtain a concatenation of them on the first axis, as in, an array-like object that would be used as if all datasets were concatenated. My final intent is to sequentially read chunks of the data along the same axis (e.g. [0:100,:,:,:], [100:200,:,:,:], ...), multiple times.
Datasets in h5py share a significant part of the numpy array API, which allows me to call numpy.concatenate to get the job done:
files = [h5.File(name, 'r') for name in filenames]
X = np.concatenate([f['data'] for f in files], axis=0)
On the other hand, the memory layout is not the same, and memory cannot be shared among them (related question). Alas, concatenate will eagerly copy the entire content of each array-like object into a new array, which I cannot accept in my use case. The source code of the array concatenation function confirms this.
How can I obtain a concatenated view over multiple array-like objects, without eagerly reading them to memory? As far as this view is concerned, slicing and indexing over this view would behave just as if I had a concatenated array.
I can imagine that writing a custom wrapper would work, but I would like to know whether such an implementation already exists as a library, or whether another solution to the problem is available and just as feasible. My searches so far have yielded nothing of this sort. I am also willing to accept solutions specific to h5py.

flist = [f['data'] for f in files] is a list of dataset objects. The actual data is on the h5 files, is accessible as long as those files remain open.
When you do
arr = np.concatenate(flist, axis=0)
I imagine concatenate first does
tmep = [np.asarray(a) for a in flist]
that is, construct a list of numpy arrays. I assume np.asarray(f['data']) is the same as f['data'].value or f['data'][:] (as I discussed 2 yrs ago in the linked SO question). I should do some time tests comparing that with
arr = np.concatenate([a.value for a in flist], axis=0)
flist is a kind of lazy compilation of these data sets, in that the data still resides on the file, and is accessed only when you do something more.
[a.value[:,:,:10] for a in flist]
would load a portion of each of those data sets into memory; I expect that a concatenate on that list would be the equivalent of arr[:,:,:10].
Generators or generator comprehensions are a form of lazy evaluation, but I think they have to be turned into lists before use in concatenate. In any case, the result of concatenate is always an array with all the data in a contiguous block of memory. It is never blocks of data residing in files.
You need to tell us more about what intend to do with this large concatenated array of data sets. As outline I think you can construct arrays that contain slices of all the data sets. You could also perform other actions as I demonstrate in the previous answer - but with an access time cost.

Python, how to insert value in Powerpoint template?

I want to use an existing powerpoint presentation to generate a series of reports:
In my imagination the powerpoint slides will have content in such or similar form:
Date of report: {{report_date}}
Number of Sales: {{no_sales}}
...
Then my python app opens the powerpoint, fills in the values for this report and saves the report with a new name.
I googled, but could not find a solution for this.
There is python-pptx out there, but this is all about creating a new presentation and not inserting values in a template.
Can anybody advice?

Ultimately, barring some other library which has additional functionality, you need some sort of brute force approach to iterate the Slides collection and each Slide's respective Shapes collection in order to identify the matching shape (unless there is some other library which has additional "Find" functionality in PPT). Here is brute force using only win32com:
from win32com import client
find_date = r'{{report_date}}'
find_sales = r'{{no_sales}}'
report_date = '01/01/2016' # Modify as needed
no_sales = '604' # Modify as needed
path = 'c:/path/to/file.pptx'
outpath = 'c:/path/to/output.pptx'
ppt = client.Dispatch("PowerPoint.Application")
pres = ppt.Presentations.Open(path, WithWindow=False)
for sld in pres.Slides:
for shp in sld.Shapes:
with shp.TextFrame.TextRange as tr:
if find_date in tr.Text
tr.Replace(find_date, report_date)
elif find_sales in shp.TextFrame.Characters.Text
tr.Replace(find_sales, no_sales)
pres.SaveAs(outpath)
pres.Close()
ppt.Quit()
If these strings are inside other strings with mixed text formatting, it gets trickier to preserve existing formatting, but it should still be possible.
If the template file is still in design and subject to your control, I would consider giving the shape a unique identifier like a CustomXMLPart or you could assign something to the shapes' AlternativeText property. The latter is easier to work with because it doesn't require well-formed XML, and also because it's able to be seen & manipulated via the native UI, whereas the CustomXMLPart is only accessible programmatically, and even that is kind of counterintuitive. You'll still need to do shape-by-shape iteration, but you can avoid the string comparisons just by checking the relevant property value.

I tried this on a ".ppx" file I had hanging around.
A microsoft office power point ".pptx" file is in ".zip" format.
When I unzipped my file, I got an ".xml" file and three directories.
My ".pptx" file has 116 slides comprised of 3,477 files and 22 directories/subdirectories.
Normally, I would say it is not workable, but since you have only two short changes you probably could figure out what to change and zip the files to make a new ".ppx" file.
A warning: there are some xml blobs of binary data in one or more of the ".xml" files.

You can definitely do what you want with python-pptx, just perhaps not as straightforwardly as you imagine.
You can read the objects in a presentation, including the slides and the shapes on the slides. So if you wanted to change the text of the second shape on the second slide, you could do it like this:
slide = prs.slides[1]
shape = slide.shapes[1]
shape.text = 'foobar'
The only real question is how you find the shape you're interested in. If you can make non-visual changes to the presentation (template), you can determine the shape id or shape name and use that. Or you could fetch the text for each shape and use regular expressions to find your keyword/replacement bits.
It's not without its challenges, and python-pptx doesn't have features specifically designed for this role, but based on the parameters of your question, this is definitely a doable thing.

MATLAB version of python iterator

I need to specify the color and marker for a series of plots on the same axis. In Python, I would simply create an iterator for each and use next() to get them out in order one at a time. I cannot find an equivalent in MATLAB; all the examples I have found involve explicitly calling the list holding the colors and markers by index, but this precludes using them in loops that don't use a matching iterator. Is there a more appropriate substitution for the iterator concept?
Alternately, is there a more appropriate way to accomplish this in MATLAB?

You can use the ColorOrder and LineStyleOrder properties of the axis: you can find here the complete documentation.
The ColorOrder property is a three-column matrix of RGB triplets and the LineStyleOrder is a cell array of line specifiers or, alternatively, a string of specifiers separated by |.
This figure has been created using the code below. Of course, you can also generate the ColorOrder matrix using one of the built-in colormaps or even a custom one.
figure;
set(gca, 'ColorOrder', hsv(5));
set(gca, 'LineStyleOrder', '-|--|:');
hold on;
t = 0:pi/20:2*pi;
for i = 1:15
plot(t, sin(t-i/5));
end
Anyway, as far as I know in MATLAB there isn't the concept of iterator, especially in the Python sense, but at least this solution should address your problem without explicitly calling the list of colors and/or marker by index.

You can define the look (such as color and marker) for the plots in the plot command. E.g. plot(1:5,'-go') will produce a green plot with o-makers.(Info)
Alternatively, you can indeed iterate over the plots in an axis. If you do all the plots in one command, like
h = plot(1:5,[1:5;2:2:10]);
then h will be a vector of chart line objects, and you can then iterate over these objects using
for i=1:length(h)
h(i).<some_modifications>
end
and set properties like this:
h(i).LineWidth = 2;
h(i).Marker = '*';
or in MATLAB versions before 2014:
set(h(i),'LineWidth',2)
set(h(i),'Marker','*')
If you do the plots in separate commands, you can manually collect the returned chart line objects in a vector and do the same thing (or of course modify them directly). You can find some properties you can use here.
Is this what you were looking for?

How to read NetCDF variable float data into a Numpy array with the same precision and scale as the original NetCDF float values?

I have a NetCDF file which contains a variable with float values with precision/scale == 7/2, i.e. there are possible values from -99999.99 to 99999.99.
When I take a slice of the values from the NetCDF variable and look at them in in my debugger I see that the values I now have in my array have more precision/scale than what I see in the original NetCDF. For example when I look at the values in the ToosUI/ncdump viewer they display as '-99999.99' or '12.45' but when I look at the values in the slice array they look like '-99999.9921875' (a greater scale length). So if I'm using '-99999.99' as the expected value to indicate a missing data point then I won't get a match with what gets pulled into the slice array since those values have a greater scale length and the additional digits in the scale are not just zeros for padding.
For example I see this if I do a ncdump on a point within the NetCDF dataset:
Variable: precipitation(0:0:1, 40:40:1, 150:150:1)
float precipitation(time=1348, lat=180, lon=360);
:units = "mm/month";
:long_name = "precipitation totals";
data:
{
{
{-99999.99}
}
}
However if I get a slice of the data from the variable like so:
value = precipitationVariable[0:1:1, 40:41:1, 150:151:1]
then I see it like this in my debugger (Eclipse/PyDev):
value == ndarray: [[[-99999.9921875]]]
So it seems as if the NetCDF dataset values that I read into a Numpy array are not being read with the same precision/scale of the original values in the NetCDF file. Or perhaps the values within the NetCDF are actually the same as what I'm seeing when I read them, but what's shown to me via ncdump is being truncated due to some format settings in the ncdump program itself.
Can anyone advise as to what's happening here? Thanks in advance for your help.
BTW I'm developing this code using Python 2.7.3 on a Windows XP machine and using the Python module for the NetCDF4 API provided here: https://code.google.com/p/netcdf4-python/

There is no simple way of doing what you want because numpy stores the values as single precision, so they will always have the trailing numbers after 0.99.
However, netCDF already provides a mechanism for missing data (see the best practices guide). How was the netCDF file written in the first place? The missing_value is a special variable attribute that should be used to indicate those values that are missing. In the C and Fortran interfaces, when the file is created all variable values are set to be missing. If you wrote a variable all in one go, you can then set the missing_value attribute to an array of indices where the values are missing. See more about the fill values in the C and Fortran interfaces. This is the recommended approach. The python netCDF4 module plays well with these missing values, and such arrays are read as masked arrays in numpy.
If you must work with the file you currently have, then I'd suggest creating a mask to cover values around your missing value:
import numpy as np
value = precipitationVariable[:]
mask = (value < -99999.98) & (value > -100000.00)
value = np.ma.MaskedArray(value, mask=mask)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Specify concat_dim for xarray open_mfdataset - python

Related

How to merge different shaped netcdf4 files?

How do I lazily concatenate "numpy ndarray"-like objects for sequential reading?

Python, how to insert value in Powerpoint template?

MATLAB version of python iterator

How to read NetCDF variable float data into a Numpy array with the same precision and scale as the original NetCDF float values?

Categories

Resources