xarray: How to apply a scipy function to a large netcdf dataset

xarray: How to apply a scipy function to a large netcdf dataset - python

I have a large netcdf file with several variables. I need to do discrete integration along a dimension to a variable say temperature of shape (80,100,300000) with dimensions (time, depth, nodes). So, I tried with dividing the large dataset into chunks with xarray and then tried to apply the function scipy.integrate.simps, but failed.
import xarray as xr
import scipy.integrate as sci
ds = xr.open_dataset('./temperature.nc',chunks={'time':5, 'nodes':1000})
temp = ds.temperature
Kindly, help me with applying the simps function along 2nd dimension of a chunked variable and then save chunks to a netcdf file instead of dumping the whole data into RAM. I would like to do something like this
temp.apply(sci.simps,{'dx':5}).to_netcdf('./temperature_integrated.nc')

I think that you are looking for xarray.apply_ufunc
Perhaps something like the following would work for you (untested):
import xarray as xr
xr.apply_ufunc(scipy.integrate, ds.temperature)

Related

How to effectively store a very large list in python

Question:I have a big 3D image collection that i would like to store into one file. How should I effectively do it?
Background: The dataset has about 1,000 3D MRI images with a size of 256 by 256 by 156. To avoid frequent files open and close, I was trying to store all of them into one big list and export it.
So far I tried reading each MRI in as 3D numpy array and append it to a list. When i tried to save it using numpy.save, it consumed all my memory and exited with "Memory Error".
Here is the code i tried:
import numpy as np
import nibabel as nib
import os
file_list = os.listdir('path/to/files')
for file in file_list:
mri = nib.load(os.path.join('path/to/files',file))
mri_array = np.array(mri.dataobj)
data.append(mri_array)
np.save('imported.npy',data)
Expected Outcome:
Is there a better way to store such dataset without consuming too much memory?

Using HDF5 file format or Numpy's memmap are the two options that I would go to first if you want to jam all your data into one file. These options do not load all the data into memory.
Python has the h5py package to handle HDF5 files. These have a lot of features, and I would generally lean toward this option. It would look something like this:
import h5py
with h5py.File('data.h5') as h5file:
for n, image in enumerate(mri_images):
h5file[f'image{n}'] = image
memmap works with binary files, so not really feature rich at all. This would look something like:
import numpy as np
bin_file = np.memmap('data.bin', mode='w+', dtype=int, shape=(1000, 256, 256, 156))
for n, image in enumerate(mri_images):
bin_file[n] = image
del bin_file # dumps data to file

Displaying a large .dat binary file in python

I have a large 40 mb (about 173,397 lines) .dat file filled with binary data (random symbols). It is an astronomical photograph. I need to read and display it with Python. I am using a binary file because I will need to extract pixel value data from specific regions of the image. But for now I just need to ingest it into Python. Something like the READU procedure in IDL. Tried numpy and matplotlib but nothing worked. Suggestions?

You need to know the data type and dimensions of the binary file. For example, if the file contains float data, use numpy.fromfile like:
import numpy as np
data = np.fromfile(filename, dtype=float)
Then reshape the array to the dimensions of the image, dims, using numpy.reshape (the equivalent of REFORM in IDL):
im = np.reshape(data, dims)

Tiff to array - error

Hello, I have some problem with converting Tiff file to numpy array.
I have a 16 bit signed raster file and I want to convert it to numpy array.
I using to this gdal libarary.
import numpy
from osgeo import gdal
ds = gdal.Open("C:/.../dem.tif")
dem = numpy.array(ds.GetRasterBand(1).ReadAsArray())
At first glance, everything converts well, but I compared the result obtained in python with result in GIS software and I got different results.
Python result
Arcmap result
I found many value in numpy array that are below 91 and 278 (real min and max values), that should not exist.

GDAL already returns a Numpy array, and wrapping it in np.array by default creates a copy of that array. Which is an unnecessary performance hit. Just use:
dem = ds.GetRasterBand(1).ReadAsArray()
Or if its a single-band raster, simply:
dem = ds.ReadAsArray()
Regading the statistics, are you sure ArcMap shows the absolute high/low value? I know QGIS for example often draws the statistics from a sample of the dataset (for performance) and depending on the settings sometimes uses a percentile (eg 1%, 99%).
edit: BTW, is this a public dataset? Like an SRTM tile? It might help if you list the source.

Substitute dataset coordinates in xarray (Python)

I have a dataset stored in NetCDF4 format that consists of Intensity values with 3 dimensions: Loop, Delay and Wavelength. I named my coordinates the same as the dimensions (I don't know if it's good or bad...)
I'm using xarray (formerly xray) in Python to load the dataset:
import xarray as xr
ds = xr.open_dataset('test_data.netcdf4')
Now I want to manipulate the data while keeping track of the original data. For instance, I would:
Apply an offset to the Delay coordinates and keep the original Delay dataarray untouched. This seems to be done with:
ds_ = ds.assign_coords(Delay_corr=ds_.Delay.copy(deep=True) + 25)
Substitute the coordinates Delay for Delay_corr for all relevant dataarrays in the dataset. However, I have no clue how to do this and I didn't find anything in the documentation.
Would anybody know how to perform item #2?
To download the NetCDF4 file with test data:
http://1drv.ms/1QHQTRy

The method you're looking for is the xr.swap_dims() method:
ds.coords['Delay_corr'] = ds.Delay + 25 # could also use assign_coords
ds2 = ds.swap_dims({'Delay': 'Delay_corr'})
See this section of the xarray docs for a full example.

I think it's much simpler than that.
If you don't want to change the existing data, you create a copy. Note that changing ds won't change the netcdf4 file, but assuming you still don't want to change ds:
ds_ = ds.copy(deep=True)
Then just set the Delay coord as a modified version of the old one
ds_.coords['Delay'] = ds_['Delay'] + 25

Python NumPy Convert FFT To File

I was wondering if it's possible to get the frequencies present in a file with NumPy, and then alter those frequencies and create a new WAV file from them? I would like to do some filtering on a file, but I have yet to see a way to read a WAV file into NumPy, filter it, and then output the filtered version. If anyone could help, that would be great.

SciPy provides functions for doing FFTs on NumPy arrays, and also provides functions for reading and writing them to WAV files. e.g.
from scipy.io.wavfile import read, write
from scipy.fftpack import rfft, irfft
import np as numpy
rate, input = read('input.wav')
transformed = rfft(input)
filtered = function_that_does_the_filtering(transformed)
output = irfft(filtered)
write('output.wav', rate, output)
(input, transformed and output are all numpy arrays)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

xarray: How to apply a scipy function to a large netcdf dataset - python

I think that you are looking for xarray.apply_ufunc Perhaps something like the following would work for you (untested): import xarray as xr xr.apply_ufunc(scipy.integrate, ds.temperature)

Related

How to effectively store a very large list in python

Displaying a large .dat binary file in python

Tiff to array - error

Substitute dataset coordinates in xarray (Python)

Python NumPy Convert FFT To File

Categories

Resources