Merging xarray datasets with same extent but different spatial resolution - python

I have one dataset of satellite based solar induced fluorescence (SIF) and one of modeled precipitation. I want to compare precipitation to SIF on a per pixel basis in my study area. My two datasets are of the same area but at slightly different spatial resolutions. I can successfully plot these values across time and compare against each other when I take the mean for the whole area, but I'm struggling to create a scatter plot of this on a per pixel basis.
Honestly I'm not sure if this is the best way to compare these two values when looking for the impact of precip on SIF so I'm open to ideas of different approaches. As for merging the data currently I'm using xr.combine_by_coords but it is giving me an error I have described below. I could also do this by converting the netcdfs into geotiffs and then using rasterio to warp them, but that seems like an inefficient way to do this comparison. Here is what I have thus far:
import netCDF4
import numpy as np
import dask
import xarray as xr
rainy_bbox = np.array([
[-69.29519955115512,-13.861261028444734],
[-69.29519955115512,-12.384786628185896],
[-71.19583431678012,-12.384786628185896],
[-71.19583431678012,-13.861261028444734]])
max_lon_lat = np.max(rainy_bbox, axis=0)
min_lon_lat = np.min(rainy_bbox, axis=0)
# this dataset is available here: ftp://fluo.gps.caltech.edu/data/tropomi/gridded/
sif = xr.open_dataset('../data/TROPO_SIF_03-2018.nc')
# the dataset is global so subset to my study area in the Amazon
rainy_sif_xds = sif.sel(lon=slice(min_lon_lat[0], max_lon_lat[0]), lat=slice(min_lon_lat[1], max_lon_lat[1]))
# this data can all be downloaded from NASA Goddard here either manually or with wget but you'll need an account on https://disc.gsfc.nasa.gov/: https://pastebin.com/viZckVdn
imerg_xds = xr.open_mfdataset('../data/3B-DAY.MS.MRG.3IMERG.201803*.nc4')
# spatial subset
rainy_imerg_xds = imerg_xds.sel(lon=slice(min_lon_lat[0], max_lon_lat[0]), lat=slice(min_lon_lat[1], max_lon_lat[1]))
# I'm not sure the best way to combine these datasets but am trying this
combo_xds = xr.combine_by_coords([rainy_imerg_xds, rainy_xds])
Currently I'm getting a seemingly unhelpful RecursionError: maximum recursion depth exceeded in comparison on that final line. When I add the argument join='left' then the data from the rainy_imerg_xds dataset is in combo_xds and when I do join='right' the rainy_xds data is present, and if I do join='inner' no data is present. I assumed there was some internal interpolation with this function but it appears not.

This documentation from xarray outlines quite simply the solution to this problem. xarray allows you to interpolate in multiple dimensions and specify another Dataset's x and y dimensions as the output dimensions. So in this case it is done with
# interpolation based on http://xarray.pydata.org/en/stable/interpolation.html
# interpolation can't be done across the chunked dimension so we have to load it all into memory
rainy_sif_xds.load()
#interpolate into the higher resolution grid from IMERG
interp_rainy_sif_xds = rainy_sif_xds.interp(lat=rainy_imerg_xds["lat"], lon=rainy_imerg_xds["lon"])
# visualize the output
rainy_sif_xds.dcSIF.mean(dim='time').hvplot.quadmesh('lon', 'lat', cmap='jet', geo=True, rasterize=True, dynamic=False, width=450).relabel('Initial') +\
interp_rainy_sif_xds.dcSIF.mean(dim='time').hvplot.quadmesh('lon', 'lat', cmap='jet', geo=True, rasterize=True, dynamic=False, width=450).relabel('Interpolated')
# now that our coordinates match, in order to actually merge we need to convert the default CFTimeIndex to datetime to merge dataset with SIF data because the IMERG rainfall dataset was CFTime and the SIF was datetime
rainy_imerg_xds['time'] = rainy_imerg_xds.indexes['time'].to_datetimeindex()
# now the merge can easily be done with
merged_xds = xr.combine_by_coords([rainy_imerg_xds, interp_rainy_sif_xds], coords=['lat', 'lon', 'time'], join="inner")
# now visualize the two datasets together // multiply SIF by 30 because values are so ow
merged_xds.HQprecipitation.rolling(time=7, center=True).sum().mean(dim=('lat', 'lon')).hvplot().relabel('Precip') * \
(merged_xds.dcSIF.mean(dim=('lat', 'lon'))*30).hvplot().relabel('SIF')

Related

How to average a 3-D Array Into 2-D Array Python

I would like to take a temperature variable from a netcdf file in python and average over all of the satellite's scans.
The temperature variable is given as:
tdisk = file.variables['tdisk'][:,:,:] # Disk Temp(nscans,nlons,nlats)
The shape of the tdisk array is 68,52,46. The satellite makes 68 scans per day. The longitude and latitude variables are given as:
lats = file.variables['latitude'][:,:] # Latitude(nlons,nlats)
lons = file.variables['longitude'][:,:] # Longitude(nlons,nlats)
Which have sizes of 52,46. I would like to average the each nscan of temperature together to get a daily mean so the temperature size becomes 52,46. I've seen ways to stack the arrays and concatenate them, but I would like a mean value. Eventually I am looking to make a contour plot with (x=longitude, y=latitude , and z=temp)
Is this possible to do? Thanks for any help
If you are using Xarray, you can do this using DataArray.mean:
import xarray as xr
# open netcdf file
ds = xr.open_dataset('file.nc')
# take the mean of the tdisk variable
da = ds['tdisk'].mean(dim='nscans')
# make a contour plot
da.plot.contour('longitude', 'latitude')
Based on the question you seem to want to calculate a temporal mean, not a daily mean as you seem to only have one day. So the following will probably work:
ds.mean(“time”)

Resampling a raster using rasterio - simple modification of grid spacing

I am resampling raster data using Python's rasterio. Looking at the rasterio.enums.Resampling class, it appears the only way to do this is to interpolate between adjacent raster grids, essentially smoothing the data.
Is there some way to do a simple upsampling that effectively divides one raster grid into many and preserves the original value for all of the sub-grids?
My resampling script is as follows - currently using the bi-linear method:
with rasterio.open(str(rasterpath+filename), crs="EPSG:4326") as src:
data = src.read(
out_shape=(
src.count,
int(src.height * upscale_factor),
int(src.width * upscale_factor)
),
resampling=Resampling.bilinear)
# scale image transform
transform = src.transform * src.transform.scale(
(src.width / data.shape[-1]),
(src.height / data.shape[-2])
)
Any suggestions? I would think some sort of treatment for discrete data would be built in but have not found it yet...
I found a solution.
Deleting resampling=Resampling.bilinear avoids interpolation and performs a "simple" resampling.

How to quickly retrieve radiances from satellite netcdf files?

I would like to retrieve GOES-16 ABI radiance data for predetermined locations (about 10,000 points per individual image) for an entire year. Each day has ~100 individual images. I have all required ABI data (in netCDF format) on disk already. The points I'd like to extract are given in terms of the row and column of the netCDF array, so in principle, retrieving the correct radiances is an array indexing operation.
However, all my attempts at doing this have been painstakingly slow (order of 10+ minutes for a single day). I've been trying to use xarray, as follows.
import xarray as xr, pandas as pd
df = pd.read_csv("selected_pixels/20190101.csv")
ds = xr.open_mfdataset('noaa-goes16/ABI-L2-MCMIPF/2019/001/*/*.nc', parallel=True,
combine='nested', concat_dim='t')
t = xr.DataArray(df.time_id.values, dims="s")
x = xr.DataArray(df.col.values, dims="s")
y = xr.DataArray(df.row.values, dims="s")
a_df = ds[[f"CMI_C{str(i).rjust(2,'0')}" for i in range(1,17)]].isel(t=t,x=x, y=y).to_dataframe()
I'm fortunate enough to have multiple processors at my disposal: I would highly appreciate any suggestions to speed up this operation.

Is there a way to plot Matplotlib's Imshow against a specific array rather than the indices?

I'm trying to use Imshow to plot a 2-d Fourier transform of my data. However, Imshow plots the data against its index in the array. I would like to plot the data against a set of arrays I have containing the corresponding frequency values (one array for each dim), but can't figure out how.
I have a 2D array of data (gaussian pulse signal) that I Fourier transform with np.fft.fft2. This all works fine. I then get the corresponding frequency bins for each dimension with np.fft.fftfreq(len(data))*sampling_rate. I can't figure out how to use imshow to plot the data against these frequencies though. The 1D equivalent of what I'm trying to do us using plt.plot(x,y) rather than just using plt.plot(y).
My first attempt was to use imshows "extent" flag, but as fas as I can tell that just changes the axis limits, not the actual bins.
My next solution was to use np.fft.fftshift to arrange the data in numerical order and then simply re-scale the axis using this answer: Change the axis scale of imshow. However, the index to frequency bin is not a pure scaling factor, there's typically a constant offset as well.
My attempt was to use 2d hist instead of imshow, but that doesn't work since 2dhist plots the number of times an order pair occurs, while I want to plot a scalar value corresponding to specific order pairs (i.e the power of the signal at specific frequency combinations).
import numpy as np
import matplotlib.pyplot as plt
from scipy import signal
f = 200
st = 2500
x = np.linspace(-1,1,2*st)
y = signal.gausspulse(x, fc=f, bw=0.05)
data = np.outer(np.ones(len(y)),y) # A simple example with constant y
Fdata = np.abs(np.fft.fft2(data))**2
freqx = np.fft.fftfreq(len(x))*st # What I want to plot my data against
freqy = np.fft.fftfreq(len(y))*st
plt.imshow(Fdata)
I should see a peak at (200,0) corresponding to the frequency of my signal (with some fall off around it corresponding to bandwidth), but instead my maximum occurs at some random position corresponding to the frequencie's index in my data array. If anyone has any idea, fixes, or other functions to use I would greatly appreciate it!
I cannot run your code, but I think you are looking for the extent= argument to imshow(). See the the page on origin and extent for more information.
Something like this may work?
plt.imshow(Fdata, extent=(freqx[0],freqx[-1],freqy[0],freqy[-1]))

Heat map visualizing touch input on smartphone (weighted 2d binning, histogram)

I have a dataset where each sample consists of x- and y-position, timestamp and a pressure value of touch input on a smartphone. I have uploaded the dataset here (OneDrive): data.csv
It can be read by:
import pandas as pd
df = pd.read_csv('data.csv')
Now, I would like to create a heat map visualizing the pressure distribution in the x-y space.
I envision a heat map which looks like the left or right image:
For a heat map of spatial positions a similar approach as given here could be used. For the heat map of pressure values the problem is that there are 3 dimensions, namely the x- and y-position and the pressure.
I'm happy about every input regarding the creation of the heat map.
There are several ways data can be binned. One is just by the number of events. Functions like numpy.histogram2d or hist2d allow to specify weights to each data point to manipulate the weight of each event.
But there is a more general histogram function that might be useful in your case: scipy.stats.binned_statistic_2d
By using the keyword argument statistic you can pick how the value of each bin is calculated from the values that lie within:
mean
std
median
count
sum
min
max
or a user defined function
I guess in your case mean or median might be a good solution.

Categories

Resources