How to quickly retrieve radiances from satellite netcdf files?

How to quickly retrieve radiances from satellite netcdf files? - python

I would like to retrieve GOES-16 ABI radiance data for predetermined locations (about 10,000 points per individual image) for an entire year. Each day has ~100 individual images. I have all required ABI data (in netCDF format) on disk already. The points I'd like to extract are given in terms of the row and column of the netCDF array, so in principle, retrieving the correct radiances is an array indexing operation.
However, all my attempts at doing this have been painstakingly slow (order of 10+ minutes for a single day). I've been trying to use xarray, as follows.
import xarray as xr, pandas as pd
df = pd.read_csv("selected_pixels/20190101.csv")
ds = xr.open_mfdataset('noaa-goes16/ABI-L2-MCMIPF/2019/001/*/*.nc', parallel=True,
combine='nested', concat_dim='t')
t = xr.DataArray(df.time_id.values, dims="s")
x = xr.DataArray(df.col.values, dims="s")
y = xr.DataArray(df.row.values, dims="s")
a_df = ds[[f"CMI_C{str(i).rjust(2,'0')}" for i in range(1,17)]].isel(t=t,x=x, y=y).to_dataframe()
I'm fortunate enough to have multiple processors at my disposal: I would highly appreciate any suggestions to speed up this operation.

Related

How to average a 3-D Array Into 2-D Array Python

I would like to take a temperature variable from a netcdf file in python and average over all of the satellite's scans.
The temperature variable is given as:
tdisk = file.variables['tdisk'][:,:,:] # Disk Temp(nscans,nlons,nlats)
The shape of the tdisk array is 68,52,46. The satellite makes 68 scans per day. The longitude and latitude variables are given as:
lats = file.variables['latitude'][:,:] # Latitude(nlons,nlats)
lons = file.variables['longitude'][:,:] # Longitude(nlons,nlats)
Which have sizes of 52,46. I would like to average the each nscan of temperature together to get a daily mean so the temperature size becomes 52,46. I've seen ways to stack the arrays and concatenate them, but I would like a mean value. Eventually I am looking to make a contour plot with (x=longitude, y=latitude , and z=temp)
Is this possible to do? Thanks for any help

If you are using Xarray, you can do this using DataArray.mean:
import xarray as xr
# open netcdf file
ds = xr.open_dataset('file.nc')
# take the mean of the tdisk variable
da = ds['tdisk'].mean(dim='nscans')
# make a contour plot
da.plot.contour('longitude', 'latitude')

Based on the question you seem to want to calculate a temporal mean, not a daily mean as you seem to only have one day. So the following will probably work:
ds.mean(“time”)

geemap: Average ImageCollection bands over time period and extract values to points

I'm trying to extract band values from a Daymet image collection (V4) at specific sampling locations, averaged over the course of year (daily average). I'm using the Google Earth Engine python API (geemap) to do so.
Currently, I'm able to extract all of the daily values for each band over the time period using the following code:
ee.Initialize()
daymet = ee.ImageCollection('NASA/ORNL/DAYMET_V4') \
.filter(ee.Filter.date('2000-01-01', '2000-12-31'))
DaymetImage = daymet.toBands() ##convert to ee.Image in order to extract_values_to_points
##export data
out_dir = os.path.expanduser('~/Downloads')
out_csv = os.path.join(out_dir, 'daymet2000.csv')
geemap.extract_values_to_points(fc_coords, DaymetImage, out_csv)
##Note: fc_coords contains sampling points (lat/long) and other variables of interest
While this has all of the information I need, the resulting .csv file is bulky and would require a lot of work to clean up. I would like to be able to directly extract the daily average value for each band in the ImageCollection. The code I have so far is:
ee.Initialize()
daymet = ee.ImageCollection('NASA/ORNL/DAYMET_V4') \
.filter(ee.Filter.date('2000-01-01', '2000-12-31')) \
.mean()
And this is where I get stuck. The .mean function appears to turn it from an ee.ImageCollection to an ee.Image; however, when I try to extract_values_to_points from this Image, I get the following error message: EEException: Image.reduceRegions: The default WGS84 projection is invalid for aggregations. Specify a scale or crs & crs_transform.
As it is an ee.Image, the toBands() function (which I used earlier on an ImageCollection) does not work. Is there any way I can extract the average band values from this ImageCollection easily?
Thanks.

creating a new dataset from large datset xarray

I have a large dataset, with hourly data for an entire year. What I want to do is to create a new dataset with specific variables at sepcific distances from a point source and use all the data to create a box and wiskers plot.
The dataset has time, lon and lat and a concentration variable with multiple dimensions:
Concentration[hours,lat, lon]
I want to create a dataset that loops through all of the times for different lat and lon and produce a concentration output for all time at each of these different locations, to then use it to create a box and wisker plot and to show the decreace of atmospheric concetration from a point source. I know the specific grids I am interested in but need help setting up the script.
EDIT:
I cropped the Global dataset and this is what the output currently looks like:
time: (8761), latitude: (30), longitude: (30)
I tried using a for loop, but it would not allow me to loop over lat/lon...
for i in range(8761):
print(Conc[i,:,:])
This lets me loop over all times and see the conc at all grids, but I want to instead of printing create a new ds, and also only loop through certains grids..
I wants a list that provides me 8761 concentrations values for each grid that I specify, and to keep all the data in one dataset so I can make a Box plot from this...

Merging xarray datasets with same extent but different spatial resolution

I have one dataset of satellite based solar induced fluorescence (SIF) and one of modeled precipitation. I want to compare precipitation to SIF on a per pixel basis in my study area. My two datasets are of the same area but at slightly different spatial resolutions. I can successfully plot these values across time and compare against each other when I take the mean for the whole area, but I'm struggling to create a scatter plot of this on a per pixel basis.
Honestly I'm not sure if this is the best way to compare these two values when looking for the impact of precip on SIF so I'm open to ideas of different approaches. As for merging the data currently I'm using xr.combine_by_coords but it is giving me an error I have described below. I could also do this by converting the netcdfs into geotiffs and then using rasterio to warp them, but that seems like an inefficient way to do this comparison. Here is what I have thus far:
import netCDF4
import numpy as np
import dask
import xarray as xr
rainy_bbox = np.array([
[-69.29519955115512,-13.861261028444734],
[-69.29519955115512,-12.384786628185896],
[-71.19583431678012,-12.384786628185896],
[-71.19583431678012,-13.861261028444734]])
max_lon_lat = np.max(rainy_bbox, axis=0)
min_lon_lat = np.min(rainy_bbox, axis=0)
# this dataset is available here: ftp://fluo.gps.caltech.edu/data/tropomi/gridded/
sif = xr.open_dataset('../data/TROPO_SIF_03-2018.nc')
# the dataset is global so subset to my study area in the Amazon
rainy_sif_xds = sif.sel(lon=slice(min_lon_lat[0], max_lon_lat[0]), lat=slice(min_lon_lat[1], max_lon_lat[1]))
# this data can all be downloaded from NASA Goddard here either manually or with wget but you'll need an account on https://disc.gsfc.nasa.gov/: https://pastebin.com/viZckVdn
imerg_xds = xr.open_mfdataset('../data/3B-DAY.MS.MRG.3IMERG.201803*.nc4')
# spatial subset
rainy_imerg_xds = imerg_xds.sel(lon=slice(min_lon_lat[0], max_lon_lat[0]), lat=slice(min_lon_lat[1], max_lon_lat[1]))
# I'm not sure the best way to combine these datasets but am trying this
combo_xds = xr.combine_by_coords([rainy_imerg_xds, rainy_xds])
Currently I'm getting a seemingly unhelpful RecursionError: maximum recursion depth exceeded in comparison on that final line. When I add the argument join='left' then the data from the rainy_imerg_xds dataset is in combo_xds and when I do join='right' the rainy_xds data is present, and if I do join='inner' no data is present. I assumed there was some internal interpolation with this function but it appears not.

This documentation from xarray outlines quite simply the solution to this problem. xarray allows you to interpolate in multiple dimensions and specify another Dataset's x and y dimensions as the output dimensions. So in this case it is done with
# interpolation based on http://xarray.pydata.org/en/stable/interpolation.html
# interpolation can't be done across the chunked dimension so we have to load it all into memory
rainy_sif_xds.load()
#interpolate into the higher resolution grid from IMERG
interp_rainy_sif_xds = rainy_sif_xds.interp(lat=rainy_imerg_xds["lat"], lon=rainy_imerg_xds["lon"])
# visualize the output
rainy_sif_xds.dcSIF.mean(dim='time').hvplot.quadmesh('lon', 'lat', cmap='jet', geo=True, rasterize=True, dynamic=False, width=450).relabel('Initial') +\
interp_rainy_sif_xds.dcSIF.mean(dim='time').hvplot.quadmesh('lon', 'lat', cmap='jet', geo=True, rasterize=True, dynamic=False, width=450).relabel('Interpolated')
# now that our coordinates match, in order to actually merge we need to convert the default CFTimeIndex to datetime to merge dataset with SIF data because the IMERG rainfall dataset was CFTime and the SIF was datetime
rainy_imerg_xds['time'] = rainy_imerg_xds.indexes['time'].to_datetimeindex()
# now the merge can easily be done with
merged_xds = xr.combine_by_coords([rainy_imerg_xds, interp_rainy_sif_xds], coords=['lat', 'lon', 'time'], join="inner")
# now visualize the two datasets together // multiply SIF by 30 because values are so ow
merged_xds.HQprecipitation.rolling(time=7, center=True).sum().mean(dim=('lat', 'lon')).hvplot().relabel('Precip') * \
(merged_xds.dcSIF.mean(dim=('lat', 'lon'))*30).hvplot().relabel('SIF')

How to correlate two time series with gaps and different time bases?

I have two time series of 3D accelerometer data that have different time bases (clocks started at different times, with some very slight creep during the sampling time), as well as containing many gaps of different size (due to delays associated with writing to separate flash devices).
The accelerometers I'm using are the inexpensive GCDC X250-2. I'm running the accelerometers at their highest gain, so the data has a significant noise floor.
The time series each have about 2 million data points (over an hour at 512 samples/sec), and contain about 500 events of interest, where a typical event spans 100-150 samples (200-300 ms each). Many of these events are affected by data outages during flash writes.
So, the data isn't pristine, and isn't even very pretty. But my eyeball inspection shows it clearly contains the information I'm interested in. (I can post plots, if needed.)
The accelerometers are in similar environments but are only moderately coupled, meaning that I can tell by eye which events match from each accelerometer, but I have been unsuccessful so far doing so in software. Due to physical limitations, the devices are also mounted in different orientations, where the axes don't match, but they are as close to orthogonal as I could make them. So, for example, for 3-axis accelerometers A & B, +Ax maps to -By (up-down), +Az maps to -Bx (left-right), and +Ay maps to -Bz (front-back).
My initial goal is to correlate shock events on the vertical axis, though I would eventually like to a) automatically discover the axis mapping, b) correlate activity on the mapped aces, and c) extract behavior differences between the two accelerometers (such as twisting or flexing).
The nature of the times series data makes Python's numpy.correlate() unusable. I've also looked at R's Zoo package, but have made no headway with it. I've looked to different fields of signal analysis for help, but I've made no progress.
Anyone have any clues for what I can do, or approaches I should research?
Update 28 Feb 2011: Added some plots here showing examples of the data.

My interpretation of your question: Given two very long, noisy time series, find a shift of one that matches large 'bumps' in one signal to large bumps in the other signal.
My suggestion: interpolate the data so it's uniformly spaced, rectify and smooth the data (assuming the phase of the fast oscillations is uninteresting), and do a one-point-at-a-time cross correlation (assuming a small shift will line up the data).
import numpy
from scipy.ndimage import gaussian_filter
"""
sig1 and sig 2 are assumed to be large, 1D numpy arrays
sig1 is sampled at times t1, sig2 is sampled at times t2
t_start, t_end, is your desired sampling interval
t_len is your desired number of measurements
"""
t = numpy.linspace(t_start, t_end, t_len)
sig1 = numpy.interp(t, t1, sig1)
sig2 = numpy.interp(t, t2, sig2)
#Now sig1 and sig2 are sampled at the same points.
"""
Rectify and smooth, so 'peaks' will stand out.
This makes big assumptions about your data;
these assumptions seem true-ish based on your plots.
"""
sigma = 10 #Tune this parameter to get the right smoothing
sig1, sig2 = abs(sig1), abs(sig2)
sig1, sig2 = gaussian_filter(sig1, sigma), gaussian_filter(sig2, sigma)
"""
Now sig1 and sig2 should look smoothly varying, with humps at each 'event'.
Hopefully we can search a small range of shifts to find the maximum of the
cross-correlation. This assumes your data are *nearly* lined up already.
"""
max_xc = 0
best_shift = 0
for shift in range(-10, 10): #Tune this search range
xc = (numpy.roll(sig1, shift) * sig2).sum()
if xc > max_xc:
max_xc = xc
best_shift = shift
print 'Best shift:', best_shift
"""
If best_shift is at the edges of your search range,
you should expand the search range.
"""

If the data contains gaps of unknown sizes that are different in each time series, then I would give up on trying to correlate entire sequences, and instead try cross correlating pairs of short windows on each time series, say overlapping windows twice the length of a typical event (300 samples long). Find potential high cross correlation matches across all possibilities, and then impose a sequential ordering constraint on the potential matches to get sequences of matched windows.
From there you have smaller problems that are easier to analyze.

This isn't a technical answer, but it might help you come up with one:
Convert the plot to an image, and stick it into a decent image program like gimp or photoshop
break the plots into discrete images whenever there's a gap
put the first series of plots in a horizontal line
put the second series in a horizontal line right underneath it
visually identify the first correlated event
if the two events are not lined up vertically:
select whichever instance is further to the left and everything to the right of it on that row
drag those things to the right until they line up
This is pretty much how an audio editor works, so you if you converted it into a simple audio format like an uncompressed WAV file, you could manipulate it directly in something like Audacity. (It'll sound horrible, of course, but you'll be able to move the data plots around pretty easily.)
Actually, audacity has a scripting language called nyquist, too, so if you don't need the program to detect the correlations (or you're at least willing to defer that step for the time being) you could probably use some combination of audacity's markers and nyquist to automate the alignment and export the clean data in your format of choice once you tag the correlation points.

My guess is, you'll have to manually build an offset table that aligns the "matches" between the series. Below is an example of a way to get those matches. The idea is to shift the data left-right until it lines up and then adjust the scale until it "matches". Give it a try.
library(rpanel)
#Generate the x1 and x2 data
n1 <- rnorm(500)
n2 <- rnorm(200)
x1 <- c(n1, rep(0,100), n2, rep(0,150))
x2 <- c(rep(0,50), 2*n1, rep(0,150), 3*n2, rep(0,50))
#Build the panel function that will draw/update the graph
lvm.draw <- function(panel) {
plot(x=(1:length(panel$dat3))+panel$off, y=panel$dat3, ylim=panel$dat1, xlab="", ylab="y", main=paste("Alignment Graph Offset = ", panel$off, " Scale = ", panel$sca, sep=""), typ="l")
lines(x=1:length(panel$dat3), y=panel$sca*panel$dat4, col="red")
grid()
panel
}
#Build the panel
xlimdat <- c(1, length(x1))
ylimdat <- c(-5, 5)
panel <- rp.control(title = "Eye-Ball-It", dat1=ylimdat, dat2=xlimdat, dat3=x1, dat4=x2, off=100, sca=1.0, size=c(300, 160))
rp.slider(panel, var=off, from=-500, to=500, action=lvm.draw, title="Offset", pos=c(5, 5, 290, 70), showvalue=TRUE)
rp.slider(panel, var=sca, from=0, to=2, action=lvm.draw, title="Scale", pos=c(5, 70, 290, 90), showvalue=TRUE)

It sounds like you want to minimize the function (Ax'+By) + (Az'+Bx) + (Ay'+Bz) for a pair of values: Namely, the time-offset: t0 and a time scale factor: tr. where Ax' = tr*(Ax + t0), etc..
I would look into SciPy's bivariate optimize functions. And I would use a mask or temporarily zero the data (both Ax' and By for example) over the "gaps" (assuming the gaps can be programmatically determined).
To make the process more efficient, start with a coarse sampling of A and B, but set the precision in fmin (or whatever optimizer you've selected) that is commensurate with your sampling. Then proceed with progressively finer-sampled windows of the full dataset until your windows are narrow and are not down-sampled.
Edit - matching axes
Regarding the issue of trying to identify which axis is co-linear with a given axis, and not knowing at thing about the characteristics of your data, i can point towards a similar question. Look into pHash or any of the other methods outlined in this post to help identify similar waveforms.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.