applying a generalized additive model to an xarray - python

I have a netCDF file which I have read with xarray. The array contains times, latidude, longitude and only one data variable (i.e. index values)
# read the netCDF files
with xr.open_mfdataset('wet_tropics.nc') as wet:
print(wet)
Out[]:
<xarray.Dataset>
Dimensions: (time: 1437, x: 24, y: 20)
Coordinates:
* y (y) float64 -1.878e+06 -1.878e+06 -1.878e+06 -1.878e+06 ...
* x (x) float64 1.468e+06 1.468e+06 1.468e+06 1.468e+06 ...
* time (time) object '2013-03-29T00:22:28.500000000' ...
Data variables:
index_values (time, y, x) float64 dask.array<shape=(1437, 20, 24), chunksize=(1437, 20, 24)>
So far, so good.
Now I need to apply a generalized additive model to each grid cell in the array. The model I want to use comes from Facebook Prophet (https://facebook.github.io/prophet/) and I have successfully applied it to a pandas array of data before. For example:
cns_ap['y'] = cns_ap['av_index'] # Prophet requires specific names 'y' and 'ds' for column names
cns_ap['ds'] = cns_ap['Date']
cns_ap['cap'] = 1
m1 = Prophet(weekly_seasonality=False, # disables weekly_seasonality
daily_seasonality=False, # disables daily_seasonality
growth='logistic', # logistic because indices have a maximum
yearly_seasonality=4, # fourier transform. int between 1-10
changepoint_prior_scale=0.5).fit(cns_ap)
future1 = m1.make_future_dataframe(periods=60, # 5 year prediction
freq='M', # monthly predictions
include_history=True) # fits model to all historical data
future1['cap'] = 1 # sets cap at maximum index value
forecast1 = m1.predict(future1)
# m1.plot_components(forecast1, plot_cap=False);
# m1.plot(forecast1, plot_cap=False, ylabel='CNS index', xlabel='Year');
The problem is that now I have to
1)iterate through every cell of the netCDF file,
2)get all the values for that cell through time,
3)apply the GAM (using fbprophet), and then export and plot the results.
The question: do you have any ideas on how to loop through the raster, get the index_values of each pixel for all times so that i can run the GAM?
I think that a nested for loop would be feasible, although i dont know how to make one that goes through every cell.
Any help is appreciated

Related

Linear Regression on Multiindex Pandas Dataframe in Python

I'm trying to perform a regression of annual temperatures over time, and obtain a slope/linear trend (number generated by the regression) for each latitude and longitude coordinate (the full dataset has many lat/lon locations). I want to replace the year and temp for each location with this slope value. My end goal is to map these trends with cartopy.
Here is some test data in a pandas multi index dataframe
tempanomaly
lat lon time_bnds
-89.0 -179.0 1957 0.606364
1958 0.495000
1959 0.134286
this is my goal:
lat lon trend
-89.0 -179.0 -0.23604
this is my regression function
def regress(y):
#X is the year or index, y is the temperature
X=np.array(range(len(y))).reshape(len(y),1)
y = y.array
fit = np.polyfit(X, y, 1)
return (fit[0])
and here is how I'm attempting to call it
reg = df.groupby(["lat", "lon"]).transform(regress)
The error I'm receiving is TypeError: Transform function invalid for data types.
In the debugging process, I found that the regression was running for each line (3 times, using the test data), as opposed to once for each location (only one location is in the test data). I believe the problem lies in the method I'm using to call the regression, but can't figure out another way to iterate through and perform a regression by lat/lon pairs—I appreciate any help!
I think you have also error in your regress function because in your case X should be 1D vector. So here it is the fixed regress function:
def regress(y):
#X is the year or index, y is the temperature
X = np.array(range(len(y)))
y = y.array
fit = np.polyfit(X, y, 1)
return (fit[0])
For pandas.DataFrame.transform produced DataFrame will have same axis length as self. Pandas Documentation
Therefore aggregate is a better option for your case.
reg = df.groupby(["lat", "lon"]).aggregate(trend=pd.NamedAgg('tempanomaly', regress)).reset_index()
which produces:
lat lon trend
-89.0 -179.0 -0.236039
with the sample data created as follows:
lat_lon = [(-89.0, -179.0), (-89.0, -179.0), (-89.0, -179.0)]
index = pd.MultiIndex.from_tuples(lat_lon, names=["lat", "lon"])
df = pd.DataFrame({
'time_bnds':[1957,1958,1959],
'tempanomaly': [0.606364, 0.495000, 0.134286]
},index=index)

Is it possible to selecting dataset by time range when range is different for every pixel in pythons xarray module

I try to select only this part of the data within a specific time range that is different for every pixel.
For indexing, I have two np.datetime64[ns] xr.DataArrays with shape(lat:152, lon:131) named time_range_min, time_range_max
One is holding the start dates and the other one the end dates.
I try this for selecting the data
dataset = data.sel(time=slice(time_range_min, time_range_max))
but it yields
cannot use non-scalar arrays in a slice for xarray indexing:
<xarray.DataArray 'NDVI' (lat: 152, lon: 131)>
If I cannot use non-scalar arrays it means that it is in general not possible to do this, or can I transform my arrays?
If "time" is a list of dates in string that is ordered from past to present (e.g. ["10-20-2021", "10-21-2021", ...]:
import numpy as np
listOfMinMaxTimeRanges = [time_range_min, time_range_max]
specifiedRangeOfTimeIndexedList = []
for indexingListOfMinMaxTimeRanges in range(np.shape(listOfMinMaxTimeRanges)[1])):
specifiedRangeOfTimeIndexed = [specifiedRangeOfTime for specifiedRangeOfTime in np.arange(0, len(time), 1) if time.index(listOfMinMaxTimeRanges[0][indexingListOfMinMaxTimeRanges]) <= specifiedRangeOfTime <= time.index(listOfMinMaxTimeRanges[1][indexingListOfMinMaxTimeRanges])]
for indexes in range(len(specifiedRangeOfTimeIndexed)):
specifiedRangeOfTimeIndexedList.append(specifiedRangeOfTimeIndexed[indexes])
Depending on how your dataset is structured:
dataset = data.sel(time = specifiedRangeOfTimeIndexedList)
or
dataset = data.sel(time = time[specifiedRangeOfTimeIndexedList])
or
dataset = dataset[time[specifiedRangeOfTimeIndexedList]]
or
dataset = dataset[:, time[specifiedRangeOfTimeIndexedList]]
or
dataset = dataset[time[specifiedRangeOfTimeIndexedList], :, :]
or
dataset = dataset[specifiedRangeOfTimeIndexedList]
...
I found a way to group every cell with stacking in xarray:
time_range_min and time_range_max marks now a single date
stack = dataset.value.stack(gridcell=['lat', 'lon'])
for unique_value, grouped_array in stack.groupby('gridcell'):
grouped_array.sel(time=slice(time_range_min,time_range_max))

Preparing Grid weather data for ConvLSTM2d

I am attempting to use a ConvLSTM2d model using hourly grid weather data. I can get the data into a 4d array with these dimensions (num_hours, lat, lon, num_features). ConvLSTM2d requires 5d and I was planning on setting a variable for sequence length of maybe 24hrs. My question is how do i create an additional dimension in this array to have the sequence length dimension?(num_hours, sequence_length, lat, lon, num_features) Is there a smarter, more efficient way to get the data in the correct form from a pandas dataframe that has columns for lat, lon, time, feature type & value?
*
I realize it is always easier to have a sample dataset when asking a question so i created a set to mimic the issue.
import pandas as pd
import numpy as np
weather_variables = ['windspeed', 'temp','pressure']
lats = [x/10 for x in range(400,500,5)]
lons = [x/10 for x in range(900,1000,5)]
hours = pd.date_range('1/1/2021', '9/28/2021', freq= 'H')
df = []
for i in range (0, len(hours)):
for weather in weather_variables:
temp_df = pd.DataFrame(index = lats, columns = lons,data = np.random.randint(0,100,size=(len(lats), len(lons))))
temp_df = temp_df.unstack().to_frame()
temp_df.reset_index(inplace= True)
temp_df['weather_variable'] = weather
temp_df['ts'] = hours[i]
df.append(temp_df)
df = pd.concat(df)
df.columns = ['lon','lat','value','weather_variable', 'ts']
So this code will create a dummy dataset containing a 3 grids for a given hour. The goal is to convert this into a 5d array of overlapping 24 hours sequences. The array would look like this i think (len(hours)?, 24, 20, 20, 3)
From the ConvLSTM paper,
The weather radar data is recorded every 6 minutes, so there
are 240 frames per day. To get disjoint subsets for training, testing and validation, we partition each
daily sequence into 40 non-overlapping frame blocks and randomly assign 4 blocks for training, 1
block for testing and 1 block for validation. The data instances are sliced from these blocks using
a 20-frame-wide sliding window. Thus our radar echo dataset contains 8148 training sequences,
2037 testing sequences and 2037 validation sequences and all the sequences are 20 frames long (5
for the input and 15 for the prediction).
If my calculations are correct, each of the "non-overlapping frame blocks" should have 6 frames in it (240 frames per day / 40 blocks per day = 6 frames per block). So I'm not sure how you create a 20-frame-wide sliding window in a given block. Nonetheless, you could take a similar approach: divide your data into non-overlapping windows of a specific length. Perhaps you use 6 hours of data to predict the next 6. I'm not sure that you need to keep the windows within a given day--a change from 11 pm to 1 am seems just as valid a time window as from say 3 am to 5 am.
I don't think Pandas will be an efficient way to massage the data. I would stick with NumPy or probably a TensorFlow data structure.

Xarray merge separate day and hour dimensions into one time dimension in python

I have an xarray dataset:
As you can see the dimensions are (lat, lon, step (hours), time (days)). I want to merge the hours and days into one so that the dimensions are instead (lat, lon, timestep). How do I do this?
Creating a one-dimensional time dimension and coordinate
You can use the stack method to create a multiindex of the the time and step dimensions. As your valid_time coord already has the correct datetimedimension, you can also drop the multiindex coords and only keep the valid_time coord withe actual datetimes.
import numpy as np
import xarray as xr
import pandas as pd
# Create a dummy representation of your data
ds = xr.Dataset(
data_vars={"a": (("x", "y", "time", "step"), np.random.rand(5, 5, 3, 24))},
coords={
"time": pd.date_range(start="1999-12-31", periods=3, freq="d"),
"step": pd.timedelta_range(start="1h", freq="h", periods=24),
},
)
ds = ds.assign_coords(valid_time=ds.time + ds.step)
# Stack the time and step dims
stacked_ds = ds.stack(datetime=("time", "step"))
# Drop the multiindex if you want to keep only the valid_time coord which
# contains the combined date and time information.
# Rename vars and dims to your liking.
stacked_ds = (
stacked_ds.drop_vars("datetime")
.rename_dims({"datetime": "time"})
.rename_vars({"valid_time": "time"})
)
print(stacked_ds)
<xarray.Dataset>
Dimensions: (time: 72, x: 5, y: 5)
Coordinates:
* time (time) datetime64[ns] 1999-12-31T01:00:00 ... 2000-01-03
Dimensions without coordinates: x, y
Data variables:
a (x, y, time) float64 0.1961 0.3733 0.2227 ... 0.4929 0.7459 0.4106
Making the time coordinate an index
Like this we create a single time dimension with a continuous datetime series as coordinate. However, it is not and index. For some methods, like resample, time needs to be an index. We can fix that be explicitly setting it an index:
stacked_ds.set_index(time="time")
However, this will make 'time' a variable instead of a coordinate. To make it a coordinate again, we can use
stacked_ds.set_index(time="time").set_coords("time")
Working with Dataarrays
You can use stacking of dimensions on Dataarrays as well. However, they do not have rename_dims and rename_vars methods. Instead, you can use swap_dims and rename:
(
ds.a.stack(datetime=("time", "step"))
.drop_vars("datetime")
.swap_dims({"datetime": "time"})
.rename({"valid_time": "time"})
).set_index(time="time")

Why latitudes and longitudes are two dimensional arrays in netcdf file?

I have netCDF file, which contains temperature data over some location. Data shape is 1450x900.
I am creating search functionality in my app, to locate temperature data with lat, lon values.
So I extracted lat and lon coordinates data from netCDf file, but I was expecting that they would be 1D arrays and instead got 2D arrays with 1450x900 shape for both coordinates.
So my question: why they are 2d arrays, instead of 1450 latitude values and 900 longitude values? Doesn't 1450 lat values and 900 lon values describe whole grid?
Lets say we have 4x5 square, indices for locating rightmost and bottom-most point of the grid will be [4, 5]. So my indices for x will be[1, 2, 3, 4] and for y: [1, 2, 3, 4, 5]. 9 indices in total are enough to locate any point on that grid (consisting of 20 cells). So why do lat (x) and lon (y) coordinates in netcdf file contain 20 indices separately (40 in total), instead of 4 and 5 indices respectively (9 in total)? Hope you get what confuses me.
Is it possible to somehow map those 2D arrays and "downgrade" to 1450 latitude values and 900 longitude values? OR it is ok as it is right now? How can I use those values for my intention? Do I need to zip lat lon arrays?
here are the shapes:
>>> DS = xarray.open_dataset('file.nc')
>>> DS.tasmin.shape
(31, 1450, 900)
>>> DS.projection_x_coordinate.shape
(900,)
>>> DS.projection_y_coordinate.shape
(1450,)
>>> DS.latitude.shape
(1450, 900)
>>> DS.longitude.shape
(1450, 900)
consider that projection_x_coordinate and projection_y_coordinate are easting/northing values not lat/longs
here is the metadata of file if needed:
Dimensions: (bnds: 2, projection_x_coordinate: 900, projection_y_coordinate: 1450, time: 31)
Coordinates:
* time (time) datetime64[ns] 2018-12-01T12:00:00 ....
* projection_y_coordinate (projection_y_coordinate) float64 -1.995e+0...
* projection_x_coordinate (projection_x_coordinate) float64 -1.995e+0...
latitude (projection_y_coordinate, projection_x_coordinate) float64 ...
longitude (projection_y_coordinate, projection_x_coordinate) float64 ...
Dimensions without coordinates: bnds
Data variables:
tasmin (time, projection_y_coordinate, projection_x_coordinate) float64 ...
transverse_mercator int32 ...
time_bnds (time, bnds) datetime64[ns] ...
projection_y_coordinate_bnds (projection_y_coordinate, bnds) float64 ...
projection_x_coordinate_bnds (projection_x_coordinate, bnds) float64 ...
Attributes:
comment: Daily resolution gridded climate observations
creation_date: 2019-08-21T21:26:02
frequency: day
institution: Met Office
references: doi: 10.1002/joc.1161
short_name: daily_mintemp
source: HadUK-Grid_v1.0.1.0
title: Gridded surface climate observations data for the UK
version: v20190808
Conventions: CF-1.5
Your data adheres to version 1.5 of the Climate and Forecast conventions.
The document describing this version of the conventions is here, although the relevant section is essentially unchanged across many versions of the conventions.
See section 5.2:
5.2. Two-Dimensional Latitude, Longitude, Coordinate Variables
The latitude and longitude coordinates of a horizontal grid that was
not defined as a Cartesian product of latitude and longitude axes, can
sometimes be represented using two-dimensional coordinate variables.
These variables are identified as coordinates by use of the coordinates attribute.
It looks like you are using the HadOBS 1km resolution gridded daily minimum temperature, and this file in particular:
http://dap.ceda.ac.uk/thredds/fileServer/badc/ukmo-hadobs/data/insitu/MOHC/HadOBS/HadUK-Grid/v1.0.1.0/1km/tasmin/day/v20190808/tasmin_hadukgrid_uk_1km_day_20181201-20181231.nc (warning: >300MB download)
As it states, the data is on a transverse mercator grid.
If you look at output from ncdump -h <filename> you will also see the following description of the grid expressed by means of attributes of the transverse_mercator dummy variable:
int transverse_mercator ;
transverse_mercator:grid_mapping_name = "transverse_mercator" ;
transverse_mercator:longitude_of_prime_meridian = 0. ;
transverse_mercator:semi_major_axis = 6377563.396 ;
transverse_mercator:semi_minor_axis = 6356256.909 ;
transverse_mercator:longitude_of_central_meridian = -2. ;
transverse_mercator:latitude_of_projection_origin = 49. ;
transverse_mercator:false_easting = 400000. ;
transverse_mercator:false_northing = -100000. ;
transverse_mercator:scale_factor_at_central_meridian = 0.9996012717 ;
and you will also see that the coordinate variables projection_x_coordinate and projection_y_coordinate have units of metres.
The grid in question is the Ordnance Survey UK grid using numeric grid references.
See for example this description of the OS grid (from Wikipedia).
If you wish to express the data on a regular longitude-latitude grid then you will need to do some type of interpolation. I see that you are using xarray. You can combine this with pyresample to do the interpolation. Here is an example:
import xarray as xr
import numpy as np
from pyresample.geometry import SwathDefinition
from pyresample.kd_tree import resample_nearest, resample_gauss
ds = xr.open_dataset("tasmin_hadukgrid_uk_1km_day_20181201-20181231.nc")
# Define a target grid. For sake of example, here is one with just
# 3 longitudes and 4 latitudes.
lons = np.array([-2.1, -2., -1.9])
lats = np.array([51.7, 51.8, 51.9, 52.0])
# The target grid is regular (1-d lon, lat coordinates) but we will need
# a 2d version (similar to the input grid), so use numpy.meshgrid to produce this.
lon2d, lat2d = np.meshgrid(lons, lats)
origin_grid = SwathDefinition(lons=ds.longitude, lats=ds.latitude)
target_grid = SwathDefinition(lons=lon2d, lats=lat2d)
# get a numpy array for the first timestep
data = ds.tasmin[0].to_masked_array()
# nearest neighbour interpolation example
# Note that radius_of_influence has units metres
interpolated = resample_nearest(origin_grid, data, target_grid, radius_of_influence=1000)
# GIVES:
# array([[5.12490065, 5.02715332, 5.36414835],
# [5.08337723, 4.96372838, 5.00862833],
# [6.47538931, 5.53855722, 5.11511239],
# [6.46571817, 6.17949381, 5.87357538]])
# gaussian weighted interpolation example
# Note that radius_of_influence and sigmas both have units metres
interpolated = resample_gauss(origin_grid, data, target_grid, radius_of_influence=1000, sigmas=1000)
# GIVES:
# array([[5.20432465, 5.07436805, 5.39693221],
# [5.09069187, 4.8565934 , 5.08191639],
# [6.4505963 , 5.44018209, 5.13774416],
# [6.47345359, 6.2386732 , 5.62121948]])
I figured an answer by myself.
As appeared 2D lat long arrays are used to define the "grid" of some location.
In other words, if we zip lat long values and project on the map, we will get "curved grid" (earth curvature is considered in other words) over some location, which then are used to create grid reference of location.
Hope its clear for anyone interested.

Categories

Resources