I need to apply a PCA conversion to some Landsat (satellite imagery) scenes stored as xarray.Dataset and containing nan values (for technical reason each band of a given pixel will be nan).
Here is the code to create an example dataset:
import numpy as np
import xarray as xr
# Create a demo xarray.Dataset
ncols = 25
nrows = 50
lon = [50 + x * 0.2 for x in range(nrows)]
lat = [30 + x * 0.2 for x in range(ncols)]
red = np.random.rand(nrows, ncols) * 10000
green = np.random.rand(nrows, ncols) * 10000
blue = np.random.rand(nrows, ncols) * 10000
nir = np.random.rand(nrows, ncols) * 10000
swir1 = np.random.rand(nrows, ncols) * 10000
swir2 = np.random.rand(nrows, ncols) * 10000
ds = xr.Dataset({'red': (['longitude', 'latitude'], red),
'green': (['longitude', 'latitude'], green),
'blue': (['longitude', 'latitude'], blue),
'nir': (['longitude', 'latitude'], nir),
'swir1': (['longitude', 'latitude'], swir1),
'swir2': (['longitude', 'latitude'], swir2)},
coords = {'longitude': (['longitude'], lon),
'latitude': (['latitude'], lat)})
# To keep example realistic let's add some nodata
ds = ds.where(ds.latitude + ds.longitude < 90)
print(ds)
<xarray.Dataset> Dimensions: (latitude: 25, longitude: 50) Coordinates: * longitude (longitude) float64 50.0 50.2 50.4 50.6
50.8 51.0 51.2 51.4 ... * latitude (latitude) float64 30.0 30.2 30.4 30.6 30.8 31.0 31.2 31.4 ... Data variables:
red (longitude, latitude) float64 6.07e+03 13.8 9.682e+03 ...
green (longitude, latitude) float64 5.476e+03 350.4 7.556e+03 ...
blue (longitude, latitude) float64 4.306e+03 2.104e+03 9.267e+03 ...
nir (longitude, latitude) float64 1.445e+03 8.633e+03 6.388e+03 ...
swir1 (longitude, latitude) float64 6.005e+03 7.692e+03 4.004e+03 ...
swir2 (longitude, latitude) float64 8.235e+03 3.127e+03 674.6 ...
After a search on the internet, I tried unsuccessfully to implement sklearn.decomposition PCA functions.
I first convert each 2 dimensions band into a single dimension:
# flatten dataset
tmp_list = []
for b in ['red', 'green', 'blue','nir','swir1','swir2']:
tmp_list.append(ds[b].values.flatten().astype('float64'))
flat_ds = np.array(tmp_list)
Then I tried to compute PCA and transform the original data in a location without nan. I succeeded to generate some output but totally different than the one generated with ArcGIS or Grass.
When I changed my location it appeared sklearn function is not able to process data containing nan. So I removed nan values from the flattened dataset, which is problematic when I deflate the flattened PCA result as it does not contains a multiple of original dataset dimensions.
# deflate PCAs
dims = ds.dims['longitude'], ds.dims['latitude']
pcas = xr.Dataset()
for i in range(flat_pcas.shape[0]):
pcas['PCA_%i' % (i + 1)] = xr.DataArray(np.reshape(flat_pcas[i], dims),
coords=[ds.longitude.values, ds.latitude.values],
dims=['longitude','latitude'])
To resume the situation:
Does another simpler approach exist to implement PCA transformation on xarray.Dataset ?
How to deal with nan ?
Try to use eofs, available here: https://github.com/ajdawson/eofs
In the documentation they say:
Transparent handling of missing values: missing values are removed automatically when computing EOFs and re-inserted into output fields.
I have used this a few times and I have found it very well-designed.
You can also use the EOFs available from pycurrents (https://currents.soest.hawaii.edu/ocn_data_analysis/installation.html)
I have an example at https://github.com/manmeet3591/Miscellaneous/blob/master/EOF/global_sst.ipynb
Related
I am trying to merge multiple nc files containing physical oceanographic data for different depths at different latitudes and longitudes.
I am using ds = xr.open_mfdataset to do this, but the files are not merging correctly and when I try to plot them it seems there is only one resulting value for the merged files.
This is the code I am using:
##Combining using concat_dim and nested method
ds = xr.open_mfdataset("33HQ20150809*.nc", concat_dim=['latitude'], combine= "nested")
ds.to_netcdf('geotraces2015_combined.nc')
df = xr.open_dataset("geotraces2015_combined.nc")
##Setting up values. Oxygen values are transposed so it matches same shape as lat and pressure.
oxygen = df['oxygen'].values.transpose()
##Plotting using colourf
fig = plt.figure()
ax = fig.add_subplot(111)
plt.contourf(oxygen, cmap = 'inferno')
plt.gca().invert_yaxis()
cbar = plt.colorbar(label = 'Oxygen Concentration (umol kg-1')
You can download the nc files from here under CTD
https://cchdo.ucsd.edu/cruise/33HQ20150809
This is how each file looks like:
<xarray.Dataset>
Dimensions: (pressure: 744, time: 1, latitude: 1, longitude: 1)
Coordinates:
* pressure (pressure) float64 0.0 1.0 2.0 3.0 ... 741.0 742.0 743.0
* time (time) datetime64[ns] 2015-08-12T18:13:00
* latitude (latitude) float32 60.25
* longitude (longitude) float32 -179.1
Data variables: (12/19)
pressure_QC (pressure) int16 ...
temperature (pressure) float64 ...
temperature_QC (pressure) int16 ...
salinity (pressure) float64 ...
salinity_QC (pressure) int16 ...
oxygen (pressure) float64 ...
... ...
CTDNOBS (pressure) float64 ...
CTDETIME (pressure) float64 ...
woce_date (time) int32 ...
woce_time (time) int16 ...
station |S40 ...
cast |S40 ...
Attributes:
EXPOCODE: 33HQ20150809
Conventions: COARDS/WOCE
WOCE_VERSION: 3.0
...
Another file would look like this:
<xarray.Dataset>
Dimensions: (pressure: 179, time: 1, latitude: 1, longitude: 1)
Coordinates:
* pressure (pressure) float64 0.0 1.0 2.0 3.0 ... 176.0 177.0 178.0
* time (time) datetime64[ns] 2015-08-18T19:18:00
* latitude (latitude) float32 73.99
* longitude (longitude) float32 -168.8
Data variables: (12/19)
pressure_QC (pressure) int16 ...
temperature (pressure) float64 ...
temperature_QC (pressure) int16 ...
salinity (pressure) float64 ...
salinity_QC (pressure) int16 ...
oxygen (pressure) float64 ...
... ...
CTDNOBS (pressure) float64 ...
CTDETIME (pressure) float64 ...
woce_date (time) int32 ...
woce_time (time) int16 ...
station |S40 ...
cast |S40 ...
Attributes:
EXPOCODE: 33HQ20150809
Conventions: COARDS/WOCE
WOCE_VERSION: 3.0
EDIT: This is my new approach which is still not working:
I'm trying to use preprocess to set_coords, squeeze, and expand_dims following Michael's approch:
def preprocess(ds):
return ds.set_coords('station').squeeze(["latitude", "longitude", "time"]).expand_dims('station')
ds = xr.open_mfdataset('33HQ20150809*.nc', concat_dim='station', combine='nested', preprocess=preprocess)
But I'm still having the same problem...
Solution: First, I had to identify the coordinate with the unique value, in my case was 'station'. Then I used preprocess to apply the squeeze and set_coords and expand_dims functions to each file, following Michael's answers.
import pandas as pd
import numpy as np
import os
import netCDF4
import pathlib
import matplotlib.pyplot as plt
def preprocess(ds):
return ds.set_coords('station').squeeze(["latitude", "longitude", "time"]).expand_dims('station')
ds = xr.open_mfdataset('filename*.nc', preprocess=preprocess, parallel=True)
ds = ds.sortby('latitude').transpose()
ds.oxygen.plot.contourf(x="latitude", y="pressure")
plt.gca().invert_yaxis()
The xarray data model requires that all data dimensions be perpendicular and complete. In other words, every combination of every coordinate along each dimension will be present in the data array (either as data or NaNs).
You can work with observational data such as yours using xarray, but you have to be careful with the indices to ensure you don't explode the data dimensionality. Specifically, whenever data is not truly a dimension of the data, but is simply an observation or attribute tied to a station or monitor, you should think of this more as a data variable than a coordinate. In your case, your dimensions seem to be station ID and pressure level (which does not have a full set of observations for each station, but is a dimension of the data). On the other hand, time, latitude, and longitude are attributes of each station, and should not be treated as dimensions.
I'll generate some random data that looks like yours:
def generate_random_station():
station_id = "{:09d}".format(np.random.randint(0, int(1e9)))
time = np.random.choice(pd.date_range("2015-08-01", "2015-08-31", freq="H"))
plevs = np.arange(np.random.randint(1, 1000)).astype(float)
lat = np.random.random() * 10 + 30
lon = np.random.random() * 10 - 80
ds = xr.Dataset(
{
"salinity": (('pressure', ), np.sin(plevs / 200 + lat)),
"woce_date": (("time", ), [time]),
"station": station_id,
},
coords={
"pressure": plevs,
"time": [time],
"latitude": [lat],
"longitude": [lon],
},
)
return ds
This ends up looking like the following:
In [11]: single = generate_random_station()
In [12]: single
Out[12]:
<xarray.Dataset>
Dimensions: (pressure: 37, time: 1, latitude: 1, longitude: 1)
Coordinates:
* pressure (pressure) float64 0.0 1.0 2.0 3.0 4.0 ... 33.0 34.0 35.0 36.0
* time (time) datetime64[ns] 2015-08-21T01:00:00
* latitude (latitude) float64 39.61
* longitude (longitude) float64 -72.19
Data variables:
salinity (pressure) float64 0.9427 0.941 0.9393 ... 0.8726 0.8702 0.8677
woce_date (time) datetime64[ns] 2015-08-21T01:00:00
station <U9 '233136181'
The problem is the latitude, longitude, and time coords aren't really dimensions which can be used to index a larger array. They aren't evenly spaced, and each combination of lat/lon/time does not have a station at it. Because of this, we need to be extra careful to make sure that when we combine the data, the lat/lon/time dimensions are not expanded.
To do this, we'll squeeze these dimensions, and expand the datasets along a new dimension, station:
In [13]: single.set_coords('station').squeeze(["latitude", "longitude", "time"]).expand_dims('station')
Out[13]:
<xarray.Dataset>
Dimensions: (pressure: 37, station: 1)
Coordinates:
* station (station) <U9 '233136181'
* pressure (pressure) float64 0.0 1.0 2.0 3.0 4.0 ... 33.0 34.0 35.0 36.0
time datetime64[ns] 2015-08-21T01:00:00
latitude float64 39.61
longitude float64 -72.19
Data variables:
salinity (station, pressure) float64 0.9427 0.941 0.9393 ... 0.8702 0.8677
woce_date (station) datetime64[ns] 2015-08-21T01:00:00
This can be done to all of your datasets, then they can be concatenated along the "station" dimension:
In [14]: all_stations = xr.concat(
...: [
...: generate_random_station()
...: .set_coords('station')
...: .squeeze(["latitude", "longitude", "time"])
...: .expand_dims('station')
...: for i in range(10)
...: ],
...: dim="station",
...: )
This results in a dataset indexed by pressure level and station:
In [15]: all_stations
Out[15]:
<xarray.Dataset>
Dimensions: (pressure: 657, station: 10)
Coordinates:
* pressure (pressure) float64 0.0 1.0 2.0 3.0 ... 653.0 654.0 655.0 656.0
* station (station) <U9 '197171488' '089978445' ... '107555081' '597650083'
time (station) datetime64[ns] 2015-08-19T06:00:00 ... 2015-08-24T15...
latitude (station) float64 37.96 34.3 38.74 39.28 ... 37.72 33.89 36.46
longitude (station) float64 -74.28 -73.59 -78.33 ... -76.6 -76.47 -77.96
Data variables:
salinity (station, pressure) float64 0.2593 0.2642 0.269 ... 0.8916 0.8893
woce_date (station) datetime64[ns] 2015-08-19T06:00:00 ... 2015-08-24T15...
You can now plot along the latitude and pressure level dimensions:
In [16]: all_stations.salinity.plot.contourf(x="latitude", y="pressure")
I've written a leave one out spatial interpolation routine using xarray where an interpolation is done n-1 times across the input data. Each iteration of the interpolation is stored in an xarray dataset under a "IterN" dimension. Here's an example:
grids
Out[2]:
<xarray.Dataset>
Dimensions: (IterN: 84, northing: 53, easting: 54)
Coordinates:
* easting (easting) float64 5.648e+05 5.653e+05 ... 5.907e+05 5.912e+05
* northing (northing) float64 4.333e+06 4.334e+06 ... 4.359e+06 4.359e+06
* IterN (IterN) int64 0 1 2 3 4 5 6 7 8 9 ... 75 76 77 78 79 80 81 82 83
Data variables:
Data (IterN, northing, easting) float64 2.065 2.018 ... 0.3037 0.2913
I can compute some statistics over the dataset using the in-built functions, for example median here:
median = grids.median(dim="IterN")
median
Out[5]:
<xarray.Dataset>
Dimensions: (northing: 53, easting: 54)
Coordinates:
* easting (easting) float64 5.648e+05 5.653e+05 ... 5.907e+05 5.912e+05
* northing (northing) float64 4.333e+06 4.334e+06 ... 4.359e+06 4.359e+06
Data variables:
Data (northing, easting) float64 2.111 2.061 2.011 ... 0.3104 0.2992
I would like to compute the root mean square error of the iteration sets against something like the median above. I can't see an inbuilt function for this, and I'm struggling to see how to implement this answer.
The desired output is an RMSE grid in the same XY dimensions as the other grids.
Any help would be greatly appreciated!
Does the following go in the direction you mean?
median = grids.median(dim="IterN")
error = grids - median
rmse = np.sqrt((error**2).mean("IterN"))
I'm currently working on a project that involves clipping Xarray datasets based on a certain grid of latitude and longitude values. To do this, I've imported a file with the specified lats and lons, and added dimensions onto the dataset to give it the necessary lat/lon information. However, I also want to pass the dimensions to a data variable (called 'precip') within the dataset. I know that I'm able to pass the dimensions to the array of values in the variable, but can they be passed to the variable itself?
My code is below:
precip = obs_dataset_full.precip.expand_dims(lon = 350, lat = 450)
precip.coords['Longitude'] = (-1 * (360 - obs_dataset_full.Longitude))
precip.coords['Latitude'] = obs_dataset_full.Latitude
precip
With output as such:
xarray.Dataset
Dimensions:
dim_0: 350 dim_1: 451 lat: 450 lon: 350
Coordinates:
dim_0
(dim_0)
int64
0 1 2 3 4 5 ... 345 346 347 348 349
dim_1
(dim_1)
int64
0 1 2 3 4 5 ... 446 447 448 449 450
Longitude
(lon)
float64
-105.7 -105.7 ... -78.34 -78.26
Latitude
(lat)
float64
35.04 35.08 35.11 ... 51.52 51.56
Data variables:
precip
(dim_0, dim_1)
float64
0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 nan
Attributes: (0)
Specifically, I want the data variable precip to also possess dimensions of lat and lon, as the dataset does.
Thanks in advance!
I have xarray dataset with longitude coordinate from 0.5 to 359.5, like the following:
Dimensions: (bnds: 2, lat: 40, lev: 35, lon: 31, member_id: 1)
Coordinates:
lev_bnds (lev, bnds) float64 ...
lon_bnds (lon, bnds) float64 ...
lat_bnds (lat, bnds) float64 ...
* lon (lon) float64 0.5 1.5 2.5 3.5 4.5 ... 359.5
* lev (lev) float64 2.5 10.0 20.0 32.5 ... 5e+03 5.5e+03 6e+03 6.5e+03
* lat (lat) float64 -89.5 -88.5 -87.5 -86.5 ... -53.5 -52.5 -51.5 -50.5
* member_id (member_id) object 'r1i1p1f1'
Dimensions without coordinates: bnds
Data variables:
so (member_id, lev, lat, lon) float32 nan nan nan ...
The area I'm interested in is between 60W to 30E, which probably corresponds to longitude 300.5 to 30.5. Is there any way to slice the dataset between these coordinates?
I tried to use isel(slice(-60,30) but it's not possible to have negative to positive numbers in the slice function.
I know I can just split the data into two small ones (300.5-359.5 and 0.5-30.5), but I was wondering if there is a better way.
Thank you!
As you correctly point out, currently isel can’t select from both the start and end of a dimension in a single pass.
If combined with roll http://xarray.pydata.org/en/stable/generated/xarray.DataArray.roll.html, you can move the points you want into a contiguous region, and then select those you need.
NB: I couldn’t be sure from your example, but it looks like you may want sel rather than isel, given you may be selecting by index value rather than position
I have two xr.Dataset objects. One is a continuous map of some variable (here precipitation). The other is a categorical map of a set of regions
['region_1', 'region_2', 'region_3', 'region_4'].
I want to calculate the mean precip in each region at each timestep by masking by region/time and then outputting a dataframe looking like the below.
In [6]: df.head()
Out[6]:
datetime region_name mean_value
0 2008-01-31 region_1 51.77333333333333
1 2008-02-29 region_1 44.87555555555556
2 2008-03-31 region_1 50.88444444444445
3 2008-04-30 region_1 48.50666666666667
4 2008-05-31 region_1 47.653333333333336
I have some code but it runs very slowly for the real datasets. Can anyone help me optimize?
A minimum reproducible example
Initalising our objects, two variables of the same shape. The region object will have been read from a shapefile and will have more than two regions.
import xarray as xr
import pandas as pd
import numpy as np
def make_dataset(
variable_name='precip',
size=(30, 30),
start_date='2008-01-01',
end_date='2010-01-01',
lonmin=-180.0,
lonmax=180.0,
latmin=-55.152,
latmax=75.024,
):
# create 2D lat/lon dimension
lat_len, lon_len = size
longitudes = np.linspace(lonmin, lonmax, lon_len)
latitudes = np.linspace(latmin, latmax, lat_len)
dims = ["lat", "lon"]
coords = {"lat": latitudes, "lon": longitudes}
# add time dimension
times = pd.date_range(start_date, end_date, name="time", freq="M")
size = (len(times), size[0], size[1])
dims.insert(0, "time")
coords["time"] = times
# create values
var = np.random.randint(100, size=size)
return xr.Dataset({variable_name: (dims, var)}, coords=coords), size
ds, size = make_dataset()
# create dummy regions (not contiguous but doesn't matter for this example)
region_ds = xr.ones_like(ds).rename({'precip': 'region'})
array = np.random.choice([0, 1, 2, 3], size=size)
region_ds = region_ds * array
# create a dictionary explaining what the regions area
region_lookup = {
0: 'region_1',
1: 'region_2',
2: 'region_3',
3: 'region_4',
}
What do these objects look like?
In[]: ds
Out[]:
<xarray.Dataset>
Dimensions: (lat: 30, lon: 30, time: 24)
Coordinates:
* lat (lat) float64 -55.15 -50.66 -46.17 -41.69 ... 66.05 70.54 75.02
* lon (lon) float64 -180.0 -167.6 -155.2 -142.8 ... 155.2 167.6 180.0
* time (time) datetime64[ns] 2008-01-31 2008-02-29 ... 2009-12-31
Data variables:
precip (time, lat, lon) int64 51 92 14 71 60 20 82 ... 16 33 34 98 23 53
In[]: region_ds
Out[]:
<xarray.Dataset>
Dimensions: (lat: 30, lon: 30, time: 24)
Coordinates:
* lat (lat) float64 -55.15 -50.66 -46.17 -41.69 ... 66.05 70.54 75.02
* time (time) datetime64[ns] 2008-01-31 2008-02-29 ... 2009-12-31
* lon (lon) float64 -180.0 -167.6 -155.2 -142.8 ... 155.2 167.6 180.0
Data variables:
region (time, lat, lon) float64 0.0 0.0 0.0 0.0 0.0 ... 1.0 1.0 1.0 1.0
Current Implementation
In order to calculate the mean of the variable in ds in each of the regions ['region_1', 'region_2', ...] in region_ds at each time, I need to loop over the TIME and the REGION.
I loop over each REGION, and then each TIMESTEP in the da object. This operation is pretty slow as the dataset gets larger (more pixels and more timesteps). Is there a more efficient / vectorized implementation anyone can think of.
My current implementation is super slow for all the regions and times that I need. Is there a more efficient use of numpy / xarray that will get me my desired result faster?
def drop_nans_and_flatten(dataArray: xr.DataArray) -> np.ndarray:
"""flatten the array and drop nans from that array. Useful for plotting histograms.
Arguments:
---------
: dataArray (xr.DataArray)
the DataArray of your value you want to flatten
"""
# drop NaNs and flatten
return dataArray.values[~np.isnan(dataArray.values)]
#
da = ds.precip
region_da = region_ds.region
valid_region_ids = [k for k in region_lookup.keys()]
# initialise empty lists
region_names = []
datetimes = []
mean_values = []
for valid_region_id in valid_region_ids:
for time in da.time.values:
region_names.append(region_lookup[valid_region_id])
datetimes.append(time)
# extract all non-nan values for that time-region
mean_values.append(
da.sel(time=time).where(region_da == valid_region_id).mean().values
)
df = pd.DataFrame(
{
"datetime": datetimes,
"region_name": region_names,
"mean_value": mean_values,
}
)
The output:
In [6]: df.head()
Out[6]:
datetime region_name mean_value
0 2008-01-31 region_1 51.77333333333333
1 2008-02-29 region_1 44.87555555555556
2 2008-03-31 region_1 50.88444444444445
3 2008-04-30 region_1 48.50666666666667
4 2008-05-31 region_1 47.653333333333336
In [7]: df.tail()
Out[7]:
datetime region_name mean_value
43 2009-08-31 region_4 50.83111111111111
44 2009-09-30 region_4 48.40888888888889
45 2009-10-31 region_4 51.56148148148148
46 2009-11-30 region_4 48.961481481481485
47 2009-12-31 region_4 48.36296296296296
In [20]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 96 entries, 0 to 95
Data columns (total 3 columns):
datetime 96 non-null datetime64[ns]
region_name 96 non-null object
mean_value 96 non-null object
dtypes: datetime64[ns](1), object(2)
memory usage: 2.4+ KB
In [21]: df.describe()
Out[21]:
datetime region_name mean_value
count 96 96 96
unique 24 4 96
top 2008-10-31 00:00:00 region_1 48.88984800150122
freq 4 24 1
first 2008-01-31 00:00:00 NaN NaN
last 2009-12-31 00:00:00 NaN NaN
Any help would be very much appreciated ! Thankyou
It's hard to avoid iterating to generate the masks for the regions given how they are defined, but once you have those constructed (e.g. with the code below), I think the following would be pretty efficient:
regions = xr.concat(
[(region_ds.region == region_id).expand_dims(region=[region])
for region_id, region in region_lookup.items()],
dim='region'
)
result = ds.precip.where(regions).mean(['lat', 'lon'])
This generates a DataArray with 'time' and 'region' dimensions, where the value at each point is the mean at a given time over a given region. It would be straightforward to extend this to an area-weighted average if that were desired too.
An alternative option that generates the same result would be:
regions = xr.DataArray(
list(region_lookup.keys()),
coords=[list(region_lookup.values())],
dims=['region']
)
result = ds.precip.where(regions == region_ds.region).mean(['lat', 'lon'])
Here regions is basically just a DataArray representation of the region_lookup dictionary.