Converting 3D xarray dataset to dataframe - python

I have imported a xarray dataset like this and extracted the values at coordinates defined by zones from a csv file, and a time period defined by a date range (30 days of a (lon,lat) grid with some environmental values for every coordinates).
from xgrads import open_CtlDataset
ds_Snow = open_CtlDataset(path + 'file')
ds_Snow = ds_Snow.sel(lat = list(set(zones['lat'])), lon = list(set(zones['lon'])),
time = period, method = 'nearest')
When i look for the information of ds_Snow, this is what I get :
Dimensions: (lat: 12, lon: 12, time: 30)
Coordinates:
* time (time) datetime64[ns] 2000-09-01 2000-09-02 ... 2000-09-30
* lat (lat) float32 3.414e+06 3.414e+06 3.414e+06 ... 3.414e+06 3.414e+06
* lon (lon) float32 6.873e+05 6.873e+05 6.873e+05 ... 6.873e+05 6.873e+05
Data variables:
spre (time, lat, lon) float32 dask.array<chunksize=(1, 12, 12), meta=np.ndarray>
Attributes:
title: SnowModel
undef: -9999.0 type : <class 'xarray.core.dataset.Dataset'>
I would like to make it a dataframe, respecting the initial dimensions (time, lat, lon).
So I did this :
df_Snow = ds_Snow.to_dataframe()
But here are the dimensions of the dataframe :
print(df_Snow)
lat lon time
3414108.0 687311.625 2000-09-01 0.0
2000-09-02 0.0
2000-09-03 0.0
2000-09-04 0.0
2000-09-05 0.0
... ...
2000-09-26 0.0
2000-09-27 0.0
2000-09-28 0.0
2000-09-29 0.0
2000-09-30 0.0
[4320 rows x 1 columns]
It looks like all the data just got put in a single column.
I have tried giving the dimensions orders as some documentation explained :
df_Snow = ds_Snow.to_dataframe(dim_order = ['time', 'lat', 'lon'])
But it does not change anything, and I can't seem to find an answer in forums or the documentation. I would like to know a way to keep the array configuration in the dataframe.
EDIT : I found a solution
Instead of converting the xarray, I chose to build my dataframe with pd.Series of each attributes like this :
ds_Snow = ds_Snow.sel(lat = list(set(station_list['lat_utm'])),lon = list(set(station_list['lon_utm'])), time = Ind_Run_ERA5_Land, method = 'nearest')
time = pd.Series(ds_Snow.coords["time"].values)
lon = pd.Series(ds_Snow.coords["lon"].values)
lat = pd.Series(ds_Snow.coords["lat"].values)
spre = pd.Series(ds_Snow['spre'].values[:,0,0])
frame = { 'spre': spre, 'time': time, 'lon' : lon, 'lat' : lat}
df_Snow = pd.DataFrame(frame)

This is the expected behaviour. From the docs:
The DataFrame is indexed by the Cartesian product of index coordinates (in the form of a pandas.MultiIndex). Other coordinates are included as columns in the DataFrame.
There is only one variable, spre, in the dataset. The other properties, the 'coordinates' have become the index. Since there were several coordinates (lat, lon, and time), the DataFrame has a hierarchical MultiIndex.
You can either get the index data through tools like get_level_values or, if you want to change how the DataFrame is indexed, you can use reset_index().

Related

adding extra dimensions in a numpy array or a dataframe

I have a pandas series named obs(62824,) that has values of temperatures as follows
0 16.9
1 11.0
2 5.9
3 9.4
4 15.4
...
I want to use the following code to basically transform my numpy array to a xr.DataArray
lat = 35.93679
lon = 14.45663
obs_data = xr.DataArray(obs_tas, dims=['time','lat','lon'], \
coords=[pd.date_range('1979-01-01', '2021-12-31', freq='D'), lat, lon])
My issue is that I get the following error
ValueError: dimensions ('lat',) must have the same length as the number of data dimensions, ndim=0
from my understanding is because the numpy array has only 1 dimension. I tried the following
obs = obs[..., np.newaxis, np.newaxis]
However that did not work as well and I still get the same error.
How can I fix that?
You are correct about adding dimensions to obs.
In Creating a DataArray and API reference it is mentioned that the coordinates themselves should be array-like.
Your lat and lon are floats. I believe all you have to do is wrap them in a list, like so:
lat = [35.93679] # <- list
lon = [14.45663] # <- list
obs_data = xr.DataArray(
obs[:, None, None],
dims=['time', 'lat', 'lon'],
coords=[
pd.date_range('1979-01-01', '2021-12-31', freq='D'), lat, lon
]
)

How to remove xarray dimension after adding another without deleting the data variables

I have data from ECMWF which when read into xarray looks like this
Dimensions: (time: 424, step: 12, latitude: 3, longitude: 2)
Coordinates:
number int64 0
* time (time) datetime64[ns] 1990-03-01T06:00:00 ... 1993-04-22T18:0...
* step (step) timedelta64[ns] 01:00:00 02:00:00 ... 11:00:00 12:00:00
surface float64 0.0
* latitude (latitude) float64 41.0 40.75 40.5
* longitude (longitude) float64 -96.92 -96.67
valid_time (time, step) datetime64[ns] 1990-03-01T07:00:00 ... 1993-04-2...
Data variables:
i10fg (time, step, latitude, longitude) float32 4.876 4.637 ... 3.959
Attributes:
GRIB_edition: 1
GRIB_centre: ecmf
GRIB_centreDescription: European Centre for Medium-Range Weather Forecasts
GRIB_subCentre: 0
Conventions: CF-1.7
institution: European Centre for Medium-Range Weather Forecasts
history: 2022-04-23T22:18 GRIB to CDM+CF via cfgrib-0.9.9...
I've stacked the time and step dimensions to make a new one called forecast. I then added another dimension called valid and set that equal to the coordinate valid_time. Both valid and fcst are the same length, but I now want to drop the fcst dimension. However, when I do this, the data variable also gets deleted. Does anyone know how to fix this? Here is my sample code. There might be a better way to do what I'm doing, but I'm still pretty new to xarray!
ds = ds.stack(fcst=("time", "step")).transpose("fcst", "latitude", "longitude")
ds.expand_dims(valid=ds['valid_time']).drop_dims('fcst')
which leaves me with
<xarray.Dataset>
Dimensions: (valid: 5088, latitude: 3, longitude: 2)
Coordinates:
* valid (valid) datetime64[ns] 1990-03-01T07:00:00 ... 1993-04-23T06:0...
number int64 0
surface float64 0.0
* latitude (latitude) float64 41.0 40.75 40.5
* longitude (longitude) float64 -96.92 -96.67
Data variables:
*empty*
I've tried setting valid as the index then dropping fcst but it still deletes the data variables.
any help would be appreciated!
All variables in an xarray Dataset must be indexed by named dimensions. You can use ds.reset_index` drop any labeled coordinates associated with a dimension, but this isn't what you want. You can't simply do away with a dimension without losing the variables which are indexed with this dimension.
From the xarray docs on data structures:
dimension names are always present in the xarray data model: if you do not provide them, defaults of the form dim_N will be created.
Instead, you can swap a non-indexing coordinate for an indexing coordinate using swap_dims. This will swicth valid_time and fcst in your array, such that valid_time becomes the dimension which indexes fcst and any variables previously indexed by fcst (i10fg in your case).
So the answer is, after the stack but without expand_dims or drop:
ds.swap_dims({'fcst': 'valid_time'}).drop('fcst')

How to set the coordinates of the output of xarray.assign?

I've been trying to create two new variables based on the latitude coordinate of a data point in an xarray dataset. However, I can only seem to assign new coordinates. The data set looks like this:
<xarray.Dataset>
Dimensions: (lon: 360, lat: 180, time: 412)
Coordinates:
* lon (lon) float64 0.5 1.5 2.5 3.5 4.5 ... 355.5 356.5 357.5 358.5 359.5
* lat (lat) float64 -89.5 -88.5 -87.5 -86.5 -85.5 ... 86.5 87.5 88.5 89.5
* time (time) datetime64[ns] 1981-09-01 1981-10-01 ... 2015-12-01
Data variables:
evapr (time, lat, lon) float32 ...
lhtfl (time, lat, lon) float32 ...
...
What I've attempted so far is this:
def get_latitude_band(latitude):
return np.select(
condlist=
[abs(latitude) < 23.45,
abs(latitude) < 35,
abs(latitude) < 66.55],
choicelist=
["tropical",
"sub_tropical",
"temperate"],
default="frigid"
)
def get_hemisphere(latitude):
return np.select(
[latitude > 0, latitude <=0],
["north", "south"]
)
mhw_data = mhw_data \
.assign(climate_zone=get_latitude_band(mhw_data.lat)) \
.assign(hemisphere=get_hemisphere(mhw_data.lat)) \
.reset_index(["hemisphere", "climate_zone"]) \
.reset_coords()
print(mhw_data)
Which is getting me close:
<xarray.Dataset>
Dimensions: (lon: 360, lat: 180, time: 412, hemisphere: 180, climate_zone: 180)
Coordinates:
* lon (lon) float64 0.5 1.5 2.5 3.5 4.5 ... 356.5 357.5 358.5 359.5
* lat (lat) float64 -89.5 -88.5 -87.5 -86.5 ... 86.5 87.5 88.5 89.5
* time (time) datetime64[ns] 1981-09-01 1981-10-01 ... 2015-12-01
Dimensions without coordinates: hemisphere, climate_zone
Data variables:
evapr (time, lat, lon) float32 ...
lhtfl (time, lat, lon) float32 ...
...
hemisphere_ (hemisphere) object 'south' 'south' ... 'north' 'north'
climate_zone_ (climate_zone) object 'frigid' 'frigid' ... 'frigid' 'frigid'
...
However, I want to then stack the DataSet and convert it to a DataFrame. I am unable to do so, and I think it is because the new variables hemisphere_ and climate_zone_ do not have time, lat, lon coordinates:
stacked = mhw_data[var].stack(dim=["lon", "lat", "time"]).to_pandas().T
results in a KeyError on "lon".
So my question is: How do I assign new variables to the xarray dataset that maintain the original coordinates of time, lat and lon?
To assign a new variable or coordinate, xarray needs to know what the dimensions are called. There are a number of ways to define a DataArray or Coordinate, but the one closest to what you're currently using is to provide a tuple of (dim_names, array):
mhw_data = mhw_data.assign_coords(
climate_zone=(('lat', ), get_latitude_band(mhw_data.lat)),
hemisphere=(('lat', ), get_hemisphere(mhw_data.lat)),
)
Here I'm using da.assign_coords, which will define climate_zone and hemisphere as non-dimension coordinates, which you can think of as additional metadata about latitude and about your data, but which aren't proper data in themselves. This will also allow them to be preserved when sending individual arrays to pandas.
For stacking, converting to pandas will stack automatically. The following will return a DataFrame with variables/non-dimension coordinates as columns and dimensions as a MultiIndex:
stacked = mhw_data.to_dataframe()
Alternatively, if you want a Series indexed by (lat, lon, time) for just one of these coordinates you can always use expand_dims:
(
mhw_data.climate_zone
.expand_dims(lon=mhw_data.lon, time=mhw_data.time)
.to_series()
)
The two possible solutions I've worked out for myself are as follow:
First, stack the xarray data into pandas DataFrames, and then create new columns:
df = None
variables = list(mhw_data.data_vars)
for var in tqdm(variables):
stacked = mhw_data[var].stack(dim=["lon", "lat", "time"]).to_pandas().T
if df is None:
df = stacked
else:
df = pd.concat([df, stacked], axis=1)
df.reset_index(inplace=True)
df.columns = list(mhw_data.variables)
df["climate_zone"] = df["lat"].swifter.apply(get_latitude_band)
df["hemisphere"] = df["lat"].swifter.apply(get_hemisphere)
Create new xarray.DataArrays for each variable you'd like to add, then add them to the dataset:
# calculate climate zone and hemisphere from latitude.
latitudes = mhw_data.lat.values.reshape(-1, 1)
zones = get_latitude_band(latitudes)
hemispheres = get_hemisphere(latitudes)
# Take advantage of numpy broadcasting to get our data to lign up with the xarray shape.
shape = tuple(mhw_data.sizes.values())
base = np.zeros(shape)
zones = zones + base
hemispheres = hemispheres + base
# finally, create two new DataArrays and assign them as variables in the dataset.
zone_xarray = xr.DataArray(data=zones, coords=mhw_data.coords, dims=mhw_data.dims)
hemi_xarray = xr.DataArray(data=hemispheres, coords=mhw_data.coords, dims=mhw_data.dims)
mhw_data["zone"] = zone_xarray
mhw_data["hemisphere"] = hemi_xarray
# ... call the code to stack and convert to pandas (shown in method 1) ...#
My intuition is that method 1 is faster and more memory efficient because there are no repeated values that need broadcasting into a large 3-dimensional array. I did not test this, however.
Also, my intuition is that there is a less cumbersome xarray native way of accomplishing the same goal, but I could not find it.
One thing is certain, method 1 is far more concise due to the fact that there is no need to create intermediate arrays or reshape data.

Using XArray.isel to access data in GRIB2 file from a specific location?

I'm trying to access the data in a GRIB2 file at a specific longitude and latitude. I have been following along with this tutorial (https://www.youtube.com/watch?v=yLoudFv3hAY) approximately 2:52 but my GRIB file is formatted differently to the example and uses different variables
import xarray as xr
import pygrib
ds=xr.open_dataset('testdata.grb2', engine='cfgrib', filter_by_keys={'typeOfLevel': 'heightAboveGround', 'topLevel':2})
ds
This prints:
<xarray.Dataset>
Dimensions: (latitude: 361, longitude: 720)
Coordinates:
time datetime64[ns] ...
step timedelta64[ns] ...
heightAboveGround float64 ...
* latitude (latitude) float64 90.0 89.5 89.0 ... -89.0 -89.5 -90.0
* longitude (longitude) float64 0.0 0.5 1.0 1.5 ... 358.5 359.0 359.5
valid_time datetime64[ns] ...
Data variables:
t2m (latitude, longitude) float32 ...
sh2 (latitude, longitude) float32 ...
r2 (latitude, longitude) float32 ...
I then try to use imshow to index along the latitude and longitude (t2m?) dimension using:
t0_ds = ds.isel(t2m={200,200})
which gives this error:
ValueError: Dimensions {'t2m'} do not exist. Expected one or more of Frozen({'latitude': 361, 'longitude': 720})
Obviously there is an error in the way I'm using isel but I have tried many variations and I can't find much information about this particular error
You can access the closest datapoint to a specific latitude/longitude using:
lat = #yourlatitude
lon = #yourlongitude
ds_loc = ds.sel(latitude = lat, longitude = lon, method = 'nearest')
isel is used to access point by index, ie:
ds_loc = ds.isel(latitude = 200)
will return a subset along the 200th latitude value.

How can I find the indices equivalent to a specific selection of xarray?

I have an xarray dataset.
<xarray.Dataset>
Dimensions: (lat: 92, lon: 172, time: 183)
Coordinates:
* lat (lat) float32 4.125001 4.375 4.625 ... 26.624994 26.874996
* lon (lon) float32 nan nan nan ... 24.374996 24.624998 24.875
* time (time) datetime64[ns] 2003-09-01 2003-09-02 ... 2004-03-01
Data variables:
swnet (time, lat, lon) float32 dask.array<shape=(183, 92, 172), chunksize=(1, 92, 172)>
Find the nearest lat-long
df.sel(time='2003-09-01', lon=6.374997, lat=16.375006, method='nearest')
Need to find
The indices of this particular location. Basically, the row-column in the grid. What would be the easiest way to go about it?
Tried
nearestlat=df.sel(time='2003-09-01', lon=6.374997, lat=16.375006, method='nearest')['lat'].values
nearestlon=df.sel(time='2003-09-01', lon=6.374997, lat=16.375006, method='nearest')['lon'].values
rowlat=np.where(df['lat'].values==nearestlat)[0][0]
collon=np.where(df['lon'].values==nearestlon)[0][0]
But I am not sure if this is the right way to go about it. How can I do this 'correctly'?
I agree that finding the index associated with a .sel operation is trickier than one would expect!
This code works:
import xarray as xr
ds = xr.tutorial.open_dataset('air_temperature')
ilon = list(ds.lon.values).index(ds.sel(lon=250.0, method='nearest').lon)
ilat = list(ds.lat.values).index(ds.sel(lat=45.0, method='nearest').lat)
print(' lon index=',ilon,'\n','lat index=', ilat)
producing:
lon index= 20
lat index= 12
And just in case one is wondering why one might want to do this, we use this for investigating time stacks of images, where we are interested in selecting the image immediately preceding the image on a specified date:
import xarray as xr
ds = xr.tutorial.open_dataset('air_temperature')
ilon = list(ds.time.values).index(ds.sel(time='2013-06-01 00:00:00', method='nearest').time)
print(idx)
which produces
848
I think that something like that should work:
ds = xr.tutorial.open_dataset('air_temperature')
idx = ds.indexes["time"].get_loc('2013-06-01 00:00:00', method="nearest")
print(idx)

Categories

Resources