adding extra dimensions in a numpy array or a dataframe - python

I have a pandas series named obs(62824,) that has values of temperatures as follows
0 16.9
1 11.0
2 5.9
3 9.4
4 15.4
...
I want to use the following code to basically transform my numpy array to a xr.DataArray
lat = 35.93679
lon = 14.45663
obs_data = xr.DataArray(obs_tas, dims=['time','lat','lon'], \
coords=[pd.date_range('1979-01-01', '2021-12-31', freq='D'), lat, lon])
My issue is that I get the following error
ValueError: dimensions ('lat',) must have the same length as the number of data dimensions, ndim=0
from my understanding is because the numpy array has only 1 dimension. I tried the following
obs = obs[..., np.newaxis, np.newaxis]
However that did not work as well and I still get the same error.
How can I fix that?

You are correct about adding dimensions to obs.
In Creating a DataArray and API reference it is mentioned that the coordinates themselves should be array-like.
Your lat and lon are floats. I believe all you have to do is wrap them in a list, like so:
lat = [35.93679] # <- list
lon = [14.45663] # <- list
obs_data = xr.DataArray(
obs[:, None, None],
dims=['time', 'lat', 'lon'],
coords=[
pd.date_range('1979-01-01', '2021-12-31', freq='D'), lat, lon
]
)

Related

rioxaray plot geotiff and show lat lon coordinates [duplicate]

I downloaded a geotiff from here: https://www.nass.usda.gov/Research_and_Science/Crop_Progress_Gridded_Layers/index.php
(file also available: https://drive.google.com/file/d/1XcfEw-CZgVFE2NJytu4B1yBvjWydF-Tm/view?usp=sharing)
Looking at one of the weeks in 2021, I'd like to convert the geotiff to a data frame so I have an associated value with each lat/lon pair in the geotiff.
I tried:
import rioxarray
fl = 'data/cpc2021/corn/cpccorn2021/condition/cornCond21w24.tif'
da = rioxarray.open_rasterio(fl, masked=True)
df = da[0].to_pandas()
df['y'] = df.index
pd.melt(df, id_vars='y')
However, this returns a dataframe with x and y that don't seem to correspond to the lat/lon. How can I add (or retain) this information while converting?
Expect lat/lon points to be in contiguous US
edit:
I found a meta file that has the projections: NAD_1983_Contiguous_USA_Albers
which I believe corresponds to EPSG:5070 (also seen later in the same xml file)
I also found the bounding box for lat/lon coordinates:
<GeoBndBox esriExtentType="search">
<exTypeCode Sync="TRUE">1</exTypeCode>
<westBL Sync="TRUE">-127.360895</westBL>
<eastBL Sync="TRUE">-68.589171</eastBL>
<northBL Sync="TRUE">51.723828</northBL>
<southBL Sync="TRUE">23.297865</southBL>
However, still uncertain how to include this information in my quest to convert to dataframe.
Result of print(da) is:
<xarray.DataArray (band: 1, y: 320, x: 479)>
[153280 values with dtype=float32]
Coordinates:
* band (band) int64 1
* x (x) float64 -2.305e+06 -2.296e+06 ... 1.987e+06 1.996e+06
* y (y) float64 3.181e+06 3.172e+06 ... 3.192e+05 3.102e+05
spatial_ref int64 0
Attributes:
AREA_OR_POINT: Area
RepresentationType: ATHEMATIC
STATISTICS_COVARIANCES: 0.1263692188822515
STATISTICS_MAXIMUM: 4.8569073677063
STATISTICS_MEAN: 3.7031858480518
STATISTICS_MINIMUM: 2.1672348976135
STATISTICS_SKIPFACTORX: 1
STATISTICS_SKIPFACTORY: 1
STATISTICS_STDDEV: 0.35548448472789
scale_factor: 1.0
add_offset: 0.0
Credit to Jose from the GIS community:
import rioxarray
import pandas as pd
da = rioxarray.open_rasterio(fl, masked=True)
da = da.rio.reproject("EPSG:4326")
df = da[0].to_pandas()
df['y'] = df.index
df = pd.melt(df, id_vars='y')
https://gis.stackexchange.com/questions/443801/add-lat-and-lon-to-dataarray-read-in-by-rioxarray/443810#443810

Converting 3D xarray dataset to dataframe

I have imported a xarray dataset like this and extracted the values at coordinates defined by zones from a csv file, and a time period defined by a date range (30 days of a (lon,lat) grid with some environmental values for every coordinates).
from xgrads import open_CtlDataset
ds_Snow = open_CtlDataset(path + 'file')
ds_Snow = ds_Snow.sel(lat = list(set(zones['lat'])), lon = list(set(zones['lon'])),
time = period, method = 'nearest')
When i look for the information of ds_Snow, this is what I get :
Dimensions: (lat: 12, lon: 12, time: 30)
Coordinates:
* time (time) datetime64[ns] 2000-09-01 2000-09-02 ... 2000-09-30
* lat (lat) float32 3.414e+06 3.414e+06 3.414e+06 ... 3.414e+06 3.414e+06
* lon (lon) float32 6.873e+05 6.873e+05 6.873e+05 ... 6.873e+05 6.873e+05
Data variables:
spre (time, lat, lon) float32 dask.array<chunksize=(1, 12, 12), meta=np.ndarray>
Attributes:
title: SnowModel
undef: -9999.0 type : <class 'xarray.core.dataset.Dataset'>
I would like to make it a dataframe, respecting the initial dimensions (time, lat, lon).
So I did this :
df_Snow = ds_Snow.to_dataframe()
But here are the dimensions of the dataframe :
print(df_Snow)
lat lon time
3414108.0 687311.625 2000-09-01 0.0
2000-09-02 0.0
2000-09-03 0.0
2000-09-04 0.0
2000-09-05 0.0
... ...
2000-09-26 0.0
2000-09-27 0.0
2000-09-28 0.0
2000-09-29 0.0
2000-09-30 0.0
[4320 rows x 1 columns]
It looks like all the data just got put in a single column.
I have tried giving the dimensions orders as some documentation explained :
df_Snow = ds_Snow.to_dataframe(dim_order = ['time', 'lat', 'lon'])
But it does not change anything, and I can't seem to find an answer in forums or the documentation. I would like to know a way to keep the array configuration in the dataframe.
EDIT : I found a solution
Instead of converting the xarray, I chose to build my dataframe with pd.Series of each attributes like this :
ds_Snow = ds_Snow.sel(lat = list(set(station_list['lat_utm'])),lon = list(set(station_list['lon_utm'])), time = Ind_Run_ERA5_Land, method = 'nearest')
time = pd.Series(ds_Snow.coords["time"].values)
lon = pd.Series(ds_Snow.coords["lon"].values)
lat = pd.Series(ds_Snow.coords["lat"].values)
spre = pd.Series(ds_Snow['spre'].values[:,0,0])
frame = { 'spre': spre, 'time': time, 'lon' : lon, 'lat' : lat}
df_Snow = pd.DataFrame(frame)
This is the expected behaviour. From the docs:
The DataFrame is indexed by the Cartesian product of index coordinates (in the form of a pandas.MultiIndex). Other coordinates are included as columns in the DataFrame.
There is only one variable, spre, in the dataset. The other properties, the 'coordinates' have become the index. Since there were several coordinates (lat, lon, and time), the DataFrame has a hierarchical MultiIndex.
You can either get the index data through tools like get_level_values or, if you want to change how the DataFrame is indexed, you can use reset_index().

How to set the coordinates of the output of xarray.assign?

I've been trying to create two new variables based on the latitude coordinate of a data point in an xarray dataset. However, I can only seem to assign new coordinates. The data set looks like this:
<xarray.Dataset>
Dimensions: (lon: 360, lat: 180, time: 412)
Coordinates:
* lon (lon) float64 0.5 1.5 2.5 3.5 4.5 ... 355.5 356.5 357.5 358.5 359.5
* lat (lat) float64 -89.5 -88.5 -87.5 -86.5 -85.5 ... 86.5 87.5 88.5 89.5
* time (time) datetime64[ns] 1981-09-01 1981-10-01 ... 2015-12-01
Data variables:
evapr (time, lat, lon) float32 ...
lhtfl (time, lat, lon) float32 ...
...
What I've attempted so far is this:
def get_latitude_band(latitude):
return np.select(
condlist=
[abs(latitude) < 23.45,
abs(latitude) < 35,
abs(latitude) < 66.55],
choicelist=
["tropical",
"sub_tropical",
"temperate"],
default="frigid"
)
def get_hemisphere(latitude):
return np.select(
[latitude > 0, latitude <=0],
["north", "south"]
)
mhw_data = mhw_data \
.assign(climate_zone=get_latitude_band(mhw_data.lat)) \
.assign(hemisphere=get_hemisphere(mhw_data.lat)) \
.reset_index(["hemisphere", "climate_zone"]) \
.reset_coords()
print(mhw_data)
Which is getting me close:
<xarray.Dataset>
Dimensions: (lon: 360, lat: 180, time: 412, hemisphere: 180, climate_zone: 180)
Coordinates:
* lon (lon) float64 0.5 1.5 2.5 3.5 4.5 ... 356.5 357.5 358.5 359.5
* lat (lat) float64 -89.5 -88.5 -87.5 -86.5 ... 86.5 87.5 88.5 89.5
* time (time) datetime64[ns] 1981-09-01 1981-10-01 ... 2015-12-01
Dimensions without coordinates: hemisphere, climate_zone
Data variables:
evapr (time, lat, lon) float32 ...
lhtfl (time, lat, lon) float32 ...
...
hemisphere_ (hemisphere) object 'south' 'south' ... 'north' 'north'
climate_zone_ (climate_zone) object 'frigid' 'frigid' ... 'frigid' 'frigid'
...
However, I want to then stack the DataSet and convert it to a DataFrame. I am unable to do so, and I think it is because the new variables hemisphere_ and climate_zone_ do not have time, lat, lon coordinates:
stacked = mhw_data[var].stack(dim=["lon", "lat", "time"]).to_pandas().T
results in a KeyError on "lon".
So my question is: How do I assign new variables to the xarray dataset that maintain the original coordinates of time, lat and lon?
To assign a new variable or coordinate, xarray needs to know what the dimensions are called. There are a number of ways to define a DataArray or Coordinate, but the one closest to what you're currently using is to provide a tuple of (dim_names, array):
mhw_data = mhw_data.assign_coords(
climate_zone=(('lat', ), get_latitude_band(mhw_data.lat)),
hemisphere=(('lat', ), get_hemisphere(mhw_data.lat)),
)
Here I'm using da.assign_coords, which will define climate_zone and hemisphere as non-dimension coordinates, which you can think of as additional metadata about latitude and about your data, but which aren't proper data in themselves. This will also allow them to be preserved when sending individual arrays to pandas.
For stacking, converting to pandas will stack automatically. The following will return a DataFrame with variables/non-dimension coordinates as columns and dimensions as a MultiIndex:
stacked = mhw_data.to_dataframe()
Alternatively, if you want a Series indexed by (lat, lon, time) for just one of these coordinates you can always use expand_dims:
(
mhw_data.climate_zone
.expand_dims(lon=mhw_data.lon, time=mhw_data.time)
.to_series()
)
The two possible solutions I've worked out for myself are as follow:
First, stack the xarray data into pandas DataFrames, and then create new columns:
df = None
variables = list(mhw_data.data_vars)
for var in tqdm(variables):
stacked = mhw_data[var].stack(dim=["lon", "lat", "time"]).to_pandas().T
if df is None:
df = stacked
else:
df = pd.concat([df, stacked], axis=1)
df.reset_index(inplace=True)
df.columns = list(mhw_data.variables)
df["climate_zone"] = df["lat"].swifter.apply(get_latitude_band)
df["hemisphere"] = df["lat"].swifter.apply(get_hemisphere)
Create new xarray.DataArrays for each variable you'd like to add, then add them to the dataset:
# calculate climate zone and hemisphere from latitude.
latitudes = mhw_data.lat.values.reshape(-1, 1)
zones = get_latitude_band(latitudes)
hemispheres = get_hemisphere(latitudes)
# Take advantage of numpy broadcasting to get our data to lign up with the xarray shape.
shape = tuple(mhw_data.sizes.values())
base = np.zeros(shape)
zones = zones + base
hemispheres = hemispheres + base
# finally, create two new DataArrays and assign them as variables in the dataset.
zone_xarray = xr.DataArray(data=zones, coords=mhw_data.coords, dims=mhw_data.dims)
hemi_xarray = xr.DataArray(data=hemispheres, coords=mhw_data.coords, dims=mhw_data.dims)
mhw_data["zone"] = zone_xarray
mhw_data["hemisphere"] = hemi_xarray
# ... call the code to stack and convert to pandas (shown in method 1) ...#
My intuition is that method 1 is faster and more memory efficient because there are no repeated values that need broadcasting into a large 3-dimensional array. I did not test this, however.
Also, my intuition is that there is a less cumbersome xarray native way of accomplishing the same goal, but I could not find it.
One thing is certain, method 1 is far more concise due to the fact that there is no need to create intermediate arrays or reshape data.

How can I find the indices equivalent to a specific selection of xarray?

I have an xarray dataset.
<xarray.Dataset>
Dimensions: (lat: 92, lon: 172, time: 183)
Coordinates:
* lat (lat) float32 4.125001 4.375 4.625 ... 26.624994 26.874996
* lon (lon) float32 nan nan nan ... 24.374996 24.624998 24.875
* time (time) datetime64[ns] 2003-09-01 2003-09-02 ... 2004-03-01
Data variables:
swnet (time, lat, lon) float32 dask.array<shape=(183, 92, 172), chunksize=(1, 92, 172)>
Find the nearest lat-long
df.sel(time='2003-09-01', lon=6.374997, lat=16.375006, method='nearest')
Need to find
The indices of this particular location. Basically, the row-column in the grid. What would be the easiest way to go about it?
Tried
nearestlat=df.sel(time='2003-09-01', lon=6.374997, lat=16.375006, method='nearest')['lat'].values
nearestlon=df.sel(time='2003-09-01', lon=6.374997, lat=16.375006, method='nearest')['lon'].values
rowlat=np.where(df['lat'].values==nearestlat)[0][0]
collon=np.where(df['lon'].values==nearestlon)[0][0]
But I am not sure if this is the right way to go about it. How can I do this 'correctly'?
I agree that finding the index associated with a .sel operation is trickier than one would expect!
This code works:
import xarray as xr
ds = xr.tutorial.open_dataset('air_temperature')
ilon = list(ds.lon.values).index(ds.sel(lon=250.0, method='nearest').lon)
ilat = list(ds.lat.values).index(ds.sel(lat=45.0, method='nearest').lat)
print(' lon index=',ilon,'\n','lat index=', ilat)
producing:
lon index= 20
lat index= 12
And just in case one is wondering why one might want to do this, we use this for investigating time stacks of images, where we are interested in selecting the image immediately preceding the image on a specified date:
import xarray as xr
ds = xr.tutorial.open_dataset('air_temperature')
ilon = list(ds.time.values).index(ds.sel(time='2013-06-01 00:00:00', method='nearest').time)
print(idx)
which produces
848
I think that something like that should work:
ds = xr.tutorial.open_dataset('air_temperature')
idx = ds.indexes["time"].get_loc('2013-06-01 00:00:00', method="nearest")
print(idx)

Extract coordinate values in xarray

I would like to extract the values of the coordinate variables.
For example I create a DataArray as:
import xarray as xr
import numpy as np
import pandas as pd
years_arr=range(1982,1986)
time = pd.date_range('14/1/' + str(years_arr[0]) + ' 12:00:00', periods=len(years_arr), freq=pd.DateOffset(years=1))
lon = range(20,24)
lat = range(10,14)
arr1 = xr.DataArray(data, coords=[time, lat, lon], dims=['time', 'latitude', 'longitude'])
I now would like to output the lon values from arr1.
I'm asking as arr1 going into a function so I may not have the lon values available.
arr1.coords['lon'] gives you longitude as a xarray.DataArray, and arr1.coords['lon'].values gives you the values as a numpy array.
Another possible solution is:
time, lat, lon = arr1.indexes.values()
The result is a Float64Index for your lat/lon coordinates.

Categories

Resources