I downloaded a geotiff from here: https://www.nass.usda.gov/Research_and_Science/Crop_Progress_Gridded_Layers/index.php
(file also available: https://drive.google.com/file/d/1XcfEw-CZgVFE2NJytu4B1yBvjWydF-Tm/view?usp=sharing)
Looking at one of the weeks in 2021, I'd like to convert the geotiff to a data frame so I have an associated value with each lat/lon pair in the geotiff.
I tried:
import rioxarray
fl = 'data/cpc2021/corn/cpccorn2021/condition/cornCond21w24.tif'
da = rioxarray.open_rasterio(fl, masked=True)
df = da[0].to_pandas()
df['y'] = df.index
pd.melt(df, id_vars='y')
However, this returns a dataframe with x and y that don't seem to correspond to the lat/lon. How can I add (or retain) this information while converting?
Expect lat/lon points to be in contiguous US
edit:
I found a meta file that has the projections: NAD_1983_Contiguous_USA_Albers
which I believe corresponds to EPSG:5070 (also seen later in the same xml file)
I also found the bounding box for lat/lon coordinates:
<GeoBndBox esriExtentType="search">
<exTypeCode Sync="TRUE">1</exTypeCode>
<westBL Sync="TRUE">-127.360895</westBL>
<eastBL Sync="TRUE">-68.589171</eastBL>
<northBL Sync="TRUE">51.723828</northBL>
<southBL Sync="TRUE">23.297865</southBL>
However, still uncertain how to include this information in my quest to convert to dataframe.
Result of print(da) is:
<xarray.DataArray (band: 1, y: 320, x: 479)>
[153280 values with dtype=float32]
Coordinates:
* band (band) int64 1
* x (x) float64 -2.305e+06 -2.296e+06 ... 1.987e+06 1.996e+06
* y (y) float64 3.181e+06 3.172e+06 ... 3.192e+05 3.102e+05
spatial_ref int64 0
Attributes:
AREA_OR_POINT: Area
RepresentationType: ATHEMATIC
STATISTICS_COVARIANCES: 0.1263692188822515
STATISTICS_MAXIMUM: 4.8569073677063
STATISTICS_MEAN: 3.7031858480518
STATISTICS_MINIMUM: 2.1672348976135
STATISTICS_SKIPFACTORX: 1
STATISTICS_SKIPFACTORY: 1
STATISTICS_STDDEV: 0.35548448472789
scale_factor: 1.0
add_offset: 0.0
Credit to Jose from the GIS community:
import rioxarray
import pandas as pd
da = rioxarray.open_rasterio(fl, masked=True)
da = da.rio.reproject("EPSG:4326")
df = da[0].to_pandas()
df['y'] = df.index
df = pd.melt(df, id_vars='y')
https://gis.stackexchange.com/questions/443801/add-lat-and-lon-to-dataarray-read-in-by-rioxarray/443810#443810
Related
I have a pandas series named obs(62824,) that has values of temperatures as follows
0 16.9
1 11.0
2 5.9
3 9.4
4 15.4
...
I want to use the following code to basically transform my numpy array to a xr.DataArray
lat = 35.93679
lon = 14.45663
obs_data = xr.DataArray(obs_tas, dims=['time','lat','lon'], \
coords=[pd.date_range('1979-01-01', '2021-12-31', freq='D'), lat, lon])
My issue is that I get the following error
ValueError: dimensions ('lat',) must have the same length as the number of data dimensions, ndim=0
from my understanding is because the numpy array has only 1 dimension. I tried the following
obs = obs[..., np.newaxis, np.newaxis]
However that did not work as well and I still get the same error.
How can I fix that?
You are correct about adding dimensions to obs.
In Creating a DataArray and API reference it is mentioned that the coordinates themselves should be array-like.
Your lat and lon are floats. I believe all you have to do is wrap them in a list, like so:
lat = [35.93679] # <- list
lon = [14.45663] # <- list
obs_data = xr.DataArray(
obs[:, None, None],
dims=['time', 'lat', 'lon'],
coords=[
pd.date_range('1979-01-01', '2021-12-31', freq='D'), lat, lon
]
)
I have imported a xarray dataset like this and extracted the values at coordinates defined by zones from a csv file, and a time period defined by a date range (30 days of a (lon,lat) grid with some environmental values for every coordinates).
from xgrads import open_CtlDataset
ds_Snow = open_CtlDataset(path + 'file')
ds_Snow = ds_Snow.sel(lat = list(set(zones['lat'])), lon = list(set(zones['lon'])),
time = period, method = 'nearest')
When i look for the information of ds_Snow, this is what I get :
Dimensions: (lat: 12, lon: 12, time: 30)
Coordinates:
* time (time) datetime64[ns] 2000-09-01 2000-09-02 ... 2000-09-30
* lat (lat) float32 3.414e+06 3.414e+06 3.414e+06 ... 3.414e+06 3.414e+06
* lon (lon) float32 6.873e+05 6.873e+05 6.873e+05 ... 6.873e+05 6.873e+05
Data variables:
spre (time, lat, lon) float32 dask.array<chunksize=(1, 12, 12), meta=np.ndarray>
Attributes:
title: SnowModel
undef: -9999.0 type : <class 'xarray.core.dataset.Dataset'>
I would like to make it a dataframe, respecting the initial dimensions (time, lat, lon).
So I did this :
df_Snow = ds_Snow.to_dataframe()
But here are the dimensions of the dataframe :
print(df_Snow)
lat lon time
3414108.0 687311.625 2000-09-01 0.0
2000-09-02 0.0
2000-09-03 0.0
2000-09-04 0.0
2000-09-05 0.0
... ...
2000-09-26 0.0
2000-09-27 0.0
2000-09-28 0.0
2000-09-29 0.0
2000-09-30 0.0
[4320 rows x 1 columns]
It looks like all the data just got put in a single column.
I have tried giving the dimensions orders as some documentation explained :
df_Snow = ds_Snow.to_dataframe(dim_order = ['time', 'lat', 'lon'])
But it does not change anything, and I can't seem to find an answer in forums or the documentation. I would like to know a way to keep the array configuration in the dataframe.
EDIT : I found a solution
Instead of converting the xarray, I chose to build my dataframe with pd.Series of each attributes like this :
ds_Snow = ds_Snow.sel(lat = list(set(station_list['lat_utm'])),lon = list(set(station_list['lon_utm'])), time = Ind_Run_ERA5_Land, method = 'nearest')
time = pd.Series(ds_Snow.coords["time"].values)
lon = pd.Series(ds_Snow.coords["lon"].values)
lat = pd.Series(ds_Snow.coords["lat"].values)
spre = pd.Series(ds_Snow['spre'].values[:,0,0])
frame = { 'spre': spre, 'time': time, 'lon' : lon, 'lat' : lat}
df_Snow = pd.DataFrame(frame)
This is the expected behaviour. From the docs:
The DataFrame is indexed by the Cartesian product of index coordinates (in the form of a pandas.MultiIndex). Other coordinates are included as columns in the DataFrame.
There is only one variable, spre, in the dataset. The other properties, the 'coordinates' have become the index. Since there were several coordinates (lat, lon, and time), the DataFrame has a hierarchical MultiIndex.
You can either get the index data through tools like get_level_values or, if you want to change how the DataFrame is indexed, you can use reset_index().
I've been trying to create two new variables based on the latitude coordinate of a data point in an xarray dataset. However, I can only seem to assign new coordinates. The data set looks like this:
<xarray.Dataset>
Dimensions: (lon: 360, lat: 180, time: 412)
Coordinates:
* lon (lon) float64 0.5 1.5 2.5 3.5 4.5 ... 355.5 356.5 357.5 358.5 359.5
* lat (lat) float64 -89.5 -88.5 -87.5 -86.5 -85.5 ... 86.5 87.5 88.5 89.5
* time (time) datetime64[ns] 1981-09-01 1981-10-01 ... 2015-12-01
Data variables:
evapr (time, lat, lon) float32 ...
lhtfl (time, lat, lon) float32 ...
...
What I've attempted so far is this:
def get_latitude_band(latitude):
return np.select(
condlist=
[abs(latitude) < 23.45,
abs(latitude) < 35,
abs(latitude) < 66.55],
choicelist=
["tropical",
"sub_tropical",
"temperate"],
default="frigid"
)
def get_hemisphere(latitude):
return np.select(
[latitude > 0, latitude <=0],
["north", "south"]
)
mhw_data = mhw_data \
.assign(climate_zone=get_latitude_band(mhw_data.lat)) \
.assign(hemisphere=get_hemisphere(mhw_data.lat)) \
.reset_index(["hemisphere", "climate_zone"]) \
.reset_coords()
print(mhw_data)
Which is getting me close:
<xarray.Dataset>
Dimensions: (lon: 360, lat: 180, time: 412, hemisphere: 180, climate_zone: 180)
Coordinates:
* lon (lon) float64 0.5 1.5 2.5 3.5 4.5 ... 356.5 357.5 358.5 359.5
* lat (lat) float64 -89.5 -88.5 -87.5 -86.5 ... 86.5 87.5 88.5 89.5
* time (time) datetime64[ns] 1981-09-01 1981-10-01 ... 2015-12-01
Dimensions without coordinates: hemisphere, climate_zone
Data variables:
evapr (time, lat, lon) float32 ...
lhtfl (time, lat, lon) float32 ...
...
hemisphere_ (hemisphere) object 'south' 'south' ... 'north' 'north'
climate_zone_ (climate_zone) object 'frigid' 'frigid' ... 'frigid' 'frigid'
...
However, I want to then stack the DataSet and convert it to a DataFrame. I am unable to do so, and I think it is because the new variables hemisphere_ and climate_zone_ do not have time, lat, lon coordinates:
stacked = mhw_data[var].stack(dim=["lon", "lat", "time"]).to_pandas().T
results in a KeyError on "lon".
So my question is: How do I assign new variables to the xarray dataset that maintain the original coordinates of time, lat and lon?
To assign a new variable or coordinate, xarray needs to know what the dimensions are called. There are a number of ways to define a DataArray or Coordinate, but the one closest to what you're currently using is to provide a tuple of (dim_names, array):
mhw_data = mhw_data.assign_coords(
climate_zone=(('lat', ), get_latitude_band(mhw_data.lat)),
hemisphere=(('lat', ), get_hemisphere(mhw_data.lat)),
)
Here I'm using da.assign_coords, which will define climate_zone and hemisphere as non-dimension coordinates, which you can think of as additional metadata about latitude and about your data, but which aren't proper data in themselves. This will also allow them to be preserved when sending individual arrays to pandas.
For stacking, converting to pandas will stack automatically. The following will return a DataFrame with variables/non-dimension coordinates as columns and dimensions as a MultiIndex:
stacked = mhw_data.to_dataframe()
Alternatively, if you want a Series indexed by (lat, lon, time) for just one of these coordinates you can always use expand_dims:
(
mhw_data.climate_zone
.expand_dims(lon=mhw_data.lon, time=mhw_data.time)
.to_series()
)
The two possible solutions I've worked out for myself are as follow:
First, stack the xarray data into pandas DataFrames, and then create new columns:
df = None
variables = list(mhw_data.data_vars)
for var in tqdm(variables):
stacked = mhw_data[var].stack(dim=["lon", "lat", "time"]).to_pandas().T
if df is None:
df = stacked
else:
df = pd.concat([df, stacked], axis=1)
df.reset_index(inplace=True)
df.columns = list(mhw_data.variables)
df["climate_zone"] = df["lat"].swifter.apply(get_latitude_band)
df["hemisphere"] = df["lat"].swifter.apply(get_hemisphere)
Create new xarray.DataArrays for each variable you'd like to add, then add them to the dataset:
# calculate climate zone and hemisphere from latitude.
latitudes = mhw_data.lat.values.reshape(-1, 1)
zones = get_latitude_band(latitudes)
hemispheres = get_hemisphere(latitudes)
# Take advantage of numpy broadcasting to get our data to lign up with the xarray shape.
shape = tuple(mhw_data.sizes.values())
base = np.zeros(shape)
zones = zones + base
hemispheres = hemispheres + base
# finally, create two new DataArrays and assign them as variables in the dataset.
zone_xarray = xr.DataArray(data=zones, coords=mhw_data.coords, dims=mhw_data.dims)
hemi_xarray = xr.DataArray(data=hemispheres, coords=mhw_data.coords, dims=mhw_data.dims)
mhw_data["zone"] = zone_xarray
mhw_data["hemisphere"] = hemi_xarray
# ... call the code to stack and convert to pandas (shown in method 1) ...#
My intuition is that method 1 is faster and more memory efficient because there are no repeated values that need broadcasting into a large 3-dimensional array. I did not test this, however.
Also, my intuition is that there is a less cumbersome xarray native way of accomplishing the same goal, but I could not find it.
One thing is certain, method 1 is far more concise due to the fact that there is no need to create intermediate arrays or reshape data.
I have a dataframe that contains lon/lat information. The aim is to find all the points within a rad distance to a specific point st_p.
In fact I kind of have already have the code in R, but I need to do the same on python.
Here what I do is that, I convert the dataframes to sf objects, I define a buffer, the I make the intersection with the buffer.
Here is the R code.
I just dont know what libraries to use in Python in order to do the same.
within_radius <- function(df, st_p, rad) {
# Transform to an sf object and cahnge from lon/lat to utm
sf_df <- st_transform(st_as_sf(
df,
coords = c("lon", "lat"),
crs = 4326,
agr = "constant"
), 6622)
# Create an utm st point based on the coordinates of the stop point
cntr <- st_transform(st_sfc(st_p, crs = 4326), 6622)
# Craete a circular buffer with the given radius
buff <- st_buffer(cntr, rad)
# Filter the points that are within the buffer
intr <- st_intersects(sf_df, buff, sparse = F)
sf_df <- st_transform(sf_df, 4326)
sf_df <- sf_df[which(unlist(intr)), ]
# Compute the distance of esch point to the begining of the road segment
xy = st_coordinates(st_centroid(sf_df))
nc.sort = sf_df[order(xy[, "X"], xy[, "Y"]), ]
sf_df <- nc.sort %>%
mutate(dist = st_distance(
x = nc.sort,
y = nc.sort[1, ],
by_element = TRUE
))
}
You can use geopandas and shapely to do pretty much anything
Create a geopandas geodataframe from a pandas dataframe with lat, lng:
In [19]: import pandas as pd
In [20]: import geopandas as gpd
In [21]: from shapely.geometry import Point
In [22]: df = pd.DataFrame({"lat": [19.435175, 19.432909], "lng":[-99.141197, -99.146036]})
In [23]: gf = gpd.GeoDataFrame(df, geometry = [Point(x,y) for (x,y) in zip(df.lng, df.lat)], crs = "epsg:4326")
In [24]: gf
Out[24]:
lat lng geometry
0 19.435175 -99.141197 POINT (-99.14120 19.43518)
1 19.432909 -99.
146036 POINT (-99.14604 19.43291)
buffer, projections and other operations are available for a geodataframe, this is how you convert to a metric projection and create a 10m buffer:
In [27]: gf.to_crs(6622).buffer(10)
Out[27]:
0 POLYGON ((-3597495.980 -2115793.588, -3597496....
1 POLYGON ((-3598149.053 -2115813.383, -3598149....
dtype: geometry
You can call intersects to get the intersection between a buffer and a point:
In [29]: gf.to_crs(6622).buffer(10).intersects(Point(-3597505.980,-2115793.588))
Out[29]:
0 True
1 False
dtype: bool
compute centroids:
In [30]: gf.to_crs(6622).buffer(10).centroid
Out[30]:
0 POINT (-3597505.980 -2115793.588)
1 POINT (-3598159.053 -2115813.383)
dtype: geometry
filter using the buffer:
In [31]: gf.loc[gf.to_crs(6622).buffer(10).intersects(Point(-3597505.980,-2115793.588))]
Out[31]:
lat lng geometry
0 19.435175 -99.141197 POINT (-99.14120 19.43518)
distance gives you the distance to the closest point in a geometry:
In [33]: gf.to_crs(6622).buffer(10).distance(Point(-3597505.980,-2115793.588))
Out[33]:
0 0.000000
1 643.377576
dtype: float64
And you can do a lot more, just look at the documentation
https://geopandas.org/index.html
Also look at shapely's documentation to see how to project a single point https://shapely.readthedocs.io/en/latest/manual.html#shapely.ops.transform
I have netCDF file, which contains temperature data over some location. Data shape is 1450x900.
I am creating search functionality in my app, to locate temperature data with lat, lon values.
So I extracted lat and lon coordinates data from netCDf file, but I was expecting that they would be 1D arrays and instead got 2D arrays with 1450x900 shape for both coordinates.
So my question: why they are 2d arrays, instead of 1450 latitude values and 900 longitude values? Doesn't 1450 lat values and 900 lon values describe whole grid?
Lets say we have 4x5 square, indices for locating rightmost and bottom-most point of the grid will be [4, 5]. So my indices for x will be[1, 2, 3, 4] and for y: [1, 2, 3, 4, 5]. 9 indices in total are enough to locate any point on that grid (consisting of 20 cells). So why do lat (x) and lon (y) coordinates in netcdf file contain 20 indices separately (40 in total), instead of 4 and 5 indices respectively (9 in total)? Hope you get what confuses me.
Is it possible to somehow map those 2D arrays and "downgrade" to 1450 latitude values and 900 longitude values? OR it is ok as it is right now? How can I use those values for my intention? Do I need to zip lat lon arrays?
here are the shapes:
>>> DS = xarray.open_dataset('file.nc')
>>> DS.tasmin.shape
(31, 1450, 900)
>>> DS.projection_x_coordinate.shape
(900,)
>>> DS.projection_y_coordinate.shape
(1450,)
>>> DS.latitude.shape
(1450, 900)
>>> DS.longitude.shape
(1450, 900)
consider that projection_x_coordinate and projection_y_coordinate are easting/northing values not lat/longs
here is the metadata of file if needed:
Dimensions: (bnds: 2, projection_x_coordinate: 900, projection_y_coordinate: 1450, time: 31)
Coordinates:
* time (time) datetime64[ns] 2018-12-01T12:00:00 ....
* projection_y_coordinate (projection_y_coordinate) float64 -1.995e+0...
* projection_x_coordinate (projection_x_coordinate) float64 -1.995e+0...
latitude (projection_y_coordinate, projection_x_coordinate) float64 ...
longitude (projection_y_coordinate, projection_x_coordinate) float64 ...
Dimensions without coordinates: bnds
Data variables:
tasmin (time, projection_y_coordinate, projection_x_coordinate) float64 ...
transverse_mercator int32 ...
time_bnds (time, bnds) datetime64[ns] ...
projection_y_coordinate_bnds (projection_y_coordinate, bnds) float64 ...
projection_x_coordinate_bnds (projection_x_coordinate, bnds) float64 ...
Attributes:
comment: Daily resolution gridded climate observations
creation_date: 2019-08-21T21:26:02
frequency: day
institution: Met Office
references: doi: 10.1002/joc.1161
short_name: daily_mintemp
source: HadUK-Grid_v1.0.1.0
title: Gridded surface climate observations data for the UK
version: v20190808
Conventions: CF-1.5
Your data adheres to version 1.5 of the Climate and Forecast conventions.
The document describing this version of the conventions is here, although the relevant section is essentially unchanged across many versions of the conventions.
See section 5.2:
5.2. Two-Dimensional Latitude, Longitude, Coordinate Variables
The latitude and longitude coordinates of a horizontal grid that was
not defined as a Cartesian product of latitude and longitude axes, can
sometimes be represented using two-dimensional coordinate variables.
These variables are identified as coordinates by use of the coordinates attribute.
It looks like you are using the HadOBS 1km resolution gridded daily minimum temperature, and this file in particular:
http://dap.ceda.ac.uk/thredds/fileServer/badc/ukmo-hadobs/data/insitu/MOHC/HadOBS/HadUK-Grid/v1.0.1.0/1km/tasmin/day/v20190808/tasmin_hadukgrid_uk_1km_day_20181201-20181231.nc (warning: >300MB download)
As it states, the data is on a transverse mercator grid.
If you look at output from ncdump -h <filename> you will also see the following description of the grid expressed by means of attributes of the transverse_mercator dummy variable:
int transverse_mercator ;
transverse_mercator:grid_mapping_name = "transverse_mercator" ;
transverse_mercator:longitude_of_prime_meridian = 0. ;
transverse_mercator:semi_major_axis = 6377563.396 ;
transverse_mercator:semi_minor_axis = 6356256.909 ;
transverse_mercator:longitude_of_central_meridian = -2. ;
transverse_mercator:latitude_of_projection_origin = 49. ;
transverse_mercator:false_easting = 400000. ;
transverse_mercator:false_northing = -100000. ;
transverse_mercator:scale_factor_at_central_meridian = 0.9996012717 ;
and you will also see that the coordinate variables projection_x_coordinate and projection_y_coordinate have units of metres.
The grid in question is the Ordnance Survey UK grid using numeric grid references.
See for example this description of the OS grid (from Wikipedia).
If you wish to express the data on a regular longitude-latitude grid then you will need to do some type of interpolation. I see that you are using xarray. You can combine this with pyresample to do the interpolation. Here is an example:
import xarray as xr
import numpy as np
from pyresample.geometry import SwathDefinition
from pyresample.kd_tree import resample_nearest, resample_gauss
ds = xr.open_dataset("tasmin_hadukgrid_uk_1km_day_20181201-20181231.nc")
# Define a target grid. For sake of example, here is one with just
# 3 longitudes and 4 latitudes.
lons = np.array([-2.1, -2., -1.9])
lats = np.array([51.7, 51.8, 51.9, 52.0])
# The target grid is regular (1-d lon, lat coordinates) but we will need
# a 2d version (similar to the input grid), so use numpy.meshgrid to produce this.
lon2d, lat2d = np.meshgrid(lons, lats)
origin_grid = SwathDefinition(lons=ds.longitude, lats=ds.latitude)
target_grid = SwathDefinition(lons=lon2d, lats=lat2d)
# get a numpy array for the first timestep
data = ds.tasmin[0].to_masked_array()
# nearest neighbour interpolation example
# Note that radius_of_influence has units metres
interpolated = resample_nearest(origin_grid, data, target_grid, radius_of_influence=1000)
# GIVES:
# array([[5.12490065, 5.02715332, 5.36414835],
# [5.08337723, 4.96372838, 5.00862833],
# [6.47538931, 5.53855722, 5.11511239],
# [6.46571817, 6.17949381, 5.87357538]])
# gaussian weighted interpolation example
# Note that radius_of_influence and sigmas both have units metres
interpolated = resample_gauss(origin_grid, data, target_grid, radius_of_influence=1000, sigmas=1000)
# GIVES:
# array([[5.20432465, 5.07436805, 5.39693221],
# [5.09069187, 4.8565934 , 5.08191639],
# [6.4505963 , 5.44018209, 5.13774416],
# [6.47345359, 6.2386732 , 5.62121948]])
I figured an answer by myself.
As appeared 2D lat long arrays are used to define the "grid" of some location.
In other words, if we zip lat long values and project on the map, we will get "curved grid" (earth curvature is considered in other words) over some location, which then are used to create grid reference of location.
Hope its clear for anyone interested.