How to create a numpy array to an xarray data array? - python

I am trying to convert a 3D numpy array to a data array however I am getting an error that I cannot figure out.
I have a 3D numpy array (lat, lon, and time), and I am hoping to convert it into an xarray data array with the dimensions being lat, lon, and time.
The np.random.rand is just to make a reproducible example of a 3D array:
atae = np.random.rand(10,20,30) # 3d array
lat_atae = np.random.rand(10) # latitude is the same size as the first axis
lon_atae = np.random.rand(20) # longitude is the same size as second axis
time_atae = np.random.rand(30) # time is the 3rd axis
data_xr = xr.DataArray(atae, coords=[{'y': lat_atae,'x': lon_atae,'time': time_atae}],
dims=["y", "x", "time"])
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-156-8f8f8a1fc7aa> in <module>
----> 1 test = xr.DataArray(atae, coords=[{'y': lat_atae,'x': lon_atae,'time': time_atae}],
2 dims=["y", "x", "time"])
3
~/opt/anaconda3/lib/python3.8/site-packages/xarray/core/dataarray.py in __init__(self, data, coords, dims, name, attrs, indexes, fastpath)
408 data = _check_data_shape(data, coords, dims)
409 data = as_compatible_data(data)
--> 410 coords, dims = _infer_coords_and_dims(data.shape, coords, dims)
411 variable = Variable(dims, data, attrs, fastpath=True)
412 indexes = dict(
~/opt/anaconda3/lib/python3.8/site-packages/xarray/core/dataarray.py in _infer_coords_and_dims(shape, coords, dims)
104 and len(coords) != len(shape)
105 ):
--> 106 raise ValueError(
107 f"coords is not dict-like, but it has {len(coords)} items, "
108 f"which does not match the {len(shape)} dimensions of the "
ValueError: coords is not dict-like, but it has 1 items, which does not match the 3 dimensions of the data
How do I convert this numpy array into a xarray data array?

You don't need to provide a list for coords, the dictionary is enough :
data_xr = xr.DataArray(atae,
coords={'y': lat_atae,'x': lon_atae,'time': time_atae},
dims=["y", "x", "time"])

Related

How to transform a Pandas Dataframe with irregular coordinates into a xarray Dataset

I'm working with a pandas Dataframe on python, but in order to plot as a map my data I have to transform it into a xarray Dataset, since the library I'm using to plot (salem) works best for this class. The problem I'm having is that the grid of my data isn't regular so I can't seem to be able to create the Dataset.
My Dataframe has the latitude and longitude, as well as the value in each point:
lon lat value
0 -104.936302 -51.339233 7.908411
1 -104.827377 -51.127686 7.969049
2 -104.719154 -50.915470 8.036676
3 -104.611641 -50.702595 8.096765
4 -104.504814 -50.489056 8.163690
... ... ... ...
65995 -32.911377 15.359591 25.475702
65996 -32.957718 15.579139 25.443994
65997 -33.004040 15.798100 25.429346
65998 -33.050335 16.016472 25.408105
65999 -33.096611 16.234255 25.383844
[66000 rows x 3 columns]
In order to create the Dataset using lat and lon as coordinates and fill all of the missing values with NaN, I was trying the following:
ds = xr.Dataset({
'ts': xr.DataArray(
data = value, # enter data here
dims = ['lon','lat'],
coords = {'lon': lon, 'lat':lat},
attrs = {
'_FillValue': np.nan,
'units' : 'K'
}
)},
attrs = {'attr': 'RegCM output'}
)
ds
But I got the following error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Input In [41], in <cell line: 1>()
1 ds = xr.Dataset({
----> 2 'ts': xr.DataArray(
3 data = value, # enter data here
4 dims = ['lon','lat'],
5 coords = {'lon': lon, 'lat':lat},
6 attrs = {
7 '_FillValue': np.nan,
8 'units' : 'K'
9 }
10 )},
11 attrs = {'example_attr': 'this is a global attribute'}
12 )
14 # ds = xr.Dataset(
15 # data_vars=dict(
16 # variable=(["lon", "lat"], value)
(...)
25 # }
26 # )
27 ds
File ~\anaconda3\lib\site-packages\xarray\core\dataarray.py:406, in DataArray.__init__(self, data, coords, dims, name, attrs, indexes, fastpath)
404 data = _check_data_shape(data, coords, dims)
405 data = as_compatible_data(data)
--> 406 coords, dims = _infer_coords_and_dims(data.shape, coords, dims)
407 variable = Variable(dims, data, attrs, fastpath=True)
408 indexes = dict(
409 _extract_indexes_from_coords(coords)
410 ) # needed for to_dataset
File ~\anaconda3\lib\site-packages\xarray\core\dataarray.py:123, in _infer_coords_and_dims(shape, coords, dims)
121 dims = tuple(dims)
122 elif len(dims) != len(shape):
--> 123 raise ValueError(
124 "different number of dimensions on data "
125 f"and dims: {len(shape)} vs {len(dims)}"
126 )
127 else:
128 for d in dims:
ValueError: different number of dimensions on data and dims: 1 vs 2
I would really appreciate any insights to solve this.
If you really require a rectangularly gridded dataset you need to resample your data into a regular grid... (rasterio, pyresample etc. provide useful functionalities for that). However if you just want to plot the data, this is not necessary!
Not sure about salem (never used it so far), but I've tried my best to simplify plotting of irrelgularly sampled data in the visualization-library I'm developing EOmaps!
You could get a "contour-plot" like appearance if you use a "delaunay triangulation" to visualize the data:
import pandas as pd
df = pd.read_csv("... path-to df.csv ...", index_col=0)
from eomaps import Maps
m = Maps()
m.add_feature.preset.coastline()
m.set_data(df, x="lon", y="lat", crs=4326, parameter="value")
m.set_shape.delaunay_triangulation()
m.plot_map()

Dask looping over library function call

Goal
I would like to parallelize a loop with dask that uses a library function inside the loop. This function, mhw.detect(), calculates some statistics on a slice of a numpy array. None of the slices of the array depend on the other slices, so I was hoping that dask could be used to compute them in parallel and store them all in the same output array.
Code
The flow of the code I am working on is:
import numpy as np
import marineHeatWaves as mhw
from dask import delayed
# Create fake input data
lat_size, long_size = 100, 100
data = np.random.random_integers(0, 30, size=(10_000, long_size, lat_size)) # size = (time, longitude, latitude)
time = np.arange(730_000, 740_000) # time in ordinal days
# Initialize an empty array to hold the output
output_array = np.empty(data.shape)
# loop through each pixel in the data array
for idx_lat in range(lat_size):
for idx_long in range(long_size):
# Extract a slice of data
data_slice = data[:, idx_lat, idx_long]
# Use the library function to calculate the stats for the pixel
# `library_output` is a dictionary that has a numpy array inside it
_, library_output = delayed(mhw.detect)(time, data_slice)
# Update the output array with the calculated values from the library
output_array[:, idx_lat, idx_long] = library_output['seas']
Previous efforts
When I run this code I get the error TypeError: Delayed objects of unspecified length are not iterable. Another stack overflow post discusses this issue and resolves the issue by converting the output of the delayed function to a delayed object. However, because I didn't create the output object myself I am not sure if I can convert it to a delayed object.
I've also tried wrapping the last line in da.from_delayed(), as in output_array[:, idx_lat, idx_long] = da.from_delayed(library_output['seas']) and initalizing the output_array with da.empty(data.shape). I get the same error, though, since I think the code doesn't make it past the line with the library function delayed(mhw.detect)(time, data_slice).
Is it possible to parallelize this? Is this approach of asking dask to compute all the slices in parallel and put them together in an output array even a reasonable approach?
Full Traceback
TypeError Traceback (most recent call last)
/home/rwegener/mhw-ocetrac-census/notebooks/ejoliver_subset_MUR.ipynb Cell 44' in <cell line: 10>()
13 data_slice = data[:, idx_lat, idx_long]
14 # Use the library function to calculate the stats for the pixel
---> 15 _, point_clim = delayed(mhw.detect)(time_ordinal, data_slice)
16 # Update the output array with the calculated values from the library
17 output_array[:, idx_lat, idx_long] = point_clim['seas']
File ~/.conda/envs/dask/lib/python3.10/site-packages/dask/delayed.py:581, in Delayed.__iter__(self)
579 def __iter__(self):
580 if self._length is None:
--> 581 raise TypeError("Delayed objects of unspecified length are not iterable")
582 for i in range(self._length):
583 yield self[i]
TypeError: Delayed objects of unspecified length are not iterable
Update
Using .apply_along_axis() as suggested:
# Create fake input data
lat_size, long_size = 100, 100
data = np.random.randint(0, 30, size=(10_000, long_size, lat_size)) # size = (time, longitude, latitude)
data = dask.array.from_array(data, chunks=(-1, 100, 100))
time = np.arange(730_000, 740_000) # time in ordinal days
# Initialize an empty array to hold the output
output_array = np.empty(data.shape)
# define a wrapper to rearrange arguments
def func1d(arr, time, shape=(10000,)):
print(arr.shape)
return mhw.detect(time, arr)
res = dask.array.apply_along_axis(func1d, 0, data, time=time)
With the output:
(1,)
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
/homes/metogra/rwegener/mhw-ocetrac-census/notebooks/ejoliver_subset_MUR.ipynb Cell 48' in <cell line: 15>()
12 print(arr.shape)
13 return mhw.detect(time, arr)
---> 15 res = dask.array.apply_along_axis(func1d, 0, data, time=time)
File ~/.conda/envs/dask/lib/python3.10/site-packages/dask/array/routines.py:508, in apply_along_axis(func1d, axis, arr, dtype, shape, *args, **kwargs)
506 if shape is None or dtype is None:
507 test_data = np.ones((1,), dtype=arr.dtype)
--> 508 test_result = np.array(func1d(test_data, *args, **kwargs))
509 if shape is None:
510 shape = test_result.shape
/homes/metogra/rwegener/mhw-ocetrac-census/notebooks/ejoliver_subset_MUR.ipynb Cell 48' in func1d(arr, time, shape)
11 def func1d(arr, time, shape=(10000,)):
12 print(arr.shape)
---> 13 return mhw.detect(time, arr)
File ~/.conda/envs/dask/lib/python3.10/site-packages/marineHeatWaves-0.28-py3.10.egg/marineHeatWaves.py:280, in detect(t, temp, climatologyPeriod, pctile, windowHalfWidth, smoothPercentile, smoothPercentileWidth, minDuration, joinAcrossGaps, maxGap, maxPadLength, coldSpells, alternateClimatology, Ly)
278 tt = tt[tt>=0] # Reject indices "before" the first element
279 tt = tt[tt<TClim] # Reject indices "after" the last element
--> 280 thresh_climYear[d-1] = np.nanpercentile(tempClim[tt.astype(int)], pctile)
281 seas_climYear[d-1] = np.nanmean(tempClim[tt.astype(int)])
282 # Special case for Feb 29
IndexError: index 115 is out of bounds for axis 0 with size 1
Rather than using delayed, this seems like a good case for dask.array.
You can create the dask array by partitioning the numpy array:
da = dask.array.from_array(output_array, chunks=(-1, 10, 10))
Now you can call mhw.detect using dask.array.map_blocks alongside np.apply_along_axis within each block:
# define a wrapper to rearrange arguments
def func1d(arr, time):
return mhw.detect(time, arr)
def block_func(block, **kwargs):
return np.apply_along_axis(func1d, 0, block, **kwargs)
res = data.map_blocks(block_func, meta=data, time=time)
res = res.compute()
The map_blocks answer above works great! Additionally, apply_along_axis() was suggested and discussed in comments. I was able to get that method to work, but in order for it to function properly you need to use both the dtype and shape inputs to da.apply_along_axis(). If these aren't supplied the function can't figure out the shape of the data it should pass as an argument.
So, another solution:
import dask.array as da
# Create fake input data
lat_size, long_size = 100, 100
data = da.random.random_integers(0, 30, size=(1_000, long_size, lat_size), chunks=(-1, 10, 10)) # size = (time, longitude, latitude)
time = np.arange(730_000, 731_000) # time in ordinal days
# define a wrapper to rearrange arguments
def func1d(arr, time):
return mhw.detect(time, arr)
result = da.apply_along_axis(func1d, 0, data, time=time, dtype=data.dtype, shape=(1000,))
result.compute()

How to convert coordinates for use in GeoPandas? (e.g., 37_N)

I have the following dataframe:
import pandas as pd
df_coords = pd.DataFrame({'lat': {'1010': '37_N',
'1050': '32_N',
'1059': '19_N',
'1587': '6_S',
'3367': '44_N'},
'lon': {'1010': '65_W',
'1050': '117_W',
'1059': '156_W',
'1587': '106_E',
'3367': '12_E'}})
and I'm trying to convert these coordinates so I can build a GeoPandas dataframe object but I can't seem to figure out how to convert the strings.
import geopandas as gpd
gpd.points_from_xy(x=["37_N"], y=["65_W"])
# ---------------------------------------------------------------------------
# ValueError Traceback (most recent call last)
# <ipython-input-36-a02c9f8a011d> in <module>
# ----> 1 gpd.points_from_xy(x=["37_N"], y=["65_W"])
# ~/anaconda3/envs/soothsayer_py3.8_env/lib/python3.8/site-packages/geopandas/array.py in points_from_xy(x, y, z, crs)
# 256 output : GeometryArray
# 257 """
# --> 258 return GeometryArray(vectorized.points_from_xy(x, y, z), crs=crs)
# 259
# 260
# ~/anaconda3/envs/soothsayer_py3.8_env/lib/python3.8/site-packages/geopandas/_vectorized.py in points_from_xy(x, y, z)
# 241 def points_from_xy(x, y, z=None):
# 242
# --> 243 x = np.asarray(x, dtype="float64")
# 244 y = np.asarray(y, dtype="float64")
# 245 if z is not None:
# ~/anaconda3/envs/soothsayer_py3.8_env/lib/python3.8/site-packages/numpy/core/_asarray.py in asarray(a, dtype, order)
# 83
# 84 """
# ---> 85 return array(a, dtype, copy=False, order=order)
# 86
# 87
# ValueError: could not convert string to float: '37_N'
Can anyone describe how to convert the N,S,W,E info in relation to the actual coordinates for GeoPandas?
You would need to map your values into a range between -180 and 180 both for longitude and latitude. Here the geopandas documentation
Here is a function that does that.
def convert_value(value):
if value.endswith('_N') or value.endswith('_E'):
return int(value[:-2])
if value.endswith('_S') or value.endswith('_W'):
return -int(value[:-2])
The input the value you want to convert for example "12_W". The function will return "-12". No guarantee of what value should be negative and what should be positive. I just set the minus at the place it made the most sense.
geopandas.points_from_xy(value_longitude, value_latitude, crs="EPSG:4326")

Index Error: Index 206893 is out of bounds for axis 0 with size 206893, griddata issue

I have an issue for the last 4 days trying to understand a python error:
`enter code here`IndexError: index 206893 is out of bounds for axis 0 with size 206893
when applying, griddata and "nearest" interpolation method using the following lines:
create a matrix where I will store the first interpolated file
tempnew = np.ones((np.asarray(w1[0,0,:,:]).shape))*np.nan
The lon, lat coordinate points of the original grid
lonl,latl = np.meshgrid(lon,lat)
points = np.vstack((np.array(lonl).flatten(),np.array(latl).flatten())).transpose()
The values of the original file
values = np.array([np.asarray(temp[0,0,:,:])]).flatten()
The dimensions of the grid that I want to interpolate to
lons = np.array(nav_lon)
lats = np.array(nav_lat)
X,Y = np.meshgrid(lons,lats)
Interpolation
tempnew = griddata(points,values, (X,Y), method = "nearest",fill_value=-3)
Here the dimension of each of the variables that I use above:
#tempnew.shape: (728, 312) #(Dimensions of tempnew is (lats,lons))
#lat.shape: (661,) #(original latitude)
#lon.shape: (313,) #(original longitude)
#points.shape: (206893, 2)
#values.shape: (206893,)
#X.shape: (728, 312)
#Y.shape: (728, 312)
Can you help me? * I would like to note here that the original file grid is regular (A-type) grid data whereas the grid to which I want to interpolate to is not regular (C-grid data)
The error looks like this:
In [36]: tempnew = sp.interpolate.griddata(points,values, (X,Y), method = "nearest
...: ",fill_value=-3)
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-36-0d0b46a3542f> in <module>
----> 1 tempnew = sp.interpolate.griddata(points,values, (X,Y), method =
"nearest",fill_value=-3)
~/software/anaconda3/envs/mhw/lib/python3.7/site-packages/scipy/interpolate/ndgriddata.py in
griddata(points, values, xi, method, fill_value, rescale)
217 elif method == 'nearest':
218 ip = NearestNDInterpolator(points, values, rescale=rescale)
--> 219 return ip(xi)
220 elif method == 'linear':
221 ip = LinearNDInterpolator(points, values, fill_value=fill_value,
~/software/anaconda3/envs/mhw/lib/python3.7/site-packages/scipy/interpolate/ndgriddata.py in
__call__(self, *args)
79 xi = self._scale_x(xi)
80 dist, i = self.tree.query(xi)
---> 81 return self.values[i]
82
83
IndexError: index 206893 is out of bounds for axis 0 with size 206893
Thanks in advance,
Sofi
I encountered this error in my Python code using the scipy.interpolate.NearestNDInterpolator class. The error message that is returned is not very clear. In the end, I found that one of the values I was inserting into my interpolant had a value of 1e184 and caused this error message. After resetting this value to 0.0, my Python script ran successfully.

Python numpy: create 2d array of values based on coordinates and plot with pcolormesh, heatplolt

I have arrays with Latitude (Lat) and an Lonitude, which is both a 1D array with the shape of 5.
Then I have another array with the value C, this is also a 1D array, with the shape of 5. I would like to plot the hole thing with pcolormesh at the end, so a kind of heatmap plot!
Here is the corresponding code:
import numpy as np
import matplotlib.pyplot as plt
In [13]:
# Data
Lat = np.array([-65.62282562, -65.62266541, -65.62241364, -65.62398529, -65.62410736])
Lon = np.array([145.28251648, 145.38883972, 145.49528503, 121.4509201, 121.55738068, 121.66372681])
C = np.array([0., 0.5, 2, 3, 0])
# Plot
plt.pcolormesh(X, Y, C)
Then I get the following error
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-14-164126d430da> in <module>()
1 # Plot
----> 2 plt.pcolormesh(X, Y, C)
/home/unix/anaconda2/lib/python2.7/site-packages/matplotlib/pyplot.pyc in pcolormesh(*args, **kwargs)
3091 ax.hold(hold)
3092 try:
-> 3093 ret = ax.pcolormesh(*args, **kwargs)
3094 finally:
3095 ax.hold(washold)
/home/unix/anaconda2/lib/python2.7/site-packages/matplotlib/__init__.pyc in inner(ax, *args, **kwargs)
1810 warnings.warn(msg % (label_namer, func.__name__),
1811 RuntimeWarning, stacklevel=2)
-> 1812 return func(ax, *args, **kwargs)
1813 pre_doc = inner.__doc__
1814 if pre_doc is None:
/home/unix/anaconda2/lib/python2.7/site-packages/matplotlib/axes/_axes.pyc in pcolormesh(self, *args, **kwargs)
5393 allmatch = (shading == 'gouraud')
5394
-> 5395 X, Y, C = self._pcolorargs('pcolormesh', *args, allmatch=allmatch)
5396 Ny, Nx = X.shape
5397
/home/unix/anaconda2/lib/python2.7/site-packages/matplotlib/axes/_axes.pyc in _pcolorargs(funcname, *args, **kw)
4993 if len(args) == 3:
4994 X, Y, C = [np.asanyarray(a) for a in args]
-> 4995 numRows, numCols = C.shape
4996 else:
4997 raise TypeError(
ValueError: need more than 1 value to unpack
So would like to give each X-Y-pair one C value, so there are 5 XY pairs, and 5 C values. In theory it should be no problem, but I really can not find a solution!
You have two problems, one logical and one when you call pcolormesh:
The first is that C and Lat contain 5 values but Lon contains 6 values. That means you don't have 5 XY values and 5 C values, so that's something you should work out.
But it's possible to create a pcolormesh with several distinct coordinates if you expand your coordinates correctly:
import numpy as np
import matplotlib.pyplot as plt
plt.figure()
# Data
Lat = np.array([-65.62282562, -65.62266541, -65.62241364, -65.62398529, -65.62410736])
Lon = np.array([145.28251648, 145.38883972, 145.49528503, 121.4509201, 121.55738068])
C = np.array([0., 0.5, 2, 3, 0])
# Plot
plt.pcolormesh(np.expand_dims(Lat, 0), np.expand_dims(Lon, 1), C*np.eye(5))
The expand_dims will make the dimensions correctly broadcast against each other and the np.eye makes sure that the correct values will have the values you assigned in C and all other coordinates will be zero.
But the output probably won't look good because that's a really sparse coordinate frame.
There are other alternative to pcolormesh, especially weighted histograms could be of interest or contours:
Lat = np.random.normal(-65, 2, 50000)
Lon = np.random.normal(130, 5, 50000)
C = np.random.randint(0, 10, 50000)
plt.figure()
plt.hexbin(Lat, Lon, C=C, cmap=plt.cm.hot)
pcolormesh is for plotting meshes. Meshes are grids of values (for all three of lat, lon, and C), and pcolormesh will plot lines and quads connecting adjacent values within the grid.
You don't have a mesh (2d), you have at best a polyline (1d). That doesn't contain enough information for a unique heat map.
Pcolormesh is intended to be used with 2D variables. If you use np.eye for a variable having large amount of data (as in your real case), you might run into memory problems.
One can use scatter in the following way to get an outcome like pcolormesh with 1D array which has the same shape as that of two arrays and its values are present at each pair of values of the two arrays.
import numpy as np
import matplotlib.pyplot as plt
# Data
Lat = np.array([-65.62282562, -65.62266541, -65.62241364, -65.62398529, -65.62410736])
Lon = np.array([145.28251648, 145.38883972, 145.49528503, 121.4509201, 121.55738068])
X=np.array(Lat[:]).tolist()
Y=np.array(Lon[:]).tolist()
C = np.array([0., 0.5, 2, 3, 0])
C = np.array(C[:]).tolist()
# Plot
plt.scatter(X, Y, c=C, s=10)

Categories

Resources