Reading in Netcdf file and calculating RMSE in Python

I have a netcdf file. I have two variables in this file: wspd_wrf_m and wspd_sodar_o. I want to read in the netcdf file and calculate the RMSE value between wspd_wrf_m and wspd_sodar_o.
The variables are with the dimensions (Days, times) which is (1094, 24)
I want to calculate the RMSE from the last 365 days of the files. Can you help me with this?
I know I need to use:
from netCDF4 import Dataset
import numpy as np
g = Dataset('','r',format='NETCDF3_64BIT')
wspd_wrf = g.variables["wspd_wrf_m"][:,:]
wspd_sodar = g.variables["wspd_sodar_o"][:,:]
But how do I select the last 365 days of hourly data that I need and calculate RMSE from this?

Selecting the last 365 days is a matter of slicing the arrays to the correct size. For example:
import numpy as np
var = np.zeros((1094, 24))
print(var.shape, var[729:,:].shape, var[-365:,:].shape)
which prints:
(1094, 24) (365, 24) (365, 24)
So both var[729:,:] and var[-365:,:] slice the last 365 days (with all hourly values) out of your 1094 day sized array.
There is more information / are more examples in the Numpy manual.
Calculate monthly mean from daily data for each year

I have seen many answers how to calculate the monthly mean from daily data across multiple years.
But what I want to do is to calculate the monthly mean from daily data for each year in my xarray separately. So, I want to end up with a mean for Jan 2020, Feb 2020 ... Dec 2024 for each lon/lat gridpoint.
My xarray has the dimensions Frozen({'time': 1827, 'lon': 180, 'lat': 90})
I tried using
var_resampled = var_diff.resample(time='1M').mean()
but this calcualtes the mean across all years (ie mean for Jan 2020-2024).
I also tried
def mon_mean(x):
return x.groupby('time.month').mean('time')
# group by year, then apply the function:
var_diff_mon = var_diff.groupby('time.year').apply(mon_mean)
This seems to do what I want but I end up with different dimensions (ie "month" and "year" instead of the original "time" dimension).
Is there a different way to calculate the monthly mean from daily data for each year separately or is there a way that the code using groupby above retains the same time dimension as before just with year and month now?
P.S. I also tried "cdo monmean" but as far as I understand this also just gives mean the monthly mean across all years.
I found a way using
def mon_mean(x):
return x.groupby('time.month').mean('time')
# group by year, then apply the function:
var_diff_mon = var_diff.groupby('time.year').apply(mon_mean)
and then using
var_diff_mon.stack(time=("year", "month"))
to get my original time dimension back
Is var_diff.resample(time='M') (or time='MS') doing what you expect ?
Let's create a toy dataset like yours:
import numpy as np
import pandas as pd
import xarray as xr
dims = ('time', 'lat', 'lon')
time = pd.date_range("2021-01-01T00", "2023-12-31T23", freq="H")
lat = [0, 1]
lon = [0, 1]
coords = (time, lat, lon)
ds = xr.DataArray(data=np.random.randn(len(time), len(lat), len(lon)), coords=coords, dims=dims).rename("my_var")
ds = ds.to_dataset()
Let's resample it:
The dataset has now 36 time steps, associated with the 36 months which are in the original dataset.

Preparing Grid weather data for ConvLSTM2d

I am attempting to use a ConvLSTM2d model using hourly grid weather data. I can get the data into a 4d array with these dimensions (num_hours, lat, lon, num_features). ConvLSTM2d requires 5d and I was planning on setting a variable for sequence length of maybe 24hrs. My question is how do i create an additional dimension in this array to have the sequence length dimension?(num_hours, sequence_length, lat, lon, num_features) Is there a smarter, more efficient way to get the data in the correct form from a pandas dataframe that has columns for lat, lon, time, feature type & value?
I realize it is always easier to have a sample dataset when asking a question so i created a set to mimic the issue.
import pandas as pd
import numpy as np
weather_variables = ['windspeed', 'temp','pressure']
lats = [x/10 for x in range(400,500,5)]
lons = [x/10 for x in range(900,1000,5)]
hours = pd.date_range('1/1/2021', '9/28/2021', freq= 'H')
df = []
for i in range (0, len(hours)):
for weather in weather_variables:
temp_df = pd.DataFrame(index = lats, columns = lons,data = np.random.randint(0,100,size=(len(lats), len(lons))))
temp_df = temp_df.unstack().to_frame()
temp_df.reset_index(inplace= True)
temp_df['weather_variable'] = weather
temp_df['ts'] = hours[i]
df = pd.concat(df)
df.columns = ['lon','lat','value','weather_variable', 'ts']
So this code will create a dummy dataset containing a 3 grids for a given hour. The goal is to convert this into a 5d array of overlapping 24 hours sequences. The array would look like this i think (len(hours)?, 24, 20, 20, 3)
From the ConvLSTM paper,
The weather radar data is recorded every 6 minutes, so there
are 240 frames per day. To get disjoint subsets for training, testing and validation, we partition each
daily sequence into 40 non-overlapping frame blocks and randomly assign 4 blocks for training, 1
block for testing and 1 block for validation. The data instances are sliced from these blocks using
a 20-frame-wide sliding window. Thus our radar echo dataset contains 8148 training sequences,
2037 testing sequences and 2037 validation sequences and all the sequences are 20 frames long (5
for the input and 15 for the prediction).
If my calculations are correct, each of the "non-overlapping frame blocks" should have 6 frames in it (240 frames per day / 40 blocks per day = 6 frames per block). So I'm not sure how you create a 20-frame-wide sliding window in a given block. Nonetheless, you could take a similar approach: divide your data into non-overlapping windows of a specific length. Perhaps you use 6 hours of data to predict the next 6. I'm not sure that you need to keep the windows within a given day--a change from 11 pm to 1 am seems just as valid a time window as from say 3 am to 5 am.
I don't think Pandas will be an efficient way to massage the data. I would stick with NumPy or probably a TensorFlow data structure.

Apply Mann Whitney U test on multidimensional array and replace single values of variable of xarray data array in Python?

I'm new to Python and need some help with xarray.
I have two 3 dimensional data arrays (rlon, rlat, time) for future and past climate. I want to compute the Mann-Whitney-U-test for each grid point to analyse significance of temperature change in future compared to past. I already got the Mann-Whitney-U-test work with selecting a time serie from one grid point of historical and future data each. Example:
import numpy as np
import xarray as xr
import scipy.stats as sts
#selecting time period and grid point of past and future data
tp = fileHis['tas']
tf = fileFut['tas']
gridpoint_past=tp.sel(rlon=-6.375, rlat=1.375, time=slice('1999-01-01', '1999-01-31'))
gridpoint_future=tf.sel(rlon=-6.375, rlat=1.375, time=slice('2099-01-01', '2099-01-31'))
result=sts.mannwhitneyu(gridpoint_past, gridpoint_future, alternative='two-sided')
print('pvalue =',result[1])
pvalue = 0.05922372345359562
My problem now is that I need to do this for each grid point and each month and in the end I would like to have a data array with pvalues for each grid point and each month of a year.
I was thinking about looping through all rlat, rlon and months and run the Mann-Whitney-U-test for each, unless there is a better way to do.?
And how can I write the pvalues one by one into a new data array with the same rlat, rlon dimension?
I was trying this, but it does not work:
I created a data array pvalue_mon, which has the same rlat, rlon as tp and tf and has 12 months as time steps.
pvalue_mon.sel(rlon=-6.375, rlat=1.375, time=th.time.dt.month.isin([1])) = result[1]
SyntaxError: can't assign to function call
or this:
pvalue_mon.sel(rlon=-6.375, rlat=1.375, time=pvalue_mon.time.dt.month.isin([1])).update(result[1])
TypeError: 'numpy.float64' object is not iterable
How can I replace a single value of an existing variable?
Instead of using the .sel() function, try using .loc[ ] as described here:

Applying K means clustering to 3 dim data

I am trying to apply k-means clustering in sklearn on a (52,168,2) dimensional dataset. As expected, it's giving dimension error for the estimator as 2D data is expected. What should be the way forward?
I have wind and load data in two separate files for a year with weekly data (one-hour resolution) in each row in both the files. The wind and load data are correlated (i.e., week 1 wind data corresponds to week 2). I am trying to apply K-means clustering to reduce operating periods from 52 weeks to an appropriate number of weeks(ideally 12 weeks). Hence, each data point, in this case, is a 168*2 np array with weekly wind and load data combined.
The dimension of data is coming out to be (52,168,2), since I have 52 weeks and each data point is 168*2. However, I can't apply this to sklearn k-means as it requires 2D data. I am wondering if i reshape data as data.reshape(52,168*2), will it preserve what I am aiming to do?
Load_data = pd.read_csv('Scenario_Load_Data.csv', header = None)
Load_data_final = Load_data.to_numpy()
Wind_data = pd.read_csv('Scenario_Wind_Data.csv', header = None)
Wind_data_final = Wind_data.to_numpy()
create_list = []
for i in range(len(Load_data_final)):
intermediate_v = np.column_stack((Load_data_final[i,:],Wind_data_final[i,:]))
data = np.array(create_list)
ValueError: Found array with dim 3. Estimator expected <= 2.
As you wanna group that by week, I believe that you can concatenate the wind and load data in the same array. I mean, 1 week will be a line and 168 + 168 will be the attributes. So, you're gonna have something like:
Week_1: at1, at2, at3, ..., at336
Week_2: at1, at2, at3, ..., at336
Week_52: at1, at2, at3, ..., at336
SO, I think it's pretty much like you're intending to do with reshape

Reading and manipulating multiple netcdf files in python

I need help with reading multiple netCDF files, despite few examples in here, none of them works properly.
I am using Python(x,y) vers 2.7.5, and other packages : netcdf4 1.0.7-4, matplotlib 1.3.1-4, numpy 1.8, pandas 0.12,
basemap 1.0.2...
I have few things I'm used to do with GrADS that I need to start doing them in Python.
I have a few 2 meter temperature data (4-hourly data, each year, from ECMWF), each file contains 2 meter temp data, with Xsize=480, Ysize=241,
Zsize(level)=1, Tsize(time) = 1460 or 1464 for leap years.
These are my files name look alike:,, ...etc.
Based on this page:
( Loop through netcdf files and run calculations - Python or R )
Here is where I am now:
from pylab import *
import netCDF4 as nc
from netCDF4 import *
import matplotlib.pyplot as plt
from mpl_toolkits.basemap import Basemap
import numpy as np
f = nc.MFDataset('d:/data/ecmwf/t2m.????.nc') # as '????' being the years
t2mtr = f.variables['t2m']
ntimes, ny, nx = shape(t2mtr)
temp2m = zeros((ny,nx),dtype=float64)
print ntimes
for i in xrange(ntimes):
temp2m += t2mtr[i,:,:] #I'm not sure how to slice this, just wanted to get the 00Z values.
# is it possible to assign to a new array,...
#... (for eg.) the average values of 00z for January only from 1981-2000?
#creating a NetCDF file
nco = nc.Dataset('d:/data/ecmwf/','w',clobber=True)
temp2m_v = nco.createVariable('t2m', 'i4', ( 'y', 'x'))
temp2m_v.long_name='2 meter Temperature'
temp2m_v.grid_mapping = 'Lambert_Conformal' # can it be something else or ..
#... eliminated?).This is straight from the solution on that webpage.
lono = nco.createVariable('longitude','f8')
lato = nco.createVariable('latitude','f8')
xo = nco.createVariable('x','f4',('x')) #not sure if this is important
yo = nco.createVariable('y','f4',('y')) #not sure if this is important
lco = nco.createVariable('Lambert_Conformal','i4') #not sure
#copy all the variable attributes from original file
for var in ['longitude','latitude']:
for att in f.variables[var].ncattrs():
# copy variable data for lon,lat,x and y
# write the temp at 2 m data
# copy Global attributes from original file
for att in f.ncattrs():
nco.Conventions='CF-1.6' #not sure what is this.
#attempt to plot the 00zJan mean
map = Basemap(projection='cyl',llcrnrlat=0.,urcrnrlat=10.,llcrnrlon=97.,urcrnrlon=110.,resolution='i')
cs = map.contourf(x,y,t2mtr,clevs,extend='both')
First question is at the temp2m += t2mtr[1,:,:] . I am not sure how to slice the data to get only 00z (let say for January only) of all files.
Second, While running the test, an error came at cs = map.contourf(x,y,t2mtr,clevs,extend='both') saying "shape does not match that of z: found (1,1) instead of (241,480)". I know some error probably on the output data, due to error on recording the values, but I can't figure out what/where .
Thanks for your time. I hope this is not confusing.
So t2mtr is a 3d array
ntimes, ny, nx = shape(t2mtr)
This sums all values across the 1st axis:
for i in xrange(ntimes):
temp2m += t2mtr[i,:,:]
A better way to do this is:
temp2m = np.sum(tm2tr, axis=0)
temp2m = tm2tr.sum(axis=0) # alt
If you want the average, use np.mean instead of np.sum.
To average across a subset of the times, jan_times, use an expression like:
jan_avg = np.mean(tm2tr[jan_times,:,:], axis=0)
This is simplest if you want just a simple range, e.g the first 30 times. For simplicity I'm assuming the data is daily and years are constant length. You can adjust things for the 4hr frequency and leap years.
A simplistic way on getting Jan data for several years is to construct an index like:
yr_starts = np.arange(0,3)*365 # can adjust for leap years
jan_times = (yr_starts[:,None]+ np.arange(31)).flatten()
# array([ 0, 1, 2, ... 29, 30, 365, ..., 756, 757, 758, 759, 760])
Another option would be to reshape tm2tr (doesn't work well for leap years).
tm2tr.reshape(nyrs, 365, nx, ny)[:,0:31,:,:].mean(axis=1)
You could test the time sampling with something like:
Doesn't the data set have a time variable? You might be able to extract the desired time indices from that. I worked with ECMWF data a number of years ago, but don't remember a lot of the details.
As for your contourf error, I would check the shape of the 3 main arguments: x,y,t2mtr. They should match. I haven't worked with Basemap.

