I have seen many answers how to calculate the monthly mean from daily data across multiple years.
But what I want to do is to calculate the monthly mean from daily data for each year in my xarray separately. So, I want to end up with a mean for Jan 2020, Feb 2020 ... Dec 2024 for each lon/lat gridpoint.
My xarray has the dimensions Frozen({'time': 1827, 'lon': 180, 'lat': 90})
I tried using
var_resampled = var_diff.resample(time='1M').mean()
but this calcualtes the mean across all years (ie mean for Jan 2020-2024).
I also tried
def mon_mean(x):
return x.groupby('time.month').mean('time')
# group by year, then apply the function:
var_diff_mon = var_diff.groupby('time.year').apply(mon_mean)
This seems to do what I want but I end up with different dimensions (ie "month" and "year" instead of the original "time" dimension).
Is there a different way to calculate the monthly mean from daily data for each year separately or is there a way that the code using groupby above retains the same time dimension as before just with year and month now?
P.S. I also tried "cdo monmean" but as far as I understand this also just gives mean the monthly mean across all years.
Thanks!
Solution
I found a way using
def mon_mean(x):
return x.groupby('time.month').mean('time')
# group by year, then apply the function:
var_diff_mon = var_diff.groupby('time.year').apply(mon_mean)
and then using
var_diff_mon.stack(time=("year", "month"))
to get my original time dimension back
Is var_diff.resample(time='M') (or time='MS') doing what you expect ?
Let's create a toy dataset like yours:
import numpy as np
import pandas as pd
import xarray as xr
dims = ('time', 'lat', 'lon')
time = pd.date_range("2021-01-01T00", "2023-12-31T23", freq="H")
lat = [0, 1]
lon = [0, 1]
coords = (time, lat, lon)
ds = xr.DataArray(data=np.random.randn(len(time), len(lat), len(lon)), coords=coords, dims=dims).rename("my_var")
ds = ds.to_dataset()
ds
Let's resample it:
ds.resample(time="MS").mean()
The dataset has now 36 time steps, associated with the 36 months which are in the original dataset.
Related
I am aiming to calculate daily climatology from a dataset, i.e. obtain the sea surface temperature (SST) for each day of the year by averaging all the years (for example, for January 1st, the average SST of all January 1st from 1982 to 2018). To do so, I made the following steps:
DATA PREPARATION STEPS
Here is a Drive link to both datasets to make the code reproducible:
link to datasets
First, I load two datasets:
ds1 = xr.open_dataset('./anomaly_dss/archive_to2018.nc') #from 1982 to 2018
ds2 = xr.open_dataset('./anomaly_dss/realtime_from2018.nc') #from 2018 to present
Then I convert to pandas dataframe and merge both in one:
ds1 = ds1.where(ds1.time > np.datetime64('1982-01-01'), drop=True) # Grab all data since 1/1/1982
ds2 = ds2.where(ds2.time > ds1.time.max(), drop=True) # Grab all data since the end of the archive
# Convert to Pandas Dataframe
df1 = ds1.to_dataframe().reset_index().set_index('time')
df2 = ds2.to_dataframe().reset_index().set_index('time')
# Merge these datasets
df = df1.combine_first(df2)
So far, this is how my dataframe looks like:
NOTE THAT THE LAT,LON GOES FROM LAT(35,37.7), LON(-10,-5), THIS MUST REMAIN LIKE THAT
ANOMALY CALCULATION STEPS
# Anomaly claculation
def standardize(x):
return (x - x.mean())/x.std()
# Calculate a daily average
df_daily = df.resample('1D').mean()
# Calculate the anomaly for each yearday
df_daily['anomaly'] = df_daily['analysed_sst'].groupby(df_daily.index.dayofyear).transform(standardize)
I obtain the following dataframe:
As you can see, I obtain the mean values of all three variables.
QUESTION
As I want to plot the climatology data on a map, I DO NOT want lat/lon variables to be averaged to one point. I need the anomaly from all the points lat/lon points, and I don't really know how to achieve that.
Any help would be very appreciated!!
I think you can do all that in a simpler and more straightforward way without converting your dataarray to a dataframe:
import os
#Will open and combine automatically the 2 datasets
DS = xr.open_mfdataset(os.path.join('./anomaly_dss', '*.nc'))
da = DS.analysed_sst
#Resampling
da = da.resample(time = '1D').mean()
# Anomaly calculation
def standardize(x):
return (x - x.mean())/x.std()
da_anomaly = da.groupby(da.time.dt.dayofyear).apply(standardize)
Then you can plot the anomaly for any day with:
da_anomaly[da_anomaly.dayofyear == 1].plot()
I am attempting to use a ConvLSTM2d model using hourly grid weather data. I can get the data into a 4d array with these dimensions (num_hours, lat, lon, num_features). ConvLSTM2d requires 5d and I was planning on setting a variable for sequence length of maybe 24hrs. My question is how do i create an additional dimension in this array to have the sequence length dimension?(num_hours, sequence_length, lat, lon, num_features) Is there a smarter, more efficient way to get the data in the correct form from a pandas dataframe that has columns for lat, lon, time, feature type & value?
*
I realize it is always easier to have a sample dataset when asking a question so i created a set to mimic the issue.
import pandas as pd
import numpy as np
weather_variables = ['windspeed', 'temp','pressure']
lats = [x/10 for x in range(400,500,5)]
lons = [x/10 for x in range(900,1000,5)]
hours = pd.date_range('1/1/2021', '9/28/2021', freq= 'H')
df = []
for i in range (0, len(hours)):
for weather in weather_variables:
temp_df = pd.DataFrame(index = lats, columns = lons,data = np.random.randint(0,100,size=(len(lats), len(lons))))
temp_df = temp_df.unstack().to_frame()
temp_df.reset_index(inplace= True)
temp_df['weather_variable'] = weather
temp_df['ts'] = hours[i]
df.append(temp_df)
df = pd.concat(df)
df.columns = ['lon','lat','value','weather_variable', 'ts']
So this code will create a dummy dataset containing a 3 grids for a given hour. The goal is to convert this into a 5d array of overlapping 24 hours sequences. The array would look like this i think (len(hours)?, 24, 20, 20, 3)
From the ConvLSTM paper,
The weather radar data is recorded every 6 minutes, so there
are 240 frames per day. To get disjoint subsets for training, testing and validation, we partition each
daily sequence into 40 non-overlapping frame blocks and randomly assign 4 blocks for training, 1
block for testing and 1 block for validation. The data instances are sliced from these blocks using
a 20-frame-wide sliding window. Thus our radar echo dataset contains 8148 training sequences,
2037 testing sequences and 2037 validation sequences and all the sequences are 20 frames long (5
for the input and 15 for the prediction).
If my calculations are correct, each of the "non-overlapping frame blocks" should have 6 frames in it (240 frames per day / 40 blocks per day = 6 frames per block). So I'm not sure how you create a 20-frame-wide sliding window in a given block. Nonetheless, you could take a similar approach: divide your data into non-overlapping windows of a specific length. Perhaps you use 6 hours of data to predict the next 6. I'm not sure that you need to keep the windows within a given day--a change from 11 pm to 1 am seems just as valid a time window as from say 3 am to 5 am.
I don't think Pandas will be an efficient way to massage the data. I would stick with NumPy or probably a TensorFlow data structure.
I am trying to run an OLS regression on a daily basis for the Y against X as depicted below. The period is between Jan 2008 to Dec 2011. The idea is to get the daily parameters from which volatility will be inferred. I have managed to run regressions for the first day but the other days are seemingly taking on values from the first day. Since the observations per day are not the same, its difficult since I am a newbie here.
Kindly help me rectify the code below.
import statsmodels.api as sm`
from statsmodels.formula.api import ols`
##groupby by index date to enable recursive regressions by day
Kempf_etal2015_model = np.zeros((1,5))`
intercept = np.zeros((1,1))`
betas = np.zeros((1,4))`
for idx, day in df_fp.groupby(df_fp.index.date):`
# for i in (day):
for i in range(len(day)) :
for j in range(len(df_fp)):
spreadlag1 = df_fp['relative_spread'].shift(-1)
c_spreeeed = df_fp['relative_spread'] - df_fp['relative_spread'].shift(-1)
c_spreeeed_lag1 = df_fp['relative_spread'] - df_fp['relative_spread'].shift(-2)
c_spreeeed_lag2 = df_fp['relative_spread'] - df_fp['relative_spread'].shift(-3)
c_spreeeed_lag3 = df_fp['relative_spread'] - df_fp['relative_spread'].shift(-4)
Kempf_etal2015_model = ols('c_spreeeed ~spreadlag1+c_spreeeed_lag1+c_spreeeed_lag2+c_spreeeed_lag3', df_fp).fit()
intercept =Kempf_etal2015_model.params[:0]
betas = Kempf_etal2015_model.params[1:]
print(Kempf_etal2015_model.params)
I end up getting the below results
Intercept 0.000102
spreadlag1 -0.104292
c_spreeeed_lag1 0.430733
c_spreeeed_lag2 0.020808
c_spreeeed_lag3 0.072575
dtype: float64
Intercept 0.000102
spreadlag1 -0.104292
c_spreeeed_lag1 0.430733
c_spreeeed_lag2 0.020808
c_spreeeed_lag3 0.072575
dtype: float64
I have a netcdf file. I have two variables in this file: wspd_wrf_m and wspd_sodar_o. I want to read in the netcdf file and calculate the RMSE value between wspd_wrf_m and wspd_sodar_o.
The variables are with the dimensions (Days, times) which is (1094, 24)
I want to calculate the RMSE from the last 365 days of the files. Can you help me with this?
I know I need to use:
from netCDF4 import Dataset
import numpy as np
g = Dataset('station_test_new.nc','r',format='NETCDF3_64BIT')
wspd_wrf = g.variables["wspd_wrf_m"][:,:]
wspd_sodar = g.variables["wspd_sodar_o"][:,:]
But how do I select the last 365 days of hourly data that I need and calculate RMSE from this?
Selecting the last 365 days is a matter of slicing the arrays to the correct size. For example:
import numpy as np
var = np.zeros((1094, 24))
print(var.shape, var[729:,:].shape, var[-365:,:].shape)
which prints:
(1094, 24) (365, 24) (365, 24)
So both var[729:,:] and var[-365:,:] slice the last 365 days (with all hourly values) out of your 1094 day sized array.
There is more information / are more examples in the Numpy manual.
There are plenty of examples of how to calculate the RMSE in Python (e.g. this one). Please give that a try, and if you can't get it to work, update your question with your attempts.
I have time series data in a pandas Series object. The values are floats representing the size of an event. Together, event times and sizes tell me how busy my system is. I'm using a scatterplot.
I want to look at traffic patterns over various time periods, to answer questions like "Is there a daily spike at noon?" or "Which days of the week have the most traffic?" Therefore I need to overlay data from many different periods. I do this by converting timestamps to timedeltas (first by subtracting the start of the first period, then by doing a mod with the period length).
Now my index uses time intervals relative to an "abstract" time period, like a day or a week. I would like to produce plots where the x-axis shows something other than just nanoseconds. Ideally it would show month, day of the week, hour, etc. depending on timescale as you zoom in and out (as Bokeh graphs generally do for time series).
The code below shows an example of how I currently plot. The resulting graph has an x-axis in units of nanoseconds, which is not what I want. How do I get a smart x-axis that behaves more like what I would see for timestamps?
import numpy as np
import pandas as pd
from bokeh.charts import show, output_file
from bokeh.plotting import figure
oneDay = np.timedelta64(24*60*60,'s')
fourHours = 24000000000000 # four hours in nanoseconds (ugly)
time = [pd.Timestamp('2015-04-27 01:00:00'), # a Monday
pd.Timestamp('2015-05-04 02:00:00'), # a Monday
pd.Timestamp('2015-05-11 03:00:00'), # a Monday
pd.Timestamp('2015-05-12 04:00:00') # a Tuesday
]
resp = [2.0, 1.3, 2.6, 1.3]
ts = pd.Series(resp, index=time)
days = dict(list(ts.groupby(lambda x: x.weekday)))
monday = days[0] # this TimeSeries consists of all data for all Mondays
# Now convert timestamps to timedeltas
# First subtract the timestamp of the starting date
# Then take the remainder after dividing by one day
# Result: each index value is in the 24 hour range [00:00:00, 23:59:59]
tdi = monday.index - pd.Timestamp(monday.index.date[0])
x = pd.TimedeltaIndex([td % oneDay for td in tdi])
y = monday.values
output_file('bogus.html')
xmax = fourHours # why doesn't np.timedelta64 work for xmax?
fig = figure(x_range=[0,xmax], y_range=[0,4])
fig.circle(x, y)
show(fig)