Calculating 30 year climate normal from gridded dataset in Python - python

I am trying to calculate the 30 year temperature normal (1981-2010 average) for the NARR daily gridded data set linked below.
In the end for each grid point I want an array that contains 365 values, each of which contains the average temperature of that day calculated from the 30 years of data for that day. For example the first value in each grid point's array would be the average Jan 1 temperature calculated from the 30 years (1981-2010) of Jan 1 temperature data for that grid point. My end goal is to be able to use this new 30yrNormal array to calculate daily temperature anomalies from.
So far I have only been able to calculate anomalies from one year worth of data. The problem with this is that it is taking the difference between the daily temperature and the average for the whole year rather then the difference between the daily temperature and the 30 year average of that daily temperature:
file='air.sfc.2018.nc'
ncin = Dataset(file,'r')
#put data into numpy arrays
lons=ncin.variables['lon'][:]
lats=ncin.variables['lat'][:]
lats1=ncin.variables['lat'][:,0]
temp=ncin.variables['air'][:]
ncin.close()
AvgT=np.mean(temp[:,:,:],axis=0)
#compute anomalies by removing time-mean
T_anom=temp-AvgT
Data:
ftp://ftp.cdc.noaa.gov/Datasets/NARR/Dailies/monolevel/
For the years 1981-2010

This is most easily solved using CDO.
You can use my package, nctoolkit (https://nctoolkit.readthedocs.io/en/latest/ & https://pypi.org/project/nctoolkit/) if you are working with Python on Linux. This uses CDO as a backend.
Assuming that the 30 files are a list called ff_list. The code below should work.
First you would create the 30 year daily mean climatology.
import nctoolkit as nc
mean_30 = nc.open_data(ff_list)
mean_30.merge_time()
mean_30.drop(month=2,day=29)
mean_30.tmean("day")
mean_30.run()
Then you would subtract this from the daily figures to get the anomalies.
anom_30 = nc.open_data(ff_list)
anom_30.cdo_command("del29feb")
anom_30.subtract(mean_30)
anom_30.run()
This should have the anomalies
One issue is whether the files have leap years or how you want to handle leap years if they exists. CDO has an undocumented command -delfeb29, which I have used above

Related

how can I do pandas dataframe 'decadal resample' over a 'structured' or even 10 years?

I have searched for a while, but nothing related to my question is found.
So I post a new thread.
I have a simple dataset which is read in by pandas as dataframe, with some daily data starting on 1951-08-01, ending on 2018-10-01.
Now I want to down-sample the data to decadal mean, so I can simply do df.resample('10A').mean()['some data'].
This gives me 8 data points, which are at 1951-12, 1961-12, 1971-12, 1981-12, 1991-12, 2001-12, 2011-12, 2021-12. This indicates that the decadal mean values are calculated for year 1951 separately, years 1952-1961, 1962-1971, etc.
I wonder if it is possible to calculate the decadal mean values every 'structured' 10 years?
for example, the decadal mean values are calculated betwen 1950-1959, 1960-1969, 1970-1979, etc.
Any help is appreciated!
You can calculate the decade separately and group on that:
decade = df['Date'].dt.year.floordiv(10).mul(10)
df.groupby(decade)['Value'].mean()

How to calculate moving average incrementally with daily data added to data frame in pandas?

I have daily data and want to calculate 5 days, 30 days and 90 days moving average per user and write out to a CSV. New data comes in everyday. How do I calculate these averages for the new data only, assuming I will load the data frame with last 89 days data plus today's data.
date user daily_sales 5_days_MA 30_days_MV 90_days_MV
2019-05-01 1 34
2019-05-01 2 20
....
2019-07-18 .....
The number of rows per day is about 1 million. If data for 90days is too much, 30 days is OK
You can apply rolling() method on your dataset if it's in DataFrame format.
your_df['MA_30_days'] = df[where_to_apply].rolling(window = 30).mean()
If you need different window on which moving average will be calculated just change window parameter. In my example I used mean() to calculate but you can choose some other statistic as well.
This code will create another column named 'MA_30_days' with calculated moving average in your DataFrame.
You can also create another DataFrame where you will collect and loop over your dataset to calculate all moving averages and save it to CSV format as you wanted.
your_df.to_csv('filename.csv')
In your case to calculation should be consider only the newest data. If you want to perform this on latest data just slice it. However the very first rows will be NaN (depends on window).
df[where_to_apply][-90:].rolling(window = 30).mean()
This will calculate moving average on last 90 rows of specific column in some df and first 29 rows would be NaN. If your latest 90 rows should be all meaningful data than you can start calculation earlier than on last 90 rows - depends on window size.
if the df already contains yesterday's moving average, and just the new day's Simple MA is required, I would say use this approach:
MAlength=90
df.loc[day-1:'MA']=(
(df.loc[day-1:'MA']*MAlength) #expand yesterday's MA value
-df.loc[day-MAlength:'Price'] #remove oldest price
+df.loc[day-MAlength:'Price'] #add newest price
)/MAlength #re-average

Calculating and plotting a 20 year Climatology

I am working on plotting a 20 year climatology and have had issues with averaging.
My data is hourly data since December 1999 in CSV format. I used an API to get the data and currently have it in a pandas data frame. I was able to split up hours, days, etc like this:
dfROVC1['Month'] = dfROVC1['time'].apply(lambda cell: int(cell[5:7]))
dfROVC1['Day'] = dfROVC1['time'].apply(lambda cell: int(cell[8:9]))
dfROVC1['Year'] = dfROVC1['time'].apply(lambda cell: int(cell[0:4]))
dfROVC1['Hour'] = dfROVC1['time'].apply(lambda cell: int(cell[11:13]))
So I averaged all the days using:
z=dfROVC1.groupby([dfROVC1.index.day,dfROVC1.index.month]).mean()
That worked, but I realized I should take the average of the mins and average of the maxes of all my data. I have been having a hard time figuring all of this out.
I want my plot to look like this:
Monthly Average Section
but I can't figure out how to make it work.
I am currently using Jupyter Notebook with Python 3.
Any help would be appreciated.
Is there a reason you didn't just use datetime to convert your time column?
The minimums by month would be:
z=dfROVC1.groupby(['Year','Month']).min()

Take maximum rainfall value for each season over a time period (xarray)

I'm trying to find the maximum rainfall value for each season (DJF, MAM, JJA, SON) over a 10 year period. I am using netcdf data and xarray to try and do this. The data consists of rainfall (recorded every 3 hours), lat, and lon data. Right now I have the following code:
ds.groupby('time.season).max('time')
However, when I do it this way the output has a shape of (4,145,192) indicating that it's taking the maximum value for each season over the entire period. I would like the maximum for each individual season every year. In other words, output should have something with a shape like (40,145,192) (4 values for each year x 10 years)
I've looked into trying to do this with DataSet.resample as well using time=3M as the frequency, but then it doesn't split the months up correctly. If I have to I can alter the dataset, so it starts in the correct place, but I was hoping there would be an easier way considering there's already a function to group it correctly.
Thanks and let me know if you need anymore details!
Resample is going to be the easiest tool for this job. You are close with the time frequency but you probably want to use the quarterly frequency with an offset:
ds.resample(time='QS-Mar').max('time')
These offsets can be further configured as described in the Pandas documentation: http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases

Manipulating a non-evenly spaced data series in python

Hello guys I've trying to plot a bunch of data of some measurements taken in uneven intervals of time and make a cubic spline interpolation of it. Here is a sample of the data:
1645 2 28 .0
1645 6 30 .0
1646 6 30 .0
1646 7 31 .0
The first column corresponds to the year which the measurement was made, the second column is the month, the third one is the number of measurements and the fourth one is the standard deviation of the measurements.
The thing is that I can't seem to figure out how to make a scatter plot of the data keeping the "unevenness" of the intervals of measurement. Also I'm not quite sure how to implement the interpolation cause I don't know what should be my x value for the data points (months maybe?)
Any advice or help would be greatly appreciated. Thank You.
Btw I'm working with python and using Scipy.
For x, you could either convert the year and month to a datetime object:
np.datetime64('2005-02')
Or convert it to months (assuming 1645 is your first value):
CumulativeMonth = (year - 1645) * 12 + month

Categories

Resources