Manipulating a non-evenly spaced data series in python - python

Hello guys I've trying to plot a bunch of data of some measurements taken in uneven intervals of time and make a cubic spline interpolation of it. Here is a sample of the data:
1645 2 28 .0
1645 6 30 .0
1646 6 30 .0
1646 7 31 .0
The first column corresponds to the year which the measurement was made, the second column is the month, the third one is the number of measurements and the fourth one is the standard deviation of the measurements.
The thing is that I can't seem to figure out how to make a scatter plot of the data keeping the "unevenness" of the intervals of measurement. Also I'm not quite sure how to implement the interpolation cause I don't know what should be my x value for the data points (months maybe?)
Any advice or help would be greatly appreciated. Thank You.
Btw I'm working with python and using Scipy.

For x, you could either convert the year and month to a datetime object:
np.datetime64('2005-02')
Or convert it to months (assuming 1645 is your first value):
CumulativeMonth = (year - 1645) * 12 + month

Related

Understanding Plotly Time Difference units

So, I have a problem similar to this question. I have a DataFrame with a column 'diff' and a column 'date' with the following dtypes:
delta_df['diff'].dtype
>>> dtype('<m8[ns]')
delta_df['date'].dtype
>>> datetime64[ns, UTC]
According to this answer, there are (kind of) equivalent. However, then I plot using plotly (I used histogram and scatter), the 'diff' axis has a weird unit. Something like 2T, 2.5T, 3T, etc, what is this? The data on 'diff' column looks like 0 days 00:29:36.000001 so I don't understand what is happening (column of 'date' is 2018-06-11 01:04:25.000005+00:00).
BTW, the diff column was generated using df['date'].diff().
So my question is:
What is this T? Is it a standard choosen by plotly like 30 mins and then 2T is 1 hour? if so, how to check the value of the chosen T?
Maybe more important, how to plot with the axis as it appears on the column so it's easier to read?
The "T" you see in the axis label of your plot represents a time unit, and in Plotly, it stands for "Time". By default, Plotly uses seconds as the time unit, but if your data spans more than a few minutes, it will switch to larger time units like minutes (T), hours (H), or days (D). This is probably what is causing the weird units you see in your plot.
It's worth noting that using "T" as a shorthand for minutes is a convention adopted by some developers and libraries because "M" is already used to represent months.
To confirm that the weird units you see are due to Plotly switching to larger time units, you can check the largest value in your 'diff' column. If the largest value is more than a few minutes, Plotly will switch to using larger time units.

how can I do pandas dataframe 'decadal resample' over a 'structured' or even 10 years?

I have searched for a while, but nothing related to my question is found.
So I post a new thread.
I have a simple dataset which is read in by pandas as dataframe, with some daily data starting on 1951-08-01, ending on 2018-10-01.
Now I want to down-sample the data to decadal mean, so I can simply do df.resample('10A').mean()['some data'].
This gives me 8 data points, which are at 1951-12, 1961-12, 1971-12, 1981-12, 1991-12, 2001-12, 2011-12, 2021-12. This indicates that the decadal mean values are calculated for year 1951 separately, years 1952-1961, 1962-1971, etc.
I wonder if it is possible to calculate the decadal mean values every 'structured' 10 years?
for example, the decadal mean values are calculated betwen 1950-1959, 1960-1969, 1970-1979, etc.
Any help is appreciated!
You can calculate the decade separately and group on that:
decade = df['Date'].dt.year.floordiv(10).mul(10)
df.groupby(decade)['Value'].mean()

Calculating 30 year climate normal from gridded dataset in Python

I am trying to calculate the 30 year temperature normal (1981-2010 average) for the NARR daily gridded data set linked below.
In the end for each grid point I want an array that contains 365 values, each of which contains the average temperature of that day calculated from the 30 years of data for that day. For example the first value in each grid point's array would be the average Jan 1 temperature calculated from the 30 years (1981-2010) of Jan 1 temperature data for that grid point. My end goal is to be able to use this new 30yrNormal array to calculate daily temperature anomalies from.
So far I have only been able to calculate anomalies from one year worth of data. The problem with this is that it is taking the difference between the daily temperature and the average for the whole year rather then the difference between the daily temperature and the 30 year average of that daily temperature:
file='air.sfc.2018.nc'
ncin = Dataset(file,'r')
#put data into numpy arrays
lons=ncin.variables['lon'][:]
lats=ncin.variables['lat'][:]
lats1=ncin.variables['lat'][:,0]
temp=ncin.variables['air'][:]
ncin.close()
AvgT=np.mean(temp[:,:,:],axis=0)
#compute anomalies by removing time-mean
T_anom=temp-AvgT
Data:
ftp://ftp.cdc.noaa.gov/Datasets/NARR/Dailies/monolevel/
For the years 1981-2010
This is most easily solved using CDO.
You can use my package, nctoolkit (https://nctoolkit.readthedocs.io/en/latest/ & https://pypi.org/project/nctoolkit/) if you are working with Python on Linux. This uses CDO as a backend.
Assuming that the 30 files are a list called ff_list. The code below should work.
First you would create the 30 year daily mean climatology.
import nctoolkit as nc
mean_30 = nc.open_data(ff_list)
mean_30.merge_time()
mean_30.drop(month=2,day=29)
mean_30.tmean("day")
mean_30.run()
Then you would subtract this from the daily figures to get the anomalies.
anom_30 = nc.open_data(ff_list)
anom_30.cdo_command("del29feb")
anom_30.subtract(mean_30)
anom_30.run()
This should have the anomalies
One issue is whether the files have leap years or how you want to handle leap years if they exists. CDO has an undocumented command -delfeb29, which I have used above

Take maximum rainfall value for each season over a time period (xarray)

I'm trying to find the maximum rainfall value for each season (DJF, MAM, JJA, SON) over a 10 year period. I am using netcdf data and xarray to try and do this. The data consists of rainfall (recorded every 3 hours), lat, and lon data. Right now I have the following code:
ds.groupby('time.season).max('time')
However, when I do it this way the output has a shape of (4,145,192) indicating that it's taking the maximum value for each season over the entire period. I would like the maximum for each individual season every year. In other words, output should have something with a shape like (40,145,192) (4 values for each year x 10 years)
I've looked into trying to do this with DataSet.resample as well using time=3M as the frequency, but then it doesn't split the months up correctly. If I have to I can alter the dataset, so it starts in the correct place, but I was hoping there would be an easier way considering there's already a function to group it correctly.
Thanks and let me know if you need anymore details!
Resample is going to be the easiest tool for this job. You are close with the time frequency but you probably want to use the quarterly frequency with an offset:
ds.resample(time='QS-Mar').max('time')
These offsets can be further configured as described in the Pandas documentation: http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases

How to distribute proportionally dates on a scale with Python

I have a very simple charting component which takes integer on the x/y axis. My problem is that I need to represent date/float on this chart. So I though I could distribute proportionally dates on a scale. In other words, let's say I have the following date : 01/01/2008, 02/01/2008 and 31/12/2008. The algorithm would return 0, 16.667, and 100 (1 month = 16.667%).
I tried to play with the datetime and timedelta classes of Python 2.5 and I am unable to achieve this. I thought I could use the number of ticks, but I am not even able to get that info from datetime.
Any idea how I could write this algorithm in Python? Otherwise, any other ideas or algorithms?
If you're dealing with dates, then you can use the method toordinal.
import datetime
jan1=datetime.datetime(2008,1,1)
dec31=datetime.datetime(2008,12,31)
feb1=datetime.datetime(2008,02,01)
dates=[jan1,dec31,feb1]
dates.sort()
datesord=[d.toordinal() for d in dates]
start,end=datesord[0],datesord[-1]
def datetofloat(date,start,end):
"""date,start,end are ordinal dates
ie Jan 1 of the year 1 has ordinal 1
Jan 1 of the year 2008 has ordinal 733042"""
return (date-start)*1.0/(end-start)
print datetofloat(dates[0],start,end)
0.0
print datetofloat(dates[1],start,end)
0.0849315068493*
print datetofloat(dates[2],start,end)
1.0
*16.67% is about two months of a year, so the proportion for Feb 1 is about half of that.
It's fairly easy to convert a timedelta into a numeric value.
Select an epoch time. Calculate deltas for every value relative to the epoch. Convert the delta's into a numeric value. Then map the numeric values as you normally would.
Conversion is straight forward. Something like:
def f(delta):
return delta.seconds + delta.days * 1440 * 60 +
(delta.microseconds / 1000000.0)
I don't know if I fully understand what you are trying to do, but you can just deal with times as number of seconds since the UNIX epoch and then just use plain old subtraction to get a range that you can scale to the size of your plot.
In processing, the map function will handle this case for you. http://processing.org/reference/map_.html I'm sure you can adapt this for your purpose

Categories

Resources