Iterating a custom date range over multiple years in Pandas - python

The short of it: How do I parse through yearly data in non-standard year. In my case Sept to Sept.
I've got a script to parse through years' worth of hourly temperature data and calculate the accumulated growing degree days (GDD) per year. Some demo data and the script are on this gist if you're curious where I'm at. The meat and potatoes though is getting the yearly cumulative sum with this:
df[col_name] = df.resample('Y')['dGDD'].cumsum()
and all works well. Each day will show the accumulated GDD in the proper column until Dec 31 when it starts from zero again.
My next goal is to calculate Chilling Degree Days which works similarly as GDD but it runs from Sept to Sept each year and I have no idea how to work that in (or what to properly google for help). I know I can set a date range to run it over, ie df['2012-9-1':'2013-9-1'] but I'm not sure how to automate it for the entirety of my data (2007-2018).
Thanks!

Solved it! Turns out the period of time I'm looking for is known as a 'Water Year'. Learning that lead me to another question and answer. That, combined with a closer look at the .resample() docs where I learned you can direct the resample to something other than index, got me this:
df['WaterYear'] = df.index + pd.DateOffset(months=-8)
df[col_name] = df.resample('Y', on='WaterYear')['dCDD'].cumsum()
And everything seems to be working swimmingly now.

Related

Use TensorFlow model to guess / predict values between 2 points

My question is something that I didn't encounter anywhere, I've been wondering if it was possible for a TF Model to determinate values between 2 dates that have real / validated values assigned to them.
I have an example :
Let's take the price of Nickel, here's it's chart the last week :
There is no data for the two following dates : 19/11 and 20/11
But we have the data points before and after.
So is it possible to use the datas from before and after these 2 points to guess the values of the 2 missing dates ?
Thank you a lot !
It would be possible to create a machine learning model to predict the prices given a dataset of previous prices. Take a look at this post for instance. You would have to modify it slightly such that it predicts the prices in the gaps given previous and upcoming prices.
But for the example you gave assuming the dates are of this year 2022, these are a Saturday and Sunday, the stock market is closed on the weekends, hence there is not price of the item. Also notice that there are other days in the year where there is not trading occurring, think about holidays, then there also is not price of course.

How to use Simple Moving average for N step forecast on unobserved data

I have a confusion about how to forecast future steps using MA. All the articles out there validate the model by only considering the historical data that were OBSERVED. However, once we validate an MA model has a good performance on our train data, we need to set up a pipeline for future forecats.
The problem is that for n-step ahead future forecast, all the data is observed only for the first forecast. What happens to the other n-1 steps? Here is an example.
Lets say we have a dataset from Jan 2021 to June 2022 and based on our experiments, we noticed a moving average using the last 3 values leads to the best error for a 3-step ahead forecast horizon.
Now we want to forecast for July, August and September. For July, we already observed the prior 3 months values so we can get a mean of that and its the forecast. However, actual July data is missing for August forecast. What happens here? Should we consider the forecasted value for July and the actuals for June and May to find the value for August?
I am sorry if my question is trivial but I am trying to code it myself in python so I want to make sure I am doing it the right way.

Cummulative Time Spent in Specific States

I have a dataset that looks as follows:
What I would like to do with this data is calculate how much time was spent in specific states, per day. So say for example I wanted to know how long the unit was running today. I would just like to know the sum of the time the unit spent RUNNING: 45 minutes, NOT_RUNNING: 400 minutes, WARMING_UP: 10 minutes, etc.
I know how to summarize the column data on its own, but I'm looking to reference the time stamp I have available to subtract the first time it was on, from the last time it was on and get that measure of difference. I haven't had any luck searching for this solution, but there's no way I'm the first to come across this and know it can be done some how, just looking to learn how. Anything helps, Thanks!

Extract another value based on an input (or automated input)

Not seen this answered elsewhere. I want to automate a file to put in three headers based on the current week. Firstly i cant figure out how to get the current week. A similar thing to the excel formula weeknum(today()). So this week is week 25 in the fiscal calender.
I need the programme to work out the current week but then also input the previous 2 weeks so to pull back week 25,24 and 23. I cant just get the week number 25 above and -1 as when we hit week 1 next year the number will go to 0
Hope that makes sense. I heard date is a bit of a pain so hopefully its not too complicated.
I'm not sure what you are trying to accomplish but maybe this will help.
import datetime
datetime.date.today().isocalendar()
It returns
(2021, 25, 1)
So to get just week 25, simply do:
datetime.date.today().isocalendar()[1]

NetCDF: how to create list of time values for years with 366 days (all_leap or 366_day calendar)?

I'd like to write a NetCDF that will contain 366 days per year for all years, with the Feb 28th value repeated as the Feb 29th value in the case of non-leap years. How would I build the list/array of time values so that the Feb 29th slot contains the same time value as Feb 28th during non-leap years? Is this really what I want to do, or is there another approach typically used for this? I haven't yet found an example of how to create a time coordinate variable with calendar attribute all_leap or 366_day.
My concern is that I'll need to do something to account for the "filler" Feb 29th in the non-leap years in order to satisfy software such as Panoply which I use for quick plots when doing data analysis. I'm not referring to the data variable values, I mean the actual time step values such as "5894 days since 1900". For example when I'm stepping through the data timestep by timestep (day by day) I want to make sure that I don't start getting off-by-one errors that end up confusing Panoply, so when I'm looking at a plot for a timestep it's interpreted correctly when it displays the time value in date format.
Maybe the crux of this is whether or not I can have duplicate values in the array of time step values, and if so will Panoply etc. handle these gracefully, i.e. when I'm constructing an array of time values to load into the time coordinate can I duplicate the value for Feb 28th in the array element mapping to Feb 29th when it's not a leap year?
This a tricky issue, which comes up when computing daily climatologies over many years. You want your computations to include 366 days even for non-leap years but to use NaN for Feb. 29.
You don't mention what language you are using to create your NetCDF files. There is an answer using Python and Pandas, in the context of creating climatologies at this question: Compute daily climatology using pandas python which might help get you started.
My answer to that question shows a method for dealing with the leap year issue.
I've created 30 year daily climatology files, using this method and Panoply has no problems viewing them.

Categories

Resources