Group by custom period annually in Xarray - python

I'm trying to group an xarray.Dataset object into a custom 5-month period spanning from October-January with an annual frequency. This is complicated because the period crosses New Year.
I've been trying to use the approach
wb_start = temperature.sel(time=temperature.time.dt.month.isin([10,11,12,1]))
wb_start1 = wb_start.groupby('time.year')
But this predictably makes the January month of the same year, instead of +1 year. Any help would be appreciated!

I fixed this in a somewhat clunk albeit effective way by adding a year to the months after January. My method essentially moves the months 10,11,12 up one year while leaving the January data in place, and then does a groupby(year) instance on the reindexed time data.
wb_start = temperature.sel(time=temperature.time.dt.month.isin([10,11,12,1]))
# convert cftime to datetime
datetimeindex = wb_start.indexes['time'].to_datetimeindex()
wb_start['time'] = pd.to_datetime(datetimeindex)
# Add custom group by year functionality
custom_year = wb_start['time'].dt.year
# convert time type to pd.Timestamp
time1 = [pd.Timestamp(i) for i in custom_year['time'].values]
# Add year to Timestamp objects when month is before Jan. (relativedelta does not work from np.datetime64)
time2 = [i + relativedelta(years=1) if i.month>=10 else i for i in time1]
wb_start['time'] = time2
#Groupby using the new time index
wb_start1 = wb_start.groupby('time.year')

Related

Convert days data in years data in a list

I want to do a time serie with temperature data from 1850 to 2014. And I have an issue because when I plot the time series the start is 0 and it corresponds to day 1 of January 1850 and it stops day 60 230 with the 31 December of 2014.
I try to do a loop to create a new list with the time in month-years but it didn't succeed, and to create the plot with this new list and my initial temperature list.
This is the kind of loop that I tested :
days = list(range(1,365+1))
years = []
y = 1850
years.append(y)
while y<2015:
for i in days:
years.append(y+i)
y = y+1
del years [-1]
dsetyears = Dataset(years)
I also try with the tool called "datetime" but it didn't work also (maybe this tool is better because it will take into account the bissextile years...).
day_number = "0"
year = "1850"
res = datetime.strptime(year + "-" + day_number, "%Y-%j").strftime("%m-%d-%Y")
If anyone has a clue or a lead I can look into I'm interested.
Thanks by advance !
You can achieve that using datetime module. Let's declare starting and ending date.
import datetime
dates = []
starting_date = datetime.datetime(1850, 1, 1)
ending_date = datetime.datetime(2014, 1, 1)
Then we can create a while loop and check if the ending date is greater or equal to starting date and add 1-day using timedelta function for every iteration. before iteration, we will append the formatted date as a string to the dates list.
while starting_date <= ending_date:
dates.append(starting_date.strftime("%m-%d-%Y"))
starting_date += datetime.timedelta(days=1)

Rewrite datetimeindex for timeseries for new Pandas version

I am developing a Machine Learning Model to do some time series forecasting. I wrote a general function to create a time-series for my data. When I was writing this, the datetimeindex method in pandas used to take different parameters and my function was executing properly. There has been a change to this method and I am not sure how to rewrite this datetimeindex method? Can someone please help?
Here is the full timeseries function I wrote:
def make_time_series(mean_power_df, years, freq='D', start_idx=4):
'''Creates as many time series as there are complete years. This code
accounts for the leap year, 2016.
:param Daily_Price_mean: A dataframe of bitcoin weighted price, averaged by day.
This dataframe should also be indexed by a datetime.
:param years: A list of years to make time series out of, ex. ['2013', '2014'].
:param freq: The frequency of data recording (D = daily)
:param start_idx: The starting dataframe index of the first point in the first time series.
The default, 16, points to '2013-01-01'.
:return: A list of pd.Series(), time series data.
'''
# store time series
time_series = []
# store leap year in this dataset
leap = '2012'
# create time series for each year in years
for i in range(len(years)):
year = years[i]
if(year == leap):
end_idx = start_idx+366
else:
end_idx = start_idx+365
# create start and end datetimes
t_start = year + '-01-01' # Jan 1st of each year = t_start
t_end = year + '-12-31' # Dec 31st = t_end
# get global consumption data
data = mean_power_df[start_idx:end_idx]
# create time series for the year
index = pd.DatetimeIndex(start=t_start, end=t_end, freq=freq) ## this is the line causing problems, this was based on previous method inputs
time_series.append(pd.Series(data=data, index=index))
start_idx = end_idx
# return list of time series
return time_series
When I call this function the following way:
group = data.groupby('date') #the data was read and processed...
Daily_Price_mean = group['Weighted_Price'].mean()
full_years = ['2012', '2013', '2014']
freq='D' # daily recordings
# make time series
time_series = make_time_series(Daily_Price_mean, full_years, freq=freq)
I get the error: TypeError: new() got an unexpected keyword argument 'start'
Can smeone please let me know how I can fix my function?
Thank you

Calculate Last Friday of Month in Pandas

I've written this function to get the last Thursday of the month
def last_thurs_date(date):
month=date.dt.month
year=date.dt.year
cal = calendar.monthcalendar(year, month)
last_thurs_date = cal[4][4]
if month < 10:
thurday_date = str(year)+'-0'+ str(month)+'-' + str(last_thurs_date)
else:
thurday_date = str(year) + '-' + str(month) + '-' + str(last_thurs_date)
return thurday_date
But its not working with the lambda function.
datelist['Date'].map(lambda x: last_thurs_date(x))
Where datelist is
datelist = pd.DataFrame(pd.date_range(start = pd.to_datetime('01-01-2014',format='%d-%m-%Y')
, end = pd.to_datetime('06-03-2019',format='%d-%m-%Y'),freq='D').tolist()).rename(columns={0:'Date'})
datelist['Date']=pd.to_datetime(datelist['Date'])
Jpp already added the solution, but just to add a slightly more readable formatted string - see this awesome website.
import calendar
def last_thurs_date(date):
year, month = date.year, date.month
cal = calendar.monthcalendar(year, month)
# the last (4th week -> row) thursday (4th day -> column) of the calendar
# except when 0, then take the 3rd week (February exception)
last_thurs_date = cal[4][4] if cal[4][4] > 0 else cal[3][4]
return f'{year}-{month:02d}-{last_thurs_date}'
Also added a bit of logic - e.g. you got 2019-02-0 as February doesn't have 4 full weeks.
Scalar datetime objects don't have a dt accessor, series do: see pd.Series.dt. If you remove this, your function works fine. The key is understanding that pd.Series.apply passes scalars to your custom function via a loop, not an entire series.
def last_thurs_date(date):
month = date.month
year = date.year
cal = calendar.monthcalendar(year, month)
last_thurs_date = cal[4][4]
if month < 10:
thurday_date = str(year)+'-0'+ str(month)+'-' + str(last_thurs_date)
else:
thurday_date = str(year) + '-' + str(month) + '-' + str(last_thurs_date)
return thurday_date
You can rewrite your logic more succinctly via f-strings (Python 3.6+) and a ternary statement:
def last_thurs_date(date):
month = date.month
year = date.year
last_thurs_date = calendar.monthcalendar(year, month)[4][4]
return f'{year}{"-0" if month < 10 else "-"}{month}-{last_thurs_date}'
I know that a lot of time has passed since the date of this post, but I think it would be worth adding another option if someone came across this thread
Even though I use pandas every day at work, in that case my suggestion would be to just use the datetutil library. The solution is a simple one-liner, without unnecessary combinations.
from dateutil.rrule import rrule, MONTHLY, FR, SA
from datetime import datetime as dt
import pandas as pd
# monthly options expiration dates calculated for 2022
monthly_options = list(rrule(MONTHLY, count=12, byweekday=FR, bysetpos=3, dtstart=dt(2022,1,1)))
# last satruday of the month
last_saturday = list(rrule(MONTHLY, count=12, byweekday=SA, bysetpos=-1, dtstart=dt(2022,1,1)))
and then of course:
pd.DataFrame({'LAST_ST':last_saturdays}) #or whatever you need
This question answer Calculate Last Friday of Month in Pandas
This can be modified by selecting the appropriate day of the week, here freq='W-FRI'
I think the easiest way is to create a pandas.DataFrame using pandas.date_range and specifying freq='W-FRI.
W-FRI is Weekly Fridays
pd.date_range(df.Date.min(), df.Date.max(), freq='W-FRI')
Creates all the Fridays in the date range between the min and max of the dates in df
Use a .groupby on year and month, and select .last(), to get the last Friday of every month for every year in the date range.
Because this method finds all the Fridays for every month in the range and then chooses .last() for each month, there's not an issue with trying to figure out which week of the month has the last Friday.
With this, use pandas: Boolean Indexing to find values in the Date column of the dataframe that are in last_fridays_in_daterange.
Use the .isin method to determine containment.
pandas: DateOffset objects
import pandas as pd
# test data: given a dataframe with a datetime column
df = pd.DataFrame({'Date': pd.date_range(start=pd.to_datetime('2014-01-01'), end=pd.to_datetime('2020-08-31'), freq='D')})
# create a dateframe with all Fridays in the daterange for min and max of df.Date
fridays = pd.DataFrame({'datetime': pd.date_range(df.Date.min(), df.Date.max(), freq='W-FRI')})
# use groubpy and last, to get the last Friday of each month into a list
last_fridays_in_daterange = fridays.groupby([fridays.datetime.dt.year, fridays.datetime.dt.month]).last()['datetime'].tolist()
# find the data for the last Friday of the month
df[df.Date.isin(last_fridays_in_daterange)]

Pandas date_range - subtracting numpy timedelta gives odd result, time becomes not 0:00:00

I am trying to generate a set of dates with pandas date_range functionality. Then I want to iterate over this range and subtract several months from each of the dates (exact number of month is determined in loop) to get a new date.
I get some very odd results when I do this.
MVP:
#get date range
dates = pd.date_range(start = '1/1/2013', end='1/1/2018', freq=str(test_size)+'MS', closed='left', normalize=True)
#take first date as example
date = dates[0]
date
Timestamp('2013-01-01 00:00:00', freq='3MS')
So far so good.
Now let's say I want to go just one month back from this date. I define numpy timedelta (it supports months for definition, while pandas' timedelta doesn't):
#get timedelta of 1 month
deltaGap = np.timedelta64(1,'M')
#subtract one month from date
date - deltaGap
Timestamp('2012-12-01 13:30:54', freq='3MS')
Why so? Why I get 13:30:54 in time component instead of midnight.
Moreover, if I subtract more than 1 month it the shift becomes so large that I lose a whole day:
#let's say I want to subtract both 2 years and then 1 month
deltaTrain = np.timedelta64(2,'Y')
#subtract 2 years and then subtract 1 month
date - deltaTrain - deltaGap
Timestamp('2010-12-02 01:52:30', freq='3MS')
I've had similar issues with timedelta, and the solution I've ended up using was using relativedelta from dateutil, which is specifically built for this kind of application (taking into account all the calendar weirdness like leap years, weekdays, etc...). For example given:
from dateutil.relativedelta import relativedelta
date = dates[0]
>>> date
Timestamp('2013-01-01 00:00:00', freq='10MS')
deltaGap = relativedelta(months=1)
>>> date-deltaGap
Timestamp('2012-12-01 00:00:00', freq='10MS')
deltaGap = relativedelta(years=2, months=1)
>>> date-deltaGap
Timestamp('2010-12-01 00:00:00', freq='10MS')
Check out the documentation for more info on relativedelta
The issues with numpy.timedelta64
I think that the problem with np.timedelta is revealed in these 2 parts of the docs:
There are two Timedelta units (‘Y’, years and ‘M’, months) which are treated specially, because how much time they represent changes depending on when they are used. While a timedelta day unit is equivalent to 24 hours, there is no way to convert a month unit into days, because different months have different numbers of days.
and
The length of the span is the range of a 64-bit integer times the length of the date or unit. For example, the time span for ‘W’ (week) is exactly 7 times longer than the time span for ‘D’ (day), and the time span for ‘D’ (day) is exactly 24 times longer than the time span for ‘h’ (hour).
So the timedeltas are fine for hours, weeks, months, days, because these are non-variable timespans. However, months and years are variable in length (think leap years), and so to take this into account, numpy takes some sort of "average" (I guess). One numpy "year" seems to be one year, 5 hours, 49 minutes and 12 seconds, while one numpy "month" seems to be 30 days, 10 hours, 29 minutes and 6 seconds.
# Adding one numpy month adds 30 days + 10:29:06:
deltaGap = np.timedelta64(1,'M')
date+deltaGap
# Timestamp('2013-01-31 10:29:06', freq='10MS')
# Adding one numpy year adds 1 year + 05:49:12:
deltaGap = np.timedelta64(1,'Y')
date+deltaGap
# Timestamp('2014-01-01 05:49:12', freq='10MS')
This is not so easy to work with, which is why I would just go to relativedelta, which is much more intuitive (to me).
You can try using pd.DateOffset which is mainly used for applying offset logic (month, year, hour) on dates format.
# get random dates
dates = pd.date_range(start = '1/1/2013', freq='H',periods=100,closed='left', normalize=True)
#take first date as example
date = dates[0]
# subtract a month
dates[0] - pd.DateOffset(months=1)
Timestamp('2012-12-01 00:00:00')
# to apply this on all dates
new_dates = list(map(lambda x: x - pd.DateOffset(months=1), dates))

Python Date Index: finding the closest date a year ago from today

I have a panda dataframe (stock prices) with an index in a date format. It is daily but only for working days.
I basically try to compute some price performance YTD and from a year ago.
To get the first date of the actual year in my dataframe I used the following method:
today = str(datetime.date.today())
curr_year = int(today[:4])
curr_month = int(today[5:7])
first_date_year = (df[str(curr_year)].first_valid_index())
Now I try to get the closest date a year ago (exactly one year from the last_valid_index()). I could extract the month and the year but then it wouldn't be as precise. Any suggestion ?
Thanks
Since you didn't provide any data, I am assuming that you have a list of dates (string types) like the following:
dates = ['11/01/2016', '12/01/2016', '02/01/2017', '03/01/2017']
You then need to transform that into datetime format, I would suggest using pandas:
pd_dates = pd.to_datetime(dates)
Then you have to define today and one year ago. I would suggest using datetime for that:
today = datetime.today()
date_1yr_ago = datetime(today.year-1, today.month, today.day)
Lastly, you slice the date list for dates larger than the date_1yr_ago value and get the first value of that slice:
pd_dates[pd_dates > date_1yr_ago][0]
This will return the first date that is larger than the 1 year ago date.
output:
Timestamp('2017-02-01 00:00:00')
You can convert that datetime value to string with the following code:
datetime.strftime(pd_dates[pd_dates > date_1yr_ago][0], '%Y/%m/%d')
output:
'2017/02/01'

Categories

Resources