I want to resample stock prices yearly but with a custom start date (not natural/calendar year). This is to simulate fiscal year which may be different from calendar year.
I have:
Date
2010-02-01 39.921856
2010-02-02 39.929314
2010-02-03 40.511570
2010-02-04 39.541145
2010-02-05 39.899464
...
2022-01-24 138.550598
2022-01-25 135.536484
2022-01-26 134.152969
2022-01-27 134.241882
2022-01-28 135.902145
Name: Adj Close, Length: 3021, dtype: float64
I want to calculate the cumulative return every fiscal year, so I need the closing price for the last day of each fiscal year. For example, this day is 1 Feb, so I want to get:
2010-02-01 39.921856
2011-02-01 xxx
2012-02-01 xxx
2013-02-01 xxx
I tried resample('1Y'), date_range etc. but pandas force calendar year no matter what. I keep getting either 01-01 or 12-31 depending on if I ask for year start or end. The origin argument for resample also does nothing:
df = df.resample('Y', origin='2010-02-01').last()
How do I get what I want?
Related
I am trying to get the last business day in the month prior to curr_date (which is end of current month):
nf.loc[:,'priorBDay'] = np.busday_offset(nf.curr_date.values.astype('M8[M]') + np.timedelta64(-1, 'M'), -1, roll='forward')
Data type for curr_date is:
datetime64[ns, UTC]
Get error:
TypeError: cannot convert datetimelike to dtype [datetime64[D]]
curr_date is the last business day of the current month, I want to calculate the last business day of the prior month (Prior_Bday).
Use the following code:
nf['priorBDay'] = nf.curr_date.apply(lambda dat:
pd.offsets.MonthBegin(-1).rollback(dat) - pd.offsets.BDay())
The first step (rollback) is to move the source date to the start of
the current month, but only if the source date is not already the first
day of month.
The second step (- BDay()) is to shift to the previous business day.
For a couple of test dates the result is:
curr_date priorBDay
0 2019-09-10 2019-08-30
1 2020-06-10 2020-05-29
2 2020-09-10 2020-08-31
3 2020-10-30 2020-09-30
4 2020-10-31 2020-09-30
5 2020-11-01 2020-10-30
6 2020-11-02 2020-10-30
I assume that curr_date column is of datetime type. If not, convert it.
Note that nf.curr_date - pd.offsets.BMonthEnd() is a wrong formula.
E.g. for 2020-10-31 it yields 2020-10-30, i.e. the last business
day in the current month.
I have a relatively large data set of weather data for 10 years and I wanna group by day of year to get the 10 years low or high for each and every single day so to use groupby I created a column in this way:
df['dms'] = df['Date'].dt.strftime('%j')
thing is when I use dt.strftime('%j') I get two numbers for the same day which is weird, for instance when I filter only by Dec 31st and I do value_counts() I get this:
365 363
366 82
Name: dms, dtype: int64
on the other hand everything work just fine if I did dt.strftime('%m-%d)
Dec-31 445
Name: dm, dtype: int64
I even did dt.strftime('%b-%d-%r').value_counts() and I got the same right filter
Dec-31-12:00:00 AM 445
Name: Date, dtype: int64
what is actually going wrong (or to sound less newbie) what is happening behind the scene in the %j case
Let us consider an example with the following data:
df = pd.DataFrame({'Date' : ['2016-12-31', '2017-12-31', '2018-12-31', '2019-12-31', '2020-12-31']})
df['Date'] = pd.to_datetime(df['Date'])
df
Date
0 2016-12-31
1 2017-12-31
2 2018-12-31
3 2019-12-31
4 2020-12-31
In the data above, 2016 and 2020 are leap years with an extra day on February 29th to make up for the fact that an actual year is 365 days and eight hours long (so every fourth year, Leap Year/Leap Day exists, because we take the sum of the extra eight hours from the previous 3 years (3 X 8 = 24), and that is why we have leap day!), so we should expect to return 366 with %j for said years when we do:
import pandas as pd
df = pd.DataFrame({'Date' : ['2016-12-31', '2017-12-31', '2018-12-31', '2019-12-31', '2020-12-31']})
df['Date'] = pd.to_datetime(df['Date'])
df['Day'] = df['Date'].dt.strftime('%j')
df
Date Day
0 2016-12-31 366
1 2017-12-31 365
2 2018-12-31 365
3 2019-12-31 365
4 2020-12-31 366
However, when you do value_counts(), then it returns:
365 3
366 2
Name: Day, dtype: int64
This is also expected behavior, so %j is working correctly behind the scenes as it is accommodating for Leap Years.
%j returns a day number of year 001-366 (366 for leap year, 365 otherwise). Since your data spans 10 years, 366 would be a valid day for leap year.
I have a dataframe containing hourly data, i want to get the max for each week of the year, so i used resample to group data by week
weeks = data.resample("W").max()
the problem is that week max is calculated starting the first monday of the year, while i want it to be calculated starting the first day of the year.
I obtain the following result, where you can notice that there is 53 weeks, and the last week is calculated on the next year while 2017 doesn't exist in the data
Date dots
2016-01-03 0.647786
2016-01-10 0.917071
2016-01-17 0.667857
2016-01-24 0.669286
2016-01-31 0.645357
Date dots
2016-12-04 0.646786
2016-12-11 0.857714
2016-12-18 0.670000
2016-12-25 0.674571
2017-01-01 0.654571
is there a way to calculate week for pandas dataframe starting first day of the year?
Find the starting day of the year, for example let say it's Friday, and then you can specify an anchoring suffix to resample in order to calculate week starting first day of the year:
weeks = data.resample("W-FRI").max()
One quick remedy is, given you data in one year, you can group it by day first, then take group of 7 days:
new_df = (df.resample("D", on='Date').dots
.max().reset_index()
)
new_df.groupby(new_df.index//7).agg({'Date': 'min', 'dots': 'max'})
new_df.head()
Output:
Date dots
0 2016-01-01 0.996387
1 2016-01-08 0.999775
2 2016-01-15 0.997612
3 2016-01-22 0.979376
4 2016-01-29 0.998240
5 2016-02-05 0.995030
6 2016-02-12 0.987500
and tail:
Date dots
48 2016-12-02 0.999910
49 2016-12-09 0.992910
50 2016-12-16 0.996877
51 2016-12-23 0.992986
52 2016-12-30 0.960348
I was answering another question here with something about pandas I thought to know, time series resampling, when I noticed this odd binning.
Let's say I have a dataframe with a daily date range index and a column I want to resample and sum on.
index = pd.date_range(start="1/1/2018", end="31/12/2018")
df = pd.DataFrame(np.random.randint(100, size=len(index)),
columns=["sales"], index=index)
>>> df.head()
sales
2018-01-01 66
2018-01-02 18
2018-01-03 45
2018-01-04 92
2018-01-05 76
Now I resample by one month, everything looks fine:
>>>df.resample("1M").sum()
sales
2018-01-31 1507
2018-02-28 1186
2018-03-31 1382
[...]
2018-11-30 1342
2018-12-31 1337
If I try to resample by more months though binning starts to look off. This is particularly evident with 6M
df.resample("6M").sum()
sales
2018-01-31 1507
2018-07-31 8393
2019-01-31 7283
First bin spans just over one month, last bin goes one month to the future. Maybe I have to set closed="left" to get the proper limits:
df.resample("6M", closed="left").sum()
sales
2018-06-30 8090
2018-12-31 9054
2019-06-30 39
Now I have an extra bin in 2019 with data from 2018-12-31...
Is this working properly? am I missing any option I should set?
EDIT: here's the output I would expect resampling one year in six month intervals, first interval spanning from Jan 1st to Jun 30, second interval spanning from Jul 1st to Dec 31.
df.resample("6M", closed="left").sum()
sales
2018-06-30 8090
2018-12-31 9093 # 9054 + 39
Note that there's also some doubt here about what it's happening with June 30 data, does it go in the first bin like I would expect or the second? I mean with the last bin it's evident but the same is probably happening in all the bins.
The M time offset alias implies month end frequency.
What you need is 6MS which is an alias for month start frequency:
df.resample('6MS').sum()
resulting in
sales
2018-01-01 8130
2018-07-01 9563
2019-01-01 0
Also df.groupby(pd.Grouper(freq='6MS')).sum() can be used interchangeably.
For extra clarity you can compare ranges directly:
>>> pd.date_range('2018-01-01', '2018-12-31', freq='6M')
DatetimeIndex(['2018-01-31', '2018-07-31'], dtype='datetime64[ns]', freq='6M')
>>> pd.date_range('2018-01-01', '2018-12-31', freq='6MS')
DatetimeIndex(['2018-01-01', '2018-07-01'], dtype='datetime64[ns]', freq='6MS')
Adding np.random.seed(365) to check both our outputs.
print(df.resample("6M", kind='period').sum())
sales
2018-01 8794
2018-07 9033
would this work for you?
I have a dataframe of surface weather observations (fzraHrObs) organized by a station identifier code and date. fzraHrObs has several columns of weather data. The station code and date (datetime objects) look like:
usaf dat
716270 2014-11-23 12:00:00
2015-12-20 08:00:00
2015-12-20 09:00:00
2015-12-21 04:00:00
2015-12-28 03:00:00
716280 2015-12-19 08:00:00
2015-12-19 08:00:00
I would like to get a count of the number of unique dates (days) per year for each station - i.e. the number of days of obs per year at each station. In my example above this would give me:
usaf Year Count
716270 2014 1
2015 3
716280 2014 0
2015 1
I've tried using groupby and grouping by station, year, and date:
grouped = fzraHrObs['dat'].groupby(fzraHrObs['usaf'], fzraHrObs.dat.dt.year, fzraHrObs.dat.dt.date])
Count, size, nunique, etc. on this just gives me the number of obs on each date, not the number of dates themselves per year. Any suggestions on getting what I want here?
Could be something like this, group the date by usaf and year and then count the number of unique values:
import pandas as pd
df.dat.apply(lambda dt: dt.date()).groupby([df.usaf, df.dat.apply(lambda dt: dt.year)]).nunique()
# usaf dat
# 716270 2014 1
# 2015 3
# 716280 2015 1
# Name: dat, dtype: int64
The following should work:
df.groupby(['usaf', df.dat.dt.year])['dat'].apply(lambda s: s.dt.date.nunique())
What I did differently is group by two levels only, then use the nunique method of pandas series to count the number of unique dates in each group.