I have a relatively large data set of weather data for 10 years and I wanna group by day of year to get the 10 years low or high for each and every single day so to use groupby I created a column in this way:
df['dms'] = df['Date'].dt.strftime('%j')
thing is when I use dt.strftime('%j') I get two numbers for the same day which is weird, for instance when I filter only by Dec 31st and I do value_counts() I get this:
365 363
366 82
Name: dms, dtype: int64
on the other hand everything work just fine if I did dt.strftime('%m-%d)
Dec-31 445
Name: dm, dtype: int64
I even did dt.strftime('%b-%d-%r').value_counts() and I got the same right filter
Dec-31-12:00:00 AM 445
Name: Date, dtype: int64
what is actually going wrong (or to sound less newbie) what is happening behind the scene in the %j case
Let us consider an example with the following data:
df = pd.DataFrame({'Date' : ['2016-12-31', '2017-12-31', '2018-12-31', '2019-12-31', '2020-12-31']})
df['Date'] = pd.to_datetime(df['Date'])
df
Date
0 2016-12-31
1 2017-12-31
2 2018-12-31
3 2019-12-31
4 2020-12-31
In the data above, 2016 and 2020 are leap years with an extra day on February 29th to make up for the fact that an actual year is 365 days and eight hours long (so every fourth year, Leap Year/Leap Day exists, because we take the sum of the extra eight hours from the previous 3 years (3 X 8 = 24), and that is why we have leap day!), so we should expect to return 366 with %j for said years when we do:
import pandas as pd
df = pd.DataFrame({'Date' : ['2016-12-31', '2017-12-31', '2018-12-31', '2019-12-31', '2020-12-31']})
df['Date'] = pd.to_datetime(df['Date'])
df['Day'] = df['Date'].dt.strftime('%j')
df
Date Day
0 2016-12-31 366
1 2017-12-31 365
2 2018-12-31 365
3 2019-12-31 365
4 2020-12-31 366
However, when you do value_counts(), then it returns:
365 3
366 2
Name: Day, dtype: int64
This is also expected behavior, so %j is working correctly behind the scenes as it is accommodating for Leap Years.
%j returns a day number of year 001-366 (366 for leap year, 365 otherwise). Since your data spans 10 years, 366 would be a valid day for leap year.
Related
I want to resample stock prices yearly but with a custom start date (not natural/calendar year). This is to simulate fiscal year which may be different from calendar year.
I have:
Date
2010-02-01 39.921856
2010-02-02 39.929314
2010-02-03 40.511570
2010-02-04 39.541145
2010-02-05 39.899464
...
2022-01-24 138.550598
2022-01-25 135.536484
2022-01-26 134.152969
2022-01-27 134.241882
2022-01-28 135.902145
Name: Adj Close, Length: 3021, dtype: float64
I want to calculate the cumulative return every fiscal year, so I need the closing price for the last day of each fiscal year. For example, this day is 1 Feb, so I want to get:
2010-02-01 39.921856
2011-02-01 xxx
2012-02-01 xxx
2013-02-01 xxx
I tried resample('1Y'), date_range etc. but pandas force calendar year no matter what. I keep getting either 01-01 or 12-31 depending on if I ask for year start or end. The origin argument for resample also does nothing:
df = df.resample('Y', origin='2010-02-01').last()
How do I get what I want?
I was trying to calculate the week number starting from first Monday of October. Is there any functions in pandas or datetime to do the calculation efficiently?
MWE
import pandas as pd
from datetime import date,datetime
df = pd.DataFrame({'date': pd.date_range('2021-10-01','2022-11-01',freq='1D')})
df['day_name'] = df['date'].dt.day_name()
df2 = df.head()
df2['Fiscal_Week'] = [52,52,52,1,1] # 2021-10-4 is monday, so for oct4, week is 1
# How to do it programatically for any year?
df2
date day_name Fiscal_Week
0 2021-10-01 Friday 52
1 2021-10-02 Saturday 52
2 2021-10-03 Sunday 52
3 2021-10-04 Monday 1
4 2021-10-05 Tuesday 1
Shift dates by the number of days to new year, they use standard Monday week number formatting (%W):
df['Fiscal_Week'] = (
df['date'] + pd.DateOffset(days=91)
).dt.strftime('%W').astype(int).replace({0:None}).fillna(method='ffill')
The offset in days can be calculated manually (I assume the fiscal year start is fixed)
The replace part is needed because leftovers from the previous year are considered week 0. The previous week might be 52 or 53, so replacing with NA and then fill forward
Python pandas (0.24.1) is adding a seemingly arbitrary number of hours, minutes, and seconds to my datetime objects. This seems unexpected as default behavior; I would expect the time component to default to midnight (00:00:00). Is this a bug?
import pandas as pd
df = pd.DataFrame( {'yr': [2019, 2019],
'mo': [9, 9],
'dy': [25, 26]} )
df['dtime'] = ( pd.to_datetime(df['yr'],format='%Y')
+pd.to_timedelta(df['mo']-1,unit='M')
+pd.to_timedelta(df['dy']-1,unit='d') )
print('pandas version == '+pd.__version__)
df
################################################
OUTPUT:
################################################
pandas version == 0.24.1
yr mo dy dtime
0 2019 9 25 2019-09-25 11:52:48
1 2019 9 26 2019-09-26 11:52:48
Problem is with converting months, here is used 'rounded' year (because leap year) and divided by 12 for 'rounded' month:
print (pd.to_timedelta(365.2425, unit='d') / 12)
30 days 10:29:06
print (pd.to_timedelta(1, unit='M'))
30 days 10:29:06
print (pd.to_timedelta(df['mo']-1,unit='M'))
0 243 days 11:52:48
1 243 days 11:52:48
Name: mo, dtype: timedelta64[ns]
Better solution is use to_datetime with year, monht and day columns and if necessary filter it by subset with list(d.values()) (if another columns in real data):
d = {'yr':'year', 'mo':'month', 'dy':'day'}
df['dtime'] = pd.to_datetime(df.rename(columns=d)[list(d.values())])
print (df)
yr mo dy dtime
0 2019 9 25 2019-09-25
1 2019 9 26 2019-09-26
To add detail as to the issue with timedelta that Jezrael pointed out above, the issue with the month conversion is as follows: Pandas timedelta defines a month as 1/12 of a year, which is 365.2425 days based on leap year logic.
243 days 11:52:48 is 21037968 seconds.
>>> 243*60*60*24+11*60*60+52*60+48
21037968
Some dimensional analysis confirms this is 8/12 of a year that is 365.2425 days long.
>>> 21037968/((8/12)*365.2425*60*60*24)
1.0
As noted above, use to_datetime to avoid this.
I have a dataframe with two columns; Sales and Date.
dataset.head(10)
Date Sales
0 2015-01-02 34988.0
1 2015-01-03 32809.0
2 2015-01-05 9802.0
3 2015-01-06 15124.0
4 2015-01-07 13553.0
5 2015-01-08 14574.0
6 2015-01-09 20836.0
7 2015-01-10 28825.0
8 2015-01-12 6938.0
9 2015-01-13 11790.0
I want to convert the Date column from yyyy-mm-dd (e.g. 2015-06-01) to yyyy-ww (e.g. 2015-23), so I run the following piece of code:
dataset["Date"] = pd.to_datetime(dataset["Date"]).dt.strftime('%Y-%V')
Then I group by my Sales based on weeks, i.e.
data = dataset.groupby(['Date'])["Sales"].sum().reset_index()
data.head(10)
Date Sales
0 2015-01 67797.0
1 2015-02 102714.0
2 2015-03 107011.0
3 2015-04 121480.0
4 2015-05 148098.0
5 2015-06 132152.0
6 2015-07 133914.0
7 2015-08 136160.0
8 2015-09 185471.0
9 2015-10 190793.0
Now I want to create a date range based on the Date column, since I'm predicting sales based on weeks:
ds = data.Date.values
ds_pred = pd.date_range(start=ds.min(), periods=len(ds) + num_pred_weeks,
freq="W")
However I'm getting the following error: could not convert string to Timestamp which I'm not really sure how to fix. So, if I use 2015-01-01 as the starting date of my date-import I get no error, which makes me realize that I'm using the functions wrong. However, I'm not sure how?
I would like to basically have a date range that spans weekly from the current week and then 52 weeks into the future.
I think problem is want create minimum of dataset["Date"] column filled by strings in format YYYY-VV. But for pass to date_range need format YYYY-MM-DD or datetime object.
I found this:
Several additional directives not required by the C89 standard are included for convenience. These parameters all correspond to ISO 8601 date values. These may not be available on all platforms when used with the strftime() method. The ISO 8601 year and ISO 8601 week directives are not interchangeable with the year and week number directives above. Calling strptime() with incomplete or ambiguous ISO 8601 directives will raise a ValueError.
%V ISO 8601 week as a decimal number with Monday as the first day of the week. Week 01 is the week containing Jan 4.
Pandas 0.24.2 bug with YYYY-VV format:
dataset = pd.DataFrame({'Date':['2015-06-01','2015-06-02']})
dataset["Date"] = pd.to_datetime(dataset["Date"]).dt.strftime('%Y-%V')
print (dataset)
Date
0 2015-23
1 2015-23
ds = pd.to_datetime(dataset['Date'], format='%Y-%V')
print (ds)
ValueError: 'V' is a bad directive in format '%Y-%V'
Possible solution is use %U or %W, check this:
%U Week number of the year (Sunday as the first day of the week) as a zero padded decimal number. All days in a new year preceding the first Sunday are considered to be in week 0.
%W Week number of the year (Monday as the first day of the week) as a decimal number. All days in a new year preceding the first Monday are considered to be in week 0.
dataset = pd.DataFrame({'Date':['2015-06-01','2015-06-02']})
dataset["Date"] = pd.to_datetime(dataset["Date"]).dt.strftime('%Y-%U')
print (dataset)
Date
0 2015-22
1 2015-22
ds = pd.to_datetime(dataset['Date'] + '-1', format='%Y-%U-%w')
print (ds)
0 2015-06-01
1 2015-06-01
Name: Date, dtype: datetime64[ns]
Or using data from original DataFrame in datetimes:
dataset = pd.DataFrame({'Date':['2015-06-01','2015-06-02'],
'Sales':[10,20]})
dataset["Date"] = pd.to_datetime(dataset["Date"])
print (dataset)
Date Sales
0 2015-06-01 10
1 2015-06-02 20
data = dataset.groupby(dataset['Date'].dt.strftime('%Y-%V'))["Sales"].sum().reset_index()
print (data)
Date Sales
0 2015-23 30
num_pred_weeks = 5
ds = data.Date.values
ds_pred = pd.date_range(start=dataset["Date"].min(), periods=len(ds) + num_pred_weeks, freq="W")
print (ds_pred)
DatetimeIndex(['2015-06-07', '2015-06-14', '2015-06-21',
'2015-06-28',
'2015-07-05', '2015-07-12'],
dtype='datetime64[ns]', freq='W-SUN')
If ds contains dates as string formatted as '2015-01' which should be '%Y-%W' (or '%G-%V' in datetime library) you have to add a day number to obtain a day. Here, assuming that you want the monday you should to:
ds_pred = pd.date_range(start=pd.to_datetime(ds.min() + '-1', format='%Y-%W-%w',
periods=len(ds) + num_pred_weeks, freq="W")
I'm trying to create a simply column using Pandas that will calculate the number of days in the year of the adjacent date column.
I've already done this fairly easily for the numbers of days in the month using the daysinmonth attribute of DatetimeIndex, with the following:
def daysinmonth(row):
x = pd.DatetimeIndex(row['Date']).daysinmonth
return x
daysinmonth(df)
I'm having trouble to mimic these results for year without the nifty pre-defined attribute.
my dataframe looks like the following (sans the days_in_year column since i'm trying to create that):
Date Days_in_month Days_in_year
1 2/28/2018 28 365
2 4/14/2019 30 365
3 1/1/2020 31 366
4 2/15/2020 29 366
Thanks to anyone who takes a look!
Get the mode of year by 4 , equal to 0 means 366, else means 365(Notice this will not include the special cases , You can check the update function and the link I provided)
(pd.to_datetime(df.Date,format='%m/%d/%Y').dt.year%4).eq(0).map({True:366,False:365})
Out[642]:
1 365
2 365
3 366
4 366
Name: Date, dtype: int64
You can using this which is more accurate for decided leap year ,definition from this site
def daysinyear(x):
if x%4==0 :
if x%100==0:
if x%400==0:
return(366)
else:
return (365)
else :
return(365)
else:
return(365)
(pd.to_datetime(df.Date,format='%m/%d/%Y').dt.year%4).apply(daysinyear)
Out[656]:
1 365
2 365
3 366
4 366
Name: Date, dtype: int64
You can also use YearEnd. You'll get a timedelta64 column with this method.
import pandas as pd
from pandas.tseries.offsets import YearEnd
df['Date'] = pd.to_datetime(df.Date)
(df.Date + YearEnd(1)) - (df.Date - YearEnd(1))
1 365 days
2 365 days
3 366 days
4 366 days
Name: Date, dtype: timedelta64[ns]
Here's another way using periods:
df['Date'].dt.to_period('A').dt.to_timestamp('A').dt.dayofyear
Output:
1 365
2 365
3 366
4 366
Name: Date, dtype: int64
I would do something like this>
import datetime
import numpy as np
def func(date):
year = date.year
begin = datetime.datetime(year,1,1)
end = datetime.datetime(year,12,31)
diff = (end - begin)
result = np.timedelta64(diff, "D").astype("int")
return result
print(func(datetime.datetime(2016,12,31)))
One solution is to take the first day of one year and of the next year. Then calculate the difference. You can then apply this using pd.Series.apply:
def days_in_year(x):
day1 = x.replace(day=1, month=1)
day2 = day1.replace(year=day1.year+1)
return (day2 - day1).days
df['Date'] = pd.to_datetime(df['Date'])
df['Days_in_year'] = df['Date'].apply(days_in_year)
print(df)
Date Days_in_month Days_in_year
1 2018-02-28 28 365
2 2019-04-14 30 365
3 2020-01-01 31 366
4 2020-02-15 29 366
You can use the basic formula to check if a year is a leap year and add the result to 365 to get the number of days in a year.
# Not needed if df ['Date'] is already of type datetime
dates = pd.to_datetime(df['Date'])
years = dates.dt.year
ndays = 365 + ((years % 4 == 0) & ((years % 100 != 0) | (years % 400 == 0))).astype(int)