I have a Pandas df that contains two time columns. These columns contain the yyyy-mm-dd of yearly event.
How is it possible to calculate the average mm-dd of the occurrence of the event over all years?
I guess this involves counting (for each line) the number of days between the actual date and the Jan-1 of the year, but I don't see how to do that efficiently with Pandas.
Thank you!
dormancy1 greenup1 maturity1 senescence1 dormancy2 greenup2 maturity2 senescence2
8 2002-08-31 2002-04-27 2002-05-06 2002-08-21 NaT NaT NaT NaT
22 2003-09-17 2003-06-06 2003-06-15 2003-07-22 NaT NaT NaT NaT
36 2004-09-10 2004-04-20 2004-05-15 2004-05-24 NaT NaT NaT NaT
44 2005-08-13 2005-04-24 2005-06-29 2005-07-18 NaT NaT NaT NaT
74 2007-05-10 2007-03-13 2007-04-07 2007-05-01 NaT NaT NaT NaT
95 2009-09-18 2009-04-26 2009-05-06 2009-06-03 NaT NaT NaT NaT
113 2010-09-09 2010-05-29 2010-06-08 2010-07-19 NaT NaT NaT NaT
Edit:
Complete steps to reproduce error:
# Create and format data
df = pd.DataFrame({'dormancy1': ['2002-08-31','2003-09-17','2004-09-10','2005-08-13','2007-05-10','2009-09-18','2010-09-09'],
'greenup1': ['2002-04-27','2003-06-06','2004-04-20','2005-04-24','2007-03-13','2009-04-26','2010-05-29'],
'maturity1': ['2002-05-06','2003-06-15','2004-05-15','2005-06-29','2007-04-07','2009-05-06','2010-06-08'],
'senescence1': ['2002-08-21','2003-07-22','2004-05-24','2005-07-18','2007-05-01','2009-06-03','2010-07-19'],
'dormancy2': ['NaT','NaT','NaT','NaT','NaT','NaT','NaT'],
'greenup2': ['NaT','NaT','NaT','NaT','NaT','NaT','NaT'],
'maturity2': ['NaT','NaT','NaT','NaT','NaT','NaT','NaT'],
'senescence2': ['NaT','NaT','NaT','NaT','NaT','NaT','NaT']})
df['dormancy1'] = pd.to_datetime(df['dormancy1'])
df['dormancy2'] = pd.to_datetime(df['dormancy2'])
df['greenup1'] = pd.to_datetime(df['greenup1'])
df['greenup2'] = pd.to_datetime(df['greenup2'])
df['maturity1'] = pd.to_datetime(df['maturity1'])
df['maturity2'] = pd.to_datetime(df['maturity2'])
df['senescence1'] = pd.to_datetime(df['senescence1'])
df['senescence2'] = pd.to_datetime(df['senescence2'])
# Define the function
def computeYear(row):
for i in row:
if pd.isna(i):
pass
else:
return dt.datetime(int(i.strftime('%Y')), 1, 1)
return np.nan
df['1Jyear'] = df.apply(lambda row: computeYear(row), axis=1)
df.apply(lambda x: pd.to_datetime((x - df['1Jyear']).values.astype(np.int64).mean()).strftime('%m-%d'))
This is what I would do:
Convert your data in datetime format if it is not already done:
df['dormancy1'] = pd.to_datetime(df['dormancy1'])
df['greenup1'] = pd.to_datetime(df['greenup1'])
Get the 1st January of the year of the row (I assumed that your events on one row occure the same year):
df['1Jyear'] = df['dormancy1'].dt.year.apply(lambda x: dt.datetime(x, 1, 1))
This is how your dataframe looks now:
df.head()
dormancy1 greenup1 1Jyear
0 2002-08-31 2002-04-27 2002-01-01
1 2003-09-17 2003-06-06 2003-01-01
2 2004-09-10 2004-04-20 2004-01-01
3 2005-08-13 2005-04-24 2005-01-01
4 2007-05-10 2007-03-13 2007-01-01
To get the average month and day of each event:
df[['dormancy1', 'greenup1']].apply(lambda x: pd.to_datetime((x - df['1Jyear']).values.astype(np.int64).mean()).strftime('%m-%d'))
This outputs the following series:
dormancy1 08-10
greenup1 04-30
Let me know if this is the required result, I hope it will help you.
Update: Handle Missing Data
Update2: Handle empty columns
I'm working with the following data:
dormancy1 greenup1 maturity1 senescence1 dormancy2 greenup2 maturity2 senescence2
8 2002-08-31 2002-04-27 2002-05-06 2002-08-21 NaT NaT NaT NaT
22 2003-09-17 2003-06-06 2003-06-15 2003-07-22 NaT NaT NaT NaT
36 2004-09-10 2004-04-20 2004-05-15 2004-05-24 NaT NaT NaT NaT
44 2005-08-13 2005-04-24 2005-06-29 2005-07-18 NaT NaT NaT NaT
74 2007-05-10 2007-03-13 2007-04-07 2007-05-01 NaT NaT NaT NaT
95 2009-09-18 2009-04-26 2009-05-06 2009-06-03 NaT NaT NaT NaT
113 2010-09-09 2010-05-29 2010-06-08 2010-07-19 NaT NaT NaT NaT
To compute the year of each row (I get the first year I find in the column so again I assume it is the same year for every event, but if it is not the same you would need to compute different columns for every events) :
def computeYear(row):
for i in row:
if not pd.isna(i):
return dt.datetime(int(i.strftime('%Y')), 1, 1)
return np.nan
df['1Jyear'] = df.apply(lambda row: computeYear(row), axis=1)
To get the result:
df.apply(lambda column: np.datetime64('NaT') if column.isnull().all() else\
pd.to_datetime((column - df['1Jyear']).values.astype(np.int64).mean()).strftime('%m-%d'))
Output:
dormancy1 08-20
greenup1 04-29
maturity1 05-21
senescence1 06-28
dormancy2 NaN
greenup2 NaN
maturity2 NaN
senescence2 NaN
1Jyear 01-01
dtype: object
Alright, now what you want to use here is the old pandas.Series.dt.dayofyear function. That will tell you how many days into the year a specific date occurs. This has probably flipped the switch in your mind and you're building the answer right now, but just in case:
avg_day_dormancy1 = df['dormancy1'].dt.dayofyear.mean()
# Now let's add those days to a year to get an actual date
import datetime as dtt # You could do this in pandas, but this is quick and dirty
avg_date_dormancy1 = dtt.datetime.strptime('2000-01-01', '%Y-%m-%d') # E.g. get date in year 2000
avg_date_dormancy += dtt.timedelta(days=avg_day_dormancy1)
Given the data you provided I got August 10th as the average date on which dormancy1 occurs. You could also call the .std() method as well on the dayofyear Series and get a 95% confidence interval for which these events occur, for example.
This is another way of doing it. Hope this helps
import pandas as pd
from datetime import datetime
Calculating the average day of the year for both events
mean_greenup_DoY = df['greenup1'].apply(lambda x: datetime.strptime(x, '%Y-%m-%d').timetuple().tm_yday).mean()
mean_dormancy_DoY = df['dormancy1'].apply(lambda x: datetime.strptime(x, '%Y-%m-%d').timetuple().tm_yday).mean()
This initially converts the date string to datetime object and then finds the day of the year using the logic in lambda function, on this mean() is appliead to get the avg day of the year.
I am a little bit perplexed as to why NaT are showing up in my CSV...usually they show up as "". Here is my date formatting:
df['submitted_on'] = pd.to_datetime(df['submitted_on'], errors='coerce').dt.to_period('d')
df['resolved_on'] = pd.to_datetime(df['resolved_on'], errors='coerce').dt.to_period('d')
df['closed_on'] = pd.to_datetime(df['closed_on'], errors='coerce').dt.to_period('d')
df['duplicate_on'] = pd.to_datetime(df['duplicate_on'], errors='coerce').dt.to_period('d')
df['junked_on'] = pd.to_datetime(df['junked_on'], errors='coerce').dt.to_period('d')
df['unproducible_on'] = pd.to_datetime(df['unproducible_on'], errors='coerce').dt.to_period('d')
df['verified_on'] = pd.to_datetime(df['verified_on'], errors='coerce').dt.to_period('d')
When I df.head() this is my result. Good, fine, all is dandy.
identifier status submitted_on resolved_on closed_on duplicate_on junked_on \
0 xx1 D 2004-07-28 NaT NaT 2004-08-26 NaT
1 xx2 N 2010-03-02 NaT NaT NaT NaT
2 xx3 U 2005-10-26 NaT NaT NaT NaT
3 xx4 V 2006-06-30 2006-09-15 NaT NaT NaT
4 xx5 R 2012-09-21 2013-06-06 NaT NaT NaT
unproducible_on verified_on
0 NaT NaT
1 NaT NaT
2 2005-11-01 NaT
3 NaT 2006-11-20
4 NaT NaT
But I write to CSV and the NaT shows up:
"identifier","status","submitted_on","resolved_on","closed_on","duplicate_on","junked_on","unproducible_on","verified_on"
"xx1","D","2004-07-28","NaT","NaT","2004-08-26","NaT","NaT","NaT"
"xx2","N","2010-03-02","NaT","NaT","NaT","NaT","NaT","NaT"
"xx3","U","2005-10-26","NaT","NaT","NaT","NaT","2005-11-01","NaT"
"xx4","V","2006-06-30","2006-09-15","NaT","NaT","NaT","NaT","2006-11-20"
"xx5","R","2012-09-21","2013-06-06","NaT","NaT","NaT","NaT","NaT"
"xx6","D","2009-11-25","NaT","NaT","2010-02-26","NaT","NaT","NaT"
"xx7","D","2003-08-29","NaT","NaT","2003-08-29","NaT","NaT","NaT"
"xx8","R","2003-06-06","2003-06-24","NaT","NaT","NaT","NaT","NaT"
"xx9","R","2004-11-05","2004-11-15","NaT","NaT","NaT","NaT","NaT"
"xx10","R","2008-02-21","2008-09-25","NaT","NaT","NaT","NaT","NaT"
"xx11","R","2007-03-08","2007-03-21","NaT","NaT","NaT","NaT","NaT"
"xx12","R","2011-08-22","2012-06-21","NaT","NaT","NaT","NaT","NaT"
"xx13","J","2003-07-07","NaT","NaT","NaT","2003-07-10","NaT","NaT"
"xx14","A","2008-09-24","NaT","NaT","NaT","NaT","NaT","NaT"
So, I did what I thought would fix the problem. df.fillna('', inplace=True) and nada. I then tried df.replace(pd.NaT, '') with no results, followed by na_rep='' when I wrote to CSV which also did not result in desired output. What am I supposed to be using to prevent NaT from being transcribed into CSV?
Sample data:
"identifier","status","submitted_on","resolved_on","closed_on","duplicate_on","junked_on","unproducible_on","verified_on"
"xx1","D","2004-07-28 07:00:00.0","null","null","2004-08-26 07:00:00.0","null","null","null"
"xx2","N","2010-03-02 03:00:16.0","null","null","null","null","null","null"
"xx3","U","2005-10-26 14:20:20.0","null","null","null","null","2005-11-01 13:02:22.0","null"
"xx4","V","2006-06-30 07:00:00.0","2006-09-15 07:00:00.0","null","null","null","null","2006-11-20 08:00:00.0"
"xx5","R","2012-09-21 06:30:58.0","2013-06-06 09:35:25.0","null","null","null","null","null"
"xx6","D","2009-11-25 02:16:03.0","null","null","2010-02-26 12:28:22.0","null","null","null"
"xx7","D","2003-08-29 07:00:00.0","null","null","2003-08-29 07:00:00.0","null","null","null"
"xx8","R","2003-06-06 12:00:00.0","2003-06-24 12:00:00.0","null","null","null","null","null"
"xx9","R","2004-11-05 08:00:00.0","2004-11-15 08:00:00.0","null","null","null","null","null"
"xx10","R","2008-02-21 05:13:39.0","2008-09-25 17:20:57.0","null","null","null","null","null"
"xx11","R","2007-03-08 17:47:44.0","2007-03-21 23:47:57.0","null","null","null","null","null"
"xx12","R","2011-08-22 19:50:25.0","2012-06-21 05:52:12.0","null","null","null","null","null"
"xx13","J","2003-07-07 12:00:00.0","null","null","null","2003-07-10 12:00:00.0","null","null"
"xx14","A","2008-09-24 11:36:34.0","null","null","null","null","null","null"
Your problem lies in that you are converting to periods. The NaT you see is actually a period object.
One way around this is to convert to strings instead.
Use
.dt.strftime('%Y-%m-%d')
Instead of
.dt.to_period('d')
Then the NaTs you see will be strings and can be replaced like
.dt.strftime('%Y-%m-%d').replace('NaT', '')
df = pd.DataFrame(dict(date=pd.to_datetime(['2015-01-01', pd.NaT])))
df
df.date.dt.strftime('%Y-%m-%d')
0 2015-01-01
1 NaT
Name: date, dtype: object
df.date.dt.strftime('%Y-%m-%d').replace('NaT', '')
0 2015-01-01
1
Name: date, dtype: object