I have a Pandas df that contains two time columns. These columns contain the yyyy-mm-dd of yearly event.
How is it possible to calculate the average mm-dd of the occurrence of the event over all years?
I guess this involves counting (for each line) the number of days between the actual date and the Jan-1 of the year, but I don't see how to do that efficiently with Pandas.
Thank you!
dormancy1 greenup1 maturity1 senescence1 dormancy2 greenup2 maturity2 senescence2
8 2002-08-31 2002-04-27 2002-05-06 2002-08-21 NaT NaT NaT NaT
22 2003-09-17 2003-06-06 2003-06-15 2003-07-22 NaT NaT NaT NaT
36 2004-09-10 2004-04-20 2004-05-15 2004-05-24 NaT NaT NaT NaT
44 2005-08-13 2005-04-24 2005-06-29 2005-07-18 NaT NaT NaT NaT
74 2007-05-10 2007-03-13 2007-04-07 2007-05-01 NaT NaT NaT NaT
95 2009-09-18 2009-04-26 2009-05-06 2009-06-03 NaT NaT NaT NaT
113 2010-09-09 2010-05-29 2010-06-08 2010-07-19 NaT NaT NaT NaT
Edit:
Complete steps to reproduce error:
# Create and format data
df = pd.DataFrame({'dormancy1': ['2002-08-31','2003-09-17','2004-09-10','2005-08-13','2007-05-10','2009-09-18','2010-09-09'],
'greenup1': ['2002-04-27','2003-06-06','2004-04-20','2005-04-24','2007-03-13','2009-04-26','2010-05-29'],
'maturity1': ['2002-05-06','2003-06-15','2004-05-15','2005-06-29','2007-04-07','2009-05-06','2010-06-08'],
'senescence1': ['2002-08-21','2003-07-22','2004-05-24','2005-07-18','2007-05-01','2009-06-03','2010-07-19'],
'dormancy2': ['NaT','NaT','NaT','NaT','NaT','NaT','NaT'],
'greenup2': ['NaT','NaT','NaT','NaT','NaT','NaT','NaT'],
'maturity2': ['NaT','NaT','NaT','NaT','NaT','NaT','NaT'],
'senescence2': ['NaT','NaT','NaT','NaT','NaT','NaT','NaT']})
df['dormancy1'] = pd.to_datetime(df['dormancy1'])
df['dormancy2'] = pd.to_datetime(df['dormancy2'])
df['greenup1'] = pd.to_datetime(df['greenup1'])
df['greenup2'] = pd.to_datetime(df['greenup2'])
df['maturity1'] = pd.to_datetime(df['maturity1'])
df['maturity2'] = pd.to_datetime(df['maturity2'])
df['senescence1'] = pd.to_datetime(df['senescence1'])
df['senescence2'] = pd.to_datetime(df['senescence2'])
# Define the function
def computeYear(row):
for i in row:
if pd.isna(i):
pass
else:
return dt.datetime(int(i.strftime('%Y')), 1, 1)
return np.nan
df['1Jyear'] = df.apply(lambda row: computeYear(row), axis=1)
df.apply(lambda x: pd.to_datetime((x - df['1Jyear']).values.astype(np.int64).mean()).strftime('%m-%d'))
This is what I would do:
Convert your data in datetime format if it is not already done:
df['dormancy1'] = pd.to_datetime(df['dormancy1'])
df['greenup1'] = pd.to_datetime(df['greenup1'])
Get the 1st January of the year of the row (I assumed that your events on one row occure the same year):
df['1Jyear'] = df['dormancy1'].dt.year.apply(lambda x: dt.datetime(x, 1, 1))
This is how your dataframe looks now:
df.head()
dormancy1 greenup1 1Jyear
0 2002-08-31 2002-04-27 2002-01-01
1 2003-09-17 2003-06-06 2003-01-01
2 2004-09-10 2004-04-20 2004-01-01
3 2005-08-13 2005-04-24 2005-01-01
4 2007-05-10 2007-03-13 2007-01-01
To get the average month and day of each event:
df[['dormancy1', 'greenup1']].apply(lambda x: pd.to_datetime((x - df['1Jyear']).values.astype(np.int64).mean()).strftime('%m-%d'))
This outputs the following series:
dormancy1 08-10
greenup1 04-30
Let me know if this is the required result, I hope it will help you.
Update: Handle Missing Data
Update2: Handle empty columns
I'm working with the following data:
dormancy1 greenup1 maturity1 senescence1 dormancy2 greenup2 maturity2 senescence2
8 2002-08-31 2002-04-27 2002-05-06 2002-08-21 NaT NaT NaT NaT
22 2003-09-17 2003-06-06 2003-06-15 2003-07-22 NaT NaT NaT NaT
36 2004-09-10 2004-04-20 2004-05-15 2004-05-24 NaT NaT NaT NaT
44 2005-08-13 2005-04-24 2005-06-29 2005-07-18 NaT NaT NaT NaT
74 2007-05-10 2007-03-13 2007-04-07 2007-05-01 NaT NaT NaT NaT
95 2009-09-18 2009-04-26 2009-05-06 2009-06-03 NaT NaT NaT NaT
113 2010-09-09 2010-05-29 2010-06-08 2010-07-19 NaT NaT NaT NaT
To compute the year of each row (I get the first year I find in the column so again I assume it is the same year for every event, but if it is not the same you would need to compute different columns for every events) :
def computeYear(row):
for i in row:
if not pd.isna(i):
return dt.datetime(int(i.strftime('%Y')), 1, 1)
return np.nan
df['1Jyear'] = df.apply(lambda row: computeYear(row), axis=1)
To get the result:
df.apply(lambda column: np.datetime64('NaT') if column.isnull().all() else\
pd.to_datetime((column - df['1Jyear']).values.astype(np.int64).mean()).strftime('%m-%d'))
Output:
dormancy1 08-20
greenup1 04-29
maturity1 05-21
senescence1 06-28
dormancy2 NaN
greenup2 NaN
maturity2 NaN
senescence2 NaN
1Jyear 01-01
dtype: object
Alright, now what you want to use here is the old pandas.Series.dt.dayofyear function. That will tell you how many days into the year a specific date occurs. This has probably flipped the switch in your mind and you're building the answer right now, but just in case:
avg_day_dormancy1 = df['dormancy1'].dt.dayofyear.mean()
# Now let's add those days to a year to get an actual date
import datetime as dtt # You could do this in pandas, but this is quick and dirty
avg_date_dormancy1 = dtt.datetime.strptime('2000-01-01', '%Y-%m-%d') # E.g. get date in year 2000
avg_date_dormancy += dtt.timedelta(days=avg_day_dormancy1)
Given the data you provided I got August 10th as the average date on which dormancy1 occurs. You could also call the .std() method as well on the dayofyear Series and get a 95% confidence interval for which these events occur, for example.
This is another way of doing it. Hope this helps
import pandas as pd
from datetime import datetime
Calculating the average day of the year for both events
mean_greenup_DoY = df['greenup1'].apply(lambda x: datetime.strptime(x, '%Y-%m-%d').timetuple().tm_yday).mean()
mean_dormancy_DoY = df['dormancy1'].apply(lambda x: datetime.strptime(x, '%Y-%m-%d').timetuple().tm_yday).mean()
This initially converts the date string to datetime object and then finds the day of the year using the logic in lambda function, on this mean() is appliead to get the avg day of the year.
I have a column in my dataframe which I want to convert to a Timestamp. However, it is in a bit of a strange format that I am struggling to manipulate. The column is in the format HHMMSS, but does not include the leading zeros.
For example for a time that should be '00:03:15' the dataframe has '315'. I want to convert the latter to a Timestamp similar to the former. Here is an illustration of the column:
message_time
25
35
114
1421
...
235347
235959
Thanks
Use Series.str.zfill for add leading zero and then to_datetime:
s = df['message_time'].astype(str).str.zfill(6)
df['message_time'] = pd.to_datetime(s, format='%H%M%S')
print (df)
message_time
0 1900-01-01 00:00:25
1 1900-01-01 00:00:35
2 1900-01-01 00:01:14
3 1900-01-01 00:14:21
4 1900-01-01 23:53:47
5 1900-01-01 23:59:59
In my opinion here is better create timedeltas by to_timedelta:
s = df['message_time'].astype(str).str.zfill(6)
df['message_time'] = pd.to_timedelta(s.str[:2] + ':' + s.str[2:4] + ':' + s.str[4:])
print (df)
message_time
0 00:00:25
1 00:00:35
2 00:01:14
3 00:14:21
4 23:53:47
5 23:59:59
I have a series:
0 2018-08-02 00:00:00
1 2016-07-20 00:00:00
2 2015-09-14 00:00:00
3 2014-09-11 00:00:00
Name: EUR6m3m, dtype: object
I wish to extend the series in length by one and shift it, such that the expected output is: (where today is the obviously todays date in the same format)
0 today()
1 2018-08-02 00:00:00
2 2016-07-20 00:00:00
3 2015-09-14 00:00:00
4 2014-09-11 00:00:00
Name: EUR6m3m, dtype: object
My current approach is to store the last value in the original series:
a = B[B.last_valid_index()]
then append:
B.append(a)
But I get the error:
TypeError: cannot concatenate object of type "<class 'pandas._libs.tslibs.timestamps.Timestamp'>"; only pd.Series, pd.DataFrame, and pd.Panel (deprecated) objs are valid
So I tried:
B.to_pydatetime() but with no luck.
Any ideas? I can not append nor extend the list, (ideally im appending) which are objects because they are a list of dates and times.
You can increment your index, add an item by label via pd.Series.loc, and then use sort_index.
It's not clear how last_valid_index is relevant given the input data you have provided.
s = pd.Series(['2018-08-02 00:00:00', '2016-07-20 00:00:00',
'2015-09-14 00:00:00', '2014-09-11 00:00:00'])
s = pd.to_datetime(s)
s.index += 1
s.loc[0] = pd.to_datetime('today')
s = s.sort_index()
Result
0 2018-09-05
1 2018-08-02
2 2016-07-20
3 2015-09-14
4 2014-09-11
dtype: datetime64[ns]
You can do appending here:
s = pd.Series([1,2,3,4])
s1 = pd.Series([5])
s1 = s1.append(s)
s1 = s1.reset_index(drop=True)
Simple and elegant output:
0 5
1 1
2 2
3 3
4 4
This question "gets asked a lot" - but after looking carefully at the other answers I haven't found a solution that works in my case. It's a shame this is still such a sticking point.
I have a pandas dataframe with a column datetime and I simply want to calculate the time range covered by the data, in seconds (say).
from datetime import datetime
# You can create fake datetime entries any way you like, e.g.
df = pd.DataFrame({'datetime': pd.date_range('10/1/2001 10:00:00', \
periods=3, freq='10H'),'B':[4,5,6]})
# (a) This yields NaT:
timespan_a=df['datetime'][-1:]-df['datetime'][:1]
print timespan_a
# 0 NaT
# 2 NaT
# Name: datetime, dtype: timedelta64[ns]
# (b) This does work - but why?
timespan_b=df['datetime'][-1:].values.astype("timedelta64")-\
df['datetime'][:1].values.astype("timedelta64")
print timespan_b
# [72000000000000]
Why doesn't (a) work?
Why is (b) required rather? (it also gives a one-element numpy array rather than a timedelta object)
My pandas is at version 0.20.3, which rules out an earlier known bug.
Is this a dynamic-range issue?
There is problem different indexes, so one item Series cannot align and get NaT.
Solution is convert first or second values to numpy array by values:
timespan_a = df['datetime'][-1:]-df['datetime'][:1].values
print (timespan_a)
2 20:00:00
Name: datetime, dtype: timedelta64[ns]
Or set both index values to same:
a = df['datetime'][-1:]
b = df['datetime'][:1]
print (a)
2 2001-10-02 06:00:00
Name: datetime, dtype: datetime64[ns]
a.index = b.index
print (a)
0 2001-10-02 06:00:00
Name: datetime, dtype: datetime64[ns]
print (b)
0 2001-10-01 10:00:00
Name: datetime, dtype: datetime64[ns]
timespan_a = a - b
print (timespan_a)
0 20:00:00
Name: datetime, dtype: timedelta64[ns]
If want working with scalars:
a = df.loc[df.index[-1], 'datetime']
b = df.loc[0, 'datetime']
print (a)
2001-10-02 06:00:00
print (b)
2001-10-01 10:00:00
timespan_a = a - b
print (timespan_a)
0 days 20:00:00
Another solution, thank you Anton vBR:
timespan_a = df.get_value(len(df)-1,'datetime')- df.get_value(0,'datetime')
print (timespan_a)
0 days 20:00:00
I am trying to parse a date column that looks like the one below,
date
061116
061216
061316
061416
However I cannot get pandas to accept the date format as there is no delimiter (eg '/'). I have tried this below but receive the error:
ValueError: time data '060116' does not match format '%dd%mm%YY' (match)
pd.to_datetime(df['Date'], format='%dd%mm%YY')
You need add parameter errors='coerce' to_datetime, because 13 and 14 months does not exist, so this dates are converted to NaT:
print (pd.to_datetime(df['Date'], format='%d%m%y', errors='coerce'))
0 2016-11-06
1 2016-12-06
2 NaT
3 NaT
Name: Date, dtype: datetime64[ns]
Or maybe you need swap months with days:
print (pd.to_datetime(df['Date'], format='%m%d%y'))
0 2016-06-11
1 2016-06-12
2 2016-06-13
3 2016-06-14
Name: Date, dtype: datetime64[ns]
EDIT:
print (df)
Date
0 0611160130
1 0612160130
2 0613160130
3 0614160130
print (pd.to_datetime(df['Date'], format='%m%d%y%H%M', errors='coerce'))
0 2016-06-11 01:30:00
1 2016-06-12 01:30:00
2 2016-06-13 01:30:00
3 2016-06-14 01:30:00
Name: Date, dtype: datetime64[ns]
Python's strftime directives.
Your date format is wrong. You have days and months reversed. It should be:
%m%d%Y