Pandas: find average month/day at which yearly event occurs - python
I have a Pandas df that contains two time columns. These columns contain the yyyy-mm-dd of yearly event.
How is it possible to calculate the average mm-dd of the occurrence of the event over all years?
I guess this involves counting (for each line) the number of days between the actual date and the Jan-1 of the year, but I don't see how to do that efficiently with Pandas.
Thank you!
dormancy1 greenup1 maturity1 senescence1 dormancy2 greenup2 maturity2 senescence2
8 2002-08-31 2002-04-27 2002-05-06 2002-08-21 NaT NaT NaT NaT
22 2003-09-17 2003-06-06 2003-06-15 2003-07-22 NaT NaT NaT NaT
36 2004-09-10 2004-04-20 2004-05-15 2004-05-24 NaT NaT NaT NaT
44 2005-08-13 2005-04-24 2005-06-29 2005-07-18 NaT NaT NaT NaT
74 2007-05-10 2007-03-13 2007-04-07 2007-05-01 NaT NaT NaT NaT
95 2009-09-18 2009-04-26 2009-05-06 2009-06-03 NaT NaT NaT NaT
113 2010-09-09 2010-05-29 2010-06-08 2010-07-19 NaT NaT NaT NaT
Edit:
Complete steps to reproduce error:
# Create and format data
df = pd.DataFrame({'dormancy1': ['2002-08-31','2003-09-17','2004-09-10','2005-08-13','2007-05-10','2009-09-18','2010-09-09'],
'greenup1': ['2002-04-27','2003-06-06','2004-04-20','2005-04-24','2007-03-13','2009-04-26','2010-05-29'],
'maturity1': ['2002-05-06','2003-06-15','2004-05-15','2005-06-29','2007-04-07','2009-05-06','2010-06-08'],
'senescence1': ['2002-08-21','2003-07-22','2004-05-24','2005-07-18','2007-05-01','2009-06-03','2010-07-19'],
'dormancy2': ['NaT','NaT','NaT','NaT','NaT','NaT','NaT'],
'greenup2': ['NaT','NaT','NaT','NaT','NaT','NaT','NaT'],
'maturity2': ['NaT','NaT','NaT','NaT','NaT','NaT','NaT'],
'senescence2': ['NaT','NaT','NaT','NaT','NaT','NaT','NaT']})
df['dormancy1'] = pd.to_datetime(df['dormancy1'])
df['dormancy2'] = pd.to_datetime(df['dormancy2'])
df['greenup1'] = pd.to_datetime(df['greenup1'])
df['greenup2'] = pd.to_datetime(df['greenup2'])
df['maturity1'] = pd.to_datetime(df['maturity1'])
df['maturity2'] = pd.to_datetime(df['maturity2'])
df['senescence1'] = pd.to_datetime(df['senescence1'])
df['senescence2'] = pd.to_datetime(df['senescence2'])
# Define the function
def computeYear(row):
for i in row:
if pd.isna(i):
pass
else:
return dt.datetime(int(i.strftime('%Y')), 1, 1)
return np.nan
df['1Jyear'] = df.apply(lambda row: computeYear(row), axis=1)
df.apply(lambda x: pd.to_datetime((x - df['1Jyear']).values.astype(np.int64).mean()).strftime('%m-%d'))
This is what I would do:
Convert your data in datetime format if it is not already done:
df['dormancy1'] = pd.to_datetime(df['dormancy1'])
df['greenup1'] = pd.to_datetime(df['greenup1'])
Get the 1st January of the year of the row (I assumed that your events on one row occure the same year):
df['1Jyear'] = df['dormancy1'].dt.year.apply(lambda x: dt.datetime(x, 1, 1))
This is how your dataframe looks now:
df.head()
dormancy1 greenup1 1Jyear
0 2002-08-31 2002-04-27 2002-01-01
1 2003-09-17 2003-06-06 2003-01-01
2 2004-09-10 2004-04-20 2004-01-01
3 2005-08-13 2005-04-24 2005-01-01
4 2007-05-10 2007-03-13 2007-01-01
To get the average month and day of each event:
df[['dormancy1', 'greenup1']].apply(lambda x: pd.to_datetime((x - df['1Jyear']).values.astype(np.int64).mean()).strftime('%m-%d'))
This outputs the following series:
dormancy1 08-10
greenup1 04-30
Let me know if this is the required result, I hope it will help you.
Update: Handle Missing Data
Update2: Handle empty columns
I'm working with the following data:
dormancy1 greenup1 maturity1 senescence1 dormancy2 greenup2 maturity2 senescence2
8 2002-08-31 2002-04-27 2002-05-06 2002-08-21 NaT NaT NaT NaT
22 2003-09-17 2003-06-06 2003-06-15 2003-07-22 NaT NaT NaT NaT
36 2004-09-10 2004-04-20 2004-05-15 2004-05-24 NaT NaT NaT NaT
44 2005-08-13 2005-04-24 2005-06-29 2005-07-18 NaT NaT NaT NaT
74 2007-05-10 2007-03-13 2007-04-07 2007-05-01 NaT NaT NaT NaT
95 2009-09-18 2009-04-26 2009-05-06 2009-06-03 NaT NaT NaT NaT
113 2010-09-09 2010-05-29 2010-06-08 2010-07-19 NaT NaT NaT NaT
To compute the year of each row (I get the first year I find in the column so again I assume it is the same year for every event, but if it is not the same you would need to compute different columns for every events) :
def computeYear(row):
for i in row:
if not pd.isna(i):
return dt.datetime(int(i.strftime('%Y')), 1, 1)
return np.nan
df['1Jyear'] = df.apply(lambda row: computeYear(row), axis=1)
To get the result:
df.apply(lambda column: np.datetime64('NaT') if column.isnull().all() else\
pd.to_datetime((column - df['1Jyear']).values.astype(np.int64).mean()).strftime('%m-%d'))
Output:
dormancy1 08-20
greenup1 04-29
maturity1 05-21
senescence1 06-28
dormancy2 NaN
greenup2 NaN
maturity2 NaN
senescence2 NaN
1Jyear 01-01
dtype: object
Alright, now what you want to use here is the old pandas.Series.dt.dayofyear function. That will tell you how many days into the year a specific date occurs. This has probably flipped the switch in your mind and you're building the answer right now, but just in case:
avg_day_dormancy1 = df['dormancy1'].dt.dayofyear.mean()
# Now let's add those days to a year to get an actual date
import datetime as dtt # You could do this in pandas, but this is quick and dirty
avg_date_dormancy1 = dtt.datetime.strptime('2000-01-01', '%Y-%m-%d') # E.g. get date in year 2000
avg_date_dormancy += dtt.timedelta(days=avg_day_dormancy1)
Given the data you provided I got August 10th as the average date on which dormancy1 occurs. You could also call the .std() method as well on the dayofyear Series and get a 95% confidence interval for which these events occur, for example.
This is another way of doing it. Hope this helps
import pandas as pd
from datetime import datetime
Calculating the average day of the year for both events
mean_greenup_DoY = df['greenup1'].apply(lambda x: datetime.strptime(x, '%Y-%m-%d').timetuple().tm_yday).mean()
mean_dormancy_DoY = df['dormancy1'].apply(lambda x: datetime.strptime(x, '%Y-%m-%d').timetuple().tm_yday).mean()
This initially converts the date string to datetime object and then finds the day of the year using the logic in lambda function, on this mean() is appliead to get the avg day of the year.
Related
Excel different formats of date how to sort in Pandas dataframe
I have a set of data and wish to do the analysis using Pandas, but the problem is the date formats in the dataset are inconsistent. Even I had changed the date by format cells but still have some date stored as text. The thing I get in Python:
You can use pd.to_datetime() with errors='coerce' parameter, as follows: # convert Date with different format strings df['Date1'] = pd.to_datetime(df['Date'], format='%m/%d/%Y', errors='coerce') df['Date2'] = pd.to_datetime(df['Date'], format='%m-%d-%y', errors='coerce') Combine the results with .combine_first(): df['Date_combined'] = df['Date1'].combine_first(df['Date2']) Then, you can sort the dates by: df.sort_values(by='Date_combined') Demo Input: Date 0 11/26/2013 1 11/26/2015 2 3/23/2014 3 08-02-13 4 08-02-15 5 09-02-13 6 1/22/2014 Output: Date Date1 Date2 Date_combined 0 11/26/2013 2013-11-26 NaT 2013-11-26 1 11/26/2015 2015-11-26 NaT 2015-11-26 2 3/23/2014 2014-03-23 NaT 2014-03-23 3 08-02-13 NaT 2013-08-02 2013-08-02 4 08-02-15 NaT 2015-08-02 2015-08-02 5 09-02-13 NaT 2013-09-02 2013-09-02 6 1/22/2014 2014-01-22 NaT 2014-01-22
How to check if a column has a particular Date format or not using DATETIME in python?
I am new to python. I have a data-frame which has a date column in it, it has different formats. I would like to check if it is following particular date format or not. I it is not following I want to drop it. I have tried using try except and iterating over the rows. But I am looking for a faster way to check if the column is following a particular date format or not. If it is not following then it has to drop. Is there any faster way to do it? Using DATE TIME library? My code: Date_format = %Y%m%d df = Date abc 0 2020-03-22 q 1 03-12-2020 w 2 55552020 e 3 25122020 r 4 12/25/2020 r 5 1212202033 y Excepted out: Date abc 0 2020-03-22 q
You could try pd.to_datetime(df.Date, errors='coerce') 0 2020-03-22 1 2020-03-12 2 NaT 3 NaT 4 2020-12-25 5 NaT It's easy to drop the null values then EDIT: For a given format you can still leverage pd.to_datetime: datetimes = pd.to_datetime(df.Date, format='%Y-%m-%d', errors='coerce') datetimes 0 2020-03-22 1 NaT 2 NaT 3 NaT 4 NaT 5 NaT df.loc[datetimes.notnull()] Also note I am using the format %Y-%m-%d which I think is the one you want based on your expected output (not the one you gave as Date_format)
How do I store the first use date in a dictionary using a for loop
I have a dataset of userids and the all the times they use a particular pass. I need to find out how many days since each of them first used the pass. I was thinking of running through the dataset and store the first use in a dictionary and minus it off today's date. I cant seem to get it to work. Userid Start use Day 1712 2019-01-04 Friday 1712 2019-01-05 Saturday 9050 2019-01-04 Friday 9050 2019-01-04 Friday 9050 2019-01-06 Sunday 9409 2019-01-05 Saturday 9683 2019-05-20 Monday 8800 2019-05-17 Friday 8800 2019-05-17 Friday This is the part of the dataset. Date format is Y-m-d usedict={} keys = df.user_id values = df.start_date for i in keys: if (usedict[i] == keys): continue else: usedict[i] = values[i] prints(usedict) user_id use_count days_used Ave Daily Trips register_date days_since_reg 12 42 23 1.826087 NaT NaT 17 28 13 2.153846 NaT NaT 114 54 24 2.250000 2019-02-04 107 days 169 31 17 1.823529 NaT NaT 1414 49 20 2.450000 NaT NaT 1712 76 34 2.235294 NaT NaT 2388 24 12 2.000000 NaT NaT 6150 10 5 2.000000 2019-02-05 106 days
You can achieve what you want with the following. I have used only 2 user ids from the example given by you, but the same will apply to all. import pandas as pd import datetime df = pd.DataFrame([{'Userid':'1712','use_date':'2019-01-04'}, {'Userid':'1712','use_date':'2019-01-05'}, {'Userid':'9050','use_date':'2019-01-04'}, {'Userid':'9050','use_date':'2019-01-04'}, {'Userid':'9050','use_date':'2019-01-06'}]) df.use_date = pd.to_datetime(df.use_date).dt.date group_df = df.sort_values(by='use_date').groupby('Userid', as_index=False).agg({'use_date':'first'}).rename(columns={'use_date':'first_use_date'}) group_df['diff_from_today'] = datetime.datetime.today().date() - group_df.first_use_date The output is: print(group_df) Userid first_use_date diff_from_today 0 1712 2019-01-04 139 days 1 9050 2019-01-04 139 days Check sort_values and groupby for more details.
I am only looking at two columns but you could find the min for each id with groupby and then use apply to get difference (I have done difference in days) import pandas as pd import datetime user_id = [1712, 1712, 9050, 9050, 9050, 9409, 9683, 8800, 8800] start = ['2019-01-04', '2019-01-05', '2019-01-04', '2019-01-04', '2019-01-06', '2019-01-05', '2019-05-20', '2019-05-17', '2019-05-17'] df = pd.DataFrame(list(zip(user_id, start)), columns = ['UserId', 'Start']) df['Start']= pd.to_datetime(df['Start']) df = df.groupby('UserId')['Start'].agg([pd.np.min]) now = datetime.datetime.now() df['days'] = df['amin'].apply(lambda x: (now - x).days) a_dict = pd.Series(df.days.values,index = df.index).to_dict() print(a_dict) References: to_dict() method taken from #jeff Output:
Using pandas to perform time delta from 2 "hh:mm:ss XX" columns in Microsoft Excel
I have an Excel file with a column named StartTime having hh:mm:ss XX data and the cells are in `h:mm:ss AM/FM' custom format. For example, ID StartTime 1 12:00:00 PM 2 1:00:00 PM 3 2:00:00 PM I used the following code to read the file df = pd.read_excel('./mydata.xls', sheet_name='Sheet1', converters={'StartTime' : str}, ) df shows ID StartTime 1 12:00:00 2 1:00:00 3 2:00:00 Is it a bug or how do you overcome this? Thanks. [Update: 7-Dec-2018] I guess I may have made changes to the Excel file that made it weird. I created another Excel file and present here (I could not attach an Excel file here, and it is not safe too): I created the following code to test: import pandas as pd df = pd.read_excel('./Book1.xlsx', sheet_name='Sheet1', converters={'StartTime': str, 'EndTime': str } ) df['Hours1'] = pd.NaT df['Hours2'] = pd.NaT print(df,'\n') df.loc[~df.StartTime.isnull() & ~df.EndTime.isnull(), 'Hours1'] = pd.to_datetime(df.EndTime) - pd.to_datetime(df.StartTime) df['Hours2'] = pd.to_datetime(df.EndTime) - pd.to_datetime(df.StartTime) print(df) The outputs are ID StartTime EndTime Hours1 Hours2 0 0 11:00:00 12:00:00 NaT NaT 1 1 12:00:00 13:00:00 NaT NaT 2 2 13:00:00 14:00:00 NaT NaT 3 3 NaN NaN NaT NaT 4 4 14:00:00 NaN NaT NaT ID StartTime EndTime Hours1 Hours2 0 0 11:00:00 12:00:00 3600000000000 01:00:00 1 1 12:00:00 13:00:00 3600000000000 01:00:00 2 2 13:00:00 14:00:00 3600000000000 01:00:00 3 3 NaN NaN NaT NaT 4 4 14:00:00 NaN NaT NaT Now the question has become: "Using pandas to perform time delta from 2 "hh:mm:ss XX" columns in Microsoft Excel". I have changed the title of the question too. Thank you for those who replied and tried it out. The question is How to represent the time value to hour instead of microseconds?
It seems that the StartTime column is formated as text in your file. Have you tried reading it with parse_dates along with a parser function specified via the date_parser parameter? Should work similar to read_csv() although the docs don't list the above options explicitly despite them being available. Like so: pd.read_excel(r'./mydata.xls', parse_dates=['StartTime'], date_parser=lambda x: pd.datetime.strptime(x, '%I:%M:%S %p').time()) Given the update: pd.read_excel(r'./mydata.xls', parse_dates=['StartTime', 'EndTime']) (df['EndTime'] - df['StartTime']).dt.seconds//3600 alternatively # '//' is available since pandas v0.23.4, otherwise use '/' and round (df['EndTime'] - df['StartTime'])//pd.Timedelta(1, 'h') both resulting in the same 0 1 1 1 2 1 dtype: int64
How do I hide pandas to_datetime NaT in when I write to CSV?
I am a little bit perplexed as to why NaT are showing up in my CSV...usually they show up as "". Here is my date formatting: df['submitted_on'] = pd.to_datetime(df['submitted_on'], errors='coerce').dt.to_period('d') df['resolved_on'] = pd.to_datetime(df['resolved_on'], errors='coerce').dt.to_period('d') df['closed_on'] = pd.to_datetime(df['closed_on'], errors='coerce').dt.to_period('d') df['duplicate_on'] = pd.to_datetime(df['duplicate_on'], errors='coerce').dt.to_period('d') df['junked_on'] = pd.to_datetime(df['junked_on'], errors='coerce').dt.to_period('d') df['unproducible_on'] = pd.to_datetime(df['unproducible_on'], errors='coerce').dt.to_period('d') df['verified_on'] = pd.to_datetime(df['verified_on'], errors='coerce').dt.to_period('d') When I df.head() this is my result. Good, fine, all is dandy. identifier status submitted_on resolved_on closed_on duplicate_on junked_on \ 0 xx1 D 2004-07-28 NaT NaT 2004-08-26 NaT 1 xx2 N 2010-03-02 NaT NaT NaT NaT 2 xx3 U 2005-10-26 NaT NaT NaT NaT 3 xx4 V 2006-06-30 2006-09-15 NaT NaT NaT 4 xx5 R 2012-09-21 2013-06-06 NaT NaT NaT unproducible_on verified_on 0 NaT NaT 1 NaT NaT 2 2005-11-01 NaT 3 NaT 2006-11-20 4 NaT NaT But I write to CSV and the NaT shows up: "identifier","status","submitted_on","resolved_on","closed_on","duplicate_on","junked_on","unproducible_on","verified_on" "xx1","D","2004-07-28","NaT","NaT","2004-08-26","NaT","NaT","NaT" "xx2","N","2010-03-02","NaT","NaT","NaT","NaT","NaT","NaT" "xx3","U","2005-10-26","NaT","NaT","NaT","NaT","2005-11-01","NaT" "xx4","V","2006-06-30","2006-09-15","NaT","NaT","NaT","NaT","2006-11-20" "xx5","R","2012-09-21","2013-06-06","NaT","NaT","NaT","NaT","NaT" "xx6","D","2009-11-25","NaT","NaT","2010-02-26","NaT","NaT","NaT" "xx7","D","2003-08-29","NaT","NaT","2003-08-29","NaT","NaT","NaT" "xx8","R","2003-06-06","2003-06-24","NaT","NaT","NaT","NaT","NaT" "xx9","R","2004-11-05","2004-11-15","NaT","NaT","NaT","NaT","NaT" "xx10","R","2008-02-21","2008-09-25","NaT","NaT","NaT","NaT","NaT" "xx11","R","2007-03-08","2007-03-21","NaT","NaT","NaT","NaT","NaT" "xx12","R","2011-08-22","2012-06-21","NaT","NaT","NaT","NaT","NaT" "xx13","J","2003-07-07","NaT","NaT","NaT","2003-07-10","NaT","NaT" "xx14","A","2008-09-24","NaT","NaT","NaT","NaT","NaT","NaT" So, I did what I thought would fix the problem. df.fillna('', inplace=True) and nada. I then tried df.replace(pd.NaT, '') with no results, followed by na_rep='' when I wrote to CSV which also did not result in desired output. What am I supposed to be using to prevent NaT from being transcribed into CSV? Sample data: "identifier","status","submitted_on","resolved_on","closed_on","duplicate_on","junked_on","unproducible_on","verified_on" "xx1","D","2004-07-28 07:00:00.0","null","null","2004-08-26 07:00:00.0","null","null","null" "xx2","N","2010-03-02 03:00:16.0","null","null","null","null","null","null" "xx3","U","2005-10-26 14:20:20.0","null","null","null","null","2005-11-01 13:02:22.0","null" "xx4","V","2006-06-30 07:00:00.0","2006-09-15 07:00:00.0","null","null","null","null","2006-11-20 08:00:00.0" "xx5","R","2012-09-21 06:30:58.0","2013-06-06 09:35:25.0","null","null","null","null","null" "xx6","D","2009-11-25 02:16:03.0","null","null","2010-02-26 12:28:22.0","null","null","null" "xx7","D","2003-08-29 07:00:00.0","null","null","2003-08-29 07:00:00.0","null","null","null" "xx8","R","2003-06-06 12:00:00.0","2003-06-24 12:00:00.0","null","null","null","null","null" "xx9","R","2004-11-05 08:00:00.0","2004-11-15 08:00:00.0","null","null","null","null","null" "xx10","R","2008-02-21 05:13:39.0","2008-09-25 17:20:57.0","null","null","null","null","null" "xx11","R","2007-03-08 17:47:44.0","2007-03-21 23:47:57.0","null","null","null","null","null" "xx12","R","2011-08-22 19:50:25.0","2012-06-21 05:52:12.0","null","null","null","null","null" "xx13","J","2003-07-07 12:00:00.0","null","null","null","2003-07-10 12:00:00.0","null","null" "xx14","A","2008-09-24 11:36:34.0","null","null","null","null","null","null"
Your problem lies in that you are converting to periods. The NaT you see is actually a period object. One way around this is to convert to strings instead. Use .dt.strftime('%Y-%m-%d') Instead of .dt.to_period('d') Then the NaTs you see will be strings and can be replaced like .dt.strftime('%Y-%m-%d').replace('NaT', '') df = pd.DataFrame(dict(date=pd.to_datetime(['2015-01-01', pd.NaT]))) df df.date.dt.strftime('%Y-%m-%d') 0 2015-01-01 1 NaT Name: date, dtype: object df.date.dt.strftime('%Y-%m-%d').replace('NaT', '') 0 2015-01-01 1 Name: date, dtype: object