Check if datetime column is empty - python

I want to check inside function like if datetime column is empty do something.
My sample df:
date_dogovor date_pogash date_pogash_posle_prodl
0 2019-03-07 2020-03-06 NaT
1 2019-02-27 2020-02-05 NaT
2 2011-10-14 2016-10-13 2019-10-13
3 2019-03-28 2020-03-06 NaT
4 2019-04-17 2020-04-06 NaT
My function:
def term(date_contract , date_paymnt, date_paymnt_aftr_prlngtn):
if date_paymnt_aftr_prlngtn is None:
return date_paymnt - date_contract
else:
return date_paymnt_aftr_prlngtn - date_contract
Applying function to df:
df['term'] = df.apply(lambda x: term(x['date_dogovor'], x['date_pogash'], x['date_pogash_posle_prodl']), axis=1 )
Result is wrong:
df['term']
0 NaT
1 NaT
2 NaT
3 NaT
4 NaT
...
115337 NaT
115338 NaT
115339 2921 days
115340 NaT
115341 NaT
Name: term, Length: 115342, dtype: timedelta64[ns]
How to correctly check if datetime column is empty?

Here is better/faster use numpy.where with Series.isna:
df['term'] = np.where(df['date_pogash_posle_prodl'].isna(),
df['date_pogash'] - df['date_dogovor'],
df['date_dogovor'] - df['date_dogovor'])
Your function should be changed with pandas.isna:
def term(date_contract , date_paymnt, date_paymnt_aftr_prlngtn):
if pd.isna(date_paymnt_aftr_prlngtn):
return date_paymnt - date_contract
else:
return date_paymnt_aftr_prlngtn - date_contract

Related

change string into datetime in pandas

How can I change the following string datetime into datetime in python.Here's my dataframe
IN OUT
2022/6/10 10:20:30.00000000000000000000000000 2022/6/17 13:25:30
2022/6/5 12:48:10.0 2022/6/11 10:15
2022/6/9 08:25:30 2022/6/13 10:25:30
2022-06-08 17:18:37.00000000000000000000 0
0 0
2022-06-08 17:18:37 2022/06/08 19:38
[image of df]
I want to delete the row containing 0 value and change string into datetime of format '%Y-%m-%d %H:%M:%S'.
Here's is my code....
import pandas as pd
from datetime import datetime as dt
def string_to_date(my_string):
if '-' and '.' in my_string:
data=dt.strftime(dt.strptime(my_string[:26],'%Y-%m-%d %H:%M:%S.%f'),'%Y-%m-%d %H:%M:%S')
return data
elif '/' and '.' in my_string:
data=dt.strftime(dt.strptime(my_string[:26],'%Y/%m/%d %H:%M:%S.%f'),'%Y-%m-%d %H:%M:%S')
return data
elif '/' in my_string:
data=dt.strftime(dt.strptime(my_string[:26],'%Y/%m/%d %H:%M:%S.%f'),'%Y-%m-%d %H:%M:%S')
return data
elif '-' in my_string:
data=dt.strftime(dt.strptime(my_string[:26],'%Y-%m-%d %H:%M:%S.%f'),'%Y-%m-%d %H:%M:%S')
return data
else:
data=dt.strftime(dt.strptime(my_string[:26],'%Y-%m-%d %H:%M:%S.%f'),'%Y-%m-%d %H:%M:%S')
return data
if name=='main':
df=pd.read_excel('data.xlsx')
col=df.columns[0:]
df=df.loc[~(df=='0').all(axis=1)]
print(df)
i=0
for n in col:
df[col[i]]=pd.to_datetime(df[col[i]])
df[col[i]]=df[col[i]].apply(lambda x:string_to_date(x))
i+=1
print(df)
Letting pandas infer the format should get you started. You can parse to datetime data type like
df['IN'] = pd.to_datetime(df['IN'], errors='coerce')
df['IN']
0 2022-06-10 10:20:30
1 2022-06-05 12:48:10
2 2022-06-09 08:25:30
3 2022-06-08 17:18:37
4 NaT
5 2022-06-08 17:18:37
Name: IN, dtype: datetime64[ns]
Note that setting keyword errors='coerce' leaves NaT (not-a-time) for all elements that pandas considers to be not a datetime, e.g. "0"
Now you can drop rows that have NaT, e.g.
df['OUT'] = pd.to_datetime(df['OUT'], errors='coerce')
df
IN OUT
0 2022-06-10 10:20:30 2022-06-17 13:25:30
1 2022-06-05 12:48:10 2022-06-11 10:15:00
2 2022-06-09 08:25:30 2022-06-13 10:25:30
3 2022-06-08 17:18:37 NaT
4 NaT NaT
5 2022-06-08 17:18:37 2022-06-08 19:38:00
df = df.dropna(axis=0, how='all')
df
IN OUT
0 2022-06-10 10:20:30 2022-06-17 13:25:30
1 2022-06-05 12:48:10 2022-06-11 10:15:00
2 2022-06-09 08:25:30 2022-06-13 10:25:30
3 2022-06-08 17:18:37 NaT
5 2022-06-08 17:18:37 2022-06-08 19:38:00
docs: pd.to_datetime, pd.DataFrame.dropna, related: parsing/formatting directives

Getting min value across multiple datetime columns in Pandas

I have the following dataframe
df = pd.DataFrame({
'DATE1': ['NaT', 'NaT', '2010-04-15 19:09:08+00:00', '2011-01-25 15:29:37+00:00', '2010-04-10 12:29:02+00:00', 'NaT'],
'DATE2': ['NaT', 'NaT', 'NaT', 'NaT', '2014-04-10 12:29:02+00:00', 'NaT']})
df.DATE1 = pd.to_datetime(df.DATE1)
df.DATE2 = pd.to_datetime(df.DATE2)
and I would like to create a new column with the minimum value across the two columns (ignoring the NaTs) like so:
df.min(axis=1)
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
dtype: float64
If I remove the timezone information (the +00:00) from every single cell then the desired output is produced like so:
0 NaT
1 NaT
2 2010-04-15 19:09:08
3 2011-01-25 15:29:37
4 2010-04-10 12:29:02
5 NaT
dtype: datetime64[ns]
Why does adding the timezone information break the function? My dataset has timezones so I would need to know how to remove them as a workaround.
This is good question, it should be a bug here with timezone
df.apply(lambda x : np.max(x),1)
0 NaT
1 NaT
2 2010-04-15 19:09:08+00:00
3 2011-01-25 15:29:37+00:00
4 2014-04-10 12:29:02+00:00
5 NaT
dtype: datetime64[ns, UTC]
Odd. Seems like a bug. You could keep the timezone format and use this.
df.apply(lambda x: x.min(),axis=1)
0 NaT
1 NaT
2 2010-04-15 19:09:08+00:00
3 2011-01-25 15:29:37+00:00
4 2010-04-10 12:29:02+00:00
5 NaT
dtype: datetime64[ns, UTC]

Pandas: find average month/day at which yearly event occurs

I have a Pandas df that contains two time columns. These columns contain the yyyy-mm-dd of yearly event.
How is it possible to calculate the average mm-dd of the occurrence of the event over all years?
I guess this involves counting (for each line) the number of days between the actual date and the Jan-1 of the year, but I don't see how to do that efficiently with Pandas.
Thank you!
dormancy1 greenup1 maturity1 senescence1 dormancy2 greenup2 maturity2 senescence2
8 2002-08-31 2002-04-27 2002-05-06 2002-08-21 NaT NaT NaT NaT
22 2003-09-17 2003-06-06 2003-06-15 2003-07-22 NaT NaT NaT NaT
36 2004-09-10 2004-04-20 2004-05-15 2004-05-24 NaT NaT NaT NaT
44 2005-08-13 2005-04-24 2005-06-29 2005-07-18 NaT NaT NaT NaT
74 2007-05-10 2007-03-13 2007-04-07 2007-05-01 NaT NaT NaT NaT
95 2009-09-18 2009-04-26 2009-05-06 2009-06-03 NaT NaT NaT NaT
113 2010-09-09 2010-05-29 2010-06-08 2010-07-19 NaT NaT NaT NaT
Edit:
Complete steps to reproduce error:
# Create and format data
df = pd.DataFrame({'dormancy1': ['2002-08-31','2003-09-17','2004-09-10','2005-08-13','2007-05-10','2009-09-18','2010-09-09'],
'greenup1': ['2002-04-27','2003-06-06','2004-04-20','2005-04-24','2007-03-13','2009-04-26','2010-05-29'],
'maturity1': ['2002-05-06','2003-06-15','2004-05-15','2005-06-29','2007-04-07','2009-05-06','2010-06-08'],
'senescence1': ['2002-08-21','2003-07-22','2004-05-24','2005-07-18','2007-05-01','2009-06-03','2010-07-19'],
'dormancy2': ['NaT','NaT','NaT','NaT','NaT','NaT','NaT'],
'greenup2': ['NaT','NaT','NaT','NaT','NaT','NaT','NaT'],
'maturity2': ['NaT','NaT','NaT','NaT','NaT','NaT','NaT'],
'senescence2': ['NaT','NaT','NaT','NaT','NaT','NaT','NaT']})
df['dormancy1'] = pd.to_datetime(df['dormancy1'])
df['dormancy2'] = pd.to_datetime(df['dormancy2'])
df['greenup1'] = pd.to_datetime(df['greenup1'])
df['greenup2'] = pd.to_datetime(df['greenup2'])
df['maturity1'] = pd.to_datetime(df['maturity1'])
df['maturity2'] = pd.to_datetime(df['maturity2'])
df['senescence1'] = pd.to_datetime(df['senescence1'])
df['senescence2'] = pd.to_datetime(df['senescence2'])
# Define the function
def computeYear(row):
for i in row:
if pd.isna(i):
pass
else:
return dt.datetime(int(i.strftime('%Y')), 1, 1)
return np.nan
df['1Jyear'] = df.apply(lambda row: computeYear(row), axis=1)
df.apply(lambda x: pd.to_datetime((x - df['1Jyear']).values.astype(np.int64).mean()).strftime('%m-%d'))
This is what I would do:
Convert your data in datetime format if it is not already done:
df['dormancy1'] = pd.to_datetime(df['dormancy1'])
df['greenup1'] = pd.to_datetime(df['greenup1'])
Get the 1st January of the year of the row (I assumed that your events on one row occure the same year):
df['1Jyear'] = df['dormancy1'].dt.year.apply(lambda x: dt.datetime(x, 1, 1))
This is how your dataframe looks now:
df.head()
dormancy1 greenup1 1Jyear
0 2002-08-31 2002-04-27 2002-01-01
1 2003-09-17 2003-06-06 2003-01-01
2 2004-09-10 2004-04-20 2004-01-01
3 2005-08-13 2005-04-24 2005-01-01
4 2007-05-10 2007-03-13 2007-01-01
To get the average month and day of each event:
df[['dormancy1', 'greenup1']].apply(lambda x: pd.to_datetime((x - df['1Jyear']).values.astype(np.int64).mean()).strftime('%m-%d'))
This outputs the following series:
dormancy1 08-10
greenup1 04-30
Let me know if this is the required result, I hope it will help you.
Update: Handle Missing Data
Update2: Handle empty columns
I'm working with the following data:
dormancy1 greenup1 maturity1 senescence1 dormancy2 greenup2 maturity2 senescence2
8 2002-08-31 2002-04-27 2002-05-06 2002-08-21 NaT NaT NaT NaT
22 2003-09-17 2003-06-06 2003-06-15 2003-07-22 NaT NaT NaT NaT
36 2004-09-10 2004-04-20 2004-05-15 2004-05-24 NaT NaT NaT NaT
44 2005-08-13 2005-04-24 2005-06-29 2005-07-18 NaT NaT NaT NaT
74 2007-05-10 2007-03-13 2007-04-07 2007-05-01 NaT NaT NaT NaT
95 2009-09-18 2009-04-26 2009-05-06 2009-06-03 NaT NaT NaT NaT
113 2010-09-09 2010-05-29 2010-06-08 2010-07-19 NaT NaT NaT NaT
To compute the year of each row (I get the first year I find in the column so again I assume it is the same year for every event, but if it is not the same you would need to compute different columns for every events) :
def computeYear(row):
for i in row:
if not pd.isna(i):
return dt.datetime(int(i.strftime('%Y')), 1, 1)
return np.nan
df['1Jyear'] = df.apply(lambda row: computeYear(row), axis=1)
To get the result:
df.apply(lambda column: np.datetime64('NaT') if column.isnull().all() else\
pd.to_datetime((column - df['1Jyear']).values.astype(np.int64).mean()).strftime('%m-%d'))
Output:
dormancy1 08-20
greenup1 04-29
maturity1 05-21
senescence1 06-28
dormancy2 NaN
greenup2 NaN
maturity2 NaN
senescence2 NaN
1Jyear 01-01
dtype: object
Alright, now what you want to use here is the old pandas.Series.dt.dayofyear function. That will tell you how many days into the year a specific date occurs. This has probably flipped the switch in your mind and you're building the answer right now, but just in case:
avg_day_dormancy1 = df['dormancy1'].dt.dayofyear.mean()
# Now let's add those days to a year to get an actual date
import datetime as dtt # You could do this in pandas, but this is quick and dirty
avg_date_dormancy1 = dtt.datetime.strptime('2000-01-01', '%Y-%m-%d') # E.g. get date in year 2000
avg_date_dormancy += dtt.timedelta(days=avg_day_dormancy1)
Given the data you provided I got August 10th as the average date on which dormancy1 occurs. You could also call the .std() method as well on the dayofyear Series and get a 95% confidence interval for which these events occur, for example.
This is another way of doing it. Hope this helps
import pandas as pd
from datetime import datetime
Calculating the average day of the year for both events
mean_greenup_DoY = df['greenup1'].apply(lambda x: datetime.strptime(x, '%Y-%m-%d').timetuple().tm_yday).mean()
mean_dormancy_DoY = df['dormancy1'].apply(lambda x: datetime.strptime(x, '%Y-%m-%d').timetuple().tm_yday).mean()
This initially converts the date string to datetime object and then finds the day of the year using the logic in lambda function, on this mean() is appliead to get the avg day of the year.

Using pandas to perform time delta from 2 "hh:mm:ss XX" columns in Microsoft Excel

I have an Excel file with a column named StartTime having hh:mm:ss XX data and the cells are in `h:mm:ss AM/FM' custom format. For example,
ID StartTime
1 12:00:00 PM
2 1:00:00 PM
3 2:00:00 PM
I used the following code to read the file
df = pd.read_excel('./mydata.xls',
sheet_name='Sheet1',
converters={'StartTime' : str},
)
df shows
ID StartTime
1 12:00:00
2 1:00:00
3 2:00:00
Is it a bug or how do you overcome this? Thanks.
[Update: 7-Dec-2018]
I guess I may have made changes to the Excel file that made it weird. I created another Excel file and present here (I could not attach an Excel file here, and it is not safe too):
I created the following code to test:
import pandas as pd
df = pd.read_excel('./Book1.xlsx',
sheet_name='Sheet1',
converters={'StartTime': str,
'EndTime': str
}
)
df['Hours1'] = pd.NaT
df['Hours2'] = pd.NaT
print(df,'\n')
df.loc[~df.StartTime.isnull() & ~df.EndTime.isnull(),
'Hours1'] = pd.to_datetime(df.EndTime) - pd.to_datetime(df.StartTime)
df['Hours2'] = pd.to_datetime(df.EndTime) - pd.to_datetime(df.StartTime)
print(df)
The outputs are
ID StartTime EndTime Hours1 Hours2
0 0 11:00:00 12:00:00 NaT NaT
1 1 12:00:00 13:00:00 NaT NaT
2 2 13:00:00 14:00:00 NaT NaT
3 3 NaN NaN NaT NaT
4 4 14:00:00 NaN NaT NaT
ID StartTime EndTime Hours1 Hours2
0 0 11:00:00 12:00:00 3600000000000 01:00:00
1 1 12:00:00 13:00:00 3600000000000 01:00:00
2 2 13:00:00 14:00:00 3600000000000 01:00:00
3 3 NaN NaN NaT NaT
4 4 14:00:00 NaN NaT NaT
Now the question has become: "Using pandas to perform time delta from 2 "hh:mm:ss XX" columns in Microsoft Excel". I have changed the title of the question too. Thank you for those who replied and tried it out.
The question is
How to represent the time value to hour instead of microseconds?
It seems that the StartTime column is formated as text in your file.
Have you tried reading it with parse_dates along with a parser function specified via the date_parser parameter? Should work similar to read_csv() although the docs don't list the above options explicitly despite them being available.
Like so:
pd.read_excel(r'./mydata.xls',
parse_dates=['StartTime'],
date_parser=lambda x: pd.datetime.strptime(x, '%I:%M:%S %p').time())
Given the update:
pd.read_excel(r'./mydata.xls', parse_dates=['StartTime', 'EndTime'])
(df['EndTime'] - df['StartTime']).dt.seconds//3600
alternatively
# '//' is available since pandas v0.23.4, otherwise use '/' and round
(df['EndTime'] - df['StartTime'])//pd.Timedelta(1, 'h')
both resulting in the same
0 1
1 1
2 1
dtype: int64

How do I hide pandas to_datetime NaT in when I write to CSV?

I am a little bit perplexed as to why NaT are showing up in my CSV...usually they show up as "". Here is my date formatting:
df['submitted_on'] = pd.to_datetime(df['submitted_on'], errors='coerce').dt.to_period('d')
df['resolved_on'] = pd.to_datetime(df['resolved_on'], errors='coerce').dt.to_period('d')
df['closed_on'] = pd.to_datetime(df['closed_on'], errors='coerce').dt.to_period('d')
df['duplicate_on'] = pd.to_datetime(df['duplicate_on'], errors='coerce').dt.to_period('d')
df['junked_on'] = pd.to_datetime(df['junked_on'], errors='coerce').dt.to_period('d')
df['unproducible_on'] = pd.to_datetime(df['unproducible_on'], errors='coerce').dt.to_period('d')
df['verified_on'] = pd.to_datetime(df['verified_on'], errors='coerce').dt.to_period('d')
When I df.head() this is my result. Good, fine, all is dandy.
identifier status submitted_on resolved_on closed_on duplicate_on junked_on \
0 xx1 D 2004-07-28 NaT NaT 2004-08-26 NaT
1 xx2 N 2010-03-02 NaT NaT NaT NaT
2 xx3 U 2005-10-26 NaT NaT NaT NaT
3 xx4 V 2006-06-30 2006-09-15 NaT NaT NaT
4 xx5 R 2012-09-21 2013-06-06 NaT NaT NaT
unproducible_on verified_on
0 NaT NaT
1 NaT NaT
2 2005-11-01 NaT
3 NaT 2006-11-20
4 NaT NaT
But I write to CSV and the NaT shows up:
"identifier","status","submitted_on","resolved_on","closed_on","duplicate_on","junked_on","unproducible_on","verified_on"
"xx1","D","2004-07-28","NaT","NaT","2004-08-26","NaT","NaT","NaT"
"xx2","N","2010-03-02","NaT","NaT","NaT","NaT","NaT","NaT"
"xx3","U","2005-10-26","NaT","NaT","NaT","NaT","2005-11-01","NaT"
"xx4","V","2006-06-30","2006-09-15","NaT","NaT","NaT","NaT","2006-11-20"
"xx5","R","2012-09-21","2013-06-06","NaT","NaT","NaT","NaT","NaT"
"xx6","D","2009-11-25","NaT","NaT","2010-02-26","NaT","NaT","NaT"
"xx7","D","2003-08-29","NaT","NaT","2003-08-29","NaT","NaT","NaT"
"xx8","R","2003-06-06","2003-06-24","NaT","NaT","NaT","NaT","NaT"
"xx9","R","2004-11-05","2004-11-15","NaT","NaT","NaT","NaT","NaT"
"xx10","R","2008-02-21","2008-09-25","NaT","NaT","NaT","NaT","NaT"
"xx11","R","2007-03-08","2007-03-21","NaT","NaT","NaT","NaT","NaT"
"xx12","R","2011-08-22","2012-06-21","NaT","NaT","NaT","NaT","NaT"
"xx13","J","2003-07-07","NaT","NaT","NaT","2003-07-10","NaT","NaT"
"xx14","A","2008-09-24","NaT","NaT","NaT","NaT","NaT","NaT"
So, I did what I thought would fix the problem. df.fillna('', inplace=True) and nada. I then tried df.replace(pd.NaT, '') with no results, followed by na_rep='' when I wrote to CSV which also did not result in desired output. What am I supposed to be using to prevent NaT from being transcribed into CSV?
Sample data:
"identifier","status","submitted_on","resolved_on","closed_on","duplicate_on","junked_on","unproducible_on","verified_on"
"xx1","D","2004-07-28 07:00:00.0","null","null","2004-08-26 07:00:00.0","null","null","null"
"xx2","N","2010-03-02 03:00:16.0","null","null","null","null","null","null"
"xx3","U","2005-10-26 14:20:20.0","null","null","null","null","2005-11-01 13:02:22.0","null"
"xx4","V","2006-06-30 07:00:00.0","2006-09-15 07:00:00.0","null","null","null","null","2006-11-20 08:00:00.0"
"xx5","R","2012-09-21 06:30:58.0","2013-06-06 09:35:25.0","null","null","null","null","null"
"xx6","D","2009-11-25 02:16:03.0","null","null","2010-02-26 12:28:22.0","null","null","null"
"xx7","D","2003-08-29 07:00:00.0","null","null","2003-08-29 07:00:00.0","null","null","null"
"xx8","R","2003-06-06 12:00:00.0","2003-06-24 12:00:00.0","null","null","null","null","null"
"xx9","R","2004-11-05 08:00:00.0","2004-11-15 08:00:00.0","null","null","null","null","null"
"xx10","R","2008-02-21 05:13:39.0","2008-09-25 17:20:57.0","null","null","null","null","null"
"xx11","R","2007-03-08 17:47:44.0","2007-03-21 23:47:57.0","null","null","null","null","null"
"xx12","R","2011-08-22 19:50:25.0","2012-06-21 05:52:12.0","null","null","null","null","null"
"xx13","J","2003-07-07 12:00:00.0","null","null","null","2003-07-10 12:00:00.0","null","null"
"xx14","A","2008-09-24 11:36:34.0","null","null","null","null","null","null"
Your problem lies in that you are converting to periods. The NaT you see is actually a period object.
One way around this is to convert to strings instead.
Use
.dt.strftime('%Y-%m-%d')
Instead of
.dt.to_period('d')
Then the NaTs you see will be strings and can be replaced like
.dt.strftime('%Y-%m-%d').replace('NaT', '')
df = pd.DataFrame(dict(date=pd.to_datetime(['2015-01-01', pd.NaT])))
df
df.date.dt.strftime('%Y-%m-%d')
0 2015-01-01
1 NaT
Name: date, dtype: object
df.date.dt.strftime('%Y-%m-%d').replace('NaT', '')
0 2015-01-01
1
Name: date, dtype: object

Categories

Resources