Getting min value across multiple datetime columns in Pandas

Getting min value across multiple datetime columns in Pandas - python

I have the following dataframe
df = pd.DataFrame({
'DATE1': ['NaT', 'NaT', '2010-04-15 19:09:08+00:00', '2011-01-25 15:29:37+00:00', '2010-04-10 12:29:02+00:00', 'NaT'],
'DATE2': ['NaT', 'NaT', 'NaT', 'NaT', '2014-04-10 12:29:02+00:00', 'NaT']})
df.DATE1 = pd.to_datetime(df.DATE1)
df.DATE2 = pd.to_datetime(df.DATE2)
and I would like to create a new column with the minimum value across the two columns (ignoring the NaTs) like so:
df.min(axis=1)
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
dtype: float64
If I remove the timezone information (the +00:00) from every single cell then the desired output is produced like so:
0 NaT
1 NaT
2 2010-04-15 19:09:08
3 2011-01-25 15:29:37
4 2010-04-10 12:29:02
5 NaT
dtype: datetime64[ns]
Why does adding the timezone information break the function? My dataset has timezones so I would need to know how to remove them as a workaround.

This is good question, it should be a bug here with timezone
df.apply(lambda x : np.max(x),1)
0 NaT
1 NaT
2 2010-04-15 19:09:08+00:00
3 2011-01-25 15:29:37+00:00
4 2014-04-10 12:29:02+00:00
5 NaT
dtype: datetime64[ns, UTC]

Odd. Seems like a bug. You could keep the timezone format and use this.
df.apply(lambda x: x.min(),axis=1)
0 NaT
1 NaT
2 2010-04-15 19:09:08+00:00
3 2011-01-25 15:29:37+00:00
4 2010-04-10 12:29:02+00:00
5 NaT
dtype: datetime64[ns, UTC]

Related

Pandas - Add mean, max, min as row in dataframe

Dataframe evntually converts to Excel...
Trying to create a additional row with the avg and max above each column.
Do not want to disturb the original headers for the actual data.
enter image description here
I dont want to hard-code column names as these will change need kind of abstract. I attempted to create a max but failed. I need the max above the column headers.

Try this, I don't know how to create above the dataframe, but I believe that in the end it might be a good solution:
import pandas as pd
df = {
'date and time':['2022-03-01', '2022-03-02', '2022-03-03', '2022-03-04'],
'<PowerAC--->':[40, 20, 9, 12]
}
df = pd.DataFrame(df)
cols = ['<PowerAC--->']
agg = (df[cols].agg(['mean', max]))
out = pd.concat([df, agg])
print(out)

A one-liner method which also remove the "NaN" values to make it visually better (I'm a bit OCD ;))
df.append(df.agg({'<PowerAC--->' : ['mean', max]})).fillna('')

I would say it's a good idea to keep your data separated from the reporting on it - I don't really see the logic for an "additional row above the column".
I would generate statistics for the overall data as a separate dataframe.
import pandas as pd
import numpy as np
np.random.seed(1)
t = pd.date_range(start='2022-05-31', end='2022-06-07')
x = np.random.rand(len(t))
df = pd.DataFrame({'date': t, 'data': x})
print(df)
# The 'numeric_only=False' behaviour will become default in a future version of pandas
d_mean = df.mean(numeric_only=False)
d_max = df.max()
# We need to transpose this, as the `d_mean` and `d_max` are Series (columns), and we want them as rows
df_stats = pd.DataFrame({'mean': d_mean, 'max':d_max}).transpose()
print(df_stats)
df output:
date data
0 2022-05-31 0.417022
1 2022-06-01 0.720324
2 2022-06-02 0.000114
3 2022-06-03 0.302333
4 2022-06-04 0.146756
5 2022-06-05 0.092339
6 2022-06-06 0.186260
7 2022-06-07 0.345561
df_stats output:
date data
mean 2022-06-03 12:00:00 0.276339
max 2022-06-07 00:00:00 0.720324
You could add this and the dataframe together with
pd.concat([df_stats, df])
which looks like
date data
mean 2022-06-03 12:00:00 0.276339
max 2022-06-07 00:00:00 0.720324
0 2022-05-31 00:00:00 0.417022
1 2022-06-01 00:00:00 0.720324
2 2022-06-02 00:00:00 0.000114
3 2022-06-03 00:00:00 0.302333
4 2022-06-04 00:00:00 0.146756
5 2022-06-05 00:00:00 0.092339
6 2022-06-06 00:00:00 0.18626
7 2022-06-07 00:00:00 0.345561
but I would keep them separate unless you've got a very good reason to.
There may be some way which makes sense using a multi-index, but that's a bit beyond me, and probably beyond the scope of this question.
Edit: If you don't infer any meaning from the max and mean of the date column but still want something compatiable with that column (i.e. still a datetime but effectively null) you could replace it by np.datetime64['NaT'] (NaT similar to NaN, but "not a time"):
df_stats['date'] = np.datetime64['NaT']
print(pd.concat([df_stats, df]).head())
output:
date data
mean NaT 0.276339
max NaT 0.720324
0 2022-05-31 0.417022
1 2022-06-01 0.720324
2 2022-06-02 0.000114

Check if datetime column is empty

I want to check inside function like if datetime column is empty do something.
My sample df:
date_dogovor date_pogash date_pogash_posle_prodl
0 2019-03-07 2020-03-06 NaT
1 2019-02-27 2020-02-05 NaT
2 2011-10-14 2016-10-13 2019-10-13
3 2019-03-28 2020-03-06 NaT
4 2019-04-17 2020-04-06 NaT
My function:
def term(date_contract , date_paymnt, date_paymnt_aftr_prlngtn):
if date_paymnt_aftr_prlngtn is None:
return date_paymnt - date_contract
else:
return date_paymnt_aftr_prlngtn - date_contract
Applying function to df:
df['term'] = df.apply(lambda x: term(x['date_dogovor'], x['date_pogash'], x['date_pogash_posle_prodl']), axis=1 )
Result is wrong:
df['term']
0 NaT
1 NaT
2 NaT
3 NaT
4 NaT
...
115337 NaT
115338 NaT
115339 2921 days
115340 NaT
115341 NaT
Name: term, Length: 115342, dtype: timedelta64[ns]
How to correctly check if datetime column is empty?

Here is better/faster use numpy.where with Series.isna:
df['term'] = np.where(df['date_pogash_posle_prodl'].isna(),
df['date_pogash'] - df['date_dogovor'],
df['date_dogovor'] - df['date_dogovor'])
Your function should be changed with pandas.isna:
def term(date_contract , date_paymnt, date_paymnt_aftr_prlngtn):
if pd.isna(date_paymnt_aftr_prlngtn):
return date_paymnt - date_contract
else:
return date_paymnt_aftr_prlngtn - date_contract

Sort date in string format in a pandas dataframe?

I have a dataframe like this, how to sort this.
df = pd.DataFrame({'Date':['Oct20','Nov19','Jan19','Sep20','Dec20']})
Date
0 Oct20
1 Nov19
2 Jan19
3 Sep20
4 Dec20
I familiar in sorting list of dates(string)
a.sort(key=lambda date: datetime.strptime(date, "%d-%b-%y"))
Any thoughts? Should i split it ?

First convert column to datetimes and get positions of sorted values by Series.argsort what is used for change ordering with DataFrame.iloc:
df = df.iloc[pd.to_datetime(df['Date'], format='%b%y').argsort()]
print (df)
Date
2 Jan19
1 Nov19
3 Sep20
0 Oct20
4 Dec20
Details:
print (pd.to_datetime(df['Date'], format='%b%y'))
0 2020-10-01
1 2019-11-01
2 2019-01-01
3 2020-09-01
4 2020-12-01
Name: Date, dtype: datetime64[ns]

How to extend an object series in length in python

I have a series:
0 2018-08-02 00:00:00
1 2016-07-20 00:00:00
2 2015-09-14 00:00:00
3 2014-09-11 00:00:00
Name: EUR6m3m, dtype: object
I wish to extend the series in length by one and shift it, such that the expected output is: (where today is the obviously todays date in the same format)
0 today()
1 2018-08-02 00:00:00
2 2016-07-20 00:00:00
3 2015-09-14 00:00:00
4 2014-09-11 00:00:00
Name: EUR6m3m, dtype: object
My current approach is to store the last value in the original series:
a = B[B.last_valid_index()]
then append:
B.append(a)
But I get the error:
TypeError: cannot concatenate object of type "<class 'pandas._libs.tslibs.timestamps.Timestamp'>"; only pd.Series, pd.DataFrame, and pd.Panel (deprecated) objs are valid
So I tried:
B.to_pydatetime() but with no luck.
Any ideas? I can not append nor extend the list, (ideally im appending) which are objects because they are a list of dates and times.

You can increment your index, add an item by label via pd.Series.loc, and then use sort_index.
It's not clear how last_valid_index is relevant given the input data you have provided.
s = pd.Series(['2018-08-02 00:00:00', '2016-07-20 00:00:00',
'2015-09-14 00:00:00', '2014-09-11 00:00:00'])
s = pd.to_datetime(s)
s.index += 1
s.loc[0] = pd.to_datetime('today')
s = s.sort_index()
Result
0 2018-09-05
1 2018-08-02
2 2016-07-20
3 2015-09-14
4 2014-09-11
dtype: datetime64[ns]

You can do appending here:
s = pd.Series([1,2,3,4])
s1 = pd.Series([5])
s1 = s1.append(s)
s1 = s1.reset_index(drop=True)
Simple and elegant output:
0 5
1 1
2 2
3 3
4 4

How do I hide pandas to_datetime NaT in when I write to CSV?

I am a little bit perplexed as to why NaT are showing up in my CSV...usually they show up as "". Here is my date formatting:
df['submitted_on'] = pd.to_datetime(df['submitted_on'], errors='coerce').dt.to_period('d')
df['resolved_on'] = pd.to_datetime(df['resolved_on'], errors='coerce').dt.to_period('d')
df['closed_on'] = pd.to_datetime(df['closed_on'], errors='coerce').dt.to_period('d')
df['duplicate_on'] = pd.to_datetime(df['duplicate_on'], errors='coerce').dt.to_period('d')
df['junked_on'] = pd.to_datetime(df['junked_on'], errors='coerce').dt.to_period('d')
df['unproducible_on'] = pd.to_datetime(df['unproducible_on'], errors='coerce').dt.to_period('d')
df['verified_on'] = pd.to_datetime(df['verified_on'], errors='coerce').dt.to_period('d')
When I df.head() this is my result. Good, fine, all is dandy.
identifier status submitted_on resolved_on closed_on duplicate_on junked_on \
0 xx1 D 2004-07-28 NaT NaT 2004-08-26 NaT
1 xx2 N 2010-03-02 NaT NaT NaT NaT
2 xx3 U 2005-10-26 NaT NaT NaT NaT
3 xx4 V 2006-06-30 2006-09-15 NaT NaT NaT
4 xx5 R 2012-09-21 2013-06-06 NaT NaT NaT
unproducible_on verified_on
0 NaT NaT
1 NaT NaT
2 2005-11-01 NaT
3 NaT 2006-11-20
4 NaT NaT
But I write to CSV and the NaT shows up:
"identifier","status","submitted_on","resolved_on","closed_on","duplicate_on","junked_on","unproducible_on","verified_on"
"xx1","D","2004-07-28","NaT","NaT","2004-08-26","NaT","NaT","NaT"
"xx2","N","2010-03-02","NaT","NaT","NaT","NaT","NaT","NaT"
"xx3","U","2005-10-26","NaT","NaT","NaT","NaT","2005-11-01","NaT"
"xx4","V","2006-06-30","2006-09-15","NaT","NaT","NaT","NaT","2006-11-20"
"xx5","R","2012-09-21","2013-06-06","NaT","NaT","NaT","NaT","NaT"
"xx6","D","2009-11-25","NaT","NaT","2010-02-26","NaT","NaT","NaT"
"xx7","D","2003-08-29","NaT","NaT","2003-08-29","NaT","NaT","NaT"
"xx8","R","2003-06-06","2003-06-24","NaT","NaT","NaT","NaT","NaT"
"xx9","R","2004-11-05","2004-11-15","NaT","NaT","NaT","NaT","NaT"
"xx10","R","2008-02-21","2008-09-25","NaT","NaT","NaT","NaT","NaT"
"xx11","R","2007-03-08","2007-03-21","NaT","NaT","NaT","NaT","NaT"
"xx12","R","2011-08-22","2012-06-21","NaT","NaT","NaT","NaT","NaT"
"xx13","J","2003-07-07","NaT","NaT","NaT","2003-07-10","NaT","NaT"
"xx14","A","2008-09-24","NaT","NaT","NaT","NaT","NaT","NaT"
So, I did what I thought would fix the problem. df.fillna('', inplace=True) and nada. I then tried df.replace(pd.NaT, '') with no results, followed by na_rep='' when I wrote to CSV which also did not result in desired output. What am I supposed to be using to prevent NaT from being transcribed into CSV?
Sample data:
"identifier","status","submitted_on","resolved_on","closed_on","duplicate_on","junked_on","unproducible_on","verified_on"
"xx1","D","2004-07-28 07:00:00.0","null","null","2004-08-26 07:00:00.0","null","null","null"
"xx2","N","2010-03-02 03:00:16.0","null","null","null","null","null","null"
"xx3","U","2005-10-26 14:20:20.0","null","null","null","null","2005-11-01 13:02:22.0","null"
"xx4","V","2006-06-30 07:00:00.0","2006-09-15 07:00:00.0","null","null","null","null","2006-11-20 08:00:00.0"
"xx5","R","2012-09-21 06:30:58.0","2013-06-06 09:35:25.0","null","null","null","null","null"
"xx6","D","2009-11-25 02:16:03.0","null","null","2010-02-26 12:28:22.0","null","null","null"
"xx7","D","2003-08-29 07:00:00.0","null","null","2003-08-29 07:00:00.0","null","null","null"
"xx8","R","2003-06-06 12:00:00.0","2003-06-24 12:00:00.0","null","null","null","null","null"
"xx9","R","2004-11-05 08:00:00.0","2004-11-15 08:00:00.0","null","null","null","null","null"
"xx10","R","2008-02-21 05:13:39.0","2008-09-25 17:20:57.0","null","null","null","null","null"
"xx11","R","2007-03-08 17:47:44.0","2007-03-21 23:47:57.0","null","null","null","null","null"
"xx12","R","2011-08-22 19:50:25.0","2012-06-21 05:52:12.0","null","null","null","null","null"
"xx13","J","2003-07-07 12:00:00.0","null","null","null","2003-07-10 12:00:00.0","null","null"
"xx14","A","2008-09-24 11:36:34.0","null","null","null","null","null","null"

Your problem lies in that you are converting to periods. The NaT you see is actually a period object.
One way around this is to convert to strings instead.
Use
.dt.strftime('%Y-%m-%d')
Instead of
.dt.to_period('d')
Then the NaTs you see will be strings and can be replaced like
.dt.strftime('%Y-%m-%d').replace('NaT', '')
df = pd.DataFrame(dict(date=pd.to_datetime(['2015-01-01', pd.NaT])))
df
df.date.dt.strftime('%Y-%m-%d')
0 2015-01-01
1 NaT
Name: date, dtype: object
df.date.dt.strftime('%Y-%m-%d').replace('NaT', '')
0 2015-01-01
1
Name: date, dtype: object

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Getting min value across multiple datetime columns in Pandas - python

This is good question, it should be a bug here with timezone df.apply(lambda x : np.max(x),1) 0 NaT 1 NaT 2 2010-04-15 19:09:08+00:00 3 2011-01-25 15:29:37+00:00 4 2014-04-10 12:29:02+00:00 5 NaT dtype: datetime64[ns, UTC]

Odd. Seems like a bug. You could keep the timezone format and use this. df.apply(lambda x: x.min(),axis=1) 0 NaT 1 NaT 2 2010-04-15 19:09:08+00:00 3 2011-01-25 15:29:37+00:00 4 2010-04-10 12:29:02+00:00 5 NaT dtype: datetime64[ns, UTC]

Related

Pandas - Add mean, max, min as row in dataframe

Check if datetime column is empty

Sort date in string format in a pandas dataframe?

How to extend an object series in length in python

How do I hide pandas to_datetime NaT in when I write to CSV?

Categories

Resources