Replace date fields in pandas with date values from next columns - python

I have a Dataframe consisting of multiple date fields as follows
df = pd.DataFrame({
'Date1': ['2017-12-14', '2017-12-14', '2017-12-14', '2017-12-15', '2017-12-14', '2017-12-14', '2017-12-14'],
'Date2': ['2018-1-17', "NaT","NaT","NaT","NaT","NaT","NaT"],
'Date3': ['2018-2-15',"NaT","NaT",'2018-4-1','NaT','NaT','2018-4-1'],
'Date4': ['2018-3-11','2018-4-1','2018-4-1',"NaT",'2018-4-1','2018-4-2',"NaT"]})
df
Date1 Date2 Date3 Date4
2017-12-14 2018-1-17 2018-2-15 2018-3-11
2017-12-14 NaT NaT 2018-4-1
2017-12-14 NaT NaT 2018-4-1
2017-12-15 NaT 2018-4-1 NaT
2017-12-14 NaT NaT 2018-4-1
2017-12-14 NaT NaT 2018-4-2
2017-12-14 NaT 2018-4-1 NaT
Date1
Date2
Date3
Date4
2017-12-14
2018-1-17
2018-2-15
2018-3-11
2017-12-14
NaT
NaT
2018-4-1
2017-12-14
NaT
NaT
2018-4-1
2017-12-15
NaT
2018-4-1
NaT
2017-12-14
NaT
NaT
2018-4-1
2017-12-14
NaT
NaT
2018-4-2
2017-12-14
NaT
2018-4-1
NaT
As you can see there are lots of empty date values which i need to be filled up with dates from the immediate next column.
Expected Output:
Date1
Date2
Date3
Date4
2017-12-14
2018-1-17
2018-2-15
2018-3-11
2017-12-14
2018-4-1
2018-4-1
2018-4-1
2017-12-14
2018-4-1
2018-4-1
2018-4-1
2017-12-15
2018-4-1
2018-4-1
NaT
2017-12-14
2018-4-1
2018-4-1
2018-4-1
2017-12-14
2018-4-2
2018-4-2
2018-4-2
2017-12-14
2018-4-1
2018-4-1
NaT
Please note : the last column can remain NaT
I have tried bfill method in vain :
df.bfill(axis=1)

Convert values to datetimes if necessary and then back filling missing values NaT:
df = df.apply(pd.to_datetime).bfill(axis=1)
print (df)
Date1 Date2 Date3 Date4
0 2017-12-14 2018-01-17 2018-02-15 2018-03-11
1 2017-12-14 2018-04-01 2018-04-01 2018-04-01
2 2017-12-14 2018-04-01 2018-04-01 2018-04-01
3 2017-12-15 2018-04-01 2018-04-01 NaT
4 2017-12-14 2018-04-01 2018-04-01 2018-04-01
5 2017-12-14 2018-04-02 2018-04-02 2018-04-02
6 2017-12-14 2018-04-01 2018-04-01 NaT
If there is multiple columns abd need specify them by list:
cols = ['Date1', 'Date2', 'Date3', 'Date4']
#or columns names with Date text
#cols = df.filter(like='Date').columns
df[cols] = df[cols].apply(pd.to_datetime).bfill(axis=1)

Related

How to create a time matrix full of NaT in python?

I would like to create an empty 3D time matrix (with known size) that I will later populate in a loop with either pd.dateTimeIndex or a list of pd.timestamp. Is there a simple method ?
This does not work:
timeMatrix = np.empty( shape=(100, 1000, 2) )
timeMatrix[:] = pd.NaT
I can do without the second line but then the numbers in timeMatrix become 10^18 numbers.
timeMatrix = np.empty( shape=(100, 1000, 2) )
for pressureLevel in levels:
timeMatrix[ i_airport, 0:varyingNumberBelow1000, pressureLevel ] = dates_datetimeindex
Thank you
df = pd.DataFrame(index=range(10), columns=range(10), dtype="datetime64[ns]")
print(df)
Prints:
0 1 2 3 4 5 6 7 8 9
0 NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT
1 NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT
2 NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT
3 NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT
4 NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT
5 NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT
6 NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT
7 NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT
8 NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT
9 NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT

How can I identify date columns and format it to YYYY-MM-DD in Python using pandas?

I am trying to format the dates in Python using Pandas. Basically, I want to identify all the date columns and convert it to YYYY-MM-DD format, overwrite and save it.
Input:
ID
NPC_code
Date1
Date2
Date3
Date4
1
10001
10-01-2020
11012019
27-Jan-18
27Jan2016
2
10002
11-01-2020
11012020
28-Jan-18
27Jan2017
3
10003
12-01-2020
11012021
29-Jan-18
27Jan2018
4
10004
13-01-2020
11012022
30-Jan-18
27Jan2019
5
10005
14-01-2020
11012023
31-Jan-18
27Jan2020
Output:
ID
NPC_code
Date1
Date2
Date3
Date4
1
10001
2020-01-10
2019-01-11
2018-01-27
2016-01-27
2
10002
2020-01-11
2020-01-11
2018-01-28
2016-01-28
3
10003
2020-01-12
2021-01-11
2018-01-29
2016-01-29
4
10004
2020-01-13
2022-01-11
2018-01-30
2016-01-30
5
10005
2020-01-14
2023-01-11
2018-01-31
2016-01-31
If your column is a string, you will need to first use `pd.to_datetime',
df['Date'] = pd.to_datetime(df['Date'])
Then, use .dt datetime accessor with strftime:
df = pd.DataFrame({'Date':pd.date_range('2017-01-01', periods = 60, freq='D')})
df.Date.dt.strftime('%Y-%m-%d').astype(int)
Or use lambda function:
df.Date.apply(lambda x: x.strftime('%Y-%m-%d')).astype(int)
refer:
strftime-and-strptime-behavior
pd.to_datetime
df['Date1'] = pd.to_datetime(df.Date1, format='%d-%m-%Y')
pd.to_datetime(df.Date2, format='%d%m%Y')
pd.to_datetime(df.Date3, format='%d-%b-%y')
pd.to_datetime(df.Date4, format='%d%b%Y')
convert to string format:
pd.to_datetime(df.Date1, format='%d-%m-%Y').dt.strftime('%Y-%m-%d')

in a pandas DF with 'season' (season1, season2...) columns, 6 months or ~182 days needs to be added to the last season that's not null

I have a pandas DF with multiple seasons and for each row, I need to add 6 months (~182 Days) to the last season that's not null.The dates are dtype: datetime64[ns].
df:
S1 S2 S3
2020-12-31 naT naT
2020-12-31 naT naT
2020-12-31 2020-12-31 naT
2020-12-31 2020-12-31 2021-01-31
Desired Output:
S1 S2 S3
2021-06-30 naT naT
2021-06-30 naT naT
2020-12-31 2021-06-30 naT
2020-12-31 2020-12-31 2021-07-31
Use .shift() to find if the next cell in the row is NaT and then use pd.DateOffset() to add extra months to those cells:
import pandas as pd
from io import StringIO
text = """
S1 S2 S3
2020-12-31 naT naT
2020-12-31 naT naT
2020-12-31 2020-12-31 naT
2020-12-31 2020-12-31 2021-01-31
"""
df = pd.read_csv(StringIO(text), header=0, sep='\s+')
df = df.apply(pd.to_datetime, errors='coerce')
# find in which cells the next value is na
next_value_in_row_na = df.shift(-1, axis=1).isna()
# for each cell where the next value is na, try to add 6 months
df = df.mask(next_value_in_row_na, df + pd.DateOffset(months=6))
Resulting dataframe:
S1 S2 S3
0 2021-06-30 NaT NaT
1 2021-06-30 NaT NaT
2 2020-12-31 2021-06-30 NaT
3 2020-12-31 2020-12-31 2021-07-31

Trouble with using groupby and ffill

I have pandas Dataframe:
Date1 Date2 Date3 Date4 id
2019-01-01 2019-01-02 NaT 2019-01-03 111
NaT NaT 2019-01-02 NaT 111
2019-02-04 NaT 2019-02-05 2019-02-06 222
NaT 2019-02-08 NaT NaT 222
I expect:
Date1 Date2 Date3 Date4 id
2019-01-01 2019-01-02 2019-01-02 2019-01-03 111
2019-02-04 2019-02-08 2019-02-05 2019-02-06 222
I tried to use:
df = df.groupby(['id']).fillna(method='ffill')
But my process didn't execute for very long time.
Thanks for any suggestions.
The logic you want is first. This will take the first non-null value within group. Assuming those NaT indicate proper datetime columns:
df.groupby('id', as_index=False).agg('first')
# id Date1 Date2 Date3 Date4
#0 111 2019-01-01 2019-01-02 2019-01-02 2019-01-03
#1 222 2019-02-04 2019-02-08 2019-02-05 2019-02-06
ffill is wrong because it returns a DataFrame indexed exactly like the original. Here you want an aggregation that collapses to one row per groupby key. Also ffill only forward-fills, though sometimes the value you want occur only on the second row.

Resample pandas times series that contains elapsed time values

I have time series data in the format shown on the bottom of this post.
I want to re-sample the data to 30 minute intervals but i need the Time in State values to be split accordingly to the correct interval (these values are expressed in whole seconds).
Now imagine for a certain row the Time in State is 2342 seconds (more than 30 minutes) and say the start time is at 08:22:00.
User Start Date Start Time State Time in State (secs)
J.Doe 03-02-2014 08:22:00 A 2342
When the re-sample is done I need for the Time in State to be split accordingly into the periods it overflows into, like this:
User Start Date Time Period State Time in State (secs)
J.Doe 03-02-2014 08:00:00 A 480
J.Doe 03-02-2014 08:30:00 A 1800
J.Doe 03-02-2014 09:00:00 A 62
480+1800+62 = 2342
I'm completely lost on how to achieve this in pandas...I would appreciate any help :-)
Source data format:
User Start Date Start Time State Time in State (secs)
J.Doe 03-02-2014 07:58:00 A 36
J.Doe 03-02-2014 07:59:00 A 43
J.Doe 03-02-2014 08:00:00 A 59
J.Doe 03-02-2014 08:01:00 A 32
J.Doe 03-02-2014 08:21:00 A 15
J.Doe 03-02-2014 08:22:00 B 3
J.Doe 03-02-2014 08:22:00 A 2342
J.Doe 03-02-2014 09:01:00 B 1
J.Doe 03-02-2014 09:01:00 A 375
J.Doe 03-02-2014 09:07:00 B 3
J.Doe 03-02-2014 09:07:00 A 6408
J.Doe 03-02-2014 10:54:00 B 2
J.Doe 03-02-2014 10:54:00 A 116
J.Doe 03-02-2014 10:58:00 B 2
J.Doe 03-02-2014 10:58:00 A 122
J.Doe 03-02-2014 10:58:00 A 12
J.Doe 03-02-2014 11:00:00 B 2
J.Doe 03-02-2014 11:00:00 A 3417
J.Doe 03-02-2014 11:57:00 B 3
J.Doe 03-02-2014 11:57:00 A 120
J.Doe 03-02-2014 11:59:00 C 165
J.Doe 03-02-2014 12:02:00 B 3
J.Doe 03-02-2014 12:02:00 A 7254
I would first create Start and End columns (as datetime64 objects):
In [11]: df['Start'] = pd.to_datetime(df['Start Date'] + ' ' + df['Start Time'])
In [12]: df['End'] = df['Start'] + df['Time in State (secs)'].apply(pd.offsets.Second)
In [13]: row = df.iloc[6, :]
In [14]: row
Out[14]:
User J.Doe
Start Date 03-02-2014
Start Time 08:22:00
State A
Time in State (secs) 2342
Start 2014-03-02 08:22:00
End 2014-03-02 09:01:02
Name: 6, dtype: object
One way to get the split times is to resample from Start and End, merge, and use diff:
def split_times(row):
y = pd.Series(0, [row['Start'], row['End']])
splits = y.resample('30min').index + y.index # this fills in middle and sorts too
res = -splits.to_series().diff(-1)
if len(res) > 2: res = res[1:-1]
elif len(res) == 2: res = res[1:]
return res.astype(int).resample('30min').astype(np.timedelta64) # hack to resample again
In [16]: split_times(row)
Out[16]:
2014-03-02 08:22:00 00:08:00
2014-03-02 08:30:00 00:30:00
2014-03-02 09:00:00 00:01:02
dtype: timedelta64[ns]
In [17]: df.apply(split_times, 1)
Out[17]:
2014-03-02 07:30:00 2014-03-02 08:00:00 2014-03-02 08:30:00 2014-03-02 09:00:00 2014-03-02 09:30:00 2014-03-02 10:00:00 2014-03-02 10:30:00 2014-03-02 11:00:00 2014-03-02 11:30:00 2014-03-02 12:00:00 2014-03-02 12:30:00 2014-03-02 13:00:00 2014-03-02 13:30:00 2014-03-02 14:00:00
0 00:00:36 NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT
1 00:00:43 NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT
2 NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT
3 NaT 00:00:32 NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT
4 NaT 00:00:15 NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT
5 NaT 00:00:03 NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT
6 NaT 00:08:00 00:30:00 00:01:02 NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT
7 NaT NaT NaT 00:00:01 NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT
8 NaT NaT NaT 00:06:15 NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT
9 NaT NaT NaT 00:00:03 NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT
10 NaT NaT NaT 00:23:00 00:30:00 00:30:00 00:23:48 NaT NaT NaT NaT NaT NaT NaT
11 NaT NaT NaT NaT NaT NaT 00:00:02 NaT NaT NaT NaT NaT NaT NaT
12 NaT NaT NaT NaT NaT NaT 00:01:56 NaT NaT NaT NaT NaT NaT NaT
13 NaT NaT NaT NaT NaT NaT 00:00:02 NaT NaT NaT NaT NaT NaT NaT
14 NaT NaT NaT NaT NaT NaT 00:02:00 00:00:02 NaT NaT NaT NaT NaT NaT
15 NaT NaT NaT NaT NaT NaT 00:00:12 NaT NaT NaT NaT NaT NaT NaT
16 NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT
17 NaT NaT NaT NaT NaT NaT NaT NaT 00:26:57 NaT NaT NaT NaT NaT
18 NaT NaT NaT NaT NaT NaT NaT NaT 00:00:03 NaT NaT NaT NaT NaT
19 NaT NaT NaT NaT NaT NaT NaT NaT 00:02:00 NaT NaT NaT NaT NaT
20 NaT NaT NaT NaT NaT NaT NaT NaT 00:01:00 00:01:45 NaT NaT NaT NaT
21 NaT NaT NaT NaT NaT NaT NaT NaT NaT 00:00:03 NaT NaT NaT NaT
22 NaT NaT NaT NaT NaT NaT NaT NaT NaT 00:28:00 00:30:00 00:30:00 00:30:00 00:02:54
To replace the NaTs with 0 it looks like you have to do some fiddling in 0.13.1 (this may already be fixed up in master, otherwise is a bug):
res2 = df.apply(split_times, 1).astype(int)
# hack to replace NaTs with 0
res2.where(res2 != -9223372036854775808, 0).astype(np.timedelta64)
# to just get the seconds
seconds = res2.where(res2 != -9223372036854775808, 0) / 10 ** 9

Categories

Resources