How to create a time matrix full of NaT in python? - python

I would like to create an empty 3D time matrix (with known size) that I will later populate in a loop with either pd.dateTimeIndex or a list of pd.timestamp. Is there a simple method ?
This does not work:
timeMatrix = np.empty( shape=(100, 1000, 2) )
timeMatrix[:] = pd.NaT
I can do without the second line but then the numbers in timeMatrix become 10^18 numbers.
timeMatrix = np.empty( shape=(100, 1000, 2) )
for pressureLevel in levels:
timeMatrix[ i_airport, 0:varyingNumberBelow1000, pressureLevel ] = dates_datetimeindex
Thank you

df = pd.DataFrame(index=range(10), columns=range(10), dtype="datetime64[ns]")
print(df)
Prints:
0 1 2 3 4 5 6 7 8 9
0 NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT
1 NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT
2 NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT
3 NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT
4 NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT
5 NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT
6 NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT
7 NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT
8 NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT
9 NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT

Related

Replace date fields in pandas with date values from next columns

I have a Dataframe consisting of multiple date fields as follows
df = pd.DataFrame({
'Date1': ['2017-12-14', '2017-12-14', '2017-12-14', '2017-12-15', '2017-12-14', '2017-12-14', '2017-12-14'],
'Date2': ['2018-1-17', "NaT","NaT","NaT","NaT","NaT","NaT"],
'Date3': ['2018-2-15',"NaT","NaT",'2018-4-1','NaT','NaT','2018-4-1'],
'Date4': ['2018-3-11','2018-4-1','2018-4-1',"NaT",'2018-4-1','2018-4-2',"NaT"]})
df
Date1 Date2 Date3 Date4
2017-12-14 2018-1-17 2018-2-15 2018-3-11
2017-12-14 NaT NaT 2018-4-1
2017-12-14 NaT NaT 2018-4-1
2017-12-15 NaT 2018-4-1 NaT
2017-12-14 NaT NaT 2018-4-1
2017-12-14 NaT NaT 2018-4-2
2017-12-14 NaT 2018-4-1 NaT
Date1
Date2
Date3
Date4
2017-12-14
2018-1-17
2018-2-15
2018-3-11
2017-12-14
NaT
NaT
2018-4-1
2017-12-14
NaT
NaT
2018-4-1
2017-12-15
NaT
2018-4-1
NaT
2017-12-14
NaT
NaT
2018-4-1
2017-12-14
NaT
NaT
2018-4-2
2017-12-14
NaT
2018-4-1
NaT
As you can see there are lots of empty date values which i need to be filled up with dates from the immediate next column.
Expected Output:
Date1
Date2
Date3
Date4
2017-12-14
2018-1-17
2018-2-15
2018-3-11
2017-12-14
2018-4-1
2018-4-1
2018-4-1
2017-12-14
2018-4-1
2018-4-1
2018-4-1
2017-12-15
2018-4-1
2018-4-1
NaT
2017-12-14
2018-4-1
2018-4-1
2018-4-1
2017-12-14
2018-4-2
2018-4-2
2018-4-2
2017-12-14
2018-4-1
2018-4-1
NaT
Please note : the last column can remain NaT
I have tried bfill method in vain :
df.bfill(axis=1)
Convert values to datetimes if necessary and then back filling missing values NaT:
df = df.apply(pd.to_datetime).bfill(axis=1)
print (df)
Date1 Date2 Date3 Date4
0 2017-12-14 2018-01-17 2018-02-15 2018-03-11
1 2017-12-14 2018-04-01 2018-04-01 2018-04-01
2 2017-12-14 2018-04-01 2018-04-01 2018-04-01
3 2017-12-15 2018-04-01 2018-04-01 NaT
4 2017-12-14 2018-04-01 2018-04-01 2018-04-01
5 2017-12-14 2018-04-02 2018-04-02 2018-04-02
6 2017-12-14 2018-04-01 2018-04-01 NaT
If there is multiple columns abd need specify them by list:
cols = ['Date1', 'Date2', 'Date3', 'Date4']
#or columns names with Date text
#cols = df.filter(like='Date').columns
df[cols] = df[cols].apply(pd.to_datetime).bfill(axis=1)

in a pandas DF with 'season' (season1, season2...) columns, 6 months or ~182 days needs to be added to the last season that's not null

I have a pandas DF with multiple seasons and for each row, I need to add 6 months (~182 Days) to the last season that's not null.The dates are dtype: datetime64[ns].
df:
S1 S2 S3
2020-12-31 naT naT
2020-12-31 naT naT
2020-12-31 2020-12-31 naT
2020-12-31 2020-12-31 2021-01-31
Desired Output:
S1 S2 S3
2021-06-30 naT naT
2021-06-30 naT naT
2020-12-31 2021-06-30 naT
2020-12-31 2020-12-31 2021-07-31
Use .shift() to find if the next cell in the row is NaT and then use pd.DateOffset() to add extra months to those cells:
import pandas as pd
from io import StringIO
text = """
S1 S2 S3
2020-12-31 naT naT
2020-12-31 naT naT
2020-12-31 2020-12-31 naT
2020-12-31 2020-12-31 2021-01-31
"""
df = pd.read_csv(StringIO(text), header=0, sep='\s+')
df = df.apply(pd.to_datetime, errors='coerce')
# find in which cells the next value is na
next_value_in_row_na = df.shift(-1, axis=1).isna()
# for each cell where the next value is na, try to add 6 months
df = df.mask(next_value_in_row_na, df + pd.DateOffset(months=6))
Resulting dataframe:
S1 S2 S3
0 2021-06-30 NaT NaT
1 2021-06-30 NaT NaT
2 2020-12-31 2021-06-30 NaT
3 2020-12-31 2020-12-31 2021-07-31

fillna doesn't give the desired result

I'm trying to substitute NaTs in a pandas dataframe.
orders.PAID_AT
0 NaT
1 NaT
2 NaT
3 NaT
4 NaT
6 NaT
7 NaT
8 NaT
9 NaT
10 NaT
11 2018-08-04 16:19:10
12 2018-08-04 16:19:10
13 NaT
14 NaT
15 2018-08-04 13:49:08
16 2018-08-04 13:49:08
18 NaT
19 NaT
20 NaT
21 2018-08-04 12:41:48
The rows 0..10 need to be filled with value of row 11 etc. Somehow I can't get it right with:
orders.PAID_AT.fillna(method='bfill', inplace=True)
I'm getting the same result as above. What am I missing here?
For avoid chained assignments assign back:
orders.PAID_AT = orders.PAID_AT.bfill()

Iterating through datetime64 columns in pandas dataframe [duplicate]

I have two columns in a Pandas data frame that are dates.
I am looking to subtract one column from another and the result being the difference in numbers of days as an integer.
A peek at the data:
df_test.head(10)
Out[20]:
First_Date Second Date
0 2016-02-09 2015-11-19
1 2016-01-06 2015-11-30
2 NaT 2015-12-04
3 2016-01-06 2015-12-08
4 NaT 2015-12-09
5 2016-01-07 2015-12-11
6 NaT 2015-12-12
7 NaT 2015-12-14
8 2016-01-06 2015-12-14
9 NaT 2015-12-15
I have created a new column successfully with the difference:
df_test['Difference'] = df_test['First_Date'].sub(df_test['Second Date'], axis=0)
df_test.head()
Out[22]:
First_Date Second Date Difference
0 2016-02-09 2015-11-19 82 days
1 2016-01-06 2015-11-30 37 days
2 NaT 2015-12-04 NaT
3 2016-01-06 2015-12-08 29 days
4 NaT 2015-12-09 NaT
However I am unable to get a numeric version of the result:
df_test['Difference'] = df_test[['Difference']].apply(pd.to_numeric)
df_test.head()
Out[25]:
First_Date Second Date Difference
0 2016-02-09 2015-11-19 7.084800e+15
1 2016-01-06 2015-11-30 3.196800e+15
2 NaT 2015-12-04 NaN
3 2016-01-06 2015-12-08 2.505600e+15
4 NaT 2015-12-09 NaN
How about:
df_test['Difference'] = (df_test['First_Date'] - df_test['Second Date']).dt.days
This will return difference as int if there are no missing values(NaT) and float if there is.
Pandas have a rich documentation on Time series / date functionality and Time deltas
You can divide column of dtype timedelta by np.timedelta64(1, 'D'), but output is not int, but float, because NaN values:
df_test['Difference'] = df_test['Difference'] / np.timedelta64(1, 'D')
print (df_test)
First_Date Second Date Difference
0 2016-02-09 2015-11-19 82.0
1 2016-01-06 2015-11-30 37.0
2 NaT 2015-12-04 NaN
3 2016-01-06 2015-12-08 29.0
4 NaT 2015-12-09 NaN
5 2016-01-07 2015-12-11 27.0
6 NaT 2015-12-12 NaN
7 NaT 2015-12-14 NaN
8 2016-01-06 2015-12-14 23.0
9 NaT 2015-12-15 NaN
Frequency conversion.
You can use datetime module to help here. Also, as a side note, a simple date subtraction should work as below:
import datetime as dt
import numpy as np
import pandas as pd
#Assume we have df_test:
In [222]: df_test
Out[222]:
first_date second_date
0 2016-01-31 2015-11-19
1 2016-02-29 2015-11-20
2 2016-03-31 2015-11-21
3 2016-04-30 2015-11-22
4 2016-05-31 2015-11-23
5 2016-06-30 2015-11-24
6 NaT 2015-11-25
7 NaT 2015-11-26
8 2016-01-31 2015-11-27
9 NaT 2015-11-28
10 NaT 2015-11-29
11 NaT 2015-11-30
12 2016-04-30 2015-12-01
13 NaT 2015-12-02
14 NaT 2015-12-03
15 2016-04-30 2015-12-04
16 NaT 2015-12-05
17 NaT 2015-12-06
In [223]: df_test['Difference'] = df_test['first_date'] - df_test['second_date']
In [224]: df_test
Out[224]:
first_date second_date Difference
0 2016-01-31 2015-11-19 73 days
1 2016-02-29 2015-11-20 101 days
2 2016-03-31 2015-11-21 131 days
3 2016-04-30 2015-11-22 160 days
4 2016-05-31 2015-11-23 190 days
5 2016-06-30 2015-11-24 219 days
6 NaT 2015-11-25 NaT
7 NaT 2015-11-26 NaT
8 2016-01-31 2015-11-27 65 days
9 NaT 2015-11-28 NaT
10 NaT 2015-11-29 NaT
11 NaT 2015-11-30 NaT
12 2016-04-30 2015-12-01 151 days
13 NaT 2015-12-02 NaT
14 NaT 2015-12-03 NaT
15 2016-04-30 2015-12-04 148 days
16 NaT 2015-12-05 NaT
17 NaT 2015-12-06 NaT
Now, change type to datetime.timedelta, and then use the .days method on valid timedelta objects.
In [226]: df_test['Diffference'] = df_test['Difference'].astype(dt.timedelta).map(lambda x: np.nan if pd.isnull(x) else x.days)
In [227]: df_test
Out[227]:
first_date second_date Difference Diffference
0 2016-01-31 2015-11-19 73 days 73
1 2016-02-29 2015-11-20 101 days 101
2 2016-03-31 2015-11-21 131 days 131
3 2016-04-30 2015-11-22 160 days 160
4 2016-05-31 2015-11-23 190 days 190
5 2016-06-30 2015-11-24 219 days 219
6 NaT 2015-11-25 NaT NaN
7 NaT 2015-11-26 NaT NaN
8 2016-01-31 2015-11-27 65 days 65
9 NaT 2015-11-28 NaT NaN
10 NaT 2015-11-29 NaT NaN
11 NaT 2015-11-30 NaT NaN
12 2016-04-30 2015-12-01 151 days 151
13 NaT 2015-12-02 NaT NaN
14 NaT 2015-12-03 NaT NaN
15 2016-04-30 2015-12-04 148 days 148
16 NaT 2015-12-05 NaT NaN
17 NaT 2015-12-06 NaT NaN
Hope that helps.
I feel that the overall answer does not handle if the dates 'wrap' around a year. This would be useful in understanding proximity to a date being accurate by day of year. In order to do these row operations, I did the following. (I had this used in a business setting in renewing customer subscriptions).
def get_date_difference(row, x, y):
try:
# Calcuating the smallest date difference between the start and the close date
# There's some tricky logic in here to calculate for determining date difference
# the other way around (Dec -> Jan is 1 month rather than 11)
sub_start_date = int(row[x].strftime('%j')) # day of year (1-366)
close_date = int(row[y].strftime('%j')) # day of year (1-366)
later_date_of_year = max(sub_start_date, close_date)
earlier_date_of_year = min(sub_start_date, close_date)
days_diff = later_date_of_year - earlier_date_of_year
# Calculates the difference going across the next year (December -> Jan)
days_diff_reversed = (365 - later_date_of_year) + earlier_date_of_year
return min(days_diff, days_diff_reversed)
except ValueError:
return None
Then the function could be:
dfAC_Renew['date_difference'] = dfAC_Renew.apply(get_date_difference, x = 'customer_since_date', y = 'renewal_date', axis = 1)
Create a vectorized method
def calc_xb_minus_xa(df):
time_dict = {
'<Minute>': 'm',
'<Hour>': 'h',
'<Day>': 'D',
'<Week>': 'W',
'<Month>': 'M',
'<Year>': 'Y'
}
time_delta = df.at[df.index[0], 'end_time'] - df.at[df.index[0], 'open_time']
offset_base_name = str(to_offset(time_delta).base)
time_term = time_dict.get(offset_base_name)
result = (df.end_time - df.open_time) / np.timedelta64(1, time_term)
return result
Then in your df do:
df['x'] = calc_xb_minus_xa(df)
This will work for minutes, hours, days, weeks, month and Year.
open_time and end_time need to change according your df

Resample pandas times series that contains elapsed time values

I have time series data in the format shown on the bottom of this post.
I want to re-sample the data to 30 minute intervals but i need the Time in State values to be split accordingly to the correct interval (these values are expressed in whole seconds).
Now imagine for a certain row the Time in State is 2342 seconds (more than 30 minutes) and say the start time is at 08:22:00.
User Start Date Start Time State Time in State (secs)
J.Doe 03-02-2014 08:22:00 A 2342
When the re-sample is done I need for the Time in State to be split accordingly into the periods it overflows into, like this:
User Start Date Time Period State Time in State (secs)
J.Doe 03-02-2014 08:00:00 A 480
J.Doe 03-02-2014 08:30:00 A 1800
J.Doe 03-02-2014 09:00:00 A 62
480+1800+62 = 2342
I'm completely lost on how to achieve this in pandas...I would appreciate any help :-)
Source data format:
User Start Date Start Time State Time in State (secs)
J.Doe 03-02-2014 07:58:00 A 36
J.Doe 03-02-2014 07:59:00 A 43
J.Doe 03-02-2014 08:00:00 A 59
J.Doe 03-02-2014 08:01:00 A 32
J.Doe 03-02-2014 08:21:00 A 15
J.Doe 03-02-2014 08:22:00 B 3
J.Doe 03-02-2014 08:22:00 A 2342
J.Doe 03-02-2014 09:01:00 B 1
J.Doe 03-02-2014 09:01:00 A 375
J.Doe 03-02-2014 09:07:00 B 3
J.Doe 03-02-2014 09:07:00 A 6408
J.Doe 03-02-2014 10:54:00 B 2
J.Doe 03-02-2014 10:54:00 A 116
J.Doe 03-02-2014 10:58:00 B 2
J.Doe 03-02-2014 10:58:00 A 122
J.Doe 03-02-2014 10:58:00 A 12
J.Doe 03-02-2014 11:00:00 B 2
J.Doe 03-02-2014 11:00:00 A 3417
J.Doe 03-02-2014 11:57:00 B 3
J.Doe 03-02-2014 11:57:00 A 120
J.Doe 03-02-2014 11:59:00 C 165
J.Doe 03-02-2014 12:02:00 B 3
J.Doe 03-02-2014 12:02:00 A 7254
I would first create Start and End columns (as datetime64 objects):
In [11]: df['Start'] = pd.to_datetime(df['Start Date'] + ' ' + df['Start Time'])
In [12]: df['End'] = df['Start'] + df['Time in State (secs)'].apply(pd.offsets.Second)
In [13]: row = df.iloc[6, :]
In [14]: row
Out[14]:
User J.Doe
Start Date 03-02-2014
Start Time 08:22:00
State A
Time in State (secs) 2342
Start 2014-03-02 08:22:00
End 2014-03-02 09:01:02
Name: 6, dtype: object
One way to get the split times is to resample from Start and End, merge, and use diff:
def split_times(row):
y = pd.Series(0, [row['Start'], row['End']])
splits = y.resample('30min').index + y.index # this fills in middle and sorts too
res = -splits.to_series().diff(-1)
if len(res) > 2: res = res[1:-1]
elif len(res) == 2: res = res[1:]
return res.astype(int).resample('30min').astype(np.timedelta64) # hack to resample again
In [16]: split_times(row)
Out[16]:
2014-03-02 08:22:00 00:08:00
2014-03-02 08:30:00 00:30:00
2014-03-02 09:00:00 00:01:02
dtype: timedelta64[ns]
In [17]: df.apply(split_times, 1)
Out[17]:
2014-03-02 07:30:00 2014-03-02 08:00:00 2014-03-02 08:30:00 2014-03-02 09:00:00 2014-03-02 09:30:00 2014-03-02 10:00:00 2014-03-02 10:30:00 2014-03-02 11:00:00 2014-03-02 11:30:00 2014-03-02 12:00:00 2014-03-02 12:30:00 2014-03-02 13:00:00 2014-03-02 13:30:00 2014-03-02 14:00:00
0 00:00:36 NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT
1 00:00:43 NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT
2 NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT
3 NaT 00:00:32 NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT
4 NaT 00:00:15 NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT
5 NaT 00:00:03 NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT
6 NaT 00:08:00 00:30:00 00:01:02 NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT
7 NaT NaT NaT 00:00:01 NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT
8 NaT NaT NaT 00:06:15 NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT
9 NaT NaT NaT 00:00:03 NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT
10 NaT NaT NaT 00:23:00 00:30:00 00:30:00 00:23:48 NaT NaT NaT NaT NaT NaT NaT
11 NaT NaT NaT NaT NaT NaT 00:00:02 NaT NaT NaT NaT NaT NaT NaT
12 NaT NaT NaT NaT NaT NaT 00:01:56 NaT NaT NaT NaT NaT NaT NaT
13 NaT NaT NaT NaT NaT NaT 00:00:02 NaT NaT NaT NaT NaT NaT NaT
14 NaT NaT NaT NaT NaT NaT 00:02:00 00:00:02 NaT NaT NaT NaT NaT NaT
15 NaT NaT NaT NaT NaT NaT 00:00:12 NaT NaT NaT NaT NaT NaT NaT
16 NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT
17 NaT NaT NaT NaT NaT NaT NaT NaT 00:26:57 NaT NaT NaT NaT NaT
18 NaT NaT NaT NaT NaT NaT NaT NaT 00:00:03 NaT NaT NaT NaT NaT
19 NaT NaT NaT NaT NaT NaT NaT NaT 00:02:00 NaT NaT NaT NaT NaT
20 NaT NaT NaT NaT NaT NaT NaT NaT 00:01:00 00:01:45 NaT NaT NaT NaT
21 NaT NaT NaT NaT NaT NaT NaT NaT NaT 00:00:03 NaT NaT NaT NaT
22 NaT NaT NaT NaT NaT NaT NaT NaT NaT 00:28:00 00:30:00 00:30:00 00:30:00 00:02:54
To replace the NaTs with 0 it looks like you have to do some fiddling in 0.13.1 (this may already be fixed up in master, otherwise is a bug):
res2 = df.apply(split_times, 1).astype(int)
# hack to replace NaTs with 0
res2.where(res2 != -9223372036854775808, 0).astype(np.timedelta64)
# to just get the seconds
seconds = res2.where(res2 != -9223372036854775808, 0) / 10 ** 9

Categories

Resources